|
CMU-CS-25-112 Computer Science Department School of Computer Science, Carnegie Mellon University
Democratizing On-Device LLM Inference with Charlie F. Ruan M.S. Thesis May 2025
Large language models (LLMs) have traditionally relied on cloud-based inference due to their high computational and memory demands. However, recent advances in small LLMs and consumer hardware capabilities have made on-device inference increasingly practical. Among potential deployment targets, the web browser stands out as a uniquely compelling platform: it is universally accessible, naturally abstracts out hardware heterogeneity, requires no dependency installation for web applications, and provides a natural agentic environment for task automation. WebLLM is a high-performance TypeScript framework that enables LLM inference entirely within client-side web browsers. WebLLM compiles LLMs ahead of time using the MLC-LLM and Apache TVM compiler stack to generate optimized WebGPU kernels and a portable WebAssembly runtime. WebLLM exposes a familiar OpenAI-style API, supports efficient GPU acceleration, and integrates seamlessly with browser environments using Web Workers and WebAssembly. To enable structured generation, which is especially challenging for small LLMs, WebLLM incorporates XGrammar, an efficient grammar-constrained decoding engine, allowing developers to enforce output formats such as JSON or DSLs with near-zero overhead. Together, these components demonstrate a path toward democratizing LLM access, making intelligent, private, and responsive AI experiences universally available through the web. 42 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
|
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |