LiteRT-LM

LiteRT-LM is a production-ready, open-source inference framework designed to deliver high-performance, cross-platform LLM deployments on edge devices.

Production-Ready: Battle-tested infrastructure that goes beyond basic inference to provide critical functionalities required by real-world products.
Open-Source: Democratize on-device LLM capabilities with open-source codebase providing broad support for mainstream open-weight models.
High-Performance: Industry-leading performance and acceleration across CPU/GPU/NPU empowered by LiteRT and optimized ML kernels from the ODML team.
Cross-Platform: Empower developers to deploy LLMs across mobile, desktop, web, and IoT with an extended set of language bindings (Kotlin, Swift, etc.).

Supported Backends & Platforms

Platform	CPU Support	GPU Support	NPU Support
Android	✅	✅	✅
iOS	✅	✅	-
macOS	✅	✅	-
Windows	✅	✅	-
Linux	✅	✅	-
Embedded	✅	-	-

Quick Start

Want to try it out first? Before proceeding with the full setup, you can use the pre-built binaries for desktop or the Google AI Edge Gallery app for mobile to run LiteRT-LM immediately.

Mobile Apps

The Google AI Edge Gallery is a demo app that puts the power of cutting-edge Generative AI models directly into your hands, powered by LiteRT-LM.

Desktop CLI (Lit)

After downloading the lit binary, just run lit to see the options. Here is a simple use case:

# Set the HuggingFace token in the HUGGING_FACE_HUB_TOKEN environment variable
# so that lit can pull the model from HuggingFace.

# On Linux or MacOS
export HUGGING_FACE_HUB_TOKEN="your_huggingface_token"

# On Windows Command Prompt
set HUGGING_FACE_HUB_TOKEN=your_huggingface_token

# On Windows Powershell
$env:HUGGING_FACE_HUB_TOKEN = "your_huggingface_token"

lit list --show_all
lit pull gemma3-1b
lit run gemma3-1b [--backend=<cpu|gpu>]

Tips and platform specific steps

Note: Running GPU on Windows requires the DirectXShaderCompiler. Download the dxc_2025_07_14.zip or the latest zip file from https://github.com/microsoft/DirectXShaderCompiler/releases, unzip the file and locate the right architecture directory under bin, copy the dxil.dll and dxcompiler.dll into the same directory as the executable like lit or litert_lm_main.

Tip: For more functionality, use lit --help or lit <command> --help

Tip: Follow this link to get your own Hugging Face token

Tip: You may have to chmod +x lit and explicitly approve the usage of pre-built binaries. For example, in MacOS, you should go to System Settings > Privacy & Security > Security to approve the binary.

🔧 Build Your App: API & SDK References

The LiteRT-LM SDK provides high-level, idiomatic abstractions to integrate LLMs into your applications with minimal boilerplate. These APIs manage the entire lifecycle—from model loading and tokenization to hardware acceleration and session management.

Choose Your Platform

Language	Status	Best For...	Documentation
Kotlin	✅ Stable	Native Android apps and JVM-based desktop tools. Optimized for Coroutines.	Kotlin API Reference
C++	✅ Stable	High-performance, cross-platform core logic and embedded systems.	C++ API Reference
Swift	🚀 In Dev	Native iOS and macOS integration with specialized Metal support.	Coming Soon
Python	🚀 In Dev	Rapid prototyping, development, and desktop-side scripting.	Python API Reference

Building from Source (Advanced)

🛑 Note for App Developers: You do not need to build this project from source to use it in your apps. If you are using Kotlin, Swift, or Python, please use our pre-built SDKs listed in the Choose Your Platform section above.

This section provides instructions for compiling the core LiteRT-LM C++ framework from scratch. You should only follow these steps if you are:

A core contributor fixing bugs or adding features to the LiteRT-LM engine.
A native C++ developer who requires custom compilation flags for an embedded system.

Supported Models and Performance

LiteRT-LM uses the .litertlm model format. You can find and download compatible models below:

Model	Usage Type	Quantization	Context size	Model Size (Mb)	Give it a try
Gemma3-1B	Chat Ready	4-bit per-channel	4096	557	Download
Gemma-3n-E2B	Chat Ready	4-bit per-channel	4096	2965	Download
Gemma-3n-E4B	Chat Ready	4-bit per-channel	4096	4235	Download
phi-4-mini	Chat Ready	8-bit per-channel	4096	3728	Download
qwen2.5-1.5b	Chat Ready	8-bit per-channel	4096	1524	Download
FunctionGemma-270M	Base (Fine-tuning required)	8-bit per-channel	1024	288	Fine-tuning Guide
↪ TinyGarden-270M	Demo	8-bit per-channel	1024	288	Download / Try App

Below are the performance numbers of running each model on various devices. Note that the benchmark is measured with 1024 tokens prefill and 256 tokens decode ( with performance lock on Android devices).

Model	Device	Backend	Prefill (tokens/sec)	Decode (tokens/sec)	Context size
Gemma3-1B	MacBook Pro (2023 M3)	CPU	422.98	66.89	4096
Gemma3-1B	Samsung S24 (Ultra)	CPU	243.24	43.56	4096
Gemma3-1B	Samsung S24 (Ultra)	GPU	1876.5	44.57	4096
Gemma3-1B	Samsung S25 (Ultra)	NPU	5836.6	84.8	1280
Gemma-3n-E2B	MacBook Pro (2023 M3)	CPU	232.5	27.6	4096
Gemma-3n-E2B	Samsung S24 (Ultra)	CPU	110.5	16.1	4096
Gemma-3n-E2B	Samsung S24 (Ultra)	GPU	816.4	15.6	4096
Gemma-3n-E4B	MacBook Pro (2023 M3)	CPU	170.1	20.1	4096
Gemma-3n-E4B	Samsung S24 (Ultra)	CPU	73.5	9.2	4096
Gemma-3n-E4B	Samsung S24 (Ultra)	GPU	548.0	9.4	4096
FunctionGemma	Samsung S25 (Ultra)	CPU	1718.4	125.9	1024

Note that the first time a given model is loaded on a given device, it will take longer to load. This is because the model weights are being arranged to run optimally on your particular device. Subsequent loads will be faster because the optimized weights are cached on your device.

Model Hosting and Deployment

When a model exceeds 1.5GB, it often surpasses the "over-the-air" download limits of cellular networks or the internal limits of standard app bundles. A remote fetch strategy is required.

Host your model file, then have your app fetch the latest version of your model URL for download. Firebase provides solutions for downloading large files on Android and iOS.

Alternatively, you can fetch a model directly from HuggingFace by using the HuggingFace API. For private or gated models, you will need to include a Hugging Face User Access Token in the Authorization: Bearer <TOKEN> header of your download request.

Documentation

For detailed documentation, please visit the docs directory.

Release Notes

Jan 31, 2026 : Repository Migration to Git LFS

The LiteRT-LM repository has been migrated to use Git LFS (Large File Storage) for all prebuilt binaries. Because this involved a history rewrite to shrink the repository size, all previous commit hashes are now invalid.

Action Required:

If you have a local copy of this repository from before January 31, 2026, your local history is now incompatible with the remote. Please do not attempt to git pull.

To fix your local environment, please perform a fresh clone:

# 1. Remove your old directory (or move it to a backup).
rm -rf LiteRT-LM

# 2. Re-clone the repository.
git clone https://github.com/google-ai-edge/LiteRT-LM.git
cd LiteRT-LM

# 3. Ensure LFS is initialized. If this is your first time installing LFS,
#    download LFS from https://git-lfs.com.
git lfs install
git lfs pull

Nov 2025 : Desktop GPU support and more (v0.8.0)
- Desktop GPU support.
- Simple CLI for Desktop: Link to Quick Start section
- Multi-Modality support: Vision and Audio input are supported when models support it. See more details here
- Kotlin API for Android and JVM (Linux, MacOS, Windows): Link to LiteRT-LM Kotlin API
- Conversation API: Link to Conversation API
- Function calling support: Link to Tool Use
June 24, 2025 : Run Gemma models with NPU Support (v0.7.0) Unlock significant performance gains! Our latest release leverages the power of Neural Processing Units (NPUs) on devices with Qualcomm and MediaTek chipsets to run the Gemma3 1B model with incredible efficiency.

Note: LiteRT-LM NPU acceleration is only available through an Early Access Program. Please check out this page for more information about how to sign it up.
June 10, 2025 : The Debut of LiteRT-LM: A New Framework for On-Device LLMs We're proud to release an early preview (v0.6.1) of the LiteRT-LM codebase! This foundational release enables you to run the latest Gemma series models across a wide of devices with initial support for CPU execution and powerful GPU acceleration on Android.

FAQ

LiteRT vs LiteRT-LM vs MediaPipe GenAI Tasks

LiteRT, LiteRT-LM, and MediaPipe GenAI Tasks are three libraries within the Google AI Edge stack that build on each other. By exposing functionality at different abstraction layers, we hope to enable developers to balance their respective needs between flexibility and complexity.

LiteRT is Google AI Edge's underlying on-device runtime. Developers can convert individual PyTorch, TensorFlow, and JAX models to LiteRT and run them on-device.

LiteRT-LM gives developers the pipeline framework to stitch together multiple LiteRT models with pre- and post-processing components (e.g., tokenizer, vision encoder, text decoder).

MediaPipe GenAI Tasks are out-of-the-box native APIs (Kotlin, Swift, JS) to run language models by just setting a few parameters such as temperature and topK.

.litertlm vs .task

MediaPipe GenAI Tasks currently use .task files to represent language models. Task files are zip archives of multiple LiteRT files, components, and metadata. .litertlm is an evolution of the .task file format to include additional metadata and enable better compression.

During our LiteRT-LM preview, we will release a small number of .litertlm files. MediaPipe APIs will continue to use .task files. Once we have the first full release of LiteRT-LM, we will migrate MediaPipe APIs to use the new .litertlm files and release a wider collection of .litertlm files on the LiteRT Hugging Face Community

Reporting Issues

If you encounter a bug or have a feature request, we encourage you to use the GitHub Issues page to report it.

Before creating a new issue, please search the existing issues to avoid duplicates. When filing a new issue, please provide a clear title and a detailed description of the problem, including steps to reproduce it. The more information you provide, the easier it will be for us to help you.

Name		Name	Last commit message	Last commit date
Latest commit History 1,284 Commits
.github/workflows		.github/workflows
build_config		build_config
c		c
cmake		cmake
cxxbridge_cmd		cxxbridge_cmd
docs		docs
kotlin		kotlin
prebuilt		prebuilt
python		python
runtime		runtime
rust		rust
schema		schema
src		src
tools/test		tools/test
.bazeliskrc		.bazeliskrc
.bazelrc		.bazelrc
.bazelversion		.bazelversion
.gitattributes		.gitattributes
.gitignore		.gitignore
BUILD		BUILD
BUILD.antlr4		BUILD.antlr4
BUILD.llguidance		BUILD.llguidance
BUILD.miniaudio		BUILD.miniaudio
BUILD.minizip		BUILD.minizip
BUILD.minja		BUILD.minja
BUILD.nanobind_json		BUILD.nanobind_json
BUILD.sentencepiece		BUILD.sentencepiece
BUILD.stb		BUILD.stb
BUILD.tokenizers_cpp		BUILD.tokenizers_cpp
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
PATCH.llguidance		PATCH.llguidance
PATCH.llguidance_grammar		PATCH.llguidance_grammar
PATCH.llguidance_numeric		PATCH.llguidance_numeric
PATCH.llguidance_parser		PATCH.llguidance_parser
PATCH.llguidance_perf		PATCH.llguidance_perf
PATCH.llguidance_regexvec		PATCH.llguidance_regexvec
PATCH.minja		PATCH.minja
PATCH.nanobind_json		PATCH.nanobind_json
PATCH.rules_rust		PATCH.rules_rust
PATCH.sentencepiece		PATCH.sentencepiece
PATCH.tensorflow		PATCH.tensorflow
PATCH.toktrie		PATCH.toktrie
README.md		README.md
WORKSPACE		WORKSPACE
__init__.py		__init__.py
android_ndk_env.bzl		android_ndk_env.bzl
cargo-bazel-lock.json		cargo-bazel-lock.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
rust_cxx_bridge.bzl		rust_cxx_bridge.bzl
setup.py		setup.py
uv.toml		uv.toml
version.bzl		version.bzl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LiteRT-LM

Supported Backends & Platforms

Quick Start

Mobile Apps

Desktop CLI (Lit)

🔧 Build Your App: API & SDK References

Choose Your Platform

Building from Source (Advanced)

Supported Models and Performance

Model Hosting and Deployment

Documentation

Release Notes

Action Required:

FAQ

LiteRT vs LiteRT-LM vs MediaPipe GenAI Tasks

.litertlm vs .task

Reporting Issues

About

Uh oh!

Releases 15

Packages

Uh oh!

Uh oh!

Contributors 31

Languages

Folders and files

Latest commit

History

Repository files navigation

LiteRT-LM

Supported Backends & Platforms

Quick Start

Mobile Apps

Desktop CLI (Lit)

🔧 Build Your App: API & SDK References

Choose Your Platform

Building from Source (Advanced)

Supported Models and Performance

Model Hosting and Deployment

Documentation

Release Notes

Action Required:

FAQ

LiteRT vs LiteRT-LM vs MediaPipe GenAI Tasks

.litertlm vs .task

Reporting Issues

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Uh oh!

Contributors 31

Languages

Packages