Skip to content

google-ai-edge/LiteRT-LM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,284 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LiteRT-LM

LiteRT-LM is a production-ready, open-source inference framework designed to deliver high-performance, cross-platform LLM deployments on edge devices.

  • Production-Ready: Battle-tested infrastructure that goes beyond basic inference to provide critical functionalities required by real-world products.
  • Open-Source: Democratize on-device LLM capabilities with open-source codebase providing broad support for mainstream open-weight models.
  • High-Performance: Industry-leading performance and acceleration across CPU/GPU/NPU empowered by LiteRT and optimized ML kernels from the ODML team.
  • Cross-Platform: Empower developers to deploy LLMs across mobile, desktop, web, and IoT with an extended set of language bindings (Kotlin, Swift, etc.).

Supported Backends & Platforms

Platform CPU Support GPU Support NPU Support
Android
iOS -
macOS -
Windows -
Linux -
Embedded - -

Quick Start

Want to try it out first? Before proceeding with the full setup, you can use the pre-built binaries for desktop or the Google AI Edge Gallery app for mobile to run LiteRT-LM immediately.

Mobile Apps

The Google AI Edge Gallery is a demo app that puts the power of cutting-edge Generative AI models directly into your hands, powered by LiteRT-LM.

Desktop CLI (Lit)

After downloading the lit binary, just run lit to see the options. Here is a simple use case:

# Set the HuggingFace token in the HUGGING_FACE_HUB_TOKEN environment variable
# so that lit can pull the model from HuggingFace.

# On Linux or MacOS
export HUGGING_FACE_HUB_TOKEN="your_huggingface_token"

# On Windows Command Prompt
set HUGGING_FACE_HUB_TOKEN=your_huggingface_token

# On Windows Powershell
$env:HUGGING_FACE_HUB_TOKEN = "your_huggingface_token"
lit list --show_all
lit pull gemma3-1b
lit run gemma3-1b [--backend=<cpu|gpu>]
Tips and platform specific steps

Note: Running GPU on Windows requires the DirectXShaderCompiler. Download the dxc_2025_07_14.zip or the latest zip file from https://github.com/microsoft/DirectXShaderCompiler/releases, unzip the file and locate the right architecture directory under bin, copy the dxil.dll and dxcompiler.dll into the same directory as the executable like lit or litert_lm_main.

Tip: For more functionality, use lit --help or lit <command> --help

Tip: Follow this link to get your own Hugging Face token

Tip: You may have to chmod +x lit and explicitly approve the usage of pre-built binaries. For example, in MacOS, you should go to System Settings > Privacy & Security > Security to approve the binary.

🔧 Build Your App: API & SDK References

The LiteRT-LM SDK provides high-level, idiomatic abstractions to integrate LLMs into your applications with minimal boilerplate. These APIs manage the entire lifecycle—from model loading and tokenization to hardware acceleration and session management.

Choose Your Platform

Language Status Best For... Documentation
Kotlin
Stable
Native Android apps and JVM-based desktop tools. Optimized for Coroutines. Kotlin API Reference
C++
Stable
High-performance, cross-platform core logic and embedded systems. C++ API Reference
Swift 🚀
In Dev
Native iOS and macOS integration with specialized Metal support. Coming Soon
Python 🚀
In Dev
Rapid prototyping, development, and desktop-side scripting. Python API Reference

Building from Source (Advanced)

🛑 Note for App Developers: You do not need to build this project from source to use it in your apps. If you are using Kotlin, Swift, or Python, please use our pre-built SDKs listed in the Choose Your Platform section above.

This section provides instructions for compiling the core LiteRT-LM C++ framework from scratch. You should only follow these steps if you are:

Supported Models and Performance

LiteRT-LM uses the .litertlm model format. You can find and download compatible models below:

Model Usage Type Quantization Context size Model Size (Mb) Give it a try
Gemma3-1B Chat Ready 4-bit per-channel 4096 557 Download
Gemma-3n-E2B Chat Ready 4-bit per-channel 4096 2965 Download
Gemma-3n-E4B Chat Ready 4-bit per-channel 4096 4235 Download
phi-4-mini Chat Ready 8-bit per-channel 4096 3728 Download
qwen2.5-1.5b Chat Ready 8-bit per-channel 4096 1524 Download
FunctionGemma-270M Base (Fine-tuning required) 8-bit per-channel 1024 288 Fine-tuning Guide
↪ TinyGarden-270M Demo 8-bit per-channel 1024 288 Download / Try App

Below are the performance numbers of running each model on various devices. Note that the benchmark is measured with 1024 tokens prefill and 256 tokens decode ( with performance lock on Android devices).

Model Device Backend Prefill (tokens/sec) Decode (tokens/sec) Context size
Gemma3-1B MacBook Pro
(2023 M3)
CPU 422.98 66.89 4096
Gemma3-1B Samsung S24
(Ultra)
CPU 243.24 43.56 4096
Gemma3-1B Samsung S24
(Ultra)
GPU 1876.5 44.57 4096
Gemma3-1B Samsung S25
(Ultra)
NPU 5836.6 84.8 1280
Gemma-3n-E2B MacBook Pro
(2023 M3)
CPU 232.5 27.6 4096
Gemma-3n-E2B Samsung S24
(Ultra)
CPU 110.5 16.1 4096
Gemma-3n-E2B Samsung S24
(Ultra)
GPU 816.4 15.6 4096
Gemma-3n-E4B MacBook Pro
(2023 M3)
CPU 170.1 20.1 4096
Gemma-3n-E4B Samsung S24
(Ultra)
CPU 73.5 9.2 4096
Gemma-3n-E4B Samsung S24
(Ultra)
GPU 548.0 9.4 4096
FunctionGemma Samsung S25
(Ultra)
CPU 1718.4 125.9 1024

Note that the first time a given model is loaded on a given device, it will take longer to load. This is because the model weights are being arranged to run optimally on your particular device. Subsequent loads will be faster because the optimized weights are cached on your device.

Model Hosting and Deployment

When a model exceeds 1.5GB, it often surpasses the "over-the-air" download limits of cellular networks or the internal limits of standard app bundles. A remote fetch strategy is required.

Host your model file, then have your app fetch the latest version of your model URL for download. Firebase provides solutions for downloading large files on Android and iOS.

Alternatively, you can fetch a model directly from HuggingFace by using the HuggingFace API. For private or gated models, you will need to include a Hugging Face User Access Token in the Authorization: Bearer <TOKEN> header of your download request.

Documentation

For detailed documentation, please visit the docs directory.

Release Notes

  • Jan 31, 2026 : Repository Migration to Git LFS

The LiteRT-LM repository has been migrated to use Git LFS (Large File Storage) for all prebuilt binaries. Because this involved a history rewrite to shrink the repository size, all previous commit hashes are now invalid.

Action Required:

If you have a local copy of this repository from before January 31, 2026, your local history is now incompatible with the remote. Please do not attempt to git pull.

To fix your local environment, please perform a fresh clone:

# 1. Remove your old directory (or move it to a backup).
rm -rf LiteRT-LM

# 2. Re-clone the repository.
git clone https://github.com/google-ai-edge/LiteRT-LM.git
cd LiteRT-LM

# 3. Ensure LFS is initialized. If this is your first time installing LFS,
#    download LFS from https://git-lfs.com.
git lfs install
git lfs pull
  • Nov 2025 : Desktop GPU support and more (v0.8.0)

  • June 24, 2025 : Run Gemma models with NPU Support (v0.7.0) Unlock significant performance gains! Our latest release leverages the power of Neural Processing Units (NPUs) on devices with Qualcomm and MediaTek chipsets to run the Gemma3 1B model with incredible efficiency.

    Note: LiteRT-LM NPU acceleration is only available through an Early Access Program. Please check out this page for more information about how to sign it up.

  • June 10, 2025 : The Debut of LiteRT-LM: A New Framework for On-Device LLMs We're proud to release an early preview (v0.6.1) of the LiteRT-LM codebase! This foundational release enables you to run the latest Gemma series models across a wide of devices with initial support for CPU execution and powerful GPU acceleration on Android.

FAQ

LiteRT vs LiteRT-LM vs MediaPipe GenAI Tasks

LiteRT, LiteRT-LM, and MediaPipe GenAI Tasks are three libraries within the Google AI Edge stack that build on each other. By exposing functionality at different abstraction layers, we hope to enable developers to balance their respective needs between flexibility and complexity.

LiteRT is Google AI Edge's underlying on-device runtime. Developers can convert individual PyTorch, TensorFlow, and JAX models to LiteRT and run them on-device.

LiteRT-LM gives developers the pipeline framework to stitch together multiple LiteRT models with pre- and post-processing components (e.g., tokenizer, vision encoder, text decoder).

MediaPipe GenAI Tasks are out-of-the-box native APIs (Kotlin, Swift, JS) to run language models by just setting a few parameters such as temperature and topK.

.litertlm vs .task

MediaPipe GenAI Tasks currently use .task files to represent language models. Task files are zip archives of multiple LiteRT files, components, and metadata. .litertlm is an evolution of the .task file format to include additional metadata and enable better compression.

During our LiteRT-LM preview, we will release a small number of .litertlm files. MediaPipe APIs will continue to use .task files. Once we have the first full release of LiteRT-LM, we will migrate MediaPipe APIs to use the new .litertlm files and release a wider collection of .litertlm files on the LiteRT Hugging Face Community

Reporting Issues

If you encounter a bug or have a feature request, we encourage you to use the GitHub Issues page to report it.

Before creating a new issue, please search the existing issues to avoid duplicates. When filing a new issue, please provide a clear title and a detailed description of the problem, including steps to reproduce it. The more information you provide, the easier it will be for us to help you.