About

Website | Getting Started | Documentation | Discord | Contributing

About

ZML is a production inference stack, purpose-built to decouple AI workloads from proprietary hardware.

Any model, many hardwares, one codebase, peak performance.

Compiled directly to NVIDIA, AMD, TPU, Trainium for peak hardware performance on any accelerator. No rewriting.

It is built using the Zig language, MLIR, and Bazel.

Getting Started

Prerequisites

We use bazel to build ZML and its dependencies. The only prerequisite is bazel, which we recommend installing through bazelisk.

macOS

brew install bazelisk

Linux

curl -L -o /usr/local/bin/bazel 'https://github.com/bazelbuild/bazelisk/releases/download/v1.28.0/bazelisk-linux-amd64'
chmod +x /usr/local/bin/bazel

30-Second Smoke Test

Run the MNIST example:

bazel run //examples/mnist

This downloads a small pretrained MNIST model, compiles it, loads the weights, and classifies a random handwritten digit.

LLM Quickstart

The main LLM example is //examples/llm. It currently supports:

Llama 3.1 / 3.2
Qwen 3.5
LFM 2.5

Authenticate with Hugging Face if you want to load gated repos such as Meta Llama:

bazel run //tools/hf -- auth login

Alternatively, set the HF_TOKEN environment variable.

Then run a prompt directly:

bazel run //examples/llm -- --model=hf://meta-llama/Llama-3.2-1B-Instruct --prompt="What is the capital of France?"

Open the interactive chat loop by omitting --prompt:

bazel run //examples/llm -- --model=hf://meta-llama/Llama-3.2-1B-Instruct

You can also load from:

a local directory: --model=/var/models/meta-llama/Llama-3.2-1B-Instruct
S3: --model=s3://bucket/path/to/model

Running Models on GPU / TPU

Append one or more platform flags when compiling or running:

NVIDIA CUDA: --@zml//platforms:cuda=true
AMD RoCM: --@zml//platforms:rocm=true
Google TPU: --@zml//platforms:tpu=true
AWS Trainium / Inferentia 2: --@zml//platforms:neuron=true
Disable CPU compilation: --@zml//platforms:cpu=false

Example on CUDA:

bazel run //examples/llm --@zml//platforms:cuda=true -- --model=hf://meta-llama/Llama-3.2-1B-Instruct --prompt="Write a haiku about Zig"

Example on ROCm:

bazel run //examples/llm --@zml//platforms:rocm=true -- --model=hf://meta-llama/Llama-3.2-1B-Instruct --prompt="Write a haiku about Zig"

Run Tests

bazel test //zml:test

Examples

examples/llm: unified LLM CLI for Llama, Qwen, and LFM
examples/mnist: smallest end-to-end model run
examples/sharding: logical mesh, partitioners, shard-local execution, profiler output
examples/io: inspect and load local, hf://, https://, and s3:// repositories through the VFS layer
examples/benchmark: measure loading and execution performance

A Taste Of ZML

const Mnist = struct {
    fc1: Layer,
    fc2: Layer,

    const Layer = struct {
        weight: zml.Tensor,
        bias: zml.Tensor,

        pub fn init(store: zml.io.TensorStore.View) Layer {
            return .{
                .weight = store.createTensor("weight", .{ .d_out, .d }, null),
                .bias = store.createTensor("bias", .{.d_out}, null),
            };
        }

        pub fn forward(self: Layer, input: zml.Tensor) zml.Tensor {
            return self.weight.dot(input, .d).add(self.bias).relu().withTags(.{.d});
        }
    };

    pub fn init(store: zml.io.TensorStore.View) Mnist {
        return .{
            .fc1 = .init(store.withPrefix("fc1")),
            .fc2 = .init(store.withPrefix("fc2")),
        };
    }

    pub fn load(
        self: *const Mnist,
        allocator: std.mem.Allocator,
        io: std.Io,
        platform: *const zml.Platform,
        store: *const zml.io.TensorStore,
        shardings: []const zml.sharding.Sharding,
    ) !zml.Bufferized(Mnist) {
        return zml.io.load(Mnist, self, allocator, io, platform, store, .{
            .shardings = shardings,
            .parallelism = 1,
            .dma_chunks = 1,
            .dma_chunk_size = 16 * 1024 * 1024,
        });
    }

    pub fn unloadBuffers(self: *zml.Bufferized(Mnist)) void {
        self.fc1.weight.deinit();
        self.fc1.bias.deinit();
        self.fc2.weight.deinit();
        self.fc2.bias.deinit();
    }

    /// just two linear layers + relu activation
    pub fn forward(self: Mnist, input: zml.Tensor) zml.Tensor {
        var x = input.flatten().convert(.f32).withTags(.{.d});
        const layers: []const Layer = &.{ self.fc1, self.fc2 };
        for (layers) |layer| {
            x = layer.forward(x);
        }
        return x.argMax(0).indices.convert(.u8);
    }
};

For a full walkthrough, see:

Where To Go Next

Run more examples in ./examples
Read the example-specific notes in examples/llm/README.md
Learn tagged dimensions in working_with_tensors.md
Start building a model with write_first_model.md
Explore deployment in deploy_on_server.md

Contributing

See here.

License

ZML is licensed under the Apache 2.0 license.

Thanks To Our Contributors

Name		Name	Last commit message	Last commit date
Latest commit History 614 Commits
.github		.github
.vscode		.vscode
.zed		.zed
bazel		bazel
docs		docs
examples		examples
ffi		ffi
mlir		mlir
pjrt		pjrt
platforms		platforms
stdx		stdx
third_party		third_party
tools		tools
upb		upb
zml		zml
.bazelignore		.bazelignore
.bazelrc		.bazelrc
.bazelversion		.bazelversion
.editorconfig		.editorconfig
.gitignore		.gitignore
.nvim.lua		.nvim.lua
AGENTS.md		AGENTS.md
BUILD.bazel		BUILD.bazel
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MODULE.bazel		MODULE.bazel
MODULE.bazel.lock		MODULE.bazel.lock
README.md		README.md
bazel.sh		bazel.sh
build.zig		build.zig
devenv.lock		devenv.lock
devenv.nix		devenv.nix
devenv.yaml		devenv.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Getting Started

Prerequisites

macOS

Linux

30-Second Smoke Test

LLM Quickstart

Running Models on GPU / TPU

Run Tests

Examples

A Taste Of ZML

Where To Go Next

Contributing

License

Thanks To Our Contributors

About

Uh oh!

Releases

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Getting Started

Prerequisites

macOS

Linux

30-Second Smoke Test

LLM Quickstart

Running Models on GPU / TPU

Run Tests

Examples

A Taste Of ZML

Where To Go Next

Contributing

License

Thanks To Our Contributors

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors

Uh oh!

Languages