Most "run a local LLM" tutorials have you installing Python, creating a venv, wrestling with CUDA drivers, and pulling down Docker images before you've seen a single generated token. Llamafile 0.10.0 — released today, March 20, 2026 — skips all of that. You download one file, make it executable, and run it. That's the whole setup.
Mozilla-AI's llamafile project just shipped its biggest release in over a year, and the headline change is one that users had been waiting on for months: GPU acceleration is back.
See also: How to Run LLMs Locally →
What Llamafile Actually Is
The core idea is almost offensively simple. Llamafile bundles a model's weights together with a compiled version of llama.cpp using something called Cosmopolitan Libc — a technique that produces what the project calls an APE (Actually Portable Executable). One binary that runs on Linux, macOS, Windows, FreeBSD, OpenBSD, and NetBSD, across both AMD64 and ARM64, without any runtime dependencies.
No Python. No Node. No CUDA toolkit. No container runtime. The file is the runtime.
When you run it, llamafile spins up a local HTTP server with a chat interface, opens a tab in your browser, and you're talking to a local model. Or skip the browser entirely and chat in the terminal. It also exposes an OpenAI-compatible API endpoint — so anything already pointing at OpenAI's API can be redirected at your local llamafile with a single URL change.
Note
What "APE" means in practice: Cosmopolitan Libc makes the binary self-aware of its host OS at runtime. The same bytes work as a Linux ELF, a macOS Mach-O, and a Windows PE. It's not cross-compilation — it's one file that figures out where it is when it runs.
What Changed in 0.10.0
The last llamafile release was May 2025 — nearly 10 months ago. There was a real worry that Mozilla was quietly letting this project die the same way they had shelved DeepSpeech years back. That's not what happened. Instead, the team was doing a complete architectural rebuild from scratch.
Here's what's actually new:
GPU acceleration is restored. This is the big one. Previous versions had lost CUDA support as the project's llama.cpp dependency fell behind. 0.10.0 brings back CUDA for Linux (tested) and Metal for macOS ARM64. If you have an NVIDIA GPU on Linux — an RTX 4080 Super is a fantastic option here — inference is no longer CPU-bound.
The llama.cpp core is current. The rebuild synced llamafile with the latest llama.cpp, which means you get model support that older builds simply lacked. Qwen3.5 for vision tasks, lfm2 for tool calling, and the Anthropic Messages API for running Claude Code against a local model.
Multiple interfaces. CLI tool, HTTP server, and terminal chat interface. Multimodal support in the terminal chat. Whisperfile — a single-file speech-to-text tool built on Whisper — ships alongside the main executable.
Models from 0.6B to 27B. Mozilla-AI provides pre-built llamafiles covering different capability profiles: thinking models, multimodal models, tool-calling models. The smallest is the Qwen3 0.6B — a reasonable starting point if you're just testing the workflow.
Warning
Windows GPU support is missing. The 4GB maximum executable file size limit on Windows creates a problem for bundled models — and GPU acceleration for Windows hasn't landed yet. Windows users should download the llamafile.exe runtime separately, then point it at a .gguf weights file. Two files instead of one, but it works.
How to Run It in Under 5 Minutes
This is tested on Ubuntu 22.04 and macOS Sonoma. The steps are the same on both.
Step 1: Download a llamafile.
Head to the Mozilla-AI HuggingFace page or the llamafile GitHub releases page. For a first run, the Qwen3 0.6B is small (about 500MB) and fast even on CPU:
wget https://huggingface.co/mozilla-ai/Qwen3-0.6B-llamafile/resolve/main/Qwen_Qwen3-0.6B-Q4_K_M.llamafile
If you want more capability and have the hardware for it, grab a 7B or 8B model instead. The Llama 3 8B Instruct llamafile is a better all-rounder for real work.
Step 2: Make it executable.
chmod +x Qwen_Qwen3-0.6B-Q4_K_M.llamafile
One line. That's the entire "installation."
Step 3: Run it.
./Qwen_Qwen3-0.6B-Q4_K_M.llamafile
It opens a browser tab with a chat interface. Or add --cli to stay in the terminal. Or --server to expose the API without the UI.
Step 4: Use the API (optional).
The server runs on localhost:8080 by default and speaks the OpenAI chat completions format. Point any OpenAI-compatible client at it:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"What is 17 * 23?"}]}'
That's it. No tokens per minute. No billing. No data leaving your machine.
Tip
zsh users on older macOS: If the llamafile won't execute, prefix with sh -c ./your-llamafile.llamafile. There's a bug in zsh versions before 5.9 that blocks it. A known issue, easy workaround.
Which Model Should You Actually Use
The 0.6B model is fine for testing the workflow but you'll outgrow it fast. For real tasks — coding, summarization, document analysis — the sweet spot is the 7B–8B range with a Q4 quantization. That fits in about 5-6GB of VRAM on CPU, or VRAM if you have a GPU.
For tool calling specifically, the lfm2 model that Mozilla includes is worth trying. It handles structured outputs and function calls better than similarly-sized general models.
If you want multimodal — sending images along with your text prompts — grab the Qwen3.5 vision variant. It runs heavier but the capability gap is real: you can describe screenshots, analyze charts, or interrogate photos without any cloud service involved.
GPU Acceleration: How Much It Actually Matters
On CPU alone, a quantized 7B model generates somewhere around 8–15 tokens per second depending on your machine. That's readable but not fast — a paragraph takes 10–15 seconds to complete.
With a midrange NVIDIA GPU on Linux, that number jumps to 60–90 tokens per second. Conversations feel instantaneous. The difference isn't marginal.
The 0.10.0 CUDA support is currently marked "tested on Linux" — which is honest, if a little cautious. It uses dynamic GPU library loading, so llamafile will detect your CUDA installation at runtime and fall back to CPU if it doesn't find what it needs. You don't have to configure anything; it either finds CUDA or it doesn't.
Metal on Apple Silicon is solid. M3 and M4 chips handle 7B models well — inference speed is competitive with a dedicated NVIDIA GPU at the lower end. If you're on a MacBook Pro with an M-series chip, the performance here is genuinely good.
If you've been holding off on upgrading your GPU and you want to run 13B+ models at usable speeds, this is the moment to think about it. An RTX 4080 Super with 16GB of VRAM handles quantized 13B models with room to spare, and 70B models via split inference if you're patient.
Llamafile vs Ollama: When to Use Which
Ollama is the dominant tool in this space right now — crossed 100,000 GitHub stars in 2026, ecosystem of integrations, good Docker support. It's the right choice if you're building a backend service, scripting inference, or running on a headless server.
Llamafile is the right choice when portability or simplicity is the actual constraint. Air-gapped machine? No internet, no package manager, doesn't matter — copy one file and run. Sharing a model with a non-technical colleague? Send them a file that opens like an app. Moving between machines? One file goes with you.
The deeper difference is architecture philosophy. Ollama is a daemon you install. Llamafile is a file you run. Neither is universally better — but the use cases don't overlap as much as they seem to.
The Verdict
Llamafile 0.10.0 is a genuinely good release. The 10-month gap was worth it — the ground-up rebuild means GPU support actually works now, llama.cpp is current, and the model support covers things that matter: vision, tool calling, speech-to-text via Whisperfile. The "download and run" experience holds up.
The Windows limitation is real and annoying. The GPU support being marked "tested on Linux" rather than fully landed is honest but provisional. But for Linux users and anyone on Apple Silicon, the case for llamafile as your first local LLM tool is strong.
Download a 7B llamafile, run it, and you'll be chatting with a local model before the download finishes.