CraftRigs
Architecture Guide

Llamafile 0.10.0: Run Any LLM as a Single File — Now With Real GPU Speed

By Georgia Thomas 7 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

CPU inference runs at 2-3 tokens per second. That's not a model limitation — that's a punishment. You ask it something and then you wait, watching text appear one letter at a time like it's loading over dial-up. The earlier llamafile rebuilds had exactly this problem. CUDA support was stripped out while Mozilla's team rewrote the core, leaving the project CPU-only for months. Good idea, brutal timing.

That's fixed now. Llamafile 0.10.0 shipped March 20th with CUDA back in and a rebuilt foundation under the hood. It's the version that finally justifies the original promise.

What Llamafile Actually Is (And Why the Single File Part Matters)

The whole concept is that an LLM should run the same way a PDF opens. You download it. You double-click it. It works.

Mozilla built this using Cosmopolitan Libc — a library that lets C programs compile into what they call an APE, an Actually Portable Executable. A single binary that's simultaneously valid on Windows, macOS, Linux, and FreeBSD. The file inspects its own environment at launch and runs natively. No virtualization, no compatibility layer, no interpreter involved.

The practical effect: you download a llamafile and you have the model weights and the inference engine bundled into one file. There's no Python environment to create, no package manager commands to run, no Docker daemon sitting in your system tray burning CPU for no reason. The file is the software.

23,800 GitHub stars. The idea is right. But for a while the execution was lagging badly behind.

[!INFO] What's an APE? APE (Actually Portable Executable) is a binary format from the Cosmopolitan Libc project. The file is simultaneously a valid ZIP archive, a valid ELF binary for Linux, a valid PE binary for Windows, and a valid Mach-O binary for macOS. The OS picks the format it understands and runs it natively — no compatibility shim required.

What's New in 0.10.0

Mozilla rebuilt llamafile from scratch. The goal was a "polyglot build" of llama.cpp — keeping llamafile's portability while staying synced with upstream llama.cpp. Earlier builds had fallen months behind, which meant missing models, accumulating bugs, and the project starting to look like it might be slowly abandoned. (Mozilla has a history here — they've killed more projects than most companies have shipped.)

Apparently not this time.

CUDA is back. Metal support for macOS ARM64 returned in December 2025. CUDA on Linux came back in February 2026. The 0.10.0 release is the first stable package shipping both together.

Model support expanded substantially. The new build supports:

  • Qwen 3.5 models including vision capabilities
  • LFM-2 for tool calling and structured output
  • Anthropic Messages API compatibility — you can point Claude Code at a local model and run it completely offline

That last one deserves a moment. Anthropic Messages API support means tools built for Claude work with local models as drop-in replacements. You're not just running a chatbot anymore. You're running the same infrastructure stack, locally, for free, with no data leaving your machine.

Whisperfile also returns — the same format applied to Whisper speech-to-text. One file, no setup, local audio transcription. If you've been paying for Whisper API credits, the math on switching is pretty favorable.

Tip

Multiple UIs in one file. The llamafile server ships a CLI tool, an HTTP server with OpenAI-compatible API, and a terminal chat interface — all in the same executable. Curl, browser, or terminal: pick whatever fits your workflow. Nothing extra to install.

Getting Started in Three Steps

Total setup time: under five minutes, mostly waiting on the download.

Step 1. Grab a pre-built llamafile. Mozilla's GitHub releases page hosts pre-built models ranging from 0.6B to 27B parameters. Pick based on your VRAM. For an RTX 3090 with 24GB, go straight to the Qwen 3.5 27B.

Step 2. Make it executable (Linux/macOS):

chmod +x qwen3.5-27b.llamafile
./qwen3.5-27b.llamafile

On Windows: rename it by adding .exe to the end. That's it. That is the entire installation process for Windows.

Step 3. It opens a browser tab. You're talking to a local LLM.

For API access instead:

./qwen3.5-27b.llamafile --server --port 8080

Point any OpenAI-compatible client — Open WebUI, Continue.dev, anything — at localhost:8080 and it works without configuration.

To bring your own model from Hugging Face rather than using Mozilla's pre-built files:

./llamafile -m your-model.gguf

Any GGUF format file works. You're not locked into a curated list.

Warning

CUDA is Linux-only in 0.10.0. macOS gets Metal acceleration (fast). Windows CUDA support is still pending — you'll get CPU inference there for now, which means that 2-3 tok/s problem is alive on Windows until the next release. Check the GitHub releases notes before assuming GPU acceleration on your platform.

RTX 3090 + Llamafile — The Pairing Worth Knowing

The RTX 3090 is an odd GPU in 2026. Ampere architecture, three generations old, draws 350W at load. But it has 24GB of GDDR6X, and that VRAM number is what determines which models you can actually run at full GPU speed.

On the used market, RTX 3090s are landing at $650-$850. That 24GB is the same amount as GPUs costing two or three times more. The newer cards are faster per watt, but they don't give you more VRAM at this price point — and for inference, VRAM is the bottleneck. If the model doesn't fit, everything else is irrelevant.

24GB is where things get genuinely useful. Qwen 3.5 27B at Q4_K_M quantization uses about 16.7GB, leaving 7GB of headroom for context. Real-world performance on this setup: 35 tokens per second, completely flat from a 4K context window all the way out to 262K. That last detail is the surprising part — most setups slow as context fills. This doesn't.

Smaller models are faster. Llama 3.1 8B runs at 40-60 tok/s on a 3090, which is fast enough that you're reading slower than it generates. CodeLlama 13B sits around 15-25 tok/s — fine for code generation where you're reviewing each block anyway.

The combination of 24GB VRAM and llamafile is specifically what makes a home rig feel less like a science experiment. The GPU is underpriced because it's not the newest card. The software now handles GPU acceleration without ceremony. The gap between "gaming PC" and "running 27B parameter models locally" is one GPU purchase and one file download.

Which Models to Actually Run

The pre-built llamafiles from Mozilla are a reasonable starting point. But since GGUF files work directly, you're not limited to whatever Mozilla decided to package.

For an RTX 3090:

  • Qwen 3.5 27B Q4_K_M — best all-around at this VRAM level. Coding, reasoning, 262K native context.
  • Llama 3.3 8B — faster, uses ~5GB, solid for most everyday tasks.
  • LFM-2 — for agentic work where tool calling accuracy matters.

Avoid anything above 30B at Q4 on a single 3090. You'll start hitting partial CPU offloading and speed drops off a cliff — 35 tok/s becomes 8 tok/s and suddenly it feels like CPU inference again.

Llamafile vs. Ollama — The Honest Comparison

Ollama is more popular. Most tutorials point there. It's genuinely easy: three commands, managed downloads, clean API, solid GPU support across platforms.

But Ollama runs as a daemon. It installs into your system. There's a service starting on boot. None of that is bad, it's just a different model.

Llamafile has no daemon. It doesn't install anywhere. Download a file, run it, delete it when you're done. No ollama serve to remember. No systemd unit. No background process you forgot about. The file is the service, and when you close it, it's gone.

For air-gapped environments, reproducible builds, or situations where you actually care what's running on your machine — llamafile's approach is cleaner. You can put the model on a USB drive, hand it to someone, and they run it with zero setup. That's not a use case Ollama handles well.

The tradeoff: Ollama's model library and CLI make daily use smoother. ollama pull qwen2.5:32b is a better experience than navigating Hugging Face for the right GGUF file. Windows support is more mature. The community tooling is bigger.

Both are good tools. Llamafile is the right choice when portability and transparency actually matter. Ollama is the right choice when you want the fastest path from zero to running and you're not moving the model anywhere.

Verdict

Llamafile 0.10.0 is the version this project should have shipped a year ago. CUDA returning is what moves it from "interesting demo" into a real inference option worth considering alongside Ollama and LM Studio. The rebuilt core means it now tracks upstream llama.cpp properly — model support will stay current instead of falling months behind with every release cycle.

If you have an NVIDIA GPU on Linux and you've been running Ollama purely because it was the default choice — download one llamafile and try it. The argument for more complexity than that is hard to make.

The RTX 3090 + Llamafile pairing is the specific recommendation here: $700-$850 for a used 3090, free software, 35 tokens per second on a 27B parameter model. That's a local AI workstation that's actually usable daily — not a benchmark rig, not a demo, a real working setup.

llamafile local llm cuda ollama alternative rtx 3090

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.