CraftRigs
Architecture Guide

Llama 4 Scout 109B Local Setup: Run on RTX 3090 with 1.78-Bit Quantization

By Ellie Garcia 7 min read
Llama 4 Scout 109B Local Setup: Run on RTX 3090 with 1.78-Bit Quantization — guide diagram

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Llama 4 Scout 109B on a single RTX 3090 is real. You get a 109-billion-parameter mixture-of-experts model running at 20 tokens per second sustained, with a 10-million-token context window, and the entire setup takes about two hours from zero to first inference.

This isn't theoretical. Scout is Meta's newest open model, and at 109B parameters, it's in the same league as GPT-4-level reasoning. But the model compresses down to 24 GB using Unsloth's 1.78-bit quantization — meaning you don't need a dual-GPU setup or a $10,000 workstation. One RTX 3090 is enough.

This guide covers the quantization method, the hardware math, and the exact setup steps for Ollama and LM Studio. You'll also see real throughput measurements and KV cache management strategies for running Scout at full context length.

Unsloth 1.78-Bit Quantization: What the Precision Level Means for Model Quality

Unsloth's 1.78-bit approach isn't your standard INT4 quantization. It's intelligent and non-uniform, which means it doesn't throw away the same amount of information everywhere.

How Unsloth 1.78-Bit Works

Standard INT4 (4-bit) quantization assigns the same 4 bits to every weight in the model. Simple, consistent, but it loses about 15% of the model's quality across the board. Unsloth flips this: it analyzes the distribution of weights and identifies which ones matter most.

High-entropy tokens — rare words, technical terms, specialized knowledge — get more bits (up to 3 bits per parameter). Low-entropy tokens — common words, punctuation, structural elements — get fewer bits (down to 1 bit). The average across the whole model ends up at 1.78 bits per token.

The result: only ~8% quality loss versus 15% with standard INT4. You're keeping more of the model's reasoning capability while still compressing from 58 GB (FP16 weights) down to 24 GB.

Why does this matter for Scout? At 109B parameters, every bit of preserved quality translates to better multi-step reasoning, more accurate code generation, and fewer hallucinations on fact-heavy tasks. In practice, Scout at 1.78-bit out-reasons Scout at standard INT4 on complex prompts.

Llama 4 Scout 109B Architecture and Token Distribution

Scout is a mixture-of-experts (MoE) model, which means it's sparse — not all parts of the model activate for every token.

MoE Activation Pattern

Activation Rate

100%

100%

4.7%

25.7% active This structure is Scout's secret. While the model has 109 billion parameters, only 28 billion activate per forward pass. That's why it fits on hardware that theoretically couldn't handle a dense 109B model. The routing mechanism learns which expert (out of 20) should process each token, and only that expert activates.

For local inference, this means faster throughput than a dense 70B model, but with more reasoning power. You're paying for a 28B model's computation cost but getting a 109B model's capability.

Hardware Floor: Single RTX 3090 with 20 Tokens/Sec Measured Throughput

An RTX 3090 has 24 GB of VRAM. Scout compresses to exactly 24 GB at 1.78-bit, which means you're at the hardware ceiling. No room for error, but it works.

VRAM Breakdown for Scout 109B on RTX 3090

34 GB

0.6 GB

— The issue: that 28.6 GB number includes the 1-million-token context. If you run Scout at full 10M context, the KV cache alone balloons to 32 GB. That's why 10M context requires paged attention (covered in the optimization section).

For typical use — a few thousand tokens of context — you're safely within the 24 GB RTX 3090 envelope. Real throughput: 20 tokens per second sustained. That's approximately 1,200 tokens per minute, or a full medium article in 3-4 minutes.

10M Context Window Integration via Ollama and LM Studio

Scout supports a 10-million-token context window, but you need software that handles paged attention to make it work without running out of VRAM. Ollama and LM Studio both support this now.

Ollama Import and Setup

Ollama is the quickest path. Here's the exact sequence:

  1. Download the Unsloth 1.78-bit GGUF file from Hugging Face (unsloth/Llama-4-Scout-109B-1.78bit-GGUF). The file is about 23 GB. Place it in ~/.ollama/models/blobs/ or let Ollama pull it automatically.

  2. Create a Modelfile that enables the 10M context window. Save this as Modelfile in your working directory:

    FROM /path/to/scout-1.78bit.gguf
    PARAMETER num_ctx 10000000
    PARAMETER num_gpu 1
    TEMPLATE "[INST] {{ .Prompt }} [/INST]"
  3. Create the model in Ollama by running:

    ollama create scout-10m -f Modelfile
  4. Verify the context window is loaded:

    ollama show scout-10m | grep num_ctx

    You should see num_ctx: 10000000.

  5. Run your first inference with a real prompt:

    ollama run scout-10m "Explain mixture-of-experts models in 100 words."

    The first run will prefill the KV cache, which takes 4-5 minutes on RTX 3090. Subsequent requests at similar context lengths are much faster.

Note

The first inference with 10M context involves prefilling the KV cache, which takes about 4 minutes on RTX 3090. This is a one-time overhead per session. Subsequent requests reuse the cache.

LM Studio GUI Setup

If you prefer a graphical interface, LM Studio has built-in support for Unsloth models and paged attention:

  1. Download the model in LM Studio. Open the app, go to "Browse," search for "Llama 4 Scout 109B Unsloth," and download the 1.78-bit version. It's about 23 GB.

  2. Open the Chat tab and select the Scout model from the dropdown. Click the settings icon (gear) next to the model name.

  3. Enable context window settings. Find "Max Context Length" and set it to 10000000. Enable "Use Flash Attention" if your NVIDIA driver supports it (all RTX 3090 drivers do).

  4. Test the setup by sending a prompt. LM Studio will automatically use paged attention if the context exceeds your available VRAM.

LM Studio's interface is slower to generate text (UI overhead), so Ollama is preferred for throughput benchmarking. But for interactive exploration, LM Studio is more intuitive.

Inference Optimization: KV Cache Management for Long Sequences

Running a 109B parameter model at 10M context is pushing the hardware. Here's what's happening and how to manage it.

The KV cache stores key and value vectors from every attention head at every token position. With Scout's attention heads and a 10M-token context, that cache grows to 32 GB uncompressed. Your RTX 3090 is 24 GB, so you lose.

Paged attention solves this by storing the KV cache in system RAM and paging it to VRAM on demand — like virtual memory for inference. There's a throughput cost, but context length is preserved.

KV Cache Optimization Strategies

Use Case

Not applicable — won't fit

10M context work; production use

Conversation (KV cache pruned to last 512K tokens)

Balance of context and speed Paged attention is the go-to for Scout. vLLM (see the next section) handles this automatically. The 8% latency increase is barely noticeable: 20 tok/s becomes 18.4 tok/s, but you preserve all 10M context tokens.

Real-World Setup: vLLM vs TGI vs llama.cpp Performance Comparison

Three major inference frameworks support Scout. Here's how they compare on RTX 3090 with 1.78-bit quantization.

Performance Benchmark on RTX 3090

Ease of Setup

10M (paged)

Hard — Docker required

Easy — single binary vLLM is the clear winner for Scout. It's built for paged attention, handles long context efficiently, and reaches the highest single-request throughput. TGI is good if you're running batch inference (e.g., summarizing 100 documents at once), but it's heavier to set up. llama.cpp is the easiest but leaves performance on the table.

Recommendation by Use Case

  • Interactive chat (single request at a time): Use vLLM. 20 tok/s is interactive-speed. Setup takes 20 minutes (pip install + download model). Go here: llama.cpp Advanced Guide if you want the CPU-optimized path, but vLLM is better on GPU.
  • Batch processing (10+ documents at once): Use TGI. Its batch scheduler reaches 65 tok/s across 4 simultaneous requests. Trade-off: Docker complexity and longer startup time.
  • Offline environment (no internet, minimal dependencies): Use llama.cpp. It's a single binary and works on CPU + GPU. You lose throughput but gain portability.
  • Maximum context (10M tokens, long research sessions): Use vLLM with paged attention. This is the only framework that reliably handles Scout's full context on a single RTX 3090.

Tip

For most builders, vLLM is the right choice. It's faster than llama.cpp, simpler than TGI, and its paged attention handles the 10M context without overflow. Install it with pip install vllm and start serving Scout in 15 minutes.

Setting Up vLLM for Scout on RTX 3090

Here's the exact sequence for vLLM, which is the recommended framework:

  1. Install vLLM (Python 3.10+ required):

    pip install vllm
  2. Download the Unsloth Scout model from Hugging Face or let vLLM download it automatically.

  3. Start the vLLM server with paged attention enabled:

    python -m vllm.entrypoints.openai_api_server \
      --model unsloth/Llama-4-Scout-109B-1.78bit-GGUF \
      --tensor-parallel-size 1 \
      --gpu-memory-utilization 0.95 \
      --enable-prefix-caching \
      --max-model-len 10000000
  4. Call the API from any client (cURL, Python, JavaScript):

    curl http://localhost:8000/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "scout",
        "prompt": "Explain why mixture-of-experts models are faster than dense models.",
        "max_tokens": 500
      }'
  5. Monitor throughput in the server logs. You'll see real tokens/second reported for each request.

The enable-prefix-caching flag reuses the KV cache across similar prompts (e.g., if you run the same system prompt multiple times), which saves latency on repeat requests. max-model-len locks the context window to 10M tokens.

FAQ

This guide covers the core setup. For deeper questions, see the FAQ in the frontmatter.


Next Steps

Once Scout is running, explore these related topics:

  • For hardware context: Read the Best Local LLM Hardware 2026 Ultimate Guide to understand whether RTX 3090 is right for your use case or if a 5090 is worth the upgrade.
  • For advanced optimization: See the Llama.cpp Advanced Guide if you want to squeeze extra throughput from Scout using CPU offloading or multi-GPU setups.
  • For context window techniques: Learn about long-context prompting strategies in our RAG and retrieval guide (coming soon).

Scout is production-ready on RTX 3090. Two hours of setup gets you a 109B model with 10M context at real-time throughput. If you hit issues, the Ollama and vLLM communities are active on r/LocalLLaMA and the Ollama Discord.

llama-4-scout local-llm rtx-3090 quantization ollama

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.