CraftRigs
Architecture Guide

Claude Mythos Local Hardware: How Much VRAM You Actually Need for Frontier Models

By Charlotte Stewart 10 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Anthropic accidentally published close to 3,000 unpublished assets to the public internet last week. A CMS misconfiguration was all it took. The March 26, 2026 leak confirmed what r/LocalLLaMA had been speculating for months: Claude Mythos — codenamed "Capybara" — is real, training is complete, and Anthropic's own draft announcement calls it "by far the most powerful AI model we've ever developed."

The community's reaction was predictable: a flood of "what GPU do I need?" threads with answers ranging from "RTX 5090 is fine" to "you need an H100." Most of them are wrong in one direction or the other.

TL;DR: No official parameter count has been published. Community estimates range from 140B to significantly larger. If Mythos is 140B-class, Q4_K_M quantization needs an estimated 85-100GB VRAM — which eliminates every single-consumer-GPU solution on the market today. The realistic hardware paths are dual RTX 5090s (64GB, Q3_K_M capable, ~$7,500-$9,000 all-in) or dual RTX 6000 Ada units (96GB, Q4_K_M capable, ~$14,000-$17,000). Here's the math behind those numbers.

What the Mythos Leak Actually Tells Us

The CMS misconfiguration exposed a near-complete draft announcement for Claude Mythos. Here's what was confirmed:

  • Mythos is a new tier above Opus — not just a better Opus, but a "Capybara" category that didn't exist before. The framing matters for hardware planning.
  • Training is complete. Anthropic has a small early-access group testing the model now.
  • "Step change" capabilities — dramatically higher scores than any prior Anthropic model on coding, reasoning, and cybersecurity benchmarks.
  • It's expensive to run. The leak notes the model isn't ready for general release partly due to inference cost. That's a meaningful signal about scale.

Here's what was NOT in the leak: parameter count, VRAM requirements, or architecture details. The 140B figure circulating in the community is extrapolation from scaling patterns — and given that Anthropic describes this as an entirely new tier, it could be considerably larger. Some community estimates run far higher. Plan for 140B as the conservative floor, not the ceiling.

Note

No parameter count has been officially confirmed by Anthropic. The VRAM estimates in this guide use 140B as the planning baseline. If Mythos is larger, every requirement scales proportionally upward.

When Does Mythos Actually Ship?

Training is done. Early access is active. Anthropic typically runs 8-16 weeks between "training complete" and general release when no safety complications surface. Q2 2026 (May-June) is plausible; Q3 2026 remains possible. You have roughly 6-12 weeks to make hardware decisions before weights start circulating in any form.

The VRAM Scaling Problem

This is where most GPU advice threads go wrong. People see "140B = 2× 70B parameters" and assume 2× VRAM. It doesn't work that way.

The scaling formula at Q4 quantization: approximately 0.56 GB per billion parameters for weights alone. Add KV cache on top, which scales with context length and model width. For reference:

Total Estimate

~5.5 GB

~43 GB

~86-100 GB The RTX 5090 has 32GB. The RTX 6000 Ada has 48GB. Neither fits a 140B model at Q4_K_M. At Q3_K_M (0.42 GB/B), a 140B model needs roughly 59-70GB for weights plus KV cache — still above the 48GB single-card ceiling.

That's the core problem. There is no single consumer GPU in March 2026 that runs a 140B-class frontier model at Q3_K_M or better.

Quantization Tradeoffs at This Scale

Q4_K_M retains 97-99% of FP16 quality. That's the floor for frontier inference — the whole point of running a Mythos-tier model is the quality ceiling. Below Q4:

  • Q3_K_M: ~85% of FP16 quality. Usable for code and math; noticeable degradation on nuanced reasoning and extended context — the precise tasks where a frontier model earns its keep.
  • Q2_K_M: ~75% quality. You'd be paying frontier inference costs for mid-tier output.

The practical recommendation: plan hardware for Q3_K_M at minimum. Don't buy a rig that forces Q2 on a model you specifically chose for its reasoning ceiling. See our quantization comparison guide for the full tradeoff breakdown.

Single-GPU Reality Check

RTX 5090 (32GB GDDR7)

The RTX 5090 launched at $1,999 MSRP. As of March 2026, actual market prices run $2,900-$3,500+ for AIB models due to GDDR7 supply constraints and demand — the Founders Edition stays closest to MSRP but sells out in minutes.

At 32GB, the RTX 5090 can't run Llama 3.1 70B at Q4_K_M without CPU offloading — the weights alone (~39GB) exceed available VRAM. When offloading kicks in on 70B models, token speed collapses to 1-2 tok/s. The card handles 70B at Q3_K_M with borderline headroom, and excels at anything under 30B. For a 140B model, the RTX 5090 alone isn't a Mythos card — it's the second GPU in a Mythos rig.

RTX 6000 Ada (48GB GDDR6)

The current single-GPU champion for 70B inference. At ~18 tok/s on Llama 3 70B Q4_K_M (as of Q1 2026 benchmarks), the RTX 6000 Ada is the card that lets you run today's frontier models without compromise. Used market pricing sits at roughly $7,000-$8,500 in March 2026 — availability is limited, enterprise channel.

But 48GB is 48GB. A 140B model at Q3_K_M needs 65-70GB. A single RTX 6000 Ada drops you to Q2_K_M territory for Mythos, which is an uncomfortable quality floor for a $7,000+ investment.

Warning

A single RTX 6000 Ada (48GB) cannot run a 140B-class model at Q3_K_M or Q4_K_M. It's the best solo card for current 70B workloads — but it sits below the frontier VRAM threshold. Verify this against your actual use case before committing $7,000+ to a single-card build.

RTX 5080 Super (24GB) — Unreleased

The 5080 Super is everywhere in the forums, but it hasn't shipped. Expected Q3 2026, leaked at 24GB GDDR7, estimated $999-$1,299. At 24GB it's a strong card for 14B-30B models and feasible for 70B at Q3 — not a Mythos card regardless of when it launches. Don't plan a frontier rig around it.

Multi-GPU: The Only Consumer Path to Frontier Inference

For a 140B-class model at any quantization level worth running, multi-GPU isn't optional.

Dual RTX 5090 (64GB Combined)

Two RTX 5090s at current market pricing: $5,800-$7,000 for the GPUs. Add $600-$900 for a PSU upgrade to handle the combined 1,150W TDP. Total built system: roughly $7,500-$9,000.

At 64GB combined VRAM, this configuration handles a 140B model at Q3_K_M with 8K context windows — the weights (~59GB) plus typical KV cache fit, though long-context sessions push against the ceiling. On today's 70B models, dual RTX 5090 in Ollama achieves 27-33 tok/s (tested with DeepSeek-R1 70B, Q1 2026, databasemart.com).

One critical caveat: Ollama distributes VRAM across multiple GPUs but doesn't parallelize computation — you get more addressable memory, not a 2× speed multiplier. For actual tensor parallelism on 140B models, you need vLLM with explicit multi-GPU partitioning. The RTX 5090 also lacks NVLink, so inter-GPU communication runs over PCIe Gen 5 — adequate for most inference workloads but a ceiling under high-batch scenarios. Full framework setup in our tensor parallelism guide.

Tip

Dual RTX 5090 vs dual RTX 6000 Ada comes down to one question: is Q3_K_M acceptable, or do you need Q4_K_M? Dual 5090s (64GB, ~$7,500-$9,000) can run 140B at Q3. Dual 6000 Adas (96GB, ~$14,000-$17,000) run 140B at Q4. The quality gap is real for nuanced reasoning tasks. The price gap is $5,000-$7,000. Only you know if that delta is justified by your workload.

Dual RTX 6000 Ada (96GB Combined)

Two RTX 6000 Ada cards runs $14,000-$17,000 for the pair. That's enterprise-grade spend for a home lab. What you get: 96GB combined VRAM that comfortably fits a 140B model at Q4_K_M with room for generous context windows. At ~18 tok/s per card on 70B Q4_K_M, a properly configured dual-card tensor parallel setup approaches 30+ tok/s for distributed inference. This configuration handles anything that ships in the 2026-2027 model cycle without revisiting the hardware question.

The Price-Performance Summary

70B tok/s

~1-2 (offloaded), ~30 (30B fits)

~18 (clean Q4_K_M)

~27-33

~35+ (tensor parallel est.)

~2-4 (slow) Prices as of March 2026. Dual RTX 5090 tok/s on 70B via Ollama; dual RTX 6000 Ada tensor parallel estimate — not independently benchmarked on 140B as of this writing. Mythos not yet publicly available.

The MacBook Pro number deserves acknowledgment. At $3,999 for 96GB unified memory, it technically offers the most accessible path to Q4_K_M Mythos on paper. The reality is 2-4 tok/s for a 140B model — viable for batch workflows, painful for interactive use. For GPU-accelerated Q4_K_M inference, the dual RTX 6000 Ada at $14,000+ is the only practical path that doesn't compromise on quality.

Upgrading from Your Current 70B Rig

If you're running an RTX 4090 today (24GB, used market ~$1,500-$2,200 as of March 2026): your card handles 70B at Q3_K_M with partial offloading. For Mythos, the 4090 isn't the problem — it's a viable second card. Pairing it with an RTX 5090 ($1,999 MSRP) gets you 56GB combined, which clears Q3_K_M for a 140B model at 8K context. That's a $2,000-$3,500 upgrade before a full rebuild makes sense. Sell the 4090 used ($1,500-$2,200) and the net cost is even lower. Worth exploring before committing to a dual-5090 build from scratch.

For the full spec comparison between the RTX 5090 and RTX 6000 Ada, see our side-by-side breakdown.

Decision Framework

Want Q4_K_M Mythos, budget isn't the primary constraint: Dual RTX 6000 Ada (96GB). Nothing else consumer-accessible gets you there. Verify framework support and weight availability before buying — Mythos weights may lag the API launch by weeks.

Want Q3_K_M Mythos at the best consumer price point: Dual RTX 5090 (~$7,500-$9,000 all-in). Strong 70B performance today, Q3 frontier capability when Mythos ships, good resale path if hardware takes another leap.

Own an RTX 4090, want to extend without full replacement: Add one RTX 5090 ($1,999 MSRP). The resulting 56GB combined VRAM clears Q3_K_M for Mythos-class models and costs a fraction of a full rebuild.

Building fresh for 70B today, future-proofing toward Mythos: Single RTX 6000 Ada. It won't run Mythos at Q3+, but it's the cleanest solo card for current 70B inference at ~18 tok/s Q4_K_M — and positions you to add a second unit when Mythos weights become accessible locally.

Waiting for a single GPU that changes the frontier equation: You're waiting for consumer 80GB+ VRAM. No such card is on any current announced roadmap. For inference framework decisions while you plan, see our Ollama vs. vLLM comparison.

FAQ

Will my RTX 4090 (24GB) run Mythos? Technically yes — at Q2_K_M with CPU offloading, you'll get Mythos weights loaded. Expect 4-8 tok/s with noticeable quality degradation. The RTX 4090 is better used as the second card in a multi-GPU stack than as a solo Mythos runner. For the full VRAM math behind this, see our VRAM scaling explainer.

Is 48GB VRAM future-proof? For 70B models through 2026-2027, yes — comfortably. For frontier models above 70B, 48GB is already showing limits against the 140B speculation. Single 48GB cards are the current ceiling for 70B workloads, not the floor of the frontier stack.

Should I wait for the H200? H200 (80GB HBM3e) exists in datacenter deployments at $25,000-$40,000+ new. There's no consumer-accessible path at reasonable pricing. If you're a business deploying frontier models at scale, the unit economics may work. For home lab and power-user builds, it's not a realistic option — don't plan around it.

What's the cheapest single card that can run Mythos at Q3_K_M? Nothing available in the consumer market as of March 2026. A used NVIDIA A100 80GB PCIe (~$8,000-$12,000 on the secondary market) gets within range of Q4_K_M for a 140B model and handles Q3 cleanly. Availability is inconsistent and it's a server card — power and cooling requirements matter. For most builders, dual RTX 5090 remains the more practical path at similar cost.

Can I run Mythos on dual RTX 5090s the moment weights drop? Likely yes, at Q3_K_M with 8K context windows and explicit multi-GPU config in vLLM or llama.cpp. Ollama's VRAM splitting gives you access to 64GB but won't provide tensor parallelism. Weight availability in GGUF or safetensors format may lag the API release — community conversions typically take 1-3 weeks after official launch.

Verdict

Claude Mythos is real and closer than most people's GPU budgets are ready for. The parameter count is unknown — the 140B figure is the community's best extrapolation from scaling patterns, but "new tier above Opus" and "by far our most powerful" language both suggest the ceiling is higher than that number implies. Plan hardware around 80-100GB VRAM as the minimum floor for Q3_K_M inference.

In March 2026, the viable hardware options are: dual RTX 5090s for Q3_K_M Mythos at $7,500-$9,000 all-in, or dual RTX 6000 Ada for uncompromised Q4_K_M at $14,000-$17,000. Single-GPU frontier inference isn't a category that exists yet — not until consumer cards clear 80GB of VRAM, which no announced product delivers.

Buy what your actual workflow justifies. If you're running frontier models daily for high-quality reasoning work, the Q4_K_M path earns back the investment. If you're experimenting and occasionally pushing boundaries, dual RTX 5090s cover Q3_K_M and are outstanding for everything up to frontier scale today.

Prices as of March 2026. RTX 5090: MSRP $1,999, market $2,900-$3,500+. RTX 6000 Ada used: $7,000-$8,500. RTX 5080 Super: unreleased as of this writing, specs from leaks only. Claude Mythos parameter count unconfirmed — all VRAM estimates are scaling law extrapolations. Benchmarks: 70B tok/s figures from databasemart.com Ollama tests (Q1 2026) and bestgpusforai.com RTX 6000 Ada vs 4090 comparison (Q1 2026).

claude-mythos vram frontier-models gpu-guide local-llm

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.