CraftRigs
articles

NVIDIA vs AMD vs Intel vs Groq: The 2026 Inference Chip War Explained

By Chloe Smith 13 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.


The chip war was one-sided for years. NVIDIA built CUDA, everyone else built workarounds. But between 2024 and 2026, three real alternatives — AMD's MI300X, Groq's LPU, and Intel's Gaudi 3 — shipped in production volume with public benchmarks you can reproduce. The question isn't whether they exist. It's whether any of them are worth choosing over NVIDIA for your specific workload.

**TL;DR: NVIDIA's H100 and B200 remain the inference standard for production local AI in 2026, but the competition finally matters. AMD's MI300X costs 40–60% less than H100 with 192GB unified memory that beats NVIDIA on large-context RAG workloads by up to 40% latency. Groq's LPU delivers 278–664 tok/s on 8B models — unmatched for small-model throughput. Intel's Gaudi 3 is a training chip mislabeled as an inference chip; skip it unless you're doing both. For most power users and small businesses deploying Llama 3.1 70B, NVIDIA is still the correct default. But "just buy NVIDIA" stopped being the complete answer about 18 months ago.**

---

## Why the Inference Chip War Matters Now

For most of 2022–2024, the inference chip conversation was short: NVIDIA H100, done. Every major framework — vLLM, TGI, LM Studio — was built on CUDA. The secondary H100 market was thin and overpriced. The only "alternatives" were slower on every metric that mattered.

2024 changed the calculus. AMD's MI300X shipped in volume after its October 2024 launch. Intel's Gaudi 3 reached OEM general availability in Q3 2024, PCIe add-in card GA in Q4. Groq scaled its GroqCloud API to enterprise customers with independently verified throughput numbers on models people actually run. For the first time, a small business doing RFQs for an inference cluster had three vendors worth a call.

### What Changed Since 2024

It wasn't just new chips — it was new chips shipping *with usable software*. ROCm matured enough that vLLM runs on MI300X without a custom engineering project. Groq published reproducible throughput benchmarks on Llama 3.1 8B. And the H100 secondary market softened as B200 ramp began, making used H100s more accessible to buyers who aren't hyperscalers.

NVIDIA's installed base is still overwhelming. But enterprise inference RFQs in 2026 are genuinely multi-vendor conversations in a way they weren't in 2023.

---

## NVIDIA's Inference Crown: H100 and B200

The H100 is the deployed standard because it was first, CUDA works, and the used market is deep. A used H100 SXM runs $30K–$40K depending on condition and configuration as of March 2026. New PCIe H100s price around $25K–$30K. That's not cheap — but for teams who need predictable performance and zero framework surprises, the premium makes sense.

The B200 is a different generation, not an incremental update. [VRAM](/glossary/vram) capacity jumped to 192GB HBM — largest single-chip memory in production — and NVIDIA's own MLPerf v5.0 data shows B200 delivering roughly 3× the throughput of H100 on Llama 3.1 70B. If H100 single-chip throughput at batch size 1 sits around 20–25 tok/s, the B200 is in the 60–75+ tok/s range under equivalent conditions. OEM pricing landed at $45K–$55K per chip — well below the $60K–$80K figures cited in early projections. B200 was in volume production by late 2024 / early 2025; it's not a future chip.

### Real-World NVIDIA Inference Performance

Workload: Llama 3.1 70B, Q4 [quantization](/glossary/quantization), batch size 1, 2K context. Software: vLLM 0.6+, CUDA 12.2. (vLLM 0.6 introduced a 2.7× throughput improvement over 0.4.x — which version a benchmark used matters.)

- **H100 PCIe:** ~20–25 tok/s. TDP cap: 350W.
- **H100 SXM:** Similar throughput per-chip; configurable TDP up to 700W. Most datacenter deployments use SXM. If someone quotes 350W for H100 power consumption without specifying PCIe vs SXM, they're describing only one variant.
- **B200:** ~60–75+ tok/s. TDP: 1,000–1,200W depending on variant. Power per token drops substantially because throughput nearly triples.

*Last verified: March 2026, vLLM 0.6+ on CUDA 12.2.*

> [!NOTE]
> The B200 wasn't launched in April 2026. It was in production by late 2024. If a vendor is telling you they're "waiting for B200 availability," they're managing procurement backlogs — not waiting on a chip release.

### Why NVIDIA Still Wins

Three compounding advantages: software maturity, used market depth, and forward compatibility.

When a new quantization method ships — AWQ, Q4_K_M, FP4 — NVIDIA CUDA support lands first. Sometimes by days, sometimes by months. That matters for teams iterating on model versions in production. vLLM, TensorRT-LLM, and SGLang all treat CUDA as their primary target; ROCm support is second.

A model optimized for H100 runs on B200 without recompilation. That's not a trivial thing when you have production workflows that can't absorb migration overhead.

---

## AMD's MI300X: The Credible Challenger

The MI300X carries 192GB of unified HBM — same as the B200 — for $15K–$18K per chip as of March 2026. That's roughly half the cost of a new H100 SXM. The case writes itself: same memory, 40–60% less money, comparable memory bandwidth.

Real inference throughput on Llama 3.1 70B Q4 at batch size 1: ~15 tok/s (ROCm + vLLM, per AMD's own published benchmarks at 1,025 input / 256 output tokens). That's 25–35% slower than H100 in equivalent single-request conditions. But cost-per-[token](/glossary/tokens-per-second) at that price differential still runs in AMD's favor across most deployment models.

One figure you'll see cited incorrectly: MI300X power consumption is not 320W. The confirmed TDP from AMD's official datasheet is 750W. Plan your rack thermal accordingly.

### AMD's Unified Memory Advantage

The MI300X's architecture differs from NVIDIA's in a way that changes the comparison for specific workloads. HBM is directly addressable by the CPU — no separate GPU memory pool with PCIe transfers when you reload model weights or update a vector index mid-inference.

For RAG workloads with large context windows, this matters in ways that raw throughput numbers miss. When embeddings and model weights share the same memory address space with no bus crossing between them, end-to-end latency on 32K–128K token contexts drops significantly. Independent benchmarks show MI300X delivering up to 40% lower inference latency vs H100 for LLM workloads with large pre-loaded vector indexes — which is far more than the 10–15% figure that gets quoted for narrow context windows. The advantage scales with context length.

> [!TIP]
> If your deployment relies heavily on RAG with long context windows (16K–128K tokens), don't assume H100 wins the performance comparison. Benchmark MI300X's unified memory architecture before committing. The cost and latency math may both favor AMD at that workload.

### The AMD Problem: Software

ROCm is functional. It's not CUDA. That 2–3 year ecosystem gap is real: some vLLM optimizations hit CUDA first, recruiting engineers who know ROCm is harder, and debugging a ROCm-specific issue has a smaller community and fewer Stack Overflow answers to fall back on.

For enterprises buying 10+ chips, AMD's lower per-chip cost can be partially offset by higher integration and ongoing engineering overhead. Run the full TCO calculation before committing — not just the chip price.

---

## Groq's LPU: The Throughput Specialist

Groq designed their Language Processing Unit from scratch for inference — not repurposed training silicon adapted for inference workloads the way both NVIDIA and AMD chips are. The architecture optimizes for [tokens per second](/glossary/tokens-per-second) above memory capacity or flexibility.

The public numbers on Llama 3.1 8B are real: GroqCloud delivers 278–664 tok/s depending on load conditions, per ArtificialAnalysis rolling averages as of late 2025. Time-to-first-token on their API lands in the 200–300ms range for most models — faster than comparably-priced GPU alternatives for streaming inference at smaller model sizes.

One structural clarification: Groq doesn't sell bare chips. They sell inference-as-a-service through GroqCloud. When you see "Groq 8×LPU cluster" in benchmarks, that's Groq's internal infrastructure — you can't buy it and rack it yourself. Each LPU has 48GB of HBM. Running a 70B model requires distributing it across hundreds of chips, which Groq handles internally but which you don't control.

### When Groq Actually Wins

Real-time applications where latency is the user experience: voice AI pipelines (speech-to-text → LLM → text-to-speech), streaming chat, real-time translation. The difference between 200ms TTFT and 400ms TTFT is the difference between "feels instant" and "there's a lag." For applications that live or die on that perception, Groq is the fastest option available for 8B–14B models as of March 2026.

Not competitive: training, fine-tuning, long-context RAG, or any workload where owning the hardware matters. The LPU is memory-constrained per chip, Groq's software API is proprietary, and migrating an application off GroqCloud is nontrivial. You're buying latency, not infrastructure.

### The Groq Reality Check

The 70B story is less clean than the 8B story. Running Llama 3.1 70B at scale on Groq requires their internal cluster infrastructure. For applications that need 70B quality at Groq throughput numbers, you're dependent on their pricing and availability — you don't own the hardware. That's a vendor lock-in decision, not a hardware decision. If you're evaluating inference vendors, include Groq. If you're building a local inference cluster you own and control, Groq doesn't fit that model.

---

## Intel Gaudi 3: Built for Training, Not Inference

Gaudi 3 reached OEM general availability in Q3 2024, PCIe add-in card GA in Q4 2024, and IBM Cloud availability in early 2025. It's been a production chip for over a year — the framing of it as new in 2026 is wrong.

Architecture: 128GB HBM, 24 tensor processors. Intel's own benchmarks on Gaudi 3 performance position it as comparable to H100 on Llama 3.1 70B — not the 8–12 tok/s figure that circulates incorrectly. For 8-card OAM configurations, Gaudi 3 training performance is the selling point: Intel priced the 8-chip OAM baseboard at $125,000 (~$15.6K per chip), with the argument being lower cost-per-training-FLOP vs H100 clusters.

For inference: Gaudi 3 reaches H100-comparable throughput, but the vLLM + Gaudi inference integration is significantly less mature than either CUDA or ROCm. Expect rough edges for at least 12–18 more months. There's no compelling reason to choose Gaudi 3 for an inference-primary deployment in 2026.

### Why Enterprises Actually Buy Gaudi 3

The distributed training stack. Gaudi's collective communication layer for multi-chip training is solid, and the Habana Gaudi SDK includes training optimizations for LLaMA and Mistral variants. If you're training custom 70B-class models and need to justify the compute cluster cost, Gaudi 3 makes the cost-per-FLOP argument against H100 clusters.

The logical combined setup: Gaudi 3 cluster for training, MI300X for inference. Both are under $20K per chip, and you're not paying H100 prices for either workload. For labs doing both, that's worth modeling.

---

## Cerebras Wafer-Scale: Not Yet

Cerebras' WSE-3 is technically compelling — 850K cores on a single wafer, 2.6 trillion transistors, every compute unit talking to every other without PCIe latency. Their published benchmarks on specific long-context transformer workloads are real. But it's a $5M+ system targeting the top handful of AI labs globally.

For 2026 local AI builders: ignore it. Check back in 2030.

---

## Head-to-Head: Performance, Cost, and Real-World Trade-offs

*Workload: Llama 3.1 70B, Q4 quantization, batch size 1, 2K context. Software: vLLM 0.6+ where applicable. Groq figures are GroqCloud API (Llama 3.1 8B — see note). Last verified March 2026.*

TDP


Up to 700W


~100ms est.

1,000–1,200W


750W


N/A


~900W
*†Groq throughput applies to Llama 3.1 8B on GroqCloud API. 70B inference requires multi-chip distributed infrastructure managed by Groq. Sources: AMD ROCm docs v6.4, NVIDIA MLPerf Inference v5.0, ArtificialAnalysis (late 2025), Intel Gaudi benchmarks (gaudi-3-performance).*

### Which Chip Wins on What Metric

- **Throughput (70B):** NVIDIA B200 at ~60–75+ tok/s
- **Cost per chip:** AMD MI300X at $15K–$18K
- **Small-model throughput:** Groq at 278–664 tok/s on 8B
- **Large-context RAG latency:** AMD MI300X — up to 40% lower latency at 32K+ context vs H100
- **Software ecosystem flexibility:** NVIDIA H100/B200 — any model, any quantization, first
- **Training + inference combined:** Intel Gaudi 3 (training) + MI300X (inference)

---

## What This Means for Local AI Builders in 2026

The consumer GPU market — RTX 4070, RTX 5070 Ti — is completely untouched by this war. If you're running a local AI workstation on consumer hardware, see our [hardware upgrade ladder for local LLM builders](/articles/100-local-llm-hardware-upgrade-ladder/) for how those decisions work. This comparison lives entirely in the $12K–$80K per-chip tier.

For everyone operating in that tier:

**Budget-constrained inference deployment:** AMD MI300X is the new baseline consideration. Half the price of an H100, same 192GB VRAM as the B200, and a documented latency advantage for large-context workloads. Be honest about ROCm overhead. But dismissing it without benchmarking is leaving money on the table.

**Production inference service where uptime and developer velocity are non-negotiable:** Stay NVIDIA. H100 for cost-sensitivity, B200 if you're running multiple H100s today and want to consolidate. The software ecosystem savings compound. See our [dual-GPU inference stack comparison](/articles/102-dual-gpu-local-llm-stack/) for how to think about scaling.

**Real-time applications where latency is the product:** Run a Groq pilot on your actual use case. If sub-300ms TTFT at 8B–14B model quality changes your user experience, the GroqCloud economics work. If it doesn't change the experience, stay NVIDIA.

**Research lab doing both training and inference:** Gaudi 3 cluster for training, MI300X for inference is a defensible setup. Both chips are under $20K, and you avoid paying H100 prices for either workload separately.

### Decision Tree

- Inference only, budget-constrained → **AMD MI300X**
- Inference, cost irrelevant, need max throughput → **NVIDIA B200**
- Real-time application, TTFT is the product → **Groq pilot + NVIDIA baseline**
- Training + inference on same cluster → **Gaudi 3 (train) + MI300X (infer)**
- Maximum software compatibility, zero integration friction → **NVIDIA H100** — still can't go wrong

---

## The 2026 Verdict: NVIDIA Wins, But the Game Is Different

NVIDIA's dominance is structural. CUDA's network effects — every framework, every optimization, every hire — compound faster than competitors can close the gap. And with the [RTX 5060 Ti vs 5070 Ti decision](/articles/104-rtx-5060-ti-8gb-vs-16gb-local-llm/) applying equally to the enterprise tier, Blackwell's generational leap in throughput extended that lead technically while the software gap widened.

But AMD's MI300X changed the cost calculus in a way that doesn't reverse. Paying 40–60% less for 75% of the throughput with superior memory architecture for large-context workloads isn't a compromise you have to rationalize. For RAG-heavy deployments at 32K+ context, it might not even be a compromise at all — the unified memory latency advantage is documented and reproducible.

Groq proved specialized silicon works for throughput. 278–664 tok/s on 8B models isn't a benchmark artifact — it's available via their public API today. The constraint is that Groq is a service vendor, not a hardware vendor. You're buying latency-as-a-product, not owning infrastructure.

Intel proved that training-optimized silicon doesn't translate cleanly to inference economics. Gaudi 3 is a real chip for training workloads; it's an also-ran for inference in 2026.

For most local AI builders this year: NVIDIA is the safe choice. But AMD's window of irrelevance closed. The next 24 months will determine whether ROCm's software ecosystem matures fast enough that "start with MI300X" becomes the default recommendation — and that outcome is no longer implausible.

---

## FAQ

**Is AMD MI300X better than NVIDIA H100 for inference?**

For raw throughput at batch size 1, the H100 is faster — ~20–25 tok/s vs MI300X's ~15–18 tok/s on Llama 3.1 70B (Q4 quantization, ROCm + vLLM, March 2026 benchmarks). But the MI300X's 192GB unified memory architecture delivers up to 40% lower inference latency for large-context RAG workloads with 32K+ token windows, where the absence of PCIe transfers for model and embedding data changes the throughput profile. At $15K–$18K vs $30K–$40K for an H100 SXM, the cost-per-token math runs strongly in AMD's favor for inference-heavy deployments that can absorb the ROCm learning curve.

**Is Groq's LPU faster than NVIDIA for inference?**

For 8B–14B models served through GroqCloud, yes — 278–664 tok/s with time-to-first-token in the 200–300ms range outperforms comparable GPU options for streaming applications. For 70B models, the comparison changes: each LPU has only 48GB of HBM, so 70B inference requires Groq's internal multi-chip infrastructure, which you access via their API rather than owning. Groq sells inference throughput as a service, not bare hardware — the decision is "use Groq's API vs run your own cluster," not "buy an LPU vs buy an H100."

**Should I buy an AMD MI300X or NVIDIA H100 for local AI in 2026?**

For Llama 3.1 70B inference workloads: MI300X is worth benchmarking at $15K–$18K vs $30K–$40K for an H100 SXM. The CUDA software stack is more mature than ROCm, and H100's secondary market is more developed — those are real friction costs that matter for smaller teams. But if you're running large-context workloads and can absorb the ROCm integration overhead, MI300X's cost-per-token argument is compelling. For deployments requiring zero integration risk, H100 remains the default.

**Is Intel Gaudi 3 worth buying for inference in 2026?**

No — not for inference-only deployments. Gaudi 3 has been generally available since Q3–Q4 2024 and performs comparably to H100 on inference at ~$15.6K per chip, but the vLLM + Gaudi inference software integration is significantly less mature than CUDA or ROCm. The real case for Gaudi 3 is training: it prices below H100 clusters per training FLOP with a solid distributed training stack. If you need both training and inference, a Gaudi 3 cluster for training combined with MI300X for inference is a reasonable combined setup. Inference only? Buy an MI300X or H100.
nvidia amd inference groq intel-gaudi

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.