How does Groq's LPU compare to an NVIDIA GPU for LLM inference speed?

Groq independently benchmarks at 276–284 tokens/second on Llama 3 70B, verified by Artificial Analysis in early 2026. A single RTX 5090 (32GB VRAM) cannot hold Llama 3 70B at standard 4-bit quantization (~40GB required) and falls back to CPU offloading at ~15–25 tok/s. A dual RTX 5090 setup gets ~27 tok/s on 70B models. For 8B–32B models, the gap narrows significantly.

Is Groq's cloud API cheaper than running a local GPU for inference?

Groq charges $0.59/M input tokens for Llama 3.3 70B as of March 2026. A single RTX 5090 (575W TDP) running 70B inference costs ~$0.38/M tokens in electricity alone — before the $2,200 GPU cost is amortized. Including hardware, local inference is more expensive than Groq for the first 2–3 years. For bursty or low-volume workloads, Groq wins. For high-volume consistent daily inference, local hardware pays off long-term.

Should the NVIDIA-Groq acquisition change my local GPU build plans?

Not for near-term hardware decisions. The Groq LPU integration into NVIDIA's Vera Rubin platform is a 2026–2027 story — it won't affect the GPU you're buying today. The practical impact is longer-term: expect cloud inference pricing to get more competitive as NVIDIA deploys this technology. Privacy-sensitive workloads still require local hardware regardless of cloud economics.

NVIDIA Bought Groq for $20 Billion — Here's What It Actually Means for Your Build

Q: What did NVIDIA acquire from Groq and why?

NVIDIA acquired Groq's assets and licensed its LPU (Language Processing Unit) inference technology for ~$20 billion in December 2025 — its largest deal ever. Groq's founder Jonathan Ross and senior leadership joined NVIDIA to integrate LPU technology into the Vera Rubin platform. The acquisition signals NVIDIA's recognition that inference-optimized chips are a distinct, high-value market from training-optimized GPUs.

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.


**NVIDIA paid $20 billion for Groq on Christmas Eve 2025** — its largest acquisition in company history. Most coverage treated this as a financial story about NVIDIA buying a competitor. It's actually a hardware signal: NVIDIA, the undisputed training champion, paid 2.9x a competitor's most recent funding valuation to absorb their inference-optimized chip technology. That doesn't happen when you're comfortable with your position.

**TL;DR: Groq built a chip purpose-designed for inference speed, not training throughput. Their LPU independently benchmarks at 276+ tokens/second on Llama 3 70B — roughly 10x faster than a dual RTX 5090 running the same model. NVIDIA acquired Groq's assets for ~$20B in December 2025 to integrate LPU technology into their Vera Rubin platform. For local builders, this doesn't change your near-term hardware decision — but it confirms that inference speed is now a real competitive moat, and NVIDIA's response tells you everything about where the market is heading.**

---

## What Is Groq — And Why It's Not Just Another AI Chip

Groq spent nine years building a processor designed for exactly one job: running already-trained AI models as fast as physically possible. Not training them. Not fine-tuning them. Just running them.

The chip is called an [LPU — Language Processing Unit](/glossary/lpu). The name is a bit of marketing, but the underlying architecture is genuinely different from a GPU in ways that matter.

Most AI chips — including every NVIDIA product — optimize for massive parallelism. GPUs run thousands of operations simultaneously, which is perfect for training, where you're updating billions of weights at once across a huge batch. That architecture comes with trade-offs during inference. You're generating one token at a time, sequentially. A GPU designed for parallel batch processing is doing a lot of expensive coordination work for a job that doesn't require it.

Groq's answer: strip that complexity out entirely.

### LPU vs GPU — What's Actually Different

The core architectural difference is memory placement. GPUs access weights from HBM (High Bandwidth Memory) — off-chip memory with approximately 8 TB/s bandwidth. Fast, but still a separate trip for each operation. Groq's LPU puts hundreds of megabytes of SRAM directly on the chip, delivering ~80 TB/s bandwidth — roughly 10x faster, per Groq's own architecture documentation.

The second difference is determinism. NVIDIA GPUs make dynamic decisions at runtime: branch prediction, cache management, thread scheduling. That unpredictability adds latency. Groq's compiler pre-plans every single memory move before the model runs. There are no cache misses because nothing is dynamically cached — every operation is statically scheduled at compile time.

What you get is a chip that trades GPU generalism for one specific advantage: generating tokens with minimal latency. It does everything else — training, fine-tuning, large-batch throughput — worse than a GPU.

> [!NOTE]
> Groq is not a GPU you slot into your build. It's a cloud service. You access it via API, pay per token, and their hardware cluster handles the rest. This distinction matters when we get to the cost comparison.

---

## The $20B Acquisition — What It Actually Means

Let's get the facts straight, because most coverage conflated two separate events.

Groq's **last independent funding round** was $750 million at a $6.9 billion post-money valuation, closed September 2025, led by Disruptive AI with participation from BlackRock, Samsung, and Cisco. That was the funding story.

The **$20 billion figure** is NVIDIA's acquisition price — announced December 24, 2025. NVIDIA paid $20B to acquire Groq's assets and license its LPU technology, with Groq's founder Jonathan Ross and senior leadership joining NVIDIA to integrate LPU technology into the [Vera Rubin AI platform](https://developer.nvidia.com/blog/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform/). Groq technically continues as an independent company under its former finance chief — but the engineers who built the LPU are now at NVIDIA.

### Why NVIDIA Paid 2.9x Groq's Last Valuation

A 2.9x acquisition premium over a three-month-old valuation isn't a normal deal. A few things explain it.

First, Groq was winning enterprise customers on latency benchmarks that NVIDIA hardware couldn't match. That's the kind of problem NVIDIA prefers to acquire rather than race against. A company paying $20B is also paying to prevent a competitor from scaling independently.

Second, timing. NVIDIA's Vera Rubin platform was already in development for 2026–2027. Integrating Groq's compiler and SRAM architecture into that roadmap — rather than building equivalent inference optimization internally — is cheaper in calendar time than in dollars, and time matters more when customers are choosing infrastructure vendors.

Third, the IP itself. Groq's deterministic compiler, which statically schedules every inference operation, represents years of specialized engineering. You can't easily reverse-engineer a working implementation.

The post-acquisition claims support the thesis: NVIDIA says the NVIDIA Groq 3 chip will deliver 35x higher throughput per megawatt for trillion-parameter models versus Blackwell NVL72. That's a marketing number worth independent verification before trusting — but the technical direction is real.

### The Risk — If Inference Speed Stops Mattering

Every acquisition at 2.9x premium embeds a bet about the future. Here, the bet is that inference speed on large models remains a meaningful differentiator as the market matures.

That's not guaranteed. Llama 4 Scout delivers strong performance at 17B parameters — a model that runs comfortably in the VRAM of a single mid-range GPU. If quality thresholds keep dropping to smaller models, the case for 70B+ inference at extreme speed becomes harder to justify. You can serve users well with a smaller model running locally for next to nothing.

The counter-argument: enterprise workloads don't switch easily. Once compliance SLAs, latency requirements, and reliability guarantees are built around a specific inference tier, companies don't migrate for marginal model improvements. Groq's enterprise customer base has stickiness that persists regardless of model efficiency trends.

---

## Groq's Actual Performance — The Numbers That Hold Up

The speed gap is real, but some numbers in circulation are optimistic. Here's what independent benchmarking shows, verified by [Artificial Analysis](https://artificialanalysis.ai/providers/groq).

**Groq on Llama 3 70B**: 276–284 tokens/second, independently verified by Artificial Analysis in early 2026. With speculative decoding on Llama 3.3 70B, a newer Groq endpoint hits 1,665 tok/s — but that's a specific workload configuration, not a general comparison baseline.

**RTX 5090 on Llama 3 70B**: A single RTX 5090 (32GB VRAM) cannot hold Llama 3 70B at standard 4-bit quantization. The Q4\_K\_M GGUF file runs approximately 40–43GB, exceeding the card's memory. Running it requires either aggressive quantization (Q2/Q3, which fits but degrades quality) or CPU offloading, which drops throughput to ~15–25 tok/s. A dual RTX 5090 setup (64GB combined VRAM via NVLink) runs Llama 3 70B at approximately 27 tok/s.

For smaller models, the picture changes: the RTX 5090 handles 32B at ~61 tok/s and 8B at ~213 tok/s — both comfortably within VRAM.

Dual RTX 5090


~27


~120


~400+


64 GB


~$5,000+
*Benchmarks as of March 2026. RTX 5090 costs are hardware-only; electricity adds ~$0.38/M tokens for 70B inference at 575W TDP.*

### Why Token Latency Matters More Than Throughput

There are two ways to measure inference speed, and most coverage conflates them.

**Throughput** is how many tokens you can generate per second across a batch — relevant when you're serving hundreds or thousands of users simultaneously. GPUs are competitive here because they run parallel batches efficiently.

**[Token latency](/guides/inference-speed-explained/)** — or TTFT (time to first token) — is how long a user waits before the first word appears. This is what users perceive. A chat interface feels instant at sub-500ms and feels sluggish above 1 second.

Groq's independently benchmarked TTFT for Llama 3.3 70B ranges from approximately 300ms to 820ms depending on load and region — placing it among the fastest cloud providers, but not quite the sub-200ms figure that early marketing implied. For local inference running on your own hardware, TTFT is near-zero for a single user, since you're not competing with anyone else's requests.

> [!TIP]
> The TTFT advantage Groq has over cloud GPU providers matters in production systems serving many concurrent users. For personal homelab use, your single-user local setup already wins on TTFT regardless of GPU — the question becomes throughput and model size.

---

## Groq Cloud API vs Your Local RTX 5090 — The Real Cost Math

Groq's pricing as of March 2026: **$0.59/M input tokens, $0.79/M output tokens** for Llama 3.3 70B Versatile. Newer Llama 4 models are cheaper — Llama 4 Scout at $0.11/M input, Llama 4 Maverick at $0.20/M input.

Local cost math is messier. The RTX 5090 draws 575W at peak inference load (confirmed TDP — not the 320W figure that circulates online). At $0.12/kWh, continuous inference costs ~$1.65/day in electricity. At a realistic throughput of ~50 tok/s on quantized 70B models, that's roughly **$0.38/M tokens in electricity only**.

Add the $2,200 GPU cost amortized over two years, and actual cost-per-million-tokens for 70B inference on a single RTX 5090 runs **$0.90–$1.20/M** during the payback period. Cheaper than Groq by year three, more expensive before that.

> [!WARNING]
> The electricity-only comparison looks better for local than it actually is. A $2,200 GPU running Llama 70B at low utilization takes 3+ years to justify itself on cost alone. Run the amortization math before concluding local is cheaper.

### When Groq's API Makes Financial Sense

- **Bursty, unpredictable workloads**: 80% idle, 20% intense. You only pay for what you use — no hardware sitting idle.
- **User-facing chat products at scale**: Sub-second latency for many concurrent users requires infrastructure you'd spend months building and maintaining locally.
- **Teams where engineering time is expensive**: Maintaining local GPU inference infrastructure has a labor cost that doesn't appear in hardware pricing.
- **Non-sensitive data**: If privacy isn't a constraint, cloud inference is straightforward.

### When Local Builds Still Win

- **Privacy-sensitive workloads**: Healthcare, legal, finance. Data that can't leave your network has one answer — local hardware, full stop.
- **[High-volume, consistent daily inference](/articles/100-local-llm-hardware-upgrade-ladder/)**: Running models continuously for 8+ hours every day flips the amortization math in year two.
- **Models under 30B parameters**: A single RTX 5090 hits 61 tok/s on 32B models — close enough to cloud performance that the privacy and zero-TTFT advantages become compelling.
- **Research and offline experimentation**: Fine-tuning, adversarial testing, offline development — none of this works on a cloud API.

---

## What the Acquisition Means for Your Build Decision in 2026

The strategic implication isn't about your next GPU purchase. It's about where the market is heading over the next 24 months.

NVIDIA paying $20B to acquire inference optimization technology is the clearest signal yet that **inference is a separate market from training**, with different competitive dynamics and different hardware requirements. For years, the assumption was that whoever made the best training hardware automatically won inference. The Groq deal challenges that assumption directly — and NVIDIA responded by spending $20B rather than waiting to find out.

### The Inference Speed Arms Race

Before the acquisition, NVIDIA's Blackwell architecture already cut inference latency 15–25% over Hopper. Post-acquisition, the NVIDIA Groq 3 chip — slated for the Vera Rubin platform — is claiming 35x higher throughput per megawatt for trillion-parameter models versus Blackwell NVL72. AMD's MI AI 700 is also positioning explicitly as an inference-optimized alternative.

For local builders, this arms race is good news. More competition in inference means lower cloud prices and better API economics over the next 12–18 months. NVIDIA's acquisition may consolidate the market longer-term, but in the near term it accelerates investment from every competitor.

### Strategic Implication for Local Builders

If you're buying hardware today, your [GPU selection decision](/comparisons/inference-latency-benchmark-2026/) is unchanged. The Vera Rubin integration is a 2026–2027 story and won't affect the card you install this quarter.

What does shift: if you're architecting a production inference system, Groq's cloud pricing will likely decrease over the next 12–18 months as NVIDIA competition matures. The local vs cloud breakeven point trends toward cloud for large-model workloads. For privacy-sensitive workloads, that math doesn't change — but for everything else, the economics of cloud inference are improving faster than the cost of local hardware.

If Groq's success at the enterprise level validates one thing for the rest of us: **the market for inference is large enough that multiple approaches can co-exist**. Cloud wins on scale and latency-critical UX. Local wins on privacy, cost-at-volume, and research flexibility. Both are real, both will persist.

---

## Common Misconceptions

**"Groq will make GPUs obsolete."** No. LPU technology optimizes inference. GPUs dominate training, fine-tuning, and large-batch throughput. The NVIDIA-Groq acquisition is NVIDIA adding an inference capability to their platform — not replacing their core product line. Both markets exist and will continue to.

**"Local builds will be killed by cloud inference."** Privacy-sensitive workloads cannot move to cloud regardless of price. Healthcare, legal, and financial AI deployments run locally because regulation and security requirements demand it, not because of economics. That market is not affected by Groq's pricing.

**"The $20B price tag proves Groq is worth it as a vendor."** Acquisition price is NVIDIA's bet on a future market, not a guarantee of execution. NVIDIA paid 2.9x because they believed in the thesis — that thesis still requires Groq's LPU technology to outperform alternatives as model scales grow. Worth verifying with your own workload before committing.

**"Groq's benchmarks are unfair to NVIDIA."** Partially. Groq's headline numbers use speculative decoding and optimal configurations — every vendor does this. The independent Artificial Analysis benchmarks are more reliable, and they still show Groq at 276+ tok/s on Llama 3 70B versus 15–27 tok/s for consumer NVIDIA hardware. The gap is real, even after adjusting for marketing.

---

## FAQ

**What did NVIDIA acquire from Groq, and why does it matter for the inference market?**

NVIDIA acquired Groq's assets and licensed its LPU technology for ~$20 billion in December 2025. Groq's founder Jonathan Ross and senior leaders joined NVIDIA to integrate LPU architecture into the Vera Rubin platform. The deal matters because it confirms that inference-optimized hardware is a separate, valuable market — one that GPU-first architecture wasn't going to win without intervention. The NVIDIA Groq 3 chip, announced post-acquisition, claims 35x throughput per megawatt for trillion-parameter models versus Blackwell NVL72.

**How does Groq's LPU speed actually compare to local GPU inference?**

On Llama 3 70B, Groq independently benchmarks at 276–284 tok/s (Artificial Analysis, 2026). A single RTX 5090 cannot run Llama 3 70B in VRAM at standard quantization — the model exceeds 32GB — and CPU offloading brings throughput to ~15–25 tok/s. Dual RTX 5090 gets ~27 tok/s on 70B. For 32B models, a single RTX 5090 hits ~61 tok/s. For 8B models, ~213 tok/s. The gap narrows significantly at smaller model sizes.

**Is Groq's cloud actually cheaper than running local hardware?**

For 70B models: Groq charges $0.59/M input tokens. A single RTX 5090 costs ~$0.38/M tokens in electricity alone at 575W TDP — but that's before amortizing the $2,200 GPU cost. Including hardware, local inference costs $0.90–$1.20/M during the 2-year payback period. Groq wins financially for bursty or low-volume workloads. Local hardware wins for high-volume consistent inference after year two or three.

**Does this acquisition change what GPU I should buy for local LLM inference?**

No — not for hardware you're buying today. The Groq LPU integration into NVIDIA's next-gen Vera Rubin platform is a 2026–2027 deployment story. Your [near-term GPU decision](/articles/102-dual-gpu-local-llm-stack/) is unchanged. The longer-term implication: cloud inference pricing will get more competitive as this technology matures, which shifts the local vs cloud breakeven point for large-model workloads. Privacy-sensitive use cases still require local hardware regardless.

NVIDIA Bought Groq for $20 Billion — Here's What It Actually Means for Your Build

Technical Intelligence, Weekly.