CraftRigs
articles

Samsung + NVIDIA Groq 3 LPU: What the Partnership Actually Means for AI Inference

By Charlotte Stewart 10 min read
Samsung + NVIDIA Groq 3 LPU: What the Partnership Actually Means for AI Inference

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.


# Samsung + NVIDIA Groq 3 LPU: What the Partnership Actually Means for AI Inference

The headlines in December called it a partnership. Samsung manufacturing Groq's chips. Interesting footnote for the inference market. Worth noting.

What those headlines buried was the $20 billion story published the same week: NVIDIA bought Groq. Not a partnership. Not a supply agreement either side could walk away from. NVIDIA spent more acquiring Groq's IP and team than it has ever spent on anything in its 30-year history — announced on Christmas Eve, when nobody was watching.

The Samsung manufacturing relationship is real. The LPU architecture is genuinely impressive. But calling this "Samsung + Groq" in 2026 is like calling M3 chips "TSMC + Apple." Samsung is the foundry. NVIDIA is the customer. And what NVIDIA built with Groq's technology is exclusively for hyperscalers.

Here's what actually changed.

**TL;DR: NVIDIA acquired Groq for $20B in December 2025. Samsung manufactures the resulting Groq 3 LPU on 4nm, with commercial shipments beginning Q3 2026. This is a rack-scale enterprise component for NVIDIA's Vera Rubin platform — no consumer product exists or appears on any roadmap. For local builders in 2026, your options are unchanged. Buy the RTX 5070 Ti. Don't wait.**

---

## What the Samsung-Groq Partnership Actually Is

NVIDIA's deal with Groq closed Christmas Eve 2025. The structure: $20 billion in IP licensing and an acquihire, pulling founder Jonathan Ross and the core engineering team into NVIDIA. Groq as a legal entity nominally continues to exist. Groq as an independent AI chip company does not.

Samsung's role is straightforward: foundry contractor. NVIDIA's Groq 3 LPU — branded the "Groq 3 LPX" in its rack-scale form — is manufactured on Samsung's 4nm process. Production ramped from roughly 9,000 wafers to approximately 15,000 wafers for commercial scale. Shipments begin Q3 2026, confirmed by Jensen Huang at GTC on March 16, 2026.

Why Samsung instead of TSMC? Groq made that call in 2023, before NVIDIA entered the picture. Samsung's 4nm wafer pricing runs about 30% cheaper than TSMC's equivalent — around $13,000 per wafer versus $18,500. TSMC's advanced node capacity is also fully loaded by Apple, AMD, and NVIDIA's own Blackwell orders. Getting a new LPU program into that queue would have taken years. Samsung had capacity. NVIDIA kept the arrangement after acquiring the company.

> [!NOTE]
> This is NVIDIA's chip now. When you read "Samsung-Groq partnership," substitute "Samsung manufacturing NVIDIA's Groq-derived inference accelerator." That framing matters for understanding where this hardware ends up and who can buy it.

---

## LPU vs GPU: What the Architecture Actually Does Differently

The LPU was built to solve one problem GPUs are genuinely bad at: getting the first token out fast for a single interactive user.

A GPU is a parallel processing machine, designed for rendering and adapted for training — then adapted a second time for [inference](/glossary/inference/). Inference is sequential. Token one, token two, token three. GPUs handle this awkwardly because their architecture is optimized for doing thousands of things simultaneously, not one thing very fast.

The LPU flips this. The core architectural insight — which is now NVIDIA's — is that inference is fundamentally a memory bandwidth problem. Moving weights from memory to compute is the bottleneck. Eliminate the memory hierarchy, eliminate the bottleneck.

### The Memory Hierarchy Difference

GPUs carry HBM chips mounted alongside the processor die. Fast by conventional standards. An H100 delivers roughly 3.35 TB/s of memory bandwidth.

The Groq 3 LPU carries 512 MB of on-chip SRAM per die. Not external. Physically on the same silicon as the compute units. That architecture delivers 150 TB/s of memory bandwidth — roughly 45x what the H100 achieves. Unlike external HBM, on-chip SRAM has zero contention. No memory bank conflicts. When the compiler schedules a weight transfer at clock cycle N, it happens at clock cycle N. Deterministic.

In practice: the first-gen 14nm LPU benchmarked at roughly 250 [tok/s](/glossary/tokens-per-second/) on Llama 3.1 70B in independent API measurements by Artificial Analysis (as of early 2026). With speculative decoding, Llama 3.3 70B reached 1,660 tok/s on GroqCloud.

For comparison: an RTX 5070 Ti running a 70B model locally at [Q4 quantization](/glossary/quantization/) generates roughly 15-25 tok/s.

The LPU is 10-65x faster on raw token generation.

> [!WARNING]
> That speed comparison isn't apples-to-apples. GroqCloud runs full-precision models on rack-scale cloud infrastructure. Your local setup runs quantized GGUF weights on consumer hardware with no servers, no subscription fees, and no data leaving your machine. The use cases are different. Don't let the tok/s gap make you think you're missing out on a near-equivalent local option — you're not comparing the same thing.

---

## Why Samsung Instead of TSMC Matters (But Not for the Reason You Think)

The foundry story matters less for performance than for supply chain resilience.

TSMC fabbing 90%+ of the world's advanced AI chips is a geopolitical risk that NVIDIA understands better than anyone. One earthquake, one political disruption, one production fire — global NVIDIA GPU supply halts. That's not hypothetical. The 2021 semiconductor shortage left that lesson in the industry's institutional memory.

Samsung operates advanced fabs in South Korea. If TSMC faces disruption, Samsung's production line continues. For a company shipping rack-scale inference infrastructure to every major hyperscaler, production across two geographies isn't a luxury. It's basic supply chain engineering.

Cost is the second factor. 30% wafer cost savings at the volume NVIDIA needs — hundreds of thousands of Groq 3 LPUs for AWS, OpenAI, and others — compounds into real manufacturing margin. Samsung gets AI production credibility and wafer revenue. NVIDIA gets cost structure and geographic redundancy.

But none of that savings passes to consumers. Samsung manufacturing the chip doesn't make it accessible to individual builders. The Groq 3 LPX is a rack component inside NVIDIA's Vera Rubin platform. You access it via cloud inference at roughly $45 per million tokens for premium LPU-accelerated inference. You cannot put it in a PC.

---

## Realistic Impact on Pricing and Availability (2026-2027)

**2026:** Groq 3 LPX supply goes exclusively to hyperscaler partners — AWS is the first confirmed deployer, alongside OpenAI's committed 3 GW of dedicated inference capacity. No consumer SKU exists.

**2027:** No consumer or prosumer roadmap from NVIDIA or any OEM partner has been published. Post-acquisition, Groq is NVIDIA's inference division. The product strategy targets AI factories, not local builders. A standalone inference appliance in the $1,500 range is not on any roadmap from any party as of March 2026.

One pricing signal does exist: NVIDIA projects inference providers can charge up to $45 per million tokens for premium LPU-accelerated inference. That's a cloud pricing signal aimed at enterprise buyers, not a consumer product price point.

Precedent from the GPU market: when AMD re-entered the discrete GPU space competitively, NVIDIA prices compressed modestly but didn't collapse. Competitive pressure takes years to materialize in consumer pricing, even when the underlying technology is legitimate.

---

## What This Means for Your Build Decision Right Now

Nothing changes. That's the honest answer.

If you're building a [local LLM inference rig](/guides/gpu-for-inference/) in 2026, the Groq 3 LPX is irrelevant to that decision. There's no Groq product you can buy, no delivery timeline that affects a 2026 purchase, and no spec sheet to compare against what's available today.

For 2026 local builds, the calculus is exactly what it was before Christmas Eve:

- **Running 8B-14B models daily:** RTX 5070 Ti (16 GB VRAM) — handles Llama 3.1 8B and Qwen 14B at full Q4 quality, ~40+ tok/s, fits any PCIe x16 slot
- **Running 30B-70B regularly:** RTX 5080 (20 GB VRAM) or a dual-card setup — Q4 70B gets you 20+ tok/s, higher VRAM budget required
- **70B full quality or 100B+ models:** Multi-GPU build or Apple Silicon for the unified memory path

The [RTX 5070 Ti vs 5080 comparison](/comparison/rtx-5070-ti-vs-5080/) covers the specific inference trade-offs in detail. Short version: 5070 Ti for most people, 5080 if you're running 70B daily and response latency matters more than the price delta.

None of that changes because NVIDIA acquired an enterprise inference chip company.

> [!TIP]
> If you want LPU-speed inference without the hardware build, GroqCloud's free tier lets you run Llama 3.3 70B at cloud LPU speeds right now. It's not local — your prompts go to a server. But for coding assistance, document review, or research where privacy isn't the concern, cloud LPU inference at $45/million tokens is cheaper than running a local 70B setup on electricity alone.

---

## Common Misconceptions About the LPU and Samsung Deal

**"LPU is the future of AI hardware."** LPU is the future of *inference latency at hyperscale*. That's a specific niche. Training, fine-tuning, mixed workloads, and consumer inference all still run on GPUs. Even NVIDIA's own Vera Rubin platform pairs the LPX rack alongside a GPU cluster — not instead of one.

**"Samsung manufacturing means LPU will be cheaper than GPU."** Samsung's cost advantage goes to NVIDIA's margin, not to your purchase price. Consumer GPU pricing is set by gaming demand, competitive dynamics with AMD, and NVIDIA's own pricing strategy — not by what a rack component costs to fabricate.

**"I should wait for Groq instead of buying an RTX 5070 Ti today."** There's nothing to wait for. No consumer Groq product exists. Waiting means running ChatGPT for another 18 months while a local setup would have been running on your desk since February.

**"This deal proves AI chip design is moving away from NVIDIA."** NVIDIA spent $20 billion to acquire the one domestic company with real architectural differentiation in inference. That's not a retreat from dominance — that's extending the moat. GPU dominance is secure for 2026 and well beyond.

---

## Competitive Positioning: NVIDIA Stays Dominant

2026 Status


Shipping now


Enterprise rack, Q3 2026


Shipping now


Enterprise-focused


Google Cloud only
NVIDIA's dominance comes from software lock-in, not just silicon. CUDA, PyTorch, ONNX, and the entire training and fine-tuning toolchain runs on NVIDIA. Groq runs inference only, on a proprietary software stack, with no local model support. NVIDIA absorbed that threat and kept everything else intact.

### The Real Competitive Threat (Spoiler: It's Not Groq)

The genuine competitive pressure on NVIDIA comes from hyperscalers building their own silicon. Google TPU, Meta's MTIA chip, Microsoft Maia — these companies are the biggest GPU customers on earth and they're all quietly developing alternatives to reduce that dependency. Chinese competitors, including Huawei Ascend and Alibaba's inference hardware, are advancing fast in their domestic market.

Groq was U.S.-focused, inference-only, and cloud-native. NVIDIA absorbed the one domestic competitor with real architectural differentiation. The hyperscaler custom silicon programs are the story worth watching — not Samsung's wafer output.

---

## Key Milestones to Watch in 2026-2027

**Q3 2026:** First Groq 3 LPX shipments to AWS and OpenAI. NVIDIA will publish benchmark claims for the Vera Rubin + LPX combination. Treat them as marketing claims until third-party validation arrives.

**Q4 2026:** Real-world performance data starts leaking from hyperscalers. If first-token latency on 70B+ models drops measurably at major inference providers, you'll see it in API response times.

**Q1-Q2 2027:** Third-party benchmarks comparing Groq 3 LPX against GPU inference on equivalent workloads. This is the data that actually tells you whether the architecture delivers on its claims at production scale.

**CraftRigs trigger point:** The [best GPU for local LLM guide](/guides/best-gpu-for-local-llm-2026/) gets a Groq hardware section when — and only when — a sub-$1,500 inference appliance ships with independently verified greater than 30 tok/s on 70B single-user inference. Until a product like that exists, this remains enterprise news.

---

## FAQ

**What is the Samsung-Groq LPU partnership?**
It's more accurate to call it a Samsung-NVIDIA manufacturing arrangement. NVIDIA acquired Groq's IP and engineering team in a $20B deal announced December 24, 2025. Samsung is the foundry contractor producing NVIDIA's Groq 3 LPU on a 4nm process. Production ramped in early 2026, with commercial shipments to hyperscalers beginning Q3 2026.

**Can I buy a Groq LPU for local AI?**
No. The Groq 3 LPX is a rack-scale enterprise component inside NVIDIA's Vera Rubin platform. It's designed for hyperscalers like AWS and OpenAI running trillion-parameter models at scale. No consumer or prosumer Groq product appears on any published roadmap from any party.

**Why Samsung over TSMC for the LPU?**
Groq selected Samsung in 2023, before NVIDIA acquired the company. Samsung's 4nm wafer pricing is roughly 30% cheaper (~$13,000 vs TSMC's ~$18,500). TSMC's advanced node capacity was also fully loaded by Apple, AMD, and NVIDIA's own Blackwell production. NVIDIA kept Samsung as the foundry partner post-acquisition. Cost and capacity — both factors were drivers.

**How fast is Groq LPU compared to a consumer GPU?**
The first-gen 14nm Groq LPU benchmarked at roughly 250 tok/s on Llama 3.1 70B in independent API measurements (Artificial Analysis, 2026). With speculative decoding, Llama 3.3 70B reached 1,660 tok/s on GroqCloud. An RTX 5070 Ti running the same size model locally at Q4 quantization generates roughly 15-25 tok/s. The LPU is dramatically faster on raw token generation — but it runs full-precision cloud models, not locally quantized GGUF weights. These are different use cases with different privacy trade-offs.

**Does this change which GPU to buy in 2026?**
No. No Groq consumer product exists and none is coming. Your local AI hardware decision in 2026 is between NVIDIA RTX 50-series GPUs and Apple Silicon — same as it was before December 2025. Build what's available now. The RTX 5070 Ti at ~$749 is the right call for most people running models up to 30B.
nvidia groq-lpu ai-inference samsung-foundry inference-hardware

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.