CraftRigs
articles

Intel Arc Pro B65: The Only 32GB GPU Under $1,000 for Local AI Builds

By Charlotte Stewart 10 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

There's exactly one slot in the 32GB discrete GPU market under $1,000, and NVIDIA doesn't fill it. Intel does.

The Intel Arc Pro B65 ships mid-April 2026 with 32 GB of GDDR6 VRAM, 608 GB/s of memory bandwidth, and a price tag Intel hasn't announced — but which will be below the B70's $949 launch price. It's the only discrete GPU at this price point that can fit a Llama 3.1 70B model entirely in VRAM without leaning on system RAM. If a 24 GB ceiling has been blocking your 70B inference builds, the B65 is worth understanding before it hits shelves.

This isn't a full review — the card ships in two weeks and no independent benchmarks exist yet. What this is: everything you need to know about the specs, the software path, the realistic memory math, and whether the B65 makes sense for your build before you decide to wait or move on.

The B65 Arrives (And Why Nobody's Covering It)

Intel announced the Arc Pro B70 and B65 on March 25, 2026. The B70 is available now at $949 through board partners. The B65 follows in mid-April at an unannounced-but-lower price.

And almost no one in the local AI community noticed.

That's partly fair — Arc's early reputation was earned the hard way. The A-series launched in 2022 with drivers that genuinely broke things, and that memory sticks. But the Battlemage architecture (B-series) is different hardware with a substantially more mature software story, and gaming benchmarks are the wrong lens for evaluating it anyway.

99% of Arc coverage focuses on gaming frame rates. The local AI angle gets ignored entirely.

Why the Silence?

The harder truth is that Intel entered this segment late, priced its best card at $949, and is asking builders to abandon CUDA for a vLLM-based inference stack most haven't touched. That's a real ask. The press doesn't cover "ask" well.

But the memory math is stubborn. NVIDIA's cheapest 32 GB option is the RTX 6000 Ada — around $2,500 used. The RTX 4090 tops out at 24 GB. If your workload is 70B inference and you don't want to spend $2,500+, Intel just gave you the only alternative.

Specs Breakdown: What You're Actually Getting

SpecificationIntel Arc Pro B65
ArchitectureXe2 (Battlemage)
Xe2 Cores20
XMX Engines160
Peak AI Performance197 INT8 TOPS
VRAM32 GB GDDR6 ECC
Memory Bus256-bit
Memory Bandwidth608 GB/s
Boost Clock2,400 MHz
TBP (reference)200W
PCIe InterfacePCIe 5.0 x16
AvailabilityMid-April 2026
PriceTBD (below $949)

The B65 and B70 are not dramatically different cards — they carry identical memory configurations. The B65 drops from 32 Xe2 cores to 20, which reduces peak AI TOPS from 367 to 197. For inference workloads, where memory bandwidth typically bottlenecks single-user throughput more than raw compute, the memory spec matters more than the core count difference.

The 200W TBP figure is the official Intel design point. Board partners like ASRock and Sparkle implement their own cooling and power targets — some B70 partner cards are rated up to 330W peak. Expect B65 partner cards to vary between 160–250W in practice.

The PCIe 5.0 x16 interface is a genuine strength. Research on LLM inference and PCIe bandwidth shows less than 2% throughput difference between PCIe generations at x16 — not a factor.

Note

The outline for this article incorrectly listed the B65 as PCIe 4.0 x8. It's PCIe 5.0 x16 per Intel's official specification sheet. The bandwidth bottleneck concern doesn't apply.

How Much Is 32 GB Really?

This is the question that matters. Here's the honest memory math for Llama 3.1 70B:

Fits in 32 GB?

Yes (headroom)

Yes (tight)

No (partial offload)

No

No Q3_K_M on Llama 3.1 70B is a meaningful quantization level — quality loss is noticeable compared to Q4, but it's not broken output. And running Q3 fully in VRAM beats running Q4 with heavy CPU offloading. The RTX 4090 (24 GB) can't fit even Q3_K_M without offloading; it has to drop to Q2 or offload significantly, which caps tokens per second at roughly 8–15 tok/s due to the PCIe memory transfer overhead.

For smaller models, the B65 has no trouble at all. Qwen 2.5 14B at Q8 needs around 15 GB — you could run two of them simultaneously with room to spare.

The Software Reality: LLM-Scaler + vLLM

This is where Intel's story gets complicated, and anyone telling you otherwise is writing from outdated notes.

The old path — IPEX-LLM — was the library that enabled Ollama on Arc. Intel archived the IPEX-LLM repository on January 28, 2026, and explicitly stated they won't provide or guarantee continued support. It still works for some configurations, but you shouldn't build a production workflow on an archived codebase.

The current path is LLM-Scaler, Intel's actively maintained inference solution. It runs a vLLM backend with native Arc Pro B-series support — vLLM added this in their November 2025 release. Intel ships Docker images for both Linux and Windows. Setup is more involved than ollama run llama3 but it's documented and it works.

Warning

Don't follow tutorials that instruct you to set GPU_LIBRARY_PATH for an Ollama install. That workflow targeted IPEX-LLM, which is now archived. Use Intel's LLM-Scaler Docker images or vLLM with Intel's XPU backend instead.

Getting Started with LLM-Scaler

The setup is roughly four steps:

  1. Install the latest Intel Arc Pro GPU driver from Intel's developer portal
  2. Pull Intel's LLM-Scaler Docker image (ships with the correct vLLM build and Intel XPU runtime pre-configured)
  3. Launch the container and set SYCL_DEVICE_FILTER=level_zero:gpu to target your Arc GPU
  4. Load your model via the vLLM API — LLM-Scaler exposes an OpenAI-compatible endpoint

First-run model load takes 30–60 seconds while the XPU runtime compiles kernels. Subsequent loads are faster. The OpenAI-compatible API means existing tooling (Open WebUI, Continue.dev, anything using the OpenAI SDK) works without modification.

It's not as frictionless as Ollama. But frictionless is a luxury you give up when you step outside the CUDA world. For a dedicated inference server, the setup cost is a one-time thing.

Performance Expectations: What We Know

The B65 doesn't have independent benchmarks yet. It ships mid-April and no review units are circulating in the local AI community as of this writing. Intel's own published comparisons show the Arc Pro B-series delivering significantly better performance-per-dollar than NVIDIA's RTX Pro 4000 workstation GPU at roughly half the price — but those are Intel's numbers on Intel's workloads.

What we can reason from:

The B65 and B70 share identical memory bandwidth (608 GB/s). For 70B inference, which is memory-bandwidth-bound in single-user configurations, the B65's throughput on large models should be close to the B70's. The gap widens on smaller models or compute-heavy workloads, where the B70's extra cores matter more.

For comparison: an RTX 4090 with CPU offloading on 70B models delivers roughly 8–15 tokens per second. A B65 running the same model fully in VRAM (Q3_K_M) should significantly exceed that — the offload penalty alone accounts for most of the gap. Whether that translates to 18 tok/s or 30 tok/s on the B65 specifically is a question that needs real benchmarks to answer.

Tip

If you're an early adopter willing to run a B65 through its paces in April, the local AI community needs your data. The r/LocalLLaMA benchmarking threads are where first-mover data gets validated fastest.

Cost Analysis: Where B65 Makes Financial Sense

Hardware cost comparison, as of March 2026:

70B Without Offload?

Q3_K_M: Yes

Q3_K_M: Yes

No

Yes The B65's power advantage is real but smaller than the outline's original 100W figure suggested. At 200W TBP running 8 hours/day:

  • B65 (200W): ~$88/year in electricity at $0.15/kWh
  • RTX 4090 (450W under load): ~$197/year at the same assumptions

That's about $109/year saved, or $327 over three years. Not transformative, but it adds up — especially if you're running multiple machines or 24/7 workloads.

The real cost argument is upfront hardware. If you're stuck between a $1,600 RTX 4090 that can't fit 70B in VRAM, and a $749 B65 that can (at Q3 quality), the performance-per-dollar math favors Intel on this specific workload. The B65 is the cheapest path to running 70B models without offloading in 2026 — by a wide margin.

Who Should Buy the B65 (And Who Shouldn't)

Buy the B65 if:

  • Running 70B+ models is your primary use case and you want to do it cleanly, without CPU offloading tanking your throughput
  • You're building a dedicated inference server and the LLM-Scaler setup overhead is a one-time cost, not an ongoing burden
  • You're running multiple smaller models in parallel (two 14B models at Q8 fits in 32 GB with headroom)
  • Power budget matters — 200W under sustained load is meaningfully cheaper than 300–450W cards over time

Skip the B65 if:

  • Your entire inference stack is CUDA-dependent — fine-tuning scripts, custom CUDA kernels, workflows that assume CUDA — and you can't migrate to XPU backends
  • You need validated, independent benchmark data before deploying. The card ships mid-April; wait six weeks and you'll have real numbers from real users
  • Maximum raw throughput is the priority (production SLA, latency-sensitive applications). The B70 at $949 has nearly the same memory bandwidth and 86% more compute — if you're paying for throughput, the extra $150–$200 to step up likely pays off

See our GPU tier comparison for 2026 for how this fits against NVIDIA options across all budget ranges.

The Honest Take: Is Arc Finally Ready?

For inference specifically? Closer than ever.

The Arc A-series criticisms from 2023–2024 targeted driver stability in gaming and rasterization paths. The Battlemage inference story runs through a completely different software stack — LLM-Scaler and vLLM, not GPU drivers in the traditional sense. That doesn't mean there are no rough edges. It means the rough edges are different ones: setup friction, smaller community, fewer tutorials.

Training and fine-tuning are still a different story. The CUDA ecosystem for fine-tuning — Unsloth, axolotl, trl — doesn't have mature XPU equivalents. If you're fine-tuning models locally, buy NVIDIA.

For pure inference on 70B+ models on a budget? The B65 fills a gap NVIDIA hasn't bothered to fill below $1,000. Intel's 32 GB at this price point isn't an accident — it's a deliberate targeting of the VRAM ceiling that's been blocking budget builders for two years. Whether the execution holds up under independent testing is the remaining question. We'll know the answer by early May.

Check out our RTX 5060 Ti 8GB vs 16GB analysis for comparison on the NVIDIA side of the sub-$500 tier.

Prices and availability as of March 29, 2026. B65 pricing estimated; Intel has not announced official pricing. Last verified: March 29, 2026


FAQ

How much VRAM does the Intel Arc Pro B65 have?

32 GB of ECC GDDR6 on a 256-bit bus, delivering 608 GB/s of memory bandwidth. Both the B65 and B70 carry the same 32 GB memory configuration — the B65 simply cuts core count from 32 Xe2 cores to 20, which reduces peak AI TOPS from 367 to 197. For memory-bandwidth-bound inference workloads like 70B model generation, the memory spec matters more than the compute gap.

Can the Intel Arc Pro B65 run Llama 3.1 70B?

Yes, at Q3_K_M quantization, which requires roughly 29–32 GB of VRAM. At 32 GB, the B65 fits the model without CPU offloading — the single biggest driver of inference slowdowns on smaller GPUs. Q4_K_M requires 40–42 GB, so you'll still need minimal CPU offloading at that level, but far less than any 24 GB card. No NVIDIA GPU under $1,000 can run 70B Q3_K_M without offloading.

Does the Intel Arc Pro B65 work with Ollama?

Not with the standard Ollama install. Intel's current recommended path is LLM-Scaler, which runs a vLLM backend with native Arc Pro B-series GPU support. IPEX-LLM — the older library that enabled a custom Ollama build on Arc — was archived by Intel in January 2026 and is no longer maintained. LLM-Scaler ships official Docker images and exposes an OpenAI-compatible API, so existing frontends like Open WebUI work without changes.

When does the Intel Arc Pro B65 ship and what does it cost?

The B65 launches mid-April 2026 through board partners (ASRock, Gunnir, Sparkle). Intel has not announced a price as of March 29, 2026. The B70 launched at $949; the B65 will be priced below that. The comparable Arc Pro B60 launched around $660 — expect B65 pricing in the $700–$800 range, though that's an estimate, not a confirmed figure.

Is Intel Arc driver quality still a concern in 2026?

For inference workloads, this is largely a non-issue. The driver stability concerns with Arc were concentrated in the gaming path — rasterization, DirectX, anti-cheat compatibility. LLM inference on Arc runs through the Level Zero runtime and Intel's XPU compute stack, which bypasses the graphics driver path almost entirely. The Battlemage launch has been significantly cleaner than A-series. Anyone telling you to skip Arc in 2026 because of 2023 driver problems is giving you stale advice.

intel-arc local-llm gpu-guide 70b-models inference-hardware

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.