Project Feynman: What Local AI Hardware Actually Looks Like in 2028

Q: Will a $1,000 GPU in 2028 run 70B models?

Not at Q4 quality without help. A 70B model at Q4_K_M quantization requires roughly 40–43 GB of VRAM just for the weights — and a 2028 $1,000 GPU will likely have 24–32 GB. You can run 70B with aggressive Q2/Q3 quantization or layer offloading, but it's not clean. The better move: by 2028, a 13–20B model will match today's 70B in capability. Build for that, not for 2026's 70B requirements.

Q: Should I buy a GPU now or wait until 2028?

Active workloads justify buying now. An RTX 5080 at $999 (as of March 2026) pays for itself in roughly two months versus H100 cloud rental at $3–5/hr. Hobbyists with no time pressure should wait: 2027–2028 hardware will offer better VRAM-per-dollar, and inference frameworks will be more stable. The cost of waiting isn't obsolescence — it's two years of slower iteration.

Q: Are multi-GPU setups worth it for local AI in 2028?

Not for home builders. GPU TDP is rising — the RTX 5090 already draws 575W, up from the RTX 4090's 450W. A dual-GPU rig hitting 1,000W+ requires dedicated electrical circuits and professional installation running $1,500–$5,000+. At current cloud H100 rates of $3–5/hr, the economics only favor local multi-GPU if you're running 24+ GPU-hours per day of sustained, predictable workloads. That's enterprise territory.

Q: How much VRAM do I actually need for a 70B model?

Minimum 48 GB for comfortable Q4_K_M inference. The weights alone take 40–43 GB at Q4, plus you need headroom for context. The RTX 5090's 32 GB falls about 10 GB short — workable with aggressive Q2/Q3 quantization, not ideal. Practical 48 GB options: dual RTX 4090s with split layers, an NVIDIA A6000 (48 GB), or a used A100 80 GB. Verified against llama.cpp VRAM tables, as of March 2026.

Q: What will GPU prices look like in 2028?

Roughly 15–25% cheaper in real terms for equivalent performance tiers, based on historical generational pricing patterns and AMD competition. The bigger shift is VRAM-per-dollar: 24 GB will likely be the standard mid-range tier at $600–800 by 2028, versus $999+ today. That's the main financial argument for hobbyists who can afford to wait.

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

The RTX 5090 draws 575 watts. Not a typo — that's 125 more watts than the RTX 4090, which already required an 850-watt power supply. Consumer GPU power draw is going up, not plateauing, and if you're planning a 2028 multi-GPU stack under the assumption that it will be electrically similar to a 2026 single-GPU build, that assumption will cost you somewhere between $1,500 and $5,000 in electrical work alone.

In 2028, a $1,000 GPU will likely have 24–32 GB of VRAM and deliver roughly 60–80 TFLOPS of FP32 compute — a real step up from today's RTX 5070 Ti (16 GB, ~44 TFLOPS, $749 as of March 2026). But that 70B model question doesn't resolve the way most people expect: Q4_K_M quantization of a 70B model requires 40–43 GB of VRAM just for the weights, and 2028's $1,000 tier won't reach that. The smarter bet is to stop planning around today's 70B models and start planning around 2028's smaller, more capable ones — a 2028 13B model will likely do what today's 34–50B does. Buy a single 16–32 GB GPU now if you have active workloads. Skip multi-GPU entirely unless you're running enterprise-scale inference every day.

This article separates what we actually know from what we're inferring. Hardware roadmaps through 2027 are public records. Thermal physics doesn't bend for marketing. Model efficiency trends are measurable. Everything else — cloud pricing wars, framework consolidation, regulatory shifts — is informed speculation, labeled as such.

The Feynman Framework — What We Know vs. What We're Guessing

Richard Feynman had a principle: know precisely what you know, and know precisely what you don't. The worst hardware predictions mix confirmed specs with market optimism and present both as fact. That's how people get burned on multi-GPU builds that made sense on paper and turned into $4,000 electrical projects.

Here's the separation for 2028:

Confirmed: NVIDIA's RTX 50 series specs are public and shipping. AMD's RDNA 4 launched in March 2025 (RX 9070 XT and RX 9070). TSMC's manufacturing roadmap through N3P node is publicly documented. Physics — how much heat air cooling can dissipate, how fast GDDR7 moves data, how many watts a 15A home circuit can sustain.

Reasonable speculation: AMD's next generation (RDNA 5) has reportedly taped out on TSMC N3P, with credible leaks pointing to a mid-2027 launch, but AMD hasn't announced anything officially. Model efficiency trends extrapolate forward from measurable data. GPU commodity pricing follows predictable generational patterns.

Unknowable: What any AI lab ships in 2027. Whether a software breakthrough makes current VRAM math irrelevant. Supply chain disruptions. Regulatory intervention in compute markets.

Every claim in this article carries one of those three labels, at least implicitly.

GPU Roadmap 2026–2028: What's Confirmed

The RTX 50 series is fully public. Several specs circulated in the community during the run-up to launch were wrong — corrections matter here because they change your VRAM planning.

FP32 TFLOPS

104.8

~84

~44

Warning

A widely repeated pre-launch claim had the RTX 5080 at 24 GB of VRAM. It shipped at 16 GB. Every board partner — ASUS, Gigabyte, MSI, PNY — confirms this. There is no 24 GB variant. If you're sizing a local AI rig around a 24 GB RTX 5080, the math doesn't work. Check actual manufacturer specs before planning builds around pre-release leaks.

GDDR7 delivers roughly 30% more memory bandwidth than GDDR6X at the same bus width — confirmed against RTX 5080 at 960 GB/s versus RTX 4080 at 736 GB/s on an identical 256-bit bus. That bandwidth matters more for quantization throughput than raw TFLOPS, because smaller quantized layers mean more memory operations per inference pass.

The full RTX 50 series Blackwell architecture also includes native FP8 tensor core support (per NVIDIA's Blackwell GPU architecture documentation). FP8 support is hardware-ready. The models optimized to use it are still arriving — 2027 and 2028 are when FP8-native weights become common. You're buying future readiness, not current capability.

AMD RDNA 4 (Shipping) and RDNA 5 (Speculation)

AMD's RDNA 4 is live — the RX 9070 XT and RX 9070 launched March 2025, with strong performance-per-watt gains over RDNA 3 and a 2x ray tracing throughput improvement per compute unit. AMD is not currently leading NVIDIA for LLM inference, where years of CUDA optimization depth create a real gap. But the value proposition in the $500–700 price tier is improving.

RDNA 5, targeted at mid-2027, is a credible rumor — tape-out on TSMC N3P has leaked, and sources including VideoCardz have reported the mid-2027 window — but AMD has made no official announcement using that name or date. Plan around it as "AMD's next generation arrives sometime in 2027" and treat any specific specs as placeholders until an official announcement.

The Power Wall Is Going Up, Not Staying Flat

This chart matters for home lab planning:

Year

2020

2022

2025 That's a 28% increase per generation at the flagship tier. Performance-per-watt has improved — the 5090 does more work per watt than the 4090 — but absolute power draw keeps climbing. The 5090 has already seen power connector overheating reports in the field (March 2025, multiple hardware review sites). NVIDIA recommends a 1,000W power supply.

Mid-range is more reasonable: the RTX 5070 Ti at 300W and RTX 5080 at 360W are manageable on any modern 850W PSU. But the direction for 2028 flagship cards is clear — 600W or higher is coming.

For dual-GPU builds: two mid-tier cards at 300–360W each means 600–720W just for the GPUs, plus CPU, storage, fans. You're looking at a 30A dedicated circuit. Professional installation: $1,500–$5,000 depending on your panel, house wiring, and municipality. That's before you buy the second GPU.

This is the real case against home multi-GPU, and it gets worse every generation.

What $1,000 Actually Buys in 2028

Based on historical performance-per-dollar scaling and TSMC node progression, a 2028 mid-range GPU at the $1,000 price point will likely offer:

VRAM: 24–32 GB (up from 16 GB on today's RTX 5080)
FP32 compute: 60–80 TFLOPS (up from ~44 TFLOPS on the RTX 5070 Ti)
Memory bandwidth: ~900–1,100 GB/s

Estimated inference performance — these are projections, not benchmarks, labeled as such:

Notes

Fits in 24 GB

~70% size reduction

Near-full quality

Full precision

CPU offloading required Software improvements in inference kernels (llama.cpp, vLLM, Ollama) will account for an estimated 20–35% of the performance gains over 2026 baselines — the same hardware gets faster as frameworks mature. This projection is speculative but consistent with the 2024–2026 trend.

Note

The tokens per second estimates above use today's inference engines as a baseline and apply conservative scaling. If a major kernel optimization lands in 2027 (plausible, given MLX and llama.cpp velocity), these numbers could be 30–40% higher.

The 70B VRAM Reality

This is where a lot of builds go wrong, so it's worth stating plainly.

A 70B model at Q4_K_M quantization requires approximately 40–43 GB of VRAM just for the model weights. Add an 8K context window and you're at roughly 43–46 GB. Extend to a 32K context window and add another ~10 GB. The minimum comfortable single-GPU setup for serious 70B work is 48 GB — an NVIDIA A6000 (48 GB), an A100 80 GB, or two RTX 4090s with layers split across both cards.

Even the RTX 5090 at 32 GB falls about 10 GB short for full Q4_K_M inference. It can run 70B at Q2 or Q3, which is workable but carries real quality degradation. At Q3, you're trading a meaningful chunk of the model's reasoning capability for fit — not a great deal when 34B Q6 fits cleanly in 24 GB with better output quality.

A 2028 $1,000 GPU with 24–32 GB faces the same arithmetic on today's 70B models. Q4_K_M requirements don't get easier because the GPU got faster. This is a fundamental constraint.

See the GPU VRAM requirements guide for how to size any build to specific model requirements.

But Here's Why It Doesn't Matter as Much as You'd Think

Llama 3.1 8B, released in 2024, outperforms Llama 2 70B from 2023. Same capability class, roughly 8x fewer parameters. That happened in one year.

Extrapolating the 2023–2026 model efficiency trend forward: a 2028 13B model will likely match a 2026 34B model across most benchmark categories. On narrow tasks, some 2028 13B models may approach today's 70B performance levels.

The implication is that planning your 2028 VRAM budget around running today's Llama 70B is the wrong frame. Plan around running whatever 13–20B model exists in 2028 that delivers 2026's 70B capability. That model fits easily in 24 GB at Q6, or in 16 GB at Q4. This is a different optimization target — and it points to 24 GB being a genuinely sufficient tier for most local AI use in 2028.

Tip

The right question for 2028 planning isn't "will I fit a 70B model?" It's "what size 2028 model will do what a 2026 70B model does?" Current trajectory: 13–20B. Size your VRAM budget around that.

Quantization Evolution: Q4 to FP8

Current state, 2026: Q4_K_M is the standard local deployment format. It reduces a 70B model from ~140 GB (FP32) to roughly 40–43 GB — approximately a 70% reduction — with about 5% quality degradation across most tested models (Qwen 2.5 quantizes the most stably; Llama 3.3 degrades faster at very low bit-widths, per benchmarks from ionio.ai).

The direction through 2028: FP8 inference is maturing. RTX 50 series Blackwell hardware has native FP8 tensor core support, which enables lower-precision inference with less accuracy loss than Q3 or Q2. The limitation is that current models weren't trained with FP8 in mind. Models specifically optimized for FP8 deployment will start appearing through 2027, with mainstream availability by 2028.

When FP8-native 70B models are common — a 2028 development — they'll shrink into roughly 35 GB, still too large for a 24–32 GB single GPU at full quality but closer. The 34B class FP8 model, though, shrinks into approximately 17 GB, and a 13B FP8 model fits comfortably in 8 GB at near-full quality. That's a real shift in what budget hardware can handle.

The lesson: don't upgrade your GPU specifically for "better quantization support" right now. The models that take advantage of it are still arriving. Buy hardware that's FP8-ready (RTX 50 series checks that box) and let the models catch up to the hardware.

The Electricity Math (Using Actual Current Prices)

US residential electricity averaged approximately $0.17/kWh in 2024, per EIA data. This has risen roughly 10.5% since January 2025. The $0.13/kWh figure that appears in older local AI economics analyses reflects pre-2020 pricing — using it today understates your running costs by 25–30%.

Running the numbers at $0.17/kWh for 8 hours of inference per day:

Setup	Monthly electricity
---	---
Single GPU, 300W (RTX 5070 Ti)	~$12.24
Single GPU, 400W (mid-tier)	~$16.32
Dual GPU, 800W	~$32.64
Single GPU, 400W (always-on server)	~$48.96

Compare that to cloud H100 rental. Market rates as of 2026: Lambda Labs at $2.49–$3.29/hr, AWS at ~$3.90/hr, Paperspace at ~$5.95/hr, with market-low providers under $2/hr. The widely cited figure of $50/hr for H100 doesn't reflect current pricing — it's not supported by any major cloud provider's rate card. At 8 hours/day of serious workloads: 240 GPU-hours/month × $3.13 (market average) = approximately $750/month in cloud costs.

Against $16/month local electricity for the same daily schedule, local hardware pays for itself fast. An RTX 5080 at $999 breaks even against cloud in roughly 45 days of daily use.

Where cloud wins: bursty, unpredictable workloads — a few jobs per week, not daily sustained compute. For those patterns, local infrastructure doesn't make financial sense.

For a complete ROI breakdown, see the single GPU vs multi-GPU ROI comparison.

Should You Buy Now or Wait Until 2028?

The "Buy Now" Case

You have a real workload — professional inference, active fine-tuning, daily use — and you're currently paying cloud costs. Buy now. A 2026 RTX 5080 (16 GB, $999) runs every 13–34B model cleanly and will still run 2028's 13B models well. A 2026 RTX 5090 (32 GB, $1,999) is the closest single-consumer-GPU option for 70B experimentation and handles multi-LoRA workflows with headroom to spare.

Today's 2026 RTX 5080 will not be obsolete in 2028. It'll be slower than 2028 hardware. That's not the same thing. The RTX 4090 from 2022 still runs every current model without complaint. GPU generations create speed gaps, not obsolescence.

Best buys for active workloads right now: RTX 5070 Ti (16 GB, $749) for 13–34B daily use; RTX 5080 (16 GB, $999) for heavier fine-tuning; RTX 5090 (32 GB, $1,999) for anyone who actually needs 70B access or large-context fine-tuning.

The "Wait Until 2027–2028" Case

You're a hobbyist exploring local AI with no time pressure. 2027–2028 hardware will offer better VRAM-per-dollar: 24 GB will likely be the standard mid-range tier at $600–800 versus $999+ today. Software will be more stable — inference frameworks are still evolving fast, and 2027 Ollama and vLLM will be noticeably more mature than 2026 versions.

The cost of waiting two years is slower iteration, not irrelevance. If you can borrow cloud GPU time for exploration, waiting is the rational financial call.

The Multi-GPU Reality Check

For home builders: skip it. The answer is almost always no.

The hardware cost is manageable — two RTX 5080s run $2,000–2,500. What isn't manageable is the electrical infrastructure: a dedicated 30A 240V circuit costs $250–$900 installed; if your panel needs upgrading to support the load, add $800–$4,000. Realistically, a complete dual-GPU home lab electrical project runs $1,500–$5,000+ depending on your starting setup.

And by 2028, a 2026 dual-RTX-5080 rig (32 GB combined, ~720W) will perform roughly comparably to a single mid-tier 2028 GPU — one that draws 350W and runs on your existing circuit.

Multi-GPU for home use only makes sense with genuinely sustained, heavy 70B inference workloads running most of the day, every day. That's an enterprise decision, not a home lab one.

Common Misconceptions (2026 Edition)

"My 2026 GPU will be obsolete in 2028." No. Slower, yes. The RTX 4090 from 2022 still runs every current model without issue. "Obsolete" means unusable — that's not what's coming.

"AI will run on phone CPUs by 2028." Small models (3B and under) run on Apple A-series chips right now. For 13B and above, you still need serious VRAM, and no phone-class chip is reaching 16 GB of unified memory by 2028. This claim is accurate for consumer-facing apps running tiny models, not for serious local AI workloads.

"Quantization will eliminate the VRAM problem." It reduces it. Q4_K_M cuts a 70B model from ~140 GB to ~42 GB — genuinely significant. But 42 GB still requires hardware most home builders don't have. Quantization is a meaningful compression method, not a solution to running any model on any hardware.

"Cloud will always be cheaper." Cloud is cheaper for bursty, irregular use. If you're running 8+ hours/day, local hardware pays for itself in under two months versus cloud H100 rates. Professionals with consistent workloads should own hardware.

"Software optimization will replace hardware scaling." They're complementary. Better kernels improve throughput on existing hardware by 20–35%, and that's real — but hardware determines the ceiling. Software optimization is the multiplier; hardware is the base.

CraftRigs Take — The Feynman Bet

Here's what we're confident about:

NVIDIA's RTX 50 series roadmap is set. AMD's next generation targets mid-2027 based on credible tape-out evidence. VRAM will grow — 24 GB at the mid-range tier by 2028 is close to certain. Consumer GPU TDP will reach 600W+ at the flagship level (plan your power supply and circuit accordingly). Model efficiency improves every six months, measurably and consistently. A 2028 13B model will be materially more capable than a 2026 13B model.

What we're speculating on with confidence: GPU prices drop 15–25% in real terms by 2028. FP8-native models become mainstream in 2027–2028. Cloud GPU rates drop as competition increases. Local AI remains a niche market — professionals and privacy-sensitive users, not consumer mainstream.

The safe bet: A single-GPU rig with 16–32 GB of VRAM purchased in 2026 remains a productive local AI machine through 2028. The models improve around it. The software gets faster. You don't need the next GPU until your specific workload outgrows your current one — and that specific workload should determine what you buy next.

The unsafe bet: Multi-GPU setups staying economically justifiable for home builders. Rising power draw and electrical infrastructure costs make the economics worse every generation, not better.

Don't future-proof by overspending on hardware you don't need yet. Future-proof by buying flexible VRAM capacity — because VRAM constraints don't change with model versions, but compute requirements do.

The Road to 2028: What to Watch

2026 signals worth tracking: RTX 5090 and 5080 price normalization (both are above MSRP in many markets as of March 2026). AMD RX 9070 XT real-world LLM inference benchmarks as they accumulate. llama.cpp FP8 inference support maturing from experimental to stable.

2027 signals: An official AMD RDNA 5 announcement — watch VRAM specs and price tier positioning. Any major model lab releasing FP8-native weights as the default download format (signals quantization infrastructure is ready). Cloud H100 pricing dropping below $2/hr at commodity providers.

2028 decision point: If 32 GB reaches the $700–800 price tier as predicted, the upgrade case weakens for anyone with a current 16 GB RTX 5080. The reason to upgrade won't be "my GPU can't run the models" — it'll be "my workflow has grown beyond what 16 GB can handle." That's a much healthier upgrade trigger than chasing specs.

FAQ

Will a $1,000 GPU in 2028 run 70B models?

Not at full Q4 quality without compromises. A 70B model at Q4_K_M quantization needs roughly 40–43 GB of VRAM for the weights — and a 2028 $1,000 GPU will likely land at 24–32 GB. You can run 70B at Q2 or Q3 with quality degradation, or use CPU offloading for some layers with a speed penalty. The real answer is to stop planning around 2026's 70B models: by 2028, a 13–20B model will match what today's 70B delivers, and that fits in 24 GB at full Q6 quality.

Should I buy a GPU now or wait until 2028?

Active workloads justify buying now — an RTX 5080 at $999 pays for itself in roughly two months versus daily H100 cloud rentals. Hobbyists without time pressure are rational to wait: 2027–2028 hardware will bring better VRAM-per-dollar ratios and more mature software. Waiting costs you two years of slower iteration. It doesn't cost you relevance.

Are multi-GPU setups worth it for local AI in 2028?

For home builders: almost never. GPU TDP is rising — the RTX 5090 already draws 575W (up from the RTX 4090's 450W), and a dual-flagship setup hits 1,000W+. Dedicated electrical circuits and panel work cost $1,500–$5,000+ before you buy a second GPU. The economics only favor local multi-GPU at 24+ GPU-hours of sustained daily compute. At that scale, you're running enterprise workloads and should be evaluating enterprise hardware, not consumer GPUs.

How much VRAM do I actually need for a 70B model?

Minimum 48 GB for comfortable Q4_K_M inference — weights take 40–43 GB, context overhead adds on top. The RTX 5090 (32 GB) falls about 10 GB short for full Q4 work. Practical 48 GB setups: dual RTX 4090s with layers split across both cards, an NVIDIA A6000 (48 GB single card), or a used A100 80 GB for headroom at scale. Verified against llama.cpp VRAM requirements, as of March 2026.

What will GPU prices look like in 2028?