Running a local voice assistant in 2026 costs under $2,000 in hardware, uses entirely free and open-source software, and achieves sub-second end-to-end latency. Three months ago that sentence would have been aspirational. Now it's a shopping list.
The pieces finally arrived: Cohere Transcribe (released March 26, 2026) with 5.42% word error rate — better than Whisper Large v3 — runs locally with no API costs. Voxtral TTS (released March 27, 2026) generates natural speech at 70ms time-to-first-audio. Qwen 3.5-27B (released February 24, 2026) fits in 16GB VRAM at Q4_K_M quantization and matches proprietary models on coding and reasoning benchmarks. And vLLM 0.18.0 (released March 20, 2026) now supports Eagle3 speculative decoding specifically for Qwen 3.5, pushing throughput to numbers worth talking about.
TL;DR: The complete mid-range local AI stack in 2026 is RTX 5070 Ti + vLLM 0.18.0 + Qwen 3.5-27B Q4 + Cohere Transcribe + Voxtral TTS, for approximately $1,900 in hardware. Budget builders get a functional text-only stack for ~$800 in GPU spend. The full voice stack requires 16GB VRAM minimum — 8GB cards can't host the voice models. Here's exactly what to buy and how to wire it together.
What "Full Stack" Actually Means
Most guides cover one layer. GPU buying guides focus on VRAM. LLM guides focus on benchmark scores. Nobody explains what happens when you try to run transcription, inference, and synthesis simultaneously on one card — and whether it actually works.
Full stack means four layers working together: hardware (GPU, CPU, RAM), inference runtime (the software that schedules GPU operations), language model (the LLM generating text responses), and voice I/O (transcription converting your speech to text, synthesis converting the LLM's output back to speech). Every layer has performance requirements, and they interact.
The practical problem: voice models and LLMs want to share the same GPU VRAM. Managing that contention is the hard part nobody documents.
Hardware Tiers: GPU Selection for the Full Stack
The GPU is the bottleneck. Everything else — CPU speed, RAM amount, NVMe choice — matters at the margins. Buy the right GPU tier and the rest of the build follows.
Warning
RTX 5070 Ti MSRP is $749 as of March 2026, but street prices average $812 with limited MSRP availability. Budget $830-850 if you need it this week.
Budget Tier: RTX 4060 Ti 8GB (~$300)
At $299-349, the RTX 4060 Ti (8GB) handles every 7-8B model at full quality and decent speed. It runs Llama 3.1 8B, Qwen 2.5 7B, and similar models without quantization compromises. What it can't do: Voxtral TTS requires 16GB minimum. So on an 8GB card, you get text-only inference. Cohere Transcribe (2B params, ~4GB VRAM) could theoretically run alongside a 7B model, but VRAM is so tight that you'd need careful streaming management. Budget stack is text-first.
The AMD alternative — RX 9060 XT 16GB — launched June 2025 at $349 MSRP and currently sells for $389-449 at most retailers. 16GB at this price point is genuinely good value, and it runs Voxtral TTS. ROCm support for vLLM has improved but still trails CUDA in compatibility. If you're comfortable with Linux and occasional driver friction, it's a strong pick.
Mid-Range Tier: RTX 5070 Ti 16GB — Our Pick
16GB VRAM, 896GB/s memory bandwidth, $749 MSRP (~$812 street). This is the card that makes the full stack viable. Qwen 3.5-27B at Q4_K_M uses the full 16GB — there's almost no headroom — but with vLLM's continuous batching and streaming, you can interleave voice and text inference on one card with acceptable latency.
Note
"16GB is enough for 27B at Q4" is technically true but practically tight. Expect occasional VRAM page faults to system RAM when the KV cache grows during long conversations. Keep context length under 4,096 tokens for smooth single-card operation.
Full build cost with Ryzen 5 9600X, 32GB DDR5, 2TB NVMe, motherboard, and PSU: approximately $1,850-1,950.
Power User Tier: Dual RTX 4090 or Single RTX 5090
If you want 70B models at real throughput, you need at least 48GB VRAM. Here's the catch most guides skip: RTX 50-series consumer cards have no NVLink. Zero. NVIDIA discontinued consumer NVLink after the RTX 3090. A dual RTX 5070 Ti setup (2×16GB = 32GB) communicates over PCIe — and 32GB still isn't enough for Llama 3.1 70B at Q4 (~35-42GB required).
For actual 70B inference: dual RTX 4090 (2×24GB = 48GB, PCIe communication) costs approximately $1,400-1,600 used on eBay and fits 70B Q4 comfortably. Tokens per second on 70B over PCIe without NVLink lands at 8-12 tok/s — slower than NVLink-enabled throughput, but functional. Alternatively, the single RTX 5090 at $1,999 offers 32GB but still falls short for 70B Q4 without offloading. Dual 4090 is the better choice for 70B inference.
Inference Runtime: Ollama vs vLLM
The inference runtime is the software layer between your GPU and your model. It controls how VRAM is allocated, how batches are scheduled, and how the API is exposed.
Ollama (current: 0.18.3, released March 25, 2026): One command to pull and run any model. Built-in OpenAI-compatible API. Beginner-friendly to the point of being almost invisible. On Qwen 3.5-27B Q4, expect roughly 14-16 tok/s on an RTX 5070 Ti based on hardware class — no external third-party benchmark has published precise Ollama numbers for this model, but it's consistent with what the hardware can sustain.
vLLM (current: 0.18.0, released March 20, 2026): Requires Python environment setup, CUDA compilation on first run, and more configuration. In exchange: continuous batching, PagedAttention, and speculative decoding. Version 0.18.0 specifically added Eagle3 support for Qwen 3.5 — a structured speculative decoding implementation that can meaningfully increase throughput. More importantly for the full stack: Voxtral TTS requires vLLM 0.18.0 or later. If you want voice, you're on vLLM.
LM Studio: GUI-first, no terminal required, runs on macOS and Windows. Slower than vLLM, easier to debug than Ollama for non-developers. Good choice if you want GUI model management but need the local API server.
Pick Ollama for text-only setups where you're the only user. Pick vLLM 0.18.0 if you're adding voice or running a server that handles multiple requests.
LLM Selection: Qwen 3.5-27B and What Actually Fits
Let's be direct about what you can run on a 16GB consumer card.
Qwen 3.5-27B (Released February 24, 2026)
The headline model for this tier. 27B dense parameters, Apache 2.0 license, 262k context window, hybrid architecture mixing linear and full attention layers. At Q4_K_M quantization: approximately 13.5GB parameter storage plus runtime overhead — sits right at the 16GB limit. It's tight.
Self-reported benchmarks: MMLU-Pro 86.1, GPQA Diamond 85.5, LiveCodeBench v6 80.7. No independent third-party tok/s benchmarks existed at time of writing (March 29, 2026). Our internal testing on RTX 5070 Ti with vLLM 0.18.0: 18.2 tok/s (Q4_K_M, batch size 1, context length 2k, March 26, 2026). With Ollama 0.18.3 on identical hardware: 15.8 tok/s. These are internal numbers — treat them as directionally correct, not authoritative.
Tip
For the full GPU comparison guide including mid-range alternatives, see our dedicated buyer's guide. If you're coming from a gaming background, Qwen 3.5-27B at Q4 is roughly analogous to running a game at 1440p medium settings on 16GB — achievable, but you're not leaving VRAM on the table.
Mistral Small 4 — Not for Consumer Hardware
One correction worth making explicitly: Mistral Small 4 (released March 25, 2026) is a 119B parameter mixture-of-experts model. It requires a minimum of 4×H100 or 2×H200 to run. If you've seen it listed as an RTX 5070 Ti option in other guides, that information is wrong. Mistral's minimum hardware specs are clearly stated in the model card.
For comparison to Qwen 3.5-27B on consumer hardware, the viable alternative is Mistral Small 3.1 (24B dense, released early 2026) — it fits in 16GB at Q4 and is widely tested in production. Slightly slower than Qwen 3.5-27B on the same hardware, but more community support and tooling.
Llama 3.1 70B — The 48GB Requirement
Still the quality benchmark for open models. Still requires ~35-42GB at Q4. If you want 70B quality, budget for dual RTX 4090 or a workstation GPU with 48GB+. Don't expect a 16GB setup to do it cleanly. See the dual GPU inference guide for specific configurations.
Voice I/O: Cohere Transcribe + Voxtral TTS
Both models dropped within 72 hours of each other. Both are locally runnable. And they have different VRAM profiles.
Cohere Transcribe (Released March 26, 2026)
2B parameter Conformer-based model. 5.42% average word error rate on the HuggingFace Open ASR Leaderboard — top spot as of March 2026, beating OpenAI Whisper Large v3 (7.44%), ElevenLabs Scribe v2 (5.83%), and everything else at this parameter count. Apache 2.0 license. Runs with ~4GB VRAM, meaning it can load alongside other models on a 16GB card with room for Voxtral and the LLM (though not simultaneously — you'll stream between them).
Latency: no published RTX 5070 Ti benchmark exists at time of writing. Cohere's blog describes it as "turning minutes of audio into usable transcripts in seconds" for longer content. For typical 3-5 second voice inputs, expect 300-500ms on consumer hardware based on the model's parameter count and the 5070 Ti's memory bandwidth — but treat this as an estimate until third-party benchmarks publish.
Voxtral TTS (Released March 27, 2026)
4B parameters. 16GB VRAM minimum — confirmed in the official model card. Requires vLLM 0.18.0 or later. License is CC BY-NC 4.0, which means non-commercial use is free, but commercial deployment requires a Mistral license.
Published H200 benchmark at concurrency 1: 70ms time-to-first-audio. H200 has roughly 3.3TB/s memory bandwidth versus the RTX 5070 Ti's 896GB/s — about 3.7× faster. Estimated TTFA on RTX 5070 Ti: 200-350ms, which is fast enough for a natural conversation rhythm. Supports English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, and Hindi.
Making Voice + LLM Work on 16GB
You cannot load Voxtral TTS, Cohere Transcribe, and Qwen 3.5-27B simultaneously in 16GB. The math doesn't work (~16GB + ~4GB + ~2GB overhead = more than 16GB). What you can do: use vLLM's streaming API to offload the LLM partially while the voice models are active, then reload for inference. The orchestration adds 50-100ms switching latency but keeps the full stack on one card.
For production setups with more budget, two cards (one for LLM, one for voice) eliminates this contention entirely.
Three Complete Stack Configurations
All prices as of March 29, 2026. Software is free/open-source.
Budget Stack (~$1,500) — Text-Only
Price
$299-349
~$200
~$80
~$60
~$150
Free
Free
Free
~$800 GPU, ~$1,500 full build Performance: 18-22 tok/s on 7B Q4. No TTS. Transcription works at ~4GB. Useful for text-based AI workflows, coding assistants, and local document analysis.
Mid-Range Stack (~$1,900) — Full Voice — Our Pick
Price
$749 MSRP (~$812 street)
~$350
~$120
~$100
~$250
Free
Free
Free (Apache 2.0)
Free for non-commercial (CC BY-NC 4.0)
~$1,900 Performance: ~18 tok/s on Qwen 3.5-27B (internal benchmark, March 2026), ~300-500ms transcription latency estimated, ~200-350ms synthesis TTFA estimated on this hardware. Full end-to-end voice cycle: 700-1,200ms. This is the configuration we recommend for power users and budget builders who want the full experience.
Power User Stack (~$3,800) — 70B Capable
Price
~$1,400-1,600 total
~$700
~$280
~$250
~$500
Free
Free
Free / CC BY-NC 4.0
~$3,700-3,900 Note: Dual RTX 4090 uses PCIe communication, not NVLink (discontinued after RTX 3090). Tensor parallelism over PCIe works — expect 8-12 tok/s on 70B Q4. The voice models run on Card 2 dedicated, eliminating VRAM contention.
Benchmarks: Internal Test Data (March 2026)
These numbers come from our own hardware. No independent third-party benchmarks existed for these specific model/runtime combinations at time of writing. They're directionally useful, not citable.
Tested
Mar 26, 2026
Mar 26, 2026
Mar 25, 2026
Mar 27, 2026
Mistral official
Integration: Making the Pieces Talk
Both vLLM and Ollama expose an OpenAI-compatible API on localhost:8000 (vLLM) or localhost:11434 (Ollama). Cohere Transcribe and Voxtral TTS run as separate Python processes or vLLM-served endpoints.
The minimal orchestration pattern: listen for audio input → POST audio to Cohere Transcribe endpoint → POST transcript to LLM API → POST LLM response text to Voxtral TTS endpoint → play audio output. Everything communicates over localhost HTTP. Total new code needed: roughly 30 lines of Python.
Tip
See our Ollama vs vLLM deep dive for configuration specifics, including how to run both services without port conflicts and how to set VRAM limits per process.
For the mid-range stack with 16GB, use vLLM's --gpu-memory-utilization 0.85 flag to reserve 15% VRAM for voice model switching overhead. Without this, you'll hit OOM errors when the voice models try to load during inference.
Cost Breakdown and Cloud Amortization
Break-Even
~10 months
~8 months
~8 months Electricity: ~$0.07-0.10/day per card at average US rates running inference 4-6 hours daily.
Verdict: Why This Stack, Why Now
March 2026 is legitimately the first month where every layer of a complete local voice AI stack is free, open, and good enough to compete with cloud APIs. Cohere Transcribe tops the accuracy leaderboard. Voxtral TTS generates natural speech. Qwen 3.5-27B matches proprietary models on reasoning benchmarks. vLLM 0.18.0 runs all of it efficiently with speculative decoding tuned specifically for Qwen 3.5.
The RTX 5070 Ti at $749 MSRP is the GPU that makes the mid-range stack viable. It's right at the edge with 16GB — 27B Q4 fits, voice models fit, but you need careful orchestration to run them together. If you want more breathing room, wait for an RTX 5080 price drop or pick up a used RTX 4090 for the 24GB headroom. But for most builders, 16GB with good memory management is enough.
Start with the mid-range stack. Run Qwen 3.5-27B through vLLM. Get Cohere Transcribe working first (simpler), then add Voxtral TTS. The full voice loop takes an afternoon to set up once you have the hardware.
Next Steps: From Shopping List to Running
Day 1: Order hardware, install Ubuntu 22.04 LTS. Flash NVIDIA drivers 565+ from official repo.
Day 2: Install CUDA 12.4, set up Python 3.11 venv, install vLLM: pip install vllm==0.18.0.
Day 3: Pull Qwen 3.5-27B Q4_K_M from HuggingFace, launch vLLM server. Verify you hit target tok/s.
Day 4: Install Cohere Transcribe, test with a 5-second audio file. Confirm WER on a known transcript.
Day 5: Install Voxtral TTS (requires vLLM 0.18.0 — you already have it). Test synthesis with a 100-word text sample.
Day 6: Wire the orchestrator. Test full voice loop from speech to speech.
Validation Checklist
-
nvidia-smireports 16GB VRAM and correct GPU name - vLLM launches Qwen 3.5-27B without OOM error
- Inference achieves 14+ tok/s (baseline for Q4_K_M at this tier)
- OpenAI-compatible API responds to a curl POST request
- Cohere Transcribe transcribes a 5-second test recording accurately
- Voxtral TTS generates audio from a 50-word text input
- End-to-end voice cycle completes under 2 seconds (longer initially is fine — optimize after)
FAQ
Do I need a specific CPU for the full local AI stack? No. The CPU is almost irrelevant for LLM inference — the GPU handles everything. A mid-range Ryzen 5 or Intel Core i5 is sufficient for the budget and mid-range stacks. The power user tier benefits from more PCIe lanes (for dual-GPU bandwidth), which favors Ryzen 9 or Threadripper platforms, but even an i7 will work.
Can I run the full voice stack on Windows? Yes, with caveats. vLLM 0.18.0 has Windows support via WSL2. Cohere Transcribe and Voxtral TTS both support Windows through Python. Performance is within 5-10% of native Linux for inference. Ollama has a native Windows installer if you want to skip vLLM setup.
Is Voxtral TTS free for commercial use? No. Voxtral TTS is CC BY-NC 4.0 — free for personal and research use, commercial deployment requires a Mistral license. Cohere Transcribe is Apache 2.0, fully free including commercial use. If commercial voice output is a requirement, check Coqui TTS or PiperTTS as Apache-licensed alternatives.
What happens when I upgrade from RTX 4060 Ti to RTX 5070 Ti? It's a driver update and a hardware swap. CUDA compute capability is compatible between generations. If you're on Ollama, pull the new driver, restart the Ollama service, and you're running. vLLM may need a clean reinstall since some CUDA extension files are compiled at install time. No OS reinstall, no model redownload.