Three voice AI releases in one week. All three marketing pages use the words "fast," "local," and "private" — sometimes for models that are cloud-only. One of the "local" options needs a 16 GB GPU to run officially. The benchmark numbers circulating on X are fabricated because the models are literally three days old.
**TL;DR: Voxtral TTS is the strongest local option for NVIDIA builders with a 16 GB GPU — real software, commercial API path, published benchmarks. Covo-Audio is the only open-weight model with full-duplex barge-in support, making it worth the rougher setup experience. Gemini 3.1 Flash Live wins on zero-hardware-cost deployments, but all your audio goes to Google. Don't believe the "3 GB VRAM" claims — real local deployment requires more headroom.**
---
## Quick Specs Comparison
Gemini 3.1 Flash Live
Local audio LLM (A2A)
~8B (7B backbone)
"Sub-second" (unspecified)
Proprietary
Google API
> [!NOTE]
> The Q4_0 GGUF for Voxtral TTS has a 2.67 GB weight file. Actual VRAM usage with KV cache and activations is higher — estimate 3–5 GB based on standard [quantization](/glossary/quantization) overhead for a 4B model — but no hardware-specific benchmark on consumer cards was published as of March 29, 2026.
---
## Gemini 3.1 Flash Live — Cloud-First, Zero Local Hardware
Gemini 3.1 Flash Live launched March 26, 2026. It's built specifically for audio-to-audio (A2A) real-time dialogue — not a text model with a voice wrapper bolted on. The protocol is a stateful WebSocket connection: raw PCM audio in at 16 kHz, 24 kHz PCM out. Zero GPU required on your end. Runs on a Raspberry Pi, a Lambda function, a MacBook Air — anything with an internet connection.
The tradeoff is obvious: all your audio goes to Google's infrastructure. Every word of every conversation. No local mode, no self-hosted option.
What it does that neither local model matches yet: 70 language support, barge-in (user interrupts mid-sentence, model responds naturally), affective dialogue (responds to tone, not just words), and built-in Google Search tool use. These aren't minor features for production voice assistants.
**Pricing** (paid tier, as of March 2026): $0.005/minute audio input, $0.018/minute audio output. A free tier exists with no published rate limit at launch. Vertex AI pricing for enterprise tiers may differ — check the Google Cloud pricing page directly.
### Real Latency: What Google Actually Says
Google describes Gemini 3.1 Flash Live as "sub-second" and "lower latency than Gemini 2.5 Flash Native Audio." No specific millisecond figure appears in the official model card or API docs as of this writing.
> [!WARNING]
> No third-party end-to-end latency benchmark for Gemini 3.1 Flash Live exists yet. "Sub-second" is Google's own claim — network conditions, server region, and request size all affect production numbers. Don't publish latency figures you can't source.
---
## Voxtral TTS — The Local Option With a Catch
Voxtral TTS (model ID: `mistralai/Voxtral-4B-TTS-2603`) dropped March 26, 2026. The arXiv paper is three days old. It's a 4B-parameter text-to-speech model — text in, audio out. This is not the same as Voxtral Mini, which is a speech-to-text transcription model released in July 2025. They're used together in a pipeline; they're not interchangeable.
The official minimum [VRAM](/glossary/vram) for Voxtral TTS is **16 GB for BF16 inference**. That's an RTX 4080, RTX 3090 24 GB, or an M3 Pro 18 GB Mac at minimum. The "3 GB VRAM" figure circulating on social media is based on confusion with Voxtral Mini or with the community GGUF weight file size — neither reflects actual runtime VRAM.
A community Q4_0 GGUF quantization exists at `TrevorJS/voxtral-tts-q4-gguf` on HuggingFace, with a 2.67 GB weight file. Based on standard 4B Q4 inference math, total VRAM with runtime overhead is probably 3–5 GB — which would fit a 8 GB RTX 4060. But no one has confirmed this with a published benchmark yet.
**What the official H200 benchmarks show** (Mistral's own arXiv figures):
- Single concurrent request: 70 ms time-to-first-audio, 9.7× real-time processing
- 32 concurrent requests: 552 ms TTFA, 1,430 characters/second/GPU throughput
These are H200 datacenter figures. Consumer card performance will be meaningfully slower — we just don't have consumer benchmarks yet for a model this new.
### The Ollama Problem
If you're planning to wire Voxtral TTS into an Ollama pipeline: you can't, yet. Voxtral TTS requires **vllm-omni >= 0.18.0**, a separate package from standard vLLM. An open GitHub issue ([ollama/ollama#11432](https://github.com/ollama/ollama/issues/11432)) confirms Ollama support hasn't shipped as of March 2026.
Setup path that actually works: install vllm-omni, load the Voxtral TTS model, serve via the vllm-omni audio API. Expect 20–30 minutes for initial setup if you're familiar with vLLM. For a full local voice stack guide once you've picked your model, see [local LLM voice integration setup](/guides/local-llm-voice-integration-guide/).
### VRAM by Configuration
Confirmed?
No published result
**Licensing note:** Voxtral TTS is CC BY-NC 4.0 — non-commercial self-hosting is fine. Building a commercial product requires a license from Mistral. The Apache 2.0 license on Voxtral Mini does NOT apply here.
---
## Covo-Audio — Full-Duplex Voice in One Model
Covo-Audio (Tencent, [GitHub](https://github.com/Tencent/Covo-Audio)) is architecturally different from both competitors. It's not a TTS model and not a transcription model — it handles voice understanding AND synthesis in one pass. Audio in, audio out, no separate ASR-then-LLM-then-TTS pipeline to manage. The backbone is Qwen2.5-7B-Base with a Whisper-large-v3 audio encoder and a flow-matching speech decoder.
Three variants were released:
- **Covo-Audio** — base pretrained model
- **Covo-Audio-Chat** — optimized for half-duplex conversation (open-source weights confirmed)
- **Covo-Audio-Chat-FD** — full-duplex with barge-in and turn-taking (availability status unclear at time of writing)
The full-duplex variant is the interesting one. Barge-in support — where the user interrupts mid-sentence and the model responds naturally rather than playing out its current utterance — requires architectural support, not just clever prompting. Covo-Audio-Chat-FD handles it natively.
> [!TIP]
> If natural barge-in is a requirement for your voice assistant pipeline, Covo-Audio-Chat-FD is the only open-weight model in this comparison that handles it natively. Voxtral TTS is synthesis-only and needs a separate transcription step. Gemini Flash Live does barge-in, but cloud-only. Mac builders: Qwen2.5 backbone has strong Metal Performance Shaders support via llama.cpp — see the [Mac M3/M4 local AI voice setup guide](/guides/mac-m3-m4-local-ai-voice-setup/) for the full workflow.
### VRAM Requirements
Covo-Audio's technical report (arXiv:2602.09823, submitted February 2026, revised March 16) doesn't include hardware-specific VRAM measurements. Based on the ~8B parameter count at BF16, **expect ~16 GB minimum** — in the same class as Voxtral TTS. Comfortable inference means a 24 GB card (RTX 3090, RTX 4090) or an Apple Silicon Mac with 24 GB+ unified memory.
No GGUF or GPTQ quantization existed on HuggingFace at time of writing. When the community ports one, the hardware floor will drop.
**Pricing:** Research release only. No managed API, no subscription tier, no hosted service. Custom Tencent research license — read it before building anything production-critical.
### Covo vs Voxtral: The Real Difference
On pure TTS throughput, Voxtral TTS has a published benchmark and a commercial API path. Covo-Audio has neither yet.
Where Covo-Audio wins architecturally: it doesn't need a separate transcription model to handle voice input. One model, one inference pass, audio in and audio out. For a [local LLM hardware build](/articles/100-local-llm-hardware-upgrade-ladder/) already running at capacity on a 24 GB card, replacing two model loads with one is meaningful.
---
## Performance Benchmarks — What We Can Actually Verify
The honest answer: most consumer-hardware benchmark numbers for these models don't exist yet. All three models are fewer than two weeks old. Here's what Tier-1 sources actually confirm as of March 29, 2026:
Gemini 3.1 Flash Live
N/A
Voxtral TTS's benchmarks are from Mistral's own H200 datacenter hardware — an H200 is not a consumer GPU. For GPU-specific voice inference comparisons across consumer cards, bookmark [RTX 4060 vs 4070 voice AI benchmarks](/comparisons/rtx-4060-4070-voice-ai-benchmarks/) — we'll update that page as community results come in.
Anyone publishing "198ms on an RTX 4060 Ti" for either of these models right now is making those numbers up. The hardware benchmarks don't exist yet.
---
## Local vs Cloud: What It Actually Costs
**Gemini 3.1 Flash Live:** $0.005/min audio input + $0.018/min audio output. A 10-minute daily voice session costs roughly $0.23/day, or about $7/month. At production scale — 100 conversations/day at 5 minutes average — you're at $115/month in audio output costs alone, before any input.
**Voxtral TTS and Covo-Audio (self-hosted):** GPU cost is the upfront hit, then electricity.
- RTX 4080 16 GB: ~$750 new, ~$550 used (as of March 2026) — handles BF16 Voxtral TTS at the official minimum
- RTX 3090 24 GB: ~$650 used — comfortable for both models with headroom
- RTX 4060 8 GB: works if Q4_0 GGUF estimates hold up — hardware test pending
- Electricity: a 200–320W GPU running 4 hours/day at $0.12/kWh adds $3–6/month
At 100 voice sessions/day, the hardware breaks even against Gemini's managed pricing somewhere in the 3–6 month range. At 10 sessions/day, cloud is cheaper for the first couple of years.
The Mistral managed API ($0.001/min) is cheaper per minute than Gemini for audio input — but covers TTS only. You'd still need a separate transcription solution for a full voice loop, which changes the math.
---
## Privacy & Compliance
**Voxtral TTS and Covo-Audio (self-hosted):** Zero audio leaves your machine. Full stop. HIPAA compliance means your infrastructure practices are the compliance boundary — no vendor BAA required because there's no vendor in the data path.
**Gemini 3.1 Flash Live:** All audio is processed on Google's infrastructure. HIPAA compliance requires a signed Business Associate Agreement (BAA) executed through Google Workspace Enterprise on Vertex AI. Standard AI API access (ai.google.dev) is not HIPAA-covered — this isn't ambiguous. GDPR and CCPA compliance requires reviewing Google's data processing agreements for your jurisdiction.
If your use case involves medical, legal, financial, or simply private audio, local inference isn't a preference — it's the only path.
---
## Use Case Matrix
Why
Zero cloud contact; commercial API available
Zero hardware, 70 languages, tool use
Self-hosted = you own the compliance boundary
Qwen2.5 backbone, strong MPS support
---
## The Verdict
Voxtral TTS is the most production-ready local voice option right now. It has an official release, a commercial API path, a published latency benchmark (H200), and a clear setup path via vllm-omni. The hardware floor is higher than the hype — 16 GB VRAM officially, with community quantization as an untested path to 8–12 GB cards — but it's real software you can actually ship.
Covo-Audio wins on architecture if you need an end-to-end voice model without a cascade pipeline. The full-duplex variant handles barge-in natively, which no other open-weight model does cleanly. But it's a research release: no managed hosting, custom license, hardware benchmarks missing. Only deploy it where you own the full stack.
Gemini 3.1 Flash Live is the practical choice when you don't have a GPU budget, privacy isn't a hard constraint, and you need something working today. The free tier is real. Latency will be fast enough for most conversational applications. But the moment you're in a HIPAA environment or handling sensitive audio, it stops being an option.
Pick local if you care about where your audio goes. Pick Gemini if you care about shipping fast without a GPU invoice.
---
## FAQ
**Does Voxtral TTS run on an RTX 4060?**
Not officially — Mistral's stated minimum is 16 GB VRAM. A community Q4_0 GGUF with a 2.67 GB weight file exists on HuggingFace, and based on standard inference overhead for a 4B Q4 model, you'd estimate 3–5 GB total VRAM. That's within an 8 GB card's range in theory. No published benchmark confirms this on consumer hardware as of March 29, 2026 — the model is three days old. If you run it on a 4060, share the numbers.
**Is Gemini 3.1 Flash Live free to use?**
A free tier exists at launch in March 2026. Paid usage is $0.005/minute for audio input and $0.018/minute for audio output. It's cloud-only — no local inference exists regardless of what hardware you have.
**What's the difference between Voxtral TTS and Voxtral Mini?**
Completely different models. Voxtral TTS (Voxtral-4B-TTS-2603, March 2026) takes text and outputs audio — text-to-speech. Voxtral Mini (3B, July 2025) takes audio and outputs a transcript — speech-to-text. You'd use them in sequence for a full voice pipeline. Voxtral Mini is Apache 2.0; Voxtral TTS is CC BY-NC 4.0. Don't confuse the VRAM requirements — Voxtral Mini is ~9.5 GB BF16, Voxtral TTS is 16 GB minimum.
**Can I use Covo-Audio commercially?**
Not yet, officially. It's a research release under a custom Tencent license. No commercial API, no managed hosting, no subscription tier exists as of March 2026. Review the LICENSE file in the GitHub repo before building anything production-critical — this isn't Apache 2.0.
**Is Gemini 3.1 Flash Live HIPAA compliant?**
Only under specific conditions. You need a signed Business Associate Agreement (BAA) executed through Google Workspace Enterprise on Vertex AI. Standard API access through ai.google.dev is not HIPAA-covered. Compliance isn't automatic — it only applies to managed enterprise accounts with a BAA in place, and some features may be restricted for BAA customers. Hardware Comparison
Gemini 3.1 Flash Live vs Voxtral TTS vs Covo-Audio: Which Voice Stack Runs Locally?
By Chloe Smith • • 9 min read
Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.
voice-ai local-llm tts gemini voxtral
Technical Intelligence, Weekly.
Access our longitudinal study of hardware performance and architectural optimization benchmarks.