CraftRigs
Architecture Guide

Voxtral TTS on 3GB VRAM: Local Voice Cloning on Any Modern GPU

By Charlotte Stewart 10 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Your ElevenLabs bill is optional now.

Mistral released Voxtral TTS on March 26, 2026 — a 4-billion-parameter voice cloning model that runs on 3GB of VRAM and clones any voice from 3 seconds of audio. In blind testing, 62.8% of listeners preferred it over ElevenLabs Flash v2.5. The model weights are free to download from Hugging Face and run on any NVIDIA GPU from the last five years.

TL;DR: Voxtral TTS outperforms ElevenLabs Flash v2.5 in blind listening tests and needs only 3GB of VRAM — meaning your RTX 4060, 3070, or even older GPU can run it today. For personal use and non-commercial projects, setup takes about 20 minutes. The catch: CC BY-NC 4.0 license means you'll need a paid Mistral agreement for anything revenue-generating.


What Is Voxtral TTS (and Why It Matters)

Voxtral TTS is Mistral AI's first open-weight text-to-speech model. It's a 4B-parameter streaming speech model trained for low-latency, multilingual voice generation — the same category of product as ElevenLabs Flash, but downloadable and self-hostable.

The headline numbers: 90ms time-to-first-audio on a single H200 GPU, generation at six times real-time speed, and voice cloning from as little as 3 seconds of reference audio. Mistral made the weights available on Hugging Face the same day it launched the API, which is not something you see often from a company with a commercial product in the same space.

It covers nine languages at launch. That includes English, French, Spanish, German, Italian, Portuguese, Dutch, Polish, and Japanese. Not as broad as ElevenLabs' 32-language support, but enough for most production use cases.

How It's Different From Previous Local TTS

If you've tried local TTS before and written it off, that's a fair reaction. Tortoise TTS needed 10–30 seconds per sentence and sounded like a text reader from 2008. Bark was better but still noticeably synthetic, and the inference times made it impractical for anything interactive.

Voxtral is trained on a different scale of real speech data and tuned specifically for naturalness. On the SEED-TTS benchmark, it hits a 1.23% word error rate — ElevenLabs v3 scores 1.26%. That's not a dramatic gap, but it means Voxtral is genuinely in the same quality tier as the best commercial product, not "good for local."

Note

Voxtral is licensed CC BY-NC 4.0. Personal use, research, and non-commercial content creation are free. Anything that generates revenue requires a commercial license from Mistral. Factor this into your stack decision before deploying to production.


GPU Requirements & Cost Breakdown

The 3GB VRAM figure Mistral cites is for quantized inference — the model loaded at reduced precision so it fits on consumer hardware. Full-precision inference needs a 16GB+ card. For almost everyone reading this, quantized is what you'll run.

Voxtral Use Case

TTS inference only

TTS inference only

TTS + 7B LLM simultaneously

TTS + 7B LLM + headroom

Multi-voice or 70B LLM + TTS Prices as of March 2026. Verify current prices on Newegg or Best Buy before purchasing.

Voxtral TTS Alone vs. Voxtral + 7B LLM

Running Voxtral TTS in isolation is easy — 3GB VRAM, done. Running it alongside a language model on the same GPU is where you need to do the math.

Minimum GPU

RTX 4060 8GB

RTX 4070 12GB

RTX 4070 12GB The RTX 4060's 8GB of VRAM is not enough to run both Voxtral TTS and a 7B model simultaneously — you'll either get an out-of-memory error or have to time-share between processes, which breaks interactive latency. The RTX 4070's 12GB handles the combined stack with ~3GB to spare.

Warning

The RTX 4070 has 12GB GDDR6X — not 10GB. Multiple sources and retailer listings have the spec wrong. Verify before purchasing by checking the manufacturer's spec sheet or NVIDIA's product page.


Benchmark: Voxtral TTS vs. ElevenLabs

Mistral ran blind human preference evaluations comparing Voxtral TTS against ElevenLabs Flash v2.5. The result: 62.8% of listeners preferred Voxtral in overall evaluations, and 68.4% preferred it specifically in multilingual voice cloning tests.

For raw accuracy, the SEED-TTS benchmark shows Voxtral at 1.23% word error rate versus ElevenLabs v3 at 1.26%. Speaker similarity score is 0.628 for Voxtral. These aren't dramatic wins — they're parity numbers. The real difference is what you pay for that parity.

ElevenLabs Flash v2.5

37.2%

1.26% (v3)

~90ms (API)

~1 minute (pro cloning)

$22/month (Creator, 100K chars)

32 The cost comparison deserves a closer look. ElevenLabs' Creator tier ($22/month) gives you 100,000 characters. At 5 characters per average word, that's 20,000 words — about 2–3 hours of audio. If you're narrating a YouTube video weekly, you'll burn through that fast. Voxtral at $0 is not "pretty good for free." It's better than the paid alternative on most metrics.

When ElevenLabs Still Wins

ElevenLabs has a real moat in three areas. First, language support — 32 languages versus Voxtral's nine means ElevenLabs is the only option if your audience isn't in one of Voxtral's supported locales. Second, streaming output — ElevenLabs can begin sending audio before the full text is processed; Voxtral runs full inference first. For interactive conversational agents where you want instant-feeling responses, that matters. Third, the voice library — ElevenLabs has thousands of pre-built voices. Voxtral requires a voice sample for every voice you want to use.


Installation & Setup (~20 Minutes)

Ollama does not support Voxtral TTS as of March 2026. Use either vLLM-Omni (faster, production-grade) or the Hugging Face transformers library (simpler for local testing).

vLLM-Omni is the officially supported inference backend for Voxtral TTS. It's faster and handles streaming better than transformers.

Step 1. Install vLLM 0.18.0 or later:

pip install "vllm>=0.18.0"

Step 2. Install vLLM-Omni:

pip install git+https://github.com/vllm-project/vllm-omni.git --upgrade

Step 3. Confirm your CUDA version is 11.8 or higher:

nvcc --version

Step 4. Download the model from Hugging Face. You'll need a free HF account:

huggingface-cli download mistralai/Voxtral-4B-TTS-2603

The download is roughly 8GB. It loads into ~3GB of VRAM during inference.

Step 5. Clone a voice sample. You need 3–30 seconds of clean audio, mono, 16kHz WAV:

ffmpeg -i your_audio.mp3 -ar 16000 -ac 1 voice_sample.wav

Step 6. Run inference using the vLLM-Omni demo script from the repo, or call the model via the OpenAI-compatible API endpoint that vLLM-Omni exposes.

Option B: Hugging Face Transformers (Simpler)

pip install torch torchaudio transformers accelerate

Then in Python:

from transformers import pipeline

tts = pipeline(
    "text-to-speech",
    model="mistralai/Voxtral-4B-TTS-2603",
    device=0  # GPU index
)

audio = tts(
    "The quick brown fox jumped over the lazy dog.",
    voice_sample="voice_sample.wav"
)

# Save to file
import soundfile as sf
sf.write("output.wav", audio["audio"], audio["sampling_rate"])

Troubleshooting: Out of Memory?

If you get a CUDA OOM error on a 3GB GPU: something else is using VRAM. Close your browser, Discord, or any game running in the background. GPU VRAM isn't freed automatically by Windows — you may need to reboot to clear it fully.

If audio sounds robotic: your voice sample is too short or too compressed. Aim for at least 15 seconds of clean, non-echoey speech. A conference call recording will perform worse than a headset recording in a quiet room.

If latency is over 500ms: you're likely on the transformers path with CPU offloading. Confirm your GPU is being used with nvidia-smi while inference runs — look for high GPU memory usage.


Real-World Use Cases

YouTube narration. The most obvious one. Feed a script to Voxtral with your own voice sample, get a rendered WAV file back in seconds. For a 1,000-word script (~7 minutes of audio), expect 60–90 seconds of generation time on an RTX 4070. That's faster than recording yourself if you include retakes.

LLM voice companions. Connect a local LLM for response generation and Voxtral for speech output. The result is a fully local voice assistant — nothing leaves your machine, no API costs, no rate limits. See the build section below.

Game dialogue. Dynamic NPC voice generation where each NPC has its own voice clone. Practical for indie games; the voice library problem (you need audio samples for each character) is solvable if you record your own cast.

Accessibility tooling. Custom voice text-to-speech for users who prefer a specific voice style or accent. The voice cloning speed makes personalization fast.

Example: Local AI Voice Assistant on One GPU

The setup: Llama 3.1 8B running at Q4_K_M quantization, Voxtral TTS on the same RTX 4070 12GB, a local Whisper instance for speech-to-text.

  • Speech input → Whisper transcription (~1 second for short phrases)
  • Text → Llama 3.1 8B response generation (~4–6 seconds for 30–50 tokens at ~35 tok/s)
  • Text response → Voxtral TTS audio (~90ms generation + ~1 second file write + audio playback)
  • Total round-trip: 6–8 seconds

That's not instant. It's also zero per-query cost, zero data leaving your machine, and fully customizable at every layer.

Tip

For shorter perceived latency, start Voxtral TTS inference as soon as the first complete sentence comes out of the LLM — don't wait for the full response. This overlaps generation and playback and cuts perceived wait time by 30–40%.


Budget GPU Recommendations

If you don't already have a GPU that can run Voxtral TTS, here's where to spend:

Under $300 — RTX 4060 8GB (~$299 new, ~$250 used): Runs Voxtral TTS cleanly. Can't run a 7B LLM simultaneously — you'll need to switch between tasks or offload the LLM to CPU. Fine if your only goal is TTS automation and you're not building an interactive voice assistant. Check for the RTX 4060 on our GPU guide for beginners.

Best value around $250 — Used RTX 3070 Ti 8GB (~$230–280 secondhand): Same VRAM as the RTX 4060, older architecture, but the price gap is meaningful for a budget build. Avoid mining-damaged cards — look for cards sold by gamers, not datacenter liquidators, and verify the card's thermal history if the seller will share it.

Sweet spot for a full voice assistant stack — RTX 4070 12GB (~$700 new, ~$525 used): This is the right card if you want to run Voxtral TTS and a 7B LLM on the same GPU without juggling VRAM. The 12GB gives you just enough headroom. If you can find the RTX 4070 Super used for around $480, that's the sharper deal — same VRAM, marginally faster.

Production or 70B models — RTX 5080 16GB (~$999 MSRP): Stock is constrained and street prices are running $1,100–$1,700 depending on when you shop. Not worth paying the premium unless you're specifically running 70B models alongside TTS. If you can wait, supply should improve by Q3 2026. For more on dual-GPU configurations, see our dual GPU local LLM stack guide.


Voxtral TTS + 7B LLM Voice Assistant (Complete Python Build)

This is the full stack: Whisper for speech-to-text, Llama 3.1 8B via an Ollama HTTP API for response generation, and Voxtral TTS via vLLM-Omni for voice output.

import subprocess
import tempfile
import soundfile as sf
import sounddevice as sd
import requests
import json

OLLAMA_URL = "http://localhost:11434/api/generate"
VLLM_OMNI_URL = "http://localhost:8000/v1/audio/speech"
VOICE_SAMPLE = "voice_sample.wav"

def get_llm_response(user_text: str) -> str:
    """Send text to local Ollama LLM, return response string."""
    payload = {
        "model": "llama3.1:8b",
        "prompt": user_text,
        "stream": False
    }
    response = requests.post(OLLAMA_URL, json=payload)
    return response.json()["response"]

def synthesize_speech(text: str, output_path: str) -> None:
    """Send text to Voxtral TTS via vLLM-Omni, save WAV."""
    payload = {
        "model": "voxtral-4b-tts",
        "input": text,
        "voice": VOICE_SAMPLE
    }
    response = requests.post(VLLM_OMNI_URL, json=payload)
    with open(output_path, "wb") as f:
        f.write(response.content)

def play_wav(path: str) -> None:
    data, sample_rate = sf.read(path)
    sd.play(data, sample_rate)
    sd.wait()

def voice_assistant_loop():
    print("Voice assistant ready. Speak, then press Enter.")
    while True:
        user_input = input("You: ").strip()
        if not user_input:
            continue
        if user_input.lower() in ("quit", "exit"):
            break

        # Get LLM response
        response_text = get_llm_response(user_input)
        print(f"Assistant: {response_text}")

        # Synthesize and play
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
            synthesize_speech(response_text, tmp.name)
            play_wav(tmp.name)

if __name__ == "__main__":
    voice_assistant_loop()

Requirements: pip install soundfile sounddevice requests. Ollama handles the LLM; vLLM-Omni handles TTS. Both need to be running as separate processes before you launch this script. For a complete walkthrough of setting up the 7B LLM side, see our 7B LLM build guide.


FAQ

How much VRAM does Voxtral TTS actually need? About 3GB for quantized inference. If you're running nothing else on your GPU at the time, an 8GB card like the RTX 4060 handles it cleanly. Combine it with a 7B LLM and you're looking at 8–9GB total — that's the RTX 4070 12GB's territory.

Is Voxtral TTS free for commercial use? No. The weights are CC BY-NC 4.0 — free for personal use, research, and non-commercial content. Revenue-generating use (paid courses, client work, products) requires a commercial license from Mistral. ElevenLabs' Creator tier at $22/month actually makes more sense if you're using it commercially and don't want to negotiate an enterprise agreement.

Does Voxtral TTS work on AMD or Intel Arc GPUs? AMD ROCm support is in progress — community reports suggest it runs but with more setup friction than NVIDIA. Intel Arc via oneAPI is even more experimental. If you're on AMD, budget extra time for driver debugging. NVIDIA with CUDA 11.8+ is the clean path.

Does Voxtral work with Ollama? Not as of March 2026. There's an open GitHub issue requesting it (ollama/ollama#11432), but Voxtral TTS isn't in the standard Ollama model library. Use vLLM-Omni or Hugging Face transformers.

How do I clone a celebrity voice legally?

voxtral local-tts voice-cloning gpu-guide text-to-speech

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.