TL;DR: You can run multiple LLMs on one GPU, but only if their combined VRAM footprint fits. On a 24 GB card, two 8B models at Q4 is doable. On anything less, you're better off loading and unloading models as needed -- which Ollama handles automatically.
The Core Constraint: VRAM Is Finite
Every loaded model occupies VRAM. A Llama 3.1 8B at Q4_K_M quantization (model compression that reduces VRAM usage while preserving most quality) uses about 5.5 GB. Load two of those and you need 11 GB, plus context memory for both.
The math is simple: add up the VRAM for each model plus their KV caches (context memory), and that total must fit in your GPU. No exceptions, no magic.
Approximate VRAM per loaded model (Q4_K_M, 4K context):
- 3B model: ~2.5 GB
- 8B model: ~5.5 GB
- 14B model: ~8.5 GB
- 70B model: ~40 GB
So on an RTX 4090 (24 GB), you could theoretically fit:
- Two 8B models (~11 GB total, comfortable)
- One 8B + one 3B (~8 GB total, lots of headroom)
- One 14B + one 3B (~11 GB, fine)
- Two 14B models (~17 GB, tight but possible with short context)
On an RTX 4070 Ti (12 GB), realistically you can run one 8B model well. Two 3B models would fit but 3B models are limited in capability. Not worth the hassle.
For a full breakdown of what fits on each GPU, check how much VRAM you actually need.
Ollama's Automatic Model Management
Ollama makes multi-model usage easy by automatically loading and unloading models. When you switch models, Ollama unloads the current one after a timeout (default 5 minutes) and loads the new one.
The keep_alive parameter controls this:
# Keep model loaded for 30 minutes
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"keep_alive": "30m"
}'
# Unload immediately after response
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"keep_alive": 0
}'
# Keep loaded forever (until Ollama restarts)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"keep_alive": -1
}'
To run two models simultaneously in Ollama, set keep_alive to -1 on both and make sure they fit in VRAM together. Ollama will keep both in memory and route requests to the right one.
If the combined VRAM exceeds your GPU, Ollama will partially offload one model to CPU. This works but the CPU-offloaded model will be significantly slower -- expect 5-10x reduction in tokens per second (the standard measure of LLM generation speed) for the offloaded portion.
CPU Offload Strategy
If you need two models available but don't have enough VRAM for both, consider this approach:
Keep your primary model 100% on GPU. This is the model you use most -- your coding assistant, your chat model, whatever gets the most requests.
Run your secondary model on CPU. Fully CPU inference on a modern processor gives you 5-15 tokens/sec for an 8B model. That's slow but usable for background tasks, summarization, or low-priority requests.
In llama.cpp, you'd run two server instances:
# Primary model - full GPU
./llama-server -m primary.gguf -ngl 33 --port 8080
# Secondary model - CPU only
./llama-server -m secondary.gguf -ngl 0 --port 8081
This gives you two endpoints, each serving a different model, without VRAM conflicts.
Practical Limits and When to Add a Second GPU
One GPU, two models works when:
- You have 24 GB VRAM and both models are 8B or smaller
- One model is on CPU and latency doesn't matter
- You're using Ollama's load/unload cycling (not truly simultaneous)
You should add a second GPU when:
- You consistently need two 8B+ models loaded simultaneously
- Your primary model needs more than half your VRAM, leaving no room for a second
- CPU offloading is too slow for your use case
- You want to run a single large model (70B) that doesn't fit on one card
Adding a second GPU doesn't just double your VRAM for multi-model use -- it also lets you split a single large model across both cards using tensor splitting. Our dual-GPU LLM rig guide covers the hardware setup, and the llama.cpp advanced guide covers the --tensor-split flag.
For a broader view of what hardware to invest in at every price point, see our local AI budget guide.
The Bottom Line
Don't overthink multi-model setups on a single GPU. For most people, Ollama's automatic load/unload with a 5-minute timeout is the right approach. You switch models, wait 5-10 seconds for the new one to load, and you're running. The only scenario where true simultaneous multi-model matters is if you're running a production-like setup serving multiple users or applications at once -- and at that point, a second GPU pays for itself.
Tested with Ollama 0.6.x and llama.cpp b4917 as of March 2026.