Ollama is the right tool for running a local LLM as a personal assistant. vLLM is the right tool for serving a local LLM as an API to an application — and the distinction matters once you start building anything that handles more than one request at a time.
The core difference: Ollama queues requests and processes them sequentially. One user asks a question, Ollama handles it, then the next request starts. For personal use, that is fine. For an application where multiple users or processes are making requests concurrently, sequential processing means the second request waits for the first to finish completely — a 30-second wait if the first request generates a long response.
vLLM solves this with continuous batching: incoming requests join the current processing batch mid-stream, dramatically increasing throughput under load. Combined with PagedAttention for efficient KV cache management, vLLM handles concurrent requests with near-linear scaling up to the point where VRAM becomes the bottleneck.
Quick Summary
- Key advantage over Ollama: Continuous batching handles concurrent requests without sequential queuing
- Minimum practical VRAM: 24GB for most 8B models — vLLM's overhead is higher than llama.cpp/Ollama
- Best use case: Local API serving for apps with multiple users, batch inference pipelines, LoRA serving
vLLM vs Ollama: Choosing the Right Tool
vLLM
Moderate
Higher
Good
Continuous batching
Yes
Linux preferred
Use vLLM if you are building a product, serving multiple users, or need LoRA serving. Use Ollama if you are running personal inference on your workstation. For a gaming PC API server serving just yourself and your home lab, the Ollama setup is simpler with no meaningful downside. For a decision matrix comparing all four runtimes, see Ollama vs LM Studio vs llama.cpp vs vLLM.
Hardware Requirements
vLLM has higher baseline memory overhead than Ollama because it pre-allocates KV cache blocks for PagedAttention. On a 24GB card, a significant portion of VRAM goes to KV cache reservation.
Minimum for Useful Deployment
GPU: NVIDIA RTX 3090, 4090, or any 24GB VRAM card.
With 24GB VRAM and Llama 3.1 8B:
- Model weights at BF16: ~16GB
- KV cache allocation (default, 90% of remaining VRAM): ~7GB
- Total VRAM used at startup: ~23GB
That ~7GB of KV cache with PagedAttention can support approximately 8–16 concurrent requests at moderate context lengths. On Ollama, you would get 1 at a time.
16GB Cards (RTX 4060 Ti, 4080 Super)
Technically possible. vLLM on a 16GB card with Llama 3.1 8B in FP16 leaves almost no KV cache budget. You can work around this with:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 4096 \
--gpu-memory-utilization 0.90
The --max-model-len 4096 limit forces vLLM to pre-allocate a smaller KV cache pool. Feasible for light workloads; Ollama is still better for most 16GB use cases.
8GB Cards
Not recommended for vLLM. Use Ollama or llama.cpp directly.
Installation
Prerequisites
- NVIDIA GPU with compute capability 7.0+ (RTX 2000 series and newer)
- CUDA 12.1 or newer
- Python 3.9–3.12
- Linux (strongly preferred; Windows via WSL2 works but with caveats)
Check your CUDA version:
nvcc --version
# or
nvidia-smi | grep "CUDA Version"
Install via pip
# Create virtual environment
python -m venv vllm-env
source vllm-env/bin/activate
# Install vLLM
pip install vllm
The vllm pip package includes pre-compiled CUDA kernels. Install will pull approximately 5–10GB of dependencies (PyTorch, CUDA libraries). This takes a while on first install.
Verify Installation
python -c "from vllm import LLM; print('vLLM installed successfully')"
Serving Your First Model
Basic Server Start
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192 \
--port 8000
This downloads the model from Hugging Face on first run (requires huggingface-cli login for gated models like Llama 3.1).
For locally downloaded models:
python -m vllm.entrypoints.openai.api_server \
--model /path/to/your/model \
--max-model-len 8192 \
--port 8000
Key Startup Parameters
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192 \ # Maximum context length; limits KV cache allocation
--gpu-memory-utilization 0.90 \ # Fraction of VRAM for model + KV cache (default: 0.90)
--max-num-seqs 32 \ # Maximum concurrent sequences (requests)
--host 0.0.0.0 \ # Listen on all interfaces (for LAN access)
--port 8000
--gpu-memory-utilization 0.90 means vLLM reserves 90% of VRAM. Lower this if you see OOM errors during startup; raise it (max 0.95) to get more KV cache headroom.
Using the OpenAI-Compatible API
vLLM's server is a drop-in replacement for the OpenAI API. Any code using openai client pointing at https://api.openai.com can be redirected to your local vLLM server by changing the base_url.
Python Client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM ignores API keys by default
)
# Single request
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "What makes PagedAttention efficient?"}],
max_tokens=512,
temperature=0.7,
)
print(response.choices[0].message.content)
# Streaming
for chunk in client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain CUDA memory hierarchy"}],
stream=True,
):
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Concurrent Requests (Where vLLM Shines)
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
async def query(prompt: str, request_id: int):
response = await client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": prompt}],
max_tokens=256,
)
return request_id, response.choices[0].message.content
async def main():
# Send 8 requests simultaneously
tasks = [
query(f"Summarize GPU architecture concept #{i}", i)
for i in range(8)
]
results = await asyncio.gather(*tasks)
for req_id, result in results:
print(f"Request {req_id}: {result[:100]}...")
asyncio.run(main())
On Ollama, these 8 requests would queue and execute one at a time. On vLLM, they batch together and complete in roughly the time it takes to process one long request.
LoRA Serving
vLLM supports hot-loading LoRA adapters without restarting the server — a feature Ollama does not have.
Enable LoRA at Server Startup
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--lora-modules \
coding-assistant=/path/to/coding-lora \
customer-support=/path/to/support-lora \
--max-model-len 4096
Request with Specific LoRA
response = client.chat.completions.create(
model="coding-assistant", # Uses the LoRA adapter
messages=[{"role": "user", "content": "Write a quicksort in Python"}],
)
Different requests to the same server can use different LoRA adapters simultaneously. vLLM manages adapter loading and memory automatically.
Consumer GPU Caveats
No FP8/FP4 on Older Cards
FP8 quantization (which cuts VRAM usage ~50% vs BF16 with minimal quality loss) requires Hopper or Ada Lovelace architecture — that means H100, A100, and RTX 4000 series. RTX 3000 series cards cannot use FP8. This means on an RTX 3090, you are limited to:
- BF16/FP16 (full precision)
- INT8 via
--quantization bitsandbytes(requires additional install) - GPTQ or AWQ quantized models
For RTX 4090 users, FP8 is supported and dramatically increases the number of concurrent requests you can handle.
Blackwell RTX 50 Series
RTX 5090 and 5080 get the best vLLM support — FP4, FP8, and improved tensor parallelism. vLLM's support for Blackwell is being actively developed. If you are buying a GPU specifically for vLLM serving in 2026, RTX 5090 (32GB) is the clear choice.
AWQ Quantization for VRAM-Constrained Cards
If you need to run a larger model on a 24GB card, AWQ-quantized models reduce VRAM usage while maintaining better quality than GGUF Q4:
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-70B-chat-AWQ \
--quantization awq \
--max-model-len 4096
AWQ 4-bit Llama 70B on a 24GB card is possible at reduced context length, though performance will be limited for concurrent use.
Adding Authentication
vLLM's built-in API has no authentication by default. For any production or team setup:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--api-key "your-secret-key-here"
Client requests then require Authorization: Bearer your-secret-key-here. For more complex auth (per-user keys, rate limiting), put nginx or Caddy in front of vLLM.
For a full team local AI server setup, see our guide on local AI API server for teams. For a simpler single-user server approach using Ollama or LM Studio, see our gaming PC homelab server guide.
Quick-Start systemd Service
To run vLLM as a persistent service on Linux:
# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM OpenAI API Server
After=network.target
[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username
Environment="PATH=/home/your-username/vllm-env/bin:/usr/bin:/bin"
ExecStart=/home/your-username/vllm-env/bin/python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192 \
--host 0.0.0.0 \
--port 8000
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
sudo systemctl enable vllm
sudo systemctl start vllm
sudo journalctl -u vllm -f # Follow logs