CraftRigs
Architecture Guide

vLLM on a Single Consumer GPU: Serve Local LLMs Like a Production API

By Georgia Thomas 6 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Ollama is the right tool for running a local LLM as a personal assistant. vLLM is the right tool for serving a local LLM as an API to an application — and the distinction matters once you start building anything that handles more than one request at a time.

The core difference: Ollama queues requests and processes them sequentially. One user asks a question, Ollama handles it, then the next request starts. For personal use, that is fine. For an application where multiple users or processes are making requests concurrently, sequential processing means the second request waits for the first to finish completely — a 30-second wait if the first request generates a long response.

vLLM solves this with continuous batching: incoming requests join the current processing batch mid-stream, dramatically increasing throughput under load. Combined with PagedAttention for efficient KV cache management, vLLM handles concurrent requests with near-linear scaling up to the point where VRAM becomes the bottleneck.

Quick Summary

  • Key advantage over Ollama: Continuous batching handles concurrent requests without sequential queuing
  • Minimum practical VRAM: 24GB for most 8B models — vLLM's overhead is higher than llama.cpp/Ollama
  • Best use case: Local API serving for apps with multiple users, batch inference pipelines, LoRA serving

vLLM vs Ollama: Choosing the Right Tool

vLLM

Moderate

Higher

Good

Continuous batching

Yes

Linux preferred

Use vLLM if you are building a product, serving multiple users, or need LoRA serving. Use Ollama if you are running personal inference on your workstation. For a gaming PC API server serving just yourself and your home lab, the Ollama setup is simpler with no meaningful downside. For a decision matrix comparing all four runtimes, see Ollama vs LM Studio vs llama.cpp vs vLLM.


Hardware Requirements

vLLM has higher baseline memory overhead than Ollama because it pre-allocates KV cache blocks for PagedAttention. On a 24GB card, a significant portion of VRAM goes to KV cache reservation.

Minimum for Useful Deployment

GPU: NVIDIA RTX 3090, 4090, or any 24GB VRAM card.

With 24GB VRAM and Llama 3.1 8B:

  • Model weights at BF16: ~16GB
  • KV cache allocation (default, 90% of remaining VRAM): ~7GB
  • Total VRAM used at startup: ~23GB

That ~7GB of KV cache with PagedAttention can support approximately 8–16 concurrent requests at moderate context lengths. On Ollama, you would get 1 at a time.

16GB Cards (RTX 4060 Ti, 4080 Super)

Technically possible. vLLM on a 16GB card with Llama 3.1 8B in FP16 leaves almost no KV cache budget. You can work around this with:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90

The --max-model-len 4096 limit forces vLLM to pre-allocate a smaller KV cache pool. Feasible for light workloads; Ollama is still better for most 16GB use cases.

8GB Cards

Not recommended for vLLM. Use Ollama or llama.cpp directly.


Installation

Prerequisites

  • NVIDIA GPU with compute capability 7.0+ (RTX 2000 series and newer)
  • CUDA 12.1 or newer
  • Python 3.9–3.12
  • Linux (strongly preferred; Windows via WSL2 works but with caveats)

Check your CUDA version:

nvcc --version
# or
nvidia-smi | grep "CUDA Version"

Install via pip

# Create virtual environment
python -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM
pip install vllm

The vllm pip package includes pre-compiled CUDA kernels. Install will pull approximately 5–10GB of dependencies (PyTorch, CUDA libraries). This takes a while on first install.

Verify Installation

python -c "from vllm import LLM; print('vLLM installed successfully')"

Serving Your First Model

Basic Server Start

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \
  --port 8000

This downloads the model from Hugging Face on first run (requires huggingface-cli login for gated models like Llama 3.1).

For locally downloaded models:

python -m vllm.entrypoints.openai.api_server \
  --model /path/to/your/model \
  --max-model-len 8192 \
  --port 8000

Key Startup Parameters

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \           # Maximum context length; limits KV cache allocation
  --gpu-memory-utilization 0.90 \  # Fraction of VRAM for model + KV cache (default: 0.90)
  --max-num-seqs 32 \              # Maximum concurrent sequences (requests)
  --host 0.0.0.0 \                 # Listen on all interfaces (for LAN access)
  --port 8000

--gpu-memory-utilization 0.90 means vLLM reserves 90% of VRAM. Lower this if you see OOM errors during startup; raise it (max 0.95) to get more KV cache headroom.


Using the OpenAI-Compatible API

vLLM's server is a drop-in replacement for the OpenAI API. Any code using openai client pointing at https://api.openai.com can be redirected to your local vLLM server by changing the base_url.

Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM ignores API keys by default
)

# Single request
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "What makes PagedAttention efficient?"}],
    max_tokens=512,
    temperature=0.7,
)
print(response.choices[0].message.content)

# Streaming
for chunk in client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain CUDA memory hierarchy"}],
    stream=True,
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Concurrent Requests (Where vLLM Shines)

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

async def query(prompt: str, request_id: int):
    response = await client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
    )
    return request_id, response.choices[0].message.content

async def main():
    # Send 8 requests simultaneously
    tasks = [
        query(f"Summarize GPU architecture concept #{i}", i)
        for i in range(8)
    ]
    results = await asyncio.gather(*tasks)
    for req_id, result in results:
        print(f"Request {req_id}: {result[:100]}...")

asyncio.run(main())

On Ollama, these 8 requests would queue and execute one at a time. On vLLM, they batch together and complete in roughly the time it takes to process one long request.


LoRA Serving

vLLM supports hot-loading LoRA adapters without restarting the server — a feature Ollama does not have.

Enable LoRA at Server Startup

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules \
    coding-assistant=/path/to/coding-lora \
    customer-support=/path/to/support-lora \
  --max-model-len 4096

Request with Specific LoRA

response = client.chat.completions.create(
    model="coding-assistant",  # Uses the LoRA adapter
    messages=[{"role": "user", "content": "Write a quicksort in Python"}],
)

Different requests to the same server can use different LoRA adapters simultaneously. vLLM manages adapter loading and memory automatically.


Consumer GPU Caveats

No FP8/FP4 on Older Cards

FP8 quantization (which cuts VRAM usage ~50% vs BF16 with minimal quality loss) requires Hopper or Ada Lovelace architecture — that means H100, A100, and RTX 4000 series. RTX 3000 series cards cannot use FP8. This means on an RTX 3090, you are limited to:

  • BF16/FP16 (full precision)
  • INT8 via --quantization bitsandbytes (requires additional install)
  • GPTQ or AWQ quantized models

For RTX 4090 users, FP8 is supported and dramatically increases the number of concurrent requests you can handle.

Blackwell RTX 50 Series

RTX 5090 and 5080 get the best vLLM support — FP4, FP8, and improved tensor parallelism. vLLM's support for Blackwell is being actively developed. If you are buying a GPU specifically for vLLM serving in 2026, RTX 5090 (32GB) is the clear choice.

AWQ Quantization for VRAM-Constrained Cards

If you need to run a larger model on a 24GB card, AWQ-quantized models reduce VRAM usage while maintaining better quality than GGUF Q4:

python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-70B-chat-AWQ \
  --quantization awq \
  --max-model-len 4096

AWQ 4-bit Llama 70B on a 24GB card is possible at reduced context length, though performance will be limited for concurrent use.


Adding Authentication

vLLM's built-in API has no authentication by default. For any production or team setup:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --api-key "your-secret-key-here"

Client requests then require Authorization: Bearer your-secret-key-here. For more complex auth (per-user keys, rate limiting), put nginx or Caddy in front of vLLM.

For a full team local AI server setup, see our guide on local AI API server for teams. For a simpler single-user server approach using Ollama or LM Studio, see our gaming PC homelab server guide.


Quick-Start systemd Service

To run vLLM as a persistent service on Linux:

# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM OpenAI API Server
After=network.target

[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username
Environment="PATH=/home/your-username/vllm-env/bin:/usr/bin:/bin"
ExecStart=/home/your-username/vllm-env/bin/python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \
  --host 0.0.0.0 \
  --port 8000
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
sudo systemctl enable vllm
sudo systemctl start vllm
sudo journalctl -u vllm -f  # Follow logs
vllm inference-server consumer-gpu api local-llm

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.