TL;DR: Fine-tuning a 7B model with LoRA requires 12-14 GB VRAM (vs ~5 GB for inference). Full fine-tuning requires 80-100+ GB when you include weights, gradients, optimizer states, and activations — impossible on consumer hardware. For $1,500-$2,000 builds, QLoRA on an RTX 4070 or 4090 is the practical path. For 13B+ models at production scale, budget $4,500+ or use cloud spot instances for training and run locally for inference.
What Is Fine-Tuning and Why Would You Do It Locally?
Fine-tuning updates a pre-trained model's weights using your own training data so the model learns your specific task, domain knowledge, or communication style — rather than relying on general knowledge from pre-training.
During training, gradient descent runs over your curated dataset and updates weights to minimize loss on that data. The GPU has to hold model weights + gradients + optimizer states + activations — typically 2-3× more VRAM than inference. That's the number people miss.
Think of it this way: inference is driving a borrowed car. Fine-tuning is sending it to a custom shop. You need the full vehicle on the lift (all weights in memory), plus the tools and workspace (gradients, optimizer states). You can't do the work with just the keys.
Why do it locally? Privacy (your data never leaves your machine), cost (no per-token API fees), and iteration speed. Cloud training adds up fast at scale. Once your model is trained, local inference is essentially free.
Why Fine-Tuning Needs So Much More VRAM Than Inference
This is the most common failure mode. Someone gets Mistral 7B running smoothly for inference and assumes fine-tuning is just "inference plus a little extra." It's not.
Mistral 7B inference with Q4_K_M quantization: ~4.5 GB VRAM. Mistral 7B full fine-tune in FP16: 28-40 GB VRAM. That's a 6-9× jump — immediately out of reach for every consumer GPU on the market.
Even with the most efficient method (QLoRA), you're looking at 8-10 GB VRAM for a 7B model. That's still 2× the inference cost, not "a little extra."
The 4 VRAM Consumers During Training
Understanding where the memory goes is the key to choosing the right method.
Model weights — same as inference. FP16 Mistral 7B = 14 GB. QLoRA uses a 4-bit quantized base, so ~4.5 GB instead.
Gradients — one FP32 gradient per trainable weight. For a full fine-tune: 4 bytes × 7 billion parameters = 28 GB. LoRA only computes gradients for the small adapter parameters (less than 1% of total weights), so this drops to ~0.7 GB.
Optimizer states (AdamW) — two momentum terms per parameter, 4 bytes each. Full fine-tune: 56 GB for Adam on a 7B model. Frozen layers don't generate optimizer states — this is where LoRA wins big.
Activations — intermediate layer outputs stored during the forward pass for backprop. These scale with batch size and sequence length. Gradient checkpointing trades compute for memory by recomputing activations instead of storing them.
LoRA vs Full Fine-Tuning — The VRAM Math
Full fine-tune on Mistral 7B (FP16):
- Weights: 14 GB
- Gradients: 28 GB
- Optimizer states: 56 GB
- Activations (~batch 4): ~6 GB
- Total: ~104 GB minimum — requires two A100 80 GB cards in parallel
LoRA rank-64 on Mistral 7B (FP16 base):
- Base weights: 14 GB
- LoRA params + gradients: ~0.7 GB
- Optimizer states: ~1.4 GB
- Activations: ~4 GB
- Total: ~20 GB — fits RTX 4090
QLoRA (4-bit base + LoRA):
- 4-bit weights: ~4.5 GB
- LoRA params + gradients + optimizer + activations: ~4-5.5 GB
- Total: ~8-10 GB — fits RTX 4070 with careful batch sizing
Note
nvidia-smi often shows a misleadingly low VRAM number until the first backward pass. Activation and optimizer state allocations can spike 3-4 GB mid-training step — this is why you hit OOM after training starts, not during model load.
Fine-Tuning Methods — LoRA, QLoRA, and Full Fine-Tune
The method you choose determines both VRAM requirements and training quality. LoRA and QLoRA dominate consumer GPU fine-tuning because they make it feasible at all.
The core insight: LoRA doesn't modify the original weights. It trains small "adapter" matrices (1-5% of total model size) that are added to the base model's output at each attention layer. Those adapters are all you're updating — and all you're computing gradients for.
Full Fine-Tuning (Requires 40-80+ GB VRAM for 7B Models)
Updates every model weight. Maximum quality potential, maximum resource requirement.
Only practical on datacenter GPUs (A100, H100) or multi-GPU setups connected via NVLink. Cloud cost for fine-tuning a 7B model on 10K samples runs $50-200 on A100 spot instances. Use full fine-tuning when you need maximum adaptation quality, you're training on 100K+ samples, or LoRA quality isn't cutting it for your task.
For most builders: skip this entirely.
LoRA (Low-Rank Adaptation) — 12-20 GB VRAM for 7B
Freezes base model weights, trains small rank-decomposed adapter matrices injected at each attention layer. VRAM reduction from full fine-tune: 5-8×. Quality gap vs full fine-tune is small for most tasks — for instruction following and domain adaptation, it's negligible.
Key parameters: rank (r=8 to 128, higher rank = more capacity + more VRAM), alpha, and target modules (q_proj, v_proj, up_proj, down_proj are typical choices). Available in Hugging Face PEFT, Unsloth, and Axolotl.
RTX 4080 (16 GB) territory. Tight but doable.
QLoRA (4-bit Quantized Base + LoRA) — 8-10 GB VRAM for 7B
Further reduces VRAM by quantizing the base model to 4-bit NF4 (normal float 4) before training. This is the method that makes consumer GPU fine-tuning practical.
Quality penalty vs full-precision LoRA: typically less than 1 MMLU point on standard benchmarks. The Unsloth library provides QLoRA with a 2-3× speedup over vanilla Hugging Face QLoRA — use it.
Training speed on RTX 4090 with Unsloth + Llama 3.1 8B, batch size 4: approximately 1,400 steps/hour.
QLoRA enables:
- 7B models on RTX 4070 (12 GB)
- 13B models on RTX 4090 (24 GB)
- 34B models on dual RTX 4090 (48 GB combined)
"My GPU Is Fast Enough for Inference So Fine-Tuning Will Work" — Wrong
This assumption kills training jobs. Let's correct it directly.
Correction 1: VRAM for training is not "inference + a little extra." Even QLoRA — the most memory-efficient method — requires ~2× the inference VRAM for a 7B model. Full fine-tune requires 10-20×.
Correction 2: nvidia-smi VRAM monitoring during training often shows a misleadingly low number until the first backward pass. Activation and optimizer state allocations can spike 3-4 GB mid-training step. You'll see 6 GB used in nvidia-smi, then OOM at step 1.
Correction 3: Gradient checkpointing (enabled by default in Unsloth) helps with activations but adds ~30% compute overhead. Training takes longer, but VRAM peaks are lower. Worth it on constrained hardware.
Where the confusion comes from: "Fine-tuning" is sometimes used loosely to mean prompt tuning or system prompt crafting. Neither requires any VRAM beyond inference — you're not touching weights. Gradient-based fine-tuning is a fundamentally different operation.
Warning
Don't start a long training run without testing your QLoRA config with 100 steps first. Confirm VRAM peaks and training speed before committing hours of GPU time to a job that might OOM on step 500.
Fine-Tuning in Practice — QLoRA on RTX 4090 with Llama 3.1 8B
Here's a real training run, not a theory exercise.
Hardware: RTX 4090 (24 GB GDDR6X), Ubuntu 22.04, CUDA 12.1, Unsloth + Hugging Face Transformers
Task: Fine-tuning Llama 3.1 8B on 5,000 instruction-response pairs (customer service Q&A)
QLoRA config:
- 4-bit NF4 base quantization
- Rank-16 LoRA, alpha=32
- Target modules: q_proj + v_proj
- Gradient checkpointing: ON
Results:
- VRAM usage during training: 14.8 GB peak (comfortable headroom on 24 GB)
- Training speed: ~1,200 steps/hour
- 3 epochs over 5K samples = 12.5K steps = ~10.4 hours to completion
- Task accuracy on held-out eval: 78% vs 61% base model (zero-shot) on task-specific questions
That 17-point accuracy gain is the payoff for the hardware investment.
What to try first: Use Unsloth's Colab notebook to validate your dataset before committing local GPU hours. Run 100-200 steps in the cloud to confirm your data formatting is correct and the loss is actually decreasing. Then switch to local training once you know the approach works.
Budget Tier Recommendations for Fine-Tuning
Use Case
Personal projects, small datasets <20K samples
Small-business fine-tuning workflows
LoRA or full fine-tune (7B), QLoRA (34-70B)
Tip
If your primary use case is inference but you want the option to fine-tune occasionally, prioritize VRAM capacity over bandwidth. The RTX 4090 is the sweet spot — 24 GB gives you enough room for QLoRA on 13B models without going to a multi-GPU setup.
For anything requiring 70B+ model fine-tuning or full fine-tuning of 13B+ models: use cloud spot instances for training (A100 80 GB at $2-4/hour) and run locally for inference. Buying hardware to cover peak training demand is usually not economical.
Related Concepts for Local AI Builders
-
LoRA / QLoRA — The primary method for consumer GPU fine-tuning. Understanding rank, alpha, and target modules is prerequisite knowledge before your first training run.
-
VRAM — The binding constraint for training. How much you need scales with method and model size — the tables above give you the exact numbers.
-
Quantization / bits per weight — QLoRA uses quantized base models. Understanding bpw helps you understand the quality tradeoff you're accepting.
-
Fine-tuning glossary entry — The canonical definition and terminology reference.