LoRA (Low-Rank Adaptation) — Local AI Glossary | CraftRigs

LoRA (Low-Rank Adaptation) is a method for fine-tuning large language models without updating all the original weights. Instead of modifying a 7B-parameter model directly (which would require storing a full 14GB copy of changed weights), LoRA inserts small, low-rank matrices at key points in the model architecture. Only these small matrices are trained — the base model stays frozen.

Why This Matters

Full fine-tuning a 7B model requires storing gradients and optimizer states for all 7 billion parameters during training. On consumer hardware, that's prohibitive — you'd need 40–80GB of VRAM just for the training state.

LoRA reduces the trainable parameter count by 90–99%. A typical LoRA for a 7B model might have 20–30 million trainable parameters instead of 7 billion. The VRAM requirements drop dramatically — fine-tuning is possible on a single RTX 3090 or 4090.

How It Works

In transformer models, most weight matrices can be approximated as lower-rank structures. LoRA decomposes the weight update matrix ΔW into two smaller matrices (A and B) such that ΔW ≈ A × B. Training only A and B requires a fraction of the compute and memory of training W directly. After training, LoRA weights can be merged back into the base model or kept separate as an adapter.

LoRA Adapters in Practice

LoRA adapters are distributed as small files (typically 50–500MB) that modify a base model's behavior. You can find LoRA adapters on Hugging Face for tasks like:

Character personas and roleplay styles
Domain-specific writing (medical, legal, technical)
Coding style preferences
Instruction-following improvements

Inference software like Ollama and llama.cpp support loading LoRA adapters at runtime, making it straightforward to experiment with different fine-tunes on the same base model.

QLoRA

QLoRA extends LoRA by quantizing the frozen base model to 4-bit (INT4) during training. This further reduces the VRAM requirements — it's what made fine-tuning 70B models on a small cluster of consumer GPUs feasible. QLoRA training quality is nearly identical to full LoRA training at 16-bit precision.

Related guides: Fine-tuning a 7B LLM on a consumer GPU with Unsloth and LoRA — step-by-step setup for LoRA and QLoRA training on a single GPU. GGUF vs GPTQ vs AWQ vs EXL2: which quantization format should you use? — exporting fine-tuned LoRA adapters to different quantization formats after training.