AI Hardware Calculator
Estimate VRAM requirements, inference speed, and power consumption for Large Language Models across consumer and enterprise hardware.
Target Hardware
CPU offloading reduces VRAM need but slows inference significantly
Model Parameters
Total parameters loaded into VRAM
Total VRAM Required ?The total GPU memory needed: model weights + KV cache + runtime overhead. If this exceeds your GPU's VRAM, you'll need to quantize further, use RAM offload, or get a bigger GPU.
Exceeds Single GPU Capacity (24GB)
Est. Speed ?Estimated tokens generated per second. 10+ tok/s feels conversational. 5-10 is usable with lag. Below 5 is best for batch/offline use. Speed depends on memory bandwidth — not compute cores.
Good for Chat
Technical Breakdown
Hardware Notes
All calculations are estimates based on standard quantization overheads and published memory bandwidth specs. Actual performance depends on quantization format (GGUF, EXL2, AWQ), software (llama.cpp, vLLM, MLX), system configuration, driver version, and workload. MoE models load all expert weights but activate only a subset per token. Apple Silicon estimates use MLX-typical throughput. Used GPU prices fluctuate — check eBay and r/hardwareswap for current market rates.