BETA Tool

AI Hardware Calculator

Estimate VRAM requirements, inference speed, and power consumption for Large Language Models across consumer and enterprise hardware.

Hover or tap the ? icons next to each field for guidance

Target Hardware

GPU / System ?

System RAM Offload (Optional) ?

CPU offloading reduces VRAM need but slows inference significantly

Model Parameters

Model Preset ?

Model Size (Billion Parameters) ?

Total parameters loaded into VRAM

Quantization Level ?

Context Window ?

Total VRAM Required ?The total GPU memory needed: model weights + KV cache + runtime overhead. If this exceeds your GPU's VRAM, you'll need to quantize further, use RAM offload, or get a bigger GPU.

42.4 GB

Exceeds Single GPU Capacity (24GB)

Est. Speed ?Estimated tokens generated per second. 10+ tok/s feels conversational. 5-10 is usable with lag. Below 5 is best for batch/offline use. Speed depends on memory bandwidth — not compute cores.

18.5 tok/s

Good for Chat

VRAM Usage Visualization 42.4GB / 48GB

24GB Limit

Model Weights

KV Cache

Overhead

Technical Breakdown

Weights Size 36.8 GB

KV Cache (Context) 4.2 GB

Runtime Overhead 1.4 GB

Est. Power Draw ~580 W

Memory Bandwidth 1,008 GB/s

Hardware Notes

All calculations are estimates based on standard quantization overheads and published memory bandwidth specs. Actual performance depends on quantization format (GGUF, EXL2, AWQ), software (llama.cpp, vLLM, MLX), system configuration, driver version, and workload. MoE models load all expert weights but activate only a subset per token. Apple Silicon estimates use MLX-typical throughput. Used GPU prices fluctuate — check eBay and r/hardwareswap for current market rates.