LLM Memory Calculator
Calculate GPU VRAM required to run LLMs locally with different quantization levels
Memory Breakdown
GPU Compatibility
How Memory is Calculated
Model Weights: Parameters ร Bits per Parameter รท 8 = Bytes. A 7B model at FP16 needs ~14GB just for weights.
KV Cache: Grows with context length. Longer contexts = more memory. This is why 128K context models need significantly more VRAM.
Quantization Impact: INT4 uses 4ร less memory than FP16, making it possible to run 70B models on consumer GPUs.
Frequently Asked Questions
Can I run Llama 70B on my RTX 4090?
Yes, with INT4 quantization (GPTQ/AWQ). You'll need ~35-40GB for full context, so use shorter context or offload to RAM.
What's the quality loss with quantization?
INT8 has minimal quality loss (~1%). INT4 shows ~2-5% degradation on benchmarks but is often unnoticeable for general use.
Can I split models across multiple GPUs?
Yes! Tools like llama.cpp, vLLM, and Hugging Face Accelerate support model parallelism across GPUs.