Calculate GPU VRAM required to run LLMs locally with different quantization levels
Model Weights: Parameters ร Bits per Parameter รท 8 = Bytes. A 7B model at FP16 needs ~14GB just for weights.
KV Cache: Grows with context length. Longer contexts = more memory. This is why 128K context models need significantly more VRAM.
Quantization Impact: INT4 uses 4ร less memory than FP16, making it possible to run 70B models on consumer GPUs.
Yes, with INT4 quantization (GPTQ/AWQ). You'll need ~35-40GB for full context, so use shorter context or offload to RAM.
INT8 has minimal quality loss (~1%). INT4 shows ~2-5% degradation on benchmarks but is often unnoticeable for general use.
Yes! Tools like llama.cpp, vLLM, and Hugging Face Accelerate support model parallelism across GPUs.