๐Ÿง 

LLM Memory Calculator

Calculate GPU VRAM required to run LLMs locally with different quantization levels

512128K
20.7 GB
Total VRAM Required

Memory Breakdown

Model Weights16.00 GB
KV Cache (4,096 ctx)2.34 GB
Activations0.80 GB
Overhead1.60 GB
Total20.74 GB

GPU Compatibility

โœ“RTX 4090
24GB (86% used)
โœ—RTX 4080
16GB
โœ—RTX 4070 Ti
12GB
โœ“RTX 3090
24GB (86% used)
โœ—RTX 3080
10GB
โœ“A100 40GB
40GB (52% used)
โœ“A100 80GB
80GB (26% used)
โœ“H100 80GB
80GB (26% used)
โœ“A6000
48GB (43% used)
โœ—Mac M1 (8GB)
8GB
โœ—Mac M1 Pro (16GB)
16GB
โœ“Mac M1 Max (32GB)
32GB (65% used)
โœ“Mac M2 Ultra (64GB)
64GB (32% used)
โœ“Mac M3 Max (48GB)
48GB (43% used)

How Memory is Calculated

Model Weights: Parameters ร— Bits per Parameter รท 8 = Bytes. A 7B model at FP16 needs ~14GB just for weights.

KV Cache: Grows with context length. Longer contexts = more memory. This is why 128K context models need significantly more VRAM.

Quantization Impact: INT4 uses 4ร— less memory than FP16, making it possible to run 70B models on consumer GPUs.

Frequently Asked Questions

Can I run Llama 70B on my RTX 4090?

Yes, with INT4 quantization (GPTQ/AWQ). You'll need ~35-40GB for full context, so use shorter context or offload to RAM.

What's the quality loss with quantization?

INT8 has minimal quality loss (~1%). INT4 shows ~2-5% degradation on benchmarks but is often unnoticeable for general use.

Can I split models across multiple GPUs?

Yes! Tools like llama.cpp, vLLM, and Hugging Face Accelerate support model parallelism across GPUs.