🧠

LLM Memory Calculator

Calculate GPU VRAM required to run LLMs locally with different quantization levels

Select Model

Quantization / Precision

Context Length: 4,096 tokens

512128K

20.7 GB

Total VRAM Required

Memory Breakdown

Model Weights16.00 GB

KV Cache (4,096 ctx)2.34 GB

Activations0.80 GB

Overhead1.60 GB

Total20.74 GB

GPU Compatibility

✓RTX 4090

24GB (86% used)

✗RTX 4080

16GB

✗RTX 4070 Ti

12GB

✓RTX 3090

24GB (86% used)

✗RTX 3080

10GB

✓A100 40GB

40GB (52% used)

✓A100 80GB

80GB (26% used)

✓H100 80GB

80GB (26% used)

✓A6000

48GB (43% used)

✗Mac M1 (8GB)

8GB

✗Mac M1 Pro (16GB)

16GB

✓Mac M1 Max (32GB)

32GB (65% used)

✓Mac M2 Ultra (64GB)

64GB (32% used)

✓Mac M3 Max (48GB)

48GB (43% used)

How Memory is Calculated

Model Weights: Parameters × Bits per Parameter ÷ 8 = Bytes. A 7B model at FP16 needs ~14GB just for weights.

KV Cache: Grows with context length. Longer contexts = more memory. This is why 128K context models need significantly more VRAM.

Quantization Impact: INT4 uses 4× less memory than FP16, making it possible to run 70B models on consumer GPUs.

Frequently Asked Questions

Can I run Llama 70B on my RTX 4090?

Yes, with INT4 quantization (GPTQ/AWQ). You'll need ~35-40GB for full context, so use shorter context or offload to RAM.

What's the quality loss with quantization?

INT8 has minimal quality loss (~1%). INT4 shows ~2-5% degradation on benchmarks but is often unnoticeable for general use.

Can I split models across multiple GPUs?

Yes! Tools like llama.cpp, vLLM, and Hugging Face Accelerate support model parallelism across GPUs.

←Back to All Tools