GPU Performance Calculator
Explore LLM inference benchmarks across GPU types and estimate hardware requirements for your model.
Configuration
Paste a link to a model's config.json for precise KV cache estimation and auto-detection of MoE architecture.
1B100B500B1000B
164128256
Estimated GPUs Required
8
TP8 or TP4/DP2
Estimated Latency
Data-driven (nearest)
TTFT
1.30 s
Time to first token
Throughput
4,174
tok/s/GPU
E2E
3.40 s
1024 in / 128 out
Based on nearest available data point. H100 SXM dense data: qwen3-32b(32B active) at 1.024K, 2.048K, 4.096K, 8.192K, 15K ctx. 100 concurrent requests, 128 output tokens.
Theoretical (FLOPS / bandwidth model)
TTFT
51.8 ms
Time to first token
TPOT
8.0 ms
Per output token
E2E
1.08 s
1024 in / 128 out
Theoretical: Single request estimate. Prefill is compute-bound (~35% of 989 TFLOPS). Decode is memory-bandwidth-bound (~65% of 3350 GB/s).
VRAM Breakdown
Model Weights
KV Cache
Overhead
Total VRAM needed326.0 GB
Available VRAM (8 GPUs)640 GB
Utilization51%
Quick Reference
Model weights70B × 2 bytes/param = 140.0 GB
KV cache per token~2050.8 KB (estimated)
KV cache total2050.8 KB × 8,192 tokens × 10 req = 172.0 GB
Framework overhead~10% of model = 14.0 GB
TTFT2 × 70.0B × 1,024 tokens / (8 × 989 TFLOPS × 35%)
TPOT140.0 GB / (8 × 3350 GB/s × 65%)
Real Benchmark Reference
Actual results on H100 SXM at 1,024 input tokens (closest to your 1,024). 8 GPUs, 100 concurrent, 100 prompts, 128 output tokens.
| Model | Engine | Config | Throughput/GPU | TTFT | E2E Latency |
|---|---|---|---|---|---|
| qwen3-32b | vLLM | TP8/DP1 | 4,174 tok/s | 1296 ms | 3.40 s |
| gpt-oss-120b | vLLM | TP4/DP2 | 4,022 tok/s | 1169 ms | 3.54 s |
| gpt-oss-120b | vLLM | TP8/DP1 | 3,809 tok/s | 1413 ms | 3.70 s |
| qwen3-32b | vLLM | TP4/DP2 | 3,760 tok/s | 1411 ms | 3.52 s |
| glm-4.7-fp8 | vLLM | TP4/DP2 | 1,379 tok/s | 3041 ms | 10.31 s |
| glm-4.7-fp8 | vLLM | TP8/DP1 | 1,354 tok/s | 3037 ms | 10.44 s |
| gpt-oss-120b | SGLang | TP4/DP2 | 1,005 tok/s | 3700 ms | 5.55 s |
| gpt-oss-120b | SGLang | TP8/DP1 | 907 tok/s | 3809 ms | 7.16 s |