GPU Performance Calculator

Explore LLM inference benchmarks across GPU types and estimate hardware requirements for your model.

Configuration

Paste a link to a model's config.json for precise KV cache estimation and auto-detection of MoE architecture.

1B100B500B1000B
164128256
Estimated GPUs Required
8
TP8 or TP4/DP2

Estimated Latency

Data-driven (nearest)
TTFT
1.30 s
Time to first token
Throughput
4,174
tok/s/GPU
E2E
3.40 s
1024 in / 128 out

Based on nearest available data point. H100 SXM dense data: qwen3-32b(32B active) at 1.024K, 2.048K, 4.096K, 8.192K, 15K ctx. 100 concurrent requests, 128 output tokens.

Theoretical (FLOPS / bandwidth model)
TTFT
51.8 ms
Time to first token
TPOT
8.0 ms
Per output token
E2E
1.08 s
1024 in / 128 out

Theoretical: Single request estimate. Prefill is compute-bound (~35% of 989 TFLOPS). Decode is memory-bandwidth-bound (~65% of 3350 GB/s).

VRAM Breakdown

Model Weights
140.0 GB
KV Cache
172.0 GB
Overhead
14.0 GB
Total VRAM needed326.0 GB
Available VRAM (8 GPUs)640 GB
Utilization51%

Quick Reference

Model weights70B × 2 bytes/param = 140.0 GB
KV cache per token~2050.8 KB (estimated)
KV cache total2050.8 KB × 8,192 tokens × 10 req = 172.0 GB
Framework overhead~10% of model = 14.0 GB
TTFT2 × 70.0B × 1,024 tokens / (8 × 989 TFLOPS × 35%)
TPOT140.0 GB / (8 × 3350 GB/s × 65%)

Real Benchmark Reference

Actual results on H100 SXM at 1,024 input tokens (closest to your 1,024). 8 GPUs, 100 concurrent, 100 prompts, 128 output tokens.

ModelEngineConfigThroughput/GPUTTFTE2E Latency
qwen3-32bvLLMTP8/DP14,174 tok/s1296 ms3.40 s
gpt-oss-120bvLLMTP4/DP24,022 tok/s1169 ms3.54 s
gpt-oss-120bvLLMTP8/DP13,809 tok/s1413 ms3.70 s
qwen3-32bvLLMTP4/DP23,760 tok/s1411 ms3.52 s
glm-4.7-fp8vLLMTP4/DP21,379 tok/s3041 ms10.31 s
glm-4.7-fp8vLLMTP8/DP11,354 tok/s3037 ms10.44 s
gpt-oss-120bSGLangTP4/DP21,005 tok/s3700 ms5.55 s
gpt-oss-120bSGLangTP8/DP1907 tok/s3809 ms7.16 s