GPU Performance Calculator

Explore LLM inference benchmarks across GPU types and estimate hardware requirements for your model.

Benchmark Results GPU Estimator

Configuration

HuggingFace Config(optional)

Paste a link to a model's config.json for precise KV cache estimation and auto-detection of MoE architecture.

Model Size: 70B parameters

1B100B500B1000B

Mixture of Experts (MoE)

Precision

KV Cache Precision

Context Length

Request Shape

Input Tokens

Output Tokens

Concurrent Requests: 10

164128256

Target GPU

Estimated GPUs Required

TP8 or TP4/DP2

Estimated Latency

Data-driven (nearest)

TTFT

1.30 s

Time to first token

Throughput

4,174

tok/s/GPU

E2E

3.40 s

1024 in / 128 out

Based on nearest available data point. H100 SXM dense data: qwen3-32b(32B active) at 1.024K, 2.048K, 4.096K, 8.192K, 15K ctx. 100 concurrent requests, 128 output tokens.

Theoretical (FLOPS / bandwidth model)

TTFT

51.8 ms

Time to first token

TPOT

8.0 ms

Per output token

E2E

1.08 s

1024 in / 128 out

Theoretical: Single request estimate. Prefill is compute-bound (~35% of 989 TFLOPS). Decode is memory-bandwidth-bound (~65% of 3350 GB/s).

VRAM Breakdown

Model Weights

140.0 GB

KV Cache

172.0 GB

Overhead

14.0 GB

Total VRAM needed326.0 GB

Available VRAM (8 GPUs)640 GB

Utilization51%

Quick Reference

Model weights70B × 2 bytes/param = 140.0 GB

KV cache per token~2050.8 KB (estimated)

KV cache total2050.8 KB × 8,192 tokens × 10 req = 172.0 GB

Framework overhead~10% of model = 14.0 GB

TTFT2 × 70.0B × 1,024 tokens / (8 × 989 TFLOPS × 35%)

TPOT140.0 GB / (8 × 3350 GB/s × 65%)

Real Benchmark Reference

Actual results on H100 SXM at 1,024 input tokens (closest to your 1,024). 8 GPUs, 100 concurrent, 100 prompts, 128 output tokens.

Model	Engine	Config	Throughput/GPU	TTFT	E2E Latency
qwen3-32b	vLLM	TP8/DP1	4,174 tok/s	1296 ms	3.40 s
gpt-oss-120b	vLLM	TP4/DP2	4,022 tok/s	1169 ms	3.54 s
gpt-oss-120b	vLLM	TP8/DP1	3,809 tok/s	1413 ms	3.70 s
qwen3-32b	vLLM	TP4/DP2	3,760 tok/s	1411 ms	3.52 s
glm-4.7-fp8	vLLM	TP4/DP2	1,379 tok/s	3041 ms	10.31 s
glm-4.7-fp8	vLLM	TP8/DP1	1,354 tok/s	3037 ms	10.44 s
gpt-oss-120b	SGLang	TP4/DP2	1,005 tok/s	3700 ms	5.55 s
gpt-oss-120b	SGLang	TP8/DP1	907 tok/s	3809 ms	7.16 s