GPU Performance Calculator
Explore LLM inference benchmarks across GPU types and estimate hardware requirements for your model.
Benchmark Glossary
Definitions for the metrics reported by vLLM and SGLang benchmark runs and shown across these dashboards.
Request counts
- Successful requests
- Requests that completed end-to-end without error. Throughput numbers are computed only over these — survivor-only rates can flatter unstable configs.
- Failed requests
- Requests the server failed to complete (timeouts, OOM, errors). A non-zero count is a signal that the displayed throughput does not reflect the full offered load.
- num_prompts
- Total number of requests sent during the run. The success ratio is successful / num_prompts.
- Maximum request concurrencyConc
- Cap on in-flight requests to the server during the run. The benchmark client never exceeds this number simultaneously.
- Peak concurrent requests
- Largest number of requests actually in flight at any sampled instant. Can be slightly higher than max concurrency due to sampling.
- Benchmark duration (s)
- Wall-clock seconds the benchmark spent issuing and collecting requests. Throughput rates are total tokens divided by this duration.
Throughput
Critical distinction: output token throughput measures generation speed (decode), while total token throughput bundles input (prefill) tokens with output tokens. Prefill is much faster than decode, so total throughput inflates the number and is misleading as a "how fast does it answer" or "$/generated token" metric.
- Output token throughput (tok/s)
- Generated tokens per second across all successful requests. Use this for user-facing speed and $/output-token cost.
- Total token throughput (tok/s)
- Input + output tokens per second. Useful for raw engine utilization, but misleading for user-perceived speed because prefill processes thousands of tokens per second per GPU while decode is much slower.
- Peak output token throughput (tok/s)
- Highest instantaneous output rate observed during the run. Indicates the engine's ceiling once batches are full; the mean output throughput is what users actually see on average.
- Request throughput (req/s)
- Completed requests per second. Sensitive to output length — short replies finish faster regardless of decode speed.
- Throughput / GPU
- Total throughput divided by the number of GPUs in the deployment (TP × PP × DP). Lets you compare configurations of different sizes on equal footing.
Latency
Three numbers to track separately: how long until the first token shows up, how fast subsequent tokens stream, and the total wall time.
- TTFT — Time to First Token (ms)
- Time from request submission until the first generated token returns. Dominated by prefill cost; grows roughly linearly with input length and queueing depth.
- TPOT — Time per Output Token (ms)
- Average milliseconds per output token, excluding the first token. This is the steady-state decode rate users feel as 'streaming speed'. 1000 / TPOT ≈ tokens/sec per request.
- ITL — Inter-token Latency (ms)
- Gap between successive tokens within a single response. Similar to TPOT but includes scheduling jitter; the mean tracks TPOT while the P99 reveals stutters.
- E2EL — End-to-end Latency (ms)
- Total time from request submitted to last token received. Equals roughly TTFT + (output_len − 1) × TPOT.
- Mean / Median / P99
- Mean is sensitive to outliers. Median is the typical experience. P99 is the tail — 1 in 100 requests are at least this slow. Watch the gap between median and P99 for stability.
Parallelism
Total GPUs in a deployment = TP × PP × DP. The Config column encodes these asTP4/PP1/DP2(omitted PP when 1).
- TP — Tensor Parallelism
- Each layer's weight matrices are split across TP GPUs. Reduces per-GPU memory but adds all-reduce communication on every forward pass. Used to fit large models that don't fit on one GPU.
- PP — Pipeline Parallelism
- Layers are partitioned into stages, each on its own GPU. Cheap on bandwidth (only activations cross stages) but introduces pipeline bubbles. Useful when inter-GPU bandwidth is limited (e.g. multi-node, no NVLink).
- DP — Data Parallelism
- Independent model replicas serve different requests in parallel. Multiplies throughput linearly with no extra communication, at the cost of replicating the full model on every replica.
Workload
- Ctx — input context length
- Number of input tokens per request, before any generation. Bigger context means more prefill work and more KV-cache memory; both push TTFT up and throughput down.
- Output length
- Tokens generated per request. Most benchmarks here use 128 output tokens. Decode time scales linearly with this number, so total run cost is roughly TTFT + output_len × TPOT.
- Total input / generated tokens
- Sum across all successful requests in a run. Useful for sanity-checking that a benchmark actually exercised the workload it claimed.