I ran vLLM on an RTX 4090, loaded Qwen2.5‑7B‑Instruct, and started testing longer outputs. At 1,024 tokens, TTFT jumped to 144 seconds. TPOT stayed around 18 ms.
The GPU wasn’t the issue. Something else was off.
Two numbers you need to know
TTFT → time to first token is how long you waited at the restaurant before the waiter even acknowledged you.
TPOT → time per output token is how fast the food arrived once they started bringing it.
When TTFT explodes but TPOT stays flat, the kitchen is fine. You just couldn't get a table.
Everything in this post follows from that distinction.
What came out of my run
Same hardware. Same model. Same offered request rate. Only output length changes.
Output tokens | Actual req/s | Mean TTFT | Mean TPOT |
|---|---|---|---|
128 | 0.95 | 196 ms | 18.2 ms |
256 | 0.87 | 3,098 ms | 18.2 ms |
512 | 0.46 | 49,929 ms | 18.0 ms |
1,024 | 0.24 | 144,074 ms | 18.0 ms |
TPOT doesn't move. Across a 7× increase in output length, token generation speed is essentially unchanged 18.19ms at 128 tokens, 17.98ms at 1,024. That's noise, not signal.
TTFT goes from 196ms to 144 seconds.
Those two facts together tell you exactly what's happening. The GPU is not slow. Requests are waiting to get in.
Why KV cache is the actual bottleneck
Every request that enters vLLM claims physical memory on the GPU not for computation, but for storage. Specifically, it claims blocks of HBM to hold the key and value vectors for every token it has processed, across every transformer layer. This is the KV cache.
A request holds those blocks from the moment it arrives until the moment it finishes generating. It releases nothing mid-flight. And as it generates more tokens, it claims more blocks - one new block every 16 output tokens.
This means a long-output request is a long-term HBM tenant. It sits in GPU memory for its entire generation lifetime, blocking that capacity from anything else. When enough of these requests are in flight simultaneously, the KV pool fills. New requests arrive, find no available blocks, and wait.
The GPU keeps decoding the requests already in flight - TPOT stays flat. But nothing new can start - exploding TTFT.
This is what "idle GPU, queued requests" actually means. The compute units are occupied. The memory is full. Admission is closed.
What the throughput number is really saying
Going from 128 to 1,024 output tokens dropped actual req/s from 0.95 to 0.24 on the same hardware, same model, same offered rate.
That's a 4× collapse in throughput with TPOT completely unchanged.
The GPU did not slow down. The scheduler stopped admitting. Those are different problems with different fixes, and mixing them up is how you end up tuning the wrong thing for weeks.
The diagnostic I now run first
Three metrics, checked in this order:
gpu_cache_usage_perc above 80% - KV pool is close to full. New requests will wait.
num_requests_waiting above zero while cache usage is high - confirmed: the scheduler has stopped admitting.
TTFT climbing while TPOT is flat - the requests already running are fine. The slowdown is entirely at admission.
If all three are true at the same time, you’re not compute‑bound. You’re not bandwidth‑bound. You have a KV residency problem. And no amount of batch‑size or parallelism tuning will fix it.
On larger models this cliff arrives faster
This ran on Qwen2.5‑7B on an RTX 4090. The behavior is real, but the numbers are compressed. On a 70B model, the KV footprint per token is bigger, the pool drains faster, and the TTFT cliff shows up at lower concurrency. Same mechanism, smaller margin.
I’m building toward 70B experiments next. Based on the KV math, the cliff moves earlier. I’ll show that in a later build.
What to check before you tune anything
Offered req/s vs actual req/s → are you saturated?
TTFT vs TPOT → where is time being lost?
Prompt length → is prefill work growing?
Output length → is decode residency blocking admission?
Only after that does tuning make sense.
Full results: vLLM Performance Triage Build 1 · RTX 4090 · Qwen2.5-7B-Instruct

I run these experiments on real hardware while developing diagnostic frameworks for inference engineering.
This build ran on hardware provided by CloudRift.ai an AI infrastructure platform that enables enterprises, telecom operators, and data centers to run ML models and host GPU clouds securely on their own hardware. The experiments in this post are real workloads on real compute, not simulated benchmarks.
For more on their platform and compute: CloudRift
