Beyond the GPU

GPU size is the most visible variable in LLM inference, but it is rarely the only one that matters.

I benchmarked DeepSeek-V2-Lite-Chat on vLLM V1 to understand how a modern Mixture of Experts (MoE) + Multi-head Latent Attention (MLA) architecture behaves under serving pressure: prefill scaling, decode stability, KV-cache growth, and concurrency ramping.

I expected memory to become the constraint. It did not.

I expected high concurrency to trigger preemptions or failures. It did not.

The surprise was request arrival shape.

When requests arrived as fast as possible (request_rate=inf), first-token latency spiked at concurrency 8 and 16. When requests were spaced out (request_rate=1), the same sweep stayed stable. Same model. Same GPU. Same input/output shape. Different arrival pattern.

The bottleneck had not disappeared. It had moved from GPU memory pressure to request admission behavior.

What makes this model different

KV cache is one of the central constraints in inference engineering. Each token adds key and value state that must remain in GPU memory for the duration of the request, and at scale that is often what fills VRAM, triggers preemptions, and sets the ceiling on concurrency.

DeepSeek-V2-Lite attacks that problem from two directions.

The first is Mixture of Experts, or MoE. The model has 16 billion total parameters but only activates 2.4 billion per token. A routing mechanism selects a subset of expert networks for each token instead of running the full model, which reduces compute without reducing total capacity.

The second is multi-head latent attention, or MLA. It compresses key and value state before it enters the cache and reconstructs it when needed for attention. That adds a little compute, but it keeps the KV footprint small even at long contexts and high concurrency.

Both of these design choices show up directly in the data.

Six charts. One question each.

Chart 1 : What happens to first-token latency as prompts get longer?

It rises, but much more slowly than the architecture would suggest.

Input tokens

TTFT p99

1,024

167 ms

2,048

84 ms

4,096

111 ms

8,192

201 ms

12,288

247 ms

15,360

262 ms

Input grew 15x. TTFT grew 57%. That is not a dramatic scaling penalty.

Chunked prefill is the reason. Instead of one large blocking prefill pass, vLLM V1 breaks the prompt into chunks and schedules them across steps. The GPU sees steadier, more uniform work, which keeps the curve from steepening too sharply.

The 2,048 token anomaly

The dip to 84 ms at 2,048 tokens is real, not noise. Two plausible explanations: the batch size hit a favorable alignment with the GPU's execution boundaries, or residual cache warmth from the previous run carried over. This build did not include kernel-level instrumentation to confirm either hypothesis. It stays in the data as an open artifact rather than smoothing it out of the narrative.

Chart 2: How much GPU memory does the model use as prompts get longer?

At 15,360 tokens, peak KV cache usage still stays under 1% of total VRAM.

Input tokens

Peak KV Cache % (96 GB VRAM)

1,024

0.07%

4,096

0.24%

8,192

0.48%

15,360

0.89%

This is Multi-head Latent Attention, or MLA, working as intended. Conventional attention architectures store key and value state for every token for the life of the request, which causes KV memory to grow linearly with context length and eventually become a major deployment constraint. MLA compresses that state into a lower-dimensional latent representation during prefill and reconstructs what it needs during decode. That adds a little compute, but it keeps the KV footprint very small even at long contexts.

On a 96 GB GPU, 0.89% at 15k tokens makes the point clearly: memory is not the limiting factor in this deployment.

Chart 3: Does token generation slow down as responses get longer?

No. The numbers barely move.

Output tokens

TPOT p99

ITL p99

128

9.3 ms

10.3 ms

256

9.1 ms

10.2 ms

512

9.8 ms

10.5 ms

1,024

9.8 ms

10.1 ms

1,536

9.3 ms

10.1 ms

In conventional attention-based decoding, longer responses often get slower because each new token has to attend over a growing history. That increases the decoding cost as the sequence length expands.

Here it barely changes, because MLA keeps the attention state compressed. The 500th token costs about the same as the 50th.

That is valuable in production when output length is unpredictable, which is most real applications. It makes latency much easier to reason about.

Chart 4 : How does throughput scale with concurrency?

Throughput improves almost 7x from concurrency 1 to 32, with zero preemptions at every point.

Concurrency

Output tok/s

1

199

4

433

8

585

16

846

24

1,174

32

1,417

MoE models activate only a fraction of their total parameters for each token. At low concurrency, that leaves a lot of GPU capacity underused between steps. As concurrency rises, more of that idle capacity gets filled, and throughput climbs cleanly.

No memory pressure. No failures. Just scaling.

Chart 5 : What happens to first-token latency as concurrency increases?

This is where the system stops looking intuitive.

Concurrency

TTFT p99

1

92 ms

2

84 ms

4

109 ms

8

3,200 ms

16

3,969 ms

24

401 ms

32

462 ms

Concurrency 4 is still healthy at 109 ms. Concurrency 8 jumps to 3,200 ms. Then concurrency 24 drops back to 401 ms.

That is a 36x latency penalty at concurrency 8 compared with concurrency 4, followed by a near-full recovery at higher concurrency.

Memory was not the cause. KV cache never exceeded 5%. Preemptions stayed at zero. Failures stayed at zero.

The numbers made it look like a GPU bottleneck, but the evidence said otherwise.

Chart 6 : How much memory pressure does concurrent serving create?

Almost none.

Concurrency

Peak KV cache

1

0.15%

8

1.16%

16

2.33%

32

4.69%

At concurrency 32, with 2,048-token inputs and 512-token outputs per request, the system was using less than 5% of available VRAM for KV state. Growth is linear, pressure stays low, and there are no preemptions.

If memory is not the bottleneck and the GPU is not saturated, something else is setting the limit. That is the question Chart 5 raised. The rerun answered it.

The rerun : one variable changed

After seeing the Chart 5 inversion, I reran the KV concurrency sweep with one change only: request_rate=1 instead of request_rate=inf.

request_rate=inf is the default burst mode in vLLM benchmarking. It launches requests as fast as possible. request_rate=1 spaces them out at one request per second. Same concurrency levels, same model, same hardware but different arrival pattern.

Concurrency

TTFT p99 (rate=inf)

TTFT p99 (rate=1)

1

3,289 ms

61 ms

2

879 ms

58 ms

4

101 ms

67 ms

8

3,440 ms

74 ms

16

3,263 ms

76 ms

24

411 ms

80 ms

32

993 ms

79 ms

Under rate=1, TTFT stayed between 58 and 80 ms across the entire concurrency range. The inversion disappeared completely.

The GPU did not change. The model did not change. The hardware did not change. Only the arrival pattern changed, and the behavior changed with it.

What was actually happening

When requests arrived all at once under rate=inf, they piled up in the scheduler at the same time. vLLM V1 manages a fixed token budget per step and continuously batches prefill and decode work. Under a burst, prefill chunks can get fragmented across many iterations while active decode traffic keeps moving. First-token latency spikes because each new request is waiting for a prefill slot that keeps getting pushed back.

At higher concurrency, the queue becomes deep enough that the scheduler settles into a rhythm. Prefill chunks are packed more consistently alongside decode batches, and the backlog clears.

Under rate=1, requests arrive more evenly. The scheduler never sees the same burst collision, so prefill slots open before the next request arrives. TTFT stays flat.

This is not a bug. It is how continuous batching behaves under burst admission. The important distinction is that --request-rate controls how fast requests are initiated, while --max-concurrency controls how many are active at once. Those are different load shapes, and they can produce very different behavior.

Why this matters beyond the charts

The mental model most benchmark posts imply is simple:

GPU bigger → inference faster

What the serving stack actually looks like is more like this:

Model architecture

KV cache residency

vLLM scheduler and batching

Request arrival pattern

Host CPU and GPU execution

TTFT / TPOT / throughput

Every layer can become the bottleneck. Build 2 showed that sequence in practice.

Multi-head latent attention kept KV residency low, so memory was not the limiting layer. Mixture-of-Experts kept the active compute path sparse, so throughput scaled cleanly in the tested range. vLLM handled decode batching without preemptions or failures. But burst admission under request_rate=inf changed the TTFT curve.

That has a direct cost lesson.

If I had stopped at the first TTFT spike, the easy conclusion would have been: “I need a bigger GPU.” That would have been the wrong fix. KV-cache usage stayed below 5%, preemptions stayed at zero, waiting stayed at zero, and failed requests stayed at zero. More VRAM would not have explained the spike.

The cheaper fix is usually earlier in the stack:

  • shape the traffic.

  • cap concurrency.

  • tune scheduler budgets.

  • right-size host CPU.

  • route long-context requests intentionally.

In this run, changing arrival rate from request_rate=inf to request_rate=1 normalized the c8/c16 TTFT spike on the same machine. That means the first dollar should not go to a larger GPU. It should go to understanding admission control, scheduler behavior, and node balance.

The host CPU is part of that stack too. vLLM’s scheduler loop, tokenizer, async runtime, metrics collection, and API serving all rely on CPU-side work. With only 11 vCPUs on this instance, burst load can make host-side contention more visible. I would not call CPU the only root cause here, because the controlled-arrival run fixed the behavior on the same node. But I would call it part of the serving-node boundary.

An inference node is not a graphics card. It is a stack. Each layer has its own failure mode, and each layer has its own cost lever.

The job is not to ask which GPU is faster. The most expensive inference mistake is not choosing a smaller GPU.

It is buying a bigger GPU to solve a bottleneck that was never in GPU memory.

Here is the full Build 2 dashboard that ties the story together.

Setup

  • Model: deepseek-ai/DeepSeek-V2-Lite-Chat

  • GPU: NVIDIA RTX PRO 6000 Blackwell, 96 GB VRAM

  • Instance: 11 vCPUs, 56 GB RAM - CloudRift.ai

  • Serving stack: vLLM V1 (VLLM_USE_V1=1)

  • Run result: 0 failed requests, 0 preemptions

References

Acknowledgments

I’m grateful to CloudRift.ai for providing the sponsored infrastructure used in this build. CloudRift.ai offers GPU compute for AI workloads, making it easier to run real inference experiments on powerful hardware without managing the underlying infrastructure yourself.

Keep Reading