In traditional infrastructure, when you hit a bottleneck you add more servers. Simple.
LLMs don't work that way.
LLM performance problems usually aren't about your code or cluster size. They come from running two fundamentally different workloads on the same GPU. One needs raw compute, the other needs memory speed. Optimizing for one hurts the other.
That's where latency spikes come from. And it's the problem the industry is now racing to solve.
1. FLOPS vs. Bandwidth
To understand why modern inference engines like vLLM and NVIDIA Dynamo are moving toward disaggregated serving, start with the hardware.
The Prefill Phase (Compute-Bound)
When you send a 2,000-token prompt to a model, the GPU processes all those tokens in parallel. This is a massive matrix-multiplication exercise. The goal is to maximize FLOPS (Floating Point Operations Per Second).
The Bottleneck: Tensor Core utilization.
The Metric: Time to First Token (TTFT).
The Decode Phase (Memory-Bound)
Once the prompt is processed, the model generates tokens one by one. For every single token, the GPU must load the entire model weights and the accumulated KV cache from HBM. On an H100, the arithmetic units are often 90% idle because they are simply waiting for data to arrive over the memory bus.
The Bottleneck: HBM Bandwidth (approx. 3.35 TB/s on H100).
The Metric: Time Per Output Token (TPOT).
The Infrastructure Dilemma: When both workloads share the same GPU, prefill tasks interrupt decode tasks causing unpredictable latency spikes under load.
2. The Solution: Disaggregated Serving
The logical evolution is to stop forcing these two workloads to share the same hardware. Disaggregated Serving (also known as P/D split) involves creating two distinct GPU pools:
Prefill Nodes: High-compute instances (like H100/H200) optimized for massive parallel processing.
Decode Nodes: Instances optimized for memory bandwidth, built for long-running token generation.
By separating them, the prefill fleet scales with request volume and the decode fleet scales with generation length independently.

How to Read This Architecture:
The Control Plane (Red): The "brain." Routes each request based on whether the KV cache already exists in the global pool skipping prefill entirely for cached prefixes.
The Data Plane (Blue): The "muscle." The NIXL/RDMA path that lets prefill and decode nodes share memory without routing through the CPU.
Left (Prefill): Optimized for math - high FLOPS to process the prompt fast.
Right (Decode): Optimized for speed - high HBM bandwidth to stream tokens as fast as the silicon allows.
3. The New Bottleneck: The KV Cache Transfer
Disaggregation introduces a new challenge: The KV Cache computed on a Prefill node must be transferred to a Decode node before generation can begin. Over standard networking, your latency gains disappear. This is where the infrastructure stack gets interesting:
NIXL (NVIDIA Inference Transfer Library): A vendor-agnostic data movement library that enables zero-copy, point-to-point transfers between GPUs. It bypasses the CPU entirely, using RDMA to move KV tensors at wire speed.
NVIDIA Dynamo's Smart Router: Unlike a traditional round-robin load balancer, Dynamo's router is LLM-aware. It uses a Global Radix Tree to track which nodes already have specific prompt prefixes cached, routing requests to minimize re-computation.
4. Operationalizing with Red Hat llm-d
For those of us in the Kubernetes ecosystem, managing this manually is a nightmare. This is why llm-d (distributed inference for LLMs) is becoming the standard well-lit path.
Developed as a collaboration between Red Hat, NVIDIA, and Google, llm-d provides the Kubernetes-native orchestration layer for this complexity. It allows SREs and Infrastructure Engineers to:
Define SLO-based Scaling: Automatically spin up more prefill workers if TTFT exceeds 100ms.
Manage the Fleet: Use custom resources CRDs to describe disaggregated graphs of vLLM workers.
Ensure Portability: Because it’s built on open standards, it prevents hardware lock-in while still taking advantage of NVIDIA-specific optimizations like NIXL when available.
The Takeaway
Inference in 2026 is a distributed systems problem with high-speed interconnects, intelligent routing, and hardware-specific memory management.
Next time you're troubleshooting latency, don't start with your code. Start with your hardware. Is the bottleneck FLOPS or HBM bandwidth? That single question determines everything else.
📺 Coming up next on StackBytes: A deep dive into NVIDIA Dynamo's Planner. How real-time SLO signals drive automated GPU rebalancing, and why that changes how you think about inference at scale.

