The Problem: Most MLOps content stops at the docker run stage. But in an enterprise environment, a model is only as good as the infrastructure supporting it.

The Goal: Build an inference ecosystem modeled after the high-concurrency architectures used by companies like Netflix and Airbnb.

The Technical Core:

  • Orchestration: Kubernetes (EKS)

  • Serving: NVIDIA Triton (Dynamic Batching enabled)

  • Provisioning: Terraform (100% IaC)

  • Monitoring: Prometheus & Grafana

The High-Level Design: A clean abstraction of the inference lifecycle, from the Model Repository to the Kubernetes Load Balancer.

The WHY : I chose Nvidia Triton for three specific engineering reasons:

  1. Concurrent Model Execution: Running multiple models on one GPU without resource starvation.

  2. Model Control API: Loading/unloading models via API without restarting the entire pod.

  3. Framework Agnostic: Serving PyTorch, ONNX, and TensorRT under one unified endpoint.

What I actually learned

This wasn't a "clean" build. To get to a stable state, I had to solve:

  • The GPU Plugin Battle: Configuring the NVIDIA Device Plugin so Kubernetes pods actually "claimed" the hardware.

  • The Mount Point Headache: Debugging why models wouldn't mount to pods due to RBAC and PV/PVC mismatches.

  • The Latency Trap: Fine-tuning Triton's scheduler to balance throughput vs. response time.

Under the Hood: The Implementation Reality.

This build was a reminder that MLOps isn't just about the model; it's about the ecosystem that keeps that model alive.

Keep Reading