The Problem: Most MLOps content stops at the docker run stage. But in an enterprise environment, a model is only as good as the infrastructure supporting it.
The Goal: Build an inference ecosystem modeled after the high-concurrency architectures used by companies like Netflix and Airbnb.
The Technical Core:
Orchestration: Kubernetes (EKS)
Serving: NVIDIA Triton (Dynamic Batching enabled)
Provisioning: Terraform (100% IaC)
Monitoring: Prometheus & Grafana
The High-Level Design: A clean abstraction of the inference lifecycle, from the Model Repository to the Kubernetes Load Balancer.
The WHY : I chose Nvidia Triton for three specific engineering reasons:
Concurrent Model Execution: Running multiple models on one GPU without resource starvation.
Model Control API: Loading/unloading models via API without restarting the entire pod.
Framework Agnostic: Serving PyTorch, ONNX, and TensorRT under one unified endpoint.
What I actually learned
This wasn't a "clean" build. To get to a stable state, I had to solve:
The GPU Plugin Battle: Configuring the NVIDIA Device Plugin so Kubernetes pods actually "claimed" the hardware.
The Mount Point Headache: Debugging why models wouldn't mount to pods due to RBAC and PV/PVC mismatches.
The Latency Trap: Fine-tuning Triton's scheduler to balance throughput vs. response time.
Under the Hood: The Implementation Reality.

This build was a reminder that MLOps isn't just about the model; it's about the ecosystem that keeps that model alive.
