I'm building an open-source LLM inference platform with intelligent routing.
The goal: Reduce inference costs by 60% without sacrificing quality for complex queries.
The approach: Use tiered model routing - fast/cheap models for the majority of traffic, larger models reserved for hard cases.
This isn't theoretical. The approach is validated by:
Berkeley's RouteLLM paper: 40-85% cost reduction in production
vLLM research: 2-4× throughput improvements through PagedAttention
Production systems at Anthropic/OpenAI: Multi-model routing at scale
But most companies don't have the infrastructure team to build this themselves.
The Problem: The Always-Large-Model Trap
Most companies deploy ONE large model for everything.
Typical deployment:
Every query → LLaMA 70B
Cost per 1M tokens: $0.0060Average query: 500 tokens (input + output)
2M queries/day
Daily cost: $6,000
Monthly cost: $180,000
Annual run rate: $2.16M
And this is a medium-sized deployment.
Deploying one model is easy:
# Deploy once, done
vllm serve meta-llama/Llama-2-70b-hfAdding routing adds complexity:
Which model for which query?
How to classify complexity?
What if the router is wrong?
How to prove savings?
What if the small model gives bad responses?
This is a legitimate concern. But research shows:
70% of queries are "simple" (Q&A, summaries, classification)
Small models (7B params) handle these well
You're over-provisioning for 70% of your traffic.
Lack of Tools
There's no "npm install llm-router" solution.
You need to build:
Query classification logic
Model serving infrastructure
Cost tracking
Observability
Fallback mechanisms
That's what I'm building.
Intelligent Multi-Model Routing
Not every request deserves the biggest model.
Most user queries are short, direct, and predictable. Sending all of them to a high-capacity model wastes GPU time and drives up cost. Instead, the platform uses tiered routing: straightforward requests go to smaller, low-latency models, while genuinely complex queries are routed to larger reasoning models.
This keeps GPUs busy doing useful work, lowers the average cost per request, and still preserves quality for the cases that actually need deeper reasoning.
The Architecture decisions so far

Key Components
1. Smart Router
Analyzes query complexity using embeddings
Adds rule-based heuristics for obvious cases
Makes decision in <10ms
Tracks cost per request
2. Small Model (Mistral family)
Handles 60-70% of queries
250 tokens/sec throughput on H100
$0.0008 per 1M tokens
Perfect for: Q&A, summaries, simple tasks
3. Large Model (LLaMA family)
Handles 30-40% of queries
144 tokens/sec throughput on H100
$0.0060 per 1M tokens
Perfect for: reasoning, code, analysis
4. Confidence-Based Fallback
Small model returns response + confidence score
If confidence < threshold (0.65), route to large model
Catches 85-90% of routing errors
Only 15-20% fallback rate
Research Validation
This isn't a new idea. It's validated by multiple research papers and production systems.
What I'm Building Differently
While RouteLLM is research-focused, I'm building for production:
1. Full Observability
Prometheus metrics for routing decisions
Grafana dashboards showing cost savings
Per-request cost tracking
Routing accuracy monitoring
2. Production-Grade Infrastructure
Kubernetes deployment (EKS)
Terraform IaC for reproducibility
Multi-environment support (dev/prod)
CI/CD automation with GitHub Actions
3. Simple, Maintainable Design
Flat architecture (no premature micro-services)
Co-located related code
Tests at every level
Clear documentation
What I’m Exploring Next
Over the coming weeks, I’ll continue iterating on the platform, capturing what works, what doesn’t, and why.
Current areas of focus:
vLLM in practice
Observing real-world behavior of batching, memory usage, and throughput
Identifying configuration levers that meaningfully impact latency and GPU utilization
Routing logic
Evolving from simple heuristics to embedding-based signals
Testing failure points and understanding when escalation is needed
Operational visibility
Tracking end-to-end latency, token throughput, and GPU utilization
Building basic cost visibility linked to actual traffic patterns
Deployment lessons
Evaluating Kubernetes and Terraform choices that helped—or hindered—setup
Documenting failure modes and tradeoffs to reconsider in future iterations
This is a learning-focused project: the goal is to understand system behavior, not to claim perfection.
Following Along
Newsletter: Weekly updates in StackBytes
Technical deep dives as I build
Benchmarks as I collect real data
Honest documentation of failures and pivots
This platform tackles real problems—runaway LLM costs—using production-grade patterns (K8s, Terraform, observability). I’m building it, learning from it, and documenting everything along the way.
Let's build it.
