I'm building an open-source LLM inference platform with intelligent routing.

The goal: Reduce inference costs by 60% without sacrificing quality for complex queries.

The approach: Use tiered model routing - fast/cheap models for the majority of traffic, larger models reserved for hard cases.

This isn't theoretical. The approach is validated by:

Berkeley's RouteLLM paper: 40-85% cost reduction in production
vLLM research: 2-4× throughput improvements through PagedAttention
Production systems at Anthropic/OpenAI: Multi-model routing at scale

But most companies don't have the infrastructure team to build this themselves.

^{_{The Problem: The Always-Large-Model Trap}}

Most companies deploy ONE large model for everything.

Typical deployment:

Every query → LLaMA 70B
Cost per 1M tokens: $0.0060

Average query: 500 tokens (input + output)
2M queries/day
Daily cost: $6,000
Monthly cost: $180,000
Annual run rate: $2.16M

And this is a medium-sized deployment.

Deploying one model is easy:

# Deploy once, done
vllm serve meta-llama/Llama-2-70b-hf

Adding routing adds complexity:

Which model for which query?
How to classify complexity?
What if the router is wrong?
How to prove savings?

What if the small model gives bad responses?

This is a legitimate concern. But research shows:

70% of queries are "simple" (Q&A, summaries, classification)
Small models (7B params) handle these well

You're over-provisioning for 70% of your traffic.

Lack of Tools

There's no "npm install llm-router" solution.

You need to build:

Query classification logic
Model serving infrastructure
Cost tracking
Observability
Fallback mechanisms

That's what I'm building.

Intelligent Multi-Model Routing

Not every request deserves the biggest model.

Most user queries are short, direct, and predictable. Sending all of them to a high-capacity model wastes GPU time and drives up cost. Instead, the platform uses tiered routing: straightforward requests go to smaller, low-latency models, while genuinely complex queries are routed to larger reasoning models.

This keeps GPUs busy doing useful work, lowers the average cost per request, and still preserves quality for the cases that actually need deeper reasoning.

The Architecture decisions so far

Key Components

1. Smart Router

Analyzes query complexity using embeddings
Adds rule-based heuristics for obvious cases
Makes decision in <10ms
Tracks cost per request

2. Small Model (Mistral family)

Handles 60-70% of queries
250 tokens/sec throughput on H100
$0.0008 per 1M tokens
Perfect for: Q&A, summaries, simple tasks

3. Large Model (LLaMA family)

Handles 30-40% of queries
144 tokens/sec throughput on H100
$0.0060 per 1M tokens
Perfect for: reasoning, code, analysis

4. Confidence-Based Fallback

Small model returns response + confidence score
If confidence < threshold (0.65), route to large model
Catches 85-90% of routing errors
Only 15-20% fallback rate

Research Validation

This isn't a new idea. It's validated by multiple research papers and production systems.

What I'm Building Differently

While RouteLLM is research-focused, I'm building for production:

1. Full Observability

Prometheus metrics for routing decisions
Grafana dashboards showing cost savings
Per-request cost tracking
Routing accuracy monitoring

2. Production-Grade Infrastructure

Kubernetes deployment (EKS)
Terraform IaC for reproducibility
Multi-environment support (dev/prod)
CI/CD automation with GitHub Actions

3. Simple, Maintainable Design

Flat architecture (no premature micro-services)
Co-located related code
Tests at every level
Clear documentation

What I’m Exploring Next

Over the coming weeks, I’ll continue iterating on the platform, capturing what works, what doesn’t, and why.

Current areas of focus:

vLLM in practice

Observing real-world behavior of batching, memory usage, and throughput
Identifying configuration levers that meaningfully impact latency and GPU utilization

Routing logic

Evolving from simple heuristics to embedding-based signals
Testing failure points and understanding when escalation is needed

Operational visibility

Tracking end-to-end latency, token throughput, and GPU utilization
Building basic cost visibility linked to actual traffic patterns

Deployment lessons

Evaluating Kubernetes and Terraform choices that helped—or hindered—setup
Documenting failure modes and tradeoffs to reconsider in future iterations

This is a learning-focused project: the goal is to understand system behavior, not to claim perfection.

Following Along

Newsletter: Weekly updates in StackBytes

Technical deep dives as I build
Benchmarks as I collect real data
Honest documentation of failures and pivots

This platform tackles real problems—runaway LLM costs—using production-grade patterns (K8s, Terraform, observability). I’m building it, learning from it, and documenting everything along the way.

Let's build it.

Building an LLM inference platform with intelligent routing