I'm building an open-source LLM inference platform with intelligent routing.

The goal: Reduce inference costs by 60% without sacrificing quality for complex queries.

The approach: Use tiered model routing - fast/cheap models for the majority of traffic, larger models reserved for hard cases.

This isn't theoretical. The approach is validated by:

  • Berkeley's RouteLLM paper: 40-85% cost reduction in production

  • vLLM research: 2-4× throughput improvements through PagedAttention

  • Production systems at Anthropic/OpenAI: Multi-model routing at scale

But most companies don't have the infrastructure team to build this themselves.

The Problem: The Always-Large-Model Trap

Most companies deploy ONE large model for everything.

Typical deployment:

Every query → LLaMA 70B
Cost per 1M tokens: $0.0060
  • Average query: 500 tokens (input + output)

  • 2M queries/day

  • Daily cost: $6,000

  • Monthly cost: $180,000

  • Annual run rate: $2.16M

And this is a medium-sized deployment.

Deploying one model is easy:

# Deploy once, done
vllm serve meta-llama/Llama-2-70b-hf

Adding routing adds complexity:

  • Which model for which query?

  • How to classify complexity?

  • What if the router is wrong?

  • How to prove savings?

What if the small model gives bad responses?

This is a legitimate concern. But research shows:

  • 70% of queries are "simple" (Q&A, summaries, classification)

  • Small models (7B params) handle these well

You're over-provisioning for 70% of your traffic.

Lack of Tools

There's no "npm install llm-router" solution.

You need to build:

  • Query classification logic

  • Model serving infrastructure

  • Cost tracking

  • Observability

  • Fallback mechanisms

That's what I'm building.

Intelligent Multi-Model Routing

Not every request deserves the biggest model.

Most user queries are short, direct, and predictable. Sending all of them to a high-capacity model wastes GPU time and drives up cost. Instead, the platform uses tiered routing: straightforward requests go to smaller, low-latency models, while genuinely complex queries are routed to larger reasoning models.

This keeps GPUs busy doing useful work, lowers the average cost per request, and still preserves quality for the cases that actually need deeper reasoning.

The Architecture decisions so far

Key Components

1. Smart Router

  • Analyzes query complexity using embeddings

  • Adds rule-based heuristics for obvious cases

  • Makes decision in <10ms

  • Tracks cost per request

2. Small Model (Mistral family)

  • Handles 60-70% of queries

  • 250 tokens/sec throughput on H100

  • $0.0008 per 1M tokens

  • Perfect for: Q&A, summaries, simple tasks

3. Large Model (LLaMA family)

  • Handles 30-40% of queries

  • 144 tokens/sec throughput on H100

  • $0.0060 per 1M tokens

  • Perfect for: reasoning, code, analysis

4. Confidence-Based Fallback

  • Small model returns response + confidence score

  • If confidence < threshold (0.65), route to large model

  • Catches 85-90% of routing errors

  • Only 15-20% fallback rate

Research Validation

This isn't a new idea. It's validated by multiple research papers and production systems.

What I'm Building Differently

While RouteLLM is research-focused, I'm building for production:

1. Full Observability

  • Prometheus metrics for routing decisions

  • Grafana dashboards showing cost savings

  • Per-request cost tracking

  • Routing accuracy monitoring

2. Production-Grade Infrastructure

  • Kubernetes deployment (EKS)

  • Terraform IaC for reproducibility

  • Multi-environment support (dev/prod)

  • CI/CD automation with GitHub Actions

3. Simple, Maintainable Design

  • Flat architecture (no premature micro-services)

  • Co-located related code

  • Tests at every level

  • Clear documentation

What I’m Exploring Next

Over the coming weeks, I’ll continue iterating on the platform, capturing what works, what doesn’t, and why.

Current areas of focus:

vLLM in practice

  • Observing real-world behavior of batching, memory usage, and throughput

  • Identifying configuration levers that meaningfully impact latency and GPU utilization

Routing logic

  • Evolving from simple heuristics to embedding-based signals

  • Testing failure points and understanding when escalation is needed

Operational visibility

  • Tracking end-to-end latency, token throughput, and GPU utilization

  • Building basic cost visibility linked to actual traffic patterns

Deployment lessons

  • Evaluating Kubernetes and Terraform choices that helped—or hindered—setup

  • Documenting failure modes and tradeoffs to reconsider in future iterations

This is a learning-focused project: the goal is to understand system behavior, not to claim perfection.

Following Along

Newsletter: Weekly updates in StackBytes

  • Technical deep dives as I build

  • Benchmarks as I collect real data

  • Honest documentation of failures and pivots

This platform tackles real problems—runaway LLM costs—using production-grade patterns (K8s, Terraform, observability). I’m building it, learning from it, and documenting everything along the way.

Let's build it.

Keep Reading