Torchrouter Documentation Hub
Commercial launch docs plus the operator reference behind them.
Start with the Control Audit and Assisted Pilot outlines if you are evaluating the commercial motion. Use the technical sections below if you need the routing, deployment, and observability reference behind the control-plane story.
Contents
Jump straight to the operational sections.
Overview
What TorchRouter is doing under the hood.
TorchRouter routes agent traffic through a controlled lane system rather than letting every request hit the same model path. The classifier scores the request, the queue decides whether a local lane can take it now, the ranker compares local candidates, and LiteLLM executes the selected backend.
The goal is local-first when the fleet can carry it, with explicit fallback when it cannot. That keeps operational behavior predictable: lane selection is visible, queueing is bounded, and observability is built into the response path.
Key Capabilities
Five systems make the control plane useful.
Tier Routing
Requests are grouped into H, R, or C tiers so high-judgment, routine, and coding traffic can follow different lane policy.
Ranker
The ranker compares local candidates with warm state, quality, speed, and VRAM pressure before a local request is admitted.
Circuit-Breaker
Cloud failure can fall back to a local lane, and local failures can escalate when the selected lane is not usable.
Queue
The queue publishes depth and wait estimates per group, keeping local capacity visible instead of hidden behind retries.
Capability Matrix
Profiles track context window, tools support, reasoning mode, warm state, hardware class, and quality-speed scores.
Local requests are not blindly preferred. They are admitted only when the lane, the queue, and the candidate profile all agree the route is safe enough for the current request.
Agent Tiers
Three tiers, three model families, one header contract.
High-judgment
Balanced-local first, then qwen3:30b-a3b or qwen2.5:32b depending on the lane and fleet state.
Routine
Fast-chat-local first, usually qwen3:8b with think:false so routine traffic stays terse and cheap.
Coding
Code-local first, with qwen2.5-coder:7b and qwen2.5-coder:32b carrying the coding lane.
API Reference
Classifier port 4001 speaks a small, explicit surface.
Returns {"status":"ok","daily_spend_usd":float}.
Returns schema_version, a live count, and the full profile map. The current matrix exposes 23 model profiles.
Returns queue depth, active count, wait estimates, policy state, and per-group fallback metadata.
Prometheus text output with the classifier_ prefix family.
OpenAI-compatible proxy. Set X-Agent-Id and send a standard chat payload; the router forwards the request to the selected group.
Response headers
Requests include routing headers that explain the outcome:
curl -sf http://sauron:4001/health
{"status":"ok","daily_spend_usd":0.42}
curl -sf http://sauron:4001/capability-matrix | jq '.count, .profiles["matthew-100.86.174.11/ollama/qwen3:8b"]'
curl -sf http://sauron:4001/queue/status | jq '.groups.balanced.depth, .groups["fast-chat"].wait_ms'
curl -s http://sauron:4001/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-Agent-Id: webmaster' \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "Summarize the routing policy."}
]
}'
Use X-Agent-Id so the classifier can apply the right tier policy. The body model name may still be agent:group for passthrough callers, but the header is the authoritative identity when both are present.
Routing Flow
Request to backend, with the failure path made explicit.
The path below matches the live routing policy: classifier first, then queue, then ranker, then LiteLLM, then local Ollama or cloud; cloud failure can drop to a tier fallback and then to the local group.
X-Agent-Id and lane context.Fleet Configuration
The capability matrix is the fleet contract.
Each profile in /capability-matrix carries a lane, host ID, context window, tools support,
reasoning profile, think:false requirement, hardware class, preemptibility, lockout state,
warm state, API base, and speed/quality scores. The runtime updates warm state without restarting.
Warm state is one of cold, warm, hot, or unknown.
Lockout state is one of none, codex, auth, or budget.
| Field | Why it exists | How TorchRouter uses it |
|---|---|---|
| lane | Primary LiteLLM group for the model. | Used by the ranker and fallback logic to decide where the model belongs. |
| context_window | Token budget for the candidate. | Prevents long-context work from entering a lane that cannot safely carry it. |
| supports_tools | Tool-call compatibility. | Tool-requiring requests are kept away from candidates that cannot comply. |
| warm_state | Runtime freshness marker. | Hot and warm candidates score better; cold candidates lose rank unless no better option exists. |
| hardware_class | Local GPU, cloud GPU, or offload classification. | Lets the fleet distinguish cheap local lanes from heavier cloud paths. |
| quality_score / speed_score | Normalised profile scores. | Balance quality against latency when selecting the best local candidate. |
The ranker reserves VRAM for a candidate before admitting a local request and releases that reservation when the request completes. That is how the fleet avoids overcommitting a host under mixed load.
Observability
Everything important is visible somewhere.
TorchRouter publishes Prometheus metrics from /metrics, routing headers on the response,
and routing continuity records to the JSONL sink when the event path is configured. The point is not
just to serve the request, but to explain the request after the fact.
classifier_ prefix and cover routing, queue, spend, circuit, and fleet state.
TORCHROUTER_ROUTING_CONTINUITY_EVENT_PATH writes JSONL event records such as route_selected, fallback_selected, and degradation_ladder_step.
X-Router-Timing-Score-Ms, X-Router-Timing-Queue-Ms, X-Router-Timing-Litellm-Ttft-Ms, and friends show where latency landed.
Deployment
Dev stack commands and prod ports are fixed in the stack files.
The launch site itself serves this manual on localhost:3201. The TorchRouter runtime uses
classifier port 4001 and LiteLLM port 4000 in production on sauron.
Development mounts the dev stack on 4002 and 4003.
# Launch site
cd /home/chrisb/torchrouter-launch-site/server
npm install
PORT=3201 npm start
# Dev router stack on sauron
docker compose -f sauron/docker-compose.yml -f sauron/docker-compose.dev.yml \
up -d litellm-dev classifier-dev
# Dev health checks
curl http://localhost:4002/health
curl http://localhost:4003/health/liveliness
# Production ports
# 4000 = LiteLLM
# 4001 = classifier
# 9090 = Prometheus
# 3030 = Grafana
LITELLM_URL, LITELLM_MASTER_KEY, ROUTING_MODE, TORCHROUTER_QUEUE_ENABLED, TORCHROUTER_FAST_CHAT_CANARY_SHARE, and TORCHROUTER_ROUTING_CONTINUITY_EVENT_PATH.
Operational Summary
Route the work, keep the lane explicit, and leave a trail.
TorchRouter is not a magic model picker. It is a control plane that makes lane choice, queue behavior, fallback rules, and evidence visible enough that operators can trust it and debug it.