Torchrouter Documentation Hub

Commercial launch docs plus the operator reference behind them.

Start with the Control Audit and Assisted Pilot outlines if you are evaluating the commercial motion. Use the technical sections below if you need the routing, deployment, and observability reference behind the control-plane story.

Tier routing Ranker Circuit-breaker Queue Capability matrix think:false enforcement Prometheus
Classifier port 4001 LiteLLM port 4000 Launch site port 3201 OpenAI-compatible

Overview

What TorchRouter is doing under the hood.

TorchRouter routes agent traffic through a controlled lane system rather than letting every request hit the same model path. The classifier scores the request, the queue decides whether a local lane can take it now, the ranker compares local candidates, and LiteLLM executes the selected backend.

The goal is local-first when the fleet can carry it, with explicit fallback when it cannot. That keeps operational behavior predictable: lane selection is visible, queueing is bounded, and observability is built into the response path.

Key Capabilities

Five systems make the control plane useful.

Tier Routing

Requests are grouped into H, R, or C tiers so high-judgment, routine, and coding traffic can follow different lane policy.

Ranker

The ranker compares local candidates with warm state, quality, speed, and VRAM pressure before a local request is admitted.

Circuit-Breaker

Cloud failure can fall back to a local lane, and local failures can escalate when the selected lane is not usable.

Queue

The queue publishes depth and wait estimates per group, keeping local capacity visible instead of hidden behind retries.

Capability Matrix

Profiles track context window, tools support, reasoning mode, warm state, hardware class, and quality-speed scores.

Operating rule

Local requests are not blindly preferred. They are admitted only when the lane, the queue, and the candidate profile all agree the route is safe enough for the current request.

Agent Tiers

Three tiers, three model families, one header contract.

H

High-judgment

Balanced-local first, then qwen3:30b-a3b or qwen2.5:32b depending on the lane and fleet state.

Agents trogdor, valkyrie, orchestration-elena, keel
Default group balanced / balanced-local
Cloud primary qwen3:30b-a3b, qwen2.5:32b
R

Routine

Fast-chat-local first, usually qwen3:8b with think:false so routine traffic stays terse and cheap.

Agents linda, broham, harbor, ledger, gavel
Default group fast-chat / fast-chat-local
Local floor qwen3:8b with think:false
C

Coding

Code-local first, with qwen2.5-coder:7b and qwen2.5-coder:32b carrying the coding lane.

Agents forge, webmaster, tooling-raj, debug-anika
Default group code / code-local
Setup send X-Agent-Id: <agent_name> and model: auto

API Reference

Classifier port 4001 speaks a small, explicit surface.

GET/health

Returns {"status":"ok","daily_spend_usd":float}.

GET/capability-matrix

Returns schema_version, a live count, and the full profile map. The current matrix exposes 23 model profiles.

GET/queue/status

Returns queue depth, active count, wait estimates, policy state, and per-group fallback metadata.

GET/metrics

Prometheus text output with the classifier_ prefix family.

POST/v1/chat/completions

OpenAI-compatible proxy. Set X-Agent-Id and send a standard chat payload; the router forwards the request to the selected group.

Response headers

Requests include routing headers that explain the outcome:

X-Queue-Selected-Group X-Queue-State X-Queue-Wait-Ms X-Router-Timing-*
curl -sf http://sauron:4001/health
{"status":"ok","daily_spend_usd":0.42}

curl -sf http://sauron:4001/capability-matrix | jq '.count, .profiles["matthew-100.86.174.11/ollama/qwen3:8b"]'

curl -sf http://sauron:4001/queue/status | jq '.groups.balanced.depth, .groups["fast-chat"].wait_ms'

curl -s http://sauron:4001/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Agent-Id: webmaster' \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "Summarize the routing policy."}
    ]
  }'
Header contract

Use X-Agent-Id so the classifier can apply the right tier policy. The body model name may still be agent:group for passthrough callers, but the header is the authoritative identity when both are present.

Routing Flow

Request to backend, with the failure path made explicit.

The path below matches the live routing policy: classifier first, then queue, then ranker, then LiteLLM, then local Ollama or cloud; cloud failure can drop to a tier fallback and then to the local group.

RequestOpenAI-compatible chat payload arrives with X-Agent-Id and lane context.
ClassifierScores complexity, criticality, and alpha-weighted policy before selecting a group.
QueuePublishes depth and wait estimates, then decides whether the local lane can take work now.
RankerEvaluates local candidates using warm state, quality, speed, and VRAM pressure.
LiteLLMReceives the selected group and forwards the request with the standard OpenAI shape.
Local / CloudExecutes on local Ollama or cloud. A cloud error can fall back to a tier-approved local path.
TraceResponse headers and continuity events document the path without changing the serving result.

Fleet Configuration

The capability matrix is the fleet contract.

Each profile in /capability-matrix carries a lane, host ID, context window, tools support, reasoning profile, think:false requirement, hardware class, preemptibility, lockout state, warm state, API base, and speed/quality scores. The runtime updates warm state without restarting.

Warm state is one of cold, warm, hot, or unknown. Lockout state is one of none, codex, auth, or budget.

Field Why it exists How TorchRouter uses it
lane Primary LiteLLM group for the model. Used by the ranker and fallback logic to decide where the model belongs.
context_window Token budget for the candidate. Prevents long-context work from entering a lane that cannot safely carry it.
supports_tools Tool-call compatibility. Tool-requiring requests are kept away from candidates that cannot comply.
warm_state Runtime freshness marker. Hot and warm candidates score better; cold candidates lose rank unless no better option exists.
hardware_class Local GPU, cloud GPU, or offload classification. Lets the fleet distinguish cheap local lanes from heavier cloud paths.
quality_score / speed_score Normalised profile scores. Balance quality against latency when selecting the best local candidate.
VRAM management

The ranker reserves VRAM for a candidate before admitting a local request and releases that reservation when the request completes. That is how the fleet avoids overcommitting a host under mixed load.

Observability

Everything important is visible somewhere.

TorchRouter publishes Prometheus metrics from /metrics, routing headers on the response, and routing continuity records to the JSONL sink when the event path is configured. The point is not just to serve the request, but to explain the request after the fact.

Prometheus Metrics use the classifier_ prefix and cover routing, queue, spend, circuit, and fleet state.
Continuity events TORCHROUTER_ROUTING_CONTINUITY_EVENT_PATH writes JSONL event records such as route_selected, fallback_selected, and degradation_ladder_step.
think:false enforcement The router belt-and-suspenders the request payload so qwen3 floor and triage lanes cannot leak reasoning narration.
Timing headers X-Router-Timing-Score-Ms, X-Router-Timing-Queue-Ms, X-Router-Timing-Litellm-Ttft-Ms, and friends show where latency landed.

Deployment

Dev stack commands and prod ports are fixed in the stack files.

The launch site itself serves this manual on localhost:3201. The TorchRouter runtime uses classifier port 4001 and LiteLLM port 4000 in production on sauron. Development mounts the dev stack on 4002 and 4003.

# Launch site
cd /home/chrisb/torchrouter-launch-site/server
npm install
PORT=3201 npm start

# Dev router stack on sauron
docker compose -f sauron/docker-compose.yml -f sauron/docker-compose.dev.yml \
  up -d litellm-dev classifier-dev

# Dev health checks
curl http://localhost:4002/health
curl http://localhost:4003/health/liveliness

# Production ports
# 4000 = LiteLLM
# 4001 = classifier
# 9090 = Prometheus
# 3030 = Grafana
Key env vars LITELLM_URL, LITELLM_MASTER_KEY, ROUTING_MODE, TORCHROUTER_QUEUE_ENABLED, TORCHROUTER_FAST_CHAT_CANARY_SHARE, and TORCHROUTER_ROUTING_CONTINUITY_EVENT_PATH.
Local-first default Production defaults to passthrough routing with controlled local fallback, not broad live routing.
Content gap flagged This manual documents the live schema and the current defaults, but runtime values like warm state, queue depth, and spend are intentionally dynamic and can change between requests.

Operational Summary

Route the work, keep the lane explicit, and leave a trail.

TorchRouter is not a magic model picker. It is a control plane that makes lane choice, queue behavior, fallback rules, and evidence visible enough that operators can trust it and debug it.