Torchrouter Documentation Hub

Commercial launch docs plus the operator reference behind them.

Start with the Control Audit and Assisted Pilot outlines if you are evaluating the commercial motion. Use the technical sections below if you need the routing, deployment, and observability reference behind the control-plane story.

Control Audit Assisted Pilot Technical reference

Tier routing Ranker Circuit-breaker Queue Capability matrix think:false enforcement Prometheus

Classifier port 4001 LiteLLM port 4000 Launch site port 3201 OpenAI-compatible

Contents

Jump straight to the operational sections.

OverviewWhat TorchRouter does
Agent TiersH, R, and C routing rules
API ReferenceHealth, matrix, queue, metrics
Routing FlowClassifier to queue to backend
Fleet ConfigurationProfiles, warm state, VRAM
ObservabilityMetrics and continuity events
DeploymentDev stack and prod ports

Overview

What TorchRouter is doing under the hood.

TorchRouter routes agent traffic through a controlled lane system rather than letting every request hit the same model path. The classifier scores the request, the queue decides whether a local lane can take it now, the ranker compares local candidates, and LiteLLM executes the selected backend.

The goal is local-first when the fleet can carry it, with explicit fallback when it cannot. That keeps operational behavior predictable: lane selection is visible, queueing is bounded, and observability is built into the response path.

Key Capabilities

Five systems make the control plane useful.

Tier Routing

Requests are grouped into H, R, or C tiers so high-judgment, routine, and coding traffic can follow different lane policy.

Ranker

The ranker compares local candidates with warm state, quality, speed, and VRAM pressure before a local request is admitted.

Circuit-Breaker

Cloud failure can fall back to a local lane, and local failures can escalate when the selected lane is not usable.

Queue

The queue publishes depth and wait estimates per group, keeping local capacity visible instead of hidden behind retries.

Capability Matrix

Profiles track context window, tools support, reasoning mode, warm state, hardware class, and quality-speed scores.

Operating rule

Local requests are not blindly preferred. They are admitted only when the lane, the queue, and the candidate profile all agree the route is safe enough for the current request.

Agent Tiers

Three tiers, three model families, one header contract.

High-judgment

Balanced-local first, then qwen3:30b-a3b or qwen2.5:32b depending on the lane and fleet state.

Agents trogdor, valkyrie, orchestration-elena, keel

Default group balanced / balanced-local

Cloud primary qwen3:30b-a3b, qwen2.5:32b

Routine

Fast-chat-local first, usually qwen3:8b with think:false so routine traffic stays terse and cheap.

Agents linda, broham, harbor, ledger, gavel

Default group fast-chat / fast-chat-local

Local floor qwen3:8b with think:false

Coding

Code-local first, with qwen2.5-coder:7b and qwen2.5-coder:32b carrying the coding lane.

Agents forge, webmaster, tooling-raj, debug-anika

Default group code / code-local

Setup send X-Agent-Id: <agent_name> and model: auto

API Reference

Classifier port 4001 speaks a small, explicit surface.

GET/health

Returns {"status":"ok","daily_spend_usd":float}.

GET/capability-matrix

Returns schema_version, a live count, and the full profile map. The current matrix exposes 23 model profiles.

GET/queue/status

Returns queue depth, active count, wait estimates, policy state, and per-group fallback metadata.

GET/metrics

Prometheus text output with the classifier_ prefix family.

POST/v1/chat/completions

OpenAI-compatible proxy. Set X-Agent-Id and send a standard chat payload; the router forwards the request to the selected group.

Response headers

Requests include routing headers that explain the outcome:

X-Queue-Selected-Group X-Queue-State X-Queue-Wait-Ms X-Router-Timing-*

curl -sf http://sauron:4001/health
{"status":"ok","daily_spend_usd":0.42}

curl -sf http://sauron:4001/capability-matrix | jq '.count, .profiles["matthew-100.86.174.11/ollama/qwen3:8b"]'

curl -sf http://sauron:4001/queue/status | jq '.groups.balanced.depth, .groups["fast-chat"].wait_ms'

curl -s http://sauron:4001/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Agent-Id: webmaster' \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "Summarize the routing policy."}
    ]
  }'

Header contract

Use X-Agent-Id so the classifier can apply the right tier policy. The body model name may still be agent:group for passthrough callers, but the header is the authoritative identity when both are present.

Routing Flow

Request to backend, with the failure path made explicit.

The path below matches the live routing policy: classifier first, then queue, then ranker, then LiteLLM, then local Ollama or cloud; cloud failure can drop to a tier fallback and then to the local group.

RequestOpenAI-compatible chat payload arrives with X-Agent-Id and lane context.

ClassifierScores complexity, criticality, and alpha-weighted policy before selecting a group.

QueuePublishes depth and wait estimates, then decides whether the local lane can take work now.

RankerEvaluates local candidates using warm state, quality, speed, and VRAM pressure.

LiteLLMReceives the selected group and forwards the request with the standard OpenAI shape.

Local / CloudExecutes on local Ollama or cloud. A cloud error can fall back to a tier-approved local path.

TraceResponse headers and continuity events document the path without changing the serving result.

Fleet Configuration

The capability matrix is the fleet contract.

Each profile in /capability-matrix carries a lane, host ID, context window, tools support, reasoning profile, think:false requirement, hardware class, preemptibility, lockout state, warm state, API base, and speed/quality scores. The runtime updates warm state without restarting.

Warm state is one of cold, warm, hot, or unknown. Lockout state is one of none, codex, auth, or budget.

Field	Why it exists	How TorchRouter uses it
lane	Primary LiteLLM group for the model.	Used by the ranker and fallback logic to decide where the model belongs.
context_window	Token budget for the candidate.	Prevents long-context work from entering a lane that cannot safely carry it.
supports_tools	Tool-call compatibility.	Tool-requiring requests are kept away from candidates that cannot comply.
warm_state	Runtime freshness marker.	Hot and warm candidates score better; cold candidates lose rank unless no better option exists.
hardware_class	Local GPU, cloud GPU, or offload classification.	Lets the fleet distinguish cheap local lanes from heavier cloud paths.
quality_score / speed_score	Normalised profile scores.	Balance quality against latency when selecting the best local candidate.

VRAM management

The ranker reserves VRAM for a candidate before admitting a local request and releases that reservation when the request completes. That is how the fleet avoids overcommitting a host under mixed load.

Observability

Everything important is visible somewhere.

TorchRouter publishes Prometheus metrics from /metrics, routing headers on the response, and routing continuity records to the JSONL sink when the event path is configured. The point is not just to serve the request, but to explain the request after the fact.

Prometheus Metrics use the classifier_ prefix and cover routing, queue, spend, circuit, and fleet state.

Continuity events TORCHROUTER_ROUTING_CONTINUITY_EVENT_PATH writes JSONL event records such as route_selected, fallback_selected, and degradation_ladder_step.

think:false enforcement The router belt-and-suspenders the request payload so qwen3 floor and triage lanes cannot leak reasoning narration.

Timing headers X-Router-Timing-Score-Ms, X-Router-Timing-Queue-Ms, X-Router-Timing-Litellm-Ttft-Ms, and friends show where latency landed.

Deployment

Dev stack commands and prod ports are fixed in the stack files.

The launch site itself serves this manual on localhost:3201. The TorchRouter runtime uses classifier port 4001 and LiteLLM port 4000 in production on sauron. Development mounts the dev stack on 4002 and 4003.

# Launch site
cd /home/chrisb/torchrouter-launch-site/server
npm install
PORT=3201 npm start

# Dev router stack on sauron
docker compose -f sauron/docker-compose.yml -f sauron/docker-compose.dev.yml \
  up -d litellm-dev classifier-dev

# Dev health checks
curl http://localhost:4002/health
curl http://localhost:4003/health/liveliness

# Production ports
# 4000 = LiteLLM
# 4001 = classifier
# 9090 = Prometheus
# 3030 = Grafana

Key env vars LITELLM_URL, LITELLM_MASTER_KEY, ROUTING_MODE, TORCHROUTER_QUEUE_ENABLED, TORCHROUTER_FAST_CHAT_CANARY_SHARE, and TORCHROUTER_ROUTING_CONTINUITY_EVENT_PATH.

Local-first default Production defaults to passthrough routing with controlled local fallback, not broad live routing.

Content gap flagged This manual documents the live schema and the current defaults, but runtime values like warm state, queue depth, and spend are intentionally dynamic and can change between requests.

Operational Summary

Route the work, keep the lane explicit, and leave a trail.

TorchRouter is not a magic model picker. It is a control plane that makes lane choice, queue behavior, fallback rules, and evidence visible enough that operators can trust it and debug it.