API
Models
The live catalog, how we bill per token, and what the smart-router brands route to.
Models
Every model wired into Token Harbor is listed on /models. One Universal Key gets you all of them — no per-vendor signup, no per-vendor balance to keep topped up.
TH Orchestra — th-orchestra
Plan → build → review. Every request is classified server-side and routed to one of three model pools, so a coding task gets planned, built, and reviewed without you juggling models. The model you pay for is the upstream model that actually answered — there is no Orchestra surcharge.
| Stage | Pool | When it fires | Model class |
|---|---|---|---|
| Plan | planner | "how should I…", "design", "outline a plan", the opening turn of a task — and the review pass | Premium reasoning model (vision-capable) |
| Build | coder | "fix", "implement", "refactor", UI work, debugging, searching — edits + runs commands | Strong agentic coding model (vision-capable) |
| Compact | summarizer | Auto-compaction under context-window pressure | Cheap fast model |
The classifier still recognises finer intents (frontend, debugger, fetcher, reviewer), but they fold onto these pools: frontend / debugger / fetcher run on the build executor, and the review pass runs on the planner.
Images. Any request carrying an image is always routed to a vision-capable model — never a text-only one.
Session stickiness. Within one agent session the build stays pinned to a single executor model unless a clearly different stage (plan or review) is detected. Switching every turn would cold-start the vendor's prompt cache; holding it steady keeps that cache warm so your repeated context bills at the cached rate.
The classifier is multilingual and inspects the last tool call when you're mid-loop, so multi-turn agent flows like Cursor / opencode / Claude Code stay coherent between turns.
Use it from any OpenAI-compatible client:
curl https://tokenharbor.ai/v1/chat/completions \
-H "Authorization: Bearer $TH_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"th-orchestra","messages":[{"role":"user","content":"fix the off-by-one in parser.ts"}]}'
Behind the scenes the request is routed to the build (coder) pool and a coder-specific system addendum is injected. Trace each call's classified role + the upstream model that answered on /dashboard/usage.
How we bill
Metered per token, no flat per-turn fee. Every call costs:
cost = (input_tokens × upstream_in_per_1m + output_tokens × upstream_out_per_1m)
/ 1,000,000
× (1 + markup_pct / 100)
× (1 − volume_discount)
upstream_*_per_1m— the vendor's published input/output price for the model. Visible on every /models card.markup_pct— Token Harbor's per-model markup over upstream. Default is 0% today.
What counts as input_tokens. Token counts come from the upstream provider's own usage report on the final request body — which includes a small amount of gateway-added scaffolding on top of your messages:
- Every user message is wrapped in a
<UNTRUSTED_USER_INPUT>security fence (about 10–14 tokens per message) that hardens against prompt-injection attacks. th-orchestracalls additionally carry a role-specific system addendum (the coder/planner/reviewer instructions mentioned above).
These tokens are billed at the model's normal input rate because the upstream charges us for them. Direct model calls carry only the fence; if you need byte-exact prompts with zero gateway additions, prompt-caching passthrough mode (requests with cache_control markers) skips the fence entirely.
Three caching layers further reduce real billed amount:
| Layer | Triggered when | What it saves |
|---|---|---|
| Upstream cache (vendor side) | Long repeated prefixes (≥1024 tokens). Marked with the Upstream cache badge on /models. | ~90% off the cached input portion. |
| Semantic cache (Token Harbor) | Same/very-similar question seen recently. | 100% — billed $0, no upstream call. |
| Exact cache (Token Harbor) | Same model + same messages + same sampling within 5 minutes. | 100% — billed $0, no upstream call. |
Cache savings show as a green dollar line on /dashboard/usage under Today's Usage.
Knowing what you're paying
- /models shows every model's live price, release date, and knowledge cutoff.
- /v1/models returns the same data as JSON for SDKs that pre-fetch the catalog.
- /dashboard/usage shows your last 100 requests with the exact tokens-in / tokens-out / cost-paid for each one. Export to CSV from there.
Bypassing or forcing cache
Per-request control via header:
| Header | Behaviour |
|---|---|
X-TH-Cache-Control: bypass | Skip lookup. The response is still written into the exact cache so the next identical request can hit. |
X-TH-Cache-Control: force-refresh | Skip lookup AND skip write. Pure passthrough — useful for testing/fresh-roll scenarios. |
When a response is served from cache, the route adds X-TH-Cache-Layer: exact (or semantic) so your client can tell.
Context limits
We never truncate your conversation. Each request goes straight to the upstream; if the upstream refuses with context_length_exceeded, that error is forwarded verbatim.