API

Models

The live catalog, how we bill per token, and what the smart-router brands route to.

Models

Every model wired into Token Harbor is listed on /models. One Universal Key gets you all of them — no per-vendor signup, no per-vendor balance to keep topped up.

TH Orchestra — `th-orchestra`

Plan → build → review. Every request is classified server-side and routed to one of three model pools, so a coding task gets planned, built, and reviewed without you juggling models. The model you pay for is the upstream model that actually answered — there is no Orchestra surcharge.

Stage	Pool	When it fires	Model class
Plan	`planner`	"how should I…", "design", "outline a plan", the opening turn of a task — and the review pass	Premium reasoning model (vision-capable)
Build	`coder`	"fix", "implement", "refactor", UI work, debugging, searching — edits + runs commands	Strong agentic coding model (vision-capable)
Compact	`summarizer`	Auto-compaction under context-window pressure	Cheap fast model

The classifier still recognises finer intents (frontend, debugger, fetcher, reviewer), but they fold onto these pools: frontend / debugger / fetcher run on the build executor, and the review pass runs on the planner.

Images. Any request carrying an image is always routed to a vision-capable model — never a text-only one.

Session stickiness. Within one agent session the build stays pinned to a single executor model unless a clearly different stage (plan or review) is detected. Switching every turn would cold-start the vendor's prompt cache; holding it steady keeps that cache warm so your repeated context bills at the cached rate.

The classifier is multilingual and inspects the last tool call when you're mid-loop, so multi-turn agent flows like Cursor / opencode / Claude Code stay coherent between turns.

Use it from any OpenAI-compatible client:

curl https://tokenharbor.ai/v1/chat/completions \
  -H "Authorization: Bearer $TH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"th-orchestra","messages":[{"role":"user","content":"fix the off-by-one in parser.ts"}]}'

Behind the scenes the request is routed to the build (coder) pool and a coder-specific system addendum is injected. Trace each call's classified role + the upstream model that answered on /dashboard/usage.

How we bill

Metered per token, no flat per-turn fee. Every call costs:

cost = (input_tokens × upstream_in_per_1m + output_tokens × upstream_out_per_1m)
       / 1,000,000
       × (1 + markup_pct / 100)
       × (1 − volume_discount)

upstream_*_per_1m — the vendor's published input/output price for the model. Visible on every /models card.
markup_pct — Token Harbor's per-model markup over upstream. Default is 0% today.

What counts as input_tokens. Token counts come from the upstream provider's own usage report on the final request body — which includes a small amount of gateway-added scaffolding on top of your messages:

Every user message is wrapped in a <UNTRUSTED_USER_INPUT> security fence (about 10–14 tokens per message) that hardens against prompt-injection attacks.
th-orchestra calls additionally carry a role-specific system addendum (the coder/planner/reviewer instructions mentioned above).

These tokens are billed at the model's normal input rate because the upstream charges us for them. Direct model calls carry only the fence; if you need byte-exact prompts with zero gateway additions, prompt-caching passthrough mode (requests with cache_control markers) skips the fence entirely.

Three caching layers further reduce real billed amount:

Layer	Triggered when	What it saves
Upstream cache (vendor side)	Long repeated prefixes (≥1024 tokens). Marked with the `Upstream cache` badge on /models.	~90% off the cached input portion.
Semantic cache (Token Harbor)	Same/very-similar question seen recently.	100% — billed $0, no upstream call.
Exact cache (Token Harbor)	Same model + same messages + same sampling within 5 minutes.	100% — billed $0, no upstream call.

Cache savings show as a green dollar line on /dashboard/usage under Today's Usage.

Knowing what you're paying

/models shows every model's live price, release date, and knowledge cutoff.
/v1/models returns the same data as JSON for SDKs that pre-fetch the catalog.
/dashboard/usage shows your last 100 requests with the exact tokens-in / tokens-out / cost-paid for each one. Export to CSV from there.

Bypassing or forcing cache

Per-request control via header:

Header	Behaviour
`X-TH-Cache-Control: bypass`	Skip lookup. The response is still written into the exact cache so the next identical request can hit.
`X-TH-Cache-Control: force-refresh`	Skip lookup AND skip write. Pure passthrough — useful for testing/fresh-roll scenarios.

When a response is served from cache, the route adds X-TH-Cache-Layer: exact (or semantic) so your client can tell.

Context limits

We never truncate your conversation. Each request goes straight to the upstream; if the upstream refuses with context_length_exceeded, that error is forwarded verbatim.

API

Models

The live catalog, how we bill per token, and what the smart-router brands route to.

Models

Every model wired into Token Harbor is listed on /models. One Universal Key gets you all of them — no per-vendor signup, no per-vendor balance to keep topped up.

TH Orchestra — `th-orchestra`

Stage	Pool	When it fires	Model class
Plan	`planner`	"how should I…", "design", "outline a plan", the opening turn of a task — and the review pass	Premium reasoning model (vision-capable)
Build	`coder`	"fix", "implement", "refactor", UI work, debugging, searching — edits + runs commands	Strong agentic coding model (vision-capable)
Compact	`summarizer`	Auto-compaction under context-window pressure	Cheap fast model

Images. Any request carrying an image is always routed to a vision-capable model — never a text-only one.

The classifier is multilingual and inspects the last tool call when you're mid-loop, so multi-turn agent flows like Cursor / opencode / Claude Code stay coherent between turns.

Use it from any OpenAI-compatible client:

curl https://tokenharbor.ai/v1/chat/completions \
  -H "Authorization: Bearer $TH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"th-orchestra","messages":[{"role":"user","content":"fix the off-by-one in parser.ts"}]}'

How we bill

Metered per token, no flat per-turn fee. Every call costs:

cost = (input_tokens × upstream_in_per_1m + output_tokens × upstream_out_per_1m)
       / 1,000,000
       × (1 + markup_pct / 100)
       × (1 − volume_discount)

upstream_*_per_1m — the vendor's published input/output price for the model. Visible on every /models card.
markup_pct — Token Harbor's per-model markup over upstream. Default is 0% today.

Every user message is wrapped in a <UNTRUSTED_USER_INPUT> security fence (about 10–14 tokens per message) that hardens against prompt-injection attacks.
th-orchestra calls additionally carry a role-specific system addendum (the coder/planner/reviewer instructions mentioned above).

Three caching layers further reduce real billed amount:

Layer	Triggered when	What it saves
Upstream cache (vendor side)	Long repeated prefixes (≥1024 tokens). Marked with the `Upstream cache` badge on /models.	~90% off the cached input portion.
Semantic cache (Token Harbor)	Same/very-similar question seen recently.	100% — billed $0, no upstream call.
Exact cache (Token Harbor)	Same model + same messages + same sampling within 5 minutes.	100% — billed $0, no upstream call.

Cache savings show as a green dollar line on /dashboard/usage under Today's Usage.

Knowing what you're paying

/models shows every model's live price, release date, and knowledge cutoff.
/v1/models returns the same data as JSON for SDKs that pre-fetch the catalog.
/dashboard/usage shows your last 100 requests with the exact tokens-in / tokens-out / cost-paid for each one. Export to CSV from there.

Bypassing or forcing cache

Per-request control via header:

Header	Behaviour
`X-TH-Cache-Control: bypass`	Skip lookup. The response is still written into the exact cache so the next identical request can hit.
`X-TH-Cache-Control: force-refresh`	Skip lookup AND skip write. Pure passthrough — useful for testing/fresh-roll scenarios.

When a response is served from cache, the route adds X-TH-Cache-Layer: exact (or semantic) so your client can tell.

Context limits

We never truncate your conversation. Each request goes straight to the upstream; if the upstream refuses with context_length_exceeded, that error is forwarded verbatim.

Models

TH Orchestra — th-orchestra

How we bill

Knowing what you're paying

Bypassing or forcing cache

Context limits

Models

TH Orchestra — th-orchestra

How we bill

Knowing what you're paying

Bypassing or forcing cache

Context limits

TH Orchestra — `th-orchestra`

TH Orchestra — `th-orchestra`