Quick start guide
Registration, API key, integration, and compression โ everything in 4 simple steps
Registration
Creating an API Key
Integration
Not all providers cache prompts. OpenAI, DeepSeek, Mistral, Qwen do not cache prompts server-side. With them, TokenCompress delivers maximum savings: up to 87% fewer tokens per request. Providers with caching (Anthropic, Google) discount repeated tokens within a TTL window (typically ~5 min), but in practice developers rarely work with a single file โ the model operates on dozens of files, context changes constantly, and the cache misses. Every miss means full price. TokenCompress compresses every request โ new file, old file, first question or hundredth. No TTL, no misses. Fragments under 500 tokens are not compressed (not worth it), everything else is processed in under a second. Here's how they compare:
| Prompt Caching | TokenCompress | |
|---|---|---|
| What it does | Discounts price per token for repeated context | Reduces number of tokens sent to the LLM |
| When it works | Same context repeated within TTL (typically ~5 min) | Every request, regardless of context or timing |
| Different files | Cache miss โ full price | Still compresses โ same savings |
| Provider support | Anthropic, Google (varies by provider) | Works with any provider |
| Combined | โ With providers that have caching โ you can use both: TokenCompress reduces token count, caching discounts whatever remains on repeated calls | |
Bottom line: With providers that don't cache (OpenAI, DeepSeek, Mistral, Qwen), TokenCompress saves up to 87% of tokens. With caching providers (Anthropic, Google), compression still works โ and is especially useful on cache misses, which are inevitable when working with many files.
Add this to your ~/.continue/config.yaml file. The apiKey consists of two parts: your TokenCompress key (created in the dashboard) and your LLM provider key, joined with :: (double colon). Set apiBase to https://tokencompress.com/v1/{provider} where {provider} matches the table below:
# ~/.continue/config.yaml
models:
- name: TokenCompress - DeepSeek
provider: openai
model: deepseek-chat
apiKey: ak_live_xxx...xxx::sk-your-provider-key
apiBase: https://tokencompress.com/v1/deepseek
roles:
- chat
- edit
- apply
defaultCompletionOptions:
stream: true
apiKey = ak_live_... :: provider-api-key
The apiKey field is a composite key consisting of two parts separated by double colons (::). The first part (ak_live_...) is created in your TokenCompress dashboard. The second part is your LLM provider's own API key (e.g. sk-... for DeepSeek or OpenAI). Example: ak_live_abc123::sk-your-provider-key
After saving the config, restart Continue. The model "TokenCompress - DeepSeek" will appear in the model list. All your requests will be automatically compressed, saving up to 87% on tokens.
TokenCompress is compatible with the OpenAI API. Set base_url to https://tokencompress.com/v1/{provider} and api_key to your composite key (TokenCompress key :: provider key):
This example downloads a 3,100+ line Rust file from the open-source Vaultwarden project and asks the LLM to analyze a specific function. TokenCompress automatically compresses the code context before sending it to the LLM, saving ~66% of tokens.
import urllib.request
from langchain_openai import ChatOpenAI
# Point LangChain at the TokenCompress proxy
llm = ChatOpenAI(
base_url="https://tokencompress.com/v1/anthropic",
api_key="ak_live_xxx...xxx::sk-ant-your-anthropic-key",
model="claude-sonnet-4-20250514",
)
# Download a large open-source Rust file (3,000+ lines)
url = "https://raw.githubusercontent.com/dani-garcia/" \
"vaultwarden/main/src/api/core/organizations.rs"
org_rs = urllib.request.urlopen(url).read().decode()
# Ask about a specific function โ TokenCompress compresses
# the code context automatically before forwarding to Anthropic
response = llm.invoke(
f"""Here is the file organizations.rs:
```rust
{org_rs}
```
What does post_groups do if org_groups_enabled() returns false?"""
)
print(response.content)
Result: The 3,100+ line file is compressed from ~30,000 to ~10,200 tokens before reaching the LLM. You pay only for the compressed tokens, and the LLM still answers correctly.
Install the package first: pip install langchain-openai. All standard LangChain features work โ chains, agents, tools, and output parsers.
Use the same ChatOpenAI client inside your LangGraph nodes. Same composite api_key and base_url format as LangChain:
This example builds a code review agent that downloads the Vaultwarden organizations.rs file and performs a multi-step security review. Each agent step sends the full code context through TokenCompress โ compression is applied automatically at every LLM call.
import urllib.request
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.graph import StateGraph, MessagesState, END
llm = ChatOpenAI(
base_url="https://tokencompress.com/v1/anthropic",
api_key="ak_live_xxx...xxx::sk-ant-your-anthropic-key",
model="claude-sonnet-4-20250514",
)
# Download a large Rust file from Vaultwarden
url = "https://raw.githubusercontent.com/dani-garcia/" \
"vaultwarden/main/src/api/core/organizations.rs"
org_rs = urllib.request.urlopen(url).read().decode()
# Step 1: Identify security-sensitive functions
def find_sensitive(state: MessagesState):
return {"messages": [llm.invoke(state["messages"])]}
# Step 2: Review those functions for vulnerabilities
def review_security(state: MessagesState):
followup = HumanMessage(
content="Now review the functions you identified for "
"authorization bypasses or injection vulnerabilities."
)
return {"messages": [llm.invoke(state["messages"] + [followup])]}
graph = StateGraph(MessagesState)
graph.add_node("find_sensitive", find_sensitive)
graph.add_node("review_security", review_security)
graph.set_entry_point("find_sensitive")
graph.add_edge("find_sensitive", "review_security")
graph.add_edge("review_security", END)
app = graph.compile()
# Run the agent โ TokenCompress compresses at every step
result = app.invoke({"messages": [
SystemMessage(content="You are a security auditor."),
HumanMessage(content=f"""Review this Rust file:
```rust
{org_rs}
```
List all functions that handle authentication or authorization."""),
]})
print(result["messages"][-1].content)
Savings: The 3,100+ line file is sent through TokenCompress at every graph step. Code context that doesn't relate to the query gets compressed away โ the LLM sees only the relevant functions.
Install: pip install langchain-openai langgraph. TokenCompress compresses context transparently โ your graph logic stays exactly the same.
OpenAI Codex CLI is a terminal-based coding agent that reads files, runs shell commands, and writes code. TokenCompress compresses every file and command output in the conversation, dramatically reducing token usage for multi-turn sessions.
Create or edit ~/.codex/config.toml:
# ~/.codex/config.toml
model = "claude-haiku-4-5-20251001"
model_reasoning_effort = "medium"
openai_base_url = "https://tokencompress.com/v1/anthropic"
openai_api_key = "ak_live_xxx...xxx::sk-ant-your-anthropic-key"
Full config example with project trust and Windows settings:
# ~/.codex/config.toml โ full example
model = "claude-haiku-4-5-20251001"
model_reasoning_effort = "medium"
service_tier = "fast"
openai_base_url = "https://tokencompress.com/v1/anthropic"
openai_api_key = "ak_live_xxx...xxx::sk-ant-your-anthropic-key"
# Trust your project directories
[projects.'~/my-project']
trust_level = "trusted"
# Windows: use elevated sandbox
[windows]
sandbox = "elevated"
Then run Codex CLI in a project directory:
# Clone Vaultwarden and ask Codex to analyze it
$ git clone https://github.com/dani-garcia/vaultwarden.git
$ cd vaultwarden
$ codex "Read src/api/core/organizations.rs and explain what post_groups does if org_groups_enabled() returns false"
How it works: Codex reads the 3,100+ line file, TokenCompress compresses the code before each LLM call. In our tests, tool outputs (file contents, grep results) were compressed from 2,221 to 916 tokens (59% savings) and 1,167 to 443 tokens (62% savings).
apiKey = ak_live_... :: provider-api-key
The apiKey field is a composite key consisting of two parts separated by double colons (::). The first part (ak_live_...) is created in your TokenCompress dashboard. The second part is your LLM provider's own API key (e.g. sk-... for DeepSeek or OpenAI). Example: ak_live_abc123::sk-your-provider-key
Codex CLI uses WebSocket (wss://) โ TokenCompress fully supports it. Works with any model: Claude, GPT-4o, DeepSeek, etc.
Use the openai Python package directly. This works with any OpenAI-compatible client โ just change the base_url.
import urllib.request
from openai import OpenAI
client = OpenAI(
base_url="https://tokencompress.com/v1/anthropic",
api_key="ak_live_xxx...xxx::sk-ant-your-anthropic-key",
)
# Download Vaultwarden organizations.rs (3,000+ lines of Rust)
url = "https://raw.githubusercontent.com/dani-garcia/" \
"vaultwarden/main/src/api/core/organizations.rs"
org_rs = urllib.request.urlopen(url).read().decode()
# TokenCompress compresses the code, then forwards to Anthropic
response = client.chat.completions.create(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": f"""
```rust
{org_rs}
```
What does post_groups do if org_groups_enabled() returns false?"""}],
)
print(response.choices[0].message.content)
Works with pip install openai. No additional dependencies needed.
Claude Code is Anthropic's agentic CLI tool that reads files, runs commands, and writes code using native Anthropic Messages API. TokenCompress compresses tool results (file contents, command output) before each API call.
Install Claude Code and configure ~/.claude/settings.json:
// Install Claude Code
$ npm install -g @anthropic-ai/claude-code
// ~/.claude/settings.json
{
"env": {
"ANTHROPIC_BASE_URL": "https://tokencompress.com/v1/anthropic",
"ANTHROPIC_AUTH_TOKEN": "ak_live_xxx...xxx::sk-ant-your-anthropic-key",
"ANTHROPIC_MODEL": "claude-sonnet-4-20250514"
}
}
ANTHROPIC_MODEL is optional โ Claude Code uses its default model if omitted. You can also override per-session: claude --model claude-sonnet-4-20250514
Then run Claude Code in your project:
$ cd my-project
$ claude "Read src/main.rs and explain what it does"
How it works: Claude Code sends requests in native Anthropic Messages API format. TokenCompress extracts code from tool results (file reads, command output), compresses it, and forwards to Anthropic โ all transparently. SSE streaming works end-to-end.
ANTHROPIC_AUTH_TOKEN = ak_live_... :: sk-ant-...
The apiKey field is a composite key consisting of two parts separated by double colons (::). The first part (ak_live_...) is created in your TokenCompress dashboard. The second part is your LLM provider's own API key (e.g. sk-... for DeepSeek or OpenAI). Example: ak_live_abc123::sk-your-provider-key
๐ก ๐ก On savings with Claude Code
Claude Code uses Anthropic's built-in prompt caching. For details on how caching affects savings, see the "TokenCompress vs Prompt Caching" section above.
Claude Code uses x-api-key header (not Bearer) โ TokenCompress handles both automatically.
OpenClaw is a multi-provider AI client (similar to Claude Code) that supports OpenAI, Anthropic, Google, DeepSeek and other providers via native SDKs. Configure it to use TokenCompress by changing the baseUrl in your provider config:
Option A: OpenAI-compatible providers (OpenAI, DeepSeek, Ollama)
// ~/.openclaw/config.json
{
"providers": {
"openai": {
"baseUrl": "https://tokencompress.com/v1/openai",
"apiKey": "ak_live_xxx...xxx::sk-your-openai-key",
"models": ["gpt-4o", "o3-mini"]
}
}
}
Works with any OpenAI-compatible provider. Replace the provider path segment (e.g. /v1/openai, /v1/deepseek) to match your provider.
Option B: Anthropic via native API
// ~/.openclaw/config.json
{
"providers": {
"anthropic": {
"baseUrl": "https://tokencompress.com/v1/anthropic",
"apiKey": "ak_live_xxx...xxx::sk-ant-your-anthropic-key",
"models": ["claude-sonnet-4-20250514"]
}
}
}
OpenClaw sends native Anthropic Messages API requests. TokenCompress handles them transparently, including SSE streaming and cache_control fields.
apiKey = ak_live_... :: provider-api-key
The apiKey field is a composite key consisting of two parts separated by double colons (::). The first part (ak_live_...) is created in your TokenCompress dashboard. The second part is your LLM provider's own API key (e.g. sk-... for DeepSeek or OpenAI). Example: ak_live_abc123::sk-your-provider-key
๐ก Savings depend on the provider
For details, see the "TokenCompress vs Prompt Caching" section above.
OpenClaw supports multiple providers simultaneously. You can configure each provider with its own TokenCompress baseUrl for maximum savings.
Supported Providers
The apiBase URL must end with the provider name matching one of the supported providers listed below. Always set provider to openai in your config โ TokenCompress uses an OpenAI-compatible API format.
| LLM Provider | provider provider field |
apiBase URL |
Note |
|---|---|---|---|
| OpenAI | openai |
https://tokencompress.com/v1/openai |
|
| Anthropic | anthropic |
https://tokencompress.com/v1/anthropic |
|
| DeepSeek | openai |
https://tokencompress.com/v1/deepseek |
|
| Google AI (Gemini) | openai |
https://tokencompress.com/v1/google |
|
| Mistral AI | openai |
https://tokencompress.com/v1/mistral |
|
| Qwen (Alibaba) | openai |
https://tokencompress.com/v1/qwen |
|
| OpenRouter | openrouter |
https://tokencompress.com/v1/openrouter |
|
| LM Studio (Local) | openai |
https://tokencompress.com/v1/lm-studio |
Enterprise |
| Ollama (Local) | openai |
https://tokencompress.com/v1/ollama |
Enterprise |
| Ollama (Cloud) | openai |
https://tokencompress.com/v1/ollama-cloud |
|
| Kilo AI | openai |
https://tokencompress.com/v1/kilo |
๐ก model model field:
Use any model name supported by your chosen LLM provider โ TokenCompress places no restrictions on this value.
Need a provider not listed here? Contact us and we'll add it promptly.
How Compression Works
TokenCompress uses a multi-stage intelligent compression pipeline. Code is not simply truncated โ it goes through deep structural analysis and semantic extraction to preserve maximum meaning with minimum tokens.
Step 1 โ Structural Analysis
The incoming code is parsed through language-aware AST (Abstract Syntax Tree) analyzers supporting 22+ programming languages. The system builds a full structural map: functions, classes, modules, imports, type signatures, and their relationships. This allows precise identification of code boundaries and dependency graphs.
Step 2 โ Semantic Relevance Scoring
Each code fragment is evaluated for semantic relevance to the current query context. The scoring engine weighs structural importance (public API, entry points, error handling), reference frequency, and proximity to the user's question. Low-relevance fragments โ boilerplate, redundant imports, formatting, repetitive patterns โ are marked for compression.
Step 3 โ Intelligent Compression
Marked fragments undergo multi-level compression: signature-preserving extraction (function stubs with type info), semantic deduplication, and context-aware summarization. The system maintains structural coherence โ compressed output is still valid, parseable code context that LLMs can reason about accurately.
Step 4 โ Integrity Verification
The compressed output passes through a verification layer that ensures referential integrity: all referenced symbols remain resolvable, type chains are intact, and no critical call paths are broken. This guarantees that the LLM receives a coherent, self-consistent context.
Result
A 3,100+ line source file (~30,000 tokens) is compressed to ~10,200 tokens โ a 66% reduction โ while retaining all semantically significant structures. The LLM receives a dense, high-signal context and produces answers of the same quality as with the full uncompressed input.
Compression activates automatically for code blocks exceeding 500 tokens. Fragments below this threshold are passed through unchanged โ the overhead would exceed the savings.