Understand Token Cost

Understanding Token Billing: Why Your AI API Bill Is Exploding (And How to Cap It)

OpenAI or Anthropic APIs seem simple: send a prompt, get a response. But behind this simplicity lies a billing model based on "tokens." In 2026, with increasingly long context windows, these units of measurement have become the primary drivers of budget unpredictability.

Published June 13, 2026 · 7 min read

The mechanics of tokens: understanding the unit

To bill for model usage, AI providers use the concept of a "token." In English, 1 token is approximately 0.75 words — so 1,000 tokens is roughly 750 words of text. However, this ratio is far from fixed. Tokenization depends on the specific vocabulary of each model. Technical terms, code, special characters, and non-English text all tokenize differently, often consuming more tokens per character than plain prose.

A few things that surprise developers when they first look at their token usage:

A 10-line Python function might cost 150–250 tokens depending on indentation and variable names
JSON payloads tokenize inefficiently due to structural characters like brackets and quotes
Whitespace and formatting in prompts consume real tokens — a well-structured prompt can use 20% more tokens than a compact one carrying the same information
Non-ASCII characters (accented letters, CJK characters, emoji) typically consume more tokens per character than ASCII

Input tokens vs. output tokens: why the split matters

Most providers price input and output tokens differently, and the gap is significant. For gpt-4o, output tokens cost roughly 4× more per token than input tokens. For Claude Sonnet models, the multiplier is similar. This asymmetry has important implications for cost optimization.

The practical consequence: generating long outputs is expensive. A feature that asks the model to produce a detailed 2,000-word report costs dramatically more than one that asks for a 100-word summary — even if the input prompt is the same length. When designing AI features, consider whether the output length is actually necessary for the use case, or whether a shorter, structured response would serve the user equally well.

Asking for JSON-structured short answers instead of prose narratives, using max_tokens to cap responses, and splitting long generation tasks into stages are all strategies that exploit this asymmetry to reduce costs without degrading user experience.

The context trap: the snowball effect

This is where budgets most often spiral. Every time you make an API call in a multi-turn conversation, you must resend the entire conversation history — because the model has no memory between calls. The implication: each additional turn in a conversation multiplies the cost of every subsequent turn.

A concrete example: a 10-turn support chat with 200 tokens per exchange does not cost 10 × 200 = 2,000 tokens. It costs 200 + 400 + 600 + ... + 2,000 = 11,000 tokens — more than 5× what a naive calculation would suggest. At scale, with thousands of concurrent conversations, this compounding effect creates a cost profile that is almost impossible to predict without dedicated tooling.

If you are not actively managing context window length, you are almost certainly paying for tokens that do not contribute to response quality. Common strategies include:

Sliding window truncation: Keep only the last N turns of conversation history, dropping the oldest messages when the context exceeds a threshold.
Summarization compression: Replace older conversation segments with a compact summary, preserving semantic content while reducing token count by 60–80%.
Relevance filtering: For RAG applications, only inject document chunks that are semantically relevant to the current query — not the entire retrieved set.

Prompt caching: the discount most teams leave on the table

Both OpenAI and Anthropic now offer prompt caching — a mechanism where a static prefix of a prompt (typically the system prompt and any fixed context) can be cached on the server and billed at a heavily discounted rate on repeat calls. OpenAI charges around 50% of the standard input token rate for cached tokens; Anthropic charges around 10% for cache reads.

For most production applications, the system prompt is identical across thousands of requests. If yours runs to 500–1,000 tokens, enabling prompt caching can reduce your input token costs by 30–50% on that portion alone. The mechanics are simple: structure your prompts so the static prefix appears first, before any dynamic content. Most providers activate caching automatically for prefixes above a minimum length threshold.

The caveat: caches have TTLs (typically 5 minutes for OpenAI, configurable for Anthropic). High-traffic applications benefit most. If your request rate is low, cache hit rates may not justify the slight additional complexity in prompt structuring.

Model pricing variance: not all tokens are equal

The difference in per-token cost between frontier and smaller models is larger than most teams realize — and has grown as the model landscape has fragmented. As a rough benchmark:

gpt-4o costs roughly 5–10× more per token than gpt-4o-mini
Claude Opus costs roughly 15× more per token than Claude Haiku
Open-source models self-hosted on commodity GPU hardware can run at 1/20th the cost of frontier API pricing

This spread means that routing decisions — choosing which model handles which request — are often the single most impactful cost lever available. Classification tasks, short extractions, simple Q&A, and structured data generation typically run just as well on smaller models. Reserving frontier models for tasks that genuinely require their capability (complex reasoning, nuanced writing, code generation) can cut total token spend by 40–70% without user-perceptible quality changes.

Best practices to master your spending

Model routing: Don't systematically use the most powerful models. Classify request complexity first, then route to the cheapest model that can handle it reliably.
Prompt auditing: Review your longest system prompts quarterly. Teams routinely ship 800-token system prompts where 200 tokens of tighter writing would carry the same intent.
Context window hygiene: Implement truncation or summarization strategies before context length becomes a budget problem, not after.
Enable prompt caching: Structure your prompts for caching eligibility if you are on OpenAI or Anthropic. It is one of the lowest-effort, highest-return optimizations available.
Cap output length: Use max_tokens as a hard guardrail. Know what your maximum acceptable output length is per use case, and enforce it explicitly.
Budget per session: Implement per-user and per-session spending limits to prevent runaway conversations or abuse scenarios from generating unexpected costs.

From opacity to control

The underlying theme of all these strategies is the same: you cannot control what you cannot see. Teams that get a grip on their AI costs start by instrumenting every request with model, token count, feature tag, and cost. Once that data exists, the optimizations become obvious. Without it, you are guessing — and in a billing model where a single architectural choice can change your cost by 10×, guessing is expensive.

Regain control of your AI costs

Identify sources of overconsumption and optimize your API calls with AIntOps today.

Try for free →