Cost Intelligence

What is context tax and how does Cognocient detect it?

Identify features paying a 'context tax' — static prompt overhead on every call that could be eliminated through prompt caching or compression.

Context tax is the cost of static prompt overhead on every API call — system prompts, tool definitions, and boilerplate that could be cached or compressed. Cognocient identifies features paying the highest context tax and shows you exactly how much you'd save by fixing it.

What is the context tax?

The context tax is the cost of sending the same large static prompt on every API call. If your prompt is 4,000 tokens of system instructions and 200 tokens of variable content, you are paying for 4,000 tokens of overhead every single call — even though 95% of it never changes.

Example:

System prompt: [2,800 tokens of static instructions and examples]
User message:  [150 tokens of actual query]
─────────────────────────────────────────────────────
Total input:   2,950 tokens   →  95% static overhead

At 10,000 calls/month, that static overhead costs ~$118/month (GPT-4o pricing). With prompt caching, it would cost $14.75 — an 87.5% reduction.

How Cognocient detects it

The Context Tax Analyser measures the coefficient of variation (CV) of prompt token counts per feature:

CV = standard deviation of prompt tokens / mean prompt tokens

A CV below 0.15 means your prompt size barely changes between calls — a strong signal that the prompt is template-heavy with little variable content.

Combined with a large average prompt size (>500 tokens), this indicates a high static overhead that is a prime caching/compression candidate.

Where to find it

Dashboard → Waste → Context Tax tab shows all features with their CV score and whether they are flagged as cache candidates.

ColumnWhat it means
FeatureThe X-Cost-Feature tag
CallsCall count in the period
Avg PromptAverage prompt token count
CVCoefficient of variation — lower = more static
CostTotal spend for this feature
StatusCache candidate / Dynamic

How to fix a context tax finding

Anthropic Claude and OpenAI both support prompt caching — the static portion of your prompt is cached server-side and billed at a significant discount on cache hits.

Anthropic Claude:

response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": variable_content}],
    system=[{
        "type": "text",
        "text": static_system_prompt,
        "cache_control": {"type": "ephemeral"},  # ← cache this
    }],
)

OpenAI caches automatically for prompts over 1,024 tokens with no code change required.

Option 2: Prompt compression

Move static knowledge into embeddings and retrieve only what's relevant. Reduces per-call token counts at the cost of a retrieval step.

Option 3: Summarise and compress

Distil your system prompt. Most 3,000-token system prompts can be compressed to 800 tokens without quality loss by removing redundant examples and reformatting instructions.

Cognocient automatically detects prompt cache opportunities (cache-miss runs with repeated prompts) in the Waste Detection → Waste Overview view. Enable the Prompt Cache & Batch advisor for proactive recommendations.

Relationship to context bloat

Context bloat (in the Waste Overview tab) measures individual calls where the total prompt is disproportionately large relative to the output. The Context Tax measures systematic static overhead across all calls to a feature. They often co-occur but have different root causes and different fixes.


Related: Prompt Cache & Batch · Waste Detection · Feature Intelligence

On this page