Cost Intelligence

What is token maxing and how does Cognocient detect it?

Automatically detect when frontier models (GPT-4, Claude Opus, Gemini Pro) are being used for short outputs — tasks a cheaper model could handle at a fraction of the cost.

Token maxing is using expensive frontier models — GPT-4o, Claude Opus, Gemini Pro — for tasks a cheaper model handles identically: sentiment classification, entity extraction, short summarisation. Cognocient detects this automatically and quantifies exactly how much you'd save by switching.

What is token maxing?

Token maxing is when your application uses a frontier-tier model (GPT-4o, Claude Opus, Gemini 1.5 Pro) to produce very short completions — under 500 tokens. You are paying flagship prices for commodity outputs.

The problem:

CallModelOutputCost per call
"Classify this ticket as urgent or normal"GPT-4o"urgent" (3 tokens)$0.0045
"Classify this ticket as urgent or normal"GPT-4o-mini"urgent" (3 tokens)$0.00006

The same output. 75× cost difference. Token Maxing Detector finds these automatically.

Where to find it

Dashboard → Token Maxing shows all feature–model combinations where a frontier model produced outputs averaging under 500 completion tokens, with at least 5 calls in the period.

Reading the findings table

ColumnWhat it means
FeatureThe X-Cost-Feature tag on these calls
Current ModelThe frontier model being used
CallsCall count in the selected period
Avg OutputAverage completion token count
CostTotal spend on these calls
SuggestionRecommended cheaper model

Fixing a token maxing finding

The suggested action is always to create a routing rule that automatically downgrades the model for that feature.

  1. Click Create routing rule at the bottom of the findings table
  2. Set condition: tag_feature = "your-feature" AND model = "gpt-4o"
  3. Set action: Reroute to gpt-4o-mini
  4. Save and deploy

Alternatively, change the model directly in your application code — the routing rule approach is better for zero-code rollouts.

Not all short completions indicate a problem. Some tasks require frontier-model reasoning even for brief answers (e.g., "Which of these 5 approaches has the lowest Big-O complexity?"). Review the findings table manually before creating routing rules.

Model tier reference

Cognocient classifies the following as frontier models for this detector:

ProviderFrontier models
OpenAIGPT-4, GPT-4-turbo, GPT-4o
AnthropicClaude 3 Opus, Claude Opus 4.x
GoogleGemini 1.5 Pro, Gemini 2.0 Flash Thinking
MistralMistral Large
Meta (Together)Llama 3.1 70B, 405B

Typical savings

In a 30-day period, most SaaS companies with more than 50k API calls find 15–30% of frontier-model calls can be safely routed to cheaper models — with no measurable quality degradation for classification, extraction, and summarisation tasks.


Related: Routing Rules · Waste Detection · AI Advisor

On this page