What is token maxing and how does Cognocient detect it?
Automatically detect when frontier models (GPT-4, Claude Opus, Gemini Pro) are being used for short outputs — tasks a cheaper model could handle at a fraction of the cost.
Token maxing is using expensive frontier models — GPT-4o, Claude Opus, Gemini Pro — for tasks a cheaper model handles identically: sentiment classification, entity extraction, short summarisation. Cognocient detects this automatically and quantifies exactly how much you'd save by switching.
What is token maxing?
Token maxing is when your application uses a frontier-tier model (GPT-4o, Claude Opus, Gemini 1.5 Pro) to produce very short completions — under 500 tokens. You are paying flagship prices for commodity outputs.
The problem:
| Call | Model | Output | Cost per call |
|---|---|---|---|
| "Classify this ticket as urgent or normal" | GPT-4o | "urgent" (3 tokens) | $0.0045 |
| "Classify this ticket as urgent or normal" | GPT-4o-mini | "urgent" (3 tokens) | $0.00006 |
The same output. 75× cost difference. Token Maxing Detector finds these automatically.
Where to find it
Dashboard → Token Maxing shows all feature–model combinations where a frontier model produced outputs averaging under 500 completion tokens, with at least 5 calls in the period.
Reading the findings table
| Column | What it means |
|---|---|
| Feature | The X-Cost-Feature tag on these calls |
| Current Model | The frontier model being used |
| Calls | Call count in the selected period |
| Avg Output | Average completion token count |
| Cost | Total spend on these calls |
| Suggestion | Recommended cheaper model |
Fixing a token maxing finding
The suggested action is always to create a routing rule that automatically downgrades the model for that feature.
- Click Create routing rule at the bottom of the findings table
- Set condition:
tag_feature = "your-feature"ANDmodel = "gpt-4o" - Set action: Reroute to
gpt-4o-mini - Save and deploy
Alternatively, change the model directly in your application code — the routing rule approach is better for zero-code rollouts.
Not all short completions indicate a problem. Some tasks require frontier-model reasoning even for brief answers (e.g., "Which of these 5 approaches has the lowest Big-O complexity?"). Review the findings table manually before creating routing rules.
Model tier reference
Cognocient classifies the following as frontier models for this detector:
| Provider | Frontier models |
|---|---|
| OpenAI | GPT-4, GPT-4-turbo, GPT-4o |
| Anthropic | Claude 3 Opus, Claude Opus 4.x |
| Gemini 1.5 Pro, Gemini 2.0 Flash Thinking | |
| Mistral | Mistral Large |
| Meta (Together) | Llama 3.1 70B, 405B |
Typical savings
In a 30-day period, most SaaS companies with more than 50k API calls find 15–30% of frontier-model calls can be safely routed to cheaper models — with no measurable quality degradation for classification, extraction, and summarisation tasks.
Related: Routing Rules · Waste Detection · AI Advisor
Related articles