What is token maxing and how does Cognocient detect it?

Automatically detect when frontier models (GPT-4, Claude Opus, Gemini Pro) are being used for short outputs — tasks a cheaper model could handle at a fraction of the cost.

Token maxing is using expensive frontier models — GPT-4o, Claude Opus, Gemini Pro — for tasks a cheaper model handles identically: sentiment classification, entity extraction, short summarisation. Cognocient detects this automatically and quantifies exactly how much you'd save by switching.

What is token maxing?

Token maxing is when your application uses a frontier-tier model (GPT-4o, Claude Opus, Gemini 1.5 Pro) to produce very short completions — under 500 tokens. You are paying flagship prices for commodity outputs.

The problem:

Call	Model	Output	Cost per call
"Classify this ticket as urgent or normal"	GPT-4o	"urgent" (3 tokens)	$0.0045
"Classify this ticket as urgent or normal"	GPT-4o-mini	"urgent" (3 tokens)	$0.00006

The same output. 75× cost difference. Token Maxing Detector finds these automatically.

Where to find it

Dashboard → Token Maxing shows all feature–model combinations where a frontier model produced outputs averaging under 500 completion tokens, with at least 5 calls in the period.

Reading the findings table

Column	What it means
Feature	The `X-Cost-Feature` tag on these calls
Current Model	The frontier model being used
Calls	Call count in the selected period
Avg Output	Average completion token count
Cost	Total spend on these calls
Suggestion	Recommended cheaper model

Fixing a token maxing finding

The suggested action is always to create a routing rule that automatically downgrades the model for that feature.

Click Create routing rule at the bottom of the findings table
Set condition: tag_feature = "your-feature" AND model = "gpt-4o"
Set action: Reroute to gpt-4o-mini
Save and deploy

Alternatively, change the model directly in your application code — the routing rule approach is better for zero-code rollouts.

Not all short completions indicate a problem. Some tasks require frontier-model reasoning even for brief answers (e.g., "Which of these 5 approaches has the lowest Big-O complexity?"). Review the findings table manually before creating routing rules.

Model tier reference

Cognocient classifies the following as frontier models for this detector:

Provider	Frontier models
OpenAI	GPT-4, GPT-4-turbo, GPT-4o
Anthropic	Claude 3 Opus, Claude Opus 4.x
Google	Gemini 1.5 Pro, Gemini 2.0 Flash Thinking
Mistral	Mistral Large
Meta (Together)	Llama 3.1 70B, 405B

Typical savings

In a 30-day period, most SaaS companies with more than 50k API calls find 15–30% of frontier-model calls can be safely routed to cheaper models — with no measurable quality degradation for classification, extraction, and summarisation tasks.

Related: Routing Rules · Waste Detection · AI Advisor