Cost Intelligence

How does Cognocient detect and recover AI spend waste?

Cognocient automatically detects 5 waste categories in your AI API spend. 28-40% of AI spend is typically recoverable. One-click fixes apply automatically.

Cognocient automatically scans every API call and classifies spend as investment or waste across five categories. The average team recovers 28-40% of their AI bill. Each finding includes a one-click fix that applies immediately.

Waste categoryWhat it catchesTypical recovery
Retry wasteFailed calls that were re-billed with no valueUp to $800/mo
Model mismatchFrontier models on tasks a mini model handles identicallyUp to $2,400/mo
Context bloatUnbounded context window growth across conversation turnsUp to $600/mo
Context starvationTwo calls within 60 seconds because data wasn't ready for the firstUp to $400/mo

No configuration required. Detection activates automatically as soon as calls flow through the proxy. Results appear in the Waste Detection dashboard tab, grouped by feature and category.

Retry waste

What it detects: Failed API calls that were automatically retried — you're billed full input tokens every time a retry fires, even though no useful output was produced. A 10,000-token prompt that times out and retries three times wastes 30,000 tokens.

The dashboard shows: "You wasted $342 on retried calls last month. 34 failed calls triggered automatic retries in pdf-extractor."

Fix — use exponential backoff, not immediate retries:

import time
from openai import OpenAI, APITimeoutError, APIConnectionError
 
client = OpenAI(api_key="sk-cog-YOUR-PROXY-KEY", base_url="https://api.cognocient.com/v1")
 
def chat_with_backoff(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                extra_headers={"X-Cost-Feature": "pdf-extractor"},
                timeout=30,
            )
        except (APITimeoutError, APIConnectionError):
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # 1s, 2s, 4s
        except Exception:
            raise  # don't retry non-transient errors

Don't retry on 400 errors (bad request), 401 (auth), or content policy violations. These will never succeed and you pay for every attempt.

Model mismatch

What it detects: Expensive frontier models used for simple tasks — sentiment analysis, classification, entity extraction, short summarisation. GPT-4o on a "classify this as positive/negative" task costs ~60× more than GPT-4o-mini with identical output quality.

The dashboard shows: "Switching sentiment-analysis from gpt-4o to gpt-4o-mini would save $822/month. AI confidence: high."

Task typeRecommended modelSavings vs GPT-4o
Sentiment / classificationgpt-4o-mini~94%
Entity extractiongpt-4o-mini~94%
Short summarisationgpt-4o-mini~94%
Complex reasoninggpt-4o
Long document understandingclaude-sonnet-4-6Comparable cost, higher quality

Fix: Apply the one-click recommendation in Dashboard → AI Advisor. Cognocient creates a routing rule that silently redirects matching calls to the cheaper model — no code changes required.

Context bloat

What it detects: Sessions where each turn sends the full conversation history. By turn 10, you're paying for turns 1–9 on every single new call. A 20-turn session can pay for the same early tokens 19 times.

The dashboard shows: "$193 wasted on bloated context windows. 15 sessions exceeded 50% context growth in support-chat."

Fix — summarise old turns instead of sending them all:

def trim_context(messages: list, max_tokens: int = 4000) -> list:
    estimated_tokens = sum(len(m["content"].split()) * 1.3 for m in messages)
 
    if estimated_tokens > max_tokens and len(messages) > 6:
        old_messages = messages[:-6]
        recent_messages = messages[-6:]
 
        summary = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Summarise this conversation in 2–3 sentences."},
                {"role": "user", "content": str(old_messages)},
            ],
            extra_headers={"X-Cost-Feature": "context-summariser"},
        ).choices[0].message.content
 
        return [
            {"role": "system", "content": f"Conversation so far: {summary}"},
            *recent_messages,
        ]
 
    return messages

Context starvation

What it detects: Two API calls within 60 seconds where the second call's prompt is 50%+ larger than the first. This pattern means your app called the model before it had all the context it needed, then had to call again. You paid for both calls but could have made one.

The dashboard shows: "$127/month wasted on iterative prompts. 34 sequences where your app called the model before the prompt was fully assembled."

Fix — gather all context before the first call:

import asyncio
 
async def process_document(doc_id: str):
    # Fetch everything in parallel, then make one call
    document, metadata, prefs = await asyncio.gather(
        fetch_document(doc_id),
        fetch_metadata(doc_id),
        fetch_user_preferences(user_id),
    )
 
    return await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": build_prompt(document, metadata, prefs)}],
        extra_headers={"X-Cost-Feature": "doc-analyser"},
    )

What happens after waste is detected

Once Cognocient identifies waste in a feature, it generates a specific recommendation in Dashboard → AI Advisor with:

  • Estimated monthly saving
  • Confidence level (High / Medium / Low)
  • One-click apply — creates a routing rule, no code changes needed

Recommendations are re-evaluated daily as new call data arrives.


Next steps: Investment vs. Waste Classification · Semantic Caching · Routing Rules

On this page