Optimization

How do I save up to 75% with prompt caching and batch routing?

Two structural cost reductions most teams leave on the table: prompt caching (up to 75% off cached reads) and async batch routing (50% off non-real-time jobs). Cognocient detects opportunities automatically and shows them on the Recommendations page. (New feature)

Prompt caching discounts repeated static content in your prompts by up to 75%. Batch routing saves 50% on non-real-time jobs. Cognocient detects opportunities for both automatically and surfaces them on the Recommendations page.

DiscountValueDetail
Anthropic prompt cache discount75%on cache_read tokens
OpenAI prompt cache discount50%on cached prefix reads
Batch API discount50%gpt-4o, claude-sonnet, and more

Prompt caching

Prompt caching is a feature on Anthropic and OpenAI that lets you pay a fraction of the normal price when the beginning of your prompt is identical across calls. If you have a 10,000-token system prompt that stays the same, only the first call charges you full price — subsequent calls with the same prefix pay the cached read rate.

ProviderCache modelNormal inputCache writeCache readSaving
Anthropicclaude-sonnet-4-6$3.00 / MTok$3.75 / MTok$0.30 / MTok90% on cache reads
Anthropicclaude-haiku$0.25 / MTok$0.30 / MTok$0.03 / MTok88% on cache reads
OpenAIgpt-4o$2.50 / MTok$2.50 / MTok$1.25 / MTok50% on cache reads
OpenAIgpt-4o-mini$0.15 / MTok$0.15 / MTok$0.075 / MTok50% on cache reads

Cognocient analyses your call patterns and flags features where the same large system prompt appears in more than 50% of calls. On the Recommendations page you'll see: "Enable Anthropic prompt caching on 'support-chat' — saves $1,240/month. Large system prompt (12,000 tokens) repeated 4,800 times last month."

Enabling prompt caching — Anthropic

Anthropic requires you to add cache_control: {type: 'ephemeral'} to the content blocks you want cached. Cognocient passes this through and records the cache_read_input_tokens at the discounted rate.

import anthropic
 
client = anthropic.Anthropic(
    api_key="sk-cog-YOUR-PROXY-KEY",
    base_url="https://api.cognocient.com/v1",
)
 
# Your large, stable system prompt
SYSTEM_PROMPT = """
You are a helpful support agent for Acme Corp.
[... 10,000 tokens of product documentation, FAQs, policies ...]
"""
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},  # <-- mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": user_question}  # only this changes per call
    ],
    extra_headers={
        "X-Cost-Feature": "support-chat",
    },
)
 
# Cognocient dashboard shows:
# cache_write_input_tokens: 10,000 (first call only)
# cache_read_input_tokens:  10,000 (all subsequent calls — 90% cheaper)

Enabling prompt caching — OpenAI

OpenAI prompt caching is automatic — no code change needed. OpenAI caches any prompt prefix of 1,024+ tokens that is identical across calls. Cognocient tracks and reports on cached_tokens from the usage object.

# OpenAI — prompt caching is automatic, nothing to change
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": LARGE_SYSTEM_PROMPT},  # cached automatically if ≥1024 tokens
        {"role": "user", "content": user_question},
    ],
    extra_headers={"X-Cost-Feature": "support-chat"},
)
 
# Cognocient dashboard shows usage breakdown:
# prompt_tokens:        10,240
# prompt_tokens_cached:  9,800  (at 50% discount)
# completion_tokens:       180

Batch routing (50% off async workloads)

Batch APIs let you submit jobs that complete within 24 hours at half the normal price. If your workload doesn't need a real-time response — report generation, bulk analysis, nightly summarisation — batch routing is the single largest cost lever available.

ModelSync priceBatch priceMax turnaround
gpt-4o$2.50 / MTok in$1.25 / MTok in24h
gpt-4o-mini$0.15 / MTok in$0.075 / MTok in24h
claude-sonnet-4-6$3.00 / MTok in$1.50 / MTok in24h
claude-haiku$0.25 / MTok in$0.125 / MTok in24h

Batch-eligible workloads

WorkloadDetail
Report generationNightly AI summaries, weekly analytics reports
Bulk document analysisProcessing uploaded PDFs, contract review
Data enrichmentTagging, classification, extraction at scale
Evaluation pipelinesAutomated QA scoring, model evals
Email summarisationEnd-of-day digest, CRM enrichment
Embedding generationBuilding vector indexes, semantic search

Do not use batch routing for features where users wait for a response (chat, search, generation in the UI). Batch responses take minutes to hours. It is designed for background processing pipelines only.

Batch API example — OpenAI

import json
from openai import OpenAI
 
client = OpenAI(
    api_key="sk-cog-YOUR-PROXY-KEY",
    base_url="https://api.cognocient.com/v1",
)
 
# 1. Prepare batch requests as JSONL
documents = load_documents()  # e.g., 1,000 contracts to analyse
 
requests_jsonl = []
for i, doc in enumerate(documents):
    requests_jsonl.append(json.dumps({
        "custom_id": f"doc-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "Extract key clauses from this contract."},
                {"role": "user", "content": doc.text},
            ],
            "max_tokens": 500,
        },
    }))
 
# 2. Upload the batch file
batch_file = client.files.create(
    file=("batch.jsonl", "\n".join(requests_jsonl).encode()),
    purpose="batch",
)
 
# 3. Submit the batch (50% off — up to 24h turnaround)
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={"feature": "contract-extractor"},  # for Cognocient attribution
)
 
print(f"Batch submitted: {batch.id}")
# Check later: client.batches.retrieve(batch.id)

Finding batch-eligible features

On the Cognocient Recommendations page, features eligible for batch routing are flagged with: "report-generator: 89% of calls are non-interactive (triggered by cron/webhook, not user action). Switching to Batch API would save $640/month." Cognocient detects non-interactive calls by analysing call timing patterns and the presence of job-scheduler headers.


Next steps: Semantic Similarity Caching · AI Recommendations · Waste Detection

On this page