How do I save up to 75% with prompt caching and batch routing?

Two structural cost reductions most teams leave on the table: prompt caching (up to 75% off cached reads) and async batch routing (50% off non-real-time jobs). Cognocient detects opportunities automatically and shows them on the Recommendations page. (New feature)

Prompt caching discounts repeated static content in your prompts by up to 75%. Batch routing saves 50% on non-real-time jobs. Cognocient detects opportunities for both automatically and surfaces them on the Recommendations page.

Discount	Value	Detail
Anthropic prompt cache discount	75%	on cache_read tokens
OpenAI prompt cache discount	50%	on cached prefix reads
Batch API discount	50%	gpt-4o, claude-sonnet, and more

Prompt caching

Prompt caching is a feature on Anthropic and OpenAI that lets you pay a fraction of the normal price when the beginning of your prompt is identical across calls. If you have a 10,000-token system prompt that stays the same, only the first call charges you full price — subsequent calls with the same prefix pay the cached read rate.

Provider	Cache model	Normal input	Cache write	Cache read	Saving
Anthropic	`claude-sonnet-4-6`	$3.00 / MTok	$3.75 / MTok	$0.30 / MTok	90% on cache reads
Anthropic	`claude-haiku`	$0.25 / MTok	$0.30 / MTok	$0.03 / MTok	88% on cache reads
OpenAI	`gpt-4o`	$2.50 / MTok	$2.50 / MTok	$1.25 / MTok	50% on cache reads
OpenAI	`gpt-4o-mini`	$0.15 / MTok	$0.15 / MTok	$0.075 / MTok	50% on cache reads

Cognocient analyses your call patterns and flags features where the same large system prompt appears in more than 50% of calls. On the Recommendations page you'll see: "Enable Anthropic prompt caching on 'support-chat' — saves $1,240/month. Large system prompt (12,000 tokens) repeated 4,800 times last month."

Enabling prompt caching — Anthropic

Anthropic requires you to add cache_control: {type: 'ephemeral'} to the content blocks you want cached. Cognocient passes this through and records the cache_read_input_tokens at the discounted rate.

import anthropic
 
client = anthropic.Anthropic(
    api_key="sk-cog-YOUR-PROXY-KEY",
    base_url="https://api.cognocient.com/v1",
)
 
# Your large, stable system prompt
SYSTEM_PROMPT = """
You are a helpful support agent for Acme Corp.
[... 10,000 tokens of product documentation, FAQs, policies ...]
"""
 
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},  # <-- mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": user_question}  # only this changes per call
    ],
    extra_headers={
        "X-Cost-Feature": "support-chat",
    },
)
 
# Cognocient dashboard shows:
# cache_write_input_tokens: 10,000 (first call only)
# cache_read_input_tokens:  10,000 (all subsequent calls — 90% cheaper)

Enabling prompt caching — OpenAI

OpenAI prompt caching is automatic — no code change needed. OpenAI caches any prompt prefix of 1,024+ tokens that is identical across calls. Cognocient tracks and reports on cached_tokens from the usage object.

# OpenAI — prompt caching is automatic, nothing to change
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": LARGE_SYSTEM_PROMPT},  # cached automatically if ≥1024 tokens
        {"role": "user", "content": user_question},
    ],
    extra_headers={"X-Cost-Feature": "support-chat"},
)
 
# Cognocient dashboard shows usage breakdown:
# prompt_tokens:        10,240
# prompt_tokens_cached:  9,800  (at 50% discount)
# completion_tokens:       180

Batch routing (50% off async workloads)

Batch APIs let you submit jobs that complete within 24 hours at half the normal price. If your workload doesn't need a real-time response — report generation, bulk analysis, nightly summarisation — batch routing is the single largest cost lever available.

Model	Sync price	Batch price	Max turnaround
`gpt-4o`	$2.50 / MTok in	$1.25 / MTok in	24h
`gpt-4o-mini`	$0.15 / MTok in	$0.075 / MTok in	24h
`claude-sonnet-4-6`	$3.00 / MTok in	$1.50 / MTok in	24h
`claude-haiku`	$0.25 / MTok in	$0.125 / MTok in	24h

Batch-eligible workloads

Workload	Detail
Report generation	Nightly AI summaries, weekly analytics reports
Bulk document analysis	Processing uploaded PDFs, contract review
Data enrichment	Tagging, classification, extraction at scale
Evaluation pipelines	Automated QA scoring, model evals
Email summarisation	End-of-day digest, CRM enrichment
Embedding generation	Building vector indexes, semantic search

Do not use batch routing for features where users wait for a response (chat, search, generation in the UI). Batch responses take minutes to hours. It is designed for background processing pipelines only.

Batch API example — OpenAI

import json
from openai import OpenAI
 
client = OpenAI(
    api_key="sk-cog-YOUR-PROXY-KEY",
    base_url="https://api.cognocient.com/v1",
)
 
# 1. Prepare batch requests as JSONL
documents = load_documents()  # e.g., 1,000 contracts to analyse
 
requests_jsonl = []
for i, doc in enumerate(documents):
    requests_jsonl.append(json.dumps({
        "custom_id": f"doc-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "Extract key clauses from this contract."},
                {"role": "user", "content": doc.text},
            ],
            "max_tokens": 500,
        },
    }))
 
# 2. Upload the batch file
batch_file = client.files.create(
    file=("batch.jsonl", "\n".join(requests_jsonl).encode()),
    purpose="batch",
)
 
# 3. Submit the batch (50% off — up to 24h turnaround)
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={"feature": "contract-extractor"},  # for Cognocient attribution
)
 
print(f"Batch submitted: {batch.id}")
# Check later: client.batches.retrieve(batch.id)

Finding batch-eligible features

On the Cognocient Recommendations page, features eligible for batch routing are flagged with: "report-generator: 89% of calls are non-interactive (triggered by cron/webhook, not user action). Switching to Batch API would save $640/month." Cognocient detects non-interactive calls by analysing call timing patterns and the presence of job-scheduler headers.

Next steps: Semantic Similarity Caching · AI Recommendations · Waste Detection

Prompt caching

Batch routing (50% off async workloads)

On this page