How do I save up to 75% with prompt caching and batch routing?
Two structural cost reductions most teams leave on the table: prompt caching (up to 75% off cached reads) and async batch routing (50% off non-real-time jobs). Cognocient detects opportunities automatically and shows them on the Recommendations page. (New feature)
Prompt caching discounts repeated static content in your prompts by up to 75%. Batch routing saves 50% on non-real-time jobs. Cognocient detects opportunities for both automatically and surfaces them on the Recommendations page.
| Discount | Value | Detail |
|---|---|---|
| Anthropic prompt cache discount | 75% | on cache_read tokens |
| OpenAI prompt cache discount | 50% | on cached prefix reads |
| Batch API discount | 50% | gpt-4o, claude-sonnet, and more |
Prompt caching
Prompt caching is a feature on Anthropic and OpenAI that lets you pay a fraction of the normal price when the beginning of your prompt is identical across calls. If you have a 10,000-token system prompt that stays the same, only the first call charges you full price — subsequent calls with the same prefix pay the cached read rate.
| Provider | Cache model | Normal input | Cache write | Cache read | Saving |
|---|---|---|---|---|---|
| Anthropic | claude-sonnet-4-6 | $3.00 / MTok | $3.75 / MTok | $0.30 / MTok | 90% on cache reads |
| Anthropic | claude-haiku | $0.25 / MTok | $0.30 / MTok | $0.03 / MTok | 88% on cache reads |
| OpenAI | gpt-4o | $2.50 / MTok | $2.50 / MTok | $1.25 / MTok | 50% on cache reads |
| OpenAI | gpt-4o-mini | $0.15 / MTok | $0.15 / MTok | $0.075 / MTok | 50% on cache reads |
Cognocient analyses your call patterns and flags features where the same large system prompt appears in more than 50% of calls. On the Recommendations page you'll see: "Enable Anthropic prompt caching on 'support-chat' — saves $1,240/month. Large system prompt (12,000 tokens) repeated 4,800 times last month."
Enabling prompt caching — Anthropic
Anthropic requires you to add cache_control: {type: 'ephemeral'} to the content blocks you want cached. Cognocient passes this through and records the cache_read_input_tokens at the discounted rate.
Enabling prompt caching — OpenAI
OpenAI prompt caching is automatic — no code change needed. OpenAI caches any prompt prefix of 1,024+ tokens that is identical across calls. Cognocient tracks and reports on cached_tokens from the usage object.
Batch routing (50% off async workloads)
Batch APIs let you submit jobs that complete within 24 hours at half the normal price. If your workload doesn't need a real-time response — report generation, bulk analysis, nightly summarisation — batch routing is the single largest cost lever available.
| Model | Sync price | Batch price | Max turnaround |
|---|---|---|---|
gpt-4o | $2.50 / MTok in | $1.25 / MTok in | 24h |
gpt-4o-mini | $0.15 / MTok in | $0.075 / MTok in | 24h |
claude-sonnet-4-6 | $3.00 / MTok in | $1.50 / MTok in | 24h |
claude-haiku | $0.25 / MTok in | $0.125 / MTok in | 24h |
Batch-eligible workloads
| Workload | Detail |
|---|---|
| Report generation | Nightly AI summaries, weekly analytics reports |
| Bulk document analysis | Processing uploaded PDFs, contract review |
| Data enrichment | Tagging, classification, extraction at scale |
| Evaluation pipelines | Automated QA scoring, model evals |
| Email summarisation | End-of-day digest, CRM enrichment |
| Embedding generation | Building vector indexes, semantic search |
Do not use batch routing for features where users wait for a response (chat, search, generation in the UI). Batch responses take minutes to hours. It is designed for background processing pipelines only.
Batch API example — OpenAI
Finding batch-eligible features
On the Cognocient Recommendations page, features eligible for batch routing are flagged with: "report-generator: 89% of calls are non-interactive (triggered by cron/webhook, not user action). Switching to Batch API would save $640/month." Cognocient detects non-interactive calls by analysing call timing patterns and the presence of job-scheduler headers.
Next steps: Semantic Similarity Caching · AI Recommendations · Waste Detection
Related articles