How do I enable semantic caching to eliminate duplicate API calls?
Add one header to enable caching for any prompt. Semantically equivalent requests return in under 10ms at $0.00. A 30% cache hit rate reduces your AI bill by 25-35%.
Goal: Eliminate redundant API calls — both exact duplicates and semantically equivalent prompts — so you pay $0.00 and get a response in under 10ms whenever the same question has been asked before.
Time: 5 minutes to enable. Cache hits appear immediately.
Best for: FAQ chatbots, help centre assistants, support bots, product Q&A — any workload where users ask similar questions repeatedly.
Step 1 — Add the cache header to eligible requests
Add X-Cog-Similarity-Cache: true to any request you want cached. That is the only change required.
Set it once at the client level
To enable caching on every call from a client — not per-request — set it in default_headers:
Step 2 — Verify a cache hit
Make two similar requests (not necessarily identical). Then check Dashboard → Live Calls — cache hits show $0.00 cost with <10ms latency.
You can also inspect response headers directly:
| Header | Value on a cache hit |
|---|---|
x-cog-cache-hit | true |
x-cog-cache-type | semantic or exact |
x-cog-latency | 8ms (vs ~400ms uncached) |
x-cog-similarity-score | 0.97 (how similar the cached prompt was) |
Step 3 — Tune the similarity threshold (optional)
The default threshold of 0.92 works well for most workloads. Adjust with X-Cog-Similarity-Threshold:
| Threshold | Behaviour | Use when |
|---|---|---|
0.98 | Near-exact matches only | Factual data, technical specifications |
0.92 | Default — good balance | General FAQ, help centre content |
0.85 | Aggressive matching | Policy questions, canonical answers |
Do not enable caching for prompts that include user-specific content, real-time data, or anything that should vary per user. Cache entries are shared across your account.
Step 4 — Set cache TTL for time-sensitive content
By default, cached responses expire after 24 hours. Override with x-cog-cache-ttl (in seconds):
What to expect after 7 days
Check Dashboard → Recommendations for your cache hit rate and savings estimate. Typical results:
| Workload | Expected hit rate | Bill reduction |
|---|---|---|
| FAQ / help centre | 40–70% | 35–60% |
| Support bot (varied users) | 20–40% | 15–30% |
| Product Q&A | 30–50% | 25–40% |
| Classification (fixed inputs) | 60–90% | 50–80% |
Related: Semantic Similarity Caching · Prompt Cache + Batch Routing · Cut Your AI Bill with One Click
Related articles
Tag Your First AI Call
Add 2 headers to your existing code and see per-feature spend in under 5 minutes.
Set a Monthly Spending Limit
Create a hard budget enforced at the proxy before charges reach your provider bill.
Cut Your AI Bill with One Click
Use AI Advisor recommendations to apply model downgrades and caching without code changes.