How do I enable semantic caching to eliminate duplicate API calls?

Add one header to enable caching for any prompt. Semantically equivalent requests return in under 10ms at $0.00. A 30% cache hit rate reduces your AI bill by 25-35%.

Goal: Eliminate redundant API calls — both exact duplicates and semantically equivalent prompts — so you pay $0.00 and get a response in under 10ms whenever the same question has been asked before.

Time: 5 minutes to enable. Cache hits appear immediately.

Best for: FAQ chatbots, help centre assistants, support bots, product Q&A — any workload where users ask similar questions repeatedly.

Step 1 — Add the cache header to eligible requests

Add X-Cog-Similarity-Cache: true to any request you want cached. That is the only change required.

Set it once at the client level

To enable caching on every call from a client — not per-request — set it in default_headers:

client = OpenAI(
    api_key="sk-cog-YOUR-PROXY-KEY",
    base_url="https://api.cognocient.com/v1",
    default_headers={
        "X-Cost-Feature":         "support-bot",
        "X-Cog-Similarity-Cache": "true",
    }
)
# Every call from this client is now cached automatically

Step 2 — Verify a cache hit

Make two similar requests (not necessarily identical). Then check Dashboard → Live Calls — cache hits show $0.00 cost with <10ms latency.

You can also inspect response headers directly:

Header	Value on a cache hit
`x-cog-cache-hit`	`true`
`x-cog-cache-type`	`semantic` or `exact`
`x-cog-latency`	`8ms` (vs ~400ms uncached)
`x-cog-similarity-score`	`0.97` (how similar the cached prompt was)

Step 3 — Tune the similarity threshold (optional)

The default threshold of 0.92 works well for most workloads. Adjust with X-Cog-Similarity-Threshold:

Threshold	Behaviour	Use when
`0.98`	Near-exact matches only	Factual data, technical specifications
`0.92`	Default — good balance	General FAQ, help centre content
`0.85`	Aggressive matching	Policy questions, canonical answers

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "How do returns work?"}],
    extra_headers={
        "X-Cog-Similarity-Cache":     "true",
        "X-Cog-Similarity-Threshold": "0.90",  # more aggressive for FAQ
    }
)

Do not enable caching for prompts that include user-specific content, real-time data, or anything that should vary per user. Cache entries are shared across your account.

Step 4 — Set cache TTL for time-sensitive content

By default, cached responses expire after 24 hours. Override with x-cog-cache-ttl (in seconds):

# Cache policy answers for 7 days
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What are your pricing plans?"}],
    extra_headers={
        "X-Cog-Similarity-Cache": "true",
        "x-cog-cache-ttl":        "604800",  # 7 days
    }
)
 
# Cache inventory answers for only 5 minutes
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Is product #4821 in stock?"}],
    extra_headers={
        "X-Cog-Similarity-Cache": "true",
        "x-cog-cache-ttl":        "300",     # 5 minutes
    }
)

What to expect after 7 days

Check Dashboard → Recommendations for your cache hit rate and savings estimate. Typical results:

Workload	Expected hit rate	Bill reduction
FAQ / help centre	40–70%	35–60%
Support bot (varied users)	20–40%	15–30%
Product Q&A	30–50%	25–40%
Classification (fixed inputs)	60–90%	50–80%

Step 1 — Add the cache header to eligible requests

Step 2 — Verify a cache hit

Step 3 — Tune the similarity threshold (optional)

Step 4 — Set cache TTL for time-sensitive content

What to expect after 7 days

On this page