How-to Guides

How do I enable semantic caching to eliminate duplicate API calls?

Add one header to enable caching for any prompt. Semantically equivalent requests return in under 10ms at $0.00. A 30% cache hit rate reduces your AI bill by 25-35%.

Goal: Eliminate redundant API calls — both exact duplicates and semantically equivalent prompts — so you pay $0.00 and get a response in under 10ms whenever the same question has been asked before.

Time: 5 minutes to enable. Cache hits appear immediately.

Best for: FAQ chatbots, help centre assistants, support bots, product Q&A — any workload where users ask similar questions repeatedly.


Step 1 — Add the cache header to eligible requests

Add X-Cog-Similarity-Cache: true to any request you want cached. That is the only change required.

Set it once at the client level

To enable caching on every call from a client — not per-request — set it in default_headers:

client = OpenAI(
    api_key="sk-cog-YOUR-PROXY-KEY",
    base_url="https://api.cognocient.com/v1",
    default_headers={
        "X-Cost-Feature":         "support-bot",
        "X-Cog-Similarity-Cache": "true",
    }
)
# Every call from this client is now cached automatically

Step 2 — Verify a cache hit

Make two similar requests (not necessarily identical). Then check Dashboard → Live Calls — cache hits show $0.00 cost with <10ms latency.

You can also inspect response headers directly:

HeaderValue on a cache hit
x-cog-cache-hittrue
x-cog-cache-typesemantic or exact
x-cog-latency8ms (vs ~400ms uncached)
x-cog-similarity-score0.97 (how similar the cached prompt was)

Step 3 — Tune the similarity threshold (optional)

The default threshold of 0.92 works well for most workloads. Adjust with X-Cog-Similarity-Threshold:

ThresholdBehaviourUse when
0.98Near-exact matches onlyFactual data, technical specifications
0.92Default — good balanceGeneral FAQ, help centre content
0.85Aggressive matchingPolicy questions, canonical answers
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "How do returns work?"}],
    extra_headers={
        "X-Cog-Similarity-Cache":     "true",
        "X-Cog-Similarity-Threshold": "0.90",  # more aggressive for FAQ
    }
)

Do not enable caching for prompts that include user-specific content, real-time data, or anything that should vary per user. Cache entries are shared across your account.

Step 4 — Set cache TTL for time-sensitive content

By default, cached responses expire after 24 hours. Override with x-cog-cache-ttl (in seconds):

# Cache policy answers for 7 days
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What are your pricing plans?"}],
    extra_headers={
        "X-Cog-Similarity-Cache": "true",
        "x-cog-cache-ttl":        "604800",  # 7 days
    }
)
 
# Cache inventory answers for only 5 minutes
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Is product #4821 in stock?"}],
    extra_headers={
        "X-Cog-Similarity-Cache": "true",
        "x-cog-cache-ttl":        "300",     # 5 minutes
    }
)

What to expect after 7 days

Check Dashboard → Recommendations for your cache hit rate and savings estimate. Typical results:

WorkloadExpected hit rateBill reduction
FAQ / help centre40–70%35–60%
Support bot (varied users)20–40%15–30%
Product Q&A30–50%25–40%
Classification (fixed inputs)60–90%50–80%

Related: Semantic Similarity Caching · Prompt Cache + Batch Routing · Cut Your AI Bill with One Click

On this page