How does Cognocient's semantic similarity caching work?

Eliminate redundant API calls — both exact duplicates and semantically equivalent prompts. A 30% cache hit rate reduces your AI bill by 25–35%. Some FAQ workloads reach 70%+.

Cognocient caches semantically equivalent prompts — not just exact duplicates. When a new prompt is close enough to a cached one, Cognocient returns the cached response in under 10ms. A 30% cache hit rate typically reduces your AI bill by 25–35%.

Metric	Value
Cache hit latency	`<10ms`
Cost per cache hit	$0.00
Hit rate for FAQ workloads	70%+

Exact match caching

The simplest form of caching. When the exact same prompt is sent again (byte-for-byte identical), Cognocient returns the cached response immediately without forwarding to the provider.

Enable with a single header:

from openai import OpenAI
 
client = OpenAI(
    api_key="sk-cog-YOUR-PROXY-KEY",
    base_url="https://api.cognocient.com/v1",
)
 
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is our refund policy?"}],
    extra_headers={
        "x-cog-cache": "true",  # enable caching for this request
    }
)

Cache hit indicators:

Header	Value
`x-cog-cache-hit`	`true`
`x-cog-cache-type`	`exact`
`x-cog-latency`	`8ms (vs ~400ms uncached)`

Semantic similarity caching

Exact match only catches identical prompts. Semantic caching catches prompts that mean the same thing. "What is your return policy?" and "How do I return an item?" are different strings but semantically equivalent — semantic caching serves both from cache.

How it works under the hood:

Incoming prompt
       │
       ▼
OpenAI text-embedding-3-small
       │  generates 1536-dim vector
       ▼
pgvector HNSW index
       │  cosine similarity search
       │  threshold: 0.95 (configurable)
       ▼
Cache hit?  ──YES──▶  Return cached response (<10ms, $0.00)
       │
       NO
       │
       ▼
Forward to AI provider (normal flow)
       │
       ▼
Store embedding + response in cache

New headers for semantic caching:

X-Cog-Similarity-Cache: true

Enable semantic (vector) caching for this request. When set to true, Cognocient embeds the prompt and searches the HNSW index before forwarding to the provider.

X-Cog-Similarity-Threshold: 0.95

Cosine similarity threshold for a cache hit. Range: 0.0–1.0. Default: 0.92. Higher values (0.98+) only match near-identical prompts. Lower values (0.85) are more aggressive and may occasionally return slightly mismatched responses.

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is your return policy?"}],
    extra_headers={
        "X-Cog-Similarity-Cache":     "true",
        "X-Cog-Similarity-Threshold": "0.95",  # optional, default 0.92
    }
)
 
# Check if it was a semantic cache hit
if response.model == "cached":  # model field is "cached" on hits
    print("Served from semantic cache — $0.00 cost")
 
# Or check the response header (when using raw fetch)
# x-cog-cache-hit: true
# x-cog-cache-type: semantic
# x-cog-similarity-score: 0.973

const response = await openai.chat.completions.create(
  {
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: 'How do I get a refund?' }],
  },
  {
    headers: {
      'X-Cog-Similarity-Cache':     'true',
      'X-Cog-Similarity-Threshold': '0.95',
    },
  }
);
 
// "What is your return policy?" was cached with similarity 0.973
// → same response returned, $0.00 billed

Cache TTL (time-to-live)

By default, cached responses expire after 24 hours. Set a custom TTL per request using the x-cog-cache-ttl header (in seconds).

# Cache this response for 7 days (604800 seconds)
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What are your pricing plans?"}],
    extra_headers={
        "x-cog-cache":     "true",
        "x-cog-cache-ttl": "604800",  # 7 days in seconds
    }
)
 
# Cache this response for only 5 minutes (300 seconds)
# Good for: data that updates frequently (prices, inventory)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the current stock status?"}],
    extra_headers={
        "x-cog-cache":     "true",
        "x-cog-cache-ttl": "300",
    }
)

Fail-safe behaviour

Caching is designed to be transparent and non-disruptive. If the cache lookup fails for any reason (index unavailable, timeout, error), the request automatically falls through to the provider. Your application never sees a cache error.

Scenario	Behaviour	Latency	Cost
Cache hit	Return cached response	`<10ms`	$0.00
Cache miss	Forward to provider	Normal	Normal
Cache unavailable	Forward to provider (silent fallback)	Normal	Normal

Use semantic caching aggressively on FAQ-style workloads (help centres, product docs, policy questions). It is safe to set threshold at 0.90 for these use cases — the questions are highly canonical and variance is minimal.

When to use (and when not to)

Good candidates

FAQ responses and help documentation
Product descriptions and policy questions
Classification prompts with fixed inputs
Any prompt where inputs are drawn from a finite set
Support bot responses to common issues

Not suitable

Personalised responses with user-specific content
Real-time data queries (prices, inventory, news)
Streaming responses where freshness matters
Creative generation (should vary each time)
Prompts containing the current date/time

Next steps: Prompt Cache + Batch Routing · Waste Detection · AI Recommendations

Exact match caching

Semantic similarity caching

Cache TTL (time-to-live)

Fail-safe behaviour

When to use (and when not to)

On this page