Optimization

How does Cognocient's semantic similarity caching work?

Eliminate redundant API calls — both exact duplicates and semantically equivalent prompts. A 30% cache hit rate reduces your AI bill by 25–35%. Some FAQ workloads reach 70%+.

Cognocient caches semantically equivalent prompts — not just exact duplicates. When a new prompt is close enough to a cached one, Cognocient returns the cached response in under 10ms. A 30% cache hit rate typically reduces your AI bill by 25–35%.

MetricValue
Cache hit latency<10ms
Cost per cache hit$0.00
Hit rate for FAQ workloads70%+

Exact match caching

The simplest form of caching. When the exact same prompt is sent again (byte-for-byte identical), Cognocient returns the cached response immediately without forwarding to the provider.

Enable with a single header:

from openai import OpenAI
 
client = OpenAI(
    api_key="sk-cog-YOUR-PROXY-KEY",
    base_url="https://api.cognocient.com/v1",
)
 
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is our refund policy?"}],
    extra_headers={
        "x-cog-cache": "true",  # enable caching for this request
    }
)

Cache hit indicators:

HeaderValue
x-cog-cache-hittrue
x-cog-cache-typeexact
x-cog-latency8ms (vs ~400ms uncached)

Semantic similarity caching

Exact match only catches identical prompts. Semantic caching catches prompts that mean the same thing. "What is your return policy?" and "How do I return an item?" are different strings but semantically equivalent — semantic caching serves both from cache.

How it works under the hood:

Incoming prompt


OpenAI text-embedding-3-small
       │  generates 1536-dim vector

pgvector HNSW index
       │  cosine similarity search
       │  threshold: 0.95 (configurable)

Cache hit?  ──YES──▶  Return cached response (<10ms, $0.00)

       NO


Forward to AI provider (normal flow)


Store embedding + response in cache

New headers for semantic caching:

X-Cog-Similarity-Cache: true

Enable semantic (vector) caching for this request. When set to true, Cognocient embeds the prompt and searches the HNSW index before forwarding to the provider.

X-Cog-Similarity-Threshold: 0.95

Cosine similarity threshold for a cache hit. Range: 0.0–1.0. Default: 0.92. Higher values (0.98+) only match near-identical prompts. Lower values (0.85) are more aggressive and may occasionally return slightly mismatched responses.

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is your return policy?"}],
    extra_headers={
        "X-Cog-Similarity-Cache":     "true",
        "X-Cog-Similarity-Threshold": "0.95",  # optional, default 0.92
    }
)
 
# Check if it was a semantic cache hit
if response.model == "cached":  # model field is "cached" on hits
    print("Served from semantic cache — $0.00 cost")
 
# Or check the response header (when using raw fetch)
# x-cog-cache-hit: true
# x-cog-cache-type: semantic
# x-cog-similarity-score: 0.973
const response = await openai.chat.completions.create(
  {
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: 'How do I get a refund?' }],
  },
  {
    headers: {
      'X-Cog-Similarity-Cache':     'true',
      'X-Cog-Similarity-Threshold': '0.95',
    },
  }
);
 
// "What is your return policy?" was cached with similarity 0.973
// → same response returned, $0.00 billed

Cache TTL (time-to-live)

By default, cached responses expire after 24 hours. Set a custom TTL per request using the x-cog-cache-ttl header (in seconds).

# Cache this response for 7 days (604800 seconds)
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What are your pricing plans?"}],
    extra_headers={
        "x-cog-cache":     "true",
        "x-cog-cache-ttl": "604800",  # 7 days in seconds
    }
)
 
# Cache this response for only 5 minutes (300 seconds)
# Good for: data that updates frequently (prices, inventory)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the current stock status?"}],
    extra_headers={
        "x-cog-cache":     "true",
        "x-cog-cache-ttl": "300",
    }
)

Fail-safe behaviour

Caching is designed to be transparent and non-disruptive. If the cache lookup fails for any reason (index unavailable, timeout, error), the request automatically falls through to the provider. Your application never sees a cache error.

ScenarioBehaviourLatencyCost
Cache hitReturn cached response<10ms$0.00
Cache missForward to providerNormalNormal
Cache unavailableForward to provider (silent fallback)NormalNormal

Use semantic caching aggressively on FAQ-style workloads (help centres, product docs, policy questions). It is safe to set threshold at 0.90 for these use cases — the questions are highly canonical and variance is minimal.

When to use (and when not to)

Good candidates

  • FAQ responses and help documentation
  • Product descriptions and policy questions
  • Classification prompts with fixed inputs
  • Any prompt where inputs are drawn from a finite set
  • Support bot responses to common issues

Not suitable

  • Personalised responses with user-specific content
  • Real-time data queries (prices, inventory, news)
  • Streaming responses where freshness matters
  • Creative generation (should vary each time)
  • Prompts containing the current date/time

Next steps: Prompt Cache + Batch Routing · Waste Detection · AI Recommendations

On this page