How does Cognocient's semantic similarity caching work?
Eliminate redundant API calls — both exact duplicates and semantically equivalent prompts. A 30% cache hit rate reduces your AI bill by 25–35%. Some FAQ workloads reach 70%+.
Cognocient caches semantically equivalent prompts — not just exact duplicates. When a new prompt is close enough to a cached one, Cognocient returns the cached response in under 10ms. A 30% cache hit rate typically reduces your AI bill by 25–35%.
| Metric | Value |
|---|---|
| Cache hit latency | <10ms |
| Cost per cache hit | $0.00 |
| Hit rate for FAQ workloads | 70%+ |
Exact match caching
The simplest form of caching. When the exact same prompt is sent again (byte-for-byte identical), Cognocient returns the cached response immediately without forwarding to the provider.
Enable with a single header:
Cache hit indicators:
| Header | Value |
|---|---|
x-cog-cache-hit | true |
x-cog-cache-type | exact |
x-cog-latency | 8ms (vs ~400ms uncached) |
Semantic similarity caching
Exact match only catches identical prompts. Semantic caching catches prompts that mean the same thing. "What is your return policy?" and "How do I return an item?" are different strings but semantically equivalent — semantic caching serves both from cache.
How it works under the hood:
New headers for semantic caching:
X-Cog-Similarity-Cache: true
Enable semantic (vector) caching for this request. When set to true, Cognocient embeds the prompt and searches the HNSW index before forwarding to the provider.
X-Cog-Similarity-Threshold: 0.95
Cosine similarity threshold for a cache hit. Range: 0.0–1.0. Default: 0.92. Higher values (0.98+) only match near-identical prompts. Lower values (0.85) are more aggressive and may occasionally return slightly mismatched responses.
Cache TTL (time-to-live)
By default, cached responses expire after 24 hours. Set a custom TTL per request using the x-cog-cache-ttl header (in seconds).
Fail-safe behaviour
Caching is designed to be transparent and non-disruptive. If the cache lookup fails for any reason (index unavailable, timeout, error), the request automatically falls through to the provider. Your application never sees a cache error.
| Scenario | Behaviour | Latency | Cost |
|---|---|---|---|
| Cache hit | Return cached response | <10ms | $0.00 |
| Cache miss | Forward to provider | Normal | Normal |
| Cache unavailable | Forward to provider (silent fallback) | Normal | Normal |
Use semantic caching aggressively on FAQ-style workloads (help centres, product docs, policy questions). It is safe to set threshold at 0.90 for these use cases — the questions are highly canonical and variance is minimal.
When to use (and when not to)
Good candidates
- FAQ responses and help documentation
- Product descriptions and policy questions
- Classification prompts with fixed inputs
- Any prompt where inputs are drawn from a finite set
- Support bot responses to common issues
Not suitable
- Personalised responses with user-specific content
- Real-time data queries (prices, inventory, news)
- Streaming responses where freshness matters
- Creative generation (should vary each time)
- Prompts containing the current date/time
Next steps: Prompt Cache + Batch Routing · Waste Detection · AI Recommendations
Related articles