How do I set velocity limits and guardrails on AI calls?
Pre-call budget enforcement, max_tokens clamping, hierarchical budget chains, and velocity circuit breakers. Stop runaway spend before it reaches your provider bill.
Guardrails add a second enforcement layer on top of budgets — per-minute and per-hour call velocity limits that catch runaway usage before it drains your monthly budget. Combined with max_tokens clamping, they prevent overspend structurally rather than alerting after the fact.
Most cost tools read billing exports that are 24 hours late. By the time an alert fires, a weekend agent loop has already run. Cognocient checks and reserves budget before the call reaches the provider — making it impossible to exceed your limits, not just possible to be notified after the fact.
The check is a single atomic Redis operation: reserve the estimated cost, allow if within budget, reject if over. 50 concurrent agent calls against the same $0.10 budget will result in exactly 10 succeeding — every time.
Budget enforcement modes
Three modes, same atomic check:
| Mode | What happens | Best for |
|---|---|---|
| Block | HTTP 429 returned. Provider never sees the request. No charge possible. | Experiments, dev/test, agent sandboxes |
| Degrade | Request auto-switched to a cheaper model and continues. | Production services with SLA requirements |
| Alert | Request proceeds normally. Slack/email notification sent at threshold. | Baseline measurement before committing to limits |
How budget enforcement works
When a request arrives:
- Cognocient estimates the cost from prompt tokens (inferred from message length) and
max_tokens - An atomic Lua script reserves that amount in Redis — or rejects if the reservation would exceed the limit
- The call proceeds (or returns 429)
- After the call, the reservation is reconciled to the actual cost
Because reservation and check happen in a single Redis operation, there is no race condition. Concurrent calls cannot collectively exceed the budget.
HTTP 429 response:
max_tokens clamping
When a budget reservation is made in block mode, Cognocient also clamps the forwarded max_tokens to the value the remaining budget can actually afford.
Why this matters: without clamping, a caller requests max_tokens=4096. Budget has $0.10 remaining. The proxy reserves $0.10 (enough for ~167 tokens at gpt-4o rates) but forwards max_tokens=4096. The stream runs 4096 tokens. Actual spend: $0.40. Reservation: $0.10. Budget overshot 4x — the reservation was meaningless.
With clamping, the forwarded request gets max_tokens=167. The model physically cannot generate beyond what was reserved.
Response headers when clamping occurs:
Your application can read these headers to understand when clamping occurred. The clamped value is always at least 16 (minimum meaningful response).
Clamping is applied to chat completion calls only — embeddings have no max_tokens concept.
Hierarchical budget enforcement
Budgets form a chain: run → feature → department → org. Every matching level is checked atomically before a call proceeds. If any level is exhausted, the call is blocked — even if child budgets still have room.
Example: A per-run limit of $0.50 looks fine individually. 50 runs × $0.49 = $24.50 against a $20 department budget. Without hierarchy, every individual run passes while the department budget is blown. With hierarchy, the department ceiling wins.
See Budget Enforcement for the full hierarchy explanation and scope labeling.
Velocity circuit breaker
An independent limit on tokens per minute (TPM), separate from budget enforcement. Activates automatically on runaway usage spikes — a single agent loop generating tokens 10x faster than the normal baseline is blocked, even if the budget has room.
The circuit breaker uses a sliding 60-second window. If TPM exceeds your key's baseline by a configured multiplier, the key is rate-limited and a Slack/webhook alert fires.
Configure the multiplier in Settings → API Keys → Velocity limit.
Agentic enforcement — write vs. read
The X-Cost-Workload header controls what happens when a budget limit is reached during an agentic workflow:
| Workload | On budget exceeded | Why |
|---|---|---|
agentic-write | Hard stop (429) — always | Write ops mutate external state. A degraded cheaper model may produce incorrect actions. |
agentic-read | Graceful degradation — switches to cheaper model | Read ops are safe to run with lower quality output. |
| (not set) | Inferred from tool names. Tools with create, update, delete → write. Everything else → read. |
Defense in depth — orchestration-layer check
For multi-step agent workflows, add a second protection layer by querying remaining budget before making the next tool call. This lets your agent wrap up gracefully instead of being hard-stopped mid-execution.
See Budget Enforcement → Defense in depth for the full LangGraph and CrewAI examples.
Next steps: Budget Enforcement (full docs) · Cost Attribution · Security & Privacy
Related articles