Optimization

How do I set velocity limits and guardrails on AI calls?

Pre-call budget enforcement, max_tokens clamping, hierarchical budget chains, and velocity circuit breakers. Stop runaway spend before it reaches your provider bill.

Guardrails add a second enforcement layer on top of budgets — per-minute and per-hour call velocity limits that catch runaway usage before it drains your monthly budget. Combined with max_tokens clamping, they prevent overspend structurally rather than alerting after the fact.

Most cost tools read billing exports that are 24 hours late. By the time an alert fires, a weekend agent loop has already run. Cognocient checks and reserves budget before the call reaches the provider — making it impossible to exceed your limits, not just possible to be notified after the fact.

The check is a single atomic Redis operation: reserve the estimated cost, allow if within budget, reject if over. 50 concurrent agent calls against the same $0.10 budget will result in exactly 10 succeeding — every time.

Budget enforcement modes

Three modes, same atomic check:

ModeWhat happensBest for
BlockHTTP 429 returned. Provider never sees the request. No charge possible.Experiments, dev/test, agent sandboxes
DegradeRequest auto-switched to a cheaper model and continues.Production services with SLA requirements
AlertRequest proceeds normally. Slack/email notification sent at threshold.Baseline measurement before committing to limits

How budget enforcement works

When a request arrives:

  1. Cognocient estimates the cost from prompt tokens (inferred from message length) and max_tokens
  2. An atomic Lua script reserves that amount in Redis — or rejects if the reservation would exceed the limit
  3. The call proceeds (or returns 429)
  4. After the call, the reservation is reconciled to the actual cost

Because reservation and check happen in a single Redis operation, there is no race condition. Concurrent calls cannot collectively exceed the budget.

HTTP 429 response:

{
  "error": {
    "message": "Budget limit exceeded: customer-success",
    "type": "budget_error",
    "scope": "department"
  }
}

max_tokens clamping

When a budget reservation is made in block mode, Cognocient also clamps the forwarded max_tokens to the value the remaining budget can actually afford.

Without clamping (broken):
  Caller sends:  max_tokens=4096
  Budget has:    $0.10 remaining (~167 tokens worth)
  Proxy reserves $0.10 but forwards max_tokens=4096
  Model generates 4096 tokens → ACTUAL SPEND: $0.40
  Budget overshot 4× — reservation was meaningless ✗

With clamping (Cognocient):
  Caller sends:  max_tokens=4096
  Budget has:    $0.10 remaining
  Proxy computes: $0.10 ÷ $0.0006/token = 167 tokens
  Proxy forwards: max_tokens=167
  Model physically cannot exceed reservation ✓
  Response headers: x-cog-max-tokens-clamped: 167
                    x-cog-max-tokens-original: 4096

Why this matters: without clamping, a caller requests max_tokens=4096. Budget has $0.10 remaining. The proxy reserves $0.10 (enough for ~167 tokens at gpt-4o rates) but forwards max_tokens=4096. The stream runs 4096 tokens. Actual spend: $0.40. Reservation: $0.10. Budget overshot 4x — the reservation was meaningless.

With clamping, the forwarded request gets max_tokens=167. The model physically cannot generate beyond what was reserved.

Response headers when clamping occurs:

x-cog-max-tokens-clamped: 167
x-cog-max-tokens-original: 4096

Your application can read these headers to understand when clamping occurred. The clamped value is always at least 16 (minimum meaningful response).

Clamping is applied to chat completion calls only — embeddings have no max_tokens concept.

Hierarchical budget enforcement

Budgets form a chain: run → feature → department → org. Every matching level is checked atomically before a call proceeds. If any level is exhausted, the call is blocked — even if child budgets still have room.

Example: A per-run limit of $0.50 looks fine individually. 50 runs × $0.49 = $24.50 against a $20 department budget. Without hierarchy, every individual run passes while the department budget is blown. With hierarchy, the department ceiling wins.

See Budget Enforcement for the full hierarchy explanation and scope labeling.

Velocity circuit breaker

An independent limit on tokens per minute (TPM), separate from budget enforcement. Activates automatically on runaway usage spikes — a single agent loop generating tokens 10x faster than the normal baseline is blocked, even if the budget has room.

The circuit breaker uses a sliding 60-second window. If TPM exceeds your key's baseline by a configured multiplier, the key is rate-limited and a Slack/webhook alert fires.

Configure the multiplier in Settings → API Keys → Velocity limit.

Agentic enforcement — write vs. read

The X-Cost-Workload header controls what happens when a budget limit is reached during an agentic workflow:

WorkloadOn budget exceededWhy
agentic-writeHard stop (429) — alwaysWrite ops mutate external state. A degraded cheaper model may produce incorrect actions.
agentic-readGraceful degradation — switches to cheaper modelRead ops are safe to run with lower quality output.
(not set)Inferred from tool names. Tools with create, update, delete → write. Everything else → read.

Defense in depth — orchestration-layer check

For multi-step agent workflows, add a second protection layer by querying remaining budget before making the next tool call. This lets your agent wrap up gracefully instead of being hard-stopped mid-execution.

# Before each agent step — check remaining budget
import httpx
 
async def check_budget(feature: str, run_id: str) -> bool:
    resp = await httpx.get(
        "https://api.cognocient.com/api/budgets/status",
        headers={"Authorization": f"Bearer {COGNOCIENT_KEY}"},
        params={"feature": feature, "run_id": run_id},
    )
    status = resp.json()
    if not status["can_proceed"]:
        return False          # Budget exhausted — wrap up
    if status["remaining_usd"] < 0.05:
        return False          # Less than $0.05 left — wrap up
    return True

See Budget Enforcement → Defense in depth for the full LangGraph and CrewAI examples.


Next steps: Budget Enforcement (full docs) · Cost Attribution · Security & Privacy

On this page