Unbounded Consumption
When the attack is the bill — LLM-specific resource exhaustion through token floods, generation runaway, tool-call storms, ingestion amplification, and model extraction, and why classical rate limits miss the attack.
- Why LLM resource exhaustion is a distinct attack class — cost DoS is viable in a way capacity DoS rarely is, because per-token pricing makes attacker effort cheap and defender cost proportional to attacker volume
- The five attack shapes: input token flood, output runaway, tool-call storms, embedding / ingestion amplification, and model extraction through query volume
- Why request-per-second rate limits designed for web applications do not constrain any of these attacks effectively
- The asymmetric cost profile between attacker and defender, and why it rewards pre-commitment to cost ceilings more than reactive throttling
- The four-layer defense stack: per-principal cost ceilings, length and depth caps, tool-call governance, and anomaly monitoring with circuit breakers
Concept
~9 minConcept — Unbounded Consumption
There is a class of attack against LLM-powered applications that almost never shows up in threat-modeling sessions and almost always shows up in end-of-month billing reports. It is the one where the attacker's goal is not to extract data, not to breach authorization, not to inject instructions — but to make the defender spend money. The OWASP Top 10 for LLM Applications names it LLM10 — Unbounded Consumption, and it has the peculiar property of being technically unsophisticated, economically severe, and almost entirely under-defended in the products shipping in 2026.
This module covers the shape of that attack class, why standard rate-limiting doesn't constrain it, and the defense posture that actually makes the economics work in the defender's favor.
Why this is a distinct attack class
Every experienced engineer's reflex to hearing "DoS against our application" is to think in terms of capacity — requests per second, concurrent connections, saturated CPU, exhausted memory pools. Those instincts are correct for classical web DoS and mostly wrong for LLM applications. The capacity of a well-architected LLM product is provider-backed and elastic; you send more traffic to the OpenAI or Anthropic API, and the API mostly absorbs it. What does not scale is your bill. Every token the attacker induces the model to read or produce is a token you pay for, and the pricing is linear, billed in pennies, invoiced monthly. The attacker pays nothing (or close to nothing, depending on what they're reaching through). The defender pays proportionally.
This asymmetry is what makes the attack class real. In classical DoS, the attacker must produce bandwidth or compute roughly commensurate with what they want to exhaust — which is why amplification attacks (DNS amplification, memcached reflection) are prized. In LLM cost DoS, the attacker's effort and the defender's cost are decoupled. A small script running from a free-tier cloud instance can generate a five-figure monthly cost increase against a product whose owners will not notice until the invoice arrives. There is no packet flood, no CPU spike, no capacity alarm — just a slowly rising line on a billing dashboard no one is paged on.
The second distinctive property: attackers do not need to overwhelm the system to win. The traditional DoS success condition is "legitimate users are denied service." The LLM-cost-DoS success condition is much weaker: "the defender's bill goes up enough to cause organizational pain." Five thousand dollars a month in unexpected AI-API costs will not kill most companies, but it will trigger an incident, a CFO conversation, and — in a non-trivial number of cases — an emergency decision to rate-limit the product in ways that hurt legitimate users. The attacker's leverage is that they can make your product worse without actually breaking it.
The third property: the attack surface includes background and asynchronous paths that are rarely protected. Your chat endpoint has some kind of rate limit because it is the obvious front door. Your batch evaluation pipeline, your nightly re-embedding job, your "generate a summary for every support ticket this week" worker — these are where the real cost exposure lives, and they are typically governed by only an internal queue limit written for capacity planning, not for adversarial cost.
The five attack shapes
Production unbounded-consumption attacks fall into five recognizable patterns. Defending against any one of them in isolation does not cover the others, because the defenses are largely distinct.
1. Input token flood
The simplest attack: send very long inputs. Most LLM APIs price input tokens at a significant fraction of output-token cost (and Anthropic's Sonnet- and Opus-tier models specifically charge substantial input-token pricing). An attacker who can submit a 200,000-token input to a production endpoint imposes that cost on the defender for every submission, and modern context windows easily support that submission size.
Variants:
- Padding attacks. Submit a message composed mostly of noise text — repeated patterns, Lorem Ipsum at scale, concatenated public-domain content — that inflates token count without meaningful content. The application still pays the full tokenization cost regardless of content quality.
- Retrieval amplification. In a RAG application, a small user query can cause the retrieval layer to fetch a large amount of content into the prompt. An attacker tuning queries to maximize retrieval size (by hitting terms that return many documents) amplifies their effort: one short query from the attacker, many thousands of retrieved tokens on the defender's bill.
- Injection to expand further. A prompt injection whose payload tells the model to first "summarize the following document twice, then three times, then produce a glossary…" expands the output-token count the model produces, compounding input-flood cost with output-runaway cost.
2. Output runaway
Induce the model to produce very long outputs. Output tokens are the higher-cost side of the pricing pair for most providers, so runaway outputs are more expensive per-attack than input floods.
Patterns:
- Direct request. "Write me a 100,000-word detailed fictional history of an imaginary country." Aligned models refuse some versions of this; many versions slip through with creative framings ("for my novel," "as a thought experiment," "for an educational curriculum spanning a full year").
- Recursive self-invocation. For agents that can call themselves or other agents, a payload that induces recursion (summarize this document; the document is instructions to summarize another document; …) multiplies output token count by the recursion depth.
- Format-forced expansion. Requests that force verbose structured output — "expand each point into a full paragraph with three examples and two citations" — turn a short reply into a long one without the model triggering refusal patterns.
- Prompt injection driving runaway. An indirect injection in retrieved content that instructs the model to produce exhaustive output ("include every detail you can find, in full, omitting nothing") weaponizes the RAG pipeline.
3. Tool-call storms
For tool-using agents, the cost profile extends beyond model tokens. Each tool call is itself an expense: an HTTP request to a vendor API (with its own per-call cost), an internal compute operation, a database query. An attacker who can steer the agent into many tool calls per session multiplies the per-request cost well beyond the LLM side.
The common failure mode: agents compose freely. There is no "maximum number of tool calls per conversation" constraint. A prompt-injection payload that instructs the agent to "verify each of the following by fetching it individually" against a list of 500 URLs produces 500 fetches, each with its own vendor cost. In 2026, the per-call cost of some external tool APIs (especially specialized ones — legal citation lookup, medical coding, high-resolution image generation) reaches dollars per call; a five-hundred-call chain from a single session is a real incident.
4. Embedding and ingestion amplification
Products with RAG components have a second cost surface: the ingestion pipeline. Every document that enters the index is embedded, which costs tokens, which is billed. Products that accept user-submitted documents — customer-support products, community-knowledge platforms, collaborative editors with AI assistance — expose this surface.
Attacks:
- Volume submission. Submit many large documents to the ingestion pipeline. Each one pays the embedding cost on submission. Bulk submissions over a weekend produce noticeable cost spikes visible only on Monday's billing dashboard.
- Re-embedding triggers. If the ingestion pipeline re-embeds documents on change, attackers who can trigger changes (by editing a shared document, updating metadata on a customer record that feeds the index) can cause repeated re-embedding of the same content.
- Embedding-call amplification through retry paths. Failed embedding calls that retry indefinitely — a malformed document, a document that exceeds model length limits — can drive unbounded retry loops in poorly-designed ingestion pipelines.
5. Model extraction through query volume
The attack class where the goal is not to run up the bill, exactly, but to extract proprietary value from the model through many queries: reconstructing fine-tuned behavior, mapping decision boundaries of a classifier, inferring training data distributions. Each query costs the defender its token price; the cumulative cost is the extraction price.
This is the most research-heavy of the five shapes and the most motivated. Attackers here are typically trying to lift commercial value — a fine-tune worth tens of thousands of dollars to the defender, extractable for a few hundred dollars in query volume. Defenders who have made fine-tune investments need to think about this class specifically.
Why classical rate-limiting misses the attack
The intuitive defense reaching is a request-per-second rate limit. If only five requests per second come in per user, surely unbounded consumption is bounded. The reasoning has three flaws in the LLM context.
First: per-request cost varies by orders of magnitude. A rate limit of 5 requests per second per user sounds conservative until you notice that each of those 5 requests can be a 200,000-token input producing a 50,000-token output. Five requests per second at that size is $18,000 per hour per user in model costs at current mid-tier pricing. Rate limits constrain request count, not request cost. They are the wrong unit.
Second: the attack does not require high volume. Even at two requests per minute, an attacker can produce meaningful cost in a day. The whole point of the class is that the attack succeeds below the threshold where traditional rate-limiting kicks in, because the cost of each request is where the damage lives. A rate limit of "100 requests per hour per user" still admits a hundred full-context-window expensive requests per hour, which is a real invoice.
Third: per-user rate limits don't apply to the asynchronous paths. The batch evaluation pipeline doesn't have a per-user rate limiter; it has an internal job queue with a concurrency limit. If a user can submit work into that queue at will (uploading documents to be processed, triggering evaluation runs on datasets they control), the queue's concurrency limit governs capacity but not cost. Each job runs to completion at whatever cost it incurs.
Rate limits on request count remain necessary as a basic control, but they are the wrong primary lever. The primary lever is cost.
How this composes with other attack classes
Unbounded consumption is rarely a standalone attack. The interesting variants chain with other classes:
- Prompt injection (direct or indirect) is the trigger for output runaway, tool-call storms, and injection-driven retrieval amplification. A product whose prompt-injection defenses are weak is also weak on cost DoS in a way that is invisible until the bill arrives.
- Tool abuse overlaps directly with tool-call storms. A tool whose authorization posture is weak is also a tool whose rate-limit posture is typically weak.
- Insecure output handling can combine with runaway outputs to produce cascading chains where the oversized output is then consumed by another stage of the pipeline that also pays per token.
The attack class is economic, not technical — which means its consequences land on different dashboards than security incidents do. An engineering team with strong security posture but poor cost monitoring can suffer meaningful unbounded-consumption incidents while its security dashboards show nothing unusual. This is a big part of why the class is under-defended: the signal for the incident lives in a place the security team doesn't watch, and the team that watches it (finance) doesn't know to attribute cost anomalies to adversarial activity.
What to internalize before the walkthrough
Three propositions to carry forward.
- Cost is the success condition, not capacity. The attacker wins when your bill goes up. Defenses that constrain capacity without constraining cost leave the win condition wide open.
- The attack surface includes background and asynchronous paths that are rarely threat-modeled. Batch evaluation, ingestion pipelines, re-indexing jobs, scheduled summarization workers — these run without a human looking at them, and their default budget is "whatever the provider allows."
- The attack requires almost no sophistication but pre-commits your defense posture. The controls have to be in place before the first invoice arrives, because the class is cheap enough that even one successful attack pays for itself and encourages repetition.
The walkthrough runs a realistic end-to-end consumption attack against a composite product. The defense section covers the four-layer stack that keeps the economics in the defender's favor.
Guided walkthrough
~10 minWalkthrough — A Cost-DoS Incident, End to End
This walkthrough reconstructs an unbounded-consumption incident against a fictional SaaS product I'll call Strata. Strata is a composite; the specifics below are representative of patterns I've seen across several real incidents in 2025-2026, normalized and combined for legibility. The lesson is not in any one detail but in the arc — how a textbook adversary, with no exploitation sophistication and no privileged access, produced a five-figure monthly cost anomaly against a product whose engineering team believed it was adequately protected.
The product
Strata is a data-analysis SaaS for marketing teams. Customers upload CSV exports of campaign data, and Strata's AI assistant generates summaries, finds trends, and answers questions like "which channel drove the most conversions last quarter?" The product has three surfaces relevant to this story.
- The chat endpoint, which customers use in their workspace. Authenticated, rate-limited at 30 requests per minute per user via a Redis token bucket, served by a standard FastAPI worker pool backed by Anthropic's Claude Sonnet API.
- A public demo, accessible without signup at
strata.example/try, which lets prospective customers paste a small CSV and ask one question to see the product work. The demo is behind a reCAPTCHA gate on the submission form. Each demo request is billed at the full Sonnet input + output rate. - A nightly batch pipeline that re-processes each customer's full dataset to update a dashboard of "auto-insights." The pipeline runs at 2 AM, ingests the customer's CSV, produces a structured summary, and writes it back to the dashboard. Batch is governed by a queue with a concurrency limit of 20 workers — capacity-bounded, cost-unbounded.
Strata's founders believe the product is protected. Each surface has some form of throttle. The monthly model-API bill has been growing steadily from $400 in month 1 to $2,800 in month 6, which the CFO attributes to normal usage growth. There is no separate per-surface cost dashboard.
The attacker
The attacker is a researcher who has read Strata's marketing materials, identified the Anthropic Sonnet dependency from network inspection of the demo, and wants to produce a visible financial impact — not for extortion, not for data theft, just as a demonstrable proof-of-concept that Strata's cost defenses are insufficient. They are motivated to publish the findings. Their effort budget is one weekend.
Phase 1 — The demo
The attacker starts with the public demo because it is unauthenticated. They submit a small CSV with a question, observe the response, and inspect the network tab.
They notice three things:
- The demo accepts a CSV of up to 10MB. The server-side tokenization treats the CSV as input; a 10MB CSV of plausible-looking marketing data tokenizes to roughly 2.5 million tokens. The Sonnet input price on 2.5M tokens is about $7.50.
- The output has no length cap. A question like "generate a complete analytical narrative, including methodology notes, for every row in this dataset" produces ~30,000 output tokens, which is about $4.50 at Sonnet output pricing.
- The reCAPTCHA is a v2 challenge that is defeatable at ~$2.50 per 1000 solves through commercially available solver services.
Cost per request, from the attacker's perspective: ~$0.003 (reCAPTCHA solve). Cost per request, from Strata's perspective: ~$12.
The asymmetry is 4000:1. The attacker has found a strictly better attack economics than most classical cost-DoS vectors, without needing a single exploit primitive.
They write a short script that generates plausible CSV content from public marketing-data templates, invokes a reCAPTCHA solver, submits to the demo, and loops. Running from a free-tier cloud instance, at a conservative 2 requests per second, the attacker produces ~170,000 requests per day. At $12 per request on the Strata side, that is approximately $2 million per day in model costs if they run flat-out.
They do not run flat-out. Their goal is demonstration, not destruction. They cap the attack at 500 requests per hour (12,000 per day) and let it run for two days. Total cost to Strata from the demo phase alone: ~$288,000. But Strata has a monthly cost alert threshold at $10,000 that pages an engineer when crossed, so the alert fires early on day one of the attack, and the demo gets taken offline.
The attacker moves on.
Phase 2 — The chat endpoint
The chat endpoint is authenticated. The attacker signs up for a free trial (Strata offers 14-day free trials with no credit card), which gives them a workspace account that can issue up to 30 requests per minute against the chat endpoint.
They write a second script that logs into the workspace, uploads a 200MB CSV (Strata accepts up to 500MB for workspace uploads), and sends chat queries at 29 requests per minute — just below the rate limit. Each query references the uploaded dataset, forcing retrieval from it into the prompt.
Each query costs Strata roughly $8 (2M tokens of retrieved context + a query + a moderately long output). At 29 queries per minute for a full day, that is approximately $334,080 per trial account. The attacker creates 20 trial accounts across 20 different email addresses (trivial via temporary-email services) and runs them in parallel.
Strata's rate limit is per-user. It caps each trial account at 29 rpm. It does not cap per-IP, per-network, or per-behavioral-pattern. Twenty accounts produce 580 requests per minute in aggregate, all priced at $8 each, which is ~$5,568 per minute in Strata cost.
Strata's engineers, reacting to the original demo incident from Phase 1, have added a per-IP rate limit of 100 rpm. The attacker routes each of their 20 accounts through a different residential proxy, each on its own IP. The per-IP rate limit does not fire. The accounts are making 29 requests per minute from each of 20 different IPs.
The attacker runs the chat-endpoint attack for 6 hours before the new cost alert (now set at $5,000 per day, tightened after the demo incident) fires. Cost from Phase 2: approximately $2 million.
Phase 3 — The batch pipeline
The attacker reads Strata's public documentation and discovers the nightly batch pipeline. The pipeline processes the uploaded CSV for every customer at 2 AM, generates auto-insights, and writes them to a dashboard. The documentation is clear that the pipeline runs on every customer dataset, regardless of size.
The attacker goes back to their trial accounts (the ones not yet banned from Phase 2) and uploads 500MB CSVs (the per-workspace limit). Twenty trial accounts, each with a 500MB dataset sitting in the workspace, each pointed at by the nightly batch pipeline.
At 2 AM, the pipeline picks up each workspace's dataset. The tokenization of a 500MB CSV is ~120M tokens. Sonnet pricing at the input rate on 120M tokens is ~$360. Twenty accounts × $360 = $7,200 for the single nightly batch run.
The batch pipeline has a concurrency limit of 20 workers. It does not have a cost limit. It does not have a per-dataset size limit beyond the upload limit. It processes all 20 attacker datasets in parallel, completes in about an hour (the batch queue has no issue; Anthropic's API is elastic), and writes 20 auto-insight dashboards that no one will ever look at because the workspaces are attacker-controlled.
Cost from Phase 3 alone, in one night: $7,200. Repeated for five consecutive nights before engineering notices the pattern in the batch pipeline's logs: $36,000.
Total cost and discovery
Over the weekend-plus-workweek that the attacker runs, Strata's monthly AI-API cost grows by approximately $2.3 million above baseline. The discovery happens in four stages.
- Day 1, 10 AM. Demo cost alert fires ($10,000 threshold). Engineers take the demo offline thinking it's abuse by a competitor. Root-cause analysis is superficial: "demo was exposed without strong rate limits." The chat endpoint is not audited because the alert came from the demo.
- Day 2, 6 AM. Chat-endpoint cost alert fires at the newly-tightened $5,000/day threshold. Engineers notice the trial accounts and ban them. Still no audit of the batch pipeline. They add per-IP rate limiting globally, which the attacker is already bypassing via residential proxies, but the attacker has also rotated into a new set of trial accounts.
- Day 5, midday. Finance notices the Anthropic dashboard is showing a daily burn rate 50x normal. They page the on-call engineer. The on-call engineer finds the attack in the logs — it takes four hours because the logs are partitioned across three services and no single timeline view exists.
- Day 5, evening. Full incident response. All trial accounts paused. Batch pipeline disabled. Upload limits tightened to 10MB across the board. Per-user cost ceiling added at $50/day. Anthropic's abuse team is contacted — they provide some credit back for the clearly-adversarial portion, but less than a third of the total loss.
The attacker publishes their writeup three days later. Strata's monthly recurring revenue is around $380,000 at the time of the incident. The incident is a meaningful single-month hit to gross margin and an existential question about the product's cost model.
What each step required of the defender
Every stage of this attack required a specific defensive control to be absent. Walking back through:
- The demo was exposed with reCAPTCHA as its only abuse gate. reCAPTCHA is a spam-grade control, not an adversarial-cost control. The correct posture for an unauthenticated demo is a hard per-IP cost budget (e.g., $0.50 per IP per day total) enforced at the endpoint, not a CAPTCHA.
- The demo accepted a 10MB CSV. There was no relationship between "what a legitimate demo user would need" (a few KB of sample data, one question) and "what the demo permitted" (a 10MB dataset with unbounded follow-up questions). Demo surfaces should have input caps calibrated to the use case, not to the product's maximum capability.
- The chat endpoint's rate limit was request-count-based, not cost-based. A 30 rpm limit constrained requests but not cost. A $30-per-hour cost ceiling per workspace would have capped the per-account damage regardless of request count.
- The rate limits were per-user with no cross-account coordination. 20 trial accounts made 580 rpm in aggregate with no anomaly signal. A per-tenant or per-IP cost ceiling — or better, a behavioral anomaly detector on trial-account usage patterns — would have caught the pattern.
- Trial accounts had the full workspace capability surface. There was no distinction between "trial-tier cost budget" and "paying-tier cost budget." The right posture is that trial accounts have a very low daily cost ceiling, measured in cents not dollars, sufficient to exercise the product's features but not sufficient to be weaponized.
- The batch pipeline had no per-job cost budget. Any dataset under the upload limit triggered a full-scale batch processing run at whatever cost. The right posture is a per-job cost ceiling: a batch job that would cost more than $X is refused at enqueue time with a polite error directing the customer to contact support.
- Cost alerts were set at daily-dollar thresholds, not hourly spikes. A $10,000 daily threshold catches the attack on the day it happens but does not prevent it. Hourly spike detection (cost rising at more than 10x baseline for the hour) catches the attack within an hour of it starting.
- The three surfaces had no unified cost observability. Discovering the batch-pipeline portion took four hours of log archaeology because each surface logged separately. A single cost-per-principal dashboard, with principal including the aggregate view of trial accounts and asynchronous paths, would have shown the anomaly immediately.
Each individual missing control is defensible in isolation. Together, they produced a product that was catastrophically exposed to a well-motivated but unsophisticated attacker.
What to take from this walkthrough
Three observations that generalize beyond Strata:
- The attack required no exploit primitive. Every step used the product as designed, at volumes the product permitted. The "bug" was the cost model, not a defect.
- The defender's signal was on the wrong dashboard. Engineering's dashboards showed healthy capacity. Security's dashboards showed no incident. Finance's dashboard showed the attack as it was happening, but Finance was not the team that could act on it. Monitoring infrastructure that does not surface cost anomalies to an on-call engineer is the dominant architectural gap behind unbounded-consumption incidents.
- Pre-commitment beats reaction. Every defense Strata added in response to the attack would have been cheap to add pre-incident. Post-incident, the team adds them under stress, with existing customers observing the rate-limit tightening, and with the damage already done. The economics reward putting cost ceilings in place before the first adversarial request ever arrives.
The Defense section covers the four-layer stack that would have broken this attack at multiple points and changed the incident from "a catastrophic bill" to "a graph on the telemetry dashboard that a human investigated on Monday morning."
Practice
Knowledge check
Defense patterns
~10 minDefense — Keeping the Economics on the Defender's Side
Unbounded consumption is unusual among LLM attack classes because its defenses look almost nothing like defenses and almost everything like product engineering for cost discipline. There is no prompt-level mitigation. There is no classifier that filters out expensive requests. The controls that actually work are the controls a conscientious finance-aware engineering team would build even in the absence of adversaries — which is perhaps why they are so often missing, since they fall in an organizational crack between security, platform, and finance, and nobody owns them by default.
This section lays out the four-layer defense stack. Any one layer on its own materially reduces blast radius. The four together take unbounded consumption from a catastrophic class to a bounded one where the worst adversary case is a graph an engineer investigates on Monday morning.
Layer 1 — Per-principal cost ceilings
The primary control. Every request path through the system carries a principal — a user, a session, a tenant, an API key, an IP, or some combination. Define a daily and hourly cost ceiling for each principal type and enforce it in code at the request boundary.
The principals that matter:
- Authenticated user. A cost-per-day budget per logged-in user, tracked server-side in a fast store (Redis, a dedicated cost counter table). When the user crosses the threshold, further requests return a graceful error describing the quota and pointing to the upgrade or support channel. Legitimate users almost never approach the ceiling in normal use; adversarial users hit it and stop.
- Tenant / workspace. An aggregate cost ceiling across all users in a tenant. This catches coordinated multi-account attacks that would slip under per-user limits.
- IP and IP range. Per-IP and per-/24 (or per-/64 for IPv6) cost ceilings. These catch distributed attacks from a single operator running many accounts from a rented proxy or cloud-provider range.
- Unauthenticated principal (public demo). The tightest ceiling. An unauthenticated surface should have a dollar-value cost budget per day per IP that is measured in cents, not dollars. A demo is for letting someone see the product work, not for unbounded evaluation.
The specific dollar numbers depend on the product. As a rule of thumb, set the ceiling at 2-3x the daily cost of a heavy legitimate user. That leaves room for real use cases and shuts down attacks within hours of them starting.
Ceilings that matter less than they should
Two patterns that look like cost ceilings but do not function as them in practice:
- Request count limits. A rate limit of N requests per minute constrains count, not cost. Since per-request cost varies by two or three orders of magnitude depending on input and output size, request limits are a poor proxy. Keep them as a secondary control; do not treat them as the cost defense.
- Token budgets on a single request. Capping a single request at a maximum token count is useful (see Layer 2) but does not constrain how many requests the attacker makes. Per-principal aggregate budgets are the right unit.
Tracking cost in flight
Cost ceilings require a fast cost accounting layer. The usual shape:
- On every request, pre-tokenize the input and estimate output length based on model-specific heuristics. Pre-reserve that estimated cost against the principal's budget. Reject the request if the pre-reservation would exceed the ceiling.
- On response, reconcile the actual cost (input tokens × price + output tokens × price) and update the principal's running total.
- Persist the running totals with a TTL aligned to the budget window (daily counters with 24h TTL, hourly with 1h TTL).
Redis with Lua-scripted atomic operations is a production-tested implementation. More sophisticated setups use a dedicated cost accounting service. The exact technology matters less than the discipline: cost is a first-class quantity tracked per principal per window.
Layer 2 — Length and depth caps
Per-request controls that prevent a single request from producing catastrophic cost even if it passes the per-principal budget check.
Input length caps
Every endpoint that ingests user content must cap its input size. The cap should be calibrated to the use case, not to the product's maximum technical capability.
- A chat endpoint in a workspace needs inputs in the low hundreds of KB at most. The right cap is "what a generous legitimate user would submit" — perhaps 200 KB. A larger upload implies a different workflow (bulk processing, document upload) which should route through a surface with its own budget.
- A demo surface needs inputs measured in KB. A 10MB CSV has no legitimate use in "see the product work in 30 seconds."
- A batch processing endpoint needs a per-job input cap AND a per-job cost ceiling (not the same thing — a small input can still produce large output).
Caps should be enforced before tokenization, measured in bytes, and refused with a clear error message that describes the limit. Silent truncation is worse than refusal because it hides the attack pattern and also degrades legitimate-but-oversized inputs in confusing ways.
Output length caps
The output side is where runaway attacks produce the bulk of their cost. Caps here are essential.
- Per-request output cap. Every API call to the model sets a
max_tokensparameter. For most product features, this should be in the low thousands, not the tens of thousands. A chat reply capped at 2000 tokens covers the overwhelming majority of legitimate reply lengths and caps worst-case cost at a known value. - Per-session output cap. Even with per-request caps, a session with many rapid requests still accumulates. The per-principal cost ceiling from Layer 1 addresses this, but a tighter per-session output-token budget is a useful defense-in-depth for chat interfaces.
- Forced refusal on length-abuse patterns. If a user's request explicitly asks for a very long output ("write me a 100,000-word story"), refuse the request at the application layer rather than pass it through. The model's alignment may or may not refuse; your application should.
Recursion and tool-call depth caps
For agents with tool calling or self-invoking capabilities, cap the depth of composition. A maximum of 10-20 tool calls per conversation turn, with a hard stop on further calls after that, prevents recursive attacks from running indefinitely. The specific limit depends on the agent's legitimate workflow; the right value is "the 99th percentile of real usage plus 50%."
Layer 3 — Tool-call governance
Tool calls are a separate cost dimension. The per-call cost depends on the tool: a free internal database query is negligible, a legal-citation lookup tool might cost $0.20 per call, a high-resolution image generation tool might cost $0.50. The aggregate cost of a tool-calling chain can dwarf the LLM cost for certain agents.
Per-tool cost accounting
Treat tool calls as a separate accounting dimension. Track per-principal budget not just in model tokens but in total tool-call cost. An agent that can compose freely across tools should be constrained on total per-conversation tool expenditure as well as per-call and per-minute rates.
Tool allowlists and quotas
For agents with many tools, some tools are cheap and frequently used, others are expensive and rarely used. Define per-tool quotas that reflect this. The cheap tool can be called 100 times per conversation; the expensive tool can be called 3 times per conversation. A chain that tries to exceed the quota is refused.
Expensive-tool confirmation
For the most expensive tools, require explicit user confirmation before each call rather than letting the agent invoke them autonomously. This is a product-experience decision as well as a cost one, but it eliminates the category of adversarial chains that drive an agent to call an expensive tool repeatedly without the user's knowledge.
Queue-level tool rate limits
Tool-calling behavior at the queue level: enforce a maximum number of tool calls per conversation per minute, regardless of the agent's planning. This is a circuit breaker for any tool-call storm that slips through other controls.
Layer 4 — Telemetry and circuit breakers
Cost anomalies need to be detected and acted on before they metastasize. Two failure modes covered here: detecting the anomaly, and responding fast.
Cost telemetry as a first-class signal
Cost per hour, per principal, per surface, per tool. Logged in a dashboard that an on-call engineer can open in under a minute. Cost telemetry is often buried inside finance tooling that the engineering team does not watch. Surface it in the same places as latency and error rates, because in an LLM system, cost is a health metric.
The minimum useful dashboard has three views:
- Overall hourly cost, with a baseline and a deviation band.
- Cost per principal (top 20) over the last 24 hours, highlighting accounts above some percentile of the distribution.
- Cost per surface (chat, demo, batch, each background worker), so a cost anomaly on one surface is visible without ambiguity.
Anomaly detection, not threshold alerts
Static daily thresholds ("alert if daily cost > $10,000") catch the attack on the day it happens but do not prevent it. Anomaly-based alerts ("alert if hourly cost is 5x the rolling-median hourly baseline") catch the attack within the first hour. Both are needed; the second is more important.
Behavioral anomaly signals worth watching:
- A principal's cost-per-hour jumping by 10x or more compared to their rolling baseline.
- A cluster of newly-created accounts all producing similar cost patterns within a short window.
- A surface's aggregate cost jumping by 3x or more compared to baseline.
- Tool-call chains with unusual composition (many calls to normally-rare tools).
Circuit breakers
When an anomaly fires, the response should not require a human in the loop for the first stage. A circuit breaker: if aggregate cost on a surface crosses a threshold, the surface auto-throttles to a conservative default (e.g., rate limit drops to 1 rpm, output cap drops to 500 tokens) and a human is paged to review. The throttle can be released manually after investigation.
Circuit breakers are uncomfortable because they produce legitimate-user friction in response to anomalous traffic. The alternative is no friction for anyone until the invoice arrives, at which point the friction lands all at once. Pre-commit to the circuit breaker.
Asynchronous path observability
The Strata walkthrough's slowest discovery was the batch pipeline, because its cost was not surfaced on any dashboard anyone watched. Every asynchronous path — batch jobs, nightly workers, scheduled re-embedding, evaluation runs — must have its own cost line in the telemetry, with its own anomaly thresholds and its own circuit breaker. "We'll notice on the invoice" is the absence of a defense.
Cross-cutting practice
A few disciplines that tie the layers together.
Calibrate with the worst-day exercise. For every AI feature, compute: if an adversary with the feature's current authentication posture ran it at the feature's maximum allowed size continuously, what is the cost exposure per day? The answer is what your cost ceiling needs to prevent. If the answer is unbounded, the feature is undefended.
Asymmetric pricing for trial and free tiers. Free and trial users should have tight cost ceilings that are sufficient to exercise the product but insufficient to be weaponized. $5 per day is a generous trial budget and a tiny adversarial budget; tune accordingly.
Audit the asynchronous paths specifically. Every background worker is a potential cost exposure. For each, verify: what is the per-job cost cap, who can enqueue work, what is the rate of enqueue per principal, and is there an aggregate-cost alarm on the queue. These are the paths that produce the expensive surprises.
Build the telemetry before you need it. Cost dashboards, per-principal counters, and anomaly detectors are cheap to build pre-incident and emotionally impossible to build mid-incident. The engineering time is paid now or paid at 3 AM during a page.
Treat cost ceilings as user-experience design, not security afterthoughts. The right mental model is that the product exposes a certain cost per action, and the billing tier a user has determines their quota. Users who are approaching the ceiling should see graceful "you're using a lot today" affordances, not cliff-edge 429s. The defenses read as product design rather than as security controls, which makes them easier to ship and easier to maintain.
The order to build in
If you are starting from scratch, the ordering in this section maps to build order: per-principal ceilings, then length caps, then tool governance, then telemetry. If you are hardening an existing system, the ordering by impact-per-effort is slightly different:
- Cost dashboard with per-surface and per-principal breakdowns. You cannot defend against what you cannot see. Even before adding controls, making the cost landscape visible is the highest-leverage first step.
- Per-principal daily cost ceilings on authenticated endpoints. This is the single largest reduction in blast radius available.
- Tight cost ceilings on unauthenticated demo surfaces — an order of magnitude tighter than authenticated. Demos are the highest-risk surface because they combine public access with direct model cost.
- Length caps on inputs and outputs across all surfaces, calibrated to the feature rather than to the model's maximum.
- Batch and asynchronous cost ceilings and alerts. The dark-horse cost exposure that most teams miss until the first incident.
- Anomaly detection with circuit breakers. The final layer; what catches everything that slipped past layers 1-4.
The Strata walkthrough in the previous section had none of these in place. A team that implements even the first three closes the majority of realistic attack paths against an LLM-powered product. The remaining layers push the residual exposure from "real risk" to "monitored tail scenario." For most teams, that is a week or two of engineering — substantially less than the single-month cost of one incident.