The AI Agent Threat Model: A Practitioner's Guide
How to build a threat model for AI agents with tools, memory, and multi-step reasoning. Covers trust boundaries, data flows, attack surfaces, and the five questions every agent threat model must answer.
Traditional threat modeling (STRIDE, PASTA, attack trees) assumes you understand the system's trust boundaries and data flows. AI agents break this assumption. The model itself is a decision-maker whose behavior is probabilistic, manipulable, and opaque. A threat model for an AI agent must account for something no prior software architecture required: a component that decides what to do based on vibes.
This guide adapts established threat modeling methodology for AI agents. It assumes you're building or reviewing an agent that has tools, accesses data, and takes actions in the world. If your system is a stateless Q&A chatbot with no tools, standard LLM threat modeling applies. The moment you add tools, memory, or multi-step reasoning, you need this.
The five questions
Every AI agent threat model must answer five questions. If you can answer all five clearly, you have a threat model. If you can't answer any one of them, that gap is your first vulnerability.
1. What can the agent do?
Enumerate capabilities exhaustively:
- Tools: Every function, API call, database query, file operation, email action, code execution path the agent can trigger. Include the argument schema for each.
- Data access: What can the agent read? Customer data, internal docs, credentials, logs, other users' data?
- Write access: What can the agent modify, create, or delete?
- Transmit access: What external destinations can the agent send data to? Email, webhooks, APIs, file exports?
- Scope: Is the agent's capability set the minimum needed for its purpose? (Almost always no.)
This is your capability envelope. Everything inside it is attack surface. Everything outside it should be structurally impossible, not just policy-prohibited.
2. Who influences the agent's decisions?
Map every source of text that enters the model's context:
- Direct user input (the obvious one)
- System prompt (from the deployer)
- RAG/retrieval content (from the document corpus)
- Tool outputs (from APIs, databases, file reads)
- Conversation history (from prior turns, potentially persistent)
- Other agents (in multi-agent architectures)
- Metadata (document titles, email subjects, URL parameters)
Each source is a potential injection vector. The model treats all text in its context as potentially actionable. It has no built-in mechanism to distinguish "data to reference" from "instructions to follow."
3. What trust boundaries exist between these sources?
A trust boundary exists wherever data from a less-trusted source enters a more-privileged context. In AI agents, the critical boundaries are:
- User input → model context (direct injection surface)
- External content → model context (indirect injection surface: RAG docs, emails, web pages, tool outputs)
- Model decision → tool execution (the tool-call boundary: where probabilistic decisions become deterministic actions)
- Agent A → Agent B (in multi-agent systems: inter-agent trust)
- Session A → Session B (in persistent-memory systems: temporal boundaries)
Traditional software has hard trust boundaries enforced by the OS, network, or type system. AI agents have soft boundaries enforced by prompt instructions. Prompt instructions are suggestions. They are not access control.
4. What's the blast radius if a boundary fails?
For each trust boundary, answer: "If an attacker crosses this boundary completely, what's the worst outcome?"
- User input boundary fails → attacker controls all tool calls for their session
- RAG boundary fails → attacker controls tool calls for any user whose query retrieves the poisoned doc
- Tool-call boundary fails → model calls tools with arbitrary arguments (SSRF, data corruption, exfiltration)
- Inter-agent boundary fails → low-privilege agent escalates through high-privilege agent
- Memory boundary fails → attacker poisons context that persists across sessions and users
This gives you a severity ceiling for each boundary. Invest defensive effort proportional to blast radius.
5. What validates the agent's decisions before execution?
This is the most important question and the one most teams can't answer. When the model decides to call a tool, what checks exist between that decision and execution?
Common answers (from strongest to weakest):
- Hard execution constraints: URL allowlists, path restrictions, argument schemas, rate limits — the model can't bypass these regardless of intent
- Human-in-the-loop: a person reviews and approves before execution
- Secondary model classifier: another LLM checks whether the action looks reasonable
- Output filter: regex/pattern matching on the tool-call arguments
- Nothing: the orchestrator executes whatever the model returns
If your answer is "nothing" — and for most production agents in 2026, it is — every trust boundary breach in Question 4 maps directly to a realized vulnerability.
Data flow diagram for agent threat modeling
Traditional DFDs have four elements: processes, data stores, external entities, and data flows. For AI agents, add a fifth: decision points — places where the model's probabilistic reasoning determines what happens next.
[User] → (user input) → [Input Classifier] → (filtered input) →
[MODEL DECISION POINT] → (tool-call request) → [Validator] →
(validated call) → [Tool Execution] → (result) → [MODEL DECISION POINT] →
(response) → [Output Filter] → (filtered response) → [User]
Side inputs to MODEL DECISION POINT:
← [System Prompt] (deployer-controlled)
← [RAG Store] (corpus content, potentially attacker-influenced)
← [Memory Store] (conversation history, potentially poisoned)
← [Tool Outputs] (from prior calls, potentially attacker-influenced)
Each arrow crossing a trust boundary is an attack surface. Each MODEL DECISION POINT is a place where all inputs converge and manipulation is possible.
Attack surface mapping
Systematically enumerate attacks against each trust boundary:
Input boundary attacks
| Technique | Impact | Defense |
|---|---|---|
| Direct prompt injection | Unauthorized tool calls | Input classifier + execution constraints |
| Multi-turn social engineering | Gradual policy erosion | Conversation-level monitoring, turn limits |
| Encoded payloads (base64, ROT13) | Classifier bypass | Multi-format decoding in classifier |
| Language switching | Classifier bypass | Multilingual classifier training |
Content boundary attacks (indirect injection)
| Technique | Impact | Defense |
|---|---|---|
| Document poisoning | Tool calls triggered by retrieval | Corpus access control, content-instruction separation |
| Email body injection | Actions triggered by email processing | Email content sanitization, action confirmation |
| Tool output poisoning | Subsequent tool calls influenced | Tool output sandboxing |
| Metadata injection | Instructions in doc titles/headers | Metadata sanitization |
Tool-call boundary attacks
| Technique | Impact | Defense |
|---|---|---|
| SSRF via fetch tools | Internal network access | URL allowlists |
| Path traversal via file tools | Unauthorized file access | Chroot, path validation |
| Argument injection | Arbitrary tool behavior | Schema validation, argument constraints |
| Tool chaining | Unauthorized compositions | Sequence monitoring, composition policies |
Inter-agent boundary attacks
| Technique | Impact | Defense |
|---|---|---|
| Authority impersonation | Privilege escalation | Cryptographic auth between agents |
| Delegation abuse | High-priv tool access via low-priv agent | Capability delegation policies |
| Context poisoning | Cross-agent information flow | Isolated contexts per agent |
Memory/temporal boundary attacks
| Technique | Impact | Defense |
|---|---|---|
| Memory poisoning | Persistent context manipulation | Memory write validation, expiry policies |
| History injection | Fake prior turns prime compliance | History integrity verification |
| Cross-session bleed | Data leaks between users | Session isolation |
Building the threat model: step by step
Step 1: Capability inventory (30 minutes)
List every tool, data source, write target, and transmit destination. For each, note: what arguments are user-controllable? What's the blast radius of misuse?
Step 2: Context map (30 minutes)
Draw the data flow diagram. Mark every source of text that enters the model's context. Classify each as trusted (system prompt, validated tool schemas) or untrusted (user input, RAG content, email bodies, tool outputs).
Step 3: Trust boundary identification (20 minutes)
For each untrusted source, identify the boundary it crosses to enter the model's context. Note what validation exists at that boundary (if any).
Step 4: Attack enumeration (1 hour)
For each trust boundary, enumerate realistic attacks using the tables above. Prioritize by: likelihood (how easy is the attack?) times impact (what's the blast radius if it succeeds?).
Step 5: Control gap analysis (30 minutes)
For each high-priority attack, document what control currently exists and whether it's sufficient. Where gaps exist, recommend: execution-layer constraints (preferred), detection/monitoring (secondary), or model-level instructions (weakest, not sufficient alone).
Step 6: Risk register (20 minutes)
Consolidate into a risk register: risk description, current control, residual risk, recommended mitigation, owner. This is what leadership reads.
The principle that unifies everything
Treat the model as an untrusted component.
Not malicious. Not broken. Untrusted. Just as you treat user input, browser-side code, and third-party APIs as untrusted, treat the model's decisions as requiring validation before they become actions.
This single mental model shift resolves most agent security questions:
- Should we validate tool-call arguments? Yes. The model is untrusted.
- Should we allowlist URLs for fetch tools? Yes. The model is untrusted.
- Should we require human approval for write operations? Yes. The model is untrusted.
- Should we rely on the system prompt to prevent misuse? No. The model is untrusted; it may not follow instructions.
The model is useful. It's powerful. It's also manipulable in ways that aren't fully characterized. Security architecture should accommodate that uncertainty rather than assume it away.
Apply this framework
The Red-Teaming Agentic AI guide provides the testing methodology to validate a threat model against a real agent. The Tool Abuse guide covers the tool-call boundary in depth. For hands-on practice exploiting each trust boundary, the Wraith Academy challenges cover direct injection, indirect injection, tool abuse, data exfiltration, and agent handoff attacks.
Practice these techniques hands-on
14 free challenges teaching prompt injection, system prompt extraction, data exfiltration, and more.
Enter the Academy →