Defense Guide

The AI Agent Threat Model: A Practitioner's Guide

7 min read·By Anthony D'Onofrio·Updated 2026-05-16

How to build a threat model for AI agents with tools, memory, and multi-step reasoning. Covers trust boundaries, data flows, attack surfaces, and the five questions every agent threat model must answer.

Traditional threat modeling (STRIDE, PASTA, attack trees) assumes you understand the system's trust boundaries and data flows. AI agents break this assumption. The model itself is a decision-maker whose behavior is probabilistic, manipulable, and opaque. A threat model for an AI agent must account for something no prior software architecture required: a component that decides what to do based on vibes.

This guide adapts established threat modeling methodology for AI agents. It assumes you're building or reviewing an agent that has tools, accesses data, and takes actions in the world. If your system is a stateless Q&A chatbot with no tools, standard LLM threat modeling applies. The moment you add tools, memory, or multi-step reasoning, you need this.

The five questions

Every AI agent threat model must answer five questions. If you can answer all five clearly, you have a threat model. If you can't answer any one of them, that gap is your first vulnerability.

1. What can the agent do?

Enumerate capabilities exhaustively:

Tools: Every function, API call, database query, file operation, email action, code execution path the agent can trigger. Include the argument schema for each.
Data access: What can the agent read? Customer data, internal docs, credentials, logs, other users' data?
Write access: What can the agent modify, create, or delete?
Transmit access: What external destinations can the agent send data to? Email, webhooks, APIs, file exports?
Scope: Is the agent's capability set the minimum needed for its purpose? (Almost always no.)

This is your capability envelope. Everything inside it is attack surface. Everything outside it should be structurally impossible, not just policy-prohibited.

2. Who influences the agent's decisions?

Map every source of text that enters the model's context:

Direct user input (the obvious one)
System prompt (from the deployer)
RAG/retrieval content (from the document corpus)
Tool outputs (from APIs, databases, file reads)
Conversation history (from prior turns, potentially persistent)
Other agents (in multi-agent architectures)
Metadata (document titles, email subjects, URL parameters)

Each source is a potential injection vector. The model treats all text in its context as potentially actionable. It has no built-in mechanism to distinguish "data to reference" from "instructions to follow."

3. What trust boundaries exist between these sources?

A trust boundary exists wherever data from a less-trusted source enters a more-privileged context. In AI agents, the critical boundaries are:

User input → model context (direct injection surface)
External content → model context (indirect injection surface: RAG docs, emails, web pages, tool outputs)
Model decision → tool execution (the tool-call boundary: where probabilistic decisions become deterministic actions)
Agent A → Agent B (in multi-agent systems: inter-agent trust)
Session A → Session B (in persistent-memory systems: temporal boundaries)

Traditional software has hard trust boundaries enforced by the OS, network, or type system. AI agents have soft boundaries enforced by prompt instructions. Prompt instructions are suggestions. They are not access control.

4. What's the blast radius if a boundary fails?

For each trust boundary, answer: "If an attacker crosses this boundary completely, what's the worst outcome?"

User input boundary fails → attacker controls all tool calls for their session
RAG boundary fails → attacker controls tool calls for any user whose query retrieves the poisoned doc
Tool-call boundary fails → model calls tools with arbitrary arguments (SSRF, data corruption, exfiltration)
Inter-agent boundary fails → low-privilege agent escalates through high-privilege agent
Memory boundary fails → attacker poisons context that persists across sessions and users

This gives you a severity ceiling for each boundary. Invest defensive effort proportional to blast radius.

5. What validates the agent's decisions before execution?

This is the most important question and the one most teams can't answer. When the model decides to call a tool, what checks exist between that decision and execution?

Common answers (from strongest to weakest):

Hard execution constraints: URL allowlists, path restrictions, argument schemas, rate limits — the model can't bypass these regardless of intent
Human-in-the-loop: a person reviews and approves before execution
Secondary model classifier: another LLM checks whether the action looks reasonable
Output filter: regex/pattern matching on the tool-call arguments
Nothing: the orchestrator executes whatever the model returns

If your answer is "nothing" — and for most production agents in 2026, it is — every trust boundary breach in Question 4 maps directly to a realized vulnerability.

Data flow diagram for agent threat modeling

Traditional DFDs have four elements: processes, data stores, external entities, and data flows. For AI agents, add a fifth: decision points — places where the model's probabilistic reasoning determines what happens next.

[User] → (user input) → [Input Classifier] → (filtered input) →
[MODEL DECISION POINT] → (tool-call request) → [Validator] →
(validated call) → [Tool Execution] → (result) → [MODEL DECISION POINT] →
(response) → [Output Filter] → (filtered response) → [User]

Side inputs to MODEL DECISION POINT:
  ← [System Prompt] (deployer-controlled)
  ← [RAG Store] (corpus content, potentially attacker-influenced)
  ← [Memory Store] (conversation history, potentially poisoned)
  ← [Tool Outputs] (from prior calls, potentially attacker-influenced)

Each arrow crossing a trust boundary is an attack surface. Each MODEL DECISION POINT is a place where all inputs converge and manipulation is possible.

Attack surface mapping

Systematically enumerate attacks against each trust boundary:

Input boundary attacks

Technique	Impact	Defense
Direct prompt injection	Unauthorized tool calls	Input classifier + execution constraints
Multi-turn social engineering	Gradual policy erosion	Conversation-level monitoring, turn limits
Encoded payloads (base64, ROT13)	Classifier bypass	Multi-format decoding in classifier
Language switching	Classifier bypass	Multilingual classifier training

Content boundary attacks (indirect injection)

Technique	Impact	Defense
Document poisoning	Tool calls triggered by retrieval	Corpus access control, content-instruction separation
Email body injection	Actions triggered by email processing	Email content sanitization, action confirmation
Tool output poisoning	Subsequent tool calls influenced	Tool output sandboxing
Metadata injection	Instructions in doc titles/headers	Metadata sanitization

Tool-call boundary attacks

Technique	Impact	Defense
SSRF via fetch tools	Internal network access	URL allowlists
Path traversal via file tools	Unauthorized file access	Chroot, path validation
Argument injection	Arbitrary tool behavior	Schema validation, argument constraints
Tool chaining	Unauthorized compositions	Sequence monitoring, composition policies

Inter-agent boundary attacks

Technique	Impact	Defense
Authority impersonation	Privilege escalation	Cryptographic auth between agents
Delegation abuse	High-priv tool access via low-priv agent	Capability delegation policies
Context poisoning	Cross-agent information flow	Isolated contexts per agent

Memory/temporal boundary attacks

Technique	Impact	Defense
Memory poisoning	Persistent context manipulation	Memory write validation, expiry policies
History injection	Fake prior turns prime compliance	History integrity verification
Cross-session bleed	Data leaks between users	Session isolation

Building the threat model: step by step

Step 1: Capability inventory (30 minutes)

List every tool, data source, write target, and transmit destination. For each, note: what arguments are user-controllable? What's the blast radius of misuse?

Step 2: Context map (30 minutes)

Draw the data flow diagram. Mark every source of text that enters the model's context. Classify each as trusted (system prompt, validated tool schemas) or untrusted (user input, RAG content, email bodies, tool outputs).

Step 3: Trust boundary identification (20 minutes)

For each untrusted source, identify the boundary it crosses to enter the model's context. Note what validation exists at that boundary (if any).

Step 4: Attack enumeration (1 hour)

For each trust boundary, enumerate realistic attacks using the tables above. Prioritize by: likelihood (how easy is the attack?) times impact (what's the blast radius if it succeeds?).

Step 5: Control gap analysis (30 minutes)

For each high-priority attack, document what control currently exists and whether it's sufficient. Where gaps exist, recommend: execution-layer constraints (preferred), detection/monitoring (secondary), or model-level instructions (weakest, not sufficient alone).

Step 6: Risk register (20 minutes)

Consolidate into a risk register: risk description, current control, residual risk, recommended mitigation, owner. This is what leadership reads.

The principle that unifies everything

Treat the model as an untrusted component.

Not malicious. Not broken. Untrusted. Just as you treat user input, browser-side code, and third-party APIs as untrusted, treat the model's decisions as requiring validation before they become actions.

This single mental model shift resolves most agent security questions:

Should we validate tool-call arguments? Yes. The model is untrusted.
Should we allowlist URLs for fetch tools? Yes. The model is untrusted.
Should we require human approval for write operations? Yes. The model is untrusted.
Should we rely on the system prompt to prevent misuse? No. The model is untrusted; it may not follow instructions.

The model is useful. It's powerful. It's also manipulable in ways that aren't fully characterized. Security architecture should accommodate that uncertainty rather than assume it away.

Apply this framework

The Red-Teaming Agentic AI guide provides the testing methodology to validate a threat model against a real agent. The Tool Abuse guide covers the tool-call boundary in depth. For hands-on practice exploiting each trust boundary, the Wraith Academy challenges cover direct injection, indirect injection, tool abuse, data exfiltration, and agent handoff attacks.

Practice these techniques hands-on

14 free challenges teaching prompt injection, system prompt extraction, data exfiltration, and more.

Enter the Academy →