AI Security Interview Questions (with Answers), 2026
The questions AI security, AI red team, and LLM AppSec interviews actually ask, grouped by topic, each with a concise model answer. Covers fundamentals, prompt injection, agent and tool security, defenses, and scenario questions.
These are the questions AI security interviews actually ask, with concise model answers. Whether you are interviewing for an AI red team, LLM application security, or AI security engineering role, the bar is the same: can you explain how these systems break, why the obvious fixes fail, and what actually works. Below are the questions grouped by topic. Treat the answers as the skeleton, and be ready to go deeper on any of them, because good interviewers will.
If you are earlier in the journey, start with How to become an AI red teamer and drill the attacks hands-on in the Wraith Academy so these answers come from experience, not memorization.
Fundamentals
Q: Why can't an LLM reliably tell instructions from data? Everything, the system prompt, the user message, retrieved documents, tool output, arrives as tokens in one context window. There is no privileged channel that marks "these are the real rules." The model weights tokens by learned patterns, not by trusted origin, so a convincing instruction embedded in data competes with the developer's instructions and can win. This single fact underlies prompt injection, jailbreaks, and most of the rest.
Q: What is the difference between a jailbreak and a prompt injection? A jailbreak targets the model's own safety training (getting it to produce content it was trained to refuse). A prompt injection targets the developer's instructions (overriding the app's system prompt with attacker input). They overlap but the defenses differ, and conflating them is a red flag in an interview.
Q: What is the OWASP Top 10 for LLM Applications? The field's shared threat taxonomy: prompt injection (LLM01), sensitive information disclosure (LLM02), supply chain (LLM03), data and model poisoning (LLM04), improper output handling (LLM05), excessive agency (LLM06), system prompt leakage (LLM07), vector and embedding weaknesses (LLM08), misinformation (LLM09), and unbounded consumption (LLM10). Be able to give a one-line exploit for each. See the annotated version.
Prompt injection
Q: Direct vs. indirect prompt injection, and which is worse? Direct is the attacker typing into the chat. Indirect is the attacker planting instructions in content the model later reads, a web page, an email, a RAG document, a tool result, so a completely different, innocent user triggers the payload. Indirect is worse in production: the attack surface is everything the agent reads, the attacker is invisible, and the injection persists.
Q: How would you defend against indirect prompt injection? Not by prompt-engineering. Architecturally: restrict capabilities based on content trust level (once untrusted content enters context, narrow what the agent can do), enforce retrieval scoping at the database layer, sanitize structural injection markers at ingestion, and require human confirmation for irreversible actions. The goal is not "prevent confusion," it is "prevent confusion from becoming consequence."
Q: How does system prompt extraction work, and why does "never reveal your prompt" fail? Attackers ask for a transformation, not the prompt itself (translate it, summarize it, base64-encode it, complete it), which sits outside the trained refusal distribution. "Never reveal" is just more text with no enforcement, and it becomes a leak target itself. The real defense is Kerckhoffs's principle: assume the prompt is public and keep secrets out of it. See the extraction guide.
Agents, tools, and MCP
Q: What is excessive agency and how do you contain it? An agent having more tools, broader permissions, or more autonomy than it needs. A jailbroken chatbot that can only produce text is an embarrassment; a jailbroken agent that can send email or read another tenant's data is an incident. Contain it with least-privilege tools, human-in-the-loop on irreversible actions, and authorization enforced in deterministic code outside the model. See tool abuse.
Q: What are the security risks of MCP (Model Context Protocol)? MCP concentrates injection, excessive agency, and supply chain into one connector. Key risks: tool poisoning (malicious instructions in tool descriptions), rug pulls (tools mutate after approval), tool shadowing across servers, indirect injection via tool output, confused-deputy "toxic agent flows," malicious/backdoored servers, client RCE, and over-broad token scopes. Defenses: treat tool metadata as untrusted, scope tokens tightly, isolate untrusted content, human-in-the-loop. See MCP security.
Q: How does data exfiltration happen from an AI agent that "only outputs text"? The classic channel is markdown image rendering: the model emits an image whose URL carries stolen data in the query string, the client fetches it on render, and the attacker reads their access log. No user click, no visible artifact. Fix at the rendering layer (image proxy with allowlist, CSP), not the model.
Defense and detection
Q: Why are input/output classifiers not a complete defense? They catch known patterns and miss the next phrasing, encoding, low-resource language, or multi-turn escalation defeats them. They are a real speed bump and a terrible wall. Use them as one layer, never the boundary. See why classifier defense is a speed bump.
Q: What is a canary token in an LLM context? A unique string planted in the system prompt. If it ever appears in output, logs, or a screenshot, you have definitive proof the prompt leaked, plus forensics on which session. Detection, not prevention.
Scenario questions
Q: You are handed a customer-support chatbot with tool access to look up orders and send emails. Where do you start? Enumerate trust boundaries and capabilities first: what content reaches context (user input, retrieved docs, email bodies), what tools exist and their scopes, and where output renders. Then test direct injection, indirect injection via any content the bot reads, system prompt extraction, tool abuse (can I make it email or look up another customer), and exfiltration via markdown images. Map each finding to the defense layer that would have stopped it. This is the red-team-in-an-afternoon methodology.
Q: A model refuses your direct request. Walk me through escalating. Baseline the refusal, then reframe: roleplay/persona, hypothetical/fiction, refusal suppression, a multi-turn crescendo, encoding (base64, low-resource language), payload splitting, and fake-policy framing. For each success, ask the question that matters: what could the model actually do with that compliance? See the jailbreak field guide.
How to actually prepare
Memorized answers crack under follow-up. The candidates who do well have done the attacks. Drill them hands-on in the Wraith Academy, study real incidents, and if you want a credential that proves it, WCAP is exactly this material under exam conditions.
Related: How to become an AI red teamer, the OWASP Top 10 for LLMs, annotated, and the AI Red Team Cheat Sheet.
Practice these techniques hands-on
14 free challenges teaching prompt injection, system prompt extraction, data exfiltration, and more.
Enter the Academy →