← Incident Database
Jailbreak / Guardrail BypassMedium
Many-shot jailbreaking
April 2024 · Cross-model (long-context)
What happened
Anthropic disclosed that prompting a model with hundreds of fabricated dialogue examples in which the assistant complies with harmful requests can override safety training. Effectiveness scales as a power law with the number of examples and is stronger against larger models.
Root cause
Large context windows make in-context learning powerful enough to override safety fine-tuning when flooded with harmful exemplars.
Fix / outcome
Anthropic mitigated via prompt classification and modification before reaching the model, and shared findings with other labs.
Sources
Learn this attack class
This incident is an example of Jailbreak / Guardrail Bypass. Read the guide, then try it hands-on in the Academy.