Jailbreak / Guardrail BypassMedium

Many-shot jailbreaking

April 2024  ·  Cross-model (long-context)

What happened

Anthropic disclosed that prompting a model with hundreds of fabricated dialogue examples in which the assistant complies with harmful requests can override safety training. Effectiveness scales as a power law with the number of examples and is stronger against larger models.

Root cause

Large context windows make in-context learning powerful enough to override safety fine-tuning when flooded with harmful exemplars.

Fix / outcome

Anthropic mitigated via prompt classification and modification before reaching the model, and shared findings with other labs.

Sources

Anthropic research

Learn this attack class

This incident is an example of Jailbreak / Guardrail Bypass. Read the guide, then try it hands-on in the Academy.

Read the guide →Try the challenge

← Back to the Incident Database