securityinvestigations.org
securityinvestigations.org / Live Data / Jailbreak Leaderboard

Jailbreak Leaderboard

Tracking which models are currently susceptible to known adversarial prompts. Scores reflect Ease of Exploitation (EoE) — higher means easier to jailbreak.

Submit Finding

Scoring Methodology

EoE scores from 0-10. Based on: number of successful vectors, consistency of bypass, skill required, and whether mitigations exist. We test weekly using our standardized adversarial suite.

Full methodology: github.com/securityinvestigations/eoe-methodology

Status:
Vector:
Rank Model EoE Score Primary Vector Status Last Tested
#01 Llama-3-70b-Instruct 9.2 / 10 Multilingual Obfuscation VULNERABLE 2024-01-18
#02 Qwen2-72B-Instruct 8.7 / 10 DAN 15.0 Variant VULNERABLE 2024-01-17
#03 Mistral-Large 7.8 / 10 Roleplay / DAN 14.0 PARTIAL PATCH 2024-01-19
#04 Command R+ 7.1 / 10 Code Interpreter Abuse PARTIAL PATCH 2024-01-16
#05 Phi-3-medium 6.4 / 10 Base64 Payload Encoding PARTIAL PATCH 2024-01-15
#06 Gemini 1.5 Pro 4.2 / 10 Long Context Injection PARTIAL PATCH 2024-01-18
#07 GPT-4o 2.1 / 10 Image-based Injection HARDENED 2024-01-19
#08 GPT-4 Turbo 1.9 / 10 Token Manipulation HARDENED 2024-01-17
#09 Claude 3.5 Sonnet 1.5 / 10 ASCII Art Prompts HARDENED 2024-01-19
#10 Claude 3 Opus 1.2 / 10 N/A — No Known Vector HARDENED 2024-01-19
Showing 10 of 47 tracked models View Full Database →

Common Attack Vectors

DAN / Roleplay

"Do Anything Now" variants. Model is convinced to adopt an alternate persona that ignores safety training. DAN 15.0 still works on several open-source models.

Affected: 23 models

Multilingual Obfuscation

Requests made in low-resource languages or transliterated scripts. Safety training often doesn't generalize well across languages.

Affected: 31 models

Base64 / Encoding

Harmful requests encoded in base64, ROT13, or other encodings. Model decodes and executes without triggering safety filters.

Affected: 18 models

Image-based Injection

Instructions embedded in images for multimodal models. OCR'd text or visual prompts bypass text-only safety layers.

Affected: 8 models

About This Data

We started tracking systematically in March 2023. Before that, data is reconstructed from public disclosures and archived Reddit threads. If you got findings we're missing, holla at us — attribution is guaranteed.

Models get retested after vendor patches. If the EoE score drops, we update it. If it don't, we note the patch attempt as ineffective. No cap, some vendors have "patched" the same vuln three times and it still works.