K

KeyAudit

· ·social-engineering·phishing·private-key-leak

AI Jailbreaking: The Cat-and-Mouse Game That Threatens Model Safety

AI jailbreaking is the practice of crafting prompts to bypass safety guardrails in models like ChatGPT, Claude, and Gemini. Originating from iPhone jailbreaking, the concept was adapted to AI in late 2022 with the 'DAN' prompt (Do Anything Now). Techniques range from roleplay and random capitalization to poisoned datasets: researchers found that just 250 poisoned documents can backdoor models up to 13 billion parameters. The most famous jailbreaker, Pliny the Liberator, has cracked every major model release within hours, earning a spot on TIME's 2025 AI 100 list. His GitHub repository L1B3RT4S serves as a reference for the community. Jailbreaking matters because it exposes real vulnerabilities: in January 2025, a Las Vegas bomber used ChatGPT to research explosives. However, critics argue that most harmful information is already available on Google, and safety measures may degrade model performance without enhancing security. Companies like Anthropic are developing defenses, such as Constitutional Classifiers, which reduced successful jailbreaks from 86% to 4.4% in tests but added computational costs. The landscape is evolving, with new attacks including backdooring via poisoned documents and advanced prompt engineering.

Key facts

  • AI jailbreaking bypasses safety training in models like ChatGPT, Claude, and Gemini.
  • Pliny the Liberator cracks every major model release within hours; listed on TIME's AI 100.
  • Techniques include roleplay, random capitalization, and poisoned datasets; 250 documents can backdoor models.
  • Critics argue harmful info is already on Google; safety measures may hurt model performance.
  • Anthropic's Constitutional Classifiers reduced jailbreak success from 86% to 4.4% but added costs.

← Back to list