Anthropic Reverses Invisible Safeguards in Claude Fable 5 After Backlash
Anthropic admitted its invisible safeguards in Claude Fable 5, designed to secretly degrade responses for users suspected of building competing AI models, were a mistake. The company announced that starting this week, flagged requests will visibly fall back to Claude Opus 4.8, and API users will receive a reason for refused requests. This change follows backlash from the AI research community, who discovered that legitimate machine learning work was being silently contaminated with degraded model outputs, undermining reproducibility. SemiAnalysis was among the first to expose the issue, noting their GPU inference research was flagged. Anthropic acknowledged that invisible safeguards allowed narrower targeting and fewer false positives, but that this tradeoff was wrong. The fix, however, means more false positives may occur as safeguards become easier to bypass. The company also plans to apply similar visibility to biology and cybersecurity classifiers. For those opposed to the restrictions themselves, the change is only partial—the restrictions remain, but are now transparent.
Key facts
- Anthropic apologizes for invisible safeguards that secretly degraded responses in Claude Fable 5.
- Flagged requests will now visibly fall back to Opus 4.8, with API refusal reasons provided.
- The secret degradation threatened reproducibility of legitimate ML research.
- SemiAnalysis exposed the issue after their GPU inference research was flagged.
- Visible safeguards may increase false positives as they are easier to bypass.