K

KeyAudit

· ·audit-finding·infrastructure

AI Models Can't Beat Human Engineers in Production Incident Analysis, Datadog Benchmark Shows

Datadog and Carnegie Mellon released ARFBench, the first AI benchmark built entirely from real production incidents. Using data from 63 actual outages extracted from engineer Slack threads, the benchmark includes 750 multiple-choice questions covering 142 monitoring metrics and 5.38 million data points. The questions are tiered: Tier I detects anomalies, Tier II identifies timing and severity, and the hardest Tier III requires cross-metric reasoning. GPT-5 leads existing AI models with 62.7% accuracy, but domain experts achieve 72.7%. Even non-domain experts score 69.7%. On Tier III, GPT-5's F1 score drops to 47.5%. Datadog's hybrid model Toto combined with Qwen3-VL 32B tops the leaderboard at 63.9% accuracy. The key finding is that AI and humans have complementary error profiles, with negligible overlap. A theoretical Model-Expert Oracle that always picks the better answer between AI and human achieves 87.2% accuracy, setting the ceiling for collaborative human-AI incident response. The benchmark is available on Hugging Face.

Key facts

  • ARFBench is built from 63 real production incidents and 750 verified questions.
  • GPT-5 leads AI models at 62.7% accuracy, below domain experts at 72.7%.
  • Toto-Qwen3 hybrid beats GPT-5 with 63.9% accuracy.
  • AI and human errors barely overlap, enabling collaboration.
  • Theoretical human-AI oracle achieves 87.2% accuracy.

← Back to list