New Benchmark Reveals AI Agents Fail at Real-World Personal Assistant Tasks
Researchers from Huawei, Beijing Institute of Technology, Peking University, and the Chinese Academy of Sciences released Claw-Anything, a benchmark evaluating AI agents on realistic personal-assistant tasks involving long-term event streams (over 3 months), interdependent services (average 10.1 per task), and multi-device interaction (CLI Linux and GUI Android). The average context window per task is 191,700 words, far exceeding existing benchmarks (1,700–12,000). GPT-5.5, OpenAI's flagship model, scored only 34.5% on the pass@1 metric, and proactive assistance tasks saw just 6.7% success. The study argues current benchmarks measure the wrong things, as agents struggle with cross-service coordination, irrelevant events, and accumulated noise. The team released an automated data pipeline generating 2,000 training environments; fine-tuning Qwen3.5-27B on 1,500 successful trajectories improved pass@1 by 23.7%, beating several closed-source models. The dataset is on Hugging Face and code on GitHub.
Key facts
- Claw-Anything benchmark tests AI agents on 3+ month event streams with 10.1 services per task.
- GPT-5.5 scored 34.5% pass@1; proactive tasks only 6.7% success.
- Existing benchmarks have context windows of 1,700–12,000 words vs Claw-Anything's 191,700.
- Fine-tuning Qwen3.5-27B on 1,500 trajectories improved pass@1 by 23.7%.
- Dataset on Hugging Face; code on GitHub.