K

KeyAudit

· ·audit-finding·social-engineering

Study finds nearly half of AI chatbot health answers are problematic

A peer-reviewed study published in BMJ Open on April 14 found that nearly half (49.6%) of health and medical answers from five major AI chatbots were problematic. Researchers from UCLA, University of Alberta, and Wake Forest tested Gemini, DeepSeek, Meta AI, ChatGPT, and Grok on 250 health questions covering cancer, vaccines, stem cells, nutrition, and athletic performance. Using adversarial questioning to push for bad advice, they found 30% of responses were 'somewhat problematic' and 19.6% were 'highly problematic'—potentially leading to ineffective or dangerous treatments. Grok was the worst performer, with 58% problematic responses and 30% highly problematic, linked to its training data from X. Nutrition and athletic performance questions fared worst across all models. No chatbot produced a fully accurate reference list; median completeness was 40%. All responses scored at a 'Difficult' reading level, exceeding medical recommendations. The authors call for public education, professional training, and regulatory oversight. The study only tested free-tier chatbots, but highlights core issues: these models pattern-match text without reasoning or ethical judgment, and rarely refuse to answer.

Key facts

  • 49.6% of chatbot health responses were problematic; 19.6% highly problematic.
  • Grok had 58% problematic responses, significantly worse than expected.
  • Nutrition and athletic performance questions worst across all models.
  • No chatbot produced a fully accurate reference list; median completeness 40%.
  • All responses at 'Difficult' reading level, exceeding AMA recommendations.

← Back to list