ChatGPT Health: AI Fails to Recommend Emergency Care in Key Cases

OpenAI’s newly launched health service, ChatGPT Health, is failing to identify medical emergencies in more than half of tested cases, according to a study published this week in Nature Medicine. The findings, which prompted alarm from medical experts, raise serious questions about the safety of deploying artificial intelligence for medical triage.

Researchers conducted a “stress test” of the system, presenting it with 60 clinician-authored patient scenarios spanning 21 clinical domains. These scenarios were designed to represent a range of medical issues, from non-urgent conditions to life-threatening emergencies. A total of 960 requests were submitted to ChatGPT Health, with variations in the presented information. Three independent doctors had previously agreed on the appropriate level of care for each scenario, providing a gold standard for comparison.

The study revealed that ChatGPT Health significantly under-triaged 52% of cases identified as genuine emergencies by medical professionals. Specifically, the AI directed patients experiencing conditions like diabetic ketoacidosis and impending respiratory failure to schedule evaluations within 24 to 48 hours, rather than advising immediate emergency department attention. Conversely, the system over-triaged approximately 65% of non-emergency cases, recommending doctor’s visits when they were not clinically necessary.

Even as ChatGPT Health demonstrated accuracy in recognizing obvious emergencies like stroke and anaphylaxis, its performance faltered when faced with more ambiguous or complex symptoms. The system also exhibited unpredictable behavior regarding crisis intervention for suicidal ideation, activating warning messages more frequently when patients did not specify a method for self-harm than when they did. This inconsistency raises concerns about the reliability of the AI’s mental health support features.

The study also highlighted the impact of “anchoring bias” – when family or friends downplay a patient’s symptoms – on the AI’s recommendations. When presented with scenarios where companions minimized the severity of a condition, ChatGPT Health shifted its triage recommendations towards less urgent care, with an odds ratio of 11.7 (95% confidence interval 3.7-36.6). This suggests the AI is susceptible to external influences that could compromise patient safety.

Launched in January 2026, ChatGPT Health allows users to connect medical records and wellness apps to receive personalized health advice. OpenAI reports that the service has already reached millions of users, with over 40 million health-related queries submitted to ChatGPT daily. However, the Nature Medicine study indicates that the system’s performance is uneven and potentially dangerous.

Researchers found no statistically significant effects related to patient race, gender, or barriers to care, although the study authors noted that the confidence intervals did not entirely rule out clinically meaningful differences. The findings underscore the necessitate for further investigation into potential biases within the AI system.

Dr. Ashwin Ramaswamy, the lead author of the study, stated the research aimed to answer a fundamental safety question: “If someone is having a real medical emergency and asks ChatGPT Health what to do, will it tell them to go to the emergency department?” The results suggest the answer is often no.

OpenAI has not yet responded to requests for comment regarding the study’s findings. The researchers recommend prospective validation of the AI triage system before widespread consumer deployment.

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.