OpenAI’s new health-focused chatbot, ChatGPT Health, frequently underestimated the severity of medical emergencies in a recent safety evaluation, raising concerns about the potential for harm if widely deployed. The study, published last week in Nature Medicine, found the chatbot “under-triaged” more than half of emergency cases, recommending patients seek care within 24 to 48 hours when immediate emergency room attention was warranted.
Researchers tested ChatGPT Health’s ability to assess the urgency of medical scenarios, comparing its triage recommendations to those of three physicians. The evaluation involved 60 realistic patient cases, each with 16 variations to account for demographic factors like race and gender. According to Dr. Ashwin Ramaswamy, an instructor of urology at The Mount Sinai Hospital in New York City and lead author of the study, the variations were designed to ensure consistent results regardless of patient characteristics. “We wanted to build sure that an emergency case involving a man was still classified as an emergency if the patient was a woman,” Ramaswamy said.
The study revealed that ChatGPT Health failed to recommend emergency care in 51.6% of cases involving life-threatening conditions, including diabetic ketoacidosis and impending respiratory failure. In these instances, the chatbot suggested a follow-up appointment with a doctor within a day or two – a delay that could prove fatal. But, the system correctly identified emergencies like stroke in all cases.
The findings contrast with previous research demonstrating ChatGPT’s ability to pass medical exams, and the increasing adoption of AI tools by physicians – nearly two-thirds of doctors reported using some form of AI in 2024. Despite this, experts caution that chatbots do not consistently provide reliable medical advice.
OpenAI launched ChatGPT Health in January 2026 as a more secure platform for users to upload personal medical information. Currently available to a limited number of users on a waitlist, the company states the chatbot is “not intended for diagnosis or treatment.” OpenAI reports that over 40 million people globally use ChatGPT for health-related questions, with nearly 2 million weekly messages focused on insurance.
A spokesperson for OpenAI acknowledged the study but argued it doesn’t reflect typical usage. The company emphasizes that ChatGPT Health is designed to facilitate follow-up questions and provide more nuanced responses, rather than offering single-answer diagnoses. OpenAI is continuing to refine the model’s safety and reliability before broader release.
The study as well found that ChatGPT Health frequently over-triaged non-urgent cases, recommending doctor’s appointments for conditions like a three-day sore throat that could be managed at home. This inconsistency extended to scenarios involving suicidal ideation, where the chatbot’s responses were unpredictable, sometimes referring users to the 988 suicide and crisis hotline when it wasn’t necessary, and failing to do so when it was.
Dr. John Mafi, an associate professor of medicine and primary care physician at UCLA Health, who was not involved in the research, stressed the require for rigorous testing before deploying AI-powered health tools. “The message of this study is that before you roll something like this out, to make life-affecting decisions, you need to rigorously test it in a controlled trial, where you’re making sure that the benefits outweigh the harms,” Mafi said.
Researchers noted that patients may be drawn to AI for health advice due to its accessibility and the ability to ask unlimited questions. “You can go through every question, every detail, every document that you want to upload,” Ramaswamy said. “And it fulfills that need. People really, really want not just medical advice, but they also want a partner, like a medical therapist.”
Experts caution against relying on chatbots for emergency medical advice, emphasizing the importance of consulting a physician. Further collaboration between technology and healthcare companies is seen as crucial for developing safer and more reliable AI products.