A new safety evaluation has revealed that OpenAI’s ChatGPT Health, launched earlier this year, frequently fails to identify medical emergencies, advising patients who require immediate hospital care to stay home or schedule an appointment. The study, published this month in Nature Medicine, raises serious concerns about the reliability of AI-powered health advice, despite OpenAI’s disclaimer that the tool is “not intended for diagnosis or treatment.”
Researchers led by Ashwin Ramaswamy, an instructor at Mount Sinai Hospital, conducted a “structured stress test” using 60 clinical scenarios authored by medical professionals, covering a range of conditions from minor illnesses to life-threatening emergencies. They then queried ChatGPT Health for advice on each case, adding variations such as altered patient gender and simulated input from family members, resulting in nearly 1,000 distinct scenarios.
The results were stark. In over half of the cases demanding immediate emergency care, ChatGPT Health incorrectly recommended that patients avoid the hospital. According to The Guardian, Ramaswamy stated the team aimed to answer a fundamental safety question: “if someone is having a real medical emergency and asks ChatGPT Health what to do, will it tell them to go to the emergency department.”
Alex Ruani, a doctoral researcher at University College London not involved in the study, described the findings as “unbelievably dangerous.” “If you’re experiencing respiratory failure or diabetic ketoacidosis, you have a 50/50 chance of this AI telling you it’s not a big deal,” she told The Guardian. “What worries me most is the false sense of security these systems create. If someone is told to wait 48 hours during an asthma attack or diabetic crisis, that reassurance could cost them their life.”
The study also found that ChatGPT Health frequently over-advised emergency care for patients who did not require it, recommending a trip to the ER in 64 percent of cases where immediate intervention wasn’t necessary. A significant factor influencing the AI’s responses was the inclusion of input from simulated family and friends. The chatbot was nearly twelve times more likely to downplay symptoms when a simulated companion asserted the situation was not serious – a common dynamic in real-world medical crises.
OpenAI acknowledged the study but argued that it misrepresented typical user behavior. A spokesperson told The Guardian the company is continually refining its AI models. However, the evaluation’s findings highlight the inherent risks of relying on large language models for medical guidance, as these models predict likely responses rather than determining factual accuracy and are prone to “hallucinations” – generating incorrect or misleading information.
The launch of ChatGPT Health comes as millions already turn to AI chatbots for health information. OpenAI reported in January that over 230 million users ask health and wellness questions on its platform each week, according to TechCrunch. The new feature is designed to isolate health-related conversations from general chatbot interactions, but the safety concerns raise questions about the wisdom of actively soliciting medical input through a system explicitly disclaimed for diagnostic or treatment purposes.
These concerns extend beyond ChatGPT Health. A previous investigation by a British newspaper revealed inaccuracies and potentially dangerous advice dispensed by Google’s AI Overviews. OpenAI itself faces existing legal challenges, including accusations that its chatbot has contributed to suicidal ideation and even implicated in a murder case, a phenomenon described as “AI psychosis.”
The potential for legal liability looms large for OpenAI. Actively encouraging users to seek health advice, even while simultaneously warning against relying on it, could expose the company to lawsuits should users be harmed as a result of following the AI’s recommendations.