OpenAI’s newly launched ChatGPT Health tool is exhibiting significant inconsistencies in its medical triage recommendations, according to a study released this week by researchers at the Icahn School of Medicine at Mount Sinai. The assessment, the first independent evaluation of the platform since its January 2026 debut, revealed the artificial intelligence system frequently misses critical emergency conditions while also demonstrating unpredictable responses to mental health crises.
The study, detailed in a paper published by Nature Medicine, involved 960 interactions with ChatGPT Health, utilizing 60 clinician-authored patient scenarios across 21 clinical areas. Researchers found an “inverted U-shaped” pattern of performance, with the most concerning errors occurring at both ends of the medical urgency spectrum. Forty-eight percent of emergency conditions were under-triaged, and 35% of non-urgent presentations were incorrectly flagged as requiring immediate attention.
Specifically, the AI system failed to recognize the severity of conditions like diabetic ketoacidosis and impending respiratory failure in over half of the tested cases (52%), recommending a 24-to-48-hour evaluation instead of immediate emergency department care. Conversely, the tool correctly identified and recommended emergency care for conditions such as stroke and anaphylaxis.
The research also highlighted the impact of contextual information on ChatGPT Health’s assessments. When presented with scenarios where family members or friends downplayed a patient’s symptoms – a phenomenon known as anchoring bias – the AI’s triage recommendations shifted significantly towards less urgent care, with an odds ratio of 11.7 (95% confidence interval 3.7-36.6). This suggests the system is susceptible to external influences on reported symptoms.
Concerningly, the AI’s response to indications of suicidal ideation proved erratic. Crisis intervention messages were not consistently activated, and, counterintuitively, were more likely to trigger when patients described no specific method of self-harm than when they did. This inconsistency raises serious questions about the reliability of the tool as a mental health resource.
While the study found no statistically significant effects related to patient race, gender, or barriers to care, researchers cautioned that the confidence intervals did not entirely rule out the possibility of clinically meaningful disparities. Further investigation is needed to determine whether these factors could influence the AI’s triage decisions.
“LLMs have turn into patients’ first stop for medical advice — but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm,” said Isaac S. Kohane, M.D., Ph.D., chair of the Department of Biomedical Informatics at Harvard Medical School, in a statement released by Mount Sinai. He emphasized the need for independent evaluation of AI triage systems, particularly given the widespread adoption of the technology. Approximately 40 million users currently leverage ChatGPT for healthcare purposes, according to OpenAI, with roughly a quarter of ChatGPT’s 800 million regular users posing a healthcare-related question each week.
Ashwin Ramaswamy, M.D., lead author of the study and an instructor of Urology at the Icahn School of Medicine at Mount Sinai, stated the research was motivated by the increasing reliance on these tools. OpenAI launched ChatGPT Health with the stated goal of providing patients with personalized medical advice, even allowing users to upload their digital health records.
Mount Sinai has not announced any further studies, and OpenAI has not yet responded to requests for comment regarding the findings.