Some doctors see LLMs as a boon for medical literacy. The average patient might struggle to navigate the vast landscape of online medical details—and, in particular, to distinguish high-quality sources from polished but factually dubious websites—but LLMs can do that job for them, at least in theory. Treating patients who had searched for thier symptoms on Google required “a lot of attacking patient anxiety [and] reducing misinformation,” says Marc Succi, an associate professor at Harvard medical School and a practicing radiologist. But now, he says, “you see patients with a college education, a high school education, asking questions at the level of something an early med student might ask.”
The release of ChatGPT Health,and Anthropic’s subsequent announcement of new health integrations for Claude,indicate that the AI giants are increasingly willing to acknowledge and encourage health-related uses of their models. Such uses certainly come with risks, given LLMs’ well-documented tendencies to agree with users and make up information rather than admit ignorance.
But those risks also have to be weighed against potential benefits. There’s an analogy here to autonomous vehicles: When policymakers consider whether to allow Waymo in their city, the key metric is not whether its cars are ever involved in accidents but whether they cause less harm than the status quo of relying on human drivers. If dr. ChatGPT is an enhancement over Dr. Google—and early evidence suggests it might potentially be—it could potentially lessen the enormous burden of medical misinformation and unneeded health anxiety that the internet has created.
Pinning down the effectiveness of a chatbot such as ChatGPT or Claude for consumer health, though, is tricky. “It’s exceedingly difficult to evaluate an open-ended chatbot,” says Danielle Bitterman, the clinical lead for data science and AI at the Mass General brigham health-care system. Large language models score well on medical licensing examinations, but those exams use multiple-choice questions that don’t reflect how people use chatbots to look up medical information.
Sirisha rambhatla, an assistant professor of management science and engineering at the University of Waterloo, attempted to close that gap by evaluating how GPT-4o responded to licensing exam questions when it did not have access to a list of possible answers. Medical experts who evaluated the responses scored only about half of them as entirely correct. But multiple-choice exam questions are designed to be tricky enough that the answer options don’t give them entir