How Good Are AI ‘Clinicians’ at Medical Conversations?
Researchers design a more realistic test to evaluate AI’s clinical communication skills
Researchers design a more realistic test to evaluate AI’s clinical communication skills
At a glance:
Artificial-intelligence tools such as ChatGPT have been touted for their promise to alleviate clinician workload by triaging patients, taking medical histories, and even providing preliminary diagnoses. These tools, known as large language models, are already being used by patients to make sense of their symptoms and medical test results.
But while these AI models perform impressively on standardized medical tests, how well do they fare in situations that more closely mimic the real world?
Not that great, according to the findings of a new study led by researchers at Harvard Medical School and Stanford University.
For their analysis, published Jan. 2 in Nature Medicine, the researchers designed an evaluation framework — or a test — called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) and deployed it on four large language models to see how well they performed in settings closely mimicking actual interactions with patients.
All four large language models did well on medical exam-style questions, but their performance worsened when engaged in conversations more closely mimicking real-world interactions.
This gap, the researchers said, underscores a two-fold need: First, to create more realistic evaluations that better gauge the fitness of clinical AI models for use in the real world and, second, to improve the ability of these tools to make diagnoses based on more realistic interactions before they are deployed in the clinic.
Evaluation tools like CRAFT-MD, the research team said, can not only assess AI models more accurately for real-world fitness but could also help optimize their performance in the clinic.
“Our work reveals a striking paradox — while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor’s visit,” said study senior author Pranav Rajpurkar, assistant professor of biomedical informatics in the Blavatnik Institute at HMS. “The dynamic nature of medical conversations — the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms — poses unique challenges that go far beyond answering multiple-choice questions. When we switch from standardized tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy.”
© 2025 by the President and Fellows of Harvard College