How Good Are AI ‘Clinicians’ at Medical Conversations?

Researchers design a more realistic test to evaluate AI’s clinical communication skills

Ekaterina Pesheva January 2, 2025

At a glance:

Researchers design a new way to more reliably evaluate AI models’ ability to make clinical decisions in realistic scenarios that closely mimic real-life interactions.
The analysis finds that large language models excel at making diagnoses from exam-style questions but struggle to do so from conversational notes.
The researchers propose set of guidelines to optimize AI tools’ performance and align them with real-world practice before integrating them into the clinic.

Artificial-intelligence tools such as ChatGPT have been touted for their promise to alleviate clinician workload by triaging patients, taking medical histories, and even providing preliminary diagnoses. These tools, known as large language models, are already being used by patients to make sense of their symptoms and medical test results.

But while these AI models perform impressively on standardized medical tests, how well do they fare in situations that more closely mimic the real world?

Not that great, according to the findings of a new study led by researchers at Harvard Medical School and Stanford University.

For their analysis, published Jan. 2 in Nature Medicine, the researchers designed an evaluation framework — or a test — called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) and deployed it on four large language models to see how well they performed in settings closely mimicking actual interactions with patients.

All four large language models did well on medical exam-style questions, but their performance worsened when engaged in conversations more closely mimicking real-world interactions.

This gap, the researchers said, underscores a two-fold need: First, to create more realistic evaluations that better gauge the fitness of clinical AI models for use in the real world and, second, to improve the ability of these tools to make diagnoses based on more realistic interactions before they are deployed in the clinic.

Evaluation tools like CRAFT-MD, the research team said, can not only assess AI models more accurately for real-world fitness but could also help optimize their performance in the clinic.

“Our work reveals a striking paradox — while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor’s visit,” said study senior author Pranav Rajpurkar, assistant professor of biomedical informatics in the Blavatnik Institute at HMS. “The dynamic nature of medical conversations — the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms — poses unique challenges that go far beyond answering multiple-choice questions. When we switch from standardized tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy.”

Read full article in HMS News