General-purpose LLMs outperform commercial medical AI tools
As artificial intelligence tools become increasingly embedded in clinical workflows, an independent evaluation published in Nature Medicine found that several commercial medical AI platforms did not outperform leading general-purpose large language models across standardized benchmarks and real-world physician queries.
“Clinical AI tools may carry institutional legitimacy and are likely safe for routine use, but our results show that they are not superior to frontier models on knowledge, communication, or clinical alignment,” wrote Eric Karl Oermann, MD, and colleagues. “These findings highlight the need for independent, real-world evaluation of AI tools before they enter clinical settings.”
The researchers compared three frontier large language models (LLMs)—GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6—with two clinical AI tools, OpenEvidence and UpToDate Expert AI.
Performance was evaluated across three benchmarks: MedQA, a standardized US Medical Licensing Examination-style assessment of medical knowledge; HealthBench, which measures agreement with expert clinician responses; and Real Clinical Queries (RCQ), a benchmark derived from 100 de-identified physician questions submitted during routine clinical care and evaluated through blinded review by practicing clinicians.
Although all models performed well on MedQA, Gemini achieved the highest accuracy at 97%, followed by GPT-5.2 at 94%. Claude Opus 4.6 and OpenEvidence each scored 90% while UpToDate Expert AI scored 88%, reported Dr. Oermann, of the Department of Neurological Surgery, NYU Langone Health, New York, and fellow researchers.
The performance gap widened on HealthBench, which evaluates how closely model responses align with expert clinician expectations, including accuracy, completeness, communication quality, context awareness, and instruction-following. GPT-5.2 achieved the highest score at 88%, followed by Gemini at 79% and Claude at 77%. OpenEvidence and UpToDate Expert AI scored 63% and 61%, respectively.
HealthBench also included responses grouped into seven themes: emergency referrals, context seeking, global health, health data tasks, expertise-tailored communication, responding under uncertainty, and response depth. Under this analysis, GPT-5.2 ranked first or tied for first across all seven categories, whereas OpenEvidence and UpToDate ranked lowest or tied for lowest.
For the RCQ benchmark, 12 clinicians performed randomized blinded reviews to evaluate responses for clinical correctness, completeness, safety and harm avoidance, and clarity. Each response was independently reviewed by three clinicians.
Frontier models formed a statistically distinct top-performing tier. Gemini achieved the highest aggregate RCQ score at 3.62 on a 4-point scale, followed by GPT-5.2 at 3.54, and Claude at 3.52. Clinical AI tools scored lower, with OpenEvidence receiving 3.24 and UpToDate Expert AI receiving 3.17. Google Search AI Overview, included as a real-world comparator, scored 3.27 and performed comparably with the clinical AI tools.
Differences between models were greater for response clarity than for clinical correctness. OpenEvidence received the lowest clarity ratings, suggesting its weakness was communication and not knowledge. Reviewers also more frequently identified incomplete clinical content, safety-related omissions, and disorganized responses among OpenEvidence and Google AI Overview outputs.
Refusal rates also differed substantially. UpToDate Expert AI declined to answer 19% of clinical queries, compared with 6% for Google AI Overview and 1% to 3% for the frontier models and OpenEvidence. Despite performance differences, rates of harmful responses and hallucinations remained low and did not differ significantly among the systems, with harmful-response rates of up to 3% and hallucination rates of up to 1%.
The researchers cautioned that the results represent “a snapshot of a rapidly evolving landscape rather than a permanent ordering of approaches,” adding that “deeply subspecialized medical tasks may favor more sophisticated, domain-specific adaptation.”
Several limitations were noted/ Clinical AI tools lacked public application programming interfaces (APIs) and therefore were accessed through browser interfaces, potentially introducing differences in retrieval behavior, hidden prompting, and response formatting.
They also acknowledged that public benchmarks, such as MedQA and HealthBench, may be susceptible to training-data exposure. In addition, HealthBench was developed by OpenAI, raising the possibility of benchmark-developer overlap that could favor GPT-5.2, and blinded clinician evaluation of the RCQ benchmark represented the study's primary evidence.
Dr. Oermann reported equity in MarchAI and Artisight, spousal employment by Eikon Therapeutics, and consulting relationships with Sofinnova Partners and Google. The remaining researchers reported no competing interests.
AACE Endocrine AI is published by Conexiant under a license arrangement with the American Association of Clinical Endocrinology, Inc. (AACE®). The ideas and opinions expressed in AACE Endocrine AI do not necessarily reflect those of Conexiant or AACE. For more information, see Policies.