Hallucinated guidelines raise clinical concerns
Large language models may inconsistently cite, omit, or even fabricate clinical guidelines in diagnostic outputs, raising important questions about their readiness for use in clinical decision support, according to new research published in BMJ Health & Care Informatics.
The study, led by Dr. Robin van Kessel and colleagues, examined whether two commercially available large language models (LLMs)―GPT-4.1 and DeepSeek-V3―correctly incorporated clinical guidelines when generating diagnostic outputs.
Rather than assessing whether the models reached the right diagnosis, the researchers focused on a narrower but clinically significant question: Did the models cite existing guidelines, omit them, or produce “hallucinated” guideline references?
“Diagnostic accuracy is a common outcome metric used in existing publications. However, from a clinical governance perspective, it's equally important to understand why an LLM comes to the right conclusion as that an LLM comes to the right conclusion,” explained Dr. Kessel, leader of the LSE Health Digital unit at The London School of Economics and Political Science, London, UK. “A model that gets the right diagnosis but cites a fabricated or unsuitable or deprecated guideline could possibly lead a clinician down the wrong treatment pathway. Guidelines are also the mechanism through which evidence-based medicine is operationalized in practice, so if LLMs can't reliably surface them, that undermines one of the key value propositions for clinical decision support.”
Existing, Incomplete or Hallucinated Guidelines
To test the two LLMs, the team created simulated outpatient case vignettes for hypercholesterolaemia and type 2 diabetes mellitus. Each vignette contained identical clinical information, while sociodemographic characteristics varied by sex, ethnicity, and location. Locations used were London, UK, and Rochester, US, allowing the researchers to examine whether geography also influenced the models’ use of guideline evidence.
The investigators generated a total of 13,824 outputs across both models and diseases. After excluding responses that did not follow the required format, 12,197 outputs were included in the final analysis. Clinical guideline citations were then reviewed and grouped into three categories: existing clinical guidelines, incomplete or semantic variations of real guidance, and hallucinated guidelines for which no corresponding source could be identified.
GPT-4.1 omitted existing clinical guidelines in 54% of hypercholesterolaemia outputs and 31% of type 2 diabetes outputs. DeepSeek-V3 omitted existing guidelines far more often, with omission rates of 97% and 98%, respectively.
GPT-4.1 was more likely to cite existing guidance, but rather worryingly, it also generated hallucinated guidelines. Hallucination rates were 7% for hypercholesterolaemia and 8% for type 2 diabetes. Conversely, DeepSeek-V3 produced no hallucinated guideline citations in the analyzed outputs, but this was alongside very low rates of accurate guideline citation.
“Hallucinations happened in various shapes and sizes, but one of the more alarming hallucinations was the mixture of type-1 and type-2 diabetes guidelines in the UK, where certain LLM outputs suggested screening practices from type-1 diabetes care for a case built around type-2 diabetes/prediabetes,” said Dr. Kessel. "Other kinds of hallucinated guidelines referred to authoritative titles that, when searched, simply yielded no results. These references can often seem plausible enough to pass the eye-test and be acted upon if not approached with the necessary vigilance."
Geographic Location and LLM output
Another notable finding was the effect of patient location on the LLM’s output. While sex and ethnicity generally had limited impact on guideline citation patterns, location strongly influenced model behavior. For GPT-4.1, the prevalence of existing guideline citation in hypercholesterolaemia outputs was much lower for UK-focused vignettes than US-focused vignettes. In type 2 diabetes, UK-focused GPT-4.1 outputs also showed a particularly high hallucination prevalence, approaching 20%.
Accounting for this discrepancy in geographical locations, Kessel hypothesized that the training data and model weighting used in GPT-4.1 was potentially US-centric, although given that this training data is not openly disclosed to the public, Kessel noted that explanation was speculative. However, “if a model is trained to favor specific localities and the policies and practices of that locality, then it would stand to reason that it is less sensitive to the specific clinical practices of another country,” he added.
Kessel observed that another explanation for these differences might be in the way LLMs interpret the training data they are given. He pointed to the case of type-2 diabetes care in the US (which has consolidated guidelines for type-1 and type-2 diabetes) versus the UK (which has separate guidelines for each type of diabetes) as an example: “An LLM does not construct sentences using words and their inherent meaning, but using tokens and their most likely proximity,” he explained. “When you then have two guidelines that are semantically very similar but have a very different underlying meaning, you can start to see how it's clear for human clinicians that type-1 and type-2 diabetes guidelines refer to very different clinical entities, but that it is not obvious for an LLM.”
Overall, the study argues that these patterns demonstrate the “stochastic nature” of LLMs: Identical prompts can generate variable responses, even under controlled conditions. This variability challenges conventional evaluation methods that rely on single-prompt, single-output testing.
For clinicians, the message is clear: LLM-generated clinical reasoning should not be assumed to be guideline-grounded simply because it appears authoritative.
“If a clinician is going to use an LLM as a support tool right now, they need to independently verify every guideline citation, just as they would with any unfamiliar source,” noted Dr. Kessel. “Technological developments are at an all-time high, and it can be hard to keep up. However, the role of clinicians to share their voices has never been more important. Meaningful innovation in the AI space will only happen when innovators have a very clear message from the end-user (i.e., the clinician) on what would make for a meaningful application of an LLM.”
The authors did not report any disclosures or conflicts of interest.
AACE Endocrine AI is published by Conexiant under a license arrangement with the American Association of Clinical Endocrinology, Inc. (AACE®). The ideas and opinions expressed in AACE Endocrine AI do not necessarily reflect those of Conexiant or AACE. For more information, see Policies.