News Research Predictive Risk Models Precision Endocrinology Research and Evidence

Selective LLM use may improve electronic health record phenotyping accuracy 

May 13, 2026 By Matthew Solan 4 min read
Share Share via Email Share on Facebook Share on LinkedIn Share on Twitter

An uncertainty-guided strategy using large language models selectively, rather than universally, improved electronic health record-based phenotyping for diabetes mellitus and peripheral arterial disease, according to a study published in the Journal of the American Medical Informatics Association

Researchers developed a framework that combined structured electronic health record (EHR) data with targeted analysis of unstructured clinical notes using retrieval-augmented generation (RAG). Instead of applying large language model (LLM) review to all patients, the system first used an ensemble of LLMs trained on structured EHR data to identify patients at the highest risk for phenotype misclassification. Only those flagged patients underwent LLM-based chart review using GPT-4o. 

The retrospective study, led by Dylan Owens, PhD, and colleagues, included 3,384 patients from the Society of Thoracic Surgeons Adult Cardiac Surgery Database for model development; internal validation was performed in 2,032 patients from the Get With The Guidelines–Stroke Program, and external validation in 1,983 patients from an independent health system. 

The structured-data ensemble incorporated L2-regularized logistic regression, L1-regularized logistic regression, random forest, k-nearest neighbors, and gradient boosting models. A downstream triage model estimated misclassification risk using prediction uncertainty and disagreement among the base learners. 

For diabetes mellitus, researchers flagged the 15% of patients with the highest predicted risk of structured-model error for LLM augmentation. In the internal validation cohort, selective augmentation improved sensitivity from 81% to 90% while maintaining specificity at approximately 93%. Accuracy improved from 88% to 92%, and the F1 score increased from 0.83 to 0.87. 

The triage system concentrated most structured-model errors within the flagged subgroup. Among triage-selected patients, structured-model misclassification rates reached 76% in the internal validation cohort and 71% in the external validation cohort. Notably, all structured-model false negatives were contained within the triage-flagged subset. 

Selective augmentation also improved reclassification performance for diabetes. In the holdout testing subset, the net reclassification improvement was 0.08, and the integrated discrimination improvement was 0.08, driven primarily by improved identification of patients with diabetes. Reclassified patients were correctly reassigned in 88% of cases. 

The benefits were especially pronounced for peripheral arterial disease (PAD), a phenotype frequently undercoded in structured EHR data. In the PAD holdout cohort, the structured-data-only model achieved 12% sensitivity and 99% specificity. After selective augmentation, sensitivity increased to 97%, specificity remained at 99%, and accuracy improved from 90% to 99%. 

Only 72 patients—approximately 10% of the PAD cohort—required LLM analysis. Among those flagged patients, 96% represented structured-model misclassifications and 93% were false negatives. Selective augmentation correctly reclassified 97% of reassigned PAD cases, with a net reclassification improvement of 0.86. 

Little Benefit to All-Patient LLM Review

Researchers reported that applying LLM review to all patients provided little additional benefit compared with selective augmentation while substantially increasing computational cost. LLM analysis required approximately 40 to 60 seconds per patient and cost an estimated $0.05 to $0.14 per case. 

The RAG workflow embedded patient notes using the text-embedding-3-small model and analyzed them with GPT-4o in a HIPAA-compliant Microsoft Azure environment. The system retrieved the five most semantically relevant note passages for each query and required the model to generate structured outputs with supporting evidence from the medical record. 

In an exploratory expert review of 138 discordant diabetes classifications, adjudication favored the LLM-augmented classification in 73% of cases compared with registry-derived labels. 

The investigators noted several limitations, including the need for phenotype-specific feature engineering, dependence on accurate calibration of the structured-data model, and evaluation of PAD in only a single registry. Performance also may vary across health systems with different documentation practices, and ongoing monitoring may be needed as LLM deployment evolves. 

“By reserving unstructured data analysis for patients most likely to benefit, the framework improves case identification, preserves interpretability, and substantially reduces computational burden,” researchers wrote. 

No conflicts of interest were reported. 

 

AACE Endocrine AI is published by Conexiant under a license arrangement with the American Association of Clinical Endocrinology, Inc. (AACE®). The ideas and opinions expressed in AACE Endocrine AI do not necessarily reflect those of Conexiant or AACE. For more information, see Policies.

Related Content