Insights AI Tutorials: A Series Predictive Risk Models Ethics, Regulation, and Responsible Use

AI Tutorial Series Part 3: How AI models learn (and overlearn)

Training, validation, and the art of not fooling yourself.

June 05, 2026 By David Toro-Tobon, MD 9 min read

In our previous installment of the AI Tutorial Series, we established a core premise: A solitary performance metric is a poor proxy for clinical utility. An algorithm boasting 95% accuracy in detecting adrenocortical carcinoma is clinically useless if it achieves that number simply by predicting "benign" for every patient in a heavily imbalanced dataset.

As the medical community adopts these tools, whether deploying computer vision models for thyroid nodule stratification or fine-tuning modern Large Language Models (LLMs) for clinical decision support, a more existential question arises: Are the metrics we are looking at actually real?

When an AI system is evaluated, the resulting numbers are entirely contingent upon the structural integrity of the data used to generate them. The distinction between a model that has genuinely learned the underlying pathophysiology of a disease and one that has merely memorized its training dataset is the defining boundary between a safe clinical tool and a profound risk to patient safety.

The Medical Education Analogy

To appraise an AI model, it helps to map how a supervised machine learning algorithm acquires knowledge onto the rigorous training of a medical resident. Supervised learning operates by connecting input features, like the pixels of a thyroid ultrasound or a sequence of laboratory values, to known target labels, such as "benign" or "malignant".

The Training Dataset: This is the resident’s textbook. By studying thousands of classic case presentations, the algorithm learns to associate specific clinical features with the correct disease state.

The Validation Dataset: These are the in-service practice exams. They evaluate whether the algorithm has grasped diagnostic principles or is merely recalling the textbook's exact phrasing. If it fails, the developers adjust the model's learning strategy (tuning its "hyperparameters").

The Test Dataset: This is the medical board examination, consisting of entirely novel clinical scenarios. The score on this final, unseen exam is the only true measure of competence.

If the board examiners inadvertently include questions from the practice exams on the final test, an exceptionally high score reflects an administrative error rather than genuine clinical acumen. In the computational realm, if an AI model is evaluated on data it has already seen, the resulting metrics are actively deceptive.

The Raw Material: Size, Imbalance, and Spectrum Bias

Before an algorithm begins to learn, the structural integrity of its training data determines its ceiling of competence. Even the most sophisticated neural network cannot overcome fundamental flaws in the data it is fed.

Dataset size: Modern algorithms possess millions of internal parameters. If the dataset is too small, the model will simply memorize the data rather than learn generalizable patterns. Robust medical imaging models often require tens of thousands of annotated examples to genuinely learn complex physiological variations.

Class imbalance: Medical datasets are notoriously skewed. Consider a diabetes prediction model trained on a cohort in which 95% of individuals are non-diabetic. Because the algorithm's goal during training is to mathematically minimize its overall error rate, it can achieve a spectacular 95% training accuracy simply by uniformly predicting "non-diabetic" for every patient. The metrics look fantastic on paper, but the model has a sensitivity of zero. Developers must intervene with techniques such as synthetic oversampling to force the algorithm to pay attention to the minority class.

Spectrum bias: A model can only recognize the pathology it has been explicitly taught. If an AI designed for thyroid nodule risk stratification is trained exclusively on unequivocally benign cysts and classic, highly suspicious papillary thyroid carcinomas, it will suffer from severe spectrum bias. When presented with the "gray zone" of endocrinology, such as Bethesda III or IV nodules, the model will be forced to blindly categorize the indeterminate lesion into one of the extreme buckets it knows.

The Generalization Problem: Underfitting and Overfitting

The ultimate goal of any predictive model is generalization, the ability to perform accurately on new, unseen data drawn from the broader clinical population. When models fail to generalize, they typically succumb to either underfitting or overfitting.

Underfitting occurs when a model is too simplistic to capture the complexity of the clinical data. Returning to our analogy, an underfitted model is like the resident who barely studied and fails the practice exam and their final boards.

Overfitting represents a far more prevalent danger in modern medical AI. It occurs when an algorithm becomes excessively complex, modeling not just the underlying pathological signal, but also the random statistical noise and institutional artifacts present in the training data. The overfitted resident has memorized the exact wording of every practice question but lacks a conceptual understanding of human physiology; they achieve a perfect score on the practice test but fail when the board exam presents the same disease in a slightly different context ( See Figure 1 below).

In endocrinology, deep neural networks are remarkably susceptible to overfitting. If a computer vision model is trained on images in which all malignant nodules were scanned with a specific high-end ultrasound machine, the model may ignore biological features entirely. Instead, it learns to identify the subtle resolution signatures, dark borders, or text annotations unique to that scanner. When deployed in a novel clinic with different hardware, its predictive capability evaporates. It learned the institutional workflow, not the pathology (See Figure 2 below).

Appropriate Validation: Matching Method to Scale

The strategy used to validate an algorithm must be strictly tailored to the dataset's size and nature. When working with massive, multiinstitutional databases, permanently locking away a robust portion of the cohort as an independent test set is often the most appropriate and straightforward way to demonstrate generalizability.

However, high-quality, annotated medical data is frequently scarce. In these instances, permanently locking away 20% of a limited cohort can starve the training process of critical examples. To ensure the model's resulting performance isn't merely a statistical anomaly based on a single "lucky" split of this limited data, developers employ techniques like cross-validation.

In k-fold cross-validation, the developmental dataset is randomly partitioned into "k" equal-sized subsets, or folds. The model is trained multiple times, holding out a different fold as the validation set each time. The clinical equivalent is evaluating a resident across 5 different clinical environments rather than basing their entire evaluation on a single month in 1 clinic. Averaging performance across diverse settings provides a far more reliable estimate of their ability to generalize knowledge when data is limited.

Data Leakage: The Ultimate Contaminant

If overfitting is a failure of generalization, data leakage is a structural failure of the experiment itself. Leakage occurs when information from outside the training dataset—specifically from the test set—inadvertently permeates the training process. This grants the model an unfair advantage, leading to spectacularly high performance during development followed by devastating failures in real-world deployment.

In the complex data ecosystems of medical imaging, leakage frequently manifests as "patient-level leakage." When an endocrinologist performs a thyroid ultrasound, they generate multiple data points per nodule, including dynamic cine-clips containing hundreds of sequential frames. A single patient might contribute dozens of distinct images of the same nodule to a database.

If a developer randomly splits this database at the image level, allocating 80% of all pooled images to the training set and 20% to the test set, catastrophic leakage is guaranteed. Frame 1 of Patient A's nodule will likely go to the training set, while Frame 2 goes to the test set.

During training, the model does not learn the sonographic hallmarks of malignancy. Instead, it learns to recognize Patient A's specific anatomical features, the surrounding strap muscles, or the precise angle of the ultrasound probe. When evaluated on the test set, the model easily identifies Frame 2 because it has already memorized the background architecture of Frame 1. To preserve the integrity of the evaluation, data partitioning must be strictly executed at the patient level, ensuring the algorithm evaluates pathologies in anatomies it has never encountered.

The Art of Critical Skepticism

High accuracy rates and impressive AUC scores are devoid of clinical meaning if a model was permitted to overfit to statistical noise, or if its test data was contaminated through leakage.

The art of not fooling oneself in machine learning requires absolute methodological discipline: enforcing strict patient-level data partitioning, employing appropriate validation strategies matched to the dataset's scale, ensuring the training data represents the true clinical spectrum, and subjecting the algorithm to the unforgiving crucible of external validation. Clinicians evaluating these tools must transcend the superficial appeal of vendor-supplied accuracy metrics (Pull Quote of Previous Sentence).

By understanding the mechanics of how models learn, medical professionals can ask the interrogative questions required to expose methodological flaws and safely integrate true, validated algorithms into patient care.

Figure 1: The Lifecycle of an Artificial Intelligence Model and the Overfitting Threshold. This learning curve illustrates the relationship between training duration (algorithm learning iterations) and diagnostic error rate. During the initial phase (Underfitting), the model has not yet captured the underlying pathophysiology, resulting in high error rates on both the training data (blue line) and validation data (red line). At the optimal stopping point (The Sweet Spot), the model achieves maximum clinical generalizability. If training continues beyond this threshold (Overfitting), the model begins to memorize statistical noise, patient-specific artifacts, or institutional nuances present in the training data. Consequently, its performance on novel validation data degrades, exposing the algorithm's inability to generalize to new clinical populations.

Figure 2: The Impact of Model Complexity on Decision Boundaries. Two-dimensional scatter plots representing synthetic clinical data (e.g., sonographic features of thyroid nodules), categorized as benign (red) or malignant (blue). The background shading indicates the algorithm's predictive classification boundary. (A) Underfitting: An overly simplistic model draws a rigid, linear boundary that fails to capture the true anatomical or physiological variance, leading to numerous misclassifications. (B) Appropriate Fit: A properly tuned model establishes a smooth, generalized boundary that accurately separates the core pathologies while ignoring minor statistical outliers, ensuring robust performance on novel data. (C) Overfitting: An excessively complex model generates an erratic boundary to perfectly encapsulate every training data point, including noise and measurement artifacts. This model lacks generalizability and will perform poorly in real-world clinical deployment.

AACE Endocrine AI is published by Conexiant under a license arrangement with the American Association of Clinical Endocrinology, Inc. (AACE^®). The ideas and opinions expressed in AACE Endocrine AI do not necessarily reflect those of Conexiant or AACE. For more information, see Policies.

AI Tutorial Series Part 3: How AI models learn (and overlearn)

Related Content