Insights AI Tutorials: A Series Predictive Risk Models Precision Endocrinology Diagnostics & Imaging

Key metrics for evaluating AI models: AI Tutorial Series

You do not need to build AI models to be an excellent evaluator of them.

May 15, 2026 By Editor-in-Chief: Johnson Thomas, MD, FACE, FEAA 18 min read

In Part 1, we opened the black box of artificial intelligence, demystifying machine learning, neural networks, and large language models. We promised that Article 2 would teach you how to evaluate AI tools, because knowing what AI is means very little if you can’t tell good AI from bad AI.

Introduction: The Seductive Lie of “95% Accuracy”

Here is a scenario that should keep you up at night: A shiny new AI tool lands on your desk. The brochure says it is “95% accurate” at detecting thyroid nodules. Everybody is thrilled. The vendor is booking celebratory dinners. They are ready to deploy it tomorrow.

But what does 95% accuracy actually mean? That number alone tells you about as much as a TSH without a free T4 in a patient with pituitary adenoma—it’s a start, but making clinical decisions based on it alone is a recipe for disaster.

Key Takeaway: A single metric can never capture the full performance of an AI model, just as a single lab value cannot capture a patient’s full clinical picture.

This article will equip you with the vocabulary and conceptual toolkit to critically evaluate AI performance claims. No coding required. No advanced math. Just clear thinking which, as endocrinologists, you already do every time you interpret a patient’s discordant lab results.

Section 1: Accuracy, The Popularity Contest Winner That Cheats

What Is Accuracy?

Let’s start with the simplest metric. Accuracy is the proportion of all predictions that the model got right:

Accuracy = (Correct Predictions)/(Total Predictions)

Sounds reasonable, right? If a model gets 95 out of 100 cases right, it’s 95% accurate. Pop the champagne.

Not so fast.

The Class Imbalance Problem: Why Accuracy Fails in Medicine

Imagine you build an AI model to detect adrenal incidentalomas that are actually adrenocortical carcinomas (ACC). ACC is rare, roughly 1-2 per million per year. Let’s say in your dataset of 10,000 adrenal incidentalomas, only 50 are ACC.

Now imagine the laziest AI model in the world. It doesn’t even look at the CT images. It just stamps every single case as “benign.” Every. Single. One.

This model’s accuracy? A stunning 99.5% (9,950 out of 10,000 correct).

But this model missed every single cancer. All 50 patients with ACC were told they were fine.

This is the class imbalance problem. When one class (benign) vastly outnumbers another (malignant), a model can achieve spectacular accuracy by simply predicting the majority class every time. In medicine, where rare but serious conditions are exactly what we need AI to catch, accuracy alone is not helpful.

When Accuracy Is Actually Useful

Accuracy is not entirely useless. When your classes are roughly balanced say, subtyping confirmed primary aldosteronism into unilateral aldosterone-producing adenoma (APA) versus bilateral idiopathic hyperaldosteronism (IHA), where adrenal venous sampling data shows roughly a 40/60 split between the two subtypes, accuracy gives a reasonable first impression. But the moment you are dealing with rare diseases, screening scenarios, or anything where the stakes of missing a case are high, you need better metrics.

Let’s meet them.

Section 2: The Confusion Matrix, Your Diagnostic Rosetta Stone

Before we can understand the better metrics, we need to understand where predictions can go right and wrong. The confusion matrix is one of the clearest tools for AI evaluation, despite its misleading name.

Think of it this way. Every time an AI model makes a prediction, it falls into one of 4 buckets:

Actually Positive (Disease)

Actually Negative (No Disease)

AI Predicts Positive

True Positive (TP)

Correctly caught the disease

False Positive (FP)

False alarm

AI Predicts Negative

False Negative (FN)

Missed the disease

True Negative (TN)

Correctly ruled out

If this looks familiar, it should. This is essentially the same 2 x 2 table you learned in epidemiology and use every time you think about diagnostic test performance. The AI world just gave it a fancier name.

Clinical Translation: True Positive = the AI correctly identifies a patient with Graves’ disease. False Positive = the AI flags a euthyroid patient as having Graves’. False Negative = the AI misses a patient who actually has Graves’. True Negative = the AI correctly says “no Graves’” for a healthy patient.

Section 3: Sensitivity and Specificity, Old Friends in New Clothes

You already know these. If you survived medical school biostatistics—and the fact that you are reading this confirms you did—you learned sensitivity and specificity. AI evaluation uses them in exactly the same way.

Sensitivity (Recall, True Positive Rate)

Sensitivity answers: Of all the patients who actually have the disease, how many did the model correctly identify?

Sensitivity = TP / (TP + FN)

A model with high sensitivity is like an overzealous intern who orders every possible test, they rarely miss a diagnosis, but they might raise a lot of false alarms along the way. High sensitivity means few false negatives.

When you want high sensitivity: Screening programs. Cancer detection. Any situation where missing a case has devastating consequences. Think of screening for pheochromocytoma; you want a test with high sensitivity here.

Specificity (True Negative Rate)

Specificity answers: Of all the patients who are actually healthy, how many did the model correctly rule out?

Specificity = TN / (TN + FP)

A model with high specificity is like the seasoned attending who only orders a test when they are fairly certain it will be positive. When this model says something is abnormal, you believe it. High specificity means few false positives.

When you want high specificity: Confirmatory testing. Situations where a false positive leads to invasive procedures, unnecessary surgery, or extreme patient anxiety. Think about the difference between FNA cytology reading of Bethesda VI (diagnostic of malignancy) versus Bethesda III (atypia of undetermined significance), the former demands high specificity because it triggers thyroidectomy.

The Sensitivity/Specificity Trade-off

Here is the uncomfortable truth about AI models: Sensitivity and specificity are usually in tension. Increase sensitivity, and you will likely sacrifice specificity (more false alarms). Increase specificity, and you might miss more true cases.

The Medical Analogy: The tradeoff is exactly like setting the PSA cutoff for a prostate cancer workup. Drop the threshold to 2.5 ng/mL, and you will catch more aggressive cancers early (higher sensitivity)—but you will also biopsy thousands of men with benign prostatic hyperplasia or indolent cancers that would never have killed them, leading to incontinence, impotence, and infection from treatments they never needed (lower specificity). Raise it to 10 ng/mL and the overdiagnosis problem vanishes (increased specificity)—but so do the lives you could have saved (reduced sensitivity). There is no free lunch in medicine or in AI.

Section 4: Positive Predictive Value and Negative Predictive Value, The Clinician’s Real Questions

Sensitivity and specificity describe how the model performs relative to the disease. But in clinic, the question is reversed. Your patient is sitting across from you, the AI has given you a result, and what you really want to know is:

“The AI says this patient has the disease. How much should I believe it?”

“The AI says this patient is fine. How much should I believe that?”

These questions are answered by Positive Predictive Value (PPV) and Negative Predictive Value (NPV).

Positive Predictive Value (PPV, Precision)

 PPV = TP / (TP + FP)

PPV tells you: When the model says “positive,” what percentage of the time is it actually right? A model with high PPV is reliable when it raises an alarm. When it says “cancer,” it usually means cancer.

Negative Predictive Value (NPV)

NPV = TN / (TN + FN)

NPV tells you: When the model says “negative,” what percentage of the time is it actually right? A model with high NPV is reassuring. When it says, “no cancer,” you can breathe easier.

The Prevalence Trap

Here is the critical insight that trips up many physicians: PPV and NPV depend heavily on disease prevalence. The same model with the same sensitivity and specificity will have wildly different PPV depending on whether it is deployed in a high-risk referral center or a community screening program.

Let us make this concrete. Suppose an AI model for detecting pheochromocytoma has 95% sensitivity and 95% specificity. Impressive numbers on paper. Now let us see how it performs in two different settings:

Setting	Prevalence	PPV
Referral center (patients with adrenal incidentalomas and symptoms)	20%	~83%
General screening (asymptomatic hypertensive patients)	0.1%	~1.9%

Read that again. The same model with the same sensitivity and specificity has a PPV of 83% in the referral setting, but a PPV of less than 2% in the screening setting. In the screening scenario, for every true pheochromocytoma the AI finds, it generates roughly 50 false alarms. That means 50 terrified patients undergo unnecessary 24-hour urine catecholamines, plasma metanephrines, and possibly CTs or MRIs of the adrenals all because an AI model with “95% sensitivity and specificity” was deployed in the wrong context.

The Bottom Line: Don’t simply ask, “How accurate is this model?” ask, “How accurate is this model in MY patient population?” Context is everything.

Section 5: The F1 Score, When You Want One Number (But a Better One)

Sometimes you need a single number that balances precision (PPV) and sensitivity (recall). Maybe you are comparing 10 different AI models, and you need a quick way to rank them. This is where the F1 score comes in.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

The F1 score is the harmonic mean of precision and recall. If either precision or recall is terrible, the F1 score tanks, it refuses to be fooled by one good number masking a bad one.

Think of it this way: If accuracy is like your patient’s average glucose (easily gamed by class imbalance, just as an average glucose of 120 mg/dL can hide wild glycemic variability), then the F1 score is more like time in range on a continuous glucose monitor. It demands respectable precision and respectable recall.

F1 scores range from 0 to 1, where 1 is perfect. An F1 of 0.9 means the model has a strong balance of catching true cases without too many false alarms. Like a CGM sensor that has given up on life by day 14, an F1 of 0.3 means something has gone terribly wrong,

Section 6: AUC-ROC, The Celebrity Metric (With a Dirty Secret)

What Is the ROC Curve?

The Receiver Operating Characteristic (ROC) curve, originally developed during World War II to evaluate radar operators distinguishing enemy planes from noise, plots sensitivity (true positive rate) on the y-axis against 1 minus specificity (false positive rate) on the x-axis at every possible decision threshold.

The Area Under the Curve (AUC) is simply the total area under this ROC curve. It ranges from 0.5 (coin flip, no better than chance) to 1.0 (perfect model). An AUC of 0.5 means your expensive AI model has the diagnostic acumen of flipping a coin. An AUC of 0.8-0.9 is generally considered good. Above 0.9 is excellent.

AUC Range	Interpretation
0.5	No discrimination (coin flip)
0.5–0.7	Poor
0.7–0.8	Acceptable
0.8–0.9	Good (commonly published)
0.9–1.0	Excellent

When AUC Lies to You

AUC is the most widely reported metric in AI papers, and for good reason: It summarizes discrimination across all thresholds in a single number that is independent of prevalence. But it has serious limitations that you need to know:

Problem 1: AUC does not tell you about clinical utility at any specific threshold. An model with an AUC of 0.92 might have excellent sensitivity at the threshold where specificity is only 60%, or excellent specificity at the threshold where sensitivity is only 70%. The AUC averages across all thresholds, but you will only operate at one. It is like knowing a restaurant has a 4.5-star average review but not knowing whether they are great at appetizers and terrible at desserts, or vice versa.

Problem 2: AUC treats all errors as equally bad. In the AUC calculation, a false positive and a false negative contribute equally. But in clinical practice, these errors are rarely equivalent. Missing an insulinoma (false negative) that causes recurrent life-threatening hypoglycemia is far worse than falsely flagging a case that gets ruled out with additional testing (false positive). AUC cannot capture this asymmetry.

Problem 3: AUC can be misleadingly high with class imbalance. Remember our ACC detection problem? With 9,950 negatives and 50 positives, the ROC curve’s x-axis (false positive rate) is dominated by the huge number of negatives. A model that produces 100 false positives looks like it has a very low false positive rate (100/9,950 = 1%) even though those 100 false positives mean twice as many false alarms as actual cancers. The AUC stays high because the false positive rate is low, even though the absolute number of false positives is clinically unacceptable.

Problem 4: 2 very different models can have the same AUC. The ROC curve is a 2D shape, and infinitely many different curves can have the same area underneath. One model might be uniformly decent across all thresholds, while another might be phenomenal at one end and terrible at the other. Same AUC, very different clinical behavior.

Clinical Analogy: AUC is like HbA1c. It is a useful summary measure, but two patients with the same HbA1c of 7.0% can have wildly different glycemic profiles, one with stable glucoses in the 140–160 range, another swinging between 40 and 300. You would never manage these patients the same way, and you should not treat models with the same AUC the same way either.

Better Alternatives for Specific Use Cases

For imbalanced datasets (most medical applications): Consider the Precision-Recall Curve and its AUC (PR-AUC or AUPRC) instead. The precision-recall curve focuses specifically on the positive (disease) class and is much more sensitive to poor performance on rare events. When only 1% of your cases are positive, PR-AUC will punish a model that generates too many false positives far more harshly than ROC-AUC will.

For clinical decision making, look at performance metrics at the specific threshold the model will actually use in practice. Ask the vendor: “At your chosen operating threshold, what are the sensitivity, specificity, PPV, and NPV in a population similar to mine?”

Section 7: Calibration, Does the Model Mean What It Says?

Here is a metric that gets far less attention than it deserves. When an AI model tells you there is a “70% probability this thyroid nodule is malignant,” does it actually mean that 70 out of 100 nodules with that score turn out to be malignant? Or is the model confidently making up numbers?

This is the question of calibration. A well-calibrated model means its predicted probabilities correspond to actual observed frequencies. If the model says 80% risk, then roughly 80% of similar cases should indeed have the disease.

Why Calibration Matters in Medicine

In many clinical scenarios, you are not just asking the AI for a yes/no binary decision. You are asking for a risk estimate that feeds into shared decision-making with your patient. A patient told they have a “30% risk” of malignancy makes different decisions than one told “80% risk.” If those probabilities are not calibrated, if the model says 80% but the true risk is actually 30%, you are making shared decisions based on shared delusions.

Poorly calibrated models are like a clock that runs fast, it might always rank things in the right order (good discrimination), but the actual times it gives are wrong. You would not set your insulin pump timing based on a fast clock, and you should not set treatment thresholds based on an uncalibrated model.

Quick Check: If a paper only reports AUC and does not mention calibration, be cautious. The model might rank patients well but assign meaningless probability values, making it unsuitable for risk communication.

Section 8: A Practical Guide, Which Metrics for Which Situation?

With all these metrics swirling around, how do you decide what to look for when evaluating an AI tool? Here is a practical decision framework:

Clinical Scenario	Priority Metrics	Why
Screening (catch every case)	Sensitivity, NPV	Missing a case is worse than a false alarm. You want the model to cast a wide net.
Confirmatory diagnosis	Specificity, PPV	A positive result triggers treatment. You need to be confident it is correct.
Risk stratification/shared decision-making	Calibration, AUC	Patients and clinicians need accurate probabilities, not just rankings.
Comparing multiple models	F1, PR-AUC	Need a single summary number that does not hide poor performance on the rare class.
Rare disease detection	PR-AUC, Sensitivity, PPV	Standard AUC is overly optimistic with class imbalance. PR-AUC is more honest.
AI-assisted image reading (eg, thyroid US)	Sensitivity, Specificity at operating threshold	Need to know real-world performance at the exact cutoff the tool uses, not averaged across all thresholds.

Section 9: Red Flags When Reading AI Studies

Armed with your new metrics vocabulary, here are warning signs that should make you skeptical when reading AI papers or evaluating AI products:

Only accuracy is reported. If the paper touts “97% accuracy” without mentioning sensitivity, specificity, or class distribution, the model might just be predicting the majority class. Ask: What is the prevalence?
AUC is reported without a confidence interval. An AUC of 0.94 sounds great, but if the 95% confidence interval is 0.71–1.0, the model’s performance is uncertain.
No external validation. The model was trained and tested on data from the same hospital. This is like a student writing and grading their own exam.
Calibration is never mentioned. If the tool provides probability estimates to patients or clinicians but calibration was never assessed, those probabilities could be meaningless.
The test population does not match your population. A model validated in a Korean tertiary referral center may not perform the same way in a rural U.S. primary care clinic. Demographics, disease prevalence, imaging equipment, and clinical workflows all matter.
Comparison is only against “junior” clinicians. If an AI is compared to first-year residents but never to experienced specialists, the reported superiority may evaporate against an appropriate benchmark.
No subgroup analysis. A model that performs well overall might fail in specific demographic groups—different ethnicities, ages, or disease subtypes. Without subgroup analysis, hidden biases can lurk beneath impressive aggregate numbers.

Section 10: Quick Reference Glossary

Term	Plain English Definition
Accuracy	Percentage of all predictions the model got right. Misleading when classes are imbalanced.
Sensitivity (Recall)	Of everyone who has the disease, what proportion did the model catch?
Specificity	Of everyone who is healthy, what proportion did the model correctly identify as healthy?
PPV (Precision)	When the model says “positive,” how often is it right?
NPV	When the model says “negative,” how often is it right?
F1 Score	A single number that balances precision and recall. Ranges from 0 (terrible) to 1 (perfect).
AUC-ROC	Measures discrimination across all thresholds. 0.5 = coin flip, 1.0 = perfect.
PR-AUC (AUPRC)	Like AUC but focused on the positive class. Better for imbalanced datasets.
Calibration	Do the model’s predicted probabilities match real-world frequencies?
Confusion Matrix	A 2×2 table showing TP, FP, FN, TN. The foundation for all other metrics.
Class Imbalance	When one class (e.g., healthy) vastly outnumbers another (e.g., diseased) in the dataset.
External Validation	Testing a model on data from a different source than it was trained on.
Threshold	The cutoff point above which the model predicts “positive.” Can be adjusted to trade sensitivity for specificity.

Conclusion: Be the Clinician AI Needs You to Be

AI models, like lab tests, are tools. Like all tools, their value depends entirely on whether they are used appropriately and interpreted correctly. A 95% accuracy rate is meaningless without context. An AUC of 0.92 is impressive but incomplete. A predicted probability of 80% is only useful if the model is calibrated.

The metrics in this article are not technical trivia for data scientists, they are the clinical reasoning framework you need to evaluate whether an AI tool belongs in your practice. Every time a vendor, a journal article, or a hospital administrator throws a performance number at you, you now have the questions to ask:

What is the prevalence in my population? What are the sensitivity and specificity at the operating threshold? Is this model calibrated? Was it externally validated? Does the test population look like my patients?

You do not need to build AI models to be an excellent evaluator of them. You already have decades of experience interpreting imperfect tests in imperfect clinical scenarios. AI evaluation is just that same skill applied to a new kind of test.

AACE Endocrine AI is published by Conexiant under a license arrangement with the American Association of Clinical Endocrinology, Inc. (AACE^®). The ideas and opinions expressed in AACE Endocrine AI do not necessarily reflect those of Conexiant or AACE. For more information, see Policies.