Commentary Research and Evidence Ethics, Regulation, and Responsible Use

Stop pretending AI peer review isn’t happening

Why every journal article should have an AI reviewer―and how banning AI from peer review could be making science worse.

April 21, 2026 By Editor-in-Chief: Johnson Thomas, MD, FACE, FEAA 8 min read

The current peer review system faces significant challenges. With an ever-increasing number of submissions, a shortage of qualified reviewers, and persistent biases despite many reform efforts, it's clear that improvements are needed.

Artificial intelligence (AI) presents promising solutions to many of these issues, yet the scientific community has hesitated to embrace its potential.

The NIH prohibited AI in peer review in 2023.^[1]Science prohibits the use of AI to generate reviews or respond to reviewers’ comments.^[2^] Elsevier, Springer Nature, and the Institute of Electrical and Electronics Engineers forbid reviewers from uploading manuscripts into AI systems.^[3,4,5] In 2026, the International Conference on Machine Learning rejected 497 papers whose authors violated AI-use policies during reciprocal review.^[6]

Johnson Thomas, MD, FACE, FEAA, Editor-in-Chief of AACE Endocrine AI

These actions send a clear message: AI is currently not welcome in the scientific evaluation process.

This approach may need to be reconsidered. Recent evidence strongly suggests that including AI reviewers alongside human experts can significantly enhance the peer review process.

Every journal article could benefit from an AI reviewer, and a collaborative model that blends human and AI insights appears to be the most promising path forward for scientific progress.

The Crisis Is Real and Getting Worse

Global scientific output exceeded 3.4 million papers in 2025, with indexed articles growing by 47% since 2016, far outpacing the availability of reviewers. Many scholars decline review invitations, and just 10% of reviewers handle half of all reviews globally.^[7] The unpaid labor propping up this system is valued at $1.5 billion annually in the US, a figure that is likely an underestimate.^[8]

The quality problem is equally stark. In the landmark Godlee et al. (1998) experiment reviewers given manuscripts deliberately seeded with eight major errors detected an average of only two; 16% caught none.^[9] A meta-analysis of 48 studies (19,443 manuscripts) found inter-reviewer agreement at just r = 0.34.^[10] The current system is simultaneously too slow, too unreliable, and too easily gamed.

A study on the impact of ChatGPT on conference peer reviews found that more than 10% of the sentences were modified by ChatGPT.^[11] A Frontiers survey of 1,600 researchers across 111 countries found more than 50% had used AI in peer review, despite policies prohibiting it.^[12] Detection tools are imperfect and increasingly outpaced by fluent LLMs. Editors cannot pay reviewers enough to matter, cannot compel participation, and cannot reliably distinguish human from AI writing.

The Evidence AI Review Works

In the largest experiment of its kind, researchers from Stanford, UCLA, Columbia, and Google deployed a Review Feedback Agent at the International Conference of Learning Representations (ICLR) 2025, providing AI-generated feedback to more than 20,000 randomly selected reviews. Published in Nature Machine Intelligence, 27% of reviewers who received feedback updated their reviews, incorporating 12,000-plus suggestions.^[13] Blinded evaluators preferred the revised reviews in 89% of cases. Reviews increased in length by an average of 80 words, became more specific and actionable, and reviewers engaged more deeply during author rebuttals. Cost: approximately $0.50 per review. Critically, acceptance rates were unaffected (32.3% vs 30.8%), meaning AI improved quality without biasing outcomes.^[13,14]

Comparative studies confirm the pattern. AI excels at detecting inconsistencies between claims and evidence, verifying statistical reporting, checking structural completeness, and identifying formatting issues—precisely the areas where human reviewers consistently fall short.^[15] As study researcher James Zou, associate professor of Biomedical Data Science at Stanford University, summarized, “AI is strongest on objective, checkable issues and weaker on subjective judgments about novelty or significance.”^[16]

Debunking the Objections

There are many arguments against using AI for journal reviews, such as:

“AI cannot judge novelty.” True—and irrelevant. No one proposes AI as sole reviewer. AI handles the systematic, verifiable dimensions (statistics, references, methodology, logical coherence) while humans focus on significance and originality. This division of labor frees up overwhelmed reviewers to do what only they can do.

“AI violates confidentiality.” The strongest objection and the most solvable. Open-source models (Llama, Mixtral, Qwen) run entirely on institutional servers with zero external data transmission. Enterprise APIs now offer contractual zero-data-retention guarantees compliant with HIPAA and SOC 2. The ICLR 2025 experiment itself operated within the conference’s own platform—no manuscript data left OpenReview.^[13] Banning all AI because some deployments risk confidentiality is like banning email because some providers are insecure. The answer is secure implementation, not prohibition.

“AI hallucinates.” So do humans. Reviewers misremember literature, apply incorrect standards, and miss the majority of seeded errors.^[9] The ICLR system addressed hallucination with multi-layer guardrails—five LLMs in the pipeline, each with automated reliability tests.^[13] AI hallucination is a solvable engineering problem. Human inattention at scale is a structural problem with no clear solution under current models.

“AI homogenizes science.” The current system already homogenizes through conservatism bias. Studies show papers from famous authors and elite institutions receive inflated scores under single-blind review.^[17,18] An AI reviewer is blind to institutional prestige, nationality, gender, and professional networks—evaluating work on its merits, which is exactly what peer review is supposed to do.

“Reviewing builds essential skills.” If we accept this premise, then we should ban calculators because they erode arithmetic skills, or statistical software because it prevents researchers from learning to do analysis by hand.The ICLR experiment directly refutes this concern. AI provided feedback on human-written reviews, functioning as a mentor. Reviewers who engaged with AI feedback produced more specific, evidence-based evaluations.^[13] AI enhanced the pedagogical function, not eroded it.

The Human-AI Collaborative Model

The question is not about replacing AI with human reviewers, but about combining them. A four-stage workflow could be helpful:

AI pre-screening for statistical consistency, reference verification, reporting guideline compliance, and structural integrity—catching errors humans miss most of the time, in minutes rather than months.
AI-augmented human review, where reviewers receive structured AI analysis alongside the manuscript. They would review the article without AI support, then engage with flagged issues by AI.
Post-review AI quality checks on the reviews themselves, identifying vague criticisms and missing evaluation dimensions.
Human editorial decision to accept, revise, or reject, relying on both AI and human analysis. AI informs, and humans decide.

This model preserves domain expertise, judgment about significance, and field context while compensating for documented weaknesses: inconsistency, bias, superficial evaluation, and glacial speed. At the very least, a collaborative human-AI peer review process provides a nuanced, human-led solution to a problem that’s becoming more apparent every day. Isn’t that better than the alternative: pretending AI peer review isn’t happening?

References

[1] Kaiser J. Science funding agencies say no to using AI for peer review. Science. 2023. doi:10.1126/science.adk0634 [Original source: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-23-149.html]

[2] Science Editorial Policy. Accessed 2026. https://www.science.org/content/page/science-journals-editorial-policies

[3] Elsevier. Generative AI policies for journals. Updated September 2025. https://www.elsevier.com/about/policies-and-standards/generative-ai-policies-for-journals

[4] https://www.nature.com/nature-portfolio/editorial-policies/ai (accessed 3/28/2026)

[5] https://ieee-aess.org/using-ai-generated-content-ieee-article-and-its-review (accessed 3/28/2026)

[6] Nature News. Major conference catches illicit AI use — and rejects hundreds of papers. Nature. 2026. doi:10.1038/d41586-026-00893-2

[7] Publons. Global State of Peer Review. Clarivate Analytics. 2018. https://publons.com/community/gspr

[8] Aczel B, Szaszi B, Holcombe AO. A billion-dollar donation: estimating the cost of researchers’ time spent on peer review. Res Integr Peer Rev. 2021;6(1):14. doi:10.1186/s41073-021-00118-2

[9] Godlee F, Gale CR, Martyn CN. Effect on the quality of peer review of blinding reviewers and asking them to sign their reports: a randomized controlled trial. JAMA. 1998;280(3):237–240. doi:10.1001/jama.280.3.237

[10] Bornmann L, Mutz R, Daniel H-D. A reliability-generalization study of journal peer reviews: a multilevel meta-analysis of inter-rater reliability and its determinants. PLoS ONE. 2010;5(12):e14331. doi:10.1371/journal.pone.0014331

[11] Liang W, et al. Monitoring AI-modified content at scale: a case study on the impact of ChatGPT on AI conference peer reviews. In: Proc 41st International Conference on Machine Learning. 2024;29575–29620.

[12] Naddaf M. More than half of researchers now use AI for peer review — often against guidance. Nature. 2025;649:273. doi:10.1038/d41586-025-04066-5

[13] Thakkar N, Yuksekgonul M, Silberg J, et al. Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025. Nat Mach Intell. 2026. doi:10.1038/s42256-026-01188-x

[14] ICLR Blog. Leveraging LLM feedback to enhance review quality. Published April 15, 2025. https://blog.iclr.cc/2025/04/15/leveraging-llm-feedback-to-enhance-review-quality/

[15] Doskaliuk B, Zimba O, Yessirkepov M, Klishch I, Yatsyshyn R. Artificial intelligence in peer review: enhancing efficiency while preserving integrity. J Korean Med Sci. 2025;40(7):e92. doi:10.3346/jkms.2025.40.e92

[16] Stanford HAI. AI’s growing role as scientific peer reviewer (interview with James Zou). Stanford Institute for Human-Centered Artificial Intelligence. 2025. https://hai.stanford.edu/news/ais-growing-role-as-scientific-peer-reviewer

[17] Tomkins A, Zhang M, Heavlin WD. Reviewer bias in single- versus double-blind peer review. Proc Natl Acad Sci USA. 2017;114(48):12708–12713. doi:10.1073/pnas.1707323114

[18] Sun Y, et al. Does double-blind peer review reduce bias? Evidence from a top computer science conference. J Assoc Inf Sci Technol. 2022;73(6):811–827. doi:10.1002/asi.24582

[19] Human–AI complementarity in peer review: empirical analysis of PeerJ data and design of an efficient collaborative review framework. Publications. 2025;14(1):1. doi:10.3390/publications14010001

AACE Endocrine AI is published by Conexiant under a license arrangement with the American Association of Clinical Endocrinology, Inc. (AACE^®). The ideas and opinions expressed in AACE Endocrine AI do not necessarily reflect those of Conexiant or AACE. For more information, see Policies.