Almost All Medical AI Is “Wrong.” Compared to What? Or to Whom?
A new paper says almost all medical AI is wrong. It never asks the obvious question — wrong next to whom?
A new paper says almost all medical AI is wrong. It never asks the obvious question — wrong next to whom? The doctors it is measured against disagree with each other about a quarter of the time.
Two pathologists look at the same breast biopsy. One sees cancer. One does not. This is not a thought experiment. In a 2015 study in JAMA, 115 pathologists read the same slides. Compared against a panel of experts, they agreed only about 75 percent of the time. The cases that mattered most — the early, borderline changes — were the ones they disagreed on most. Same slide. Same patient. A different answer depending on who held the microscope.
I keep that number in mind whenever someone tells me medical AI cannot be trusted.
A new editorial in the International Journal of Medical Informatics makes a strong claim right in its title: almost all machine learning models for medicine are wrong. The authors are serious people, and part of their argument is right.
The labels used to train these tools come from human experts who disagree.
The models are tested on the same kind of patients they learned from, not on new hospitals or new populations. They are scored with numbers that look exact but shift the moment the setting changes. And once switched on, they are rarely watched, even though medicine keeps changing underneath them.
The title is not new.
It echoes a famous 2005 essay, “Why most published research findings are false.” That one was about human research — the same studies and expert opinions these models are trained to copy. Follow its logic and almost all of medicine rests on shaky findings too. Somehow the cannon gets pointed only at the machine.
All true, as far as it goes. I have made versions of these points myself. Build a tool on a weak foundation, never test it in the real world, and you should not trust it. But the paper never asks the one question that matters most. Wrong compared to what?
Every flaw on their list is also a flaw in human doctors — usually a bigger one. Doctors disagree with each other, as those pathologists did.
They are trained once, tested once, and then practice for decades. Nobody runs a doctor through a second hospital to confirm she still performs. Nobody measures, year after year, whether her judgment still matches the evidence. And the knowledge she learned in training is often more than a decade behind the best current science — on average it takes about 17 years for research to reach everyday practice. A machine can be corrected overnight. A habit takes a generation.
So when a study calls a model “wrong,” I want to know: wrong next to what human standard?
The human standard is not perfect truth.
It is a group of people who disagree, carry old knowledge, and are almost never checked.
This is where the paper trips over its own feet. Its strongest section explains that there is no clean truth in medicine. Diagnoses are fuzzy. Lab tests vary. Experts split. The authors say so plainly. Then they turn around and call the machine “wrong” — wrong against the same fuzzy standard they just told us not to trust.
You cannot have it both ways. If the ruler is made of rubber, the word “wrong” loses its meaning. When a model disagrees with a label, it often disagrees with a label a second expert would have rejected too. That is not error. That is the ordinary disagreement of medicine, renamed as a machine defect.
So who is the arbiter of truth? In medicine, there isn’t one. There is a committee of fallible humans, and we have agreed to treat their majority vote as the truth. That is fine — as long as we judge the machine by the same yardstick we use on ourselves, not a stricter one we invented the day the machine walked in.
I will give the authors one real point: scale. One tired doctor harms patients one at a time. A bad model, switched on across a whole hospital system, can harm thousands before lunch. That is a genuine reason to watch deployed tools closely — to test them, keep them honest, and check them over time. Their list of fixes is a good list. I would sign most of it. But that is an argument about how we deploy these tools, not about whether they are “wrong.” The methods in this paper are right. The frame is a double standard wearing a white coat. We are demanding proof from the machine that we never demanded from ourselves.
My position is simple.
Hold medical AI to a high standard.
Then hold human medicine to the exact same one.
Stop asking whether the model is perfect.
Start asking whether it is better than what we already do — and then actually measure what we already do, which we almost never have. A tool does not need to be flawless. It needs to beat the disagreeing, aging, unchecked standard we already live with. Often, it already does.
The next time you read that almost all medical AI is wrong, ask the question the headline skips: compared to whom?
Share this with a colleague who still calls the human exam the gold standard — the data on how often we agree should travel.
ObGyn Intelligence is free because the work matters. If you want to keep it independent and in your inbox, subscribe.
References
1. Cabitza F, Jurman G, Molinari F, Bellazzi R. Why almost all ML models for medicine are wrong — and what we need for evidence-based medical AI. Int J Med Inform. 2026 (in press). doi:10.1016/j.ijmedinf.2026.106538.
2. Elmore JG, Longton GM, Carney PA, et al. Diagnostic concordance among pathologists interpreting breast biopsy specimens. JAMA. 2015;313(11):1122-1132. doi:10.1001/jama.2015.1405. PMID 25781441.
3. Morris ZS, Wooding S, Grant J. The answer is 17 years, what is the question: understanding time lags in translational research. J R Soc Med. 2011;104(12):510-520. doi:10.1258/jrsm.2011.110180. PMID 22179294.
4. Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2(8):e124. doi:10.1371/journal.pmed.0020124. PMID 16060722.


