AI Is Learning to Say “I Don’t Know.” That’s a Skill Medicine Never Required of Itself.
A wave of new research asks when medical AI should stay silent; but the doctor it tells you to “go see” makes up confident answers under uncertainty too, and that changes everything
A wave of new research asks when medical AI should stay silent, but the doctor it tells you to “go see” makes up confident answers under uncertainty too, and that changes what we should actually be building.
About one in three American adults now ask an AI chatbot a health question in a given year.
That is roughly the same share that turns to social media for health information. So a quiet research question has become an urgent one: when should a medical AI keep its mouth shut?
A new review in npj Digital Medicine pulls together the young science of “abstention” — a technical word for a simple idea.
It means knowing when to stop, ask for more information, or just say “I don’t know.” The finding is blunt. Today’s models almost never do it. In one study, a leading model held back on only 1 in 10 medical questions that had no safe answer. The rest of the time it answered anyway, fluent, confident, and often wrong.
So researchers are building tools to teach AI when to stay quiet. And the behavior they score as “safe,” again and again, is the same: when in doubt, the model should stop and say, “Please consult a doctor.”
That sounds right.
It is also where the thinking stops.
The whole approach rests on one assumption nobody checks: that the doctor you are sent to will do better.
Start with the machine. When an AI gives a confident answer it has no real basis for, we now call it confabulation — making up a plausible story to fill a gap in what it actually knows. The field takes this seriously. There are now careful tools that catch a model in the act of inventing, by sampling its answers and flagging when its confidence is hollow. Good work, and necessary.
But confabulation is not a computer glitch. It is what any mind does, silicon or human, when it is unsure and feels pressure to produce an answer.
Daniel Kahneman spent a career proving this about people. He showed that confidence is not a readout of accuracy. It is a feeling, produced by how good a story the mind can tell itself. He called it the illusion of validity: we feel sure when the pieces fit, even when the evidence is thin. And he named the engine, the mind builds its answer from whatever is in front of it and quietly ignores what is missing.
Read that again. It describes a language model. It also describes a tired clinician at 3 a.m.
The evidence on doctors is not gentle. In a landmark 2008 review, two patient-safety researchers argued that physicians badly underrate how often their diagnoses are wrong. When one of them asked rooms full of doctors whether they had made a diagnostic error in the past year, only about 1 in 100 said yes. A 2013 study put numbers on the gap. Physicians got about 55 out of 100 easy cases right — and fewer than 6 out of 100 hard ones. Their confidence barely moved: about 7.2 out of 10 on the easy cases, 6.4 on the hard ones. Accuracy fell off a cliff. Confidence took a gentle stroll downhill. On the hard cases they were wrong more than nine times in ten and stayed nearly as sure of themselves.
What they almost never said was “I don’t know.” That phrase is in remarkably few doctors’ repertoires. The culture of medicine treats uncertainty as something close to failure, and tolerance for not-knowing measurably drops over the course of medical school. We train it out of people.
So here is the asymmetry the AI field keeps walking past. We have built detectors to catch the machine bluffing. We have built nothing of the kind for the clinician who does the same thing, undetected, and then writes it in the chart. The machine that confabulates gets a benchmark and a leaderboard. The human it defers to gets the benefit of the doubt.
In obstetrics this matters twice over, because there are two patients. And there is a second blind spot in the research: it treats silence as the safe choice. At the bedside it often is not. The harm in my field comes far more from things not done, the warning sign not named, the worrying tracing not acted on, the escalation that came too late, than from saying too much. An AI trained to go quiet whenever it is unsure can be trained straight into the passivity that hurts patients. And “consult a doctor” assumes there is a doctor to consult, and that the doctor knows. For many people typing questions at midnight, neither is true.
This is the part the silence debate misses, and it is the heart of a clinical opinion my colleagues and I just published. The danger was never that clinicians would use AI. The danger is that they will use it poorly, copying confident output without checking it, accepting an error because it reads well, treating a chatbot as an authority instead of a tool. The answer is not to avoid these tools, and it is not to make them quieter. It is competence: knowing when AI helps and when it does not, how to test its answers, how to catch a made-up drug dose before it reaches a patient, how to say plainly what is known and what is not, and who stays accountable when the screen goes dark. The clinician does.
Obstetrics has done this before. Ultrasound, electronic fetal monitoring, cell-free DNA screening, each arrived to resistance, and none of them replaced the obstetrician’s judgment. Each demanded a higher version of it. AI is next in that line.
But notice what that competence really asks for. Knowing the limits of your own knowledge. Saying “I don’t know” out loud. Matching your confidence to what you actually checked. That is exactly the calibration we are now trying to engineer into the machine — and it is the same skill medicine has never seriously required of itself.
For a patient, the practical version is small and powerful. The AI article you read before your appointment can sound certain and be wrong. So can the confident answer across the desk. The clinician worth trusting is not the one who never hesitates.
It is the one who can say, “I’m not sure: let me check,” and then actually checks.
You are allowed to ask two questions out loud: How sure are you? And did you verify it? A good doctor will not flinch at either.
My take: we are studying the wrong half of the problem. The question was never silence versus speech. It is calibration — knowing what you know, and admitting what you don’t. Kahneman built his entire method on the premise that he could not trust his own confidence, so he backed everything with data. That is the discipline. It is rare in the model. It is just as rare in the physician we hold up as the safe alternative.
The arrival of AI is forcing a calibration test on medicine that medicine never volunteered for. We should take it. The goal is not a quieter machine. It is a profession — human and artificial together — that is finally honest about the edge of its own knowledge. In obstetrics, where two patients depend on that honesty, “first do no change” has to give way to “first do good.” Competence, not silence, is the safer answer.
Bottom line: don’t ask how to make AI stay quiet. Ask how to make both the machine and the clinician tell you the truth about how much they actually know.
If this is the kind of thinking you want more of, subscribe to ObGyn Intelligence. It is free, it is independent, and it says plainly what the data shows.
References
1. Presacan O, Nik A, Ojha J, Thambawita V, Ionescu B, Riegler MA. When silence is safer: a review and decision-theoretic framework for LLM abstention in healthcare. npj Digit Med. 2026. doi:10.1038/s41746-026-02882-1.
2. Grünebaum A, Dudenhausen J, Chervenak FA. Clinical artificial intelligence competence in obstetrics and gynecology: patient safety, physician accountability, and responsible use. Am J Obstet Gynecol. 2026. doi:10.1016/j.ajog.2026.06.011.
3. Berner ES, Graber ML. Overconfidence as a cause of diagnostic error in medicine. Am J Med. 2008;121(5 Suppl):S2-S23. doi:10.1016/j.amjmed.2008.01.001.
4. Meyer AND, Payne VL, Meeks DW, Rao R, Singh H. Physicians’ diagnostic accuracy, confidence, and resource requests: a vignette study. JAMA Intern Med. 2013;173(21):1952-1958. doi:10.1001/jamainternmed.2013.10081.
5. Kahneman D. Thinking, Fast and Slow. New York, NY: Farrar, Straus and Giroux; 2011.
6. Grünebaum A, Chervenak J, Pollet SL, Katz A, Chervenak FA. The exciting potential for ChatGPT in obstetrics and gynecology. Am J Obstet Gynecol. 2023;228:696-705. doi:10.1016/j.ajog.2023.03.009.


