"failure rates exceeded 80 per cent for all models when they needed to do so-called differential diagnosis — when full patient information was lacking" [80 per cent]
The article treats this as a general property of current LLM architecture, not a tuning problem specific to one vendor. All 21 models tested—including leading offerings from OpenAI, DeepSeek, Anthropic, Google, and xAI—exhibited the same failure mode. This suggests the limitation is intrinsic to how transformer-based models handle incomplete information under diagnostic constraints, not a deployment or prompt-engineering issue. The gap between differential and final diagnosis performance indicates LLMs lack the probabilistic reasoning framework that human clinicians use to maintain multiple hypotheses under uncertainty.