Harvard Study: AI Outperforms Doctors in Emergency Diagnosis Tasks

A study led by researchers at Harvard Medical School has found that an advanced artificial intelligence system can outperform human doctors in certain emergency diagnosis tasks. The research compared physicians with an AI model, OpenAI o1, using real-world emergency department cases and structured clinical scenarios. In one experiment involving 76 patients, the AI produced correct or near-correct diagnoses more often than doctors when both were given the same written patient records. Experts say the findings reflect rapid progress in AI-driven clinical reasoning, while emphasising that the technology should support, not replace, human judgement.

How AI outperformed doctors in a landmark Harvard study

Researchers evaluated AI and physicians across real emergency cases and controlled clinical scenarios. In the emergency setting, both were given identical electronic health records containing vital signs, demographic details and brief clinical notes. Neither conducted physical examinations, meaning the comparison focused solely on interpreting written medical information.

In this setup, the AI achieved correct or near-correct diagnoses in about 67% of cases, compared with 50% to 55% for doctors. With additional patient information, AI accuracy increased to around 82%, while doctors reached 70% to 79%, though the difference was not statistically significant.

—

Wide Pickt banner — collaborative shopping lists app for Telegram, phone mockup with grocery list

The system also performed strongly in treatment planning tasks. When analysing case studies, it scored about 89%, significantly higher than the roughly 34% achieved by physicians using conventional resources.

Why the AI showed an edge

The advantage was most evident in high-pressure situations with limited information, such as emergency triage. The AI can process large volumes of data quickly and evaluate multiple diagnostic possibilities at once, reducing the impact of common cognitive biases that affect human decision-making under stress.

In one example, a patient with worsening lung symptoms was initially thought to be failing treatment. The AI identified an alternative explanation linked to the patient's history of lupus, which was later supported, demonstrating its ability to detect less obvious patterns.

Important limitations

Despite its performance, the system has clear constraints. It relied entirely on text-based records and could not assess physical cues such as appearance, behaviour or distress. As a result, it functioned more like a second-opinion tool than a full clinician.

The study was also limited in scope, involving a relatively small sample from a single hospital, leaving open questions about performance across broader and more diverse populations.

Expert views and concerns

Researchers including Arjun Manrai and Adam Rodman said the findings point towards a future where AI supports clinical decision-making. Ewen Harrison described such systems as useful second-opinion tools, while Wei Xing cautioned that the results do not demonstrate readiness for routine clinical use.

Concerns remain around reliability, bias and accountability, with no clear framework yet defining responsibility in cases of AI-assisted errors.

What this means for the future of medicine

The findings underline the growing role of AI in healthcare, particularly in fast-paced environments such as emergency departments. While the technology shows clear potential to improve diagnostic accuracy and efficiency, it remains an assistive tool rather than a replacement for human expertise.

Further large-scale and prospective studies will be needed to determine how AI can be safely integrated into everyday clinical practice.