Sequential Diagnosis with Language Models

Published Jun 2025

URL https://arxiv.org/pdf/2506.22405

Takeaways

An interesting setup for testing the DIAGNOSTIC ABILITY of LLMs.
OpenAI’s O3 is VERY GOOD at medical diagnosis.
Claims on cost/superiority over humans should be taken with a grain of salt.

Method

304 New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases.
- 56 most recent cases as a held-out test for generalization.
- VERY CHALLENGING and BIASED: Cases with malignant syndromes.
Build a “gatekeeper” as a simulated environment for patient diagnosis.
- The doctor/LLM has three actions: ask a question, request a test, or diagnose.
- The environment responds with real or synthetic information.
- Doctors verified 508 responses (real and synthetic): 0 leaks, 8 potentially problematic.
Build a multi-agent system: MAI-Dx Orchestrator
- Prompt-tuned with GPT-4.1 and tested with O3.
- Dr. Hypothesis: Proposes diagnoses.
- Dr. Test-Chooser: Assigns tests.
- Dr. Challenger: Verifies/rethinks the approach.
- Dr. Stewardship: Optimizes cost.
- Dr. Checklist: Ensures the correct format.
- QUESTION: How do the agents debate? Is there a MAIN agent for handoffs? Or is this written in the workflow?

Results

Baselines:

Different models with a very basic prompt… NO MENTION OF COST
Primary physician and generalist… NO SPECIALIST, NO EXTERNAL RESOURCE (books/internet).

Criticisms

NOT A GOOD REFLECTION on the ability/cost trade-off.

NO MENTION about COST in the base prompt → not a very well-calibrated result.
Figure 8 perhaps gives a better comparison of UPPER-BOUND ABILITY.

The setting is a little biased.

Tests often take time. Doctors often order tests based on eliminating an emergency instead of finding a diagnosis; it’s a “minimize regret vs. maximize outcome” trade-off.

You are a diagnostic assistant. Order tests and ask patient questions
to determine the diagnosis.
To order tests, use <test></test> tags:
<test>CBC</test>
<test>Chest X-ray</test>
...more tests...
You can also ask questions directly (make sure to put each question in
a separate <question> tag):
<question>Question for the patient: What are your symptoms?</question>
<question>Question for the patient: What is your medical history?
</question>...more questions...
You cannot mix <test> and <question> tags in the same turn; just use all
<test> tags or all <question> tags.
Make sure to ask for enough questions and tests to reach a diagnosis.
When ready to diagnose, use <diagnosis></diagnosis> tags:
<diagnosis>Your diagnosis here</diagnosis>

Additional Criticisms

A doctor shares similar concerns¹

Used ZERO healthy patients.
“Cost-effective” ignores HUMAN TOLL.
The physician comparison is “RIGGED” (too many restrictions).
The “Retrospective Oracle” Problem (tests are solved, but not all tests in real life are solvable).
NO “TIME-TO-STOP”. Great doctors know when NOT to test.

Outlook

The Orchestrator should be compared against:
- A single agent with detailed instructions.
- A multi-shot agent.
- Division among specialists is NOT NECESSARY.
Test against noisy information; the vignette seems TOO IDEALISTIC.
- The MedPAIR² reveals that LLMs (non-thinking) are very sensitive to the vignette and have a tendency to jump to a conclusion with WRONG EVIDENCE.
Test against easier/more common cases.
Test against safety.

References

Dr. Dominic Ng: https://x.com/DrDominicNg/status/1939816655829475648 ↩
MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering: https://arxiv.org/pdf/2505.24040 ↩