Takeaways
- An interesting setup for testing the DIAGNOSTIC ABILITY of LLMs.
- OpenAI’s O3 is VERY GOOD at medical diagnosis.
- Claims on cost/superiority over humans should be taken with a grain of salt.
Method
-
304 New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases.
- 56 most recent cases as a held-out test for generalization.
- VERY CHALLENGING and BIASED: Cases with malignant syndromes.
-
Build a “gatekeeper” as a simulated environment for patient diagnosis.
- The doctor/LLM has three actions: ask a question, request a test, or diagnose.
- The environment responds with real or synthetic information.
- Doctors verified 508 responses (real and synthetic): 0 leaks, 8 potentially problematic.
-
Build a multi-agent system: MAI-Dx Orchestrator
- Prompt-tuned with GPT-4.1 and tested with O3.
- Dr. Hypothesis: Proposes diagnoses.
- Dr. Test-Chooser: Assigns tests.
- Dr. Challenger: Verifies/rethinks the approach.
- Dr. Stewardship: Optimizes cost.
- Dr. Checklist: Ensures the correct format.
- QUESTION: How do the agents debate? Is there a MAIN agent for handoffs? Or is this written in the workflow?
Results
Baselines:
- Different models with a very basic prompt… NO MENTION OF COST
- Primary physician and generalist… NO SPECIALIST, NO EXTERNAL RESOURCE (books/internet).
Criticisms
- NOT A GOOD REFLECTION on the ability/cost trade-off.
- NO MENTION about COST in the base prompt → not a very well-calibrated result.
- Figure 8 perhaps gives a better comparison of UPPER-BOUND ABILITY.
- The setting is a little biased.
- Tests often take time. Doctors often order tests based on eliminating an emergency instead of finding a diagnosis; it’s a “minimize regret vs. maximize outcome” trade-off.
You are a diagnostic assistant. Order tests and ask patient questions
to determine the diagnosis.
To order tests, use <test></test> tags:
<test>CBC</test>
<test>Chest X-ray</test>
...more tests...
You can also ask questions directly (make sure to put each question in
a separate <question> tag):
<question>Question for the patient: What are your symptoms?</question>
<question>Question for the patient: What is your medical history?
</question>...more questions...
You cannot mix <test> and <question> tags in the same turn; just use all
<test> tags or all <question> tags.
Make sure to ask for enough questions and tests to reach a diagnosis.
When ready to diagnose, use <diagnosis></diagnosis> tags:
<diagnosis>Your diagnosis here</diagnosis>
Additional Criticisms
A doctor shares similar concerns 1
- Used ZERO healthy patients.
- “Cost-effective” ignores HUMAN TOLL.
- The physician comparison is “RIGGED” (too many restrictions).
- The “Retrospective Oracle” Problem (tests are solved, but not all tests in real life are solvable).
- NO “TIME-TO-STOP”. Great doctors know when NOT to test.
Outlook
-
The Orchestrator should be compared against:
- A single agent with detailed instructions.
- A multi-shot agent.
- Division among specialists is NOT NECESSARY.
-
Test against noisy information; the vignette seems TOO IDEALISTIC.
- The MedPAIR 2 reveals that LLMs (non-thinking) are very sensitive to the vignette and have a tendency to jump to a conclusion with WRONG EVIDENCE.
-
Test against easier/more common cases.
-
Test against safety.
References
- Dr. Dominic Ng: https://x.com/DrDominicNg/status/1939816655829475648 ↩
- MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering: https://arxiv.org/pdf/2505.24040 ↩