Blog Logo

30 Nov 2021 ~ 3 min read

Sequential Diagnosis with Language Models


Sequential Diagnosis with Language Models

Published Jun 2025
Tags
EvaluationMulti-AgentDiagnosis

Takeaways

  1. An interesting setup for testing the DIAGNOSTIC ABILITY of LLMs.
  2. OpenAI’s O3 is VERY GOOD at medical diagnosis.
  3. Claims on cost/superiority over humans should be taken with a grain of salt.

Method

  • 304 New England Journal of Medicine clinicopathological conference (NEJM-CPC) cases.

    • 56 most recent cases as a held-out test for generalization.
    • VERY CHALLENGING and BIASED: Cases with malignant syndromes.
  • Build a “gatekeeper” as a simulated environment for patient diagnosis.

    • The doctor/LLM has three actions: ask a question, request a test, or diagnose.
    • The environment responds with real or synthetic information.
    • Doctors verified 508 responses (real and synthetic): 0 leaks, 8 potentially problematic.
  • Build a multi-agent system: MAI-Dx Orchestrator

    • Prompt-tuned with GPT-4.1 and tested with O3.
    • Dr. Hypothesis: Proposes diagnoses.
    • Dr. Test-Chooser: Assigns tests.
    • Dr. Challenger: Verifies/rethinks the approach.
    • Dr. Stewardship: Optimizes cost.
    • Dr. Checklist: Ensures the correct format.
    • QUESTION: How do the agents debate? Is there a MAIN agent for handoffs? Or is this written in the workflow?

Results

Baselines:

  • Different models with a very basic prompt… NO MENTION OF COST
  • Primary physician and generalist… NO SPECIALIST, NO EXTERNAL RESOURCE (books/internet).

image.png

Criticisms

  1. NOT A GOOD REFLECTION on the ability/cost trade-off.
  • NO MENTION about COST in the base prompt → not a very well-calibrated result.
  • Figure 8 perhaps gives a better comparison of UPPER-BOUND ABILITY. image.png
  1. The setting is a little biased.
  • Tests often take time. Doctors often order tests based on eliminating an emergency instead of finding a diagnosis; it’s a “minimize regret vs. maximize outcome” trade-off.

Additional Criticisms

A doctor shares similar concerns 1

  1. Used ZERO healthy patients.
  2. “Cost-effective” ignores HUMAN TOLL.
  3. The physician comparison is “RIGGED” (too many restrictions).
  4. The “Retrospective Oracle” Problem (tests are solved, but not all tests in real life are solvable).
  5. NO “TIME-TO-STOP”. Great doctors know when NOT to test.

Outlook

  1. The Orchestrator should be compared against:

    • A single agent with detailed instructions.
    • A multi-shot agent.
    • Division among specialists is NOT NECESSARY.
  2. Test against noisy information; the vignette seems TOO IDEALISTIC.

    • The MedPAIR 2 reveals that LLMs (non-thinking) are very sensitive to the vignette and have a tendency to jump to a conclusion with WRONG EVIDENCE.
  3. Test against easier/more common cases.

  4. Test against safety.

References

  1. Dr. Dominic Ng: https://x.com/DrDominicNg/status/1939816655829475648
  2. MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering: https://arxiv.org/pdf/2505.24040

Hi, I'm Qianyi. I'm an ML engineer based in Beijing. Read more about me on my website.