Blog Logo

30 Nov 2021 ~ 4 min read

The Diagnostic Agent


⚠️

Warning

This article is currently a work in progress and may contain incomplete or draft content.

This is the final piece of the med series.

TL;DR

Yes, I am throwing the buzzword “reasoning” at you, and on top of that, I claim I want to build a fancy “medical diagnosis agent” with it. The ideology presented here is largely inspired by three things:

  1. The necessary condition: Zhang Xiangyu’s talk on multimodality & reasoning, which had me believe reasoning is necessary for image understanding.
  2. The sufficient condition: OpenAI’s O3 results, which convinced me that an LLM has the ability to reason and plan
  3. The approach: O3 place detective blog, which made me realize this approach will work

The following are just my logical steps and specifics of the claim.

The logic is pretty straightforward:

  1. The current way of “solving” imaging tasks is running into scalability issues in two ways: a. The data scarcity issue: As the performance of the model increases, it’s actually harder to curate more diverse data and to label it. b. No matter how well we develop a model in one domain, it’s pretty much a one-time deal; its transferability to other applications is rather limited.

How?

The approach involves building a multimodal reasoning system that combines specialized vision models with language model planning and reasoning capabilities.

What is an agent?

The term “agent” has been overused and misused, just like “AGI”. I would rather follow Lilian Weng’s definition 1 . She framed the agent as an autonomous problem-solver with four major components: memory, planning, tool use, and action.

agent overview
  • Memory is still an active area of research. Most working solutions choose to use a simpler solution of filtering, summarizing, or concatenating the previous context. I am very much interested in a more generic/elegant solution, but I do not have one to offer here.

  • Planning seems to be a more settled topic, as reasoning and a long context really enable an LLM to craft a list of executable steps.

  • Tools enable the agent to call upon external resources to perform actions and were popularized by MCP.

  • Action used to be troublesome but has gotten much better with reasoning and better models.

How does it work?

The diagnostic agent combines vision models and language models in a structured workflow:

  1. Vision tools extract specific findings from medical images
  2. Language model plans diagnostic reasoning steps
  3. Memory maintains context across a multi-step analysis
  4. Actions execute diagnostic protocols and follow-ups

Paradigm Comparison

ParadigmMedical RelevanceData FormatAnnotation ChallengeAdvantageDisadvantage
Vision-Only ModelMedical findingsRaw imageLabeling qualityWorks well and runs fastHard to transfer and scale
MLLMMedical findings and impressionsImage and report pairCurating/synthesizing reportsScales well across tasks and domainsNot very robust; has been reported to have hallucinations and be short-sighted
AgentMedical findings, impressions, and diagnosisImage, CoT/RL with diagnosisCold start with report-synthetic CoT -> self-adapted RLScales well across tasks and domainsComplex to implement and validate

Benefits

The agent approach combines the best of both worlds:

Watch-Outs

We are making the assumption that “knowledge” and “skills” are composable and complement each other. This requires careful validation in clinical settings.

Final Thoughts

By this point, I hope you understand what I am trying to build:

  • Vision models alone are limited: Data and annotation constraints force narrow specialization
  • Language models provide reasoning: They excel at planning, memory, and action coordination
  • CV models become tools: Repositioned as specialized instruments within a broader system

The beauty of this approach:

  • Extensible: More modalities = more tools
  • Transferable: Cross-checks between regions, modalities, and follow-ups
  • Scalable: Potential to move from diagnosis to prognosis

To excel at doing ONE task, we may have to do well in ALL tasks.

References

  1. https://lilianweng.github.io/posts/2023-06-23-agent/

Hi, I'm Qianyi. I'm an ML engineer based in Beijing. Read more about me on my website.