The Diagnostic Agent

This is the final piece of the med series.

TL;DR

Yes, I am throwing the buzzword “reasoning” at you, and on top of that, I claim I want to build a fancy “medical diagnosis agent” with it. The ideology presented here is largely inspired by three things:

The necessary condition: Zhang Xiangyu’s talk on multimodality & reasoning, which had me believe reasoning is necessary for image understanding.
The sufficient condition: OpenAI’s O3 results, which convinced me that an LLM has the ability to reason and plan
The approach: O3 place detective blog, which made me realize this approach will work

The following are just my logical steps and specifics of the claim.

The logic is pretty straightforward:

The current way of “solving” imaging tasks is running into scalability issues in two ways: a. The data scarcity issue: As the performance of the model increases, it’s actually harder to curate more diverse data and to label it. b. No matter how well we develop a model in one domain, it’s pretty much a one-time deal; its transferability to other applications is rather limited.

How?

The approach involves building a multimodal reasoning system that combines specialized vision models with language model planning and reasoning capabilities.

What is an agent?

The term “agent” has been overused and misused, just like “AGI”. I would rather follow Lilian Weng’s definition¹. She framed the agent as an autonomous problem-solver with four major components: memory, planning, tool use, and action.

Memory is still an active area of research. Most working solutions choose to use a simpler solution of filtering, summarizing, or concatenating the previous context. I am very much interested in a more generic/elegant solution, but I do not have one to offer here.
Planning seems to be a more settled topic, as reasoning and a long context really enable an LLM to craft a list of executable steps.
Tools enable the agent to call upon external resources to perform actions and were popularized by MCP.
Action used to be troublesome but has gotten much better with reasoning and better models.

How does it work?

The diagnostic agent combines vision models and language models in a structured workflow:

Vision tools extract specific findings from medical images
Language model plans diagnostic reasoning steps
Memory maintains context across a multi-step analysis
Actions execute diagnostic protocols and follow-ups

Paradigm Comparison

Paradigm	Medical Relevance	Data Format	Annotation Challenge	Advantage	Disadvantage
Vision-Only Model	Medical findings	Raw image	Labeling quality	Works well and runs fast	Hard to transfer and scale
MLLM	Medical findings and impressions	Image and report pair	Curating/synthesizing reports	Scales well across tasks and domains	Not very robust; has been reported to have hallucinations and be short-sighted
Agent	Medical findings, impressions, and diagnosis	Image, CoT/RL with diagnosis	Cold start with report-synthetic CoT -> self-adapted RL	Scales well across tasks and domains	Complex to implement and validate

Benefits

The agent approach combines the best of both worlds:

To show: How and what
To tell: Contexts and system prompts allow changes in behavior. https://www.dbreunig.com/2025/05/07/claude-s-system-prompt-chatbots-are-more-than-just-models.html

Watch-Outs

We are making the assumption that “knowledge” and “skills” are composable and complement each other. This requires careful validation in clinical settings.

Final Thoughts

By this point, I hope you understand what I am trying to build:

Vision models alone are limited: Data and annotation constraints force narrow specialization
Language models provide reasoning: They excel at planning, memory, and action coordination
CV models become tools: Repositioned as specialized instruments within a broader system

The beauty of this approach:

Extensible: More modalities = more tools
Transferable: Cross-checks between regions, modalities, and follow-ups
Scalable: Potential to move from diagnosis to prognosis

To excel at doing ONE task, we may have to do well in ALL tasks.

References

https://lilianweng.github.io/posts/2023-06-23-agent/ ↩