TL;DR
I argue that the fundamental challenge in medical AI is scalability—across anatomies, tasks, and possibly modalities. Computer vision (CV) models cannot solve this challenge alone. The solution lies in leveraging LLMs to create unified, intelligent systems that can learn and reason in a more human-like way.
Multimodal Large Language Models (MLLMs) are showing impressive capabilities, but their performance is uneven across different types of medical images.
-
2D Imaging (e.g., X-ray): MLLMs are already achieving resident-level performance in diagnostic tasks.
-
3D Imaging (e.g., CT scans): These models are not yet effective for practical clinical use.
The Core Problem: A Lack of Scalability
This is the part where I start introducing LLMs to the game. But why? Leveraging prior knowledge in medical image analysis today resembles the pre-deep learning era, where practitioners manually designed kernels, features, and complex workflows. Similarly, modern medical imaging workflows still require manually designed loss functions, networks, and custom dataset preparations for each distinct task.
This approach, however, does not scale. Progress in one task rarely translates to another:
-
A state-of-the-art lung nodule detector offers little help in building a liver nodule detector.
-
A classifier for breast lesion malignancy provides a minimal head start for classifying thyroid lesions.
-
A model that segments coronary vessels in CT scans is of little use for segmenting them in MRI scans.
While foundation models offer a potential starting point, they are almost never directly usable in a clinical setting without extensive fine-tuning. This is not just a CV problem but a fundamental challenge rooted in the nature of medical data. The NLP space faces similar issues; one project famously produced 380 specialized models for medical Named Entity Recognition 3 . We are stuck building fleets of small, isolated models.
Why does this happen?
-
Visual Variations: Medical imaging datasets inherently contain subtle yet crucial visual variations. These arise from differences in scanner vendors, acquisition protocols, and even the diverse visual phenotypes within the same disease subtype. Capturing all these variations without sufficient supervision is ineffective. 4
-
Ambiguity in Task Definition: The “correct” way to perform a task is highly dependent on the clinical goal. Consider segmenting a liver: for surgical planning, excluding major vessels may be necessary, while a routine volumetric measurement might include them. Decisions around handling imaging artifacts or delineating ambiguous anatomical boundaries—whether relying on anatomical knowledge or purely visual cues—can shift considerably from one application or dataset to another.
-
Limitations of Current Labeling Methods: Even with clearly defined tasks, current labeling methods inadequately capture complex clinical semantics. We often rely on oversimplified tools, such as single segmentation masks or one-hot labels, to communicate intricate clinical objectives. This simplification forces models to decode complex visual and contextual information from overly simplistic labels, leading to supervision that is frequently noisy, inconsistent, and semantically shallow.
MLLMs: A Path to Scalable Intelligence
Clinicians excel in adapting to new tasks and domains with minimal data due to their ability to directly address these challenges:
-
Visual Expertise: Highly developed visual processing abilities allow clinicians to bridge the visual domain gap effectively .
-
Contextual Understanding: Deep comprehension of clinical context and task objectives helps clinicians navigate task ambiguities.
-
Comprehensive Training: Extensive biological and clinical knowledge enables clinicians to reason beyond superficial observations, assembling visual patterns into a coherent clinical interpretation.
Scaling up a traditional CV model with more data and parameters might improve its ability to recognize low-level visual patterns, but it does little to imbue it with the conceptual understanding needed for true medical reasoning and understanding of task requirements. 1
This is where MLLMs come in.
-
The vision encoder captures and describes low-level visual patterns, addressing the visual domain gap.
-
A contextually rich prompt unequivocally defines task requirements during training and inference.
-
The large language model (LLM), pretrained with vast amounts of medical books and reports, focuses on understanding high-level concepts, applying clinical knowledge, and performing reasoning.
This separation of concerns naturally aligns with the “allocation of expertise” that I advocated for in Thoughts on data annotation
The good news is that this approach works. Google’s Med-Gemini, for instance, has shown that a moderately sized model can achieve promising results on 2D image tasks with a feasible amount of data. An independent study even found Med-Gemini surpassed radiology residents in diagnostic accuracy on chest X-rays. 5 .
The 3D Challenge: Why CT Scans Break MLLMs
Despite the success with 2D images, extending this paradigm to 3D modalities like CT has proven difficult. What is the gap?
-
From Comprehension to Localization: Most 2D medical AI benchmarks are “comprehension” tasks (e.g., “What is the primary finding in this X-ray?”), where the visual cue is dominant. This plays to the strengths of MLLMs. However, 3D analysis relies heavily on precise localization and dense prediction (e.g., “Where is the lesion and what is its exact volume?”), tasks where current MLLMs lag significantly behind specialized CV models. 6
-
Granularity Mismatch: The visual tokens used in MLLMs are highly sensitive to network layer selection, image resolution, and pretraining strategies 7 . This limitation becomes particularly acute with CT scans, where the required granularity ranges widely from millimeter-level lesion identification to broader anatomical contexts at a sub-meter scale.
-
The Data Bottleneck: MLLMs thrive on vast paired datasets of images and detailed textual annotations. While sufficient amounts of reports with 2D image captures exist for tasks like Chest X-Ray, Dermatology, Pathology, and Ophthalmology 8 , 3D imaging data is substantially lacking .
Note: This field is evolving rapidly. Google Health is already experimenting with modeling CT scans as a series of 2D slices based on the video-pretrained VideoCoCa
model 2 , and MLLMs designed for long video sequences are on the rise 9 , which could pave the way for better 3D understanding.
Conclusion
Leveraging LLMs is the most promising path toward building truly scalable medical AI. Their value is not their ability to chat, but their capacity to be trained with rich, detailed annotations that are impossible to capture with simple CV labels. They come pre-loaded with a wealth of biological and clinical knowledge, providing the conceptual framework that vision models lack.
The immediate challenge is making this powerful combination work for higher-dimensional data like CT and MRI. In my next post, I will propose a potential solution: the cliché, unsurprising diagnostic agent.
References
- Zhang Xiangyu’s talk on multimodal & reasoning ↩
- https://github.com/Google-Health/imaging-research/tree/master/ct-foundation#overview ↩
- Unlocking Healthcare AI: I’m Releasing State-of-the-Art Medical Models for Free. Forever. https://huggingface.co/blog/MaziyarPanahi/open-health-ai ↩
- HOW WELL DO SUPERVISED 3D MODELS TRANSFER TO MEDICAL IMAGING TASKS? https://www.cs.jhu.edu/~zongwei/publication/li2023suprem.pdf ↩
- ReXVQA: https://arxiv.org/pdf/2506.04353 ↩
- How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks: https://github.com/EPFL-VILAB/fm-vision-evals ↩
- Perception Encoder: The best visual embeddings are not at the output of the network: https://arxiv.org/pdf/2504.13181 ↩
- Med-Gemini: https://arxiv.org/abs/2507.05201 ↩
- Long-RL: Scaling RL to Long Sequences: https://github.com/NVlabs/Long-RL ↩