Reason Like a Radiologist

Published Apr 2025

Takeaways

intuitive approach : instead of direct X-ray-to-report, leverage CoT to get structured thinking right.
extension is LIMITED : fixed workflow with a small set of predefined labels and anatomical regions; does not generalize to other modalities.

Three-stage method:

Medical Concept Learning (MCL), a SFT with rewriting a report into a structured CoT (Finding, Disease Category, Anatomical Region)
1. Disease Category = 14 clinical labels derived from CheXpert (Irvin et al., 2019):
  
  pneumonia, fracture, consolidation, cardiomegaly, no finding, pleural other, pneumothorax, atelectasis, support devices, edema, pleural effusion, lung lesion, and lung opacity
2. Anatomical Region = 12 areas based on prior work (STREAM: Yang et al., 2025):
  
  abdomen, cardiac silhouette, left apical zone, left hilar structures, left lung, mediastinum, right apical zone, right hilar structures, right lung, whole lung, spine, and trachea
3. Example: Findings (“Lungs are low in volume”) → Disease (Atelectasis, via intermediate concepts like collapse and mediastinal shift) → Anatomy (Whole Lung)
Spatially Verifiable Reinforcement (SVR), a RL (GRPO) that enforces that the location also gives an accurate bbox prediction.
1. Interestingly, it only rewards with IOU and format reward, but NO “correctness” reward.
2. I guess the goal of this stage is to reinforce the structured parser and accurate bbox.
Report adaptor, a LORA that makes an LLM “talk” the same way as a report, instead of in a structured language like MCL.

Only LORA on the language part works the best.

For stages 1 and 2:

MS-CX (https://arxiv.org/abs/2204.09817): MIMIC-CXR (https://physionet.org/content/mimic-cxr-jpg/2.0.0/), includes 1,162 image–sentence pairs with annotated bounding boxes and a corresponding medical phrase.
LATTE-CXR (https://physionet.org/content/latte-cxr/1.0.0/): built from MIMIC-CXR and REFLACX, includes 13,751 verified bounding box annotations aligned with radiological findings

For stage 3, Language LORA:

MIMIC-CXR comprises 377,110 chest X-ray images paired with 227,835 radiology reports
IU X-Ray (https://openi.nlm.nih.gov/): 7,470 images and 3,955 reports

Interestingly, LORA alone works pretty well.

Considering LORA has the most data, it makes sense: 230k vs ~15k.

Achieved CE metrics with F1 score of 0.412 on MIMIC-CXR and 0.610 on IU X-ray.

This seems low, but is actually ranked top-tier in each benchmark.

According to a concurrent paper, Medical Report Generation Is A Multi-label Classification Problem (https://arxiv.org/pdf/2409.00250), training a ResNet50 for label classification on these datasets with poor precision contributes to a very long-tailed distribution.

According to GPT, these two benchmarks are noisy/challenging in nature, and human experts have a consistency of ~0.5-0.6 on MIMIC-CXR.