Takeaways
-
intuitive approach : instead of direct X-ray-to-report, leverage CoT to get structured thinking right.
-
extension is LIMITED : fixed workflow with a small set of predefined labels and anatomical regions; does not generalize to other modalities.
Methods
Three-stage method:
-
Medical Concept Learning (MCL), a SFT with rewriting a report into a structured CoT (Finding, Disease Category, Anatomical Region)
-
Disease Category = 14 clinical labels derived from CheXpert (Irvin et al., 2019):
pneumonia, fracture, consolidation, cardiomegaly, no finding, pleural other, pneumothorax, atelectasis, support devices, edema, pleural effusion, lung lesion, and lung opacity
-
Anatomical Region = 12 areas based on prior work (STREAM: Yang et al., 2025):
abdomen, cardiac silhouette, left apical zone, left hilar structures, left lung, mediastinum, right apical zone, right hilar structures, right lung, whole lung, spine, and trachea
-
Example: Findings (“Lungs are low in volume”) → Disease (Atelectasis, via intermediate concepts like collapse and mediastinal shift) → Anatomy (Whole Lung)
-
-
Spatially Verifiable Reinforcement (SVR), a RL (GRPO) that enforces that the location also gives an accurate bbox prediction.
- Interestingly, it only rewards with IOU and format reward, but NO “correctness” reward.
- I guess the goal of this stage is to reinforce the structured parser and accurate bbox.
-
Report adaptor, a LORA that makes an LLM “talk” the same way as a report, instead of in a structured language like MCL.
Only LORA on the language part works the best.
Data
For stages 1 and 2:
-
MS-CX (https://arxiv.org/abs/2204.09817): MIMIC-CXR (https://physionet.org/content/mimic-cxr-jpg/2.0.0/), includes 1,162 image–sentence pairs with annotated bounding boxes and a corresponding medical phrase.
-
LATTE-CXR (https://physionet.org/content/latte-cxr/1.0.0/): built from MIMIC-CXR and REFLACX, includes 13,751 verified bounding box annotations aligned with radiological findings
For stage 3, Language LORA:
-
MIMIC-CXR comprises 377,110 chest X-ray images paired with 227,835 radiology reports
-
IU X-Ray (https://openi.nlm.nih.gov/): 7,470 images and 3,955 reports
Results
Interestingly, LORA alone works pretty well.
Considering LORA has the most data, it makes sense: 230k vs ~15k.
Achieved CE metrics with F1 score of 0.412 on MIMIC-CXR and 0.610 on IU X-ray.
This seems low, but is actually ranked top-tier in each benchmark.
- According to a concurrent paper, Medical Report Generation Is A Multi-label Classification Problem (https://arxiv.org/pdf/2409.00250), training a ResNet50 for label classification on these datasets with poor precision contributes to a very long-tailed distribution.
- According to GPT, these two benchmarks are noisy/challenging in nature, and human experts have a consistency of ~0.5-0.6 on MIMIC-CXR.