Blog Logo

30 Nov 2021 ~ 3 min read

Reason Like a Radiologist


Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation

Published Apr 2025
Tags
RRGX-rayRL

Takeaways

  • intuitive approach : instead of direct X-ray-to-report, leverage CoT to get structured thinking right.

  • extension is LIMITED : fixed workflow with a small set of predefined labels and anatomical regions; does not generalize to other modalities.

Methods

Three-stage method:

  1. Medical Concept Learning (MCL), a SFT with rewriting a report into a structured CoT (Finding, Disease Category, Anatomical Region)

    1. Disease Category = 14 clinical labels derived from CheXpert (Irvin et al., 2019):

      pneumonia, fracture, consolidation, cardiomegaly, no finding, pleural other, pneumothorax, atelectasis, support devices, edema, pleural effusion, lung lesion, and lung opacity

    2. Anatomical Region = 12 areas based on prior work (STREAM: Yang et al., 2025):

      abdomen, cardiac silhouette, left apical zone, left hilar structures, left lung, mediastinum, right apical zone, right hilar structures, right lung, whole lung, spine, and trachea

    3. Example: Findings (“Lungs are low in volume”) → Disease (Atelectasis, via intermediate concepts like collapse and mediastinal shift) → Anatomy (Whole Lung)

  2. Spatially Verifiable Reinforcement (SVR), a RL (GRPO) that enforces that the location also gives an accurate bbox prediction.

    1. Interestingly, it only rewards with IOU and format reward, but NO “correctness” reward.
    2. I guess the goal of this stage is to reinforce the structured parser and accurate bbox.
  3. Report adaptor, a LORA that makes an LLM “talk” the same way as a report, instead of in a structured language like MCL.

    Only LORA on the language part works the best.

    image.png

Data

For stages 1 and 2:

  1. MS-CX (https://arxiv.org/abs/2204.09817): MIMIC-CXR (https://physionet.org/content/mimic-cxr-jpg/2.0.0/), includes 1,162 image–sentence pairs with annotated bounding boxes and a corresponding medical phrase.

  2. LATTE-CXR (https://physionet.org/content/latte-cxr/1.0.0/): built from MIMIC-CXR and REFLACX, includes 13,751 verified bounding box annotations aligned with radiological findings

For stage 3, Language LORA:

  1. MIMIC-CXR comprises 377,110 chest X-ray images paired with 227,835 radiology reports

  2. IU X-Ray (https://openi.nlm.nih.gov/): 7,470 images and 3,955 reports

Results

image.png

Interestingly, LORA alone works pretty well.

Considering LORA has the most data, it makes sense: 230k vs ~15k.

Achieved CE metrics with F1 score of 0.412 on MIMIC-CXR and 0.610 on IU X-ray.

image.png

This seems low, but is actually ranked top-tier in each benchmark.

  • According to a concurrent paper, Medical Report Generation Is A Multi-label Classification Problem (https://arxiv.org/pdf/2409.00250), training a ResNet50 for label classification on these datasets with poor precision contributes to a very long-tailed distribution.

image.png

  • According to GPT, these two benchmarks are noisy/challenging in nature, and human experts have a consistency of ~0.5-0.6 on MIMIC-CXR.

Hi, I'm Qianyi. I'm an ML engineer based in Beijing. Read more about me on my website.