Blog Logo

30 Nov 2021 ~ 6 min read

CT-to-Report Does NOT Work, Yet


Observations as of July 2025; things MAY CHANGE

TL;DR

This is a collection of studies on the task of RRG (“CT DICOMs in, Report out”).

  • We finally see a good-sized (~50k) dataset with reports , which further attracts more “annotations”
  • None of the direct report generation methods are really working
  • Even worse, The numbers DON’T MATCH across papers (CT2Rep)
    • which shows a lack of a consistent benchmark
    • perhaps the new challenge will redeem the problem 3

CT-CLIP

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

Published Mar 2024
Tags
CTDatasetCLIP

NOTE: The paper and dataset were first published in 03/24, but the figure is drawn from the v3 version in 04/25.

Takeaway

  • good amount CT data with labels 1
  • native, whole-CT abnormalities classification DOES NOT work

Dataset (CT-RATE)

  • 50,188 non-contrast 3D chest CT volumes (series_id) from 25,692 distinct CT experiments (study_id) conducted on 21,304 unique patients (patient_id).
  • with reports and parsed into 18 distinct types of abnormalities
  • with extended metadata

Method and Results

  • Extract labels from reports with fine-tuned RadBERT
  • CT-ViT with
    • patch size of 20 × 20 × 10
    • normalized spacing of 0.75mm x 0.75mm x 1.5mm
    • normalized resolution of 480 × 480 × 240
    • ending up with 24^3 = 13,824 patches compressed to a 512-channel embedding before the text embedding
  • several strategies, such as supervised, linear probing, CT-CLIP, etc.
  • Performance is better than random but NOT usable; even worse, it GENERALIZES POORLY

image.png

RadGenome-ChestCT

RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis

Published Apr 2024
Tags
CTAnnotationVQA

Takeaway

More annotations for CT-CRATE - segmentation, hierarchical structure identities, and VQA 2

Method

New annotations added for the CT-CRATE dataset

  • universal segmenter SAT 4 : whole body segmentation of 197 categories
  • LLM and NER parsing: GPT-4 annotated hierarchical structure: Breaks each report into an anatomically-hierarchical tree and align every sentence to the matching mask region.
  • Rule-based template: Generates grounded VQA pairs that ask about abnormality presence, location, size, etc.

image.png

CT2Rep

CT2Rep: Automated Radiology Report Generation for 3D Medical Imaging

Published Mar 2024
Tags
CTAnnotationRRG

Takeaway

  • A good first-effort attempt; has been used as a baseline for comparison for CT RRG

Method

  • First direct CT → image encoder → text decoder to generate a report

image.png

  • similar to CT-CLIP preprocessing, but patch size of 12x24x24, ending up with 20^3 = 8,000 vision embeddings
  • Added a follow-up setting, CT2RepLong, but the results did not get better…

Result

image.png

  • NOTE: RepLong CE metrics have WORSE precision and recall but higher F1
    • Be aware of the Simpson paradox
    • Also, the F1/P/R of the top 2 rows doesn’t seem correct

CT-AGRG

CT-AGRG: AUTOMATED ABNORMALITY-GUIDED REPORT GENERATION FROM 3D

Published Aug 2024
Tags
CTRRG

Takeaway

  • A native baseline to start with
  • CT-Net with CNN architecture > CT-ViT with Transformer architecture
    • Aligns with my expectation: ViT is limited due to the current setting (full-size CT → sparse supervision)

Method

image.png

  • Use a visual encoder to train on supervised classification first, then integrate it with a small GPT-2 for report generation
    • multi-label → multi-task embedding → report sentence

result

image.png

CT-Agent

CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering

Published May 2025
Tags
CTRRGAgent

Takeaway

  • First attempt? To integrate a whole CT volume into an LLM

  • Interesting, regionally prompted analysis reveals that the MLLM CANNOT identify some organs well

Method

image.png

  • frozen 2D CLIP (ViT-B/16): as vision encoder, each image is 256 tokens = a total of 240 (slice) x 256 (num_token) x 1024 (emb_dim) tensor to start with
  • Global Token Aggregation (GTA): compresses the volume into 256 x 1024 global embedding
  • Local Token Selection(LTS): selects the top K slices, merged by similarity M slices in K+M x 1024 local embedding
  • Projector: matches vision tokens to LLM (LLaVAMed-v1.5) embedding dim 1024 → 4096
  • LORA: Is trained on EACH anatomical region to answer VQA
  • Frozen DeepSeek V3: As a planner to orchestrate everything

Results

  • Shows improvement? Still pretty bad; the CT2Rep baseline is again SIGNIFICANTLY WORSE than what’s reported in the original paper

image.png

  • The regionally prompted analysis is by far the most interesting part, which shows that the agent either doesn’t know or can’t identify some organs well…
    • Surprisingly, bone is hard?

image.png

MedRegion-CT

MedRegion-CT: Region-Focused Multimodal LLM for Comprehensive 3D CT

Published Jun 2025
Tags
CTLLMVQA

Takeaway

An interesting way to “manage” contexts for an MLLM, though the benefits are limited

Method

image.png

  1. Leverage a pre-trained 2D Vision encoder (RAD-DINO) and heuristics to train global slice and region slice tokens
  2. Leverage a 3D segmentation model (SAT) and add masks of 6 major organs and attributes (size of organ and lesions)

full setup:

R=LLM(Tvision,Tseg,Tattr,I)R = LLM(T_{vision}, T_{seg}, T_{attr}, I)

Result

image.png

  • Minor improvements over NLG metrics
  • Greater degradation in the Green score may suggest a loss of medical findings…

References

  1. https://huggingface.co/datasets/ibrahimhamamci/CT-RATE/
  2. https://huggingface.co/datasets/RadGenome/RadGenome-ChestCT
  3. VLM3D Challenge – Task 2: Multi-Abnormality Classification: https://abnclass.vlm3dchallenge.com/evaluation/test/leaderboard/
  4. Large-Vocabulary Segmentation for Medical Images with Text Prompts: https://arxiv.org/pdf/2312.17183

Hi, I'm Qianyi. I'm an ML engineer based in Beijing. Read more about me on my website.