CT-to-Report Does NOT Work, Yet

Observations as of July 2025; things MAY CHANGE

TL;DR

This is a collection of studies on the task of RRG (“CT DICOMs in, Report out”).

We finally see a good-sized (~50k) dataset with reports , which further attracts more “annotations”
None of the direct report generation methods are really working
Even worse, The numbers DON’T MATCH across papers (CT2Rep)
- which shows a lack of a consistent benchmark
- perhaps the new challenge will redeem the problem ³

CT-CLIP

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

Published Mar 2024

URL https://arxiv.org/pdf/2403.17834

Tags

CTDatasetCLIP

NOTE: The paper and dataset were first published in 03/24, but the figure is drawn from the v3 version in 04/25.

Takeaway

good amount CT data with labels ¹
native, whole-CT abnormalities classification DOES NOT work

Dataset (CT-RATE)

50,188 non-contrast 3D chest CT volumes (series_id) from 25,692 distinct CT experiments (study_id) conducted on 21,304 unique patients (patient_id).
with reports and parsed into 18 distinct types of abnormalities
with extended metadata

Method and Results

Extract labels from reports with fine-tuned RadBERT
CT-ViT with
- patch size of 20 × 20 × 10
- normalized spacing of 0.75mm x 0.75mm x 1.5mm
- normalized resolution of 480 × 480 × 240
- ending up with 24^3 = 13,824 patches compressed to a 512-channel embedding before the text embedding
several strategies, such as supervised, linear probing, CT-CLIP, etc.
Performance is better than random but NOT usable; even worse, it GENERALIZES POORLY

RadGenome-ChestCT

RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis

Published Apr 2024

URL https://arxiv.org/pdf/2404.16754

Tags

CTAnnotationVQA

Takeaway

More annotations for CT-CRATE - segmentation, hierarchical structure identities, and VQA²

Method

New annotations added for the CT-CRATE dataset

universal segmenter SAT⁴: whole body segmentation of 197 categories
LLM and NER parsing: GPT-4 annotated hierarchical structure: Breaks each report into an anatomically-hierarchical tree and align every sentence to the matching mask region.
Rule-based template: Generates grounded VQA pairs that ask about abnormality presence, location, size, etc.

CT2Rep

CT2Rep: Automated Radiology Report Generation for 3D Medical Imaging

Published Mar 2024

URL https://arxiv.org/pdf/2403.06801

Tags

CTAnnotationRRG

Takeaway

A good first-effort attempt; has been used as a baseline for comparison for CT RRG

Method

First direct CT → image encoder → text decoder to generate a report

similar to CT-CLIP preprocessing, but patch size of 12x24x24, ending up with 20^3 = 8,000 vision embeddings
Added a follow-up setting, CT2RepLong, but the results did not get better…

Result

NOTE: RepLong CE metrics have WORSE precision and recall but higher F1
- Be aware of the Simpson paradox
- Also, the F1/P/R of the top 2 rows doesn’t seem correct

CT-AGRG

CT-AGRG: AUTOMATED ABNORMALITY-GUIDED REPORT GENERATION FROM 3D

Published Aug 2024

URL https://arxiv.org/pdf/2408.11965

Tags

CTRRG

Takeaway

A native baseline to start with
CT-Net with CNN architecture > CT-ViT with Transformer architecture
- Aligns with my expectation: ViT is limited due to the current setting (full-size CT → sparse supervision)

Method

Use a visual encoder to train on supervised classification first, then integrate it with a small GPT-2 for report generation
- multi-label → multi-task embedding → report sentence

result

CT-Agent

CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering

Published May 2025

URL https://arxiv.org/pdf/2505.16229

Tags

CTRRGAgent

Takeaway

First attempt? To integrate a whole CT volume into an LLM
Interesting, regionally prompted analysis reveals that the MLLM CANNOT identify some organs well

Method

frozen 2D CLIP (ViT-B/16): as vision encoder, each image is 256 tokens = a total of 240 (slice) x 256 (num_token) x 1024 (emb_dim) tensor to start with
Global Token Aggregation (GTA): compresses the volume into 256 x 1024 global embedding
Local Token Selection(LTS): selects the top K slices, merged by similarity M slices in K+M x 1024 local embedding
Projector: matches vision tokens to LLM (LLaVAMed-v1.5) embedding dim 1024 → 4096
LORA: Is trained on EACH anatomical region to answer VQA
Frozen DeepSeek V3: As a planner to orchestrate everything

Results

Shows improvement? Still pretty bad; the CT2Rep baseline is again SIGNIFICANTLY WORSE than what’s reported in the original paper

The regionally prompted analysis is by far the most interesting part, which shows that the agent either doesn’t know or can’t identify some organs well…
- Surprisingly, bone is hard?

MedRegion-CT

MedRegion-CT: Region-Focused Multimodal LLM for Comprehensive 3D CT

Published Jun 2025

URL https://arxiv.org/pdf/2506.23102

Tags

CTLLMVQA

Takeaway

An interesting way to “manage” contexts for an MLLM, though the benefits are limited

Method

Input:
The global visual information is provided here: <image>.
The following regions of interest have been identified: lung<region1>,
airway <region2>, .... abdomen <region6>
Additional clinical attribute information is provided as follows:
Organ Volumes:
Lung:

- Right Upper Lobe:{right_upper_lobe_volume}ml
- Right Middle Lobe:{right_middle_lobe_volume}m
  Right Lower Lobe:{right_lower_lobe_volume}ml
- Left Upper Lobe:{left_upper_lobe_volume}ml
- Left Lower Lobe:{left_lower_lobe_volume}m
  Heart:
- Left Atrium:{left_atrium_volume}ml
- Right Atrium:{right_atrium_volume}ml
  Left Ventricle:{left_ventricle_volume}ml
- Right Ventricle:{right_ventricle_volume}ml
  Liver:{liver_volume}ml
  Kidney:
- Left:{left_kidney_volume}ml
  Right:{right_kidney_volume}ml
  Lesion Details:
  Nodule:
- count:{nodule_count}
  diameter (mm):{nodule_diameter}
  location:{nodule_location}
  Cyst:
  count:{cyst_count}
  diameter (mm):{cyst_diameter}
- location:{cyst_location}
  Effusion:
  count:{effusion_count}
  diameter (mm):{effusion_diameter}
- location:{effusion_location}
  Describe this medical scan with findings.

Output:
[Lung]:{lung_findings]
[Airways]: {airways_findings]
[Mediastinum]: {mediastinum_findings}
[Heart]:{heart_findings]
[Osseous]:{osseous_findings]
[Abdomen]:{abdomen_findings}

Leverage a pre-trained 2D Vision encoder (RAD-DINO) and heuristics to train global slice and region slice tokens
Leverage a 3D segmentation model (SAT) and add masks of 6 major organs and attributes (size of organ and lesions)

full setup:

R = LLM(T_{vision}, T_{seg}, T_{attr}, I)

Result

Minor improvements over NLG metrics
Greater degradation in the Green score may suggest a loss of medical findings…

References

https://huggingface.co/datasets/ibrahimhamamci/CT-RATE/ ↩
https://huggingface.co/datasets/RadGenome/RadGenome-ChestCT ↩
VLM3D Challenge – Task 2: Multi-Abnormality Classification: https://abnclass.vlm3dchallenge.com/evaluation/test/leaderboard/ ↩
Large-Vocabulary Segmentation for Medical Images with Text Prompts: https://arxiv.org/pdf/2312.17183 ↩