Blog Logo

30 Nov 2021 ~ 4 min read

Green Score


GREEN: Generative Radiology Report Evaluation and Error Notation

Published May 2024
Tags
RRGMetric

Takeaways

  • solid motivation : Current radiology report generation (RRG) metrics use NLG, which DOES NOT reflect report factual accuracy.
  • Okay solution—train a small LLM to judge the generated report by counting inconsistent findings.
  • more investigation required - The LLM preference is very much opposite to that of radiologists. Why? And maybe that’s a metric by itself?

Method

  • Distilling GPT-4o’s response of analyzing common findings/errors to a smaller LLM image.png

  • The Generative Evaluator Fine-tune 7B-parameter LLaMA-2/Phi-2 variants (“RadLLaMA-2”, “RadPhi-2”) on 100k report pairs that GPT-4 has annotated for six error categories:

    1. False finding
    2. Missed finding
    3. Wrong location
    4. Wrong severity
    5. Hallucinated comparison
    6. Missed comparison
  • GREEN Score Formula: Basically an IOU of predicted findings and GT findings with some adjustment weight between minor/major errors

    GreenScore=Matched FindingsSig. Errors+λInsig. Errors+Matched FindingsGreen Score=\frac{\text{Matched Findings}}{\text{Sig.\ Errors}+\lambda\,\text{Insig.\ Errors}+ \text{Matched Findings}}
    • λ = weight < 1
    • Bounds 0-1
    • Higher is better

Result

The Correlation Analysis

Using the ReXVal dataset with assessments from six board-certified radiologists:

  • 200 report pairs from 50 MIMIC-CXR (Yu et al., 2023b) test cases
  • 0.63 correlation with the mean expert assessment
  • Individual expert correlations: 0.48-0.64

image.png

  • GREEN local models show similar performance to the GPT-4 baseline

image.png

The Preference Analysis

This is the unsettling part of the study.

While error counting provides systematic evaluation, true clinical utility depends on alignment with human preferences.

image.png

  1. The error count is an inferior metric to GREEN by a noticeable margin
    • But the author uses the error-count correlation as “proof” to suggest that GREEN correlates well with human judgment…
  2. Direct preference by GPT-4 is REALLY BAD, but WHY? - If it is what I think it is, using reverse GPT-4 preference actually gives the best metric.
    • This is definitely worth more effort to investigate—maybe more than the GREEN score itself—since everything GREEN learns is from GPT-4.

Hi, I'm Qianyi. I'm an ML engineer based in Beijing. Read more about me on my website.