Green Score

GREEN: Generative Radiology Report Evaluation and Error Notation

Published May 2024

Takeaways

solid motivation : Current radiology report generation (RRG) metrics use NLG, which DOES NOT reflect report factual accuracy.
Okay solution—train a small LLM to judge the generated report by counting inconsistent findings.
more investigation required - The LLM preference is very much opposite to that of radiologists. Why? And maybe that’s a metric by itself?

Method

Objective:
Evaluate the accuracy of a candidate radiology report in comparison to a reference
radiology report composed by expert radiologists.
Process Overview:
You will be presented with:
1. The criteria for making a judgment.
2. The reference radiology report.
3. The candidate radiology report.
4. The desired format for your assessment.
1. Criteria for Judgment:
For each candidate report, determine:
- The count of clinically significant errors.
- The count of clinically insignificant errors.
Errors can fall into one of the following categories:
a) A false report of a finding in the candidate.
b) Missing a finding that is present in the reference.
c) Misidentification of a finding's anatomical location/position.
d) Misassessment of the severity of a finding.
e) Mentioning a comparison that isn't in the reference.
f) Omitting a comparison detailing a change from a prior study.

Note: Concentrate on the clinical findings rather than the report's writing style.
Evaluate only the findings that appear in both reports.
2. Reference Report:
**Reference Report**
3. Candidate Report:
**Candidate Report**
4. Reporting Your Assessment:
Follow this specific format for your output, even if no errors are found:
[Explanation]:
<Explanation>
[Clinically Significant Errors]:
(a) <Error Type>: <The number of errors>. <Error 1>; <Error 2>; ...; <Error n>
....
(f) <Error Type>: <The number of errors>. <Error 1>; <Error 2>; ...; <Error n>
[Clinically Insignificant Errors]:
(a) <Error Type>: <The number of errors>. <Error 1>; <Error 2>; ...; <Error n>
....
(f) <Error Type>: <The number of errors>. <Error 1>; <Error 2>; ...; <Error n>
[Matched Findings]:
<The number of matched findings>. <Finding 1>; <Finding 2>; ...; <Finding n>

Distilling GPT-4o’s response of analyzing common findings/errors to a smaller LLM
The Generative Evaluator Fine-tune 7B-parameter LLaMA-2/Phi-2 variants (“RadLLaMA-2”, “RadPhi-2”) on 100k report pairs that GPT-4 has annotated for six error categories:
1. False finding
2. Missed finding
3. Wrong location
4. Wrong severity
5. Hallucinated comparison
6. Missed comparison
GREEN Score Formula: Basically an IOU of predicted findings and GT findings with some adjustment weight between minor/major errors
$Green Score=\frac{\text{Matched Findings}}{\text{Sig.\ Errors}+\lambda\,\text{Insig.\ Errors}+ \text{Matched Findings}}$
- λ = weight < 1
- Bounds 0-1
- Higher is better

Result

The Correlation Analysis

Using the ReXVal dataset with assessments from six board-certified radiologists:

200 report pairs from 50 MIMIC-CXR (Yu et al., 2023b) test cases
0.63 correlation with the mean expert assessment
Individual expert correlations: 0.48-0.64

GREEN local models show similar performance to the GPT-4 baseline

The Preference Analysis

This is the unsettling part of the study.

While error counting provides systematic evaluation, true clinical utility depends on alignment with human preferences.

The error count is an inferior metric to GREEN by a noticeable margin
- But the author uses the error-count correlation as “proof” to suggest that GREEN correlates well with human judgment…
Direct preference by GPT-4 is REALLY BAD, but WHY? - If it is what I think it is, using reverse GPT-4 preference actually gives the best metric.
- This is definitely worth more effort to investigate—maybe more than the GREEN score itself—since everything GREEN learns is from GPT-4.