Takeaways
- solid motivation : Current radiology report generation (RRG) metrics use NLG, which DOES NOT reflect report factual accuracy.
- Okay solution—train a small LLM to judge the generated report by counting inconsistent findings.
- more investigation required - The LLM preference is very much opposite to that of radiologists. Why? And maybe that’s a metric by itself?
Method
Objective:
Evaluate the accuracy of a candidate radiology report in comparison to a reference
radiology report composed by expert radiologists.
Process Overview:
You will be presented with:
1. The criteria for making a judgment.
2. The reference radiology report.
3. The candidate radiology report.
4. The desired format for your assessment.
1. Criteria for Judgment:
For each candidate report, determine:
- The count of clinically significant errors.
- The count of clinically insignificant errors.
Errors can fall into one of the following categories:
a) A false report of a finding in the candidate.
b) Missing a finding that is present in the reference.
c) Misidentification of a finding's anatomical location/position.
d) Misassessment of the severity of a finding.
e) Mentioning a comparison that isn't in the reference.
f) Omitting a comparison detailing a change from a prior study.
Note: Concentrate on the clinical findings rather than the report's writing style.
Evaluate only the findings that appear in both reports.
2. Reference Report:
**Reference Report**
3. Candidate Report:
**Candidate Report**
4. Reporting Your Assessment:
Follow this specific format for your output, even if no errors are found:
[Explanation]:
<Explanation>
[Clinically Significant Errors]:
(a) <Error Type>: <The number of errors>. <Error 1>; <Error 2>; ...; <Error n>
....
(f) <Error Type>: <The number of errors>. <Error 1>; <Error 2>; ...; <Error n>
[Clinically Insignificant Errors]:
(a) <Error Type>: <The number of errors>. <Error 1>; <Error 2>; ...; <Error n>
....
(f) <Error Type>: <The number of errors>. <Error 1>; <Error 2>; ...; <Error n>
[Matched Findings]:
<The number of matched findings>. <Finding 1>; <Finding 2>; ...; <Finding n>
-
Distilling GPT-4o’s response of analyzing common findings/errors to a smaller LLM
-
The Generative Evaluator Fine-tune 7B-parameter LLaMA-2/Phi-2 variants (“RadLLaMA-2”, “RadPhi-2”) on 100k report pairs that GPT-4 has annotated for six error categories:
- False finding
- Missed finding
- Wrong location
- Wrong severity
- Hallucinated comparison
- Missed comparison
-
GREEN Score Formula: Basically an IOU of predicted findings and GT findings with some adjustment weight between minor/major errors
λ
= weight < 1- Bounds 0-1
- Higher is better
Result
The Correlation Analysis
Using the ReXVal dataset with assessments from six board-certified radiologists:
- 200 report pairs from 50 MIMIC-CXR (Yu et al., 2023b) test cases
- 0.63 correlation with the mean expert assessment
- Individual expert correlations: 0.48-0.64
- GREEN local models show similar performance to the GPT-4 baseline
The Preference Analysis
This is the unsettling part of the study.
While error counting provides systematic evaluation, true clinical utility depends on alignment with human preferences.
- The error count is an inferior metric to GREEN by a noticeable margin
- But the author uses the error-count correlation as “proof” to suggest that GREEN correlates well with human judgment…
- Direct preference by GPT-4 is REALLY BAD, but WHY? - If it is
what I think it is, using reverse GPT-4 preference actually gives the best metric.
- This is definitely worth more effort to investigate—maybe more than the GREEN score itself—since everything GREEN learns is from GPT-4.