GEMeX-ThinkVG

GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

Published Mar 2025

Data https://huggingface.co/datasets/BoKelvin/GEMeX-ThinkVG

Data

206,071 ThinkVG instances for 21,994 images
- Open-ended = 62,712
- Closed-ended = 37,340
- Single-choice = 53,905
- Multi-choice = 52,114
Generated with the Deepseek R1 text-only reasoning model
- Generated with the following prompt, optimized for 50 x 4 question types
messages = [ “role”: “user”, “content”: f“‘Suppose you are viewing a CXR that shows the following: “The hilar contours are normal [visual location: bilateral hilar structures ([116, 112, 227, 182])] …”. Given the question: “YOUR QUESTION”, provide a detailed thinking process (around 100 words), including a specific visual location (e.g., (region [x1,y1,x2,y2])) about how to solve this question with answer “ANSWER”. You must assume that you are viewing the CXR image rather than reading the textual findings, thus, do not output words like “observe the report” or “from report” or “report states” or “given findings” or “provided findings” or “described findings”.""
Verified text-reasoning

RL

messages = ["role":"user", "content":
    f"'We would like to request your feedback on the performance
    of two AI assistants in response to the user question displayed
    above. For your reference, the visual content in the image is
    represented with a caption describing the same image. Please
    rate the accuracy (most important), relevance of their responses,
    considering both answer and reason (if any). Each assistant receives an overall score on a scale of 1 to 10, where a higher
    score indicates better overall performance. Please output both
    the scores and your reason in JSON format { assistant1: score, assistant2: score, reason:your reason }."""

GRPO
Semantic reward
- Uses OpenBio-LLM-70B to grade the generated answer vs. ground truth on a 1-10 scale
- If scores differ < 2 → reward = 1, else 0
- Treats free-text and multiple choice uniformly
Grounding reward
- Checks that the number of predicted boxes matches GT
- Computes mean IoU; if > 0.75 → reward = 1

Results

Training is significantly useful (RL or SFT)
RL helps with accuracy and data utilization
V-score still relies on more data
- With or without ThinkVG doesn’t help (row 2 vs 3)
- RL also helps (last 2 rows)
A-score is higher than previous versions almost across the board; perhaps a model improvement?

Robustness

Two checkpoints with the same amount of SFT data, so the only difference is the thinking trace.
Closed = closed-ended question, swap (is vs. isn’t), (normal vs. abnormal)
Single = single-choice QA, shuffle choice order

Personal Takes

Reasoning and verification done in text-only space
- Not sure if region grounding actually helps (e.g., do not predict bbox)
- Not sure if the visual description actually derived from the region proposed; they are all just decoded messages from ALL visual embeddings
  - “Losing textual-visual fidelity” mentioned in Reason Like a Radiologist

GEMeX

GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

Published Mar 2025

URL https://arxiv.org/pdf/2411.16778

Data https://huggingface.co/datasets/BoKelvin/GEMeX-VQA