Successor to GEMeX
Data
-
206,071 ThinkVG instances for 21,994 images
- Open-ended = 62,712
- Closed-ended = 37,340
- Single-choice = 53,905
- Multi-choice = 52,114
-
Generated with the Deepseek R1 text-only reasoning model
- Generated with the following prompt, optimized for 50 x 4 question types
messages = [ “role”: “user”, “content”: f“‘Suppose you are viewing a CXR that shows the following: “The hilar contours are normal [visual location: bilateral hilar structures ([116, 112, 227, 182])] …”. Given the question: “YOUR QUESTION”, provide a detailed thinking process (around 100 words), including a specific visual location (e.g., (region [x1,y1,x2,y2])) about how to solve this question with answer “ANSWER”. You must assume that you are viewing the CXR image rather than reading the textual findings, thus, do not output words like “observe the report” or “from report” or “report states” or “given findings” or “provided findings” or “described findings”.""
-
Verified text-reasoning
RL
messages = ["role":"user", "content":
f"'We would like to request your feedback on the performance
of two AI assistants in response to the user question displayed
above. For your reference, the visual content in the image is
represented with a caption describing the same image. Please
rate the accuracy (most important), relevance of their responses,
considering both answer and reason (if any). Each assistant receives an overall score on a scale of 1 to 10, where a higher
score indicates better overall performance. Please output both
the scores and your reason in JSON format { assistant1: score, assistant2: score, reason:your reason }."""
-
GRPO
-
Semantic reward
- Uses OpenBio-LLM-70B to grade the generated answer vs. ground truth on a 1-10 scale
- If scores differ < 2 → reward = 1, else 0
- Treats free-text and multiple choice uniformly
-
Grounding reward
- Checks that the number of predicted boxes matches GT
- Computes mean IoU; if > 0.75 → reward = 1
Results
- Training is significantly useful (RL or SFT)
- RL helps with accuracy and data utilization
- V-score still relies on more data
- With or without ThinkVG doesn’t help (row 2 vs 3)
- RL also helps (last 2 rows)
- A-score is higher than previous versions almost across the board; perhaps a model improvement?
Robustness
- Two checkpoints with the same amount of SFT data, so the only difference is the thinking trace.
- Closed = closed-ended question, swap (is vs. isn’t), (normal vs. abnormal)
- Single = single-choice QA, shuffle choice order
Personal Takes
- Reasoning and verification done in text-only space
- Not sure if region grounding actually helps (e.g., do not predict bbox)
- Not sure if the visual description actually derived from the region proposed; they are all just decoded messages from ALL visual embeddings
- “Losing textual-visual fidelity” mentioned in Reason Like a Radiologist
GEMeX
A dataset for VQA/VLM created by cleaning Chest ImaGenome for:
- Normalized locations (Table 8)
- Deleted small parts (carina - right clavicle, left clavicle, aortic arch)
- Merged multiple parts → left mid + left lower = left mid-to-lower lung zone
- One-to-one sentence-location match
- Leveraged GPT-4o to generate large and diverse QA
-
Total 1.6M QA
-
Test set is human-verified
- 300 images from the MIMIC-CXR test set
- Initially accompanied by 3,291 questions automatically generated by GPT-4o
- Radiologists reviewed 10 incorrect answers and adjusted 3 inaccurate location annotations.
- Contributed approximately 600 new questions.
-
Results on fine-tuning
- A-score = accuracy
- V-score = mIoU of Pred vs. GT (for multiple choice) with Hungarian matching
- Seems low, but maybe instance mismatch really hurts the metric?
- AR-score = using GPT-4o as a judge
- Model has high AR but low accuracy = format issue?