Blog Logo

30 Nov 2021 ~ 4 min read

GEMeX-ThinkVG


GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

Successor to GEMeX

Data

  • 206,071 ThinkVG instances for 21,994 images

    • Open-ended = 62,712
    • Closed-ended = 37,340
    • Single-choice = 53,905
    • Multi-choice = 52,114
  • Generated with the Deepseek R1 text-only reasoning model

    • Generated with the following prompt, optimized for 50 x 4 question types

    messages = [ “role”: “user”, “content”: f“‘Suppose you are viewing a CXR that shows the following: “The hilar contours are normal [visual location: bilateral hilar structures ([116, 112, 227, 182])] …”. Given the question: “YOUR QUESTION”, provide a detailed thinking process (around 100 words), including a specific visual location (e.g., (region [x1,y1,x2,y2])) about how to solve this question with answer “ANSWER”. You must assume that you are viewing the CXR image rather than reading the textual findings, thus, do not output words like “observe the report” or “from report” or “report states” or “given findings” or “provided findings” or “described findings”.""

  • Verified text-reasoning

image.png

RL

  • GRPO

  • Semantic reward

    • Uses OpenBio-LLM-70B to grade the generated answer vs. ground truth on a 1-10 scale
    • If scores differ < 2 → reward = 1, else 0
    • Treats free-text and multiple choice uniformly
  • Grounding reward

    • Checks that the number of predicted boxes matches GT
    • Computes mean IoU; if > 0.75 → reward = 1

Results

image.png

image.png

  • Training is significantly useful (RL or SFT)
  • RL helps with accuracy and data utilization
  • V-score still relies on more data
    • With or without ThinkVG doesn’t help (row 2 vs 3)
    • RL also helps (last 2 rows)
  • A-score is higher than previous versions almost across the board; perhaps a model improvement?

Robustness

image.png

  • Two checkpoints with the same amount of SFT data, so the only difference is the thinking trace.
  • Closed = closed-ended question, swap (is vs. isn’t), (normal vs. abnormal)
  • Single = single-choice QA, shuffle choice order

Personal Takes

  • Reasoning and verification done in text-only space
    • Not sure if region grounding actually helps (e.g., do not predict bbox)
    • Not sure if the visual description actually derived from the region proposed; they are all just decoded messages from ALL visual embeddings

GEMeX

GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

A dataset for VQA/VLM created by cleaning Chest ImaGenome for:

  • Normalized locations (Table 8)
    • Deleted small parts (carina - right clavicle, left clavicle, aortic arch)
    • Merged multiple parts → left mid + left lower = left mid-to-lower lung zone
  • One-to-one sentence-location match
  • Leveraged GPT-4o to generate large and diverse QA

image.png

  • Total 1.6M QA

    image.png

    image.png

  • Test set is human-verified

    • 300 images from the MIMIC-CXR test set
    • Initially accompanied by 3,291 questions automatically generated by GPT-4o
    • Radiologists reviewed 10 incorrect answers and adjusted 3 inaccurate location annotations.
    • Contributed approximately 600 new questions.
  • Results on fine-tuning

    • A-score = accuracy
    • V-score = mIoU of Pred vs. GT (for multiple choice) with Hungarian matching
      • Seems low, but maybe instance mismatch really hurts the metric?
    • AR-score = using GPT-4o as a judge
      • Model has high AR but low accuracy = format issue?

image.png


Hi, I'm Qianyi. I'm an ML engineer based in Beijing. Read more about me on my website.