TL;DR
Yes. I am throwig the buzzword “reasoning” at you, on top of that I claim I want to build a fancy “medical diagnosis agent” with it. The ideology presented here is largely inspired by 3 things:
- the necessary condition: Zhang Xiangyu’s talk on multimodal & reasoning, which had me believe reasoning is necessary for image understanding.
- the sufficient condition: [the sequential paper], which convinced me LLM has the ability to reason and plan
- the approch: [o3 place detective blog], which made me realize this approach will work
the following are just my logical steps and specifics of the claim
My blabla
The logic is pretty straight forward:
-
current way of “solving” imaging task is running scalebility issues in 2 ways a. the data scarcity issue, as the performance of the model increases, it’s actually harder to curate more diverse data, and to label it b. No matter how well a model we develop in one model, it’s pretty much a one time deal, its transferbility to other application is rather limited
-
I argue, the challenge itself stems from the goal of solving two problems at the same time, namely: what does it look like, what it is?
By working closely with annotators for years, I quickly realized labeling medical images can roughly be divided into two general types:
-
the labersome work: eg. finding abormalities the key of this line of work requires patientce and detexrity rather than exptertise. annotaters with short amount background training (within days if not hours) is usually succient to get started. examples, ranging from scrolling slice by slice of vast CT abdomen scans for “blob-like” lung nodules
to an extreme, shading pixel by pixel of all tublar lung vessels.
Unsurprisingly, sometimes it is the annotators WITHOUT clinical expertise produce better quality annotations due to shear amount of attention and energy they invested.
-
the expertise work: eg. diffential diagnosis this is where real clinical expertise comes in, takes years to master.
why not simple MLLM
To simply put, I dont’ think we are there yet
- there is SIGNIFICANT gap in localization/dense prediction task between MLLM and traditional CV models 1
- MLLM vision tokens are sensitive to layer/resolution/pretrain even with massive data and model size 2 it is not only practical but efficient to take advantage of existing small/single task model, and MAYBE slowly merge everything together.