Zhang Xiangyu's talk on multimodal & reasoning

This is is an interview

¹ with Zhang Xiangyu, long time research scientist not a traditional peer-reviewed scientific paper perse, in my opinion, an more elaborated, Ilyan¹ level talk. You should find yourself in a very intellectual mode/environment and listen/read into it, worth your time.

podwise already have many highlights and summaries automatically generated, I am not going to copy-paste here.

I will however list the most important lessons I drew from the talk, and my fuzzy thoughts relevant to it

1. Image alone scales poorly

this is actually a bigger claim than the original talk, which is more about “self-supervised learning for Image scales poorly”. First of all, there is NO SUCH THING as UN-supervised learning. LLM has strong supervision in two forms: 1. language by itself is well structured and 2. selecting which data to train is a form of annotation. ² And Zhangxy’s insight is that SSL in vision is actually supervised and hence limited by its developer’s knowledge of crafting augmentations. More specifically

Contrastive Learning is essentially learning HUMAN’s understanding of object invariance of color/scale/aspect ratio/etc.
Mask AutoEncoder is essentially learning occlusion invariance

But there are more we would hope for: physics, composition, object relationship, causal effect SSL in video however may have a shot to solve this, but its information density is too low ³ to brute force the problem at this moment. More efficient solution is yet to come.

And now comes to my personal takes on the “supervised learning ALSO scalely poorly” statement. Though google DM⁴ and META ⁵ BAAI ⁶ also show some improvements when MASSIVE scaling on vision encoder, but the improvements in my opinion is somewhat nice to have. I doubt it bumps many applications from not-okay | okay | great tier of usage. I even believe image resolution (hence the number of tokens utilized) play a bigger role in image understanding, at least it does result in significant difference in the holistic understanding vs localization/ORC tradeoff ⁷

2. The LLM inverse scaling phenomenon

This is concurrent insight from back ⁴ and forth ⁵discussion on “emergent ability” ² , and Xiangyu’s insight is more intuitive to follow: when LLM gets longer, it DOES have better intuition of solving (reasoning) task, hence it prefers take the shortcut of spelling out answer directly, which incurs higher error rate than faithfully solving the problem step by step. When properly instructed (COT), the larger model IS better.

This makes me wonder if vision can be solved with data + compute at all? Afterall, the native vision model DOES NOT have DYNANMIC “reasoning circult” when you can reason with more compute by COT. NOTE: VLM ⁸seems to have reason ability with auto-regressive next image prediction, but with fixed budget

3. The power of reasoning model

This is the most facinating part of the talk. It is no secret

WHAT: RL works ⁹
HOW: Deepseek GRPO ¹⁰ makes it widely popular
but WHY? : it’s too simple to be true, why didn’t it happened earlier

Xiangyu’s insight is that, there is NO NEW KNOWLEDGE introduced in the post-train (SFT or RL) anyway. RL just does a better way to ellicit EXISTING PATTERN (like CoT) out from pre-train, and the aha moment words (wait…) is common phrase in high quality math discussion data. When combined with test-time compute, which allows model to dynamically leverage more tokens on “thinking path” with higher success rate, and also adapt to previous wrong chosen path.

My intuition is that

Instead slam hard on token which is has multiple mode nature and have non-reducible variance, we would hope to supervise on thinking circuit
RL is a better way (than SFT) to search then ellicit the existing pattern
in layman term, RL explore better solution in the model’s point of view , and help the model to develop a better “intuition” in the solution space, instea of the token space

4. the path to o3 image is still largely UNKNOWN

O3 reasoning pattern works with image³ is truly special, even magical. Xiangyu’s insight is that, it PROBABLY can be traced back to the pre-train as well

there are lots of explaination pattern about the image
of which almost make annotation or simply zoom to the region of interest, eg. fixing electronics, label special usage of tools, etc.
and O3 thinking pattern can be seen as a successful ellicitation of croping/zooming/annotation pattern

¹¹

References

TODO ↩
I think this is from RBG’s talk somewhere, but I cant find it ↩
TODO: Link to MAE in video paper ↩
TODO: Link to scaling siglip unlock mutli-cultual paper ↩
TODO: Link to some scaling paper ↩
TODO: Link to EVA22b? paper ↩
TODO: Link to META perceptor? paper ↩
TODO: Link to MAE in video paper ↩
TODO: SFT memorize, RL generalize ↩
TODO: R1 paper ↩
TODO: Link to MAE in video paper ↩