This is is an interview
1 with Zhang Xiangyu, long time research scientist not a traditional peer-reviewed scientific paper perse, in my opinion, an more elaborated, Ilyan 1 level talk. You should find yourself in a very intellectual mode/environment and listen/read into it, worth your time.
podwise already have many highlights and summaries automatically generated, I am not going to copy-paste here.
I will however list the most important lessons I drew from the talk, and my fuzzy thoughts relevant to it
1. Image alone scales poorly
this is actually a bigger claim than the original talk, which is more about “self-supervised learning for Image scales poorly”. First of all, there is NO SUCH THING as UN-supervised learning. LLM has strong supervision in two forms: 1. language by itself is well structured and 2. selecting which data to train is a form of annotation. 2 And Zhangxy’s insight is that SSL in vision is actually supervised and hence limited by its developer’s knowledge of crafting augmentations. More specifically
- Contrastive Learning is essentially learning HUMAN’s understanding of object invariance of color/scale/aspect ratio/etc.
- Mask AutoEncoder is essentially learning occlusion invariance
But there are more we would hope for: physics, composition, object relationship, causal effect SSL in video however may have a shot to solve this, but its information density is too low 3 to brute force the problem at this moment. More efficient solution is yet to come.
And now comes to my personal takes on the “supervised learning ALSO scalely poorly” statement. Though google DM 4 and META 5 BAAI 6 also show some improvements when MASSIVE scaling on vision encoder, but the improvements in my opinion is somewhat nice to have. I doubt it bumps many applications from not-okay | okay | great tier of usage. I even believe image resolution (hence the number of tokens utilized) play a bigger role in image understanding, at least it does result in significant difference in the holistic understanding vs localization/ORC tradeoff 7
2. The LLM inverse scaling phenomenon
This is concurrent insight from back 3 and forth 4 discussion on “emergent ability” 2 , and Xiangyu’s insight is more intuitive to follow: when LLM gets longer, it DOES have better intuition of solving (reasoning) task, hence it prefers take the shortcut of spelling out answer directly, which incurs higher error rate than faithfully solving the problem step by step. When properly instructed (COT), the larger model IS better.
This makes me wonder if vision can be solved with data + compute at all? Afterall, the native vision model DOES NOT have DYNANMIC “reasoning circult” when you can reason with more compute by COT. NOTE: VLM 8 seems to have reason ability with auto-regressive next image prediction, but with fixed budget
3. The power of reasoning model
This is the most facinating part of the talk. It is no secret
- WHAT: RL works 9
- HOW: Deepseek GRPO 10 makes it widely popular
- but WHY? : it’s too simple to be true, why didn’t it happened earlier
Xiangyu’s insight is that, there is NO NEW KNOWLEDGE introduced in the post-train (SFT or RL) anyway. RL just does a better way to ellicit EXISTING PATTERN (like CoT) out from pre-train, and the aha moment words (wait…) is common phrase in high quality math discussion data. When combined with test-time compute, which allows model to dynamically leverage more tokens on “thinking path” with higher success rate, and also adapt to previous wrong chosen path.
My intuition is that
- Instead slam hard on token which is has multiple mode nature and have non-reducible variance, we would hope to supervise on thinking circuit
- RL is a better way (than SFT) to augment its concept
- basically
4. the path to o3 image is still largely UNKNOWN
Xiangyu’s insight
Though there are many
11References
- TODO ↩
- I think this is from RBG’s talk somewhere, but I cant find it ↩
- TODO: Link to MAE in video paper ↩
- TODO: Link to scaling siglip unlock mutli-cultual paper ↩
- TODO: Link to some scaling paper ↩
- TODO: Link to EVA22b? paper ↩
- TODO: Link to META perceptor? paper ↩
- TODO: Link to MAE in video paper ↩
- TODO: SFT memorize, RL generalize ↩
- TODO: R1 paper ↩
- TODO: Link to MAE in video paper ↩