This is an interview with Zhang Xiangyu, a long-time research scientist. This is not a traditional peer-reviewed scientific paper per se; in my opinion, it’s a more elaborate, Ilya-level talk 13 . I highly recommend listening/reading it in a very intellectual mode/environment—it’s worth your time.
Podwise 10 already has many highlights and summaries automatically generated. So I am not going to copy and paste here. However, I will list the most important lessons I drew from the talk and my fuzzy thoughts relevant to it.
1. Image alone scales poorly
This is actually a bigger claim than the original talk, which focused more on “self-supervised learning for images scales poorly”.
First of all, there is NO SUCH THING as UNSUPERVISED learning. LLMs have strong supervision in two forms: 1) language itself is well-structured, and 2) selecting which data to train on is a form of annotation. 1
Zhang’s insight is that SSL in vision is actually supervised and hence limited by the developers’ knowledge of crafting augmentations. More specifically:
- Contrastive Learning is essentially learning a HUMAN’S understanding of object invariance regarding color, scale, aspect ratio, etc.
- A Masked Autoencoder is essentially learning occlusion invariance
But there’s more we would hope for: physics, composition, object relationships, causal effects
SSL in video may have a shot at solving this, but its information density is too low 2 to brute-force the problem at the moment. More efficient solutions are yet to come.
Now comes my personal take on the “supervised learning ALSO scales poorly” statement. Though Google DM 3 , META 4 , and BAAI 5 also show improvements with MASSIVE scaling on vision encoders, these improvements are, in my opinion, somewhat nice-to-have. I doubt they bump many applications from the not-okay → okay → great tier of usage. I even believe image resolution (and hence the number of tokens utilized) plays a bigger role in image understanding - at least it results in significant differences in the holistic understanding vs localization/OCR trade-off. 6
2. The LLM inverse scaling phenomenon
This is a concurrent insight from the back-and-forth 14 discussion 15 on “emergent abilities,“ 11 and Zhang’s insight is more intuitive to follow: when LLMs get larger, they DO have better intuition for solving (reasoning) tasks, hence they prefer to take the shortcut of spelling out answers directly, which incurs higher error rates than faithfully solving problems step-by-step. When properly instructed (CoT), larger models ARE better.
This makes me wonder if vision can be solved with data and compute at all. After all, the native vision model does not have DYNAMIC “reasoning circuits” that can reason with more compute via CoT. NOTE: A VLM 7 seems to have reasoning ability with autoregressive next-image prediction, but with a fixed budget.
3. The power of a reasoning model (with RL)
This is the most fascinating part of the talk. It is no secret:
- WHAT: RL works 8
- HOW: DeepSeek’s GRPO 9 makes it widely popular
- BUT WHY? It’s too simple to be true—why didn’t it happen earlier?
Zhang’s insight is that there is NO NEW KNOWLEDGE introduced in post-training (SFT or RL) anyway. RL simply does a better job of eliciting EXISTING PATTERNS (like CoT) from pre-training, and the “aha moment” words (like “wait…”) are common phrases in high-quality math discussion data. When combined with test-time compute, which allows models to dynamically leverage more tokens on “thinking paths” with higher success rates, and also to adapt from previously wrong choices.
My intuition is that:
- Instead of slamming hard on tokens that have a multimodal nature and non-reducible variance, we should supervise on thinking circuits
- RL is a better way (than SFT) to search and elicit existing patterns
- In layman’s terms, RL explores better solutions from the model’s point of view and helps the model develop a better “intuition” in the solution space instead of the token space
Update: I found this fascinating explanation of WHY RL works 16

4. The path to reasoning with images
OpenAI’s O3 is a living testament 12 that reasoning with images is possible. If you haven’t seen it, be sure to watch it—truly special, even magical.
Just to be clear, O3’s image implementation is still largely UNKNOWN.
Zhang’s insight is that it PROBABLY can be traced back to pre-training as well:
- There are lots of explanation patterns for images
- These patterns often involve annotation or simply zooming to regions of interest, e.g., fixing electronics, labeling the special usage of tools, etc.
- O3’s thinking patterns can be seen as successful elicitation of cropping/zooming/annotation patterns
- As a side note, image manipulation is implemented with live coding in Python ; though I think more structured, parameterized function calls would be more appropriate.
References
- I think this is from RBG’s talk somewhere, but I can’t find it ↩
- TODO: Link to MAE in video paper ↩
- TODO: Link to scaling siglip unlock multi-cultural paper ↩
- TODO: Link to some scaling paper ↩
- TODO: Link to EVA22b? paper ↩
- TODO: Link to META perceptor? paper ↩
- TODO: Link to MAE in video paper ↩
- TODO: SFT memorizes, RL generalizes ↩
- TODO: R1 paper ↩
- https://podwise.ai/dashboard/episodes/4209997 ↩
- https://www.jasonwei.net/blog/common-arguments-regarding-emergent-abilities ↩
- https://simonwillison.net/2025/Apr/26/o3-photo-locations/ ↩
- Why Next-Token Prediction Could Surpass Human Intelligence: https://www.youtube.com/watch?v=Yf1o0TQzry8> ↩
- Inverse Scaling: When Bigger Isn’t Better: https://arxiv.org/pdf/2211.02011 ↩
- Inverse scaling can become U-shaped: https://arxiv.org/pdf/2211.02011 ↩
- [UCLA RL-LLM] Chapter 0: Course outline and prologue: https://www.youtube.com/watch?v=q9972BRoXzQ ↩