Zhang Xiangyu's Talk on Multimodality & Reasoning

This is an interview with Zhang Xiangyu, a long-time research scientist. This is not a traditional peer-reviewed scientific paper per se; in my opinion, it’s a more elaborate, Ilya-level talk¹³. I highly recommend listening/reading it in a very intellectual mode/environment—it’s worth your time.

Podwise¹⁰ already has many highlights and summaries automatically generated. So I am not going to copy and paste here. However, I will list the most important lessons I drew from the talk and my fuzzy thoughts relevant to it.

1. Image alone scales poorly

This is actually a bigger claim than the original talk, which focused more on “self-supervised learning for images scales poorly”.

First of all, there is NO SUCH THING as UNSUPERVISED learning. LLMs have strong supervision in two forms: 1) language itself is well-structured, and 2) selecting which data to train on is a form of annotation.¹

Zhang’s insight is that SSL in vision is actually supervised and hence limited by the developers’ knowledge of crafting augmentations. More specifically:

Contrastive Learning is essentially learning a HUMAN’S understanding of object invariance regarding color, scale, aspect ratio, etc.
A Masked Autoencoder is essentially learning occlusion invariance

But there’s more we would hope for: physics, composition, object relationships, causal effects

SSL in video may have a shot at solving this, but its information density is too low² to brute-force the problem at the moment. More efficient solutions are yet to come.

Now comes my personal take on the “supervised learning ALSO scales poorly” statement. Though Google DM³, META⁴, and BAAI⁵ also show improvements with MASSIVE scaling on vision encoders, these improvements are, in my opinion, somewhat nice-to-have. I doubt they bump many applications from the not-okay → okay → great tier of usage. I even believe image resolution (and hence the number of tokens utilized) plays a bigger role in image understanding - at least it results in significant differences in the holistic understanding vs localization/OCR trade-off.⁶

2. The LLM inverse scaling phenomenon

This is a concurrent insight from the back-and-forth ¹⁴ discussion ¹⁵ on “emergent abilities,“¹¹ and Zhang’s insight is more intuitive to follow: when LLMs get larger, they DO have better intuition for solving (reasoning) tasks, hence they prefer to take the shortcut of spelling out answers directly, which incurs higher error rates than faithfully solving problems step-by-step. When properly instructed (CoT), larger models ARE better.

This makes me wonder if vision can be solved with data and compute at all. After all, the native vision model does not have DYNAMIC “reasoning circuits” that can reason with more compute via CoT. NOTE: A VLM⁷ seems to have reasoning ability with autoregressive next-image prediction, but with a fixed budget.

3. The power of a reasoning model (with RL)

This is the most fascinating part of the talk. It is no secret:

WHAT: RL works⁸
HOW: DeepSeek’s GRPO⁹ makes it widely popular
BUT WHY? It’s too simple to be true—why didn’t it happen earlier?

Zhang’s insight is that there is NO NEW KNOWLEDGE introduced in post-training (SFT or RL) anyway. RL simply does a better job of eliciting EXISTING PATTERNS (like CoT) from pre-training, and the “aha moment” words (like “wait…”) are common phrases in high-quality math discussion data. When combined with test-time compute, which allows models to dynamically leverage more tokens on “thinking paths” with higher success rates, and also to adapt from previously wrong choices.

My intuition is that:

Instead of slamming hard on tokens that have a multimodal nature and non-reducible variance, we should supervise on thinking circuits
RL is a better way (than SFT) to search and elicit existing patterns
In layman’s terms, RL explores better solutions from the model’s point of view and helps the model develop a better “intuition” in the solution space instead of the token space

Update: I found this fascinating explanation of WHY RL works¹⁶

4. The path to reasoning with images

OpenAI’s O3 is a living testament¹² that reasoning with images is possible. If you haven’t seen it, be sure to watch it—truly special, even magical.

Just to be clear, O3’s image implementation is still largely UNKNOWN.

Zhang’s insight is that it PROBABLY can be traced back to pre-training as well:

There are lots of explanation patterns for images
These patterns often involve annotation or simply zooming to regions of interest, e.g., fixing electronics, labeling the special usage of tools, etc.
O3’s thinking patterns can be seen as successful elicitation of cropping/zooming/annotation patterns
As a side note, image manipulation is implemented with live coding in Python ; though I think more structured, parameterized function calls would be more appropriate.

References

I think this is from RBG’s talk somewhere, but I can’t find it ↩
TODO: Link to MAE in video paper ↩
TODO: Link to scaling siglip unlock multi-cultural paper ↩
TODO: Link to some scaling paper ↩
TODO: Link to EVA22b? paper ↩
TODO: Link to META perceptor? paper ↩
TODO: Link to MAE in video paper ↩
TODO: SFT memorizes, RL generalizes ↩
TODO: R1 paper ↩
https://podwise.ai/dashboard/episodes/4209997 ↩
https://www.jasonwei.net/blog/common-arguments-regarding-emergent-abilities ↩
https://simonwillison.net/2025/Apr/26/o3-photo-locations/ ↩
Why Next-Token Prediction Could Surpass Human Intelligence: https://www.youtube.com/watch?v=Yf1o0TQzry8> ↩
Inverse Scaling: When Bigger Isn’t Better: https://arxiv.org/pdf/2211.02011 ↩
Inverse scaling can become U-shaped: https://arxiv.org/pdf/2211.02011 ↩
[UCLA RL-LLM] Chapter 0: Course outline and prologue: https://www.youtube.com/watch?v=q9972BRoXzQ ↩