Anatomical Positional Embedding

I have been sitting on this idea for a while. I confess it’s still a bit of a stretch. I am sharing it only because I find it very interesting, if not inspiring. So, hear me out.

TL;DR

The goals of an anatomical embedding are two-fold:

Pass “global” structural context along to local patches (and to an LLM?)
Retrieve the most relevant patch when queried from an LLM (e.g., “take a second look at the left ventricle”)

Before talking about what it is, I want to clarify WHY we need a different positional embedding in the first place? Traditional self-learned or sinusoidal positional embeddings have worked pretty well for natural images.

This all circles back to the “large data space, low sample size” issue.

The common practice is to process the image in a sliding-window manner , especially if the application is dealing with texture in detail (e.g., small lesion analysis). Although research shows that large context windows usually help.¹

When dealing with partial images, traditional positional embeddings encode limited information—perhaps only the relative permutation among patches.

In reality, there are more important contexts that can and should be carried into a local patch analysis: the relative position with respect to the whole body and the general variation/topology of the body region.

Hence the proposal for an anatomical embedding.

Pillars of a good positional embedding

If you haven’t paid attention to how positional embeddings have evolved, please take a look at designing-positional-encoding.

The author stated that the desirable properties of a good positional embedding for a GENERAL SEQUENCE are:

UNIQUE encoding for each position
INTERPOLATABLE between two encoded positions
GENERALIZABLE to longer sequences
DETERMINISTIC with respect to input
EXTENSIBLE to multiple dimensions

I find that almost all of them are applicable if we simply swap the semantics of a “position” in physical space with “anatomical position” in the human body . (Except #3, where space beyond the human body is meaningless.)

Anatomy vs Segmentation Embedding

The next immediate question is: how is this different from/related to semantic segmentation of body regions, which we already have an abundant amount of research on?

From an alignment point of view, it is a trade-off between input pixel space and “parameterized” structural space.

Specifically, anatomy focuses more on the holistic representation while segmentation focuses more on pixel-perfect mapping. This difference is most subtle and becomes prominent when dealing with ambiguous regions: anatomy infers from a TOP-DOWN perspective with the overall body arrangement, while segmentation mostly infers from the BOTTOM-UP with visual cues.

(Example suggested by Gemini, which I found appropriate)

Consider an ambiguous region on a CT scan, like the boundary between the liver and the right kidney.

A segmentation model (bottom-up) might struggle if the pixel intensities are similar, potentially mislabeling the boundary.

An anatomical embedding (top-down) would provide the context that the patch is located where the superior pole of the kidney typically abuts the inferior surface of the liver. This “knowledge” of the global body plan provides a strong prior that can resolve local ambiguity.

As a final remark, it requires extra consideration to be useful to language models. This means (with an additional projection layer?) we can properly prompt out textual information with such an embedding.

Potential Implementation

To build this, we could use a Variational Autoencoder (VAE), which is great at learning compressed representations of data. The key is to train it with 3 distinct goals, or “supervisions,” to ensure the final embedding captures all the context we need :

Pixel-level Alignment: To make sure the embedding understands the local tissue, we use a standard reconstruction loss. A decoder must be able to reconstruct the original image patch from the embedding. There are many choices for VAEs², I am currently leaning towards beta-VAE for its simplicity and controllability of the trade-off between reconstruction quality and disentanglement, which may suit anatomical embedding as well.
Structural Alignment: To ensure the embedding knows where it is in the grand scheme of the body, we use a technique inspired by Masked Autoencoders (MAE)³. The model would need to predict the overall organ structure (e.g., a full body mask) from just a few local patch embeddings. This forces it to learn the global anatomical layout.
Language Alignment: To make the embedding searchable via text prompts (like ‘show me the left ventricle’), we add a third decoder. This component connects the image embedding to a language model’s embedding space, likely using a contrastive loss like SigLIP⁴.

A balance between the three terms would yield a good embedding to pass along anatomical context to local patches and to an LLM, hopefully.

References

Why is patch size important? Dealing with context variability in segmentation networks: https://openreview.net/forum?id=y6g0cu8q19#discussion ↩
From Autoencoder to Beta-VAE: https://lilianweng.github.io/posts/2018-08-12-vae/ ↩
Masked Autoencoders Are Scalable Vision Learners: https://arxiv.org/abs/2111.06377 ↩
Sigmoid Loss for Language Image Pre-Training: https://arxiv.org/abs/2303.15343 ↩