Seeing Superposition
#technicalI’ve been doing mech-interp for some time and “superposition” kept appearing in everything I read. I had a basic sense of the term and wrote a brief explanation post. Still, the best way to comprehend something is to see the thing with your own eyes. So I decided to replicate the experiment from Elhage et al., 2022 (notebook). It’s not the whole paper, but specifically heatmap. I wanted to see superposition emerge from training and address a few questions I had while replicating the experiment — specifically why non-linearity matters and what off-diagonal patterns are actually showing.
Before we even try to visualize and see superposition in practice, we need to identify what kind of data we should be working with. We have the following premises regarding data that’ll lead to interpretable results:
- Sparsity: most features we observe in practice rarely occur.
- Features > Model dimensions: a constraining factor that forces the model to compress.
- Feature Importance: not all features are equally important for a given task.
We set up a small model with and , where is the number of features and is the number of dimensions our model has. We also need to vary the sparsity level and assign different importance to each feature.
As for the synthetic data, the input vectors simulate the mentioned premises. Every (which is a “feature”) has an associated sparsity and importance . Every equals with probability and is uniformly distributed between otherwise. As for the importance, the paper uses geometric decay: . is an arbitrarily chosen base and isn’t a magic number. Looking at :
Importance affects the loss — errors on more significant features are penalized more heavily, so the model prioritizes representing them. The loss is:
Now that we have identified the loss function — what exactly are we trying to minimize the loss for?
The model tries to reconstruct the embeddings of with features via -dimensional space. The model looks like this:
The paper hypothesis suggests that every feature in the -dimensional space can be represented in the lower -dimensional one. We are using linear map , where is the weight matrix. Each column represents the direction of the feature .
We use the transpose of the matrix to recover the original vector.
We also include bias to the recovered result. The reason for doing so is to allow the model to nudge the features to their expected values.
Analytical insight
Besides showing the actual loss that would be computed while training, the paper analytically explains why superposition is occuring showing this equation:
Feature benefit is the value a model gains from representing a feature.
Interference is the noise value between and embeddings that are non-orthogonal to each other.
Full deriviation: from MSE to feature benefit + interference
We start deriving from our original MSE loss:
Now we start substituting the value of relative to the :
Knowing that we replace matrix multiplication with explicit sum:
Since we substitute that for case:
Following the original loss equation:
Now we take the expectation . The standard assumption in the toy model is that features are independent with and for , while is some constant (typically normalized so it acts as , which is why it disappears below — if you don’t normalize, just carry it through as a scalar).
Term (A) survives directly:
Term (B) vanishes, because every summand contains with , which is zero:
Term (C) simplifies because the cross terms for , leaving only the diagonal:
Putting (A), (B), and (C) back together:
The second double sum is over pairs with , which we can rewrite as . The paper indexes the importance on the interfering feature rather than (a relabeling — both forms appear in the literature), giving the form quoted at the top:
The rather than absorbs the constant from the normalization assumption.
Visualization
Across the panels we’re looking at which is the matrix of pairwise dot products between feature directions.
As sparsity increases, it starts to represent more features — but more interference also emerges. The diagonal is influencing the feature benefit cost while off-diagonals are affecting the interference cost.
We see that in the densest case (), only diagonal entries are highlighted for the 5 most important features. As sparsity increases, more diagonal entries light up. At the same time more off-diagonals show up. By and , we see that a dense block forms in the bottom-right — the low-importance features group together while the high-importance features maintain cleaner directions.
The model forms these patterns because as sparsity rises, interference is paid less in the expectation (because features fire less often when sparse), so the model can represent more features at the cost of letting them share directions.
The thing that is worth noting is the colors of the off-diagonals. The colors aren’t arbitrary — red means (pointing in similar directions), blue means (pointing in opposite directions).
We see that in the earlier heatmaps the off-diagonal cells are strongly blue. They form because the model arranges these feature directions as antipodal pairs — two feature directions pointing in exactly opposite directions, so .
This actually answers a question I had been sitting with: why does the model need a non-linearity to superpose? Here’s why. When feature fires and contaminates feature ‘s readout, an antipodal arrangement makes that contamination negative. ReLU clips it to zero, thus clearing the interference for free. Without ReLU, antipodal pairs would buy the model nothing — squared dot product is the same in either direction.
As the sparsity rises, we see red cells emerge. This is when angles between directions become acute and the model accepts non-cleanable interference because the features causing it are low-importance and rarely fire. The model can’t keep using antipodal arrangement since it can only have a limited number of them. For example, if we try to fit 5 features into 2 dimensions, geometry forces some pairs into acute angles.
Conclusion
Before the replication I’d say that I had a little understanding of how the model tries to compress features into hidden dimensions. Now I’d say it’s about using non-linearity to filter cheap interference; superposition is the consequence.
The paper covers a lot more than this one experiment. I’m especially curious about geometric organization of features in superposition and how phase transitions occur between configurations.