Understanding the Impact of Random Initializations on TopK Sparse Autoencoders
In the realm of machine learning, particularly in the training of neural networks, the role of initialization cannot be overstated. This article delves into the intriguing findings from our investigation of TopK Sparse Autoencoders (SAEs), specifically how variations in random initialization can lead to divergent feature representations even when trained on identical datasets with the same batch order.
- Divergence in Latent Representations
- Interpretability of Unshared Latents
- Feature Splitting and Absorption
- Stability Across Different Architectures
- Methodology: Measuring Latent Alignment
- Latent Overlap Across Multiple Models
- Frequency of Latent Activation
- The Influence of SAE Size on Feature Overlap
- Investigating Interpretability of Unique Latents
- Conclusion
Divergence in Latent Representations
When two TopK SAEs are trained using the same data but with different random initializations, a fascinating phenomenon occurs. Our study reveals that only about 53% of the features are shared between these two models. This relatively low overlap suggests that a significant number of latents in one SAE do not have a close counterpart in the other, and vice versa. The implication here is profound: the features learned by SAEs are not fixed or universally applicable, but rather can be highly variable based on initialization.
Interpretability of Unshared Latents
Interestingly, many of the unshared latents exhibit interpretability. This raises the question of how different training paths can lead to distinct yet interpretable representations. Furthermore, we observed that narrower SAEs tend to have a higher overlap of features across random seeds. In contrast, as the size of the SAE increases, the degree of overlap diminishes. This trend aligns with existing literature on feature splitting and absorption, indicating that the features learned by SAEs can be somewhat arbitrary.
Feature Splitting and Absorption
The behavior of SAEs supports the idea that learned features are not atomic. Instead, different configurations can lead to various interpretations of the same latent features. As the size of the SAEs increases, we also see a phenomenon known as feature absorption, where some latents gain an “implicit” meaning alongside their “explicit” feature interpretation. This duality in representation can allow models to learn disjoint representations even when they are trained on the same data.
Stability Across Different Architectures
Our findings suggest that the architecture of the SAE plays a crucial role in the stability of feature learning under different random seeds. Previous studies have indicated that certain architectures, like ReLU SAEs trained with an L1 penalty, show significant stability across different initializations. In contrast, TopK SAEs appear to benefit from methods that align different seeds, highlighting the need for careful consideration in architectural choices.
Methodology: Measuring Latent Alignment
To quantify the alignment between independently trained SAEs, we employed the Hungarian algorithm. This method efficiently computes the matching between latents, maximizing the average cosine similarity between matched encoder and decoder vectors. The resulting alignment score provides a clear measure of how similarly the two models interpret the latent space.
Upon analyzing the distribution of cosine similarities, we observed that there are two distinct modes: one reflecting high similarity and another indicating low similarity. This duality suggests that while some latents are closely aligned, others diverge significantly. In cases where the encoder and decoder matchings disagree, the cosine similarity tends to be lower, reinforcing the complexity of the latent space.
Latent Overlap Across Multiple Models
Further exploration revealed that when introducing a third SAE trained with a different random seed, the overlap of shared latents decreased from 47% to 35%. This finding indicates that the majority of shared latents between the first two models also persist in their relationship with the third model, showcasing an interesting dynamic of latent retention across different configurations.
Frequency of Latent Activation
An important aspect of our investigation was examining the frequency of latent activation across models. We found that the latents most frequently activated in SAE 1 were also those shared with SAE 2 and SAE 3. Conversely, the latents that activated infrequently in SAE 1 were those unique to that model. Intriguingly, some latents exclusive to SAE 1 exhibited a higher average firing rate than those present across all models, hinting at a complex relationship between activation frequency and latent representation.
The Influence of SAE Size on Feature Overlap
Our research also underscores a clear relationship between the size of the SAE and the fraction of unshared latents. Even when applying a more lenient metric for defining shared features, it became evident that larger SAEs retained a greater number of unique features. The computational demands of analyzing these larger models are significant, with implementations taking considerable time and resources, further complicating the exploration of this relationship.
Investigating Interpretability of Unique Latents
To delve deeper into the interpretability of unshared latents, we utilized an auto-interp approach to evaluate over 7,000 latents from two 32,768 latent SAEs. Our findings indicated a promising average interpretability score of 0.72, with a significant number of explanations falling within a reasonable range of clarity. However, the latents with low interpretability scores often correlated with low similarity across different seeds, suggesting that while some latents may be unique to a specific initialization, they might not lend themselves to clear interpretation.
Conclusion
The exploration of TopK SAEs trained under varying random initializations reveals a rich tapestry of latent representations that diverge significantly based on initial conditions. Our results challenge the notion of a universal set of features, highlighting the importance of viewing feature discovery as a compositional problem. As we continue to investigate these phenomena, we anticipate further insights into the intricate dance between architecture, initialization, and feature representation in neural networks.
By understanding these dynamics, we can better harness the capabilities of SAEs and other machine learning models, paving the way for more robust and interpretable AI systems.

