Understanding the Impact of Random Initializations on TopK Sparse Autoencoders

In the realm of machine learning, particularly in the training of neural networks, the role of initialization cannot be overstated. This article delves into the intriguing findings from our investigation of TopK Sparse Autoencoders (SAEs), specifically how variations in random initialization can lead to divergent feature representations even when trained on identical datasets with the same batch order.

Contents

Divergence in Latent Representations
Interpretability of Unshared Latents
Feature Splitting and Absorption
Stability Across Different Architectures
Methodology: Measuring Latent Alignment
Latent Overlap Across Multiple Models
Frequency of Latent Activation
The Influence of SAE Size on Feature Overlap
Investigating Interpretability of Unique Latents
Conclusion

Divergence in Latent Representations

When two TopK SAEs are trained using the same data but with different random initializations, a fascinating phenomenon occurs. Our study reveals that only about 53% of the features are shared between these two models. This relatively low overlap suggests that a significant number of latents in one SAE do not have a close counterpart in the other, and vice versa. The implication here is profound: the features learned by SAEs are not fixed or universally applicable, but rather can be highly variable based on initialization.

Interpretability of Unshared Latents

Interestingly, many of the unshared latents exhibit interpretability. This raises the question of how different training paths can lead to distinct yet interpretable representations. Furthermore, we observed that narrower SAEs tend to have a higher overlap of features across random seeds. In contrast, as the size of the SAE increases, the degree of overlap diminishes. This trend aligns with existing literature on feature splitting and absorption, indicating that the features learned by SAEs can be somewhat arbitrary.

Feature Splitting and Absorption

The behavior of SAEs supports the idea that learned features are not atomic. Instead, different configurations can lead to various interpretations of the same latent features. As the size of the SAEs increases, we also see a phenomenon known as feature absorption, where some latents gain an “implicit” meaning alongside their “explicit” feature interpretation. This duality in representation can allow models to learn disjoint representations even when they are trained on the same data.

Stability Across Different Architectures

Our findings suggest that the architecture of the SAE plays a crucial role in the stability of feature learning under different random seeds. Previous studies have indicated that certain architectures, like ReLU SAEs trained with an L1 penalty, show significant stability across different initializations. In contrast, TopK SAEs appear to benefit from methods that align different seeds, highlighting the need for careful consideration in architectural choices.

Methodology: Measuring Latent Alignment

To quantify the alignment between independently trained SAEs, we employed the Hungarian algorithm. This method efficiently computes the matching between latents, maximizing the average cosine similarity between matched encoder and decoder vectors. The resulting alignment score provides a clear measure of how similarly the two models interpret the latent space.

Upon analyzing the distribution of cosine similarities, we observed that there are two distinct modes: one reflecting high similarity and another indicating low similarity. This duality suggests that while some latents are closely aligned, others diverge significantly. In cases where the encoder and decoder matchings disagree, the cosine similarity tends to be lower, reinforcing the complexity of the latent space.

Latent Overlap Across Multiple Models

Further exploration revealed that when introducing a third SAE trained with a different random seed, the overlap of shared latents decreased from 47% to 35%. This finding indicates that the majority of shared latents between the first two models also persist in their relationship with the third model, showcasing an interesting dynamic of latent retention across different configurations.

Frequency of Latent Activation

An important aspect of our investigation was examining the frequency of latent activation across models. We found that the latents most frequently activated in SAE 1 were also those shared with SAE 2 and SAE 3. Conversely, the latents that activated infrequently in SAE 1 were those unique to that model. Intriguingly, some latents exclusive to SAE 1 exhibited a higher average firing rate than those present across all models, hinting at a complex relationship between activation frequency and latent representation.

The Influence of SAE Size on Feature Overlap

Our research also underscores a clear relationship between the size of the SAE and the fraction of unshared latents. Even when applying a more lenient metric for defining shared features, it became evident that larger SAEs retained a greater number of unique features. The computational demands of analyzing these larger models are significant, with implementations taking considerable time and resources, further complicating the exploration of this relationship.

Investigating Interpretability of Unique Latents

To delve deeper into the interpretability of unshared latents, we utilized an auto-interp approach to evaluate over 7,000 latents from two 32,768 latent SAEs. Our findings indicated a promising average interpretability score of 0.72, with a significant number of explanations falling within a reasonable range of clarity. However, the latents with low interpretability scores often correlated with low similarity across different seeds, suggesting that while some latents may be unique to a specific initialization, they might not lend themselves to clear interpretation.

Conclusion

The exploration of TopK SAEs trained under varying random initializations reveals a rich tapestry of latent representations that diverge significantly based on initial conditions. Our results challenge the notion of a universal set of features, highlighting the importance of viewing feature discovery as a compositional problem. As we continue to investigate these phenomena, we anticipate further insights into the intricate dance between architecture, initialization, and feature representation in neural networks.

By understanding these dynamics, we can better harness the capabilities of SAEs and other machine learning models, paving the way for more robust and interpretable AI systems.

Why SAEs Trained on Identical Data Sets Can Discover Different Features

Understanding the Impact of Random Initializations on TopK Sparse Autoencoders

Divergence in Latent Representations

Interpretability of Unshared Latents

Feature Splitting and Absorption

Stability Across Different Architectures

Methodology: Measuring Latent Alignment

Latent Overlap Across Multiple Models

Frequency of Latent Activation

The Influence of SAE Size on Feature Overlap

Investigating Interpretability of Unique Latents

Conclusion

Stay Connected

Explore Top AI Tools Instantly

Latest News

Enhancing Mission-Critical Small Language Models through Multi-Model Synthetic Training: Insights from Research 2509.13047

OpenAI Acquires AI Personal Finance Startup Hiro: What This Means for the Future

Google Launches Gemma 4: Emphasizing Local-First, On-Device AI Inference for Enhanced Performance

Master Python Continuous Integration and Deployment with GitHub Actions: Take the Real Python Quiz

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

Understanding the Impact of Random Initializations on TopK Sparse Autoencoders

Divergence in Latent Representations

Interpretability of Unshared Latents

Feature Splitting and Absorption

Stability Across Different Architectures

More Read

Methodology: Measuring Latent Alignment

Latent Overlap Across Multiple Models

Frequency of Latent Activation

The Influence of SAE Size on Feature Overlap

Investigating Interpretability of Unique Latents

Conclusion

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

Enhancing Mission-Critical Small Language Models through Multi-Model Synthetic Training: Insights from Research 2509.13047

OpenAI Acquires AI Personal Finance Startup Hiro: What This Means for the Future

Google Launches Gemma 4: Emphasizing Local-First, On-Device AI Inference for Enhanced Performance

Master Python Continuous Integration and Deployment with GitHub Actions: Take the Real Python Quiz