The Ongoing Battle Over AI Training Data: Transparency, Compliance, and the Future of Open Development
Introduction to the Battle
The landscape of artificial intelligence (AI) is evolving rapidly, and a critical battle is unfolding around the training data that fuels these technologies. This battle is manifesting in courts, legislative halls, and standardization bodies like the Internet Engineering Task Force (IETF). Copyright holders are demanding better compensation and ways to manage who can use their data. Meanwhile, AI developers are advocating for more freedom to source and utilize data, which they believe is vital for scaling operations and gaining a competitive edge.
Yet, amidst these discussions, one aspect often gets overlooked: transparency surrounding training data. Understanding what data AI developers use, how it’s applied, and its origins is vital for multiple stakeholders. Once upon a time, conversations about transparency were more common. However, as competition intensified, developers became more secretive, focusing on safeguarding their proprietary information.
The Role of the European Union’s AI Act
A pivotal provision in the European Union’s AI Act may usher in a new era of transparency. This regulation requires developers of “general-purpose AI” models to publish a summary of the data employed in training their models. This summary must conform to a specific template provided by the European Commission.
Why is this significant? For one, it benefits not only copyright holders but also privacy watchdogs and researchers. Copyright holders need confirmation that their preferences are respected, while privacy advocates must ensure that sensitive data isn’t inadvertently incorporated into AI models. Researchers are also keenly interested in understanding the data fed into large language models, which affects the outcomes and usefulness of these technologies.
Transparency: A Double-Edged Sword
Despite the opposition to this disclosure mandate and criticisms regarding the adequacy of the template, this provision presents a significant opportunity. It serves as a potential breakthrough for transparency into how leading AI models like Google’s Gemini, Anthropic’s Claude, and OpenAI’s GPT source and utilize data.
In recent peer-reviewed research, supported by Mozilla and presented at the 9th ACM Conference on Fairness, Accountability, and Transparency, a framework for assessing AI companies’ training data summaries has been developed. This framework is built on a solid foundation of standards and practices from software development, aiding both developers in compiling summaries and the European Commission in evaluating compliance.
Open-Source Developers Leading the Charge
Looking closely at the public summaries already released by AI developers, particularly in the open-source community, offers promising insights. The evaluation of five of these summaries found that only one, Microsoft’s Phi model, did not pass the assessment criteria. In contrast, open-source developers like Hugging Face and Swiss AI delivered compelling results, earning praise for their high standards of transparency.
The positive takeaway is that it’s feasible for smaller teams to provide transparent summaries. If these smaller projects can navigate the complexities of disclosing training data, one would expect the same—or better—from well-funded AI labs.
Major Players Lagging in Transparency
However, a troubling trend emerges when assessing larger AI developers. To date, leading companies like OpenAI, Google, and xAI have not published any required summaries, despite the AI Act mandating them to do so. Recent reports corroborate this lack of compliance. Some companies have released minimal information about their training data, but such disclosures fall far short of what the law requires.
It appears that these industry giants might be exploiting a legal gray area. While the obligation for AI developers to publish summaries became effective last August, the European Commission is still in the process of gaining enforcement powers. Companies could be walking the line between complying in good faith and withholding information from the public and competitors.
The Importance of Upholding Standards
The situation poses questions about leading AI developers’ commitment to the principles enshrined in the AI Act’s Code of Practice, which emphasizes transparency. As the EU AI Office prepares to enforce compliance, it will need to consider research findings and expert opinions to ensure that enforcement is objective and fair.
The potential for blatant non-compliance, especially from the industry’s giants, underscores the urgency of ensuring that transparency efforts support smaller developers acting in good faith. Following the law should not be seen as optional or an inconvenience.
Future Implications of AI Transparency
The journey toward greater transparency in AI training data is fraught with challenges, but it is also filled with opportunities for establishing better practices within the industry. As stakeholders continue advocating for clearer regulations and frameworks, the focus on transparency will only increase.
Understandably, the balance between competitive advantage and the public interest remains delicate. However, the move toward increased transparency is essential for ensuring ethical AI development. By holding leading AI developers accountable for their data practices, we can work toward a more equitable future in AI technology.
Stay Informed
For ongoing updates and insights about the intersection of technology and democracy, consider joining our newsletter. Your involvement will help support the broader conversation about these vital issues in AI development.
Inspired by: Source

