The Linguistic Bias of Generative AI: Whose English Are We Using?
In the world of generative AI, a staggering 90% of the training data comes from English. This statistic might sound impressive, but it raises an important question: which version of English is being utilized in these systems? While English serves as a global lingua franca, spoken by approximately 1.5 billion people worldwide, the version that dominates AI technology is overwhelmingly mainstream American English. This focus on a single variety has significant implications for linguistic diversity and representation in AI.
The Dominance of Mainstream American English
The prevalence of American English in digital spaces is not by chance; it arises from a complex interplay of historical, economic, and technological factors. The United States has long been at the forefront of internet development, content creation, and the rise of major tech companies—think Google, Meta, Microsoft, and OpenAI. The linguistic norms that these companies embed in their products reflect the cultural priorities of mainstream America, creating a homogenous digital landscape.
Research has shown that this dominance can frustrate speakers of non-mainstream English. For instance, a study indicated that users of various English dialects found the accents generated by AI technologies to be predominantly American, which often felt exclusionary. One participant remarked that these technologies seemed designed “with some other people in mind,” highlighting the disconnect between the users and the technology.
The Standardization of English
Mainstream varieties of English have historically been regarded as the “standard” against which all other forms are measured. Take the work of linguist John Baugh, for example. His research demonstrated that using different accents can directly impact access to goods and services. In his study, landlords were more responsive to inquiries made in a mainstream accent than to those made with African-American or Latino accents. This systemic bias not only perpetuates inequality but also affects algorithmic decisions made by AI systems.
The models behind various AI tools—such as autocorrect, voice-to-text, and writing assistants—are often trained on datasets that prioritize mainstream American English. This data is largely sourced from US-based media and online platforms, leading to a systematic disregard for grammatical, syntactical, and vocabulary variations found in other English dialects.
The Cost of Linguistic Bias
The stakes of this linguistic bias become even more pronounced when AI technologies are implemented globally. Consider the implications if an AI tutor cannot comprehend a construction unique to Nigerian English or if an AI-powered resume scanner penalizes an applicant for using Indian English. Furthermore, when voice recognition software misrepresents culturally significant terms in the oral histories of Australian First Nations elders, what knowledge is lost or distorted?
These scenarios illustrate the urgent need for a more inclusive approach as governments, educational institutions, and corporations increasingly rely on AI technologies.
Embracing Diverse Englishes
The belief that there exists a singular “correct” English is a myth. In reality, English is spoken in a multitude of forms, each shaped by local societies, cultures, histories, and identities. For instance, Aboriginal English possesses its own structure and rules, offering the same potential as any other variant. Similarly, Indian English introduces lexical innovations, such as “prepone,” which refers to scheduling something earlier, and Singapore English (or Singlish) incorporates elements from Malay, Hokkien, and Tamil.
These variations are not “broken” forms of English; they are legitimate expressions of linguistic identity. Unfortunately, in the realm of AI development, this rich diversity is often overlooked. Non-standardized varieties frequently find themselves underrepresented in training datasets and excluded from evaluation benchmarks, resulting in an AI ecosystem that claims to be multilingual but is, in practice, monolingual.
Toward Linguistic Justice in AI
So, what would it look like to build AI systems that recognize and respect a variety of English forms? First and foremost, a mindset shift is necessary. Instead of enforcing a “correct” language, AI systems should embrace linguistic variation. This could involve supporting community-led initiatives that document and digitize local linguistic varieties on their own terms.
Collaboration across disciplines—linking linguists, technologists, educators, and community leaders—is crucial. The goal should not be to “fix” language but to develop technology that yields just outcomes. By focusing on the technology itself rather than forcing speakers to conform to a single standard, we can create a more equitable AI landscape.
The Power of Language Diversity
English has served as a powerful tool for both empire and resistance, creativity, and solidarity. Around the globe, speakers have adapted the language to their unique contexts, making it their own. As we move toward an AI-enabled future, it is essential to build systems that reflect this linguistic richness.
Next time you encounter a spelling suggestion from your phone or find an AI chatbot misinterpreting your phrasing, take a moment to ponder: whose English is being modeled? And perhaps more critically, whose English is being marginalized or excluded? This reflection is vital as we strive for a more inclusive and representative approach in the development of AI technologies.
Inspired by: Source

