The Linguistic Bias of Generative AI: Whose English Are We Using?

In the world of generative AI, a staggering 90% of the training data comes from English. This statistic might sound impressive, but it raises an important question: which version of English is being utilized in these systems? While English serves as a global lingua franca, spoken by approximately 1.5 billion people worldwide, the version that dominates AI technology is overwhelmingly mainstream American English. This focus on a single variety has significant implications for linguistic diversity and representation in AI.

Contents

The Dominance of Mainstream American English
The Standardization of English
The Cost of Linguistic Bias
Embracing Diverse Englishes
Toward Linguistic Justice in AI
The Power of Language Diversity

The Dominance of Mainstream American English

The prevalence of American English in digital spaces is not by chance; it arises from a complex interplay of historical, economic, and technological factors. The United States has long been at the forefront of internet development, content creation, and the rise of major tech companies—think Google, Meta, Microsoft, and OpenAI. The linguistic norms that these companies embed in their products reflect the cultural priorities of mainstream America, creating a homogenous digital landscape.

Research has shown that this dominance can frustrate speakers of non-mainstream English. For instance, a study indicated that users of various English dialects found the accents generated by AI technologies to be predominantly American, which often felt exclusionary. One participant remarked that these technologies seemed designed “with some other people in mind,” highlighting the disconnect between the users and the technology.

The Standardization of English

Mainstream varieties of English have historically been regarded as the “standard” against which all other forms are measured. Take the work of linguist John Baugh, for example. His research demonstrated that using different accents can directly impact access to goods and services. In his study, landlords were more responsive to inquiries made in a mainstream accent than to those made with African-American or Latino accents. This systemic bias not only perpetuates inequality but also affects algorithmic decisions made by AI systems.

The models behind various AI tools—such as autocorrect, voice-to-text, and writing assistants—are often trained on datasets that prioritize mainstream American English. This data is largely sourced from US-based media and online platforms, leading to a systematic disregard for grammatical, syntactical, and vocabulary variations found in other English dialects.

The Cost of Linguistic Bias

The stakes of this linguistic bias become even more pronounced when AI technologies are implemented globally. Consider the implications if an AI tutor cannot comprehend a construction unique to Nigerian English or if an AI-powered resume scanner penalizes an applicant for using Indian English. Furthermore, when voice recognition software misrepresents culturally significant terms in the oral histories of Australian First Nations elders, what knowledge is lost or distorted?

These scenarios illustrate the urgent need for a more inclusive approach as governments, educational institutions, and corporations increasingly rely on AI technologies.

Embracing Diverse Englishes

The belief that there exists a singular “correct” English is a myth. In reality, English is spoken in a multitude of forms, each shaped by local societies, cultures, histories, and identities. For instance, Aboriginal English possesses its own structure and rules, offering the same potential as any other variant. Similarly, Indian English introduces lexical innovations, such as “prepone,” which refers to scheduling something earlier, and Singapore English (or Singlish) incorporates elements from Malay, Hokkien, and Tamil.

These variations are not “broken” forms of English; they are legitimate expressions of linguistic identity. Unfortunately, in the realm of AI development, this rich diversity is often overlooked. Non-standardized varieties frequently find themselves underrepresented in training datasets and excluded from evaluation benchmarks, resulting in an AI ecosystem that claims to be multilingual but is, in practice, monolingual.

Toward Linguistic Justice in AI

So, what would it look like to build AI systems that recognize and respect a variety of English forms? First and foremost, a mindset shift is necessary. Instead of enforcing a “correct” language, AI systems should embrace linguistic variation. This could involve supporting community-led initiatives that document and digitize local linguistic varieties on their own terms.

Collaboration across disciplines—linking linguists, technologists, educators, and community leaders—is crucial. The goal should not be to “fix” language but to develop technology that yields just outcomes. By focusing on the technology itself rather than forcing speakers to conform to a single standard, we can create a more equitable AI landscape.

The Power of Language Diversity

English has served as a powerful tool for both empire and resistance, creativity, and solidarity. Around the globe, speakers have adapted the language to their unique contexts, making it their own. As we move toward an AI-enabled future, it is essential to build systems that reflect this linguistic richness.

Next time you encounter a spelling suggestion from your phone or find an AI chatbot misinterpreting your phrasing, take a moment to ponder: whose English is being modeled? And perhaps more critically, whose English is being marginalized or excluded? This reflection is vital as we strive for a more inclusive and representative approach in the development of AI technologies.

Inspired by: Source

How AI Systems Rely on English: Exploring the Language Variations Behind Global Communication

The Linguistic Bias of Generative AI: Whose English Are We Using?

The Dominance of Mainstream American English

The Standardization of English

The Cost of Linguistic Bias

Embracing Diverse Englishes

Toward Linguistic Justice in AI

The Power of Language Diversity

Stay Connected

Explore Top AI Tools Instantly

Latest News

Sam Altman Targeted Again in Recent Attack: What You Need to Know

Enhancing Mission-Critical Small Language Models through Multi-Model Synthetic Training: Insights from Research 2509.13047

OpenAI Acquires AI Personal Finance Startup Hiro: What This Means for the Future

Google Launches Gemma 4: Emphasizing Local-First, On-Device AI Inference for Enhanced Performance

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

The Linguistic Bias of Generative AI: Whose English Are We Using?

The Dominance of Mainstream American English

The Standardization of English

More Read

The Cost of Linguistic Bias

Embracing Diverse Englishes

Toward Linguistic Justice in AI

The Power of Language Diversity

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

Sam Altman Targeted Again in Recent Attack: What You Need to Know

Enhancing Mission-Critical Small Language Models through Multi-Model Synthetic Training: Insights from Research 2509.13047

OpenAI Acquires AI Personal Finance Startup Hiro: What This Means for the Future

Google Launches Gemma 4: Emphasizing Local-First, On-Device AI Inference for Enhanced Performance