Why Is ChatGPT Still Struggling To Speak Swahili?

More than 200 million people speak some form of Swahili. The world’s most powerful AI speaks it poorly. The reasons stretch from colonial data pipelines and the explosive street slang of Nairobi’s youth, to a language structure so architecturally different from English that the machines simply were not built to understand it.

Ask ChatGPT in Swahili whether the price of unga (maize flour) has risen in Nairobi, and it will probably answer you. Ask it in Sheng, the rolling, restless street language of Nairobi’s youth, and it will likely either stall, respond in English, or produce an answer that no teenager in Eastlands would recognize as their own. Ask it about the philosophy of ubuntu in the register of formal Tanzanian Swahili, and it might give you something that is technically correct but linguistically hollow, the equivalent of answering a question about jazz with a chord chart.

Swahili is not a minor language. It is spoken, in one form or another, by more than 200 million people across East and Central Africa, is an official language of the East African Community and the African Union, and has been spreading rapidly into new territories since Somalia joined the EAC in 2024. Yet despite all of that demographic heft, it remains, in the language of artificial intelligence research, a low-resource language. That designation has nothing to do with the number of people who speak it, and everything to do with how little of that speech has ever been written down, digitized, or fed into a machine.

The consequences are not trivial. Across East Africa, governments are deploying AI tools in education, healthcare, agriculture, and public administration. Chatbots counsel patients in Kenya. Digital assistants help Tanzanian farmers assess crop disease. Automated systems process legal aid applications in Uganda. If the underlying language model speaks Swahili poorly, the downstream failures are not academic; they are medical, economic, and social.

THE DATA PROBLEM: HOW A LANGUAGE OF 200 MILLION BECAME ‘LOW-RESOURCE’

The term low-resource, as applied to languages, is a product of a specific hierarchy: the more text a language has indexed on the internet, in academic papers, in published books, and in government documents, the better AI systems trained on that data will perform in it. English sits at the top of this hierarchy, dominating training datasets. Swahili occupies a much lower tier.

A 2025 study published by the arXiv research repository reviewed six major large language models and eight smaller ones. Of Africa’s roughly 2,000 languages, only 42 appeared in supported language lists, and just four, including Swahili, Amharic, Afrikaans, and Malagasy, received consistent treatment. More than 98 percent of the continent’s languages received no meaningful coverage at all. Swahili, in that bleak taxonomy, is among the fortunate few. And yet the gap between its coverage and the coverage of European languages remains enormous.

The reasons are structural. A 2020 paper in ScienceDirect noted that Swahili lacks the “extensive datasets necessary for training advanced language models,” pointing to a shortage of annotated corpora, digitized academic texts, and domain-specific training data. Swahili flourishes in spoken life, in markets and radio stations and WhatsApp voice notes. But it is underrepresented in the written, indexed, searchable text that AI systems consume.

A 2025 survey of Kiswahili use across the region found that while Kiswahili flourishes on social media, its broader digital presence is limited by low content creation and inadequate AI tools. Academia compounds the problem: English dominates scholarly publishing, and without institutional incentives to publish in Kiswahili, research that could feed language models simply does not exist in the language.

OpenAI’s own benchmark tests acknowledge the disparity. When GPT-4 launched, it outperformed GPT-3.5 in 24 of 26 languages tested, including Swahili. The framing was positive, but the baseline was telling: Swahili was in a group of languages described as “traditionally challenging,” alongside Welsh and Latvian, languages spoken by far fewer people but with deep archives of digitized written text.

“Swahili, spoken by over 100 million people in East Africa, is a linguistically rich and culturally significant language but is underrepresented in the field of NLP. Unlike widely spoken languages such as English or Chinese, which benefit from abundant digital resources, Swahili lacks the extensive datasets necessary for training advanced language models.”

Mdpi Applied Sciences, January 2025

THE GRAMMAR MACHINE: HOW SWAHILI’S ARCHITECTURE DEFEATS STANDARD AI

Even if the data problem were solved tomorrow, a second, deeper challenge would remain. Swahili is not structured like English. It belongs to the Bantu language family, and its grammar operates on principles so different from the Indo-European languages that dominated AI’s foundational development that standard tools fail at the most elementary level: breaking a sentence into meaningful units.

Swahili is an agglutinative language. Rather than using separate words to convey meaning, it builds meaning by stacking prefixes and suffixes onto a root. The result is that a single Swahili word can express what English requires an entire clause to say. The word “hatutakwenda” means “we will not go.” Properly analyzed, it contains six distinct morphemes: ha- (negation), tu- (we), ta- (future tense), kw- (infinitive marker), end- (verb stem for go), and a (final vowel). AI tokenizers, largely designed around English, treat text as a sequence of words separated by spaces. They do not know what to do with a language where one word does the work of six.

A 2024 paper published through the ACM International Conference on NLP introduced a dataset of over 319,156 Swahili verb conjugation forms, created specifically because no such resource existed. The dataset covers five tenses, three grammatical persons, singular and plural forms, and various modal constructions. Its necessity underscores the vacuum that preceded it. Swahili verbs, the paper notes, “are formed by the addition of multiple affixes that indicate tense, aspect, person, and number, making tokenization, lemmatization, and morphological analysis especially challenging.”

Compounding this is the noun class system. Swahili has 18 grammatical noun classes, compared to English’s two (singular and plural). Every noun belongs to a class, and adjectives, verbs, and pronouns must all agree with that class through a system of prefixes. When an AI model mistakenly assigns a noun to the wrong class, or fails to track class agreement across a sentence, the resulting output is not merely accented or imprecise. It is grammatically wrong in ways that a fluent speaker notices immediately, and that can change the meaning of a sentence entirely.

Researchers at Medium’s Equalyz AI publication described the consequence clearly: morphological richness “exponentially increases vocabulary size and complicates tokenization. Standard tokenization approaches developed for English often perform poorly in African languages.” The mismatch is not a bug to be patched; it is a foundational design problem.

SWAHILI IS NOT ONE LANGUAGE

Every discussion of AI and Swahili runs into a definitional difficulty: which Swahili? The language that a grandmother speaks in coastal Mombasa, the version that a trader uses in Kisangani in the Democratic Republic of Congo, the register that a politician employs in Dar es Salaam’s parliament, and the words that a teenager types on Instagram in Nairobi are all called Swahili. They share a grammar and a core vocabulary. They are, in many ways, mutually intelligible. But they are not the same.

Standard Swahili, “Kiswahili sanifu”, is based on the Kiunguja dialect of Zanzibar and is most fully realized in Dar es Salaam. Tanzania’s state-controlled language body, BAKITA (the National Swahili Council), is the only institution empowered to approve new vocabulary. This formal, authoritative register is what most academic texts and AI training data use, when they use Swahili at all.

Kenyan Swahili diverges significantly, most notably in its code-switching with English. Tanzanians and Kenyans both mix English with Swahili, but in different patterns and with different phonological outcomes. Congolese Swahili differs further still: it borrows from French rather than English, employs grammatical constructions absent from East African Swahili, and has yet to converge on a single standard. Speakers of Bunia, Kisangani, Bukavu, and Lubumbashi understand each other, but their language diverges in vocabulary, pronunciation, and grammar in ways that would perplex a model trained only on Tanzanian corpora.

A 2025 multilingual study of Swahili use found strong daily usage in Tanzania and Kenya but lower adoption in Uganda, the DRC, and diaspora communities, with researchers warning that over-standardisation raises concerns about loss of dialectal diversity. For AI models that already flatten Swahili into its most formal register, this risk is not abstract. When a Congolese user converses with ChatGPT in their variety of Swahili and receives responses calibrated to Tanzanian Kiswahili sanifu, the exchange is not multilingual communication. It is the imposition of one dialect as the default, with all others treated as error.

A Snapshot of Swahili’s AI Representation Gap

Metric	English	Swahili
Internet content share	~55%	<0.1%
Approx. speakers	~1.5 billion	200+ million
NLP verb form dataset	Vast (billions of tokens)	319,156 forms (2024)
AI model coverage	Dominant / first-tier	1 of 4 African languages

Sources: arXiv 2025, Wikipedia, ACM 2024, Internet World Stats

THE SHENG PROBLEM: WHEN THE LANGUAGE REFUSES TO STAND STILL

If dialectal variation across borders is one challenge, the velocity of linguistic change within Kenya is another entirely. Sheng, the youth language of Nairobi, is perhaps the single most vivid illustration of why AI models trained on formal, static Swahili corpora fail to understand how East Africans actually speak.

Sheng emerged in the 1970s in Nairobi’s Eastlands, born in the slums of the city’s multicultural, multilingual neighborhoods. Its syntax is rooted in Swahili, but its lexicon is drawn from Luo, Kikuyu, Luhya, Maa, English, and dozens of other languages spoken in the city. Standard Swahili -kula (to eat) becomes -laku in Sheng, reversed and reshaped through the linguistic playfulness that defines the language. Sigara (cigarette) becomes ngife, derived from the English slang fag, adapted to Swahili syllable structure, then inverted.

What makes Sheng especially challenging for AI is that it does not stand still. Vocabulary that was current in Eastlands two years ago may already be out of date. The language mutates continuously, partly as a mechanism of identity and exclusion: once adults or outsiders understand it, young people create new words. A researcher quoted in a Boydell and Brewer publication described Sheng speakers as regularly switching between Sheng, Swahili, English, Kikuyu, and Dholuo within a single conversation, a phenomenon linguists call translanguaging, and which bears almost no resemblance to the monolingual, formal-register Swahili that appears in the texts AI models train on.

Sheng is no longer limited to low-income Nairobi neighborhoods. Research published in the Journal of African Cultural Studies found that Sheng has expanded into mainstream domains including media, politics, education, and corporate advertising. Kenyan television channels briefly experimented with Sheng-language news broadcasts. Advertising campaigns deploy it to reach younger consumers. The language that AI models consistently mishandle is not a niche dialect; it is the primary voice of the majority of Kenyans under 35, in a country where 75 percent of the population is younger than 35.

Research on AI content moderation in low-resource languages has illustrated the practical failure mode precisely. In one study, a Sheng hashtag, #TupataneTuesday (meaning “let’s meet each other on Tuesday”), required the correct morphemic segmentation of the Swahili word Tupatane: Tu- (we), -pat- (to meet), -ane (each other). English-trained frequency-based tokenizers, as researchers at arXiv documented, routinely fail this kind of segmentation, generating nonsense or incorrect tokens where meaning should be.

THE IEEE VERDICT: WHAT THE BENCHMARKS ACTUALLY SHOW

The academic literature is unambiguous, if diplomatically worded. A study published through IEEE Xplore evaluated ChatGPT-3.5 on three Swahili classification tasks: news classification, emotion classification, and sentiment classification. The results were sobering. ChatGPT underperformed SwahBERT, a model trained specifically on Swahili, and mBERT, a general multilingual model, in news and emotion classification. It performed comparably to SwahBERT only in sentiment classification, the most basic of the three tasks. The study concluded that “ChatGPT can still be improved in situations with limited resources,” a sentence that carries considerably more weight when those limited resources are the language of 200 million people.

A separate effort by Lelapa AI, a South African AI lab, developed InkubaLM-0.4B, a small language model trained from scratch on 1.9 billion tokens of African language data including Swahili, IsiZulu, Yoruba, Hausa, and IsiXhosa. Despite being a fraction of the size of GPT-4, InkubaLM outperformed several larger models on African language tasks. The finding points to an important truth: scale alone does not solve the Swahili problem. A model trained on a hundred trillion tokens of predominantly English text will not learn Swahili well regardless of its size, because the underlying data distribution never corrected for the gap.

“Over 90% of AI model training for African languages occurs on non-African infrastructure, exposing the continent to risks of data sovereignty violations and surveillance.”

Geopolitical Monitor, 2025

THE SOVEREIGNTY QUESTION: WHO OWNS SWAHILI’S DIGITAL FUTURE?

Beneath the technical debates runs a political one. A 2025 analysis by Geopolitical Monitor reported that over 90 percent of AI model training for African languages occurs on non-African infrastructure, meaning data about how Swahili speakers communicate is processed on servers in the United States, China, and Europe, governed by external ethical standards, and folded into models over which African institutions have no meaningful control.

The practical implications are corrosive. When a Kenyan farmer uses a chatbot to identify crop disease and receives a response in formal Tanzanian Swahili, or worse, in English, because the model could not parse their dialect, the failure is not merely linguistic. It is a failure of a technology that was built, in part, on data extracted from African users, to serve those same users back. Critics describe this as a new form of linguistic dependency, wrapped in the language of progress. Adding a Swahili interface to a foreign-built model, they argue, is not the same as meaningful inclusion.

The Masakhane Research Foundation, a pan-African NLP collective bringing together over 400 researchers from 30 African countries, has attempted to address this through community-driven data collection and model development. Mozilla’s Common Voice project has assembled over 200 hours of validated Swahili speech, among the most extensive open-source voice datasets for any African language. These efforts represent genuine progress, but they also reveal the scale of the gap: 200 hours of speech is a starting point, not a foundation.

WHAT PROGRESS LOOKS LIKE, AND HOW FAR IT FALLS SHORT

OpenAI acknowledges the gap and points to improvements. GPT-4 and GPT-4o have shown measurably better Swahili performance than their predecessors. A 2025 update from DataStudios noted that model improvements have reduced token inefficiency for non-Latin languages and improved translation accuracy for low-resource pairs including Swahili. The redesigned tokenizer in 2025 was said to cut token usage by 25-35 percent for complex scripts, though Swahili uses the standard Latin alphabet and benefits differently from those changes.

Academic researchers have explored promising alternatives. A January 2025 paper in MDPI Applied Sciences demonstrated that Retrieval-Augmented Generation, a technique that allows an AI model to retrieve relevant documents before generating a response, significantly outperforms standard fine-tuning for Swahili conversation systems. The best-performing model in that study achieved a BLEU score of 56.88 percent and a query performance score of 84.34 percent, results that suggest a workable architecture for improving Swahili AI even without massively expanded training data.

But the fundamental constraints remain. The internet, the principal source of AI training data, is still overwhelmingly English. Swahili academic publishing remains peripheral. Sheng is almost entirely absent from any formal corpus. Congolese and Kenyan Swahili dialects are consistently underrepresented even in Swahili-specific datasets. And the institutions with the resources to change this are, largely, not African.

To build an AI that speaks Swahili as East Africans actually speak it, researchers would need to digitize the oral; to capture the code-switching and the slang and the formal registers and the generational shifts in a single coherent dataset; to involve native speakers not just as data sources but as architects of the systems being built. None of that is impossible. None of it has been done.

Until it is, ChatGPT will continue to speak Swahili the way a textbook tourist speaks it: grammatically cautious, formally correct in its standard register, and utterly disconnected from the living language that 200 million people use every day to argue, to laugh, to grieve, to buy unga, and to be.

Get the latest news and insights that are shaping the world. Subscribe to Impact Newswire to stay informed and be part of the global conversation.

Got a story to share? Pitch it to us at info@impactnews-wire.com and reach the right audience worldwide