English is a minority language too

English is a dominant default in NLP but it’s not so special. The economic value of localized language technology is large and growing.
languages
localization
Author

Lucien Carroll

Published

April 3, 2022

In California, more than 1 out of 4 residents were born in another country. In the San Francisco Bay Area, that number is about 1 in 3, and among tech workers in the area, about 3 out of 4. So in tech circles, dominated by Silicon Valley, English is treated as the majority and default language, and much exposure to languages beyond English positions those other languages as foreign, personal, or home languages. Even as home languages, they are losing ground to English, replaced by the second or third generation. Someone who works in Silicon Valley, USA might easily think that English overwhelmingly dominates the business world, and that people who don’t use English in business life are switching to it.

It’s certainly true that English does have a dominant position in the global economy and in the technology world. It is spoken as the most common home language and as a national language in three of the G7 countries, and it is used as the primary trade language among speakers of many other languages. In language technology, especially in NLP research, English is often treated as unspoken default. However, the dominance of English should not be exaggerated. English accounts for a minority of the global economy, a minority of technology users, and a minority of internet content. Moreover, the dominance of English is declining.

A minority of economic activity

If you associate the GDP of each country or region to the primary language of that country/region, you can get an approximate breakdown of economic activity attributable to each language, along the lines of this table:

Rank Language GDP($US Billions) % of World GDP
1 English 24.86 29.59%
2 Chinese 14.73 17.53%
3 Japanese 5.15 6.14%
4 German 4.82 5.74%
5 Spanish 4.77 5.67%
6 French 3.41 4.06%
7 Arabic 2.75 3.27%
8 Portuguese 2.08 2.48%
9 Italian 2.00 2.38%
10 Russian 1.87 2.22%

With the caveat that not all economic activity in a country is carried out in the dominant language, but recognizing that most economic activity in Japan, for example, is carried out in Japanese, we can make some fair generalizations. Even though English is the language accounting for the most nominal GDP, it is approximately the same sum as the next three languages combined (Chinese, Japanese and German), and the top 10 languages here account for less than 80% of the global GDP.

A minority of technology users

Surveys of technology users similarly show a long fat tail of preferred languages. In the internet language survey results shown below, the single largest group of users do prefer English, but that number is smaller than the next two languages combined (Chinese and Spanish), and the top 10 languages together only account for 77% of the online population.

Rank Language Users (millions) % of World Users
1 English 1186 25.90%
2 Chinese 888 19.40%
3 Spanish 364 7.90%
4 Arabic 237 5.20%
5 Portuguese 172 3.70%
6 Indonesian 198 4.30%
7 French 152 3.30%
8 Japanese 119 2.60%
9 Russian 116 2.50%
10 German 93 2.00%

If you look at surveys of smartphone penetration per country and assign each population to just one language, Chinese jumps far into the lead, with English in second place.

A minority of internet content

A standard survey of website languages reports that 62.9% of websites are in English. However, the methodology used there assigns a language according to the default of the top-level page, ignoring subdomains, so any multilingual website that sets English as the default language will count for English in the results. Methods that count web pages, such as the Indicators of languages in the internet survey produce much lower estimates of the portion of internet content in English.

Rank Language Content
1 Chinese 21.60%
2 English 19.60%
3 Spanish 7.85%
4 Hindi 3.76%
5 Russian 3.76%
6 French 3.33%
7 Portuguese 3.13%
8 Arabic 3.09%
9 Japanese 2.66%
10 German 2.37%

So though English is the default language in many of the most visited websites, it has dropped into second place in overall internet content. As with other tables above, this distribution also has a long fat tail. The top ten languages here account for only 71% of the surveyed content.

Declining dominance

Since the economies of Asia and Africa have generally been growing faster than those of Europe and North America, English accounts for a declining portion of the global economy. In addition, as digital infrastructure development extends the reach of smartphones, and of the internet more generally, the pool of technology users continues to diversify.

Global GDP by region 2010 to 2019

All of this means that if you are on a research team investigating language technology or on a product team developing a language-dependent system, you really need to keep a global perspective and consider multiple languages in your plan. Otherwise you will either be limiting your impact to a small portion of your potential users or very poorly serving those users.

Fortunately, the cost of localized language technology is quickly falling, making it feasible to support more languages. But that is a topic for another post.