| Columbia Undergraduate Science Journal

Humans use language for cooperation, self-expression, and much more. So it's not surprising that, since the dawn of computing, scientists have envisioned building technology that can communicate with us seamlessly in natural language—i.e., human (as opposed to programming) language (1).

Thanks to decades of research, that goal is finally in view. Virtual assistants such as Apple's Siri and Amazon's Alexa can manage simple verbal interactions. Auto-correct and predictive text make drafting emails almost too easy. For speakers of major languages like English, the seamless integration of language into technology is so ordinary as to be invisible. Yet for speakers of endangered, minoritized, and under-resourced languages, it is often impossible to use their native languages in public spaces, such as school, work, or social media, without encountering social stigma and technological barriers.

Endangered languages are often indigenous languages. These languages are faced with declining numbers of native speakers in large part due to the forced assimilation of indigenous peoples through colonization and globalization (2). Around 40% of the world's 7,000 languages are endangered. That statistic comes from the UN, which declared 2019 the International Year of Indigenous Languages (IYIL), in recognition of indigenous languages' importance to cultural diversity, knowledge sharing and preservation, economic development, and peace and stability (3).

The IYIL action plan called for more technological resources to stem the worldwide decline in linguistic diversity (4). While especially acute for endangered languages, under-resourcing occurs even for languages with millions of speakers. Since tech companies have a financial incentive to focus on common languages in developed countries, only a tiny fraction of all languages have adequate technological support (5).

At the time of writing, for instance, Google Translate claims to help people "communicate in over 100 languages," or roughly 1-2% of the world's living languages. Google Translate's website, however, shows that the minority and endangered languages supported tend to lack translation features available to major languages, such as the ability to take dictation or translate script from touchscreen writing (6). And while Google has researched how to improve translation quality for low-resource languages, Google Translate still provides subpar translations for these languages much of the time. Among other factors, the lack of large bodies of quality data in minority languages poses a challenge to the development of language models that would produce better translations (7, 8).

A piece of technology that could make it easier for speakers of endangered languages to carve out their place in the digital sphere, simultaneously producing more language data that can advance the development of further technologies for their languages, is the digital keyboard. One example, Poio, comes from researchers at the Interdisciplinary Centre for Social and Language Documentation, in partnership with endangered language community members in Europe. Poio collects endangered language data from diverse sources, including Wikipedia articles and digitized dictionaries, into ready-to-use form for both computational linguistics research and apps, such as the predictive keyboard which the researchers developed for 27 endangered languages. Besides aiding researchers like themselves, the Poio creators hope their product will aid younger generations in learning their communities' languages. They are now working on improving the quality of their keyboard's text predictions, expanding it to more languages, and support for offline use (9). Thanks to Poio and similar projects (such as the indigenous languages keyboard developed by the Canada-based First Peoples' Cultural Council), what was once a common convenience only for major languages is becoming accessible for minority languages (10).

What about endangered languages from cultures with oral traditions? These languages often lack a standardized script and manpower to carry out the painstaking work of transcribing and translating speech (11). To solve this problem, computer scientists are automating the transcription of endangered languages from audio and aligning audio recordings with their translations into more common languages, helping to preserve these languages for future generations and bypassing the so-called "transcription bottleneck" that has plagued linguistic fieldwork for years (12). And language communities which no longer have fluent speakers may one day benefit from advances in automated language reconstruction (13).

Reflecting upon what he described as this "brief period of overlap between the mass extinction of the world’s languages and the advent of the digital age," computational linguist Steven Bird asked his colleagues, "What can we do—as individuals and as a professional association—as we wake up to this global linguistic crisis?" (14). Over a decade after Bird posed that question, the crisis of language death is still upon us. So are the burgeoning perils and promises of technology.

References:

Schubert, L. (2014). Computational Linguistics. In Stanford Encyclopedia of Philosophy. Stanford Center for the Study of Language and Information. https://plato.stanford.edu/archives/spr2020/entries/computational-linguistics/
Woodbury, A. (n.d.). FAQ: What is an Endangered Language? Linguistic Society of America. Retrieved July 16, 2021, from https://www.linguisticsociety.org/resource/faq-what-endangered-language
About IYIL 2019. (n.d.). 2019 - International Year of Indigenous Language. Retrieved July 15, 2021, from https://en.iyil2019.org/about/
Action plan for organizing the 2019 International Year of Indigenous Languages. (2018). UNESCO. https://en.iyil2019.org/wp-content/uploads/2018/09/N1804802.pdf
Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56, 85–100. https://doi.org/10.1016/j.specom.2013.07.008
Languages. (n.d.). Google Translate. Retrieved July 16, 2021, from https://translate.google.com/intl/en-GB/about/languages/
Benjamin, M. Empirical Evaluation of Google Translate across 107 Languages. (2019, March 30). Teach You Backwards. https://www.teachyoubackwards.com/empirical-evaluation/.
Arivazhagan, N., Bapna, A., Firat, O., Lepikhin, D., Johnson, M., Krikun, M., Chen, M. X., Cao, Y., Foster, G., Cherry, C., Macherey, W., Chen, Z., & Wu, Y. (2019). Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges. ArXiv:1907.05019 [Cs]. http://arxiv.org/abs/1907.05019
Zamora Fernández, G., Ferreira, V., & Manha, P. (2020). Poio Text Prediction: Lessons on the Development and Sustainability of LTs for Endangered Languages. Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), 106–110. https://aclanthology.org/2020.sltu-1.14
Ibaraki, S. (n.d.). Turning To AI To Save Endangered Languages. Forbes. Retrieved July 15, 2021, from https://www.forbes.com/sites/cognitiveworld/2018/11/23/turning-to-ai-to-save-endangered-languages/
Millour, A., & Fort, K. (2020). Text Corpora and the Challenge of Newly Written Languages. Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), 111–120. https://aclanthology.org/2020.sltu-1.15
Zanon Boito, M., Villavicencio, A., & Besacier, L. (2020). Investigating Language Impact in Bilingual Approaches for Computational Language Documentation. Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), 79–87. https://aclanthology.org/2020.sltu-1.11
Ciobanu, A. M., & Dinu, L. P. (2020). Automatic Identification and Production of Related Words for Historical Linguistics. Computational Linguistics, 45(4), 667–704. https://doi.org/10.1162/coli_a_00361
Bird, S. (2009). Natural Language Processing and Linguistic Fieldwork. Computational Linguistics, 35(3), 469–474. https://doi.org/10.1162/coli.35.3.469

The Role of Technology in Preserving Linguistic Diversity