11/09/2025
The Cognitive Science of Language lecture series talk will take place on Tuesday November 11, 3:30-5:20pm, LRW 2001. The lecture will be delivered by Dr. Annie En-Shiun Lee. Professor Annie En-Shiun Lee is an assistant professor at Ontario Tech University and the University of Toronto (status-only). Professor Lee’s goal is to make language technology as inclusive and accessible to as many people as possible, under the institutional vision “Tech with a Conscience”. Professor Lee directs the Lee Language Lab (L^3) with research focusing on multilingual and multicultural natural language processing and large language models, especially in the adoption of technology and its usability for disenfranchised or underserved groups at risk of being left behind the digital divide. Professor Lee’s research has been published in Nature Digital Medicine, ACM Computing Survey, ACL (EACL, NAACL, ACL), SIGCSE, IEEE TKDE, and Bioinformatics. Dr. Lee is the demo co-chair for NAACL 2024 and has received numerous recognitions, including Outstanding Paper Award and Best Theme Paper Award at NAACL 2025, Audience Award at Teaching NLP 2024, ARIA Spotlight Award for MScAC 2024, as well as nominated for Tim McTiernan Student Mentorship Award 2025, and Women in AI Researcher of the Year Award 2025.
Title: Bridging the World with Words: Multilingual and Multicultural Natural Language Processing
Abstract: Despite the rapid advances in Large Language Models (LLM), research efforts have historically focused disproportionately on high-resource languages, particularly English, leaving over 7,000 living languages underserved. We address the fundamental challenge of bridging the gap of low-resource language (LRL) translation in multilingual language models. Low-resource languages are typically characterized by a scarcity of both unlabeled and labeled data, as well as limited tools and models. This talk explores strategies aimed at bridging the gap of low-resource language (LRL) translation in multilingual models, where LRLs are characterised by a limited scarcity of both unlabeled and labelled data, as well as limited tools and models. The talk will detail the importance of quantifying linguistic distance to improve the robustness and predictability of multilingual model performance. We introduce our improvements to the World Languages Database, URIEL+, an expanded linguistic knowledge base that addresses issues like low feature coverage, lack of morphological data, and ambiguous distance calculations. URIEL+ significantly expands typological coverage for nearly 2,900 languages and supports 361 new low-resource languages. By utilising this enhanced resource, our work on Open Toolkits & Models demonstrates that integrating linguistic distance metrics and dataset information (ProxyLLM) leads to models that are more robust and consistently more accurate, yielding performance increases on downstream tasks by up to 50%. Success in LRL research requires widespread cooperation through Community Collaboration, we detail our projects in Creating New Datasets for linguistic and cultural diversity. We highlight dataset collaborations with grassroots groups like WorldCuisines with Southeast Asia Crowd (recognized as a "best paper"), IrokoBench with Masakhane (recognised as an "outstanding paper"), and Error Annotation in Machine Translations of Chinese Dialects. Lastly, we discuss developing language tools such as Language Learning Apps (Language Education/ATAIIGI), the Annotation Correction App (Annotation Correction), and materials focused on Teaching Multilinguality (recognized with an Audience Award).
We are looking forward to seeing you at the event!