Building a Dataset and Exploring Low-Resource Approaches to Natural Language Inference with Myanmar

Htet, Aung Kyaw

doi:10.25949/24653928.v1

01whole.pdf (1.27 MB)

Building a Dataset and Exploring Low-Resource Approaches to Natural Language Inference with Myanmar

thesis

posted on 2024-02-13, 22:43 authored by Aung Kyaw HtetAung Kyaw Htet

Despite dramatic recent progress in NLP, it is still a major challenge to apply Large Language Models (LLM) to low-resource languages. This is most visible in benchmarks such as Cross-Lingual Natural Language Inference (XNLI), a key task that demonstrates cross-lingual capabilities of NLP systems across a set of 15 languages.

In this thesis, we extend XNLI task for one additional low-resource language, Myanmar, as a proxy challenge for broader low-resource languages, and make three core contributions. First, we build a dataset called Myanmar XNLI (myXNLI) using community crowd-sourced methods, as an extension to the existing XNLI corpus. This involves a two-stage process of community-based construction followed by expert verification; through an analysis, we demonstrate and quantify the value of the expert verification stage in the context of community-based construction for low-resource languages. We make the myXNLI dataset available to the community for future research. Second, we carry out evaluations of recent multilingual language models on the myXNLI benchmark, as well as explore data-augmentation methods to improve model performance. Our data-augmentation methods improve model accuracy by up to 2 percentage points for Myanmar, while uplifting other languages at the same time. Third, we investigate how well these data-augmentation methods generalise to other low-resource languages in the XNLI dataset.

History

Awarding Institution

Macquarie University

Degree Type

Thesis MRes

Degree

Master of Research

Department, Centre or School

School of Computing

Year of Award

2023

Principal Supervisor

Mark Dras

Additional Supervisor 1

Diego Molla-Aliod

Rights

Copyright: The Author Copyright disclaimer: https://www.mq.edu.au/copyright-disclaimer

Language

English

Extent

104 pages

Former Identifiers

AMIS ID: 282558

Usage metrics

Keywords

Myanmar Burmese Natural Language Inference Low-resource Language Multilingual Language Model

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Building a Dataset and Exploring Low-Resource Approaches to Natural Language Inference with Myanmar

History

Table of Contents

Awarding Institution

Degree Type

Degree

Department, Centre or School

Year of Award

Principal Supervisor

Additional Supervisor 1

Rights

Language

Extent

Former Identifiers

Usage metrics

Categories

Keywords

Licence

Exports