Macquarie University
Browse

Grammatical Error Correction Incorporating First Language Information

Download (1.4 MB)
thesis
posted on 2023-08-07, 03:08 authored by Yitao Liu

Grammatical error correction (GEC) is the task of automatically correcting grammatical errors in a text. Modern GEC systems are generally data-driven, using machine learning. Research has found that, compared with using native English corpora, using corpora written by learners of English as a Second Language (ESL) can help a GECsystem to get a better result when correcting ESL texts. However, the first language information of ESL writers is ignored in most GEC research. Different first languages have their different syntax, vocabulary, and cultural background, which will impact the language learning for ESL learners, leading them to make first language-specific errors. However, compared with native corpora, ESL corpora are limited in quantity, let alone ESL corpus with writers’ first language information, which makes them hard to utilize directly. In this thesis, we investigate different methods for utilizing ESL corpora, especially ESL corpora with first language information, to improve the performance of GEC systems. In the first part of the thesis, we use a domain adaptation method to use a large native corpus and a small ESL corpus together to train a GEC classifier. In contrast to earlier methods for combining such corpora—specifically, the error inflation method and the Naive Bayes adaptation method—we used the ‘Frustratingly Easy Domain Adaptation’ method of Daume III (2007) to augment the feature vectors directly to improve the performance of the classifier; unlike the previous work, this is not tied to specific classifiers or artificial data. We examine this approach for correcting article errors with a number of baseline systems, and the results indicate that, when using an SGD classifier, the performance relative to baselines is encouraging for the use of domain adaptation. Following this, we focus on GEC deep learning approaches, where domain adaptation is typically achieved through transfer learning. In our first exploration of using first language information in this context, we train a fully convolutional Neural Machine Translation (NMT) model on two ESL corpora, and then use corpus data from specific first language backgrounds to fine-tune this model in order to correct texts from same first language background. Experiments show that data cleaning techniques are important for the fine-tuning process. Given this, the fine-tuning models fine-tuned by the appropriate specific first language corpus, based on the clean pre-trained model, perform substantially better than those fine-tuned by an equivalent amount of randomly selected corpus fine-tuning data. We further carry out an analysis of error types, motivated by the Second Language Acquisition literature, and find that improvements occurred especially for error types strongly related to writers’ first language backgrounds. After looking at individual first language information, the idea of language family is used to enlarge the usable fine-tuning corpus, gathering corpora where writers’ first languages are in the same language family in order to form the fine-tuning data. Experiments show that the approach to dealing with the situation where the first language background of the test texts is one specific member of the examined Italic language family performed better than those fine-tuned by equivalent amounts of randomly chosen fine-tuning data, and also superior to the approach examined above where individual first language models are produced by corpora from specific first language backgrounds. Finally, we modify the architecture of the NMT model to add first language information as side information into the model with minimum modification in order to allow the method to be usable regardless of the choice of the specific neural network, and maximize the utilization of the fine-tuning data. We systematically analyze the appropriate side information injection sites, considering locations before the embedding layer, before the encoder layer, and before the output layer, and test those single methods along with their different combinations. As observed in the experiments, the best-performing method for utilizing first language information was the method to inject first language information as side information into the GEC model before the encoder layer, and it is superior to all the earlier methods incorporating first language information in this thesis. Moreover, this method only needs to train one finetuning model, instead of a set of models like other methods investigated in this thesis.

History

Table of Contents

Chapter 1. Introduction -- Chapter 2. Related Work -- Chapter 3. Grammatical Error Correction based on Domain Adaptation -- Chapter 4. Fine-tuning GEC Model by ESL Corpus from Specific First Language Background -- Chapter 5. Fine-tuning GEC Model Based on Language Family Corpus -- Chapter 6. Injecting First Language Information as Side Information into GEC Model -- Chapter 7. Conclusions and Future Works -- Appendix -- References

Awarding Institution

Macquarie University

Degree Type

Thesis PhD

Department, Centre or School

Department of Computing

Year of Award

2021

Principal Supervisor

Mark Dras

Rights

Copyright: The Author Copyright disclaimer: https://www.mq.edu.au/copyright-disclaimer

Language

English

Extent

262 pages

Usage metrics

    Macquarie University Theses

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC