Native language identification: explorations and applications
thesisposted on 28.03.2022, 17:44 authored by Shervin Malmasi
The prediction of an author's native language using only their second language writing -- a task called Native Language Identification (NLI) -- is usually tackled using supervised classification. This is underpinned by the presupposition that an author will be disposed towards certain language production patterns in their second language (L2), due to influence from their first language (L1). This identification of such L1-specific linguistic patterns can be used to study language transfer effects in Second Language Acquisition (SLA). NLI can also be used in forensic linguistics, serving as a tool for authorship profiling to provide evidence about a writer's linguistic background. NLI is a young but rapidly growing research topic. It has become a well-defined classification task in the past decade and increasing work during the last five years has brought an unprecedented level of research focus and momentum to the area, culminating in the first NLI Shared Task in 2013. Most work hitherto has focused on the core machine learning and feature engineering facets of the task, obtaining suitable data and unifying the area with a common evaluation framework. This thesis makes three broad contributions: (1) exploring the task in new ways; (2) investigating how NLI can inform SLA; and (3) introducing the novel task of L1-based text segmentation. Following our implementation of an NLI system for the shared task -- which investigated the effects of classifier ensembles, feature types and feature diversity -- we explored the task in several new ways. We first looked at human and oracle performance, gauging potential for further improvements in classification performance. We used two oracles to estimate upper bounds for NLI accuracy, applying them to a new dataset composed of all submissions to the 2013 shared task, revealing interesting error patterns. We then presented the first study of human performance for NLI using a group of experts. The experts did not outperform our NLI system, with the performance gap likely to widen on the standard NLI setup, demonstrating that this is a hard task that, uncharacteristically, computers can perform better than humans. We next explored the cross-lingual applicability of NLI by extending it to other languages. To this end we identified six typologically very different sources of non-English L2 data and via a series of experiments using common features established that NLI accuracy is similar across the L2s and a wide range of L1s. We showed that other patterns, e.g. oracle performance and feature diversity, also hold across languages. Next, we considered practical applications of NLI in SLA research, investigating ways to use the classification task to give a broad linguistic interpretation of the data. Our first exploration focused on language transfer, the characteristic L2 usage patterns caused by native language interference, which is investigated by SLA researchers seeking to find overused and underused features. We proposed a method for deriving ranked lists of such discriminative features and then analyzed our results to see how useful they might be in formulating plausible language transfer hypotheses. We then defined and examined an approach to formulating and testing hypotheses about errors and the environments in which they are made, a process which traditionally in SLA often involves substantial effort. To this end we defined a new task for finding contexts for errors that vary with the native language of the writer and propose four graph-theoretic models for doing so. The findings in this chapter form the basis of a useful research direction for developing methods to assist SLA experts develop hypotheses using large data. The final part of this dissertation introduced the novel task of native language-based text segmentation, exploring how discriminative NLI features investigated in the previous task can be exploited here. The goal is to partition a text into regions that exhibit differing L1 influence; such methods could be applied for intrinsic plagiarism detection or even literary analysis. We adapted an unsupervised Bayesian approach originally developed for topic segmentation to one with generative models built over features useful in NLI. We investigated several models: one with alternating asymmetric priors was the best, with compactness of distributions over features proving to be important.