Native language identification incorporating syntactic knowledge
thesisposted on 2022-03-28, 23:08 authored by Sze-Meng Jojo Wong
"Inferring characteristics of authors from their textual data, often termed authorship profiling, is typically treated as a classification task, where an author is classified with respect to characteristics including gender, age, native language, and so on. This profile information is often of interest to marketing organisations for product promotional reasons as well as governments for crime investigation purposes. The thesis focuses on the specific task of inferring the native language of an author based on texts written in a second language, typically English; this is referred as native language identification (NLI). Since the seminal work of Koppel et al. in 2005, this task has been primarily tackled as a text classification task using supervised machine learning techniques. Lexical features, such as function words, character n-grams, and part-of-speech (PoS) n-grams, have been proven to be useful in NLI. Syntactic features, on the other hand, in particular those that capture grammatical errors, which might potentially be useful for this task, have received little attention. The thesis explores the relevance of concepts from the field of second language acquisition, with a focus on those which postulate that constructions of the native language lead to some form of characteristic errors or patterns in a second language. In the first part of the thesis, an experimental study is conducted to determine the native language of seven different groups of authors in a specially constructed corpus of non-native English learners (International Corpus of Learner English). Three commonly observed syntactic errors that might be attributed to the transfer effects from the native language are examined - namely, subject-verb disagreement, noun-number disagreement, and misuse of determiners. Based on the results of a statistical analysis, it is demonstrated that these features generally have some predictive power, but that they do not improve the level of best performance of the supervised classification, in comparison with a baseline using lexical features. In the second part, a second experimental study aims to learn syntax-based errors from syntactic parsing, with the purpose of uncovering more useful error patterns in the form of parse structures which might characterise language-specific ungrammaticality. The study demonstrates that parse structures, represented by context-free grammar (CFG) production rules and parse reranking features, are useful in general sentence grammaticality judgement. Consequently, adapting these syntactic features to NLI, with the use of parse production rules in particular, a statistically significant improvement over the lexical features is observed in the overall classification performance. The final part of the thesis takes a Bayesian approach to NLI through topic modeling in two ways. Topic modeling, using a probabilistic CFG formulation, is first taken as a feature clustering technique to discover coherent latent factors (known as 'topics') that might capture predictive features for individual native languages. The topics, rather than the word n-grams that are typical of topic modeling, consist of bi-grams over part of speech. While there is some evidence of topic cluster coherence, this does not improve the classification performance. The second approach explores adaptor grammars, a hierarchical non-parametric extension of probabilistic CFGs (and also interpretable as an extension of topic modeling), for feature selection of useful collocations. Adaptor grammars are extended to identify n-gram collocations of arbitrary length over mixtures of PoS and function words, using both maxent and induced syntactic language model approaches to NLI classification. It is demonstrated that the learned collocations used as features can also improve over the baseline (lexical) performance, although success varies with the approach taken.