Effective and efficient topical collocation models
Topic models are powerful tools for automatically detecting latent topic distributions from a set of documents. The "bag-of-words" representation used in conventional topic models, however, ignores potentially useful dependencies between words. Topical collocation models extend topic models by modelling relationships between mutually informative consecutive words. Since the standard methods for evaluating "bag-of-words" topic models do not directly apply to topical collocation models, many fundamental questions remain open. For example, it is not clear whether topical collocation models are better than topic models, which kind of topical collocation model is the best, and how well the best-performing model scales to large datasets.
Therefore, to address these questions, this thesis has three parts. In the first part, we develop four different evaluation methods for topical collocation models and apply them to five topical collocation models and a standard topic model, Latent Dirichlet Allocation (LDA). We evaluate the models using human annotation, a point-wise mutual information metric, and practical down-stream information retrieval and classification tasks. The experiments reveal that 1) almost all the topical collocations models achieve better performance than LDA on all the evaluation methods; and 2) topical collocation models using Adaptor Grammars (AG-colloc) almost always provide a new state-of-the-art, though some improvements over baselines are marginal.
Having identified the best topical collocation model, the second part of this dissertation focuses on scaling it to very large-scale datasets with a sparse parallel reformulation. We present an efficient reformulation of the AG-colloc model, an unsupervised topical collocation model that can learn collocations of arbitrary length. Taking advantage of sparsity in both collocation and topic distributions, we develop a novel linear time sampling algorithm that can be easily parallelised so that the reformulation is capable of handling large-scale corpora.
In the third part, the new implementation is compared to LDA and a topic model for learning topical collocations (PA) on large-scale corpora in terms of speed and quality, using the evaluation methods developed in the first part of the thesis.
Three contributions of this thesis are:
1. An empirical comparison of five topic models for learning topical collocations (PA, LDACOL, TNG, AG-colloc, and AG-colloc2) to a standard topic model (LDA), using four evaluation methods;
2. An efficient reformulation of the AG-colloc model;
3. A novel linear time sampling algorithm, which can be easily parallelised so that the reformulation is capable of handling large-scale corpora.