Effective and efficient topical collocation models

Zhao, Zhendong

doi:10.25949/21616641.v1

01whole.pdf (3.9 MB)

Effective and efficient topical collocation models

thesis

posted on 2022-11-24, 05:38 authored by Zhendong Zhao

Topic models are powerful tools for automatically detecting latent topic distributions from a set of documents. The "bag-of-words" representation used in conventional topic models, however, ignores potentially useful dependencies between words. Topical collocation models extend topic models by modelling relationships between mutually informative consecutive words. Since the standard methods for evaluating "bag-of-words" topic models do not directly apply to topical collocation models, many fundamental questions remain open. For example, it is not clear whether topical collocation models are better than topic models, which kind of topical collocation model is the best, and how well the best-performing model scales to large datasets.

Therefore, to address these questions, this thesis has three parts. In the first part, we develop four different evaluation methods for topical collocation models and apply them to five topical collocation models and a standard topic model, Latent Dirichlet Allocation (LDA). We evaluate the models using human annotation, a point-wise mutual information metric, and practical down-stream information retrieval and classification tasks. The experiments reveal that 1) almost all the topical collocations models achieve better performance than LDA on all the evaluation methods; and 2) topical collocation models using Adaptor Grammars (AG-colloc) almost always provide a new state-of-the-art, though some improvements over baselines are marginal.

Having identified the best topical collocation model, the second part of this dissertation focuses on scaling it to very large-scale datasets with a sparse parallel reformulation. We present an efficient reformulation of the AG-colloc model, an unsupervised topical collocation model that can learn collocations of arbitrary length. Taking advantage of sparsity in both collocation and topic distributions, we develop a novel linear time sampling algorithm that can be easily parallelised so that the reformulation is capable of handling large-scale corpora.

In the third part, the new implementation is compared to LDA and a topic model for learning topical collocations (PA) on large-scale corpora in terms of speed and quality, using the evaluation methods developed in the first part of the thesis.

Three contributions of this thesis are:

1. An empirical comparison of five topic models for learning topical collocations (PA, LDACOL, TNG, AG-colloc, and AG-colloc2) to a standard topic model (LDA), using four evaluation methods;

2. An efficient reformulation of the AG-colloc model;

3. A novel linear time sampling algorithm, which can be easily parallelised so that the reformulation is capable of handling large-scale corpora.

History

1 Introduction -- 2 Background on topic models and topical collocation models -- 3 Finding the most effective topic model for learning topical collocations for small corpora -- 4 An efficient reformulation of adaptor grammar for learning topical collocations -- 5 Evaluations on large-scale corpora -- 6 Conclusions and future work

Notes

A thesis submitted to Macquarie University for the degree of Doctor of Philosophy

Awarding Institution

Macquarie University

Degree Type

Thesis PhD

Degree

Thesis PhD, Macquarie University, Department of Computing, 2020

Department, Centre or School

Department of Computing

Year of Award

2020

Principal Supervisor

Mark Johnson

Rights

Copyright: The Author Copyright disclaimer: https://www.mq.edu.au/copyright-disclaimer

Language

English

Extent

173 pages

Usage metrics

Keywords

topic modelling collocation

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Effective and efficient topical collocation models

History

Table of Contents

Notes

Awarding Institution

Degree Type

Degree

Department, Centre or School

Year of Award

Principal Supervisor

Rights

Language

Extent

Usage metrics

Categories

Keywords

Licence

Exports