01whole.pdf (2.24 MB)
Modeling topics and knowledge bases with vector representations
thesisposted on 2022-03-28, 02:01 authored by Dat Quoc Nguyen
Motivated by the recent success of utilizing latent feature vector representations (i.e. embeddings) in various natural language processing tasks, this thesis investigates how latent feature vector representations can help build better topic models and improve link prediction models for knowledge base completion. The first objective of this thesis is to incorporate latent feature word representations that contain external information from a large corpus in order to improve topic inference in a small corpus. The second objective is to develop new embedding models for predicting missing relationships between entities in knowledge bases. In particular, the major contributions of this thesis are summarized as follows: We propose two latent feature topic models which integrate word representations trained on large external corpora into two Dirichlet multinomial topic models: a Latent Dirichlet Allocation model and a one-topic-per-document Dirichlet Multinomial Mixture model. We introduce a triple-based embedding model named STransE to improve complex relationship prediction in knowledge bases. In addition, we also describe a new embedding approach, which combines the Latent Dirichlet Allocation model and the STransE model, to improve search personalization. We present a neighborhood mixture model where we formalize an entity representation as a mixture of its neighborhood in the knowledge base. Extensive experiments show that our proposed models and approach obtain better performance than well-established baselines in a wide range of evaluation tasks.