Generating Actionable Knowledge from Big Data: Knowledge Extraction and Truth Discovery
thesisposted on 2022-03-28, 16:14 authored by Xiu Fang
To revolutionize our modern society by utilizing the wisdom of Big Data, considerable knowledge bases (KBs) have been constructed to feed the massive knowledge-driven applications with Resource Description Framework (RDF) triples. The important challenges for KB construction include extracting information from large-scale, possibly conflicting and different-structured data sources (i.e., the knowledge extraction problem) and reconciling the conflicts that reside in the sources (i.e., the truth discovery problem). Tremendous research efforts have been contributed on both problems respectively. However, the existing KBs are far from being comprehensive and accurate.In this dissertation, we first propose a system for generating actionable knowledge from Big Data, and use this system to construct a comprehensive KB, called GrandBase. Then we solve the raised research issues regarding GranbBase construction by developing a series of methodologies: Firstly, we study predicate extraction and implement ontology augmentation for knowledge base expansion. Secondly, we address truth discovery (on both single-valued and multi-valued objects or predicates) and performance evaluation on truth discovery methods for knowledge base purification. In particular, we first propose a framework for extracting new predicates from four types of data sources, namely Web texts, Document Object Model (DOM) trees, existing KBs, and query stream to augment the ontology of the existing KB (i.e., Freebase). We use query stream and two major KBs, DBpedia and Freebase,to seed the predicate extraction from Web texts and DOM trees. Then, to estimate value veracity for multi-valued objects, we model the endorsement relations amongsourcesbyquantifyingtheirtwo-sidedinter-sourceagreements. Twoaspectsofsource reliabilityarederivedfromthetwographsconstructedbymodelingtheinter-sourcerelations. To more precisely estimate source reliability for effective multi-valued truth discovery, our graph-based model incorporates four important implications, including two types of source relations, object popularity, loose mutual exclusion, and long-tail phenomenon on source coverage. After that, to fully leverage the advantages of the existing truth discovery methods and achieve more robust and better truth discovery, we propose to extract truth from the prediction results of those methods. Our ensemble approach distinguishes between the single-valued and multi-valued truth discovery problems. Finally,for performance evaluation of truth discovery methods, as the ground truth may be very limited or even impossible to obtain, we make the attempt towards conducting evaluation without using ground truth. For each of the models and approaches presented in this dissertation, we have conducted extensive experiments using either real-world or synthetic datasets. Empirical studies show the effectiveness of our approaches. Finally, we also discuss the future research directions regarding GrandBase construction and extension in this dissertation.