A graph-based solution for data discovery in large organisations
Being a data-driven organisation starts with finding the right datasets and then understanding and analysing the data. With a myriad of data sources across multiple heterogeneous databases, modern organisations face many problems in retrieving all the relevant data from their databases. In large organisations, data is stored in multiple databases from cloud environments to legacy warehouses and mainframe applications. Hence, discovering data has become a time consuming iterative process for data analysts. Data discovery is a business intelligence term for the process of collecting data from various databases and consolidating them into a single source. Traditional data discovery solutions were only focused on individual data units such as dimensional search engines (Ex. OLAP) and not the relationship among datasets. This is becoming problematic as the industry shows more interest in the relationship among datasets. Therefore, graph models are employed as an alternative method to manage and store the relationships among individual entities. Graphs also have shown their advantages in integrating data sources into a unified semantic graph as virtual knowledge graphs (VKGs) paradigm. However, more research needs to be done to make these solutions practical and simple to use. This thesis demonstrates the advantages of using graphs that can remedy some of the weaknesses of the existing technologies in the data discovery process. The proposed solution orchestrates a set of concepts and technologies to introduce a technological agnostic layer called Graph Gateway for hiding the complexity of the data environment for end-users. We replace the query language to help analysts formulate the query simply and exibly. The new proposed query language enables users to query several databases in a single query. Our model integrates the data catalogues and data dictionaries in the graph layer to help users formulate and complete their queries.