Incremental graph pattern based node matching
thesisposted on 2022-03-28, 10:21 authored by Guohao Sun
Graph Pattern based Subgraph Matching (GPSM) is used to identify all the matching subgraphs of a pattern graph GP in a data graph GD. The existing GPSM solutions are based on subgraph isomorphism or Bounded Graph Simulation (BGS), which aims to find the entire matching subgraphs in GD. However, in some real applications, such as group finding and expert recommendation, people are more interested in identifying some nodes based on a specified structure between them, leading to the Graph Pattern based Node Matching (GPNM) problem. GPNM aims to find all the matches of the nodes in a data graph GD based on a given pattern graph GP. Firstly, in real scenarios, both GP and GD are updated frequently. However, the existing GPNM methods must perform a new GPNM procedure from scratch to deliver the node matching results based on the updated GP and updated GD, which consumes significant time. Therefore, there is a pressing need for a method to efficiently deliver the node matching results on the updated graphs. To address this problem, in this thesis, we propose a novel incremental GPNM method called INC-GPNM, where we first build an index to incrementally record the shortest path length range between different label types in GD, and then identify the affected parts of GD in GPNM including nodes and edges with respect to the updates of GP and GD. Based on the index structure and our novel search strategies, INC-GPNM can efficiently deliver node matching results taking the updates of GP and GD as inputs, and can greatly reduce the query processing time with improved time complexity. Secondly, in some real applications (e.g., social graph searches on Facebook), many typical pattern graphs frequently and repeatedly appear in users' queries in a short period of time. In this situation, it is still time-consuming to apply the incremental GPNM procedure for each of the updates in data graph. To address this problem, in this thesis, we first analyze the updates in the data graph and find that not all the updates in the data graph essentially affect the GPNM results. Then, we propose the notion of elimination relationships and analyze the elimination relationships between multiple updates in the data graph. In addition, if one update Ua can eliminate the other update Ub, and Ub can eliminate update Uc, there exists the hierarchical structure between these elimination relationships. We further generate an Elimination Hierarchy Tree (EH-Tree) to index the elimination relationships. Based on the EH-Tree, we propose a GPNM method called EH-GPNM, that considers the elimination relationships between multiple updates in the data graph. EH-GPNM can efficiently deliver node matching results when facing frequent and repeated pattern graphs with multiple updates in the data graph. Thirdly, inspired by EH-GPNM, we noted that the elimination relationships not only exist among the updates in the data graph, but are also present among the updates in the pattern graph and even in the cross updates from GP and GD. To further improve the GPNM efficiency when both GP and GD are updated frequently, in this thesis, we propose a more efficient GPNM method, called UA-GPNM. UA-GPNM first detects the elimination relationships between multiple independent updates in GP and GD, and also the cross elimination relationships between the updates from GP and GD, then UA-GPNM generates an EH-Tree to index all the elimination relationships. In addition, we also propose a graph partition strategy in UA-GPNM to accelerate the GPNM procedure. The experiments show that UA-GPNM can achieve better efficiency compared with INC-GPNM and EH-GPNM when facing the updates of GP and GD. All the methods proposed above in this thesis have been validated and evaluated through sufficient experiments and theoretical analysis. The results have demonstrated that the proposed methods significantly outperform the existing work of GPNM -- abstract.