Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and...

Algorithmics and Applications of Tree and Graph Searching

Dennis Shasha, Jason T. L. Wang, and Rosalba Dennis Shasha, Jason T. L. Wang, and Rosalba GiugnoGiugno

Presenters:Presenters:Jerod Watson & Christan GrantJerod Watson & Christan Grant

Introduction Searching in Trees

Approximate Containment Queries Path-Only Searches Extension to Trees

Searching in Graphs Keygraph Searching in Graph DBs GraphGrep Subgraph Matching

Conclusion

Introduction

Modern search engines Keyword-based queries Impressive speed

Several research efforts have attempted to generalize keyword search to keytree and keygraph searching

XQuery

AQUA Query

Query expressed as a tree pattern, termed “query tree”

DB can be represented as single tree or as set of trees

Each tree could be ordered or unordered

Queries often concerned with the parent-child, ancestor-descendant”, or path relationship among nodes

Queries can be expressed by containment mapping.

Query tree may contain fixed length don’t cares (FLDCs) ex. “?”

Query tree may contain variable length don’t cares (VLDCs) ex. “*”

This class of queries referred to as approximate containment (AC) queries

Path-Only Searches

Many AC queries are concerned with paths only. Ex. “Find the descendants of Mary who

is a child of John”

XISS is an indexing and querying system designed to support regular path expressions

Extension to Trees

Pathfix algorithm Phase 1: Encodes each root-to-leaf path

of every data tree into a suffix array DB Phase 2: Compares the query tree Q

with each data tree D in the DB allowing a difference of DIFF

Handling Don’t Cares Partition query into connected subtrees

having don’t cares Match each of those don’t care free

subtrees with data trees in the DB For the matched subtrees that belong to

the same data tree, determine whether they combine to match the query based on the matching semantics of the don’t cares.

Filtering

Implementation ATreeGrep

GraphsGraphs

Graphs

Abstract data type of elements (nodes or vertices) interconnected by edges.

A graph is a specialized tree in which there is no constraint on the number of paths is possible from a node

No root Graph may contain cycles

Keygraph Searching

Searching for a particular graph or order of elements inside of a large graph (i.e. internet)

Searching for a particular graph or structure among several graphs (i.e. chemical elements)

Use indexing to reduce complexity

Keygraph Searching

Three basic steps1. Reduce the search space by filtering2. Formulate query into simple

structures3. Match

Keygraph Searching (survey) A* algorithm GraphDB Daylight Lore

A*

Seminal work by Nilson (1980) Route finding algorithm that keeps track of its

visited nodes and the distance it has traveled. Applications:

Protein databases (discovery and search) Image databases Chinese character databases CAD circuit data and software source code

A*

Pseudocode function A*(start,goal) var closed := the empty set var q := make_queue(path(start)) while q is not empty var p := remove_first(q) var x := the last node of p if x in closed continue if x = goal return p add x to closed foreach y in successors(x) enqueue(q, p, y) return failure

GraphDB

Specifies a data model and query model.

1. Queries are in the form of regular expressions

2. Nodes are classes representing data objects

3. Edges are classes to store paths in the database

4. Path classes are and indexing data structures are used to index database

Provides graph and search operations to:

Shortest path between two nodes

Subgraphs from a starting node and range

GraphDB

Daylight

"Provide the best known computer algorithms for chemical information processing to those who need them."

Uses finger printing to index/prune

ChemDBChemDB(Contains 6.5 million unique structures or subgraphs)

Lore

Database management system for XML Modeled using rooted labeled subgraph Indexed in four ways for fast regular

expression use Vindex, Tindex, Lindex, Pindex(Data Guide)

Lore

1) Vindex: For each edge labeled l, all nodes are index with incomming edges labeled l and some unique atomic value that satisfy some condition.

2) Tindex: A text index for all nodes with l-labeled edges a with a string of specific values containing specific words

3) Lindex: Link index to index nodes with outgoing l-labled edges

4) Pindex (DataGuide): indexes all nodes reachable from root through labled path.

The DataGuide is used by all queries from root. Other queries traverse paths using indexs(1-3), pruning

what is not a match.

Tindex (1999)

A Data structure to index semistructured database nodes that are reachable from several regular path expressions

T-index may be more efficient than P-index because it relaxes some constraints

Reportedly in graph of size 1500 T-index is 13% of database

GraphGrep

Uses variable length paths (cyclic or acyclic) to index DB. This provides for efficient filtering.

Nodes have ids (numbers) and labels (letters).

GraphGrep

Index Construction1. Choose an lp max indexing length

2. Create “path-representation”3. Create fingerprint

GraphGrep

Filtering the Database1. Query graph is parsed and a

fingerprint built2. Fingerprint are compared

1. If a graph has at least one value in its fingerprint that is less than the query fingerprint it is discarded.

2. Remaining graphs may contain > 1 sub graphs

GraphGrep

Filtering the Database Takes linear time to the size of the

database But discards 99% of database!!!

GraphGrep

Finding Subgraphs Matching with Queries Query tree depth first traversal

branches are decomposed into sequences of overlapping label-paths (patterns)

GraphGrep

Overlaps1. Last node in a patters coincides with

first node of next pattern (e.g. ABCB (lp = 3) ABC CB)

2. If a node has branches, it is included in the first pattern of every branch

3. The first node in a cycle is visited twice

GraphGrep

Matching Example1. Select the set of paths2. Combine lists with constraints3. Remove lists with equal id nodes in

non overlapping positions

GraphGrep

Techniques for Queries with Wildcards Consider the parts of the query graph that

is between wild cards (like pathfix) The cartesian product of the components

that match are valid. An entry in the cartesian product is a valid

path (length = wildcards) between nodes.

GraphGrep

1 GHz pentium III NCI databases (1,000 – 16,000

nodes) Average 20 nodes in db (max 270

nodes) Queries 13-189 nodes Lp = 4 and 10

GraphGrep

Linear in size of DB Different lp influence running time

Conclusions / Questions

Searching in Trees Introduces ATreeGrep

Searching in GraphsIntroduces GraphGrep

Thanks to:

God Class Wikipedia Various other Googled sources

Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and...

Documents

Transcript of Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and...