Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and...
-
Upload
alvin-patterson -
Category
Documents
-
view
217 -
download
0
Transcript of Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and...
Algorithmics and Applications of Tree and Graph Searching
Dennis Shasha, Jason T. L. Wang, and Rosalba Dennis Shasha, Jason T. L. Wang, and Rosalba GiugnoGiugno
Presenters:Presenters:Jerod Watson & Christan GrantJerod Watson & Christan Grant
Introduction Searching in Trees
Approximate Containment Queries Path-Only Searches Extension to Trees
Searching in Graphs Keygraph Searching in Graph DBs GraphGrep Subgraph Matching
Conclusion
Introduction
Modern search engines Keyword-based queries Impressive speed
Several research efforts have attempted to generalize keyword search to keytree and keygraph searching
XQuery
AQUA Query
Query expressed as a tree pattern, termed “query tree”
DB can be represented as single tree or as set of trees
Each tree could be ordered or unordered
Queries often concerned with the parent-child, ancestor-descendant”, or path relationship among nodes
Queries can be expressed by containment mapping.
Query tree may contain fixed length don’t cares (FLDCs) ex. “?”
Query tree may contain variable length don’t cares (VLDCs) ex. “*”
This class of queries referred to as approximate containment (AC) queries
Path-Only Searches
Many AC queries are concerned with paths only. Ex. “Find the descendants of Mary who
is a child of John”
XISS is an indexing and querying system designed to support regular path expressions
Extension to Trees
Pathfix algorithm Phase 1: Encodes each root-to-leaf path
of every data tree into a suffix array DB Phase 2: Compares the query tree Q
with each data tree D in the DB allowing a difference of DIFF
Handling Don’t Cares Partition query into connected subtrees
having don’t cares Match each of those don’t care free
subtrees with data trees in the DB For the matched subtrees that belong to
the same data tree, determine whether they combine to match the query based on the matching semantics of the don’t cares.
Filtering
Implementation ATreeGrep
GraphsGraphs
Graphs
Abstract data type of elements (nodes or vertices) interconnected by edges.
A graph is a specialized tree in which there is no constraint on the number of paths is possible from a node
No root Graph may contain cycles
Keygraph Searching
Searching for a particular graph or order of elements inside of a large graph (i.e. internet)
Searching for a particular graph or structure among several graphs (i.e. chemical elements)
Use indexing to reduce complexity
Keygraph Searching
Three basic steps1. Reduce the search space by filtering2. Formulate query into simple
structures3. Match
Keygraph Searching (survey) A* algorithm GraphDB Daylight Lore
A*
Seminal work by Nilson (1980) Route finding algorithm that keeps track of its
visited nodes and the distance it has traveled. Applications:
Protein databases (discovery and search) Image databases Chinese character databases CAD circuit data and software source code
A*
Pseudocode function A*(start,goal) var closed := the empty set var q := make_queue(path(start)) while q is not empty var p := remove_first(q) var x := the last node of p if x in closed continue if x = goal return p add x to closed foreach y in successors(x) enqueue(q, p, y) return failure
GraphDB
Specifies a data model and query model.
1. Queries are in the form of regular expressions
2. Nodes are classes representing data objects
3. Edges are classes to store paths in the database
4. Path classes are and indexing data structures are used to index database
Provides graph and search operations to:
Shortest path between two nodes
Subgraphs from a starting node and range
GraphDB
Daylight
"Provide the best known computer algorithms for chemical information processing to those who need them."
Uses finger printing to index/prune
ChemDBChemDB(Contains 6.5 million unique structures or subgraphs)
Lore
Database management system for XML Modeled using rooted labeled subgraph Indexed in four ways for fast regular
expression use Vindex, Tindex, Lindex, Pindex(Data Guide)
Lore
1) Vindex: For each edge labeled l, all nodes are index with incomming edges labeled l and some unique atomic value that satisfy some condition.
2) Tindex: A text index for all nodes with l-labeled edges a with a string of specific values containing specific words
3) Lindex: Link index to index nodes with outgoing l-labled edges
4) Pindex (DataGuide): indexes all nodes reachable from root through labled path.
The DataGuide is used by all queries from root. Other queries traverse paths using indexs(1-3), pruning
what is not a match.
Tindex (1999)
A Data structure to index semistructured database nodes that are reachable from several regular path expressions
T-index may be more efficient than P-index because it relaxes some constraints
Reportedly in graph of size 1500 T-index is 13% of database
GraphGrep
Uses variable length paths (cyclic or acyclic) to index DB. This provides for efficient filtering.
Nodes have ids (numbers) and labels (letters).
GraphGrep
Index Construction1. Choose an lp max indexing length
2. Create “path-representation”3. Create fingerprint
GraphGrep
Filtering the Database1. Query graph is parsed and a
fingerprint built2. Fingerprint are compared
1. If a graph has at least one value in its fingerprint that is less than the query fingerprint it is discarded.
2. Remaining graphs may contain > 1 sub graphs
GraphGrep
Filtering the Database Takes linear time to the size of the
database But discards 99% of database!!!
GraphGrep
Finding Subgraphs Matching with Queries Query tree depth first traversal
branches are decomposed into sequences of overlapping label-paths (patterns)
GraphGrep
Overlaps1. Last node in a patters coincides with
first node of next pattern (e.g. ABCB (lp = 3) ABC CB)
2. If a node has branches, it is included in the first pattern of every branch
3. The first node in a cycle is visited twice
GraphGrep
Matching Example1. Select the set of paths2. Combine lists with constraints3. Remove lists with equal id nodes in
non overlapping positions
GraphGrep
Techniques for Queries with Wildcards Consider the parts of the query graph that
is between wild cards (like pathfix) The cartesian product of the components
that match are valid. An entry in the cartesian product is a valid
path (length = wildcards) between nodes.
GraphGrep
1 GHz pentium III NCI databases (1,000 – 16,000
nodes) Average 20 nodes in db (max 270
nodes) Queries 13-189 nodes Lp = 4 and 10
GraphGrep
Linear in size of DB Different lp influence running time
Conclusions / Questions
Searching in Trees Introduces ATreeGrep
Searching in GraphsIntroduces GraphGrep
Thanks to:
God Class Wikipedia Various other Googled sources