Solr Graph Query: Presented by Kevin Watters, KMW Technology
date post
16-Apr-2017Category
Technology
view
550download
1
Embed Size (px)
Transcript of Solr Graph Query: Presented by Kevin Watters, KMW Technology
O C T O B E R 1 1 - 1 4 , 2 0 1 6 B O S T O N , M A
Solr Graph Query Kevin WaDers
Founder, KMW Technology
Solr 6.0 Graph Query Overview
Kevin WaDers KMW Technology kwa=ers@kmwllc.com www.kmwllc.com
October 14, 2016
KMW Technology Overview Boston based soIware consulJng and professional services
organizaJon. Founded in 2010. Developers & consultants with deep industry experience. BouJque firm specializing in Open Source, Search, Big Data,
Machine Learning, and AI Custom Connectors, Pipelines, Classifiers, Search, UI/UX
development. Data and InformaJon Architecture
What is a Graph? One data model to rule them all! A generic representaJon of all linked data models. G = ?!?! A graph is made up of nodes and edges Nodes/VerJces ( node_id ) has metadata and links to other nodes. Edges/Links ( edge_ids ) are associated with a node and point to other
nodes. Nodes can be modeled as documents in the index with a mulJ-value field containing the edges. For other use cases edges can also be modeled as documents.
Graph Traversal There are many graph traversal / exploraJon algorithms. DFS, BFS, A*, Alphabeta, etc Solr Graph Query implements BFS Breadth-First Search, each hop expands the FronJer of the graph. It explores all current edges in a single step/query!
Graph Query Parser Syntax
Parameter Default DescripJon
from field containing the node id
to Field contaning the edge id(s)
maxDepth -1 The number of hops to traverse from the root of the graph. -1 means traverse unJl all edges and documents have been collected. maxDepth=1 is similar behavior to a JOIN.
traversalFilter null arbitrary query string to apply at each hop of the traversal
returnRoot true true|false indicaJon of if the documents matching the root query should be returned.
leafNodesOnly false true|false indicaJon to return only documents in the result set that do not have a value in the to field.
useAutn false Decide to use Automaton query term for edge traversal or TermsQuery.
Uses Solrs query parser plugin and local params syntax: {!graph from=node_id to=edge_ids}query
Key Features and Design Goals
Graph is a Filter on top of your data -someone Designed for large scale and large number of edges and very deep traversals. Limited memory usage for traversal Cycle detecJon for free (based on current bit set!) Highly cacheable via the FilterCache! Support mulJValued fields for nodes and/or edges Support arbitrary query filters during the exploraJon with the Traversal Filter Follow Every Edge! No edge leI behind! Traversal is complete! Works with Facets, Facet Queries, and other search components seamlessly
Memory Usage One bit set to rule them all (for the result set) BitSet provides cycle detecJon for free. (Have I been here
before?) BitSet equal to size of index! 100 Million doc index only uses about 12 MB RAM per query!
(Same size as 1 filter cache entry!) root nodes BitSet only if returnRoot = false leaf nodes same for all graph queries.
Performance ConsideraJons Use DocValues, theyre SO MUCH FASTER! Dont tokenize your node/edge ids! (unless thats what you want)
Performance is a funcJon of the number of unique edges that are traversed, not the number of nodes.
Limit depth if you know how far to go in the traversal.
Graph Query For Security Graph queries are elegant and simple to use for
traversing security hierarchies such as LDAP and AD Custom security models that are hierarchical or folder
based in nature. Supports Users being members of Groups that can be
members of other Groups Adding or removing a user/group is updaJng just 1
document, not re-indexing large porJons of your index!
Example Company with Security Model
Document Security Model within the Solr Index
Graph Traversal for User 1
Graph Traversal for User 2
Graph Based Security Query
Single security query to traverse the graph: {!graph from=node_id to=edge_ids returnOnlyLeaf=true}id:user_1
Security query is applied as a filter to the query request to ensure the security filter is cached!
Distributed & Solr Cloud You can distribute the user/group records to all shards in the index with smart rouJng!
Distribute the documents only across the shards.
Fixed number of permissions on each shard and distributed documents keeps graph traversals local for the best performance!
Users , AcJons and Items Model your browsing/purchase history as
Users (have an ID) Items (have an ID, metadata, category, etc.) AcJons (link between user and Items, such as raJng, purchase, like/dislike)
Find similar users Graph traversal from a user (or set of users) through their acJons to items they like, to find similar users, and out to items they like.
Now, exclude the original starJng set returnRoot=false
User 1 (depth=2)
Item 1 (root)
Item 4 (depth=4)
Item 2 (depth=4) AcJon/Buy
(depth=1)
AcJon/Buy (depth=3)
AcJon/Buy (depth=3)
User 2 (depth=2)
Item 3 (depth=4)
AcJon/Buy (depth=3)
4 hops in the graph from an Item gets you to related items, omit the starJng point and only return records that are items {!graph from=node_id to=edge_id maxDepth=4 returnRoot=false}id:Item_1 AND type:item
AcJon/Buy (depth=1)
Users who buy X also buy Y
WordNet as a Knowledge Graph WordNet maintained by Princeton University provides a hierarchical model of the English language. Words have relaJonships to each other such: Hypernym a more general case of another word Hyponym a more specific case of another word Jaguar is a type of Cat Cat is a type of Animal Cat is a hypernym of Jaguar. Jaguar is a hyponym of cat. Index WordNet entries with fields containing the links to the hypernyms and hyponyms!
WordNet Hypernym Traversal +{!graph from="synset_id" to="hypernym_id" maxDepth=8}sense_lemma:jaguar
WordNet Graph IntersecJons Is a jaguar a type of animal? If a graph intersecJon exists, the answer is yes! IntersecJon of knowledge graph traversals can be used to answer quesJons!
Wikipedia Pages have links! Lots of Links Pages have Infoboxes that contain great metadata. Infobox types like : person, scienJst, writer, arJst.. Etc
What if youre looking for all Wikipedia pages about people?
Infobox facets The infobox tags are more specific than the users search/request.
Searching for People should include ScienJsts, Authors, and ArJsts!
Wikipedia doesnt know a ScienJst is a person, but WordNet does!
WordNet knows a scienJst is a person!
Wikipedia pages linked to Graph Theory
InformaJon Overload! Its difficult to see the people in this sea of informaJon!
Combine WordNet and Wikipedia With Graph Queries to find people!
Using WordNet were able to disambiguate that the enJty_types of scienJst , person and philosopher are all types of people! Normal FaceJng is not enough!
Nested and Filtered Graph Queries!
The Graph query can be nested. This allows you to traverse one set of fields, then change the fields you are traversing. This example first traverses all WordNet documents that are a type of person, then based on that result set, it does a 1 hope traversal to Wikipedia data on the enJty_type field to restrict the results. {!graph from="enPty_type" to="sense_lemma" maxDepth=1}{!graph from="sense_lemma" to="sense_hyponym_lemma" maxDepth=2}sense_lemma:person Intersect that with pages that are related/linked to from the Wikipedia query of node_id:Graph theory {!graph from=node_id to=edge_ids maxDepth=1}node_id:Graph theory AddiJonally use returnRoot=false if you want to omit the WordNet docs from the result set!
Gather Nodes? If youre interested in doing some distributed Graph traversal in Solr there are a few opJons.
You can use the Gather Nodes funcJonality in Streaming AggregaJons. Not super fast, but it gets the job done!
Distributed Graph Traversal Do you think you need to scale up? We have an implementaJon based on Ka{a & Solr Cloud that uses Ka{a to distribute the fronJer query.
What next? Edge weights, Relevancy, and Scoring
Based on |/idf or bm25, Based on numerical field values (min/max/sum/avg weight
applicaJon)? Skip high frequency edges?
Min distance computaJon Driving direcJons? Be=er support for visualizaJon libraries like D3.js! Distributed Traversal via Ka{a fronJer query broker
AddiJonal Detail
Related Solr Tickets h=ps://issues.apache.org/jira/browse/SOLR-7543 h=ps://issues.apache.org/jira/browse/SOLR-8632
h=ps://issues.apache.org/jira/browse/SOLR-8176 QuesJons? Kevin Wa=ers, KMW Technology kwa=ers@kmwllc.com
AcJons occur over Jme These events cant easily be aggregated or fla=ened onto a
record. Model this as a person record, with a set of acJon records. Each acJon record has the id of the previous acJon. Search for an acJon, graph traverse based on person id to
another acJon, then finally to the person record.
OpenCV, Video RecogniJon Imagine indexing each frame of video from security cameras.
Pass each frame of video through OpenCV for object recogniJon & face recogniJon.
Each frame has a frame number of