Solr Graph Query: Presented by Kevin Watters, KMW Technology

Click here to load reader

  • date post

    16-Apr-2017
  • Category

    Technology

  • view

    550
  • download

    1

Embed Size (px)

Transcript of Solr Graph Query: Presented by Kevin Watters, KMW Technology

  • O C T O B E R 1 1 - 1 4 , 2 0 1 6 B O S T O N , M A

  • Solr Graph Query Kevin WaDers

    Founder, KMW Technology

  • Solr 6.0 Graph Query Overview

    Kevin WaDers KMW Technology kwa=ers@kmwllc.com www.kmwllc.com

    October 14, 2016

  • KMW Technology Overview Boston based soIware consulJng and professional services

    organizaJon. Founded in 2010. Developers & consultants with deep industry experience. BouJque firm specializing in Open Source, Search, Big Data,

    Machine Learning, and AI Custom Connectors, Pipelines, Classifiers, Search, UI/UX

    development. Data and InformaJon Architecture

  • What is a Graph? One data model to rule them all! A generic representaJon of all linked data models. G = ?!?! A graph is made up of nodes and edges Nodes/VerJces ( node_id ) has metadata and links to other nodes. Edges/Links ( edge_ids ) are associated with a node and point to other

    nodes. Nodes can be modeled as documents in the index with a mulJ-value field containing the edges. For other use cases edges can also be modeled as documents.

  • Graph Traversal There are many graph traversal / exploraJon algorithms. DFS, BFS, A*, Alphabeta, etc Solr Graph Query implements BFS Breadth-First Search, each hop expands the FronJer of the graph. It explores all current edges in a single step/query!

  • Graph Query Parser Syntax

    Parameter Default DescripJon

    from field containing the node id

    to Field contaning the edge id(s)

    maxDepth -1 The number of hops to traverse from the root of the graph. -1 means traverse unJl all edges and documents have been collected. maxDepth=1 is similar behavior to a JOIN.

    traversalFilter null arbitrary query string to apply at each hop of the traversal

    returnRoot true true|false indicaJon of if the documents matching the root query should be returned.

    leafNodesOnly false true|false indicaJon to return only documents in the result set that do not have a value in the to field.

    useAutn false Decide to use Automaton query term for edge traversal or TermsQuery.

    Uses Solrs query parser plugin and local params syntax: {!graph from=node_id to=edge_ids}query

  • Key Features and Design Goals

    Graph is a Filter on top of your data -someone Designed for large scale and large number of edges and very deep traversals. Limited memory usage for traversal Cycle detecJon for free (based on current bit set!) Highly cacheable via the FilterCache! Support mulJValued fields for nodes and/or edges Support arbitrary query filters during the exploraJon with the Traversal Filter Follow Every Edge! No edge leI behind! Traversal is complete! Works with Facets, Facet Queries, and other search components seamlessly

  • Memory Usage One bit set to rule them all (for the result set) BitSet provides cycle detecJon for free. (Have I been here

    before?) BitSet equal to size of index! 100 Million doc index only uses about 12 MB RAM per query!

    (Same size as 1 filter cache entry!) root nodes BitSet only if returnRoot = false leaf nodes same for all graph queries.

  • Performance ConsideraJons Use DocValues, theyre SO MUCH FASTER! Dont tokenize your node/edge ids! (unless thats what you want)

    Performance is a funcJon of the number of unique edges that are traversed, not the number of nodes.

    Limit depth if you know how far to go in the traversal.

  • Graph Query For Security Graph queries are elegant and simple to use for

    traversing security hierarchies such as LDAP and AD Custom security models that are hierarchical or folder

    based in nature. Supports Users being members of Groups that can be

    members of other Groups Adding or removing a user/group is updaJng just 1

    document, not re-indexing large porJons of your index!

  • Example Company with Security Model

  • Document Security Model within the Solr Index

  • Graph Traversal for User 1

  • Graph Traversal for User 2

  • Graph Based Security Query

    Single security query to traverse the graph: {!graph from=node_id to=edge_ids returnOnlyLeaf=true}id:user_1

    Security query is applied as a filter to the query request to ensure the security filter is cached!

  • Distributed & Solr Cloud You can distribute the user/group records to all shards in the index with smart rouJng!

    Distribute the documents only across the shards.

    Fixed number of permissions on each shard and distributed documents keeps graph traversals local for the best performance!

  • Users , AcJons and Items Model your browsing/purchase history as

    Users (have an ID) Items (have an ID, metadata, category, etc.) AcJons (link between user and Items, such as raJng, purchase, like/dislike)

  • Find similar users Graph traversal from a user (or set of users) through their acJons to items they like, to find similar users, and out to items they like.

    Now, exclude the original starJng set returnRoot=false

  • User 1 (depth=2)

    Item 1 (root)

    Item 4 (depth=4)

    Item 2 (depth=4) AcJon/Buy

    (depth=1)

    AcJon/Buy (depth=3)

    AcJon/Buy (depth=3)

    User 2 (depth=2)

    Item 3 (depth=4)

    AcJon/Buy (depth=3)

    4 hops in the graph from an Item gets you to related items, omit the starJng point and only return records that are items {!graph from=node_id to=edge_id maxDepth=4 returnRoot=false}id:Item_1 AND type:item

    AcJon/Buy (depth=1)

    Users who buy X also buy Y

  • WordNet as a Knowledge Graph WordNet maintained by Princeton University provides a hierarchical model of the English language. Words have relaJonships to each other such: Hypernym a more general case of another word Hyponym a more specific case of another word Jaguar is a type of Cat Cat is a type of Animal Cat is a hypernym of Jaguar. Jaguar is a hyponym of cat. Index WordNet entries with fields containing the links to the hypernyms and hyponyms!

  • WordNet Hypernym Traversal +{!graph from="synset_id" to="hypernym_id" maxDepth=8}sense_lemma:jaguar

  • WordNet Graph IntersecJons Is a jaguar a type of animal? If a graph intersecJon exists, the answer is yes! IntersecJon of knowledge graph traversals can be used to answer quesJons!

  • Wikipedia Pages have links! Lots of Links Pages have Infoboxes that contain great metadata. Infobox types like : person, scienJst, writer, arJst.. Etc

    What if youre looking for all Wikipedia pages about people?

  • Infobox facets The infobox tags are more specific than the users search/request.

    Searching for People should include ScienJsts, Authors, and ArJsts!

    Wikipedia doesnt know a ScienJst is a person, but WordNet does!

  • WordNet knows a scienJst is a person!

  • Wikipedia pages linked to Graph Theory

    InformaJon Overload! Its difficult to see the people in this sea of informaJon!

  • Combine WordNet and Wikipedia With Graph Queries to find people!

    Using WordNet were able to disambiguate that the enJty_types of scienJst , person and philosopher are all types of people! Normal FaceJng is not enough!

  • Nested and Filtered Graph Queries!

    The Graph query can be nested. This allows you to traverse one set of fields, then change the fields you are traversing. This example first traverses all WordNet documents that are a type of person, then based on that result set, it does a 1 hope traversal to Wikipedia data on the enJty_type field to restrict the results. {!graph from="enPty_type" to="sense_lemma" maxDepth=1}{!graph from="sense_lemma" to="sense_hyponym_lemma" maxDepth=2}sense_lemma:person Intersect that with pages that are related/linked to from the Wikipedia query of node_id:Graph theory {!graph from=node_id to=edge_ids maxDepth=1}node_id:Graph theory AddiJonally use returnRoot=false if you want to omit the WordNet docs from the result set!

  • Gather Nodes? If youre interested in doing some distributed Graph traversal in Solr there are a few opJons.

    You can use the Gather Nodes funcJonality in Streaming AggregaJons. Not super fast, but it gets the job done!

  • Distributed Graph Traversal Do you think you need to scale up? We have an implementaJon based on Ka{a & Solr Cloud that uses Ka{a to distribute the fronJer query.

  • What next? Edge weights, Relevancy, and Scoring

    Based on |/idf or bm25, Based on numerical field values (min/max/sum/avg weight

    applicaJon)? Skip high frequency edges?

    Min distance computaJon Driving direcJons? Be=er support for visualizaJon libraries like D3.js! Distributed Traversal via Ka{a fronJer query broker

  • AddiJonal Detail

    Related Solr Tickets h=ps://issues.apache.org/jira/browse/SOLR-7543 h=ps://issues.apache.org/jira/browse/SOLR-8632

    h=ps://issues.apache.org/jira/browse/SOLR-8176 QuesJons? Kevin Wa=ers, KMW Technology kwa=ers@kmwllc.com

  • AcJons occur over Jme These events cant easily be aggregated or fla=ened onto a

    record. Model this as a person record, with a set of acJon records. Each acJon record has the id of the previous acJon. Search for an acJon, graph traverse based on person id to

    another acJon, then finally to the person record.

  • OpenCV, Video RecogniJon Imagine indexing each frame of video from security cameras.

    Pass each frame of video through OpenCV for object recogniJon & face recogniJon.

    Each frame has a frame number of