Taxonomical Semantical Magical Search - Doug Turnbull, OpenSource Connections

40
Taxonomical Semantical Magical Search OpenSource Connections Doug Turnbull Relevance Lead [email protected] @softwaredoug © OpenSource Connections, 2017

Transcript of Taxonomical Semantical Magical Search - Doug Turnbull, OpenSource Connections

Taxonomical Semantical Magical SearchOpenSource Connections

Doug TurnbullRelevance [email protected]@softwaredoug

© OpenSource Connections, 2017

Solr/ES consulting: team 100% focused on relevance

Learn to rank – semantic search – relevance – personalization – findability

Who are we?

© OpenSource Connections, 2017

Reflect:What problem are you trying to solve when you jump to 'semantic search'?

© OpenSource Connections, 2017

"We studied spontaneous word choice for objects in five application-related domains, and found the variability to be surprisingly large. In every case two people favored the same term with probability <0.20. "

"Simulations show how this fundamental property of language limits the success of various design methodologies for vocabulary-driven interaction. "

© OpenSource Connections, 2017

Solve with keyword stuffing?

- Content creators guarantee every "shoe" has a "shoe" keyword somewhere!

- And every wing-tip mentions dress shoes…

- ...Ad infinitum…

© OpenSource Connections, 2017

Solve with tagging?

- Java is a type of JVM language. Should this be tagged JVM too? What is a "query string"? Which of these tags is useful for search?

- Who tags everything? Is it consistent? What are the rules?

(taken from Stackoverflow)

© OpenSource Connections, 2017

Solve with synonyms?

Yes! Synonyms can be a tool that can help us. But it's easy to mess up:

shoes => dress shoeswing tips,shoestennis shoes,shoes

When I search for tennis shoes, why do I get wing tips; why do I get dresses?!?

© OpenSource Connections, 2017

Talking teaches/reminds vocab (Searching)

shoes dress shoes brown wing tips

Searcher learning: results gives clues to help shopper refine further

Searcher trusting: more confident on terms to use

Searcher uncertain: uses broad queries to experiment

© OpenSource Connections, 2017

Searchers get more specific...

wing tips

Hierarchy of Ideas:

NP (item): "wing tips"

type_of:"dress shoes"

type_of:"shoe"

shoes

NP(item): "shoe"

More specific

© OpenSource Connections, 2017

… and try types of modifierswing tips

NP (item): "wing tips"

type_of:"dress shoes"

type_of:"shoe"

sapphire wing tips

NP (item): "wing tips"

type_of:"dress shoes"

type_of:"shoe"

ADJ (color) "sapphire"

type_of:"blue"

© OpenSource Connections, 2017

Semantic search: enable semantic exploration

Low term specificity: search term specifies a wide category to explore

Searching for "shoes"

High term specificity: search term too specific, try semantically broader/similar items

"Show 'dress shoes' for 'oxfords' "

© OpenSource Connections, 2017

Make Solr grok type-of relationships

"wing tip" is a type of "dress shoe" is a type of "shoe"

Search here, only show wing tips

Search here, show all things that are a type-of shoe

Beyond the actual terms used in docs

© OpenSource Connections, 2017

Per-entity terms a taxonomy

Shoes

Athletic Shoes

Dress Shoes

High HeelsOxfords

Wing Tips

Running Shoes

Tennis Shoes

Blue Sapphire

Sky blueA search taxonomy (not the taxonomy for your site nav)

© OpenSource Connections, 2017

Index-time tax. expansion

Item

Color

Size

Substrings -> Entities

Expand to broad/narrow

tennis shoes => footwear\shoes\athletic\tennis_shoes

sapphire => blue\sapphire

© OpenSource Connections, 2017

In Solr...

Item

Color

Size

Possible to build from simple keepwords

Query or Index time synonyms uses TF*IDF of concept

Substrings -> Entities

Expand to broad/narrow

tennis shoes => tennis_shoes,athletic_shoes,shoes,...

sapphire => sapphire,blue

© OpenSource Connections, 2017

In Solr, index time...(Input Text) You will love these maroon dress shoes

(tokenization & maybe stemming) [you] [will] [love] [these] [maroon] [dress] [shoes]

compound/decompound (syn filter) [you] [will] [love] [these] [maroon] [dress_shoes]

Keepwords for entity [dress_shoes]

Semantic expansion (syn filter) [dress_shoes] [shoes]

(Input Text) You will love these maroon dress shoes

(tokenization & maybe stemming) [you] [will] [love] [these] [maroon] [dress] [shoes]

compound/decompound (syn filter) [you] [will] [love] [these] [maroon] [dress_shoes]

Keepwords for entity [maroon]

Semantic expansion (syn filter) [maroon] [brown]

"Item" copy field

"Color" copy field

© OpenSource Connections, 2017

Index time solution(Input Text) brown wing tips

(Item analyzer output) [wing_tips] [dress_shoes] [shoes]

(Input Text) brown wing tips

(Color analyzer output) [brown]

Matches maroon, because at index time: maroon => brown, maroon

IDF Highest for wing_tipsLowest for shoes(eliminate TF? norms?)

q=brown wing tips&defType=edismax&sow=false&qf=item^100 color^10

(you'll want to search more than these semantic fields)

© OpenSource Connections, 2017

Query-time tax. expansion

How do users think of your items?

Item

Color

Size

Trained/built From Query logs

Substrings -> Entities

Expand to broad/narrow

tennis shoes => item:"tennis shoes" OR item:"athletic shoes" OR item:"shoes" ...

sapphire => color:blue OR color:sapphire

sapphire tennis shoes

© OpenSource Connections, 2017

Query Phrase In Solr...(Input Text) Brown wing tips

Semantic expansion (syn filter) [wing tips] [dress shoes] [shoes]

(Input Text) Brown wing tips

Semantic expansion (syn filter) [brown] [maroon]

ItemSemanticAnalyzer

Color SemanticAnalyzer

Transform to description("dress shoes" OR "wing tips" OR shoes OR maroon OR brown)

Problems: - two query analyzers for same field not possible in Solr- Can't re-tokenize [dress shoes] -> "dress shoes" phrase q

© OpenSource Connections, 2017

Match Query Parserhttps://github.com/o19s/match-query-parser

q=brown wing tips&defType=edismax&qf=description title

&bq={!match analyze_as=item_tax search_with=phrase qf=description v=$q}^100

&bq={!match analyze_as=color_tax search_with=phrase qf=description v=$q}

How to analyze query string

Phrase: retokenize multi word tokens and do phrase search

© OpenSource Connections, 2017

Other building blocksAuto Phrase Token Filter / Query Auto Filtering:

- https://github.com/lucidworks/auto-phrase-tokenfilter- https://lucidworks.com/2015/02/17/introducing-query-autofiltering/

Health-on-net Lucene Synonyms- https://github.com/healthonnet/hon-lucene-synonyms

Sematext Query Segmenter:- https://github.com/sematext/query-segmenter

Shopping 24 Bmax Query Parser- https://github.com/shopping24/solr-bmax-queryparser

© OpenSource Connections, 2017

Deriving Querqy rules from taxonomies

https://github.com/renekrie/querqy

© OpenSource Connections, 2017

Query Time vs Index Time

Query Time:

PROS- No need to reindex when

updating managed vocab

CONS- Relevance scoring of terms

(boosts help)- Complex / slow queries

Index Time:

PROS- TF*IDF more accurate scoring

(broad concepts score low, narrow score high)

- Faster queries

CONS- Reindexing for synonym

changes

© OpenSource Connections, 2017

Structure your docs for query understandingRelevance engineer's challenge:

- Where can we begin with a taxonomy?- Reuse filters & facets- Reuse your page's navigational taxonomy?- Track which searches land on pages (old school click

tracking)?- Zero results tracking?

- How do we incentivize content creators to move away from keyword stuffing to organizing to search keyword taxonomy?

- Finally: we don't care about the source data model, only what helps users find things

© OpenSource Connections, 2017

SHReC Algorithm

© OpenSource Connections, 2017

SHReC AlgorithmSimple doc frequency in-content to look for super-concepts / sub-concepts

term/phrase x subsumes y (x parent concept?) when:

df(x) > df(y)

df(x ∧ y) / df(y) >= α (α = 1 complete subsumption)

© OpenSource Connections, 2017

SHReC Algorithm Example

ShoesWing Tips

df("shoes") > df("wing tips")

df("shoes" ∧ "wing tips") / df("wing tips") >= 0.8

© OpenSource Connections, 2017

SHReC Algorithm with Solr

ShoesWing Tips

df("shoes") > df("wing tips")

df("shoes" ∧ "wing tips") / df("wing tips") >= 0.8

Cache doc freq (q=*:*&facet.field=item&facet=true)

q=item:"wing tips" AND item:shoes, num results

© OpenSource Connections, 2017

Unfortunately reality is messy

ShoesWing Tips

Your data probably looks like

© OpenSource Connections, 2017

Idea:mine other corpus?

Shoes Wing Tips

● but still, what phrases do you test?

© OpenSource Connections, 2017

Statistically sig. colocations

Wing Tips WingTips

Student t-test against null hypothesis that wing / tips unrelated

© OpenSource Connections, 2017

Refinements

shoe

dress shoe (12%) wing tip (23%)

tennis shoe (11%)

blue dress shoe (1%)

sapphire brooks brothers dress shoe (0.001%)

brown dress shoe (20%)

Colors scattered throughout

Sub concepts, likely child phrases

tennis shoe (11%)

Siblings refine each other

running shoe (34%)

Should these be in supercategory "athletic shoes"?

© OpenSource Connections, 2017

Refinement mining in Solr

docs = [{"query": "shoe""refinement": "dress shoe"

},{

"query": "shoe""refinement": "brown shoe"

},{

"query": "tie""refinement": "brown tie"

}]

q=query:shoe&facet=true&facet.field=refinement

Refinements:- dress shoe (4)- tennis shoe (2)- ...

© OpenSource Connections, 2017

SHReC w/ Refinements

docs = [{"query": "shoe""refinement": "dress shoe"

},{

"query": "shoe""refinement": "brown shoe"

},{

"query": "tie""refinement": "brown tie"

}]

q=query:shoe&facet=true&facet.field=refinement

© OpenSource Connections, 2017

SHReC w/ Refinements

q=query:shoe&facet=true&facet.field=refinement

Num results for q=shoe

(Slow, but you do this rarely)

Seed the corpus exploration SHReC

© OpenSource Connections, 2017

SHReC w/ sig terms

scoreNodes( select( facet(collectionName, q="query:shoes", buckets="refinements", bucketSorts="count(*) desc", bucketSizeLimit="100", count(*)), refine_graph as node, "count(*)", replace(collection, null, withValue=collectionName), replace(field, null, withValue=refine_graph)))

What's actually happening in SHReC is significance scoring, which is baked into Solr:

Relationship of local vs global

© OpenSource Connections, 2017

Other ways of measuring term stat. significance

● Trey G. Solr knowledge graph (hope you saw his talk)! https://lucidworks.com/video/leveraging-lucenesolr-as-a-knowledge-graph-and-intent-engine/

● Mark Harwood Elastic Graph / Sig Termshttps://www.elastic.co/elasticon/conf/2016/sf/graph-c

apabilities-in-the-elastic-stack

© OpenSource Connections, 2017

But word2vec, LDA, etc- Focused on content, not users: Focused on discovering topics/synonyms in

content: we often need search query to content vernacular mappings

- Traditional topic modeling flat

- Hierarchies extracted from content don't reflect user's hierarchies & how they map to content

- Don't confuse co-occurences with synonyms without extensive data modeling/munging to get your content here

© OpenSource Connections, 2017

Questions?

Further Reading:- Relevant Search!- Blog articles:

- Building Entity-focused search w/ Keyphrases:- http://opensourceconnections.com/blog/2016/12/02/solr-elasticsearch-synony

ms-better-patterns-keyphrases/- Synonym best practices:

- http://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/

- Match Query Parser:- http://opensourceconnections.com/blog/2017/01/23/our-solution-to-solr-multite

rm-synonyms/

Discount code: relsearchhttp://manning.com

- <shoutout BLOOOMBERG!!>- We built a learning to rank plugin for that other

search engine...

Shameless plug