ExtMiner: Combining Multiple Ranking and Clustering Algorithms for Structured Document Retrieval
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
-
Upload
trey-grainger -
Category
Software
-
view
1.324 -
download
0
Transcript of Intent Algorithms: The Data Science of Smart Information Retrieval Systems
![Page 1: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/1.jpg)
Intent Algorithms: The data science of smart information retrieval systems
Trey GraingerSVP of Engineering, Lucidworks
Southern Data Science Conference2017.04.07
![Page 2: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/2.jpg)
Trey GraingerSVP of Engineering
• Previously Director of Engineering @ CareerBuilder
• MBA, Management of Technology – Georgia Tech
• BA, Computer Science, Business, & Philosophy – Furman University
• Information Retrieval & Web Search - Stanford University
Other fun projects:
• Co-author of Solr in Action, plus numerous research papers
• Frequent conference speaker
• Founder of Celiaccess.com, the gluten-free search engine
• Lucene/Solr contributor
About Me
![Page 3: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/3.jpg)
• Introduction
- Apache Solr
- Lucidworks / Fusion
• Search Engine Fundamentals
- Keyword Search
- Relevancy Ranking
- Domain-specific Relevancy
- Crafting Relevancy Functions
…
Agenda
…
• Reflected Intelligence
- Signals (Demo)
- Recommendations (Demo)
- Learning to Rank (Demo)
• Semantic Search
- RDF / SPARQL
- Entity Extraction
- Query Parsing
- Semantic Knowledge Graph
Southern Data Science
![Page 4: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/4.jpg)
Traditional
Keyword
SearchRecommendations
Semantic
Search
User Intent
Personalized
Search
Augmented
SearchDomain-aware
Matching
Dimensions of
User Intent
Southern Data Science
![Page 5: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/5.jpg)
what do you do?
![Page 6: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/6.jpg)
![Page 7: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/7.jpg)
Search-Driven Everything
Customer Service
Customer Insights
Fraud Surveillance
Research Portal
Online RetailDigital Content
![Page 8: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/8.jpg)
Apache Solr
![Page 9: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/9.jpg)
“Solr is the popular, blazing-fast,
open source enterprise search
platform built on Apache Lucene™.”
![Page 10: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/10.jpg)
Key Solr Features:
● Multilingual Keyword search
● Relevancy Ranking of results
● Faceting & Analytics (nested / relational)
● Highlighting
● Spelling Correction
● Autocomplete/Type-ahead Prediction
● Sorting, Grouping, Deduplication
● Distributed, Fault-tolerant, Scalable
● Geospatial search
● Complex Function queries
● Recommendations (More Like This)
● Graph Queries and Traversals
● SQL Query Support
● Streaming Aggregations
● Batch and Streaming processing
● Highly Configurable / Plugins
● Learning to Rank
● Building machine-learning models
● … many more*source: Solr in Action, chapter 2
![Page 11: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/11.jpg)
The standard
for enterprise
search.of Fortune 500
uses Solr.
90%
![Page 12: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/12.jpg)
Lucidworks Fusion
![Page 13: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/13.jpg)
Typical Search Architecture Evolution
Optional
Worker Worker Cluster Manager
Spark/Hadoop
Shards Shards
Solr
HD
FS
Shared Config Management
Leader Election
Load Balancing
ZK 1
Zookeeper
ZK N
Nutch/
HeretrixLog Proc.
Mahout
(Recommender)
ManifoldCF*
(Connectors)
Security
(Roll your own)
Roll your own
*only 12 connectors available,
compared w/ 60+ in Fusion
SiLK
Scheduling
(cron?)
Admin UI
Deployment
(Roll your own)
Monitoring
(Roll your own)
Relevance Tools
(Roll your own)
Tika ships w/ Solr, but can’t be scaled independently
NLP tools
![Page 14: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/14.jpg)
SECURITY BUILT-IN
Shards Shards
Apache Solr
Apache Zookeeper
ZK 1
Leader Election
Load Balancing
Shared Config Management
Worker Worker
Apache Spark
Cluster Manager
RE
ST
AP
I
Admin UI
Lucidworks
View
LOGS FILE WEB DATABASE CLOUD
Core Services
• • •
ETL and Query Pipelines
Recommenders/Signals
NLP
Machine Learning
Alerting and Messaging
Security
Connectors
Scheduling
Fusion Simplifies the Deployment
HD
FS
(O
ptio
na
l)
![Page 15: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/15.jpg)
Lucidworks Fusion
![Page 16: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/16.jpg)
Fusion powers search for the brightest companies in the world.
![Page 17: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/17.jpg)
search & relevancy
![Page 18: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/18.jpg)
Basic Keyword Search(inverted index, tf-idf, bm25, multilingual text analysis, query formulation, etc.)
Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)
Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)
Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)
Self-learningIntent Algorithm Spectrum
Southern Data Science
![Page 19: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/19.jpg)
Basic Keyword Search
The beginning of a typical search journey
![Page 20: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/20.jpg)
Term Documents
a doc1 [2x]
brown doc3 [1x] , doc5 [1x]
cat doc4 [1x]
cow doc2 [1x] , doc5 [1x]
… ...
once doc1 [1x], doc5 [1x]
over doc2 [1x], doc3 [1x]
the doc2 [2x], doc3 [2x],
doc4[2x], doc5 [1x]
… …
Document Content Field
doc1 once upon a time, in a land far,
far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over
the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo”
once.
… …
What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):
The inverted index
Southern Data Science
![Page 21: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/21.jpg)
/solr/select/?q=apache solr
Field Documents
… …
apache doc1, doc3, doc4,
doc5
…
hadoop doc2, doc4, doc6
… …
solr doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3 doc4
solr
apache
apache solr
Matching queries to documents
Southern Data Science
![Page 22: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/22.jpg)
Text Analysis
Generating terms to index from raw text
![Page 23: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/23.jpg)
Text Analysis in Solr
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized
② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens
③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream
*From Solr in Action, Chapter 6
Southern Data Science
![Page 24: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/24.jpg)
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized
② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens
③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
Southern Data Science
![Page 25: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/25.jpg)
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized
② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens
③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
Southern Data Science
![Page 26: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/26.jpg)
A text field in Lucene/Solr has an Analyzer containing:
① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized
② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens
③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream
Text Analysis in Solr
*From Solr in Action, Chapter 6
Southern Data Science
![Page 27: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/27.jpg)
Per-language Analysis Chains
*Some of the 32 different languages configurations in Appendix B of Solr in Action
Southern Data Science
![Page 28: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/28.jpg)
Per-language Analysis Chains
*Some of the 32 different languages configurations in Appendix B of Solr in Action
Southern Data Science
![Page 29: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/29.jpg)
Southern Data Science
![Page 30: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/30.jpg)
Relevancy Ranking
Scoring the results, returning the best matches
![Page 31: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/31.jpg)
Classic Lucene/Solr Relevancy Algorithm:
*Source: Solr in Action, chapter 3
Score(q, d) =
∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)t in q
Where:t = term; d = document; q = query; f = field
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights ½ )
sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2
t in q
norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()
Southern Data Science
![Page 32: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/32.jpg)
Classic Lucene/Solr Relevancy Algorithm:
*Source: Solr in Action, chapter 3
Score(q, d) =
∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)t in q
Where:t = term; d = document; q = query; f = field
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery
queryNorm(q) = 1 / (sumOfSquaredWeights ½ )
sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2
t in q
norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()
Southern Data Science
![Page 33: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/33.jpg)
• Term Frequency: “How well a term describes a document?”
– Measure: how often a term occurs per document
• Inverse Document Frequency: “How important is a term overall?”
– Measure: how rare the term is across all documents
TF * IDF
*Source: Solr in Action, chapter 3
Southern Data Science
![Page 34: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/34.jpg)
BM25 (Okapi “Best Match” 25th Iteration)
Score(q, d) =
∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl )t in q
Where:t = term; d = document; q = query; i = index
tf(t in d) = numTermOccurrencesInDocument ½
idf(t) = 1 + log (numDocs / (docFreq + 1))
|d| = ∑ 1t in d
avgdl = = ( ∑ |d| ) / ( ∑ 1 ) )d in i d in i
k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point.
b = Free parameter. Usually ~0.75. Increases impact of document normalization.
Southern Data Science
![Page 35: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/35.jpg)
News Search : popularity and freshness drive relevance
Restaurant Search: geographical proximity and price range are critical
Ecommerce: likelihood of a purchase is key
Movie search: More popular titles are generally more relevant
Job search: category of job, salary range, and geographical proximity matter
TF * IDF of keywords can’t hold it’s own against good
domain-specific relevance factors!
That’s great, but what about domain-specific knowledge?
Southern Data Science
![Page 36: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/36.jpg)
Southern Data Science
*Example from chapter 16 of Solr in Action
Domain-specific relevancy calculation (News Website Example)
News website:
/select?
fq=$myQuery&
q=_query_:"{!func}scale(query($myQuery),0,100)"
AND _query_:"{!func}div(100,map(geodist(),0,1,1))"
AND _query_:"{!func}recip(rord(publicationDate),0,100,100)"
AND _query_:"{!func}scale(popularity,0,100)"&
myQuery="street festival"&
sfield=location&
pt=33.748,-84.391
25%
25%
25%
25%
![Page 37: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/37.jpg)
Southern Data Science
Fancy boosting functions (Restaurant Search Example)
Distance (50%) + keywords (30%) + category (20%)
q=_val_:"scale(mul(query($keywords),1),0,30)" AND
_val_:"scale(sum($radiusInKm,mul(query($distance),-1)),0,50)” AND
_val_:"scale(mul(query($category),1),0,20)"
&keywords=filet mignon
&radiusInKm=48.28
&distance=_val_:"geodist(latitudelongitude.latlon_is,33.77402,-84.29659)”
&category=”fine dining"
&fq={!cache=false v=$keywords}
![Page 38: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/38.jpg)
This is powerful, but feels like
a lot of work to get right…
![Page 39: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/39.jpg)
what is “reflected intelligence”?
![Page 40: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/40.jpg)
The Three C’s
Content:Keywords and other features in your documents
Collaboration:How other’s have chosen to interact with your system
Context:Available information about your users and their intent
Reflected Intelligence“Leveraging previous data and interactions to improve how
new data and interactions should be interpreted”
Southern Data Science
![Page 41: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/41.jpg)
● Recommendation Algorithms
● Building user profiles from past searches, clicks, and other actions
● Identifying correlations between keywords/phrases
● Building out automatically-generated ontologies from content and queries
● Determining relevancy judgements (precision, recall, nDCG, etc.) from click
logs
● Learning to Rank - using relevancy judgements and machine learning to train
a relevance model
● Discovering misspellings, synonyms, acronyms, and related keywords
● Disambiguation of keyword phrases with multiple meanings
● Learning what’s important in your content
Examples of Reflected Intelligence
Southern Data Science
![Page 42: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/42.jpg)
John lives in Boston but wants to move to New York or possibly another big city. He is
currently a sales manager but wants to move towards business development.
Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location
in the food service industry.
Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a
Big Data company. He is happy to move across the U.S. for the right job.
Jane is a nurse educator in Boston seeking between $40K and $60K
*Example from chapter 16 of Solr in Action
Consider what you know about users
Southern Data Science
![Page 43: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/43.jpg)
http://localhost:8983/solr/jobs/select/?
fl=jobtitle,city,state,salary&
q=(
jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10
)
AND (
(city:"Boston" AND state:"MA")^15
OR state:"MA")
AND _val_:"map(salary, 40000, 60000,10, 0)”
*Example from chapter 16 of Solr in Action
Query for Jane
Jane is a nurse educator in Boston seeking between $40K and $60K
Southern Data Science
![Page 44: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/44.jpg)
{ ...
"response":{"numFound":22,"start":0,"docs":[
{"jobtitle":" Clinical Educator
(New England/ Boston)",
"city":"Boston",
"state":"MA",
"salary":41503},
…]}}
*Example documents available @ http://github.com/treygrainger/solr-in-action
Search Results for Jane
{"jobtitle":"Nurse Educator",
"city":"Braintree",
"state":"MA",
"salary":56183},
{"jobtitle":"Nurse Educator",
"city":"Brighton",
"state":"MA",
"salary":71359}
Southern Data Science
![Page 45: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/45.jpg)
You just built a
recommendation engine!
![Page 46: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/46.jpg)
Southern Data Science
Collaborative Filtering
Term Documents
user1 doc1, doc5
user2 doc2
user3 doc2
user4 doc1, doc3, doc4, doc5
user5 doc1, doc4
… …
Document “Users who bought this product” field
doc1 user1, user4, user5
doc2 user2, user3
doc3 user4
doc4 user4, user5
doc5 user4, user1
… …
What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):
![Page 47: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/47.jpg)
Southern Data Science
Step 1: Find similar users who like the same documents
Document “Users who bought this product” field
doc1 user1, user4, user5
doc2 user2, user3
doc3 user4
doc4 user4, user5
doc5 user4, user1
… …
Top-scoring results (most similar users):1) user4 (2 shared likes)2) user5 (2 shared likes)3) user 1 (1 shared like)
doc1user1 user4
user5
user4 user5
doc4
q=documentid: ("doc1" OR "doc4")
*Source: Solr in Action, chapter 16
![Page 48: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/48.jpg)
/solr/select/?q=userlikes:("user4"^2 OR "user5"^2 OR "user1"^1)
Southern Data Science
Step 2: Search for docs “liked” by those similar users
Term Documents
user1 doc1, doc5
user2 doc2
user3 doc2
user4 doc1, doc3, doc4, doc5
user5 doc1, doc4
… …
Top recommended documents:1) doc1 (matches user4, user5, user1)2) doc4 (matches user4, user5)3) doc5 (matches user4, user1)4) doc3 (matches user4)
// doc2 does not match
Most similar users:1) user4 (2 shared likes)2) user5 (2 shared likes)3) user 1 (1 shared like)
*Source: Solr in Action, chapter 16
![Page 49: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/49.jpg)
Using matrix factorization is typically more efficient (Ships with Fusion 3.1):
Southern Data Science
![Page 50: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/50.jpg)
Feedback Loops
User
Searches
User
Sees
ResultsUser
takes an
action
Users’ actions
inform system
improvements
Southern Data Science
![Page 51: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/51.jpg)
Demo:
Signals & Recommendations
![Page 52: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/52.jpg)
![Page 53: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/53.jpg)
![Page 54: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/54.jpg)
• 200%+ increase in
click-through rates
• 91% lower TCO
• 50,000 fewer support
tickets
• Increased customer
satisfaction
![Page 55: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/55.jpg)
Learning to Rank
![Page 56: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/56.jpg)
Learning to Rank (LTR)
● It applies machine learning techniques to discover the best combination
of features that provide best ranking.
● It requires labeled set of documents with relevancy scores for given set
of queries
● Features used for ranking are usually more computationally expensive
than the ones used for matching
● It typically re-ranks a subset of the matched documents (e.g. top 1000)
Southern Data Science
![Page 57: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/57.jpg)
Southern Data Science
![Page 58: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/58.jpg)
Common LTR Algorithms
• RankNet* (neural networks, boosted trees)
• LambdaMart* (regression trees)
• SVM Rank** (SVM classifier)
** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf
* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf
Southern Data Science
![Page 59: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/59.jpg)
Demo: Learning to Rank
![Page 60: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/60.jpg)
#1: Pull, Build, Start Solrgit clone https://github.com/apache/lucene-solr.git && cd lucene-solr/solrant server bin/solr -e techproducts -Dsolr.ltr.enabled=true
#2: Run Searcheshttp://localhost:8983/solr/techproducts/browse?q=ipod
#3: Supply User Relevancy Judgementscd contrib/ltr/example/nano user_queries.txt
#4: Install Training Librarycurl -L https://github.com/cjlin1/liblinear/archive/v210.zip > liblinear-2.1.tar.gztar -xf liblinear-2.1.tar.gz && mv liblinear-210 liblinearcd liblinear && make && cd ../
#5: Train and Upload Model./train_and_upload_demo_model.py -c config.json
#6: Re-run Searches using Machine-learned Ranking Modelhttp://localhost:8983/solr/techproducts/browse?q=ipod
&rq={!ltr model=exampleModel reRankDocs=25 efi.user_query=$q}
![Page 61: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/61.jpg)
# Run Searcheshttp://localhost:8983/solr/techproducts/select?q=ipod
![Page 62: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/62.jpg)
# Supply User Relevancy Judgementsnano contrib/ltr/example/user_queries.txt
#Format: query | doc id | relevancy judgement | source
# Train and Upload Model./train_and_upload_demo_model.py -c config.json
![Page 63: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/63.jpg)
# Re-run Searches using Machine-learned Ranking Modelhttp://localhost:8984/solr/techproducts/browse?q=ipod
&rq={!ltr model=exampleModel reRankDocs=100 efi.user_query=$q}
![Page 64: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/64.jpg)
semantic search
![Page 65: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/65.jpg)
Building a Taxonomy of Entities
Many ways to generate this:
• Statistical Analysis of interesting phrases
- Word2Vec / Glove / Dice Conceptual Search
• Topic Modelling
• Clustering of documents / phrases
• Buy a dictionary (often doesn’t work for
domain-specific search problems)
• Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain*
* K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.Southern Data Science
![Page 66: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/66.jpg)
![Page 67: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/67.jpg)
Southern Data Science
![Page 68: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/68.jpg)
Southern Data Science
![Page 69: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/69.jpg)
entity extraction
![Page 70: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/70.jpg)
Southern Data Science
![Page 71: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/71.jpg)
semantic query parsing
![Page 72: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/72.jpg)
Southern Data Science
![Page 73: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/73.jpg)
Probabilistic Query Parser
Goal: given a query, predict which
combinations of keywords should be
combined together as phrases
Example:
senior java developer hadoop
Possible Parsings:senior, java, developer, hadoop
"senior java", developer, hadoop
"senior java developer", hadoop
"senior java developer hadoop”
"senior java", "developer hadoop”
senior, "java developer", hadoop
senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.
Southern Data Science
![Page 74: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/74.jpg)
Semantic Query Parsing
Identification of phrases in queries using two steps:
1) Check a dictionary of known terms that is continuously
built, cleaned, and refined based upon common inputs from
interactions with real users of the system. The SolrTextTagger
works well for this.*
2) Also invoke a probabilistic query parser to dynamically
identify unknown phrases using statistics from a corpus of data
(language model)
*K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation
through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
Southern Data Science
![Page 75: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/75.jpg)
query augmentation
![Page 76: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/76.jpg)
Southern Data Science
![Page 77: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/77.jpg)
Southern Data Science
![Page 78: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/78.jpg)
id: 1job_title: Software Engineerdesc: software engineer at a great companyskills: .Net, C#, java
id: 2job_title: Registered Nursedesc: a registered nurse at hospital doing hard workskills: oncology, phlebotemy
id: 3job_title: Java Developerdesc: a software engineer or a java engineer doing workskills: java, scala, hibernate
field term postings list
doc pos
desc
a
1 4
2 1
3 1, 5
at1 3
2 4
company 1 6
doing2 6
3 8
engineer1 2
3 3, 7
great 1 5
hard 2 7
hospital 2 5
java 3 6
nurse 2 3
or 3 4
registered 2 2
software1 1
3 2
work2 10
3 9
job_title java developer 3 1
… … … …
field doc term
desc
1a
at
company
engineer
great
software
2a
at
doing
hard
hospital
nurse
registered
work
3a
doing
engineer
java
or
software
work
job_title 1Software Engineer
… … …
Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“TheSemantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge Graph
Southern Data Science
![Page 79: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/79.jpg)
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“TheSemantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge Graph
Set-theory View
Graph View
How the Graph Traversal Works
skill: Java
skill: Scala
skill: Hibernate
skill: Oncology
doc 1
doc 2
doc 3
doc 4
doc 5
doc 6
skill: Java
skill: Java
skill: Scala
skill: Hibernate
skill: Oncology
Data Structure View
Java
Scala Hibernate
docs1, 2, 6
docs 3, 4
Oncology
doc 5
Southern Data Science
![Page 80: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/80.jpg)
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge Graph
Scoring nodes in the Graph
Foreground vs. Background AnalysisEvery term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context.
countFG(x) - totalDocsFG * probBG(x)
z = --------------------------------------------------------
sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))
{ "type":"keywords”, "values":[
{ "value":"hive", "relatedness": 0.9765, "popularity":369 },
{ "value":"spark", "relatedness": 0.9634, "popularity":15653 },
{ "value":".net", "relatedness": 0.5417, "popularity":17683 },
{ "value":"bogus_word", "relatedness": 0.0, "popularity":0 },
{ "value":"teaching", "relatedness": -0.1510, "popularity":9923 },
{ "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] }
+-
Foreground Query: "Hadoop"
Southern Data Science
![Page 81: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/81.jpg)
Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“TheSemantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016.
Knowledge Graph
Multi-level Graph Traversal with Scores
software engineer*(materialized node)
Java
C#
.NET
.NET Developer
Java Developer
Hibernate
ScalaVB.NET
Software Engineer
Data Scientist
SkillNodes
has_related_skillStartingNode
SkillNodes
has_related_skill Job TitleNodes
has_related_job_title
0.900.88 0.93
0.93
0.34
0.74
0.91
0.89
0.74
0.89
0.780.72
0.48
0.93
0.76
0.83
0.80
0.64
0.61
0.780.55
Southern Data Science
![Page 82: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/82.jpg)
Knowledge Graph
Southern Data Science
![Page 83: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/83.jpg)
Knowledge Graph
Southern Data Science
![Page 84: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/84.jpg)
Southern Data Science
![Page 85: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/85.jpg)
Knowledge Graph
Use Case: Summarizing Document Intent
Experiment: Pass in raw text (extracting phrases as needed), and rank their similarity to the documents using the SKG.
Additionally, can traverse the graph to “related” entities/keyword phrases NOT found in the original document
Applications: Content-based and multi-modal recommendations (no cold-start problem), data cleansing prior to clustering or other ML methods, semantic search / similarity scoring
![Page 86: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/86.jpg)
Basic Keyword Search(inverted index, tf-idf, bm25, multilingual text analysis, query formulation, etc.)
Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)
Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)
Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)
Self-learningIntent Algorithm Spectrum
Southern Data Science
![Page 87: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/87.jpg)
Contact Info
Trey [email protected]@treygrainger
http://solrinaction.comMeetup discount (39% off): 39grainger
Other presentations: http://www.treygrainger.com
Southern Data Science
![Page 88: Intent Algorithms: The Data Science of Smart Information Retrieval Systems](https://reader034.fdocuments.net/reader034/viewer/2022050614/5a6ed3e77f8b9a42298b586f/html5/thumbnails/88.jpg)
Additional References:
Southern Data Science