Dynamic Building of Domain Specific Lexicons Using Emergent Semantics
Dynamic Search Using Semantics & Statistics
-
Upload
paul-hofmann -
Category
Technology
-
view
535 -
download
1
description
Transcript of Dynamic Search Using Semantics & Statistics
![Page 1: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/1.jpg)
Text Mining - Bayesian Topic Modeling for Interactive Retrieval
at SAP and Cisco
Ram AkellaUniversity of California and Stanford
With Karla Caballero, Maria Daltayanni, Chunye Wang - UCSC andPaul Hofmann SAP Labs
October 6, 2011 SAP
![Page 2: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/2.jpg)
Outline
• Motivation• Statistical Topic Modeling - SAP & Saffron• Knowledge Extraction and Reuse at Cisco• Interactive Retrieval• Interactive Retrieval Demo
![Page 3: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/3.jpg)
Outline
• Motivation• Statistical Topic Modeling - SAP & Saffron • Knowledge Extraction and Reuse in Cisco• Interactive Retrieval• Interactive Retrieval Demo
![Page 4: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/4.jpg)
Motivation
10/6/2011
SEARCH
Depression treatment of patients…
Depression influence on
family relationships…
DOCTOR
SOCIAL SCIENTIST
q1: elderly depression
q2: depression symptoms
q3: symptoms and treatment
User expects to find more relevant results each time she interacts with the system
Relevance of the presented documents depends on user context
![Page 5: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/5.jpg)
Interactive Retrieval Model Query
User Feeback
Feedback and propagation to
similar documents
Information needUpdate
DocumentCollection
Metadata Generation System
Interactive Retrieval System
![Page 6: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/6.jpg)
Interactive Retrieval Model Query
User Feeback
Feedback and propagation to
similar documents
Information needUpdate
DocumentCollection
Interactive Retrieval System
Metadata Generation SystemAdd to the document metadata that facilitates the retrieval processThis metadata consist of:
1. Statistical Topic Mixture2. Knowledge Extraction basedon Business process (problem, cause, solution)
![Page 7: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/7.jpg)
Outline• Motivation
• Statistical Topic Modeling - SAP & Saffron– Motivation– Related Work– Proposed Approach– Topic Modeling and Entity Association
• Knowledge Extraction and Reuse at Cisco• Interactive Retrieval• Interactive Retrieval Demo
![Page 8: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/8.jpg)
Topic Modeling: Motivation• Given a set of documents, we want to identify the main areas or topics
discussed in a unsupervised manner. We take advantage of the semantic associations between words across the documents.
If two words appear in the same document, they should be related.
• For each topic we have different distributions of words and each document might contain material about a variety of topics.
Play
Music
Sports
10/6/2011
Topic 1 (80%)Sports
Topic 2 (5%)
Topic 3 (20%)Common Words
Topic 1Sports
net
game
ball
ball net racquet
notes
instrument
![Page 9: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/9.jpg)
Related WorkLDA[2003] Correlated
Topics [2005]Pachinko Allocation Model [2006]
Our Model GD-LDA
Complexity based on # oftopics K
K 2K
Speed
Scalable
Handles Topic Correlations
Effective topic selection and truncation
![Page 10: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/10.jpg)
Our Approach – The higher probability mass is accommodated in the upper part of the
tree (this facilitates the truncation and reduction in the number of topics)
– We can define a method to determine the number of topics suitable for a particular dataset without training the model several times (each time for a given number of specified topics)
10/6/2011
…
bushcampaign
mccainbradley
republicancandidate
filmshowmusicmoviestoryplay
companypercentstockmarketpricerate
patientDiseasePeopleStudyMedicHealth
peacetalksyrianclintonsyriagolan
…
…
0.00960.0146
0.0310 0.0660
0.0851
![Page 11: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/11.jpg)
Experimental SetupThe datasets are from two types:• Scientific Articles (NIPS)
– Longer documents
• News Data (NYT, APW, XIE)– Shorter Documents– More diverse vocabulary
• We compare the performance of the algorithm against three approaches in the literature : LDA, CTM and Pachinko
• We test our model using Empirical Likelihood– This method estimate how likely it is that a test document will be generated
from the estimated model. – We want this value to be high (better generalization and applicability to
unseen documents).10/6/2011
Dataset NIPS NYT APW XIE
#documents 1840 5553 4954 5275
# unique terms 13649 11229 6955 3890
Doc Length 1322 274 170 81
![Page 12: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/12.jpg)
Results: NYT DatasetWe obtain the topic mixture for the NYT Dataset using K=20 topics .
10/6/2011
yearlivecenturymuseumpeoplemusictimestarbook
storypjournalconstitutiontimeeditorbudgetyork
militarywarnuclearpresidentpoliticchechnyapowersoviet
internetinformationtechnologyserviceipeopleebusyworkmail
computermakehandsystemtvpeopleprogramnetworkdontdrivecall
studypatientpeopleuniversdiseasemedicincreasecarestate
bankproblemeconomysysteminvestorpercentpriceinvestmenteconomistfinancial
+
-++
+ drugstateunitedtalknatoclintonamerican
+
+
-
![Page 13: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/13.jpg)
13
Results: Empirical Likelihood
10/6/2011
APW Dataset NIPS Dataset
NYT Dataset XIE Dataset
Our Model
![Page 14: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/14.jpg)
Results: Running Time
10/6/2011
APW Dataset NIPS Dataset
Minutes
Minutes
NYT Dataset
Minutes
XIE Dataset
Minutes Our
Model
![Page 15: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/15.jpg)
Illustrative Example: NYT Dataset
10/6/2011
NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.
![Page 16: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/16.jpg)
Illustrative Example: NYT Dataset
10/6/2011
NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.
![Page 17: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/17.jpg)
Illustrative Example: NYT Dataset
10/6/2011
NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.
![Page 18: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/18.jpg)
Illustrative Example: NYT Dataset
10/6/2011
NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.
![Page 19: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/19.jpg)
Topic Modeling & Entity Association
This work was presented at SAPPHIRE NOW 2010
Base knowledge Source
UCSC Topic
Mining System
Saffron Associative Memory Base
Query
Valukas Report about why Lehman
Brothers Failed
(6 volumes)
SAP Business Objects Entity
Extractor
Entities
TopicsSaffron Associative
Memory creates associations among entities and topics
We would like to know who are
the actors involved in a
particular action that led to the
failure of Lehman brothers
Text Data to be monitored
![Page 20: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/20.jpg)
Outline
• Motivation• Statistical Topic modeling - SAP & Saffron
• Knowledge Extraction and Reuse in Cisco– Knowledge Extraction System– System Architecture– Domain Knowledge– Improving Productivity– Performance of Service Request Recommender
• Interactive Retrieval• Interactive Retrieval Demo
![Page 21: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/21.jpg)
Service Request Database
Service Request
Text Mining System
What was the problem?
Why did it occur?
How was it solved?
Problem
Cause
Solution
Irrelevant Content
KnowledgeUnstructured Text
Knowledge Database
Applicationssuch as retrieval
Problem
Cause
Solution
Document 1
Problem
Cause
Solution
Document 2
high
Similarity
high
low
Finding different solutions to the same problem
Knowledge Extraction System at Cisco
![Page 22: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/22.jpg)
Service Request
HierarchicalClassifier
Labeled Paragraphs
Preprocessor
Service Request
Recommender
User
Bag-of-words
Domain Knowledge
ExpertiseFeature Generator
Data flow of Analyzer
Data flow of Recommender
Data output for User
Legend
System ArchitectureType Feature Class and
Motivation
Statistical
features
Length of paragraph Short paragraphs are usually irrelevant.
Relative position of a paragraph in a service request
Service requests have the hidden process “problem → cause→ solution”.
Number of “%” Error codes (relevant) begin with “%”.
Contextual
features
Contain “Hi”, “Hello”, “my name”, or “I’m”
Introduction, irrelevant
Contain “feel free”, “to contact”, or “have a ... day”; begin with “Best” or “Thank”
Salutation, irrelevant
Telephone number, zip code, or affiliation
Contact information, irrelevant
Hint words
Contain “problem”, “error message” or “symptom”
Problem
Contain “suspect”, “seem”, “looks like”, “indicate”, “try”, “test”, or “check”
Troubleshooting
Contain “recommend”, “suggest”, “replace”, “reseat”, “RMA”, or “workaround”
Solution
Lexical features
Number of words from domain dictionary
Usually relevant
Product name Usually relevant
Features from Expertise
![Page 23: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/23.jpg)
- Internetworking Terms and Acronyms Dictionary (ITAD)- Benefits: (1) the expansion of acronyms and terminology;
(2) the enhancement of concept dependencies.- Example:
The phone boots up and it does a DHCP [Dynamic Host Configuration Protocol. Provides a mechanism for allocating IP addresses dynamically so that addresses can be reused when hosts no longer need them] request in the native VLAN [virtual LAN]. There it gets an IP address [32-bit address assigned to hosts using TCP/IP] and an option that it needs to boot up in the VLAN 40 and that it need to go in trunking [physical and logical connection between two switches across which network traffic travels] mode.
Host Server with 2 interfaces [connection between two systems or devices] and one default gateway. When ping Vlan-B [virtual LAN] interface an ARP [Address Resolution Protocol. Internet protocol used to map an IP address to a MAC address] request with a source IP of Vlan-B is sent to Default Router [network layer device that uses one or more metrics to determine the optimal path along which network traffic should be forwarded. Routers forward packets from one network to another based on network layer information] on Vlan-A, but Router does not respond to ARP request.
Snippet from Doc1
Snippet from Doc2
[…]: explanation from ITAD. Blue: overlapping words between unexpanded excerpts.Red: overlapping words introduced by ITAD.
Measuring similarity
Domain Knowledge
![Page 24: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/24.jpg)
Browse a service request
Relevant?N
Read and understand thoroughly
Create knowledge article
Y
N
Y
Time to access relevance
Time to extract knowledge
Read enough?
Improving ProductivityCompare the time spent by engineers in reading service requests before and after using our system.
Time to access relevance
Time to extract knowledge
Before using system 27 minutes 97 minutes
After using system 11 minutes 67 minutes
Productivity improved by
145% 45%
![Page 25: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/25.jpg)
Performance of Service Request Recommender
Result 1: Both deterministic and probabilistic model achieved much better results when labeled paragraphs were used; validates our hypothesis of the inherent diagnostic business process.
Result 2: Using domain knowledge further improves retrieval results. Result 3: Probabilistic recommender outperformed deterministic recommender.
Baseline Our Method
Retrieval models
Deterministic model
Probabilistic model
Information The whole document
The semantically labeled paragraphs
Domain Knowledge
None Dictionary
Retrieval SchemesOur
Method
![Page 26: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/26.jpg)
Outline• Motivation• Statistical Topic modeling – SAP & Saffron • Knowledge Extraction and Reuse at Cisco
• Interactive Retrieval– Problem– Reinforcement Learning Formulation– How many interaction steps needed– How much feedback is needed– Interactive Retrieval Using Topic Modeling
• Interactive Retrieval Demo
![Page 27: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/27.jpg)
Interactive Retrieval• Model the user intent to retrieve relevant documents• Identify the trade-off between
– Retrieval accuracy (how accurate are the results required to be by the user?)
– Interaction time (how much time is the user willing to spend on interaction?)
• Applied to– Medical documents retrieval
• e.g., search for past patient cases with similar symptoms
– Resume retrieval in a labor marketplace• e.g., search for Python developers who work in machine learning
MORE IMPORTANT
LESS IMPORTANT
![Page 28: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/28.jpg)
28
Problem
10/6/2011 What is the best path to choose ?
User Intent
Set of Relevant Documents
Static Myopic Dynamic
Dynamic
Dynamic Programming
Reinforcement Learning
t1 t2 t3 … tn
User Intent
Set of Relevant Documents
User Intent
Set of Relevant Documents
![Page 29: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/29.jpg)
Reinforcement Learning formulation of IIR
Agent IIR system
Environment User
IntentBest guess for user intent or need
(expressed in query terms)
Action Ranking Rk
Reward Improvement
v(Rk)-v(Rk-1)(as observed from user
feedback)
ObjectiveMax. sum of
rewards
![Page 30: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/30.jpg)
Experiments Set-Up
• Dataset: TREC-9 OHSUMED, 348.566 medical documents– with a list of relevance judgments
• 65 user queries– query title: 2 − 5 words– query description: 5 − 10 words
• Interactive Sessions of 3 − 5 steps• Relevance function is binary• Value of results (with appropriate weights wi)
– Precision @10: percentage of relevant documents in the top-10 results– We compare our results with Pseudo-relevance Feedback
![Page 31: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/31.jpg)
How many interaction steps needed?
9/19/2011
![Page 32: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/32.jpg)
How much feedback is needed?
1 2 3 4 5 6 70.600000000000001
0.650000000000001
0.700000000000001
0.750000000000001
0.800000000000001
0.850000000000001
# of documents on which feedback is provided per step
prec
isio
n @
10
Experiments tested on348,566 OHSU-MED medical dataset, TREC 2002
![Page 33: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/33.jpg)
Interactive Retrieval w Topic Modeling• Topics help us to reduce the search
– They add context to the query– Some important terms to describe the users’ intent may not be
included in the query– Topics are calculated a-priori and added to each document as metadata
Topic Mixture ofNon Relevant Docs
Topic Mixture ofRelevant Docs
Combination of terms and topic relevance
scores
Meta-query(combination of
user inputs)
Updated each time the user provides feedback (clicks) or additional information to the system (query redefinition)
![Page 34: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/34.jpg)
Proposed Dataset
• We test our approach using the HARD TREC queries which consist of :– 851,018 news documents from NYT APW and XIE
agencies– Each document has an average length of 305 terms– There are 496,779 unique terms– We infer the topic information of the corpus using 75 topics
– For testing purposes we use m=3 interactions– We use test 30 queries– We compare our algorithm with mixture relevance feedback
10/6/2011
![Page 35: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/35.jpg)
Preliminary Results
10/6/2011
Number of Interactions
Precision
1 2 30.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MixtureState Based
![Page 36: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/36.jpg)
Outline
• Motivation• Statistical Topic modeling – SAP & Saffron• Knowledge Extraction and Reuse at Cisco• Interactive Retrieval
• Interactive Retrieval Demo
![Page 37: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/37.jpg)
Example User intent• young female with fevers and increased CPK (Creatine PhosphoKinase)
– CPK: enzyme, may cause heart attack or severe muscle breakdown if increased
• neuroleptic malignant syndrome (life-threatening neurological disorder)– Associated with CPK– Symptoms: muscular cramps, fever, unstable blood pressure, changes in
cognition, including agitation, delirium and coma
• differential diagnosis– List symptoms– List causes of the symptoms– Prioritize by the most dangerous – Treat
• treatment
![Page 38: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/38.jpg)
Relevant Documents
• Non-relevant documents:Doc 1: Significance of elevated levels of CPK in febrile diseases: a prospective study. The incidence and significance of elevated serum levels of (CPK) in febrile diseases were studied prospectively in all patients admitted with fever to a department of medicine during 1 year.
Doc 2: Metoclopramide-induced neuroleptic malignant syndrome….Symptoms of NMS include rigidity, hyperpyrexia, altered consciousness, and autonomic instability. This syndrome is generally associated with neuroleptic medications used to treat psychotic and major depressive illnesses…
• Relevant document:Doc 3: Neuroleptic malignant syndrome: guidelines for treatment and reinstitution of neuroleptics… Cardinal symptoms include fever, muscular rigidity, an elevated serum level of creatine phosphokinase, changes in mental status, and autonomic dysfunction…
![Page 39: Dynamic Search Using Semantics & Statistics](https://reader036.fdocuments.net/reader036/viewer/2022062514/558555d2d8b42a0a3a8b5007/html5/thumbnails/39.jpg)
Interactive Demo
• InteractiveDemo_MedicalData
• Sub-queries– young female with fevers and increased CPK– neuroleptic malignant syndrome– differential diagnosis– treatment