Information Discovery on Vertical Domains Vagelis Hristidis Assistant Professor School of Computing...
-
Upload
colleen-julianna-wilkerson -
Category
Documents
-
view
220 -
download
2
Transcript of Information Discovery on Vertical Domains Vagelis Hristidis Assistant Professor School of Computing...
Information Discovery on Vertical Domains
Vagelis HristidisAssistant ProfessorSchool of Computing and Information SciencesFlorida International University (FIU), Miami
Need for Information DiscoveryAmount of available data increasesNeedle in the haystack problemSome applications:
◦ Web◦ Desktop search◦ Data Warehousing◦ Bibliographic database◦ Homes, cars search, e.g., realtor.com,
autotrader.com◦ Scientific domains, e.g.,
genes, proteins, publications in biology, elements and interactions of components in chemistry Patient hospitalizations, physician info, procedure
outcomes in hospitalsVagelis Hristidis - FIU - Information Discovery on Vertical Domains 2
Strengths and Limitations of Current Approaches
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 3
Web Search+ Scalability+ Handle free text+ Exploit content and link structure to achieve
ranking+ Simple keyword queries- Limited query expressive power- Generic, domain-independent ranking algorithms- Return pages, not answers
Database Querying+ Efficient
+ Handle structured data
+ Well-defined theory and answers
- Must learn query language, e.g. SQL
- No automatic ranking of results Keyword Search in Databases + Simple keyword queries
+ exploit links (e.g., primary-foreign keys)
- Generic ranking – typically size of result
- No domain semantics
p1: person[name="John"nation="US"]
l1: lineitem[quantity=10
shipdate=Oct 14 2001]
l2: lineitem[quantity=10
shipdate=Oct 15 2001]
pa3: part[partkey=1005name="TV"]
pa1: part[partkey=1008name="VCR"]
pa2: part[partkey=1009
name="VCR & DVD"]
Research ObjectiveAllow effective and efficient information
discovery on vertical domainsStrategy:
◦ Exploit associations between entities◦ Model domain semantics, e.g., patient entity is
critical for medical practitioner, but not for biologist
◦ Model users of a domain◦ Use knowledge of domain experts,and existing
knowledge structures (e.g., domain ontologies)◦ Exploit user feedback◦ Go beyond plain keyword search. Explore best
search interface for each domain, e.g., faceted search
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 4
Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 5
Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 6
Products MarketplaceProject started while
visiting Microsoft Research at Redmond, in Summer 2003
SQL Returns Unordered Sets of Results
Overwhelms Users of Information Discovery Applications
How Can Ranking be Introduced, Given that ALL Results Satisfy Query?
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 7
8
Products Marketplace (cont’d)Example – Realtor Database
House Attributes: Price, City, Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year
Query: City =`Seattle’ AND Waterfront = TRUE
Too Many Results!Intuitively, Houses with lower Price,
more Bedrooms, or BoatDock are generally preferable
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains
9
Products Marketplace (cont’d)Rank According to Unspecified Attributes [VLDB’04,TODS’06]
Score of a Result Tuple t depends onGlobal Score: Global Importance of
Unspecified Attribute Values◦ E.g., Newer Houses are generally preferred
Conditional Score: Correlations between Specified and Unspecified Attribute Values◦ E.g., Waterfront BoatDock
Many Bedrooms Good School District
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains
10
Products Marketplace (cont’d)Key ProblemsGiven a Query Q, How to
Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR).
How to Calculate the Global and Conditional Scores.Use Query Workload and Data.
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains
Products Marketplace (cont’d)Other ProjectsSelect the best attributes to output –
attribute ordering problem [SIGMOD’06]◦ E.g., Color is important for sports cars but
not much for family carsProduct Advertising: Select best
attributes to display for a product to maximize its visibility among its competitors [ICDE’08, TKDE’09]◦ Use past query workload◦ Maximize number of past queries for which
the product is returnedVagelis Hristidis - FIU - Information Discovery on Vertical Domains 11
Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 12
Biological Databases [EDBT’09]
With University of Maryland Intuitive but powerful query
language, based on soft (ranking) and hard (pruning) filters
Goal is to improve the user experience of users of PubMed
Exploit associations between entities (genes, proteins, publications)
Example of Query: Find the most important publications on “cancer” that are related to the “TNF” gene through a protein.
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 13
Results Navigation in PubMed with BioNav [ICDE’09, TKDE’10]
With SUNY Buffalo.Most publications in PubMed
annotated with Medical Subject Headings (MeSH) terms.
Present results in MeSH tree.Propose navigation model and
smart expansion techniques that may skip tree levels.
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 14
BioNav: Exploring PubMed Results
Static Navigation Treefor query “prothymosin”
MESH (313)Amino Acids, Peptides, and Proteins (310)
Proteins (307)Nucleoproteins (40)
Biological Phenomena, … (217)Cell Physiology (161)
Cell Growth Processes (99)
Genetic Processes (193)Gene Expression (92)
Transcription, Genetic (25)
95 more nodes
2 more nodes45 more nodes
4 more nodes
3 more nodes15 more nodes
10 more nodes1 more node
Histones (15)
- Query Keyword: prothymosin
- Number of results: 313
- Navigation Tree stats:
• # of nodes: 3941• depth: 10• total citations: 30897
Big tree with many duplicates!
15Vagelis Hristidis, Searching and Exploring Biomedical Data
16
BioNav: Exploring PubMed Results
Reveal to the user a selected set of descendent concepts that:(a) Collectively contain all results(b) Minimize the expected user navigation costNot all children of the root are necessarily revealed as in static navigation.
Vagelis Hristidis, Searching and Exploring Biomedical Data
Vagelis Hristidis, Searching and Exploring Biomedical Data 17
BioNav Evaluation
02468
101214161820
Overall Navigation Cost(# of Concepts Revealed + # of EXPAND Actions)
Static BioNav
Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 18
Vagelis Hristidis, Searching and Exploring Biomedical Data
XOntoRank: Use Ontologies to Search Electronic Medical Records [ICDE’09]
With Miami Children’s Hospital, Indiana University School of Medicine, IBM Almaden.
Latest EMR format: HL7 CDA – XML-based Algorithm to enhance keyword search using ontological
knowledge (e.g., SNOMED)
19
Medical DictionaryM
edic
al D
icti
onar
y
50043002Disorder of
Respiratory system
79688008RespiratoryObstruction
Is a
118946009Disorder of
Thorax
41427001Disorder ofBronchus
Is a
195967001Asthma
Is a
Is a
301229001Bronchial Finding
Is a
405944004AsthmaticBronchitis
Is a
May be
266364000Asthma attack
Is aMay be
955009Bronchial Structure
Finding site of
Finding site of
Finding site of
82094008Lower respiratory tract
structure
Is a
Vagelis Hristidis, Searching and Exploring Biomedical Data
XOntoRank: Example 1
q = {“bronchitis”, “albuterol”}
result = Observationcodevalue Bronchitisvalue Albuterol
21
Vagelis Hristidis, Searching and Exploring Biomedical Data
XOntoRank: Example 2
q = {“asthma”, “albuterol”}
result = ???
22
Vagelis Hristidis, Searching and Exploring Biomedical Data
XOntoRankA CDA node may be associated to a
query keyword w through ontology.XOntoRank first assigns scores to
ontological concepts◦ OntoScore OS(): Semantic relevance of a
concept c in the ontology to a query keyword w.
Then, given these scores, assign Node Scores NS() to document nodes
Other aggregation functions are possible.
23
Vagelis Hristidis, Searching and Exploring Biomedical Data
Computing OntoScore of Concept Given Query KeywordThree ways to view the ontology
graph:◦As an unlabeled, undirected graph.◦As a taxonomy.◦As a complete set of relationships.
24
Vagelis Hristidis, Searching and Exploring Biomedical Data 25
Authority Flow Ranking in EMRs
A subset of the electronic health record dataset.
Work under submission.
EventsPlan TimeStampCreated=”2004-11-03 11:57:00.0" Events=”….small residual pericardial effusion…..”
Hospitalization TimeStampCreated=”2004-10-27 22:00:00.0" History=”18 year old boy with an aggressive form of chest lymphoma…” Allergies = “NKDA”…...
Cardiac PatientID=”1438" Complication=”apical impulse … Echo-large increasing pericardial effusion…”
Employee TimeStampCreated=”2004-12-23 14:03:00.0" Title=”Pediatric Cardiologist”….
EventsPlan Events=“4 month old baby… pericardial effusion...”
Medication TimeStampCreated=”2003-02-13 21:57:00.0"..
Hospitalization History = “48 year old..”
v1v7
v2v3
v4
v5v6
prescribed_to
recorded_by
recorded_by
Query: “pericardial effusion”
Vagelis Hristidis, Searching and Exploring Biomedical Data 26
ObjectRank on EMRs: Authority Flow Ranking
Schema of the EMR dataset
Hospitalization
EmployeeAssociated_
Events
Patient Medication
A-E
P-M H-M
M-E
A-H H-E
P-E
created_by
reco
rded
_by
pres
crib
ed_b
y
of prescribed_to
forcreated_by
Vagelis Hristidis, Searching and Exploring Biomedical Data 29
User Study Results
00.10.20.30.40.50.60.70.80.9
1
CO085BM25 BM25 CO085 CO030
Ave
rag
e S
ensi
tivity
00.10.20.30.40.50.60.70.80.9
1
CO085BM25 BM25 CO085 CO030
Ave
rage
Spe
cific
ity
Mean Sensitivity Mean Specificity
BM25: Traditional Information Retrieval Ranking FunctionCO: Clinical ObjectRank (Authority Flow)
Other challenges of Searching EMRs [NSF Symposium on Next Generation of Data Mining ’07]
Entity and Association Semantics
Negative StatementsPersonalizationTreatment of Time and
Location AttributesFree Text Embedded in CDA
Document
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 30
Vagelis Hristidis, Searching and Exploring Biomedical Data
Syntax vs. Semantics in Schema
31
Example – query “Asthma Theophylline”
More details at [Hristidis et al. NSF Symposium on Next Generation of Data Mining ’07]
Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 32
Bibliographic Databases Work started while at UCSD Exploit citations link structure to create query
specific ranking [VLDB’04, TODS’08] Demo available for Database literature at
http://dbir.cs.fiu.edu/BibObjectRank
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 33
Bibliographic Databases (cont’d)Query Reformulation
Work with U of Maryland [ICDE’08]
Based on user selected resultsPerform query expansion –
add/change weight of query keywords
Adjust authority flow weightsCurrently working on applying
these ideas to queries on PubMed.Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 34
Explaining Query Results – Explaining Subgraph
Target Object: “Modeling Multidimensional databases” paper.Explaining Subgraph Creation1. BFS in reverse direction from target object.2. BFS in forward direction from base set objects (authority
sources).3. Subgraph contains all nodes/edges traversed in forward
direction.4. Compute explaining authority flow along each edge by
eliminating the authority leaving the subgraph (iterative procedure).
5. Structure-based reformulation: High-flow edges in explaining subgraph receive weight boost.
Paper Authors=“H. Gupta, V. Harinarayan, A. Rajaraman, J. Ullman” Title=“Index Selection for OLAP.” Year=“ICDE 1997”
Paper Authors=“C. Ho, R. Agrawal, N. Megiddo, R. Srikant” Title=“Range Queries in OLAP Data Cubes.” Year=“SIGMOD 1997”
Paper Authors=“R. Agrawal, A. Gupta, S. Sarawagi” Title=“Modeling Multidimensional Databases.” Year=“ICDE 1997”
Author Name=“R. Agrawal”
Year Name=“ICDE”, Year=1997, Location=Birmingham
1.59e-7
6.76e-6
1.48e-4
7.12e-6
2.37e-6
3.02e-4 1.0e-4
0.001 6.76e-6
Conference Name=“ICDE”
7.12e-7
9.55e-7
v1
v2
v3
v4v5
v6
TARGET OBJECT
Specific Domains Studied (or being studied)Products marketplaceBiological databasesClinical databasesBibliographicPatents
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 36
Search PatentsSpecial characteristics of
patents: Patents are organized
into classes and subclasses.
Patents have links to external publications and to other patents.
Patents are organized to various sections (abstract, claims, description and images).
Patents use specific legal wording in the claims section. Further, claims have references to other claims, that is, claims can be viewed as a graph.
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 37
Demo at PatentsSearcher.com
End - Thank YouFor more information, please go
to:http://ww.cis.fiu.edu/~vagelis
Supported by ◦NSF CAREER, 2010-2015◦NSF grant IIS- 0811922: III-CXT-
Small: Information Discovery on Domain Data Graphs, 2008-2011
◦DHS grant 2009-ST-062-000016: Information Delivery and Knowledge Discovery for Hurricane Disaster Management, 2009-2011
Vagelis Hristidis - FIU - Information Discovery on Vertical Domains 38