HOMME: Hierarchical‐Ontological Mind Map Explorer
Yi‐Shin Chen, Pei‐Ling Hsu, Hsiao‐Shan Hsieh, Li‐Chin Lee, Carlos ArguetaInstitute of Information Systems and Applications
N i l T i H U i iNational Tsing Hua UniversityIDEA Lab
Outline
• Introduction to IDEA Lab
• Introduction to HOMMEIntroduction to HOMME
• Framework
• Experimental Evaluation
• Conclusions and Future WorkConclusions and Future Work
llIntelligent Data Engineeringand Applications (IDEA) Laboratoryand Applications (IDEA) Laboratory
Research Focus
Query
Mining
Optimization
Query
Storage
Index
DB
Corresponding Research Issuesp g
AI
HCI Network
DatabaseWeb Pattern Recognition
CURRENT PROJECTS
Current Projectsj
• GoogolPlex
– Web information integration and retrievalg
– Topic expansion and integration
Group “answers” based on topic and sentiment– Group answers based on topic and sentiment
GoogolPlex Project (Cont.)g j ( )
• Apply cloud computing to speed up the analysis in large scale and heterogeneous data (Googolplex size)
GoogolPlex Project (Cont.)g j ( )
R l t d h i• Related research issues– Automatic Ontology construction from heterogeneous data
GoogolPlex Project (Cont.)g j ( )
• Related research issues– Sentiment analysis for short articles (e.g., micro‐blogs, social network messages) in multi‐language environments
I hate it when it’s rainy and cold!I hate it when it s rainy and cold!
Loved today’s trip.
I can’t believe this happened!
GoogolPlex Project (Cont.)g j ( )
• Related research issues– Keyword extraction from short articles (e.g., micro‐blogs, social network messages) in multi‐language environments
…task of algorithm analysis consists…
…in a Markov Chain is…
…when sorting is…
GoogolPlex Project (Cont.)g j ( )
• Related research issues– Semantic analysis for different purposes, such as geo‐tagging
– TweoLocator: A Non‐Intrusive Geographical LocatorSystem for Twitter
Id if h l i f i l i i• Identify the location of a particular Twitter at a given time
Using exclusively the content of his/her tweets– Using exclusively the content of his/her tweets
HOMME Conceptual Finder Demop
HOMME Ontology Builder Demo(cont’d)gy ( )
TweoLocator: Framework
TweoLocator: Experimental Resultsp
50%
60%
70%
80%
350
400
450
500
20%
30%
40%
50%
100
150
200
250
300
Tweets
70%80%90%100%
200
250
US GB CA AU INOTHERS
Avg Acc
0%
10%
0
50
100
30%40%50%60%70%
100
150
ProfilesCorrect tweets 463 288 353 169 125 23 65.6%
Unrelated Tweets 110 55 114 53 41 18 18.1%
Disagreed & Reallocated
142 175 22 0 14 0 16.3%
Accuracy 65% 56% 72% 76% 69% 56% 66%
US GB PH CA Others AU SE
Correct 240 88 39 28 26 22 9
0%10%20%
0
50y
Wrong 24 3 0 2 0 0 0
N/A 44 17 3 6 1 3 1
Disagreement 16 0 0 1 0 0 0
Accuracy 74% 81% 93% 76% 96% 88% 90%
Current Projectsj
• GoogolPlex
– Web information integration and retrievalg
• iConductI i d i– Interactive conducting system
iConduct Projectj
• Analyze the intentions from data streams
• Instantly aggregate user intentions and multimedia data
Current Projectsj
• GoogolPlex
– Web information integration and retrievalg
• iConductI i d i– Interactive conducting systems
• MyMiningy g– Market analysis
MyMining Projecty g j
• Mining market information from– Stock data (numerical data)( )
– News, blogs, and micro blogs (text data)
• Find the relationship between Stock Market and social networking sites
Goal
• In this research, our goal is to build a system which can help us to :p– Automatically integrate the stock news and Identify the events.Identify the events.
– Evaluate the event influence on the industry level and use the information on verifying pricesand use the information on verifying prices movement.
MyMining Projecth d lMethodology
Off‐line
On‐line
Current Performance
• Accuracy of four methods:Methods Average
Accuracy
Pheromone 0.5784574
Adjust 0 5323214Adjust regression
0.5323214
Regression 0.5134457
Blind test 0.3045479
PEOPLE IN IDEA LAB
Peoplep
• Current students:– Domestic students: 7
– International students: 8San Lucia
Nationality
i
Myanmar7%
7%
Taiwan46%
Honduras20%
El Salvador
Malaysia6%
Indonesia7%
7%
INTRODUCTION TO HOMME
Humans generate Knowledgeg g
• Collecting all human knowledge has always been a recurring goalg g
Internet Era
• WWW has made collecting all human knowledge possible.g p
Data Flood
• Redundant
• ScatteredScattered
• Mutually complementary
Integrationg
• It is crucial to integrate heterogeneous data sources.– Easier access
Summarization– Summarization.
– Less redundancy
Previous Work (1)( )
• Web data integration and organization based on expert knowledge or collaboratively‐p g ycreated (crowd wisdom) data– Manually– Manually
– Semi‐automatic
– Automatic
Previous Work (2)( )
• Wikipedia: most successful collaboratively‐created collection of human knowledge on the gweb
U t t d ti l• Unstructured articles• Structured information (infoboxes)
Previous Work (3)( )
• Other works used Wikipedia structured data to integrate web data.g– YAGO:
• Wikipedia Categories + WordNetWikipedia Categories + WordNet
• http://www.mpi‐inf.mpg.de/yago‐naga/yago/
– DBpedia: • Wikipedia infoboxes
• http://dbpedia.org/About
Previous Work (4)( )
• Other sources of crowd wisdom studied to integrate and organize web datag g– Social annotations
Search logs– Search logs
Previous Work (5)( )
• Two approaches to integrate web data:– External Resources to extract relationshipsp
• Relatively small coverage
– Bottom‐up approach to web data integration• Difficulty in labeling the semantic relationships• Difficulty in labeling the semantic relationships
HOMME
• Relies on multiple heterogeneous “crowd wisdom” data sources.
B i f i• Bottom‐up extraction of semantic relationships present in the web data.
P t i d lik t ti f• Presents a mind map like representation of knowledge for easy navigation
FRAMEWORK
Framework
Data Sources
• Multiple heterogeneous data sources– Search logsg
– Social annotations: Delicious tags
Web directory: Open Directory Project (ODP)– Web directory: Open Directory Project (ODP)
Framework
Resource Integratorg
• Normalize and decompose heterogeneous data into smaller elements with common characteristics.
• We use the notion of word sequences and concept sequences
Word Sequencesq
h h l d d d• Every query in the search log is considered a word sequence
• Every URL in the search log can be decomposed into a word sequenceEvery URL in the search log can be decomposed into a word sequence
– www.mtv.com/music/artist/bowlingforsoupartist.jhtml
<mtv, music, artist, bowling, for, soup, artist>
• All the Delicious tags assigned to a URL are a word sequence
• The ODP title assigned to a URL is a word sequence.
• The ODP category assigned to a URL is turned into a word sequence.– E.g. air/travel/agent <air, travel, agent>
Concept Sequencesp q
• A sequence of words can represent concept
Framework
Term Extractor
• For each frequent word sequence it tries to split it into concepts.p p– E.g. Query: “star wars light saber”
Word sequence: <star wars light saber>Word sequence: <star, wars, light, saber>
Concept sequences: <<star, wars>, <light, saber>>
Framework
Term Mapperpp
• Term Mapper uses the output of Term Extractor to build a features matrix.
1. Classify concepts by ODP category.
2. Frequency of tags assigned to queries as features.q y g g q
Framework
Relationship Finderp
• Input data from Term Extractor: Word sequences
• Goal of relationship Finder: p– Seeks to find important semantic relationships between word sequencesbetween word sequences
• Challenges:T d t t t did t i d– To detect concept candidates in word sequences
– To gather correlated concept candidates
– To name semantic relationships between concept candidates
Relationship Finderp
S l i• Solutions:– Rules of detecting concept candidates from word
sequences • Mapped with existed concepts• Mapped with dictionaries• Mapped with dictionaries• Crowd wisdom
– Frequent queries– ODP titles
• Word sequences containing “of”
C id i th t t d– Considering the contexts among word sequences– Considering the meanings of relationships
Relationship Finderp
i hi l l i hi• Hierarchical Relationships– Has‐Subclass– Is‐A
• Synonymous RelationshipsSynonymous Relationships– Is‐Equal‐ToHas Meaning– Has‐Meaning
• Other relationships– Has‐Data‐About– Has‐Website
Relationship Finderp
i hi l l i hi• Hierarchical Relationships– Has‐Subclass
C l i hi i l i– Is‐A
• Synonymous Relationships
Common relationships in ontologies
Synonymous Relationships– Is‐Equal‐ToHas Meaning– Has‐Meaning
• Other relationships– Has‐Data‐About– Has‐Website
Relationship Finderp
i hi l l i hi• Hierarchical Relationships– Has‐Subclass Top down
class
Has‐Subclass
– Is‐A
• Synonymous RelationshipsclassBottom up
Synonymous Relationships– Is‐Equal‐ToHas Meaning
class
is a– Has‐Meaning
• Other relationships instance
is a
– Has‐Data‐About– Has‐Website
Has‐Subclass Relationship FinderpCommon relationships in ontologies
• Hierarchical Relationships– Has‐Subclass Top down
class
Has‐Subclass
• Utilizing ODP Categories
• Mapping with crowd wisdoms: frequent queries
class
Mapping with crowd wisdoms: frequent queries
• For instance“ l h ”– Query: “travel agent phone”
– ODP category: air/travel/agent
– Output: travel has‐Subclass travel agent
Is‐A Relationship Finderp
Hi hi l R l ti hiCommon relationships in ontologies
• Hierarchical Relationships– Is‐A
• Word sequences with crowd wisdom
class
Has‐SubclassBottom up
• Word sequences with crowd wisdom– Queries, ODP titles
• Hierarchies among word sequences
class
– Word sequences with “of”– Additional words for ambiguous words
• For instanceclass
is aFor instance– Query: “apple company”– Ambiguous word: apple
instance
is a
g pp– Additional words: company– Output: apple company Is‐A company
Relationship Finderp
i hi l l i hi• Hierarchical Relationships– Has‐Subclass– Is‐A
• Synonymous RelationshipsReferring to the same concepts
Synonymous Relationships– Is‐Equal‐ToHas Meaning– Has‐Meaning
• Other relationships– Has‐Data‐About– Has‐Website
Synonymous Relationship Finder(1)y y p ( )
• Many word sequences refer to the same concepts• Many word sequences refer to the same concepts.
• Is‐Equal‐To– <cartoonnetwork>, and <cartoon, network>
• Has‐Meaning– <ae>, <american, eagle>, and <american, eagle, outfitter>, , g , , g ,
• Finds distinct queries and ODP data referring to same concepts.
• Steps:1. Groups queries based on navigational intention
– Intention inferred from clicked URLs– Groups the navigational queries based on the clicked URL
2. ODP data is added to the groups based on their referring URLs.O data s added to t e g oups based o t e e e g U s
Synonymous Relationship Finder(2)y y p ( )
• For instance:– Query: “american eagle”Q y g
– Clicked URL: www.ae.com
ODP title: “american eagle outfitter”– ODP title: american eagle outfitter
– Output:• “ae” has‐Meaning ”American eagle”
• ”American eagle” has‐Meaning “american eagle f ”outfitter”
Relationship Finderp
i hi l l i hi• Hierarchical Relationships– Has‐Subclass– Is‐A
• Synonymous RelationshipsSynonymous Relationships– Is‐Equal‐ToHas Meaning– Has‐Meaning
• Other relationships– Has‐Data‐About– Has‐Website
Has‐Data‐About Relationship Finderp
• S t i d d t t t i b• Some terms in word sequences denote concepts present in a web site.
• Finds frequent match between query terms and parts of clicked URLs.
• For instance:– Query: “bowling for soup”– Clicked URL: wwwmtv com/music/artist/bowlingforsoupartist jhtmlClicked URL: www.mtv.com/music/artist/bowlingforsoupartist.jhtml– Output:
• “mtv” has‐Data‐About “music”• “mtv” has‐Data‐About “artist”mtv has Data About artist• “mtv” has‐Data‐About “bowling for soup”
Has‐Website Relationship Finderp
d f i d• Uses word sequences from queries, URLs, and ODP titles
• For instance:– Query: “online dictionary”– Clicked URL: www.m‐w.com– ODP title: “merriam‐webster online”– Output:p
• “online dictionary” has‐Website www.m‐w.com• “merriam‐webster online” has‐Website www.m‐w.com
Iterative Process
• The extracted relationships are used to improve the term extraction process.p p
C i i b h T• Constant interaction between the Term Extractor and the Relationship Finder.
Framework
Concept Cluster Finderp
U h f i d b T M• Uses the features matrix generated by Term Mapper.
• Uses k‐means algorithm to cluster queries.
• Each cluster automatically labeled based on cluster yrepresentative.– Features with highest scores
EXPERIMENTAL EVALUATION
Setupp
• Three data sources:– Search log by MS Live Labs from US users in May 2006
• 1,512,556 navigational queries extracted
– Open Directory Project (ODP)
– Delicious tags crawled from February to May 2010
• Implementation:P f d PHP J S i I f Vi T lki– Prototype front end: PHP + JavaScript InfoVis Toolkit
Demonstration
Ontology Buildergy
Demonstration
Concept Linker (1)p ( )
Concept Linker (2)p ( )
Experimental Results – Concept Linkerp p
O k d h k• Our work was compared to other works:– Single‐link Agglomerative Hierarchical clustering(AHC)– DBSCAN
• We want to evaluate ability to discover query clusters.
• Ground truth: manually labeled 50 queries fromGround truth: manually labeled 50 queries from each category.
HOMME and AHC
HOMME and DBSCAN
Experimental Results ‐ Relationship Fi dFinder
• 11 volunteers checked sample of output relationshipsp
E h h k d 100 l f h l i hi• Each checked 100 tuples for each relationship type.– Total 400 output relationships
– All checked same setAll checked same set
Relationship Finder Evaluated by H E tHuman Expert
CONCLUSIONS AND FUTURE WORK
Conclusions
• The proposed approach uses heterogeneous sources to – Effectively cluster queries related to a concept.
Extract relationships between concepts– Extract relationships between concepts automatically.
• The relationships recognized by HOMME are also recognized by humans most of the time.
Future Work
• Improve coverage for Relationship Finder
• Add more relationship types
• Improve execution times for offline partImprove execution times for offline part
Top Related