YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM...
-
Upload
thomasine-kelley -
Category
Documents
-
view
231 -
download
0
Transcript of YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM...
YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNETFABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM
Subbalakshmi Iyer
Motivation for an Ontology
Natural Language communication Automated text translation Finding information on internet Computer-processable collection of
knowledge
What is an Ontology?
An ontology is the description of a domain, its
classes and properties and relationships between those classes by means of a formal language.
collection of knowledge about the world, a knowledge base
Example ontologies: large taxonomies categorizing Web sites
(such as on Yahoo!) categorizations of products for sale and their
features (such as on Amazon.com)
Uses of Ontologies
Machine Translation Word Sense Disambiguation Document Classification Question Answering Entity and fact-oriented Web Search
What is Yago
Yet Another Great Ontology Part of Yago-Naga project Goal to build a knowledge base that is
Large Scale Domain-independent Automatic Construction High Accuracy
Uses Wikipedia and WordNet
More about YAGO 2 million entities 20 million facts Facts represented as RDF triples Accuracy of 95% Examples:
Elvis Presley isA singer singer subClassOf person Elvis Presley bornOnDate 1935-01-08 Elvis Presley bornIn Tupelo Tupelo locatedIn Mississippi(state) Mississippi(state) locatedIn USA
The YAGO model
Slight extension of RDFS Represents knowledge as
Entities Classes Relations Facts
Properties of relations like transitivity Simple and decidable model
Knowledge Representation in YAGO
All objects are entities e.g. Elvis Presley, Grammy Award
2 entities can stand in a relationship e.g. hasWonAward Elvis Presley hasWonAward Grammy Award
The triple of entity, relationship, entity is a fact e.g. Elvis Presley hasWonAward Grammy
Award is a fact
Knowledge Representation in YAGO -2 Numbers, dates and strings are also
entities. Elvis Presley BornInYear 1935
Words are entities “Elvis” means Elvis Presley Entity is instance of class Elvis Presley Type Singer Classes are also entities Singer Type class
Knowledge Representation in YAGO- 3 Classes have hierarchies Singer SubClassOf Person Relations are also entities subClassOf Type atr Each fact has a fact identifier #1 FoundIn Wikipedia
Key Contributions of YAGO
Information Extraction from Wikipedia Infoboxes Category Pages
Combination with WordNet Taxonomy
Quality Control Canonicalization Type Checking
Information Extraction -1
Entities from Wikipedia Each page title is candidate entity Wiki Markup Language Wikipedia dump as of September, 2008
Information Extraction Techniques Infobox Harvesting
Wikipedia Infoboxes Word-Level Techniques
Wikipedia Redirects Category Harvesting
Wikipedia Categories Type Extraction
Wikipedia Categories, WordNet Classes
BorB
B
Born: January 8, 1935
Attribute Relation Inverse Manifold Indirect……
Born bornOnDate
…
Elvis Presley
bornOnDate
January 8, 1935
Infobox Attribute Map
Relation Domain Range
…bornOnDate person
yagoDate…
Relation Map
BorB
B
Died: August 16, 1977
Attribute Relation Inverse Manifold Indirect……
Died diedOnDate
…
Elvis Presley
diedOnDate
Infobox Attribute Map
August 16, 1977
Relation MapRelation Domain Range
…diedOnDate person
yagoDate…
BorB
B
Genre: Rock and Roll
Attribute Relation Inverse Manifold Indirect
…… Genre isOfGenre
…
Elvis Presley
isOfGenre
Infobox Attribute Map
Rock and Roll
…isOfGenre entity yagoClass
…
Relation Domain Range
Relation Map
BorB
B
Birth Name: Elvis Aaron Presley
Attribute Relation Inverse Manifold Indirect……
birth name means
…
means
Infobox Attribute Map
Elvis Presley
Elvis Aaron Presley
Relation MapRelation Domain Range
… means yagoWord entity
…
Manifold Attributes
Some attributes may have multiple values e.g. a person may have multiple children
Multiple facts are generated e.g. one hasChild fact for each child
Indirect Attributes - 1
Some attributes do not concern article entity, but another fact e.g attribute GDP does not concern the article
entity i.e. Republic of Singapore, but year 2008 Therefore, facts generated: Singapore hasGDP 238.755 billion #14 during 2008 Singapore hasGDP 238.755 billion during 2008
Attribute Relation Inverse Manifold Indirect……
gdp ppp hasGDP gdp year during
Attribute Map
Type of Infobox
Released October, 1971Format vinyl recordGenre Folk RockLength 8:33 minsLabel United ArtistsWriter Don McLean
Manufacturer Tesla MotorsProduction 2008-presentClass RoadsterLength 3,946 mmWidth 1,873 mmHeight 1,127 mm
American Pie
Tesla Roadster
Song Infobox
Car Infobox
Type of Infobox: Attribute Map
Attribute Relation Inverse Manifold Indirect
…… car #length hasLength …song #length hasDuration
…
Attribute Map
Song Infobox
Car Infobox
American Pie hasDuration 8:33
Tesla Roadster hasLength 3946
Information Extraction - Word Level Techniques Wikipedia Redirects
virtual redirect page for “Presley, Elvis“ links to “Elvis Presley”
Each redirect gives ‘means’ fact e.g. “Presley, Elvis“ means Elvis Presley
Parsing Person Names extract the name components establish relations givenNameOf and
familyNameOf e.g. Presley familyNameOf Elvis Presley Elvis givenNameOf Elvis Presley
Wikipedia Categories
Categories: Presidents of the United States | Lists of office-holders | Lists of Presidents
Categories: Rift Valleys | North Sea | Rivers of Germany | Articles needing translation from German Wikipedia | Rivers of Netherlands
Categories: Canadian Singers| Canadian male singers| 1959 births | English-language singers | Living people | Grammy Award Winners |
Portrait photographers
Facts created from Wikipedia Categories
Rhine locatedIn Germany Bryan Adams bornOnDate 1959 Bryan Adams hasWonAward Grammy
Award Abraham Lincoln politicianOf United
States
Information Extraction - Category Harvesting Relational Categories
([0-9]f3,4g) births([0-9]f3,4g) deaths([0-9]f3,4g) establishments([0-9]f3,4g) books|novelsMountainsjRivers in (.*)PresidentsjGovernors of (.*)(.*) winners[A-Za-z]+ (.*) winners
bornOnDatediedOnDateestablishedOnDatewrittenOnDatelocatedInpoliticianOfhasWonPrizehasWonPrize
RelationRegular Expression
Table: Some Category Heuristics
2. Connecting Wikipedia and WordNet – What is WordNet Lexical database for the English
language Created at the Cognitive Science
Laboratory of Princeton University Groups English words into sets of
synonyms called synsets Provides short, general definitions Provides hypernym/hyponym relations
e.g. canine is hypernym, dog is hyponym
Connecting Wikipedia and WordNet – Type Extraction
Goal: create class hierarchy e.g. singer subClassOf performer
performer subClassOf artist hyponymy relation from WordNet Wikipedia class ‘American people in
Japan’ is subclass of WordNet class ‘person’
Classifications of Categories
Conceptual Categories e.g. Albert Einstein is in ‘Naturalized
citizens of the United States’ Administrative Categories
e.g. Albert Einstein is in ‘Articles with unsourced statements’
Relational Information 1879 births
Thematic Vicinity Physics
Identification of Conceptual Categories Only conceptual categories are used Shallow linguistic parsing of category names
e.g. category ‘American people in Japan’ Break category into
pre-modifier - ‘American’ head - ‘people’ post-modifier - ‘in Japan’
If head is plural, then category is conceptual category
Extract class from Wikipedia category Connect to class from WordNet
e.g. the Wikipedia class ‘American people in Japan’ has to be made a subclass of the WordNet class ‘person’
Algorithm
Function wiki2wordnet(c)Input: Wikipedia category name cOutput: WordNet synset1 head =headCompound(c)2 pre =preModifier(c)3 post =postModifier(c)4 head =stem(head)5 If there is a WordNet synset s for pre + head6 return s7 If there are WordNet synsets s1, … , sn for head8 (ordered by their frequency for head)9 return s110 fail
Explanation of Algorithm
Input: American people in Japan1. pre-modifier : American2. Head : people3. Post-modifier : in Japan4. Stem(head) : person5. If there is a WordNet synset for ‘American person’ 6. return that synset7. If there are s1, …, sn synsets for ‘person’8. (Ordered by frequency for ‘person’)9. Return s110.Fail
Output: personResult: American People in Japan subClassOf person
Exceptions
Complete hierarchy of classes Upper classes from WordNet Leaves from Wikipedia 2 dozen cases failed Categories with head compound “capital” In Wikipedia, it means “capital city” In WordNet, it means “financial asset” These cases were corrected manually
3. Quality Control
Canonicalization Each fact and each entity reference unique an entity is always referred to by the same
identifier in all facts in YAGO Type Checking eliminates individuals that do not have class eliminates facts that do not respect domain
and range constraints an argument of a fact in YAGO is always an
instance of the class required by the relation
Canonicalization - 1
Redirect Resolution infobox heuristics deliver facts that have
Wikipedia entities (i.e. Wikipedia links) as arguments
These links may not be correct Wikipedia page identifiers
Check if each argument is correct Wikipedia identifier
Replace by correct, redirected identifier E.g. Hermitage Museum locatedIn St. Petersburg Hermitage Museum locatedIn Saint
Petersburg
Canonicalization - 2
Removal of Duplicate facts Sometimes, 2 heuristics deliver the same
fact. canonicalization eliminates one of them e.g., category ‘1935 births’ yields the fact:
Elvis Presley bornOnDate 1935 Infobox attribute ‘Born: January 8, 1935’
yields the fact: Elvis Presley bornOnDate January 8, 1935
Type Checking - 1
Reductive Type Checking Sometimes class of entity cannot be determined Such facts are discarded
e.g. Wikipedia entities that have been proposed for an article, but that do not have a page yet
Inductive Type Checking Type constraints can be used to generate facts e.g. Elvis Presley bornOnDate January 8, 1935 So, Elvis Presley is a person Regular expression check to ensure entity name
pattern of given name and family name
Type Checking - 2
Type Coherence Checking Sometimes, classification yields wrong results
e.g. Abraham Lincoln is instance of 13 classes 12 are subclasses of class ‘person’; e.g. lawyer,
president 13th class is class ‘cabinet’ Class hierarchy of YAGO is partitioned
into branches e.g. locations, artifacts, people, other physical entities, and abstract entities
Branch that most types lead to, is determined
Other types are purged
References
YAGO:ALarge Ontology from Wikipedia andWordNet
Fabian M. Suchanek, Gjergji Kasneci, GerhardWeikum Max-Planck-Institute for Computer Science, Saarbruecken, Germany Automated Construction and Growth of a
Large Ontology Fabian M. Suchanek Thesis for obtaining the title of Doctor of Engineering of the
Faculties of Natural Sciences and Technology of Saarland University
Wikipedia http://en.wikipedia.org/wiki/Main_Page WordNet http://wordnet.princeton.edu/