YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM...

44
YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM Subbalakshmi Iyer

Transcript of YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNET FABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM...

YAGO:A LARGE ONTOLOGY FROM WIKIPEDIA AND WORDNETFABIAN M. SUCHANEK, GJERGJI KASNECI, GERHARD WEIKUM

Subbalakshmi Iyer

Motivation for an Ontology

Natural Language communication Automated text translation Finding information on internet Computer-processable collection of

knowledge

What is an Ontology?

An ontology is the description of a domain, its

classes and properties and relationships between those classes by means of a formal language.

collection of knowledge about the world, a knowledge base

Example ontologies: large taxonomies categorizing Web sites

(such as on Yahoo!) categorizations of products for sale and their

features (such as on Amazon.com)

Uses of Ontologies

Machine Translation Word Sense Disambiguation Document Classification Question Answering Entity and fact-oriented Web Search

What is Yago

Yet Another Great Ontology Part of Yago-Naga project Goal to build a knowledge base that is

Large Scale Domain-independent Automatic Construction High Accuracy

Uses Wikipedia and WordNet

More about YAGO 2 million entities 20 million facts Facts represented as RDF triples Accuracy of 95% Examples:

Elvis Presley isA singer singer subClassOf person Elvis Presley bornOnDate 1935-01-08 Elvis Presley bornIn Tupelo Tupelo locatedIn Mississippi(state) Mississippi(state) locatedIn USA

The YAGO model

Slight extension of RDFS Represents knowledge as

Entities Classes Relations Facts

Properties of relations like transitivity Simple and decidable model

Knowledge Representation in YAGO

All objects are entities e.g. Elvis Presley, Grammy Award

2 entities can stand in a relationship e.g. hasWonAward Elvis Presley hasWonAward Grammy Award

The triple of entity, relationship, entity is a fact e.g. Elvis Presley hasWonAward Grammy

Award is a fact

Knowledge Representation in YAGO -2 Numbers, dates and strings are also

entities. Elvis Presley BornInYear 1935

Words are entities “Elvis” means Elvis Presley Entity is instance of class Elvis Presley Type Singer Classes are also entities Singer Type class

Knowledge Representation in YAGO- 3 Classes have hierarchies Singer SubClassOf Person Relations are also entities subClassOf Type atr Each fact has a fact identifier #1 FoundIn Wikipedia

Key Contributions of YAGO

Information Extraction from Wikipedia Infoboxes Category Pages

Combination with WordNet Taxonomy

Quality Control Canonicalization Type Checking

Information Extraction -1

Entities from Wikipedia Each page title is candidate entity Wiki Markup Language Wikipedia dump as of September, 2008

Information Extraction - WML

Information Extraction Techniques Infobox Harvesting

Wikipedia Infoboxes Word-Level Techniques

Wikipedia Redirects Category Harvesting

Wikipedia Categories Type Extraction

Wikipedia Categories, WordNet Classes

1. Information Extraction from Wikipedia – Infobox Harvesting

Wikipedia Infobox

BorB

B

Born: January 8, 1935

Attribute Relation Inverse Manifold Indirect……

Born bornOnDate

Elvis Presley

bornOnDate

January 8, 1935

Infobox Attribute Map

Relation Domain Range

…bornOnDate person

yagoDate…

Relation Map

BorB

B

Died: August 16, 1977

Attribute Relation Inverse Manifold Indirect……

Died diedOnDate

Elvis Presley

diedOnDate

Infobox Attribute Map

August 16, 1977

Relation MapRelation Domain Range

…diedOnDate person

yagoDate…

BorB

B

Genre: Rock and Roll

Attribute Relation Inverse Manifold Indirect

…… Genre isOfGenre

Elvis Presley

isOfGenre

Infobox Attribute Map

Rock and Roll

…isOfGenre entity yagoClass

Relation Domain Range

Relation Map

BorB

B

Birth Name: Elvis Aaron Presley

Attribute Relation Inverse Manifold Indirect……

birth name means

means

Infobox Attribute Map

Elvis Presley

Elvis Aaron Presley

Relation MapRelation Domain Range

… means yagoWord entity

Manifold Attributes

Some attributes may have multiple values e.g. a person may have multiple children

Multiple facts are generated e.g. one hasChild fact for each child

Indirect Attributes - 1

Some attributes do not concern article entity, but another fact e.g attribute GDP does not concern the article

entity i.e. Republic of Singapore, but year 2008 Therefore, facts generated: Singapore hasGDP 238.755 billion #14 during 2008 Singapore hasGDP 238.755 billion during 2008

Attribute Relation Inverse Manifold Indirect……

gdp ppp hasGDP gdp year during

Attribute Map

Indirect Attributes - 2

Singapore Infobox

Type of Infobox

Released October, 1971Format vinyl recordGenre Folk RockLength 8:33 minsLabel United ArtistsWriter Don McLean

Manufacturer Tesla MotorsProduction 2008-presentClass RoadsterLength 3,946 mmWidth 1,873 mmHeight 1,127 mm

American Pie

Tesla Roadster

Song Infobox

Car Infobox

Type of Infobox: Attribute Map

Attribute Relation Inverse Manifold Indirect

…… car #length hasLength …song #length hasDuration

Attribute Map

Song Infobox

Car Infobox

American Pie hasDuration 8:33

Tesla Roadster hasLength 3946

Information Extraction - Word Level Techniques Wikipedia Redirects

virtual redirect page for “Presley, Elvis“ links to “Elvis Presley”

Each redirect gives ‘means’ fact e.g. “Presley, Elvis“ means Elvis Presley

Parsing Person Names extract the name components establish relations givenNameOf and

familyNameOf e.g. Presley familyNameOf Elvis Presley Elvis givenNameOf Elvis Presley

Wikipedia Categories

Categories: Presidents of the United States | Lists of office-holders | Lists of Presidents

Categories: Rift Valleys | North Sea | Rivers of Germany | Articles needing translation from German Wikipedia | Rivers of Netherlands

Categories: Canadian Singers| Canadian male singers| 1959 births | English-language singers | Living people | Grammy Award Winners |

Portrait photographers

Facts created from Wikipedia Categories

Rhine locatedIn Germany Bryan Adams bornOnDate 1959 Bryan Adams hasWonAward Grammy

Award Abraham Lincoln politicianOf United

States

Information Extraction - Category Harvesting Relational Categories

([0-9]f3,4g) births([0-9]f3,4g) deaths([0-9]f3,4g) establishments([0-9]f3,4g) books|novelsMountainsjRivers in (.*)PresidentsjGovernors of (.*)(.*) winners[A-Za-z]+ (.*) winners

bornOnDatediedOnDateestablishedOnDatewrittenOnDatelocatedInpoliticianOfhasWonPrizehasWonPrize

RelationRegular Expression

Table: Some Category Heuristics

2. Connecting Wikipedia and WordNet – What is WordNet Lexical database for the English

language Created at the Cognitive Science

Laboratory of Princeton University Groups English words into sets of

synonyms called synsets Provides short, general definitions Provides hypernym/hyponym relations

e.g. canine is hypernym, dog is hyponym

Connecting Wikipedia and WordNet – Type Extraction

Goal: create class hierarchy e.g. singer subClassOf performer

performer subClassOf artist hyponymy relation from WordNet Wikipedia class ‘American people in

Japan’ is subclass of WordNet class ‘person’

Classifications of Categories

Conceptual Categories e.g. Albert Einstein is in ‘Naturalized

citizens of the United States’ Administrative Categories

e.g. Albert Einstein is in ‘Articles with unsourced statements’

Relational Information 1879 births

Thematic Vicinity Physics

Identification of Conceptual Categories Only conceptual categories are used Shallow linguistic parsing of category names

e.g. category ‘American people in Japan’ Break category into

pre-modifier - ‘American’ head - ‘people’ post-modifier - ‘in Japan’

If head is plural, then category is conceptual category

Extract class from Wikipedia category Connect to class from WordNet

e.g. the Wikipedia class ‘American people in Japan’ has to be made a subclass of the WordNet class ‘person’

Algorithm

Function wiki2wordnet(c)Input: Wikipedia category name cOutput: WordNet synset1 head =headCompound(c)2 pre =preModifier(c)3 post =postModifier(c)4 head =stem(head)5 If there is a WordNet synset s for pre + head6 return s7 If there are WordNet synsets s1, … , sn for head8 (ordered by their frequency for head)9 return s110 fail

Explanation of Algorithm

Input: American people in Japan1. pre-modifier : American2. Head : people3. Post-modifier : in Japan4. Stem(head) : person5. If there is a WordNet synset for ‘American person’ 6. return that synset7. If there are s1, …, sn synsets for ‘person’8. (Ordered by frequency for ‘person’)9. Return s110.Fail

Output: personResult: American People in Japan subClassOf person

Fig.: WordNet search for “person”

Fig.: WordNet search for ‘American Person’

Exceptions

Complete hierarchy of classes Upper classes from WordNet Leaves from Wikipedia 2 dozen cases failed Categories with head compound “capital” In Wikipedia, it means “capital city” In WordNet, it means “financial asset” These cases were corrected manually

3. Quality Control

Canonicalization Each fact and each entity reference unique an entity is always referred to by the same

identifier in all facts in YAGO Type Checking eliminates individuals that do not have class eliminates facts that do not respect domain

and range constraints an argument of a fact in YAGO is always an

instance of the class required by the relation

Canonicalization - 1

Redirect Resolution infobox heuristics deliver facts that have

Wikipedia entities (i.e. Wikipedia links) as arguments

These links may not be correct Wikipedia page identifiers

Check if each argument is correct Wikipedia identifier

Replace by correct, redirected identifier E.g. Hermitage Museum locatedIn St. Petersburg Hermitage Museum locatedIn Saint

Petersburg

Canonicalization - 2

Removal of Duplicate facts Sometimes, 2 heuristics deliver the same

fact. canonicalization eliminates one of them e.g., category ‘1935 births’ yields the fact:

Elvis Presley bornOnDate 1935 Infobox attribute ‘Born: January 8, 1935’

yields the fact: Elvis Presley bornOnDate January 8, 1935

Type Checking - 1

Reductive Type Checking Sometimes class of entity cannot be determined Such facts are discarded

e.g. Wikipedia entities that have been proposed for an article, but that do not have a page yet

Inductive Type Checking Type constraints can be used to generate facts e.g. Elvis Presley bornOnDate January 8, 1935 So, Elvis Presley is a person Regular expression check to ensure entity name

pattern of given name and family name

Type Checking - 2

Type Coherence Checking Sometimes, classification yields wrong results

e.g. Abraham Lincoln is instance of 13 classes 12 are subclasses of class ‘person’; e.g. lawyer,

president 13th class is class ‘cabinet’ Class hierarchy of YAGO is partitioned

into branches e.g. locations, artifacts, people, other physical entities, and abstract entities

Branch that most types lead to, is determined

Other types are purged

References

YAGO:ALarge Ontology from Wikipedia andWordNet

Fabian M. Suchanek, Gjergji Kasneci, GerhardWeikum Max-Planck-Institute for Computer Science, Saarbruecken, Germany Automated Construction and Growth of a

Large Ontology Fabian M. Suchanek Thesis for obtaining the title of Doctor of Engineering of the

Faculties of Natural Sciences and Technology of Saarland University

Wikipedia http://en.wikipedia.org/wiki/Main_Page WordNet http://wordnet.princeton.edu/

Thank You, Any Questions?