Distinguishing between Instances and Classes in the ...
Transcript of Distinguishing between Instances and Classes in the ...
![Page 1: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/1.jpg)
Distinguishing between Instancesand Classes in the Wikipedia
Taxonomy
Cacilia Zirn
c©EML Research – p. 1
![Page 2: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/2.jpg)
A Wikipedia Ontology?
Geography of Spain
Subdivisions of Spain Landforms of Spain
Autonomous communities of Spain
Canary Islands
IS-A
Balearic Islands
IS-A
Volcanoes of Spain
IS-A
Islands of Spain
IS-A
Islands of the Canary Islands
IS-A
Tenerife
IS-A
Gran Canaria
IS-A
Sport in Tenerife
CD Tenerife
Santa Cruz de Tenerife province
CONTAINS
Municipalities in Santa Cruz de Tenerife
Las Palmas de Gran Canaria
IS-LOCATED-IN
Sport in the Canary Islands
IS-A
c©EML Research – p. 2
![Page 3: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/3.jpg)
Wikipedia Ontology
The big goal:
Deriving an ontology from Wikipedia automatically
Necessary steps:
1. derive a taxonomy from Wikipedia (identify ISA relations),Ponzetto & Strube (AAAI 2007)
2. distinguish between instances and classes(work presented now)
3. interpret remaining relations ,Nastase & Strube (AAAI 2008)
c©EML Research – p. 3
![Page 4: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/4.jpg)
Outline
1. Deriving a taxonomy from Wikipedia
2. Instances and classes
3. Methods
4. Evaluation
5. Conclusions
c©EML Research – p. 4
![Page 5: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/5.jpg)
Prerequisites: Category Network
c©EML Research – p. 5
![Page 6: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/6.jpg)
Deriving a taxonomy
Geography of Spain
Subdivisions of Spain Landforms of Spain
Autonomous communities of Spain
Canary Islands Balearic Islands
Volcanoes of Spain Islands of Spain
Islands of the Canary Islands
Tenerife Gran Canaria
Sport in Tenerife
CD Tenerife
Santa Cruz de Tenerife province
Municipalities in Santa Cruz de Tenerife
Las Palmas de Gran Canaria
Sport in the Canary Islands
c©EML Research – p. 6
![Page 7: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/7.jpg)
Deriving a taxonomy
Geography of Spain
Subdivisions of Spain Landforms of Spain
Autonomous communities of Spain
Canary Islands
IS-A
Balearic Islands
IS-A
Volcanoes of Spain
IS-A
Islands of Spain
IS-A
Islands of the Canary Islands
IS-A
Tenerife
IS-A
Gran Canaria
IS-A
Sport in Tenerife
CD Tenerife
Santa Cruz de Tenerife province
Municipalities in Santa Cruz de Tenerife
Las Palmas de Gran Canaria
Sport in the Canary Islands
IS-A
c©EML Research – p. 6
![Page 8: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/8.jpg)
Outline
1. Deriving a taxonomy from Wikipedia
2. Instances and classes
3. Methods
4. Evaluation
5. Conclusions
c©EML Research – p. 7
![Page 9: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/9.jpg)
Instances and Classes
Instances
TENERIFE, TEIDE, 2008• are unique entities in the world• in reasoning, they are mapped
to objects
C
C C
I I I I
Classes
MUNICIPALITIES IN SANTA CRUZ DE TENERIFE,VOLCANOES OF SPAIN
• concepts that subsume classes or individuals• in reasoning, they are mapped to predicates
c©EML Research – p. 8
![Page 10: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/10.jpg)
Instances and Classes
Distinction between instances and classes...• can be found in WordNet and Cyc• was done manually there• agreement coefficient on this task
on WordNet data κ = 0.75(Miller & Hristea, Computational Linguistics 2006)
➠ high cost!
c©EML Research – p. 9
![Page 11: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/11.jpg)
Instances and Classes
Distinction between instances and classes...• can be found in WordNet and Cyc• was done manually there• agreement coefficient on this task
on WordNet data κ = 0.75(Miller & Hristea, Computational Linguistics 2006)
➠ high cost!
develop heuristics to distinguish betweeninstances and classes fully automatically
c©EML Research – p. 9
![Page 12: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/12.jpg)
Instances and Classes
• intuitively all Wikipedia categories seem to be classes
➠ not the case• instances also used as categories to organize
related concepts
c©EML Research – p. 10
![Page 13: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/13.jpg)
Outline
1. Deriving a taxonomy from Wikipedia
2. Instances and classes
3. Methods
4. Evaluation
5. Conclusions
c©EML Research – p. 11
![Page 14: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/14.jpg)
Methods
• development of 5 methods
• Structure-based method• NER (Named entity recognition)• Capitalization• Plural• Page
• all are heuristics• use NLP techniques• based on category network
c©EML Research – p. 12
![Page 15: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/15.jpg)
Methods
• development of 5 methods
• Structure-based method• NER (Named entity recognition)• Capitalization• Plural• Page
c©EML Research – p. 13
![Page 16: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/16.jpg)
Structure-based method (1)
Only classes can have instances and classes.
TENERIFE, TENERIFE NORTH AIRPORT
c©EML Research – p. 14
![Page 17: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/17.jpg)
Structure-based method (1)
Only classes can have instances and classes.
TENERIFE, TENERIFE NORTH AIRPORT
• if a category has hyponyms, it has to be a class• count hyponyms (incoming ISA-links)
c©EML Research – p. 14
![Page 18: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/18.jpg)
Structure-based method (2)
• if a category has more than one hyponym:➠ the category is labeled as Class
• if a category has no hyponym:➠ the category is labeled as Instance
C
I ? ?
I ?
I
c©EML Research – p. 15
![Page 19: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/19.jpg)
Structure-based method (3)
Only classes can have instances and classes.
TENERIFE, TENERIFE NORTH AIRPORT
c©EML Research – p. 16
![Page 20: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/20.jpg)
Structure-based method (3)
Only classes can have instances and classes.
TENERIFE, TENERIFE NORTH AIRPORT
• labeling of the ISA-links has been done automatically• possible that links are classified erroneously
➠ tolerate one erroneous link
c©EML Research – p. 16
![Page 21: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/21.jpg)
Structure-based method (4)
• if a category has exactly one hyponym:
• if the hyponym has a hyponym itself:➠ the category is labeled as Class
• if the hyponym has no hyponym :➠ the category is labeled as Instance
C
I I C
I I
I
c©EML Research – p. 17
![Page 22: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/22.jpg)
Methods
• development of 5 methods
• Structure-based method• NER (Named entity recognition)• Capitalization• Plural• Page
c©EML Research – p. 18
![Page 23: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/23.jpg)
Method: NER (1)
Instances correspond to unique entities in the world andare therefore named entities.
TENERIFE, CD TENERIFE
c©EML Research – p. 19
![Page 24: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/24.jpg)
Method: NER (1)
Instances correspond to unique entities in the world andare therefore named entities.
TENERIFE, CD TENERIFE
➠ idea: use a named entity recognizer
• we use CRFClassifier (Stanford)• Person, Location, Organization for named entities• Other for the rest
c©EML Research – p. 19
![Page 25: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/25.jpg)
Method: NER (2)
Instances correspond to unique entities in the world andare therefore named entities.
TENERIFE, CD TENERIFE
c©EML Research – p. 20
![Page 26: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/26.jpg)
Method: NER (2)
Instances correspond to unique entities in the world andare therefore named entities.
TENERIFE, CD TENERIFE
• some names consist of complex noun structures:AUTONOMOUS COMMUNITIES OF SPAIN
• only lexical heads are passed to named entityrecognizer
• lexical heads are extracted using Stanford Parser
c©EML Research – p. 20
![Page 27: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/27.jpg)
Utility: Lexical head finder
• lexical heads: determine the syntactic properties of aphrase
• in a noun phrase: the noun
ROOT
NP`````
NPXXXXX
�����
JJ
autonomous
NNS
communities
PPQQ
��
IN
of
NP
NNP
Spainc©EML Research – p. 21
![Page 28: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/28.jpg)
Method: NER (3)
• if the named entity recognizer returns one of the labels:Person, Location, Organization:➠ the category is labeled as Instance
• if the named entity recognizer returns the label Other :➠ the category is labeled as Class
c©EML Research – p. 22
![Page 29: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/29.jpg)
Method: NER (3)
• if the named entity recognizer returns one of the labels:Person, Location, Organization:➠ the category is labeled as Instance
• if the named entity recognizer returns the label Other :➠ the category is labeled as Class
the parser sometimes returns several heads
• if the majority of returned labels is Other :➠ the category is labeled as Class
• otherwise:➠ the category is labeled as Instance
c©EML Research – p. 22
![Page 30: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/30.jpg)
Methods
• development of 5 methods
• Structure-based method• NER (Named entity recognition)• Capitalization• Plural• Page
c©EML Research – p. 23
![Page 31: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/31.jpg)
Method: Capitalization (1)
Content words belonging to a named entity are capitalized.
Convention for Wikipedia titles.
TENERIFE LADIES OPEN
and
AUTONOMOUS COMMUNITIES OF SPAIN
c©EML Research – p. 24
![Page 32: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/32.jpg)
Method: Capitalization (1)
Content words belonging to a named entity are capitalized.
Convention for Wikipedia titles.
TENERIFE LADIES OPEN
and
AUTONOMOUS COMMUNITIES OF SPAIN
• Bunescu & Pasca (2006) developed a heuristic to processWikipedia page titles:“If all content words of a page title are capitalized, itcorresponds to a named entity”
➠ we apply this heuristic to category titles
c©EML Research – p. 24
![Page 33: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/33.jpg)
Method: Capitalization (2)
1. preprocess first word• first word is always capitalized
➠ pass it to CRFClassifier• if it is not recognized as a named entity: lowercase the
word
2. filter out function words
3. analyze remaining words• if all words are capitalized:
➠ the category is labeled as Instance• otherwise:
➠ the category is labeled as Class
c©EML Research – p. 25
![Page 34: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/34.jpg)
Methods
• development of 5 methods
• Structure-based method• NER (Named entity recognition)• Capitalization• Plural• Page
c©EML Research – p. 26
![Page 35: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/35.jpg)
Method: Plural (1)
Instances are unique ➠ generally used in singular form.
TENERIFE, SPAIN
and
AUTONOMOUS COMMUNITIES OF SPAIN
c©EML Research – p. 27
![Page 36: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/36.jpg)
Method: Plural (1)
Instances are unique ➠ generally used in singular form.
TENERIFE, SPAIN
and
AUTONOMOUS COMMUNITIES OF SPAIN
• exceptions: “The Millers are coming to our party”not to be expected in Wikipedia category titles
c©EML Research – p. 27
![Page 37: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/37.jpg)
Method: Plural (2)
Instances are unique ➠ generally used in singular form.
TENERIFE, SPAIN
and
AUTONOMOUS COMMUNITIES OF SPAIN
c©EML Research – p. 28
![Page 38: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/38.jpg)
Method: Plural (2)
Instances are unique ➠ generally used in singular form.
TENERIFE, SPAIN
and
AUTONOMOUS COMMUNITIES OF SPAIN
• the grammatical number of the lexical head is the same asthe number of the category title
➠ we parse the category title with the Stanford Parser,obtaining:• the lexical head(s)• the part-of-speech tags
c©EML Research – p. 28
![Page 39: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/39.jpg)
Method: Plural (3)
Instances are unique ➠ generally used in singular form.
TENERIFE, SPAIN
and
AUTONOMOUS COMMUNITIES OF SPAIN
c©EML Research – p. 29
![Page 40: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/40.jpg)
Method: Plural (3)
Instances are unique ➠ generally used in singular form.
TENERIFE, SPAIN
and
AUTONOMOUS COMMUNITIES OF SPAIN
• tags of interest:• NNPS = noun, proper, plural• NNS = noun, common, plural• NNP = noun, proper, singular
Autonomous/JJ communities/NNS of/IN Spain/NNP
c©EML Research – p. 29
![Page 41: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/41.jpg)
Method: Plural (3)
• if the lexical head of a phrase is tagged as plural noun(NNS, NNPS):➠ the category is labeled as Class
• otherwise:➠ the category is labeled as Instance
c©EML Research – p. 30
![Page 42: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/42.jpg)
Methods
• development of 5 methods
• Structure-based method• NER (Named entity recognition)• Capitalization• Plural• Page
c©EML Research – p. 31
![Page 43: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/43.jpg)
Method: Page (1)
Articles should be placed in categorieswith the same name.
Advice for authors in Wikipedia.
SPAIN, TENERIFE
c©EML Research – p. 32
![Page 44: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/44.jpg)
Method: Page (1)
Articles should be placed in categorieswith the same name.
Advice for authors in Wikipedia.
SPAIN, TENERIFE
• a number of articles have homonymous categories• most articles refer to unique entities
➠ heuristic: a category containing a page with the same nameis an instance
c©EML Research – p. 32
![Page 45: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/45.jpg)
Method: Page (2)
• if a page with homonymous title exists:➠ the category is labeled as Instance
• otherwise:➠ the category is labeled as Class
c©EML Research – p. 33
![Page 46: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/46.jpg)
Implementation
• working on Wikipedia derived taxonomy• containing 127,124 nodes and 106,258 (ISA) links• data accessed via Wikipedia dump• own Wikipedia-API (in Java)
c©EML Research – p. 34
![Page 47: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/47.jpg)
Outline
1. Deriving a taxonomy from Wikipedia
2. Instances and classes
3. Methods
4. Evaluation
5. Conclusions
c©EML Research – p. 35
![Page 48: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/48.jpg)
Data (1)
Use ResearchCyc as gold standard .
ResearchCyc• distinguishes between #$Individual
and #$SetOrCollection• distinction is done manually
overlap Wikipedia / ResearchCyc:• 7860 concepts
• 44.35%(3486)#$Individual• 55.65%(4374)#$SetOrCollection
c©EML Research – p. 36
![Page 49: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/49.jpg)
Data (2)
Use ResearchCyc as gold standard .
C
I C
I I
#$S
#$I
#$S
#$S #$S
#$I
c©EML Research – p. 37
![Page 50: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/50.jpg)
Measures (1)
Tinstances Fclasses
Finstances Tclasses
Precinstances = Tinstances
Tinstances+Finstances
Tinstances: Instance in Wiki & Individual in CycFinstances: Instance in Wiki but not Individual in CycTclasses: Class in Wiki & SetOrCollection in CycFclasses: Class in Wiki but not SetOrCollection in Cyc
c©EML Research – p. 38
![Page 51: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/51.jpg)
Measures (2)
Tinstances Fclasses
Finstances Tclasses
Precclasses = Tclasses
Tclasses+Fclasses
Tinstances: Instance in Wiki & Individual in CycFinstances: Instance in Wiki but not Individual in CycTclasses: Class in Wiki & SetOrCollection in CycFclasses: Class in Wiki but not SetOrCollection in Cyc
c©EML Research – p. 39
![Page 52: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/52.jpg)
Measures (3)
Tinstances Fclasses
Finstances Tclasses
Accuracy = Tinstances+Tclasses
Tinstances+Finstances+Tclasses+Fclasses
Tinstances: Instance in Wiki & Individual in CycFinstances: Instance in Wiki but not Individual in CycTclasses: Class in Wiki & SetOrCollection in CycFclasses: Class in Wiki but not SetOrCollection in Cyc
c©EML Research – p. 40
![Page 53: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/53.jpg)
Evaluate every method separately
Method Precinstances Precclasses
NER 85.23 76.84page 66.1 91.5capitalization 85.99 82.44plural 66.44 87.99structure 56.17 87.21
c©EML Research – p. 41
![Page 54: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/54.jpg)
Evaluate every method separately
Method Precinstances Precclasses Accuracy
NER 85.23 76.84 79.69page 66.1 91.5 75.74capitalization 85.99 82.44 83.82plural 66.44 87.99 75.24structure 56.17 87.21 64.71
c©EML Research – p. 41
![Page 55: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/55.jpg)
Final setting
Classification schemes
A) Accuracy scheme• method with best accuracy: capitalization• (regard scheme as baseline)
c©EML Research – p. 42
![Page 56: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/56.jpg)
Final setting
Classification schemes
B) Precision scheme• order methods according to their precision
(Precinstances or Precclasses)
1. page ➠ class2. plural ➠ class3. structure ➠ class4. capitalization ➠ instance5. remaining categories ➠ class
c©EML Research – p. 43
![Page 57: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/57.jpg)
Final setting
Classification schemes
C) Voting scheme
1. page & plural ➠ class2. capitalization & NER ➠ instance3. remaining categories ➠ precision scheme
c©EML Research – p. 44
![Page 58: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/58.jpg)
Final setting
Classification schemes
A) Accuracy scheme
B) Precision scheme
C) Voting scheme
special form of cross-validation:• 5 rounds of binary random splits• maintain the #$Individual / #$SetOrCollection distribution• evaluate on the resulting 10 data sets
c©EML Research – p. 45
![Page 59: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/59.jpg)
Final results
Method Precisioninstances Precisionclasses Accuracy
A) Accuracy sc. 85.99±0.54 82.44±0.63 82.82±0.5B) Precision sc. 90.92±0.41 77.36±0.52 81.64±0.42C) Voting sc. 89.21±0.46 81.82±0.52 84.52±0.34
c©EML Research – p. 46
![Page 60: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/60.jpg)
Discussion
• preprocessing errors, e.g. wrong parsing results(...AND YOU WILL KNOW US BY THE TRAIL OF DEAD
ALBUMS)• recognizing named entities:
BEE TRAIN
If components of a named entity are not named entities,it is not recognized
• classification in gold standard:Inter-agreement between judges is not 100%➠ different possible judgements
c©EML Research – p. 47
![Page 61: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/61.jpg)
Outline
1. Deriving a taxonomy from Wikipedia
2. Instances and classes
3. Methods
4. Evaluation
5. Conclusions
c©EML Research – p. 48
![Page 62: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/62.jpg)
Conclusions
• automatic distinction between instances and classes ispossible with a high accuracy (84.52%)
• combining the methods with machine learning couldimprove performance even more
• next step: introducing distinction between instances andclasses to Wikipedia articles
• methods can easily be applied to other languages
c©EML Research – p. 49
![Page 63: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/63.jpg)
The resulting resource
• we applied classification scheme Cto the 127,124 nodes in the Wikipedia taxonomy• 15,472 nodes were classified as instance• 111,652 nodes were classified as class
• we converted the data into RDF Schema file(using the Jena Semantic Web Framework)
• the result is freely available
c©EML Research – p. 50
![Page 64: Distinguishing between Instances and Classes in the ...](https://reader031.fdocuments.net/reader031/viewer/2022012101/6169ead911a7b741a34cccec/html5/thumbnails/64.jpg)
Thanks!
Acknowledgements• Michael Strube and Vivi Nastase• Anette Frank for supervising my bachelor thesis• Simone Ponzetto for his work in deriving the taxonomy• Klaus Tschira Foundation
Check out
... the results (RDF Schema)
www.eml-research.de/nlp/download/wikitaxonomy.php
... more papers on Wikipedia
www.eml-research.de/∼strube
c©EML Research – p. 51