Finding homogenious word sets Towards a dissertation in NLP
description
Transcript of Finding homogenious word sets Towards a dissertation in NLP
![Page 1: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/1.jpg)
Finding homogenious word setsTowards a dissertation in NLP
Chris Biemann
NLP Department, University of Leipzig
Universitetet i Oslo, 12/10/2005
![Page 2: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/2.jpg)
2
Outline
• Preliminaries: Co-occurrences
• Unsupervized methods- Language Seperation- POS tagging
• Weakly Supervized Methods- gazetteer building for NER- semantic lexicon extension- extension of lexical-semantic word nets
![Page 3: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/3.jpg)
3
Statistical Co-occurrences• occurrence of two or more words within a well-defined
unit of information (sentence, nearest neighbors, document, window ...)
• Significant Co-occurrences reflect relations between words
• Significance Measure (log-likelihood):- k is the number of sentences containing a and b together- ab is (number of sentences with a)*(number of sentences with b)- n is total number of sentences in corpus
( , ) log log !
with number of sentences,
.
sig A B x k x k
n
abx
n
![Page 4: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/4.jpg)
4
Unsupervized methods• Unsupervized means: no training data, there is nothing like a „training
set“• This means: the discovery and usage of any structure in language must
be entirely algorithmical• Unsupervized means knowledge-free: No prior knowledge allowed.• Famous unsupervized method: clustering.
Advantages:• language-independent• no need to build manual ressources (cheap)• Robust
Disadvantages:• Labeling problem• Unaware of errors• Often not traceable• difficult to interpret / evaluate
![Page 5: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/5.jpg)
5
Unsupervized Language DiscriminationSupervized Language Identification: • needs training • Operates on letter n-grams or common words as features• Works almost error-free for texts from 500 letters on
Drawbacks:• Does not work for previously unknown languages• Danger of misclassifying instead of reporting „unknown“
Example: http://odur.let.rug.nl/~vannoord/TextCat/Demo • “xx xxx x xxx …” classified as Nepali• “öö ö öö ööö …” classified as Persian
Unsupervized Language Discrimination: Task: Given a mixed-language corpus, split it into the different languages.
Biemann, C., Teresniak, S. (2005): Disentangling from Babylonian Confusion - Unsupervized Language Identification, Proceedings of CICLing-2005, Computational Linguistics and Intelligent Text Processing, Mexico City, Mexico and Springer LNCS 3406, pp. 762-773
![Page 6: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/6.jpg)
6
Co-occurrence Graphs • The entirety of all
significant co-occurrences is a co-occurrence graph G(V,E) withV: Vertices = WordsE: Edges (v1, v2, s) with v1, v2 words, s significance value.
• Co-occurrence graph is- weighted- undirected
• Small-world-property
![Page 7: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/7.jpg)
7
Chinese Whispers - Motivation
• (small-world) graphs consist of regions with a high clustering coefficient and hubs that connect those regions
• The nodes in cluster regions should be assigned the same label per region
• Every node gets a label and whispers it to its neighbouring nodes. A node changes to a label if most of its neighbours whisper this label – or it invents a new one
• Under assumption of semantic closeness when being strongly connected there should emerge motivated clusters
![Page 8: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/8.jpg)
8
Chinese Whispers AlgorithmAssign different labels to every node in the graph;
For iteration i from 1 to total_iterations {
mutation_rate= 1/(i^2);
For each word w in the graph {
new_label of w = highest ranked label in neighbourhood of w;
with probability mutation_rate: new_label of w = new class label;
}
labels = new_labels;
}
• graph clustering algorithm• linear time in the number of nodes • random mutation can be omitted but showed better results
for small graphs
AL1
DL2
EL3
BL4
CL3
58
63
deg=1deg=2
deg=3deg=5
deg=4
![Page 9: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/9.jpg)
9
Chinese Whispers on 7 Languages
![Page 10: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/10.jpg)
10
Chinese Whispers on 7 languages
![Page 11: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/11.jpg)
11
Assigning languages to sentences
• Use word-based language identification tool• Largest clusters form word lists for different languages• A sentence is assigned a cluster label if
- it contains at least 2 words from the cluster and - not more words from another cluster
Questions for Evaluation:• up to what number of languages is that possible ?• How much can the corpus be biased ?
![Page 12: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/12.jpg)
12
Evaluation
Mix of seven languages, equal number of sentences:• Languages used: Dutch, Estonian, English, French, German, Icelandic and
Italian • At least 100 sentences per language are necessary for consistent clustersTwo languages with strong bias:• At least 500 sentences out of 100‘000 needed to find the smaller language• Tested on English in Estonian, Dutch in German, French in Italian
![Page 13: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/13.jpg)
13
Common mistakes
• Unclassified: - mostly enumerations of sport teams - very short sentences, e.g. headlines- legal act ciphers in estonian case, e.g. 10.12.96 jõust.01.01.97 - RT I 1996 , 89 , 1590
• Misclassified: mixed-language-sentences, likeFrench: Frönsku orðin "cinéma vérité" þýða "kvikmyndasannleikur“
English: Die Beatles mit "All you need is love".
![Page 14: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/14.jpg)
14
Induction of POS InformationGiven:
Unstructured monolingual text corpus
Goal: Induction of POS Tags for many (all) words.Result is a list of words with the corresponding tag. Application on text (the actual POS tagging) is another task.
Motivation: • POS information is a processing step in a variety of NLP applications
such as parsing, IE, indexing• POS taggers need a considerable amount of hand-tagged training data
which is expensive and only available for major languages• Even for major languages, POS taggers are suited for well-formed
texts and do not cope well with domain-dependent issues as being found e.g. in eMail or spoken corpora
![Page 15: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/15.jpg)
15
Literature Overview[Schütze 93, Schütze 95, Clark 00, Freitag 04] show a similar
architecture on high level, but differ in details.
Steps to achieve word classes:
1. Calculation of global contexts using a window of 1-2 words to left and right and the most frequent 150-250 words as features
2. Clustering of these contexts gives word classes
Paper Finch & Chater 92
Schütze 93
Schütze 95
Clark 00 Freitag 04
Context 4 x 150 4 x 5000, by SVD: 15
2 x 250by SVD: 50
2 x classes 2 x 5000
Similarity Mutual Inf. and 4 x 500by SVD: 50
Cosine KL divergence
Mutual Inf.
Clustering 1-Nearest Neighbour (?) Buckshot
500 classesBuckshot, 200 classes
Iterative,77,100,150 c‘s
Co-clustering
200 classes
![Page 16: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/16.jpg)
16
Method Description• Contexts: the most frequent N (100, 200, 300) words are used for 4 x
N context vectors for the most frequent 10‘000 words in the corpus• Cosine similarity between all pairs of the 10‘000 top words is
calculated• Transformation to a graph: Draw an edge with weight
1/ (1-cos(x,y)) between x and y, if cos(x,y) is above some threshold • Chinese Whispers (CW) on graph results in word class clusters
Differences to prev. methods:• CW Clustering does not need number of classes as input• No dimensionality reduction techniques as SVD• Explicit threshold for similarity
![Page 17: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/17.jpg)
17
Toy Example (1)
Corpus fragments:
... _KOM_ sagte der Sprecher bei der Sitzung _ESENT_
... _KOM_ rief der Vorsitzende in der Sitzung _ESENT_
... _KOM_ warf in die Tasche aus der Ecke _ESENT_
Features: der(1), die(2), bei(3), in(4), _ESENT_(5), _KOM_(6)
Word 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
sagte 1 1
rief 1 1
warf 1 1 1
Sprecher 1 1 1
Vorsitzende 1 1 1
Tasche 1 1 1
Sitzung 1 1 2 2
Ecke 1 1
Position: -2 -1 +1 +2
![Page 18: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/18.jpg)
18
Toy Example (2)Cos(x,y) sagte rief warf Sprecher Vorsitzende Tasche Sitzung Ecke
sagte 1
rief 1 1
warf 0.4082 0.4082 1
Sprecher 0 0 0 1
Vorsitzende 0 0 0.3333 0.6666 1
Tasche 0 0 0 0.4082 0.3333 1
Sitzung 0 0 0 0.3333 0.3333 0.1666 1
Ecke 0 0 0 0.4082 0.4082 0 0.6666 1
Here, CW cuts graph in 2 partitions: nouns and verbs.
1000
15
15
30
30
17
17
171717
151512
![Page 19: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/19.jpg)
19
Norwegian – Labels
![Page 20: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/20.jpg)
20
corpus size and features: CP vs. coverage
cluster purity vs. coverage for different corpus sizes and features
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 1
cluster purity
cove
rage
de010K_100
de010K_200
de010K_300
de050K_100
de050K_200
de050K_300
de100K_100
de100K_200
de100K_300
de1M_200
de10M_200
de10_it0_85
![Page 21: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/21.jpg)
21
Example: time words in Norwegian
![Page 22: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/22.jpg)
22
Cluster sizes and clusters per word class
• When optimizing CP, words of the same word class tend to end up in several clusters, especially for open word classes
• Open word classes are the most interesting word classes for further processing steps like IE, relation learning..
• Cluster sizes are Zipf-distributed, there are always many small clusters
• Hierarchical CW could be used to lower the number of clusters while staying in POS distinctions
![Page 23: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/23.jpg)
23
Outlook: Constructing a POS tagger
• Using word clusters to initialize a POS tagger
• Evaluation based on types instead of tokens
Open questions:• Context window backoff model for unknown words• Leave out or take in unclustered high frequency words (as
singletons) ?• Can the many classes per POS be unified using tagger
behaviour?
![Page 24: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/24.jpg)
24
Weakly Supervized Methods
Weakly supervized means:• Very little training data and prior knowledge• Learning from labeled and unlabeled data• bootstrapping methods
Advantages:• Very little input: still cheap• No labeling problem• Easier to evaluate
Disadvantages:• Subject to error propagation• Stopping criterion difficult to define
![Page 25: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/25.jpg)
25
Bootstrapping of lexical items
For learning by bootstrapping, two things are needed: A start set of some known items with classes and a rule set that states, how more information can be obtained using known items.
Generic bootstrapping algorithm: Knowledge=0New=Start_setWhile New>0
Knowledge+=NewNew=0New=find new items using Knowledge and Rule_set
known items
new items
Phase of exhaustionPhase of growth
iteration
# items
![Page 26: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/26.jpg)
26
Benefits and Bothers of Bootstrapping
Pro:• Only small start sets (seeds) are needed, those can be
rapidly prepared• Process needs no further supervision (weakly supervized
learning)
Cons:• Danger of Error Propagation• When to stop is unclear
![Page 27: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/27.jpg)
27
Patterns for word classes and their relations
Examples for word classes in text:• Islands: „On the island of Cuba ...“, „carribbean island of
Trinidad“• Companies: „the ACME Ltd. Incorporated“• Verbs of utterance: „she said: <something>“• Person names: John W. Smith, Ellen Meyer
Observation: • Words belonging to the same class can be interchanged
without hurting the relation• Sometimes no trigger words
![Page 28: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/28.jpg)
28
Problem definition
Be Ri: A1 ... An n-ary relations over word sets A1..An.
Given:
• Some elements of sets A1..An
• Large corpus
Needed:
• Sets A1..An
• (a1..an) Ri
Necessary: rules for classification
![Page 29: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/29.jpg)
29
Pattern Matching Rules
• Annotate Text with known items and flat features (tagging is nice, but Tagsets of 4 tags will do for English)" ... said Jonas Berger , who .. "
... LC UC LN PM LC ..
• Use rules likeUC* LN -> FN FN UC* -> LNto classify "Jonas" as first name
• Rules of this kind are weak hypotheses because they sometimes misclassify, e.g. in
“As Berger turned over, ...““... tickets at Agency Berger, Munich."
Rules alone are not sufficient.
![Page 30: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/30.jpg)
30
Pendulum-Algorithm: Bootstrapping with verification
Initialize Knowledge, Rules, New_items
While New_items>0:
Last_new_items=New_items New_items=0
for all Last_new_items i
fetch text containing i from corpus
find candidates in text by using Knowledge and Rules
verify candidate k:
fetch text containing k
rate k on basis of text
New_items+=candidates with high ratings Knowledge+=New_items
Search step
Verification step
Quasthoff, U.; Biemann, Chr.; Wolff, Chr. : Named entity learning and verification: EM in large corpora. In: Proceedings of CoNLL-2002 , The Sixth Workshop on Computational Language Learning, 31 August and 1 September 2002 in association with Coling 2002 in Taipei, Taiwan
Biemann, Chr.; Böhm, K.; Quasthoff; U.; Wolff, Chr.: Automatic Discovery and Aggregation of Compound Names for the use in Knowledge Representations. Proc: I-KNOW ’03, International Conference on Knowledge Management, Graz and Journal of Universal Computer Science (JUCS), Volume 9, Number 6, Pp. 530-541, Juni 2003
![Page 31: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/31.jpg)
31
Explanations on the Pendulum
• The same rules are used for both search and verification of candidates
• Previously known and previously learned items are used for both search and verification of candidates
• A word is tonly taken into knowledge, if it occurs– multiple times and– at high rate
in the corpus with its classification.
![Page 32: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/32.jpg)
32
Example: island names and island specifiers
![Page 33: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/33.jpg)
33
Results – German Person NamesStart Set and prior knowledge:
9 first names, 10 last names, 15 rules, 12 reg-exps for titles
Corpus: Projekt Deutscher Wortschatz, 36 Mio. Sentences
Found: 42000 items, of which74% LNPrec>99%, 15% FN Prec>80% 11% TIT Prec>99%
Items per iteration step
step
New items Total items
![Page 34: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/34.jpg)
34
Extending a semantic lexicon using co-occurrences and HaGenLex
Size for nouns: about 13 000.
50 semantic classes for nouns are constructed from allowed combinations of:
• 16 semantic features (binary), e.g. HUMAN+, ARTIFICIAL- • 17 ontologic sorts, e.g. concrete, abstract-situation...
sort (hierarchy)
semantic featuressemantic classes
WORD SEMANTIC CLASSAggressivität nonment-dyn-abs-situationAgonie nonment-stat-abs-situationAgrarprodukt nat-discreteÄgypter human-objectAhn human-objectAhndung nonment-dyn-abs-situationÄhnlichkeit relationAirbag nonax-mov-art-discreteAirbus mov-nonanimate-con-potagAirport art-con-geogrAjatollah human-objectAkademiker human-objectAkademisierung nonment-dyn-abs-situation... ...
![Page 35: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/35.jpg)
35
Underlying Assumptions
• Harris 1968: Distributional Hypothesissemantic similarity is a function over global contexts of words. The more similar the contexts, the more similar the words
• Projected on nouns and adjectives: nouns of similar semantic classes are modified through similar adjectives
![Page 36: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/36.jpg)
36
Neighbouring Co-occurrences and Profiles
• Neighbouring co-occurrence: a pair of words that occur next to each other more often than to be expected under assumption of statistical independence.
• The neighbouring co-occurrence relation between adjectives as left neighbours and nouns as right neighbours approximates typical head-modifier structures
• The set of adjectives that co-occur significantly often to the left of a noun is called ist adjective profile (analogous definition of noun profile for adjectives)
• For experiments, I used the most recent German corpus of Projekt Deutscher Wortschatz, 500 million tokens
![Page 37: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/37.jpg)
37
Example: neighbouring profiles
amount: 160‘000 nouns, 23‘400 adjectives
word adjective / noun profile
Buch neu, erschienen, erst, neuest, jüngst, gut, geschrieben, letzt, zweit, vorliegend, gleichnamig herausgegeben, nächst, dick, veröffentlicht, ...
Käse gerieben, überbacken, kleinkariert, fett, französisch, fettarm, löchrig, holländisch, handgemacht, grün, würzig, selbstgemacht, produziert, schimmelig,
Camembert gebacken, fettarm, reif
überbacken Schweinesteak, Aubergine, Blumenkohl, Käse
erlegt Tier, Wild, Reh, Stück, Beute, Großwild, Wildkatzen, Büffel, Rehbock, Beutetier, Wal, Hirsch, Hase, Grizzly, Wildschwein, Thier, Eber, Bär, Mücke,
ganz Leben, Bündel, Stück, Volk, Wesen, Vermögen, Herz, Heer, Arsenal, Dorf, Land, Können, Berufsleben, Paket, Kapitel, Stadtviertel, Rudel, Jahrzehnt, ...
Word transl. adjektive / noun profile translations
book new, published, first, newest, most recent, recently, good, written, last, second, onhand, eponymous, next, thick, ...
cheese grated, baked over, small minded, fat, French, low-fat, holey, Dutch, hand-made, green, spicey, self-made, produced, moldy
camembert baken, low-fat, ripe
baked over steak, aubergine, cauliflower, cheese
brought down animal, game, deer, piece, prey, big game, wild cat, buffalo, roebuck, prey animal, whale, hart, bunny, grizzly, wild pig, boar, bear, ...
whole life, bundle, piece, population, kind, fortune, heart, army, anrsenal, village, country, ability, career, packet, chapter, quater, pack, decade ...
![Page 38: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/38.jpg)
38
Mechanism of Inheritance
Algorithm:Initialize adjective and noun profiles;Initialize the start set;As long as new nouns get classified {
calculate class probabilities for each adjective;for all yet unclassified nouns n {
Multiply class probabilities per class of modifying adjectives; Assign the class with highest probabilities to n;
} }
Which class is assigned to N4 in the next step?
Class probabilities per adjective:• count number of classes• normalize on total number of class wrt. noun classes• normalize to 1
![Page 39: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/39.jpg)
39
Experimental DataDistribution of semantic classes (total: 6045)
nonment-dyn-abs-situationhuman-objectprot-theor-concept
nonoper-attributeax-mov-art-discretenonment-stat-abs-situationanimal-object
nonmov-art-discretement-stat-abs-situationnonax-mov-art-discretetem-abstractum
mov-nonanimate-con-potagart-con-geograbs-infoart-substance
nat-discretenat-substanceprot-discretenat-con-geogr
prot-substancemov-art-discretemeas-unitoper-attribute
institutionment-dyn-abs-situationplant-objectmov-nat-discretecon-info
con-geogrcon-objectanimate-objectprot-method
dyn-abs-situationobjectnonmov-nonanimate-con-potagabs-geogr
stat-abs-situationmodalityrelationcon-potag
prot-con-objectnonmov-nat-discretenoninstit-abs-potagthc-relation
nonanimate-con-potagabs-situationabs-potag
• 5133 nouns comply to minAdj=5, that means maximal recall=84.9%• In all experiments, 10-fold-cross validation was used
![Page 40: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/40.jpg)
40
Results: Global Classification
• Classification was carried out directly on 50 semantic classes• Different measuring points correspond to parameters minAdj in
{5,10,15,20}, maxClass in {2, 5, 50}• Results too poor for lexicon extension
Precision/Recall for global classifier
00,10,20,30,40,50,60,70,80,9
1
0 0,2 0,4 0,6 0,8 1
Precision
Re
ca
ll
![Page 41: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/41.jpg)
41
Combining Single ClassifiersArchitecture: binary classifiers for single features, then
combinding the outcome. Parameter: minAdj=5, maxClass=2
ANIMAL +/-ANIMATE +/-ARTIF +/-AXIAL +/-... (16 features)
... (17 sorts)
ab +/-abs +/-ad +/-as +/-
Selection:compatible
semantic classes that are minimal
w.r.t hierarchy and unambiguous.
result classor
reject
![Page 42: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/42.jpg)
42
Results: Single Semantic Features
• for bias > 0.05 good to excellent precision• total precision: 93.8% (86.8% for feature +)• total recall: 70.7% (69.2% for feature +)
![Page 43: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/43.jpg)
43
Results: Ontologic Sorts
• for bias > 0.10 good to excellent precision• total precision: 94.1% (89.5% for sort +)• total recall: 73.6% (69.6% for sort +)
![Page 44: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/44.jpg)
44
Results: Comb. Semantic Classes
• no connection between amount of class and results visible• total precision: 82.3%• total recall: 32.8% • number of newly classified nouns: 8500 (minAdj=2: ~ 13‘000)
![Page 45: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/45.jpg)
45
Typical mistakesPflanze (plant) animal-object instead of plant-objectzart, fleischfressend, fressend, verändert, genmanipuliert, transgen, exotisch, selten, giftig, stinkend,
wachsend...
Nachwuchs (offspring) human-object instead of animal-objectwissenschaftlich, qualifiziert, akademisch, eigen, talentiert, weiblich, hoffnungsvoll, geeignet, begabt,
journalistisch...
Café (café) art-con-geogr instead of nonmov-art-discrete (cf. Restaurant)Wiener, klein, türkisch, kurdisch, romanisch, cyber, philosophisch, besucht, traditionsreich, schnieke,
gutbesucht, ...
Neger (negro) animal-object instead of human-objectweiß, dreckig, gefangen, faul, alt, schwarz, nackt, lieb, gut, brav
but:
Skinhead (skinhead) human-object (ok){16,17,18,19,20,21,22,23,30}ährig, gleichaltrig, zusammengeprügelt, rechtsradikal, brutal
In most cases the wrong class is semantically close. Evaluation metrics did not account for that.
Biemann, C., Osswald, R. (2005): Automatic Extension of Feature-based Semantic Lexicons via Contextual Attributes, Proceedings of 29th annual meeting of Gfkl, Magdeburg 2005
![Page 46: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/46.jpg)
46
Extending CoreNet – Korean WordNetCoreNet Characteristics• Rather large groups of words per concept as opposed to fine-grained WordNet structure• Same concept hierarchy is used for all word classes
Size of KAIST Korean corpus: • 38 Million tokens, • 2.3 Million sentences, • 3.8 Million types
Word class Lemmas Senses
NOUN 28,823 56,523
VERB 1,757 4,717
ADJECTIVE 804 1,392
![Page 47: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/47.jpg)
47
Pendulum-Algorithm on co-occurrences
LastLearned=StartSet;
Knowledge=StartSet;
NewLearned=0;
while (LastLearned>0) {
for all i in LastLearned {
Candidates=getCooccurrences(i);
for all c in Candidates {
VerifySet=getCooccurrences(c);
if |VerifySet Knowledge| >threshhold {
NewLearned+=c;
Knowledge+=c;
}
}
}
LastLearned=NewLearned;
NewLearned=0;
}
Search step
Verification step
![Page 48: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/48.jpg)
48
Sample step
Seed:
Search with yields (amongst others):
Verifiy:
![Page 49: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/49.jpg)
49
Evaluation
• Selection of concepts performed by a non-Korean speaker
• Evaluation performed manually, only new words counted
• Heuristics for avoiding result set infection- iteratively lower threshold for verification from 8 downto 3 until the result set is too large- take lowest threshold for result set with reasonable size (not exceeding start set)
• Typical run needed 3-7 iterations to converge
Biemann, C., Shin, S.-I., Choi, K.-S. (2004): "Semiautomatic Externsion of CoreNet using a Bootstrapping Mechanism on Corpus-based Co-occurrences", Proceedings of the 20th International Conference on Computational Linguistics (COLING04) Genf, Switzerland
![Page 50: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/50.jpg)
50
Results
Not enough for automatic extension, but a good source for candidates
CoreNet ID Name of Concept Size # new # ok precision
50 human good/bad 119 36 5 13.89%111 human relation 274 3 2 66.67%113 partner / co-worker 123 23 8 34.78%114 partner / member 71 5 3 60.00%181 human ability 213 7 2 28.57%430 store 128 12 11 91.67%471 land, area 260 10 2 20.00%548 insect, bug 75 43 6 13.95%552 part of animal 736 10 6 60.00%553 head 139 7 4 57.14%577 forehead 72 4 2 50.00%590 legs and arms 86 7 3 42.86%672 plant (vegetation) 461 30 15 50.00%817 cloths 246
343934
2311887
52.94%37.67% Sum:
![Page 51: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/51.jpg)
51
Problems... ...and possible solutions
• „Coverage is low“- increase corpus size for relevant domains- make use of other features, e.g. patterns
• „Precision is not satisfactionary“- obtain multiple concepts simultaneously- meta-level bootstrapping- make use of other features, e.g. POS tags for word class information
This work gives a baseline of what is reachable without employing language-dependent features
![Page 52: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/52.jpg)
52
From Text to Ontologies
Text Text TextText
sort by language
lang. 1 ...lang. 2 lang. n
assign word classes
text with POS labels
Determine patterns
and extract word pairs
assign semantic properties for words
typed relations and instances
![Page 53: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/53.jpg)
53
Questions?
THANK YOU!
![Page 54: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/54.jpg)
54
![Page 55: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/55.jpg)
55
![Page 56: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/56.jpg)
56
![Page 57: Finding homogenious word sets Towards a dissertation in NLP](https://reader035.fdocuments.net/reader035/viewer/2022062801/568143f9550346895db08a53/html5/thumbnails/57.jpg)
57
Abstract
Methods are introduced that find sets of words that have something in common in some way by corpus analysis. Having the objective of vastly automizing the task and putting the knowledge in algorithms instead of training sets, two kinds of methods can be distinguished: completely unsupervized methods (clustering) and weakly supervized methods (bootstrapping).
Two unsupervized variants for standard preprocessing steps will be discussed, namely language identification and part-of-speech tagging. In both, a novel, efficient graph clustering algorithm is employed.
After a general introduction to bootstrapping, which needs only a minimal training set, three bootstrapping experiments will be described: Gazetteer construction for Named Entity Recognition, extension of a semantic lexicon and expansion of a lexical-semantic word net.
Follow-ups on the latter two can give rise to automatic ontology creation and extension.