COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin...

30
COOC COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger, Diana Raileanu, Hubert Schlarb Supervisors: Jan Alexandersson, Paul Buitelaar

Transcript of COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin...

Page 1: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

COOCCOOC

Practicum / Software Project, SS 2000Final Report

Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

Diana Raileanu, Hubert Schlarb

Supervisors: Jan Alexandersson, Paul Buitelaar

Page 2: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC 2

ContentsContents• Intro

• Theoretical foundations

• At the outset

• Project aspects– Preprocessing– Training– Application– Evaluation

• Outlook

Page 3: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Einleitung 3

IntroIntro

• Word Sense Disambiguation (WSD) as preparation for semantic analysis of text documents

• Application areas: translation systems, info retrieval systems, document classification, etc.

• Machine learning approaches: - supervised (semantically tagged corpora) - unsupervised (untagged corpora)• COOC: the first unsupervised, corpus-based approach for

German

Page 4: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Theoretische Grundlagen 4

Theoretical FoundationsTheoretical FoundationsWSD (Word Sense Disambiguation) in context:

E.g.: bank - place to sit vs. financial institutionI‘m going to the bank to get some money.

COOC: cooccurrence of words in a given context

GermaNet: (WordNet for German)

WordNet: - lexical and semantic data bank

- semantic net, ontology

- lexical and conceptual relations (antonymy, hyponymy)

Page 5: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Theoretische Grundlagen 5

Theoretical Foundations (II)Theoretical Foundations (II)Method:

- knowledge sources (WordNet, Thesaurus)

- the possibility of finding relations between words and meanings

supervised: - requires already disambiguated data

- requires large amounts of data

unsupervised: - requires even more data

- data need not be desambiguated

Page 6: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Theoretische Grundlagen 6

Theoretical Foundations (III)Theoretical Foundations (III)Examples of unsupervised methods:

• Lesk (1986): comparison among dictionary entries

• Yarowski (1992):

- Roget‘s Thesaurus, Groliers Encyclopedia - collections of contexts for a thesaurus category - identification of characteristic words

• Resnik (1997): - Penn Treebank Corpus, pos-tagged, syntactically annotated - selectional preference (predicate arguments)

Page 7: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Ausgangssituation 7

At the outsetAt the outsetApproach of Seligman (94):

• Japanese dialogues (direction finding, hotel reservations in spontaneous speech)• thesaurus with 4 fixed abstraction levels• explicit semantic smoothing

COOC project:• Tiger corpus (Frankfurter Rundschau)• GermaNet with varying number of abstraction levels (up to 26)• implicit semantic smoothing

Page 8: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Training 8

Flow diagram

Page 9: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Vorbehandlung 9

PreprocessingPreprocessing

• Conversion of the training corpus (plain text) into the COOC format

• Statistics on GermaNet categories

Page 10: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Vorbehandlung 10

ResourcesResources• Tiger corpus (1.051.446 tokens)

- German newspaper text from the Frankfurter Rundschau

• TnT tagger (Brants 2000) - statistical Part-of-Speech tagger

• Mmorph (Petitpierre & Russell, 1995) - morphological analysis tool

• GermaNet: - lexico-semantic network for German (about 25000 nouns, 6000 verbs, 3500 Adjectives)

Page 11: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Vorbehandlung 11

COOC-FormatCOOC-Format

Philip Glass wurde auf seinen weltweiten Tourneen mit Kassetten und Tonbändern überschüttet. (Phillip Glass was showered with audio tape and cassettes during his wordwide tour.)

...

166 seinen NA PPOSAT167 weltweiten weltweit ADJA [ 113815 113669 111763 111559 ]

...

172 Tonbändern Tonband NN [ 75749 ... 1749365 ] ... [ 75749 ... 144863 ]173 überschüttet überschütten VVPP [ 353400 ... 226602 ] [ 353400 ... 2266023 ]

...

Page 12: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Vorbehandlung 12

GermaNet HierarchyGermaNet Hierarchy

Page 13: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Vorbehandlung 13

Statistics onStatistics on GermaNet CategoriesGermaNet Categories

• Omission of higher-frequency categories• Reduction of computational complexity• Format: Frequency ID(Offset) Synset• Example: 70725 1749365 Objekt_0 43450 369009 Situation_0 ........... 2 843903 Kofferraum_0 1 695036 Intellekt_0_Genius_0

Page 14: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Training 14

Segmentation...Segmentation...

...at sentence boundaries:

...or e.g. after every 3 significant words:Landesbank schlägt Verträge zwischen Stadt und privaten Investoren vor Überall wird gebuddelt und gemauert. Hamburg erlebt den größten Geschäftsbau-Boom. Jährlich hinzukommen rund 300 000 Quadratmeter an Büroräumen.

Landesbank schlägt Verträge zwischen Stadt und privaten Investoren vor Überall wird gebuddelt und gemauert. Hamburg erlebt den größten Geschäftsbau-Boom.Jährlich hinzukommen rund 300 000 Quadratmeter an Büroräumen.

Page 15: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Training 15

WindowsWindows

Text window: n segments with current segment in the middle wider scope than n-grams

S(i) S(i+1) S(i+2) S(i+3) S(i+4) W(t)

W(t+1)W(t+2)

n = 3

Page 16: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Training 16

Training: unsupervisedTraining: unsupervisedCompare Compare Peter goes by trainPeter goes by train with with Diana goes by bikeDiana goes by bike: :

traintrain and and bikebike should both be VEHICLES; but different ambiguities should both be VEHICLES; but different ambiguities

Page 17: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Training 17

For a pair of categories:For a pair of categories:

• conditional probabilityconditional probability

• mutual informationmutual information

Effect: correct category combinations emerge Effect: correct category combinations emerge statisticallystatistically

StatisticsStatistics

)(Pr),Pr()|Pr(

1

1212 c

ccccS

)(Pr)|Pr(log),(

2

12212 c

ccccMIW

Page 18: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Training 18

Training: ParametersTraining: Parameters• Segmentation methods

• Window width

limiting calculation time and space requirements:

• exclusion of certain POS combinations

• only categories in certain frequency intervals

• only pairs with frequency > minimum

Page 19: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Anwendung 19

ApplicationApplication

• Actual disambiguation process– input: sentences/text in COOC format,

containing ambiguous words– output: disambiguated sentences/words– requires training results

Page 20: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Anwendung 20

To proceedTo proceed

• Connection to the training data bank– selection of parameters (window and segment

size) of the training data bank• Text processing

– construction of the initial windows– desambiguation of the current segment– results are written to the Ouput Data

Page 21: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Anwendung 21

To proceed (II)To proceed (II)

• Window handling: – the middle (current) segment is then

disambiguated word by word– at the last segment, the window is moved one

segment to the right

S(i) S(i+1) S(i+2) S(i+3) S(i+4)S(i) S(i+1) S(i+2) S(i+3) S(i+4)

Page 22: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Anwendung 22

To proceed (III)To proceed (III)

• Handling the words in the middle (current) segment– distinguish significant vs. insignificant words

(with and without GermaNet categories)– for significant words, the most probable

meaning is computed and output– insignificant words are written unchanged into

the Output Data

Page 23: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Anwendung 23

• where:• MI: mutual information• PR: conditional probability• c0: current category

• ci: context category

n

n

í

ii ccPRccMI1

)(),( 00

Probability of the Appeareance of a Probability of the Appeareance of a Category in ContextCategory in Context

Page 24: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Anwendung 24

• where:• PC: probability of the appearance of a

category given a context

n

PCin

í 1max

Calculation of the most probable Calculation of the most probable meaningmeaning

Page 25: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Anwendung 25

Folklore, Rock, Klassik und Jazz zu vermischen reicht ihnen nicht, sie nutzen die Elektronik und sind sogar dazu übergegangen, Instrumente selbst zu bauen.

Example: DisambiguationExample: Disambiguation

3002 Rock Rock NN 2 Rock_03004 Klassik Klassik NN 1 Klassik_03008 vermischen vermischen VVINF 1 vermengen_0_vermischen_03009 reicht reichen VVFIN 7 reichen_03014 nutzen nutzen VVFIN 2 nutzen_2_nützen_23016 Elektronik Elektronik NN 1 Elektronik_03023 Instrumente Instrument NN 2 Musikinstrument_0_Instrument_23026 bauen bauen VVINF 4 bauen_3

3002 Rock Rock NN [ 39981 ... 3228 ] [ 39981 ... 3228 ]3004 Klassik Klassik NN [ 221503 ... 221266 ]3008 vermischen vermischen VVINF [ 643704 643048 ]3009 reicht reichen VVFIN [ 21538 ] [ 339847 307402 ] [ 581324 ... 568361 ] [ 581324 ... 862674] [ 581324 ... 912753 ] [ 586102 585849 ] [ 588150 ... 586261 ]3016 Elektronik Elektronik NN [ 405356 ... 383322 ]3023 Instrumente Instrument NN [ 5357 3228 ] [ 142311 ... 3228 ]3026 bauen bauen VVINF [ 650176 647379 ] [ 742021 ... 734399 ]

[ 743571 ... 734399 ] [ 743710 735354 734399 ]

Not satisfied to merely mix up Folk, Rock, Classical,Not satisfied to merely mix up Folk, Rock, Classical,and Jazz, they make use of Electronic Music as well, and Jazz, they make use of Electronic Music as well, and go so far as to build their own instruments.and go so far as to build their own instruments.

Page 26: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Evaluation 26

Evaluation:ComparisonEvaluation:ComparisonTest corpus

1017 Komponisten Komponist NN 1 Komponist_0_Komponistin_0

2010 Möglichkeiten Möglichkeit NN 2 Möglichkeit_2_Eventualität_0

14011 verfügbar verfügbar ADJD 0

14014 machen machen VVINF 6 betätigen_0_treiben_0_machen_0

24006 wirkt wirken VVFIN 6 wirken_2

Evaluation corpus (Negra/Lexsem corpus)

1017 Komponisten Komponist NN Komponist_0_Komponistin_0

2010 Möglichkeiten Möglichkeit NN Möglichkeit_2_Chance_0_Gelegenheit_0

14011 verfügbar verfügbar ADJD unknown

14014 machen machen VVINF unspec

24006 wirkt wirken VVFIN wirken_2

Page 27: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Evaluation 27

Meanings in the test corpusMeanings in the test corpus980

485

299

159113 97

34 39 20 14 13 19 8 2 2 16 3 2 11 5 2 020

0 0 0 0 0 30

100

200

300

400

500

600

700

800

900

1000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

2346 words annotated with 3.1 meanings per word,1366 of these ambiguous, with average of 4.6 meanings

Page 28: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Evaluation 28

Results (3 Segments/Window)Results (3 Segments/Window)

Segmentgröße 0(Satz) 2 5 7 10 15

count 1882

trivial 773

hitcount 703 586 688 718 724 720

incorrect 347 523 421 349 357 366

nicht desambiguiert 59 210 52 42 28 23

Precision (alle) [32,3%] 80,97 81,28 79,84 81,03 80,74 80,31

Precision (amb.) [21,7%] 66,95 52,84 62,04 67,30 66,98 66,30

Recall 96,51 88,51 96,88 97,41 98,15 98,41

segment sizesegment sizeSentencesSentences

not disambiguatednot disambiguated

Page 29: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Zusammenfassung 29

SummaryCOOC:• is the first unsupervised, corpus-based method of disambiguating semantically ambiguous words for German • goes beyond n-gram statistics• uses plain text, GermaNet, MMorph and a POS tagger• is a tool for unsupervised learning, semantic tagging, and evaluation• first evaluation gives 67,3% (81) precision and 97,4% recall

Page 30: COOC Practicum / Software Project, SS 2000 Final Report Tanja von den Berg, Tilman Jäger, Kerstin Klöckner, Stephan Lesch, Holger Neis, Norbert Pfleger,

30.11.00 COOC: Ausblick 30

Outlook• Use of GermaNet 2 (but still need a hand-labeled evaluation corpus)

• Repeat experiment with WordNet and Penn Treebank Corpus

• Several experiments to determine optimal parameters

• Two theses:• lexical disambiguation• general predictions