Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes...

23
Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural) FCT, Universidade Nova de Lisboa

Transcript of Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes...

Page 1: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Thesaurus Design (from analised corpora)

Pablo Gamallo, Alexandre Agustini, G.P. Lopes

{gamallo,aagustini}@di.fct.unl.pt

GLINt (Gupo de Lingua Natural) FCT, Universidade Nova de Lisboa

Page 2: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

fine sanction

president secretary

small big

ministery minister

banc organisation

Thesaurus design Linguistic goals

Page 3: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Thesaurus design Proprieties

Distribucional Hypothesis:Words sharing similar contexts are semantically related

Domain specific corpus

Types of context: simple co-occurrence (bigrams) co-occurrence within a window (n-grams) syntactic structures

Page 4: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Thesaurus designSteps

Extraction of syntactic contexts from the corpus

Similarity measure between words (based on their syntactic contexts)

For each word, identify its most similar words

Page 5: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Extraction of syntactic contexts

Tagging (PoS tags)

Chunking (parsing in basic chunks)

Attachment heuristics

Identification of binary dependencies

Extraction of syntactic contexts

Page 6: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Clinton sent a clear message to the president of Portugal

Tagger:Tagger:Clinton_N sent_V a_ART clear_ADJ message_N to_PREP the_ART authorities_N of_PREP Portugal_N

Tagging and chunking

Chunking: Chunking: NP (Clinton)

VP (send)

NP (message, clear)

PP (to, NP(authority))

PP (of, NP(portugal))

Page 7: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Attachment Heuristics and Syntactic Dependencies

• Attachment of Basic Chunks:Attachment of Basic Chunks:<NP(Clinton) , VP( sent)>

<VP( sent), NP(message, clear)>

<NP(message, clear), PP(to, NP(authority))>

<NP(president), PP(of, NP(portugal))>

• Binary Dependencies:Binary Dependencies:<SUBJ, send , Clinton>

<DOBJ, send, message>

<TO, message, authority>

<OF, authority, portugal>

Page 8: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Syntactic Contexts

<DOBJ, send , message> :

<DOBJ, send, (*)> <DOBJ, (*), message>

<TO, message, authority> :

<TO, message, (*)> <TO, (*), authority>

<OF, authority, portugal > :

<OF, authority, (*)> <OF, (*), portugal>

Page 9: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Similarity Measure Binary Binary Jaccard coefficientJaccard coefficient

The similarity between two words relies on:The ratio between the number of contexts that are common to both words and the total number of their contexts.

Page 10: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Similarity Measure Weighted Weighted Jaccard coefficientJaccard coefficient

Page 11: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

MicroCorpus

Pedro is reading a book and Maria is reading a book,

Pedro is reading a novel and Maria read a novel yesterday,

Pedro is reading a lot of things, but Pedro loves Maria,

Maria loves books, in fact Maria loves a lot of things.

Maria is eating an apple and Pedro is eating an apple too,

Pedro eated eggs yesterday, Pedro eats a lot of things,

Maria is eating eggs, Maria loves eggs a lot.

Page 12: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Thesaurical relations between names

Pedro Mariabook novelapple egg thing book, egg, apple, novel(book egg)?(Maria thing)??(Pedro egg)???

Page 13: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Extracting syntactic contexts of names

• Pedro: (<SUBJ, read , (*)>, 3) (<SUBJ, love , (*)>, 1) ( <SUBJ, eat, (*)>, 3)

• Maria: (<SUBJ, read , (*)>,2) (<SUBJ, love, (*)>, 3) (<SUBJ, eat, (*)>,2) (<IOBJ-DE, love, (*)>,1)

• novel: (<DOBJ, read , (*)>,2)• book: (<DOBJ, read , (*)>,3) (<IOBJ-DE, love , (*)>,1)• thing: (<DOBJ, read , (*)>,1) (<DOBJ, eat, (*)>,1)

(<IOBJ-DE, love, (*)>,1)• apple: (<DOBJ, eat, (*)>,2).• egg: (<DOBJ, eat , (*)>,2) (<IOBJ-DE, love, (*)>,1)

Page 14: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Computing the weigth of a context for each word (1):

Pedro: (<SUBJ, read , (*)>, 3) GW(<SUBJ, read , (*)>) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56LW(Pedro, <SUBJ, read , (*)>) = log(3) = 0.47W(Pedro, <SUBJ, read , (*)>) = 1.03

Pedro: (<SUBJ, love , (*)>, 1) GW(<SUBJ, love , (*)>) = log (1/3 + 3/4) / log(2) = 0.034 / 0.3 = 0.11LW(Pedro, <SUBJ, love , (*)>) = log(1) = 0W(Pedro, <SUBJ, read , (*)>) = 0.11

Pedro: (<SUBJ, eat , (*)>, 3) GW(<SUBJ, eat , (*)>) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56LW(Pedro, <SUBJ, eat , (*)>) = log(3) = 0.47W(Pedro, <SUBJ, eat, (*)>) = 1.03

Page 15: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Computing the weigth of a context for each word (2):

Maria: (<SUBJ, read , (*)>, 2) GW(<SUBJ, read , (*)>) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56LW(Maria, <SUBJ, read , (*)>) = log(2) = 0.3W(Maria,, <SUBJ, read , (*)>) = 0.86

Maria: (<SUBJ, love , (*)>, 3) GW(<SUBJ, love , (*)>) = log (1/3 + 3/4) / log(2) = 0.034 / 0.3 = 0.11LW(Maria, <SUBJ, love , (*)>) = log(3) = 0.47W(Maria, <SUBJ, read , (*)>) = 0.58

Maria: (<SUBJ, eat , (*)>, 2) GW(<SUBJ, eat , (*)>) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56LW(Maria, <SUBJ, eat , (*)>) = log(3) = 0.3W(Maria, <SUBJ, eat, (*)>) = 0.86

Maria: (<IOBJ-DE, love , (*)>, 1) GW(< IOBJ-DE, love , (*)>) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31LW(Maria, < IOBJ-DE, love , (*)>) = log(1) = 0.W(Maria, < IOBJ-DE, love , (*)>) = 0.31

Page 16: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Computing the weigth of a context for each word (3):

novel: (<DOBJ, read , (*)>, 2) GW(<DOBJ, read , (*)>) = log (2/1 + 3/2 + 1/3) / log(3) = 0.54 / 0.47 = 1.15LW(novel, <DOBJ, read , (*)>) = log(2) = 0.3W(novel, <DOBJ, read , (*)>) = 1.45

book: (<DOBJ, read , (*)>, 3) GW(<DOBJ, read , (*)>) = log (2/1 + 3/2 + 1/3) / log(3) = 0.54 / 0.47 = 1.15LW(book, <DOBJ, read , (*)>) = log(3) = 0.47W(book, <DOBJ, read , (*)>) = 1.62

book: (<IOBJ-DE, love , (*)>, 1) GW(< IOBJ-DE, love , (*)>) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31LW(book, < IOBJ-DE, love , (*)>) = log(1) = 0.W(book, < IOBJ-DE, love , (*)>) = 0.31

Page 17: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Computing the weigth of a context for each word (4):

thing: (<DOBJ, read , (*)>, 1) GW(<DOBJ, read , (*)>) = log (2/1 + 3/2 + 1/3) / log(3) = 0.54 / 0.47 = 1.15LW(thing, <DOBJ, read , (*)>) = log(1) = 0W(thing, <DOBJ, read , (*)>) = 1.15

thing: (<DOBJ, eat , (*)>, 1) GW(<DOBJ, eat , (*)>) = log (1/3 + 2/1 + 2/2) / log(3) = 0.52 / 0.47 = 1.1LW(eat, <DOBJ, eat , (*)>) = log(1) = 0W(book, <DOBJ, eat , (*)>) = 1.1

thing: (<IOBJ-DE, love , (*)>, 1) GW(< IOBJ-DE, love , (*)>) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31LW(thing, < IOBJ-DE, love , (*)>) = log(1) = 0.W(thing, < IOBJ-DE, love , (*)>) = 0.31

Page 18: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Computing the weigth of a context for each word (5):

apple: (<DOBJ, eat, (*)>, 2) GW(<DOBJ, eat , (*)>) = log (1/3 + 2/1 + 2/2) / log(3) = 0.52 / 0.47 = 1.1LW(apple, <DOBJ, eat , (*)>) = log(2) = 0.3W(apple, <DOBJ, eat, (*)>) = 1.4

egg: (<DOBJ, eat , (*)>, 2) GW(<DOBJ, eat , (*)>) = log (1/3 + 2/1 + 2/2) / log(3) = 0.52 / 0.47 = 1.1LW(egg, <DOBJ, eat , (*)>) = log(2) = 0.3W(book, <DOBJ, eat, (*)>) = 1.4

egg: (<IOBJ-DE, love , (*)>, 1) GW(< IOBJ-DE, love , (*)>) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31LW(egg, < IOBJ-DE, love , (*)>) = log(1) = 0.W(egg, < IOBJ-DE, love , (*)>) = 0.31

Page 19: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Similarity between words (1)

WJ(Pedro, Maria) = 2.17 / 2.61 = 0.830.83min( (1.03+0.11+1.03), (0.86+0.58+0.86) ) = 2.17max( (1.03+0.11+1.03), (0.86+0.58+0.86+0.31) ) = 2.61

WJ(book, novel) = 1.45 / 1.93 = 0.750.75min( (1.45), (1.62) ) = 1.45max((1.45), (1.62+ 0.31) ) = 1.93

WJ(book, thing) = 1.58 / 2.69 = 0.580.58min( (1.62+0.33), (1.27+0.31) ) = 1.58max( (1.62+0.33), (1.27+0.31+1.1) ) = 2.69

Page 20: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Similarity between words (2)

WJ(apple, egg) = 1.4 / 1.71 = 0.81min( (1.4), (1.4) ) = 1.4max( (1.4), (1.4+0.31) ) = 1.71

WJ(apple, thing) = 1.1 / 2.68 = 0.410.41min( (1.4), (1.1) ) = 1.1max((1.4), (1.27+0.31+1.1) ) = 2.68

WJ(egg, thing) = 1.41 / 2.68 = 0.510.51min( (1.4+0.25), (1.1+0.31) ) = 1.41max( (1.4+0.25), (1.27+0.31+1.1) ) = 2.68

WJ(novel, thing) = 1.1 / 2.68 = 0.410.41min( (1.45), (1.1) ) = 1.1max((1.45), (1.27+0.31+1.1) ) = 2.68

Page 21: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Similarity between words (3)

WJ(Maria, thing) = 0.31 / 2.68 = 0.09min( (0.31), (0.31) ) = 0.31

max( (0.86+0.58+0.86+0.31) , (1.27+0.31+1.1) ) = 2.68

WJ(book, egg) = 0.31 / 1.93= 0.160.16min((0.31), (0.31) ) = 0.31max((1.62+.31), (1.4+0.31) ) = 1.93

WJ(Pedro, thing) = 0 / 2.62 = 00WJ(novel, egg) = 0 / 1.65 = 00WJ(book, apple) = 0 / 1.87 = 0;0;

WJ(Maria, egg) = 0.31 / 2.61 = 0.11min( (0.31), (0.31) ) = 0.31

max( (0.86+0.58+0.86+0.31) , (1.4+0.31) ) = 2.61

Page 22: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

Similarity between words (Sorting)

(0.83) Pedro Maria (0.81) apple egg (0.75) book novel(0.58) thing book(0.51) thing egg(0.41) thing apple, novel(0.16) book egg(0.11) Maria egg(0.09) Maria thing(0.0) Pedro egg(0.0) novel egg

Page 23: Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural)

juíz| {dirigente, presidente, subinspector, governador, árbitros}

diploma| {decreto, lei, artigo, convenção, regulamento}

decreto| {diploma, lei, artigo, nº, código}

regulamento| {estatuto, código, sistema, decreto, norma}

regra| {norma, princípio, regime, legislação, plano}

renda| {caução, indemnização, reintegração, multa, quota}

conceito| {noção, estatuto, regime, temática, montante}

Corpus “Procuradoria Geral da República” (P.G.R.)

Lists of similar words