Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel -...

38
Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund

Transcript of Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel -...

Page 1: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Using bilingual LSA for FN annotation of French text from generic resources

Guillaume Pitel - LORIA/LED

FR.FrameNet Project

Funded by France-Berkeley Fund

Page 2: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 2

Outline

The (small) FR.FrameNet project The projection problem Realizations

French Frames database Annotated reference sub-corpus English semantic clusters from FEs Projection into French

Other potential applications

Page 3: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 3

The (small) FR.FrameNet project A Berkeley-Nancy collab. Funded by France-

Berkeley Fund - ICSI, ATILF, LORIA French participants : Susanne Alt, Benoît Crabbé,

Christiane Jadelot, Guillaume Pitel, Laurent Romary

Setting the foundations for a cheap bootstrapping of a French FrameNet Reusing existing French Lexical Semantic resources Reusing any available resources Focus on automatic methods

Page 4: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 4

The projection problem

Use a semantic lexicon in language A to annotate a corpus in language B Resulting data is expected to be of much lower

quality than a handcrafted lexicon It is a bootstrapping process : requires manual

correction Important question : does it really speed up the

final production ?

Page 5: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 5

Pado & Lapata approach Using a Source language/Target language

parallel corpus The Source-side of the corpus must be FN-

annotated, The roles are projected in the Target corpus

Train a statistical semantic role parser for Target language

Automatic annotation of any corpus in Target language

Page 6: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 6

Pado & Lapata approach

Problems translation is not frame-conserving in many

cases (20-30%) parallel corpora are a rare resource Berkeley’s FrameNet is not built on the English

side of a // corpus :( But very useful with a resource like

Europarl

Page 7: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 7

The main bottleneck

Existence of parallel AND annotated corpora : rare and expensive to build

But… Annotated corpora are available Parallel, aligned corpora are available

Page 8: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 8

The Semantic Space based approach (using LSA) Pure semantic annotation

no grammatical function no POS

Use a bilingual LSA space to make the projection Preparation :

Find the lexical units in the Target language that fits for each frame

Use an available resource Compute them automatically

Compute the semantic clusters of each frame element

Page 9: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 9

The Semantic Space based approach (using LSA) Usage : Automatic preannotation (or selection)

For each sentence in Target corpus Find potential frames from LUs Compare each word (or head of constituent) of the sentence

with to computed semantic clusters of the (core) roles of the candidate frames (or the corresponding roles in parents if training data missing)

Candidate Frames and FEs are rated by the semantic distance

What we can expect Can’t deal with anaphora, Can’t deal with FEs not semantically narrow

Page 10: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 10

Subprojects

Convert frames to French Using the ISC Semantic Atlas (built from 2

synonym dictonaries + a minimal FR//EN corpus)

Annotation of reference subcorpus 1000 sentences from Europarl

Projection using LSA

Page 11: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 11

Convert Frames to French

Page 12: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 12

English LUs to French LUs

For each Frame in Berkeley FrameNet For each LU, find potential translations in French.

Using Semantic ATLAS (Ploux & Ji, 2003) - other languages ?

Compute the French “profile” of the Frame Manually check that a lemma can actually evoke the

frame (pure subjective judgment) Frame-by-frame procedure Must be validated later by corpus evidence

Page 13: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 13

Lexical units in “Filling” Frame adorn.v, anoint.v, asphalt.v, brush.v, butter.v, coat.v, cover.v,

cram.v, crowd.v, dab.v, daub.v, douse.v, drape.v, drizzle.v, dust.v, embellish.v, fill.v, flood.v, gild.v, glaze.v, hang.v, heap.v, inject.v, jam.v, load.v, pack.v, paint.v, panel.v, pave.v, pile.v, plant.v, plaster.v, pump.v, scatter.v, seed.v, shower.v, smear.v, sow.v, spatter.v, splash.v, splatter.v, spray.v, spread.v, sprinkle.v, squirt.v, strew.v, stuff.v, suffuse.v, surface.v, tile.v, varnish.v, wallpaper.v, wrap.v

Page 14: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 14

Translations 1/4 Adorn : Chamarrer, embellir, enjoliver, orner, parer, revêtir Anoint : Oindre Asphalt : Asphalter, bitumer Brush: Badigeonner, brosser, effleurer Butter : Beurrer Coat : Empâter, enduire, enrober, revêtir Cover : badigeonner, barbouiller, couvrir, franchir, gainer, garnir, habiller, monter, parcourir, quadriller, recouvrir, revêtir,

saillir, se couvrir, subvenir, tapisser Cram : bachoter, bâfrer, bûcher, chauffer, engraisser, lester, potasser Crowd : foule (should be also peupler) Dab : bassiner, tamponner, toucher Daub : badigeonner, barbouiller, peinturlurer Douse : ??? Drape : Draper Drizzle : brouillasser, bruiner, crachiner, pleuvasser, pleuviner Dust : enlever la poussière, essuyer, poussière, saupoudrer, épousseter Embellish : broder, embellir, enjoliver, orner Fill : appliquer un enduit, boucher, bourrer, calfeutrer, combler, devenir plein, emplir, enfler, fourrer, garnir, gonfler,

gorger, imprégner, lester, mastiquer, meubler, obturer, occuper, peupler, plomber, pourvoir, pourvoir à, pénétrer, remplir, s'enfler, se gonfler, se peupler, se remplir

Page 15: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 15

Manual selection 1/4 Adorn : Chamarrer, embellir, enjoliver, orner, parer, revêtir Anoint : Oindre Asphalt : Asphalter, bitumer Brush: Badigeonner, brosser, effleurer Butter : Beurrer Coat : Empâter, enduire, enrober, revêtir Cover : badigeonner, barbouiller, couvrir, franchir, gainer, garnir, habiller, monter, parcourir, quadriller, recouvrir, revêtir,

saillir, se couvrir, subvenir, tapisser Cram : bachoter, bâfrer, bûcher, chauffer, engraisser, lester, potasser Crowd : foule (should be also peupler) Dab : bassiner, tamponner, toucher Daub : badigeonner, barbouiller, peinturlurer Douse : ??? Drape : Draper Drizzle : brouillasser, bruiner, crachiner, pleuvasser, pleuviner Dust : enlever la poussière, essuyer, poussière, saupoudrer, épousseter Embellish : broder, embellir, enjoliver, orner Fill : appliquer un enduit, boucher, bourrer, calfeutrer, combler, devenir plein, emplir, enfler, fourrer, garnir, gonfler,

gorger, imprégner, lester, mastiquer, meubler, obturer, occuper, peupler, plomber, pourvoir, pourvoir à, pénétrer, remplir, s'enfler, se gonfler, se peupler, se remplir

Page 16: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 16

Frame building : Conclusion Quite inexpensive compared to an approach

of introspection from scratch or corpus-based (Filling is a big frame with a lot of LUs, it took me ~ 30min to select good instances - with manual color setting)

Probably far from perfect coverage, low precision

Need several annotators to duplicate the work

Page 17: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 17

Our approach to cross-language semantic annotation

The goal : A lemma can be related to several Frames We want to disambiguate between the

possible choices, And also try to attribute roles (at least core

roles) once we have made the choice All of this in French, while we have the training

data in English

Page 18: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 18

Bilingual LSA approach

Page 19: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 19

Latent Semantic Analysis Improvement of cooccurrence matrices Reduce the number of dimensions Example :

A occurs in documents (or contexts) 1,2,3 B in 2,3,4,5 C in 4,5,6 A and C never occur in the same document

LSA would allow to reduce documents 1-6 into one dimension

Page 20: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 20

Evaluating the semantic position of Frame Elements in LSA

Computing an English LSA space Tools : Treetagger + Infomap-nlp Corpus : BNC+English part of Europarl +

translation of Balzac POS+lemma : “NNyear” Keep only Verbs, Adjectives, Nouns,

Adverbs Other combinations (no POS, all POS, raw

form) don’t perform as well

Page 21: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 21

Example FE’s annotations for Filling.Theme

with water. with a fungicide such as green or yellow sulphur. with a soft brush and malathion dust. with a little cayenne pepper. …

Terms used for the FE’s representation NNwater;NNfungicide;JJsuch;JJgreen;JJyello

w;NNsulphur;JJsoft;NNbrush;NNmalathion;NNdust;JJlittle;NNcayenne;NNpepper

Page 22: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 22

Evaluating FE’s semantic coherence

Compute the semantic center of the FE = center of each FE term’s position

Find the N nearest neighbors of this center If the center is in a semantically coherent

region, the average similarity between neighbors and center is high.

Page 23: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 23

FEs de FillingFrame.FE Average Std Min Max Nb annot

Filling.Agent 0.604941 0.0413504 0.563591 0.717469 279

Filling.Cause

Filling.Degree 0.595513 0.0431123 0.552401 0.697830 4

Filling.Depictive 0.683302 0.0502735 0.633029 0.804053 1

Filling.Goal 0.6483 0.0510976 0.597202 0.793063 543

Filling.Instrument 0.646028 0.0715617 0.574466 0.844308 4

Filling.Manner 0.647012 0.0795992 0.567413 0.896142 25

Filling.Means 0.67356 0.0502949 0.623265 0.820630 1

Filling.Path 0.708096 0.069683 0.638413 0.925448 2

Filling.Place 0.562765 0.0364663 0.526299 0.683526 2

Filling.Purpose 0.631099 0.0585047 0.572594 0.761788 5

Filling.Result 0.734567 0.0585102 0.676057 0.825459 37

Filling.Source 0.611222 0.0447367 0.566485 1.000000 1

Filling.Subregion 0.782659 0.0756196 0.707039 0.944916 2

Filling.Theme 0.747146 0.0485786 0.698567 0.890307 450

Filling.Time 0.474269 0.0474972 0.426772 0.628049 16

Page 24: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 24

Neighbors of Filling.Theme powder 0.890307 spray 0.836283 dry 0.821666 crushed 0.820905 charcoal 0.813571 plastic 0.806768 copper 0.804459 paste 0.802643 foam 0.802201 brush 0.799847 …

Computed from : with fake diamonds. with pictures of cute white bunnies. with jewels and fine gowns. with one of these pegs. with pictures , flowers , and messages of peace. with wreaths of flowers and garlands of feathers. with the finest furniture from a firm in London 's New Bond Street. with a crown. with beautifully hooked melodies and harmonies. with chrism , the sacred ointment ,. with gel. with such a leaden armour of expectations. with the poison. with these substances. with vaseline. with his pungent urine. with holy oil. in bulb fibre. in whipped cream and honey. with a foot of topsoil. with her hand. …

Page 25: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 25

Neighbors of Filling.Agent oliver 0.717469 jack 0.696716 joe 0.691628 marie 0.686812 harry 0.684113 charlie 0.681887 billy 0.680378 tom 0.678887 jane 0.676179 rose 0.669748 …

Computed from :Your man. I. They. The priests. He. the wife of Cnut 's henchman Tofi the Proud. The Reclusiarch. she. What father. The Indians. Over 200 species of birds. He. He. Father Peter. Viktor. by ecclesiastics. We. One girl. She. she. he. the white gravel. the reluctant soldier. I. Eva. he. Two people. he. the good beachcombers. Sylvester. he. He. Two girls. you. Cecil Beaton. you. Larsen. you. He. you. you. He. he. she. Mina and K. She. you. she. the programme that turns the cameras on teenagers and let's them do the talking and the interviews. Baldwin. by Molly Fletcher. She. I. They. she. Endill. They. He. the BBC and official propaganda…

Page 26: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 26

FEs’ clusters

Grouping terms of the FE by minimal distance (arbitrarily set) i.e. 0.8 = 74°

Keeping clusters with more than 5% of terms

http://guillaume.work.free.fr/Frames.en.3

Page 27: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 27

Clusters of Filling frame Agent : 2 cluster(s) Degree : 4 cluster(s) Depictive : 6 cluster(s) Goal : 2 cluster(s) Instrument : 6 cluster(s) Manner : 2 cluster(s) Means : 2 cluster(s) Path : 1 cluster(s) Place : 5 cluster(s) Purpose : 1 cluster(s) Result : 2 cluster(s) Source : 1 cluster(s) Subregion : 1 cluster(s) Theme : 2 cluster(s) Time : 0 cluster(s)

Page 28: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 28

Clusters Filling.Agent rachel 0.867663 sara 0.863332 ellen 0.856612 lily 0.855513 sally 0.853933 alice 0.849205 emily 0.847480 dad 0.845598 jenny 0.844003 kate 0.839664 maggie 0.836391

tom 0.924026john 0.908828hugh 0.898049michael 0.897622scott 0.892861sir 0.891623david 0.889539frank 0.889324murray 0.879660anthony 0.879149geoffrey 0.876748

Page 29: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 29

Clusters Filling.Goal

tin 0.924426 pot 0.908988 jar 0.908169 cake 0.893367 bottle 0.888083 bag 0.871596 jug 0.866099 bowl 0.860658 basket 0.858857 plastic 0.852992 dish 0.846176 peel 0.834313

wall 0.911646wooden 0.864492entrance 0.851708front 0.846124floor 0.834214porch 0.834039staircase 0.827131roof 0.823297rear 0.815847corner 0.815765rear 0.813187front 0.813136

Page 30: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 30

Clusters Filling.Theme powder 0.913015 salt 0.907773 dry 0.900202 aromatic 0.886529 vegetable 0.870903 spray 0.867004 bean 0.860508 herb 0.858321 meat 0.852165 apple 0.848998 vinegar 0.848045 pea 0.845492

shiny 0.915945red 0.908281pink 0.905748tint 0.900729grey 0.899490yellow 0.882565blue 0.882097white 0.877434ribbon 0.876266brown 0.875334pale 0.875016silk 0.865824

Page 31: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 31

Projection

Compute French clusters from English clusters

Corpus collection Europarl (French-English) // French-English Balzac from Project

Gutenberg French//English : 50M lemmas Shakespeare, Hansard Corpus to be included

Page 32: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 32

Training data

Lemmas interleaved on a sentence alignment basis

Training with a larger window Only parallel corpus, experiments that

introduce bits of pure monolingual corpus show a quality loss

Page 33: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 33

Similarity between translations in the Biling. Sem. Space

Results : eat / manger : 0,98 (32°) fleuve / river : 0,94 (55°) green / vert : 0,83 (92°) bleu / blue : 0,87 (81°) eat / fleuve : 0,77 (107°) drink / écran : 0,82 (96°)

Page 34: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 34

Neighborhood in Bilingual Semantic Space

Eat/Manger

eat:0.976250manger:0.976250consommer:0.823532 (consume)boire:0.818577 (drink)feed:0.784077fumeur:0.777815 (smoker)consume:0.775385fumer:0.757367 (to smoke)cream:0.744859

Page 35: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 35

Neighborhood in Bilingual Semantic Space

Fleuve/River

river:0.938150fleuve:0.938150coastline:0.810345rivière:0.807991alp:0.801064sea:0.774821lake:0.771523coast:0.761910littoral:0.756541(seashore)bassin:0.755235 (basin)

Page 36: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 36

Neighborhood in Bilingual Semantic Space

Vert/Green

vert:0.825634green:0.825634green:0.748835biotechnology:0.745683mandelkern:0.675176hatch:0.664682taslima:0.633138cote:0.628252converter:0.624423orée:0.616550 (forest border)hydrogen:0.611002

Page 37: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 37

Projection: Conclusion Projecting whole clusters gives variable results Results in the projection are very disappointing

Unusable in this state Seems that it may simply come from alignment

mistakes Can we improve the projected clusters with a

bilingual dictionary ? Relating clusters to Synsets ? Not necessarily a good

idea : Champagne and Caviar are not related in WN More generally “simple” translation may cause

undesired broadening of the cluster

Page 38: Using bilingual LSA for FN annotation of French text from generic resources Guillaume Pitel - LORIA/LED FR.FrameNet Project Funded by France-Berkeley Fund.

Guillaume Pitel - LORIA - Nancy 38

Potential application Statistical processing is interesting because it can

capture “usage-based” regularities Clusters built with LSA can be interesting information

sources for the lexicographer They can also more simply be used to automatically

find new semantic types/selectional preferences emerging from the annotation of a new domain (metaphors occuring frequently for instance)

In a multilingual, collaborative annotation task, could be useful in order to transfer work between languages without requiring annotation of a parallel corpus.