Unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora
description
Transcript of Unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora
Unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora
Alessandro Lenci (Università di Pisa, Italy)
Barbara McGillivray (ILC-CNR / Università di Pisa, Italy)
Simonetta Montemagni (ILC-CNR, Italy)
Vito Pirrelli (ILC-CNR, Italy)
Outline
1. Subcategorization acquisition
2. MDL verb clustering
1. Subcategorization acquisition: summary
• Previous work
• Our acquisition process
• Evaluation of results
Previous work (1)
• Brent, 1991; Ushioda et al., 1993; Briscoe & Carroll, 1997; Korhonen, 2002
• These approaches presuppose a battery of predefined frames
• there are languages for which no such SCF repertoires are already available
Previous work (2)
• alternative: acquisition process as a “SCF discovery” process in corpora
• Basili et al., 1997; Zeman & Sarkar, 2000; Alonso et al., 2007; Bourigault & Frérot, 2005
• we present a variation of this “discovery approach” to SC acquisition for Italian verbs
Our SC extraction method
• simply requires a “chunked” corpus and a limited number of search heuristics that do not rely on any previous knowledge about SCFs
– languages other than English
– a looser notion of SCF including typical verb modifiers and strongly selected arguments
The acquisition process
0. experimental setting– chunked PAROLE Corpus
• Italian general corpus
• 3 million word tokens
• chunked with CHUG-IT
– 47 communication verbs
The acquisition process (step 1)
1. extraction of verb local contexts (SLCs) from chunked texts
• Ex.:[N_C lo yen] [FV_C ha chiuso] [P_C a Tokio] [P_C a 120] [I_C dopo aver toccato] [P_C nel corso] [P_C della seduta] [N_C il massimo storico]
‘the yen closed down in Tokyo at 120 after reaching the maximum ever in the course of the session’
The acquisition process (step 2)
2. Context carving: linguistically-motivated criteria select only those chunks that are in the dependency scope of v noise information is minimized
• Ex.:
[N_C lo yen] [FV_C ha chiuso] [P_C a Tokio] [P_C a 120] [I_C dopo aver toccato] [P_C nel corso] [P_C della seduta] [N_C il massimo storico]
‘the yen closed down in Tokyo at 120 after reaching the maximum ever in the course of the session’
The acquisition process (step 3)
3. induction of potential subcategorization frames (PSF)
a. assumption: all contextual chunks occurring immediately after the verb are very likely governed by it potentially subcategorized slots (PSS)
b. Frequency filter on PSSs
c. a SLC is eligible as a PSF if its contextual chunks belong to the list of selected PSS
d. Frequency filter on PSFs
The acquisition process (step 3)
SLC PSF Rel.freq.
[ ] [ ] 0.33
[CHE_C] [CHE_C] 0.05
[I_C-di] [I_C-di] 0.13
[N_C] [N_C] 0.45
[N_C][ADJ_C] [N_C] 0.45
[N_C][ADJPART_C] [N_C] 0.45
[N_C][di_C] [N_C] 0.45
[N_C][NA_C] [N_C] 0.45
[N_C][P_C-a] [N_C] 0.45
[N_C][P_C-di] [N_C] 0.45
[N_C][P_C-di][ADJ_C] [N_C] 0.45
[N_C][P_C-di][ADJPART_C] [N_C] 0.45
Verb accettare ’accept’
PSS
[ ]
[CHE_C]
[I_C-di]
[N_C]
Evaluation of results - Italian
• Evaluation of our SCF induction method
– extracted carved contexts: baseline (step
2)
– induced subcat frames (step 4)
o type precision
o type recall
o F-measure
frames acquired all
frames acquiredcorrectly P
standard gold in the frames all
frames acquiredcorrectly R
RP
R*P*2
F
Evaluation - Italian (2)
• carried out against three gold standards
1. IGS1: a general purpose computational lexicon (SIMPLE-PAROLE-CLIPS lexicon)
2. IGS2: Italian dictionary (Sabatini-Coletti 2006)
3. IGS3: merging IGS1 and IGS2
4. Manual evaluation
Evaluation - Italian (3)
IGS1 IGS2 IGS3 4
SCFs P 42% 30% 52% 93%
R 8% 84% 78% NA
F 13% 44% 62% NA
baseline P 23% 13% 27% 40%
R 72% 68% 75% NA
F 35% 22% 38% NA
Evaluation - English
• four gold standards
1. EGS1: general purpose computational lexicon (Valex5 Lexicon)
2. EGS2: Longman Dictionary (2006);
3. EGS3: biomedical English lexicon (SPECIALIST Lexicon)
4. EGS4: merging EGS1, EGS2 and EGS3
Evaluation – English (2)
EGS1+ EGS2 EGS3 EGS4
SCFs P 69% 52% 83%
R 48% 54% 51%
F 57% 53% 63%
baseline P 28% 17% 33%
R 52% 49% 53%
F 36% 25% 41%
2. Verb clustering: summary
• The MDL Principle
• Verb clustering using MDL
Why verb clustering?
• syntax-semantics lexical interface
• starting from the SCFs extracted, we aim at inducing clusters of verbs that share similar semantic properties
• each verb is represented as a vector whose dimensions report its statistical distribution with the automatically extracted SCFs
• a clustering of verb vectors is performed using the
Minimum Description Length Principle (MDL)
The MDL Principle
• from information theory (Rissanen 1989)
• model description length: code length in bits for the encoding of the model itself complexity of the model
• data description length: code length in bits for the encoding of the given data observed through the model fit of the model to the data
• MDL: “any regularity in the data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally”
)(minarg )|( mDmm LLM
1) Baseline model: each verb belongs to one class
2) Compare with any model
3) Choose such that
4) Cluster together into the class
},{;,,,,1},{:),( 11 khrjj vvkhjrjvkhM
),( 11 mn
))],(()([),( 10
0)),(()(
111
),(10
maxarg khMLMLmnM
khkhMLML
0M
}{,},{: 110 rr vvM
Verb clustering using MDL
1r),(11 mn vv
PROMETTERERISPONDERE
PARLAREPROTESTARE
CHIEDEREDIRE
ASSERIREMINACCIARECOMANDARE
INSEGNAREAMMONIRE
DICHIARARECONFESSARE
CHIARIREPROIBIRE
SUGGERIRECOMUNICARE
ACCETTAREPROPORREMOSTRARE
COMMENTARECHIAMAREPREGARE
DISCUTERERIVELARE
RICHIAMARERIMPROVERARE
LEGGERESPIEGARE
REPLICAREDESCRIVERERICHIEDERE
DENUNCIAREOFFRIRE
RIMPIANGEREORDINARE
• 47 Italian communication verbs: 23 clustering steps
MDL -clustering
Conclusions
• a preliminary qualitative analysis of induced verb
clusters shows encouraging results
• we expect to evaluate the coherence of the
obtained lexico-semantic clusters and the coverage
of the subcategorization behaviour of clustered
verbs
•The verb classes are assigned a new
cluster-based frame distribution
[ ] [che] [I-di] [N] [P-a] [perché]
chiarire ‘clarify’ 0.34 0.10 0 0.40 0 0.009
comunicare ‘communicate’
0.24 0.15 0 0.31 0.08 0
proibire ‘forbid’ 0.21 0.03 0.03 0.51 0 0
suggerire ‘suggest’
0.24 0.10 0.009 0.42 0.02 0.02
verb class (cluster)
0.25 0.10 0.008 0.41 0.02 0.02
MDL -clustering