THE LAST PART OF THE COURSE. CLASSES OF NOUNS AND DETERMINERS.
Body-Part Nouns and Whole-Part Relations in Portuguese
-
Upload
jorge-baptista -
Category
Science
-
view
83 -
download
2
description
Transcript of Body-Part Nouns and Whole-Part Relations in Portuguese
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 1
Body part nouns and Whole-Part Relations in Portuguese
Ilia Markov123, Nuno Mamede23, Jorge Baptista123
1 U. Algarve/CECL 2 U. Lisboa/IST 3 INESC-ID Lisboa/L2F
PROPOR2014 - Intl. Conference on Computational Processing of Portuguese October 6-8, 2014, ICMC, São Carlos, SP, Brazil
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 2
Objectives
• Improve the automatic extraction of semantic relations between textual elements in a existing NLP system, STRING !
• Part-whole relations (meronymy) !
•Human body-part nouns (Nbp) !
O Pedro partiu o braço ‘Pedro broke the arm’ WHOLE-PART(Pedro,braço)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 3
Objectives (cont.)
!
•Development of a rule-base meronymy detection module for Human-Nbp relations • Implementation in STRING (Mamede et al., 2012) !!
STRING: a hybrid, statistical and rule-based, Natural Language Processing (NLP) system for Portuguese
string.l2f.inesc-id.pt
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 4
Motivation
Semantic relations are a device for structuring texts: contribute to cohesion and coherence of a text.
Automatic extraction of semantic relations is useful for some NLP tasks: • Anaphora Resolution O Pedro lavou a cara ‘Pedro washed the face’ WHOLE-PART(Pedro,cara) O Pedro lavou a sua cara ‘Pedro washed his face’ WHOLE-PART(sua,cara) & ANTECEDENT(?,sua)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 5
Motivation (cont.)
• Semantic Role Labeling O Pedro partiu um braço ‘Pedro broke an arm’ WHOLE-PART(Pedro,braço) ➢ Pedro is an experiencer. O Pedro partiu o braço do João ‘Pedro broke João’s arm’ WHOLE-PART(João,braço) ➢ Pedro is an agent.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 6
Motivation (cont.)
•Opinion mining !É um bom hotel: o quarto era limpo, as camas eram feitas de lavado todos os dias, e os pequenos-almoços eram opíparos ‘It is a nice hotel: the room was clean, the beds (bed sheets) were changed everyday, and the breakfast was sumptuous’
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 7
Related Work
In NLP, various information extraction techniques have been developed in order to capture part-whole relations from texts: • Hearst, 1992
Lexico-syntactic patterns to capture hyponymic (type-of) relations
• Girju et al., 2003, 2006 The method semi-automatically identifies patterns that encode part-whole relations and learns automatically the classification rules needed for the extraction of part-whole relations from these patterns. The authors report an overall average precision of 80.95% and recall of 75.91%.
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 8
• Van Hage et al., 2006 A method for learning part-whole relations from vocabularies and text sources; the authors were able to acquire 503 part-whole pairs from the AGROVOC Thesaurus to learn 91 reliable part-whole patterns. !
• Pantel and Pennacchiotti, 2006 The Espresso algorithm: takes as input a few seed instances of a particular relation and learns surface patterns to extract more instances. The algorithm obtains a precision of 80%.
Related Work (cont.)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 9
Related Work (cont.)
• Lexical ontologies for Portuguese: - WordNet.PT - PAPEL - Onto.PT !
• Parsers of Portuguese: - The PALAVRAS parser (Bick, 2000), using
the Visual Interactive Syntax Learning (VISL) environment; - LX Semantic Role Labeler (Branco & Costa, 2010).
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 10
Dependency Rule in STRING
O Pedro partiu o braço do João ‘Pedro broke João’s arm’ IF( MOD[POST](#2[UMB-Anatomical-human],#1[human]) &
PREPD(#1,?[lemma:de]) &
CDIR[POST](#3,#2) & ~WHOLE-PART(#1,#2)
)
WHOLE-PART(#1,#2)
WHOLE-PART(João,braço)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 11
Fixed Phrases and Frozen Sentences involving Nbp
‣400 semi-automatically crafted rules, based on available lexicon-grammar of European Portuguese idioms
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 12
Other phenomena
• DET=um and bilateral symmetry O Pedro partiu um braço ‘Pedro broke an arm’
• relations between 2 Nbp A Ana pinta as unhas dos pés ‘Ana paints the nails of the feet’
• part-of Nbp O Pedro tocou com a ponta da língua no gelado ‘Pedro touched with the tip of the tongue on the ice cream’
• “hidden” Nbp with disease nouns O Pedro tem uma gastrite (estômago) ‘Pedro has gastritis (stomach)’
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 12
Evaluation
• First fragment of the CETEMPúblico corpus (Rocha & Santos, 2000): 14.7 M tokens; 6.3 M simple words; and 300 K sentences. • Using a Nbp lexicon (151 lemmas); 16,746 sentences with Nbp
were extracted. • A random stratified sample of 1,000 sentences with Nbp,
keeping the proportion of their total frequency in the source corpus. • Divided between 4 annotators – 4 subsets of 225 sentences
each, with a common set of 100 sentences to assess inter-annotator agreement. ‣WHOLE-PART, FIXED, nothing
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 13
Inter-annotator Agreement
Inter-annotator AgreementAverage Pairwise Percent Agreement
Fleiss’ Kappa
Average Pairwise Cohen’s Kappa
http://dfreelon.org/utils/recalfront/recal3/
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 14
Results (1st evaluation)
ResultsSystem’s performance for Nbp
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 15
Error Analysis false-positives
• Disambiguation of Nbp in context - língua ‘tonge/language’ - língua portuguesa ‘Portuguese language’ - língua de Camões ‘language of Camões’
• New idioms have been encoded in the lexicon - abrir o coração a ‘to open one’s heart to sb.’ - fazer face a ‘to face sth./to deal with’
• Nbp used figuratively Além disso, a nova face desta Igreja chilena… ‘Moreover, the new face of this Chilean Church…’
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 16
Error Analysis false-negatives
• The whole and the part are not syntactically related and may be quite far away from each other: !O facto do corpo ter sido encontrado na cozinha, leva os bombeiros a suspeitar que a vítima, com graves problemas de saúde, tenha desmaiado e caído à lareira, o que poderá ter estado na origem do incêndio. ‘The fact that the body was found in the kitchen, makes the firefighters to suspect that the victim with serious health problems fainted and fallen into the hearth, which may have been the origin of the fire.’ WHOLE-PART(vítima,corpo)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 17
Error Analysis false-negatives (cont.)
• Some human nouns and all pronouns (including personal, relative and demonstrative) are unmarked with the human feature (even if anaphora resolution performs ok);
Segundo o responsável do hospital, o doente – que também sofreu graves ferimentos na cabeça – poderia ser ainda sujeito a uma segunda intervenção cirúrgica ‘According to the head of the hospital, the patient - who also suffered serious head injuries – could still be subjected to a second surgical intervention’ ANTECEDENT(doente,que)!PART-WHOLE(que,cabeça)!
‣inheritance of features and relative placing of AR and WP modules within STRING architecture
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 18
• A modifier of a noun or an adjective (and not a verb): !Um mágico com um barrete (enfiado) na cabeça ‘A magician with a hat (stuck) in the head’ !WHOLE-PART(mágico,cabeça)
Error Analysis false-negatives (cont.)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 19
System’s performance for Nbp
Results (2nd evaluation)
+0.13 +0.11 +0.12
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 21
Thank you!
Questions please!
echo "O Pedro penteou o cabelo do filho com os dedos" | xip/string.sh TOP +------------+----------+----------------+-------------------+ | | | | | NP VF NP PP PP +-------+ + +-------+ +----+-------+ +----+-------+ | | | | | | | | | | | ART NOUN VERB ART NOUN PREP ART NOUN PREP ART NOUN + +- +- +- + + + +- +- + +- | | | | | | | | | | | O Pedro penteou o cabelo de o filho com os dedos MAIN(penteou) MOD_POST(cabelo,filho) MOD_POST(penteou,dedos) SUBJ_PRE(penteou,Pedro) CDIR_POST(penteou,cabelo) WHOLE-PART(filho,cabelo) WHOLE-PART(Pedro,dedos) 0>TOP{NP{O Pedro} VF{penteou} NP{o cabelo} PP{de o filho} PP{com os dedos}}string.l2f.inesc-id.pt
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 20
References
Berland, M. and Charniak, E. 1999. Finding parts in very large corpora. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 57–64. Morristown, NJ, USA. Association for Computational Linguistics.
Bick, E. 2000. The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.
Branco, A. and Costa, F. 2010. A Deep Linguistic Processing Grammar for Portuguese. In Pardo et al. (eds.), Computational Processing of Portuguese, LNAI 6001, Springer, pp. 86–89.
Girju,R., Badulescu A., and Moldovan, D. 2006. Automatic discovery of part-whole relations. Computational Linguistics, 21(1):83–135.
Nascimento, M., Veloso, R., Marrafa, P., Pereira, L., Ribeiro, R., and Wittmann, L. 1998. LE-PAROLE: do Corpus à Modelização da Informação Lexical num Sistema-multifunção. Actas do XIII Encontro Nacional da Associação Portuguesa de Linguística, 2:115–134.
Mamede, N., Baptista, J., Diniz, C. and Cabarrão, V. 2012. STRING: An hybrid statistical and rule-based natural language processing chain for portuguese. http://www.propor2012.org/demos/DemoSTRING.pdf
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 21
References (cont.)
Pantel, P. and Pennacchiotti, M. 2006. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of Conference on Computational Linguistics / Association for Computational Linguistics (COLING/ACL-06), pages 113–120. Sydney, Australia.
Rocha,P. and Santos, D. 2000. "CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa". In Maria das Graças Volpe Nunes (ed.), V Encontro para o processamento computacional da língua portuguesa escrita e falada (PROPOR 2000) (São Paulo, Brasil, 19-22 de Novembro de 2000), São Paulo: ICMC/USP, pp. 131-140.
Widlöcher, A. and Mathet, Y. 2012. The Glozz Platform: a Corpus Annotation and Mining Tool. In Proceedings of the 2012 Association for Computational Liguistics Symposium on Document Engineering, DocEng ’12, pages 171–180, Paris, France. Telecom ParisTech, Association for Computational Liguistics.
Winston, M., Chaffin, R. and Herrmann, D.1987. A Taxonomy of Part-Whole Relations. Cognitive Science, 11:417–444.
technology from seed
L2 F - Spoken Language Systems Laboratory