LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

31
LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt

Transcript of LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

Page 1: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098Corpus Linguistics – Lecture 4

Albert Gatt

Page 2: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

In this lecture

Levels of annotation Corpus typology

classification based on type and levels of annotation

multilingual corpora

Page 3: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

Part 1

Levels of corpus annotation (cont/d)

Page 4: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Levels of linguistic annotation part-of-speech (word-level) lemmatisation (word-level) parsing (phrase & sentence-level) semantics (multi-level)

semantic relationships between words and phrases

semantic features of words discourse features (supra-sentence level) phonetic transcription prosody

Page 5: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Lemmatisation

Groups morphological variants of a word under the head word: mexa’ (walk)

imxejt (I walked) imxejna (we walked) nimxu (we walk) ...

Increasingly common these days.

Together , these forma lemma

Page 6: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Lemmatisation example: the SUSANNE corpus Format: word + tag + lemma

A05:0030.33 - VVDv said say

Every word in the corpus is on separate line. Extremely useful for lexicography

Corpus file:sentence.word POS tag actualword

headword(lemma)

Page 7: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Automatic morphological analysis

For some languages, there are reasonably good lemmatisers/ morphological analysers:

Examples for English: morpha: built at the University of Sussex EngTwol: commercial, by LingSoft.

Page 8: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Engtwol output

undeniable: "undeniable" <DER:ble> A ABS

(derived with –ble suffix) adjective (A) absolute (ABS) form

This is a rule-based analyser. There are others which use corpus-derived statistical patterns.

Page 9: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Semantic annotation I: Two types

markup of semantic relations (e.g. predicate-argument structure) currently used in parsed corpora, to mark up

function-argument structures etc.

markup of features of word meaning (mainly, word senses) has origins in content analysis to arrive at

conclusions about how prominent particular concepts are

Now used in a lot of work on word sense disambiguation

Page 10: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Example of type 1 semantic markup (Penn Treebank)

(S (NPSBJ1 Chris) (VP wants

(S (NPSBJ *1) (VP to

(VP throw (NP the ball))))))

Predicate Argument Structure: wants(Chris, throw(Chris, ball))

Empty embedded subjectlinked to NP subject no. 1

Page 11: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Semantic markup type 2: lexical features Most common type:

word-sense tagged corpora Main idea:

disambiguate a word in context by tagging its sense Often uses WordNet (Miller et al 1993)

WordNet is a lexical taxonomy which represents lexical relations within a large number of words. including hyponymy (IS-A) relations etc For each entry, all the (supposed) senses of the word

are given. Main use: identify senses of words in context,

mark them up with a pointer to a wordnet sense.

Page 12: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

WordNet senses: Move (noun)

(377) move -- (the act of deciding to do something; "he didn't make a

move to help"; "his first move was to hire a lawyer")

(70) move, relocation -- (the act of changing your residence or place of business; "they say that three moves equal one fire")

(57) motion, movement, move, motility -- (a change of position that does not entail a change of location; "the reflex motion of his eyebrows revealed his surprise"; "movement is a sign of life"; "an impatient move of his hand"; "gastrointestinal motility")

(30) motion, movement, move -- (the act of changing location from one place to another; "police controlled the motion of the crowd"; "the movement of people from the farms to the cities"; "his move put him directly in my path")

(5) move -- ((game) a player's turn to take some action permitted by the rules of the game)

Page 13: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

(130) travel, go, move, locomote -- (change location; move,

travel, or proceed; "How fast does your new car go?"; "We travelled from Rome to Naples by bus"; "The policemen went from door to door looking for the suspect"; "The soldiers moved towards the city in an attempt to take it before night fell")

(60) move, displace -- (cause to move, both in a concrete and in an abstract sense; "Move those boxes into the corner, please"; "I'm moving my money to another bank"; "The director moved more responsibilities onto his new assistant")

(52) move -- (move so as to change position, perform a nontranslational motion; "He moved his hand slightly to the right")

(20) move -- (change residence, affiliation, or place of employment; "We moved from Idaho to Nebraska"; "The basketball player moved from one team to another")

WordNet senses: Move (verb)

Page 14: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Check it out!

Wordnet is freely available for download:

http://wordnet.princeton.edu/

Page 15: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Word sense annotation: other uses tagging words with their semantic field (Wilson 1996)

plant life men’s clothing …

tagging words with their “emotional” content (Campbell & Pennebaker 2002) based on a dictionary: social processes negative emotions

This approach underlies Pennebaker’s Linguistic Inquiry and WordCount (LIWC) system, analyses a text and comes up with a profile of its

personal/emotional content relates this to some features of its author (gender, age…)

Page 16: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Discourse annotation Most common:

text-level things such as paragraphs

Less common: anaphoric NPs and reference (cf. example from

lecture 3)

Even less common: annotation of words which function as discourse

cues (Stenstrom 1984): apology (sorry), hedges (sort of), etc

annotation of rhetorical structure

Page 17: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Discourse: Annotating rhetorical structure (I) Rhetorical Structure Theory (Mann and Thompson

1988): views text as made up of “discourse units” units stand in various rhetorical relations, which

reflect their role in constructing an argument, a narrative, etc

CONCESSION/CONTRAST relation: [Although Mr. Freeman is retiring,] [he will continue to

work as a consultant for American Express on a project basis].

Second unit is the main one (nucleus) First unit (satellite) “concedes” that what the main unit

is saying is contradicted by another fact. Recent corpus (Marcu et al 2003) is annotated with

this information.

Page 18: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Phonetic transcription

Not many phonetically transcribed corpora. MARSEC corpus is one of the best known.

This is a version of the Lancaster/IBM Spoken English Corpus.

Several databases of transcribed speech, however. Mostly used for statistical speech technology applications (e.g. text-to-speech synthesis).

Page 19: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Annotating suprasegmentals Aims: capture suprasegmental

features such as stress, intonation and pauses in spoken speech.

Some transcription systems exist TOBI (American) Tonic Stress Marker (TSM; British) define ways of annotating

suprasegmentals such as start/end of tone group; simultaneous speech, rise-fall tone, falling tone, etc…

Page 20: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Problem-oriented tagging

If you’re interested in a particular problem, and no corpus exists, build your own!

Many corpora define problem-specific annotation schemes.

Page 21: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Example: the TUNA Corpus Problem: How do people refer to objects

using definite NPs? Main interest: visual properties (colour, size etc) Focus: semantics of definite NPs, i.e. what

people choose to include in their description.

Method: experiment to get people to describe objects,

distinguishing them from other objects in the same visual “scene”

annotation of descriptions based on semantics

Page 22: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

TUNA Corpus: description<DESCRIPTION NUM="SINGULAR">

<ATTRIBUTE NAME="colour" VALUE="red"> red </ATTRIBUTE><ATTRIBUTE NAME="type" VALUE="sofa"> sofa </ATTRIBUTE><ATTRIBUTE NAME="size" VALUE="large"> bigger version </ATTRIBUTE>

</DESCRIPTION>

Red sofa, bigger version.

Features of the corpus:

1. represents the “target” referent

2. also represents the “distractors” (from which the target must be distinguished)

3. semantically transparent: annotation goes beyond language

Page 23: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

Part 2

Multilingual corpora

Page 24: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Why multilingual corpora? comparative studies

syntax morphology …

the cornerstone of most research in automatic machine translation nowadays most MT systems are statistical, trained on large

repositories of parallel (e.g. English-Chinese) text.

Page 25: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Parallel corpora Represents a text in its original language

(L1), with a translation in another language (L2) long history: Medieval polyglot bibles were

among the first “parallel” corpora

Alignment: Many parallel corpora align L1 and L2 at

sentence level, sometimes also at word level… Sentence-level alignment can be achieved

automatically with very high accuracy!

Page 26: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Example: SMULTRON corpus Developed and released in 2007-8

Relatively small

Aligned texts in English, Swedish and German E.g. Sophie’s World is one of the texts

Annotated with syntax, POS, morphology

Comes with a tool to view parallel syntactic trees.

Page 27: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

SMULTRON example: English (Sophie’s World)<s id=“s3”> <terminals> <t id="s3_1" word="Sophie" pos="NNP" morph="--"/>

<t id="s3_2" word="Amundsen" pos="NNP" morph="--"/> <t id="s3_3" word="was" pos="VBD" morph="--"/> <t id="s3_4" word="on" pos="IN" morph="--"/>

<t id="s3_5" word="her" pos="PRP$" morph="--"/><t id="s3_6" word="way" pos="NN" morph="--"/><t id="s3_7" word="home" pos="RB" morph="--"/><t id="s3_8" word="from" pos="IN" morph="--"/><t id="s3_9" word="school" pos="NN" morph="--"/><t id="s3_10" word="." pos="." morph="--"/>

</terminals></s>

This shows terminal nodes only. Corpus Also represents syntactic non-terminals (NP, VP etc)

Page 28: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

SMULTRON: Same sentence in German

<s id=“3”> <terminals> <t id="s3_1" word="Sofie" pos="NE" morph="FEM" lemma="Sofie " /> <t id="s3_2“ word="Amundsen" pos="NE" morph="--"

lemma="Amundsen“ /> <t id="s3_3" word="war" pos="VAFIN" morph="--" lemma="sein"/> <t id="s3_4" word="auf" pos="APPR" morph="--" lemma="auf" /> <t id="s3_5" word="dem" pos="ART" morph="--" lemma="der" /> <t id="s3_6" word="Heimweg" pos="NN" morph="MASK"

lemma="Heimweg“ /> <t id="s3_7" word="von" pos="APPR" morph="--" lemma="von" /> <t id="s3_8" word="der" pos="ART" morph="--" lemma="die" /> <t id="s3_9" word="Schule" pos="NN" morph="FEM" lemma="Schul~e" /> <t id="s3_10" word="." pos="$." morph="--" lemma="--" /> </terminals></s>

Note: richer morphology, representation of lemmas, …

Page 29: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Translation corpora

Not parallel. Have different texts in two or more

different languages, of the same genre.

Examples: PAROLE corpus is a translation corpus for

EU languages

Page 30: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Why translation corpora? Parallel corpora, by definition, contain

translation (L2) can give rise to errors artificiality and translation quality can be an

issue e.g. McEnery & Wilson report a study on an

English-Polish corpus. The Polish text reads “like a translation”

Problem can be overcome if the texts used are professionally translated.

Translation corpora have texts in two or more languages, “in the original”. Data is more natural.

Page 31: LIN 3098 Corpus Linguistics – Lecture 4 Albert Gatt.

LIN 3098 -- Corpus Linguistics

Summary

We have now concluded our initial incursion into: corpus construction corpus annotation corpus typology

Next up: using corpora for linguistic research