Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

23
Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga

Transcript of Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Page 1: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Introduction to (still more) Computational Linguistics

Pawel Sirotkin28.11-01.12.2008, Riga

Page 2: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Rule-based CL

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

4

Rule-based CL Rules have to be generated by hand Easily tailored to fit (or test) a particular theory First results with just a handful of rules

But: Very hard to get “all” the rules Rules may conflict Rules are language- and domain-specific

Page 3: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Statistical CL

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

5

Needed: an algorithm that can create rules Algorithm needs training data to learn

More and more data around Digitalized literature, official documents, corpora

These rules can be applied to new texts Good points:

Largely independent from language, domain etc. Computational power available in abundance

Page 4: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

A brief aside: Corpora

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

6

• First major corpus: Brown Corpus (mid-60ies)– 500 samples of 2000 words each– From newspapers, fiction and non-fiction books– Around 80 part-of-speech-tags

• Tagging took over 15 year to be completed

• Modern corpora: BNC, COCA, ...– Sometimes hundreds of millions of words– Written and spoken texts– More or less syntactic and semantic annotation

Page 5: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Part-of-Speech Tagging

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

7

Linguistic background What are parts of speech? How do we recognize them?

Practical usage What are POS taggers good for? What should they do?

Implementation What are the possible problems? What are the possible solutions?

Page 6: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Parts of speech

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

8

Nouns, verbs, adjectives…

I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character.

(Martin Luther King)

How many nouns are there in this text?

Page 7: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Parts of speech

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

9

Nouns, verbs, adjectives…

I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character.

(Martin Luther King)

What defines a noun?

Page 8: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

What defines a part of speech?

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

10

Noun a word (other than a pronoun) used to identify any

of a class of people, places, or things (common noun), or to name a particular one of these (proper noun) [OED] Semantic definition

any member of a class of words that typically can be combined with determiners to serve as the subject of a verb, can be interpreted as singular or plural, can be replaced with a pronoun, and refer to an entity, quality, state, action, or concept [Merriam-Webster] Syntactic and semantic definition

Page 9: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

• More (closed) word classes in English• More (or less, or different) word classes

in other languages• Different word classes in different

linguistic models

What parts of speech are there?

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

11

Open word classes Closed word classes

Nouns (table, time, Wiebke) Determiners (the, some, what)

Verbs (go, use, sleep) Auxiliary verbs (be, have, must)

Adjectives (nice, white, absent) Pronouns (I, ourselves, his)

Adverbs (quickly, clockwise, yesterday)

Prepositions (on, by, after)

Interjections (wow, ouch, er) Conjunctions (and, while, either ... or ...)

Page 10: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

How to recognize word classes?

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

12

Substitution test The small boy sits in a car.

The, a, this: determiner Small, big, angry, clever: adjectives Boy, girl, cat, doll: nouns Sits, cries, sleeps: verbs In, on, outside: prepositions

Page 11: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Why do we need POS tags?

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

13

• Main aim: disambiguation• Useful for most advanced CLP applications

– Machine translation– Named Entity Recognition/Extraction– Anaphora resolution– etc.

Page 12: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Part-of-Speech Tagger

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

14

Not surprisingly, an application for determining parts of speech in a text

NotADV surprisinglyADV, anDET applicationN forPREP determiningV partsN ofPREP speechN inPREP aDET textN

Page 13: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Part-of-Speech Tagging – rules?

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

15

Rule-based POS Tagging? Possible rules (simplified):

If ends in „est“, then it‘s an adjective (superlative form) Pest? Rest?

If ends in „ed“, it‘s a verb (past or participle form) Bed? Sled?

Rules of this kind are few and unreliable Largest problem: they don’t help with the

ambiguous words!

Page 14: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Part-of-Speech Tagging – rules!

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

16

• The wind is blowing.– How do we know wind is a noun and not a verb?– Because it appears after an article and before a

verb• ART ___ VERB ART NOUN VERB

• We need rules about inter-word relations• Hard to say what the rules are:

– The cromulent wind– The cromulent wind up

Page 15: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Part-of-Speech Tagging: Stats

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

17

• Wind: 76% noun usage, 24% verb usage• ART ___ VERB: 72% noun, 1% adverb• The wind blows:

– Verb probability: 24% x 0% = 0%– Adverb probability: 0% x 1% = 0%– Noun probability: 76% x 72% = 55%

Careful!

The numbers are invented, and the calculation is more complex than that.

Page 16: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

What do we need?

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

18

This is a simple sentence.

This text, excogitated by Dr. Samākslots of New York, is a bit more complicated. It consists of a few longer-than-usual sentences; also, it has punctuation etc. It will help us to learn the complexities of part-of-speech tagging, or

POST.

Page 17: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

We need…

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

19

A tokenizer to split the text into tokens Tag probabilities for the tokens

E.g. left: 46% adjective, 31% noun, 23% verb Tag sequence probabilities

E.g. ADJ ___ NOUN: 57% noun, 43% adjective How long should the sequences be?

Methods for estimating unknown words E.g. 80% proper noun probability if capitalized No closed word classes

Page 18: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Tag probabilities

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

20

The wind blows.

• The: 98% article, 2% adverb• Wind: 76% noun, 24% verb• Blows: 53% verb, 47% noun• Article Noun: 72%, Article Verb 1%• Adverb Noun 0%, Adverb Verb 6%• Noun Verb 61%, Noun Noun 4%• Verb Verb 3%, Verb Noun 59%.

Page 19: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Tag probability calculation

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

21

The wind blows.

• Article – noun – verb: 98% x 72% x 76% x 61% x 53% = 17%

• Article – noun – noun: 98% x 72% x 76% x 4% x 47% = 10%

• Article – verb – noun: 98% x 1% x 24% x 39% x 47% = 0.04%

• Article – verb – verb: 98% x 1% x 24% x 3% x 53% = 0.0004%

• …

• The complexity of calculations explodes when the length of the sentences and the number of tags increase.

Page 20: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

Hidden Markov Models

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

22

The wind blows

98%

2%

76% 52%

24% 47%

?

?

72% 61%

6% 59%0% 2%

1% 4%

Page 21: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

24%0.22%

54%

Viterbi Algorithm

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

23

98%

2%

76% 52%

47%

72% 61%

6% 59%0% 2%

1% 4%

The wind blows

article: 98%

adverb: 2%

article – noun: 54%article – verb: 0.2%adverb – noun: 0%adverb – verb: 0.02%

article – noun – verb: 17%article – noun – noun: 1%article – verb – verb: 0.02%article – noun – noun: 0.05%

article – verb: 0.22%article – noun – verb: 18%

Page 22: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

HMMs – the theory

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

24

A five-tuple (S, K, Π, A, B) Set of states S

here: the possible tags at any point Output alphabet K

here: the possible tokens Initial probabilities Π

here: probabilities for first item in a sentence/text State transition probabilities A

here: tag sequence probabilities Symbol emission probabilities B

Here: token-tag-probabilities

Page 23: Introduction to (still more) Computational Linguistics Pawel Sirotkin 28.11-01.12.2008, Riga.

POST: Current state

Computational Linguistics, NLL Riga 2008, by Pawel Sirotkin

25

Baseline approach (tagging each token with most frequently used tag) delivers up to 90% accuracy

State-of-the-art taggers reach 96-97% accuracy But: Given an average sentence length of 20

words in a newspaper text, we get errors in most sentences!

POS taggers are used as a first step in most complex CL applications

Some free online taggers: CLAWS, CST, CCG…