sms4science.ch A multi-lingual challenge for Part-of...

34
German Seminar sms4science.ch: A multi-lingual challenge for Part-of-Speech tagging ird-cmc-rennes Simone Ueberwasser M.A. 24/10/2015 Page 1

Transcript of sms4science.ch A multi-lingual challenge for Part-of...

Page 1: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

sms4science.ch:A multi-lingual challenge forPart-of-Speech taggingird-cmc-rennes

Simone Ueberwasser M.A.

24/10/2015 Page 1

Page 2: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

I like you –> saumässig –> and my little –> härzli pöpperlet –> toujour –>per te! –> You are –> mon ceur, –> tu sei –> min stärn, –> I have you –>eifach –> molto –> gärn! –> Hdslmf –> din –> Hase

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 2

Page 3: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Outline

The corpus

Necessity of normalization

Aim of normalization

Approach

Conclusions

Outlook

Page 4: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Outline

The corpus

Necessity of normalization

Aim of normalization

Approach

Conclusions

Outlook

24/10/2015 Page 3

Page 5: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

The corpusThe project

I Funded by Swiss National Science Foundation: € ~1.5 MioI Seven doctoral studentsI Zurich, Bern, Neuchâtel, LeipzigI Lead: Elisabeth Stark, ZurichI www.sms4science.ch

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 3

Page 6: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

The corpusThe data collection

I Nov 2009 – Jan 2010I Collected in co-operation with SwisscomI ~26,000 SMSI ~650,000 TokensI More than 50% with code-switchingI Demographic questionnaires: 1,316 covering 79% of SMSI Freely available for research

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 4

Page 7: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

The corpusLanguages

Language SMSSwiss German dialect 10’734Standard German 7’254French 4’622Italian 1’475Romansh 1’120English 538...

...Dialetto 50Spanish 43Patois 28

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 5

Page 8: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

The corpusProcessing

I AnonymisationI Language tagging (manual processing):

I Main language contributes most tokensI Borrowings (established, in dictionary)I Nonce-borrowings (spontaneous, not in dictionary)

I Normalization (manual processing)I PoS taggingI (Morphosyntactic tagging)

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 6

Page 9: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Outline

The corpus

Necessity of normalization

Aim of normalization

Approach

Conclusions

Outlook

24/10/2015 Page 7

Page 10: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Abbreviations, borrowings

Figure: lol as an abbreviation and borrowing

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 7

Page 11: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

DialectFar from standard German

Figure: Bern dialect

‘If he does not come, I don’t have to come either?’

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 8

Page 12: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Dialect without orthographySome spelling variants of ich

Figure: Some spelling variants of ich (‘I’)

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 9

Page 13: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Clitics

Figure: Double clitics

‘. . . whether I it . . . ’

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 10

Page 14: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Compulsory ellipses

Figure: Standard German: An welche Adresse hast Du mir geschrieben?

‘To which one of my addresses did you write’

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 11

Page 15: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Compulsory particleDoes not exist in the Standard

Figure: Particle go

‘Did you seriously go back to sleep?’

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 12

Page 16: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

CaseNo accusative in the NP in the Swiss German dialect

Figure: Standard German: . . . gestern hast [Du] einen Zahnarzttermin verpasst

‘Hey, you skipped a dentist appointment yesterday’

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 13

Page 17: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Variation in prepositionsAuf in the dialect, nach in the Standard

Figure: Variation in the us of prepositions

‘let’s go to Bern’

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 14

Page 18: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Word order

Figure: Standard German: [ich] habe es mir nämlich noch überlegt

‘I was actually thinking about that’

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 15

Page 19: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

No adequate/distinct equivalent in the StandardExample: abe (‘downwards’)

Figure: Standard German equivalents: hinunter, herunter, hinab, ø

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 16

Page 20: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Necessity for normalizationSummary

I Unorthodox spelling (in all languages)I CliticsI Abbreviations, borrowingsI Dialect without spelling norms (German and Italian)I Compulsory ellipsesI Compulsory particlesI CaseI Word orderI No adequate/distinct equivalentsI Five variants of Romansh

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 17

Page 21: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Outline

The corpus

Necessity of normalization

Aim of normalization

Approach

Conclusions

Outlook

24/10/2015 Page 18

Page 22: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Main aims of normalization

I Research into the syntax (of the dialect)I Research into lexical variationI Prepare PoS

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 18

Page 23: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Outline

The corpus

Necessity of normalization

Aim of normalization

Approach

Conclusions

Outlook

24/10/2015 Page 19

Page 24: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Outline

The corpus

Necessity of normalization

Aim of normalization

ApproachLinguistic ApproachTechnical Approach

Conclusions

Outlook

24/10/2015 Page 19

Page 25: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Main aims of normalization

I Research into the syntax (of the dialect)I Research into lexical variationI Prepare PoS

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 19Linguistic Approach Technical

Page 26: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Main aims of normalization

I Research into the syntax (of the dialect)I No change to word orderI No compensation of ellipsesI Leave required particles but standardize (go, goge, ga, gage –> go)I Do not adjust caseI Separate cliticsI No ‘replacement’ of prepositions

I Research into lexical variationI Use the Standard German variant wherever possibleI Find a lemma that is similar in meaning and form where there is no

equivalent, but be consistentI Mark abbreviations, emoticons and borrowingsI Leave unrecognized elements as they are (e.g. tkdn, iLSi)

I Prepare PoSI Capitalize nouns (in German)I Expand abbreviations (e.g. lg –> liebe Grüsse)

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 20Linguistic Approach Technical

Page 27: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Outline

The corpus

Necessity of normalization

Aim of normalization

ApproachLinguistic ApproachTechnical Approach

Conclusions

Outlook

24/10/2015 Page 21

Page 28: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Requirements

I Server based (–> co-operation)I Common vocabulary for annotatorsI SuggestionsI One-to-many and many-to-oneI Feedback (e.g. errors in tokenization)

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 21Linguistic Approach Technical

Page 29: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

The toolSMS Glossing Tool

Figure: SMS Glossing Tool

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 22Linguistic Approach Technical

Page 30: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I
Page 31: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Outline

The corpus

Necessity of normalization

Aim of normalization

Approach

Conclusions

Outlook

24/10/2015 Page 24

Page 32: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Conclusions

I A small data set allows for manual treatmentI Linguistic rules: change as little as possible and be consistent when

you have to change thingsI Technical setup: Work with a self-growing dictionary that can be

shared between annotatorsI Resulting accuracy: ~95%

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 24

Page 33: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

Outline

The corpus

Necessity of normalization

Aim of normalization

Approach

Conclusions

Outlook

24/10/2015 Page 25

Page 34: sms4science.ch A multi-lingual challenge for Part-of ...ird-cmc-rennes.sciencesconf.org/conference/ird-cmc-rennes/pages/... · sms4science.ch: A multi-lingual challenge for ... I

German Seminar

OutlookWhat’s up, Switzerland

I Start: Jan 1st 2016I 500 Mio tokensI No manual processing possibleI SMS data will be used for automated annotationI Accuracy: ???

24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 25