sms4science.ch A multi-lingual challenge for Part-of...
Transcript of sms4science.ch A multi-lingual challenge for Part-of...
German Seminar
sms4science.ch:A multi-lingual challenge forPart-of-Speech taggingird-cmc-rennes
Simone Ueberwasser M.A.
24/10/2015 Page 1
German Seminar
I like you –> saumässig –> and my little –> härzli pöpperlet –> toujour –>per te! –> You are –> mon ceur, –> tu sei –> min stärn, –> I have you –>eifach –> molto –> gärn! –> Hdslmf –> din –> Hase
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 2
German Seminar
Outline
The corpus
Necessity of normalization
Aim of normalization
Approach
Conclusions
Outlook
German Seminar
Outline
The corpus
Necessity of normalization
Aim of normalization
Approach
Conclusions
Outlook
24/10/2015 Page 3
German Seminar
The corpusThe project
I Funded by Swiss National Science Foundation: € ~1.5 MioI Seven doctoral studentsI Zurich, Bern, Neuchâtel, LeipzigI Lead: Elisabeth Stark, ZurichI www.sms4science.ch
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 3
German Seminar
The corpusThe data collection
I Nov 2009 – Jan 2010I Collected in co-operation with SwisscomI ~26,000 SMSI ~650,000 TokensI More than 50% with code-switchingI Demographic questionnaires: 1,316 covering 79% of SMSI Freely available for research
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 4
German Seminar
The corpusLanguages
Language SMSSwiss German dialect 10’734Standard German 7’254French 4’622Italian 1’475Romansh 1’120English 538...
...Dialetto 50Spanish 43Patois 28
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 5
German Seminar
The corpusProcessing
I AnonymisationI Language tagging (manual processing):
I Main language contributes most tokensI Borrowings (established, in dictionary)I Nonce-borrowings (spontaneous, not in dictionary)
I Normalization (manual processing)I PoS taggingI (Morphosyntactic tagging)
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 6
German Seminar
Outline
The corpus
Necessity of normalization
Aim of normalization
Approach
Conclusions
Outlook
24/10/2015 Page 7
German Seminar
Abbreviations, borrowings
Figure: lol as an abbreviation and borrowing
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 7
German Seminar
DialectFar from standard German
Figure: Bern dialect
‘If he does not come, I don’t have to come either?’
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 8
German Seminar
Dialect without orthographySome spelling variants of ich
Figure: Some spelling variants of ich (‘I’)
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 9
German Seminar
Clitics
Figure: Double clitics
‘. . . whether I it . . . ’
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 10
German Seminar
Compulsory ellipses
Figure: Standard German: An welche Adresse hast Du mir geschrieben?
‘To which one of my addresses did you write’
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 11
German Seminar
Compulsory particleDoes not exist in the Standard
Figure: Particle go
‘Did you seriously go back to sleep?’
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 12
German Seminar
CaseNo accusative in the NP in the Swiss German dialect
Figure: Standard German: . . . gestern hast [Du] einen Zahnarzttermin verpasst
‘Hey, you skipped a dentist appointment yesterday’
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 13
German Seminar
Variation in prepositionsAuf in the dialect, nach in the Standard
Figure: Variation in the us of prepositions
‘let’s go to Bern’
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 14
German Seminar
Word order
Figure: Standard German: [ich] habe es mir nämlich noch überlegt
‘I was actually thinking about that’
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 15
German Seminar
No adequate/distinct equivalent in the StandardExample: abe (‘downwards’)
Figure: Standard German equivalents: hinunter, herunter, hinab, ø
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 16
German Seminar
Necessity for normalizationSummary
I Unorthodox spelling (in all languages)I CliticsI Abbreviations, borrowingsI Dialect without spelling norms (German and Italian)I Compulsory ellipsesI Compulsory particlesI CaseI Word orderI No adequate/distinct equivalentsI Five variants of Romansh
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 17
German Seminar
Outline
The corpus
Necessity of normalization
Aim of normalization
Approach
Conclusions
Outlook
24/10/2015 Page 18
German Seminar
Main aims of normalization
I Research into the syntax (of the dialect)I Research into lexical variationI Prepare PoS
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 18
German Seminar
Outline
The corpus
Necessity of normalization
Aim of normalization
Approach
Conclusions
Outlook
24/10/2015 Page 19
German Seminar
Outline
The corpus
Necessity of normalization
Aim of normalization
ApproachLinguistic ApproachTechnical Approach
Conclusions
Outlook
24/10/2015 Page 19
German Seminar
Main aims of normalization
I Research into the syntax (of the dialect)I Research into lexical variationI Prepare PoS
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 19Linguistic Approach Technical
German Seminar
Main aims of normalization
I Research into the syntax (of the dialect)I No change to word orderI No compensation of ellipsesI Leave required particles but standardize (go, goge, ga, gage –> go)I Do not adjust caseI Separate cliticsI No ‘replacement’ of prepositions
I Research into lexical variationI Use the Standard German variant wherever possibleI Find a lemma that is similar in meaning and form where there is no
equivalent, but be consistentI Mark abbreviations, emoticons and borrowingsI Leave unrecognized elements as they are (e.g. tkdn, iLSi)
I Prepare PoSI Capitalize nouns (in German)I Expand abbreviations (e.g. lg –> liebe Grüsse)
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 20Linguistic Approach Technical
German Seminar
Outline
The corpus
Necessity of normalization
Aim of normalization
ApproachLinguistic ApproachTechnical Approach
Conclusions
Outlook
24/10/2015 Page 21
German Seminar
Requirements
I Server based (–> co-operation)I Common vocabulary for annotatorsI SuggestionsI One-to-many and many-to-oneI Feedback (e.g. errors in tokenization)
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 21Linguistic Approach Technical
German Seminar
The toolSMS Glossing Tool
Figure: SMS Glossing Tool
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 22Linguistic Approach Technical
German Seminar
Outline
The corpus
Necessity of normalization
Aim of normalization
Approach
Conclusions
Outlook
24/10/2015 Page 24
German Seminar
Conclusions
I A small data set allows for manual treatmentI Linguistic rules: change as little as possible and be consistent when
you have to change thingsI Technical setup: Work with a self-growing dictionary that can be
shared between annotatorsI Resulting accuracy: ~95%
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 24
German Seminar
Outline
The corpus
Necessity of normalization
Aim of normalization
Approach
Conclusions
Outlook
24/10/2015 Page 25
German Seminar
OutlookWhat’s up, Switzerland
I Start: Jan 1st 2016I 500 Mio tokensI No manual processing possibleI SMS data will be used for automated annotationI Accuracy: ???
24/10/2015 Corpus Necessity Aim Approach Conclusions Outlook Page 25