Post on 23-Mar-2021
Language and Space Lab
Basic natural languageprocessing for Swiss GermantextsTanja Samardzic
07/06/2018 Page 1
Language and Space Lab
Long-term contributionNoemi AepliFatima StadlerYves ScherrerElvira Glaser
Funding
Hasler Foundation grant No 16038
UZH URPP ’Language and Space’
Agreement with Spitch
Specific tasks
Henning BeywlChristof BlessAlexandra BunzliMatthias FriedliAnne GohringNoemi GrafAnja Hasse
Gordon HeathAgnes KolmerMike LinggPatrick MachlerEva PetersUliana Petrunina
Janine Richner-SteinerHana RuchBeni RuefPhillip StrobelSimone UeberwasserAlexandra Zoller
07/06/2018 Basic natural language processing for Swiss German texts Page 2
Language and Space Lab
Data
Language and Space Lab
Oral history project ArchiMob
07/06/2018 Basic natural language processing for Swiss German texts Page 4
Language and Space Lab
The ArchiMob corpus sample
07/06/2018 Basic natural language processing for Swiss German texts Page 5
Language and Space Lab
Some numbers
44 documents selected by Janine Richner-Steiner and Matthias Friedli,supervised by Elvira Glaser
Release 1.0 (2016):
– 34 documents, around 500 000 word tokens
– 23/44 documents transcribed in the period 2004–2014
– 11/44 documents transcribed in 2015, in collaboration with Spitch
Next release (2017):
– 43 documents, around 650 000 word tokens
– 6/44 documents transcribed in 2016
– 3/44 in progress
07/06/2018 Basic natural language processing for Swiss German texts Page 6
Language and Space Lab
Format
Language and Space Lab
Current format
07/06/2018 Basic natural language processing for Swiss German texts Page 8
Language and Space Lab
Content
je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN
jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV
07/06/2018 Basic natural language processing for Swiss German texts Page 9
Language and Space Lab
Transcription
Language and Space Lab
Transcription
je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN
jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV
07/06/2018 Basic natural language processing for Swiss German texts Page 11
Language and Space Lab
Manual transcription
1. 16 documents - Nisus Writer– No segmentation (only turns)– No text to speech alignment– Converted into XML, added segmentation and alignment
2. 7 documents - FOLKER (Schmidt, 2012)– Segmented into chunks of 4-10 seconds– XML and alignment output
3. 11 documents - EXMARaLDA (Schmidt, 2012)– same as FOLKER, just more convenient
07/06/2018 Basic natural language processing for Swiss German texts Page 12
Language and Space Lab
Some details
– Based on Dieth guidelines, but gradually simplified
– Utterance as the basic unit
– Turns not explicitly annotated
– Inconsistence in writing (pronouns and clitics)
– Pauses, repetitions
– Incomprehensible speech
07/06/2018 Basic natural language processing for Swiss German texts Page 13
Language and Space Lab
Normalisation
je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN
jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV
07/06/2018 Basic natural language processing for Swiss German texts Page 14
Language and Space Lab
Approach
– Manual normalisation of 6 documents, VARD2 and IGT
– Automatic normalisation– Character-level machine translation (CSMT) with MOSES– Training on the 6 manually normalised documents
07/06/2018 Basic natural language processing for Swiss German texts Page 15
Language and Space Lab
CSMT
Translation model: p(normalised |transcribed)
i s c h s c h i s t
Language model: p(normalisedi |normalisedi−1)
i s t
07/06/2018 Basic natural language processing for Swiss German texts Page 16
Language and Space Lab
Current state of the art
Yves Scherrer and Nikola Ljubesic (KONVENS 2016)
– Larger translation units (utterances instead of words)
– Language model augmented with German spoken data
– Improved tuning
– Result: 90.46 % accuracy
07/06/2018 Basic natural language processing for Swiss German texts Page 17
Language and Space Lab
Part-of-speech tagging
Language and Space Lab
Part-of-speech
je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN
jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV
07/06/2018 Basic natural language processing for Swiss German texts Page 19
Language and Space Lab
Tagger development
STTS+ tag set
Train Test % Acc. % OOV
TuBa-D/S Normalised 70.31 24.21Starting NOAH Original 60.56 30.72
Removed TuBa-D/S Normalised 70.68 24.21punctuation NOAH Original 73.09 30.72
NOAH +Adapted ArchiMob Original 90.09 –
07/06/2018 Basic natural language processing for Swiss German texts Page 20
Language and Space Lab
Current activities
Tagger adaptation:
– Active learning: gradually add ArchiMob data in the train set
– CRF tagger
07/06/2018 Basic natural language processing for Swiss German texts Page 21
Language and Space Lab
Speech-to-text
Language and Space Lab
Speech-to-text
Acoustic model: p(transcribed |sound)
/.../ /.../ /.../ /.../ /.../ /.../ /.../ das ischsch en ez
Language model: p(transcribedi |transcribedi−1)
das ischsch en ez
07/06/2018 Basic natural language processing for Swiss German texts Page 23
Language and Space Lab
Approach
– Improving Spitch prototype with new language models
– Our own speech-to-text development with Kaldi
– Manual transcription
07/06/2018 Basic natural language processing for Swiss German texts Page 24
Language and Space Lab
Next steps
Language and Space Lab
Next steps
– Continue transcription, PoS tagging, normalisation
– Neural transducers (deep learning) for normalisation
– Subword language models for speech-to-text
– New data
07/06/2018 Basic natural language processing for Swiss German texts Page 26
Language and Space Lab
Your feedback!