Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4....

27
Language and Space Lab Basic natural language processing for Swiss German texts Tanja Samardˇ zi´ c 07/06/2018 Page 1

Transcript of Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4....

Page 1: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Basic natural languageprocessing for Swiss GermantextsTanja Samardzic

07/06/2018 Page 1

Page 2: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Long-term contributionNoemi AepliFatima StadlerYves ScherrerElvira Glaser

Funding

Hasler Foundation grant No 16038

UZH URPP ’Language and Space’

Agreement with Spitch

Specific tasks

Henning BeywlChristof BlessAlexandra BunzliMatthias FriedliAnne GohringNoemi GrafAnja Hasse

Gordon HeathAgnes KolmerMike LinggPatrick MachlerEva PetersUliana Petrunina

Janine Richner-SteinerHana RuchBeni RuefPhillip StrobelSimone UeberwasserAlexandra Zoller

07/06/2018 Basic natural language processing for Swiss German texts Page 2

Page 3: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Data

Page 4: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Oral history project ArchiMob

07/06/2018 Basic natural language processing for Swiss German texts Page 4

Page 5: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

The ArchiMob corpus sample

07/06/2018 Basic natural language processing for Swiss German texts Page 5

Page 6: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Some numbers

44 documents selected by Janine Richner-Steiner and Matthias Friedli,supervised by Elvira Glaser

Release 1.0 (2016):

– 34 documents, around 500 000 word tokens

– 23/44 documents transcribed in the period 2004–2014

– 11/44 documents transcribed in 2015, in collaboration with Spitch

Next release (2017):

– 43 documents, around 650 000 word tokens

– 6/44 documents transcribed in 2016

– 3/44 in progress

07/06/2018 Basic natural language processing for Swiss German texts Page 6

Page 7: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Format

Page 8: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Current format

07/06/2018 Basic natural language processing for Swiss German texts Page 8

Page 9: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Content

je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN

jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV

07/06/2018 Basic natural language processing for Swiss German texts Page 9

Page 10: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Transcription

Page 11: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Transcription

je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN

jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV

07/06/2018 Basic natural language processing for Swiss German texts Page 11

Page 12: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Manual transcription

1. 16 documents - Nisus Writer– No segmentation (only turns)– No text to speech alignment– Converted into XML, added segmentation and alignment

2. 7 documents - FOLKER (Schmidt, 2012)– Segmented into chunks of 4-10 seconds– XML and alignment output

3. 11 documents - EXMARaLDA (Schmidt, 2012)– same as FOLKER, just more convenient

07/06/2018 Basic natural language processing for Swiss German texts Page 12

Page 13: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Some details

– Based on Dieth guidelines, but gradually simplified

– Utterance as the basic unit

– Turns not explicitly annotated

– Inconsistence in writing (pronouns and clitics)

– Pauses, repetitions

– Incomprehensible speech

07/06/2018 Basic natural language processing for Swiss German texts Page 13

Page 14: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Normalisation

je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN

jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV

07/06/2018 Basic natural language processing for Swiss German texts Page 14

Page 15: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Approach

– Manual normalisation of 6 documents, VARD2 and IGT

– Automatic normalisation– Character-level machine translation (CSMT) with MOSES– Training on the 6 manually normalised documents

07/06/2018 Basic natural language processing for Swiss German texts Page 15

Page 16: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

CSMT

Translation model: p(normalised |transcribed)

i s c h s c h i s t

Language model: p(normalisedi |normalisedi−1)

i s t

07/06/2018 Basic natural language processing for Swiss German texts Page 16

Page 17: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Current state of the art

Yves Scherrer and Nikola Ljubesic (KONVENS 2016)

– Larger translation units (utterances instead of words)

– Language model augmented with German spoken data

– Improved tuning

– Result: 90.46 % accuracy

07/06/2018 Basic natural language processing for Swiss German texts Page 17

Page 18: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Part-of-speech tagging

Page 19: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Part-of-speech

je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN

jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV

07/06/2018 Basic natural language processing for Swiss German texts Page 19

Page 20: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Tagger development

STTS+ tag set

Train Test % Acc. % OOV

TuBa-D/S Normalised 70.31 24.21Starting NOAH Original 60.56 30.72

Removed TuBa-D/S Normalised 70.68 24.21punctuation NOAH Original 73.09 30.72

NOAH +Adapted ArchiMob Original 90.09 –

07/06/2018 Basic natural language processing for Swiss German texts Page 20

Page 21: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Current activities

Tagger adaptation:

– Active learning: gradually add ArchiMob data in the train set

– CRF tagger

07/06/2018 Basic natural language processing for Swiss German texts Page 21

Page 22: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Speech-to-text

Page 23: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Speech-to-text

Acoustic model: p(transcribed |sound)

/.../ /.../ /.../ /.../ /.../ /.../ /.../ das ischsch en ez

Language model: p(transcribedi |transcribedi−1)

das ischsch en ez

07/06/2018 Basic natural language processing for Swiss German texts Page 23

Page 24: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Approach

– Improving Spitch prototype with new language models

– Our own speech-to-text development with Kaldi

– Manual transcription

07/06/2018 Basic natural language processing for Swiss German texts Page 24

Page 25: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Next steps

Page 26: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Next steps

– Continue transcription, PoS tagging, normalisation

– Neural transducers (deep learning) for normalisation

– Subword language models for speech-to-text

– New data

07/06/2018 Basic natural language processing for Swiss German texts Page 26

Page 27: Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4. 7. · Alexandra Bunzli¨ Matthias Friedli Anne Gohring¨ Noemi Graf Anja Hasse Gordon

Language and Space Lab

Your feedback!