Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4....

Post on 23-Mar-2021

1 views 0 download

Transcript of Basic natural language processing for Swiss German texts6fa63c6d-aee8-4007-aa5a... · 2020. 4....

Language and Space Lab

Basic natural languageprocessing for Swiss GermantextsTanja Samardzic

07/06/2018 Page 1

Language and Space Lab

Long-term contributionNoemi AepliFatima StadlerYves ScherrerElvira Glaser

Funding

Hasler Foundation grant No 16038

UZH URPP ’Language and Space’

Agreement with Spitch

Specific tasks

Henning BeywlChristof BlessAlexandra BunzliMatthias FriedliAnne GohringNoemi GrafAnja Hasse

Gordon HeathAgnes KolmerMike LinggPatrick MachlerEva PetersUliana Petrunina

Janine Richner-SteinerHana RuchBeni RuefPhillip StrobelSimone UeberwasserAlexandra Zoller

07/06/2018 Basic natural language processing for Swiss German texts Page 2

Language and Space Lab

Data

Language and Space Lab

Oral history project ArchiMob

07/06/2018 Basic natural language processing for Swiss German texts Page 4

Language and Space Lab

The ArchiMob corpus sample

07/06/2018 Basic natural language processing for Swiss German texts Page 5

Language and Space Lab

Some numbers

44 documents selected by Janine Richner-Steiner and Matthias Friedli,supervised by Elvira Glaser

Release 1.0 (2016):

– 34 documents, around 500 000 word tokens

– 23/44 documents transcribed in the period 2004–2014

– 11/44 documents transcribed in 2015, in collaboration with Spitch

Next release (2017):

– 43 documents, around 650 000 word tokens

– 6/44 documents transcribed in 2016

– 3/44 in progress

07/06/2018 Basic natural language processing for Swiss German texts Page 6

Language and Space Lab

Format

Language and Space Lab

Current format

07/06/2018 Basic natural language processing for Swiss German texts Page 8

Language and Space Lab

Content

je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN

jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV

07/06/2018 Basic natural language processing for Swiss German texts Page 9

Language and Space Lab

Transcription

Language and Space Lab

Transcription

je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN

jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV

07/06/2018 Basic natural language processing for Swiss German texts Page 11

Language and Space Lab

Manual transcription

1. 16 documents - Nisus Writer– No segmentation (only turns)– No text to speech alignment– Converted into XML, added segmentation and alignment

2. 7 documents - FOLKER (Schmidt, 2012)– Segmented into chunks of 4-10 seconds– XML and alignment output

3. 11 documents - EXMARaLDA (Schmidt, 2012)– same as FOLKER, just more convenient

07/06/2018 Basic natural language processing for Swiss German texts Page 12

Language and Space Lab

Some details

– Based on Dieth guidelines, but gradually simplified

– Utterance as the basic unit

– Turns not explicitly annotated

– Inconsistence in writing (pronouns and clitics)

– Pauses, repetitions

– Incomprehensible speech

07/06/2018 Basic natural language processing for Swiss German texts Page 13

Language and Space Lab

Normalisation

je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN

jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV

07/06/2018 Basic natural language processing for Swiss German texts Page 14

Language and Space Lab

Approach

– Manual normalisation of 6 documents, VARD2 and IGT

– Automatic normalisation– Character-level machine translation (CSMT) with MOSES– Training on the 6 manually normalised documents

07/06/2018 Basic natural language processing for Swiss German texts Page 15

Language and Space Lab

CSMT

Translation model: p(normalised |transcribed)

i s c h s c h i s t

Language model: p(normalisedi |normalisedi−1)

i s t

07/06/2018 Basic natural language processing for Swiss German texts Page 16

Language and Space Lab

Current state of the art

Yves Scherrer and Nikola Ljubesic (KONVENS 2016)

– Larger translation units (utterances instead of words)

– Language model augmented with German spoken data

– Improved tuning

– Result: 90.46 % accuracy

07/06/2018 Basic natural language processing for Swiss German texts Page 17

Language and Space Lab

Part-of-speech tagging

Language and Space Lab

Part-of-speech

je ja ITJde dann ADVhet hat VAFINme man PISno noch ADVgluegt gelugt VVPPtankt gedacht VVPPdasch das ist PDS+ez jetzt ADVde der ARTgenneraal general NN

jaa ja ITJdas das PDSischsch ist VAFINen en PPERez jetzt ADV

07/06/2018 Basic natural language processing for Swiss German texts Page 19

Language and Space Lab

Tagger development

STTS+ tag set

Train Test % Acc. % OOV

TuBa-D/S Normalised 70.31 24.21Starting NOAH Original 60.56 30.72

Removed TuBa-D/S Normalised 70.68 24.21punctuation NOAH Original 73.09 30.72

NOAH +Adapted ArchiMob Original 90.09 –

07/06/2018 Basic natural language processing for Swiss German texts Page 20

Language and Space Lab

Current activities

Tagger adaptation:

– Active learning: gradually add ArchiMob data in the train set

– CRF tagger

07/06/2018 Basic natural language processing for Swiss German texts Page 21

Language and Space Lab

Speech-to-text

Language and Space Lab

Speech-to-text

Acoustic model: p(transcribed |sound)

/.../ /.../ /.../ /.../ /.../ /.../ /.../ das ischsch en ez

Language model: p(transcribedi |transcribedi−1)

das ischsch en ez

07/06/2018 Basic natural language processing for Swiss German texts Page 23

Language and Space Lab

Approach

– Improving Spitch prototype with new language models

– Our own speech-to-text development with Kaldi

– Manual transcription

07/06/2018 Basic natural language processing for Swiss German texts Page 24

Language and Space Lab

Next steps

Language and Space Lab

Next steps

– Continue transcription, PoS tagging, normalisation

– Neural transducers (deep learning) for normalisation

– Subword language models for speech-to-text

– New data

07/06/2018 Basic natural language processing for Swiss German texts Page 26

Language and Space Lab

Your feedback!