Diachronic Analysis of the Italian Language exploiting Google Ngram

14
Diachronic Analysis of the Italian Language exploiting Google Ngram

Transcript of Diachronic Analysis of the Italian Language exploiting Google Ngram

Page 1: Diachronic Analysis of the Italian Language exploiting Google Ngram

Diachronic Analysis of the Italian Language exploiting Google Ngram

Page 2: Diachronic Analysis of the Italian Language exploiting Google Ngram

Hello!Pierpaolo BasileAnnalina CaputoRoberta LuisiGiovanni SemeraroDepartment of Computer ScienceUniversity of Bari Aldo Moro - Italy

Page 3: Diachronic Analysis of the Italian Language exploiting Google Ngram

BackgroundTRI

P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time. IJCoL vol. 1: Emerging Topics at the First Italian Conference on Computational Linguistics.

Corpus with Temporal Information

Dictionary/Random Vectors

Temporal Random Indexing

WordSpace

WordSpace

WordSpace

WordSpace

WordSpace

▪ Several WordSpaces for several time periods

▪ Word vectors are comparable across WordSpaces

Page 4: Diachronic Analysis of the Italian Language exploiting Google Ngram

Motivation 1Detect meaning shift

Marty, in 2015 people will surf

on the web!!!

Page 5: Diachronic Analysis of the Italian Language exploiting Google Ngram

Motivation 1Detect meaning shift

Surf!?!?! On the

web!?!?!?

Page 6: Diachronic Analysis of the Italian Language exploiting Google Ngram

Motivation 1Detect meaning shift

Surf!?!?! On the

web!?!?!?

surf the Net/Internet to use the InternetWhen was this meaning introduced?

Page 7: Diachronic Analysis of the Italian Language exploiting Google Ngram

Motivation 2Large corpus

▪ Build a method for computing TRI relying on a very large corpus

▪ Google Ngram for the Italian language▫n-grams (up to five) extracted from Google Books▫over five million books spanning the years from 1500 to 2012

▪ covers several languages including Italian

analysis is often described as 1991 104 5

N-gram occurrences books

Page 8: Diachronic Analysis of the Italian Language exploiting Google Ngram

Methodology

1. Run TRI on the Italian Google Ngram▫build a WordSpace for each time period (10 years)

2. Provide for each word a time series3. Search significant changes in the time

series

Page 9: Diachronic Analysis of the Italian Language exploiting Google Ngram

cossim( , )

Time Series

Several time series Γ at the time interval k

log frequency

point-wise

cumulative cossim( , )

Word frequency in each time period k

Cosine similarity between word vectors across two time periods

Considers a cumulative vector of the previous k-1 time periods

Page 10: Diachronic Analysis of the Italian Language exploiting Google Ngram

Change pointdetection

▪ Mean shift of Γ pivoted at time period j▪ Search statistical significant mean shifts

▫bootstrapping approach under the null hypothesis that there is no change in the meaning

Page 11: Diachronic Analysis of the Italian Language exploiting Google Ngram

EvaluationDataset

Build a benchmark for meaning shift detection for the Italian language▪ extract a set of words by pooling data by running

several system settings▪ find correct change points in a dictionary (Sabatino

Coletti/Etimologico Zanichelli)

Page 12: Diachronic Analysis of the Italian Language exploiting Google Ngram

EvaluationResults

Method AccuracyTRIpoint 0.3086TRIcum 0.2963TRR1point 0.2716log freq 0.2346TRR2point 0.1728TRR1cum 0.1605TRR2cum 0.1235

Accuracy: the year predicted by the system should be equal or greater than the year reported in the gold standard

TRR1 and TRR2 are variants of TRI based on Reflective Random Indexing

Page 13: Diachronic Analysis of the Italian Language exploiting Google Ngram

Conclusions andFuture Work

▪ TRI method with point wise detection provides good results▫it overcomes the baseline based on log-frequency

▪ We provide a benchmark for the evaluation of meaning shifts for the Italian language

▪ Future work: extend the dataset and provide an evaluation for the English language

Page 14: Diachronic Analysis of the Italian Language exploiting Google Ngram

Thanks!!Any [email protected]

https://github.com/pippokill/tri