HPC in linguistic research
description
Transcript of HPC in linguistic research
![Page 2: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/2.jpg)
HPC use in linguistic research• Linguistic and biological models• Phylogenies • Linguistic data• Models of evolution• Parallelism• Scaling• Results• On going work• Key challenges
![Page 3: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/3.jpg)
Linguistic and biological systems
Attribute Genetics Linguistics
Discrete units nucleotides, codons, genes, individuals
words, grammar, syntax
Replication transcription teaching, learning,imitation
Dominant mode(s) of inheritance parent-offspring, clonal parent-offspring, peer
groups, teaching
Horizontal transmission many mechanisms borrowing
Mutation many mechanisms SNP’s, mobile DNA,
mistakes, vowel shifts, innovation
Selection fitness differences among alleles ?
![Page 4: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/4.jpg)
Inferring evolutionary histories form linguistic data• Evolutionary histories, phylogenies• Tools for understand evolution• Depicts relationships between languages• Identify groups which share a common ancestor• Calculate timing events • Account for lack of independence in the data
• Inferred from data, taken from different languages • Using an explicate statistical model of evolution • Problem is NP-hard, growth is a double factorial. • Markov chain Monte Carlo search methods, heuristic search,
hill climber • Product of Data + Model
![Page 5: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/5.jpg)
Greek Indo-IranianSlavic
Germanic
CelticRom
ance
![Page 6: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/6.jpg)
The Data• Swadesh list, Morris Swadesh 1940, onwards
• 200 meaning, present in all languages (all most)
• Chosen to be stable, slowly evolving and resistant to borrowing
• Some what of a language “gene”
![Page 7: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/7.jpg)
![Page 8: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/8.jpg)
Cognate classes• Word with a common evolutionary ancestry and meaning
EnglishFish
DanishFisk
DutchVisch
Fish Ryba
CzechRyba
Russian Ryba
BulgarianRiba
23 other languages34other languages
![Page 9: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/9.jpg)
Data coding, Cognates • Cognates, words and meaning what are derived from a
common ancestor• Languages evolve by a processes of descent with modification
English when water German wann wasser French quand eau Italian quando acqua Greek qote nero Hittite kuwapi watar
English 1 1 0 0
German 1 1 0 0
French 1 0 1 0
Italian 1 0 1 0
Greek 1 0 0 1
Hittite 1 1 0 0
“Water”3 cognates
“When” 1 cognate
![Page 10: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/10.jpg)
Continuous-time Markov Model
Q01
0Non cognate
1Cognate
Q10
Q01 Rate at which cognates are gainedQ10 Rate at which cognates are lost
![Page 11: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/11.jpg)
The Likelihood Model• Calculates the probability of a tree (T), given the data (D) and
model of evolution (M). Fitness / evaluation • Accounts for > 99% of the run time
Product over the model1 – 12 categories
Product over the data200 – 100,000 sites
![Page 12: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/12.jpg)
Level of parallelism
Data – Analysis of multiple datasets (3-5)
Model – Test a range of models (10-20)
Run – Stochastic process multiple runs (5-10)
Code – individual run can still take years
Triv
ially
par
alle
l
![Page 13: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/13.jpg)
The problem• 2003 – 16 taxa, 125 sites, 1 x model
• 2005 – 87 taxa, 2450 sites, 4 x model
• 2007 – 400 taxa, 34,440 sites, 100 x model
• Complexity 700,000x, 5-6 order of magnitude
• 4.8 years per run, typically 5 publication quality runs + 10 model tests
• 4.8 years < attention span of academics• results are required in days
![Page 14: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/14.jpg)
Parallel method 1Distribute the data (MPI)
Cognates
Lang
uage
s
Data
Core 1 Core 2 Core 3
0 1 1
0 1 1
0 1 0
1 1 0
1 1 1
1 0 1
0 0 1
0 0 1
1 0 1
1 0 1
0 0 0
1 0 1
1 0 1
1 0 1
0 0 1
0 0 1
1 1 1
1 0 0
1 0 0
1 0 0
……………………..……………..
……………………..……………..
![Page 15: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/15.jpg)
Parallel method 2 Distribute the model (OpenMP)
Data
Core 1
Pass 1
Data
Core 2
Pass 2
Data
Core 3
Pass 3
Data
Core 4
Pass 4
![Page 16: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/16.jpg)
Distribute the data and the model (MPI + OpenMP)
Data
Core 1
Pass 1
Core 2
Data
Core 3
Pass 2
Core 4
Data
Core 5
Pass 3
Core 6
Data
Core 7
Pass 4
Core 8
![Page 17: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/17.jpg)
Cores
Seco
nds -
log
10
![Page 18: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/18.jpg)
Cores
Effici
ency
![Page 19: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/19.jpg)
Results• Runtime reduced from 4.8 years to
• Good scaling, but not sustainable
• HPC has allowed for the accurate analysis of large complex data sets with statistically justifiable models.
Cores Days60 31.5
150 14.5300 8.5600 6
![Page 20: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/20.jpg)
Current work• Phoneme data• Modelling sound utterances
• Better resolution than cogency data• Relevant linguistics patterns are emerging• 120 phonemes, 2 cogency judgments • Another 3 order of magnitude complexity
• Accelerator implementation CUDA / OpenCL
Language Word Cogency PhonemeEnglish Fish 1 FishDanish Fisk 1 Fisk
![Page 21: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/21.jpg)
Scalable computing• Last 10 years, 5-6 order of magnate increase in complexity
• Reasonably scalable code redesign needed.
• Need to change the how not the what• What – statistical framework, realistic models• How – algorithm, language, parallelisation method, hardware
• Scalable algorithms
![Page 22: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/22.jpg)
Burn inSerial
ConvergenceParallel
![Page 23: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/23.jpg)
Parallel sampling using multiple chains
![Page 24: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/24.jpg)
Key challenges• Computing is a rate limiting step• Trending water / drowning• Widening gap between computing power and data models complexity• Data set size and model complexity restricted• 20-30 year old methods, which are less accurate and non statistical are
returning
• Connecting researchers with results not HPC
• HPC is a nuisance in science• Steep learning curve• High cost. Hardware, running costs and personnel• Access and flexibility• Not one off activity, thousands of data sets are produced each year, 3000+
published in 2011
![Page 25: HPC in linguistic research](https://reader036.fdocuments.net/reader036/viewer/2022070500/5681682a550346895dddbe3b/html5/thumbnails/25.jpg)
AcknowledgmentsMark Pagel