NLP for Social Media - Indian Institute of Technology...
Transcript of NLP for Social Media - Indian Institute of Technology...
![Page 1: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/1.jpg)
This file has been cleaned of potential threats.
If you confirm that the file is coming from a trusted source, you can send the following SHA-256
hash value to your admin for the original file.
6d5bb068ba719ffa8cc26fd819f02decb35862aef851f64515670a4afe685bf6
To view the reconstructed contents, please SCROLL DOWN to next page.
![Page 2: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/2.jpg)
NLP for Social MediaLecture 6: Processing Multilingual Content
Monojit Choudhury
Microsoft Research Lab, [email protected]
![Page 3: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/3.jpg)
Distribution of Languages over WWW
Fraction of Degree k (#page/total page)
Pk
(Fra
ctio
n o
f la
ngu
ages
hav
ing
at leas
t k
pag
es)
= 5e-4
English
German
French
Russian
Japanese
Chinese
Spanish
Italian
Korean
Dutch
Portuguese
Czech
Swedish
Polish
Danish
Catalan
Norwegian
Hungarian
Finnish
Slovak
Turkish
![Page 4: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/4.jpg)
How representative is WWW
Correlations Speaker Web Wiki LDC-Items
Web 0.29
Wiki 0.36 0.92
LDC-Items 0.52 0.94 0.85
LDC-Words 0.62 0.78 0.71 0.94
![Page 5: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/5.jpg)
The Outliers!Speaker Web Wiki LDC-I LDC-W
MandarinSpanishEnglishArabicHindiPortugueseBengaliRussianJapaneseGermanWuJavaneseTeluguMarathiVietnameseKoreanTamil
English German French Russian Japanese ChineseSpanishItalian Korean Dutch Portuguese Czech Swedish Polish Danish CatalanNorwegian
EnglishGerman French Polish Japanese Dutch Italian Portuguese SpanishSwedish Russian Chinese NorwegianFinnish Turkish Esperanto
Romanian
English Arabic Chinese Spanish Japanese Korean German French Czech Portuguese Hindi Tamil Farsi Russian Vietnamese DutchItalian
English Arabic Chinese Spanish German French Japanese Korean Portuguese Hindi Urdu Bengali Italian Greek Swedish NorwegianRussian
![Page 6: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/6.jpg)
What about Social Media?
http://www.ciklopea.com/en/blog/consulting/the-distribution-of-languages-on-twitter/531/
Distribution of Languages on Twitter in 2013
![Page 7: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/7.jpg)
Questions to ponder
• What factors determine the distribution of languages on a social network?
• How do we compute or estimate this distribution?
• What technologies, if any, are needed to make an OSN accessible to the speakers of a language?
• What technologies are needed to support and encourage multilingualism on an OSN?
• What can we learn about multilingualism from OSNs?
![Page 8: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/8.jpg)
Questions to ponder
• What factors determine the distribution of languages on a social network?
• How do we compute or estimate this distribution?
• What technologies, if any, are needed to make an OSN accessible to the speakers of a language?
• What technologies are needed to support and encourage multilingualism on an OSN?
• What can we learn about multilingualism from OSNs?
Language Detection
Processing Code-switching
Some interesting stats
![Page 9: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/9.jpg)
Agenda
• Language Detection
• Processing Code-switched text
• Some interesting stats
![Page 10: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/10.jpg)
Scope of Language Identification
Granularity
L1 or NotL1 or L2
Universal
Cla
sse
s
Document Sentence Word Morpheme
Utvald artikkel er ein bolk på som vert oppdatert ein gong i veka med
dei første avsnitta frå ein utvald artikkel i lag med eit bilete.
آفتابگردان
Kalla varthakal kallan mar asianetprajaranamAmarthi amarthi nte
phoninte display poi
![Page 11: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/11.jpg)
Scope of Language Identification
Granularity
L1 or NotL1 or L2
Universal
Cla
sse
s
Document Sentence Word Morpheme
CLD2, Linguini, Polyglot (w/o transliteration)
![Page 12: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/12.jpg)
Doc-level Language Detection
Each Document is Monolingual
Documents can be multilingual
Lui, Lao and Baldwin (2014), Transactions of ACL
![Page 13: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/13.jpg)
A brief History of Doc-level LI1994: William B Cavnar, John M Trenkle, et al. N-gram-based text categorization. Ann Arbor MI, 48113(2):161–175.
1999: John M Prager. Linguini: Language identification for multilingual documents. In Systems Sciences, 1999. HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference.
2005: P. McNamee. Language identification: A solved problem suitable for undergraduate instruction. Journal of Computing Sciences in Colleges, 20(3).
2011: Erik Tromp and Mykola Pechenizkiy. Graph-based n-gram language identification on short texts. In Proc. 20th Machine Learning conference of Belgium and The Netherlands, pages 27–34.
2012: Marco Lui and Timothy Baldwin. langid.py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, pages 25–30.
2013: Moises Goldszmidt, Marc Najork, and Stelios Paparizos. Boot-strapping language identifiers for short colloquial postings. In Proc. of ECMLPKDD.
LI for Web Documents (for Information Retrieval/Web Search)
LI for short and noisy text (Twitter & other user generated content)
![Page 14: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/14.jpg)
Doc-level LI: ApproachesUnicode Block
◦ Idea: Different languages use different scripts
English, FrenchGerman, Spanish
Portuguese, Swedish,
Vietnamese, Tagalog, Malay, …
Russian, Bulgarian,
Belorussian, Abkhasian,
Serbian
How many languages use the Devanagari script?
![Page 15: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/15.jpg)
Doc-level LI: ApproachesUnicode Block◦ Idea: Different languages use different scripts
Dictionary based◦ Compute the intersection with each of the language lexicon. Declare the
highest matching lexicon as the winner.
◦ Issues: Resource intensive; coverage; short text
N-gram based techniques
Which of this is Sanskrit?kshiprata, altakmbil
![Page 16: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/16.jpg)
Character n-gram based word classifiersTask:
Input: A word w
Output: Yes (if w belongs to L1) or No (otherwise)
Features: character n-grams (n = 2 to 5)
Classifier: Naïve Bayes*, Max-Ent, SVMs
Data:◦ Positive Examples: words of L1
◦ Negative example: words from other languages
Output: prob or score of w being L1 Prob(kshiprata is Sanskrit) >> Prob(altakmbil is Sanskrit)
kshiprata $kshiprata$2: $k, ks, sh, hi, ip, pr, ra, at, ta, a$3: $ks, ksh, shi, hip, ipr, … ta$4: $ksh, kshi, ship, …, ata$5: $kshi, kship, shipr, …, rata$
![Page 17: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/17.jpg)
Doc-level LI: ApproachesUnicode Block
◦ Idea: Different languages use different scripts
Dictionary based◦ Compute the intersection with each of the language lexicon. Declare the highest matching lexicon as
the winner.
◦ Issues: Resource intensive; coverage; short text
N-gram based techniques◦ Robust, easy to build, can be bootstrapped
◦ Issues: very short text, very noisy text
Other Features: ◦ Meta-data of a webpage
◦ User Info (in Twitter/social media)
![Page 18: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/18.jpg)
Some off-the-shelf Tools
Tool Reference #Lang Approach Features Type
linguini Prager, 1999 Vector-space model 2-5 Byte n-grams Multi
polyglot Lui and Baldwin, 2011/14 44 Generative mixture model
Byte n-grams Multi
langid.py Lui and Baldwin, 2012 97 Naïve Bayes Classifier 1,2,3,4 Byte-gram Multi
CLD2 Google, 2013 83 Naïve Bayes Classifier character 4-grams Mono
![Page 19: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/19.jpg)
Scope of Language Identification
Granularity
L1 or NotL1 or L2
Universal
Cla
sse
s
Document Sentence Word Morpheme
EMNLP CS Workshop & FIRE Shared Tasks
CLD, Linguini, Polyglot (w/o transliteration)
![Page 20: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/20.jpg)
Word-level Language Labeling: Problem Definition
Modi ke speech se India inspired ho gaya #namo
NE Hn En Hn NE En Hn Hn Other
के से हो गया
Other Labels:• Mix: Part L1, part L2 (e.g., artiston, nachoing)• Ambiguous: can be either language (e.g., computer, vote, football)
![Page 21: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/21.jpg)
A brief History of Doc-level LI1994: William B Cavnar, John M Trenkle, et al. N-gram-based text categorization. Ann Arbor MI, 48113(2):161–175.
1999: John M Prager. Linguini: Language identification for multilingual documents. In Systems Sciences, 1999. HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference.
2005: P. McNamee. Language identification: A solved problem suitable for undergraduate instruction. Journal of Computing Sciences in Colleges, 20(3).
2011: Erik Tromp and Mykola Pechenizkiy. Graph-based n-gram language identification on short texts. In Proc. 20th Machine Learning conference of Belgium and The Netherlands, pages 27–34.
2012: Marco Lui and Timothy Baldwin. langid.py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, pages 25–30.
2013: Moises Goldszmidt, Marc Najork, and Stelios Paparizos. Boot-strapping language identifiers for short colloquial postings. In Proc. of ECMLPKDD.
LI for Web Documents (for Information Retrieval/Web Search)
LI for short and noisy text (Twitter & other user generated content)
![Page 22: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/22.jpg)
A Brief History of Word-level Language Labeling2008: T Solorio and Y. Liu. Parts-of-speech tagging for English-Spanish code-switched text. In Proceedings of the Empirical Methods in natural Language Processing.
2013: Ben King and Steven Abney. Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of NAACL-HLT, pages 1110–1119.
2013: Rishiraj Saha Roy, Monojit Choudhury, Prasenjit Majumder, and Komal Agarwal. Overview and datasets of FIRE 2013 track on Transliterated Search. In FIRE Working Notes.
2014: Monojit Choudhury, Gokul Chittaranjan, Parth Gupta and Amitava Das. Overview FIRE 2014 track on Transliterated Search. In FIRE Working Notes.
2014: Thamar Solorio et al. Overview for the First Shared Task on Language Identification in Code-Switched Data.
2014: Utsab Barman, Amitava Das, Joachim Wagner and Jennifer Foster. Code Mixing: A Challenge for Language Identification in the Language of Social Media. 1st Workshop on Code-switching, EMNLP’14
![Page 23: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/23.jpg)
Word-level Language Labeling: Problem Definition
Modi ke speech se India inspired ho gaya #namo
NE Hn En Hn NE En Hn Hn Other
के से हो गया
Other Labels:• Mix: Part L1, part L2 (e.g., artiston, nachoing)• Ambiguous: can be either language (e.g., computer, vote, football)
![Page 24: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/24.jpg)
Modeling as a Structured Prediction Problem
Given X: X1 = Modi, X2 = ke, …,
Output: Y = Y1 (label for X1), Y2 (label for X2) …
Such that p(Y|X) is maximized
Hidden Markov Models, Conditional Random Fields,
Features Training & Test Data
![Page 25: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/25.jpg)
FeaturesToken-based
features
• Capitalization
• Script
• Special Characters
• Character n-gram based classifiers
• Word length
Lexical Features
• Regular lexicon
• Unigram Frequency
• Entity Lexicon
• Acronym/slang lexicon
Context Features
• Next 3 tokens
• Last 3 tokens
• Current token
• Previous label (Bigram or B)
![Page 26: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/26.jpg)
Datasets & Metrics Shared Task in Code-switching
Workshop@ EMNLP
Metrics:• Word-level labeling accuracy• Word level Class-wise Precision, Recall and F-score• Tweet (doc) level accuracy• Tweet (doc) level CS Precision, Recall and F-score.
![Page 27: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/27.jpg)
PerformanceShared Task in Code-switching
Workshop@ EMNLP
0
20
40
60
80
100
En-Es En-Ne En-Cn Ar-Ar
LA Tweet F-score Dict-baseline
![Page 28: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/28.jpg)
Pain points
95
85 85.6 87.9
15.6
85.682
91.884.5 86.4
14.8
83.7 81.5
0
20
40
60
80
100
Token LevelAccuracy
Lang1 F-score
Lang2 F-score
NE F-score Other F-score
Tweet Level
Dev Test Surprise
English-Spanish class wise F-score
![Page 29: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/29.jpg)
Agenda
Language Detection
Processing Code-switched text
Some interesting stats
![Page 30: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/30.jpg)
Did you like Interstellar?
Interstellar es una amazing
movie. Interstellar 是了不起的电影。
星际 es una了不起的电影。
Chinese
Spanglish
![Page 31: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/31.jpg)
How does Skype Translator work?
English Speech English Text Chinese Text Chinese SpeechASR SMT TTS
Skype Translator
English Speech Data (1000+ hours)
English Text Data (1011 words)
Phone model
Languagemodel
En – Cn parallel data (107 sentences)
English Tree bank(106 trees)
Translation Model
Parser
English POS label data (107 words)
POS tagger
English … …
![Page 32: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/32.jpg)
For Skyping in Spanglish…
Spanglish Speech
Spanglish Text Chinese Text Chinese SpeechASR SMT TTS
Skype Translator
Spanglish Speech Data (1000+ hours)
Spanglish Text Data (1011 words)
Phone model
Languagemodel
SE – Cn parallel data (107 sentences)
Spanglish Tree bank(106 trees)
Translation Model
Parser
Spanglish POS label data (107 words)
POS tagger
Spanglish … …
There are at least 300 com-monly spoken
code-mixed tongues!
![Page 33: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/33.jpg)
For Skyping in Spanglish…
Es-Cn Trans. Model
Es. Parser
POS tagger
Es Language Model
En-Cn Trans. Model
En Parser
En POS tagger
En Language Model
SE – Cn parallel data (104 sentences)
Spanglish Tree bank(103 trees)
Spanglish Trans. Model
Spanglish Parser
Spanglish POS label data (104 words)
Spanglish POS tagger
Spanglish text(106 words)
Spanglish LM
SE – Cn parallel data (107 sentences)
SE Tree bank(106 trees)
SE POS label data (107 words)
English …
![Page 34: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/34.jpg)
State-of-the-art
Es-Cn Trans. Model
Es. Parser
POS tagger
Es Language Model
En-Cn Trans. Model
En Parser
En POS tagger
En Language Model
CM parallel data (104
sentences)
CM Tree bank(103 trees)
Trans. Model for CM
Parser for CM
CM POS label data (104 words)
POS tagger for CM
CM text (106 words)
Language Model for CM
Language Detection
![Page 35: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/35.jpg)
POS Tagging
Modi ke speech se India inspired ho gaya #namo
NE Hn En Hn NE En Hn Hn Other
के से हो गया
NP ADP NN ADP NP VB VB VB X
![Page 36: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/36.jpg)
T Solorio and Y. Liu. 2008. Parts-of-speech tagging for English-Spanish code-switched text. In Proceedings of the Empirical Methods in natural Language Processing.
1. Tag the whole sentence using L1 tagger [L1 POS annotated data]
2. Tag the whole sentence using L2 tagger [L2 POS annotated data]
3. Use the L1 tag and L2 tag as features (plus more) and learn to predict the POS tag for CM text [CM annotated data]
![Page 37: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/37.jpg)
En-Es Results:Heuristic based combinations
![Page 38: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/38.jpg)
En-Es Results: Machine Learning Techniques
Features
![Page 39: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/39.jpg)
POS-tagged CM data requirement
English data: Penn Treebank (97%)5 Million words
Spanish data: CRATER CM data: 8000 words
![Page 40: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/40.jpg)
Some experiments with Hindi
40
50
60
70
80
90
Hi En LID+POS LID*+POS POS+ML with LID as feature
Vyas et al. POS Tagging of English-Hindi Code-Mixed Social Media Content. EMNLP 2014
English data: CMU ARK Tagger (95%)Hindi data: SNLTR/MSR tagger (100k; 90%) CM data: 4000 words
![Page 41: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/41.jpg)
Agenda
Language Detection
Processing Code-switched text
Some interesting stats
![Page 42: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/42.jpg)
Script Distribution of FB Posts
![Page 43: NLP for Social Media - Indian Institute of Technology Kharagpurcse.iitkgp.ac.in/~pawang/courses/SC15/Lecture6_mc.pdf · LDC-Words 0.62 0.78 0.71 0.94. The Outliers! Speaker Web Wiki](https://reader030.fdocuments.net/reader030/viewer/2022040904/5e780ad96fa34268e44b52ba/html5/thumbnails/43.jpg)
Code-Switching Stats on FBIn the 4 public forums studied:◦All threads are multilingual
◦ 17.2% of the comments/posts have code-switching or mixing
◦ 04.2% have code-switching
◦ 23.7% of Romanized Hindi posts have at least one or more English embeddings
◦ 7.20% of the English posts have at least one or more Hindi embeddings