Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

17
Search in Transliterated Space Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Transcript of Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Page 1: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Search in Transliterated Space

Shared Task Proposal, FIRE 2012

Monojit ChoudhuryMicrosoft Research Lab India

Page 2: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

A Transliterated World Wide Web

Song Lyrics

Page 3: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

A Transliterated World Wide Web

Reviews and Forums

Page 4: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

A Transliterated World Wide Web

Facebook and Twitter

Page 5: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

A Transliterated World Wide Web

And lot more

Page 6: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Beyond Indic languages

Many languages that use non-Roman script Arabic (Saudi Arabia, UAE, Egypt,

Morocco,…) Persian Indian sub-continental languages (IL &

Dzongkha, Nepalese, Sinhala) Thai, Vietnamese Cyrillic (Russian, Ukrainian) Chinese, Japanese, Korean (rare)

Page 7: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Aspects of Transliterated Text

Code Mixing

Transliteration

Errors, Contracti

on

Page 8: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

IR Scenario - I

Mono-script Monolingual IR in transliterated space Query: thandee hava yeh chandni

suhanee Results: Only Roman transliterated

documents

Challenge: Spelling variations tandee hawa ye chandny soohaany

Page 9: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

IR Scenario - II

Cross-script and Multi-script Monolingual IR in transliterated space

Query: thandee hava yeh chandni OR ठं� डी� हवा� ये चाँ��दनी� Results: Both Roman transliterated

or in native script

Challenge: Transliteration

Page 10: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Scenario - III

Cross-script and Cross-lingual IR Query: death of mareech and subahoo Document: Hindi (Transliterated and

Devanagari) and English documents

Page 11: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Shared Task on Retrieval

Mono-scriptMonolingual

IR

Transliterated query in

Roman

Transliterated documents in Roman

Cross-scriptMonolingual

IR

Transliterated query in

Roman

Transliterated documents in native script

Multi-scriptMonolingual

IR

Query in Roman or

native script

Documents in Roman and native scripts

Page 12: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Shared Sub-Tasks

Language identification of transliterated queries, documents, code-mixed text

kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML

Transliteration Forward: കഴി�ക്കാ�ന്‍ kazhikkan Backward: kazhikkan കഴി�ക്കാ�ന്‍

Page 13: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Available Data

20000 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags)

35000 unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics

More data under preparation from FaceBook on mixture of various languages.

Looking for partners to extend!

Page 14: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Available Data

Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics

Looking for partners to extend it to other (Indian) Languages

Other domains?

Page 15: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Thank you! [email protected]

Page 16: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Other resources

Lexicons Pronunciation lexicons G2P for some languages Stemmers and morphological

analyzers

Anything else?

Page 17: Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India.

Concluding Remarks

We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing

These are just some initial ideas that came up from our experiences

If you are interested please let me know