December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu...
-
Upload
kellie-chandler -
Category
Documents
-
view
215 -
download
1
Transcript of December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu...
![Page 1: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/1.jpg)
December 13, 2008 FIRE
Not So Surprising Anymore: Hindi from TIDES to FIRE
Douglas W. Oard and Tan Xu
University of Maryland, USAhttp://terpconnect.umd.edu/~oard
Slides from: Leah Larkey, Mike Maxwell, Franz Josef Och, David YarowskyIdeas from: Just about all of “Team TIDES”
![Page 2: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/2.jpg)
A Very Brief History of NLP
• 1966: ALPAC– Refocus investment on enabling technologies
• 1990: IBM’s Candide MT system– Credible data-driven approaches
• 1999: TIDES– Translation, Detection, Extraction, Summarization
![Page 3: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/3.jpg)
Surprise Language Framework
• English-only Users / Docs in language X
• Zero-resource start (treasure hunt)
• Sharply time constrained (29 days)
• Character-coded text
• Research-oriented
• Intense team-based collaboration
![Page 4: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/4.jpg)
Schedule
Cebuano• Announce: Mar 5• Test Data: • Stop Work: Mar 14• Newsletter: April• Talks: May 30
(HLT)• Papers:
Hindi
Jun 1
Jun 27
Jun 30
August
Aug 5 (TIDES PI)
October (TALIP)
![Page 5: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/5.jpg)
300-Language Survey
![Page 6: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/6.jpg)
• Five evaluated tasks– Automatic CLIR (English queries)– Topic tracking (English examples, event-based)– Machine translation into English– English “Headline” generation– Entity tagging (five MUC types)
• Several useful components– POS tags, morphology, time expressions, parsing
• Several demonstration systems– Interactive CLIR (two systems)– Cross-language QA (English Q, Translated A)– Machine translation (+ Translation elicitation)– Cross-document entity tracking
![Page 7: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/7.jpg)
16 Participating TeamsCebuano + Hindi
USC-ISI
Maryland
NYU
Johns Hopkins
Sheffield
U Penn-LDC
CMU
UC Berkeley
MITRE
Hindi Only
U Mass
Alias-i
BBN
IBM
CUNY-Queens
K-A-T (Colorado)
Navy-SPAWAR
![Page 8: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/8.jpg)
TranslationDetection
Extraction
Summarization
BooksWeb
Books
WebPeople
Lexicons
Corpora
Time
ResourceHarvesting
Systems
ResearchResults
CaptureProcess Knowledge
Innovation Cycle
Coordination
StrategyPushOrganizeTalk
![Page 9: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/9.jpg)
10-Day Cebuano Pre-Exercise
![Page 10: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/10.jpg)
Hindi Participants
Alias-I
UC
Berkeley
BB
N
CM
U
CU
NY
Johns Hopkins
IBM ISI
LDC
MIT
RE
NY
U
SP
AW
AR
U. S
heffield
U. M
assachusetts
U. M
aryland
ResourceGeneration
Detection
Extraction
Summarization
Translation
![Page 11: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/11.jpg)
Hindi Resources• Much more data available than for Cebuano
• Data collected by all project participants – Web pages, News, Handbooks, Manually created, …– Dictionaries
• Major problems: – Many non-standard encodings– Often no converters available– Available converters often did not work properly
• Huge effort: data conversion and cleaning
• Resulting bilingual corpus: 4.2 million words
![Page 12: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/12.jpg)
Hindi Translation Elicitation Server- Johns Hopkins University (David Yarowsky)
People voluntarily translated large numbers of Hindi news sentences for nightly prizes at a novel Johns Hopkins University website
Performance is measured by Bleu score on 20% randomly interspersed test sentences Allows immediate way to rank and reward quality translations and exclude junk
Result: 300,000 words of perfectly sentence-aligned bitext (exactly on genre) for 1-2 cents/word within ~5 days
Much cheaper than 25 cents/word for translation services or
5 cents/word for a prior MT-group’s recruitment of local studentsSample Interface:
user (English) translations typed here…
and here ….
User choice of 2-3encoding alternatives
Observed exponential growth in usage (before prizes ended)
viral advertising via family, friends, newgroups, …
$0 in recruitment, advertising, and administrative costs
Nightly incentive rewards given automatically via amazon.com gift certificates to email addresses (any $ amount, no fee)
no need for hiring overhead. Rewards only given for proven high quality work already performed (prizes not salary).
immediate positive feedback encourages continued use
Direct immediate access to worldwide labor market fluent in source language
![Page 13: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/13.jpg)
MT Challenges
• Lexicon coverage– Hindi morphology– Transliteration of Names
• Hindi word order: – SOV vs. SVO
• Training data inconsistencies, misalignments
• Incomplete tuning cycle– Same data/same model would give better results from
better tuning of model parameters
![Page 14: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/14.jpg)
Example Translation
• Indonesian City of Bali in October last year in the bomb blast in the case of imam accused India of the sea on Monday began to be averted. The attack on getting and its plan to make the charges and decide if it were found guilty, he death sentence of May. Indonesia of the police said that the imam sea bomb blasts in his hand claim to be accepted. A night Club and time in the bomb blast in more than 200 people were killed and several injured were in which most foreign nationals. …
![Page 15: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/15.jpg)
MT Results Overview - Hindi
50 60 70 80 90
bestcompeting
ISI public
ISI public+
ISIunrestricted
ISI late
Human 6
Human 5
PercentHuman CasedNISTr3n4score
Results in NIST evaluation: 7.43 Cased NIST (7.80 uncased)
![Page 16: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/16.jpg)
Comparison to other languages
Language pair Words Training Data NIST score Relative Human NIST
Cebuano-English1.3M
(w/o Bible: 400K)? ?
Hindi-English 4.2M 7.4 73%
Chinese-English 150M 9.0 80%
Arabic-English 120M 10.1 89%
Note: different (news) test corpora, NIST scores incomparable
![Page 17: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/17.jpg)
Hindi Week 1: Porting• Monday
– 2,973 BBC documents (UTF-8)– Batch CLIR (no stem, 2/3 known items rank 1)
• Tuesday– MIRACLE (“ITRANS”, gloss)– Stemmer (implemented from a paper)
• Wednesday– BBC CLIR collection (19 topic, known item)
• Friday:– Parallel text (Bible: 900k words, Web: 4k words) – Devanagari OCR system
![Page 18: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/18.jpg)
Hindi Weeks 2/3/4: Exploration• N-grams (trigrams best for UTF-8)• Relative Average Term Frequency (Kwok)• Scanned bilingual dictionary (Oxford)• More topics for test collection (29)• Weighted structured queries (IBM lexicon)• Alternative stemmers (U Mass, Berkeley)• Blind relevance feedback• Transliteration• Noun phrase translation • MIRACLE integration (ISI MT, BBN headlines)
![Page 19: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/19.jpg)
IIIT Shabdanjali Dictionary in ISCII
IIIT Shabdanjali Dictionary in UTF_8
Original BBC-Hindi News Collection in html
Cleaned BBC-Hindi News Collection in UTF_8
Human Translated BBCNews Documents
Initial BLEU test collection
Eng-Hindi CLIR Test Collection:19 queries
Hindi Bible
Word Aligned Bible
First MT System
Paper: A Lightweight Stemmer for Hindi
Hindi Stemmer
First versionCLIR system
Second version CLIR system
June 2
June 3
June 4
June 5
the first version of Internet Archive (IA)
Web parallel
Hindi Morphological Analysers
Converter between iscii and utf8
Converter from utf8 devanagari
to hexadecimal and to ITRANS
University of Maryland LDC ISI Other Resources
Eng_Hindi dict with POS tags
June 6
June 9
June 10
June11
small web parallel corpus
Transliterated Hindi Bible
ITRANSHindi Bible
Full Hindi OCR System
Expanding Coverage ofGloss TranslationScored translation
lexicon
Third version CLIR system
Master DictionaryVersion 0.7
Cleaned Master Dictionary Version 0.7
Relevence Judgement of 19 queries
Fourth version CLIR system
Hindi stemmer inUTF-8 hext
Small BBC word alignment
Eng-Hindi CLIR Test Collection:29 queries
OCR System
JHU
2nd version BBC Small word alignment
June12
June 13
June16 BBC Small word
alignment in UTF-8
June 17
ISI Probabilistic Lexicon. Of June13
Berkeley ProbabilisticDictionaries of June13.
Master DictionaryBy source Version 0.7 (only
IIIT party)
Scored translation lexicon version 2
LDC Sentence AlignedParallel Texts Collections
Word alignment of LDC Parallel Texts
Auto Word AlignedHindi Bible
EMILLIE CORPUSVERSION 0.1
June 18Ocred Oxford
Hindi-English Dictionary
June 19
Complete Matchine Translated BBC Collection
BBN RevisedHindi stemmer in
UTF-8 hext
June 20
Scored translation lexicon version 2.1
Cleaned Complete Matchine Translated BBC Collection
XML Format Ocred OxfordHindi-English Dictionary
XML Format Ocred OxfordHindi-English Dictionary
Version3.0
June 23 ISI Emellie Word Alignment June 23
ISI Emellie Word Alignment June 18
Scored LexiconJune 24
BBN/UMD Topic Lists
Second VersionBBN/UMD Topic Lists
![Page 20: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/20.jpg)
Formative Evaluation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 5 10 15 20 25 30
Day (=Date-1)
Mea
n R
ecip
roca
l R
ank
![Page 21: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/21.jpg)
Lessons Learned
• We learned more from 2 languages than 1– Simple techniques worked for Cebuano– Hindi needed more (encoding, MT, transliteration)
• Usable systems can be built in a month– Parallel text for MT is the pacing item
• Broad collaboration yielded useful insights
![Page 22: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/22.jpg)
Our FIRE-2008 Goals
• Evaluate Surprise Language resources– IBM and LDC translation lexicons– Berkeley Stemmer
• Compare CLIR techniques– Probabilistic Structured Queries (PSQ)– Derived Aggregated Meaning Matching (DAMM)
![Page 23: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/23.jpg)
Comparing Test Collections
FIRE-2008Test Collection
Surprise Language Test Collection
Query language English English
Doc language Hindi Hindi
Topics 50 15
Documents 95,215 41,697
Avg rel docs/topic 68 41
![Page 24: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/24.jpg)
0.37
0.47 0.47
0.38
0.0
0.1
0.2
0.3
0.4
0.5
UMD UMD-BRF Umass Umass-BRF
Mea
n A
vera
ge P
reci
sion
Monolingual Baselines
Our FIRE-2008 Training (TDN) 2003 Surprise Language (TDNS)
15 Surprise Language topics
![Page 25: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/25.jpg)
A Ranking Function: Okapi BM25
])(7
)(*8
)),()(
*9.03.0(
)),(*2.2(][
)5.0)((
)5.0)(([log
eqtf
eqtf
detfavdlddldetf
edf
edfN
Qek
k
k
document frequency term frequency
query term query document length document
average document length term frequency in query
])(7
)(*8
)),()(
*9.03.0(
)),(*2.2(][
)5.0)((
)5.0)(([log
eqtf
eqtf
detfavdlddldetf
edf
edfN
Qek
k
k
])(7
)(*8
)),()(
*9.03.0(
)),(*2.2(][
)5.0)((
)5.0)(([log
eqtf
eqtf
detfavdlddldetf
edf
edfN
Qek
k
k
![Page 26: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/26.jpg)
Estimating TF and DF for Query Terms
jf
kjjiki dftffepdetf ),(*)(),(
jf
jjii fdffepedf )(*)()(
jf
)( jfdf
),( kj dftf
),( ki detf
)( ji fep
)( iedf
3f2f 4f1f
0.4*20 + 0.3*5 + 0.2*2 + 0.1*50 = 14.9
20
50
5025
3040
0.30.4
0.4*50 + 0.3*40 + 0.2*30 + 0.1*200 = 58
0.1
200
0.2e1
0.4
0.3
0.2
0.1
f1
f2
f3
f4
![Page 27: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/27.jpg)
Bidirectional Translation
wonders of ancient world (CLEF Topic 151)
se//0.31demande//0.24demander//0.08peut//0.07merveilles//0.04question//0.02savoir//0.02on//0.02bien//0.01merveille//0.01pourrait//0.01
Unidirectional:
si//0.01sur//0.01me//0.01t//0.01emerveille//0.01ambition//0.01merveilleusement//0.01veritablement//0.01cinq//0.01hier//0.01
merveilles//0.92merveille//0.03emerveille//0.03merveilleusement//0.02
Bidirectional:
![Page 28: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/28.jpg)
Surprise LanguageTranslation Lexicons
SourceTranslation
pairsEnglish words
Hindi words
LDC (dict) 69,195 21,842 33,251
IBM (stat) 181,110 50,141 77,517
ISI (stat) 512,248 65,366 97,275
p(h|e)
p(e|h)
40%
60%
![Page 29: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/29.jpg)
George. W. Bush乔治 . 布什
shrubbery草丛
grass lawn草坪
marijuanagrass大麻
bush
grass
0.7
0.3
0.8
0.2
布什
草丛
大麻
0.6
1.0
0.4
1.0
Synonym Sets as Models of Term Meaning
![Page 30: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/30.jpg)
“Meaning Matching” Variants
icon Query translation
knowledge?
Document translation
knowledge?
Query language
Synsets?
Document language synsets?
Pre-aligned
Synsets?
FAMM
DAMMPAMMq
PAMMd
IMM
PSQAPSQ
PDT
APDT
(Q) (D)
(Q) (D)
(Q) D
Q (D)
Q D
Q D
Q (D)
Q D
(Q) D
![Page 31: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/31.jpg)
f1 (0.32)
f2 (0.21)
f3 (0.11)
f4 (0.09)
f5 (0.08)
f6 (0.05)
f7 (0.04)
f8 (0.03)
f9 (0.03)
f10 (0.02)
f11 (0.01)
f12 (0.01)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
f1 f1f2f3f4f5
f1f2f3f4
f1f2f3f4f5f6f7
f1f2f3f4f5f6f7f8f9f10
f11
f12
f1 f1 f1f2
f1f2
f1f2f3
f1
Cumulative Probability ThresholdTranslations
Pruning Translations
![Page 32: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/32.jpg)
Comparing PSQ and DAMM
0%
20%
40%
60%
80%
100%
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Cumulative Probability Threshold
MA
P:C
LIR
/Mon
olin
gual
PSQ
DAMM
15 Surprise Language topics, TDN queries
![Page 33: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/33.jpg)
1/3 of Topics Improve w/DAMM
-0.2
-0.1
0.0
0.1
0.2
MA
P: D
AM
M-P
SQ
15 Surprise Language topics, TDN queries
![Page 34: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/34.jpg)
0.0
0.1
0.2
0.3
0.4
clir-EH-umd-man0UCB Stemmer
clir-EH-umd-man1YASS Stemmer
clir-EH-umd-man2UCB StemmerPre-Trans BRF
clir-EH-median mono-HH-best
Mea
n A
vera
ge P
reci
sion
Official CLIR Results
50 FIRE-2008 topics, TDN queries
![Page 35: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/35.jpg)
Comparing Stemmers
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
MA
P: Y
AS
S-B
erk
ele
y
YASS Stemmer Better
Berkeley Stemmer Better
50 FIRE-2008 Topics, TDN queries
![Page 36: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/36.jpg)
Best (Overall) CLIR Run
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
AP
: cl
ir-E
H-u
md-m
an2 - M
edia
n
clir-EH-umd-man2 Better
Median Better
41 FIRE-2008 topics with ≥ 5 relevant documents, TDN queries
![Page 37: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/37.jpg)
Cross-Language “Retrieval”
Search
Translated Query
Ranked List
QueryTranslation
Query
![Page 38: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/38.jpg)
Interactive Translingual Search
Search
Translated Query
Selection
Ranked List
Examination
Document
Use
Document
QueryFormulation
QueryTranslation
Query
Query Reformulation
MT
Translated “Headlines”
English Definitions
![Page 39: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/39.jpg)
UMass Interactive Hindi CLIR
![Page 40: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/40.jpg)
MIRACLE Design Goals
• Value-added interactive search– Regardless of available resources
• Maximize the value of minimal resources– Bilingual term list + Comparable English text
• Leverage other available resources– Parallel text, morphology, MT, summarization
![Page 41: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/41.jpg)
![Page 42: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/42.jpg)
![Page 43: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/43.jpg)
![Page 44: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/44.jpg)
![Page 45: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/45.jpg)
Summary
• Larger Hindi test collection– Prerequisite for insightful failure analysis
• Surprise Language resources were useful– Translation lexicons– Berkeley stemmer (combine with YASS?)
• DAMM is robust with weaker resources
![Page 46: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/46.jpg)
Looking Forward
• Shared resources– Test collections– Translation lexicons (or parallel corpora)– Stemmers
• System infrastructure– IL variants of Indri/Terrier/Zettair/Lucene
• Community-based cycle of innovation– Students are our most important “result”
![Page 47: December 13, 2008FIRE Not So Surprising Anymore: Hindi from TIDES to FIRE Douglas W. Oard and Tan Xu University of Maryland, USA oard.](https://reader035.fdocuments.net/reader035/viewer/2022062716/56649e035503460f94aedbb6/html5/thumbnails/47.jpg)
For More Information
• Team TIDES newsletter– http://language.cnri.reston.va.us/TeamTIDES.html – Cebuano: April 2003– Hindi: October 2003
• Papers– NAACL/HLT 2003– MT Summit 2003– ACM TALIP Special Issues(Jun/Sep 2003)