Cross-Language Information Retrieval
-
Upload
sumin-byeon -
Category
Technology
-
view
669 -
download
1
description
Transcript of Cross-Language Information Retrieval
Cross-Language Information Retrieval
University of ArizonaSumin Byeon
�1
Overview
Google&Search&
안드로이드 이메일 암호화&
Android&email&encryp3on&
Matching&algorithm&
Bilingual&corpus&database&
Results&in&English&
�2
Background
• Corpus - a collection of written text; a single word or multiple words, or even phrases and sentences
• Comparable corpus - a collection of text from pairs of languages referring to the same domain[1]; (source text, target text) pair
• N-gram - n-character or n-word slice of a longer string[2]. We refer n-character slices by the term n-gram. We use 4-gram (four-gram or quad-gram)
• Source language - the language of the original phrases
• Target language - the language into which CLIR translates the original phrases[1]: Picchi, Eugenio, and Carol Peters. Cross-Language Information Retrieval: A System for Comparable Corpus Querying. Vol. 2. N.p.: Springer US, 1998. Print. 1387-5264.
[2]: Cavnar, William B., and John M. Trenkle. "N-Gram-Based Text Categorization." (1994) Print. �3
Motivation
• Desire to acquire information even if the information is not sufficiently available in their native language
• Survey has shown people have a higher foreign language proficiency level in reading than in writing
• CLIR may bridge the gap between their desire to obtain information and unavailability or under-availability of such information in their native language
�4
Goals
• Allow users to query for domain-specific (i.e., computer science and software engineering) information in their native language
• Present relevant search results in the target language; the language in which the largest amount of information is available
�5
Components
• Domain-specific bilingual corpus extraction from multiple sources
• Corpus indexing
• Querying and string matching
�6
Corpus Extraction
�7
Corpus Indexing
• (S, T) -> (i1, h1), (i2, h2), …, (in, hn)
• Quad-grams (k=4)
• Fingerprint overlapping is okay, although it is not the most space-efficient way
Java$
0:$Java$(20451)$
global$variable$
3:$bal_$(14870)$
8:$aria$(14269)$
example$
1:$xamp$(20451)$
자바$
전역 변수$
예제$
0
12500
25000
37500
50000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 86 88 90 92 95 97 99 103
Frequency
�8
Querying & Matching
Java$global$variable$example$$
1:$ava_$(24085)$
8:$bal_$(14870)$
13:$aria$(14269)$
22:$xamp$(20451)$
Java$
0:$Java$(20451)$0:$Java$(20451)$
…$
…$
…$
global$variable$
3:$bal_$(14870)$
8:$aria$(14269)$
example$
1:$xamp$(20451)$
자바$
전역 변수$
예제$
�9
Multiple Candidates
• Longest match first
• Confidence: how many times does this comparable corpus pair appear in a set of documents?
• Outcome of matching depends on the domain of the documents stored in the database
global&
0:&loba&(25848)&
세계적인&
variable&
1:&aria&(14269)&
변수&
global&variable&
3:&bal_&(14870)&
8:&aria&(14269)&
전역 변수&
variable&
1:&aria&(14269)&
가변적인&
�10
Indexing and Querying Recap
자바 전역 변수 예제!
Java!global!variable!example!!
자바 :!Java!
전역 :!transfer!
전역 변수 :!global!variable!
예제 :!example!
변수 :!variable!
전역 :!all!parts!(of)!
�11
Relationship with Content Addressability
Lorem&ipsum&dolor&sit&amet,&consectetur&adipiscing&elit.&Quisque&id&Java&tris8que&nunc.&Ves8bulum&sit&amet&tortor&ullamcorper,&pre8um&augue&ac,&facilisis&quam.&Ut&convallis&suscipit&mauris,&at&porta&erat&vulputate&in.&Nulla&vitae&consectetur&risus.&global&variable&Aenean&justo&risus,&mollis&sed&condimentum&sed,&sagi@s&eget&nisl.&Phasellus&sem&leo,&commodo&at&dignissim&vitae,&ullamcorper&nec&metus.&Proin&pre8um&porta&lectus&nec&example&pulvinar.&Nulla&non&elementum&nisi,&vel&hendrerit&quam.&Curabitur&bibendum&lobor8s&8ncidunt.&Proin&vel&velit&porta,&tempus&ligula&a,&interdum&leo.&Aenean&lorem&nibh,&facilisis&ut&porta&sit&amet,&ornare&quis&ligula.&
Java&
global&variable&
example&
자바 전역 변수 예제&
자바&
전역 변수&
예제&
�12
Evaluation
• Matching
• Did it translate all the search terms to the target language properly?
• Did it preserve domain-specific information?
• Searching
• Hit ratio: # of relevant web pages / # of results on the first page
• Total number of search results
�13
Evaluation
• 재귀 열거 집합 - recursively enumerable sets
• (3/3, 1/1)
• 배낭 문제 시간 복잡도 - 배낭 issue the time complexity
• (3/4, 1/2)
• 가상화를 통한 데이터센터 에너지 효율 극대화 - through virtualization datacenter energy efficiency maximization
• (7/7, 4/4)
�14
Evaluation
• Query in source language “재귀 열거 집합”
• (6/10, 15,300)
• Query in target language “recursively enumerable sets”
• (10/10, 105,000)
• Google Translate result “Set of recursive enumeration”
• (10/10, 1,990,000)
�15
Evaluation
• Query in source language “배낭 문제 시간 복잡도”
• (10/10, 31,200)
• Query in target language “배낭 issue time complexity”
• (2/6, 2,270)
• Google Translate result “Knapsack problem, the time complexity”
• (10/10, 206,000)
�16
Evaluation
• Query in source language “가상화를 통한 데이터센터 에너지 효율 극대화”
• (5/10, 36,100)
• Query in target language “through virtualization datacenter energy efficiency maximization”
• (8/10, 264,000)
• Google Translate result “Maximize energy efficiency through data center virtualization”
• (10/10, 284,000)�17
Conclusion & Future Work
• Preliminary results look satisfactory
• Machine translation based CLIR appears to be more useful in many cases
• Evaluation factors may not reflect the actual quality of the system
• Labor-intensive evaluation process - need for an automated evaluation
• Fuzzy matching based on lexical information (e.g., call, calls)
• Fuzzy matching based on semantic information (e.g., maximize, maximizing, maximization, maximum)
�18