Cross-Language Information Retrieval

Cross-Language Information Retrieval

University of ArizonaSumin Byeon

�1

Overview

Google&Search&

안드로이드 이메일 암호화&

Android&email&encryp3on&

Matching&algorithm&

Bilingual&corpus&database&

Results&in&English&

�2

Background

• Corpus - a collection of written text; a single word or multiple words, or even phrases and sentences

• Comparable corpus - a collection of text from pairs of languages referring to the same domain[1]; (source text, target text) pair

• N-gram - n-character or n-word slice of a longer string[2]. We refer n-character slices by the term n-gram. We use 4-gram (four-gram or quad-gram)

• Source language - the language of the original phrases

• Target language - the language into which CLIR translates the original phrases[1]: Picchi, Eugenio, and Carol Peters. Cross-Language Information Retrieval: A System for Comparable Corpus Querying. Vol. 2. N.p.: Springer US, 1998. Print. 1387-5264.

[2]: Cavnar, William B., and John M. Trenkle. "N-Gram-Based Text Categorization." (1994) Print. �3

Motivation

• Desire to acquire information even if the information is not sufficiently available in their native language

• Survey has shown people have a higher foreign language proficiency level in reading than in writing

• CLIR may bridge the gap between their desire to obtain information and unavailability or under-availability of such information in their native language

�4

Goals

• Allow users to query for domain-specific (i.e., computer science and software engineering) information in their native language

• Present relevant search results in the target language; the language in which the largest amount of information is available

�5

Components

• Domain-specific bilingual corpus extraction from multiple sources

• Corpus indexing

• Querying and string matching

�6

Corpus Extraction

�7

Corpus Indexing

• (S, T) -> (i1, h1), (i2, h2), …, (in, hn)

• Quad-grams (k=4)

• Fingerprint overlapping is okay, although it is not the most space-efficient way

Java$

0:$Java$(20451)$

global$variable$

3:$bal_$(14870)$

8:$aria$(14269)$

example$

1:$xamp$(20451)$

자바$

전역 변수$

예제$

0

12500

25000

37500

50000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 86 88 90 92 95 97 99 103

Frequency

�8

Querying & Matching

Java$global$variable$example$$

1:$ava_$(24085)$

8:$bal_$(14870)$

13:$aria$(14269)$

22:$xamp$(20451)$

Java$

0:$Java$(20451)$0:$Java$(20451)$

…$

…$

…$

global$variable$

3:$bal_$(14870)$

8:$aria$(14269)$

example$

1:$xamp$(20451)$

자바$

전역 변수$

예제$

�9

Multiple Candidates

• Longest match first

• Confidence: how many times does this comparable corpus pair appear in a set of documents?

• Outcome of matching depends on the domain of the documents stored in the database

global&

0:&loba&(25848)&

세계적인&

variable&

1:&aria&(14269)&

변수&

global&variable&

3:&bal_&(14870)&

8:&aria&(14269)&

전역 변수&

variable&

1:&aria&(14269)&

가변적인&

�10

Indexing and Querying Recap

자바 전역 변수 예제!

Java!global!variable!example!!

자바 :!Java!

전역 :!transfer!

전역 변수 :!global!variable!

예제 :!example!

변수 :!variable!

전역 :!all!parts!(of)!

�11

Relationship with Content Addressability

Lorem&ipsum&dolor&sit&amet,&consectetur&adipiscing&elit.&Quisque&id&Java&tris8que&nunc.&Ves8bulum&sit&amet&tortor&ullamcorper,&pre8um&augue&ac,&facilisis&quam.&Ut&convallis&suscipit&mauris,&at&porta&erat&vulputate&in.&Nulla&vitae&consectetur&risus.&global&variable&Aenean&justo&risus,&mollis&sed&condimentum&sed,&sagi@s&eget&nisl.&Phasellus&sem&leo,&commodo&at&dignissim&vitae,&ullamcorper&nec&metus.&Proin&pre8um&porta&lectus&nec&example&pulvinar.&Nulla&non&elementum&nisi,&vel&hendrerit&quam.&Curabitur&bibendum&lobor8s&8ncidunt.&Proin&vel&velit&porta,&tempus&ligula&a,&interdum&leo.&Aenean&lorem&nibh,&facilisis&ut&porta&sit&amet,&ornare&quis&ligula.&

Java&

global&variable&

example&

자바 전역 변수 예제&

자바&

전역 변수&

예제&

�12

Evaluation

• Matching

• Did it translate all the search terms to the target language properly?

• Did it preserve domain-specific information?

• Searching

• Hit ratio: # of relevant web pages / # of results on the first page

• Total number of search results

�13

Evaluation

• 재귀 열거 집합 - recursively enumerable sets

• (3/3, 1/1)

• 배낭 문제 시간 복잡도 - 배낭 issue the time complexity

• (3/4, 1/2)

• 가상화를 통한 데이터센터 에너지 효율 극대화 - through virtualization datacenter energy efficiency maximization

• (7/7, 4/4)

�14

Evaluation

• Query in source language “재귀 열거 집합”

• (6/10, 15,300)

• Query in target language “recursively enumerable sets”

• (10/10, 105,000)

• Google Translate result “Set of recursive enumeration”

• (10/10, 1,990,000)

�15

Evaluation

• Query in source language “배낭 문제 시간 복잡도”

• (10/10, 31,200)

• Query in target language “배낭 issue time complexity”

• (2/6, 2,270)

• Google Translate result “Knapsack problem, the time complexity”

• (10/10, 206,000)

�16

Evaluation

• Query in source language “가상화를 통한 데이터센터 에너지 효율 극대화”

• (5/10, 36,100)

• Query in target language “through virtualization datacenter energy efficiency maximization”

• (8/10, 264,000)

• Google Translate result “Maximize energy efficiency through data center virtualization”

• (10/10, 284,000)�17

Conclusion & Future Work

• Preliminary results look satisfactory

• Machine translation based CLIR appears to be more useful in many cases

• Evaluation factors may not reflect the actual quality of the system

• Labor-intensive evaluation process - need for an automated evaluation

• Fuzzy matching based on lexical information (e.g., call, calls)

• Fuzzy matching based on semantic information (e.g., maximize, maximizing, maximization, maximum)

�18

Cross-Language Information Retrieval

Technology

Transcript of Cross-Language Information Retrieval