Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations...

12
Learning Translations via Matrix Completion Derry Wijaya 1 , Brendan Callahan 1 , John Hewitt 1 , Jie Gao 1 , Xiao Ling 1 , Marianna Apidianaki 12 and Chris Callison-Burch 1 1 Computer and Information Science Department, University of Pennsylvania 2 LIMSI, CNRS, Universit´ e Paris-Saclay, 91403 Orsay [email protected] Abstract Bilingual Lexicon Induction is the task of learning word translations without bilin- gual parallel corpora. We model this task as a matrix completion problem, and present an effective and extendable frame- work for completing the matrix. This method harnesses diverse bilingual and monolingual signals, each of which may be incomplete or noisy. Our model achieves state-of-the-art performance for both high and low resource languages. 1 Introduction Machine translation (MT) models typically re- quire large, sentence-aligned bilingual texts to learn good translation models (Wu et al., 2016; Sennrich et al., 2016a; Koehn et al., 2003). How- ever, for many language pairs, such parallel texts may only be available in limited quantities, which is problematic. Alignments at the word- or subword- levels (Sennrich et al., 2016b) can be in- accurate in the limited parallel texts, which can in turn lead to inaccurate translations. Due to the low quantity and thus coverage of the texts, there may still be “out-of-vocabulary” words encountered at run-time. The Bilingual Lexicon Induction (BLI) task (Rapp, 1995), which learns word translations from monolingual or comparable corpora, is an at- tempt to alleviate this problem. The goal is to use plentiful, more easily obtainable, monolingual or comparable data to infer word translations and re- duce the need for parallel data to learn good trans- lation models. The word translations obtained by BLI can, for example, be used to augment MT systems and improve alignment accuracy, cover- age, and translation quality (Gulcehre et al., 2016; Callison-Burch et al., 2006; Daum´ e and Jagarla- mudi, 2011). fridge house fridge house sleep sleep kulkas koki tidur hijzelf koelkast kokkien EN NL fridge house fridge house sleep sleep kulkas koki tidur kast koelkast kokkien EN ID EN ID Figure 1: Our framework allows us to use a di- verse range of signals to learn translations, in- cluding incomplete bilingual dictionaries, infor- mation from related languages (like Indonesian loan words from Dutch shown here), word embed- dings, and even visual similarity cues. Previous research has explored different sources for estimating translation equivalence from mono- lingual corpora (Schafer and Yarowsky, 2002; Klementiev and Roth, 2006; Irvine and Callison- Burch, 2013, 2017). These monolingual signals, when combined in a supervised model, can en- hance end-to-end MT for low resource languages (Klementiev et al., 2012a; Irvine and Callison- Burch, 2016). More recently, similarities between words in different languages have been approxi- mated by constructing a shared bilingual word em- bedding space with different forms of bilingual su- pervision (Upadhyay et al., 2016). We present a framework for learning transla- tions by combining diverse signals of translation that are each potentially sparse or noisy. We use matrix factorization (MF), which has been shown to be effective for harnessing incomplete or noisy distant supervision from multiple sources of in- formation (Fan et al., 2014; Rockt¨ aschel et al., 2015). MF is also shown to result in good cross- lingual representations for tasks such as alignment (Goutte et al., 2004), QA (Zhou et al., 2013), and cross-lingual word embeddings (Shi et al., 2015).

Transcript of Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations...

Page 1: Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations as a matrix completion problem. Starting from some observed translations (e.g.,

Learning Translations via Matrix Completion

Derry Wijaya1, Brendan Callahan1, John Hewitt1, Jie Gao1, Xiao Ling1,Marianna Apidianaki12 and Chris Callison-Burch1

1Computer and Information Science Department, University of Pennsylvania2LIMSI, CNRS, Universite Paris-Saclay, 91403 Orsay

[email protected]

Abstract

Bilingual Lexicon Induction is the task oflearning word translations without bilin-gual parallel corpora. We model thistask as a matrix completion problem, andpresent an effective and extendable frame-work for completing the matrix. Thismethod harnesses diverse bilingual andmonolingual signals, each of which maybe incomplete or noisy. Our modelachieves state-of-the-art performance forboth high and low resource languages.

1 Introduction

Machine translation (MT) models typically re-quire large, sentence-aligned bilingual texts tolearn good translation models (Wu et al., 2016;Sennrich et al., 2016a; Koehn et al., 2003). How-ever, for many language pairs, such parallel textsmay only be available in limited quantities, whichis problematic. Alignments at the word- orsubword- levels (Sennrich et al., 2016b) can be in-accurate in the limited parallel texts, which can inturn lead to inaccurate translations. Due to the lowquantity and thus coverage of the texts, there maystill be “out-of-vocabulary” words encountered atrun-time. The Bilingual Lexicon Induction (BLI)task (Rapp, 1995), which learns word translationsfrom monolingual or comparable corpora, is an at-tempt to alleviate this problem. The goal is to useplentiful, more easily obtainable, monolingual orcomparable data to infer word translations and re-duce the need for parallel data to learn good trans-lation models. The word translations obtained byBLI can, for example, be used to augment MTsystems and improve alignment accuracy, cover-age, and translation quality (Gulcehre et al., 2016;Callison-Burch et al., 2006; Daume and Jagarla-mudi, 2011).

fridge

house

fridg

e

hous

e

sleep

slee

p

kulk

as

koki

tidur

hijzelf koelkast kokkien

EN

NL

fridge

house

fridg

e

hous

e

sleep

slee

p

kulk

as

koki

tidur

kast koelkast kokkien

EN ID EN ID

Figure 1: Our framework allows us to use a di-verse range of signals to learn translations, in-cluding incomplete bilingual dictionaries, infor-mation from related languages (like Indonesianloan words from Dutch shown here), word embed-dings, and even visual similarity cues.

Previous research has explored different sourcesfor estimating translation equivalence from mono-lingual corpora (Schafer and Yarowsky, 2002;Klementiev and Roth, 2006; Irvine and Callison-Burch, 2013, 2017). These monolingual signals,when combined in a supervised model, can en-hance end-to-end MT for low resource languages(Klementiev et al., 2012a; Irvine and Callison-Burch, 2016). More recently, similarities betweenwords in different languages have been approxi-mated by constructing a shared bilingual word em-bedding space with different forms of bilingual su-pervision (Upadhyay et al., 2016).

We present a framework for learning transla-tions by combining diverse signals of translationthat are each potentially sparse or noisy. We usematrix factorization (MF), which has been shownto be effective for harnessing incomplete or noisydistant supervision from multiple sources of in-formation (Fan et al., 2014; Rocktaschel et al.,2015). MF is also shown to result in good cross-lingual representations for tasks such as alignment(Goutte et al., 2004), QA (Zhou et al., 2013), andcross-lingual word embeddings (Shi et al., 2015).

Page 2: Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations as a matrix completion problem. Starting from some observed translations (e.g.,

Specifically, we represent translation as a matrixwith source words in the columns and target wordsin the rows, and model the task of learning trans-lations as a matrix completion problem. Startingfrom some observed translations (e.g., from exist-ing bilingual dictionaries,) we infer missing trans-lations in the matrix using MF with a BayesianPersonalized Ranking (BPR) objective (Rendleet al., 2009). We select BPR for a number ofreasons: (1) BPR has been shown to outperformtraditional supervised methods in the presence ofpositive-only data (Riedel et al., 2013), which istrue in our case since we only observe positivetranslations. (2) BPR is easily extendable to incor-porate additional signals for inferring missing val-ues in the matrix (He and McAuley, 2016). Sinceobserved translations may be sparse, i.e. the “coldstart” problem in the matrix completion task, in-corporating additional signals of translation equiv-alence estimated on monolingual corpora is useful.(3) BPR is also shown to be effective for multi-lingual transfer learning (Verga et al., 2016). Forlow resource source languages, there may be re-lated, higher resource languages from which wecan project available translations (e.g., translationsof loan words) to the target language (Figure 1).

We conduct large scale experiments to learntranslations from both low and high resource lan-guages to English and achieve state-of-the-art per-formance on these languages. Our main contribu-tions are as follows:

• We introduce a MF framework that learnstranslations by integrating diverse bilingualand monolingual signals of translation, eachpotentially noisy/incomplete.

• The framework is easily extendable to incor-porate additional signals of translation equiv-alence. Since ours is a framework for integra-tion, each signal can be improved separatelyto improve the overall system.

• Large scale experiments on both low and highresource languages show the effectiveness ofour model, outperforming the current state-of-the-art.

• We make our code, datasets, and output trans-lations publicly available.1

2 Related Work

Bilingual Lexicon Induction Previous researchhas used different sources for estimating transla-

1http://www.cis.upenn.edu/%7Ederry/translations.html

tions from monolingual corpora. Signals such ascontextual, temporal, topical, and ortographic sim-ilarities between words are used to measure theirtranslation equivalence (Schafer and Yarowsky,2002; Klementiev and Roth, 2006; Irvine andCallison-Burch, 2013, 2017).

With the increasing popularity of word em-beddings, many recent works approximate simi-larities between words in different languages byconstructing a shared bilingual embedding space(Klementiev et al., 2012b; Zou et al., 2013; Vulicand Moens, 2013; Mikolov et al., 2013a; Faruquiand Dyer, 2014; Chandar A P et al., 2014; Gouwset al., 2015; Luong et al., 2015; Lu et al., 2015;Upadhyay et al., 2016). In the shared space,words from different languages are represented ina language-independent manner such that similarwords, regardless of language, have similar repre-sentations. Similarities between words can then bemeasured in the shared space. One approach to in-duce this shared space is to learn a mapping func-tion between the languages’ monolingual semanticspaces (Mikolov et al., 2013a; Dinu et al., 2014).The mapping relies on seed translations which canbe from existing dictionaries or be reliably cho-sen from pseudo-bilingual corpora of compara-ble texts e.g., Wikipedia with interlanguage links.Vulic and Moens (2015) show that by learning alinear function with a reliably chosen seed lexicon,they outperform other models with more expen-sive bilingual signals for training on benchmarkdata.

Most prior work on BLI however, either makesuse of only one monolingual signal or uses unsu-pervised methods (e.g., rank combination) to ag-gregate the signals. Irvine and Callison-Burch(2016) show that combining monolingual signalsin a supervised logistic regression model produceshigher accuracy word translations than unsuper-vised models. More recently, Vulic et al. (2016)show that their multi-modal model that employs asimple weighted-sum of word embeddings and vi-sual similarities can improve translation accuracy.These works show that there is a need for combin-ing diverse, multi-modal monolingual signals oftranslations. In this paper, we take this step furtherby combining the monolingual signals with bilin-gual signals of translations from existing bilingualdictionaries of related, “third” languages.

Bayesian Personalized Ranking (BPR) Ourapproach is based on extensions to the probabilis-

Page 3: Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations as a matrix completion problem. Starting from some observed translations (e.g.,

tic model of MF in collaborative filtering (Korenet al., 2009; Rendle et al., 2009). We representour translation task as a matrix with source wordsin the columns and target words in the rows (Fig-ure 1). Based on some observed translations in thematrix found in a seed dictionary, our model learnslow-dimensional feature vectors that encode thelatent properties of the words in the row and thewords in the column. The dot product of thesevectors, which indicate how “aligned” the sourceand the target word properties are, captures howlikely they are to be translations.

Since we do not observe false translations in theseed dictionary, the training data in the matrix con-sists only of positive translations. The absence ofvalues in the matrix does not imply that the cor-responding words are not translations. In fact, weseek to predict which of these missing values aretrue. The BPR approach to MF (Rendle et al.,2009) formulates the task of predicting missingvalues as a ranking task. With the assumption thatobserved true translations should be given highervalues than unobserved translations, BPR learns tooptimize the difference between values assignedto the observed translations and values assigned tothe unobserved translations.

However, due to the sparsity of existing bilin-gual dictionaries (for some language pairs suchdictionaries may not exist), the traditional for-mulation of MF with BPR suffers from the“cold start” issue (Gantner et al., 2010; He andMcAuley, 2016; Verga et al., 2016). In our case,these are situations in which some source wordshave no translations to any word in the target orrelated languages. For these words, additional in-formation, e.g., monolingual signals of translationequivalence or language-independent representa-tions such as visual representations, must be used.

We use bilingual translations from the sourceto the target language, English, obtained fromWikipedia page titles with interlanguage links.Since Wikipedia pages in the source language maybe linked to pages in languages other than English,we also use high accuracy, crowdsourced transla-tions (Pavlick et al., 2014) from these third lan-guages to English as additional bilingual transla-tions. To alleviate the cold start issue, when asource word has no existing known translation toEnglish or other third languages, our model backs-off to additional signals of translation equivalenceestimated based on its word embedding and visual

representations.

3 Method

In this section, we describe our framework forintegrating bilingual and monolingual signals forlearning translations. First we formulate the taskof Bilingual Lexicon Induction, and introduce ourmodel for learning translations given observedtranslations and additional monolingual/language-independent signals. Then we derive our learningprocedure using the BPR objective function.

Problem Formulation Given a set of sourcewords F , a set of target words E, the pair he, fiwhere e 2 E and f 2 F is a candidate trans-lation with an associated score xe,f 2 [0, 1] in-dicating the confidence of the translation. Theinput to our model is a set of observed transla-tions T := {he, fi | xe,f = 1}. These could comefrom an incomplete bilingual dictionary. We alsoadd word identities to the matrix i.e., we defineT

identity := {he, ei}, where T

identity ⇢ T . Thetask of Bilingual Lexicon Induction is then to gen-erate missing translations: for a given source wordf and a set of target words {e | he, fi /2 T}, pre-dict the score xe,f of how likely it is for e to be atranslation of f .

Bilingual Signals for Translation One way topredict xe,f is by using matrix factorization. Theproblem of predicting xe,f can be seen as a task ofestimating a matrix X : E⇥F . X is approximatedby a matrix product of two low-rank matrices P :|E|⇥ k and Q : |F |⇥ k:

X := PQ

T

where k is the rank of the approximation. Eachrow pe in P can be seen as a feature vector describ-ing the latent properties of the target word e, andeach row qf of Q describes the latent properties ofthe source word f . Their dot product encodes howaligned the latent properties are and, since thesevectors are trained on observed translations, it en-codes how likely they are to be translation of eachother. Thus, we can write this formulation of pre-dicted scores xe,f with MF as:

x

MFe,f = p

Te qf =

kX

i=1

pei . qfi (1)

Auxiliary Signals for Translation Because theobserved bilingual translations may be sparse, the

Page 4: Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations as a matrix completion problem. Starting from some observed translations (e.g.,

fridge

house fri

dge

hous

e sleep

slee

p

kulk

as

koki

tidur

hijzelf koelkast kokkien

EN

NL

¢

¢

...

¢

¢

¢

¢

...

¢

¢

¢

¢

...

¢

¢

...

...

...

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

EMNLP 2017 Submission 1161. Confidential Review Copy. DO NOT DISTRIBUTE.

fridge

house

fridg

e

hous

e

sleep

slee

p

kulk

as

koki

tidur

hijzelf koelkast kokkien

EN

NL

¢

¢

...

¢

¢

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

EMNLP 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

fridge

house

fridg

e

hous

e

sleep

slee

p

kulk

as

koki

tidur

hijzelf koelkast kokkien

EN

NL

EN ID

Figure 2: The word tidur (id) is a cold word withno associated translation in the matrix.

to estimate their latent dimensions accurately (Fig-ure 2). Additional signals for measuring transla-tion equivalence can help alleviate this problem.Hence, in the case of cold words, we use a formu-lation of xu,i that involves auxiliary features aboutthe words for measuring translation:

x

AUXu,i = ✓

Tu ✓i

✓i represents an auxiliary feature vector of the coldword i e.g., its word embedding or image vectorrepresentation, while ✓u is a feature vector to betrained, whose elements model the interaction be-tween word u and word i: the extent to which theword u matches the auxiliary features of word i. Inpractice, learning ✓u amounts to learning a classi-fier, one for each target word u that learns weights✓u given the feature vectors ✓i of its translations.

Since each word can have multiple additionalfeature vectors, we can formulate x

AUXu,i as a

weighted combination of all the auxiliary featuresavailable to us:

x

AUXu,i = ↵1 ✓

Tu ✓i + ↵2 �

Tu �i + ... + ↵n �

Tu �i

where ↵m are parameters assigned to control thecontribution of each auxiliary feature.

Learning with Bayesian Personalized RankingThe objective of Bayesian Personalized Ranking(BPR) is to

4 Experiments

Data

Results

5 Results and conclusion

ReferencesSarath Chandar AP, Stanislas Lauly, Hugo Larochelle,

Mitesh Khapra, Balaraman Ravindran, Vikas C

Raykar, and Amrita Saha. 2014. An autoencoderapproach to learning bilingual word representations.In Advances in Neural Information Processing Sys-tems. pages 1853–1861.

Chris Callison-Burch, Philipp Koehn, and Miles Os-borne. 2006. Improved statistical machine transla-tion using paraphrases. In Proceedings of the mainconference on Human Language Technology Con-ference of the North American Chapter of the As-sociation of Computational Linguistics. Associationfor Computational Linguistics, pages 17–24.

Georgiana Dinu, Angeliki Lazaridou, and Marco Ba-roni. 2014. Improving zero-shot learning by miti-gating the hubness problem. In ICLR Workshop Pa-pers.

Miao Fan, Deli Zhao, Qiang Zhou, Zhiyuan Liu,Thomas Fang Zheng, and Edward Y Chang. 2014.Distant supervision for relation extraction with ma-trix completion. In ACL (1). Citeseer, pages 839–849.

Manaal Faruqui and Chris Dyer. 2014. Improving vec-tor space word representations using multilingualcorrelation. Association for Computational Linguis-tics.

Zeno Gantner, Lucas Drumond, Christoph Freuden-thaler, Steffen Rendle, and Lars Schmidt-Thieme.2010. Learning attribute-to-feature mappings forcold-start recommendations. In Data Mining(ICDM), 2010 IEEE 10th International Conferenceon. IEEE, pages 176–185.

Cyril Goutte, Kenji Yamada, and Eric Gaussier. 2004.Aligning words using matrix factorisation. In Pro-ceedings of the 42nd Annual Meeting on Associa-tion for Computational Linguistics. Association forComputational Linguistics, page 502.

Stephan Gouws, Yoshua Bengio, and Greg Corrado.2014. Bilbowa: Fast bilingual distributed represen-tations without word alignments. stat 1050:9.

Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and Yoshua Bengio. 2016. Pointingthe unknown words. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (ACL). Association for ComputationalLinguistics.

Ruining He and Julian McAuley. 2016. Vbpr: Visualbayesian personalized ranking from implicit feed-back. In Thirtieth AAAI Conference on Artificial In-telligence.

Ann Irvine and Chris Callison-Burch. 2013. Su-pervised bilingual lexicon induction with multiplemonolingual signals. Citeseer.

Ann Irvine and Chris Callison-Burch. 2016. End-to-end statistical machine translation with zero orsmall parallel texts. Natural Language Engineering22(04):517–548.

¢

¢

...

¢

¢

¢

¢

...

¢

¢

...

...

...

Figure 2: The word tidur (id) is a cold word withno associated translation in the matrix. Auxiliaryfeatures ✓f about the words can be used to predicttranslations for cold words.

Auxiliary Signals for Translation Because theobserved bilingual translations may be sparse, theMF approach can suffer from the existence of colditems: words that have too few associated ob-served translations to estimate their latent dimen-sions accurately (Figure 2). Additional signalsfor measuring translation equivalence can allevi-ate this problem. Hence, in the case of cold words,we use a formulation of xu,i that involves auxiliaryfeatures about the words in the predicted xu,i:

x

AUXu,i = ✓

Tu ✓i + �

T✓i (2)

✓i represents an auxiliary information about thecold word i e.g., its word embedding or visual fea-tures. ✓u is a feature vector to be trained, whosedot product with ✓i models the extent to which theword u matches the auxiliary features of word i. Inpractice, learning ✓u amounts to learning a classi-fier, one for each target word u that learns weights✓u given the feature vectors ✓i of its translations.� models the target words’ overall bias toward agiven word i.

Since each word can have multiple additionalfeature vectors, we can formulate x

AUXu,i as a

weighted sum of available auxiliary features1:

x

AUXu,i = ↵1 ✓

Tu ✓i + ↵2 �

Tu �i + ... + ↵n �

Tu �i

where ↵m are parameters assigned to control thecontribution of each auxiliary feature.

In practice, we can combine the MF and auxil-iary formulations by defining:

xu,i = x

MFu,i + x

AUXu,i

1we omit bias terms for brevity

However, since bilingual signals that are input tox

MFu,i are often precise but sparse, while monolin-

gual signals that are input to x

AUXu,i are often noisy

and not sparse, in our model we only back-off tothe less precise x

AUXu,i for cold source words that

have none or too few associated translations (moredetalis are given in the experiments, Section 4).For other source words, we use x

MFu,i to predict.

Learning with Bayesian Personalized RankingThe objective of Bayesian Personalized Ranking(BPR) is to maximize the difference in scoresassigned to the observed translations comparedto those assigned to the unobserved translations.Given a training set D consisting of triples of theform hu, i, ji, where hu, ii 2 T and hu, ji /2 T ,BPR wants to maximize xu,i,j , defined as:

xu,i,j = xu,i � xu,j

where xu,i and xu,j can be defined either by eq.1 or eq. 2 (for cold words). Specifically, BPRoptimizes (Rendle et al., 2009):

X

hu,i,ji2D

ln �(xu,i,j)� �⇥||⇥||2

where � is the logistic sigmoid function, ⇥ isthe parameter vector of xu,i,j to be trained, and�⇥ is its hyperparameter vector. BPR can betrained using stochastic gradient ascent where atriple hu, i, ji is sampled from D and parameterupdates are performed:

⇥ ⇥ + ⌘.(�(�xu,i,j)@xu,i,j

@⇥� �⇥⇥)

⌘ is the learning rate. Hence, for the MF formula-tion of xu,i,j , we can sample a triple hu, i, ji fromD and update its parameters as:

pu pu + ⌘.(�(�x

MFu,i,j)(qi � qj)� �P pu)

qi qi + ⌘.(�(�x

MFu,i,j)(pu)� �Q+ qi)

qj qj + ⌘.(�(�x

MFu,i,j)(�pu)� �Q� qj)

while for the auxiliary formulation of xu,i,j ; wewe can sample a triple hu, i, ji from D and updateits parameters as:

✓u ✓u + ⌘.(�(�x

AUXu,i,j )(✓i � ✓j)� �⇥ ✓u)

� � + ⌘.(�(�x

AUXu,i,j )(✓i � ✓j)� �� �)

Figure 2: The word tidur (id) is a cold word withno associated translation in the matrix. Auxiliaryfeatures ✓f about the words can be used to predicttranslations for cold words.

MF approach can suffer from the existence of colditems: words that have none or too few associ-ated observed translations to estimate their latentdimensions accurately (Figure 2). Additional sig-nals for measuring translation equivalence can al-leviate this problem. Hence, in the case of coldwords, we use a formulation of xe,f that involvesauxiliary features about the words in the predictedxe,f :

x

AUXe,f = ✓

Te ✓f + �

T✓f (2)

✓f represents an auxiliary information about thecold word f e.g., its word embedding or visual fea-tures. ✓e is a feature vector to be trained, whosedot product with ✓f models the extent to whichthe word e matches the auxiliary features of wordf . In practice, learning ✓e amounts to learning aclassifier, one for each target word e that learnsweights ✓e given the feature vectors ✓f of its trans-lations. � models the targets’ overall bias towarda given word f .

Since each word can have multiple additionalfeature vectors, we can formulate x

AUXe,f as a

weighted sum of available auxiliary features2:

x

AUXe,f = ↵1 ✓

Te ✓f + ↵2 �

Te �f + ... + ↵n �

Te �f

where ↵m are parameters assigned to control thecontribution of each auxiliary feature.

In practice, we can combine the MF and auxil-iary formulations by defining:

xe,f = x

MFe,f + x

AUXe,f

2We omit bias terms for brevity.

However, since bilingual signals that are input tox

MFe,f are often precise but sparse, while monolin-

gual signals that are input to x

AUXe,f are often noisy

and not sparse, in our model we only back-off tothe less precise x

AUXe,f for cold source words that

have none or too few associated translations (moredetails are given in the experiments, Section 4).For other source words, we use x

MFe,f to predict.

Learning with Bayesian Personalized RankingUnlike traditional supervised models that try tomaximize the scores assigned to positive instances(in our case, observed translations), the objec-tive of Bayesian Personalized Ranking (BPR) is tomaximize the difference in scores assigned to theobserved translations compared to those assignedto the unobserved translations. Given a training setD consisting of triples of the form he, f, gi, wherehe, fi 2 T and he, gi /2 T , BPR wants to maxi-mize xe,f,g, defined as:

xe,f,g = xe,f � xe,g

where xe,f and xe,g can be defined either by eq.1 or eq. 2 (for cold words). Specifically, BPRoptimizes (Rendle et al., 2009):

X

he,f,gi2D

ln �(xe,f,g)� �⇥||⇥||2

where � is the logistic sigmoid function, ⇥ isthe parameter vector of xe,f,g to be trained, and�⇥ is its hyperparameter vector. BPR can betrained using stochastic gradient ascent where atriple he, f, gi is sampled from D and parameterupdates are performed:

⇥ ⇥ + ⌘.(�(�xe,f,g)@xe,f,g

@⇥� �⇥⇥)

⌘ is the learning rate. Hence, for the MF formula-tion of xe,f,g, we can sample a triple he, f, gi fromD and update its parameters as:

pe pe + ⌘.(�(�x

MFe,f,g)(qf � qg)� �P pe)

qf qf + ⌘.(�(�x

MFe,f,g)(pe)� �Q+ qf )

qg qg + ⌘.(�(�x

MFe,f,g)(�pe)� �Q� qg)

while for the auxiliary formulation of xe,f,g, wecan sample a triple he, f, gi from D and update itsparameters as:

✓e ✓e + ⌘.(�(�x

AUXe,f,g )(✓f � ✓g)� �⇥ ✓e)

� � + ⌘.(�(�x

AUXe,f,g )(✓f � ✓g)� �� �)

Page 5: Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations as a matrix completion problem. Starting from some observed translations (e.g.,

4 Experiments

To implement our approach, we extend the imple-mentation of BPR in LIBREC3 which is a publiclyavailable Java library for recommender systems.

We evaluate our model for the task of BilingualLexicon Induction (BLI). Given a source word f ,the task is to rank all candidate target words e bytheir predicted translation scores xe,f . We con-duct large-scale experiments on 27 low- and high-resource source languages and evaluate their trans-lations to English. We use the 100K most frequentwords from English Wikipedia as candidate En-glish target words (E).

At test time, for each source language, we eval-uate the top-10 accuracy (Acc10): the percent ofsource language words in the test set for which acorrect English translation appears in the top-10ranked English candidates.

4.1 Data4.1.1 Test setsWe use benchmark test sets for the task of bilin-gual lexicon induction to evaluate the performanceof our model. The VULIC1000 dataset (Vulicand Moens, 2016) comprises 1000 nouns in Span-ish, Italian, and Dutch, along with their one-to-oneground-truth word translations in English.

We construct a new test set (CROWDTEST) fora larger set of 27 languages from crowdsourceddictionaries (Pavlick et al., 2014). For each lan-guage, we randomly pick up to 1000 words thathave only one English word translation in thecrowdsourced dictionary to be the test set for thatlanguage. On average, there are 967 test sourcewords with a variety of POS per language. Sincedifferent language treats grammatical categoriessuch as tense and number differently (for example,unlike English, tenses are not expressed by spe-cific forms of words in Indonesian (id); rather,they are expressed through context), we makeour evaluation on all languages in CROWDTESTgeneric by treating a predicted English translationof a foreign word as correct as long as it has thesame lemma as the gold English translation. Tofacilitate further research, we make CROWDTESTpublicly available in our website.

4.1.2 Bilingual Signals for TranslationWe use Wikipedia to incorporate informationfrom a third language into the matrix, with ob-

3https://www.librec.net/index.html

fridge

house

fridg

e

hous

e

sleep

slee

p

kulk

as

koki

tidur

id-wiki/Kucing

id-wiki/Kulkas

id-wiki/Koki

EN

WIKI

EN ID

it-wiki/Sonno

es-wiki/SommeilWIKI+CROWD

Figure 3: Wikipedia pages with observed trans-lations to the source (id) and the target (en) lan-guages act as a third language in the matrix.

served translations to both the source languageand the target language, English. We first col-lect all interlingual links from English Wikipediapages to pages in other languages. Usingthese links, we obtain translations of Wikipediapage titles in many languages to English e.g.,id.wikipedia.org/wiki/Kulkas ! fridge (en).The observed translations are projected to fill themissing translations in the matrix (Figure 3). Wecall these bilingual translations WIKI.

From the links that we have collected, wecan also infer links from Wikipedia pages in thesource language to other pages in non-target lan-guages e.g., id.wikipedia.org/wiki/Kulkas!it.wikipedia.org/wiki/Frigorifero. The ti-tles of these pages can be translated to Englishif they exist as entries in the dictionaries. Thesenon-source, non-target language pages can act asyet another third language whose observed trans-lations can be projected to fill the missing transla-tions in the matrix (Figure 3). We call these bilin-gual translations WIKI+CROWD.

4.1.3 Monolingual Signals for TranslationWe define cold source words in our experiments assource words that have no associated WIKI trans-lations and fewer than 2 associated WIKI+CROWDtranslations. For each cold source word f , we pre-dict the score of its translation to each candidateEnglish word e using the auxiliary formulation ofxe,f (Equation 2). There are two auxiliary signalsabout the words that we use in our experiments:(1) bilingually informed word embeddings and (2)visual representations.

Bilingually Informed Word Embeddings Foreach language, we learn monolingual embed-dings for its words by training a standard mono-lingual word2vec skipgram model (Mikolov

Page 6: Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations as a matrix completion problem. Starting from some observed translations (e.g.,

et al., 2013b) on tokenized Wikipedia pages ofthat language using Gensim (Rehurek and Sojka,2010). We obtain 100-dimensional word embed-dings with 15 epochs, 15 negatives, window sizeof 5, and cutoff value of 5.

Given two monolingual embedding spaces RdF

and RdE of the source and target languages F andE, where df and de denote the dimensionality ofthe monolingual embedding spaces, we use the setof crowdsourced translations that are not in the testset as our seed bilingual translations4 and learna mapping function W 2 RdE⇥dF that maps thetarget language vectors in the seed translations totheir corresponding source language vectors.5

We learn two types of mapping: linear and non-linear, and compare their performances. The linearmapping (Mikolov et al., 2013a; Dinu et al., 2014)minimizes: ||XEW�XF||2F where, following thenotation in (Vulic and Korhonen, 2016), XE andXF are matrices obtained by respective concate-nation of target language and source language vec-tors that are in the seed bilingual translations. Wesolve this optimization problem using stochasticgradient descent (SGD).

We also consider a non-linear mapping (Socheret al., 2013) using a simple four-layer neural net-work, W = (�(1)

, �

(2), �

(3), �

(4)) that is trained tominimize:

X

xf2XF

X

xe2XE

||xf � �

(4)s(�(3)

s(�(2)s(�(1)

xe)))||2

where �

(1) 2 Rh1⇥dE , �

(2) 2 Rh2⇥h1 , �

(3) 2Rh3⇥h2 , �

(4) 2 RdF ⇥h3 , hn is the size of the hid-den layer, and s = tanh is the chosen non-linearfunction.

Once the map W is learned, all candidate targetword vectors xe can be mapped into the sourcelanguage embedding space RdF by computingx

Te W. Instead of the raw monolingual word em-

beddings xe, we use these bilingually-informedmapped word vectors xT

e W as the auxiliary wordfeatures WORD-AUX to estimate x

AUXe,f .

Visual Representations Pilot Study Recentwork (Vulic et al., 2016) has shown that combin-ing word embeddings and visual representationsof words can help achieve more accurate bilingualtranslations. Since the visual representation of a

4On average, there are 9846 crowdsourced translationsper language that we can use as seed translations.

5We find that mapping from target to source vectors givesbetter performances across models in our experiments.

Figure 4: Five images for the French word eauand its top 4 translations ranked using visual sim-ularities of images associated with English words(Bergsma and Van Durme, 2011)

word seems to be language-independent (e.g. theconcept of water has similar images whether ex-pressed in English or French (Figure 4), the visualrepresentations of a word may be useful for infer-ring its translation and for complementing the in-formation learned in text.

We performed a pilot study to include visualfeatures as auxiliary features in our framework.We use a large multilingual corpus of labeled im-ages (Callahan, 2017) to obtain the visual repre-sentation of the words in our source and target lan-guages. The corpus contains 100 images for up to10k words in each of 100 foreign languages, plusimages of each of their translations into English.For each of the images, a convolutional neural net-work (CNN) feature vector is also provided fol-lowing the method of Kiela et al. (2015). For eachword, we use 10 images provided by this corpusand use their CNN features as auxiliary visual fea-tures VISUAL-AUX to estimate x

AUXe,f .

4.1.4 Combining SignalsDuring training, we trained the parameters of x

MFe,f

and x

AUXe,f using a variety of signals:

• x

MF�We,f is trained using WIKI translations as

the set of observed translations T

• x

MF�W+Ce,f is trained using WIKI+CROWD

translations as the set of observed T

• x

AUX�WEe,f is trained using the set of word

identities T

identity and WORD-AUX as ✓f

• x

AUX�VISe,f is trained using the set of word

identities T

identity and VISUAL-AUX as ✓f

During testing, we use the following back-offscheme to predict translation scores given a sourceword f and a candidate target word e:

Page 7: Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations as a matrix completion problem. Starting from some observed translations (e.g.,

xe,f =

8>>><

>>>:

x

MF�We,f if f has � 1 associated WIKI,

x

MF�W+Ce,f else if f has � 2

associated WIKI+CROWD,x

AUXe,f otherwise

where x

AUXe,f = ↵we x

AUX�WEe,f + ↵vis x

AUX�VISe,f

4.2 ResultsWe conduct experiments using the following vari-ants of our model, each of which progressively in-corporates more signals to rank candidate Englishtarget words. When a variant uses more than oneformulation of xe,f , it applies them using the back-off scheme that we have described before.

• BPR W uses only x

MF�We,f

• BPR W+C uses x

MF�We,f and x

MF�W+Ce,f

• BPR LN uses only x

AUX�WEe,f with linear

mapping• BPR NN uses only x

AUX�WEe,f with neural

network (NN) mapping• BPR WE uses x

MF�We,f , x

MF�W+Ce,f , and

x

AUX�WEe,f with NN mapping

• BPR VIS adds x

AUX�VISe,f to BPR WE

Table 1: Acc10 performance on VULIC1000

Baseline BPR+MNN BPR LN BPR WE(MNN)

IT-EN 78.8% 79.4% 81.3% 86.0%ES-EN 81.8% 82.1% 83.4% 87.1%NL-EN 80.8% 81.6% 83.2% 87.2%

We evaluate the performance of BPR WE againsta baseline that is the state-of-the-art modelof Vulic and Korhonen (2016), on benchmarkVULIC1000 (Table 1). The baseline (MNN)learns a linear mapping between monolingual em-bedding spaces and finds translations in an unsu-pervised manner: it ranks candidate target wordsbased on their cosine similarities to the sourceword in the mapped space. As seed translationpairs, MNN uses mutual nearest neighbor pairs(MNN) obtained from pseudo-bilingual corporaconstructed from unannotated monolingual data ofthe source and target languages (Vulic and Moens,2016). We train MNN and our models using thesame 100-dimensional word2vec monolingualword embeddings.

As seen in Table 1, we see the benefit oflearning translations in a supervised manner.

BPR+MNN uses the same MNN seed translationsas MNN, obtained from unannotated monolingualdata of English and the foreign language, tolearn the linear mapping between their embeddingspaces. However, unlike MNN, BPR+MNN uses themapped word vectors to predict ranking in a su-pervised manner with BPR objective. This resultsin higher accuracies than MNN. Using seed trans-lations from crowdsourced dictionaries to learnthe linear mapping (BPR LN) improves accuracieseven further compared to using MNN seed trans-lations obtained from unannotated data. Finally,BPR WE that learns translations in a supervisedmanner and uses third language translations andnon-linear mapping (trained with crowdsourcedtranslations not in the test set) performs consis-tently and very significantly better than the state-of-the-art on all benchmark test sets. This showsthat incorporating more and better signals of trans-lation can improve performance significantly.

Evaluating on CROWDTEST, we observe asimilar trend over all 27 languages (Figure 5). Par-ticularly, we see that BPR W and BPR W+C sufferfrom the cold start issue where there are too few orno observed translations in the matrix to make ac-curate predictions. Incorporating auxiliary infor-mation in the form of bilingually-informed wordembeddings improves the accuracy of the predic-tions dramatically. For many languages, learn-ing these bilingually-informed word embeddingswith non-linear mapping improves accuracy evenmore. The top accuracy scores achieved by themodel vary across languages and seem to be in-fluenced by the amount of data i.e., Wikipedia to-kens and seed lexicons entries available for train-ing. Somali (so) for example, has only 0.9 milliontokens available in its Wikipedia for training theword2vec embeddings and only 3 thousand seedtranslations for learning the mapping between theword embedding spaces. In comparison, Span-ish (es) has over 500 million tokens available inits Wikipedia and 11 thousand seed translations.We also believe that our choice of tokenizationmay not be suitable for some languages – we usea simple regular-expression based tokenizer formany languages that do not have a trained NLTK6

tokenization model. This may influence perfor-mance on languages such as Vietnamese (vi) onwhich we have a low performance despite its largeWikipedia corpus.

6http://www.nltk.org/

Page 8: Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations as a matrix completion problem. Starting from some observed translations (e.g.,

0%10%20%30%40%50%60%70%80%90%100%

so uz vi ne gu ta te az bn lv hi cy hu bs sq uk sk id sv tr nl bg it fr sr ro es

BPR_W BPR_W+C BPR_LINEAR BPR_NN BPR_WE

Figure 5: Acc10 on CROWDTEST across all 27 languages show that adding more and better signalsfor translation improves translation accuracies. The top accuracies achieved by our model: BPR WEvary across languages and appear to be influenced by the amount of data (Wikipedia tokens and seedtranslations) and tokenization available for the language.

Table 2: Top-5 translations of the Indonesian word kesadaran (awareness) using different model variants

BPR W BPR LN BPR NN BPR WEkesadaran kesadaran kesadaran kesadaran

consciousness consciousness consciousness consciencegoddess awareness empathy awareness

friendship empathy awareness understandingnight perception perceptions consciousnessnation mindedness perception acquaintance

Some example translations of an Indonesianword produced by different variants of our modelare shown in Table 2. Adding third language trans-lation signals on top of the bilingually-informedauxiliary signals improves accuracies even fur-ther.7 The accuracies achieved by BPR WE onthese languages are significantly better than pre-viously reported accuracies (Irvine and Callison-Burch, 2017) on test sets constructed from thesame crowdsourced dictionaries (Pavlick et al.,2014)8.

The accuracies across languages appear to im-prove consistently with the amount of signals be-ing input to the model. In the following exper-iments, we investigate how sensitive these im-provements are with varying training size.

In Figure 6, we show accuracies obtained by

7Actual improvement per language depends on the cover-age of the Wikipedia interlanguage links for that language

8The comparison however, cannot be made apples-to-apples since the way Irvine and Callison-Burch (2017) selecttest sets from the crowdsourced dictionaries maybe differentand they do not release the test sets

BPR WE with varying sizes of seed translation lex-icons used to train its mapping. The results showthat a seed lexicon size of 5K is enough acrosslanguages to achieve optimum performance. Thisfinding is consistent with the finding of Vulic andKorhonen (2016) that accuracies peak at about 5Kseed translations across all their models and lan-guages. For future work, it will be interestingto investigate further why this is the case: e.g.,how optimal seed size is related to the quality ofthe seed translations and the size of the test set,and how the optimum seed size should be chosen.Lastly, we experiment with incorporating auxil-iary visual signals for learning translations on themultilingual image corpus (Callahan, 2017). Thecorpus contains 100 images for up to 10K wordsin each of 100 foreign languages, plus images ofeach of their translations into English. We trainand test our BPR VIS model to learn translationsof 5 low- and high-resource languages in this cor-pus. We use the translations of up to 10K wordsin each of these languages as test set and use up to

Page 9: Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations as a matrix completion problem. Starting from some observed translations (e.g.,

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

so uz vi ne gu ta te az bn lv hi cy hu bs sq uk sk id sv tr nl bg it fr sr ro es

100 500 1K 5K All

Figure 6: Acc10 across different seed lexicon sizes

Table 3: Acc10 performance on the multilingualimage corpus test set (Callahan, 2017)

Baseline BPR VIS # Seeds(CNN-AvgMax)

IT-EN 31.4% 55.8% 581ES-EN 33.0% 58.3% 488NL-EN 35.5% 69.2% 1857FR-EN 37.1% 65.9% 1697ID-EN 36.9% 45.3% 462

10 images (CNN features) of the words in this setas auxiliary visual signals to predict their transla-tions. In this experiment, we weigh auxiliary wordembedding and visual features equally. To trainthe mapping of our word embedding features, weuse as seeds crowdsourced translations not in testset.

We compare the quality of our translationswith the baseline CNN-AVGMAX (Bergsma andVan Durme, 2011), which considers cosine simi-larities between individual images from the sourceand target word languages and takes average oftheir maximum similarities as the final similaritybetween a source and a target word. For eachsource word, the candidate target words are rankedaccording to these final similarities. This base-line has been shown to be effective for inducingtranslations from images, both in the uni-modal(Bergsma and Van Durme, 2011; Kiela et al.,2015) and multi-modal models (Vulic et al., 2016).

As seen in Table 3, incorporating additionalbilingual and textual signals to the visual signalsimproves translations. Accuracies on these imagecorpus’ test sets are lower overall as they containa lot of translations from our crowdsourced dictio-naries; thus we have much less seeds to train ourword embedding mapping. Furthermore, these testsets contain 10 times as many translations as ourprevious test sets. Using more images instead ofjust 10 per word may also improve performance.

5 Conclusion

In this paper, we propose a novel framework forcombining diverse, sparse and potentially noisymulti-modal signals for translations. We view theproblem of learning translations as a matrix com-pletion task and use an effective and extendablematrix factorization approach with BPR to learntranslations.

We show the effectiveness of our approach inlarge scale experiments. Starting from minimally-trained monolingual word embeddings, we con-sistently and very significantly outperform state-of-the-art approaches by combining these featureswith other features in a supervised manner usingBPR. Since our framework is modular, each in-put to our prediction can be improved separatelyto improve the whole system e.g., by learning bet-ter word embeddings or a better mapping func-tion to input into the auxiliary component. Ourframework is also easily extendable to incorporatemore bilingual and auxiliary signals of translationequivalence.

Acknowledgments

This material is based in part on research spon-sored by DARPA under grant number HR0011-15-C-0115 (the LORELEI program). The U.S. Gov-ernment is authorized to reproduce and distributereprints for Governmental purposes. The viewsand conclusions contained in this publication arethose of the authors and should not be interpretedas representing official policies or endorsementsof DARPA and the U.S. Government.

This work was also supported by the French Na-tional Research Agency under project ANR-16-CE33-0013, and by Amazon through the AmazonAcademic Research Awards (AARA) program.

ReferencesShane Bergsma and Benjamin Van Durme. 2011.

Learning Bilingual Lexicons Using the VisualSimilarity of Labeled Web Images. In IJCAIProceedings-International Joint Conference on Ar-tificial Intelligence, pages 1764–1769, Barcelona,Spain.

Brendan Callahan. 2017. Image-based bilingual lexi-con induction for low resource languages. Master’sthesis, University of Pennsylvania.

Chris Callison-Burch, Philipp Koehn, and Miles Os-borne. 2006. Improved Statistical Machine Trans-lation Using Paraphrases. In Proceedings of the

Page 10: Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations as a matrix completion problem. Starting from some observed translations (e.g.,

Human Language Technology Conference of theNAACL, Main Conference, pages 17–24, New YorkCity.

Sarath Chandar A P, Stanislas Lauly, Hugo Larochelle,Mitesh Khapra, Balaraman Ravindran, Vikas CRaykar, and Amrita Saha. 2014. An AutoencoderApproach to Learning Bilingual Word Representa-tions. In Advances in Neural Information Process-ing Systems 27, pages 1853–1861.

Hal Daume and Jagadeesh Jagarlamudi. 2011. Do-main Adaptation for Machine Translation by Min-ing Unseen Words. In Proceedings of the 49th An-nual Meeting of the Association for ComputationalLinguistics: Human Language Technologies, pages407–412, Portland, Oregon.

Georgiana Dinu, Angeliki Lazaridou, and Marco Ba-roni. 2014. Improving zero-shot learning by miti-gating the hubness problem. In Proceedings of ICLRWorkshop, San Diego, California.

Miao Fan, Deli Zhao, Qiang Zhou, Zhiyuan Liu,Thomas Fang Zheng, and Edward Y. Chang. 2014.Distant Supervision for Relation Extraction withMatrix Completion. In Proceedings of the 52nd An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 839–849, Baltimore, Maryland.

Manaal Faruqui and Chris Dyer. 2014. Improving Vec-tor Space Word Representations Using MultilingualCorrelation. In Proceedings of the 14th Conferenceof the European Chapter of the Association for Com-putational Linguistics, pages 462–471, Gothenburg,Sweden.

Zeno Gantner, Lucas Drumond, Christoph Freuden-thaler, Steffen Rendle, and Lars Schmidt-Thieme.2010. Learning attribute-to-feature mappings forcold-start recommendations. In Data Mining(ICDM), 2010 IEEE 10th International Conferenceon, pages 176–185. IEEE.

Cyril Goutte, Kenji Yamada, and Eric Gaussier. 2004.Aligning words using matrix factorisation. In Pro-ceedings of the 42nd Annual Meeting on Associationfor Computational Linguistics, page 502. Associa-tion for Computational Linguistics.

Stephan Gouws, Yoshua Bengio, and Greg Corrado.2015. BilBOWA: Fast Bilingual Distributed Rep-resentations without Word Alignments. In Proceed-ings of the 32nd International Conference on Ma-chine Learning, ICML 2015, pages 748–756, Lille,France.

Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and Yoshua Bengio. 2016. Pointingthe Unknown Words. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 140–149, Berlin, Germany.

Ruining He and Julian McAuley. 2016. Vbpr: Visualbayesian personalized ranking from implicit feed-back. In AAAI Conference on Artificial Intelligence,pages 144–150. AAAI Press.

Ann Irvine and Chris Callison-Burch. 2013. Super-vised Bilingual Lexicon Induction with MultipleMonolingual Signals. In Proceedings of the 2013Conference of the North American Chapter of theAssociation for Computational Linguistics (NAACL2013), pages 518–523, Atlanta, Georgia.

Ann Irvine and Chris Callison-Burch. 2016. End-to-end statistical machine translation with zero orsmall parallel texts. Natural Language Engineering,22(04):517–548.

Ann Irvine and Chris Callison-Burch. 2017. A Com-prehensive Analysis of Bilingual Lexicon Induction.Computational Linguistics, 43(2):273–310.

Douwe Kiela, Ivan Vulic, and Stephen Clark. 2015. Vi-sual Bilingual Lexicon Induction with TransferredConvNet Features. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 148–158, Lisbon, Portugal.

Alexandre Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky. 2012a. Toward Statis-tical Machine Translation without Parallel Corpora.In Proceedings of the 13th Conference of the Euro-pean Chapter of the Association for ComputationalLinguistics, pages 130–140, Avignon, France.

Alexandre Klementiev and Dan Roth. 2006. WeaklySupervised Named Entity Transliteration and Dis-covery from Multilingual Comparable Corpora. InProceedings of the 21st International Conference onComputational Linguistics and 44th Annual Meet-ing of the Association for Computational Linguis-tics, pages 817–824, Sydney, Australia.

Alexandre Klementiev, Ivan Titov, and Binod Bhat-tarai. 2012b. Inducing Crosslingual DistributedRepresentations of Words. In Proceedings of COL-ING 2012, pages 1459–1474, Mumbai, India.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical Phrase-Based Translation. InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1, pages 48–54.

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009.Matrix Factorization Techniques for RecommenderSystems. Computer, 42(8):30–37.

Ang Lu, Weiran Wang, Mohit Bansal, Kevin Gimpel,and Karen Livescu. 2015. Deep Multilingual Cor-relation for Improved Word Embeddings. In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages250–256, Denver, Colorado.

Page 11: Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations as a matrix completion problem. Starting from some observed translations (e.g.,

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Bilingual Word Representations withMonolingual Quality in Mind. In Proceedings of the1st Workshop on Vector Space Modeling for NaturalLanguage Processing, pages 151–159, Denver, Col-orado.

Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a.Exploiting similarities among languages for ma-chine translation. arXiv preprint arXiv:1309.4168.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed Representa-tions of Words and Phrases and their Composition-ality. In Advances in Neural Information ProcessingSystems 26, pages 3111–3119.

Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev,and Chris Callison-Burch. 2014. The language de-mographics of Amazon Mechanical Turk. Transac-tions of the Association for Computational Linguis-tics, 2:79–92.

Reinhard Rapp. 1995. Identifying Word Translationsin Non-Parallel Texts. In Proceedings of the 33rdAnnual Meeting of the Association for Computa-tional Linguistics, pages 320–322, Cambridge, Mas-sachusetts.

Radim Rehurek and Petr Sojka. 2010. Software Frame-work for Topic Modelling with Large Corpora. InProceedings of the LREC 2010 Workshop on NewChallenges for NLP Frameworks, pages 45–50, Val-letta, Malta.

Steffen Rendle, Christoph Freudenthaler, Zeno Gant-ner, and Lars Schmidt-Thieme. 2009. BPR:Bayesian Personalized Ranking from Implicit Feed-back. In Proceedings of the Twenty-Fifth Confer-ence on Uncertainty in Artificial Intelligence, pages452–461.

Sebastian Riedel, Limin Yao, Andrew McCallum, andBenjamin M. Marlin. 2013. Relation Extractionwith Matrix Factorization and Universal Schemas.In Proceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 74–84, Atlanta, Georgia.

Tim Rocktaschel, Sameer Singh, and Sebastian Riedel.2015. Injecting Logical Background Knowledgeinto Embeddings for Relation Extraction. In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages1119–1129, Denver, Colorado.

Charles Schafer and David Yarowsky. 2002. InducingTranslation Lexicons via Diverse Similarity Mea-sures and Bridge Languages. In Proceedings ofthe 6th Conference on Natural Language Learning- Volume 20, COLING-02, pages 1–7, Taipei, Tai-wan.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Improving Neural Machine TranslationModels with Monolingual Data. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 86–96, Berlin, Germany.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural Machine Translation of Rare Wordswith Subword Units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany.

Tianze Shi, Zhiyuan Liu, Yang Liu, and MaosongSun. 2015. Learning Cross-lingual Word Embed-dings via Matrix Co-factorization. In Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing(Volume 2: Short Papers), pages 567–572, Beijing,China.

Richard Socher, Milind Ganjoo, Christopher D Man-ning, and Andrew Ng. 2013. Zero-Shot Learn-ing Through Cross-Modal Transfer. In Advancesin Neural Information Processing Systems 26, pages935–943.

Shyam Upadhyay, Manaal Faruqui, Chris Dyer, andDan Roth. 2016. Cross-lingual models of word em-beddings: An empirical comparison. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 1661–1670, Berlin, Germany.

Patrick Verga, David Belanger, Emma Strubell, Ben-jamin Roth, and Andrew McCallum. 2016. Multi-lingual relation extraction using compositional uni-versal schema. In Proceedings of the 2016 Con-ference of the North American Chapter of the As-sociation for Computational Linguistics: HumanLanguage Technologies, pages 886–896, San Diego,California.

Ivan Vulic, Douwe Kiela, Stephen Clark, and Marie-Francine Moens. 2016. Multi-Modal Representa-tions for Improved Bilingual Lexicon Learning. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers), pages 188–194, Berlin, Germany.

Ivan Vulic and Anna Korhonen. 2016. On the Role ofSeed Lexicons in Learning Bilingual Word Embed-dings. In Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 247–257, Berlin, Ger-many.

Ivan Vulic and Marie-Francine Moens. 2013. A studyon bootstrapping bilingual vector spaces from non-parallel data (and nothing else). In Proceedings ofthe 2013 Conference on Empirical Methods in Natu-ral Language Processing, pages 1613–1624, Seattle,Washington.

Page 12: Learning Translations via Matrix Completionccb/publications/learning-translations-vi… · lations as a matrix completion problem. Starting from some observed translations (e.g.,

Ivan Vulic and Marie-Francine Moens. 2015. BilingualWord Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induc-tion. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (Volume 2: Short Papers),pages 719–725, Beijing, China.

Ivan Vulic and Marie-Francine Moens. 2016. BilingualDistributed Word Representations from Document-Aligned Comparable Data. Journal of Artificial In-telligence Research, 55:953–994.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s Neural MachineTranslation System: Bridging the Gap betweenHuman and Machine Translation. arXiv preprintarXiv:1609.08144.

Guangyou Zhou, Fang Liu, Yang Liu, Shizhu He, andJun Zhao. 2013. Statistical machine translation im-proves question retrieval in community question an-swering via matrix factorization. In Proceedings ofthe 51st Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 852–861, Sofia, Bulgaria.

Will Y. Zou, Richard Socher, Daniel Cer, and Christo-pher D. Manning. 2013. Bilingual word embeddingsfor phrase-based machine translation. In Proceed-ings of the 2013 Conference on Empirical Methodsin Natural Language Processing, pages 1393–1398,Seattle, Washington.