Chapter 26 Comparing Lexicons Cross-Linguistically Asifa ...
Retrofitting Word Vectors to Semantic Lexicons
-
Upload
sho-takase -
Category
Technology
-
view
137 -
download
2
Transcript of Retrofitting Word Vectors to Semantic Lexicons
Retrofitting Word Vectors to
Semantic Lexicons
Manaal Faruqui, Jese Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, Noah A. Smith
NACL 2015
読む人:高瀬翔知識獲得研究会2015/4/21
1
単語のベクトル表現について
• コーパスからの単語の意味(ベクトル表現)獲得はNLPで重要な技術– 手法:単語-文脈の共起行列,共起行列の次元圧縮,ニューラル言語モデルなど
– 似た性質の単語=似たベクトル
• 応用タスクの素性としても有用
2
代表作 作家 陸上競技 文学賞 時速 筋肉 順位 書く
フランツ・カフカ 80 60 0 30 0 0 0 40
大江健三郎 70 60 0 50 0 0 0 60
ウサイン・ボルト 0 0 100 0 30 40 80 0
カール・ルイス 0 0 90 0 40 50 70 0
単語の文脈ベクトル
ベクトル表現への外部知識導入と先行研究の問題点
• 外部知識利用でベクトル表現の質が向上[Yu+ 14, Chang+ 13]– 外部知識:WordNet,FrameNetなど
• 問題点:ベクトルの構成手法が限定的– コーパスと外部知識の利用を統合してしまっている
– 与えられたベクトルに外部知識を組み込む改良ができない(新たな学習手法などに対応できない)• 例[Yu+ 14]:目的関数に外部知識の項がある
3
文脈 外部知識
本研究の貢献
• 外部知識を後処理としてベクトル表現に導入する手法を提案
–任意のベクトルの構築手法と組み合わせ可能
–提案手法は高速
• 10万単語,300次元のベクトルに対し約5秒で動作
• 様々な実験を通して有用性を示す
–学習手法,外部知識,ベクトルの次元,言語など様々な比較
4
提案手法
• やりたいことは2つ– コーパスから得たベクトル(入力)と似たベクトルとする– 外部知識上で関連する単語は似たベクトルとする
• 関連:同義語,上位下位語,言い換え
• 目的関数– 似せたいベクトル間のユークリッド距離を最小化
• 一項目:コーパスの情報(入力ベクトルに近づける)• 二項目:外部知識(外部知識上での関連語に近づける)
– E:外部知識上で関連している単語間に張ったエッジの集合– α,β:ハイパーパラメータ(α=1,β=1 / エッジの次数)
5
Figure 1: Word graph with edges between related words
showing the observed (grey) and the inferred (white)
word vector representations.
Experimentally, we show that our method works
well with different state-of-the-art word vector mod-
els, using different kinds of semantic lexicons and
gives substantial improvements on a variety of
benchmarks, while beating the current state-of-the-
art approaches for incorporating semantic informa-
tion in vector training and trivially extends to mul-
tiple languages. We show that retrofitting gives
consistent improvement in performance on evalua-
tion benchmarks with different word vector lengths
and show a qualitative visualization of the effect of
retrofitting on word vector quality. The retrofitting
tool is available at: ht t ps: / / gi t hub. com/
mf ar uqui / r et r of i t t i ng.
2 Retrofittingwith Semantic Lexicons
Let V = { w1, . . . , wn } be avocabulary, i.e, the set
of word types, and⌦bean ontology that encodesse-
mantic relations between words in V . We represent
⌦as an undirected graph (V, E ) with one vertex for
each word type and edges (wi , wj ) 2 E ✓ V ⇥V
indicating a semantic relationship of interest. These
relations differ for different semantic lexicons and
aredescribed later (§4).
The matrix Q̂ will be the collection of vector rep-
resentations q̂i 2 Rd, for each wi 2 V , learned
using a standard data-driven technique, where d is
the length of the word vectors. Our objective is
to learn the matrix Q = (q1, . . . , qn ) such that the
columns are both close (under a distance metric) to
their counterparts in Q̂ and to adjacent vertices in⌦.
Figure 1 shows a small word graph with such edge
connections; whitenodesarelabeled with theQ vec-
tors to be retrofitted (and correspond to V⌦); shaded
nodes are labeled with the corresponding vectors in
Q̂, which areobserved. Thegraph can beinterpreted
as a Markov random field (Kindermann and Snell,
1980).
The distance between a pair of vectors is defined
to be the Euclidean distance. Since we want the
inferred word vector to be close to the observed
value q̂i and close to its neighbors qj , 8j such that
(i , j ) 2 E , the objective to be minimized becomes:
(Q) =
nX
i = 1
2
4↵ i kqi − q̂i k2 +
X
( i ,j )2 E
βi j kqi − qj k2
3
5
where ↵ and β values control the relative strengths
of associations (more details in §6.1).
In this case, we first train the word vectors inde-
pendent of the information in the semantic lexicons
and then retrofit them. is convex in Q and its so-
lution can be found by solving a system of linear
equations. To do so, we use an efficient iterative
updating method (Bengio et al., 2006; Subramanya
et al., 2010; Das and Petrov, 2011; Das and Smith,
2011). The vectors in Q are initialized to be equal
to the vectors in Q̂. We take thefirst derivativeof
with respect to one qi vector, and by equating it to
zero arriveat the following online update:
qi =
Pj :( i ,j )2 E βi j qj + ↵ i q̂iP
j :( i ,j )2 E βi j + ↵ i(1)
In practice, running this procedure for 10 iterations
converges to changes in Euclidean distance of ad-
jacent vertices of less than 10− 2. The retrofitting
approach described above is modular; it can be ap-
plied to word vector representations obtained from
any model as theupdates in Eq. 1 areagnostic to the
original vector training model objective.
Semantic Lexicons dur ing Learning. Our pro-
posed approach is reminiscent of recent work on
improving word vectors using lexical resources (Yu
and Dredze, 2014; Bian et al., 2014; Xu et al., 2014)
which alters the learning objective of the original
vector training model with a prior (or a regularizer)
that encourages semantically related vectors (in ⌦)
to be close together, except that our technique is ap-
plied as a second stage of learning. We describe the
コーパスから得たベクトル(入力)
改良後のベクトル
解き方
• 反復更新で解を求める– 各 qi について,目的関数を最小化する値への更新を繰り返す
– qi は入力ベクトルで初期化
• 経験的には10回の反復で近づけたいベクトル間のユークリッド距離は0.01未満になる
6
Figure 1: Word graph with edges between related words
showing the observed (grey) and the inferred (white)
word vector representations.
Experimentally, we show that our method works
well with different state-of-the-art word vector mod-
els, using different kinds of semantic lexicons and
gives substantial improvements on a variety of
benchmarks, while beating the current state-of-the-
art approaches for incorporating semantic informa-
tion in vector training and trivially extends to mul-
tiple languages. We show that retrofitting gives
consistent improvement in performance on evalua-
tion benchmarks with different word vector lengths
and show a qualitative visualization of the effect of
retrofitting on word vector quality. The retrofitting
tool is available at: ht t ps: / / gi t hub. com/
mf ar uqui / r et r of i t t i ng.
2 Retrofittingwith Semantic Lexicons
Let V = { w1, . . . , wn } beavocabulary, i.e, the set
of word types, and⌦beanontology that encodesse-
mantic relations between words in V . We represent
⌦as an undirected graph (V, E ) with one vertex for
each word type and edges (wi , wj ) 2 E ✓ V ⇥V
indicating a semantic relationship of interest. These
relations differ for different semantic lexicons and
aredescribed later (§4).
Thematrix Q̂ will be thecollection of vector rep-
resentations q̂i 2 Rd, for each wi 2 V , learned
using a standard data-driven technique, where d is
the length of the word vectors. Our objective is
to learn the matrix Q = (q1, . . . , qn ) such that the
columns are both close (under a distance metric) to
their counterparts in Q̂ and to adjacent vertices in⌦.
Figure 1 shows a small word graph with such edge
connections; whitenodesarelabeled with theQ vec-
tors to be retrofitted (and correspond to V⌦); shaded
nodes are labeled with the corresponding vectors in
Q̂, which areobserved. Thegraph can beinterpreted
as a Markov random field (Kindermann and Snell,
1980).
The distance between a pair of vectors is defined
to be the Euclidean distance. Since we want the
inferred word vector to be close to the observed
value q̂i and close to its neighbors qj , 8j such that
(i , j ) 2 E , theobjective to beminimized becomes:
(Q) =
nX
i= 1
2
4↵ i kqi − q̂i k2 +
X
( i ,j )2 E
βi j kqi − qj k2
3
5
where ↵ and β values control the relative strengths
of associations (more details in §6.1).
In this case, we first train the word vectors inde-
pendent of the information in the semantic lexicons
and then retrofit them. is convex in Q and its so-
lution can be found by solving a system of linear
equations. To do so, we use an efficient iterative
updating method (Bengio et al., 2006; Subramanya
et al., 2010; Das and Petrov, 2011; Das and Smith,
2011). The vectors in Q are initialized to be equal
to the vectors in Q̂. We take thefirst derivativeof
with respect to one qi vector, and by equating it to
zero arriveat the following online update:
qi =
Pj :( i ,j )2 E βi j qj + ↵ i q̂iP
j :( i ,j )2 E βi j + ↵ i(1)
In practice, running this procedure for 10 iterations
converges to changes in Euclidean distance of ad-
jacent vertices of less than 10− 2. The retrofitting
approach described above is modular; it can be ap-
plied to word vector representations obtained from
any model as theupdates in Eq. 1 areagnostic to the
original vector training model objective.
Semantic Lexicons dur ing Learning. Our pro-
posed approach is reminiscent of recent work on
improving word vectors using lexical resources (Yu
and Dredze, 2014; Bian et al., 2014; Xu et al., 2014)
which alters the learning objective of the original
vector training model with a prior (or a regularizer)
that encourages semantically related vectors (in ⌦)
to beclose together, except that our technique is ap-
plied as a second stage of learning. We describe the
更新式:
実験
• 様々な公開されているベクトルを入力とし– Glove[Pennington+ 14]:共起情報をベクトルでモデル化– SG[Mikolov+ 13]:周囲の単語を予測できるよう学習– GC[Huang+ 12]:ローカルと文書レベルの文脈を組み合わせて学習– Multi[Faruqui+ 14]:異なる言語間で単語ベクトルにCCA
• 様々な外部知識を利用して– PPDB:翻訳すると同じ語になる単語を言い換えとして収集したDB
– WordNet:人手の辞書(同義語のみ(syn) or 同義+上位下位(all))– FrameNet:フレーム辞書,同一のフレームを持つ単語にエッジを張る
• 様々なタスクでの性能向上を検証– 単語の類似度タスク– TOFEL:与えられた単語と同じ意味の単語を選択肢から選ぶ– 単語の統語的アナロジータスク– Sentiment analysis:文内の単語のベクトルの平均を素性に分類器構築
7
結果(各タスクで向上した値)
8
SYN-REL(単語の統語的アナロジー)以外で向上が見られる→単語ベクトルに意味的な情報を付与し,質が向上
後処理の効果を測る
• 外部知識の情報は学習時に組み込む事も可能
– 2種類の組み込み方を試す
• log-bilinearのモデルを考え
–学習時に正則化項で導入
• 10万単語毎の遅延更新(lazy)
–確率的勾配降下法で k 個の事例を見る毎に本研究の提案手法でベクトル更新(periodic)
9
結果
• lazyでも上昇有り
• periodicはlazyよりも大幅に性能向上
• retrofitting(提案手法)はperiodicとcompetitive,性能が上回ることもある
10
先行研究との比較
• [Yu+ 14]との比較では全てのタスクで性能向上
• [Xu+ 14]との比較でもほぼ全てのタスクで性能向上
11
まとめ
• 単語ベクトル表現に外部知識を組み込む手法を提案
–後処理として組み込むので任意のベクトルに適用可能
• 提案手法による性能向上を実験で検証
–外部知識を利用する既存手法より性能向上
12