Identifying Users’ Topical Tasks in Web Search

Identifying Users’ Topical Tasks

in Web Search

W. Hua, Y. Song, H. Wang, Z. Zhou, WSDM 2013

2013/06/30 SEXI2013読み会

@harapon

この論文の内容

Search Taskはクエリとそのreformationで成り立つ

task identificationはサーチエンジンにとって重要

• 価値のある情報を提供する

• ユーザーのサーチの意図を予測する

• ユーザーにクエリをサジェストする

これまでのアプローチでは

• クエリのtemporal features (時間特徴量)

• lexical features (単語の特徴量)を用いる

多くのクエリのreformationはtopicalなので，その

reformationは単語レベルでは同じではないかも

（問題1）

“flight to LA”に対して“cheap US flight“をサジェスト→やりやすい

“flight to LA”に対して“hotel in LA”をサジェスト→難しい

この論文の内容

更に同じsearch session内であっても複数のタスク

が挟まれている場合があり，タイムスタンプに

よってクエリを時間順に並べることができない

（問題2）

このような問題に対して，

• セマンティックレベルで2つのクエリを比較した類似度

をつくることで問題1に対処し，

• 時間的に離れたクエリ間においてSequential Cut and

Merge (SCM)アルゴリズムを提案し，問題2に対処した

3. ProBase

http://research.microsoft.com/en-us/projects/probase/

bag-of-words表現と人間の理解の間にはギャップ

セマンティックネットワーク

3.1 Knowledge Representation

• entity (“Barack Obama”)

• concept (“President of America”)

• attribute (eg, “age”, “color”)

• isA relationship (“Barack Obama” isA “President of America”)

• isAttributeOf relationship

(“population” isAttributeOf “country”)

Probaseのedgesは確率的情報の重み付け• P(instance | concept)

• eg. P(“poodle”| “dogs”) > P(“pugs”|”dogs”)

• P(concept | instance), P(concept | attribute), P(attribute | concept)も定義可能

何十億のweb pageからひっぱってきて構築5

4. Methodology

task identificationは次の2つの問題に集約される

• 2つのクエリ間の類似性・関連性をどう定量化するか？

• 1つのクエリセッション内で類似クエリを効率的に

どうクラスター化するか？

taskの定義

あるセッションがクエリとして与えられたとき，taskは

: クエリのタイムスタンプ

: クエリ , の類似度

: 類似度の閾値

は連続していないかもしれない

4.1 Similarity Calculation

クエリ類似度のために4種類の特徴量をつくる

• conceptual features

• lexical features

• template features

• temporal features

特にconceptual featuresがメイン

• クエリ曖昧性の解消がモチベーション

• “apple”が「リンゴ(果物)」なのか「アップル(企業)」なのか

4.1.1 Conceptual Features

クエリの背後にある概念を特徴量化する

Step 1: Parsing

• クエリを単語に分割

• Probase内のinstance/attributeに写像される一番長い

単語列を用いる

• 同じ長さならより多くのconceptに繋がっている方

• クエリが”truck driving school pay after training”なら

”truck driving”, “school”, “pay”, “training”がProbaseに表れる

最も長いインスタンス．”driving”はダメ

• クエリが”tiger woods”なら

”tiger”, “woods“ではなく”tiger woods”

• このようにBoW表現よりも解釈しやすいクエリ

Step 2: Conceptualization

• あるクエリをinstances/attributesの集合に写像

• これらのinstances/attributesを表す最も良いconceptを推測

• まず，以下を用いて候補となるconceptを特定する

• ここで，concept vector

• MはProbase内のconceptの総数

• 上位K位のconceptのみ選ぶ

クラスター化• 次にあるクエリ内の複数のトピックを見つけるために

各クラスターが一つのトピックになるよう，instance/attribituesをグループ化する

• eg. “alabama home insurance”なら”alabama”(“state”)と”home insurance”(“insurance” and “benefits”)

• 重み付けグラフをつくる

• 各edgeはnodeのconcept vector間のコサイン類似度

• 閾値より小さいコサイン類似度であればグラフからedgeを除去し，instance/attributes clusterを表すサブグラフを作成

• クラスター r：10

: Tと一致するノード集

合，

: エッジ集合

クエリ曖昧性問題の解消

• 各クラスターr内のinstance/attributesを

concept vector crでコンセプト化

• Naive Bayes fuctionによってクラスターr内の各

instance/attributesのconcept vectorの共通部分を計算

• ここでinstance/attributesはそれぞれ独立と仮定

• クラスターr内の共通コンセプトをランク付け

P(ck|tlr): instance/attributes tl

rびconcept vector のk番目の値

P(ck): Probaseにおけるconcept ckのpopularity

クエリ曖昧性解消の例

クエリ全体のコンセプト化

• 曖昧性のないconcept vectorからコンセプト化

Step 3: Calculating conceptual similarity

• 各クエリqのコンセプト化の結果(concept vector cq)か

ら，クエリ間のコサイン類似度を計算

クエリ

単語列

instance/attributes集合

T = (t1, t2, …, tL)

t1 → c1

t2 → c2

tL → cL

concept vectort2

ti ti+1

クラスター化

T1 → c1

Tr → cr

各クラスターの

concept vectorクエリqiの

concept vector

一連の流れ

4.1.2 Lexical Features

クエリ間のBoW類似度を表す2つの方法

• N-word Jaccard

• “the car james bond drive”を2-wordsでやると

[“the car”, “car james”, “james bond”, “bond drive”]

• N-char Jaccard

• 同様に文字単位で定義

vi : the N-word set of query qi

vi k : the term-frequency of the kth N-word in set vi

m, n : the size of set vi and vj

ki , kj : the indexes of that N- word in set vi and vj

vi ki, vijkj

: the term frequencies of that common N-word in set vi and vj

4.1.3 Template Features

Huang et al.(2009)の方法

substring/superstring, add/remove

words, stemming, spelling correction, acronym and

abbreviation, etc.

要はタイプミスや派生語の編集距離

Levenshtein edit distance

ed(qi , qj) : the Levenshtein edit distance between query qi and qj

len(qi): the length of query qi

4.1.4 Temporal Features

連続するクエリ間のtime interval

時間的に近ければ近いほど同じタスクである確率

が高い

t(qi) : the time query qi is issued

d(qi): the dwelltime of query qi (the sum of dwelltimes of clicks after qi)

4.2 Task Identification

Sequential Cut and Merge (SCM)

挟み込まれた場合に

対応できず

計算時間がかかる

over-merge

4.2 Task Identification

Sequential Cut and Merge (SCM)

• まず，SCを適用し得られたtaskをsub taskと命名

• sub taskに含まれるクエリのBoWをマージし，新しい

クエリをつくる．これはsub taskを表現

• sub task集合でGCを適用．閾値以下のedgeをカット

• SCMのウリ

• SCの問題点(挟み込み)に対処

• GCに比べ，計算時間が少ない(上位概念でGCしているので)

5. Evaluation and Results

2012年5月のある1日の商用ブラウザから得られた

セッションを抽出

簡単のため，英語で書かれているUS住人で1つの

セッションに少なくとも10クエリあるセッション

にフィルタリング

45813セッション得られて，600サンプルを人手

でラベル付け

Effectiveness of Classifiers and Features

Error Rate

Accuracy of Algorithms

Computational Time

• 速い

GCはSCに比べてf measureで12.03%, jaccardで46.21%改善，

SCMはGCに比べて1.49%, jaccardで12.61%改善している

6. Conclusions

Probaseを使ってconceptual featuresつくった

これまでの特徴量と合わせて使うと分類器の精度改善

task identificationにもこれまでのアルゴリズムをあわせたようなSCMアルゴリズムを使うと，計算時間も同定精度も改善する

今後の課題として

”celtics members”と”kevin garnette”が同じタスクにされてしまう問題を解消したい• 前者はNBAのチーム，後者はNBAのプレイヤー

Identifying Users’ Topical Tasks in Web Search

Documents

Transcript of Identifying Users’ Topical Tasks in Web Search

Variation Theory as a research tool for identifying learning in the design of tasks

Topical Steroids

Achieving Business Value by Integrating Tasks, Topics and ...Mar 18, 2008 · Achieving Business Value by Integrating Tasks, Topics and Content • • 3.18.2008 | p.5 Identifying

Topical and Transdermal Drug Products - Semantic …...Topical and Transdermal Drug Products The Topical/Transdermal Ad Hoc Advisory Panel for the USP Performance Tests of Topical

Topical Tidbits

Identifying the Writing Tasks Important for Academic ... · Identifying the Writing Tasks ... Identifying the Writing Tasks Important for Academic Success ... For entry-level graduate

Dermatologic topical corticosteroids Dermatologic topical corticosteroids

TOPICAL MEDICATIONS

Course Outline · · 2018-04-24Lab Tasks Identifying computer components ... Identifying mobile devices Identifying additional ports and connectors Identifying video ports and connectors

Topical Web Crawling Using Weighted Anchor Text and Web ...Topical Web Crawling Using Weighted Anchor Text and Web Page ... tasks. Topical crawlers follow the hyperlinked ... structure

Topical Agents

Topical corticosteroids

Dermatotherapeutics - Topical

Topical Antibiotics

Exploring physical exposures and identifying high-risk work ...Exploring physical exposures and identifying high-risk work tasks within the floor layer trade Jamie McGahaa, Kim Millera,

32. TOPICAL APPLICATION Page 1 32. TOPICAL APPLICATION ...

Topical Acne Rosacea Products Prior Authorization of ... Forms/IAIA_PA_TopicalAcneRosacea.pdfPrior authorization is required for topical acne agents (topical antibiotics and topical

Distinguishing between Topical and Non-topical …gummadi/papers/topical_non...Distinguishing between Topical and Non-topical Information Diffusion Mechanisms in Social Media Przemyslaw

Identifying and Labeling Search Tasks via Query-based ...

Topical paincontrolmedication