Visualizing Big Data Muaz A Mian. Pie Chart Bar Chart Normal Visualization tools.
DeepTilebars: Visualizing Term Distribution for Neural Information...
Transcript of DeepTilebars: Visualizing Term Distribution for Neural Information...
-
DeepTilebars: Visualizing Term Distribution for Neural
Information Retrieval
Zhiwen Tang Grace Hui Yang
InfoSenseDepartment of Computer Science
Georgetown University
AAAI 2019 @ Honolulu, Hawaii
-
Neural Information Retrieval (Neu-IR)
!2
• Ad-hoc Information Retrieval: to satisfy a user’s information need by retrieving and ranking documents from a document collection. E.g. Google, Lucene
query documents
relevance
• Neu-IR: Retrieval systems based on deep neural networks
-
What is Relevance: A Visualization Perspective
!3
scrollbar-based visualization, by Byrd, DL’99
Fisheye interface, by Hornbæk and Frøkjær, CHI’01
TileBars, by Hearst, CHI’95
• Visualizations offer efficient, direct, and informative feedback to a search engine user.
• What makes visualization practically valuable to humans might also be valuable to a deep neural network for its resemblance to human neurons
• Can we prove that?
-
TileBars
• TileBars visualizes how query terms distribute within a document and produce a compact image to explicitly show the relevance (the matches) to the user.
• Proposed by Marti A. Hearst in the 90s. (Hearst, CHI’95)
• Segment documents by topic
• Visualize the query term hits within each
segment
• Display the query terms vertically and the document segments horizontally
• The darker the cell, the more relevant the text
• It is called Word-to-Segment Matching
4
-
Matching Patterns
!5
Relevant
Relevant
Irrelevant
Irrelevant
en0005-66-32430: January 2008 :: California Tax Attorney Blog California Tax Attorney Blog Published byCalifornia Tax Lawyer Mitchell A. Port Blog Home FirmWebsite Practice Areas Contact Us Home > January 2008
January 30, 2008 Tax Avoidance Or Tax Evasion? Employment Tax Evasion Schemes California employers: be
en0010-68-18656:SignificantChangestoU.S.TaxationofExpatriatingCitizensandLong-TermResidents.[21K]Printed(pdf)version. [172K]October2005UnitedStatesandSwedenSignNewIncomeTaxTreatyProtocol.[8K]Printed(pdf)version. [127K]November2004IRSProposesSection482RegulationsonIntangiblePropertyand
en0004-27-03654: Franchise business opportunity - Fast food franchise opportunity, Food franchise opportunity,International franchise opportunity, Commercial janitorial franchise opportunity, Small business franchise,Quiznos franchise for sale - Franchise business opportunity Franchise opportunity in Canada Carpet flooring
en0004-27-03654: Franchise business opportunity - Fast food franchise opportunity, Food franchise opportunity,International franchise opportunity, Commercial janitorial franchise opportunity, Small business franchise,Quiznos franchise for sale - Franchise business opportunity Franchise opportunity in Canada Carpet flooring
Documents Word-to-Word matchingWord-to-Segment
matching
Query: California Franchise Tax Board
Difficult to distinguish
relevant from irrelevant
Easy to observe consecutive
matching blocks
Given a query, common relevance matching patterns include (Skorochod’ko 1971; Hearst 1997):
• Chained
• Ringed
• Monolith
• Piecewise
• Hierarchical
Chained
Ringed
Monolith
Piecewise
-
Case Study: Spam & Documents
!6
Excerpt from a web document:
Fast food franchise opportunity Restaurant franchise opportunity franchise opportunity uk Top franchise opportunity Small business franchise opportunity Canadian franchise opportunity New franchise opportunity Retail franchise opportunity Home based franchise opportunity Home based franchise opportunity business franchise opportunity starbucks Start up franchise opportunity Top ten franchise opportunity National franchise business opportunity show Fitness franchise opportunity franchise business opportunity show Entrepreneur franchise opportunity franchise opportunity in canada franchise business opportunity for sale franchise buying opportunity franchise opportunity canada Car wash franchise opportunity Us franchise opportunity Business directory franchise opportunity New franchise business opportunity Carpet flooring franchise opportunity Business franchise in opportunity Low cost franchise opportunity Child franchise opportunity Learning franchise opportunity Food franchise opportunity Coffee franchise opportunity International franchise opportunity Business franchise massage opportunity Small franchise opportunity
• Web documents often contain scripts, advertisements, and spam.
• Using word-to-word matching, these irrelevant documents may appear to be even “more” relevant to a query:
• E.g. For query “California Franchise Tax Board”, we can see that “franchise” appears repeatedly in a long spam passage.
• Luckily, using word-to-segment matching, a spam passage only contributes to one cell
• Easier to distinguish them from the relevant ones
-
Case Study: Proximity Queries
!7
Excerpt from a web document:
Sonoma Valley Hospital - Provides inpatient, outpatient and continuing care to the community. (Sonoma)
Specialty Healthcare Services Inc. - Long term acute care hospitals with locations throughout the United States. Headquartered in California.
St. Bernardine Medical Center - A member of the Catholic Healthcare West (CHW), a not-for-profit corporation sponsored by several religious communities.
St. Dominic's Hospital - Manteca, CA - One of six hospitals in the St. Josephs Regional Health System, a member of Catholic Healthcare West, a co-sponsored health ministry serving California, Arizona, and Nevada.
St. Elizabeth Community Hospital - Medical staff of more than 50 primary care physicians and specialists. Affiliated with Catholic Healthcare West. (Red Bluff)
St. Francis Medical Center - Serving the health care and social needs - body, mind, and spirit - of the communities of Southeast Los Angeles. The Center is founded upon and advances the healing ministry of Christ in the tradition of service established by St. Vincent de Paul, St. Louise de Marillac and St. Elizabeth Ann Seton.
St. Helena Hospital/St. Helena Center for Health - Located in Northern California's Napa Valley, a fully accredited, nonprofit community hospital offering a full range of acute-care and wellness services and programs.
St. Joseph's Behavioral Health Center - The Center is a licensed nonprofit psychiatric hospital serving central California. (Stockton)
St. Josephs Medical Center - Offers a full range of comprehensive services for people of all ages. (Stockton)
St. Jude Medical Center - Located in Fullerton. Information on a wide variety of medical services and community education programs.
• Proximity queries are AND queries that require all the query terms to appear together within a certain distance.
• e.g. Find phrase “sonoma county medical services” within 500 words
• Some proximity queries ask to span over a large window size; but most existing Neu-IR models can only handle them in a tight window, e.g. 2 or 3 words apart (Hui et al. 2017).
• Word-to-Segment matching can resolve this issue by merging topically coherent words into a single segment and obtain a stronger relevance signal from it.
-
The Proposed Work
• In the indexing phase for the search engine,
• Visualize matching between query and segments
• “Color” the grid with features that follow good information retrieval
principles
• Then, in the retrieval phase,
• Use an end-to-end deep learning model to obtain the final ranking
scores
8
-
Segmentation
• TextTiling by Hearst
• Hearst, Marti A. "Multi-paragraph
segmentation of expository text." ACL’94.
• Query independent, sequentially laid topics
9
Smoothed similarityscore
!"
!#
!$
%& %&'" %&'#%&(" Tokensequence
• 1. Token sequence generation
• A token sequence is like a fixed-length pseudo
sentence
• 2. Similarity computation
• The similarity for two neighboring sequences is
calculated over the two windows to which they each belong.
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` noevidence '' that any irregularities
took place . The jury further said in term-end presentments that the City Executive Committee , which had over-allcharge
of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was
conducted . The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye toinvestigate reports of
possible `` irregularities '' in the hard-fought primary which waswon by Mayor-nominate IvanAllen Jr. .
`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in theelection , the number
of voters and the size of this city '' . The jury said it did find that many of Georgia's registration and
election laws `` are outmoded or inadequate and often ambiguous '’ . It recommended that Fulton legislators act ``to have these laws studied
and revised to the end of modernizing
!"
!#
!$
!%
!&
!'
!(
!)
!*
• 3. Boundary determination
• Boundaries are discovered at positions where
similarity scores dramatically drop
-
Standardize Matrix Dimensions
• All query-document interaction matrices are standardized into the same dimension
• Pad unto the maximum length of queries
10
• For short documents, add empty segments
• For long documents, merge the last segments
• A compact representation:
• More than 90% documents contain 30 fewer segments in Clueweb09,
• More than 80% documents contain 30 fewer segments in LETOR
!"!#!$!%!&
'" '# '$ '% '& '(
!"!#!$!%
'" '# '$ '%
!"!#!$!%!&
'" '# '$ '% '& '(
!"!#!$!%
'" '# '$ '% '& '( ') '*
-
Color the Cells
• Each cell represents an interaction between a query term and a document segment
• Each cell consists of three channels
• It is like the canonical RGB colors in
multimedia and image research
!11
tf(wi, Bj)
idf(wi) × 𝕀Bj (wi)maxt∈Bj
e−(vwi−vt)2
• In theory, more features can be included in the framework
• First channel: term frequency
• Second channel: inverse document frequency• Third channel: word similarity based on distributional hypothesis
-
DeepTileBars: The Neural Network
!12
1×#$ %&'( )
%&'(
*%&'(
+
relevancescore
1×(#$ − / + 1)
1×(#$ − 2 + 1)
3)4
3+4
3*4
3))
3+)
3*)
5)565789) 9: 97;
#<×#$… …
……
5>
9>
%&'(
>
1×(#$ − 1)3>4 3>)
…… …… ……
…… …… ……
?@@)
?@@>
?@@+
?@@*
#<×1
#<×2
#<×/
#<×2
RelevanceDetection DetectionResults RelevanceAggregationWord-to-segment
matchingRelevanceDecision
-
Bagging of Various Kernel Sizes
• Bagging of various kernel sizes
• Each segment corresponds to a topic
• Super topics are created by combining adjacent topics
• Relevance evaluation is performed at different granularity levels
!13
1segmentpercell
5 segmentspercell
10segmentspercell
3segmentspercell
Query: California franchise tax board (TREC Web 2011 q116)
Docucment: clueweb09-enwp00-69-13554
-
Overview of DeepTileBars
14
1×#$ %&'( )
%&'(
*%&'(
+
relevancescore
1×(#$ − / + 1)
1×(#$ − 2 + 1)
3)4
3+4
3*4
3))
3+)
3*)
5)565789) 9: 97;
#<×#$… …
……
5>
9>
%&'(
>
1×(#$ − 1)3>4 3>)
…… …… ……
…… …… ……
?@@)
?@@>
?@@+
?@@*
#<×1
#<×2
#<×/
#<×2
RelevanceDetection DetectionResults RelevanceAggregationWord-to-segment
matchingRelevanceDecision
z0k = CNNk(I ), k = 1,2,3,...,l
z1k = LSTMk(z0k ), k = 1,2,3,...,l
s = MLP([z11 , z12 , . . . , z
1k , . . . , z
1l ])
J(Θ) = ∑(q,d+,d−)
− log1
1 + e−(s(d+,Θ)−s(d−,Θ))Objective function:
In summary:
Burges, Chris, et al. "Learning to rank using gradient descent." ICML’05
-
Experimental Setup• Tasks: Ad-hoc Retrieval
• Datasets
• Text REtrieval Conference (TREC) Web Track 2010-2012 (Clarke et al. 2012)
• Collection: Clueweb09 CatB (50 million English webpages, crawled by CMU
from January to February, 2009 )
• 150 queries + 38,948 judged documents
• LETOR-MQ2008 (Qin and Liu 2013)
• Collection: Gov2 (25 million webpages, crawled from .gov sites in early 2004)
• 784 queries + 15,211 judged documents
• Evaluation Metrics
• NDCG (Järvelin and Kekäläinen 2002), ERR (Chapelle et al. 2009)
• Precision
!15
DCG =n
∑i=1
relilog2(i + 1) ERR =
n
∑i=1
1i
i−1
∏j=1
(1 − Rj)Ri
-
Experiments: Baseline Systems
• Traditional IR approaches
• BM25 (Robertson and Zaragoza, 2009)
• Language modeling (Zhai and Lafferty, 2017)
• TREC Best Runs:
• Sophisticated term weighting (Dinçer and Karaoglan, 2010; Elsayed, 2010)
• Simple neural nets but with abundant feature engineering (Boytsov and Belova 2011, Al-akashi
and Inkpen 2012)
• Neu-IR models
• DRMM (Guo et al. 2016)
• MatchPyramid (Pang et al. 2016)
• DeepRank (Pang et al. 2017)
• Duet (Mitra et al. 2017)
• HiNT (Fan et al. 2018)
• Variants of DeepTileBars: word-to-word vs. word-to-segment, different kernel sizes!16
-
Results: TREC 2010-2012 Web Tracks
!17
System ERR@20 NDCG@20
P@20TREC-Best 0.188 0.236 0.382
BM25 0.102 0.137 0.253LM 0.118 0.166 0.297
DRMM 0.127 0.184 0.346MatchPyramid 0.113 0.125 0.228
DeepRank 0.127 0.134 0.224HiNT 0.157 0.205 0.322
DeepTileBars(n_q * 1) 0.140 0.207 0.368DeepTileBars(n_q * 3) 0.150 0.212 0.369DeepTileBars(n_q * 5) 0.146 0.211 0.371DeepTileBars(n_q * 7) 0.142 0.207 0.366DeepTileBars(n_q * 9) 0.147 0.213 0.372
DeepTileBars(w2w, all kernels) 0.110 0.123 0.248DeepTileBars(w2s, all kernels) 0.168 0.229 0.384
The bigger the number, the better the search engine effectiveness.
-
Results: LETOR-MQ 2008
!18
System P@5 P@10 NDCG@5 NDCG@10
BM25 0.337 0.245 0.461 0.220
LM 0.323 0.236 0.441 0.206
DRMM 0.337 0.242 0.466 0.219
MatchPyramid 0.329 0.239 0.442 0.211
DeepRank 0.359 0.252 0.496 0.240
Duet 0.341 0.240 0.471 0.216
HiNT 0.367 0.255 0.501 0.244
DeepTileBars 0.427 0.320 0.553 0.256
The bigger the number, the better the search engine effectiveness.
-
• A new, light-weight Neu-IR model inspired by classical work in term distribution visualization
• We propose to use:
• Word-to-segment matching
• Bagging of multiple CNNs
• Why is it working?
• It is practically a hierarchical modeling of document structure
• Topic - super topic - super super topic -…. - document
• Better handles proximity queries & spam documents
• A possible new direction for Neu-IR: visualize relevance signals by first turning texts into images then using deep neural networks
!19
Conclusions
-
Thank you!
• Contacts:
• [email protected] (Zhiwen Tang)
• [email protected] (Grace Hui Yang)
• Slides: http://infosense.cs.georgetown.edu/~zt79/aaai2019_v3.key
• Codes: https://github.com/smt-HS/DeepTileBars-release
• Implementation of TextTiling in NLTK
• https://www.nltk.org/_modules/nltk/tokenize/
texttiling.html
• InfoSense Website:
• http://infosense.cs.georgetown.edu/
20
mailto:[email protected]:[email protected]://infosense.cs.georgetown.edu/~zt79/aaai2019_v3.keyhttp://infosense.cs.georgetown.edu/~zt79/aaai2019_v3.keyhttp://infosense.cs.georgetown.edu/~zt79/aaai2019_v3.keyhttps://github.com/smt-HS/DeepTileBars-releasehttps://github.com/smt-HS/DeepTileBars-releasehttps://www.nltk.org/_modules/nltk/tokenize/texttiling.htmlhttps://www.nltk.org/_modules/nltk/tokenize/texttiling.html