Dealing with Diversity in Mining and Query Processing
description
Transcript of Dealing with Diversity in Mining and Query Processing
Dealing with Diversity in Mining and Query Processing
Jeffrey Xu Yu (于旭 )Department of Systems Engineering and Engineering ManagementThe Chinese University of Hong [email protected]
Books on Social Networks Social and Economic Networks
by Matthew O. Jackon Social Network Data Analysis
by Charu C. Aggarwal Exploratory Social Network Analysis with Pajek by Wouter de
Nooy, Andrej Mrvar, and Vladimir Batagelj Networks, Crowds, and Markets: Reasoning about a Highly
Connected World by David Easley and John Keinberg
Networks An Introduction by M.E.J. Newman
Some Online Courses Mining of Massive Datasets (Anand Rajaraman and Jeff Ullman)
http://infolab.stanford.edu/~ullman/mmds.html Networks, Crowds, and Markets: Reasoning about a highly
connected world, by David Easley and Jon Kleinberg http://www.cs.cornell.edu/home/kleinber/networks-book
Topics in Data Management & Mining – Social Networks, Laks V.S. Lakshmanan http://www.cs.ubc.ca/~laks/534l/cpsc534l.html
Stanford Large Network Dataset Collectionhttp://snap.stanford.edu/data Social networks Communication networks Citation networks Collaboration networks Web graphs Amazon networks Internet networks Road networks Autonomous systems Signed networks Wikipedia networks and metadata Twitter and Memetracker
Graph Database http://en.wikipedia.org/wiki/Graph_database
Pregel: Google’s internal graph processing platform Trinity: Microsoft Research Asia Neo4j: commercial graph database …
Diversified Ranking Why diversified ranking?
Information requirements diversity Query incomplete
Problem Statement For query dependent diversity ranking, the goal is to find
K nodes in a graph that are relevant to the query node, and also they are dissimilar to each other.
For query independent diversity ranking, the goal is to find K prestige nodes in a graph that are dissimilar to each other.
Main applications Ranking nodes in social network, ranking papers, etc.
Challenges Diversity measures
No wildly accepted diversity measures on graph in the literature.
Scalability Most existing methods cannot be scalable to large
graphs. Lack of intuitive interpretation.
Existing Methods Grasshopper [Zhu, et al., HLT-NAACL’07] ManiRank [Zhu, et al., WWW’11] DivRank [Mei, et al., KDD’10] DRAGON [Tong, et al., KDD’11] Resistive Graph Centers [Dubey, et al., KDD’11]
Grasshopper/ManiRank The main idea
Work in an iterative manner. Select a node at one iteration by random walk. Set the selected node to be an absorbing node, and perform
random walk again to select the second node. Perform the same process K iterations to get K nodes.
No diversity measure Achieving diversity only by intuition and experiments.
Cannot scale to large graph (time complexity O())
Grasshopper/ManiRank Initial random walk with no absorbing states
Absorbing random walk after ranking the first item
DivRank Based on a vertex-reinforced random walk. No diversity measure. Convergence properties is not clear. Time and space complexity is
DRAGON, Resistive Graph Centers
DRAGON [Tong, et al., KDD’11] Diversity measure lacks of clear topological interpretation
Resistive Graph Centers [Dubey, et al., KDD’11] Based on personalized PageRank with a learnable teleportation
parameter. Cannot be scalable to large graphs.
A Summary
Comparison with existing methods
Our Approach The main idea
Relevance of the top-K nodes (denoted by a set S) is achieved by the large (Personalized) PageRank scores.
Diversity of the top-K nodes is achieved by large expansion ratio. Expansion ratio of a set nodes S: σ(S)=|N(S)|/n
Larger expansion ratio implies better diversity
K-step expansion ratio of S: σk(S)=|Nk(S)|/n
Our diversity measures
The K-step Expansion
Diversified ranking problem on graph as a discrete optimization problem.
Submodularity F(S) is shown to be submodular and non-descreasing.
The greedy algorithm A 1-1/e approximation algorithm for solving Eq. (1). Linear time and space complexity w.r.t. the size of the graph.
A Discrete Optimization Problem
The Greedy Algorithm
Works in K rounds Select a node with maximal marginal gain at one round
Marginal gain
Maximize Fk(S) subject to cardinality constraint |S| <= K
Submodularity Fk(S) is shown to be submodular and non-descreasing.
Randomized greedy algorithm Near 1-1/e approximation algorithm. Linear time and space complexity w.r.t. the size of the graph.
Generalized Diversified Ranking Optimization
Randomized greedy algorithm Same idea as the greedy algorithm Works in K rounds At each round, select the node with maximal marginal gain. But,
evaluating the maximal marginal gain is expensive.
Our idea: Use a probabilistic counting data structure to sketch the k-step neighborhood for each node.
Generalized Diversified Ranking Optimization
(| ( { }) | | ( ) |)u k kw N S u N S Marginal gain
A probabilistic counting structure, devised by Flajolet and Martin.
Be used to estimate the cardinality of a multi-set using only logC+t bits, where C denotes the cardinality and t is a small constant.
Each FM Sketch is a log C+t bitmap. Advantage: To estimate the cardinality of the union of
two multi-sets, we only need to do a bitwise-OR between to FM Sketches.
FM Sketch and Its Properties
Randomized greedy algorithm For each node u, use FM Sketch to sketch Nk({u}) Use the following rule to sketch Nk({u}), which can be implemented in a
recursive way
Use FM sketch to sketch Nk(S) Evaluating the marginal gain can be implemented by a bitwise-OR
between Nk(S) and Nk({u})
The Randomized Greedy Algorithm
1( , )
({ }) ({ })k ku v E
N u N v
Experimental Studies We conduct experiments on 5 real networks (3
collaboration networks, 1 citation network, and 1 social network).
We show some results with Flickr, which is a popular photo shared website (from ASU social computing data repository). Undirected social network (80,513 nodes and
5,899,882 edges, and 195 different groups)
Some Testing Results on Flickr
Make a Top-K Algorithm Diversified
The result of searching “apple” in Google image
Existing top- search algorithms Search results are ranked independently When searching “apple” in google image, 9 out of top 15 results are the
logo of Apple Inc.
Structural Keyword Search (1)
Example: Keyword Search in Graphs Input: a graph with text information on each node, and a user given keyword query Output: top-k of minimal Steiner trees that contain all user given keywords
“graph patterns” “keyword search”
DBLP4a1
w1
3p1
w2
3p2
4a1
w1
3p1
w4
3p4
4a1
w3
3p3
w2
3p2
4a1
w3
3p3
w4
3p4
4a1 Author: Jiawei Han w1 w2 w3 w4 Action: Write
3p1 Paper: Mining Graph Patterns 3p2
3p3 3p4Paper: Mining Significant Graph
Patterns by Leap Search
Paper: Optimizing Index for Taxonomy Keyword Search
Paper: Keyword Search in Text Cube: Finding Top-k Aggregated Cell Documents
v1 v2 v3 v4
Structural Keyword Search (2)
4a1
w1
3p1
w2
3p2
v1 score=0.8
4a1
w1
3p1
w4
3p4
v2 score=0.5
4a1
w3
3p3
w2
3p2
v3 score=0.5
4a1
w3
3p3
w4
3p4
v4 score=0.4
0.6
0.6 0.6
0.6
0.2 0.2
Suppose the similarity of and is , e.g.,
Let
is better than because and are similar with each other
is better than because has a larger total score
Diversified Top-K We should consider both similarity and score Let be a list of search results Let be the score of result Let be the similarity of and For any ,
and are similar : a user given threshold
Diversified top- results result : At most results: No two results in are similar Total score of results in is maximized
A Diversity Graph
3v3
3v5
3v4
3v6
68
7
7
1
10 3v1
3v2
3v6
68
7
7
1
10v1
v2
v3
v5
v4
Diversity Graph Undirected graph , , there is an edge (,) in is similar to The diversified top-result set is an independent set of
𝐾=2 ,𝐷={𝑣1 ,𝑣2 } 𝐾=3 ,𝐷={𝑣1 ,𝑣2 }
Existing Top-K Search Frameworks
Most existing top-K search frameworks avoid exploring all search results by finding an early stop condition.
Incremental Top-K Results are generated one by one in ranked order Stops when K results are output
Bounding Top-K Results are generated not necessarily in ranked order. A non-increasing score upper bound for unseen result u is maintained. Stop when the K-th largest score generated is no smaller than u.
Our Framework We support the existing top-K frameworks
Results are generated one by one Stops if a certain stop condition is satisfied
Our framework
We extend the existing algorithms to get top-K diversified results by three new functions. sufficient(): a new early stop condition necessary(): the necessary stop condition div-search(): search top-k diversified results on the current results
Step 2
Step 3
Step 1
• Check the stop condition sufficient()
• Stops if sufficient() is satisfied
• Generate the next result using the original top-K algorithm
• Check the necessary() condition
• If necessary() is satisfied, search the diversified top-K results using div-search()
• Go to Step 1
𝑠𝑐𝑜𝑟𝑒 (𝐷𝐾 (𝑆))≥𝑏𝑒𝑠𝑡 (𝑆)
Sufficient Stop Condition Sufficient stop condition sufficient()
: the set of current generated results : an upper bound of the optimal solution calculated from current
generated results : the current diversified top- results with score : the score upper bound of all unseen results For each , in the ideal situation, for the unseen results, all the
remaining results are set to be We have The sufficient stop condition is
Necessary Stop Condition
|𝑆|≥|𝑆′|+𝐾−max {𝑖∨1≤ 𝑖≤𝐾 ,𝐷𝑖 (𝑆 ′)≠∅ }
Necessary stop condition necessary() : the set of current generated results Assume the stop condition of the original algorithm is satisfied
Otherwise the algorithm cannot stop : the set of results when the last time necessary() is satisfied (or if
necessary() is never satisfied) If for a certain , we need at least more results generated in order to get
results The necessary stop condition is
The Possible Search Algorithms
3v1 3v2 3v3 3v100
3u1
…
…
100
99 99 99 99
0.5 1 1 1u2
v0
u3 u100 3u1
…
…
100
99 99 99 99
0.5 1 1 1
3v0
3u2 3u3 3u100
v0 v2 v3 v100
Greedy Solution: score=199 Optimal Solution: score=9900
Given the diversity graph for the current generated result set
Greed is Not Good
Finding on is an NP-Hard problem
𝐺 (𝐾=100)𝐺 (𝐾=100)
Three New Search Algorithms We propose three exact algorithms
div-astar: an A* based approach div-dp: decompose div-astar using operator div-cut: further decompose div-dp using operators and
NP
NP NP
NP NP
NP NP
NP
NP
NP
NP NP
NP
NP
NP
NP NP
NP
NP
NP
NP NP
NP
NP
NP
div-astar div-dp div-cut
An A* Based Approach
We use a heap to maintain partial solutions Each partial solution is with form
the set of results selected in the partial solution : the total score of results in : the upper bound of score if is expanded to a full solution Entries in are expanded in non-increasing order of
The algorithm stops when of the next soution is no larger than the score of the current best solution
An A* Based Approach Calculation of
is the set of adjacent nodes of in The equation is a relaxation of the optimal solution w.r.t. is to avoid generating redundant results can be calculated in time in the worst case
s.t.
An A* Based Approach
3 3
3
3
68
7
7
3
10
3
3
Diversity graph
∅ ,0,25
{𝑣1},10,21
{𝑣2},8,8
{𝑣3 },7,20
{𝑣4 },7,13
{𝑣5},6,6
{𝑣6 },3,3
An example ()
Step 1: Expand node (), with
An A* Based Approach
3 3
3
3
68
7
7
3
10
3
3
Diversity graph
∅ ,0,25
{𝑣1},10,21
{𝑣2},8,8
{𝑣3 },7,20
{𝑣4 },7,13
{𝑣5},6,6
{𝑣6 },3,3
{𝑣1,𝑣2 },18,18
{𝑣1,𝑣6 },13,13
An example ()
Step 2: Expand node (), with
An A* Based Approach
3 3
3
3
68
7
7
3
10
3
3
Diversity graph
∅ ,0,25
{𝑣1},10,21
{𝑣2},8,8
{𝑣3 },7,20
{𝑣4 },7,13
{𝑣5},6,6
{𝑣6 },3,3
{𝑣1,𝑣2 },18,18
{𝑣1,𝑣6 },13,13
{𝑣3 ,𝑣4 },14,20
{𝑣3 ,𝑣5 },13,13
An example ()
Step 3: Expand node (), with
An A* Based Approach
3 3
3
3
68
7
7
3
10
3
3
Diversity graph
∅ ,0,25
{𝑣1},10,21
{𝑣2},8,8
{𝑣3 },7,20
{𝑣4 },7,13
{𝑣5},6,6
{𝑣6 },3,3
{𝑣1,𝑣2 },18,18
{𝑣1,𝑣6 },13,13
{𝑣3 ,𝑣4 },14,20
{𝑣3 ,𝑣5 },13,13
{𝑣3 ,𝑣4 ,𝑣5},20,20
An example ()
Step 4: Expand node (), with
An A* Based Approach
3 3
3
3
68
7
7
3
10
3
3
Diversity graph
∅ ,0,25
{𝑣1},10,21
{𝑣2},8,8
{𝑣3 },7,20
{𝑣4 },7,13
{𝑣5},6,6
{𝑣6 },3,3
{𝑣1,𝑣2 },18,18
{𝑣1,𝑣6 },13,13
{𝑣3 ,𝑣4 },14,20
{𝑣3 ,𝑣5 },13,13
{𝑣3 ,𝑣4 ,𝑣5},20,20
An example ()
Step 5: Expand node (), with Current best score is , and next best score is : stopOptimal solution:
A DP Based Approach The diversity graph may contain many disconnected components
It is costly to apply A* algorithm on the whole diversity graph Combine the results of disconnected components using operator based
on Dynamic Programming (DP) Dynamic Programming
Suppose contains two disconnected components and State : the optimal score of the diversified top- results on State transition equation:
𝐺 .𝑠𝑖=max0≤ 𝑗≤𝑖
{𝐺1 .𝑠 𝑗+𝐺2 . 𝑠𝑖− 𝑗 }
A DP Based Approach
3 3
3
3
6 8
7
7
110
10 6
78
93
3
An Example ()
optimal solution: {,,,}
i solution s
0 0
1 10
2 18
3 20
4 0
5 0
⊕
i solution s
0 0
1 10
2 18
3 22
4 0
5 0
¿
i solution s
0 0
1 10
2 20
3 28
4 36
5 40
𝐺2 𝐺
𝐺1 𝐺2𝐺
A Cut Point Based Approach Cut point of graph
Suppose is a connected graph A cut point is a point whose removal makes disconnected
can be further decomposed using cut points Suppose is a cut point of , there are two situations
: is excluded in the final solution After removing , becomes several disconnected components
: is included in the final solution After removing and all ’s adjacent nodes, becomes several disconnected
components Add to each result in
and are combined using operator to compute
A Cut Point Based Approach Let be a cut point of Let be the solution by excluding Let be the solution by including and are mutually exclusive with each other : the optimal score of diversified top- results on Calculating
𝐺 .𝑠𝑖=𝑚𝑎𝑥 {𝐺1 . 𝑠𝑖 ,𝐺2 .𝑠𝑖 }
A Cut Point Based Approach Handling multiple cut points
Step 1: Construct a cup-point tree (cptree) Each node: associated with a cut point (leaf node is associated with a virtual
cut point) Each edge: associated with a subgraph that connects two cut points (the
subgraph can be empty or disconnected) A sample cptree:
Step 2: Search the cptree In a bottom-up fashion
𝑐0
𝑐1 𝑐2 𝑐3
𝑐4 𝑐5 𝑐6
𝐺1𝐺2
𝐺3
𝐺4 𝐺5 𝐺6
𝐺3
𝐺4𝐺2
𝐺1
𝐺34 𝐺12
𝐺
𝑐24𝑐34 𝑐12
A Cut Point Based Approach
Suppose , , , have been computed
We now compute and
An Example
𝐺3
𝐺4𝐺2
𝐺1
𝐺34 𝐺12
𝐺
𝑐24𝑐34 𝑐12
A Cut Point Based Approach
Computing
Computing (Case 1) is excluded: (Case 2) is included:
is the result after removing adjacent nodes of from
We have can be computed similarly
An Example
𝐺3
𝐺4𝐺2
𝐺1
𝐺34 𝐺12
𝐺
𝑐24𝑐34 𝑐12
A Cut Point Based Approach
Computing
Computing (Case 1) is excluded: (Case 2) is included: We have
can be computed similarly Do not forget to add {} to all the results
of
An Example
A Cut Point Based Approach i solution s
0 0
1 13
2 23
3 33
4 36
5 39
An Example ()
3 3 6 8
7
7
110
𝐺1
10 6
78
9
𝐺23
3𝒘𝟐
3 3𝒘𝟓
3𝒘𝟔
3𝒘𝟑
3𝒘 𝟒
3
3
𝐺
𝐺4
𝐺3
i solution s
0 0
1 10
2 20
3 28
4 36
5 40
⊗=¿
i solution s
0 0
1 13
2 23
3 33
4 36
5 40
𝑮 . 𝒊𝒏(𝒘𝟐)
𝑮 .𝒆𝒙 (𝒘𝟐)
13
11
1
1
𝑮
Further Improvements Example can be removed from There exists s.t.
After removing and become cut points
3 3 6 8
7
7
110
10 6
78
9
3 3𝒘𝟓
3𝒘𝟑
3𝒘 𝟒3
𝐺
13
1
1
1
1
3 3 6 8
7
7
110
𝐺1
10 6
78
9
𝐺23
3𝒘𝟐
3 3𝒘𝟓
3𝒘𝟔
3𝒘𝟑
3𝒘 𝟒
3
3
𝐺 ′
𝐺4
𝐺3
13
1
1
1
1
3𝒘𝟔
3𝒘𝟏
3
3𝒘𝟐
3
12
Performance Studies Experimental Setup
We use 2 real datasets: Enwiki and Reuters Enwiki: 11,930,681 articles from English Wikipedia Reuters: 21,578 news from Reuters
Query: a set of keywords Answer: top- documents We compare three algorithms
div-star: A* based approach div-dp: Dynamic programming based approach div-cut: Cut point based approach
We vary 3 parameters: : (two groups)
Small 40, 80, 120, 160, 200, default 120 Large : 500, 700, 900, 1300, 2000, default 900
Similarity threshold : 0.4, 0.5, 0.6, 0.7, 0.8 default 0.6 Keyword frequency : 5 levels 1,2,3,4,5, default 3
Performance Studies Score function:
Given a query and a document
is term frequency of keyword for dataset is the total number of words in
Similarity function: Given two documents and
𝑠𝑐𝑜𝑟𝑒 (𝑄 ,𝑑 )=∑𝑞∈𝑄
𝑡𝑓 (𝑞 ,𝑑 )× 𝑖𝑑𝑓 (𝑞)
√𝑙𝑒𝑛(𝑑)
𝑠𝑖𝑚 (𝑑1 ,𝑑2 )=∑
𝑤∈𝑑1∩𝑑2𝑖𝑑𝑓 (𝑤)
∑𝑤∈𝑑1∪𝑑 2
𝑖𝑑𝑓 (𝑤)
Performance Studies
Vary (Enwiki)
Small Small
Large Large
Conclusion We study the diversified ranking. We study the diversified top- search problem.
The diversity use only the similarity of search results themselves We propose a framework, s.t. most top- algorithm can be easily
extended to handle diversified top- search by applying.
APWeb 2013 in Sydney, Australia The 15th International Asia-Pacific Web Conference (APWeb), 4-6
April, 2013, Sydney, Australia Just before ICDE 2013. Paper Submission Deadline: October 20.
Three Keynote Speakers H.V. Jagadish (University of Michigan) Dan Suciu (University of Washington) Mark Sanderson (RMIT)
A Special Issue on WWW Journal
Research Postgraduate Study at SEEM/CUHK [www.se.cuhk.edu.hk/programmes]
Research Postgraduate Programs M.Phil, PhD, M.phil-PhD (Articulated) Deadlines:
December 1, 2012 (First Round) January 31, 2013 (Official Final Round). But, due to Chinese New Year, submit it
early before January 20. Postgraduate Studentship: HK$13,600 per month (non-taxable) Current Tuition Fees: HK$42,100/year
Hong Kong PhD Fellowship Scheme 2013-2014 (135 positions in HK) Deadline: December 1, 2012 Monthly stipend of HK$20,000 10,000 travel allowance Current Tuition Fees: HK$42,100/year
Taught Postgraduate Study at SEEM/CUHK [www.se.cuhk.edu.hk/programmes]
Taught Postgraduate Programmes MSc Programme in SEEM (Systems Engineering and Engineering
Management) MSc Programme in ECLT (E-Commerce and Logistics Technologies) Current Tuition Fees: (Provisional) HK$128,000 Full-Time One-Year study in HK Application deadline:
1st Round: January 15, 2013 2nd Round: March 15, 2013 Early applications are encouraged; Offers may be made to eligible
applicants well before March 15.
Thank you!Questions?