DBSocial 2013, New York
description
Transcript of DBSocial 2013, New York
Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions
DBSocial 2013, New York
Foteini Alvanaki Sebastian Michel
Excellence Cluster on Multimodal Computing and Interaction (MMCI)
2
MotivationenBlogue (1)
€
{#flood, #Lourdes}€
{# Algeria, Stratfor}
€
{# Asanz, #Wikileaks}
€
{# obamainBerlin, #Merkel}€
{# Twisted, ABCFamily}
€
{#NSA, #Orwell}
€
{# Kim, #Kanye}€
{# Kim, # Baby}
€
{# Bieber, #NBAFinals}€
{# Algeria, Stratfor}
€
{# Asanz, # Wikileaks}
€
{# obamainBerlin, # Merkel}
€
{#NSA, # Orwell}
€
{# Kim, # Kanye}€
{# Kim, #Baby}
€
{# Bieber, # NBAFinals}
€
{# Rihanna,# Bieber, # Youtube}€
{# Kim, # Baby}
€
{#Kim, # Baby}
€
{# Kim, # Baby}€
{# Bieber, #NBAFinals} €
{# flood, #Lourdes}
€
{# flood, #Lourdes}
€
{#flood, #Lourdes}
€
{# flood, # Lourdes}
€
{# HeatNation, #NBAFinals} €
{# HeatNation, #NBAFinals}
€
{# obamainBerlin, #Merkel}
€
{# obamainBerlin, # Merkel}
€
{# obamainBerlin, #Merkel}
€
{# obama,#berlin}
€
{#obama,# berlin}
• enBlogue: Identifies emergent topics• Input: A stream of documents annotated with hash-tags (e.g. Tweets)• Restricts the focus to the more recent documents using a time sliding window
3
MotivationenBlogue (2)
€
{#flood, #Lourdes}€
{# Algeria, Stratfor}
€
{# Asanz, # Wikileaks}
€
{# obamainBerlin, # Merkel}€
{# Twisted, ABCFamily}
€
{# NSA, #Orwell}
€
{#Kim, # Kanye}€
{# Kim, #Baby}
€
{# Bieber, #NBAFinals}€
{# Algeria, Stratfor}
€
{# Asanz, #Wikileaks}
€
{# obamainBerlin, # Merkel}
€
{# NSA, #Orwell}
€
{# Kim, # Kanye}€
{# Kim, # Baby}
€
{# Bieber, #NBAFinals}
€
{#Rihanna,#Bieber, # Youtube}€
{# Kim, # Baby}
€
{# Kim, #Baby}
€
{# Kim, #Baby}€
{# Bieber, # NBAFinals} €
{# flood, # Lourdes}
€
{# flood, # Lourdes}
€
{#flood, # Lourdes}
€
{# flood, #Lourdes}
€
{#HeatNation, #NBAFinals} €
{# HeatNation, # NBAFinals}
€
{# obamainBerlin, #Merkel}
€
{# obamainBerlin, #Merkel}
€
{# obamainBerlin, #Merkel}
€
{# obama,# berlin}
€
{# obama,#berlin}
• Tracks the correlation of co-occurring hash-tags over time• Reports on unexpected changes in the correlation
€
{# Kim, # Baby}
time
corr
elati
on
€
{# Kim, # Baby}
€
{# Kim, # Baby}
4
Jaccard Coefficient
• T : A set containing the document ids annotated with tag t
• Pair of tags :
• Set of n tags :
€
J(t1,t2) =T1 I T2
T1 UT2
€
J(t1,..., tn ) =Ii=1
nTi
Ui=1
nTi
€
{t1, t2}
€
{t1,t2,...,tn}
Jaccard Coefficient Computation
• Maintain counters for all subsets of co-occurring tags
5
€
{a, b, c}
€
{a, b}
€
{a, c}
€
{b, c}
€
{a, b, c}
€
{b, c, d}
€
{c, d}
€
{b, d}
€
{b, c, d}
€
AUB AI B
€
AUC AI C
€
BUC BI C
€
CUD C I D
€
BUD BI D
€
AUBUC AI BI C
€
BUCUD BI C I D
6
Inclusion – Exclusion Principle
• Compute the cardinality of the union of n sets using the cardinalities of the intersections of all its subsets:
€
XUZ = X + Z − X I Z
€
Ui=1
nTi = (−1)k+1 Ti1 I L I Tik
1≤ ii <L < ik ≤n∑
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
k=1
n
∑
7
Inclusion – Exclusion PrincipleAdvantages
• Needs to maintain less counters
• Adapts more easily to changes in the load
€
AUB A I B
€
AUC AI C
€
BUC BI C
€
CUD C I D
€
BUD B I D
€
AUBUC AI B I C
€
BUCUD BI C I D€
{a, b, c}
€
{a, b}
€
{a, c}
€
{b, c}
€
{a, b, c}
€
{b, c, d}
€
{c, d}
€
{b, d}
€
{b, c, d}
€
A
€
B
€
C
€
D
€
}€
d'= {a, d}
€
AI D
8
Problem
• For each subset of co-occurring tags– Number of documents annotated each tag– Number of documents annotated with all tags
• A big number of co-occurring tag sets• New documents arrive fast changing the
numbers
€
{t1, t2,...,tn}
€
Ii=1
nTi
€
Ti
Solution: Let multiple nodes compute the Jaccard coefficient for different tag sets
9
Outline Motivation
enBlogue Jaccard Coefficient Inclusion – Exclusion Principle Problem
• Idea- Architecture– Partition Tags– Updating Counters
• Results– Theoretical Results– Experimental Results
• Conclusion
Architecture
10
Nodes computing the Jaccard coefficients
Nodes computing the partitions
11
Partition TagsRequisites
1. Treat tag-sets as inseparable units
2. Minimise the overlap of single tags tracked by different nodes
€
{a,b}
€
{c,d}
€
{a,c,d}
€
N1 :{a,b}
€
A B AI B
€
N2 :{c,d}
€
C D C I D
€
{a,c,d}
€
C DAI C AI D C I D
€
{a,c,d}
€
AA I C AI D
€
J(a,b) =AI B
A + B − AI B
€
J(c,d) =C I D
C + D − C I D
€
J(a,c,d) =AI C I D
A + C + D − AI C − AI D − C I D + AI C I D
12
Partition TagsAlgorithm
Phase 1: Create an initial assignment of the tags to the nodes Max-k cover : Selects k out of n sets that cover the maximum number of elements
Phase 2: Make sure all sets of tags are assigned to some node
13
Partition TagsExample
€
d1 = {a, b, c}
€
d2 = {b, c}
€
d3 = {a, b, f }
€
d4 = {d, e, g}
€
d5 = {a, d, e}
PHASE 1: MAX-2 COVER
€
{a, b, c}
€
{a, d, e}
PHASE 2: ASSIGNING REMAINING SETS
€
{a, b, f }
€
{d, e, g}
€
{a, b, c}{a,b, f }
€
{d, e, g}{a,d, e}
€
{a, b, c, f }
€
{a,d,e,g}
14
Update Counters
€
N1 :{a, b,c,d}
€
N2 :{b,e, f }
€
BI E BI F E I FBI E I F B E F
€
A B AI BC D C I D
€
d4 = {c,d}
€
|C | + +|D | + +C I D + +
€
d5 = {b, f }€
|B | + +
€
|B | + +| E | + +|BI E | + +
15
Finding nodes
€
d2000 = {a, c}
€
a :{N1, N2}
€
b :{N1}
€
c :{N1}
€
d :{N2}
€
e :{N2}
€
f :{N1}
€
g :{N2}
€
⇒ {N1, N2}U{N1} = {N1, N2}
Inverted Index
16
Outline Motivation
enBlogue Jaccard Coefficient Inclusion – Exclusion Principle Problem
Idea Architecture Distributing Tags Updating Counters
• Results– Theoretical Results– Experimental Results
• Conclusion
17
Theoretic expectation
€
E affected nodes[ ] = k ∗ 1−v−mm
⎛ ⎝ ⎜
⎞ ⎠ ⎟vm ⎛ ⎝ ⎜
⎞ ⎠ ⎟
⎡
⎣ ⎢
⎤
⎦ ⎥
nk
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥
• k partitions• v total tags (vocabulary)• m randomly selected tags per set• n total tag-sets
18
Theoretical ResultsPartitions: 10 Vocabulary Size: 1,000,000
19
Real Data Experiments• Dataset: Tweets of 15th March 2013• Partitions: 10
20
Outline Motivation
enBlogue Jaccard Coefficient Inclusion – Exclusion Principle Problem
Idea Architecture Distributing Tags Updating Counters
Results Theoretical Results Experimental Results
• Conclusion
21
Conclusion
• An algorithm to compute the Jaccard coefficient for tag-sets in a massive data stream.
• Applicable to all measures using intersection and/or unions of sets (e.g. Dice)
• Results show small replication• Load equally distributed to the nodes.
22
Thank you!