S. (Muthu) Muthukrishnan Google mysliceofpizza Massive Data Analysis: What is under the hood?
How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan,...
-
Upload
arabella-long -
Category
Documents
-
view
222 -
download
1
Transcript of How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan,...
![Page 1: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/1.jpg)
How to Summarize the Universe:How to Summarize the Universe:Dynamic Maintenance of Dynamic Maintenance of
QuantilesQuantiles
Gilbert, Kotidis, Muthukrishnan, Gilbert, Kotidis, Muthukrishnan, StraussStrauss
Presented by Itay MalingerPresented by Itay Malinger
December 2003December 2003
![Page 2: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/2.jpg)
Problem DefinitionProblem Definition
► The Universe: The Universe: U = {0, …, U = {0, …, ||U U ||-1}-1}►Number of records in data set: ||A||=Number of records in data set: ||A||=NN►Data set can be thought of as an array:Data set can be thought of as an array:
A[i] – number of records with value iA[i] – number of records with value i► AASS – number of records with values in S – number of records with values in S► The The Ф-Ф-quantile of an ordered sequence of N quantile of an ordered sequence of N
data items are the value with rankdata items are the value with rank►Our goal is computing Our goal is computing εε-approximate -approximate ФФ--
quantiles – find a quantiles – find a jjk k such that:such that:
kji
iNk ][A)(
/1,...,2,1for kNk
Nkikji
)(][A
![Page 3: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/3.jpg)
0
2
4
6
8
10
12
A[i]
1 2 3 4 … … |U|
U
![Page 4: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/4.jpg)
TransactionsTransactions
► Insert(i): A[i] Insert(i): A[i] A[i] + 1 A[i] + 1►Delete(i): A[i] Delete(i): A[i] A[i] – 1 A[i] – 1►LetLet►ASSUME: The Universe size |U| is ASSUME: The Universe size |U| is
knownknown
i
tt iAN ][
![Page 5: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/5.jpg)
The Main Algorithmic ResultThe Main Algorithmic Result
►The RSS AlgorithmThe RSS Algorithm►Space ComplexitySpace Complexity►Update In every transaction in Update In every transaction in
O(space) timeO(space) time►Estimation On demand in O(space) Estimation On demand in O(space)
timetime►One Time passOne Time pass
)/))log(
log()((log 22 U
UO
![Page 6: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/6.jpg)
Dyadic IntervalsDyadic Intervals
►Log(|U|)+1 resolution levels jLog(|U|)+1 resolution levels j►2|U|-1 Dyadic intervals2|U|-1 Dyadic intervals
UIiiU
I
jUkjUkkj
I
0,0}{
|),log(|
]1|)log(|2)1(,|)log(|2[,
0 1 2 3 4 5 6 7I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7)
I(2,0) I(2,1) I(2,2) I(2,3)
I(1,0) I(1,1)
I(0,0)
![Page 7: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/7.jpg)
Arbitrary intervalsArbitrary intervals►Any Interval can be displayed as a Any Interval can be displayed as a
disjoint union of at most log(|U|) disjoint union of at most log(|U|) dyadic intervalsdyadic intervals
►For example A[0,6] = For example A[0,6] = I(1,0)+I(2,2)+I(3,6)I(1,0)+I(2,2)+I(3,6)
► Intervals starting at 0 will not use the Intervals starting at 0 will not use the same resolution twicesame resolution twice0 1 2 3 4 5 6 7
I(3,0) I(3,1) I(3,2) I(3,3) I(3,4) I(3,5) I(3,6) I(3,7)
I(2,0) I(2,1) I(2,2) I(2,3)
I(1,0) I(1,1)
I(0,0)
![Page 8: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/8.jpg)
Computing quantilesComputing quantiles
►Assuming we have the number of Assuming we have the number of records in each dyadic interval, We can records in each dyadic interval, We can efficiently compute any arbitrary interval efficiently compute any arbitrary interval in A.in A.
►To compute the To compute the фф-quantile for any -quantile for any k, k, we we need a need a jjkk s.t.: s.t.:
A[0,jA[0,jkk) < kФN < A[0,j) < kФN < A[0,jk+1k+1))
►Use binary search to find it.Use binary search to find it.►Keeping all intervals is costly (O(|U|))Keeping all intervals is costly (O(|U|))
![Page 9: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/9.jpg)
Random Subset SumsRandom Subset Sums
► In case j = log(|U|)In case j = log(|U|)►Let S be a subset of ULet S be a subset of U►Each uEach uU has p=½ of being in SU has p=½ of being in S►E(|S|)= ½|U|E(|S|)= ½|U|►Define:Define:►E(|AE(|ASS|)=½||A||=½|)=½||A||=½NN
Si
S A[i] A
![Page 10: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/10.jpg)
Estimating A[i]Estimating A[i]
]A[
AA]A[]A[
A)A]2(A[
)AE[-])SA2(E[
)SAAE(2
AA[i]]SAE[
}\{U
}\{U21
S
S
}\{U21
S
i
ii
i
i
i
i
i
i
i
![Page 11: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/11.jpg)
ImprovementImprovement
► Instead of keeping sets of point dyadic Instead of keeping sets of point dyadic sets, Keep random sets of all sets, Keep random sets of all resolutionsresolutions
►We need a method of keeping a We need a method of keeping a Random set of j-resolution dyadic Random set of j-resolution dyadic intervals (keeping it explicitly is o(|U|)intervals (keeping it explicitly is o(|U|)
► Instead of keeping the sets keep a Instead of keeping the sets keep a small representation of themsmall representation of them
![Page 12: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/12.jpg)
Pseudorandom set generatorPseudorandom set generator
►We need to keep a small We need to keep a small representation of a random set S (Uirepresentation of a random set S (UiS S with p= ½)with p= ½)
►Given a seed of size log(|U|)+1Given a seed of size log(|U|)+1►Represent a set S of size o(|U|)Represent a set S of size o(|U|)►Quickly test if iQuickly test if iS or notS or not►Use Extended Hamming CodeUse Extended Hamming Code
![Page 13: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/13.jpg)
Extended Hamming CodeExtended Hamming Code
►Given a seed, tells whether the iGiven a seed, tells whether the iSS►For example:For example:
|U| = 8|U| = 8 Seed size: log|U|+1 = 4Seed size: log|U|+1 = 4
G(seed, i) = seed X i’th column mod 2G(seed, i) = seed X i’th column mod 2►Efficient to computeEfficient to compute►3-wise disjoint3-wise disjoint
10101010
11001100
11110000
11111111
}7,5,2,0{~
1
0
1
1
![Page 14: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/14.jpg)
The Data StructureThe Data Structure
►For each resolution level j keep For each resolution level j keep num_copies random subsets S of all num_copies random subsets S of all dyadic intervals in that level (we only dyadic intervals in that level (we only keep the representation seed)keep the representation seed)
►KeepKeep►Maintain N = ||A||Maintain N = ||A||►We got SWe got S11,…,S,…,Snum_copiesnum_copies per level per level
2/|)log(|)/|)log(log(|24num_copies UU
Si
S A[i] A
![Page 15: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/15.jpg)
Upon TransactionsUpon Transactions
► Insert(i) / Delete(i)Insert(i) / Delete(i) For Each resolution level jFor Each resolution level j
►Locate the single ILocate the single Ij,kj,k into which i falls into which i falls (high order binary bits)(high order binary bits)
►Determine all SDetermine all Sℓℓ containing I containing Ij,kj,k
►For Each SFor Each Sℓℓ increase/Decrease ||A increase/Decrease ||ASSℓℓ|| by || by 11
![Page 16: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/16.jpg)
Estimating Quantiles: Estimating Quantiles: Dyadic IntervalsDyadic Intervals
► Given a dyadic interval I=IGiven a dyadic interval I=Ij,kj,k
► There are num_copies sets of resolution jThere are num_copies sets of resolution j
GG EE► Quickly test each SQuickly test each Sℓℓ and check if I and check if ISSℓℓ and if so and if so
estimateestimate► Group all estimations into Group all estimations into GG groups of groups of EE
elementselements► For each group g calculate the average of all For each group g calculate the average of all
estimations Aestimations Ag,j,kg,j,k
2/|)log(|8)/|)log(log(|3num_copies UU
AA2A , SSI
![Page 17: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/17.jpg)
Estimating Quantiles:Estimating Quantiles:Arbitrary intervalsArbitrary intervals
►Given an interval I, Write it as a disjoint Given an interval I, Write it as a disjoint union of at most log(|U|) dyadic intervals union of at most log(|U|) dyadic intervals IIj,kj,k
►Form G groups and calculate each Form G groups and calculate each group’s sum of all dyadic interval’s Agroup’s sum of all dyadic interval’s Ag,j,kg,j,k for all Ifor all Ij,kj,k comprising I. comprising I.
►Take the median of all G groups as the Take the median of all G groups as the final estimate of Afinal estimate of AII
► Its more convenient to refer to the result Its more convenient to refer to the result as an overestimate |Aas an overestimate |AII|≤|A|≤|AII||~~≤|A≤|AII|+|+εεNN
![Page 18: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/18.jpg)
3 dyadic intervals
E = 4 Elements per group
G = 3 Groups
SUM
SUM
SUM
SUM
AV
ER
AG
E
MEDIAN
The Interval’s Estimate
![Page 19: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/19.jpg)
AnalysisAnalysis
►LemmaLemma: The algorithm estimates each : The algorithm estimates each quantile to within quantile to within εεN with p>1-N with p>1-δδ
►ProofProof:: For a fixed resolution level j, Let For a fixed resolution level j, Let Then:Then:
otherwise0,
SI,A2X kI
kK
k kXX
AA
AA2
]E[XA2
S]I|E[X
0k
0
k0k
0
0k
0
I
kkII
kkkI
k
2
II
]|var[
A2AXA
0
0k0k
ASIX k
![Page 20: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/20.jpg)
)SIAAE(2
A
kj,S
I kj,
![Page 21: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/21.jpg)
Analysis (cont.)Analysis (cont.)
87
I
222
2
222I
22
jk
jkI
Ijk
j
(j)kkk
εN]AZP[
8
1
/εU8logNε
NUlog
ENε
var(Y)
εN
var(Z)εN]AZP[
NUlogAUlog]SIj|var[Y
]SIj|γN-E[YA
AγA]SIj|E[Y
XYIIII
j
j
j
γ21
2/|)log(|8E U
![Page 22: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/22.jpg)
Analysis (cont.)Analysis (cont.)
►We take G copies of Z and take the median.We take G copies of Z and take the median.►By the Chernoff inequality,By the Chernoff inequality,
►The binary search looked for a jThe binary search looked for a jkk such that such that
►We made log|U| checks in the binary searchWe made log|U| checks in the binary search►The probability any of them failed is log|U| The probability any of them failed is log|U|
times what we achieved, i.e times what we achieved, i.e δδ
)/|)log(log(|3 UG
|U|δ/log1εN]|AmZP[| I
NAANkAA
ANkA
kkkk
kk
jjjj
jj
)1,0[~)1,0[~),0[),0[
~)1,0[~),0[
![Page 23: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/23.jpg)
RSS PropertiesRSS Properties
►The algorithm may return a quantile The algorithm may return a quantile value which was not seen in the inputvalue which was not seen in the input
►Changing the order of insertions and Changing the order of insertions and deletions doesn’t affect resultsdeletions doesn’t affect results
►The RSSs are composable: U can be The RSSs are composable: U can be split to many disjoint ranges and some split to many disjoint ranges and some pre-agreed common random subsetspre-agreed common random subsets
![Page 24: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/24.jpg)
Extension: U is unknownExtension: U is unknown
►Predict a range Predict a range [0, u-1][0, u-1] for U. for U.►Upon insertion of i > u-1, add Upon insertion of i > u-1, add
anotheranother instance of RSS with range instance of RSS with range [u, u[u, u22-1]-1], and so on…, and so on…
►Because RSS is composable, we only Because RSS is composable, we only have to join the result upon queryhave to join the result upon query
► Increased cost factor: logIncreased cost factor: log22log(|U|).log(|U|).
![Page 25: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/25.jpg)
ExperimentsExperiments
►What is the median length of all active What is the median length of all active AT&T calls ?AT&T calls ?
►When call When call Starts: Add timestampStarts: Add timestamp Ends: Delete start timestampEnds: Delete start timestamp
►4 KB used for RSS4 KB used for RSS►ComparedCompared
RSSRSS GKGK GK2GK2
![Page 26: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/26.jpg)
Number of Active Phone Calls Number of Active Phone Calls Over TimeOver Time
![Page 27: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/27.jpg)
Error in Computation of Error in Computation of Median Over TimeMedian Over Time
![Page 28: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/28.jpg)
Average Error for Last 50 Average Error for Last 50 Snapshots, For DecilesSnapshots, For Deciles
![Page 29: How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.](https://reader035.fdocuments.net/reader035/viewer/2022062422/56649f1f5503460f94c37e26/html5/thumbnails/29.jpg)
The The EndEnd