Approximating the Depth via Sampling and Emptiness
description
Transcript of Approximating the Depth via Sampling and Emptiness
Approximating the Depth via Sampling and Emptiness
Lecture : Adi Vardi
Approximating the Depth via Sampling and Emptiness
Depth(r,S) – Given a set of n objects S and a query range r, let depth(r,s) be the number of object of S intersected by r.
Approximating the Depth via Sampling and Emptiness
Depth(r,S) = 3
Approximating the Depth via Sampling and Emptiness
Range counting queries - preprocessing a set S of n objects and a class of ranges, so that given a range r, one can quickly report number of objects in S intersecting r
Example: Range tree
• S = Set of points in the plane• r = Query Rectangle
Range tree – primary tree
Balanced binary search tree Ordered by x-coordinate Each point is stored at a leaf node
Range tree query• Given interval [a,b], search for a and b• Find where the paths split, look at subtrees along
the search
Paths split
a b
Range tree complexity
Query time: O(logd-1n) Space: O(nlogd-1n) Construction time: O(nlogd-1n)
Approximating the Depth via Sampling and Emptiness
Range emptiness queries - preprocessing a set S of objects and a class of ranges, so that given a range r, one can quickly report whether r intersects any of the objects in S
Approximating the Depth via Sampling and Emptiness
Emptiness(r,s) = not_empty
Approximating the Depth via Sampling and Emptiness
Emptiness(r,s) = empty
Approximating the Depth via Sampling and Emptiness
Sampling - selection of a subset of objects from a set S to estimate characteristics of S
S S’
Approximating the Depth via Sampling and Emptiness
Approximate – Let μr = depth(r,S) denote the depth of r. For a prespecified ε > 0 the data structure outputs a number αr such that (1 - ε)μr ≤ αr ≤ μr
Example: μr = 20, ε = 0.1 o Valid αr: 18, 19, 20 o Invalid αr: 17, 21
Motivation Counting queries are much harder then emptiness queries
Example: halfspace queries in 2,3 dimensions o Counting queries can be answered in polynomial time o Emptiness queries can be answered in logarithmic time
Goal: answer approximate range counting queries using
polylogarithmic emptiness queries
Halfplane emptiness Preprocessing: Find convex hall Query o Rotate polygon o Find upper and lower envelop – binary search o Find extreme points – binary search o Orientation test
Idea 1
• Create M independent samples R1,…,Rm of S • Each sample formed by picking each element of S with probability p
• Xi = 1 if r intersects any object of Ri, Xi = 0 otherwise • Yr = σ 𝑋𝑖 i • αr = f(Yr, M, p) • problems?
Idea 1 – problems Number of samples M = M(ε,n)
(Number of " Bernoulli trials") Probability to pick object p = p(μr) Wrong p might yield invalid result o Large p for "light" depth o Small p for "heavy" depth
How to find appropiate p?
Idea 1 – problems "Heavy" depth o n = 1000, μr = 997, ε → 0 o Sample size must be less than 4
(otherwise none of them will be "empty") "Light" depth o n = 1000, μr = 3 o Sample size must be very large (otherwise we
won't be able to catch the "non-empty" objects).
Idea 2 Guess starting z = depth(r,S)
Probability to pick object p = 1𝑧
Probability r of depth k to intersect Ri
pz(k) = 1 – (1 - 1𝑧)k
If r has depth z, then Δ = E[yr] = Mpz(z) If yr > Δ then μr > z, otherwise μr ≤ z Perform binary search on [0,n] Problems?
-problems Idea 2
Only the expectation of Yr is Mpz(z) The decision μr < z might be mistaken How can we overcome mistakes and guarantee
(1-ε)μr ≤ αr ≤ μr?
Halfspace complexity status
Emptiness query – logarithmic time Binary search – logarithmic time
(probably) Number of samples M = M(ε,n) -
logarithmic?
The decision procedure
Given parameters z ∈ [0,n] and 12 > ε > 0,
we construct a data structure, such that for any δ,
with 12 > δ ≥ ε, and a query range r, we can decide
with high probability, whether μr < z or μr ≥ z. The data structure is allowed to make a mistake if
μr ∈ [(1 - δ)z, (1 + δ)z]
(1-δ)z (1+δ)z z
?< >
Why δ? Trade off between query time and accuracy Large δ when binary search range is large Small δ when the range is small
(1-δ)z (1+δ)z z
<
The data structure Let R1,…,Rm be M independent random samples of
S, formed by picking every element with probability 1/z, where M(ε) = ⌈c3ε-2logn⌉, and c3 is a sufficiently large absolute constant.
Build M separate emptiness-query data structures D1,…,DM, respectively, and put D = D(z,ε) := {D1,…,DM}
Answering a query Xi = 1 if r intersects any object of Ri, Xi = 0 otherwise
for i = 1…M(δ)
Yr = σ Xi i
Probability r of depth k to intersect Ri
pz(k) = 1 – (1 - 1z)k
If r has depth z, then Δ = E[yr] = Mpz(z) Our data structure returns "depth(r,S) < z" if Yr < Δ,
and "depth(r,S) ≥ z" otherwise
Lemma 1
Pr[Yr > Δ | depth(r,S) ≤ (1-δ)z] does not exceed n-c4, where c4 = c4(c3) > 0 depends only on c3 and can be made arbitrarily large by a choice of sufficiently large c3 > 0.
Reminder - Chernoff bound Let X1,…Xn be n independent Bernoulli trials
Pr[Xi = 1] = pi, Pr[Xi = 0] = qi = 1 – pi
X = σ 𝑋𝑛𝑖=1 i
μ = E[x]
For any δ > 0
o Pr[X ≥ (1+δ)μ] ≤ ( 𝑒𝛿ሺ1+𝛿ሻ1+𝛿)μ
o Pr[X < (1-δ)μ] ≤ exp(-μδ2/2)
For any δ ≤ 2e – 1
Pr[X ≥ (1+δ)μ] ≤ exp(-μδ2/4)
Lemma 1 - proofObservation 1: for 0 ≤ x ≤ y < 1, 1−x1−y = 1 + y−x1−y ≥ 1 + y – x.
The probability α is maximized when depth(r,S) = (1-α)z. Therefore: α ≤ Pr[Yr > Δ| depth(r,S) = (1-ε)z].
Pr[xi] = pz((1-ε)z).
E[Yr] = Mpz((1-ε)z) = M[1-(1-1𝑧)(1-ε)z] ≥ M(1-e-(1-ε)) ≥ 𝑀3
Since (1-1𝑧)z ≤ e-1, and ε ≤ 12.
According to observation 1, for y = (1-1𝑧)(1-ε)z, x = (1-1𝑧)z
ξ = Δ𝐸[𝑌𝑟] = 𝑀[1−ቀ1−1𝑧ቁ𝑧]
𝑀[1−ቀ1−1𝑧ቁሺ1−𝜀ሻ𝑧] ≥ 1 + (1-1𝑧)(1-ε)z - (1-1𝑧)z = 1 + (1-1𝑧)(1-ε)z[1-(1-1𝑧)εz]
Lemma 1 - proof
ξ = Δ𝐸[𝑌𝑟] = 1 + (1-1𝑧)(1-ε)z[1-(1-1𝑧)εz]
Applying exp(-2x) ≤ 1-x, 1+x ≤ exp(x):
ξ ≥ 1 + exp[-2𝑧(1-ε)z)][1-exp(-1𝑧εz)] ≥ 1 + 1𝑒2[1-exp(-ε)]
≥ 1 + 1𝑒2[1-(1-𝜀2)] ≥ 1 + 𝜀15
Deploying Chernoff inequality:
α = Pr[Yr > Δ] ≤ Pr[Yr > ξE[Yr]] ≤ Pr[Yr > (1+ε/15)E[Yr]]
≤ exp[-E[Yr]14( 𝜀15)2] ≤ exp(-𝑀𝜀2𝑐 ) ≤ exp(𝜀2⌈c3ε−2logn ⌉𝑐 ) ≤ n-c4
Lemma 2 Given a set S of n objects, a parameter 0 < ε < 12 and z∈[0,n],
one can construct a data structure D(z) which, given a range r
and a parameter 12 > δ ≥ ε, returns either LOW or HIGH.
If it returns LOW, then μr ≤ (1+δ)z, and if it returns HIGH, then μr ≥ (1-δ)z.
The data structure might return either answer if μr ∈ [(1-δ)z,(1+δ)z].
The data structure D consist of M = O(ε-2logn) emptiness data structure
Lemma 2 - cont The space needed is O(S(2n/z)ε-2logn) where S(m) is
the space needed for a single emptiness data structure storing m objects.
The time is O(Q(2n/z)δ-2logn) where Q(m) is the time needed for a single query in such a structure storing m objects.
All bounds hold with high probability.
Proof?
Lemma 3 Given the data structure of Lemma 2, z and δ > c5,
one can decide for a query range r if μr < z/(1+δ) or μr ≥ z(1+δ). C5 sufficiently large constant.
The data structure is allowed to return any answer if μr ∈ [z/(1+δ), z(1+δ)].
This requires M = ⌈c6(logn)/lnδ⌉ emptiness queries, and the answer returned is correct with high probability, where c6 is an appropriate absolute constant
Proof?
Range counting data structure "Light" depth values o Build a separate data structure Di = D(vi, εi) of
Lemma 2
o vi = 𝑖2, εi = 18𝑖 o i = 1,2,….,U = O(ε-1)
"Heavy" depth values o Build a separate data structure Dj = D(vj, εj) of
Lemma 2 o vj = (U/4)(1+ε/16)j, εj = ε/16 o j = U+1,U+2….W,
where W := clog1+ε/16n = O(ε-1logn), Wn = n
Range counting data structurelog1+𝜀/16 𝑛 = 𝑙𝑜𝑔𝑛log (1+ 𝜀16) For 0 < ε ≤ 1 50log(1+ε/16) ≥ ε
Answering a query Given a range query r, each data structure in our list returns
LOW or HIGH. If we were to query all the data structures, we would get a
sequence of HIGHs, followed by a sequence of LOWs. The value associated with the last data structure returning
HIGH (rounded to the nearest integer) yields the required approximation.
We can use binary search on D1…DW to locate this changeover value using a total of O(logW) = O(log(ε-1logn)) queries.
Overall time is O(Q(n)ε-2(logn)log(ε-1logn)).
D1 DW αr
“Light” depthKey observation: error ≤ Uε = cε-1ε = c. Small error range.
Example: μr = 3, ε = 0.1, error = 0.3. αr must be equal to μr!
Assume μr = x
D2x might return HIGH or LOW
v2x - v2x-1 = 12 > 116 = 2𝑥−12 18(2𝑥− 1) = v2x-1ε2x-1
D2x-1 must return HIGH
v2x+1 - v2x = 12 > 116 = 2𝑥+12 18(2𝑥+ 1) = v2x+1ε2x+1
D2x+1 must return LOW
D2x-1 D2x D2x+1
?
“ Light” depth – option 1
D2x = H
αr = v2x = x = μr
D2x-1 D2x D2x+1
“ Light” depth – option 2D2x = L
αr = ⌈v2x-1⌉ = ⌈x - 12⌉ = x = μr
D2x-1 D2x D2x+1
“Heavy” depthvj+1 = (1+ 𝜀16)vj, εj = 𝜀16
Observation: vj+1 – vj = vjεj
Assume μr = vj
Dj-2 must return HIGH
vj+2 - vj = vj𝜀16 + vj+1
𝜀16 = vj𝜀16[1 + (1+ 𝜀16)] > 2vj
𝜀16
> vj𝜀16(1+ 𝜀16)2 = vj+2εj+2
Dj+2 must return LOW
Dj-2 Dj+2 Dj
?
Dj-1
?
Dj+1
?
“Heavy” depthDj+1 ≥ αr ≥ Dj-2 If we choose the DS previous to the changeover:
μr = Dj ≥ αr ≥ Dj-3 ≥ vj(1-3 𝜀16 ) ≥ vj(1-ε)
Dj-2 Dj+2 Dj
?
Dj-1
?
Dj+1
?
Improved data structure Treat D1,…,DW as a linked list LM,
M = ⌈logW⌉ = O(log(ε-1logn). Build a data structure where Li-1 is formed from Li
by picking every other element Base list L1 has 4-8 elements
L1
L2
L3
L4
Answering a query Search top-down starting from L1 At the ith stage maintain pointers to four consecutive DS
in Li, such that left two return HIGH and the right two return LOW.
The corresponding portion of Li+1 delimited by these two DS in Li is a sub list of at most seven DS in Li+1.
We query at most three new DS to maintain the sub list. Key observation: at each level we use δi as large as possible,
such that the error intervals of all these data structure are disjoint
Query time: O(ε-2Q(n)logn)
Answering a query
L1
L2
L3
L4
Answering a query
L1
L2
L3
L4
Answering a query
L1
L2
L3
L4
Answering a query
L1
L2
L3
L4
αr
Answering a query δ1 = n1/4 During coarse search o O(loglogn) levels o δi = O(ξ𝛿i-1)
During refine search o O(log(ε-1)) levels o δ'j = 1/2j
Coarse searchAfter O(log(ε-1)) levels all "light" depth DS disappear
In the ith level:
vj(1+δi) < vj+1/(1+δi)
(1+δi)2 < vj+1/vj = Ci = (1+ε/16)2^i
1 + δi = ξ𝐶i
Vj Vj(1+ε/16)
Vj
Vj(1+ε/16)2
Vj(1+ε/16)2
Coarse searchAs long as δi > c5, ξ𝐶i ≫ 1. Therefore:
δi ≈ ξ𝐶i
During the first level we only 4 elements. Therefore C1 = n1/2, δ1 = n1/4.
Since Ci-1 = Ci2 and δi ≈ ξ𝐶i, δi = O(ξ𝛿i-1)
Vj Vj(1+ε/16)
Vj
Vj(1+ε/16)2
Vj(1+ε/16)2
Coarse searchC1 = n1/2, C2 = n1/4, C3 = n1/8, C4 = n1/16….
δ1 = n1/4, δ2 = n1/8, δ3 = n1/16, δ4 = n1/32…
loglogn level since n1/4 = 2logn/4 is constant after O(loglogn) sqrt operation
Vj Vj(1+ε/16)
Vj
Vj(1+ε/16)2
Vj(1+ε/16)2
Refine search – “Light” depthδ(vj+2d) + δ(vj+d) < d δ(vj+d) + δvj< d Therefore: 2δ(vj+2d) + 2δvj = δ(vj+2d) + δ(vj+d) + δ(vj+d) + δvj < 2d
δi-1 = 2δi
Vj Vj + d
Vj
Vj + 2d
d d
2d
Vj + 2d
Refine search – “Heavy” depth
Observation 2: for k ≤ logε, (1+ε/16)2^k ≈ (1+2kε/16).
proof: (1+ε/16)2^k = 1 + σ ቀ2𝑘𝑖 ቁ2𝑘𝑖=1 (ε/16)i
ቀ2𝑘𝑖+1ቁ(ε/16)i+1 = ቀ2𝑘𝑖 ቁ(ε/16)i 𝜀(2𝑘−𝑖)16(𝑖+1) ≤ 116ቀ2𝑘𝑖 ቁ(ε/16)i
Refine search – “Heavy” depthδvj(1+ε/16)2^k + δvj(1+ε/16)2^(k-1) < d2
δvj(1+ε/16)2^(k-1) + δvj < d1.
k ≤ log(ε-1), Applying observation 2:
2δvj(1+ε/16)2^k + 2δvj ≈ δvj(1+ε/16)2^k + δvj(1+2kε/16) + 2δvj
= δvj(1+ε/16)2^k + 2δvj(1+2k-1ε/16) + δvj ≈ δvj(1+ε/16)2^k + 2δvj(1+ε/16)2^(k-1) + δvj < d1 + d2 δi-1 = 2δi
Vj
Vj
Vj(1+ε/16)2^(k-1) Vj(1+ε/16)2^k
Vj(1+ε/16)2^k
d1 d2
d1 + d2
Complexity analysis – coarse search
During the coarse search δi = O(ξ𝛿i), Lemma 3 bound number of emptiness queries. Therefore: σ 𝑂𝑖,δi>c5 ( 𝑙𝑜𝑔𝑛𝑙𝑜𝑔𝛿𝑖)
= 𝑙𝑜𝑔𝑛𝑛14 + 𝑙𝑜𝑔𝑛𝑛18 + 𝑙𝑜𝑔𝑛𝑛 116 + …..+ 𝑙𝑜𝑔𝑛𝑛𝑂(𝑙𝑜𝑔𝑙𝑜𝑔𝑛 )
= 𝑙𝑜𝑔𝑛2𝑙𝑜𝑔𝑛4 + 𝑙𝑜𝑔𝑛
2𝑙𝑜𝑔𝑛8 + 𝑙𝑜𝑔𝑛2𝑙𝑜𝑔𝑛16 + …..+ 𝑙𝑜𝑔𝑛2𝑂(1)
= O(logn)
Complexity analysis – refine search
During the refine search δ'j = 1/2j, Lemma 1
bound number of emptiness queries. Therefore:
= σ 𝑂𝑂(log 1𝜀)𝑗=2 (𝑙𝑜𝑔𝑛δ′ j2 ) = σ 𝑂𝑂(log 1𝜀)𝑗=2 (22jlogn)
= O(ε-2logn).
Summary Query time: O(ε-2Q(n)logn) Space: O(S(n)ε-3log2n) Construction time: O(T(n)ε-3log2n) In some cases we can reduce space and
construction time requirements. Intuition: o S(2n/z), T(2n/z) o Degree λ: S(n/i) = O(S(n)/iλ)
Applications - Halfplane
Emptiness queries can be answered in logarithmic time when d = 2,3.
S(n) = O(n), T(n) = O(nlogn), Q(n) = O(logn). Approximating counting queries o Query time: O(ε-2log2n) o Space: O(nε-2logn) o Construction time: O(nε-2log2n)
Applications - Disks Using the standard lifting of points in R2 to the
paraboloid in R3 (maps balls in Rd to hyperplanes in Rd+1 and points in Rd to points on the standard paraboloid in Rd+1).
Disk range query in the plane reduces to a halfspace range query in three dimensions.
Similar results for disks range counting o Query time: O(ε-2log2n) o Space: O(nε-2logn) o Construction time: O(nε-2log2n)
Applications – Depth queriesPseudo-disks: A set of objects is a collection of pseudo-disks, if the boundary of every pair of them intersects at most twice.
Applications – Depth queries Computing the union of n pseudo-disk in the plane and
preprocessing the union for point-location queries Emptiness queries can be answered in logarithmic time. S(n) = O(n), T(n) = O(nlogn), Q(n) = O(logn). Approximating counting queries o Query time: O(ε-2log2n) o Space: O(nε-2logn) o Construction time: O(nε-2log2n)
Relative approximation via sampling
Approximate depth(r,S) for query point r. Use single sample. Sample each object with probability p in a
random sample R. If r had sufficiently large depth, then its depth
can be estimated reliably by depth(r,R)/p. The deeper r is the better this estimate is.
Lemma 4 – reliable sampling Let S be a set of n objects, 0 < ε < 12,
and let r a point of depth u ≥ k in S. Let R be a random sample of S, such that every
element is picked with probability p = 8𝑘𝜀2ln1𝛿.
Let X be the depth of r in R (accurate depth). X/p lies in the interval [(1-ε)u ,(1+ε)u]. This estimate succeeds with
probability ≥ 1 – δu/k ≥ 1 – δ.
Lemma 4 – example
U = 10000, k = 5000, ε = 0.2, δ = 0.001.
p = 85000∗0.22ln 10.001 = 0.276
x/p ∈ [8000,12000]. Pr[success] ≥ 1 – δ = 0.999
Lemma 4 – proofμ = E[X] = pu. By Chernoff's inequality: Pr[X ∉ [(1 - ϵ)μ,(1 + ϵ)μ] = Pr[X < (1 - ϵ)μ] + Pr[X > (1 + ϵ)μ] ≤ exp(-puε2/2) + exp(-puε2/4)
≤ exp(-4𝑢𝑘ln1𝛿) + exp(-2𝑢𝑘ln1𝛿) ≤ δu/k
Since u ≥ k
Lemma 4 – conclusions If depth(r,S) is (say) u ≤ 10k, then
depth(r,R) ≤ (1+ε)pu = O( 1𝜀2ln1𝛿).
This is (relatively) small number. Therefore via sampling, we turned the task of
estimating depth of heavy range to estimating shallow range.
We can perform a binary search for the depth of r by a sequence of coarser to finer samples.