How to Design Robust Algorithms using Noisy Comparison Oracle
Transcript of How to Design Robust Algorithms using Noisy Comparison Oracle
How to Design Robust Algorithms using Noisy ComparisonOracle
Raghavendra AddankiUMass Amherst
Sainyam GalhotraUMass Amherst
Barna SahaUC Berkeley
ABSTRACT
Metric based comparison operations such as finding maximum,nearest and farthest neighbor are fundamental to studying variousclustering techniques such asπ-center clustering and agglomerativehierarchical clustering. These techniques crucially rely on accurateestimation of pairwise distance between records. However, com-puting exact features of the records, and their pairwise distancesis often challenging, and sometimes not possible. We circumventthis challenge by leveraging weak supervision in the form of acomparison oracle that compares the relative distance between thequeried points such as βIs point π’ closer to π£ orπ€ closer to π₯?β.
However, it is possible that some queries are easier to answerthan others using a comparison oracle. We capture this by introduc-ing two different noise models called adversarial and probabilisticnoise. In this paper, we study various problems that include findingmaximum, nearest/farthest neighbor search under these noise mod-els. Building upon the techniques we develop for these comparisonoperations, we give robust algorithms for π-center clustering andagglomerative hierarchical clustering.We prove that our algorithmsachieve good approximation guarantees with a high probabilityand analyze their query complexity. We evaluate the effectivenessand efficiency of our techniques empirically on various real-worlddatasets.
PVLDB Reference Format:
Raghavendra Addanki, Sainyam Galhotra, Barna Saha. How to Design Ro-bust Algorithms using Noisy Comparison Oracle. PVLDB, 14(9): XXX-XXX,2021. doi:XX.XX/XXX.XX
1 INTRODUCTION
Many real world applications such as data summarization, socialnetwork analysis, facility location crucially rely on metric basedcomparative operations such as finding maximum, nearest neigh-bor search or ranking. As an example, data summarization aims toidentify a small representative subset of the data where each repre-sentative is a summary of similar records in the dataset. Popularclustering algorithms such as π-center clustering and hierarchicalclustering are often used for data summarization [25, 39]. In this pa-per, we study fundamental metric based operations such as findingmaximum, nearest neighbor search, and use the developed tech-niques to study clustering algorithms such as π-center clusteringand agglomerative hierarchical clustering.
Clustering is often regarded as a challenging task especially dueto the absence of domain knowledge, and the final set of clusters
This work is licensed under the Creative Commons BY-NC-ND 4.0 InternationalLicense. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy ofthis license. For any use beyond those covered by this license, obtain permission byemailing [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 14, No.9 ISSN 2150-8097. doi:XX.XX/XXX.XX
1 2 3
654
Figure 1: Data summarization example
identified can be highly inaccurate and noisy [7]. It is often hardto compute the exact features of points and thus pairwise distancecomputation from these feature vectors could be highly noisy. Thiswill render the clusters computed based on objectives such as π-center unreliable.
To address these challenges, there has been a recent interest toleverage supervision from crowd workers (abstracted as an oracle)which provides significant improvement in accuracy but at an addedcost incurred by human intervention [20, 55, 57]. For clustering, theexisting literature on oracle based techniques mostly use optimal
cluster queries, that ask questions of the form βdo the points u andv belong to the same optimal cluster?β[6, 17, 42, 57]. The goal isto minimize the number of queries aka query complexity whileensuring high accuracy of clustering output. This model is relevantfor applications where the oracle (human expert or a crowd worker)is aware of the optimal clusters such as in entity resolution [20, 55].However, in most applications, the clustering output depends highlyon the required number of clusters and the presence of other records.Without a holistic view of the entire dataset, answering optimalqueries may not be feasible for any realistic oracle. Let us consideran example data summarization task that highlights some of thechallenges.
Example 1.1. Consider a data summarization task over a collection
of images (shown in Figure 1). The goal is to identify π images (say
π = 3) that summarize the different locations in the dataset. The
images 1, 2 refer to the Eiffel tower in Paris, 3 is the Colosseum in
Rome, 4 is the replica of Eiffel tower at Las Vegas, USA, 5 is Veniceand 6 is the Leaning tower of Pisa. The ground truth output in this
case would be {{1, 2}, {3, 5, 6}, {4}}. We calculated pairwise similarity
between images using the visual features generated fromGoogle Vision
API [1]. The pair (1, 4) exhibits the highest similarity of 0.87, while allother pairs have similarity lower than 0.85. Distance between a pair ofimagesπ’ and π£ , denoted asπ (π’, π£), is defined as (1βsimilarity between
π’ and π£). We ran a user experiment by querying crowd workers to
answer simple Yes/No questions to help summarize the data (Please
refer to Section 6.2 for more details).
In this example, we make the following observations.
arX
iv:2
105.
0578
2v1
[cs
.DS]
12
May
202
1
Raghavendra Addanki, Sainyam Galhotra, and Barna Saha
β’ Automated clustering techniques generate noisy clusters.
Consider the greedy approach for π-center clustering [27] whichsequentially identifies the farthest record as a new cluster center. Inthis example, records 1 and 4 are placed in the same cluster by thegreedy π-center clustering, thereby leading to poor performance.In general, automated techniques are known to generate erroneoussimilarity values between records due to missing informationor even presence of noise [19, 56, 58]. Even Googleβs landmarkdetection API [1] did not identify the location of images 4 and 5.β’ Answering pairwise optimal cluster query is infeasible.
Answering whether 1 and 3 belong to the same optimal clusterwhen presented in isolation is impossible unless the crowd workeris aware of other records present in the dataset, and the granularityof the optimum clusters. Using the pair-wise Yes/No answersobtained from the crowd workers for the
(62)pairs in this example,
the identified clusters achieved 0.40 F-score for π = 3. Please referto Section 6.2 for additional details.β’ Comparing relative distance between the locations is easy.
Answering relative distance queries of the form βIs 1 closer to 3,or is 5 closer to 6?β does not require any extra knowledge aboutother records in the dataset. For the 6 images in the example, wequeried relative distance queries and the final clusters constructedfor π = 3 achieved an F-score of 1.
In summary, we observe that humans have an innate under-standing of the domain knowledge and can answer relative distancequeries between records easily. Motivated by the aforementionedobservations, we consider a quadruplet comparison oracle that com-pares the relative distance between two pairs of points (π’1, π’2) and(π£1, π£2) and outputs the pair with smaller distance between thembreaking ties arbitrarily. Such oracle models have been studied ex-tensively in the literature [11, 17, 24, 32, 34, 48, 49]. Even thoughquadruplet queries are easier than binary optimal queries, someoracle queries maybe harder than the rest. In a comparison query, ifthere is a significant gap between the two distances being compared,then such queries are easier to answer [9, 15]. However, when thetwo distances are close, the chances of an error could increase. Forexample, βIs location in image 1 closer to 3, or 2 is closer to 6?βmaybe difficult to answer.
To capture noise in quadruplet comparison oracle answers, weconsider two noise models. In the first noise model, when the pair-wise distances are comparable, the oracle can return the pair ofpoints that are farther instead of closer. Moreover, we assume thatthe oracle has access to all previous queries and can answer queriesby acting adversarially. More formally, there is a parameter ` > 0such that if maxπ (π’1,π’2),π (π£1,π£2)
minπ (π’1,π’2),π (π£1,π£2) β€ (1+`), then adversarial error mayoccur, otherwise the answers are correct. We call this the "Adver-sarial Noise Model". In the second noise model called "ProbabilisticNoise Model", given a pair of distances, we assume that the oracleanswers correctly with a probability of 1βπ for some fixed constantπ < 1
2 . We consider a persistent probabilistic noise model, whereour oracle answers are persistent, i.e., query responses remain un-changed even upon repeating the same query multiple times. Suchnoise models have been studied extensively [9, 10, 20, 24, 42, 46]since the error due to oracles often does not change with repetition,and in some cases increases upon repeated querying [20, 42, 46].
This is in contrast to the noise models studied in [17] where re-sponse to every query is independently noisy. Persistent querymodels are more difficult to handle than independent query modelswhere repeating each query is sufficient to generate the correctanswer by majority voting.
1.1 Our Contributions
We present algorithms for finding the maximum, nearest and far-thest neighbors, π-center clustering and hierarchical clusteringobjectives under the adversarial and probabilistic noise model us-ing comparison oracle. We show that our techniques have provableapproximation guarantees for both the noise models, are efficientand obtain good query complexity. We empirically evaluate therobustness and efficiency of our techniques on real world datasets.(i) Maximum, Farthest and Nearest Neighbor: Finding maxi-mum has received significant attention under both adversarial andprobabilistic models [4, 9, 15, 18, 21β23, 38]. In this paper, we pro-vide the following results.β’Maximum under adversarial model. We present an algo-rithm that returns a value within (1 + `)3 of the maximum amonga set of π values π with probability 1 β πΏ1 using π (π log2 (1/πΏ))oracle queries and running time (Theorem 3.6).β’Maximum under probabilistic model. We present analgorithm that requires π (π log2 (π/πΏ)) queries to identifyπ (log2 (π/πΏ))th rank value with probability 1 β πΏ (Theorem 3.7). Inother words, in π (π log2 (π)) time we can identify π (log2 (π))thvalue in the sorted order with probability 1β 1
ππ for any constant π .To contrast our results with the state of the art, Ajtai et al. [4] study aslightly different additive adversarial error model where the answerof a maximum query is correct if the compared values differ by \(for some \ > 0) and otherwise the oracle answers adversarially.Under this setting, they give an additive 3\ -approximation withπ (π) queries. Although, our model cannot be directly comparedwith theirs, we note that our model is scale invariant, and thus,provides a much stronger bound when distances are small. As aconsequence, our algorithm can be used under additive adversarialmodel as well, and obtaining the same approximation guarantees(Theorem 3.10).
For the probabilistic model, after a long series of works [9, 21, 23,38], only recently an algorithm is proposed with query complexityπ (π logπ) that returns an π (logπ)th rank value with probability1 β 1
π [22]. Previously, the best query complexity was π (π3/2) [23].While our bounds are slightly worse than [22], our algorithm issignificantly simpler.
Rest of the work in finding maximum allow repetition of queriesand assume the answers are independent [15, 18]. As discussedearlier, persistent errors are much more difficult to handle thanindependent errors. In [18], the authors present an algorithm thatfinds maximum usingπ (π log 1/πΏ) queries and succeeds with prob-ability 1 β πΏ . Therefore, even under persistent errors, we obtainguarantees close to the existing ones which assume independenterror. The algorithms of [15, 18] do not extend to our model.β’ Nearest Neighbor. Nearest neighbor queries can be cast asβfinding minimumβ among a set of distances. We can obtain bounds1πΏ is the confidence parameter and is standard in the literature of randomizedalgorithms.
How to Design Robust Algorithms using Noisy Comparison Oracle
similar to finding maximum for the nearest neighbor queries. In theadversarial model, we obtain an (1 + `)3-approximation, and in theprobabilistic model, we are guaranteed to return an element withrankπ (log2 (π/πΏ)) with probability 1βπΏ usingπ (π log2 (1/πΏ)) andπ (π log2 (π/πΏ)) oracle queries respectively.
Prior techniques have studied nearest neighbor search undernoisy distance queries [41], where the oracle returns a noisy es-timate of a distance between queried points, and repetitions areallowed. Neither the algorithm of [41], nor other techniques de-veloped for maximum [4, 18] and top-π [15] extend for nearestneighbor under our noise models.β’ Farthest Neighbor. Similarly, the farthest neighbor query canbe cast as finding maximum among a set of distances, and theresults for computing maximum extends to this setting. However,computing the farthest neighbor is one of the basic primitives formore complex tasks likeπ-center clustering, and for that, the existingbounds under the probabilistic model that return anπ (logπ)th rankelement is insufficient. Since distances on a metric space satisfiestriangle inequality, we exploit it to get a constant approximationto the farthest query under the probabilistic model and a milddistribution assumption (Theorem 3.10).(ii) π-center Clustering: π-center clustering is one of the funda-mental models of clustering and is very well-studied [52, 59].β’ π-center under adversarial model We design algorithm thatreturns a clustering that is a 2 + ` approximation for small valuesof ` with probability 1 β πΏ using π (ππ2 + ππ log2 (π/πΏ)) queries(Theorem 4.2). In contrast, even when exact distances are known, π-center cannot be approximated better than a 2-factor unless π = ππ
[52]. Therefore, we achieve near-optimal results.β’ π-center under probabilistic noise model. For probabilisticnoise, when optimal π-center clusters are of size at least Ξ©(
βπ), our
algorithm returns a clustering that achieves constant approximationwith probability 1βπΏ usingπ (ππ log2 (π/πΏ)) queries (Theorem 4.4).To the best of our knowledge, even though π-center clustering isan extremely popular and basic clustering paradigm, it hasnβt beenstudied under the comparison oracle model, and we provide thefirst results in this domain.(iii) Single Linkage and Complete Linkageβ Agglomerative
Hierarchical Clustering : Under adversarial noise, we show aclustering technique that loses only amultiplicative factor of (1+`)3in each merge operation and has an overall query complexity ofπ (π2). Prior work [24] considers comparison oracle queries to per-form average linkage in which the unobserved pairwise similaritiesare generated according to a normal distribution. These techniquesdo not extend to our noise models.
1.2 Other Related Work
For finding the maximum among a given set of values, it is knownthat techniques based on tournament obtain optimal guarantees andare widely used [15]. For the problem of finding nearest neighbor,techniques based on locality sensitive hashing generally work wellin practice [5]. Clustering points using π-center objective is NP-hard and there are many well known heuristics and approximationalgorithms [59] with the classic greedy algorithm achieving anapproximation ratio of 2. All these techniques are not applicable
when pairwise distances are unknown. As distances between pointscannot always be accurately estimated, many recent techniquesleverage supervision in the form of an oracle. Most oracle basedclustering frameworks consider βoptimal clusterβ queries [13, 28,33, 42, 43] to identify ground truth clusters. Recent techniques fordistance based clustering objectives, such as π-means [6, 12, 36, 37]and π-median [3] use optimal cluster queries in addition to distanceinformation for obtaining better approximation guarantees. Asβoptimal clusterβ queries can be costly or sometimes infeasible, therehas been recent interest in leveraging distance based comparisonoracles for other problems similar to our quadruplet oracles [17, 24].
Distance based comparison oracles have been used to studya wide range of problems and we list a few of them β learningfairness metrics [34], top-down hierarchical clustering with a dif-ferent objective [11, 17, 24], correlation clustering [49] and classi-fication [32, 48], identify maximum [30, 53], top-π elements [14β16, 38, 40, 45], information retrieval [35], skyline computation [54].To the best of our knowledge, there is no work that considersquadruplet comparison oracle queries to perform π-center cluster-ing and single/complete linkage based hierarchical clustering.
Closely related to finding maximum, sorting has also been wellstudied under various comparison oracle based noise models [8, 9].The work of [15] considers a different probabilistic noise modelwith error varying as a function of difference in the values but theyassume that each query is independent and therefore repetition canhelp boost the probability of success. Using a quadruplet oracle, [24]studies the problem of recovering a hierarchical clustering under aplanted noise model and is not applicable for single linkage.
2 PRELIMINARIES
Let π = {π£1, π£2, . . . , π£π} be a collection of π records such that eachrecord maybe associated with a value π£ππ (π£π ),βπ β [1, π]. We as-sume that there exists a total ordering over the values of elementsin π . For simplicity we denote the value of record π£π as π£π insteadof π£ππ (π£π ) whenever it is clear from the context.
Given this setting, we consider a comparison oracle that com-pares the values of any pair of records (π£π , π£ π ) and outputs Yes ifπ£π β€ π£ π and No otherwise.
Definition 2.1 (Comparison Oracle). An oracle is a function
O : π Γπ β {Yes, No}. Each oracle query considers two values as
input and outputs O(π£1, π£2) = Yes if π£1 β€ π£2 and No otherwise.
Note that a comparison oracle is defined for any pair of values.Given this oracle setting, we define the problem of identifying themaximum over the records π .
Problem 2.2 (Maximum). Given a collection of π records π =
{π£1, . . . , π£π} and access to a comparison oracle O, identify the
argmaxπ£π βπ π£π with minimum number of queries to the oracle.
As a natural extension, we can also study the problem of identi-fying the record corresponding to the smallest value in π .
2.1 Quadruplet Oracle Comparison Query
In applications that consider distance based comparison of recordslike nearest neighbor identification, the records π = {π£1, . . . , π£π}are generally considered to be present in a high-dimensional metricspace along with a distance π : π Γπ β R+ defined over pairs of
Raghavendra Addanki, Sainyam Galhotra, and Barna Saha
records. We assume that the embedding of records in latent spaceis not known, but there exists an underlying ground truth [5]. Priortechniques mostly assume complete knowledge of accurate distancemetric and are not applicable in our setting. In order to capture thesetting where we can compare distances between pair of records,we define quadruplet oracle below.
Definition 2.3 ( Quadruplet Oracle). An oracle is a function
O : π Γπ Γπ Γπ β {Yes, No}. Each oracle query considers two pairsof records as input and outputs O(π£1, π£2, π£3, π£4) = Yes if π (π£1, π£2) β€π (π£3, π£4) and No otherwise.
The quadruplet oracle is similar to the comparison oracle dis-cussed before with a difference that the two values being comparedare associated with pair of records as opposed to individual records.Given this oracle setting, we define the problem of identifying thefarthest record over π with respect to a query point π as follows.
Problem 2.4 (Farthest point). Given a collection of π records
π = {π£1, . . . , π£π}, a query record π and access to a quadruplet oracle
O, identify argmaxπ£π βπ \{π } π (π, π£π ).
Similarly, the nearest neighbor query returns a point that satis-fies argminπ’π βπ \{π } π (π,π’π ). Now, we formally define the k-centerclustering problem.
Problem 2.5 (k-center clustering). Given a collection of π
recordsπ = {π£1, . . . , π£π} and access to a comparison oracle O, identifyπ centers (say π β π ) and a mapping of records to corresponding
centers, π : π β π , such that the maximum distance of any record
from its center, i.e., maxπ£π βπ π (π£π , π (π£π )) is minimized.
We assume that the points π£π β π exist in a metric space andthe distance between any pair of points is not known. We denotethe unknown distance between any pair of points (π£π , π£ π ) whereπ£π , π£ π β π as π (π£π , π£ π ) and use π to denote the number of clusters.Optimal clusters are denoted asπΆβ withπΆβ (π£π ) β π denoting the setof points belonging to the optimal cluster containing π£π . Similarly,πΆ (π£π ) β π refers to the nodes belonging to the cluster containingπ£π for any clustering given by πΆ (Β·).
In addition to the k-center clustering, we study single linkageand complete linkageβagglomerative clustering techniques wherethe distance metric over the records is not known apriori. Thesetechniques initialize each record π£π in a separate singleton clusterand sequentially merge the pair of clusters having the least distancebetween them. In case of single linkage, the distance between twoclusters πΆ1 and πΆ2 is characterized by the closest pair of recordsdefined as:
πππΏ (πΆ1,πΆ2) = minπ£π βπΆ1,π£π βπΆ2
π (π£π , π£ π )
In complete linkage, the distance between a pair of clusters πΆ1and πΆ2 is calculated by identifying the farthest pair of records,ππΆπΏ (πΆ1,πΆ2) = maxπ£π βπΆ1,π£π βπΆ2 π (π£π , π£ π ) .
2.2 Noise Models
The oracle models discussed in Problem 2.2, 2.4 and 2.5 assume thatthe oracle answers every comparison query correctly. In real worldapplications, however, the answers can be wrong which can leadto noisy results. To formalize the notion of noise, we consider twodifferent models. First, adversarial noise model considers a setting
where a comparison query can be adversarially wrong if the twovalues being compared are within a multiplicative factor of (1 + `)for some constant ` > 0.
O(π£1, π£2) =
Yes, if π£1 < 1
(1+`) π£2
No, if π£1 > (1 + `)π£2adversarially incorrect if 1
(1+`) β€π£1π£2β€ (1 + `)
The parameter ` corresponds to the degree of error. For example,` = 0 implies a perfect oracle. The model extends to quadrupletoracle as follows.
O(π£1, π£2, π£3, π£4) =
Yes, if π (π£1, π£2) < 1
(1+`) π (π£3, π£4)No, if π (π£1, π£2) > (1 + `)π (π£3, π£4)adversarially incorrect if 1
(1+`) β€π (π£1,π£2)π (π£3,π£4) β€ (1 + `)
The second model considers a probabilistic noise model whereeach comparison query is incorrect independently with a proba-bility π < 1
2 and asking the same query multiple times yields thesame response. We discuss ways to estimate ` and π from real datain Section 6.
3 FINDING MAXIMUM
In this section, we present robust algorithms to identify the recordcorresponding to the maximum value in π under the adversarialnoise model and the probabilistic noise model. Later we extend thealgorithms to find the farthest and the nearest neighbor. We notethat our algorithms for the adversarial model are parameter free(do not depend on `) and the algorithms for the probabilistic modelcan use π = 0.5 as a worst case estimate of the noise.
3.1 Adversarial Noise
Consider a trivial approach that maintains a running maximumvalue while sequentially processing the records, i.e., if a largervalue is encountered, the current maximum value is updated to thelarger value. This approach requires π β 1 comparisons. However,in the presence of adversarial noise, our output can have a signifi-cantly lower value compared to the correct maximum. In general,if π£πππ₯ is the true maximum of π , then the above approach canreturn an approximate maximum whose value could be as low asπ£πππ₯/(1 + `)πβ1. To see this, assume π£1 = 1, and π£π = (1 + ` β π)πwhere π > 0 is very close to 0. It is possible that while comparing π£πand π£π+1, the oracle returns π£π as the larger element. If this mistakeis repeated for every π , then, π£1 will be declared as the maximumelement whereas the correct answer is π£π β π£1 (1 + `)πβ1.
To improve upon this naive strategy, we introduce a naturalkeeping score based idea where given a set π β π of records, wemaintain Count(π£, π) that is equal to the number of values smallerthan π£ in π .
Count(π£, π) =βοΈ
π₯ βπ\{π£ }1{O(π£, π₯) == No}
It is easy to observe that when the oracle makes no mistakes,Count(π max, π) = |π | β 1 and obtains the highest score, where π maxis the maximum value in π . Using this observation, in Algorithm 1,we output the value with the highest Count score.
How to Design Robust Algorithms using Noisy Comparison Oracle
Given a set of records π , we show in Lemma 3.1 thatCount-Max(π ) obtained using Algorithm 1 always returns a goodapproximation of the maximum value in π .
Lemma 3.1. Given a set of values π with maximum value π£max,
Count-Max(π ) returns a value π’max where π’max β₯ π£max/(1 + `)2using π ( |π |2) oracle queries.
Using Example 3.2, when ` = 1, we demonstrate that (1 + `)2 = 4approximation ratio is achieved by Algorithm 1.
Example 3.2. Let π denote a set of four records π’, π£,π€ and π‘ with
ground truth values 51, 101, 102 and 202, respectively. While iden-
tifying the maximum value under adversarial noise with ` = 1, theoracle must return a correct answer to O(π’, π‘) and all other oracle
query answers can be incorrect adversarially. If the oracle answers all
other queries incorrectly, we have, Count values of π‘,π€,π’, π£ are 1, 1, 2,and 2 respectively. Therefore, π’ and π£ are equally likely, and when
Algorithm 1 returns π’, we have a 202/51 β 3.96 approximation.
From Lemma 3.1, we have that π (π2) oracle queries where |π | = π,are required to get (1 + `)2 approximation. In order to improve thequery complexity, we use a tournament to obtain the maximumvalue. The idea of using a tournament for finding maximum hasbeen studied in the past [15, 18].
Algorithm 2 presents pseudo code of the approach that takesvalues π as input and outputs an approximate maximum value. Itconstructs a balanced _-ary tree T containing π leaf nodes suchthat a random permutation of the valuesπ is assigned to the leavesof T . In a tournament, the internal nodes of T are processed bottom-up such that at every internal nodeπ€ , we assign the value that islargest among the children ofπ€ . To identify the largest value, wecalculate argmaxπ£βchildren(π€) Count(π£, children(π€)) at the internalnodeπ€ , where Count(π£, π ) refers to the number of elements in π
that are considered smaller than π£ . Finally, we return the value at theroot of T as our output. In Lemma 3.3, we show that Algorithm 2returns a value that is a (1 + `)2 log_ π multiplicative approximationof the maximum value.
Algorithm 1 Count-Max(S) : finds the max. value by counting1: Input : A set of values π2: Output : An approximate maximum value of π3: for π£ β π do
4: Calculate Count(π£, π)5: π’max β arg maxπ£βπCount(π£, π)6: return π’max
Lemma 3.3. Suppose π£πππ₯ is the maximum value among the set
of records π . Algorithm 2 outputs a value π’πππ₯ such that π’πππ₯ β₯π£πππ₯
(1+`)2 log_ π using π (π_) oracle queries.
According to Lemma 3.3, Algorithm 2 identifies a constant ap-proximation when _ = Ξ(π), ` is a fixed constant and has a querycomplexity of Ξ(π2). By reducing the degree of the tournamenttree from _ to 2, we can achieve Ξ(π) query complexity, but with aworse approximation ratio of (1 + `)logπ .
Now, we describe our main algorithm (Algorithm 4) that uses thethe following observation to improve the overall query complexity.
Observation 3.4. At an internal node π€ β T , the identified
maximum is incorrect only if there exists π₯ β children(π€) that is veryclose to the true maximum (sayπ€πππ₯ ), i.e.
π€max(1+`) β€ π₯ β€ (1+ `)π€max.
Based on the above observation, our AlgorithmMax-Adv usestwo steps to identify a good approximation of π£max. Consider thecase when there are a lot of values that are close to π£max. In Algo-rithm Max-Adv, we use a subset π β π of size
βππ‘ (for a suitable
choice of parameter π‘ ) obtained using uniform sampling with re-placement. We show that using a sufficiently large subset π , ob-tained by sampling, we ensure that at least one value that is closerto π£max is in π , thereby giving a good approximation of π£max.
Algorithm 2 Tournament : finds the maximum value using atournament tree1: Input : Set of values π , Degree _2: Output : An approximate maximum value π’max3: Construct a balanced _-ary tree T with |π | nodes as leaves.4: Let ππ be a random permutation of π assigned to leaves of T5: for π = 1 to log_ |π | do6: for internal nodeπ€ at level log_ |π | β π do7: Letπ denote the children ofπ€ .8: Set the internal nodeπ€ to Count-Max(π )9: π’max βvalue at root of T10: return π’max
In order to handle the case when there are only a few valuescloser to π£max, we divide the entire data set into π disjoint parts (fora suitable choice of π) and run the Tournament algorithm withdegree _ = 2 on each of these parts separately (Algorithm 3). Asthere are very few points close to π£max, the probability of comparingany such value with π£max is small, and this ensures that in thepartition containing π£max, Tournament returns π£max. We collectthe maximum values returned by Algorithm 2 from all the partitionsand include these values in π in AlgorithmMax-Adv. We repeatthis procedure π‘ times and set π =
βπ, π‘ = 2 log(2/πΏ) to achieve the
desired success probability 1 β πΏ . We combine the outputs of boththe steps, i.e.,π andπ and output the maximum among them usingCount-Max. This ensures that we get a good approximation as weuse the best of both the approaches.
Algorithm 3 Tournament-Partition1: Input : Set of values π , number of partitions π2: Output : A set of maximum values from each partition3: Randomly partition π into π equal parts π1,π2, Β· Β· Β·ππ4: for π = 1 to π do5: ππ β Tournament(ππ , 2)6: π β π βͺ {ππ }7: return π
Theoretical Guarantees. In order to prove approximation guar-antee of Algorithm 4, we first argue that the sample π contains agood approximation of the maximum value π£max with a high prob-ability. Let πΆ denote the set of values that are very close to π£max.Suppose πΆ = {π’ : π£max/(1 + `) β€ π’ β€ π£max}. In Lemma 3.5, we first
Raghavendra Addanki, Sainyam Galhotra, and Barna Saha
Algorithm 4Max-Adv : Maximum with Adversarial Noise1: Input : Set of values π , number of iterations π‘ , partitions π2: Output : An approximate maximum value π’max3: π β 1,π β π
4: Letπ denote a sample of sizeβππ‘ selected uniformly at random
(with replacement) from π .5: for π β€ π‘ do
6: ππ β Tournament-Partition(π , π)7: π β π βͺππ8: π’max β Count-Max(π βͺπ )9: return π’max
show that π contains a value π£ π β π such that π£ π β₯ π£max/(1 + `),whenever the size of πΆ is large, i.e., |πΆ | >
βπ/2. Otherwise, we
show that we can recover π£max correctly with probability 1 β πΏ/2whenever |πΆ | β€
βπ/2.
Lemma 3.5. (1) If |πΆ | >βπ/2, then there exists a value π£ π β π
satisfying π£ π β₯ π£max/(1 + `) with probability of 1 β πΏ/2.(2) Suppose |πΆ | β€
βπ/2. Then, π contains π£max with probability
at least 1 β πΏ/2.
Now, we briefly provide a sketch of the proof of Lemma 3.5.Consider the first step, where we use a uniformly random sampleπ of
βππ‘ points from π (obtained with replacement). When |πΆ | β₯β
π/2, probability that π contains a value from πΆ is given by 1 β
(1 β |πΆ |/π) |π | = 1 β(1 β 1
2βπ
)2βπ log(2/πΏ)β 1 β πΏ/2.
In the second step, Algorithm 4 uses a modified tournamenttree that partitions the set π into π =
βπ parts of size π/π =
βπ
each and identifies a maximum ππ from each partition ππ usingAlgorithm 2. We have that the expected number of elements fromπΆ in a partition ππ containing π£max is |πΆ |/π =
βπ/(2βπ) = 1/2.
Thus by the Markovβs inequality, the probability that ππ containsa value from πΆ is β€ 1/2. With 1/2 probability, π£max will never becompared with any point fromπΆ in the partitionππ . To increase thesuccess probability, we run this procedure π‘ times and obtain all theoutputs. Among the π‘ runs of Algorithm 2, we argue that π£max isnever compared with any value ofπΆ in at least one of the iterationswith a probability at least 1 β (1 β 1/2)2 log(2/πΏ) β₯ 1 β πΏ/2.
In Lemma 3.1, we show that using Count-Max we get a (1+ `)2multiplicative approximation. Combining it with Lemma 3.5, wehave that π’max returned by Algorithm 4 satisfies π’max β₯ π£max
(1+`)3with probability 1β πΏ . For query complexity, Algorithm 3 identifiesβππ‘ samples denoted by π . These identified values, along with π
are then processed by Count-Max to identify the maximum π’πππ₯ .This step requires π ( |π βͺπ |2) = π (π log2 (1/πΏ)) oracle queries.
Theorem 3.6. Given a set of values π , Algorithm 4 returns a
(1 + `)3 approximation of maximum value with probability 1 β πΏusing π (π log2 (1/πΏ)) oracle queries.
3.2 Probabilistic Noise
We cannot directly extend the algorithms for the adversarial noisemodel to probabilistic noise. Specifically, the theoretical guaranteesof Lemma 3.3 do not apply when the noise is probabilistic. In thissection, we develop several new ideas to handle probabilistic noise.
π
51π’
50π£
1π€
100π‘
Figure 2: Example for Lemma 3.1 with ` = 1.
Let rank(π’,π ) denote the index ofπ’ in the non-increasing sortedorder of values inπ . So, π£πππ₯ will have rank 1 and so on. Our mainidea is to use an early stopping approach that uses a sample π β πofπ (log(π/πΏ)) values selected randomly and for every value π’ thatis not in π , we calculate Count(π’, π) and discard π’ using a chosenthreshold for the Count scores. We argue that by doing so, it helpsus eliminate the values that are far away from the maximum in thesorted ranking. This process is continued Ξ(logπ) times to identifythe maximum value. We present the pseudo code in the Appendixand prove the following approximation guarantee.
Theorem 3.7. There is an algorithm that returns π’max β π such
that rank(π’max,π ) = π (log2 (π/πΏ)) with probability 1 β πΏ and re-
quires π (π log2 (π/πΏ)) oracle queries.
The algorithm to identify the minimum value is same as that ofmaximumwith amodificationwhere Count scores consider the caseof Yes (instead of No): Count(π£, π) = β
π₯ βπ\{π£ } 1{O(π£, π₯) == Yes}
3.3 Extension to Farthest and Nearest Neighbor
Given a set of records π , the farthest record from a query π’ corre-sponds to the record π’ β² β π such that π (π’,π’ β²) is maximum. Thisquery is equivalent to finding maximum in the set of distance val-ues given by π· (π’) = {π (π’,π’ β²) | βπ’ β² β π } containing π valuesfor which we already developed algorithms in Section 3. Since theground truth distance between any pair of records is not known, werequire quadruplet oracle (instead of comparison oracle) to identifythe maximum element in π· (π’). Similarly, the nearest neighbor ofquery record π’ corresponds to finding the record with minimumdistance value in π· (π’). Algorithms for finding maximum fromprevious sections, extend for these settings with similar guarantees.
Example 3.8. Figure 2 shows a worst-case example for the ap-
proximation guarantee to identify the farthest point from π (with
` = 1). Similar to Example 3.2, we have, Count values of π‘,π€,π’, π£ are
1, 1, 2, 2 respectively. Therefore, π’ and π£ are equally likely, and when
Algorithm 1 outputs π’, we have a β 3.96 approximation.
For probabilistic noise, the farthest identified in Section 3.2 isguaranteed to rank within the top-π (log2 π) values of set π (Theo-rem 3.7). In this section, we show that it is possible to compute thefarthest point within a small additive error under the probabilisticmodel, if the data set satisfies an additional property discussed be-low. For the simplicity of exposition, we assume π β€ 0.40, thoughour algorithms work for any value of π < 0.5 (with different con-stants).
One of the challenges in developing robust algorithms for far-thest identification is that every relative distance comparison ofrecords from π’ (O(π’, π£π , π’, π£ π ) for some π£π , π£ π β π ) may be an-swered incorrectly with constant error probability π and the suc-cess probability cannot be boosted by repetition. We overcomethis challenge by performing pairwise comparisons in a robustmanner. Suppose the desired failure probability is πΏ , we observethat if Ξ(log(1/πΏ)) records closest to the query π’ are known (say
How to Design Robust Algorithms using Noisy Comparison Oracle
π
π£ ππ£π
π’πΌ
2πΌ
In this example, O(π’, π£π , π’, π£ π ) is an-swered correctly with a probability1β π . To boost the correctness prob-ability, FCount uses the queriesO(π₯, π£π , π₯, π£ π ), βπ₯ in the red regionaround π’, denoted by π .
Figure 3: Algorithm 5 returns βYesβ as π (π’, π£π ) < π (π’, π£ π ) β 2πΌ .
π) and maxπ₯ βπ {π (π’, π₯)} β€ πΌ for some πΌ > 0, then each pairwisecomparison of the form O(π’, π£π , π’, π£ π ) can be replaced by Algo-rithm PairwiseComp and use it to execute Algorithm 4. Algorithm 5takes the two records π£π and π£ π as input along with π and outputsYes or No where Yes denotes that π£π is closer to π’. We calculateFCount(π£π , π£ π ) =
βπ₯ βπ 1{O(π£π , π₯, π£ π , π₯) == Yes} as a robust esti-
mate where the oracle considers π£π to be closer to π₯ than π£ π . IfFCount(π£π , π£ π ) is smaller than 0.3|π | β€ (1 β π) |π |/2 then we outputNo and Yes otherwise. Therefore, every pairwise comparison queryis replaced with Ξ(log(1/πΏ)) quadruplet queries using Algorithm 5.
We argue that Algorithm 5 will output the correct answer with ahigh probability if |π (π’, π£ π )βπ (π’, π£π ) | β₯ 2πΌ (See Fig 3). In Lemma 3.9,we show that, if π (π’, π£ π ) > π (π’, π£π ) + 2πΌ , then, FCount(π£π , π£ π ) β₯0.3|π | with probability 1 β πΏ .
Lemma 3.9. Suppose maxπ£π βπ π (π’, π£π ) β€ πΌ and |π | β₯ 6 log(1/πΏ).Consider two records π£π and π£ π such that π (π’, π£π ) < π (π’, π£ π ) β2πΌ then
FCount(π£π , π£ π ) β₯ 0.3|π | with a probability of 1 β πΏ .
With the help of Algorithm 5, relative distance query of any pairof records π£π , π£ π from π’ can be answered correctly with a high prob-ability provided |π (π’, π£π ) β π (π’, π£ π ) | β₯ 2πΌ . Therefore, the output ofAlgorithm 5 is equivalent to an additive adversarial error modelwhere any quadruplet query can be adversarially incorrect if thedistance |π (π’, π£π ) β π (π’, π£ π ) | < 2πΌ and correct otherwise. In the Ap-pendix, we show that Algorithm 4 can be extended to the additiveadversarial error model, such that each comparison (π’, π£π , π’, π£ π ) isreplaced by PairwiseComp (Algorithm 5). We give an approxima-tion guarantee, that loses an additive 6πΌ following a similar analysisof Theorem 3.6.
Algorithm 5 PairwiseComp (π’, π£π , π£π , π)
1: Calculate FCount(π£π , π£π ) =β
π₯βπ 1{O(π₯, π£π , π₯, π£π ) == Yes}2: if FCount(π£π , π£π ) < 0.3 |π | then3: return No4: else return Yes
Theorem 3.10. Given a query vertex π’ and a set π with |π | =Ξ©(log(π/πΏ)) such that maxπ£βπ π (π’, π£) β€ πΌ then the farthest iden-
tified using Algorithm 4 (with PairwiseComp), denoted by π’max is
within 6πΌ distance from the optimal farthest point, i.e., π (π’,π’max) β₯maxπ£βπ π (π’, π£) β 6πΌ with a probability of 1 β πΏ . Further the querycomplexity is π (π log3 (π/πΏ)).
4 π-CENTER CLUSTERING
In this section, we present algorithms for π-center clustering andprove constant approximation guarantees of our algorithm. Our
algorithm is an adaptation of the classical greedy algorithm forπ-center [27]. The greedy algorithm [27] is initialized with an arbi-trary point as the first cluster center and then iteratively identifiesthe next centers. In each iteration, it assigns all the points to thecurrent set of clusters, by identifying the closest center for eachpoint. Then, it finds the farthest point among the clusters and usesit as the new center. This technique requires π (ππ) distance com-parisons in the absence of noise and guarantees 2-approximationof the optimal clustering objective. We provide the pseudocode forthis approach in Algorithm 6. Using an argument similar to the onepresented for the worst case example in Section 3, we can show thatif we use Algorithm 6 where we replace every comparison with anoracle query, the generated clusters can be arbitrarily worse evenfor small error. In order to improve its robustness, we devise newalgorithms to perform assignment of points to respective clustersand farthest point identification. Missing Details from this sectionare discussed in Appendix 10 and 11.
Algorithm 6 Greedy Algorithm1: Input : Set of points π2: Output : Clusters C3: π 1 β arbitrary point from π , π = {π 1}, πΆ = {{π }}.4: for π = 2 to π do
5: π π β Approx-Farthest(π,πΆ)6: π β π βͺ {π π }7: C β Assign(π)8: return C
4.1 Adversarial Noise
Now, we describe the two steps (Approx-Farthest and Assign) ofthe Greedy Algorithm that will complete the description of Algo-rithm 6. To do so, we build upon the results from previous sectionthat give algorithms for obtaining maximum/farthest point.Approx-Farthest. Given a clustering C, and a set of centers π ,we construct the pairs (π£π , π π ) where π£π is assigned to cluster πΆ (π π )centered at π π β π . Using Algorithm 4, we identify the point, cen-ter pair that have the maximum distance i.e. argmaxπ£π βπ π (π£π , π π ),which corresponds to the farthest point. For the parameters, weuse π =
βπ, π‘ = log(2π/πΏ) and number of samples π =
βππ‘ .
Assign. After identifying the farthest point, we reassign all thepoints to the centers (now including the farthest point as the newcenter) closest to them. We calculate a movement score calledMCount for every point with respect to each center. MCount(π’, π π ) =βπ π βπ\{π π } 1{O((π π , π’), (π π , π’)) == Yes}, for any record π’ β π and
π π β π . This step is similar to Count-Max Algorithm. We assignthe point π’ to the center with the highest MCount value.
Example 4.1. Suppose we run k-center algorithm with π = 2 and` = 1 on the points in Example 3.8. The optimal centers are π’ and
π‘ with radius 51. On running our algorithm, suppose π€ is chosen
as the first center and Approx-Farthest calculates Count values
similar to Example 3.2. We have, Count values of π , π‘, π’, π£ are 1, 2, 3, 0respectively. Therefore, our algorithm identifies π’ as the second center,
achieving 3-approximation.
Raghavendra Addanki, Sainyam Galhotra, and Barna Saha
Theoretical Guarantees. We now prove the approximation guar-antee obtained by Algorithm 6.
In each iteration, we show that Assign reassigns each point toa center with distance approximately similar to the distance fromthe closest center. This is surprising given that we only use MCountscores for assignment. Similarly, we show that Approx-Farthest(Algorithm 4) identifies a close approximation to the true farthestpoint. Concretely, we show that every point is assigned to a centerwhich is a (1 + `)2 approximation; Algorithm 4 identifies farthestpointπ€ which is a (1 + `)5 approximation.
In every iteration of the Greedy algorithm, if we identify an πΌ-approximation of the farthest point, and a π½-approximation whenreassigning the points, then, we show that the clusters output are a2πΌπ½2-approximation to the π-center objective. For complete details,please refer Appendix 10. Combining all the claims, for a givenerror parameter `, we obtain:
Theorem 4.2. For ` < 118 , Algorithm 6 achieves a (2 + π (`))-
approximation for the π-center objective usingπ (ππ2+ππ Β·log2 (π/πΏ))oracle queries with probability 1 β πΏ .
4.2 Probabilistic Noise
For probabilistic noise, each query can be incorrect with probabilityπ and therefore, Algorithm 6 may lead to poor approximation guar-antees. Here, we build upon the results from section 3.3 and provideApprox-Farthest and Assign algorithms. We denote the size ofminimum cluster among optimum clusters πΆβ to beπ, and totalfailure probability of our algorithms to be πΏ . We assume π β€ 0.40,a constant strictly less than 1
2 . Let πΎ = 450 be a large constant usedin our algorithms which obtains the claimed guarantees.Overview. Algorithm 7 presents the pseudo-code of our algorithmthat operates in two phases. In the first phase (lines 3-12), we sam-ple each point with a probability πΎ log(π/πΏ)/π to identify a smallsample of β πΎπ log(π/πΏ)
π points (denoted by π ) and use Algorithm 7to identify π centers iteratively. In this process, we also identifya core for each cluster (denoted by π ). Formally, core is definedas a set of Ξ(log(π/πΏ)) points that are very close to the centerwith high probability. The cores are then used in the second phase(line 15) for the assignment of remaining points. Now, we describe
Algorithm 7 Greedy Clustering1: Input : Set of points π , smallest cluster sizeπ.2: Output : Clusters πΆ3: For every π’ β π , include π’ in π with probability πΎ log(π/πΏ)
π
4: π 1 β select an arbitrary point from π , π β {π 1}5: πΆ (π 1) β π
6: π (π 1) β Identify-Core(πΆ (π 1), π 1)7: for π = 2 to π do
8: π π β Approx-Farthest(π,πΆ)9: πΆ, π β Assign(π, π π , π )10: π β π βͺ {π π }11: πΆ β Assign-Final(π, π ,π \π )12: return πΆ
the main challenge in extending Approx-Farthest and Assignideas of Algorithm 6. Given a cluster πΆ containing the center π π ,
when we find the Approx-Farthest, the ideas from Section 3.2give a π (log2 π) rank approximation. As shown in section 3.3, wecan improve the approximation guarantee by considering a set ofΞ(log(π/πΏ)) points closest to π π , denoted by π (π π ) and call themcore of π π . We argue that such an assumption of set π is justified.For example, consider the case when clusters are of size Ξ(π) andsampling π log(π/πΏ) points gives us log(π/πΏ) points from each op-timum cluster; which means that there are log(π/πΏ) points withina distance of 2OPT from every sampled point where OPT refers tothe optimum π-center objective.Assign. Consider a point π π such that we have to assign points toform the cluster πΆ (π π ) centered at π π . We calculate an assignment
score (called ACount in line 4) for every point π’ of a cluster πΆ (π π ) \π (π π ) centered at π π . ACount captures the total number of times π’is considered to belong to same cluster as that of π₯ for each π₯ in thecore π (π π ). Intuitively, points that belong to same cluster as that ofπ π are expected to have higher ACount score. Based on the scores,we move π’ to πΆ (π π ) or keep it in πΆ (π π ).
Algorithm 8 Assign(π, π π , π )1: πΆ (π π ) β {π π }2: for π π β π do
3: for π’ β πΆ (π π ) \ π (π π ) do4: ACount(π’, π π , π π ) =
βπ£π βπ (π π ) 1{O(π’, π π ,π’, π£π ) == Yes}
5: if ACount(π’, π π , π π ) > 0.3 |π (π π ) | then6: πΆ (π π ) β πΆ (π π ) βͺ {π’ };πΆ (π π ) β πΆ (π π ) \ {π’ }7: π (π π ) β Identify-Core(πΆ (π π ), π π )8: return C, R
Algorithm 9 Identify-Core(πΆ (π π ), π π )1: for π’ β πΆ (π π ) do2: Count(u)=
βπ₯βπΆ (π π ) 1{O(π π , π₯, π π ,π’) == No}
3: π (π π ) denote set of 8πΎ log(π/πΏ)/9 points with the highest Count values.4: return π (π π )
Identify-Core. After forming cluster πΆ (π π ), we identify the coreof π π . For this, we calculate a score, denoted by Count and capturesnumber of times it is closer to π π compared to other points inπΆ (ππ ).Intuitively, we expect points with high values of Count to belong toπΆβ (π π ) i.e., optimum cluster containing π π . Therefore we sort theseCount scores and return the highest scored points.Approx-Farthest. For a set of clusters C, and a set of centers π ,we construct the pairs (π£π , π π ) where π£π is assigned to cluster πΆ (π π )centered at π π β π and each center π π β π has a corresponding coreπ (π π ). The farthest point can be found by finding the maximumdistance (point, center) pair among all the points considered. To doso, we use the ideas developed in section 3.3.
We leverage ClusterComp (Algorithm 10) to compare the dis-tance of two points, say π£π , π£ π from their respective centers π π , π π .ClusterComp gives a robust answer to a pairwise comparisonquery to the oracle O(π£π , π π , π£ π , π π ) using the cores π (π π ) and π (π π ).ClusterComp can be used as a pairwise comparison subroutine inplace of PairwiseComp for the algorithm in Section 3 to calculatethe farthest point. For every π π β π , let π (π π ) denote an arbitraryset of
βοΈπ (π π ) points from π (π π ). For a ClusterComp comparison
How to Design Robust Algorithms using Noisy Comparison Oracle
query between the pairs (π£π , π π ) and (π£ π , π π ), we use these subsetsin Algorithm 10 to ensure that we only make Ξ(log(π/πΏ)) oraclequeries for every comparison. However, when the query is betweenpoints of the same cluster, say πΆ (π π ), we use all the Ξ(log(π/πΏ))points from π (π π ). For the parameters used to find maximum usingAlgorithm 4, we use π =
βπ, π‘ = log(π/πΏ).
Example 4.3. Suppose we run π-center Algorithm 7 with π = 2 andπ = 2 on the points in Example 3.8. Letπ€ denote the first center chosen
and Algorithm 7 identifies the core π (π€) by calculating Count values.If O(π’,π€, π ,π€) and O(π ,π€, π‘,π€) are answered incorrectly (with prob-
ability π), we obtain Count values of π£, π ,π’, π‘ as 3, 2, 1, 0 respectively;and π£ is added to π (π€). We identify the second centerπ’ by calculating
FCount for π ,π’ and π‘ (See Fig. 3). After assigning (using Assign), the
clusters identified are {π€, π£}, {π’, π , π‘}, achieving 3-approximation.
Algorithm 10 ClusterComp (π£π , π π , π£π , π π )
1: comparisonsβ 0, FCount(π£π , π£π ) β 02: if π π = π π then
3: Let FCount(π£π , π£π ) =β
π₯βπ (π π ) 1{O(π£π , π₯, π£π , π₯) == Yes}4: comparisonsβ |π (π π ) |5: else Let FCount(π£π , π£π ) =
βπ₯βπ (π π ),π¦βπ (π π ) 1{O(π£π , π₯, π£π , π¦) == Yes}
6: comparisonsβ |π (π π ) | Β· |π (π π ) |7: if FCount(π£π , π£π ) < 0.3 Β· comparisons then
8: return No9: else return Yes
Assign-Final. After obtaining π clusters on the set of sampledpoints π , we assign the remaining points using ACount scores,similar to the one described in Assign. For every point that isnot sampled, we first assign it to π 1 β π , and if ACount(π’, π 2, π 1) β₯0.3|π (π 1) |, we re-assign it to π 2, and continue this process iteratively.After assigning all the points, the clusters are returned as output.
Theoretical Guarantees
Our algorithm first constructs a sampleπ β π and runs the greedyalgorithm on this sampled set of points. Our main idea to ensurethat good approximation of the π-center objective lies in identifyinga good core around each center. Using a sampling probability ofπΎ log(π/πΏ)/π ensures that we have at leastΞ(log(π/πΏ)) points fromeach of the optimal clusters in our sampled set π . By finding theclosest points using Count scores, we identify π (log(π/πΏ)) pointsaround every center that are in the optimal cluster. Essentially,this forms the core of each cluster. These cores are then used forrobust pairwise comparison queries (similar to Section 3.3), in ourApprox-Farthest and Assign subroutines. We give the followingtheorem, which guarantees a constant, i.e., π (1) approximationwith high probability.
Theorem 4.4. Given π β€ 0.4, a failure probability πΏ , and π =
Ξ©(log3 (π/πΏ)/πΏ). Then, Algorithm 7 achieves a π (1)-approximation
for the π-center objective using π (ππ log(π/πΏ) + π2
π2 π log2 (π/πΏ)) or-acle queries with probability 1 βπ (πΏ).
5 HIERARCHICAL CLUSTERING
In this section, we present robust algorithms for agglomerativehierarchical clustering using single linkage and complete linkage
objectives. The naive algorithms initialize every record as a single-ton cluster and merge the closest pair of clusters iteratively. Fora set of clusters C = {πΆ1, . . . ,πΆπ‘ }, the distance between any pairof clusters πΆπ and πΆ π , for single linkage clustering, is defined asthe minimum distance between any pair of records in the clusters,πππΏ (πΆ1,πΆ2) = minπ£1βπΆ1,π£2βπΆ2 π (π£1, π£2). For complete linkage, clus-ter distance is defined as the maximum distance between any pairof records. All algorithms discussed in this section can be easilyextended for complete linkage, and therefore we study single link-age clustering. The main challenge in implementing single linkageclustering in the presence of adversarial noise is identification ofminimum value in a list of at most
(π2)distance values. In each
iteration, the closest pair of clusters can be identified by using Algo-rithm 4 (with π‘ = 2 log(π/πΏ)) to calculate the minimum over the setcontaining pairwise distances. For this algorithm, Lemma 5.1 showsthat the pair of clusters merged in any iteration are a constant ap-proximation of the optimal merge operation at that iteration. Theproof of this lemma follows from Theorem 3.6.
Lemma 5.1. Given a collection of clusters C = {πΆ1, . . . ,πΆπ }, ouralgorithm to calculate the closest pair (using Algorithm 4) identifiesπΆ1andπΆ2 to merge according to single linkage objective if πππΏ (πΆ2,πΆ2) β€(1 + `)3minπΆπ ,πΆ π βC π (πΆπ ,πΆ π ) with 1 β πΏ/π probability and requires
π (π2 log2 (π/πΏ)) queries.
Algorithm 11 Greedy Algorithm1: Input : Set of pointsπ2: Output : Hierarchy π»3: π» β {{π£ } | π£ β π }, C β {{π£ } | π£ β π }4: forπΆπ β C do
5: πΆπ βNearestNeighbor ofπΆπ among C \ {πΆπ } using Sec 3.36: while |C | > 1 do7: Let (πΆ π ,πΆ π ) be the closest pair among (πΆπ ,πΆπ ), βπΆπ β C8: πΆβ² β πΆ π βͺπΆ π
9: Update Adjacency list ofπΆβ² with respect to C10: AddπΆβ² as parent ofπΆ π andπΆ π in π» .11: C β
(C \ {πΆ π ,πΆ π }
)βͺ {πΆβ² }
12: πΆβ² β NearestNeighbor ofπΆβ² from its adjacency list13: return π»
Overview. Agglomerative clustering techniques are known to beinefficient. Each iteration of merge operation compares at most(π2)pairs of distance values and the algorithm operates π times to
construct the hierarchy. This yields an overall query complexity ofπ (π3). To improve their query complexity, SLINK algorithm [47]was proposed to construct the hierarchy in π (π2) comparisons.To implement this algorithm with a comparison oracle, for everycluster πΆπ β C, we maintain an adjacency list containing everyclusterπΆ π in C along with a pair of records with the distance equalto the distance between the clusters. For example, the entry for πΆ π
in the adjacency list of πΆπ contains the pair of records (π£π , π£ π ) suchthat π (π£π , π£ π ) = minπ£π βπΆπ ,π£π βπΆ π
π (π£π , π£ π ). Algorithm 11 presents thepseudo code for single linkage clustering under the adversarialnoise model. The algorithm is initialized with singleton clusterswhere every record is a separate cluster. Then, we identify theclosest cluster for everyπΆπ β C, and denote it byπΆπ . This step takes
Raghavendra Addanki, Sainyam Galhotra, and Barna Saha
π nearest neighbor queries, each requiring π (π log2 (π/πΏ)) oraclequeries. In every subsequent iteration, we identify the closest pairof clusters (Using section 3.3), say πΆ π and πΆ π from C.
After merging these clusters, the data structure is updatedas follows. To update the adjacency list, we need the pair ofrecords with minimum distance between the merged cluster πΆ β² β‘πΆ π βͺ πΆ π and every other cluster πΆπ β C. In the previous iter-ation of the algorithm, we already have the minimum distancerecord pair for (πΆ π ,πΆπ ) and (πΆ π ,πΆπ ). Therefore a single querybetween these two pairs of records is sufficient to identify theminimum distance edge between πΆ β² and πΆπ (formally: πππΏ (πΆ π βͺπΆ π ,πΆπ ) = min{πππΏ (πΆ π ,πΆπ ), πππΏ (πΆ π ,πΆπ )}). The nearest neighbor ofthe merged cluster is identified by running minimum calculationover its adjacency list. In Algorithm 11, as we identify closest pairof clusters, each iteration requires π (π log2 (π/πΏ)) queries. As ourAlgorithm terminates in at most π iterations, it has an overall querycomplexity of π (π2 log2 (π/πΏ)). In Theorem 5.2, we given an ap-proximation guarantee for every merge operation of Algorithm 11.
Theorem 5.2. In any iteration, suppose the distance between a clus-ter πΆ π β C and its identified nearest neighbor πΆ π is πΌ-approximation
of its distance from the optimal nearest neighbor, then the distance
between pair of clusters merged by Algorithm 11 is πΌ (1 + `)3 approx-imation of the optimal distance between the closest pair of clusters in
C with a probability of 1 β πΏ using π (π log2 (π/πΏ)) oracle queries.
Probabilistic Noise model. The above discussed algorithms donot extend to the probabilistic noise due to constant probabilityof error for each query. However, when we are given a priori, apartitioning ofπ into clusters of size > logπ such that themaximumdistance between any pair of records in every cluster is smaller thanπΌ (a constant), Algorithm 11 can be used to construct the hierarchycorrectly. For this case, the algorithm to identify the closest andfarthest pair of clusters is same as the one discussed in Section 3.3.Note that agglomerative clustering algorithms are known to requireΞ©(π2) queries, which can be infeasible for million scale datasets.However, blocking based techniques present efficient heuristics toprune out low similarity pairs [44]. Devising provable algorithmswith better time complexity is outside the scope of this work.
6 EXPERIMENTS
In this section, we evaluate the effectiveness of our techniques onvarious real world datasets and answer the following questions.Q1: Is quadruplet oracle practically feasible? How do the differenttypes of queries compare in terms of quality and time taken byannotators? Q2: Are proposed techniques robust to different levelsof noise in oracle answers? Q3: How does the query complexityand solution quality of proposed techniques compare with optimumfor varied levels of noise?
6.1 Experimental Setup
Datasets. We consider the following real-world datasets.(1) cities dataset [2] comprises of 36K cities of the United States.The different features of the cities include state, county, zip code,population, time zone, latitude and longitude.(2) caltech dataset comprises 11.4K images from 20 categories.The ground truth distance between records is calculated using the
hierarchical categorization as described in [29].(3) amazon dataset contains 7K images and textual descriptionscollected from amazon.com [31]. For obtaining the ground truthdistances we use Amazonβs hierarchical catalog.(4) monuments dataset comprises of 100 images belonging to 10tourist locations around the world.(5) dblp contains 1.8M titles of computer science papers from differ-ent areas [60]. From these titles, noun phrases were extracted anda dictionary of all the phrases was constructed. Euclidean distancein word2vec embedding space is considered as the ground truthdistance between concepts.Baselines. We compare our techniques with the optimal solution(whenever possible) and the following baselines. (a) Tour2 con-structs a binary tournament tree over the entire dataset to comparethe values and the root node corresponds to the identified maxi-mum/minimum value (Algorithm 2 with _ = 2). This approach isan adaptation of the finding maximum algorithm in [15] with adifference that each query is not repeated multiple times to increasesuccess probability. We also use them to identify the farthest andnearest point in the greedy π-center Algorithm 6 and closest pairof clusters in hierarchical clustering.
(b) Samp considers a sample ofβπ records and identifies the far-
thest/nearest by performing quadratic number of comparisons overthe sampled points using Count-Max. For π-center, Samp considersa sample of π logπ points to identify π centers over these samplesusing the greedy algorithm. It then assigns all the remaining pointsto the identified centers by querying each record with every pairof center.
Calculating optimal clustering objective for π-center is NP-hardeven in the presence of accurate pairwise distance [59]. So, we com-pare the solution quality with respect to the greedy algorithm onthe ground truth distances, denoted by TDist. For farthest, nearestneighbor and hierarchical clustering, TDist denotes the optimaltechnique that has access to ground truth distance between records.
Our algorithm is labelled Far for farthest identification, NN fornearest neighbor, kC for π-center and HC for hierarchical clusteringwith subscript π denoting the adversarial model and π denotingthe probabilistic noise model. All algorithms are implemented inC++ and run on a server with 64GB RAM. The reported results areaveraged over 100 randomly chosen iterations. Unless specified, weset π‘ = 1 in Algorithm 4 and πΎ = 2 in Algorithm 7.Evaluation Metric. For finding maximum and nearest neighbors,we compare different techniques by evaluating the true distanceof the returned solution from the queried points. For π-center, weuse the objective value, i.e., maximum radius of the returned clus-ters as the evaluation metric and compare against the true greedyalgorithm (TDist) and other baselines. For datasets where groundtruth clusters are known (amazon, caltech and monuments), weuse F-score over intra-cluster pairs for comparing it with the base-lines [20]. For hierarchical clustering, we compute the pairs ofclusters merged in every iteration and compare the average truedistance between these clusters. In addition to the quality of re-turned solution, we compare the query complexity and runningtime of the proposed techniques with the baselines described above.Noise Estimation. For cities, amazon, caltech, and monumentsdatasets, we ran a user study on Amazon Mechanical Turk to esti-mate the noise in oracle answers over a small sample of the dataset,
How to Design Robust Algorithms using Noisy Comparison Oracle
0 3 5 6 7 8 9 10
0356789
10
0.0 0.2 0.4 0.6 0.8 1.0
(a) caltech
0 1 3 4 5 6 7 8 9 10 11 12
013456789
101112 0.0
0.2
0.4
0.6
0.8
1.0
(b) amazon
Figure 4: Accuracy values (denoted by the color of a cell)
for different distance ranges observed during our user study.
The diagonal entries refer to the quadruplets with similar
distance between the corresponding pairs and the distance
increases as we go further away from the diagonal.
often referred to as the validation set. Using crowd responses, wetrained a classifier (random forest [51] obtained the best results)using active learning to act as the quadruplet oracle, and reduce thenumber of queries to the crowd. Our active learning algorithm [50]uses a batch of 20 queries and we stop it when the classifier accu-racy on the validation set does not improve by more than 0.01 [26].To efficiently construct a small set of candidates for active learningand pruning low similarity pairs for dblp, we employ token basedblocking [44] for the datasets. For the synthetic oracle, we simulatequadruplet oracle with different values of the noise parameters.
6.2 User study
In this section, we evaluate the users ability to answer quadrupletqueries and compare it with other types of queries.Setup. We ran a user study on Amazon Mechanical Turk platformfor four datasets cities, amazon, caltech and monuments. We con-sider the ground truth distance between record pairs and discretizethem into buckets, and assign a pair of records to a bucket if thedistance falls within its range. For every pair of buckets, we query arandom subset of logπ quadruplet oracle queries (where π is size ofdataset). Each query is answered by three different crowd workersand a majority vote is taken as the answer to the query.
6.2.1 Qualitative Analysis of Oracle. In Figure 4, for every pair ofbuckets, using a heat map, we plot the accuracy of answers obtainedfrom the crowd workers for quadruplet queries. For all datasets,average accuracy of quadruplet queries is more than 0.83 and theaccuracy is minimum whenever both pairs of records belong to thesame bucket (as low as 0.5). However, we observe varied behavioracross datasets as the distance between considered pairs increases.
For the caltech dataset, we observe that when the ratio of thedistances is more than 1.45 (indicated by a black line in the Fig-ure 4(a)) , there is no noise (or close to zero noise) observed in thequery responses. As we observe a sharp decline in noise as thedistance between the pairs increases, it suggests that adversarialnoise is satisfied for this dataset. We observe a similar pattern forthe cities and monuments datasets. For the amazon dataset, weobserve that there is substantial noise across all distance ranges(See Figure 4(b)) rather than a sharp decline, suggesting that theprobabilistic model is satisfied.
0
0.2
0.4
0.6
0.8
1
1.2
cities caltechmonuments
amazon
Dis
tanc
e
Datasets
TdistFar
Tour2Samp
(a) Farthest,higher is better
0
0.2
0.4
0.6
0.8
1
1.2
cities caltechmonuments
amazon
Dis
tanc
e
Datasets
TdistNN
Tour2Samp
(b) Nearest Neighbor (NN), lower is betterFigure 5: Comparison of farthest and NN techniques for
crowdsourced oracle queries.
6.2.2 Comparison with pairwise querying mechanisms. To evaluatethe benefit of quadruplet queries, we compare the quality of quadru-plet comparison oracle answers with the following pairwise oraclequery models. (a) Optimal cluster query: This query asks questionsof type βdo π’ and π£ refer to same/similar type?β. (b) Distance query:How similar are the records π₯ and π¦? In this query, the annotatorscores the similarity of the pair within 1 to 10.We make the following observations. (i) Optimal cluster queriesare answered correctly only if the ground truth clusters refer todifferent entities (each cluster referring to a distinct entity). Crowdworkers tend to answer βNoβ if the pair of records refer to differententities. Therefore, we observe high precision (more than 0.90) butlow recall (0.50 on amazon and 0.30 on caltech for π = 10) of thereturned labels. (ii) We observed very high variance in the distanceestimation query responses. For all record pairs with identical enti-ties, the users returned distance estimates that were within 20% ofthe correct distances. In all other cases, we observe the estimatesto have errors of upto 50%. We provide more detailed comparisonon the quality of clusters identified by pairwise query responsesalong with quadruplet queries in the next section.
6.3 Crowd Oracle: Solution Quality & Query
Complexity
In this section, we compare the quality of our proposed techniquesfor the datasets on which we performed the user study. Followingthe findings of Section 6.2, we use probabilistic model based algo-rithm for amazon (with π = 0.50) and adversarial noise model basedalgorithm for caltech, monuments and cities.Finding Max and Farthest/Nearest Neighbor. Figure 5 com-pares the quality of farthest and nearest neighbor (NN) identified byproposed techniques along with other baselines. The values are nor-malized according to the maximum value to present all datasets onthe same scale. Across all datasets, the point identified by Far andNN is closest to the optimal value, TDist. In contrast, the farthest re-turned by Tour2 is better than that of Samp for cities dataset butnot for caltech, monuments and amazon. We found that this differ-ence in quality across datasets is due to varied distance distributionbetween pairs. The cities dataset has a skewed distribution ofdistance between record pairs, leading to a unique optimal solutionto the farthest/NN problem. Due to this reason, the set of recordssampled by Samp does not contain any record that is a good approx-imation of the optimal farthest. However, ground truth distancesbetween record pairs in amazon, monuments and caltech are lessskewed with more than logπ records satisfying the optimal farthestpoint for all queries. Therefore, Samp performs better than Tour2
Raghavendra Addanki, Sainyam Galhotra, and Barna Saha
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
0 20 40 60 80 100
(a) cities, Β΅=1
Obj
ectiv
e
k
kC
0
10
20
30
40
50
0 20 40 60 80 100
(d) dblp, Β΅=0.5
Obj
ectiv
e
k
Tour2
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
0 20 40 60 80 100
(c) cities, p=0.1
Obj
ectiv
e
k
Samp
0
10
20
30
40
50
0 20 40 60 80 100
(d) dblp, p=0.1
Obj
ectiv
e
k
TDist
Figure 6: π-center clustering objective comparison for adversarial and probabilistic noise model.
0
0.5
1
1.5
2
2.5
3
cities caltechmonuments
amazon
Dis
tanc
e
Datasets
TdistHC
Tour2Samp
(a) Single Linkage
0
0.5
1
1.5
2
2.5
3
cities caltechmonuments
amazon
Dis
tanc
e
Datasets
TdistHC
Tour2Samp
(b) Complete LinkageFigure 7: Comparison of Hierarchical clustering techniques
with crowdsourced oracle.
on these datasets. We observe Samp performs worse for NN becauseour sample does not always contain the closest point.π-center Clustering. We evaluate the F-score2 of the clustersgenerated by our techniques along with baselines and techniquesfor pairwise optimal query mechanism (denoted as Oq)3. Table 1presents the summary of our results for different values of π . Acrossall datasets, our technique achieves more than 0.90 F-score. On theother hand, Tour2 and Samp do not identify the ground truth clus-ters correctly, leading to low F-score. Similarly, Oq achieves poorrecall (and hence low F-score) as it labels many record pairs to be-long to separate clusters. For example, a frog and a butterfly belongto the same optimal cluster for caltech (k=10) but the two recordsare assigned to different clusters by Oq.Hierarchical Clustering. Figure 7 compares the average distanceof the merged clusters across different iterations of the agglomer-ative clustering algorithm. Tour2 has π (π3) complexity and doesnot run for cities dataset in less than 48 hrs. The objective valueof different techniques are normalized by the optimal value withTdist denoting 1. For all datasets, HC performs better than Sampand Tour2. Among datasets, the quality of hierarchies generatedfor monuments is similar for all techniques due to low noise.Query Complexity. To ensure scalability, we trained active learn-ing based classifier for all the aforementioned experiments. In total,amazon, cities, and caltech required 540 (cost: $32.40), 220 (cost:$13.20) and 280 (cost: $16.80) queries to the crowd respectively.
6.4 Simulated Oracle: Solution Quality & Query
Complexity
In this section, we compare the robustness of the techniques wherethe query response is simulated synthetically for given ` and π .
2Optimal clusters are identified from the original source of the datasets (amazon andcaltech) and manually for monuments.3We report the results on the sample of queries asked to the crowd as opposed totraining a classifier because the classifier generates noisier results and has poorerF-score than the quality of labels generated by crowdsourcing
Technique kC Tour2 Samp Oq*caltech (π = 10) 1 0.88 0.91 0.45caltech (π = 15) 1 0.89 0.88 0.49caltech (π = 20) 0.99 0.93 0.87 0.58monuments (π = 5) 1 0.95 0.97 0.77amazon (π = 7) 0.96 0.74 0.57 0.48amazon (π = 14) 0.92 0.66 0.54 0.72
Table 1: F-score comparison of k-center clustering. Oq is
marked with β as it was computed on a sample of 150 pair-
wise queries to the crowd3. All other techniques were run on
the complete dataset using a classifier.
0 1000 2000 3000 4000 5000 6000 7000 8000
0 0.5 1 2
Dis
tanc
e
Β΅
TdistFar
Tour2Samp
(a) citiesβAdversarial
3000
4000
5000
6000
7000
0 0.1 0.3
Dis
tanc
e
p
TdistFar
Tour2Samp
(b) citiesβProbabilisticFigure 8: Comparison of farthest identification techniques
for adversarial and probabilistic noise models.
Finding Max and Farthest/Nearest Neighbor. In Figure 8(a),` = 0 denotes the setting where the oracle answers all queriescorrectly. In this case, Far and Tour2 identify the optimal solutionbut Samp does not identify the optimal solution for cities. In bothdatasets, Far identifies the correct farthest point for ` < 1. Evenwith an increase in noise (`), we observe that the farthest is alwaysat a distance within 4 times the optimal distance (See Fig 8(a)). Weobserve that the quality of farthest identified by Tour2 is close tothat of Far for smaller ` because the optimal farthest point π£maxhas only a few points in the confusion regionπΆ (See Section 3) thatcontains the points that are close to π£max. For e.g., less than 10%are present in πΆ when ` = 1 for cities dataset, i.e., less than 10%points return erroneous answer when compared with π£max.In Figure 8(b), we compare the true distance of the identified farthestpoints for the case of probabilistic noise with error probability π .We observe that Farπ identifies points with distance values veryclose to the farthest distance Tdist, across all data sets and errorvalues. This shows that Far performs significantly better than thetheoretical approximation presented in Section 3. On the otherhand, the solution returned by Samp is more than 4Γ smaller thanthe value returned by Farπ for an error probability of 0.3. Tour2 hasa similar performance as that of Farπ for π β€ 0.1, but we observea decline in solution quality for higher noise (π) values.
How to Design Robust Algorithms using Noisy Comparison Oracle
0
5
10
15
20
25
0 0.5 1 2
Dis
tanc
e
Β΅
TdistNN
Tour2
(a) citiesβAdversarial
0
20
40
60
80
100
120
140
0 0.1 0.3
Dis
tanc
e
p
TdistNN
Tour2
(b) citiesβProbabilisticFigure 9: Comparison of nearest neighbor techniques for
adversarial and probabilistic noise model (lower is better).
In Figures 9(a), 9(b), we compare the true distance of the identi-fied nearest neighbor with different baselines.
NN shows superior performance as compared to Tour2 acrossall error values. This justifies the lack of robustness of Tour2 asdiscussed in Section 3. The solution quality of NN does not worsenwith increase in error. We omit Samp from the plots because thereturned points had very poor performance (as bad as 700 evenin the absence of error). We observed similar behavior for otherdatasets. In terms of query complexity, NN requires around 53 Γ 103queries for cities dataset and the number of queries grow linearlywith the dataset size. Among baselines, Tour2 uses 37Γ 103 queriesand Samp uses 18 Γ 103.βIn conclusion, we observe that our techniques achieve the best quality
across all data sets and error values, while Tour2 performs similar to
Far for low error, and its quality degrades with increasing error."
π-center Clustering. Figure 6 compares the π-center objective ofthe returned clusters for varying π in the adversarial and probabilis-tic noise model. Tdist denotes the best possible clustering objective,which is guaranteed to be a 2-approximation of the optimal objec-tive. The set of clusters returned by kC are consistently very closeto TDist across all datasets, validating the theory. For higher valuesof π , kC approaches closer to TDist, thereby improving the approx-imation guarantees. The quality of clusters identified by kC aresimilar to that of Tour2 and Far for adversarial noise (Figure 6a,b)but considerably better for probabilistic noise (Figure 6c,d).Running time. Table 2 compares the running time and the numberof required quadruplet comparisons for various problems underadversarial noise model with ` = 1 for the largest dblp dataset.Far and NN requires less than 6 seconds for both adversarial andprobabilistic error models. Our π-center clustering technique re-quires less than 450 min to identify 50 centers for dblp datasetacross different noise models; the running time grows linearly withπ . While the running time of our algorithms are slightly higherthan Tour2 for farthest, nearest and π-center, Tour2 did not finishin 48 hrs due toπ (π3) running time for single and complete linkagehierarchical clustering. We observe similar performance for theprobabilistic noise model. Note that even though the number ofcomparisons are in millions, this dataset requires only 740 queriesto the crowd workers to train the classifier.
7 CONCLUSION
In this paper, we show how algorithms for various basic taskssuch as finding maximum, nearest neighbor, π-center clustering,and agglomerative hierarchical clustering can be designed usingdistance based comparison oracle in presence of noise. We believe
Problem Our Approach Tour2 SampTime # Comp Time # Comp Time # Comp
Farthest 0.1 2.2M 0.06 2M 0.07 1MNearest 0.075 2M 0.07 2M 0.61 1MkC (k=50) 450 120M 375.3 95M 477 105M
Single Linkage 1813 990M DNF 1760 940MComplete Linkage 1950 940M DNF 1940 920M
Table 2: Running time (in minutes) and number of quadru-
plet comparisons (denoted by # Comp, in millions) of differ-
ent techniques for dblp dataset under the adversarial noise
model with ` = 1. DNF denotes βdid not finishβ.
our techniques can be useful for other clustering tasks such asπ-means and π-median, and we leave those as future work.
REFERENCES
[1] Google vision api https://cloud.google.com/vision.[2] United states cities database. https://simplemaps.com/data/us-cities.[3] Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate
clustering with same-cluster queries. In 9th Innovations in Theoretical Computer
Science Conference (ITCS 2018), volume 94, page 40. Schloss DagstuhlβLeibniz-Zentrum fuer Informatik, 2018.
[4] MiklΓ³s Ajtai, Vitaly Feldman, Avinatan Hassidim, and Jelani Nelson. Sorting andselection with imprecise comparisons. In International Colloquium on Automata,
Languages, and Programming, pages 37β48. Springer, 2009.[5] Akhil Arora, Sakshi Sinha, Piyush Kumar, and Arnab Bhattacharya. Hd-index:
pushing the scalability-accuracy boundary for approximate knn search in high-dimensional spaces. Proceedings of the VLDB Endowment, 11(8):906β919, 2018.
[6] Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Advances in neural information processing systems, pages3216β3224, 2016.
[7] Shai Ben-David. Clustering-what both theoreticians and practitioners are doingwrong. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[8] Mark Braverman, Jieming Mao, and S Matthew Weinberg. Parallel algorithmsfor select and partition with noisy comparisons. In Proceedings of the forty-eighth
annual ACM symposium on Theory of Computing, pages 851β862, 2016.[9] Mark Braverman and Elchanan Mossel. Noisy sorting without resampling. In
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms,pages 268β276. Society for Industrial and Applied Mathematics, 2008.
[10] Marco Bressan, NicolΓ² Cesa-Bianchi, Andrea Paudice, and Fabio Vitale. Correla-tion clustering with adaptive similarity queries. InAdvances in Neural Information
Processing Systems, pages 12510β12519, 2019.[11] Vaggos Chatziafratis, Rad Niazadeh, and Moses Charikar. Hierarchical clustering
with structural constraints. arXiv preprint arXiv:1805.09476, 2018.[12] I Chien, Chao Pan, and Olgica Milenkovic. Query k-means clustering and the
double dixie cup problem. In Advances in Neural Information Processing Systems,pages 6649β6658, 2018.
[13] Tuhinangshu Choudhury, Dhruti Shah, and Nikhil Karamchandani. Top-mclustering with a noisy oracle. In 2019 National Conference on Communications
(NCC), pages 1β6. IEEE, 2019.[14] Eleonora Ciceri, Piero Fraternali, Davide Martinenghi, and Marco Tagliasacchi.
Crowdsourcing for top-k query processing over uncertain data. IEEE Transactionson Knowledge and Data Engineering, 28(1):41β53, 2015.
[15] Susan Davidson, Sanjeev Khanna, Tova Milo, and Sudeepa Roy. Top-k andclustering with noisy comparisons. ACM Trans. Database Syst., 39(4), December2015.
[16] Eyal Dushkin and Tova Milo. Top-k sorting under partial order information. InProceedings of the 2018 International Conference on Management of Data, pages1007β1019, 2018.
[17] Ehsan Emamjomeh-Zadeh and David Kempe. Adaptive hierarchical clusteringusing ordinal queries. In Proceedings of the Twenty-Ninth Annual ACM-SIAM
Symposium on Discrete Algorithms, pages 415β429. SIAM, 2018.[18] Uriel Feige, Prabhakar Raghavan, David Peleg, and Eli Upfal. Computing with
noisy information. SIAM Journal on Computing, 23(5):1001β1018, 1994.[19] Donatella Firmani, Barna Saha, and Divesh Srivastava. Online entity resolution
using an oracle. PVLDB, 9(5):384β395, 2016.[20] Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. Robust
entity resolution using random graphs. In Proceedings of the 2018 International
Conference on Management of Data, pages 3β18, 2018.[21] Barbara Geissmann, Stefano Leucci, Chih-Hung Liu, and Paolo Penna. Sorting
with recurrent comparison errors. In 28th International Symposium on Algo-
rithms and Computation (ISAAC 2017). Schloss Dagstuhl-Leibniz-Zentrum fuerInformatik, 2017.
[22] Barbara Geissmann, Stefano Leucci, Chih-Hung Liu, and Paolo Penna. Optimalsortingwith persistent comparison errors. In 27th Annual European Symposium on
Raghavendra Addanki, Sainyam Galhotra, and Barna Saha
Algorithms (ESA 2019), volume 144, page 49. Schloss Dagstuhl-Leibniz-ZentrumfΓΌr Informatik, 2019.
[23] Barbara Geissmann, Stefano Leucci, Chih-Hung Liu, and Paolo Penna. Optimaldislocation with persistent errors in subquadratic time. Theory of Computing
Systems, 64(3):508β521, 2020.[24] Debarghya Ghoshdastidar, MichaΓ«l Perrot, and Ulrike von Luxburg. Foundations
of comparison-based hierarchical clustering. In Advances in Neural Information
Processing Systems, pages 7454β7464, 2019.[25] Yogesh Girdhar and Gregory Dudek. Efficient on-line data summarization using
extremum summaries. In 2012 IEEE International Conference on Robotics and
Automation, pages 3490β3496. IEEE, 2012.[26] Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan
Rampalli, Jude Shavlik, and Xiaojin Zhu. Corleone: Hands-off crowdsourcing forentity matching. In Proceedings of the 2014 ACM SIGMOD international conference
on Management of data, pages 601β612, 2014.[27] Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance.
Theoretical Computer Science, 38:293β306, 1985.[28] Kasper Green Larsen, Michael Mitzenmacher, and Charalampos Tsourakakis.
Clustering with a faulty oracle. In Proceedings of The Web Conference 2020,WWW β20, page 2831β2834, New York, NY, USA, 2020. Association for ComputingMachinery.
[29] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object categorydataset. 2007.
[30] Stephen Guo, Aditya Parameswaran, and Hector Garcia-Molina. So who won?dynamic max discovery with the crowd. In Proceedings of the 2012 ACM SIGMOD
International Conference on Management of Data, pages 385β396, 2012.[31] Ruining He and Julian McAuley. Ups and downs: Modeling the visual evolution
of fashion trends with one-class collaborative filtering. In proceedings of the 25th
international conference on world wide web, pages 507β517, 2016.[32] Max Hopkins, Daniel Kane, Shachar Lovett, and Gaurav Mahajan. Noise-
tolerant, reliable active classification with comparison queries. arXiv preprintarXiv:2001.05497, 2020.
[33] Wasim Huleihel, Arya Mazumdar, Muriel MΓ©dard, and Soumyabrata Pal. Same-cluster querying for overlapping clusters. In Advances in Neural Information
Processing Systems, pages 10485β10495, 2019.[34] Christina Ilvento. Metric learning for individual fairness. arXiv preprint
arXiv:1906.00250, 2019.[35] Ehsan Kazemi, Lin Chen, Sanjoy Dasgupta, and Amin Karbasi. Comparison based
learning from weak oracles. arXiv preprint arXiv:1802.06942, 2018.[36] Taewan Kim and Joydeep Ghosh. Relaxed oracles for semi-supervised clustering.
arXiv preprint arXiv:1711.07433, 2017.[37] Taewan Kim and Joydeep Ghosh. Semi-supervised active clustering with weak
oracles. arXiv preprint arXiv:1709.03202, 2017.[38] Rolf Klein, Rainer Penninger, Christian Sohler, and David P Woodruff. Tolerant
algorithms. In European Symposium on Algorithms, pages 736β747. Springer,2011.
[39] MatthΓ€us Kleindessner, Pranjal Awasthi, and Jamie Morgenstern. Fair k-centerclustering for data summarization. In International Conference on Machine Learn-
ing, pages 3448β3457, 2019.[40] Ngai Meng Kou, Yan Li, Hao Wang, Leong Hou U, and Zhiguo Gong. Crowd-
sourced top-k queries by confidence-aware pairwise judgments. In Proceedings of
the 2017 ACM International Conference on Management of Data, pages 1415β1430,2017.
[41] Blake Mason, Ardhendu Tripathy, and Robert Nowak. Learning nearest neighborgraphs from noisy distance samples. In Advances in Neural Information Processing
Systems, pages 9586β9596, 2019.[42] Arya Mazumdar and Barna Saha. Clustering with noisy queries. In Advances in
Neural Information Processing Systems, pages 5788β5799, 2017.[43] Arya Mazumdar and Barna Saha. Query complexity of clustering with side
information. In Advances in Neural Information Processing Systems, pages 4682β4693, 2017.
[44] George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. Com-parative analysis of approximate blocking techniques for entity resolution. Pro-ceedings of the VLDB Endowment, 9(9):684β695, 2016.
[45] Vassilis Polychronopoulos, Luca De Alfaro, James Davis, Hector Garcia-Molina,and Neoklis Polyzotis. Human-powered top-k lists. InWebDB, pages 25β30, 2013.
[46] DraΕΎen Prelec, H Sebastian Seung, and John McCoy. A solution to the single-question crowd wisdom problem. Nature, 541(7638):532β535, 2017.
[47] Robin Sibson. Slink: an optimally efficient algorithm for the single-link clustermethod. The computer journal, 16(1):30β34, 1973.
[48] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Tauman Kalai.Adaptively learning the crowd kernel. In Proceedings of the 28th International
Conference on International Conference on Machine Learning, pages 673β680, 2011.[49] Antti Ukkonen. Crowdsourced correlation clustering with relative distance
comparisons. In 2017 IEEE International Conference on Data Mining (ICDM), pages1117β1122. IEEE, 2017.
[50] https://modal-python.readthedocs.io/en/latest/. modal library.[51] https://scikit-learn.org/stable/. Scikit-learn.
[52] Vijay V Vazirani. Approximation algorithms. Springer Science & Business Media,2013.
[53] Petros Venetis, Hector Garcia-Molina, Kerui Huang, and Neoklis Polyzotis. Maxalgorithms in crowdsourcing environments. In Proceedings of the 21st internationalconference on World Wide Web, pages 989β998, 2012.
[54] Victor Verdugo. Skyline computation with noisy comparisons. In Combinatorial
Algorithms: 31st International Workshop, IWOCA 2020, Bordeaux, France, June
8β10, 2020, Proceedings, page 289. Springer.[55] Vasilis Verroios and Hector Garcia-Molina. Entity resolution with crowd errors.
In 2015 IEEE 31st International Conference on Data Engineering, pages 219β230.IEEE, 2015.
[56] Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. Crowdsourcing algorithmsfor entity resolution. Proceedings of the VLDB Endowment, 7(12):1071β1082, 2014.
[57] Ramya Korlakai Vinayak and Babak Hassibi. Crowdsourced clustering: Queryingedges vs triangles. In Advances in Neural Information Processing Systems, pages1316β1324, 2016.
[58] Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. Crowder:Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11),2012.
[59] David PWilliamson and David B Shmoys. The design of approximation algorithms.Cambridge university press, 2011.
[60] Chao Zhang, Fangbo Tao, Xiusi Chen, Jiaming Shen, Meng Jiang, Brian Sadler,Michelle Vanni, and Jiawei Han. Taxogen: Unsupervised topic taxonomy con-struction by adaptive term embedding and clustering. In Proceedings of the 24th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,pages 2701β2709, 2018.
How to Design Robust Algorithms using Noisy Comparison Oracle
8 FINDING MAXIMUM
Lemma 8.1. (Hoeffdingβs Inequality) If π1, π2, Β· Β· Β· , ππ are independent random variables with ππ β€ ππ β€ ππ for all π β [π], then
Pr
[οΏ½οΏ½οΏ½οΏ½οΏ½βοΈπ
ππ β E[ππ ]οΏ½οΏ½οΏ½οΏ½οΏ½ β₯ ππ
]β€ 2 exp
(β 2π2π2β
π (ππ β ππ )2
)
8.1 Adversarial Noise
Let the maximum value among π be denoted by π£max and the set of records for which the oracle answer can be incorrect is given by
πΆ = {π’ | π’ β π ,π’ β₯ π£max1 + ` }
Claim 8.2. For any partition ππ , Tournament(ππ ) uses at most 2|ππ | oracle queries.
Proof. Consider the πth round in Tournament. We can observe that the number of remaining values is at most |ππ |2π . So, we make |ππ |2π+1many oracle queries in this round. Total number of oracle queries made is
logπβοΈπ=0
|ππ |2π+1
β€ 2|ππ |
β‘
Lemma 8.3. Given a set of values π , Count-Max(π) returns a (1 + `)2 approximation of maximum value of π using π ( |π |2) oracle queries.
Proof. Let π£max = max{π₯ β π}. Consider a value π€ β π such that π€ <π£max(1+`)2 . We compare the Count values for π£max and π€ given
by, Count(π£max, π) =βπ₯ βπ 1{O(π£max, π₯) == No} and Count(π€, π) = β
π₯ βπ 1{O(π€, π₯) == No}. We argue that π€ can never be returned byAlgorithm 1, i.e., Count(π€, π) < Count(π£max, π).
Count(π£max, π) =βοΈπ₯ βπ
1{O(π£max, π₯) == No} β₯βοΈ
π₯ βπ\{π£max }1{π₯ < π£max/(1 + `)}
= 1{O(π£max,π€) == No} +βοΈ
π₯ βπ\{π£max,π€ }1{π₯ < π£max/(1 + `)}
= 1 +βοΈ
π₯ βπ\{π£max,π€ }1{π₯ < π£max/(1 + `)}
Count(π€, π) =βοΈπ¦βπ
1{O(π€,π¦) == No} =βοΈ
π¦βπ\{π€,π£max }1{O(π€,π¦) == No}
β€βοΈ
π¦βπ\{π€,π£max }1{π¦ β€ (1 + `)π€}
β€βοΈ
π¦βπ\{π€,π£max }1{π¦ β€ π£max/(1 + `)}
Combining the two, we have :Count(π£max, π) > Count(π€, π)
This shows that the Countof π£max is strictly greater than the count of any pointπ€ withπ€ <π£max(1+`)2 . Therefore, our algorithm would have
output π£max instead ofπ€ . For calculating the Count for all values in π , we make at most |π |2 oracle queries as we compare every value withevery other value. Finally, we output the maximum value as the value with highest Count. Hence, the claim. β‘
Lemma 8.4 (Lemma 3.3 restated). Suppose π£πππ₯ is the maximum value among the set of records π . Algorithm 2 outputs a value π’πππ₯ such
that π’πππ₯ β₯ π£πππ₯
(1+`)2 log_ π using π (π_) oracle queries.
Proof. From Lemma 8.3, we have that we lose a factor of (1+ `)2 in each level of the tournament tree, we have that after log_ π levels, thefinal output will have an approximation guarantee of (1 + `)2 log_ π . The total number of queries used is given by :
βlog_ ππ=0
|ππ |__2 = π (π_)
where ππ is the number of records at level π . β‘
Lemma 8.5. Suppose |πΆ | >βπ/2. Let π denote a set of 2
βπ log(2/πΏ) samples obtained by uniform sampling with replacement from π . Then,
π contains a (1 + `) approximation of the maximum value π£max, with probability 1 β πΏ/2.
Raghavendra Addanki, Sainyam Galhotra, and Barna Saha
Proof. Consider the first step where we use a uniformly random sample π ofβππ‘ = 2
βπ log(2/πΏ) values from π (obtained by sampling
with replacement). Given |πΆ | β₯βπ2 , probability that π contains a value from πΆ is given by
Pr[π β©πΆ β π] = 1 β(1 β |πΆ |
π
) |π |> 1 β
(1 β 1
2βπ
)2βπ log(2/πΏ)> 1 β πΏ/2
So, with probability 1 β πΏ/2, there exists a value π’ β πΆ β©π . Hence, the claim. β‘
Lemma 8.6. Suppose the partition ππ contains the maximum value π£max of π . If |πΆ | β€βπ/2, then, Tournament(ππ ) returns the π£max with
probability 1/2.
Proof. Algorithm 4 uses a modified tournament tree that partitions the set π into π =βπ parts of size π
π=βπ each and identifies a
maximum ππ from each partition ππ using Algorithm 2. If π£max β ππ , then,
E[|πΆ β©ππ |] =|πΆ |π
=
βπ
2βπ=
12
Using Markovβs inequality, the probability that ππ contains a value from πΆ is given by :
Pr[|πΆ β©ππ | β₯ 1] β€ E[|πΆ β©ππ |] β€12
Therefore, with at least a probability of 12 , π£max will never be compared with any point from πΆ in the partition ππ containing π£max. Hence,
π£max is returned by Tournament(ππ ) with probability 1/2.β‘
Lemma 8.7 (Lemma 3.5 restated). (1) If |πΆ | >βπ/2, then there exists a value π£ π β π satisfying π£ π β₯ π£max/(1 + `) with a probability of
1 β πΏ/2.(2) Suppose |πΆ | β€
βπ/2. Then, π contains π£max with a probability at least 1 β πΏ/2.
Proof. Claim (1) follows from Lemma 8.5.
In every iteration π β€ π‘ of Algorithm 4, we have that π£max β ππ with probability 12 (Using Lemma 8.6). To increase the success probability,
we run this procedure π‘ times and obtain all the outputs. Among the π‘ = 2 log(2/πΏ) runs of Algorithm 2, we have that π£max is never comparedwith any value of πΆ in atleast one of the iterations with a probability at least
1 β (1 β 1/2)2 log(2/πΏ) β₯ 1 β πΏ
2
Hence, π = βͺπππ contains π£max with a probability 1 β πΏ2 . β‘
Theorem 8.8 (Theorem 3.6 restated). Given a set of values π , Algorithm 4 returns a (1 + `)3 approximation of maximum value with
probability 1 β πΏ using π (π log2 (1/πΏ)) oracle queries.
Proof. In Algorithm 4, we first identify an approximate maximum value using Sampling. If |πΆ | β₯βπ2 , then, from Lemma 8.5, we have that
the value returned is a (1 + `) approximation of the maximum value of π . Othwerwise, from Lemma 8.7, π contains π£max with a probability1 β πΏ/2. As we use Count-Max on the set π βͺπ , we know that the value returned, i.e., π’max is a (1 + `)2 of the maximum among valuesfrom π βͺπ . Therefore, π’max β₯ π£max
(1+`)3 . Using union bound, the total probability of failure is πΏ .
For query complexity, Algorithm 3 obtains a set π ofβππ‘ sample values. Along with the set π obtained (where |π | = ππ‘
π), we use
Count-Max on π βͺπ to output the maximum π’max. This step requires π ( |π βͺπ |2) = π ((βππ‘ + ππ‘
π)2) oracle queries. In an iteration π , for
obtainingππ , we makeπ (βπ |ππ |) = π (π) oracle queries (Claim 8.2), and for π‘ iterations, we makeπ (ππ‘) queries. Using π‘ = 2 log(2/πΏ), π =βπ,
in total, we make π (ππ‘ + (βππ‘ + ππ‘
π)2) = π (π log2 (1/πΏ)) oracle queries. Hence, the theorem.
β‘
8.2 Probabilistic Noise
Lemma 8.9. Suppose the maximum value π’max is returned by Algorithm 2 with parameters (π ,π). Then, rank(π’max,π ) = π (βοΈπ log(1/πΏ))
with a probability of 1 β πΏ .
Proof. We have for the maximum value π£max, expected count value :
E[Count(π£max,π )] =βοΈπ€βπ
1{O(π€, π£max) == π€} = (π β 1) (1 β π)
How to Design Robust Algorithms using Noisy Comparison Oracle
Using Hoeffdingβs inequality, with probability 1 β πΏ/2 :
Count(π£max,π ) β₯ (π β 1) (1 β π) ββοΈ((π β 1) log(2/πΏ))/2
Consider a record π’ β π with rank at most 5βοΈ2π log(2/πΏ). Then,
E[Count(π’,π )] =βοΈπ€βπ
1{O(π’, π£max) == π€} = (π β rank(π’)) (1 β π) + (rank(π’) β 1)π
Using Hoeffdingβs inequality, with probability 1 β πΏ/2 :
Count(π’,π ) < (π β 1) (1 β π) β (rank(π’) β 1) (1 β 2π) +βοΈ0.5(π β 1) log(2/πΏ)
< (π β 1) (1 β π) β (5βοΈ2π log(2/πΏ) β 1) (1 β 2π) +
βοΈ0.5(π β 1) log(2/πΏ)
< Count(π£max,π )
The last inequality is true for a value of π β€ 0.4. As Algorithm 2 returns the record π’max with maximum Count value, we have thatrank(π’max,π ) = π (
βοΈπ log(1/πΏ)). Using union bound, for the above conditions to be met, we have the claim. β‘
To improve the query complexity, we use an early stopping criteria that discards a value π₯ using the Count(π₯,π ) when it determines thatπ₯ has no chance of being the maximum. Algorithm 12 presents the psuedocode for this modified count calculation. We sample 100 log(π/πΏ)values randomly, denoted by ππ‘ and compare every non-sampled point with ππ‘ . We argue that by doing so, it helps us eliminate the valuesthat are far away from the maximum in the sorted ranking. Using Algorithm 12, we compare the Count scores with respect to ππ‘ of a valueπ’ β π \ ππ‘ and if Count(π’, ππ‘ ) β₯ 50 log(π/πΏ), we make it available for the subsequent iterations.
Algorithm 12 Count-Max-Prob : Maximum with Probabilistic Noise1: Input : A set π of π values, failure probability πΏ .2: Output : An approximate maximum value of π3: π‘ β 14: while π‘ < log(π) or |π | > 100 log(π/πΏ) do5: ππ‘ denote a set of 100 log(π/πΏ) values obtained by sampling uniformly at random from π with replacement.6: Set π β π
7: for π’ β π \ ππ‘ do8: if Count(π’, ππ‘ ) β₯ 50 log(π/πΏ) then
9: π β π βͺ {π’}10: π β π, π‘ β π‘ + 111: π’max β Count-Max(π )12: return π’max
As Algorithm 12 considers each value π’ β π \ ππ‘ by iteratively comparing it with each value π₯ β ππ‘ and the error probability is less than π ,the expected count of π£max (if it is available) at any iteration π‘ is (1β π) |ππ‘ |. Accounting for the deviation around the expected value, we havethat the Count(π£max, ππ‘ ) is at least 50 log(π/πΏ) when π β€ 0.44. If a particular value π’ has Count(π’, ππ‘ ) < 50 log(π/πΏ) in any iteration, i.e.,then it can not be the largest value in π and therefore, we remove it from the set of possible candidates for maximum. Therefore, any valuethat remains in π after an iteration π‘ , must have rank closer to that of π£max. We argue that after every iteration, the number of candidatesremaining is at most 1/60th of the possible candidates.
Lemma 8.10. In an iteration π‘ containing ππ‘ remaining records, using Algorithm 5, with probability 1 β πΏ/π, we discard at least 5960 Β· ππ‘ records.
Proof. Consider an iteration π‘ which has ππ‘ remaining records. Algorithm 5 and a record π’ with rank πΌ Β· ππ‘ . Now, we have :
E[Count(π’, ππ‘ )] = ((1 β πΌ) (1 β π) + πΌπ)100 log(π/πΏ)
For πΌ = 0, i.e., we have for maximum value π£max
E[Count(π£max, ππ‘ )] = (1 β π)100 log(π/πΏ)
Using π β€ 0.4 and Hoeffdingβs inequality, with probability 1 β πΏ/π2 , we have :
Count(π£max, ππ‘ ) β₯ (1 β π)100 log(π/πΏ) ββ100 log(π/πΏ) β₯ 50 log(π/πΏ)
4The constants 50, 100 etc. are not optimized and set just to satisfy certain concentration bounds.
Raghavendra Addanki, Sainyam Galhotra, and Barna Saha
For π’, we calculate the Count value. Using π β€ 0.4 and Hoeffdingβs inequality, with probability 1 β πΏ/π2 , we have :
Count(π’, ππ‘ ) < ((1 β πΌ) (1 β π) + πΌπ)100 log(π/πΏ) +βοΈ100((1 β πΌ) (1 β π) + πΌπ) log(π/πΏ)
< ((1 β 0.6πΌ)100 +βοΈ100(1 β 0.6πΌ)) log(π/πΏ) < 50 log(π/πΏ)
Upon calculation, for πΌ > 5960 , we have the above expression. Therefore, using union bound, with probability 1 βπ (πΏ/π), all records π’ with
rank at least 59ππ‘60 satisfy :
Count(π’, ππ‘ ) < Count(π£max, ππ‘ )So, all such values can be removed. Hence, the claim. β‘
In the previous lemma, we argued that in every iteration, at least 1/60th fraction is removed and therefore in Ξ(logπ) iterations, thealgorithm will terminate. In each iteration, we discard the sampled values ππ‘ to ensure that there is no dependency between the Count scores,and our guarantees hold. As we remove at most π (π‘ Β· log(π/πΏ)) = π (log2 (π/πΏ)) sampled points, our final statement of the result is :
Lemma 8.11. Query complexity of Algorithm 5 is π (π Β· log2 (π/πΏ)) and π’max satisfies rank(π’max,π ) β€ π (log2 (π/πΏ)) with probability 1 β πΏ .
Proof. From Lemma 8.10, we have with probability 1 β πΏ/π, after iteration π‘ , at least 59ππ‘60 records removed along with the 100 log(π/πΏ)
records that are sampled. Therefore, we have :ππ‘+1 β€
ππ‘
60β 100 log(π/πΏ)
After log(π/πΏ) iterations, we have ππ‘+1 β€ 1. As we have removed log60 π Β· 100 log(π/πΏ) records that were sampled in total, these couldrecords with rankβ€ 100 log2 (π/πΏ). So, the rank of π’max output is at most 100 log2 (π/πΏ). In an iteration π‘ , the number of oracle queriescalculating Count values is π (ππ‘ Β· log(π/πΏ)). In total, Algorithm 5 makes π (π log2 (π/πΏ)) oracle queries. Using union bound over log(π/πΏ)iterations, we get a total failure probability of πΏ . β‘
Theorem 8.12 (Theorem 3.7 restated). There is an algorithm that returns π’max β π such that rank(π’max,π ) = π (log2 (π/πΏ)) withprobability 1 β πΏ and requires π (π log2 (π/πΏ)) oracle queries.
Proof. The proof follows from Lemma 8.11 β‘
9 FARTHEST AND NEAREST NEIGHBOR
Lemma 9.1 (Lemma 3.9 restated). Suppose maxπ£π βπ π (π’, π£π ) β€ πΌ and |π | β₯ 6 log(1/πΏ). Consider two records π£π and π£ π such that π (π’, π£π ) <π (π’, π£ π ) β 2πΌ then FCount(π£π , π£ π ) β₯ 0.3|π | with a probability of 1 β πΏ
Proof. Since π (π’, π£π ) < π (π’, π£ π ) β 2πΌ , for a point π₯ β π ,π (π£ π , π₯) β₯ π (π’, π£ π ) β π (π’, π₯)
> π (π’, π£π ) + 2πΌ β π (π’, π₯)β₯ π (π£π , π₯) β π (π’, π₯) + 2πΌ β π (π’, π₯)β₯ π (π£π , π₯) + 2πΌ β 2π (π’, π₯)β₯ π (π£π , π₯)
So, π (π£π , π₯, π£ π , π₯) is No with a probability π . As π β€ 0.4, we have :
E[FCount(π£π , π£ π )] = (1 β π) |π |Pr[FCount(π£π , π£ π ) β€ 0.3|π |] β€ Pr[FCount(π£π , π£ π ) β€ (1 β π) |π |/2]
From Hoeffdingβs inequality (with binary random variables), we have with a probability exp(β |π | (1βπ)2
2 ) β€ πΏ (using |π | β₯ 6 log(1/πΏ), π < 0.4): FCount(π£π , π£ π ) β€ (1 β π) |π |/2. Therefore, with probability at most πΏ , we have, FCount(π£π , π£ π ) β€ 0.3|π |. β‘
For the sake of completeness, we restate the Count definition that is used in Algorithm Count-Max. For every oracle comparison, wereplace it with the pairwise comparison query described in Section 3.3. Let π’ be a query point and π denote a set of Ξ(log(π/πΏ)) points withina distance of πΌ from π’. We maintain a Count score for a given point π£π β π as :
Count(π’, π£π , π,π ) =βοΈ
π£π βπ \{π£π }1{Pairwise-Comp(π’, π£π , π£ π , π) == No}
Lemma 9.2. Given a query vertex π’ and a set π with |π | = Ξ©(log(π/πΏ)) such that maxπ£βπ π (π’, π£) β€ πΌ . Then the farthest identified using
Algorithm 13 (with PairwiseComp), denoted by π’max is within 4πΌ distance from the optimal farthest point, i.e., π (π’,π’max) β₯ maxπ£βπ π (π’, π£) β 4πΌwith a probability of 1 β πΏ . Further the query complexity is π (π2 log(π/πΏ)).
How to Design Robust Algorithms using Noisy Comparison Oracle
Algorithm 13 Count-Max(V) : finds the farthest point by counting in π1: Input : A set of points π , and query point π’ and a set π .2: Output : An approximate farthest point from π’
3: for π£π β π do
4: Calculate Count(π’, π£π , π,π )5: π’max β arg maxπ£βπCount(π’, π£π , π,π )6: return π’max
Proof. Let π£max = maxπ£βπ π (π’, π£). Consider a value π€ β π such that π (π’,π€) < π (π’, π£max) β 4πΌ . We compare the Count val-ues for π£max and π€ given by, Count(π’, π£max, π,π ) =
βπ£π βπ \{π£max } 1{Pairwise-Comp(π’, π£max, π£ π , π) == No} and Count(π’,π€, π,π ) =β
π£π βπ \{π€ } 1{Pairwise-Comp(π’,π€, π£ π , π) == No}. We argue that π€ can never be returned by Algorithm 13, i.e., Count(π’,π€, π,π ) <
Count(π’, π£max, π,π ). Using Lemma 9.1 we have :
Count(π’, π£max, π,π ) =βοΈ
π£π βπ \{π£max }1{Pairwise-Comp(π’, π£max, π£ π , π) == No}
β₯βοΈ
π£π βπ \{π£max }1{π (π’, π£ π ) < π (π’, π£max) β 2πΌ}
= 1{π (π’,π€) < π (π’, π£max) β 2πΌ} +βοΈ
π£π βπ \{π£max,π€ }1{π (π’, π£ π ) < π (π’, π£max) β 2πΌ}
= 1 +βοΈ
π£π βπ \{π£max,π€ }1{π (π’, π£ π ) < π (π’, π£max) β 2πΌ}
Count(π’,π€, π,π ) =βοΈ
π£π βπ \{π€ }1{Pairwise-Comp(π’,π€, π£ π , π) == No}
β€βοΈ
π£π βπ \{π€ }1{π (π’, π£ π ) < π (π’,π€) + 2πΌ}
β€βοΈ
π£π βπ \{π€,π£max }1{π (π’, π£ π ) < π (π’, π£max) β 2πΌ}
Combining the two, we have :Count(π’, π£max, π,π ) > Count(π’,π€, π,π )
This shows that the Countof π£max is strictly greater than the count of any pointπ€ when π (π’,π€) < π (π’, π£max) β 4πΌ . Therefore, our algorithmwould have output π£max instead ofπ€ . For calculating the Count for all points in π , we make at most |π |2 Β· |π | oracle queries as we compareevery point with every other point using Algorithm 5. Finally, we output the point π’max as the value with highest Count. From Lemma 9.1,when |π | = Ξ©(log(π/πΏ)), the answer to any pairwise query is correct with a failure probability of πΏ/π2. As there are π2 pairwise comparisons,and each with failure probability of πΏ/π2, from union bound, we have that that the total failure probability is πΏ . Hence, the claim. β‘
Algorithm 14 Tournament : finds the farthest point using a tournament tree1: Input : Set of values π , Degree _, query point π’ and a set π .2: Output : An approximate farthest point from π’
3: Construct a balanced _-ary tree T with |π | nodes as leaves.4: Let ππ be a random permutation of π assigned to leaves of T5: for π = 1 to log_ |π | do6: for internal nodeπ€ at level log_ |π | β π do7: Letπ denote the children ofπ€ .8: Set the internal nodeπ€ to Count-Max(π’, π,π )9: π’max β point at root of T10: return π’max
Let the farthest point from query point π’ amongπ be denoted by π£max and the set of records for which the oracle answer can be incorrectis given by
πΆ = {π€ | π€ β π ,π (π’,π€) β₯ π (π’, π£max) β 2πΌ}
Raghavendra Addanki, Sainyam Galhotra, and Barna Saha
Algorithm 15 Tournament-Partition1: Input : Set of values π , number of partitions π , query point π’ and a set π .2: Output : A set of farthest points from each partition3: Randomly partition π into π equal parts π1,π2, Β· Β· Β·ππ4: for π = 1 to π do5: ππ β Tournament(π’, π,ππ , 2)6: π β π βͺ {ππ }7: return π
Algorithm 16Max-Prob : Maximum with Probabilistic Noise1: Input : Set of values π , number of iterations π‘ , query point π’ and a set π .2: Output : An approximate farthest point π’max3: π β 1,π β π
4: Let π denote a sample of sizeβππ‘ selected uniformly at random (with replacement) from π .
5: for π β€ π‘ do
6: ππ β Tournament-Partition(π’, π,π , π)7: π β π βͺππ8: π’max β Count-Max(π’, π,π βͺπ )9: return π’max
Lemma 9.3. (1) If |πΆ | >βπ/2, then there exists a value π£ π β π satisfying π (π’, π£ π ) β₯ π (π’, π£max) β 2πΌ with a probability of 1 β πΏ/2.
(2) Suppose |πΆ | β€βπ/2. Then, π contains π£max with a probability at least 1 β πΏ/2.
Proof. The proof is similar to Lemma 8.7. β‘
Theorem 9.4 (Theorem 3.10 restated). Given a query vertex π’ and a set π with |π | = Ξ©(log(π/πΏ)) such that maxπ£βπ π (π’, π£) β€ πΌ then
the farthest identified using Algorithm 4 (with PairwiseComp), denoted by π’max is within 6πΌ distance from the optimal farthest point, i.e.,
π (π’,π’max) β₯ maxπ£βπ π (π’, π£) β 6πΌ with a probability of 1 β πΏ . Further the query complexity is π (π log3 (π/πΏ)).
Proof. The proof is similar to Theorem 8.8. In Algorithm 16, we first identify an approximate maximum value using Sampling. If|πΆ | β₯
βπ2 , then, from Lemma 9.3, we have that the value returned is a 2πΌ additive approximation of the maximum value of π . Otherwise,
from Lemma 9.3, π contains π£max with a probability 1 β πΏ/2. As we use Count-Max on the set π βͺπ , we know that the value returned, i.e.,π’max is a 4πΌ of the maximum among values from π βͺπ . Therefore, π (π’,π’max) β₯ π (π’, π£max) β 6πΌ . Using union bound over π Β· π‘ comparisons,the total probability of failure is πΏ .
For query complexity, Algorithm 15 obtains a set π ofβππ‘ sample values. Along with the set π obtained (where |π | = ππ‘
π), we use
Count-Max on π βͺ π to output the maximum π’max. This step requires π ( |π βͺ π |2 |π |) = π ((βππ‘ + ππ‘
π)2 log(π/πΏ)) oracle queries. In
an iteration π , for obtaining ππ , we make π (βπ |ππ | log(π/πΏ)) = π (π log(π/πΏ)) oracle queries (Claim 8.2), and for π‘ iterations, we makeπ (ππ‘ log(π/πΏ)) queries. Using π‘ = 2 log(2π/πΏ), π =
βπ, in total, we make π (ππ‘ log(π/πΏ) + (
βππ‘ + ππ‘
π)2 log(π/πΏ)) = π (π log3 (π/πΏ)) oracle
queries. Hence, the theorem. β‘
10 π-CENTER : ADVERSARIAL NOISE
Lemma 10.1. Suppose in an iteration π‘ of Greedy algorithm, centers are given by ππ‘ and we reassign points using Assign which is a
π½-approximation to the correct assignment. In iteration π‘ + 1, using this assignment, if we obtain an πΌ-approximate farthest point using
Approx-Farthest, then, after π iterations, Greedy algorithm obtains a 2πΌπ½2-approximation for the π-center objective.
Proof. Consider an optimum clusteringπΆβ with centers π’1, π’2, .., π’π respectively:πΆβ (π’1),πΆβ (π’2), Β· Β· Β· ,πΆβ (π’π ). Let the centers obtained byAlgorithm 6 be denoted by π . If |π β©πΆβ (π’π ) | = 1 for all π , then, for some point π₯ β πΆβ (π’π ) assigned to π π β π by Algorithm Assign, we have
π (π₯, π β©πΆβ (π’π )) β€ π (π₯,π’π ) + π (π’π , π β©πΆβ (π’π )) β€ 2πππ
=β π (π₯, π π ) β€ π½ minπ π βπ π (π₯, π π ) β€ π½ π (π₯, π β©πΆβ (π’π )) β€ 2π½πππTherefore, every point in π is at a distance of at most 2π½πππ from a center assigned in π .
Suppose for some π we have |π β© πΆβ (π’ π ) | β₯ 2. Let π 1, π 2 β π β© πΆβ (π’ π ) and π 2 appeared after π 1 in iteration π‘ + 1. As π 1 β ππ‘ , wehave minπ€βππ‘ π (π€, π 2) β€ π (π 1, π 2). In iteration π‘ , we know that the farthest point π 2 is an πΌ-approximation of the farthest point (say ππ‘ ).Moreover, suppose π 2 assigned to cluster with center π π in iteration π‘ that is a π½-approximation of itβs true center. Therefore,
1πΌ
minπ€βππ‘
π (π€, ππ‘ ) β€ π (π π , π 2) β€ π½ minπ€βππ‘
π (π€, π 2) β€ π½π (π 1, π 2)
How to Design Robust Algorithms using Noisy Comparison Oracle
Because π 1 and π 2 are in the same optimum cluster, from triangle inequality we have π (π 1, π 2) β€ 2πππ . Combining all the above we getminπ€βππ‘ π (π€, ππ‘ ) β€ 2πΌπ½πππ which means that farthest point of iteration π‘ is at a distance of 2πΌπ½πππ from ππ‘ . In the subsequent iterations,the distance of any point to the final set of centers, given by π only gets smaller. Hence,
maxπ£
minπ€βπ
π (π£,π€) β€ maxπ£
minπ€βππ‘
π (π£,π€) = minπ€βππ‘
π (ππ‘ ,π€) β€ 2πΌπ½πππ
However, when we output the final clusters and centers, the farthest point after π-iterations (say ππ ) could be assigned to center π£ π β π thatis a π½-approximation of the distance to true center.
π (ππ , π£ π ) β€ π½ minπ€βπ π (ππ ,π€) β€ 2πΌπ½2 πππ
Therefore, every point is assigned to a cluster with distance at most 2πΌπ½2 πππ . Hence the claim. β‘
Lemma 10.2. Given a set π of centers, Algorithm Assign assigns a point π’ to a cluster π π β π such that π (π’, π π ) β€ (1 + `)2minπ π‘ βπ {π (π’, π π‘ )}using π (ππ) queries.
Proof. The proof is essentially the same as Lemma 8.3 and uses MCount instead of Count. β‘
Lemma 10.3. Given a set of centers π , Algorithm 4 identifies a point π£ π with probability 1 β πΏ/π , such that
minπ π βπ
π (π£ π , π π ) β₯ maxπ£π‘ βπ
minπ π‘ βπ
π (π£π‘ , π π‘ )(1 + `)5
Proof. Suppose π£π‘ is the farthest point assigned to center π π‘ β π . Let π£ π , assigned to π π β π be the point returned by Algorithm 4. FromTheorem 8.8, we have :
π (π£ π , π π ) β₯maxπ£π βπ π (π£π , π π )(1 + `)3
β₯ π (π£π‘ , π π‘ )(1 + `)3
β₯minπ β²π‘ βπ π (π£π‘ , π β²π‘ )(1 + `)3
Due to error in assignment, using Lemma 10.2
π (π£ π , π π ) β€ (1 + `)2 minπ β²πβπ
π (π£ π , π β²π )
Combining the above equations we have
minπ β²πβπ
π (π£ π , π β²π ) β₯minπ β²π‘ βπ π (π£π‘ , π β²π‘ )(1 + `)5
For Approx-Farthest, we use π =βπ and π‘ = log(2π/πΏ) andπ =
βππ‘ . So, following the proof in Theorem 3.6, we succeed with probability
1 β πΏ/π . Hence, the lemma. β‘
Lemma 10.4. Given a current set of centers π ,
(1) Assign assigns a point π’ to a cluster πΆ (π π ) such that π (π’, π π ) β€ (1 + `)2minπ π βπ {π (π’, π π )} using π (ππ) oracle queries additionally.(2) Approx-Farthest identifies a pointπ€ in clusterπΆ (π π ) such that minπ π βπ π (π€, π π ) β₯ maxπ£π‘ βπ minπ π‘ βπ π (π£π‘ , π π‘ )/(1 + `)5 with probability
1 β πΏπusing π (π log2 (π/πΏ)) oracle queries .
Proof. (1) From Lemma 10.2, we have the claim. We assign a point to a cluster based on the scores the cluster center received incomparison to other centers. Except for the newly created center, we have previously queried every center with every other center. Therefore,number of new oracle queries made for every point is π (π); that gives us a total of π (ππ) additional new queries used by Assign.
(2) From Lemma 10.3, we have that minπ π βπ π (π€, π π ) β₯ maxπ£π‘ βπ minπ π‘ βππ (π£π‘ ,π π‘ )(1+`)5 with probability 1βπΏ/π . As the total number of queries
made by Algorithm 4 is π (ππ‘ + ( ππ‘π+βππ‘)2). For Approx-Farthest, we use π =
βπ and π‘ = log(2π/πΏ) and π =
βππ‘ , therefore, the query
complexity is π (π log2 (π/πΏ)). β‘
Theorem 10.5 (Theorem 4.2 restated). For ` < 118 , Algorithm 6 achieves a (2 +π (`))-approximation for the π-center objective using
π (ππ2 + ππ Β· log2 (π/πΏ)) oracle queries with probability 1 β πΏ .
Proof. From the above discussed claim and Lemma 10.4, we have that Algorithm 6 achieves a 2(1 + `)9 approximation for π-centerobjective. When ` < 1
18 , we can simplify the approximation factor to 2 + 18`, i.e., 2 +π (`). From Lemma 10.4, we have that in each iteration,we succeed with probability 1 β πΏ/π . Using union bound, the failure probability is given by πΏ . For query complexity, as there are π iterations,and in each iteration we use Assign and Approx-Farthest, using Lemma 10.4, we have the theorem. β‘
Raghavendra Addanki, Sainyam Galhotra, and Barna Saha
11 π-CENTER : PROBABILISTIC NOISE
11.1 Sampling
Lemma 11.1. Consider the sampleπ β π of points obtained by selecting each point with a probability450 log(π/πΏ)
π . Then, we have400π log(π/πΏ)
π β€|π | β€ 500π log(π/πΏ)
π and for every π β [π], |πΆβ (π π ) β©π | β₯ 400 log(π/πΏ) with probability 1 βπ (πΏ) for sufficiently large πΎ > 0.
Proof. We include every point inπ with a probability 450 log(π/πΏ)π where the size of the smallest cluster isπ. Using Chernoff bound, with
probability 1 βπ (πΏ), we have :400π log(π/πΏ)
πβ€ |π | β€ 500π log(π/πΏ)
π
Consider an optimal cluster πΆβ (π£π ) with center π£π . As every point is included with probability 450 log(π/πΏ)π :
E[|πΆβ (π π ) β©π |] = |πΆβ (π π ) | Β·450 log(π/πΏ)
πβ₯ 450 log(π/πΏ)
Using Chernoff bound, with probability at least 1 β πΏ/π, we have
|πΆβ (π π ) β©π | β₯ 400 log(π/πΏ)
Using union bound for all the π clusters, we have the lemma. β‘
11.2 Assignment
ACount(π’, π π , π π ) =βοΈ
π₯ βπ (π π )1{O(π’, π₯,π’, π π ) == Yes}
Lemma 11.2. Consider a pointπ’ and π π β π π such thatπ (π’, π π ) β€ π (π’, π π )β2OPT and |π (π π ) | β₯ 12 log(π/πΏ), then, ACount(π’, π π , π π ) β₯ 0.3|π (π π ) |with a probability of 1 β πΏ
π2 .
Proof. Using triangle inequality, for any π₯ β π (π π )
π (π’, π₯) β€ π (π’, π π ) + π (π π , π₯) β€ π (π’, π π ) β 2OPT+π (π π , π₯) β€ π (π’, π π )
So, O(π’, π₯,π’, π π ) is Yes with a probability at least 1 β π . We have:
E[ACount(π’, π π , π π )] =βοΈ
π₯ βπ (π π )E[1{O(π’, π₯,π’, π π ) == Yes}] β₯ (1 β π) |π (π π ) |
Using Hoeffdingβs inequality, with a probability of exp(β|π (π π ) | (1 β π)2/2) β€ πΏπ2 (using π β€ 0.4), we have
ACount(π’, π π , π π ) β€ (1 β π) |π (π π ) |/2
We have Pr[ACount(π’, π π , π π ) β€ 0.3|π |] β€ Pr[ACount(π’, π π , π π ) β€ (1 β π) |π |/2]. Therefore, with probability πΏπ2 , we have ACount(π’, π π , π π ) β€
0.3|π |. Hence, the lemma. β‘
Lemma 11.3. Suppose π’ β πΆβ (π π ) and for some π π β π , if π (π π , π π ) β₯ 6OPT, then, Algorithm 8 assigns π’ to center π π with probability 1 β πΏπ2 .
Proof. As π’ β πΆβ (π π ), we have π (π’, π π ) β€ 2OPT. Therefore,
π (π π , π’) β π (π π , π’) β₯ π (π π , π π ) β 2π (π π , π’) β₯ 2OPTπ (π π , π’) β₯ π (π π , π’) + 2OPT
From Lemma 11.2, we have that if π (π’, π π ) β€ π (π’, π π ) β 2OPT, then, we will assign π’ to π π with probability 1 β πΏπ2 . β‘
Lemma 11.4. Given a set of centers π , every π’ β π is assigned to a cluster π π such that π (π’, π π ) β€ minπ π βπ π (π’, π π ) + 2OPT with a probability
of 1 β 1/π2.
Proof. From Lemma 11.2, we have that a point π’ is assigned to π π from π π if π (π’, π π ) β€ π (π’, π π) β 2OPT. If π π is the final assigned centerof π’, then, for every π π , it must be true that π (π’, π π ) β₯ π (π’, π π ) β 2OPT, which implies π (π’, π π ) β€ minπ π βπ π (π’, π π ) + 2OPT. Using union boundover at most π points, we have with a probability of 1 β πΏ
π , every point π’ is assigned as claimed. β‘
How to Design Robust Algorithms using Noisy Comparison Oracle
11.3 Core Calculation
Consider a cluster πΆ (π π ) with center π π . Let πππ denote the number of points in the set |{π₯ : π β€ π (π₯, π π ) < π}|.
Count(π’) =βοΈ
π₯ βπΆ (π π )1{O(π π , π₯, π π , π’) == No}
Lemma 11.5. Consider any two points π’1, π’2 β πΆ (π π ) such that π (π’1, π π ) β€ π (π’2, π π ), then E[Count(π’1)] β E[Count(π’2)] = (1 β 2π)ππ (π’2,π π )π (π’1,π π )
Proof. For a point π’ β πΆ (π π )
E[Count(π’)] = E
βοΈπ₯ βπΆ (π π )
1{O(π π , π₯, π π , π’) == No}
= ππ (π’,π π )0 π + πβ
π (π’,π π ) (1 β π)
E[Count(π’1)] β E[Count(π’2)] =(ππ (π’1,π π )0 π + ππ (π’2,π π )
π (π’1,π π ) (1 β π) + πβπ (π’2,π π ) (1 β π)
)β(ππ (π’1,π π )0 π + +ππ (π’2,π π )
π (π’1,π π )π + πβπ (π’2,π π ) (1 β π)
)= (1 β 2π)ππ (π’2,π π )
π (π’1,π π )β‘
Lemma 11.6. Consider any two pointsπ’1, π’2 β πΆ (π π ) such thatπ (π’1, π π ) β€ π (π’2, π π ) and |ππ (π’2,π π )π (π’1,π π ) | β₯
βοΈ100|πΆ (π π ) | log(π/πΏ). Then, Count(π’1) >
Count(π’2) with probability 1 β πΏ/π2.
Proof. Suppose π’1, π’2 β πΆ (π π ). We have that Count(π’1) and Count(π’2) is a sum of |πΆ (π π ) | binary random variables.Using Hoeffdingβs inequality, we have with probability exp(βπ½2/2|πΆ (π π ) |) that
Count(π’1) β€ E[Count(π’1)] βπ½
2
Count(π’2) > E[Count(π’2)] +π½
2Using union bound, with probability at least 1 β 2 exp(βπ½2/2|πΆ (π π ) |), we can conclude that
Count(π’1) β Count(π’2) > E[Count(π’1) β Count(π’2)] β π½ > (1 β 2π)ππ (π’2,π π )π (π’1,π π ) β π½
Choosing π½ = (1 β 2π)ππ (π’2,π π )π (π’1,π π ) , we have Count(π’1) > Count(π’2) with a probability (for constant π β€ 0.4)
1 β 2 exp(β(1 β 2π)2(ππ (π’2,π π )π (π’1,π π )
)2/2|πΆ (π π ) |) β₯ 1 β 2 exp(β0.02
(ππ (π’2,π π )π (π’1,π π )
)2/|πΆ (π π ) |).
Further, simplifying using ππ (π’2,π π )π (π’1,π π ) β₯
βοΈ100|πΆ (π π ) | log(π/πΏ), we get probability of failure is 2 exp(β2 log(π/πΏ)) = π (πΏ/π2) β‘
Lemma 11.7. If |πΆ (π π ) | β₯ 400 log(π/πΏ), then, |π (π π ) | β₯ 200 log(π/πΏ) with probability 1 β |πΆ (π π ) |2πΏ/π2.
Proof. From Lemma 11.6, we have that if there are points π’1, π’2 withβοΈ100|πΆ (π π ) | log(π/πΏ) many points between them, then, we can
identify the closer one correctly. When |πΆ (π π ) | β₯ 400 log(π/πΏ), we haveβοΈ100|πΆ (π π ) | log(π/πΏ) β₯ 200 log(π/πΏ) points between every point
and the point with the rank 200 log(π/πΏ). Therefore, |π (π π ) | β₯ 200 log(π/πΏ). Using union bound over all pairs of points in the cluster, we getthe claim. β‘
Lemma 11.8. If π₯ β πΆβ (π π ), then, π₯ β πΆ (π π ) or π₯ is assigned to a cluster π π such that π (π₯, π π ) β€ 8OPT.
Proof. If π₯ β πΆβ (π π ), we argue that it will be assigned to πΆ (π π ). For the sake of contradiction, suppose π₯ is assigned to a cluster πΆ (π π ) forsome π π β π . We have π (π₯, π π ) β€ 2OPT and let π (π π , π π ) β₯ 6OPT
π (π π , π π ) β€ π (π π , π₯) + π (π π , π₯)
π (π π , π₯) β₯ 4OPTHowever, we know that π (π π , π₯) β€ π (π π , π₯) + 2OPT β€ 4OPT from Lemma 11.2. We have a contradiction. Therefore, π₯ is assigned to π π . Ifπ (π π , π π ) β€ 6OPT, we have π (π₯, π π ) β€ π (π₯, π π ) + 2OPT β€ 8OPT. Hence, the lemma. β‘
Raghavendra Addanki, Sainyam Galhotra, and Barna Saha
11.4 Farthest point computation
Let π (π π ) represent the core of the cluster πΆ (π π ) and contains Ξ(log(π/πΏ)) points. We define FCount for comparing two points π£π , π£ π fromtheir centers π π , π π respectively. If π π β π π , we let :
FCount(π£π , π£ π ) =βοΈ
π₯ βπ (π π ),π¦βπ (π π )
1{O(π£π , π₯, π£ π , π¦) == Yes}
Otherwise, we let FCount(π£π , π£ π ) =βπ₯ βπ (π π ) 1{O(π£π , π₯, π£ π , π₯) == Yes}. First, we observe that each of the summation is over |π (π π ) | many
terms, because |π (π π ) | =βοΈ|π (π π ) |.
Lemma 11.9. Consider two records π£π , π£ π in different clusters πΆ (π π ), πΆ (π π ) respectively such that π (π π , π£π ) < π (π π , π£ π ) β 4OPT then
FCount(π£π , π£ π ) β₯ 0.3|π (π π ) | |π (π π ) | with a probability of 1 β πΏπ2 .
Proof. We know maxπ£π βπ (π π ) π (π’, π£π ) β€ 2OPT and max
π£π βπ (π π ) π (π£ π , π π ) β€ 2OPT.For a point π₯ β π (π π ), π¦ β π (π π )
π (π£ π , π¦) β₯ π (π π , π£ π ) β π (π π , π¦)> π (π£π , π π ) + 4OPTβπ (π π , π¦)> π (π£π , π₯) β π (π₯, π π ) + 4OPTβπ (π π , π¦)> π (π£π , π₯)
So, π (π£π , π₯, π£ π , π¦) is No with a probability π . As π β€ 0.4, we have :
E[FCount(π£π , π£ π )] = (1 β π) |π (π π ) | |π (π π ) |
Pr[FCount(π£π , π£ π ) β€ 0.3|π (π π ) | |π (π π ) |] β€ Pr[FCount(π£π , π£ π ) β€ (1 β π) |π (π π ) | |π (π π ) |/2]
From Hoeffdingβs inequality (with binary random variables), we have with a probability exp(β |π (π π ) | |π (π π ) | (1βπ)2
2 ) β€ πΏπ2 (using
|π (π π ) | |π (π π ) | β₯ 12 log(π/πΏ), π < 0.4) : FCount(π£π , π£ π ) β€ (1 β π) |π (π π ) | |π (π π ) |/2. Therefore, with probability at most πΏ/π2, we have,FCount(π£π , π£ π ) β€ 0.3|π (π π ) | |π (π π ) |.
β‘
In order to calculate the farthest point, we use the ideas discussed in Section 3 to identify the point that has the maximum distance fromits assigned center. As noted in Section 3.3, our approximation guarantees dependend on the maximum distance of points in the core fromthe center. In the next lemma, we show that assuming a maximum distance of a point in the core (See Lemma 11.8), we can obtain a goodapproximation for the farthest point.
Lemma 11.10. Let maxπ π βπ,π’βπ (π π ) π (π’, π π ) β€ πΌ . In every iteration, if the farthest point is at a distance more than (6πΌ + 3OPT), then,Approx-Farthest outputs a (6πΌ/OPT+3)-approximation. Otherwise, the point output is at most (6πΌ + 3OPT) away.
Proof. The farthest point output Approx-Farthest is a 6πΌ additive approximation. However, the assignment of points to the clusteralso introduces another additive approximation of 2OPT, resulting in a total 6πΌ + 2OPT approximation. Suppose in the current iteration,the distance of the farthest point is π½ OPT, then the point output by Approx-Farthest is at least π½ OPTβ(6πΌ + 2OPT) away. So, theapproximation ratio is π½
π½β(6πΌ+2OPT) . If π½ OPT β₯ 6πΌ + 3OPT, we have π½ OPTπ½ OPTβ(6πΌ+2OPT) β€ π½ . As we are trying to minimize the approximation
ratio, we set π½ OPT = 6πΌ + 3OPT and get the claimed guarantee. β‘
11.5 Final Guarantees
Throughout this section, we assume thatπ = Ξ©(log3 (π/πΏ)
πΏ
)for a given failure probability πΏ > 0.
Lemma 11.11. Given a current set of centers π , and maxπ£π βπ,π’βπ (π£π ) π (π’, π£ π ) β€ πΌ , we have :
(1) Every point π’ is assigned to a cluster πΆ (π π ) such that π (π’, π π ) β€ minπ π βπ π (π’, π π ) + 2OPT using π (ππ log(π/πΏ)) oracle queries withprobability 1 βπ (πΏ).
(2) Approx-Farthest identifies a point π€ in cluster πΆ (π π ) such that minπ£π βπ π (π€, π£ π ) β₯ maxπ£π βπ minπ π βπ π (π£ π , π π )/(6πΌ/OPT+3) withprobability 1 βπ (πΏ/π) using π ( |π | log3 (π/πΏ)) oracle queries.
Proof. (1) First, we argue that cores are calculated correctly. From Lemma 11.3, we have that a point π’ β πΆβ (π π ) is assigned to thecenter correctly π π . Therefore, all the points from π β©πΆβ (ππ ) move to πΆ (ππ ). As the size of |πΆ (ππ ) | β₯ |π β©πΆβ (ππ ) | β₯ 400 log(π/πΏ), we have|π (π π ) | β₯ 200 log(π/πΏ) with a probability 1 β |πΆ (π π ) |2πΏ/π2(From Lemma 11.6). Using union bound, we have that all the cores are calculated
How to Design Robust Algorithms using Noisy Comparison Oracle
correctly with a failure probability ofβπ |πΆ (π π ) |2/π2 = πΏ .
For every point, we compare the distance with every cluster center by maintaining a center that is the current closest. From Lemma 11.2,we have that the query will fail with a probability of πΏ/π2. Using union bound, we have that the failure probability is π (πππΏ/π2) = πΏ . FromLemma 11.2, we have the approximation guarantee.
(2) From Lemma 11.10, we have our claim regarding the approximation guarantees. For Approx-Farthest, we use the parameters
π‘ = 2 log(2π/πΏ), π =βοΈ|π |. As we make π ( |π | log2 (π/πΏ)) cluster comparisons using Algorithm ClusterComp (for Approx-Farthest), we
have that the total number of oracle queries is π ( |π | log(π/πΏ) log2 (π/πΏ)) = π ( |π | log3 (π/πΏ)). Using union bound, we have that the failureprobability is π (πΏ/π + |π | log2 (π/πΏ)/π2) = π (πΏ/π). β‘
Theorem 11.12. [Theorem 4.4 restated] Given π β€ 0.4, a failure probability πΏ , and π = Ξ©(log3 (π/πΏ)
πΏ
). Then, Algorithm 7 achieves a
π (1)-approximation for the π-center objective using π (ππ log(π/πΏ) + π2
π2 π log2 (π/πΏ)) oracle queries with probability 1 βπ (πΏ).
Proof. Using similar proof as Lemma 10.1, we have that the approximation ratio of Algorithm 7 is 4(6πΌ/OPT+3) + 2. Using πΌ = 8OPTfrom Lemma 11.8, we have that the approximation factor is 206. For the first stage, from Lemma 11.11, we have that for all the π iterations,the number of oracle queries is π ( |π |π log3 (π/πΏ)). Using union bound over π iterations, success probability is 1 βπ (πΏ). For the calculationof core, the query complexity is π ( |π |2π). For assignment, the query complexity is π (ππ log(π/πΏ)). Therefore, total query complexity isπ (ππ log(π/πΏ) + π
ππ log4 (π/πΏ) + π2
π2 π log2 (π/πΏ)) = π (ππ log(π/πΏ) + π2
π2 π log2 (π/πΏ)). β‘
12 HIERARCHICAL CLUSTERING
Lemma 12.1 (Lemma 5.1 restated). Given a collection of clusters C = {πΆ1, . . . ,πΆπ }, our algorithm to calculate the closest pair (using
Algorithm 4) identifies πΆ1 and πΆ2 to merge according to single linkage objective if πππΏ (πΆ2,πΆ2) β€ (1 + `)3minπΆπ ,πΆ π βC π (πΆπ ,πΆ π ) with 1 β πΏprobability and requires π (π2 log2 (π/πΏ)) queries.
Proof. In each iteration, our algorithm considers a list of(π2)distance values and calculates the closest using Algorithm 4. The claim
follows from the proof of Theorem 3.6 β‘
Using the same analysis, we get the following result for complete linkage.
Lemma 12.2. Given a collection of clusters C = {πΆ1, . . . ,πΆπ }, our algorithm to calculate the closest pair (using Algorithm 4) identifies πΆ1and πΆ2 to merge according to complete linkage objective if πππΏ (πΆ2,πΆ2) β€ (1 + `)3minπΆπ ,πΆ π βC π (πΆπ ,πΆ π ) with 1 β πΏ probability and requires
π (π2 log2 (π/πΏ)) queries.
Theorem 12.3 (Theorem 5.2 restated). In any iteration, suppose the distance between a cluster πΆ π β C and its identified nearest neighbor
πΆ π is πΌ-approximation of its distance from the optimal nearest neighbor, then the distance between pair of clusters merged by Algorithm 11 is
πΌ (1 + `)3 approximation of the optimal distance between the closest pair of clusters in C with a probability of 1 β πΏ using π (π log2 (π/πΏ)) oraclequeries.
Proof. Algorithm 11 iterates over the list of pairs (πΆπ ,πΆπ ),βπΆπ β C and identifies the closest pair using Algorithm 4. The claim followsfrom the proof of Theorem 3.6 β‘