Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures...

34
The VLDB Journal (2009) 18:1279–1312 DOI 10.1007/s00778-009-0140-7 REGULAR PAPER Statistical structures for Internet-scale data management Nikos Ntarmos · Peter Triantafillou · Gerhard Weikum Received: 21 December 2007 / Revised: 27 January 2009 / Accepted: 9 February 2009 / Published online: 5 March 2009 © Springer-Verlag 2009 Abstract Efficient query processing in traditional database management systems relies on statistics on base data. For centralized systems, there is a rich body of research results on such statistics, from simple aggregates to more elaborate synopses such as sketches and histograms. For Internet-scale distributed systems, on the other hand, statistics management still poses major challenges. With the work in this paper we aim to endow peer-to-peer data management over struc- tured overlays with the power associated with such statistical information, with emphasis on meeting the scalability chal- lenge. To this end, we first contribute efficient, accurate, and decentralized algorithms that can compute key aggregates such as Count, CountDistinct, Sum, and Average. We show how to construct several types of histograms, such as simple Equi-Width, Average-Shifted Equi-Width, and Equi-Depth histograms. We present a full-fledged open-source imple- mentation of these tools for distributed statistical synopses, and report on a comprehensive experimental performance evaluation, evaluating our contributions in terms of efficiency, accuracy, and scalability. Keywords Distributed information systems · Data management over peer-to-peer data networks · Distributed data synopses N. Ntarmos · P. Triantafillou (B ) R.A. Computer Technology Institute and Computer Engineering and Informatics Department, University of Patras, Rio, Greece e-mail: [email protected] N. Ntarmos e-mail: [email protected] G. Weikum Max-Planck-Institut für Informatik, Saarbrücken, Germany e-mail: [email protected] 1 Introduction In the last decade, we have witnessed a proliferation of global data management applications that involve a large number of nodes spread across the Internet. Prominent examples are: Peer-to-peer (P2P) file sharing in overlay networks like Gnutella or BitTorrent [82]. These networks have mil- lions of users (peers) that provide storage and bandwidth for searching and fetching files, and they exhibit a high degree of churn with users joining and leaving at high rates. Grid-based sharing of scientific data [33] such as virtual observatories in astronomy or models and experiments on biochemical networks in life sciences. Here the data typ- ically needs a much higher degree of data consistency, compared to P2P applications, and calls for advanced querying. Distributed data analysis over structured records as part of data-fusion applications over relational, XML, or RDF sources [53, 84]. A grand challenge along these lines could be to perform real-time analysis of Internet-traffic data in order to combat (and ideally prevent) attacks, spam, and other anomalies [40]. This class of applications comes with a rich repertoire of query types including relational operators like join and group-by [31]. All these Internet-scale application areas have a need for distributed aggregation queries over a large number of net- work nodes which can be dynamically selected by a filter predicate. In many cases, the aggregates have a statistical nature, and can thus sometimes be estimated with sufficient accuracy and without computing the full, exact result; this paradigm is known as approximate query processing (AQP) [2, 3, 12, 41, 50]. For example, in P2P file sharing, one may be 123

Transcript of Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures...

Page 1: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

The VLDB Journal (2009) 18:1279–1312DOI 10.1007/s00778-009-0140-7

REGULAR PAPER

Statistical structures for Internet-scale data management

Nikos Ntarmos · Peter Triantafillou ·Gerhard Weikum

Received: 21 December 2007 / Revised: 27 January 2009 / Accepted: 9 February 2009 / Published online: 5 March 2009© Springer-Verlag 2009

Abstract Efficient query processing in traditional databasemanagement systems relies on statistics on base data. Forcentralized systems, there is a rich body of research resultson such statistics, from simple aggregates to more elaboratesynopses such as sketches and histograms. For Internet-scaledistributed systems, on the other hand, statistics managementstill poses major challenges. With the work in this paperwe aim to endow peer-to-peer data management over struc-tured overlays with the power associated with such statisticalinformation, with emphasis on meeting the scalability chal-lenge. To this end, we first contribute efficient, accurate, anddecentralized algorithms that can compute key aggregatessuch as Count, CountDistinct, Sum, and Average. We showhow to construct several types of histograms, such as simpleEqui-Width, Average-Shifted Equi-Width, and Equi-Depthhistograms. We present a full-fledged open-source imple-mentation of these tools for distributed statistical synopses,and report on a comprehensive experimental performanceevaluation, evaluating our contributions in terms of efficiency,accuracy, and scalability.

Keywords Distributed information systems ·Data management over peer-to-peer data networks ·Distributed data synopses

N. Ntarmos · P. Triantafillou (B)R.A. Computer Technology Institute and ComputerEngineering and Informatics Department,University of Patras, Rio, Greecee-mail: [email protected]

N. Ntarmose-mail: [email protected]

G. WeikumMax-Planck-Institut für Informatik, Saarbrücken, Germanye-mail: [email protected]

1 Introduction

In the last decade, we have witnessed a proliferation of globaldata management applications that involve a large number ofnodes spread across the Internet. Prominent examples are:

– Peer-to-peer (P2P) file sharing in overlay networks likeGnutella or BitTorrent [82]. These networks have mil-lions of users (peers) that provide storage and bandwidthfor searching and fetching files, and they exhibit a highdegree of churn with users joining and leaving at highrates.

– Grid-based sharing of scientific data [33] such as virtualobservatories in astronomy or models and experiments onbiochemical networks in life sciences. Here the data typ-ically needs a much higher degree of data consistency,compared to P2P applications, and calls for advancedquerying.

– Distributed data analysis over structured records as partof data-fusion applications over relational, XML, or RDFsources [53,84]. A grand challenge along these lines couldbe to perform real-time analysis of Internet-traffic data inorder to combat (and ideally prevent) attacks, spam, andother anomalies [40]. This class of applications comeswith a rich repertoire of query types including relationaloperators like join and group-by [31].

All these Internet-scale application areas have a need fordistributed aggregation queries over a large number of net-work nodes which can be dynamically selected by a filterpredicate. In many cases, the aggregates have a statisticalnature, and can thus sometimes be estimated with sufficientaccuracy and without computing the full, exact result; thisparadigm is known as approximate query processing (AQP)[2,3,12,41,50]. For example, in P2P file sharing, one may be

123

Page 2: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1280 N. Ntarmos et al.

interested in estimating the total number of distinct files thatthe entire network has available at a given point—a Count-Distinct aggregation. In e-science, the total number of X-rayspectras for triple star systems available anywhere in a Gridnetwork could be an interesting measure, requiring a dis-tributed Count operation. For the analysis of Internet-trafficlogs, the total amount of bytes transferred to clients in aparticular range of IP addresses can be computed by a dis-tributed Sum operation. Last but not least, the advanced join-and-group queries over structured data sources typicallyrequire choosing a low-cost query execution plan, and thisquery optimization in turn relies on sufficiently accurate sta-tistical estimators for the selectivity of query predicates (i.e.,the cardinality of intermediate results). If the data itself iswidely distributed, we thus face a problem of having to com-pute histograms and other statistical synopses over a largenumber of network nodes each of which holds some datafragment.

In centralized settings, the statistics management foraggregation queries, selectivity estimation, and approximatequery processing has reached a fairly mature state. For dis-tributed data, however, the issues are much less understood.And for Internet-scale, widely decentralized settings, build-ing distributed statistical synopses and computing accurateestimates for the aforementioned kinds of queries still posesmajor challenges. This paper takes a first step towards add-ressing these challenges, and providing solutions to some ofthe involved issues.

1.1 Problem formulation

The general framework within which our solution is envis-aged to operate is the following. We consider a networkeddata system of cooperating peer nodes, built over a struc-tured P2P overlay. The network consists of possibly a largenumber of nodes, which collectively form the system’s infra-structure. These nodes contribute and/or store data items andare thus involved in operations such as computing synopsesand building histograms. In general, queries do not affectall nodes. In particular, aggregation queries compute aggre-gation functions over data sets dynamically determined by afilter predicate of the query. The relevant data items are storedin unpredictable ways in a subset of all nodes. Further, thenetworked data management system is not single-purpose: itis intended to be providing a large number of “data services”concurrently. That is, a large number of different data setsare expected to exist, stored at (perhaps overlapping) subsetsof the network. And, relevant queries and synopses may bebuilt and used over any of these data sets.

Our target design is depicted in Fig. 1. We assume opera-tion in a structured overlay, such as either traditional or newerlocality-preserving Distributed Hash Table (DHT) overlays.

Queries Inference EngineAuto−Config

Histogram Inference Engine

HistogramsAverage−Shifted

Equi−DepthHistograms

HistogramsEqui−Width

DHT Overlay

TCP/IP − UDP/IP

Physical Network

Fre

eDH

SF

reeS

HA

DE

Compressed(V,F)Histograms

MaxDiff(V,F)Histograms

Distributed Synopsis Implementation

Aggregate

Fig. 1 Building blocks of our target system: all algorithms presentedin this work, collectively coined FreeSHADE [29], are built on top ofour distributed synopsis layer, FreeDHS [27,68], operating on top of aDHT, which in turn is a network overlay above the IP network

Data stored in this P2P network are assumed to be structured1

in relations. Each such relation R consists of (k +l) attributesor columns, R(a1, . . . , ak, b1, . . . , bl); attributes ai are usedas single-attribute indices of the tuples of R, while no indexinformation is kept for attributes bi . Each attribute ai is char-acterized by its value domain ai · D : {ai · vmin, ai · vmax},consisting of the minimum and maximum values of the attri-bute,. Every tuple t in R is uniquely identified by a primarykey t.key. This key can be either one of the attributes of thetuple, or can be calculated otherwise (e.g. based on the valuesof a combination of its attributes). The set of nodes on whicha data tuple t is stored, is defined in one of three ways:

1. The tuple is replicated at the nodes in the P2P overlaydictated by its key t.key and by each of the t.ai attributes,using hash functions of the DHT infrastructure.

2. Alternatively, one can store just one instance of the datatuple, for example at the node responsible for t.key, andstore pointers to this node at the other locations.

1 The notions of structure and relations are not as strict as they are incentralized database management systems. For example, all MP3 filesin the system can be thought of as belonging to a relation, since theyare all annotated using a predefined set of attributes, such as “artist”,“title”, “album”, etc. In essence, in such a setting relations are akin to“namespaces”.

123

Page 3: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1281

3. Last, there is also the scenario of only the nodecontributing a given tuple actually storing a completeinstance of the latter, and nodes dictated by t.key andt.ai storing just pointers to this one node.

In all three cases, we can efficiently identify all nodes hold-ing tuples that match a single-attribute filter predicate on anyof the indexed attributes ai .

In [68] we introduced Distributed Hash Sketches (DHS); adistributed counting mechanism extending hash sketches [24,26]. The solution proposed there supports efficient countingin a decentralized and load-balanced way, both in a duplicate-sensitive and a duplicate-insensitive fashion. In this work weuse DHS as a building block for our solutions for decentral-ized aggregate query processing and histogram creation anduse, while also presenting and comparing alternative designsand implementations for creating distributed synopses.

The key desiderata that an acceptable solution should sat-isfy are:

D1 Efficiency: the number of nodes that need to be con-tacted for query-answering must be small in order toenjoy small latency and bandwidth requirements;

D2 Scalability and availability, seemingly contradicting theefficiency goal: notwithstanding the goal of minimizingthe number of involved nodes, in bad cases an arbitrarilylarge numbers of nodes may be involved (e.g., whencounting with a non-selective filter or when adding ele-ments to multiple multisets), which dictates the needfor a truly decentralized solution, avoiding single-point-based scalability, bottleneck, and availability problems;

D3 Access and storage load balancing: query-answeringand related overheads should be distributed fairly acrossall nodes; this should pertain to both the cost of insert-ing items in the overlay and the cost of disseminatingdata synopses to interested nodes;

D4 Accuracy: tunable, highly accurate estimation of sta-tistical synopses, with robustness to network dynamics(churn) and failures;

D5 Ease of integration: special-purpose indexing struc-tures, and their required extra (routing) state to bemaintained by nodes, should be avoided;

D6 Duplicate (in)sensitivity when counting: the proposedsolution must be able to count both the total number ofitems as well as the number of unique items in multisets.For example, counting the total number of distinct MP3songs in a network where each node holds some subsetof songs and popular songs are held by many nodes, orthe number of all songs whose title contains some spe-cific, popular word (e.g. love). These aggregations arequery-driven and involve a variable, a priori unknownnumber of nodes.

These desiderata suggest a number of design decisions.First, we have chosen to build our solutions over a DHT over-lay, solely using DHT primitives and the DHT paradigm. Thechoice of a DHT overlay is made primarily because of its sca-lability and efficiency in storing and locating data items ofinterest and also because of mature solutions for handlingnetwork dynamics [77,82]. Second, our proposed solutionshould require no extra structures and associated (routing)state that needs to be maintained. For example, we wouldbe hesitant to introducing additional multicast trees into thesystem (other than the mechanisms that are provided by theDHT anyway), as these would require maintenance of addi-tional routing information. Third, we wish to contribute astatistics maintenance infrastructure that should be thoughtof as the counterpart of (a part of) catalog information incentralized environments, in the following notion: if a queryis posed in the system and there exist relevant data in thecatalog, then that data will be readily used for processingand optimization of that query; otherwise, the query is exe-cuted without any optimization step, since computing the rel-evant catalog data is too expensive to be computed at queryrun-time. Fourth and last, we opt for a single DHT-basedsolution, to be utilized for concurrently providing many dataservices, answering multiple aggregation queries and build-ing multiple histograms. This goes a long way towards reduc-ing the complexity associated with such large networked datasystem infrastructures. And this, in turn, goes a long waytowards understanding and maintaining the system better.

We will first leverage basic statistical structures (such ashash sketches and distributed hash sketches) to compute highprecision estimates of many known aggregate queries in aP2P data system. We will further exploit these structures asbuilding blocks in order to scalably and efficiently produceand utilize higher-level statistical structures, such as varioustypes of histograms, for query optimization purposes. Ourhope is that, by doing so, it becomes feasible to harness thewealth of research results produced for query processing andoptimization in centralized and distributed database systemsand port it into the P2P realm.

1.2 Contributions

Query optimization in data base management systems(DBMSs) relies on precomputed statistics and related datasynopses such as histograms or sketches (see, e.g., [15,44,58,59] and further references given there). With this work, wemake the first step towards an in-depth treatment of develop-ing and maintaining such statistical information in a decen-tralized fashion that is appropriate for large-scale distributeddata management systems. We show how to perform decen-tralized aggregate query processing and how to construct,maintain, and use in a decentralized fashion several types ofhistograms. Our implementation and extensive performance

123

Page 4: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1282 N. Ntarmos et al.

evaluation, on the one hand, testify that the proposedalgorithms and structures enjoy efficiency, scalability, andaccuracy and, on the other, help bring to the surface relatedtrade-offs.

Our specific contributions are:

1. Algorithms for computing important aggregates, such asCOUNT-DISTINCT, COUNT, SUM, and AVG.

2. Algorithms for constructing, maintaining, and utilizingseveral histogram types, including Equi-Width, Average-Shifted, and Equi-Depth histograms, satisfying the afore-mentioned design desiderata.

3. A full implementation of the above algorithms fordecentralized aggregate query processing and histogramconstruction, use, and maintenance. Specifically, ourimplementation—coined FreeSHADE for “Statistics,Histograms, and Aggregates in a DHT-based Environ-ment”—is carried out over our distributed synopsisimplementation, built on top of the open-source FreePas-try [28] overlay network. Our software is itself availableto the community to test, validate, and use[29].

4. We contribute a comprehensive performance evaluationof our algorithms in terms of statistical estimation errors,hop-count efficiency, network bandwidth requirements,scalability, and load distribution fairness among networknodes. For comparison, we have additionally implemen-ted a rendezvous-based solution.

We have developed an arsenal of basic building blocks(basic aggregates and histogram types) and provide fully dis-tributed implementations for each of them, believing thatmore elaborate queries and statistical structures can stemfrom this thread of research. Our histogram-related contri-butions in this paper, first create distributed equi-width his-tograms, utilizing the DHS distinct-value estimator structureand then proceed to infer more elaborate histogram typesbased on the equi-width histograms. We stress that this workclaims neither that creating histograms over a distinct-valueestimator is the best possible approach, nor that our approachfor inferring these complex histograms is the best way to dothis. Rather the focus of this paper is on the DHS-based archi-tectural principles and composability properties towards ourdesign desiderata D1 through D6.

1.3 Outline

The rest of this paper proceeds as follows. Section 2 out-lines the foundational preliminaries and related work thatthis paper is based on, including a query optimization primer,overviews of histograms and hash sketches, and query pro-cessing and optimization in peer-to-peer data systems.Section 3 discusses rendezvous-based approaches to comput-ing hash sketches in a distributed manner, and leverages DHS

as a fully decentralized implementation of hash sketches.Sections 4 and 5 discuss ways to compute basic aggregatesin a P2P setting and methods of computing various typesof histograms. Section 6 presents our implementation andexperimental performance evaluation of the proposed solu-tions. Section 7 concludes the paper.

2 Preliminaries and related work

This work strives to bridge the gap between traditional queryprocessing and optimization and P2P data management sys-tems. To do so, we leverage tools from both the peer-to-peerand centralized world. In this section we present a short over-view of traditional and P2P query processing and optimiza-tion and related statistical structures and techniques.

2.1 Query optimization primer

DBMS’s depend on statistics for efficient query processing,typically encompassing various data aggregates, sketches ofbase data, and histograms. Such statistics are used in manyways throughout the lifetime of a query, either as part of thequery optimizer logic—for example, to optimize the accesspaths for single-relation queries [79] and to calculate theselectivity of predicates [21,46,73,74] or the optimal orderof predicate evaluation in multi-predicate queries [45,72]—or as high-quality samples of the base data in approximatequery answering systems [2,3,71]. Even if exact answers aresought, such quick approximate answers may be desirableas a feedback to the user prior to execution of a large andtime/resource-consuming query [41].

System R [79] was among the first to use cost-based opti-mization. It maintained in the system catalogs such statis-tics as the number of tuples and data pages per relation, thenumber of data pages and distinct values per index, and theratio of data pages per segment that hold information forany given relation. The cost of candidate access paths wasthen estimated using a set of formulas with static factors andthe above statistics as input. Later on, the industry turnedto histograms as a more accurate and compact way of sum-marizing information about stored data. [44] and [74] give ataxonomy and a brief history of the evolution of histograms,while [52] presents a survey of distributed query processingin the pre-P2P era.

Chaudhuri [11] identified the following information asnecessary for an optimizer to reach an informed decision:(1) number of tuples in a relation, (2) number of physicalpages used by a table, (3) statistical information for table col-umns, in the form of histograms, minimum and maximum,or second-lowest and second-highest, values, and number ofdistinct values in the column, and (4) information on thecorrelations among attribute values, in the form of either

123

Page 5: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1283

multi-dimensional histograms (having the disadvantage ofgrowing very big with the number of dimensions) or a single-dimensional histogram on the leading column, plus the totalcount of distinct combinations of column values present inthe data, for multi-column indices. In the following sectionswe shall discuss techniques to compute such statistics in aP2P setting.

Sketches, histograms, aggregates, and data synopses ingeneral, have found uses in many other fields of computerscience. Research on data streams [17,18,35,66] has turnedto such synopses of base data to alleviate large data transfersand ease storage and processing requirements on stream pro-cessors. Histograms and sketches have also been proposed asa means of decreasing the amount of data transmitted—and,consequently, the amount of power consumed—by nodes ofsensor networks [16,85]. Along the same line of thought,distributed systems such as publish/subscribe systems [10,87], distributed web proxy caches [25], and peer-to-peer websearch engines [55,61,62] have all used statistical synopsesto quickly and efficiently exchange information about datastored on the various nodes in the system.

2.2 Histograms

Histograms are by far the most common technique usedby commercial databases as a statistical summary and anapproximation of the distribution of values in base relations.For a given attribute/column, a histogram is a groupingof attribute values into “buckets” whose collective frequ-ency is approximated by statistics maintained in each suchbucket.

All histograms make some basic assumptions concerningthe distribution of items in each bucket. One of the mostprominent such assumptions is the uniform spread assump-tion [74]. According to it, values in a bucket are assumed toexist only at the points of equal spread (equal to the bucketaverage), and have a frequency equal to the ratio of the fre-quency of the cell over the number of distinct values in it. Inorder to calculate this, one needs to store the number of dis-tinct attribute values along with the minimum and maximumvalues per bucket.

The most basic histogram variant—the Equi-Widthhistogram—partitions the attribute value domain into cells(buckets) of equal spread and assigns to each the number oftuples with an attribute value within the cell’s boundaries.Relevant research has focused on improving the statisticalproperties of histograms, with many interesting results. [74]present a taxonomy of histograms along with several novelhistogram types, and take an extensive look into their accu-racy and efficiency with regard to both construction time andstorage space requirements. In the following sections, weshall discuss tools, protocols, and techniques to build andcompute several types of histograms. Namely, apart from

plain Equi-Width histograms, we deal with Average ShiftedEqui-Width and Equi-Depth histograms, described shortly.

“Average Shifted Histograms” or ASHs [78] pose an inter-esting alternative to simple Equi-Width histograms. An ASHconsists of a set of Equi-Width histograms with the samenumber of buckets and the same bucket spread but differentstarting points. The frequency of each value in a bucket isthen computed as the average of the estimations given byeach of these histograms. ASH—in essence a kernel esti-mator—can provide a much smoother approximation of theactual distribution of tuple values compared to plain Equi-Width histograms.

Equi-depth histograms [67,80] have been widely used incommercial database management systems. They consist ofa partitioning of the attribute value domain in disjoint inter-vals, such that the number of data tuples whose attribute valuefalls in each such interval is (almost) equal among all inter-vals. That is, in an Equi-Depth histogram all buckets haveequal frequencies but not (necessarily) equal spreads.

2.3 Hash sketches

Hash sketches were first proposed by Flajolet and Martin in[26] under the name of Probabilistic Counting with Stochas-tic Averaging or PCSA, as a means of estimating the numberof distinct items in a multiset D of data in a database,2 build-ing on the counting algorithm of [65]. The estimate obtainedis (virtually) unbiased, while the authors also provide upperbounds on its standard deviation. The only assumption under-lying hash sketches is the existence of a pseudo-uniform hashfunction h() :D →[0, 1, . . . , 2L)—an assumption also pres-ent in most (if not all) P2P-related research. Durand andFlajolet presented a similar algorithm [24], coined super-LogLog counting, which reduced the space complexity andrelaxed the assumptions on the statistical properties of thehash function of [26].3 Hash sketches have been used inmany application domains where counting distinct elementsin multi-sets is of some importance, such as approximatequery answering in very large databases [54], data miningon the Internet graph [69], and stream processing [22,30].The selection of such sketches is dictated by the fact thatrelated data synopsis techniques relying on a global orderingof (hashed) data—such as, for example, count-min sketches[17] and the distinct value estimators of [7]— cannot beimplemented in a completely decentralized manner follow-ing the DHT paradigm, without resorting to such techniquesas gossiping, multi-broadcasting, or a rendezvous-basedapproach, all of which have undesirable properties in oursetting as we shall see shortly.

2 For a survey of distinct-value estimators, see [7].3 The analysis leading to the equations used in this section is wellbeyond the scope of this paper and can be found in [24,26].

123

Page 6: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1284 N. Ntarmos et al.

Fig. 2 Inserting items into ahash sketch: single bitmap case

Fig. 3 Inserting items into ahash sketch: multiple bitmapscase

A hash sketch consists of a bit vector B[·] of length L ,with all bits initially set to 0, and a hash function h() asabove. Let ρ(y) : [0, 2L) → [0, L) be the position of theleast significant (leftmost) 1-bit in the binary representationof y; that is, ρ(y) = {min(k ≥ 0) : bit (y, k) �= 0}, y > 0,and ρ(0) = L , where bit (y, k) denotes the kth bit in thebinary representation of y (bit-position 0 corresponds to theleast significant bit). In order to estimate the number n of dis-tinct elements in a multiset D we apply ρ(h(d)) to all d ∈ Dand record the results in the bitmap vector B[0 . . . L −1] (seeFig. 2). Since h() distributes values uniformly over [0, 2L), itfollows that P(ρ(h(d)) = k) = 2−k−1. Thus, when countingelements in an n-item multiset, B[0] will be set to 1 approx-imately n

2 times, B[1] approximately n4 times, etc. This fact

is rather intuitive: imagine all n possible L-bit numbers; theleast significant bit (bit 0) will be 1 for half of them (oddnumbers); of the remaining n

2 numbers, half will have bit 1set, or n

4 overall, and so on.Then, the quantity R(D) = maxd∈D ρ(h(d)) provides

an estimation of the value of log(n), with an additive biasof 1.33 and a standard deviation of 1.87. Thus, 2R esti-mates “logarithmically” n within 1.87 binary orders of mag-nitude. However, the expectation of 2R is infinite and thuscannot be used to estimate n. To this extent, [24] propose

the following technique (similar to the stochastic averagingtechnique in [26]): (1) use a set of m = 2c different B〈i〉[·]vectors (c being a non-negative integer), each resulting to adifferent R〈i〉 estimate, (2) for each element d, select one ofthese using the first c bits of h(d), and (3) update the selectedvector and compute R〈i〉 using the remaining bits of h(d) (seeFig. 3).

If M 〈i〉 is the (random) value of the parameter R for vec-tor i , then the arithmetic mean 1

m

∑mi=1 M 〈i〉 is expected to

approximate log( nm )plus an additive bias. The estimate of n is

then computed by the formula: E(n) = αm ·m ·21m ·∑m

i=1 M〈i〉,

where αm = (−m · 2− 1m −1

log(2)· ∫ ∞

0 e−t · t− 1m dt)−m([24]). The

authors further propose a truncation rule, consisting of takinginto account only the m0 = �θ0 ·m smallest M values. θ0 is areal number between 0 and 1, with θ0 = 0.7 producing near-optimal results. With this modification, the estimate formula

becomes: E(n) = αm ·m0 ·2 1m0

·∑∗ M〈i〉, where

∑∗ indicatesthe truncated sum, and the modified constant αm ensuresthat the estimate remains unbiased (see Fig. 4). The resultingestimate has a standard deviation of 1.05√

m, while the hash

function must have a length of at least H0 = log(m) +�log(

( nmaxm

)) + 3 , nmax being the maximum cardinality

estimated.

123

Page 7: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1285

Fig. 4 Counting items using ahash sketch

PCSA counting. The algorithm in [26] is based on the samehashing scheme (i.e. using ρ(·)) and the same observationsas [24]. The PCSA algorithm differs from the superLogLogalgorithm in the following: (1) [26] rely on the existenceof an explicit family of hash functions exhibiting ideal ran-dom properties, while [24] have relaxed this assumption,(2) [26] set R to be the position of the leftmost 0-bit inthe bitmap B[·], as opposed to the position of the right-most 1-bit in the bitmap with [24], (3) [24] use in the orderof log log(max cardinality) bits per bitmap while [26] needin the order of log(max cardinality) bits per bitmap, (4) theestimation in [26] is computed as:

E(n) = 1

0.77351· m · 2

1m

∑m−10 M〈i〉

.

and (5) the bias and standard error of [26] are closely approx-imated by 1 + 0.31/m and 0.78/

√m respectively. Note that

the data insertion algorithm is the same for both [24] and[26] (with the sole difference of the assumptions on the hashfunction).

Hash sketches exhibit a natural distributivity; the hashsketch of the union of any number of sets can be computedfrom the hash sketches of these sets by a bitwise OR of thecorresponding bit vectors, given that all of them have thesame number of bit vectors and length and that they havebeen built using the same set of hash functions. Thus, if aninitial set A is spread across several hosts (e.g. across a P2Pnetwork), one can compute the global hash sketch for A fromeach of the locally computed hash sketches corresponding tothe subset of A that each peer is responsible for.

2.4 Summation sketches

Considine et al. [16] have introduced the notion of “summa-tion sketches”, building on hash sketches [24,26]. Assumeagain we have a multiset D = {d1, d2, d3, . . .} of data items

di = (ki , vi ), each identified by a key ki and bearing a valuevi . Then, the distinct summation problem consists of com-puting the quantity:

S =∑

distinct((ki ,vi ) ∈D)

vi .

The basic idea is to model data item values as a series of iteminsertions. That is, in order to estimate the sum of the valuesof two distinct data items d1 = (k1, v1) and d2 = (k2, v2),with k1 �= k2, the algorithm in [16] proceeds as follows: first,(1) insert v1 distinct items in a first hash sketch; then (2) insertv2 distinct items in a second hash sketch; (3) take the bitwiseOR of the resulting sketches; finally, (4) use the standardPCSA or superLogLog estimator on the combined sketch tocompute the actual sum result. This technique takes O(vi )

expected time to add a data item di = (ki , vi ) to the sum-mation sketch. Obviously, this does not scale well for largevalues of vi . [16] address this by emulating the vi insertions.The proposed method consists of two steps:

1. First, set the lowest δi = �log(vi )− 2 log log(vi ) of thesummation sketch bits to all ones, since these bits are allset to one with high probability after vi insertions (seethe proof of Theorem 2 in [26]).

2. Simulate the insertions that set bits δi and higher in thehash sketch.

An item di sets a given bit position p ≥ δi if and only if∀0≤ j<p(bit (h(di ), j) = 0), which happens with probability2−p. This means that, for a set of vi insertions, the number ofinsertions setting bits above a position p follows a binomialdistribution with parameters vi and 2−δi . Thus, in order toadd an item di = (ki , vi ), the authors advocate first to draw arandom sample y from B(vi , 2−δi ), consider each of these yinsertions as having reached bit δi , then use the classic hashsketch insertion process to set the remaining bits beyond δi .

123

Page 8: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1286 N. Ntarmos et al.

In order to take advantage of multiple bitmaps in the hashsketch without losing in accuracy, the above algorithm ismodified as follows. First, each value vi is transformed toqi · m + ri , for some integer qi and ri , with 0 ≤ ri < m.Then, in order to add such an item to the summation sketch,[16] first add ri distinct items once, as in standard PCSA, andthen add qi to each bitmap independently. The time cost ofthis insertion algorithm is in O(m log2(

vim )), while the result

has the same precision guarantees as that of the standardPCSA and/or superLogLog counting.

2.5 Statistics in P2P systems for aggregate queriesand histograms

The peer-to-peer research corpus has already begun to inves-tigate ways of providing DBMS functionality over P2P datanetworks [34]. Most prominent such systems, mainly tar-geted by this work, are built on top of Distributed Hash Tables(DHTs). Distributed Hash Tables are a family of structuredpeer-to-peer network overlays exposing a hash-table-likeinterface. The main advantage of DHTs over unstructuredP2P networks, lies in the strict theoretical probabilistic (inthe presence of node failures and network dynamics) perfor-mance guarantees offered by the former. Prominent examplesof traditional DHTs include Pastry[23], CAN [75], Chord[83], Kademlia [60], etc.

DHTs offer two basic primitives: insert(key, value) andlookup(key). Nodes are assigned unique identifiers from acircular ID space—usually computed as the hash (SHA-1,MD5, etc.) of their IP address and port number on which theP2P application is operating—and arranged according to apredefined geometry and distance function[36]. This resultsin a partitioning of the node-ID space among nodes, so thateach node is responsible for a well-defined set (arc) of iden-tifiers. Each item is also assigned a unique identifier fromthe same ID space—usually by simply feeding the item tothe same cryptographic hash function used to generate thenode IDs—and is stored at the node whose ID is closestto the item’s ID, according to the DHT’s distance function.Each node in an N -node DHT maintains direct IP links (akafingers) to O(log(N )) other nodes in appropriate positionsin the overlay, as dictated by the DHT’s geometry, so thatrouting between any two nodes takes O(log(N )) hops in theworst-case.4

In general, due to the pseudo-random output of crypto-graphic hash functions, each DHT node will be responsi-ble for storing a (possibly random) subset of the attributevalues—and hence (pointers to) tuples—of each relation.This holds regardless of the specific DHT employed and hasgrave implications on the efficiency of several query types.For example, aggregate or range queries may introduce a

4 All log(·) notation refers to base-2 logarithms.

messaging overhead that is in O(N ) in an N -node network.It is this challenge that this paper endeavors to meet in aneffort to make efficient and scalable peer-to-peer query pro-cessing and optimization feasible.

Distributed counting/aggregation solutions proposed bythe peer-to-peer research corpus so far, can be categorized inone of the following groups:

1. Rendezvous-based protocols,2. Gossip-based protocols,3. Broadcast/convergecast-type protocols, also known as

aggregation-tree approaches,4. Sampling-based protocols.

Rendezvous-based protocols. The first type of solution is alsothe first that comes to mind when using a structured over-lay (DHT): select a node in the overlay (e.g. by using thehash function(s) of the DHT overlay) and use it to main-tain the aggregate value (e.g. see the distributed countingmechanism outlined in [26]). Hash-partitioned counters—where the counting space is partitioned into disjoint intervals,with each such interval mapped to a (set of) node(s) in theoverlay—or “coordinator”-based solutions (abundant insensor networks and distributed data stream processing [18,39])—where summaries of data are gathered at a centralaggregation point to be processed—also fall in this category.Solutions of this type suffer from potential shortcomingsregarding our design desiderata D1 through D6. Having acentral aggregation node means that this node will be con-tacted on every update of, and on every query for, the currentvalue of the aggregate, potentially violating (D2). Moreover,each of these central aggregation nodes withstands a highaccess and storage load, violating (D3), while one can arguethat such highly loaded nodes will exhibit high responsetimes, also violating (D1). Using a (fixed) number of ren-dezvous nodes does not really solve the problem but merelymitigates the scalability issues to the cost of inserting itemsto and/or querying the value of such an aggregate, as thesegrow linearly to the number of rendezvous nodes engagedin the computation, while also violating (D1). Similar argu-ments hold for the case when multiple rendezvous nodes haveto be contacted, as the result of simultaneously maintainingmultiple aggregates (i.e. counting multiple quantities at thesame time). Despite the theoretical and practical advanta-ges of the other types of solutions to be discussed shortly,rendezvous-based solutions are by far the most popular inreal-world implementations, mainly due to their simplicityand excellent hop-count performance in the single aggre-gate case; for example, IETF’s Service Location Protocol,5

5 http://www.ietf.org/rfc/rfc2608.txt.

123

Page 9: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1287

Skype,6 iTunes,7 TiVo,8 the Asterisk PBX,9 as well as anyprotocol relying on super-peers to function, all use some sortof rendezvous-based protocol to bring together resources andcoordinate access to them. For this reason, we have chosento compare our proposal against variants of the rendezvous-based solution.

Gossip-based protocols. The second type of solutions, basedon gossiping, (e.g. [4,48,49,51,63,64]) usually provideprobabilistic semantics of “eventual consistency” for theiroutcome; gossip-based protocols are based on an iterativeprocedure, according to which every node exchanges infor-mation with a (set of) its neighboring node(s) on every iter-ation. Eventual consistency means that, in the presence offailures and dynamicity in the P2P overlay, the algorithmwill eventually converge to a stable state after the overlayhas itself stabilized. Although the bandwidth requirements ofthese approaches are low when amortized over all nodes, theoverall bandwidth consumption and hop-count are usuallyvery high. Moreover, the fact that all nodes have to activelyparticipate in a gossip-based computation, even if it is of nointerest to them, coupled with the multi-round property ofthese solutions violates (D1) and (D2), while their semanticsviolates (D4). Of course, in unstructured overlays it is notclear if one can do better than this anyway. All in all, gossip-ing solutions for aggregation are completely decentralized;however, they are best suited for an environment where sim-ple aggregations can take place and small amounts of extradata can be passed along (piggybacked) with regular messag-ing behavior between neighbors. In a DHT-based infrastruc-ture, on the other hand, approaches like the one advocated inthis paper are much more advantageous.

Aggregation-tree-based protocols. The third type of solu-tions [5,6,16,56,81,85,88,89] is based on a two-round pro-cedure: (1) a broadcast phase, during which the queryingnode broadcasts a query through the network, creating a (vir-tual) tree of nodes as the query propagates in the overlay; and(2) a convergecast phase, during which each node sends itslocal part of the answer, along with answers received fromnodes deeper down the tree, to its “parent” node. Solutionsthat are based on pre-built tree structures also belong in thisgroup. Of these works, Astrolabe [88] was among the first totalk of aggregation in the peer-to-peer landscape; the authorsproposed the creation and maintenance of a hierarchical, tree-like overlay, used to propagate complex queries and theirresults through the peer-to-peer overlay. A similar idea hasbeen proposed in [89]. Bawa et al. [5] propose building a(set of) multicast overlay tree(s) to propagate queries and

6 http://www.skype.com/.7 http://www.apple.com/itunes/.8 http://www.tivo.com/.9 http://www.asterisk.org.

results back and forth, while using flood-like methods to sendmessages around the network. Although these structures havenice properties and are capable of computing aggregates ina wide scale, they are not directly applicable for the creationand maintenance of multiple simultaneous counters/aggre-gates. First, they require the creation and maintenance ofa separate (possibly extra) network overlay, thus violatingthe desired property (D5). Second, similarly to rendezvous-based approaches, the cost of maintaining multiple countersat the same time grows linearly with the number of suchcounters (e.g. when having to maintain a different tree percounter). Furthermore, even with just one counter, if the num-ber of nodes containing items to be counted is in O(N ), thenthe counting cost (total hop count and number of messages)is also in O(N ), thus violating (D1). Moreover, even if theload is balanced, functionality is not; although all intermedi-ate nodes communicate with the same number of neighbors(that is, if the tree is full), nodes closer to the root of the treeare more “important” than leaf nodes, thus violating (D2) and(D3). The time for computing an aggregate is in O(log N )

if all queries go out in parallel across links between parentand children nodes (that is, if the tree is balanced), but thisstatement obscures the fact that the total number of mes-sages is in O(N ). In brief, our solution can also do O(log N )

if we send out queries in parallel. However, the total num-ber of messages is a better metric as far as resource con-sumption (e.g., network bandwidth and per-node processors)and scalability are concerned; in this regard, our solutionsdo much better than the O(N ) performance of aggregationtrees.

Sampling-based protocols. The core idea of the last type ofsolutions [8,57] is to estimate the value of the counter in ques-tion, by selectively querying (sampling) a set of nodes in thenetwork. Bharambe et al. [8] attempt to compute approximatehistograms of system statistics by using random samplingof nodes in the network. Manku [57] estimate the numberof nodes in the overlay by also using a random samplingalgorithm. First, there is no obvious way to generalize thesetechniques to count arbitrary quantities (other than the onesthey were designed for). Second, with data tuples arbitrarilyreplicated across the network overlay and with a possiblyhighly skewed tuple popularity distribution, simplistic ran-dom selection schemes will lead to highly biased samples ofthe actual dataset. This is caused by the fact that sampling-based techniques are known to suffer when duplicates existin the base data [14,38], thus conflicting with both desiderata(D4) and (D6). On the other hand, if the sample is big enoughto guarantee a certain level of confidence, then these solutionsmay fall short of satisfying (D1). Even if distributed samplingwas practical, arguments similar to those presented earlierfor rendezvous and tree-based approaches hold for the reuseof sampled data for future queries. In general, we discern

123

Page 10: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1288 N. Ntarmos et al.

two categories: pre-computed synopses-based versus onlineapproaches. To our knowledge, with the exception of ourDHS-based approach, there has not been a method for stor-ing or utilizing such pre-computed synopses in a decentral-ized manner in distributed environments. In Sect. 5, we willadopt a two-step procedure to create and maintain complexhistograms. Keen readers may suggest borrowing algorithmsand constructions from online histograms [32,86]. However,online histogram algorithms rely heavily on sampling andare thus impractical for the reasons mentioned above.

In summary, each of the above families of protocols has itsspecific strengths and weaknesses. By and large, the strengthsare more congruent with unstructured overlay networks,whereas this paper focuses on DHT-based structured over-lays. Moreover, as we anticipate the need to compute aggre-gations and other statistical estimates in combination withfilter predicates of the posed queries, we demand that wecan efficiently identify the subset of network nodes that holdthe base statistics for such dynamically restricted aggrega-tions. None of the above four approaches is well suited forthis requirement. Hybrid methods that combine different ele-ments of several of the above paradigms or modificationsof these methods are conceivable as well, but would entailnew research and thus fall outside the scope of the currentpaper.

3 Distributing data synopses

In the following sections, we will present various distributedalgorithms and protocols for computing aggregates and his-tograms, to be used in distributed query optimization. Thecommon denominator of all these, is the need for a distrib-uted synopsis infrastructure. Assume we wish to computea hash sketch for a set A distributed across a peer-to-peerdata network. We identify two major directions of attackingthis problem: (1) the “conservative” but popular rendezvousbased approach, and (2) the completely decentralized andhighly scalable way of DHS, in which no node has some sortof special functionality.

3.1 The rendezvous approach

The rendezvous approach is what one would call the natu-ral evolution of client–server architectures in the distributedworld of peer-to-peer networks. Nodes storing items belong-ing to the set A first compute a rendezvous ID (for example,by feeding “A” to the underlying DHT’s base hash func-tion). Then, they compute locally the synopsis of choice andsend the outcome to the node whose ID is closest to theabove ID (called the “rendezvous node”). The rendezvousnode is responsible for combining the individual synopses(by bitwise OR) into the global synopsis for A. Interested

nodes can then acquire the global synopsis for A by queryingthe rendezvous node. As is obvious, the message cost for anode to “insert” an item to this distributed synopsis, as wellas the cost for a node to acquire the global synopsis, is inO(log(N )).

Although clean, simple, and highly efficient in terms ofhop count, this solution suffers from two major scalabilityissues. First, the rendezvous node is burdened with dispro-portionally higher communication cost than the rest of thenodes. In the worst case that all (or a large portion of) nodesin the system store items from A—or a large portion of nodeswish to acquire the global synopsis— then the rendezvousnode will have to withstand O(N ) incoming connections andthe corresponding bandwidth overhead.

The “easy way out” is to use multiple rendezvous nodesper distributed synopsis so as to spread the load among them.This can happen in any of two ways: either (1) nodes spreadtheir items across rendezvous nodes on insertion and main-tenance time (proactive replication), but pay the extra hop-count cost of having to visit every one of them to reconstructthe complete synopsis, or (2) they always update all rendez-vous nodes on insertion, thus paying this extra cost duringinsertions, but only need to visit a single rendezvous nodeto acquire the complete synopsis. Second, along the samelines of the above observation, in order to compute the globalsynopses for multiple sets, one has to contact as many ren-dezvous nodes as are the sets. That is, the overall messagecost for computing multiple synopses grows linearly to thenumber of synopses. However, for the sake of completeness,we shall not drop this approach, since it appears to be quitepopular in the relevant literature.

3.2 Distributed hash sketches

In [68], we presented a DHT-based implementation of hashsketches, coined Distributed Hash Sketches (or DHS). DHTsalready feature a pseudo-uniform hash function; node anddocument IDs are (usually) computed as either the securehash of some object-specific piece of information [23,83](e.g. the IP address and port of nodes, the content for files,etc.), or as the outcome of a pseudo-uniform random numbergenerator [60]. In both cases, the resulting ID is anL-bit pseudo-uniform number (for some fixed, system-spe-cific L), thus satisfying the main assumption of hash sketches.We denote by k ≤ L the length of the DHS bitmap vectorsand assume that items are added to the DHS using the klower-order bits of their corresponding DHT keys. In [68]we discussed techniques and protocols to implement bothPCSA and superLogLog counting within DHS. In this workwe use only the latter for clarity of presentation but wewould like to note that the solutions discussed in thiswork are also applicable to the PCSA-based DHS. We’ll firstdiscuss the case with a single bitmap vector and a single

123

Page 11: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1289

(a) Mapping of bit positions to nodes – Singlebitmap

(b) Mapping of bit positions to nodes – Mul-tiple bitmaps

(c) Counting items using a DHS

Fig. 5 Distributed hash sketches

estimated quantity, extending our design for multiple vec-tors and multiple quantities later.

3.2.1 Mapping DHS bits to DHT nodes

The core observation is that higher bit positions in the hashsketch bitmaps are set with exponentially decreasing fre-quency. We thus proposed to also partition the node ID spaceinto consecutive disjoint regions of exponentially decreasingsize, and to associate every bit position with the region whosesize corresponds to the bit’s frequency. That is, we partitionthe node ID space, [0, 2L), into k consecutive, non-overlap-ping intervals Ir = [thr(r), thr(r − 1)), r ∈ [0, k), where:thr(r) = 2L−r−1. Using this partitioning, bit r of B[·] ismapped to node IDs randomly (uniformly) chosen from Ir

(bit k is mapped to the interval [0, thr(k − 1))). Rememberthat when counting distinct items in an n-object multiset, bitr of the bitmap vector is “visited” n · 2−r−1 times. With thek-bit IDs used in DHS, this translates to a maximum of 2k

distinct objects in any possible multiset, or to a maximum of2k−r−1 objects being mapped to position r in the bitmap vec-tor. Now, note that intervals Ir have exponentially decreas-ing sizes |Ir | = 2L−r−1. The above result in a distributionof information across all nodes in the network, as uniformas the hash function used. With this partitioning scheme, bitposition 0 is mapped to the first half of the ID space, bitposition 1 is mapped to the next quarter of the ID space, bit2 to the next eighth, and so on (Fig. 5a). The mapping ofbit positions to ID-space regions is such that the bit positionto region correspondence is the same for all bitmaps of thedistributed hash sketch—that is, different bitmaps coincidein the ID space (see Fig. 5b).

3.2.2 DHS data insertion

Assume a node wishes to add an item to the DHS. First itinserts the item to a temporary local hash sketch instance.This operation consists of taking the item’s ID as producedby the DHT’s base hash function, applying the ρ(·) function,and setting the corresponding bit in the local sketch. For anobject o with ID o.id, we first compute r = ρ(lsbk(o.id)),where lsbk(·) returns the k lower-order bits of its argument;then we select a random ID in the interval [thr(r), thr(r−1))

in the ID space, corresponding to bit position r set during theprevious step, and send a “set-to-1” message to the noderesponsible for that ID. This in essence means that the map-ping of bits to nodes within the bit’s specified region is donein a randomized fashion: every time an operation (set a bit to1, check the current bit’s value, etc.) is to be performed on abit, a random ID is chosen uniformly from the correspondingID space region and the operation is carried out by the noderesponsible for that ID. The DHT overlay guarantees, withhigh probability, that the message will reach the target nodein O(log(N )) hops in the worst case. Each DHS tuple is ofthe form < metric_id, bit , t ime_out >, where metric_idis an identifier uniquely identifying the metric10 to be esti-mated, bit = r denotes the position in the distributed vectorof the bit that is to be set, and t ime_out defines a time-to-live interval for the current tuple, reset at every updates ofthe tuple, allowing for aging out of DHS entries.

When multiple (m) bitmaps are used, the item insertionis performed in the exact same manner as in the single-bitmap case, only now selecting one out of m vectors using

10 For the time being, we will assume that there is only one metricestimated in the overlay; counting multiple metrics at once (also called“multi-dimensional counting”) will be discussed later.

123

Page 12: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1290 N. Ntarmos et al.

lsbk(o.id) mod m, and then using r = ρ(lsbk(o.id) div m)

as the position of the bit to be set. The DHS tuple data mustnow be extended to < metric_id, vector_id, bit,time_out > , where vector_id is the ID of the vector beingupdated. The worst-case hop-count cost for a node to insertan item in such a DHS is in O(log N )—that is, independentof the number of bitmaps—since each insertion only touchesa single bitmap.

When multiple items are to be inserted, nodes can eitheriterate over the itemset repeating the above algorithm oneitem at a time, or they can first insert all these items to a localhash sketch and then send a “set-to-1” message to only onenode per bit position/ID-space region. The latter operation iscarried out by (1) first constructing a message containing thebits to be set, then (2) sending it to a random node in the regioncorresponding to the least-significant set bit in the message,and (3) iterating the last step for the next higher set bit inthe message. For I items per node, setting on average log(I )bits in the local hash sketch, the per-node average worst-caseinsertion hop-count cost becomes O (log(I ) · log(N )).

3.2.3 DHS data maintenance

The maintenance of the DHS bits stored at a node can be pro-vided using either a “soft-state” approach, whereby the bitsare periodically refreshed by the nodes responsible for set-ting them, or using an “explicit-update” approach, wherebybits are set until explicitly deleted by the responsible nodes.Our contributions here are orthogonal with respect to theabove decision. Both alternatives can be supported. For con-creteness, we discuss one alternative, based on the soft-stateapproach.

Remember that a time-to-live value is stored along withevery piece of information; data items are then deleted if notupdated within this time period, so deleting an item incursno extra cost. The computation of this time_out field posesan interesting trade-off. Larger time-out values will result inless updates per time unit needed to keep the DHS up-to-date. On the other hand, a smaller value will allow for fasteradaptation to abrupt fluctuations in the value of the metricestimated, but will incur a higher maintenance cost as far as(primarily) network resources are concerned. However, wehave to point out once again that the per-node bandwidthand storage requirements of DHS are very low, thus even ahigh update rate might translate to a negligible bandwidthconsumption.

3.2.4 Counting with DHS

Just like in the insertion phase, in order to check for the stateof a bit position, the querying node first computes a randomID in the region corresponding to the probed bit position andthen sends a “probe” message to the node responsible for

that ID. The worst-case message cost for this operation isalso in O(log(N )). Furthermore, the fact that the ID spaceregion corresponding to a given bit position is the same for allbitmaps, allows the querying node to acquire state informa-tion for the probed bit position for any and all bitmaps andmetrics with just a single operation. This design achieves abalanced storage and maintenance load across all nodes inthe system.

In [68], superLogLog counting using a populated DHS isperformed by starting from the most significant bit region ofthe ID space and probing nodes in regions corresponding tosuccessively less significant bit positions, until all bitmapshave at least one 1-bit (see Fig. 5c). For PCSA hash sketches,counting proceeds in the opposite direction; starting fromthe least significant bit region of the ID space and probingnodes in regions corresponding to successively more signif-icant bit positions, until at least one 0 bit has been locatedfor every bitmap. Probed nodes respond with a <metricID,bit-vector> tuple (metricID corresponds to the quantity beingestimated). For a m-bitmap DHS, bit-vector is a m-bit vectorwith bit positions corresponding to the value of the probed bitfor each of the bitmaps in the DHS. The worst-case messagecost for this procedure is in O(L log(N )) for L bits per hashsketch bitmap, N nodes in the P2P overlay, independently ofthe number of bitmaps, items, or dimensions.

We have slightly modified the above counting proceduresto perform counting in a recursive manner, since recursiverouting has been proved to be more efficient than iterativerouting in real-world systems [19,77], and in order to usecaching of the probe results along the path of recursion todeal with some of the shortcoming of the techniques in [68].In more detail, in both the superLogLog and PCSA cases,counting proceeds from the least-significant-bit position tobit positions of increasingly higher significance (depictedlater in Fig. 7). Suppose a random node wishes to executethe counting procedure for a given metric. Counting beginswith the querying node and an empty hash sketch (i.e. a hashsketch whose bits are all set to an “undefined” value) andproceeds as follows:

1. The current node first checks its local cache for a fullhash sketch for the desired metric.

2. If such a sketch is found:

(i) The sketch is returned to the previous node in thepath (or to the user if this is the querying node).

(ii) The target node caches the received sketch andrecurses to step (i).

3. Else:

(a) The current node computes the hash sketch basedon its local data and merges it with the hash sketchretrieved from the previous node (if any).

123

Page 13: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1291

(b) If the merged hash sketch is complete (i.e. for PCSAcounting, there is at least a 0-bit for any bitmap inthe sketch, while for superLogLog counting, thereis a bit position which is set to 0 for all bitmaps inthe sketch):11

– The node caches the final result, and goes to step(a).

(c) Else:– The node computes a random ID in the region

for the next higher bit position (or bit position 0if this is the querying node), and

– Sends a probe for the desired metric, along withthe merged hash sketch, to the node responsiblefor that ID.

Note that since the algorithm visits at most L regions andit takes O(log(N )) hops in the worst case to go from oneregion to the next during the forward phase (propagationof the probe) of the algorithm, the resulting worst-case hopcount complexity is again in O(L · log(N )).

3.2.5 Compensating for random selection errors

We have further examined analytically the error introducedby the above-mentioned randomized distribution of bit infor-mation to multiple nodes, and have presented a trade-offbetween accuracy, availability, and efficiency. In brief, therandomizing algorithms used when setting bits and whenchecking their values in the DHS, introduces a probabilityof probing nodes not storing any bit data during the queryphase. In order to deal with this, when a node answers backwith an empty result set, we retry the probe to its immediateneighbors in the overlay, for up to a precomputed number oftimes. In order to compute this threshold, assume that n′ itemshave been uniformly distributed to N ′ bins (i.e. mapped toan N ′-node interval in the DHS). The counting process cor-responds to uniformly and independently picking a bin fromthe set of bins without replacement, and checking for whetherthere is any item stored in it. The probability P(X = t)that t empty bins are selected in the first t probes, equals:

P(X = t) =(

N ′−tN ′

)n′(see [68] for a sketch of proof).

By solving this equation for t , and taking into account thatwith multiple (m) bitmaps, items are quasi-uniformly parti-tioned among them, we get that, in order to choose a non-empty bin with probability of at least p, one has to visit atleast: �N ′ · (1 − p

mn′ ) nodes. Empirically we have found

that, when there are as many items as there are nodes in the

11 For superLogLog counting, although there may be more 1-bits afteran all-0 bit position, we have empirically verified that (1) the probabilitythat this happens is negligible, and (2) in the rare opposite case, these bitpositions are always among those omitted during the truncation phaseof the superLogLog algorithm.

overlay, up to 5 nodes have to be visited in the worst case inorder to guarantee that a non-empty node will be found witha probability of at least 99%, if such a node does exist (i.e.if the bit probed has actually been set during the insertionphase). Last, these are all 1-hop operations and the numberof nodes visited during the retry phase can be considered asmall constant.

3.2.6 Simultaneously counting multiple quantities

As noted earlier, when counting, one probes a node ID in aregion for its bit information for all bitmaps and all metricsof interest, and the probed node responds with a sequenceof tuples, one for each estimated dimension/quantity. Thus,counting in multiple dimensions (i.e. estimating multiplequantities) has the same worst-case hop-count complexity ascounting in a single dimension. Such quantities, for example,can be the frequencies of the cells of a DHS-based histogram,to be discussed shortly, thus allowing a node to gather all cellfrequencies in just O(L · log(N )) hops.

4 Decentralized aggregate query processing for P2Pdata management

We move on now to discuss ways of computing certain basicaggregates in a P2P data management system.

COUNT-DISTINCT

Both rendezvous-based hash sketches and DHS are directlyapplicable for the estimation of the number of (distinct) itemsin a multiset. Assume we wish to be able to estimate the num-ber of distinct values in a column C of a relation R stored inour Internet-scale data management system. In the rendez-vous-based scenario, the algorithm proceeds as follows: (1)nodes storing tuples of R, first hash together the identifiers ofR and C to compute the rendezvous ID; (2) each such nodeadds the relevant values it stores into a locally populatedsynopsis, and (3) sends its synopsis to the node responsiblefor the rendezvous ID. That node will then be responsiblefor gathering all synopses and constructing (via bitwise OR)the overall synopsis. Any node will then be able to ask therendezvous node for the estimated cardinality of R.C (seeFig. 6). The worst-case number of messages to populate theoverall synopsis is in O(N log(N )) in an N -node network—translating to a O(b · N log(N )) bandwidth usage figure, forb bytes per local synopsis—while querying the populatedsynopsis needs O(log(N )) messages.

Note that, when counting distinct items, a synopsis-basedrendezvous solution is much more efficient than the simpleapproach of having nodes send lists of tuple-ID/value pairsthey store to a rendezvous node; the rendezvous node would

123

Page 14: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1292 N. Ntarmos et al.

Fig. 6 Example of a rendezvous-based approach: node 48 is designated as the rendezvous node and all nodes with tuples for attribute1 send it theirdata

Fig. 7 Example of a DHS-based approach: nodes insert their tuples in the DHS and the load is spread across the overlay

need space proportional to the actual estimated quantity to beable to discern duplicates, while hash sketches considerablytrim this figure down (roughly in the order of the logarithm ofthe measured quantity), while a similar argument also holdsfor bandwidth usage for transferring the lists to the rendez-vous node.

In the DHS-based case, nodes storing tuples of R insertthem into the DHS, by:

1. Hashing them on their C value (C.vi ) and inserting themto a local hash sketch, as described in Sect. 3.2,

2. Selecting a random ID in the interval corresponding tothe bit set during the previous step, and

3. Sending a “set-to-1” message to the node responsible forthis ID (see Fig. 7).

Assume the DHS uses L-bit bitmaps. Then, the worst-case message count cost to populate the DHS is in O(N · L ·log(N )), which translates to a O(b′ · N · L · log(N )) overallbandwidth consumption, with b′ bytes per “set-to-1” mes-sage. In a real-world scenario, such a message would merelyinclude the identifier of the measured quantity (i.e. the “met-ric ID” per the lingo of [68]) and the bitmap and bit positionto set. Querying the populated DHS requires O(L · log(N ))

messages in the worst case.Comparing the two approaches, we can see that populating

the DHS-based estimator needs a factor of L more messagesand a factor of L more overall bandwidth12 in the worst case,while querying it also requires a factor of L more messages in

12 Divide by N to get the corresponding cost figures per node, asopposed to overall costs.

123

Page 15: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1293

the worst case, compared to the rendezvous based estimator.However note that: (1) in the rendezvous-based approach,the rendezvous node receives all the load of counting andconstitutes a single-point-of-failure and performance bottle-neck, while the DHS-based estimator is completely decen-tralized and balanced with regard to the per-node load, and(2) when estimating multiple quantities, the hop-count costof populating and querying the DHS-based estimator remainsunchanged, while the relevant costs of the rendezvous solu-tion grows linearly to the number of estimated quantities.

COUNT

In order to count the number of items in a column includingduplicates we would proceed as above, only adding the tupleIDs to the corresponding synopsis (hash sketch or DHS),instead of the values of the column in question. Further-more, if identical tuples stored on different nodes shouldalso be accounted for, nodes hash together the node and tupleIDs and use the result as the input to the synopsis insertionprocedures.

Node cardinality

One of the metrics used in the optimizer of System-R anda very important one when it comes to actually executing aquery, is the number of data pages per relation or per index.This metric defines the number of secondary storage accessesthat the query processor will have to make when doing a fullscan of the relation/index. We propose an analogy betweenP2P overlay nodes and secondary storage disk blocks. In thetraditional, centralized database world, the number of sec-ondary storage accesses is the main cost component of queryaccess plans. Furthermore, it is generally known that disk I/Odelays are mainly due to seek time. In the distributed, peer-to-peer setting, nodes can be thought of as the counterpart ofdisk blocks; routing to a node in the network overlay corre-sponds to a seek in a disk subsystem, while accessing itemson a node corresponds to a disk block access. As a matter offact, by P2P norms, the communication overhead dominatesby far all local processing costs, to the extent that any localoperation may safely be (and actually usually is) ignored inthe cost formulas.

In order to estimate the number of nodes that store tuplesof a given relation, each node storing tuples for the rela-tion/column in question, inserts its ID into a global synopsis.Since node IDs are unique across the P2P overlay this opera-tion is equivalent to a COUNT-DISTINCT aggregate over animaginary column populated with the IDs of nodes storingdata for the relation in question. As of this, we can use anyof the techniques discussed earlier (i.e. hash sketches withrendezvous nodes or the DHS) to estimate this quantity.

When applied to a whole relation R, the “node cardinality”metric corresponds to the number of nodes in the systemstoring tuples of R. One can easily think of important caseswhere this computation may prove useful. For example, inlater sections we shall show how to compute histograms overattributes in a relation stored in an Internet-scale data man-agement system. In order to reach an informed decision, apossible optimizer will need, in addition to the attribute-valuefrequency distribution histograms, further information as tohow many nodes store tuples belonging to a given histogramcell or having a given attribute value. By building node car-dinality histograms, one can answer such questions usingstandard statistical techniques from the relevant literature.Such statistics, coupled with a method for computing over-laps between the data sets of various nodes (such as [61]),will greatly improve the efficiency and quality of query pro-cessing in such distributed settings.

SUM and AVG

In order to compute the SUM aggregate, we follow the algo-rithm in [16]: each node locally computes the sum of valuesof the column tuples it stores, populates a local hash sketchusing the algorithm in [16], and then either posts this hashsketch to the rendezvous node in the former case, or sendsbatches of “set-to-1” messages to the appropriate nodes onthe DHS in the latter case. For the rendezvous case, the mes-sage count, bandwidth consumption, and query overhead fig-ures are obviously the same as those discussed previously.For the DHS case, note that the O(N · L · log(N )) figure alsoholds here since, with L-bit bitmaps, at most L bits can beset. Therefore, the cost analysis is the same as that of ear-lier aggregates. Similarly, computing the average (AVG) ofthe values of a column consists of estimating the SUM andCOUNT of the column and then taking their ratio.

5 Decentralized histogram creation and use for P2Pdata management

In this section, we shall attempt to harness the tools discussedso far and present techniques to implement Average ShiftedEqui-Width histograms, thus improving the accuracy of theapproach in [68], and a method to construct Equi-Depth his-tograms, over an Internet-scale data management system. Wealso discuss the case of extending our approach for comput-ing more complex histogram types, such as MaxDiff(V, F)and Compressed(V, F) histograms.

Equi-Width histograms

The core idea is to partition the attribute value domain intoa fixed number of equally sized intervals/buckets (say B)

123

Page 16: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1294 N. Ntarmos et al.

and use a different distributed synopsis “metric” for eachbucket to estimate the number of data items belonging toit. This construct corresponds to a distributed Equi-Widthhistogram (coined DEWH) over the given attribute. If usinga DHS-based distributed synopsis, this technique requiresO(N · L · log(N )) overall messages in the worst case for allnodes to insert all of their items in the DHS and O(L ·log(N ))

hops to reconstruct the histogram (remember that countingin multiple dimensions with DHS incurs no additional hop-count overhead). For a rendezvous-based approach, thesecosts are in O(N ·B ·log(N )) and O(B log(N )), respectively,since there will (on average) be B different rendezvous nodes.An even more extreme case would be to store all histogrambuckets on a single rendezvous node; in this case, the mes-sage cost for data insertion and histogram reconstruction isin O(log(N )), as discussed earlier, but the load imbalance iseven more severe, as we shall shortly see in the experimentalevaluation section.

Note that, even in this extreme case, the rendezvous-basedapproach is still approximate and not exact. Due to the arbi-trary replication of data tuples across nodes in the overlay,for such an approach to be exact, the rendezvous node shouldmaintain complete lists of at least the primary keys of everyitem mapped to every bucket, so that identical items insertedby different nodes do not get counted twice in the bucket fre-quencies. However, this clearly does not scale well with thenumber of nodes and/or the number of items in the overlay.Thus, the rendezvous node has to either (1) keep a synopsisof the primary keys mapped to every bucket (e.g. a Bloomfilter—but without a-priori knowledge of the amount of itemsmapped to each bucket, this should be made large enough tobe able to record the primary keys of all data items in the over-lay, which in turn is also not a known quantity), or (2) usea synopsis for the per-bucket frequency (e.g. a hash sketch),which is based on the exact same notion as the per-bucketdistributed hash sketch of the DHS-based approach.

Average shifted Equi-Width histograms

By taking advantage of the dimension-free counting of DHS,we can also implement a distributed version of ASHs (coinedDistributed ASH or DASH), using the technique outlinedearlier for standard Equi-Width distributed histograms. EachDASH will consist of several DEWH with the same bucketwidths but different starting positions in the value space.

In a DHS-based approach, each node constructs a localhash sketch for each cell of each of the equi-width histo-grams comprising the ASH, and sends batches of “set-to-1”messages for every non-zero bit position in the resulting hashsketches, containing the identifiers of the metrics/ASH cellsfor which this bit is to be set. Thus, the message cost of pop-ulating DASH is the same as in the previous case, while thecost of reconstructing the histogram is again in O(L ·log(N )).

With rendezvous-based DEWH as a base, on the otherhand, the cost of inserting the items is in O(N · S · B · log N )

for S shifted DEWHs and the cost of reconstructing the his-togram is in O(S · B · log N ), since in this case there willbe S · B different rendezvous nodes. Again, one could takethe rendezvous case to the limits and store all buckets ofall shifted histograms on the same rendezvous node so asto bring these figures down to O(log(N )), incurring how-ever an even greater load imbalance than in the previousapproach.

Equi-Depth histograms

Distributed synopses, as described earlier, are good for esti-mating the number of items belonging to a predefined group(e.g. value range), but cannot count items belonging to a rangewhose boundaries change on-the-fly. This makes distributedsynopses inappropriate for direct use in building histogramtypes other than Equi-Width or Average-Shifted ones.

To this extent, first note that although researchers usu-ally reject Equi-Width histograms in favor of more complextypes, such as Compressed or MaxDiff histograms, there aresituations where the former perform as well or even betterthan the latter [9], while there may be settings where even aone-bucket histogram is adequate [21].

Studying the relevant literature, we noted that morecomplex histogram types, such as Equi-Depth histograms,usually require much fewer cells to provide a good approx-imation of the attribute value distribution than Equi-Widthhistograms [74]. Moreover remember that, when using DHS,we can build Equi-Width or Average Shifted histograms withmany cells for no extra cost, having a separate metric for eachhistogram cell, due to the counting properties of DHS.

We thus propose to use either DEWH or DASH histo-grams as our base and try to infer more complex histogramtypes from them. The general approach is to retrieve infor-mation from a DEWH or DASH histogram, and then todo a series of local computations to infer the more com-plex histogram types. In more detail, our solution for com-puting an Equi-Depth histograms consists of the followingsteps:

1. First, build a relatively large DEWH or DASH histo-gram—for example, having 10-to-20 times more bucketsthan the target histogram.

2. Use the uniform spread assumption to project frequen-cies to values within each cell,

3. Consider the set of values of all cells and4. Find a partitioning of them into c groups, such that the

order of values is not changed and all groups have equal(or almost equal) sums of value frequencies.

123

Page 17: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1295

More complex histogram types

One could advocate using the above inference techniqueto compute even more complex histogram types, such asMaxDiff(V,F) histograms—used by the Microsoft SQLServer [44]—and Compressed(V,F) histograms—proved tobe the perfect trade-off between optimality under the variancemetric [46] and practicality due to their size and constructiontime overheads, and is used by the DB2 optimizer [44]. Inshort, MaxDiff(V,F) histograms first sort items on their val-ues and then place cell boundaries in the points in the valuedomain where items have the largest differences in their fre-quency, while in a c-cell Compressed(V,F) histogram, thec−k higher frequency items are placed in single-value buck-ets, and the remaining values are placed in k buckets in anEqui-Depth manner.

After projecting frequencies to values within each cellusing the uniform spread assumption, a c-cell MaxDiff(V, F)histogram could be inferred from a large plain Equi-Widthor DASH histogram, by:

1. Creating a sorted list of the differences in frequency ofthe above values,

2. Selecting the c−1 largest differences as cell boundaries,3. Grouping together values within each new cell, and4. Computing the cell’s frequency as the sum of the values’

frequencies.

while for a c-cell Compressed(V, F) histogram we could:

1. Locate the c − k highest-frequency items,2. Place them in singleton buckets, and3. Assign the remaining values to k Equi-Depth buckets as

in the Equi-Depth case above.

However, as we shall se shortly in the experimental eval-uation section, these cases are somewhat problematic. Thisis mainly due to the fact that the width (and, therefore, thenumber) of the buckets of the histogram in the first step of theabove inference algorithm, defines the accuracy and granu-larity of the subsequent inference steps. In order not to loseany information there should be a separate histogram bucketfor each attribute value (i.e. buckets should be 1 value wide).On the other hand, if the attribute value domain is very large(e.g. millions of values), having singleton buckets will notbe practical. Moreover, the use of DHS imposes a trade-offbetween the number of items falling into each bucket andthe accuracy of the corresponding estimate, as outlined in[68], with more items resulting in better accuracy. Since thebucket frequency is projected to values in the bucket usingthe uniform spread assumption, the proposed method losesin accuracy as the buckets get wider; this holds especially forthe Compressed(V, F) and MaxDiff(V, F) histograms whose

construction relies on accurately locating the highest-fre-quency or highest-frequency-difference value positions; forexample, if the value distribution is very skewed and thereis one very popular value in a multi-valued bucket, then theproposed inference method will assign a fraction of the totalbucket frequency to each of the distinct equal-spread valuesin the bucket, and will thus fail to accurately compute thefrequency-to-value mapping in that bucket.

We shall revisit this trade-off in the experimental evalu-ation section. As we shall also see there, this issue affectsalmost solely the Compressed and MaxDiff histogram con-struction; for Equi-Depth histograms, a DEWH or DASHwith 10-to-20 times more buckets than the target histogramachieves nearly excellent accuracy with a very low cost. Onthe other hand, for Compressed and MaxDiff histograms werequire a DEWH or DASH with a few values, 1-to-5, perbucket. There is a new hidden trade-off here regarding band-width requirements; unlike the number of hops, the amountof data fetched during reconstruction of the base DEWH orDASH histograms grows linearly with the number of buck-ets in the histograms. To this extent, we have consideredadding a module in our inference engine that is responsi-ble for automatically tuning the number and width of buck-ets and/or shifts in the underlying DEWH or DASH histo-grams, according to the domain size, the type of histogramto construct, and a user-supplied network bandwidth thresh-old. The current implementation of this module is in pre-liminary stages and remains a subject of ongoing and futurework.

We also wish to explicitly state that we do not deal withmulti-dimensional histograms. The main thrust behind thiswork is not to provide implementations for all specific his-togram types or to solve all well-known issues in histogram-related research. We simply wish to show that such statisticalstructures can be maintained in a purely distributed man-ner, to present alternatives and discuss issues arising in thissetting.

5.1 Applications in P2P query optimization

As already mentioned, we think of remote nodes and networkaccesses as if they were blocks and secondary storage acces-ses in a centralized database management system. Thus, aquery optimizer for a P2P DBMS should, in addition to tak-ing into account the statistics of the data values stored on theP2P overlay, also consider statistics on the node population,such as the “number of nodes per relation” (as opposed tothe “number of data pages per relation”). We thus advocatestoring, along with value frequencies, the node cardinalityof each histogram cell, and to take into consideration boththe number of data items and the number of nodes involvedin a query, to further optimize the query access paths. This

123

Page 18: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1296 N. Ntarmos et al.

operation roughly corresponds to building “node histograms”in addition to the histograms on the data value frequencies.

With the infrastructure discussed so far, we can computeseveral pieces of information required by optimizers, as dis-cussed in [11] (and mentioned earlier). More specifically:

1. The number of tuples in a relation. This quantity corre-sponds to the relation cardinality and can now be com-puted with a single COUNT and/or COUNT-DISTINCToperation, depending on whether we want to account forduplicates or not.

2. The number of physical pages used by a table. This quan-tity corresponds to the “node cardinality” of the rela-tion, which can also be directly computed, as discussedearlier.

3. Statistical information for table columns, in the formof:

(a) Histograms. Earlier in the current section we dis-cussed techniques to compute several types of his-tograms over base data.

(b) The number of distinct values in a column. Again,this quantity can be computed with a singleCOUNT-DISTINCT operation on the column ofinterest.

(c) The minimum and maximum, or second-lowest andsecond-highest, values. As discussed earlier, thiskind of functionality is usually provided by locality-preserving DHTs, on top of which we have assumedto operate. Otherwise, we can use histograms andthe inference technique outlined earlier to computean estimate of these values for each bucket, or forthe column as a whole.

4. Information on the correlations among attribute values,in the form of:

(a) Multi-dimensional histograms, which could be cre-ated using the infrastructure presented so far, bydefining the boundaries of the multi-dimensionalbuckets, and assigning a DHS metric to each suchbucket, instead of the single-dimensional bucketswe had up to now. This solution may well sufferfrom the curse of dimensionality and probably besubject to a degradation in accuracy with increas-ing number of dimensions; however, there is alsothe following alternative.

(b) A single-dimensional histogram on the leading col-umn, plus the total count of distinct combinationsof column values present in the data, for multi-column indices[11]. This can also be computed withour infrastructure, even on a per-bucket basis, byhaving nodes insert combinations of values underappropriate DHS metrics.

Furthermore, we can provide for range predicateselectivity estimation. To this extent, we consider the uni-form spread assumption, with a twist. Remember that ren-dezvous hash sketches and DHS can count both the totalnumber of items in a set including duplicates and the num-ber of distinct values, depending on whether we insert thevalues or the keys of tuples. Since we are dealing with rangequeries at this point, we assume that the underlying DHTefficiently supports them, so we can find the minimum andmaximum values of a cell by probing the nodes responsi-ble for the values corresponding to the cell’s boundaries; ifon the other hand there is no such functionality availableor we do not wish to make these two extra network acces-ses, we can assume that the minimum and maximum val-ues coincide with the cell’s boundaries, or estimate them asearlier.

Given a range predicate α ≤ X ≤ β, we proceed as fol-lows:

1. First determine the histogram buckets affected by therange query (the buckets with which the range predicateoverlaps).

2. Using the DHS, determine the frequency of each bucketinvolved in the query, following the algorithms outlinedearlier.

3. Using the uniform spread assumption [74], determine thevalues of each bucket and their frequencies.

4. Among these values, find those that belong to the rangepredicate, and

5. Return the sum of their frequencies as the estimated sizeof the range query result set.

Moreover, we could compute an estimate of a relation’sself-join size, by using Compressed(V,F) histograms (due tothe inefficiencies of basic types of histograms as estimatorswhen the values of the joined relations are correlated [21]).What we need to do is:

1. Use the DHS to reconstruct the base Equi-Width histo-gram and infer the Compressed(V, F) histogram,

2. Iterate over all histogram buckets, and3. Take the sum of the squares of the bucket frequencies.

Due to the counting properties of DHS, reconstructing thehistogram has a worst-case message cost of O(L log(N )).Note that the above computation should be used only as aheuristic, as the accuracy of Compressed(V, F) histograms issubject to many subtle factors, as we shall shortly see. Weare currently investigating ways of applying our methods inother classic query optimization scenarios, such as projec-tions, single- and multi-way joins, etc.

123

Page 19: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1297

6 Implementation and performance evaluation

6.1 Setup

We initially implemented a basic Chord-like DHT, DHS,and the solutions proposed in this work, in an event-drivensimulator coded in C++ from scratch [68]. The simulatorsupported all basic DHT primitives—that is, node joins andleaves/failures and data addition and deletion—and all thefunctionality of DHS and subsequent constructions. We usedthis simulator for preliminary performance evaluation pur-poses and sanity checks, and in order to observe how oursolutions behave and scale in a fully controlled environment.We now implemented DHS and DHS-based aggregates andhistograms over FreePastry [28], a publicly available, imple-mentation of the Pastry overlay [23] in the Java program-ming language. In Pastry (and FreePastry) lookups have ahop-count cost in O(log2b N ), with b a positive integer. Wetweaked FreePastry so that its routing cost is in O(log2 N )

(i.e. b = 1), so that hop-count results are easily comparableto other DHTs. Both implementations gave similar results,so we chose to show only results computed using the latter, toshowcase the performance of our solutions as implemented ina real-world system. The source code of our implementation,coined FreeSHADE for “Statistics, Histograms, and Aggre-gates in a DHT-based Environment”, consists of approxi-mately 8,500 lines of Java code13 and is publicly availableon the world-wide web through [29].

In both cases, the performance evaluation was carried outin the following steps:

1. We generated the workload; that is, the data tuples of thebase relations and the queries to be executed on themlater on. As is standard practice in the relevant litera-ture, we synthetically generated data sets that (1) cantest our system for various value distributions rangingfrom near uniform to highly skewed, and (2) correspondto value distributions also observed in real-world systems[37,76].

2. We populated the network with peers and allowed enoughtime for the DHT to stabilize. We chose to simulate aP2P DBMS running on a 1000-node DHT overlay, asan example of a mid-range distributed system, but largernetworks are naturally supported by both our solutionsand the code base.

3. We randomly assigned data tuples from the base data tonodes in the overlay. This corresponds to the data con-tributed by each overlay node.

4. Then we had all nodes insert into the P2P DBMS and thedistributed synopses all relevant information (attribute

13 Counted using David A. Wheeler’s ’SLOCCount’.

values, tuple IDs, node ID) for the tuples they wereassigned during the previous step.

5. Finally, we selected random nodes and had them exe-cute our algorithms for reconstructing histograms andcomputing aggregates.

For the DHS-based cases, we used 32 bits per hash sketchbitmap. Note that the expected performance figures aredirectly affected by the bitmap length, and that with 32-bitbitmaps we are able to count over trillions of items. A moreconservative (and better looking, with regard to the resultinghop-count figures) approach would be to use smaller (e.g.20-bit) bitmaps; however, we present the 32-bit case as anexample of a worst-case scenario. Finally, we varied the num-ber of bitmaps from 64 to 512. The results we shall presentshortly were averaged over multiple runs for every case, toavoid statistical artifacts.

Our main focus is on distribution transparency, thus wemainly want to prove that the proposed solutions (1) are asaccurate as their centralized counterparts, (2) impose lowrun-time overhead, and (3) scale well with the size of thenetwork and of the overall data collection of peers. We thusmeasure the accuracy and (message count) performance ofthe proposed solutions, attempting to showcase their appli-cability under the wide-scale distribution setting of a P2PDBMS.

More specifically, for aggregate query processing, wewere primarily interested in the number of hops requiredto do the estimation, the accuracy of the estimation itself, aswell as the fairness of the load distribution across nodes inthe network. We instrumented FreePastry with the appropri-ate hooks to allow us to measure the number of hops eachmessage needs to reach its destination node and the corre-sponding bandwidth usage. We present hop-count results forboth inserting items to the DHS and for doing the actualestimation. We present results for three aggregates; namely,COUNT, SUM, and AVG, as examples of aggregates basedon classic hash sketches, on summation sketches, and on acombination of these two respectively. Moreover, we reporton the mean error of the estimation, computed as the percent-age by which the distributed estimation differed to the actualvalue of the estimated aggregate computed over the base datain a centralized manner (i.e. as if all data was stored on a sin-gle host). Last, we report on the distribution of insertion andquery load across nodes in the overlay, in order to showcasethe trade-off of performance versus scalability/load distribu-tion between the DHS and rendezvous-based approaches.

For histograms, our main concern was the accuracy ofthe estimated bucket frequencies and bucket boundaries (forinferred histograms), as well as the cost of reconstructingthe base DEWH or DASH histogram in terms of numberof hops and bandwidth usage (the inference step is a localoperation and hence considered of negligible cost in the P2P

123

Page 20: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1298 N. Ntarmos et al.

environment) and the load distribution. For the hop-count andbandwidth measurements we proceeded as above. In orderto measure the accuracy of the estimated histograms, we firstcomputed the histograms of choice over the base data in acentralized manner, then reconstructed and inferred the cor-responding distributed histograms. Accuracy is again mea-sured as the percentage by which the per-bucket estimatedfrequencies or bucket boundaries differ to the correspondingbucket frequencies/boundaries of the centralized histograms.

As far as the (fairness of the) distribution of load acrossparticipating hosts is concerned, we measure the load on anygiven node as the insertion and query/probe “hits” on thisnode; that is, the number of times this node is the targetof an insertion or query/probe operation. We further instru-mented FreePastry to report on the above metrics; namelyNode Insertion Hits and Node Query Hits. Now, assume li ,1 ≤ i ≤ N is the load on the i th node, and that µl andσl is the mean and sample standard deviation of these loadsrespectively. That is:

µl = 1

N∑

i=1

li

and

σl =√√√√ 1

N − 1·

N∑

i=1

(li − µl)2.

We employed a multitude of metrics to visualize the impact ofthe different approaches outlined earlier in this paper. Morespecifically, we used:

– The Gini Coefficient [20], as proposed in [70], calculatedby:

GC = 1

2 · N 2 · µl

N∑

i=1

N∑

j=1

(|li − l j |).

That is, GC is the mean of the absolute difference ofevery possible pair of load values. The Gini Coefficienttakes values in the interval [0, 1), where a GC value of0.0 is the best possible state, with 1.0 being the worst.The Gini Coefficient roughly represents the amount ofimbalance in the system, so that for example a GC valueof 0.25 translates to ≈75% of the total load being equallydistributed across all nodes in the system.

– The Fairness Index [47]:

FI =(∑N

i=1 li)2

N · ∑Ni=1 l2

i

.

The Fairness Index (also known as “Jain’s Fairness Index”after the first of the authors of [47]) takes values in theinterval (0, 1], with 0 and 1 being the worst and the bestvalue respectively. FI is to some extent the inverse ofthe GC in that it represents the amount of balance in thesystem, so that for example a FI value of 0.25 roughlytranslates to the load being equally spread across 0.25 ofthe nodes in the system.

– The maximum and total loads for DHS- and rendezvous-based approaches.

6.2 Results

6.2.1 Aggregate queries

In this part of our experimental evaluation we primarilymeasured the hop-count efficiency and the accuracy ofrendezvous-based hash sketches and of the DHS as the sub-strate for computing aggregates in a P2P DBMS. We initiallycreated single-attribute relations, with integer values in theintervals [0, 1000), following either a uniform distribution(depicted as a Zipf with θ equal to 0.0), or a shuffled Zipf dis-tribution with θ equal to 0.7, 1.0, and 1.2 [37,76]). Relationscontained 300,000 tuples each, distributed across all nodesin the overlay. Note that we use single-column relations forpresentation reasons and ease of modeling, as a means ofevaluating queries on statistical structures over index attri-butes of a (possibly much) larger relation; we have also testedour system with larger configurations (more and larger tuplesper relation, real-valued attributes, etc.) with similar results.

Estimation error and hop-count costs

Figures 8a and b plots the average error of the resulting esti-mate for the COUNT aggregate, using a rendezvous- and aDHS-based approach respectively. In both cases, the result-ing error is due to the use of hash sketches to estimate theaggregate, thus both approaches exhibit the same averageerror. As expected, the higher the number of bitmaps in thesynopsis, the better the accuracy. Specifically, for 64 bitmapsthe average error is ≈12%, dropping to ≈5.7% for 128 bit-maps, ≈3.4% for 256 bitmaps, and ≈2.9% for 512 bitmaps.The estimation error is slightly larger for the SUM aggre-gate, due to the error introduced by the use of the summa-tion technique of [16] (Figs. 8c, d). Last, the estimation errorcurve for the AVG aggregate closely follows that of the SUMaggregate.

We have measured and show the per-node average hopcount for inserting all tuples to the distributed synopsis, andthe hop count for computing the aggregate. Figures 9a andb depict the insertion hop-count cost for all of the evalu-ated aggregates. Figure 10a depicts the query hop-count costfor the rendezvous-based approach; since there is a single

123

Page 21: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1299

0

2

4

6

8

10

12

14

64 128 256 512

Ave

rage

Est

imat

ion

Err

or (

%)

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(a) COUNT aggregate – Rendezvous approach

0

2

4

6

8

10

12

14

64 128 256 512

Ave

rage

Est

imat

ion

Err

or (

%)

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(b) COUNT aggregate – DHS approach

0

2

4

6

8

10

12

14

64 128 256 512

Ave

rage

Est

imat

ion

Err

or (

%)

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(c) SUM/AVG aggregate – Rendezvous approach

0

2

4

6

8

10

12

14

64 128 256 512A

vera

ge E

stim

atio

n E

rror

(%

)

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(d) SUM/AVG aggregate – DHS approach

Fig. 8 Average % estimation error for the COUNT and SUM/AVG aggregates versus their centralized values

0

5

10

15

20

25

30

35

64 128 256 512

Ave

rage

Hop

s pe

r N

ode

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(a) Insertion hop count – Rendezvous approach

0

5

10

15

20

25

30

35

64 128 256 512

Ave

rage

Hop

s pe

r N

ode

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(b) Insertion hop count – DHS approach

Fig. 9 Hop count for populating the distributed synopsis using a rendezvous-based approach without replication and a DHS-based approach

rendezvous node per aggregate, this curve is the same forall evaluated aggregates. Finally, Fig. 10b, c, and d depictsthe query hop-count cost for the DHS-based approach, forthe COUNT, SUM, and AVG aggregates, respectively. Ascan be seen, the per-node hop count costs are higher for theDHS-based approach by a factor of approximately 8× forboth the insertion and query cases. Given that each node doesone DHT lookup to insert all of its tuples and to query thedistributed synopsis in the rendezvous-based case, this trans-lates to approximately 8 DHT lookups per node in the DHS-based case. The query hop-count cost for the DHS-based

approaches for SUM and AVG (Fig. 10a, d) is somewhatlower compared to the COUNT aggregate, as more bits areset to 1 due to the algorithm in [16] and thus it is easier forthe DHS probing algorithm to find set bits.

Load distribution

The extra hop-count cost of the DHS-based approach paysback when it comes to load distribution fairness. Figures 11a,b and 12a, b plot the Gini Coefficient and the Fairness Indexof the load imposed on nodes in the system during tuples

123

Page 22: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1300 N. Ntarmos et al.

0

10

20

30

40

50

64 128 256 512

Ave

rage

Est

imat

ion

Hop

s

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(a) All Aggregates – Rendezvous approach

0 5

10 15 20 25 30 35 40 45 50

64 128 256 512

Ave

rage

Est

imat

ion

Hop

s

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(b) COUNT Aggregate – DHS approach

0

10

20

30

40

50

64 128 256 512

Ave

rage

Est

imat

ion

Hop

s

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(c) SUM Aggregate – DHS approach

0

10

20

30

40

50

64 128 256 512A

vera

ge E

stim

atio

n H

ops

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(d) AVG Aggregate – DHS approach

Fig. 10 Hop count for computing the COUNT, SUM, and AVG aggregates using a rendezvous-based approach without replication and a DHS-basedapproach

Fig. 11 Gini Coefficient(smaller is better) and FairnessIndex (larger is better—note thelogarithmic scale on the Y axis)of the load on nodes forpopulating a rendezvous-basedsynopsis

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

64 128 256 512

Gin

i Coe

ffic

ient

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(a) Gini Coefficient – Rendezvous approach

0.0001

0.001

0.01

0.1

64 128 256 512

Fair

ness

Ind

ex

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(b) Fairness Index – Rendezvous approach

Fig. 12 Gini Coefficient(smaller is better) and FairnessIndex (larger is better—note thelogarithmic scale on the Y axis)of the load on nodes forpopulating the DHS-basedsynopsis for the COUNTaggregate 0

0.2

0.4

0.6

0.8

1

64 128 256 512

Gin

i Coe

ffic

ient

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(a) Gini Coefficient – DHS approach

0.0001

0.001

0.01

0.1

64 128 256 512

Fair

ness

Ind

ex

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(b) Fairness Index – DHS approach

123

Page 23: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1301

Fig. 13 Gini Coefficient(smaller is better) and FairnessIndex of the load on nodes forpopulating the DHS-basedsynopsis for the SUM and AVGaggregates (larger isbetter—note the logarithmicscale on the Y axis) 0

0.2

0.4

0.6

0.8

1

64 128 256 512

Gin

i Coe

ffic

ient

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(a) DHS approach

0.0001

0.001

0.01

0.1

64 128 256 512

Fair

ness

Ind

ex

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(b) DHS approach

Fig. 14 Evolution over time ofthe Gini coefficient and Fairnessindex for querying thedistributed synopsis for all threeaggregates

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 800 900 1000

Gin

i Coe

ffic

ient

Number of QueriesRendezvous Approach

DHS approach -- 64 bitmapsDHS approach -- 128 bitmaps

DHS approach -- 256 bitmapsDHS approach -- 512 bitmaps

(a) Gini Coefficient

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 800 900 1000

Fair

ness

Ind

ex

Number of QueriesRendezvous Approach

DHS approach -- 64 bitmapsDHS approach -- 128 bitmaps

DHS approach -- 256 bitmapsDHS approach -- 512 bitmaps

(b) Fairness Index

insertion for the COUNT aggregate (the figures for therendezvous-based approach are practically identical for allthree aggregates tested, so we will reuse Fig. 11a and b).Recall that the load on any given node is measured as the num-ber of times that node is visited (a.k.a. node hits) during datainsertion and/or query processing. There is approximately animprovement of more than two orders of magnitude for bothmetrics. More specifically, according to the Gini Coefficient,from a ≈0.1% of the load being equally distributed across allnodes in the rendezvous approach, to a ≈26.4% of the loadbeing equally spread for the DHS approach; likewise, the F Ivalue jumped from ≈0.001 for the rendezvous-based case to≈0.100 for the DHS-based approach. This also holds forthe SUM and AVG aggregate (see Fig. 13); the DHS-basedapproach manages on average to outperform the rendezvous-based approach by a factor of ≈200× and ≈40× as far asGC and FI are concerned, respectively.

Last, Fig. 14 plots the evolution of the Gini coefficient,Fairness index, and total query load (node hits) over time,as queries are executed in the system. With the rendezvous-based approach there is a single node that is burdened with allof the query load, thus the GC and FI are constant over timeand equal or worse than ≈0.999 and ≈0.001, respectively.The DHS-based approaches converge to a GC an FI value of≈0.5, which equal the GC and FI values of the distribution ofthe distances between consecutive nodes in the ID space andis thus the best respective values achievable by any algorithmthat uses a randomized assignment of items to nodes.

Replicated rendezvous

Supporters of the rendezvous-based approach might thinkthat the GC and FI figures can be greatly improved by usingmultiple nodes to store the distributed synopsis and one ofthe replication schemes outlined in Sect. 3—namely, sim-ple replication (i.e. each node chooses randomly one of therendezvous nodes and insert all of its tuples there) or pro-active replication (i.e. each node inserts all of its tuples onall rendezvous nodes), representing the two extremes in loaddistribution for the rendezvous-based case. Figure 15 plotsthe GC and FI values for an increasingly larger number ofrendezvous nodes (replicas). As can be seen, even for 2, 000replicas the GC value does not drop below 0.725 and the FImerely climbs to just below 0.3.

This might seem somewhat counterintuitive to the unin-formed reader; one might expect that the load would be totallybalanced with as many replicas as there are nodes (i.e. 1000).It all boils down to the way these replica nodes are selected;since a node does not (and cannot) know a priori the IDsof all nodes in the system (and even if it did, this informa-tion would soon become obsolete with nodes entering andleaving the P2P overlay), the best possible way of choos-ing the rendezvous IDs is by picking random IDs (followinga uniform distribution) from the node ID space. Althoughnode IDs are selected in a quasi-uniform manner (due to theuse of a cryptographic hash function for their computation),this does not mean that the distances between consecutive

123

Page 24: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1302 N. Ntarmos et al.

Fig. 15 Gini coefficient andFairness index of the load onnodes for populating thedistributed synopsis for theCOUNT aggregate forincreasingly larger amount ofrendezvous nodes sharing thecomputation (either throughreplication or because ofmultiple computed aggregates)

0.7

0.75

0.8

0.85

0.9

0.95

1

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Gin

i Coe

ffic

ient

Number of ReplicasSimple Replication -- GC Proactive Replication -- GC

(a) Gini Coefficient (smaller is better)

0

0.05

0.1

0.15

0.2

0.25

0.3

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Fair

ness

Ind

ex

Number of ReplicasSimple replication -- FI Proactive Replication -- FI

(b) Fairness Index (larger is better)

0

50

100

150

200

250

64 128 256 512

Ave

rage

Est

imat

ion

Hop

s

Number of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(a) Simple replication – Query hop count

0

50

100

150

200

250

64 128 256 512A

vera

ge H

ops

per

Nod

eNumber of Bitmaps

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(b) Proactive replication – Insertion hop count

Fig. 16 Hop counts for the COUNT aggregate using a rendezvous-based approach with 50 replicas

IDs are all equal; quite on the contrary. For example, in our1000-node, 32-bit ID-space setup, the minimum, average,and maximum distances between any two consecutive nodeswere ≈211, 222, and 225, respectively, with a standard devi-ation of ≈222. Note that this distribution of the distancesbetween consecutive nodes in the network has a GC and aFI value of ≈0.5. Therefore, choosing among these nodesby first picking a random (uniform) ID and then selectingthe node responsible for it, ends up in a rather skewed loaddistribution.

Now assume that there were a way for a node to know theIDs of all other nodes currently present in the system at anygiven point in time. Even in that case, it would take at least≈275 rendezvous node replicas (either with simple or withproactive replication), for the rendezvous-based approach toachieve the GC and FI figures of the DHS-based approach.However, in that case the hop-count cost for inserting tuplesinto and/or computing the distributed synopsis would beabysmal. As a rule of thumb, Fig. 16 depicts the hop-countcost for inserting all tuples under proactive replication and forcomputing the COUNT aggregate under simple replication,using a rendezvous-based distributed synopsis, for a muchmilder replication scenario using just 50 replicas.

So far, there seems to be a clear-cut trade-off between therendezvous- and the DHS-based approaches; the former isbetter when raw hop-count cost is more important, but the

latter is the best choice when load distribution comes intoplay. This is only half the truth, though; let us take a look atFigs. 15 and 16 from a different perspective; that of multiplecomputed aggregates, as opposed to multiple replicas of asingle aggregate. Assume, for example, that we are to com-pute multiple COUNT, SUM, AVG, etc. aggregates over thesame peer-to-peer data network. Unless we store all suchaggregates on a single rendezvous node—a nightmare froma load distribution and, hence, from a scalability and faulttolerance point of view—the only other choice is to assignthese aggregates to randomly chosen nodes in the overlay.This scenario, however, is identical to that of a single aggre-gate with a proactive replication scheme and as many repli-cas as there are aggregates. The bottom line is that, althoughin the single-aggregate (or single-“metric”) case the rendez-vous approach seems to be the winner with regard to rawhop count, its advantage vanishes quickly with more than ahandful of estimated aggregates.

In order to further illustrate this, Fig. 17 plots the totalload imposed on the nodes in the system during query timefor 10, 20, 50, and 100 COUNT aggregates (or conversely fora single COUNT aggregate and proactive replication with afactor of 1, 2, 5, and 10%, respectively, the figure for the sim-ple replication rendezvous scenario would be even worse).Note that the y-axis is in logarithmic scale. This figure show-cases in the best possible way the linear increase in the total

123

Page 25: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1303

1

10

100

1000

10000

100000

0 100 200 300 400 500 600 700 800 900 1000

Tot

al N

ode

Loa

d (Q

uery

Hits

)

Number of Queries

10 rendezvous-based aggregates20 rendezvous-based aggregates50 rendezvous-based aggregates

100 rendezvous-based aggregates

DHS approach -- 512 bitmapsDHS approach -- 256 bitmapsDHS approach -- 128 bitmapsDHS approach -- 64 bitmaps

Fig. 17 Total query load in the system when multiple aggregates areto be computed concurrently

query load with the number of computed aggregates for therendezvous-based approaches. Note that only one curve isshown the DHS-based solution for each of the possible num-ber of bitmaps since the total query load (and hop count cost)is the same irrespective of the number of aggregates/“met-rics” estimated, due to the properties of the DHS outlined inSect. 3.2.

Scalability

Given the above discussions, we now turn our attention tostudying the scalability of the DHS-based solutions. As madeobvious thus far, the greatest strengths of the DHS-basedsolutions is their scalability for large numbers of computedaggregates and their fair load distribution across nodes in theoverlay. The DHS-based approach further exhibits excellentscalability with regard to the number of nodes and tuplesin the system. In this section we repeat the measurementsfor the COUNT aggregate, only now varying the number ofnodes in the system from 1,000 to 2,500, 5,000, and 10,000nodes. The number of data tuples managed by the overlayis also scaled to either 30× or 300× the number of nodes(e.g. 30,000 and 300,000 tuples for the 1000-node network,75,000 and 750,000 tuples for the 2,500-node network, etc).

Figures 18a, 19a, and 20a plot the per-node average inser-tion load, maximum node insertion load, and hop-count costrespectively for the 30× case—that is, the average numberof times a node is hit during insertion of all tuples into theDHS synopsis, the maximum such number, and the averagenumber of hops each node pays in order to insert all of itstuples into the DHS—for various combinations of input valuedistributions and number of bitmaps in the DHS. Similarly,Figs. 18b, 19b,and 20b depict the same metrics for the 300×case. As we can see, the per-node insertion load is nearlyconstant, while the maximum insertion load scales logarith-mically to the number of nodes and sublinearly to the numberof items in the overlay (the relevant figures for the rendezvous

approach would obviously show a linear dependence on thenumber of nodes and independence on the number of items).Moreover, the per-node average insertion hop count growsalso nearly logarithmically to the number of nodes and tuplesin the system—there is less than double the hop-count cost fora ten-times increase in the number of nodes and a one-hun-dred-times increase in the number of tuples in the overlay!

Likewise, Figs. 21a and 22a plot the average number ofnodes visited and the hop-count cost for computing theCOUNT aggregate for the 30× case, with the 300× casebeing depicted in Figs. 21b and 22b. From this set of figureswe can conclude that the average query hop-count is inde-pendent on the number of items and grows logarithmically tothe number of nodes in the system. The average query loadalso grows logarithmically to the number of nodes. Note,however, that our DHS-based solutions perform better, load-wise, for a larger number of items in the system, dependingon the number of bitmaps used. This is due to the fact that thelarger number of items inserted into the DHS-based synopsiscombined with the lower number of bitmaps in the synopsis,means that less nodes will have to be visited because of retries(see Sect. 3.2.5). Average estimation error, Gini coefficientand Fairness index were as shown earlier in Figs. 8b, 12aand 12b, respectively. Last, the figures for the other aggre-gates were similar to these presented here for COUNT andare thus omitted.

6.2.2 Histograms

We would like to emphasize again that our key goal is distri-bution transparency. We are thus not concerned with the sta-tistical properties of the histograms per se, or with whethera given histogram type is optimal or not; these have beenextensively studied in the literature. Instead, we compare ourresults with histograms of the same type and characteristicsbuilt using a centralized approach.

Our experiments proceed as follows: (1) first, we generatethe base data and compute the histogram(s) of choice over itin a centralized manner; (2) then, we configure and populatethe simulated 1000-node network with peers and items asdescribed earlier; (3) finally, we choose random nodes andhave them reconstruct and/or infer the same set of histograms,following the approach outlined in Sect. 5. Note that we aredealing solely with single-dimensional histograms.

We first populate and reconstruct a 100-cell Equi-WidthDHS-based histogram, and then use it to infer 10-bucketEqui-Depth, Compressed(V, F), and MaxDiff(V, F) histogr-ams. In this part of our experimental evaluation we use only256 bitmaps per DHS hash sketch, as this number poses agood trade-off between cost and accuracy, as showcased bythe previous results. All relations have a single integer-valuedattribute, with values drawn from the attribute value domain

123

Page 26: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1304 N. Ntarmos et al.

Fig. 18 Per-node averageinsertion load for a COUNTaggregate using the DHS, forvarious number of nodes in theoverlay; a tenfold increase in thenumber of items leads to a≈60% increase in the number oftimes each node is visited duringthe insertion phase

5.125

5.1875

5.25

5.3125

5.375

1000 2500 5000 10000

Ave

rage

nod

e lo

ad

Number of nodes in the overlayZipf theta: 1.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

5.125

5.1875

5.25

5.3125

5.375

1000 2500 5000 10000

Ave

rage

nod

e lo

ad

Number of nodes in the overlayZipf theta: 1.2

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

5.125

5.1875

5.25

5.3125

5.375

1000 2500 5000 10000

Ave

rage

nod

e lo

ad

Number of nodes in the overlayZipf theta: 0.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

5.125

5.1875

5.25

5.3125

5.375

1000 2500 5000 10000

Ave

rage

nod

e lo

ad

Number of nodes in the overlayZipf theta: 0.7

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

(a) Number of data tuples = 30 × Number of nodes

8

8.2

8.4

8.6

8.8

9

1000 2500 5000 10000

Ave

rage

nod

e lo

ad

Number of nodes in the overlayZipf theta: 1.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

8

8.2

8.4

8.6

8.8

9

1000 2500 5000 10000

Ave

rage

nod

e lo

ad

Number of nodes in the overlayZipf theta: 1.2

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

8

8.2

8.4

8.6

8.8

9

1000 2500 5000 10000

Ave

rage

nod

e lo

ad

Number of nodes in the overlayZipf theta: 0.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

8

8.2

8.4

8.6

8.8

9

1000 2500 5000 10000A

vera

ge n

ode

load

Number of nodes in the overlayZipf theta: 0.7

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

(b) Number of data tuples = 300 × Number of nodes

using either a uniform or a shuffled Zipf distribution with θ

equal to 0.7, 1.0, and 1.2 [37,76].For each of the inferred histogram types we measured the

average cell frequency error versus the real attribute-valuefrequency distribution. For the Equi-Depth histogram, wefurther show what we call the “positioning error”: the errorat the estimation of the cell boundaries. For the latter two his-togram types we also show the recall rate—the ratio of topfrequencies or single-valued buckets (respectively) that wereinferred correctly out of the 10 total. Finally, for the Com-pressed histograms we further show the “ranking error”, thatis the fraction of top-frequency values that were inferred atthe correct ranks. Again, the reported values were averagedover several runs of the simulation.

Initially we used a 10-million tuples relation, with val-ues in the interval [0, 1,000,000). Figure 23a summarizesthe results from this set of experiments. With regard to thehop count costs and load distribution, the same argumen-tation holds as in the multiple aggregates case, as theproposed way of implementing distributed Equi-Width his-tograms is by using a different DHS “metric” or a differ-ent rendezvous ID per histogram bucket. This means thata rendezvous-based solution would have to either (ii) usemultiple rendezvous nodes, thus incurring the outrageoushop-count costs and total query load depicted in Figs. 16and 17, or (i) store all histogram buckets on a single node,hence creating a severe load imbalance in the system. Wethus omit any further discussion of such costs in this part of

123

Page 27: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1305

Fig. 19 Maximum nodeinsertion load for a COUNTaggregate using the DHS, forvarious number of nodes in theoverlay; a tenfold increase in thedata tuples in the overlay leadsto a ≈6-fold increase in themaximum number any node isvisited during the insertionphase, yet still being an order ofmagnitude lower than that of therendezvous-based approach

55 60 65 70 75 80 85 90 95

100 105 110

1000 2500 5000 10000

Max

imum

nod

e lo

ad

Number of nodes in the overlayZipf theta: 1.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

60

70

80

90

100

110

120

130

1000 2500 5000 10000

Max

imum

nod

e lo

ad

Number of nodes in the overlayZipf theta: 1.2

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

50

60

70

80

90

100

110

120

130

1000 2500 5000 10000

Max

imum

nod

e lo

ad

Number of nodes in the overlayZipf theta: 0.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

60

70

80

90

100

110

120

1000 2500 5000 10000

Max

imum

nod

e lo

ad

Number of nodes in the overlayZipf theta: 0.7

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

(a) Number of data tuples = 30× Number of nodes

300

400

500

600

700

800

900

1000 2500 5000 10000

Max

imum

nod

e lo

ad

Number of nodes in the overlayZipf theta: 1.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

200

300

400

500

600

700

800

900

1000 2500 5000 10000

Max

imum

nod

e lo

ad

Number of nodes in the overlayZipf theta: 1.2

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

200

300

400

500

600

700

800

900

1000

1000 2500 5000 10000

Max

imum

nod

e lo

ad

Number of nodes in the overlayZipf theta: 0.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

300 350 400 450 500 550 600 650 700 750

1000 2500 5000 10000M

axim

um n

ode

load

Number of nodes in the overlayZipf theta: 0.7

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

(b) Number of data tuples = 300× Number of nodes

our experimental evaluation and concentrate on DHS-basedhistograms.

As far as accuracy is concerned, the average cell frequencyerror is nearly excellent for the Equi-Width, Average-Shiftedand Equi-Depth histograms. The very small error in the Equi-Depth cases is somewhat artificial; since all buckets haveapproximately the same frequency there is little margin forerror. What makes more sense in this case, is the cell bound-ary positioning error, depicted in Fig. 23b. Again, our algo-rithms achieve a nearly excellent accuracy, even in the highlyskewed case where input values are drawn from a Zipf with

θ equal to 1.2. On the cost side, the average hop count forreconstructing the initial DHS histogram is ≈26 hops (i.e.equaling the cost of merely ≈6 DHT lookups), while theaverage bandwidth consumption is ≈60–70 kbytes.

No results are shown in Fig. 23a for the MaxDiff(V, F)and Compressed(V, F) histograms, as they proved to be quiteproblematic. The main reason for this is that, with 100 buck-ets in the initial DHS histogram and 1 million values inthe attribute value domain, 10,000 values are mapped intoevery histogram cell. This makes it nearly impossible toguess, based solely on the number of unique values in each

123

Page 28: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1306 N. Ntarmos et al.

Fig. 20 Per-node averageaverage insertion hop-count costfor a COUNT aggregate usingthe DHS, for various number ofnodes in the overlay; a tenfoldincrease in the number of tuplesin the overlay leads to a ≈35%increase in the hop-count costeach node has to pay to insert allof its items in the DHS

23 24 25 26 27 28 29 30 31 32 33

1000 2500 5000 10000Ave

rage

inse

rtio

n ho

ps p

er n

ode

Number of nodes in the overlayZipf theta: 1.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

23 24 25 26 27 28 29 30 31 32 33 34

1000 2500 5000 10000Ave

rage

inse

rtio

n ho

ps p

er n

ode

Number of nodes in the overlayZipf theta: 1.2

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

23 24 25 26 27 28 29 30 31 32 33 34

1000 2500 5000 10000Ave

rage

inse

rtio

n ho

ps p

er n

ode

Number of nodes in the overlayZipf theta: 0.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

23 24 25 26 27 28 29 30 31 32 33 34

1000 2500 5000 10000Ave

rage

inse

rtio

n ho

ps p

er n

ode

Number of nodes in the overlayZipf theta: 0.7

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

(a) Number of data tuples = 30× Number of nodes

30 32 34 36 38 40 42 44 46 48

1000 2500 5000 10000Ave

rage

inse

rtio

n ho

ps p

er n

ode

Number of nodes in the overlayZipf theta: 1.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

30 32 34 36 38 40 42 44 46 48

1000 2500 5000 10000Ave

rage

inse

rtio

n ho

ps p

er n

ode

Number of nodes in the overlayZipf theta: 1.2

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

30 32 34 36 38 40 42 44 46 48

1000 2500 5000 10000Ave

rage

inse

rtio

n ho

ps p

er n

ode

Number of nodes in the overlayZipf theta: 0.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

30 32 34 36 38 40 42 44 46 48

1000 2500 5000 10000Ave

rage

inse

rtio

n ho

ps p

er n

ode

Number of nodes in the overlayZipf theta: 0.7

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

(b) Number of data tuples = 300× Number of nodes

bucket and the bucket’s overall frequency, the exact mappingof frequencies to values, required for these histogramtypes.

To this extent, we switched our experimental evaluationtowards a different direction: again, we used a 1,000-nodeoverlay and tried to build the same histograms as before, onlynow using a 10-million tuples relation, drawing tuple valuesfrom the interval [0,1,000) and building a 1,000-bucket ini-tial DEWH histogram. This corresponds to having a separateDHS counter for every possible value in the attribute valuedomain.

The results are depicted in Figs. 24a–d. As we can see, theaverage cell frequency error is at the same levels as before(Fig. 24a), and so is also the cell boundary positioning errorfor the Equi-Depth histogram case (Fig. 24b). As far as theaverage ranking error (Fig. 24c) and recall rate (Fig. 24d) areconcerned, we discern a certain trend: in both cases our algo-rithms lose in accuracy when the input values are drawn fromthe uniform distribution, and get better as the input distribu-tion becomes more skewed. Note, however, that the strengthsof these two histogram types manifest with skewed value dis-tributions; if the tuple values are uniformly distributed over

123

Page 29: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1307

Fig. 21 Average number ofnodes visited when computing aCOUNT aggregate using theDHS, for various number ofnodes in the overlay

50

55

60

65

70

75

80

1000 2500 5000 10000Ave

rage

num

ber

of n

odes

vis

ited

Number of nodes in the overlayZipf theta: 1.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

45

50

55

60

65

70

75

1000 2500 5000 10000Ave

rage

num

ber

of n

odes

vis

ited

Number of nodes in the overlayZipf theta: 1.2

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

45

50

55

60

65

70

75

80

1000 2500 5000 10000Ave

rage

num

ber

of n

odes

vis

ited

Number of nodes in the overlayZipf theta: 0.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

45

50

55

60

65

70

75

1000 2500 5000 10000Ave

rage

num

ber

of n

odes

vis

ited

Number of nodes in the overlayZipf theta: 0.7

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

(a) Number of data tuples = 30× Number of nodes

10

20

30

40

50

60

70

80

1000 2500 5000 10000Ave

rage

num

ber

of n

odes

vis

ited

Number of nodes in the overlayZipf theta: 1.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

10

20

30

40

50

60

70

1000 2500 5000 10000Ave

rage

num

ber

of n

odes

vis

ited

Number of nodes in the overlayZipf theta: 1.2

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

10

20

30

40

50

60

70

1000 2500 5000 10000Ave

rage

num

ber

of n

odes

vis

ited

Number of nodes in the overlayZipf theta: 0.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

10

20

30

40

50

60

70

80

1000 2500 5000 10000Ave

rage

num

ber

of n

odes

vis

ited

Number of nodes in the overlayZipf theta: 0.7

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

(b) Number of data tuples = 300× Number of nodes

their value domain, then one gains nothing from using sucha histogram, or any histogram at all to that extent!

However, the most interesting thing about this set of exper-iments is the hop-count cost for reconstructing the initialDEWH histogram: across all sets of parameters, this hop-count cost didn’t exceed 30 hops (for an average of ≈27hops)! The average bandwidth cost for reconstructing thebase DEWH histogram was ≈600–700 kbytes. As alreadymentioned, the lurking trade-off here is related to bandwidthrequirements; with 1000 buckets versus the 100 buckets wehad earlier, each node visited during histogram reconstruc-tion may respond with up to 10 times more data. Rememberthat this data is merely a set of <metricID, bit-vector> tuples

whose size is very small (24–84 bytes per tuple in our real-world implementation).

This trade-off becomes much more apparent when dealingwith large attribute value domains. For example, if attributevalues were drawn from the interval [0, 1,000,000) so that theinterval is densely populated (i.e. there were tuples for mostof the 1M values in the domain) and we required a 1-millionDEWH base histogram, then the histogram reconstructionbandwidth cost would go up to around 600–700 Mbytes! Afirst note here is that this figure refers to uncompressed DHSdata; an appropriate compression algorithm could probablycut down bandwidth requirements by one to two orders ofmagnitude. Besides that, (1) to our knowledge, most realistic

123

Page 30: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1308 N. Ntarmos et al.

Fig. 22 Average hop-countcost per query when computinga COUNT aggregate using theDHS, for various number ofnodes in the overlay

40

45

50

55

60

65

70

75

1000 2500 5000 10000

Ave

rage

hop

s pe

r qu

ery

Number of nodes in the overlayZipf theta: 1.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

40

45

50

55

60

65

70

1000 2500 5000 10000

Ave

rage

hop

s pe

r qu

ery

Number of nodes in the overlayZipf theta: 1.2

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

40

45

50

55

60

65

70

75

1000 2500 5000 10000

Ave

rage

hop

s pe

r qu

ery

Number of nodes in the overlayZipf theta: 0.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

40

45

50

55

60

65

70

75

1000 2500 5000 10000

Ave

rage

hop

s pe

r qu

ery

Number of nodes in the overlayZipf theta: 0.7

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

(a) Number of data tuples = 30× Number of nodes

40

45

50

55

60

65

70

75

1000 2500 5000 10000

Ave

rage

hop

s pe

r qu

ery

Number of nodes in the overlayZipf theta: 1.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

40

45

50

55

60

65

70

75

1000 2500 5000 10000

Ave

rage

hop

s pe

r qu

ery

Number of nodes in the overlayZipf theta: 1.2

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

40

45

50

55

60

65

70

75

1000 2500 5000 10000

Ave

rage

hop

s pe

r qu

ery

Number of nodes in the overlayZipf theta: 0.0

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

35

40

45

50

55

60

65

70

1000 2500 5000 10000

Ave

rage

hop

s pe

r qu

ery

Number of nodes in the overlayZipf theta: 0.7

64 bitmaps 128 bitmaps 256 bitmaps 512 bitmaps

(b) Number of data tuples = 300× Number of nodes

databases and workloads build indices and histograms forcolumns with up to a few (tens of) thousands of distinct val-ues, (2) in most queries we shall not need information forall of the buckets but for a handful of them, and (3) even ifthe previous conditions do not hold, we can still build highlyaccurate Equi-Width, ASH, and Equi-Depth histograms overthe base data with a very low hop-count and bandwidth con-sumption cost.

Discussion

We would like to pinpoint the necessity and appropriate-ness of the techniques presented, implemented, and evaluated

in this work, for solving basic problems in the P2P DBMSrealm. The above experimental results showed that the mes-sage count and bandwidth costs for reconstructing the baseDEWH or DASH histograms and inferring the more com-plex histogram types can be quite low. Now, note that (1)after a node has paid the cost to reconstruct the histograms,choosing the optimal execution plan is a local operation, and(2) the above figures refer to the cost of reconstructing thewhole histogram; query processing may require estimationof the cardinality of only specific buckets, depending on thequery predicate constraints.

Now assume that a query optimizer, armed with the abovehistograms, is (at least) able to select the optimal query

123

Page 31: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1309

1

2

3

4

5

6

7

8

9

10

Equi-Width Average-Shifted Equi-Depth

Ave

rage

Cel

l Fre

quen

cy E

rror

(%

)

Histogram Type

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(a) Average cell frequency error

0

0.5

1

1.5

2

2.5

3

3.5

Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 9

Ave

rage

Pos

ition

ing

Err

or (

%)

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(b) Average cell boundary positioning error (Equi-Depth his-togram case only)

Fig. 23 Histogram inference results: 1,000 nodes, 10,000,000 items, values in [0, 1,000,000), 100 initial DHS buckets, 10 buckets per inferredhistogram

0

2

4

6

8

10

12

14

Equi-Width Average-Shifted Equi-Depth MaxDiff(V,F) Compressed(V,F)

Ave

rage

Cel

l Fre

quen

cy E

rror

(%

)

Histogram Type

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(a) Average cell frequency error

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

2.25

2.5

2.75

Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 9

Ave

rage

Pos

ition

ing

Err

or (

%)

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(b) Average cell boundary positioning error – Equi-Depthhistogram case

0

1

2

3

4

5

6

7

8

9

Compressed(V,F)

Cor

rect

ly R

anke

d T

op F

req.

(ou

t of

10)

Histogram Type

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(c) Average ranking error – Compressed(V,F) histogram case

0

1

2

3

4

5

6

7

8

9

10

11

Compressed(V,F) MaxDiff(V,F)

Max

. Fre

q. P

ositi

ons

Foun

d (o

ut o

f 10

)

Histogram Type

Zipf 0.0 Zipf 0.7 Zipf 1.0 Zipf 1.2

(d) Average recall rate – Compressed(V,F) and MaxDiff(V,F)histogram cases

Fig. 24 Histogram inference results: 1,000 nodes, 10,000,000 items, values in [0, 1,000), 1,000 initial DHS buckets, 10 buckets per inferredhistogram

execution plan. In this case, the savings in bandwidth andresponse time will be considerable [8,43]. For example, in[43] the authors consider multi-way joins in a much smallersetting than the one discussed above (256 nodes and relationswith 256,000 tuples each or 100 tuples per node). The opti-mal join strategy in the three-way join case results in a datatransfer of 47Mbytes, as opposed to 71Mbytes transferredby FREddies, both of which are orders of magnitude largerthan the ≈0.7Mbytes required to reconstruct the histogramsusing DHS. Thus, for example, if DHS-based histograms

were added to the PIER query processing logic, PIER wouldselect the optimal join plan at a very reasonable additionalcost, resulting in major overall bandwidth and latency sav-ings. This gives an insight to the impact that DHS can haveon Internet-scale query engines.

7 Conclusions

In this paper, we have developed a new framework for distrib-uted statistical synopses that are particularly well suited for

123

Page 32: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1310 N. Ntarmos et al.

Internet-scale overlay networks such as P2P systems. Ourrationale has been to start with the best known techniquesfor centralized settings and extend them towards distributedsettings with particular care about scalability.

Our contributions rest on utilizing specific statistical struc-tures, such as hash sketches and distributed hash sketches,and on showing new methods for harnessing them to processimportant types of aggregate queries in a P2P environment.Using this substrate, we have shown how to develop DHT-based higher-level synopses such as Equi-Width, Average-Shifted Equi-Width, and Equi-Depth histograms, whichfigure prominently in the literature on traditional DBMS sta-tistics management for query optimization. Our implemen-tation and extensive performance evaluation show that thesestructures can be constructed and exploited efficiently andscale well with growing network size, while achieving highaccuracy. The implementation of our solutions, over the pop-ular stable DHT software FreePastry, is available as opensource for anyone to use and test [29].

There are several routes to explore for future research.First, we would like to examine the design and implementa-tion of auto-tuning capabilities for our histogram inferenceengine; these issues have been addressed in centralized data-base systems [1], but they are completely untouched by theP2P research community. Integrating our solutions with Inter-net-scale query processing systems, in the spirit of [42],comes up next in the list; we believe that the outcomewould bring us closer to the elusive goal of completely self-organized P2P solutions to advanced data management.Third, we intend to look into implementing further types ofsynopses, aggregates, and histogram variants. Finally, usingour tools for approximate query answering, in the sense of[2,3,13,50], would be a very intriguing direction as well.

Acknowledgments Nikos Ntarmos was supported by the 03ED719research project, implemented within the framework of the “Reinforce-ment Program of Human Research Manpower” (PENED), co-financedby National and Community Funds (25% from the Greek Ministry ofDevelopment, General Secretariat of Research and Technologyand 75% from the European Union, European Social Fund). PeterTriantafillou was funded by the 6th Framework Program of the EuropeanUnion through the Integrated Project DELIS (#001907) on DynamicallyEvolving Large-scale Information Systems. We would like to sincerelythank the associate editor and the reviewers for their careful reading ofour paper and thoughtful and valuable comments.

References

1. Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: Build-ing histograms without looking at data. In: Proceedings of theACM SIGMOD International Conference on Management of Data(SIGMOD) (1999)

2. Acharaya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: TheAQUA approximate query answering system. In: Proceedings of

the ACM SIGMOD International Conference on Management ofData (SIGMOD) (1999)

3. Acharya, S., Gibbons, P., Poosala, V., Ramaswamy, S.: Joinsynopses for approximate query answering. In: Proceedings of theACM SIGMOD International Conference on Management of Data(SIGMOD) (1999)

4. Babaoglu, Ö., Meling, H., Montresor, A.: Anthill: A framework forthe development of agent-based peer-to-peer systems. In: Proceed-ings of the IEEE International Conference on Distributed Comput-ing and Systems (ICDCS) (2002)

5. Bawa, M., Garcia-Molina, H., Gionis, A., Motwani, R.: Estimatingaggregates on a peer-to-peer network. Technical report, ComputerScience Department, Stanford University (2003)

6. Bawa, M., Gionis, A., Garcia-Molina, H., Motwani, R.: The priceof validity in dynamic networks. In: Proceedings of the ACM SIG-MOD International Conference on Management of Data (SIG-MOD) (2004)

7. Beyer, K., Haas, P.J., Reinwald, B., Sismanis, Y., Gemulla, R.: Onsynopses for distinct-value estimation under multiset operations.In: Proceedings of the ACM SIGMOD International Conferenceon Management of Data (SIGMOD) (2007)

8. Bharambe, A., Agrawal, M., Seshan, S.: Mercury: Supportingscalable multi-attribute range queries. In: Proceedings of theACM Symposium on Communications Architectures and Proto-cols (SIGCOMM) (2004)

9. Blohsfeld, B., Korus, D., Seeger, B.: A comparison of selectivityestimators for range queries on metric attributes. In: Proceedingsof the ACM SIGMOD International Conference on Managementof Data (SIGMOD) (1999)

10. Carzaniga, A., Rosenblum, D.S., Wolf, A.L.: Design and evalua-tion of a wide-area event notification service. ACM Trans. Comput.Syst. (TOCS) 19(3), 332–383 (2001)

11. Chaudhuri, S.: An overview of query optimization in relational sys-tems. In: Proceedings of the ACM SIGACT-SIGMOD-SIGARTSymposium on Principles of Database Systems (PODS) (1998)

12. Chaudhuri, S., Das, G., Narasayya, V.R.: A robust, optimization-based approach for approximate answering of aggregate queries.In: Proceedings of the ACM SIGMOD International Conferenceon Management of Data (SIGMOD) (2001)

13. Chaudhuri, S., Das, G., Narasayya, V.R.: Optimized stratified sam-pling for approximate query processing. ACM Trans. DatabaseSyst. (TODS) 32(2), 9 (2007)

14. Chaudhuri, S., Motwani, R., Narasayya, R.: Random sampling forhistogram construction: how much is enough? In: Proceedings ofthe ACM SIGMOD International Conference on Management ofData (SIGMOD) (1998)

15. Chaudhuri, S., Narasayya, V.R.: Automating statistics manage-ment for query optimizers. IEEE Trans. Knowl. Data Eng. (TKDE)13(1), 7–20 (2001)

16. Considine, J., Li, F., Kollios, G., Byers, J.: Approximate aggrega-tion techniques for sensor databases. In: Proceedings of the Inter-national Conference on Data Engineering (ICDE) (2004)

17. Cormode, G., Muthukrishnan, S.: An improved data stream sum-mary: the count-min sketch and its applications. In: Proceedingsof the Latin American Symposium on Theoretical Informatics(LATIN) (2004)

18. Cormode, G., Muthukrishnan, S.: Space efficient mining of mult-igraph streams. In: Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS)(2005)

19. Dabek, F., Li, J., Sit, E., Robertson, J., Kaashoek, M.F., Morris, R.:Designing a DHT for low latency and high throughput. In: Proceed-ings of the USENIX Symposium on Networked Systems Designand Implementation (NSDI) (2004)

20. Damgaard, C., Weiner, J.: Describing inequality in plant size orfecundity. Ecology 81, 1139–1142 (2000)

123

Page 33: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

Statistical structures for Internet-scale data management 1311

21. Dobra, A.: Histograms revisited: When are histograms the bestapproximation method for aggregates over joins? In: Proceedingsof the ACM SIGMOD-SIGACT-SIGART Symposium on Princi-ples of Database Systems (PODS) (2005)

22. Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Sketch-basedmulti-query processing over data streams. In: Proceedings ofthe International Conference on Extending Database Technology(EDBT) (2004)

23. Druschel, P., Rowstron, A.: Pastry: Scalable, distributed objectlocation and routing for large-scale peer-to-peer systems. In: Pro-ceedings of the IFIP/ACM IFIP/ACM International Conference onDistributed Systems Platforms (Middleware) (2001)

24. Durand, M., Flajolet, P.: Loglog counting of large cardinalities. In:Proceedings of the Annual European Symposium on Algorithms(ESA) (2003)

25. Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: a scal-able wide-area web cache sharing protocol. In: Proceedings of theACM Symposium on Communications Architectures and Proto-cols (SIGCOMM) (1998)

26. Flajolet, P., Martin, G.N.: Probabilistic counting algorithms fordata base applications. J. Comput. Syst. Sci. 31(2), 182–209(1985)

27. FreeDHS: Homepage (2006). http://netcins.ceid.upatras.gr/DHS.php

28. FreePastry: Homepage (2002). http://freepastry.org/FreePastry/29. FreeSHADE (2006). http://netcins.ceid.upatras.gr/SHADE.php30. Ganguly, S., Garofalakis, M., Rastogi, R.: Processing set expres-

sions over continuous update streams. In: Proceedings of theACM SIGMOD International Conference on Management of Data(SIGMOD) (2003)

31. Garofalakis, M. (ed.): Special issue on in-network query process-ing. IEEE Data Eng. Bull. 28(1) (2005)

32. Gibbons, P.B., Matias, Y., Poosala, V.: Fast incremental main-tenance of approximate histograms. ACM Trans. Database Syst.(TODS) 27(3), 261–298 (2002)

33. Gray, J., Liu, D.T., Nieto-Santisteban, M.A., Szalay, A.S.,DeWitt, D.J., Heber, G.: Scientific data management in the comingdecade. SIGMOD Record 34(4), 34–41 (2005)

34. Gribble, S., Halevy, A., Ives, Z., Rodrig, M., Suciu, D.: What canpeer-to-peer do for databases, and vice versa? In: Proceedings ofthe International Workshop on the Web and Databases (WebDB)(2001)

35. Guha, S., Koudas, N., Shim, K.: Data-streams and histograms. In:Proceedings of the Annual ACM Symposium on Theory of Com-puting (STOC) (2001)

36. Gummadi, K., Gummadi, R., Gribble, S., Ratnasamy, S.,Shenker, S., Stoica, I.: The impact of DHT routing geometry onresilience and proximity. In: Proceedings of the ACM Symposiumon Communications Architectures and Protocols (SIGCOMM)(2003)

37. Gummadi, K.P., Dunn, R.J., Saroiu, S., Gribble, S.D., Levy,H.M., Zahorjan, J.: Measurement, modeling, and analysis of apeer-to-peer file-sharing workload. In: Proceedings of the ACMSymposium on Operating Systems Principles (SOSP), pp. 314–329(2003)

38. Haas, P., Naughton, J.F., Seshadri, S., Stokes, L.: Sampling-basedestimation of the number of distinct values of an attribute. In: Pro-ceedings of the International Conference on Very Large Data Bases(VLDB) (1995)

39. Hadjieleftheriou, M., Byers, J.W., Kollios, G.: Robust sketchingand aggregation of distributed data streams. Technical Report2005-011, Computer Science Department, Boston University(2005)

40. Hellerstein, J.M., Condie, T., Garofalakis, M., Loo, B.T.,Maniatis, P., Roscoe, T., Taft, N.: Public health for the internet

(PHI). In: Proceedings of the ACM SIGMOD/VLDB Biennial Con-ference on Innovative Data Systems Research (CIDR) (2007)

41. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In:Proceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD) (1997)

42. Huebsch, R., Chun, B.N., Hellerstein, J.M., Loo, B.T., Maniatis, P.,Roscoe, T., Shenker, S., Stoica, I., Yumerefendi, A.R.: The archi-tecture of PIER: an internet-scale query processor. In: Proceedingsof the ACM SIGMOD/VLDB Biennial Conference on InnovativeData Systems Research (CIDR) (2005)

43. Huebsch, R., Jeffery, S.: FREddies: DHT-based adaptive queryprocessing via FedeRated Eddies. Technical Report UCB/CSD-4-1339, University of Californa at Berkeley (2004)

44. Ioannidis, Y.: The history of histograms (abridged). In: Proceedingsof the International Conference on Very Large Data Bases (VLDB)(2003)

45. Ioannidis, Y., Christodoulakis, S.: Optimal histograms for limitingworst-case error propagation in the size of join results. ACM Trans.Database Syst. (TODS) 18(4), 709–748 (1993)

46. Ioannidis, Y., Poosala, V.: Balancing histogram optimality andpracticality for query result size estimation. In: Proceedings of theACM SIGMOD International Conference on Management of Data(SIGMOD) (1995)

47. Jain, R., Chiu, D., Hawe, W.: A quantitative measure of fairness anddiscrimination for resource allocation in shared computer systems.Technical Report TR-301, DEC (1984)

48. Jelasity, M., Montresor, A.: Epidemic-style proactive aggregationin large overlay networks. In: Proceedings of the IEEE Interna-tional Conference on Distributed Computing and Systems (ICDCS)(2004)

49. Jelasity, M., Montresor, A., Babaoglu, Ö.: Gossip-based aggre-gation in large dynamic networks. ACM Trans. Comput. Syst.(TOCS) 23(3), 219–252 (2005)

50. Jermaine, C.M., Arumugam, S., Pol, A., Dobra, A.: Scalableapproximate query processing with the DBO engine. In: Proceed-ings of the ACM SIGMOD International Conference on Manage-ment of Data (SIGMOD) (2007)

51. Kempe, D., Dobra, A., Gehrke, J.: Computing aggregate informa-tion using gossip. In: Proceedings of the Annual IEEE Symposiumon Foundations of Computer Science (FOCS) (2003)

52. Kossmann, A.: The state of the art in distributed query processing.ACM Comput. Surv. 32(4), 422–469 (2000)

53. Koudas, N. (ed.): Special issue on data quality. IEEE Data Eng.Bull. 29(2) (2006)

54. Krishnan, P.: Online prediction algorithms for databases and oper-ating systems. Ph.D. thesis, Brown University (1995)

55. Li, J., Loo, B.T., Hellerstein, J.M., Kaashoek, M.F., Karger, D.,Morris, R.: On the feasibility of peer-to-peer web indexing andsearch. In: Proceedings of the International Workshop on Peer-to-Peer Systems (IPTPS) (2003)

56. Madden, S., Franklin, M.J., Hellerstein, J.M., Hong, W.: TAG:A Tiny AGgregation service for ad-hoc sensor networks. In: Pro-ceedings of the USENIX Symposium on Operating Systems Designand Implementation (OSDI) (2002)

57. Manku, G.: Routing networks for Distributed Hash Tables. In:Proceedings of the ACM Symposium on Principles of DistributedComputing (PODC) (2003)

58. Markl, V., Haas, P.J., Kutsch, M., Megiddo, N., Srivastava, U.,Tran, T.M.: Consistent selectivity estimation via maximumentropy. VLDB J. 16(1), 55–76 (2007)

59. Matia, Y., Matias, Y.: Calibration and profile based synopses errorestimation and synopses reconciliation. In: Proceedings of theInternational Conference on Data Engineering (ICDE) (2007)

60. Maymouknov, P., Mazières, D.: Kademlia: A peer-to-peer infor-mation system based on the XOR metric. In: Proceedings of theInternational Workshop on Peer-to-Peer Systems (IPTPS) (2002)

123

Page 34: Statistical structures for Internet-scale data management · 2017-08-28 · Statistical structures for Internet-scale data management 1281 3. Last, there is also the scenario of only

1312 N. Ntarmos et al.

61. Michel, S., Bender, M., Ntarmos, N., Triantafillou, P.,Weikum, G., Zimmer, C.: Discovering and exploiting keyword andattribute-value co-occurrences to improve P2P routing indices. In:Proceedings of the ACM Conference on Information and Knowl-edge Management (CIKM) (2006)

62. Michel, S., Triantafillou, P., Weikum, G.: KLEE: A framework fordistributed top-k query algorithms. In: Proceedings of the Interna-tional Conference on Very Large Databases (VLDB) (2005)

63. Montresor, A., Jelasity, M., Babaoglu, Ö.: Robust aggregation pro-tocols for large-scale overlay networks. In: Proceedings of the Inter-national Conference on Dependable Systems and Networks (DSN)(2004)

64. Montresor, A., Meling, H., Babaoglu, Ö.: Messor: Load-balanc-ing through a swarm of autonomous agents. In: Proceedings of theWorkshop on Agent and Peer-to-Peer Systems (2002)

65. Morris, R.: Counting large numbers of events in small registers.Commun. ACM 21(10), 840–842 (1978)

66. Munro, J.I., Paterson, M.S.: Selection and sorting with limited stor-age. Theor. Comput. Sci. 12(3), 315–323 (1980)

67. Muralikrishna, M., DeWitt, D.J.: Equi-depth histograms forestimating selectivity factors for multi-dimensional queries. In:Proceedings of the ACM SIGMOD International Conference onManagement of Data (SIGMOD) (1988)

68. Ntarmos, N., Triantafillou, P., Weikum, G.: Counting at large: Effi-cient cardinality estimation in internet-scale data networks. In:Proceedings of the International Conference on Data Engineering(ICDE) (2006)

69. Palmer, C.R., Siganos, G., Faloutsos, M., Faloutsos, C.,Gibbons, P.B.: The connectivity and fault-tolerance of the internettopology. In: Proceedings of the Workshop on Network-RelatedData Management (NRDM) (2001)

70. Pitoura, T., Triantafillou, P.: Load distribution fairness in p2p datamanagement systems. In: Proceedings of the International Confer-ence on Data Engineering (ICDE) (2007)

71. Poosala, V., Ganti, V., Ioannidis, Y.: Approximate query answeringusing histograms. IEEE Data Eng. Bull. 22(4), 5–14 (1999)

72. Poosala, V., Ioannidis, Y.: Estimation of query-result distributionand its application in parallel-join load balancing. In: Proceedingsof the International Conference on Very Large Data Bases (VLDB)(1996)

73. Poosala, V., Ioannidis, Y.: Selectivity estimation without the attri-bute value independence assumption. In: Proceedings of the Inter-national Conference on Very Large Data Bases (VLDB) (1997)

74. Poosala, V., Ioannidis, Y., Haas, P., Shekita, E.: Improvedhistograms for selectivity estimation of range predicates. In: Pro-ceedings of the ACM SIGMOD International Conference on Man-agement of Data (SIGMOD) (1996)

75. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.:A scalable content-addressable network. In: Proceedings of theACM Symposium on Communications Architectures and Proto-cols (SIGCOMM) (2001)

76. Reynolds, P., Vahdat, A.: Efficient peer-to-peer keyword searching.In: Proceedings of the IFIP/ACM IFIP/ACM International Confer-ence on Distributed Systems Platforms (Middleware), pp. 21–40(2003)

77. Rhea, S., Geels, D., Roscoe, T., Kubiatowicz, J.: Handling churnin a DHT. In: Proceedings of the USENIX Annual Technical Con-ference (USENIX) (2004)

78. Scott, D.W.: Average shifted histograms: effective nonparamet-ric density estimators in several dimensions. Ann. Stat. 13(3),1024–1040 (1985)

79. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie,R.A., Price, T.G.: Access path selection in a relationaldatabase management system. In: Proceedings of the ACMSIGMOD International Conference on Management of Data (SIG-MOD) (1979)

80. Shapiro, G.P., Connell, C.: Accurate estimation of the number oftuples satisfying a condition. In: Proceedings of the ACM SIG-MOD International Conference on Management of Data (SIG-MOD) (1984)

81. Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Mediansand beyond: New aggregation techniques for sensor networks. In:Proceedings of the ACM Conference on Embedded NetworkedSensor Systems (SenSys) (2004)

82. Steinmetz, R., Wehrle, K. (eds.): Peer-to-Peer Systems and Appli-cations. Springer, Berlin (2005)

83. Stoica, I., Morris, R., Karger, D., Kaashoek, M.F.,Balakrishnan, H.: Chord: A scalable Peer-To-Peer lookupservice for internet applications. In: Proceedings of the ACMSymposium on Communications Architectures and Protocols(SIGCOMM) (2001)

84. Suciu, D. (ed.): Special issue on web-scale data, systems, andsemantics. IEEE Data Eng. Bull. 29(4) (2006)

85. Tao, Y., Kollios, G., Considine, J., Li, F., Papadias, D.: Spatio-temporal aggregation using sketches. In: Proceedings of the Inter-national Conference on Data Engineering (ICDE) (2004)

86. Thaper, N., Guha, S., Indyk, P., Koudas, N.: Dynamic multidimen-sional histograms. In: Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD) (2002)

87. Triantafillou, P., Economides, A.: Subscription summarization:A new paradigm for efficient publish/subscribe systems. In: Pro-ceedings of the IEEE International Conference on DistributedComputing and Systems (ICDCS) (2004)

88. van Renesse, R., Birman, K.P., Vogels, W.: Astrolabe: a robustand scalable technology for distributed system monitoring, man-agement, and data mining. ACM Trans. Comput. Syst.(TOCS) 21(2), 164–206 (2003)

89. Yalagandula, P., Dahlin, M.: A scalable distributed informationmanagement system. In: Proceedings of the ACM Symposium onCommunications Architectures and Protocols (SIGCOMM) (2004)

123