In network aggregation techniques for wireless sensor networks - a survey

In-network Aggregation Techniques for WirelessSensor Networks: A Survey

Elena Fasolo†, Michele Rossi†, Jorg Widmer? and Michele Zorzi††DEI, University of Padova, via Gradenigo 6/B – 35131, Padova, Italy

?DoCoMo Euro-Labs, Landsberger Strasse 312 – 80687 Munich, Germany

Abstract— In this paper, we provide a comprehensive reviewof the existing literature on techniques and protocols for in-network aggregation in wireless sensor networks. We first definesuitable criteria to classify existing solutions, and then describethem by separately addressing the different layers of the protocolstack while highlighting the role of a cross-layer design approach,which is likely to be needed for optimal performance. Throughoutthe paper we identify and discuss open issues, and proposedirections for future research in the area.

I. I NTRODUCTION

Recent advances in technology make it feasible to massproduce small sensor nodes with sensing, computation, andcommunication capabilities. This has spurred a substantialamount of research on wireless sensor networks over the pastfew years. For ease of deployment, sensor devices should beinexpensive, small, and have a long lifetime, which makesit important to develop very efficient software and hardwaresolutions. For this reason, protocols for sensor networks shouldbe carefully designed so as to make the most efficient use ofthe limited resources in terms of energy, computation, andstorage. These restrictions are likely to remain, since in manycases it is desirable to exploit technological improvementsto develop smaller and more energy efficient devices ratherthan making them more powerful. Typical applications envi-sioned for sensor networks (e.g., environmental monitoring,surveillance, tracking, etc.), along with the already mentionedresource-constrained character of sensor nodes, usually resultin very different network requirements and communicationspatterns compared to other types of ad hoc network scenarios.The area of communications and protocol design for sensornetworks has been widely researched in the past few years,and many solutions have been proposed and compared.

In this survey paper we focus instead on another importantaspect of sensor networks, namely in-network aggregation anddata management. These techniques allow to trade off commu-nication for computational complexity. Given the applicationarea, network resource constraints, and the fact that localcomputation often consumes significantly less energy thancommunication, in-network data aggregation and managementare at the very heart of sensor network research. In particular,resource efficiency, timely delivery of data to the sink node,and accuracy or granularity of the results are conflicting goalsand the optimal tradeoff among them largely depends on thespecific application.

Initially, in-network aggregation techniques involved differ-ent ways to route packets in order to combine data comingfrom different sources but directed towards the same desti-nation(s). In other words, these protocols were simply routing

algorithms which differed from more traditional ad hoc routingprotocols in the metric they used to select the routing paths.More recently, many additional studies have been published,addressing not only the routing problem but also mechanismsto represent and combine data more efficiently. In-networkdata aggregation is a complex problem that involves manylayers of the protocol stack and different aspects of protocoldesign, and a characterization and classification of conceptsand algorithms is still lacking in the literature.

The aim of the present paper is to provide a taxonomy ofin-network aggregation by defining the main concepts andcovering the most important and recent work in the field.Our major contributions are, on the one hand, to definecriteria to classify existing solutions and, on the other hand, toidentify and propose directions for future research in this area.Compared with well-researched topics in sensor networks,such as for example MAC and routing protocol design, dataaggregation does not seem to have received as much attention,and we think that it provides many interesting opportunitiesfor relevant contributions. The goal of this paper is to helppeople to get an updated view of this area and to provide amotivation and a starting point for researchers and studentswho are interested in these issues.

The paper is organized as follows. In Section II we definethe in-network aggregation paradigm, by identifying the mainproblems involved and giving some criteria to classify existingalgorithms. In Section III we discuss theoretical performancelimits of in-network aggregation techniques. In Section IV weintroduce some protocol issues in the presence of in-networkprocessing, classify the most recent solutions, and discusstheir advantages and disadvantages. In Section V we focus onpossible techniques to combine data by means of aggregationfunctions, highlight how these interact with routing protocols,and discuss the benefits arising from a cross-layer design(routing and aggregation). Finally, in Section VI we summarizethe in-network aggregation approaches discussed throughoutthe paper, and give directions and motivations for futureresearch.

II. BASICS OFIN-NETWORK AGGREGATION

In typical sensor network scenarios, data is collected bysensor nodes throughout some area, and needs to be madeavailable at some central sink node(s), where it is processed,analyzed, and used by the application. In many cases, datagenerated by different sensors can be jointly processed whilebeing forwarded towards the sink, e.g., by fusing togethersensor readings related to the same event or physical quantity,or by locally processing raw data before this is transmitted.

In-network aggregation deals with this distributed processingof data within the network. Data aggregation techniques aretightly coupled with how data is gathered at the sensor nodesas well as how packets are routed through the network,and have a significant impact on energy consumption andoverall network efficiency (e.g., by reducing the number oftransmissions or the length of the packets to be transmitted).In-network data aggregation can be considered a relativelycomplex functionality, since the aggregation algorithms shouldbe distributed in the network and therefore require coordina-tion among nodes to achieve better performance. Also, we em-phasize that data size reduction through in-network processingshall not hide statistical information about the monitored event.For instance, when multiple sensors collaborate in observingthe same event, the number of nodes reporting it and thetimings of the reports may reveal the event’s size and/ordynamics, respectively.

We define the in-network aggregation process as follows:

In-network aggregation is the global process ofgathering and routing information through a multi-hop network, processing data at intermediate nodeswith the objective of reducing resource consumption(in particular energy), thereby increasing networklifetime.

We can distinguish between two approaches:

• In-network aggregation with size reductionrefers to theprocess of combining and compressing data coming fromdifferent sources in order to reduce the information tobe sent over the network. As an example, assume thata node receives two packets from two different sourcescontaining the locally measured temperatures. Instead offorwarding the two packets, the sensor may compute theaverage of the two readings and send it in a single packet.

• In-network aggregation without size reductionrefers tothe process of merging packets coming from differentsources into the same packet without data processing:assume to receive two packets carrying different physicalquantities, e.g., temperature and humidity. These twovalues cannot be processed together but they can still betransmitted in a single packet, thereby reducing overhead.

The first approach is better able to reduce the amount of datato be sent over the network but it may also reduce the accuracywith which the gathered information can be recovered at thesink. After the aggregation operation, it is usually not possibleto perfectly reconstruct all of the original data.1 The secondapproach, instead, preserves the original information (i.e., atthe sink, the original data can be perfectly reconstructed).Which solution to use depends on many factors including thetype of application, the data rate, the network characteristics,and so on. Both of the above strategies may involve thetreatment of data at different network layers.

In-network aggregation techniques require three basic in-gredients: suitablenetworking protocols, effectiveaggregationfunctions and efficient ways ofrepresenting the data(see

1This actually depends on the type of aggregation function in use, i.e., lossyor lossless.

Fig. 1). In the remainder of this section we briefly introduceeach of these aspects.

Routing Protocols [1]–[9]: The most important ingredient forin-network aggregation is a well designed routing protocol.Data aggregation requires a different forwarding paradigmcompared to classic routing. Classic routing protocols typicallyforward data along the shortest path to the destination (with re-spect to some specified metric). If, however, we are interestedin aggregating data to minimize energy expenditure, nodesshould route packets based on the packet content and choosethe next hop in order to promote in-network aggregation. Thistype of data forwarding is often referred to asdata centricrouting. According to the data centric paradigm, as a nodesearches for the relay nodes, it needs to use metrics whichtake into account the positions of the most suitable aggregationpoints, the data type, the priority of the information, etc.Altogether, the application scenario, routing scheme, and dataaggregation mechanism are closely interrelated.

Moreover, in-network aggregation techniques may requiresome form of synchronization among nodes. In particular, thebest strategy at a given node is not always to send data as soonas it is available. Waiting for information from neighboringnodes may lead to better data aggregation opportunities and,in turn, improved performance. Timing strategies are requiredespecially in the case of monitoring applications where sensornodes need to periodically report their readings to the sink.These strategies usually involve data gathering trees rooted atthe sink. The main timing strategies proposed so far in theliterature are summarized below [10]:• Periodic simple aggregationrequires each node to wait

for a pre-defined period of time, to aggregate all dataitems received, and then to send out a packet with theresult of the aggregation.

• Periodic per-hop aggregationis quite similar to theprevious approach, the only difference being that theaggregated data is transmitted as soon as the node hearsfrom all of its children. This requires that each nodeknows the number of its children. In addition, a timeoutis used in case some children’s packets are lost.

• Periodic per-hop adjusted aggregationadjusts the timeoutof a node, after which it sends the aggregated data,depending on the node’s position in the gathering tree.

Note that the choice of the timing strategy strongly affectsthe design of the routing protocol [10]–[12].

Aggregation Functions [8], [13]–[20]: One of the mostimportant functionalities that in-network aggregation tech-niques should provide is the ability to combine data comingfrom different nodes. There are several types of aggregationfunctions and most of them are closely related to the specificsensor application. Nevertheless, we can identify some com-mon paradigms for their classification:• Lossy and lossless:Aggregation functions can compress

and merge data according to either a lossy or a losslessapproach. In the first case the original values can notbe recovered after having merged them by means ofthe aggregation function. In addition, we may lose inprecision with respect to transmitting all readings un-

compressed. In contrast, the second approach (lossless)allows to compress the data by preserving the originalinformation. This means that all readings can be perfectlyreconstructed from their aggregate at the receiver side.

• Duplicate sensitive and duplicate insensitive:An inter-mediate node may receive multiple copies of the sameinformation. In this case, it may happen that the samedata is considered multiple times when the information isaggregated. If the aggregation function in use is duplicatesensitive, the final result depends on the number oftimes the same value has been considered. Otherwise, theaggregation function is said to be duplicate insensitive.For instance, a function that takes the average is duplicatesensitive, whereas a function that takes the minimumvalue is duplicate insensitive.

Good aggregation functions for wireless sensor networks needto meet additional requirements. In particular, they should takeinto account the very limited processing and energy capabili-ties of sensor devices, and should therefore be implementableby means of elementary operations. Also, different devicesmay be suitable for different types of operations, dependingon their energy resources and computation capabilities. Thesefacts need to be considered in the design of aggregationfunctions and routing protocols.

Data representation [21]–[24]: Due to its limited storagecapabilities, a node may not be able to store all the re-ceived/generated information in its internal buffer. It there-fore needs to decide whether to store, discard, compress,or transmit the data. All these operations require a suitableway to represent the information. The corresponding datastructure may vary according to the application requirements.Finally, even though the data structure is usually common toall nodes, it should be adaptable to node-specific or locationspecific characteristics. A recent and promising method to dealwith data representation and compression is distributed sourcecoding techniques, that compress data on the basis of someknowledge about its correlation. More details on the approachare given in section V-F.

Although we described routing, aggregation and data rep-resentation in isolation, they are intimately related and shouldbe designed and implemented jointly for optimal performance.Most of the related work in the literature covers only partialaspects of the joint optimization of these functionalities, andoften neglects or oversimplifies some of the others. Furtherwork on cross-layer optimization for in-network aggregationshould therefore be appreciated as innovative and is verymuch needed. In the sequel, we thoroughly review each ofthe aforementioned functionalities. We start with a review ofrecent work on the theoretical limits of aggregation techniquesin the next section.

III. T HEORETICAL L IMITS OF IN-NETWORK

AGGREGATIONTECHNIQUES

Several theoretical studies provide limits and bounds on theperformance of in-network data aggregation techniques andthus assist in the design of suitable algorithms. The efficiencyof these algorithms depends on the correlation among the

data generated by different information sources (sensor units).Such a correlation can be spatial, when the values generatedby close-by sensors are related, temporal, when the sensorreadings change slowly over time, or semantic, when thecontents of different data packets can be classified under thesame semantic group (e.g., the data is generated by sensorsplaced in the same room). The gains of in-network dataaggregation can be best demonstrated in the extreme case whendata generated by different sources can be combined into asingle packet (e.g., when the sources generate identical data).If there areK sources all close to each other and far awayfrom the sink, the combination of their data into a single packetleads, on average, to aK-fold reduction in transmissions withrespect to the case where all data are sent separately. Generally,the optimal joint routing and compression structure is a Steinertree, which is known to be NP hard [25]. However, there existpolynomial solutions for special cases where the informationsources are close to each other [26]. The authors in [27]propose a model to describe the spatial correlation in termsof joint entropy. They analyze a symmetric line network withdifferent degrees of correlation among neighboring nodes. Forthe uncorrelated case, the authors show that the best routingstrategy is to forward packets along shortest paths. In contrast,in case of completely correlated information, the best strategyis to aggregate data as soon as possible. After that, a singlepacket (formed by the aggregated data) is sent to the sink alongthe shortest path. In all the intermediate cases, clustering-basedsolutions may be the optimal choice, although no formal proofis given in the paper.

In [28] the authors study the impact of data correlation onthe energy expenditure of data distribution protocols. Theyfocus on various energy aware data aggregation trees underdifferent network conditions, such as node density, sourcedensity, source distribution, and data aggregation degree. Thefindings of the study are in agreement with the results in [27]but in addition provide more quantitative results. In particular,the authors focus on tree structures and compare the MinimumSteiner Tree (MST) with the Shortest Path Tree (SPT). TheMST is found to be the optimal aggregation tree structure.Although the SPT guarantees low delays and can be builtin an online fashion, its performance in terms of aggregationeffectiveness is largely inferior to that of the MST.

In addition, in [28] opportunistic aggregation is comparedto systematic aggregation in terms ofcost ratio which isthe cost of thecorrelation unaware(SPT) tree over that ofthe correlation aware(MST) tree considering the same setof sources and sinks. The authors prove, using an analyticalmodel, that the expected cost improvement of MST over SPTin sensor networks increases asO(

√log N), whereN is the

number of nodes in the network. This result makes SPT aviable solution for many practical cases (small networks).Based on this study, the authors propose a hybrid tree structurecalled SCT (Semantic/Spatial Correlation Tree) [29]. SCT isbased on the identification of an aggregation backbone whichis used to generate efficient aggregation trees, regardless ofsources distribution and density. The aim is to efficiently buildand maintain a network structure for data aggregation. To thisend, the authors of [29] propose a ring-sector subdivision of

the network. A subset of nodes is elected as aggregation nodesand they are organized in a spanning tree to form the dataaggregation backbone. Each node belonging to the backboneaggregates messages coming from a certain sub-area.

A further tree-based aggregation algorithm that exploits datacorrelation is presented in [30]. It is based on shallow lighttrees (SLT) that unify the properties of MST and SPT. In anSLT, the total cost of the tree is only a constant factor largerthan that of the MST, while the distances (delays) betweenany node and the sink are only a constant factor larger thanthe shortest paths. In [31], the authors analyze aggregationproperties of a tree structure that is based on an SPT of nodesclose to the sink node, while nodes that are further away areconnected to the leaves of the SPT via paths found by anapproximation algorithm for the traveling salesman problem.Simulations show that these trees outperform SLTs in manyscenarios.

IV. N ETWORKING PROTOCOLS AND HIERARCHIES FOR

IN-NETWORK AGGREGATION

Most of the work done so far on in-network aggregationdeals with the problem of forwarding packets in order to fa-cilitate the in-network aggregation of the information therein.Initially, the main ideas were to enhance existing routingalgorithms in such a way to make data aggregation possible. Tothis respect, many studies proposed solutions exploitingtree-based (or hierarchical) structures. These consist of routingalgorithms based on a tree rooted at the sink. Trees areusually SPTs but some approaches exist which consider morecomplex tree constructions. The tree based approaches arereferred to in this paper asclassical approaches. Sometimesthe tree structure can be optimized to the type of data to begathered. Also, the nodes can be locally grouped into clustersfor improved efficiency. Recently, a few notable exceptionslooked at the problem from a different angle. These papersaddress the weaknesses of the tree-based approach by focusingon multi-path routing. Finally, some very recent schemesimplement a mixture of tree-based and multi-path solutions.These are referred to here ashybrid approachesto emphasizethe adaptive nature of their routing algorithms.

In the following, we focus on each class of routing protocolsseparately (tree-based, cluster-based, multi-path and hybrid)by reviewing the main concepts and briefly commenting thepros and cons of each scheme. As seen from the numberof schemes discussed in each subsection, many solutions areproposed in the tree-based and cluster-based categories. Onthe other hand, very few studies use the multi-path and hybridapproaches. This leaves room for further work in this area.

A. Tree-based Approaches

Classic routing strategies [32], [33] are usually based ona hierarchical organization of the nodes in the network. Infact, the simplest way to aggregate data flowing from thesources to the sink is to elect some special nodes which workas aggregation points and define a preferred direction to befollowed when forwarding data.

In addition, a node may be marked as special depending onmany factors such as its position within the data gathering

tree [34], its resources [35], the type of data stored in itsqueue [36], [37], or the processing cost due to aggregationprocedures [38]. According to thetree-based approach[1],[3], [6] a spanning tree rooted at the sink is constructedfirst. Subsequently, such a structure is exploited in answeringqueries generated by the sink. This is done by performing in-network aggregation along theaggregation treeby proceedinglevel by level from its leaves to its root. Thus, as two or moremessages get to a given node, their aggregate can be computedexactly. However, this way of operating has some drawbacksas actual wireless sensor networks are not free from failures.More precisely, when a packet is lost at a given level of thetree, e.g., due to channel impairments, the data coming fromthe related subtree are lost as well. In fact, a single message ata given level of the tree may aggregate the data coming fromthe whole related subtree. In spite of the potentially high costof maintaining a hierarchical structure in dynamic networksand the scarce robustness of the system in case of link/devicefailures, these approaches are particularly suitable to designoptimal aggregation functions and perform efficient energymanagement. In fact, there are some studies where the sinkorganizes routing paths to evenly and optimally distribute theenergy consumption while favoring the aggregation of data atthe intermediate nodes [36], [39], [40]. In [39] the authorscompute aggregation topologies by taking into account theresidual energy of each node through linear programming.Further algorithms can be found in [34], [35], [41], [42].In [41] the authors investigate which nodes in the network canbe exploited as aggregation points for optimal performance.In [34], [42] the focus is on the nodes that should be entrustedwith the transmission of the sensed values, whereas in [35] theemphasis is put on the proper scheduling of sleeping/activeperiods. Often, optimal paths are calculated in a centralizedmanner at the sink by exploiting different assumptions onthe data correlation and selecting the best aggregation pointsby means of cost functions [43]. Recently, also tree-basedschemes for real time or time-constrained applications havebeen proposed [44]–[46].

Finally, a last approach based on aggregation trees relieson the construction ofconnected dominating sets[47]. Thesesets consist of a small subset of nodes which form a connectedbackbone and whose positions are such that they can collectdata from any point in the network. Nodes that do not belongto these sets are allowed to sleep when they do not have datato send. Some rotation of the nodes in the dominating set isrecommended for energy balancing.

In the following paragraphs, we review the main routingapproaches based on aggregation trees.

TAG [5] - The Tiny AGgregation(TAG) approach is adata centric protocol. It is based on aggregation trees and isspecifically designed formonitoring applications. This meansthat all nodes should produce relevant information periodically.Therefore, it is possible to classify TAG as aperiodic per-hop adjustedaggregation approach. The implementation ofthe core TAG algorithm consists of two main phases: 1)the distribution phase, where queries are disseminated to thesensors, and 2) thecollection phase, where the aggregated

sensor readings are routed up the aggregation tree.For thedistribution phase, TAG uses a tree based routing

scheme rooted at the sink node. The sink broadcasts a messageasking nodes to organize into a routing tree and then sends itsqueries. In each message there is a field specifying the level,or distance from the root, of the sending node (the level of theroot is equal to zero). Whenever a node receives a messageand it does not yet belong to any level, it sets its own level tobe the level of the message plus one. It also elects the nodefrom which it receives the message as its parent. The parent isthe node that is used to route messages towards the sink. Eachsensor then rebroadcasts the received message adding its ownidentifier (ID) and level. This process continues until all nodeshave been assigned an ID and a parent. The routing messagesare periodically broadcast by the sink in order to keep the treestructure updated. After the construction of the tree, the queriesare sent along the structure to all nodes in the network. TAGadopts the selection and aggregation facilities of the databasequery languages (SQL). Accordingly, TAG queries have thefollowing form:

SELECT{agg(expr), attrs} from SENSORWHERE{selPreds}GROUP BY{attrs}HAVING{havingPreds}EPOCH DURATION i

In practice, the sink sends a query, where it specifies thequantities that it wants to collect (attrs field), how these mustbe aggregated (agg(expr)) and the sensors that should beinvolved in the data retrieval. This last request is specifiedthrough the WHERE, GROUP and HAVING clauses [5]. Fi-nally, an EPOCH duration field specifies the time (in seconds)each device should wait before sending new sensor readings.This means the readings used to compute an aggregate recordall belong to the same time interval, or epoch.

During the datacollection phase, due to the tree structure,each parent has to wait for data from all of its children beforeit can send its aggregate up the tree. Epochs are divided intoshorter intervals called communication slots. The number ofthese slots equals the maximum depth of the routing tree. Theslot mechanism gives a nice benefit. As the time is slotted,sensor nodes can be put to sleep until the next scheduledtransmission interval. In practice, a node goes back to sleepsoon after it has finished sending its readings to its parent.Data aggregation is performed by all intermediate nodes.However, in order not to limit TAG to the few and verysimple aggregation functions defined by the SQL language(such as COUNT, MIN, MAX, SUM, and AVERAGE) amore general classification is accounted for by partitioningaggregates according to theDuplicate Sensitivity, Exemplaryand SummaryandMonotonicproperties [5].

As for most tree-based schemes, TAG may be inefficient incase of dynamic topologies or link/device failures: as discussedabove, trees are particularly sensitive to failures at intermediatenodes as the related subtree may become disconnected. Inaddition, as the topology changes, TAG has to re-organize thetree structure and this means high costs in terms of energyconsumption and overhead.

Directed Diffusion [1] - Directed Diffusionis a reactive data

centric protocol. The routing scheme is specifically tailoredfor those applications where one or few sinks ask somespecific information by flooding the network with their queries.Directed Diffusion is organized in three phases (see Fig. 2,originally shown in [1]): 1)interest dissemination, 2) gradientsetup and 3) data forwarding along the reinforced paths(path reinforcement and forwarding). When a certain sink isinterested in collecting data from the nodes in the network,it propagates aninterest message (interest dissemination),describing the type of data the node is interested in, andsetting a suitable operational mode for its collection. Eachnode, on receiving the interest, re-broadcasts it to its neighbors.In addition, the node sets upinterest gradients, i.e., vectorscontaining the next hop that has to be used to propagate theresult of the query back to the sink node (gradient setup).As an illustrative example (see Fig. 2), if the Sink sends aninterest which reaches nodesa and b, and both forward theinterest to nodec, then nodec sets up two vectors indicatingthat the data matching that interest should be sent back toa and/or b. The strength of such a gradient can be adapted,which may result in a different amount of information beingredirected to each neighbor. To this end, various metrics suchas the node’s energy level, its communication capability andits position within the network can be used. Each gradient isrelated to the attribute it has been set up for. As the gradientsetup phase for a certain interest is complete, only a singlepath for each source isreinforced and used to route packetstowards the sink (path reinforcement and forwarding).

Data aggregation is performed when data is forwarded tothe sink by means of proper methods, which can be selectedaccording to application requirements. The data gathering tree(i.e., reinforced paths) must be periodically refreshed by thesink and this can be expensive in case of dynamic topologies.A tradeoff, depending on the network dynamics, is involvedbetween the frequency of the gradient setup (i.e., energyexpenditure) and the achieved performance. A valuable featureof Directed Diffusion consists of thelocal interactionamongnodes in setting up gradients and reinforcing paths. This allowsfor increased efficiency as there is no need to spread thecomplete network topology to all nodes in the network.

We observe that attention is to be paid to MAC Layerdesign. Consider as an example the IEEE802.11 wirelesstechnology. As said above, queries are propagated by meansof broadcasts (basic access in IEEE802.11). However, data issent back to the sink via unicast transmissions. This meansthat when either the node density increases or the duplicatesuppression rule is not used, due to MAC collisions andsubsequent backoffs, the delay may become excessively large.Hence, the local traffic should be kept at an acceptably lowlevel in order to avoid collisions. Several approaches [36], [48],[49] have been proposed to reduce the control traffic generatedby the local interactions among nodes with Directed Diffusion.In these solutions, the authors use properly defined aggregationtrees with the main purpose of reducing both traffic and delay.In [48] a modified version of Directed Diffusion,EnhancedDirected Diffusion(EDD), is proposed. This protocol jointlyexploits Directed Diffusion to collect data and a cluster-basedarchitecture to increase the efficiency of the local interactions

(decreasing local traffic and related collisions). A similarapproach is investigated in [50].

PEGASIS [3] - The key idea inPower-Efficient GAtheringin Sensor Information Systems(PEGASIS) is to organize thesensor nodes in a chain. Moreover, nodes take turns to act asthe chain leader, where at every instant the chain leader is theonly node allowed to transmit data directly to the sink. In thisway, it is possible to evenly distribute the energy expenditureamong the nodes in the network. The chain can be built eitherin a centralized (by the sink) or distributed manner (by usinga greedy algorithm at each node). In both cases, however, theconstruction of the chain requires global knowledge of thenetwork at all nodes. The chain building process starts with thenode furthest from the sink. Then the closest neighbor to thisnode is chosen as the next one in the chain and so on. Nodestake turns to act as leader according to the following rule:nodei is elected as the leader in roundi. If there areN nodesin the network, rounds cyclically take values in{1, 2, . . . , N}according to a TDMA schedule. As a consequence, the leaderis not always the same but, during each transmission round, itis at a different position in the chain. Note that in this schemea direct communication channel from each sensor to the sinkis required.

In PEGASIS, each node receives data from a neighbor andaggregates it with its own reading by generating a singlepacket of the same length. Subsequently, such an aggregateis transmitted to the next node in the chain until the packetreaches the current chain leader. At this point the leaderincludes its own data into the packet and sends it to the sink.A possible drawback of the scheme comes from the distanceamong neighbors. In fact, when the neighbors along the chainare too distant the energy expenditure can be very high.In addition, transmission energies are not evenly distributedbut depend on the actual distances between the nodes andtheir neighbors, i.e., nodes with distant neighbors dissipatemore energy. PEGASIS can therefore be enhanced by notallowing such nodes to become leaders, for example using athreshold-based leader election policy. The main disadvantagesof PEGASIS are the necessity of having a complete view of thenetwork topology at each node for a proper chain constructionand that all nodes must be able to transmit directly to the sink.This makes the scheme unsuitable for those networks with atime varying topology. In addition, also link failures and packetlosses may affect the performance of this protocol. In fact, thefailure of any intermediate node compromises the delivery ofall data aggregated and sent by the previous nodes in the chain.Hence, some improvements to the scheme may be needed inorder to increase its robustness.

DB-MAC [7] - A different approach to route packets byperforming data aggregation is presented in [7], where therouting and the MAC protocols are jointly designed. Theprimary objective of theDelay Bounded Medium AccessControl (DB-MAC) [7] scheme is to minimize the latencyfor delay bounded applications while taking advantage ofdata aggregation mechanisms for increased energy efficiency.DB-MAC adopts a CSMA/CA contention scheme based onan RTS/CTS/DATA/ACK handshake. The protocol is most

suitable for those cases where different sources sense an eventalmost at the same time and, due to the delay constraints, haveto send their measurements right away to the sink. In suchcases, the generated data flows can be dynamically aggregatedwhile routing them towards the sink. This gives rise to an ag-gregation tree, which is built on the fly and without having anyknowledge about the network topology. The MAC protocol isvery similar to the IEEE802.11 RTS/CTS Access [51] withsome minor modifications: RTS/CTS messages are exploitedto perform data aggregation and backoff intervals are com-puted by taking into account the priorities assigned to differenttransmissions. In particular, each node takes advantage of thetransmissions from other nodes by overhearing CTSs in orderto facilitate data aggregation. This leads to choosing the relaynode among those nodes that already have some packets totransmit in their queue. This is implemented to promote dataaggregation with the information stored along the path.

As an example, refer to the scenario in Fig. 3. We havetwo nodesS1 and S2 which want to transmit their packetsto the sink using one of their neighbors (R1 and R2 in thefigure) as the relay. At the beginning of the contention, a nodetransmits a newly generated packet by setting its priority to themaximum value. The packet priority is subsequently decreasedat each traversed node. BecausePr(S1) > Pr(S2), S1 wins thecontention for the medium and sends its packet toR1 whichdecreases its priority by one unit. After this,Pr(S2) becomesequal to the priority of the packet just transmitted, which isnow stored at nodeR1. If S2 is placed in the coverage areaof both S1 and R1, it can overhear all messages exchangedbetween these two nodes (remember that the packet atS2 stillhas to be forwarded). If this is the case,S2 may now want tosend its packet toR1 instead ofR2 as it knows thatR1 alreadyhas one packet in its queue (the packet previously transmittedby S1). This facilitates in-network aggregation. DB-MACgives an example of how routing and data aggregation mayinfluence each other, and shows that, in most cases, energyefficient solutions are achieved only through a cross-layerdesign. The advantage of this strategy is the flexible anddistributed procedure for the construction of aggregation trees,which appears to be suitable for wireless networks with adynamic topology.

Further Algorithms - Regarding the tree-based approaches,many additional solutions have been proposed to solve theproblem of efficiently constructing aggregation trees. Theauthors in [36] define an efficient, distributed and energyaware heuristics (EADAT) to build the aggregation tree. A nicefeature of such an approach is that the tree construction processonly relies on a local knowledge of the network topology.Hence, the costs incurred in updating the tree in response tonode mobility, device failures and duty cycles may be limited.In addition, to further increase the energy savings, the schemein [36] uses an aggregation tree rooted at the sink where allnon-leaf sensors perform data aggregation while leaf nodescan turn off their radios in order to save energy. In [52],the problem of constructing the optimal aggregation tree istreated from a game theoretic perspective. The authors developa framework including payoff functions that take into account

the path reliability, the path length and the energy constraintsof the nodes. They finally propose and evaluate a couple ofheuristics to implement opportunistic in-network aggregationstrategies. [53] presents a solution for the mobile sink case.The authors define a protocol to maintain the aggregation treein the presence of mobile sinks. In their solution, they rely ontrusted nodes which work as gateways between the networkand the sinks. A further contribution can be found in [54],where the authors combine a tree based scheme with datacompression based on polynomial regression.

An additional problem related to the aggregation tree isaddressed in [37], where the authors present a set of algo-rithms to minimize the overall energy consumption of thesensor nodes in the presence of latency constraints underthe assumption of perfect knowledge of the aggregation tree.The problem is solved by devising appropriate schedulingstrategies for each node. This contribution is particularlyimportant for applications requiring a prompt delivery of theinformation to the sink. A major drawback, however, is that theproblem of constructing the aggregation tree is not addressed.Finally, in [55] theSecure Data Aggregation Protocol(SDAP)scheme is presented. This algorithm addresses the problemof delivering data over aggregation trees in a secure manner.Further data aggregation schemes for secure communicationsare collected in [56].

B. Cluster-based Approaches

Similarly to tree-based algorithms, cluster-basedschemes[2], [4], [48], [57] also consist of a hierarchicalorganization of the network. However, here nodes aresubdivided into clusters. Moreover, special nodes, referred toascluster-heads, are elected in order to aggregate data locallyand transmit the result of such an aggregation to the sink.The advantages and disadvantages of cluster-based schemesare very similar to those of tree-based approaches.

LEACH [2] - Low-Energy Adaptive Clustering Hierarchy(LEACH) is a self-organizing and adaptive clustering protocolusing randomization to evenly distribute the energy expendi-ture among the sensors. Clustered structures are exploited toperform data aggregation where cluster-heads act as aggre-gation points. The protocol works in rounds and defines twomain phases: 1) asetup phaseto organize the clusters and 2)a steady-state phasewhich deals with the actual data transfersto the sink node.

In the first phase the nodes organize themselves into clus-ters. Within each cluster a node is elected as thecluster-head.At the beginning of the setup phase, each sensor elects itself tobe the local cluster-head for the current round. This decisionis made according to a distributed probabilistic approach. Theaim is to have, on average, a percentageP of the nodesacting as cluster-heads, whereP has to be optimally chosenaccording to the node density. In practice, sensors calculatethe following threshold:

T (n) =

P

1− P (R mod (1/P ))if n ∈ G

0 otherwise,(1)

whereP is the desired percentage of cluster-heads,R is theround number andG is the set of nodes that have not beencluster-heads during the last1/P rounds. A given nodenpicks a random number[0, 1] and decides to be a cluster-head if this number is lower thanT (n). A cluster-headsends advertisements to its neighbors using a CSMA MAC.Surrounding nodes decide which cluster to join based on thesignal strength of these messages. Finally, based on the numberof nodes that are willing to be part of the cluster, each cluster-head creates a TDMA schedule to optimally manage the localtransmissions.

The actual data transmission starts in the second phase of theprotocol. All source nodes (S in Fig. 4) send their data to theircluster-heads according to the established schedule. The useof a TDMA protocol in the data collection phase ensures thatthere are no collisions within the clusters, saving both energyand time. After cluster-heads (CH in Fig. 4) have received allthe data from the active sources, they send them back to thesink using a single direct transmission (dotted lines in Fig. 4).If the sink is placed far away from a cluster-head, a high powermay be necessary to successfully deliver the message. Also, adoze mode is implemented to further save energy. When dozemode is used, the nodes’ radios may be switched off until theirscheduled TDMA transmission slot. Note that cluster-headscannot switch their radio off as they have to receive packetsfrom potentially all nodes in the cluster. LEACH is completelydistributed in the sense that neither control messages from thesink nor the distribution of global information to the sensornodes are required for correct operation. Moreover, LEACHoutperforms classical clustering algorithms by accounting foradaptive clusters and rotating cluster-heads.

The LEACH framework also offers the opportunity toimplement any aggregation function at the cluster-heads.However, several problems may arise in highly dynamicenvironments. In this case continuous updates are neededin order to keep the clusters consistent with the underlyingtopology. This requires to send many control messageswhich, in turn, may substantially impact the performance. Inaddition, in case of mobility additional problems may arise.A node close to a cluster-head at a given instant in timemay move away from the cluster-head. As a consequence,the node needs to increase its power, thereby spending muchmore energy to transmit to the cluster-head than expected. Arecent improvement of LEACH has been presented in [58]where further energy savings are achieved by the introductionof meta-data negotiation.

COUGAR [4], [59] - Cougar is most suitable for moni-toring applications, where nodes produce relevant informationperiodically. The protocol can be classified as aperiodic per-hop aggregation approach. Cougar is basically a clusteringscheme. As soon as the cluster-heads receive all data from thenodes in their clusters, they send their partial aggregates to agateway node. Of course, being similar to LEACH, Cougaris also affected by the same problems in highly dynamicenvironments.

Noticeably, Cougar differs from the previous clusteringbased algorithms in the way cluster-heads are elected. Unlike

in LEACH, where each node picks its cluster-head basedon signal strength measurements, in Cougar the cluster-headselection may be driven by additional metrics. In fact, a nodecould be more than one hop away from its cluster-head.For this reason, the routing algorithm adopted to exchangepackets within clusters is based on the AODV (Ad hocOn demand Distance Vector) technique. As AODV does notgenerate duplicate data packets, Cougar is particularly suitableto perform in-network aggregation with duplicate sensitiveaggregators. The core Cougar algorithm consists of the nodesynchronization engine which ensures that data is aggregatedcorrectly. Each cluster-head has a waiting list containing allnodes it expects a message from. The list is updated everytime the node receives a record from a node in its cluster.The cluster-head does not report its reading to the gatewayuntil, at timetsend, it hears from all nodes in its waiting list.A prediction mechanism is also implemented at each cluster-head in order to infer the instanttsend. In addition, a childnode can determine whether its cluster-head is waiting fora packet from it and can use anotification packetto refinethe prediction at the cluster-head. Timeouts and backoffs areimplemented to deal with wrong predictions.

In [4], the authors define three different data aggregationfeatures:Direct delivery, where data aggregation is performedat the cluster-heads only,Packet mergingwhich consists ofaggregation of packets without size reduction andPartialaggregation, where data aggregation is implemented at theintermediate nodes.

Further Algorithms - Many additional studies exploitinga hierarchical organization of the nodes have been proposedin the literature. Some of them are improvements of ex-isting protocols. In [60], for instance, the authors proposeenhancements to the LEACH and PEGASIS schemes. Forthe performance evaluation, the authors propose the new DataAggregation Quality (DAQ) metric, which is defined as theratio between the size of the aggregated data and its jointentropy. DAQ is an interesting performance measure as it takesinto account both the effectiveness in reducing the size of thedata to be transmitted and the quality of the information. Afurther improvement to LEACH is presented in [61], wherea code is added to the data transmission to enhance theintra-cluster communication security. A similar approach isproposed in [57] where the cluster-based scheme is enhancedby a secure transmission protocol called SecureDAV.

[27] presents a location-based clustering scheme wherethe sensors self-organize to form static clusters. The datagenerated within each cluster is sent to the related cluster-head along shortest paths, and in-network aggregation isperformed at the intermediate nodes. The cluster-heads sendthe aggregated data to the sink through a multi-hop pathwithout any further aggregation. The cluster size can be variedto tune the degree of aggregation. The authors in [62] studythe impact of partially correlated data on the performance ofclustering algorithms. They analyze the behavior of multi-hoprouting and, by combining random geometry techniques andrate distortion theory, predict the total energy consumptionand network lifetime of their cluster-based scheme. Further

cluster-based algorithms for data aggregation can be foundin [63], [64]. An interesting work about clustering and dataaggregation is presented in [65]. Here, a cross-layer approachis adopted and some issues concerning MAC design areaddressed.

Another work based on a hierarchical organization of thenetwork is proposed in [11]. Assuming that some algorithmsare used to form an aggregation tree or a cluster-based aggre-gation structure, the authors propose a scheme to dynamicallyadapt the data aggregation period (see [10]) according to theaggregation quality required by the sink.

A different approach is presented in [66] where a semi-structured approach, named ToD, is defined in order to alle-viate the problem of maintaining a hierarchical organizationof nodes in case of dynamic large scale networks. This studyis enriched by simulations with 2000 nodes and experimentalresults over 105 Mica2 devices.

C. Multi-path Approaches

In order to overcome the robustness problems of aggregationtrees, a new approach has been recently proposed in [8],[9], [67]. Instead of having an aggregation tree where eachnode has to send the partial result of its aggregation to asingle parent, these solutions send data over multiple paths.The main idea is that each node can send the data to its(possibly) multiple neighbors by exploiting the broadcastcharacteristics of the wireless medium. Hence, data mayflow from the sources to the sinks along multiple pathsand aggregation may be performed by each node. Observethat in contrast to the tree-based schemes discussed above,multi-path approaches allow to propagate duplicates of thesame information. Clearly, such schemes trade a higherrobustness (as multiple copies of the same data can be sentalong multiple paths) for some extra overhead (due to sendingduplicates). An aggregation structure that fits well with thismethodology is calledrings topology, where sensor nodesare divided into several levels according to the number ofhops separating them from the data sink. Data aggregation isperformed over multiple paths as packets move level by leveltowards the sink (see Fig. 5). Next, we review the synopsisdiffusion framework which belongs to this class of protocols.

Synopsis Diffusion [8] - The authors of [8] present theSyn-opsis Diffusionprotocol, where data aggregation is performedthrough a multi-path approach. The underlying topology fordata dissemination is organized in concentric rings aroundthe sink. Synopsis Diffusion consists of two phases: 1) thedistribution of the queriesand 2) thedata retrievalphase. Thering topology is formed when a node sends a query over thenetwork. In particular, two different structures, listed below,can be taken into account. The first type of topology consistsof a simple ring structure. During the query distribution phase,the network nodes form a set of rings around the querying nodeq, which is the only sensor belonging to ringR0. A node isin ring Ri if it is i hops away from the querying node.

The second type of topology has some improvements thatmake it more robust than the previous one and able to copewith changes in the network. This topology is calledAdaptive

Rings. The distribution phase does not change but this time anodeu in ring i keeps track of the number of times,nov, thetransmissions from any nodeni−1 in ring i − 1 included itsown data during the last few epochs. That is, nodeu checkswhether its data is aggregated with the information sent by anynode in ringi−1. If nov is small,u tries to find a better ring inorder to have more of its own data included in the subsequenttransmissions. In fact, ringsi, i + 1, i− 2 and i + 2 can alsobe considered for aggregating data (ringsi−2 andi+2 couldbe overheard in case of mobility). To allow for these checks,the list of all node IDs participating in the construction of thesynopsis (data aggregation result) is included in the headerof each packet. This feature is also exploited at each node asa sort of implicit acknowledgment. Finally, the decision onwhich ring to join is made according to heuristics dependingon ni−1, ni, ni+1, ni+2 andni−2 [8]. The query aggregationperiod is divided intoepochsand one aggregate is provided atthe end of each. Specific time slots are allocated within eachepoch and used to schedule the node transmissions in a TDMAfashion. Sensors can be put to sleep and woken up at theirscheduled transmission slots. The aggregation starts from theoutermost ring, e.g.,Ri, proceeds towards the subsequent ring,e.g.,Ri−1 and propagates level by level towards the sink. Inthe example in Fig. 5, the data generated at node A can reachthe sink through seven paths:{A, B, F, I, S}, {A, B, F, H, S},{A, B, F, G, H, S}, {A, C, D, E, I, S}, {A, C, F, H, S}, {A, C,F, I, S}, and{A, C, G, H, S}. Note that, as the main feature ofSynopsis Diffusion is that data can flow over multiple paths,a node may receive duplicates of the same information. Thismay affect the aggregation result, especially when aggregationfunctions are duplicate sensitive. This problem is addressed bythe authors in [8] by proposing proper aggregation functionsand data structures, see Section V-D. On the upside, multi-pathschemes are suitable for networks with frequent packet lossesdue to mobility or channel impairments, as the extra overhead(duplicates) pays off in terms of robustness: if a link fails, thedata may still reach the sink through a different path.

Further Algorithms - Another way to implement multi-pathschemes is based on multiple spanning trees. For instance, theauthors in [68] define a method to provide fault tolerance topacket losses by forming a Directed Acyclic Graph (DAG).DAG allows to have multiple parent nodes at each sensor. Inaddition, the method ensures correct data transmission timing,according to the actual hop count of the DAG edges.

D. Hybrid Data Aggregation Approaches

In order to benefit from the advantages of both tree-basedand multi-path schemes, it is possible to definehybridapproacheswhich adaptively tune their data aggregationstructure for optimal performance. To the best of ourknowledge, a single work [9] has been proposed with thisaim. The related protocol is presented next.

Tributaries and Deltas [9] - The Tributaries and Deltasprotocol tries to overcome the problems of both the tree andmulti-path based structures, by combining the best featuresof both schemes. The result is a hybrid algorithm where both

data aggregation structures may simultaneously run in differentregions of the network. The idea is that under low packetloss rates, a data aggregation tree is the most suitable struc-ture due to the possibility of implementing efficient sleepingmodes (see previous sections) and to the good efficiency inrepresenting and compressing the data. On the other hand,in case of high loss rates or when transmitting partial resultswhich are accumulated from many sensor readings, a multi-path approach may be the best option due to its increasedrobustness. Hence, nodes are divided into two categories:nodes using a tree-based approach to forward packets (alsocalled T nodes) and nodes using a multi-path scheme (Mnodes). This means that the network is organized in regionsimplementing one of the two schemes. The main difficulty isto link regions running different data aggregation structures.In doing so, the following rules have to be satisfied [9]:

• Edge Correctness:An edge originating from anM nodecan never be incident on aT node. It means that theaggregation result in a multi-path region can only bereceived by anM node (see Fig. 6).

• Path Correctness:M nodes form a subgraph includingthe sink node, which is fed by trees composed ofT nodes(see Fig. 6).

According to the above rules, the sink is surrounded byMnodes only. These form the so calleddelta regionwhich canbe expanded or shrunk by switching nodes from the tree mode(T ) to the multi-path mode (M ) and vice-versa, respectively. Inpractice, only the nodes lying along the boundary between thetwo regions are allowed to change their operating mode [9](see Fig. 6). Expanding the delta region corresponds to in-creasing the number of paths towards the sink, which is goodwhen the packet loss probability is high. On the other hand,shrinking the region is beneficial in case the network is staticand the packet loss probability is small. The user can set athreshold to specify the minimum percentage of nodes thatshould contribute to the aggregation operation. Note that thispercentage increases in case of a wider delta region. In fact,this implies that more multi-path nodes are available thusleading to a higher robustness against failures and, in turn, tomore nodes which actively contribute to the aggregation result.The opposite holds when the delta region is shrunk. To see this,consider nodeT5 in Fig. 6. This node is switched to anMvertex (diagram on the right) and, as a consequence, can nowtransmit the aggregated data flow also to nodesM4 andM5.In particular,M5 can now contribute to the data aggregationby passing the data coming from nodeM to lower levels.

In [9] the authors compare the Tributaries and Deltas algo-rithm to TAG [5] (pure tree-based) and Synopsis Diffusion [8](pure multi-path). The simulation results in [9] only focus onthe quality of the gathered information (RMS error), whiledisregarding the energy consumption aspect. In particular, theydemonstrate that Tributaries and Deltas guarantees smallererrors with respect to TAG and that the approach nicelysolves the drawbacks of pure multi-path schemes (SynopsisDiffusion). The major weakness of this approach is the pos-sibly high overhead incurred in updating the data gatheringstructure. The maintenance of the quite complex network

structure may also be a problem in case of node mobility (thisis also an open issue, not addressed in [9]).

Finally, particular attention is to be paid to the increase intraffic and therefore to the MAC scheme in use. We stress thatmost of the work on data aggregation done so far does notconsider this problem. We emphasize the need for true cross-layer approaches which jointly consider routing, aggregationfunctions and MAC aspects with particular focus on both datarepresentation efficiency and energy consumption.

V. DATA REPRESENTATIONS ANDIN-NETWORK

AGGREGATIONFUNCTIONS

As discussed in Section II, the problems of finding aproper data representation and an optimal aggregation functionare strongly related and complex. The solutions proposed sofar mostly adopt very simple aggregation functions such asaverage, median, quantile, min, max, etc. (see [3], [5]). Thesestrongly reduce the amount of data to be transmitted over thenetwork but also heavily affect the precision of the transmittedinformation (lossyaggregation functions). However, in manycases, we may be interested in a more detailed represen-tation of the data, which calls for more complex functionsand data structures (taking into account the spatial, temporaland semantic correlation of the readings). In this direction,complex frameworks to provide data fusion rules have beenrecently proposed in order to provide cross-layer functionsself-adaptable to the application dynamics [15], [69], [70].

A first improvement to a simple data aggregation functionto take into account the spatial correlation is presented in [13].In this strategy, the dependence on the distance among nodesis quantified by adecay functionwhich may, e.g., decayexponentially with an increasing hop distance [13]. Duringthe data aggregation, each reading is weighed by a decayingfactor which decreases with the distance to its source. Theframework can be extended by additionally accounting fortemporal and semantic correlation. However, this remains anopen and mostly unaddressed issue.

In the following sections, we describe a selection of in-network aggregation functions according to our classificationin Section II. We review the simplest methods first, andsubsequently consider more complex approaches. At the endof the section, we discuss distributed source coding techniqueswhich perform joint coding of correlated data from multiplesources in a distributed manner.

A. TiNA [14]

Temporal coherency-aware in-Network Aggregation[14](TiNA) works on top of a routing tree (i.e., TAG or Cougar,see Section IV-A) having the data gathering point (sink) asits root. It exploits the temporal correlation in a sequence ofsensor readings to reduce energy consumption by suppressingthose values that do not affect the expected quality of theaggregated data. This is implemented through aTOLERANCEclause which is added to the SQL query. Thetct parameter ofthis clause is used to specify the temporal coherency tolerancefor the query. As an example, at aleaf node, each new availablevalue,Vnew, is compared against the last reported data point,Vold. Vnew is transmitted (and aggregated) up the tree if and

only if it satisfies the following requirement (data suppressionrule):

|Vnew − Vold|Vnew

> tct, (2)

TiNA uses the clause GROUP BY of the SQL query to decidehow different messages shall be processed, i.e., two data pointscan only be aggregated if they belong to the same GROUP.The data gathering procedure executed at theinternal nodesisas follows. They first gather and combine packets sent by theirchildren. If a given node does not receive valid data from anyof its children, it replaces the missing information using thelast reported data from the same child (previously stored in itsbuffer). The node then considers its own reading. In case it canbe aggregated with some other data in its buffer (they belongto the same GROUP), then the reading is aggregated with thatdata regardless of thetct value. Doing so, internal nodes canreport their values more often than leaf nodes thus increasingthe accuracy of the aggregation. On the other hand, if theinternal node needs to create a new group, it does so and addsthe new reading only if this data satisfies Eq. (2). The idea isthat new groups are created only when the new measurementssignificantly differ from old data points (Eq. (2)).

Moreover, in TiNA a very simple mechanism to counteractlink failures is used. Children, when suppressing data, mustsendheartbeatmessages to their parent at regular intervals.The cost of this message is low as it is just a notificationpacket. Thanks to these packets each parent knows whetherits children are still alive. Thus it can infer whether the oldreadings are to be kept valid. In case of a missing notification,the appropriate child is discarded until the parent hears fromthat child again. These messages can also be used in case ofmobile sensors as nodes change their location in the network.Finally, the periodic heartbeats allow children to reconnect tothe data gathering tree in case of parent failure.

B. DADMA [16]

Data Aggregation and Dilution by Modulus Addressing(DADMA) [16] is a distributed data aggregation and dilutiontechnique for sensor networks where nodes aggregate ordilute sensed values according to the rules given in an SQLstatement. DADMA treats a wireless sensor network as adistributed relational database. This database has a single viewwhich is created by joining records which are locally storedin the sensor nodes. This technique can be used over wellknown routing schemes such as Directed Diffusion [1] andLEACH [2], see Section IV-A. The sensor network databaseview (SNDV) is temporarily created and maintained at the sinknode. The basic idea in DADMA is to aggregate data comingfrom a group of sensors or to exclude some sensors from thedata gathering tree. These operations are carried out accordingto two simple rules. First, a user can retrieve a subset of datafields available in an SNDV and aggregate data by using thefollowing aggregate mfunction:

fa(x) = x div m. (3)

Moreover, sensor nodes can be excluded from a query by adilute m function as follows:

fd(x) = (x/r) mod (m/r). (4)

In the previous equationsx is the grid location of a node withrespect to one of the axes,r is the resolution in meters andmis the aggregation (or dilution) factor. As the sink sends a newquery, it also specifies abased onfield and a command thatcould be eitheraggregateor dilute. Each sensor node comparesthe result of its aggregation or dilution function with thebasedon value and decides its behavior.

For instance, on receiving adilute m command a nodefirst uses Eq. (4) to calculate its location indices for both thehorizontal and vertical axes (fd(x) andfd(y)). Subsequently,it compares these values with thex and y indices includedin the based onfield of the query. If they match, the sensorreplies to the query. In a similar way, when anaggregatem command is received, the values measured by a sensornode are aggregated with those measured by the other nodeshaving the same indices. We observe that such a strategy is apractical way to take into account the spatial location of thenodes by, for instance, aggregating only those values comingfrom closely placed devices. The author in [16] studies theperformance of DADMA by putting particular emphasis onthe energy savings coming from the reduction of the numberof transmissions and on the probability of event detection.Moreover, he devises a mechanism to achieve a good tradeoffamong the cost, the accuracy, and the reliability in retrievingthe wanted information. The same concepts are addressedin [71] where, in addition to the aggregation/dilution schemes,two location based hash functions are introduced to determinehow the sensed data can be grouped or which sensors shouldbe excluded from a query.

C. Data Aggregation by means of Feedback Control [72]

The authors of [72] define a strategy to tune the degree ofdata aggregation while maintaining specified latency boundson data delivery and minimizing the energy consumption. Theyconsider time-constrained reference scenarios dealing withreal-time applications which impose specific time constraintsto the delivery of sensor measurements. Data is groupedinto different classes associated with different bounds on thedelivery time. The aim is to guarantee the delivery of all dataatthe minimum energy costwhile satisfying all time constraints.The data aggregation degree is adapted accordingly to meetthese requirements. If the total communication load exceedsthe system capacity, the amount of data has to be reduced (thedata aggregation degree has to be increased), whereas the dataaggregation degree may be relaxed in case of low traffic. Inthe former case, a so calledlossy feedback loopmechanismassigns a data aggregation degree (d) on the basis of loadand capacity estimates. This algorithm runs independently ateach node. Specifically,d is defined as the ratio betweenthe number of outgoing and incoming packets. For instance,if d = 0.66, three received packets have to be aggregatedinto two packets (e.g., by averaging two of them).2 In thelimiting case whered = 1 no data aggregation is performed.Moreover,d is continuously adapted according to new loadand capacity estimates. In addition, when the system operatesin a non-overloaded regime, a further strategy calledlossless

2Note that all packets have the same size in this case.

feedback loopcan be used to reduce the energy consumption.According to this scheme incoming messages are collectedand transmitted in a single packet without data size reduction.

This solution is interesting for two reasons: the control ofthe data aggregation is based on physical measurements of thenetwork conditions, thus making the mechanism self-adaptableto the actual network dynamics. Second, it aims at satisfyingtime constraints that, in general, are rarely considered bywireless sensor network algorithms. This solution is extendedin [15], where the authors define a complete data aggregationframework (AIDA), by considering general aggregation rules.

D. Synopsis Diffusion Framework [8]

A recent solution to the data aggregation problem has beenproposed in [8]. The main contribution of the paper is todefine aggregation functions and data structures which arerobust to considering the same sensor readings in the dataaggregation process multiple times (double-countingproblem).This is crucial when data aggregation is used in conjunctionwith multi-path routing schemes (see Section IV-C).

The approach definesorder and duplicate insensitive(ODI)properties whose role is to make sure that the final resultof the aggregation is independent of the routing topology.That is, the computed aggregate must be the same irrespectiveof the order in which the sensor readings are merged andthe number of times they are considered in the aggregationprocess. Asynopsisis defined as a summary of the partialresult of the overall aggregation process received at a givennode. Three functions on the synopses are possible to performdata aggregation:

• Synopsis Generation: Given a sensor reading, a synopsisgeneration functionSG(·) produces the correspondingsynopsis for that data.

• Synopsis Fusion: Given two synopses, a synopsis fusionfunction SF (·, ·) generates a new synopsis that summa-rizes both.

• Synopsis Evaluation: Given a synopsis, a synopsis eval-uation functionSE(·) yields up the final result.

The exact implementations of the functions and the synopsisdefinitions are strictly related to the considered aggregationquery. A simple and fast way to check whether a synopsisdiffusion algorithm is ODI-correct is based on the followingfour properties:

• Preserves duplicates:if two readings contain the samedata values, the algorithm generates the same synopsis.

• The synopsis functionSF (·) is commutative:for anytwo synopsess1 and s2 we have thatSF (s1, s2) =SF (s2, s1).

• The synopsis functionSF (·) is associative: for anytriple (s1, s2, s3) we have thatSF (s1, SF (s2, s3)) =SF (SF (s1, s2), s3).

• The synopsis functionSF (·) is same-synopsis idempo-tent: for any synopsiss we have thatSF (s, s) = s.

The four properties above are necessary and sufficient for ODI-correctness. More properties and examples can be found in therelated paper [8], where the authors also discuss the advantagesof their solution with respect to TAG [5].

E. The Quantile Digest [21]

Quantile Digest [21] (q-digest) is a data structure forrepresenting sensor readings with an arbitrary degree of ap-proximation (trading data size for precision). The data com-pression algorithm adapts its behavior to the data distributionby automatically grouping the sensed data into variable sizebuckets of almost equal weight. As in [21], we assume thatsensor readings are integer numbers falling within the range[1, σ]. A q-digest consists of a set of buckets of differentsizes and their associated counts. More specifically, considera complete binary treeT . In a q-digest, each element of thetree v ∈ T can be considered as a bucket with a specificrange. For example, the range associated with the root of theq-digest is[1, σ] and its two children have ranges[1, σ/2] and[σ/2 + 1, σ], respectively. In addition, every bucketv ∈ Thas a counter (count(v)) associated with it. The structure isrecursive and ranges are halved as we proceed from the rootto the leaves of the tree. A q-digest is simply a subset ofthe (complete) tree which only contains those elements withpositive counts. For its construction, we say that an elementof the original treev ∈ T is in the q-digest if and only if itsatisfies the following properties:

q1) count(v) ≤ bn/kc, where n is the number ofreadings andk is the compression factor. This ruleensures that the internal (non-leaf) elementv in thetree does not have a high count.

q2) count(v) + count(vp) + count(vs) > bn/kc wherevp and vs are the parent and the sibling ofv,respectively.

q3) Since there are no parent and sibling for the rootit can violate property q2). A leaf node is insteadallowed to violate property q1).

In Fig. 7 we show an example illustrating how a q-digest isbuilt. The example is the same described in [21].n = 15 is thenumber of readings at any one sensor, which are compressedand summarized in the data structure. The leaf nodes, fromleft to right, represent the values1, 2, . . . , 8 and the numberinside the boxes represent the counts. Thecompression factork is equal to5 which means that the q-digest hasbn/kc = 3levels. Finally,σ = 8 is the size of the data interval, wherewe assume to collect integer values spanning from1 to 8.Consider a set ofn = 15 readings within this range, as shownin Fig. 7(a). The number of buckets needed to store all data is7. In Fig. 7(a), the children of nodesa, c andd do not satisfythe digest property (q2). Hence, we compress their values intoa single bucket by getting to the structure in Fig. 7(b). Atthis point, nodee still does not satisfy property (q2). Hence,we compress the value therein by getting to Fig. 7(c). Now,nodeg still does not satisfy property (q2) and hence a furthercompression is needed. This last compression leads us to theq-digest in Fig. 7(d). Note that only5 buckets are neededto store the final result, in spite of the7 buckets that wereoriginally needed to store the data without compression. Ascan be observed from this example, this procedure results ina larger loss of accuracy for the readings with a small count.The compression factork is used to tune the procedure to thedesired accuracy. It also affects the memory requirements for

storing a q-digest [21].For its practical implementation, the q-digest structure needs

two functions: 1) to construct the q-digest and 2) to merge twoor more q-digests. The first function is calledcompressas ittakes the uncompressed q-digest, the number of readingsn andthe compression factork as input and generates a compressedrepresentation of the q-digest as output (see the above exam-ple). The second functionality is themergefunction which isused for example when two sensors send their q-digests tothe same parent. The parent merges these two q-digests into asingle q-digest and adds its own values to the new structure.The mergefunction first takes the union of the two q-digests,which is obtained by adding the counts of the buckets with thesame range. After this, it compresses the resulting q-digest byapplying thecompressfunction above. As soon as the q-digeststructure has been built, each sensor packs it and transmits itto its parent (predecessor node) in the data gathering tree.

In principle, this scheme can be used on top of any routingprotocol that avoids loops and duplicates of the same packet.We observe, however, that the joint design of these data repre-sentation and compression techniques with routing algorithmsis still a completely open research issue.

F. Distributed Source Coding

A recent paradigm to perform data aggregation exploitsDistributed Source Coding (DSC). These techniques are basedon the Slepian-Wolf theorem [73], which allows joint codingof correlated data from multiple sources and without explicitcommunication. This is possible as long as the individualsource rates satisfy certain constraints about conditional en-tropies. These techniques require that the correlation struc-ture is available a priori at the independent encoders. Ref-erence [23] gives a good survey on DSC techniques andrelated open issues in this emerging field. The probably mostimportant contribution to DSC was derived by Slepian andWolf in their landmark paper [24]. A simple way to encodeand transmit the data generated by two generic sourcesXandY is to apply separate coding with total rateR1 + R2 =H(X) + H(Y ), whereH(·) denotes the entropy of the dataflow. If the two sources can communicate, then they couldcoordinate their coding operations and use together a totalrate ofH(X, Y ) ≤ R1 +R2. The authors in [24] showed thattwo correlated sources can be coded with a total rate equal tothe joint entropyH(X, Y ) even though they arenot able tocommunicate with each other, as long as their individual ratesare at least equal to the conditional entropiesH(X|Y ) andH(Y |X) respectively. Although different sources do not needto communicate with each other, theydo needto have somecommon information about the correlation structure. Towardsthis end, the sink node may first collect a certain amount ofdata from the network, process it and subsequently send theproper correlation information to all sensors. Only after thisoperation, can each node start compressing its readings.

The theory has been generalized and recently applied towireless sensor networks. For instance, in [74] the authors fo-cus on LDPC codes which are well known for their capacity ofapproaching the Shannon limit; Slepian and Wolf proved thatthe theoretical limitH(X,Y ) can be reached with equality,

but without devising practical schemes to approach it. In [75],the authors apply Slepian-Wolf coding in its simplest form byproving its effectiveness. Note that, in order for Slepian-Wolfdecoding to be effective we need to have a good estimate ofdata correlation properties. Accordingly, the scheme in [75]uses an algorithm, running at the sink, to measure the actualdata correlation. Then, a set of nodes is allowed to sendcompressed data, where the compression is achieved locallyand decoding is performed in a centralized fashion at the datagathering node. At the sink, the uncompressed samples comingfrom the sensors that are not allowed to compress are used asthe side information for decoding. Notably, this approach hasthe drawback that data is not aggregated along the path to thesink. Hence, further savings can be achieved by exploiting innetwork data fusion on top of the distributed per node datacompression. Also, this approach might be affected by packetlosses as, in such a case, the needed side information might notbe fully available at the sink (decoding entity). In the paper,the authors discuss these issues but without giving detailedresults. In [76], the authors present and solve theminimumcost data gathering treeproblem. The network is modeled asa graphG = (V, E), whereV and E are the set of vertices(nodes) and edges, respectively. Slepian-Wolf coding is used atevery node. Moreover, a communication costwe is associatedwith each edgee ∈ E. The cost function is assumed to beseparable, i.e.,f(xe, we) = xewe, wherexe is the amount ofinformation to be sent over edgee andwe is the edge cost (e.g.,transmission power). The minimum cost data gathering treeproblem consists of finding the spanning tree ofG and the rateallocation for each node inV that minimize the cost functionof the network (i.e., the sum of the costs of all links). Theshortest path tree is optimal for any rate allocation and thus theoptimization problem can be separated into a spanning tree anda rate allocation optimization subproblems. [76] gives exactalgorithms to solve both of them. Overall, the results in [76]allow to code the data in a completely distributed fashion byexploiting the side information in a recursive manner.

The main drawback of this scheme is that it involves thecalculation of an SPT and that it requires the full (centralized)knowledge of the data correlation structure forall nodesin thenetwork to express the rate constraints. Lossless encoders canthen separately and independently encode data at each nodeas efficiently as if its encoder would see the data values sentby all other nodes. Notably, the scheme’s inability to toleratefailures may eliminate this advantage. In fact, if the encodedbits from one node are lost, the sink may not be able toreconstruct several sensor values. The authors of reference [77]highlight the drawbacks of previous approaches [75] [76] whenthe network is error prone and, as a partial solution, proposeto exploit multi-path routing schemes. The advantages of theirapproach come at the cost of a higher energy consumption tosetup/maintain multiple trees and to transmit multiple copies(extra overhead) of the same message.

In summary, DSC effectively makes routing and codingdecisions independent of each other. On the downside, how-ever, this solution increases the computational complexity andrequires the collection of information about joint statistics,which may not always be easy in practice.

VI. D ISCUSSIONS ANDCONCLUSIONS

In this paper we have presented a detailed review of in-network aggregation techniques for wireless sensor networks.One of the main design aspects for sensor network archi-tectures is energy efficiency, to keep the network operationalas long as possible. Therefore, aggregation techniques are anessential building block, as they aim at reducing the numberof transmissions required for data collection which, in turn,reduces energy consumption.

In this survey, we have provided a definition of in-networkdata aggregation and identified its key elements: data dissem-ination and query mechanisms (with particular focus on therouting and MAC layer), aggregation functions, and data struc-ture. Fig. 8 and Fig. 9 summarize the basic characteristics ofthe presented solutions and provide a qualitative comparison.By its very nature, in-network aggregation concerns severallayers of the protocol stack, and any efficient solution is likelyto require a cross-layer design. However, we note that mostof the existing research focuses on networking issues suchas routing, often considering only very simple approaches toaggregate data. In addition, much work still remains to be doneto provide cross-layer solutions, accounting for application,data representation, routing and MAC aspects. In fact, theschemes proposed so far often focus on only a subset of theseaspects, typically trying to merge routing and data aggregationtechniques, but ignoring MAC, application or data representa-tion issues. Finally, another aspect still not deeply investigatedconcerns the memory and the computational resources allowto sustain data aggregation processing [78].

For routing, many protocols are based on clustering. Amajor advantage of a clustered structure is that it directlyallows aggregation of data at the cluster head. Such algorithmswork well in relatively static networks where the clusterstructure remains unchanged for a sufficiently long time, butthey may be fragile when used in more dynamic environments.Often, the cost required to maintain the hierarchical struc-ture is substantial. Similar considerations apply to tree-basedschemes. Initial work addresses some of these problems [53],[79] but further research efforts are required to keep a networkfunctional under mobility. This last aspect is in fact largelyunexplored, and it is not clear how different protocols performin its presence. Also, multi-path algorithms may be ableto deal with (limited) topology changes due to their higherrobustness [8]. An interesting alternative research directionis provided by reactive and localized routing protocols [7].This study is also one of the very few that take MAC layerissues into account [7], [65]. We stress that without such ajoint design, the performance gained at the routing layer maybe partially lost due to MAC inefficiencies. Hybrid algorithmsallow to combine the properties of different approaches. Thisis the case for the algorithm in [9], which provides a goodtradeoff between tree-based and multi-path schemes. Hybridalgorithms allow to tune the degree of aggregation and mayfacilitate the adaptation of the aggregation scheme (e.g., to thepacket loss probability). For these reasons, they are particularlysuitable for the design of schemes that are able to adapt toapplication needs.

As discussed above, only very few studies provide a deeperanalysis of the aggregation functions. Previous work mostlytakes spatial correlation [13], [80] and temporal correla-tion [14] of data into account, but semantic correlation is notsufficiently well studied. In this context, distributed sourcecoding is a fairly recent and very promising research area.However, while many theoretical results are known, few ofthem have been turned into practical algorithms applicable towireless sensor networks.

REFERENCES

[1] C. Intanagonwiwat, R. Govindan, D. Estrin, J. Heidemann, and F. Silva,“Directed diffusion for wireless sensor networking,”IEEE/ACM Trans.Networking, vol. 11, no. 1, pp. 2–16, Feb. 2002.

[2] W. B. Heinzelman, A. P. Chandrakasan, and H. Balakrishnan, “Anapplication-specific protocol architecture for wireless microsensor net-works,” IEEE Trans. Wireless Commun., vol. 1, no. 4, pp. 660–670,Oct. 2002.

[3] S. Lindsey, C. Raghavendra, and K. M. Sivalingam, “Data GatheringAlgorithms in Sensor Networks using Energy Metrics,”IEEE Trans.Parallel Distrib. Syst., vol. 13, no. 9, pp. 924–935, Sep. 2002.

[4] Y. Yao and J. Gehrke, “Query processing for sensor networks,” inACMCIDR 2003, Asilomar, CA, US, Jan. 2003.

[5] S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong, “TAG:a Tiny AGgregation Service for Ad-Hoc Sensor Networks,” inOSDI2002, Boston, MA, US, Dec. 2002.

[6] Y. Xu, J. Heidemann, and D. Estrin, “Geographic-Informed EnergyConservation for Ad Hoc Routing,” inACM/SIGMOBILE MobiCom2001, Rome, Italy, Jul. 2001.

[7] G. Di Bacco, T. Melodia, and F. Cuomo, “A MAC Protocol for Delay-Bounded Applications in Wireless Sensor Networks,” inMed-Hoc-Net2004, Bodrum, Turkey, Jun. 2004.

[8] S. Nath, P. B. Gibbons, Z. R. Anderson, and S. Seshan, “SynopsisDiffusion for Robust Aggregation in Sensor Networks,” inACM SenSys2004, Baltimore, MD, US, Nov. 2004.

[9] A. Manjhi, S. Nath, and P. B. Gibbons, “Tributaries and Deltas:Efficient and Robust Aggregation in Sensor Network Stream,” inACMSIGMOD 2005, Baltimore, MD, US, Jun. 2005.

[10] I. Solis and K. Obraczka, “The Impact of Timing in Data Aggregationfor Sensor Networks,” inIEEE ICC 2004, Paris, France, Jun. 2004.

[11] F. Hu, C. Xiaojun, and C. May, “Optimized Scheduling for DataAggregation in wireless sensor networks,” inIEEE ITCC 2005, LasVegas, NV, US, Apr. 2005.

[12] X. Jianbo, Z. Siliang, and Q. Fengjiao, “A new In-network dataaggregation technology of wireless sensor networks.” inIEEE SKG’06,Guilin, China, Nov. 2006.

[13] E. Cohen and H. Kaplan, “Spatially-Decaying Aggregation Over aNetwork: Model and Algorithms,” inACM SIGMOD 2004, Paris,France, Jun. 2004.

[14] A. Sharaf, J. Beaver, A. Labrinidis, and K. Chrysanthis, “Balancingenergy efficiency and quality of aggregate data in sensor networks,”The VLDB Journal, vol. 13, no. 4, pp. 384–403, Dec. 2004.

[15] T. He, B. M. Blum, J. A. Stankovic, and T. Abdelzaher, “AIDA:Adaptive Apllication-Independent data aggregation in wireless sensornetworks,”ACM Trans. on Embedded Computing Systems, vol. 3, no. 2,pp. 426–457, May 2004.

[16] E. Cayirci, “Data Aggregation and Dilution by modulus addressing inwireless sensor networks,”IEEE Communications Letters, vol. 7, no. 8,pp. 355–357, Aug. 2003.

[17] D. Petrovic, R. C. Shah, K. Ramchandran, and J. Rabaey, “DataFunneling: Routing with Aggregation and Compression for WirelessSensor Networks,” inIEEE SNPA 2003, Anchorage, AK, US, May2003.

[18] M. Riedewald, D. P. Agrawal, and A. El Abbadi, “pCube: Update-efficient online aggregation with progressive feedback and errorbounds,” inIEEE SSDBM 2000, Berlin, Germany, Jul. 2000.

[19] L. Huang, B. Zhao, A. Joseph, and J. Kubiatowicz,“Probabilistic data aggregation in distributed networks,”EECS Department, University of California, Berkeley, Tech.Rep. UCB/EECS-2006-11, February 6 2006. [Online]. Avail-able: http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-11.html

[20] X. Wu and Z. Tian, “Optimized Data Fusion in Bandwidth and EnergyConstrained Sensor Networks.” inIEEE ICASSP 2006, Toulouse,France, May 2006.

[21] N. Shrivastava, C. Buragohain, D. P. Agrawal, and S. Suri, “Mediansand Beyond: New Aggregation Techniques for Sensor Networks,” inACM SenSys 2004, Baltimore, MD, US, Nov. 2004.

[22] A. Bezenchek, M. Rafanelli, and L. Tininini, “A data structure for rep-resenting aggregate data,” inIEEE SSDBM 1996, Stockholm, Sweden,Jun. 1996.

[23] Z. Xiong, A. D. Liveris, and S. Cheng, “Distributed source coding forsensor networks,”IEEE Signal Processing Mag., vol. 21, no. 5, pp.80–94, Sep. 2004.

[24] D. Slepian and J. K. Wolf, “Noiseless coding of correlated informationsources,”IEEE Trans. Inform. Theory, vol. 19, no. 4, pp. 471–480, Jul.1973.

[25] R. M. Karp,Reducibility among combinatorial problems, in: Complex-ity of Computer Computations, R. Miller, J.W. Thatcher , Ed. NewYork, US: Plenum Press, 1972.

[26] B. Krishnamachari, D. Estrin, and S. Wicker, “The Impact of DataAggregation in Wireless Sensor Networks,” inIEEE ICDCS 2002,Vienna, Austria, Jul. 2002.

[27] S. Pattem, B. Krishnamachari, and R. Govindan, “The Impact ofSpatial Correlation on Routing with Compression in Wireless SensorNetworks,” in ACM/IEEE IPSN 2004, Berkeley, CA, US, Apr. 2004.

[28] Y. Zhu, K. Sundaresan, and R. Sivakumar, “Practical Limits on Achiev-able Energy Improvements and Useable Delay Tolerance in CorrelationAware Data Gathering in Wireless Sensor Networks,” inIEEE SECON2005, Santa Clara, CA, US, Sep. 2005.

[29] Y. Zhu, R. Vedantham, S. J. Park, and R. Sivakumar, “A Scalable Cor-relation Aware Aggregation Strategy for Wireless Sensor Networks,”in IEEE WICON 2005, Budapest, Hungary, Jul. 2005.

[30] P. von Rickenbach and R. Wattenhofer, “Gathering correlated datain sensor networks,” inACM DIALM-POMC 2004, Philadelphia, PA,USA, Oct. 2004.

[31] R. Cristescu, B. Beferull-Lozano, M. Vetterli, and R. Wattenhofer,“Network correlated data gathering with explicit communication: Np-completeness and algorithms,” insubmitted to IEEE/ACM Transactionson Networking.

[32] K. Akkaya and M. Younis, “A survey of routing protocols in wirelesssensor networks,”Elsevier Ad Hoc Network Journal, vol. 3, no. 3, pp.325–349, May 2005.

[33] J. N. Al-Karaki and A. E. Kamal, “Routing techniques in wirelesssensor networks: a survey,”IEEE Wireless Communications, vol. 11,no. 6, pp. 6–28, Dec. 2004.

[34] I. Solis and K. Obraczka, “Isolines: energy-efficient mapping in sensornetworks,” in IEEE ISCC 2005, Cartagena, Spain, Jun. 2005.

[35] V. Erramilli, I. Matta, and A. Bestavros, “On the interaction betweendata aggregation and topology control in wireless sensor networks,” inIEEE SECON 2004, Santa Clara, CA, US, Oct. 2004.

[36] M. Ding, X. Cheng, and G. Xue, “Aggregation Tree Construction inSensor Networks,” inIEEE VTC 2003, Orlando, FL, US, Oct. 2003.

[37] Y. Yu, B. Krishnamachari, and V. Prasanna, “Energy-Latency Tradeoffsfor Data Gathering in Wireless Sensor Networks,” inIEEE Infocom2004, Hong Kong, Mar. 2004.

[38] H. Luo, J. Luo, Y. Liu, and S. Das, “Energy efficient routing withadaptive data fusion in sensor Networks.” inThird ACM/SIGMOBILEWorkshop on Foundations of Mobile Computing, Cologne, Germany,Aug. 2005.

[39] K. Dasgupta, K. Kalpakis, and P. Namjoshi, “An efficient Clustering-based heuristic for data gathering and aggregation in sensor networks,”in IEEE WCNC 2003, New Orleans, LA, US, Mar. 2003.

[40] H. Albert, R. Kravets, and I. Gupta, “Building Trees Based OnAggregation Efficiency in Sensor Networks,” inMed-Hoc-Net 2006,Lipari, Italy, Jun. 2006.

[41] J. N. Al-Karaki, R. Ul-Mustafa, and A. E. Kamal, “Data aggregation inwireless sensor networks: exact and approximate algorithms,” inIEEEHPSR 2004, Phoenix, AZ, US, Apr. 2004.

[42] G. Hartl and B. Li, “infer: A Bayesian Approach towards EnergyEfficient Data Collection in Dense Sensor Networks,” inIEEE ICDCS2005, Columbia, OH, US, Jun. 2005.

[43] H. O. Tan and I. Korpeoglu, “Power Efficient Data gathering andaggregation in wireless sensor networks,”ACM SIGMOD Record,vol. 32, no. 4, pp. 66–71, Dec. 2003.

[44] H. Cheng, Q. Liu, and X. Jia, “Heuristic algorithms for real-timedata aggregation in wireless sensor networks.” inACM IWCCC 2066,Vancouver, British Columbia, Canada, Jul. 2006.

[45] Y. Hu, N. Yu, and X. Jia, “Energy efficient real-time data aggregationin wireless sensor networks.” inACM IWCCC 2006, Vancouver, BritishColumbia, Canada, Jul. 2006.

[46] J. Choi, J. W. Lee, K. Lee, S. Choi, W. H. Kwon, and H. S. Park,“Aggregation Time Control Algorithm for Time constrained DataDelivery in Wireless Sensor Networks.” inIEEE VTC 2006-Spring,Melbourne, Australia, May 2006.

[47] H. Gupta, V. Navda, S. R. Das, and V. Chowdhary, “Efficient Gatheringof Correlated Data in Sensor Networks,” inACM MobiHoc 2005,Urbana-Champaign, IL, US, May 2005.

[48] B. Zhou, L. H. Ngoh, B. S. Lee, and C. P. Fu, “A HierarchicalScheme for Data Aggregation in Sensor Network,” inIEEE ICON 2004,Singapore, Nov. 2004.

[49] C. Intanagonwiwat, D. Estrin, R. Govindan, and J. Heidemann, “Impactof Network Density on Data Aggregation in Wireless Sensor Net-works,” in IEEE ICDCS 2002, Vienna, Austria, Jul. 2002.

[50] M. Lee and V. W. S. Wong, “An energy-aware spanning tree algorithmfor data aggregation in wireless sensor networks,” inIEEE PacRrim2005, Victoria, BC, Canada, Aug. 2005.

[51] IEEE LAN MAN Standards, Part 11: Wireless LAN Medium AccessControl (MAC) and Physical Layer (PHY) specifications”, ANSI/IEEEStd., March 1999.

[52] R. Kannan and S. S. Iyengar, “Game-theoretic models for reliable path-length and energy-constrained routing with data aggregation in wirelesssensor networks,”IEEE J. Select. Areas Commun., vol. 22, no. 6, pp.1141–1150, Aug. 2004.

[53] H. S. Kim, T. F. Abdelzaher, and W. H. Kwon, “Minimum-EnergyAsynchronous Dissemination to Mobile Sinks in Wireless sensornetworks,” inACM SenSys 2003, Los Angeles, CA, US, Nov. 2003.

[54] T. Banerjee, K. Chowdhury, and D. P. Agrawal, “Tree based dataaggregation in sensor networks using polynomial regression,” inIEEEFUSION 2005, Philadelphia, PA, US, Jul. 2005.

[55] Y. Yang, X. Wang, S. Zhu, and G. Cao, “SDAP: A Secure HopbyHopData Aggregation Protocol for Sensor Networks.” inACM MobiHoc2006, Florence, Italy, May 2006.

[56] Y. Sang, H. Shen, Y. Inoguchi, Y. Tan, and N. Xiong, “Secure DataAggregation in Wireless Sensor Networks: A Survey.” inPDCAT’06,Taipei, Taiwan, Dec 2006.

[57] A. Mahimkar and T. S. Rappaport, “SecureDAV: A secure dataaggregation and verification protocol for sensor networks,” inIEEEGLOBECOM 2004, Dallas, TX, US, Nov. 2004.

[58] H. Chen, H. Mineno, and T. Mizuno, “A Meta-Data-Based DataAggregation Scheme in Clustering Wireless Sensor Networks.” inIEEE/ACM MDM 2006, Nara, Japan, May 2006.

[59] Y. Yao and J. Gehrke, “The cougar approach to in-network queryprocessing in sensor networks,”ACM SIGMOD Record, vol. 31, no. 3,pp. 9–18, Sep. 2002.

[60] T. Pham, E. J. Kim, and M. Moh, “On data aggregation qualityand energy efficiency of wireless sensor network protocols,” inIEEEBroadNets 2004, San Jose, CA, US, Oct. 2004.

[61] H. Cam, S. Ozdemir, P. Nair, and D. Muthuavinashiappan, “ESPDA:Energy-efficient and secure pattern-based data aggregation for wirelesssensor networks,”IEEE SENSORS 2003, Oct. 2003.

[62] M. Lotfinezhad and B. Liang, “Effect of partially correlated data onclustering in wireless sensor networks,” inIEEE SECON 2004, SantaClara, CA, US, Oct. 2004.

[63] O. Younis and S. Fahmy, “Distributed clustering in ad-hoc sensornetworks: a hybrid, energy-efficient approach,” inIEEE Infocom 2004,Hong Kong, Mar. 2004.

[64] V. Mhatre and C. Rosenberg, “Design guidelines for wireless sensornetworks: Communication, clustering and aggregation,”Elsevier AdHoc Networks Journal, vol. 2, no. 1, pp. 45–63, Jan. 2004.

[65] P. Popovski, F. H. P. Fitzek, H. Yomo, T. K. Madsen, R. Prasad, andN. J. Vej, “MAC-layer approach for cluster-based aggregation in sensornetworks,” in IEEE IWWAN’04, Oulu, Finland, May 2004.

[66] K.-W. Fan, S. Liu, and P. Sinha, “Scalable data aggregation for dynamicevents in sensor networks.” inACM SenSys 2006, Boulder, CO, US,Oct. 2006.

[67] S. Chen and Z. Zhang, “Localized algorithm for aggregate fairness inwireless sensor networks.” inACM/SIGMOBILE MobiCom 2006, LosAngeles, CA, US, Sep. 2006.

[68] S. Motegi, K. Yoshihara, and H. Horiuchi, “DAG Based In-NetworkAggregation for Sensor Network Monitoring,” inIEEE SAINT 2006,Phoenix, AZ, US, Jan. 2006.

[69] U. Ramachandran, R. Kumar, M. Wolenetz, B. Cooper, B. Agarwalla,J. Shin, P. Hutto, and A. Paul, “Dynamic data fusion for future sensor

networks.” ACM Transactions on Sensor Networks, vol. 2, no. 3, pp.404–443, Aug. 2006.

[70] M. Wenz and H. Worn, “Event-based Production Rules for Data Aggre-gation in Wireless Sensor Networks.” inIEEE MFI 2006, Heidelberg,Germany, Sep. 2006.

[71] E. Cayirci and T. Coplu, “Distributed spatial data aggregation anddilution based on hashing and relational algebra in wireless sensornetworks,” in IEEE ISSNIP 2004, Melbourne, Australia, Dec. 2004.

[72] T. Abdelzaher, T. He, and J. Stankovic, “Feedback control of dataaggregation in sensor networks,” inIEEE CDC 2004, Atlantis, ParadiseIsland, Bahamas, Dec. 2004.

[73] T. M. Cover and J. A. Thomas,Elements of Information Theory. JohnWiley, 1991.

[74] M. Sartipi and F. Fekri, “Source and channel coding in wireless sensornetworks using LDPC codes,” inIEEE SECON 2004, Santa Clara, CA,US, Oct. 2004.

[75] J. Chou, D. Petrovic, and K. Ramchandran, “A Distributed and Adap-tive Signal Processing Approach to Reducing Energy Consumption inSensor Networks,” inIEEE Infocom 2003, San Francisco, CA, US,Mar. 2003.

[76] R. Cristescu, B. Beferull-Lozano, and M. Vetterli, “On NetworkCorrelated Data Gathering,” inIEEE Infocom 2004, Hong Kong, Mar.2004.

[77] S. Coleri and P. Varaiya, “Fault Tolerant and Energy Efficient Routingfor Sensor Networks,” inIEEE GLOBECOM 2004, Dallas, TX, US,Nov. 2004.

[78] J. Verd, J. Garca, M. Nemirovsky, and M. Valero, “The impact of trafficaggregation on the memory performance of networking applications.”in ACM MEDEA 2004, Antibes Juan-les-Pins, France, Oct. 2004.

[79] K.-I. Hwang, J. In, and D.-S. Eom, “Distributed Dynamic Shared Treefor Minimum Energy Data Aggregation of Multiple Mobile Sinks inWireless Sensor Networks.” inIEEE EWSN06, Zurich, Switzerland,Feb. 2006.

[80] M. Sharifzadeh and C. Shahabi, “Supporting Spatial Aggregation insensor network database,” inACM GIS 2004, Washington, DC, US,Nov. 2004.

Elena Fasolo ([email protected]) was born in Padova,Italy, in 1980. She received the Laurea Degree inTelecommunications Engineering from the University ofPadova, in 2004. She is currently a PhD Student at the Schoolin Information Engineering at the Information EngineeringDepartment, University of Padova, Italy, and she is spendingsix months at the NTT DoCoMO European Laboratoriesto do research on network coding techniques for wirelessnetworks. Her main interests are on networking and MACprotocol layers especially on efficient data gathering anddissemination schemes.

Michele Rossi([email protected]) was born in Ferrara, Italy,in 1974. He received both the Laurea degree and the Ph.D. inInformation Engineering from the University of Ferrara, Italy,in 2000 and 2004, respectively. During 2003 he was on leaveat the Center for Wireless Communications at the Universityof California San Diego, US. Since November 2005 he isan Assistant Professor at the Department of InformationEngineering of the University of Padova, Italy. His researchinterests include channel access, routing and data distributionalgorithms for wireless networks.

Jorg Widmer ([email protected]) is seniormanager at DoCoMo Euro-Labs in Munich, Germany. Heleads the Self-Organized Ambient Networking research group,working on several projects in the area of wireless ad-hocand sensor networks. Before joining DoCoMo, he spent twoyears as postdoctoral researcher at EPFL, Switzerland, doingresearch on ultra-wide band networks. Joerg Widmer obtainedhis M.S. degree and Ph.D. degree in computer science fromthe University of Mannheim, Germany in 2000 and 2003,respectively. In 1999/2000, he was at the ICSI Center forInternet Research, Berkeley, CA working on equation-basedcongestion control. His research interests include efficientalgorithms for wireless ad-hoc and sensor networks, networkcoding, and transport protocols.

Michele Zorzi ([email protected]) was born in Venice, Italy,in 1966. He received the Laurea Degree and the Ph.D. inElectrical Engineering from the University of Padova, Italy,in 1990 and 1994, respectively. During the Academic Year1992/93, he was on leave at the University of California,San Diego (UCSD), attending graduate courses and doingresearch on multiple access in mobile radio networks. In1993, he joined the faculty of the Dipartimento di Elettronicae Informazione, Politecnico di Milano, Italy. After spendingthree years with the Center forWireless Communications atUCSD, in 1998 he joined the School of Engineering of theUniversity of Ferrara, Italy, where he became a Professor in2000. Since November 2003, he has been on the faculty atthe Information Engineering Department of the University ofPadova. His present research interests include performanceevaluation in mobile communications systems, random accessin mobile 18 radio networks, ad hoc and sensor networks, and

energy constrained communications protocols. Dr. Zorzi is theEditor-In-Chief of the IEEE Wireless Communications Maga-zine, and currently serves on the Editorial Boards of the IEEETransactions on Communications, the IEEE Transactions onWireless Communications, the IEEE Transactions on MobileComputing, the Wiley Journal of Wireless Communicationsand Mobile Computing and the ACM/URSI/Kluwer Journal ofWireless Networks. He was also guest editor for special issuesin the IEEE Personal Communications Magazine (EnergyManagement in Personal Communications Systems) and theIEEE Journal on Selected Areas in Communications (Multi-media Network Radios).

Figure and Table captions

Fig. 1 Diagram for in-network aggregation techniques and their relation with differentprotocol layers. We stress that in general data processing also interacts with theApplication, MAC and PHY layers.

Fig. 2 A simplified scheme for Directed Diffusion [1].

Fig. 3 A message exchange example in DB-MAC.

Fig. 4 LEACH clustering approach.

Fig. 5 Examples of aggregation paths over a ring structure.

Fig. 6 Example of data gathering regions in Tributary and Delta.

Fig. 7 Q-digest example [21]: the complete treeT is derived by a recursive binary splittingof the original (root) interval[1, σ]. The q-digest consists of the non-empty boxes of thedata structure in sub-figure (d).

Fig. 8 Summary of the basic characteristics of the routing protocols.

Fig. 9 Summary of the basic characteristics of the data aggregation functions.

Data representations

Data aggregation functions

Routing protocols

Applications

MAC and PHY layers

Data representations

Data aggregation functions

Routing protocols

Applications

MAC and PHY layers

Fig. 1. Diagram for in-network aggregation techniques and their relation with different protocol layers. We stress that in general data processing also interactswith the Application, MAC and PHY layers.

c

b Source

Sink

(a) Interest dissemination

c

b

a

Source

Sink

(b) Gradients set up

c

b

a

Source

Sink

(c) Data delivery along the reinforced path

a

c

b Source

Sink

(a) Interest dissemination

c

b

a

Source

Sink

(b) Gradients set up

c

b

a

Source

Sink

(c) Data delivery along the reinforced path

a

Fig. 2. A simplified scheme for Directed Diffusion [1].

S1

S2

R1

R2

Sink direction

Pr = 10

Pr = 9

RTS(10)

CTS(10)

CTS(10)

S1

S2

R1

R2

Sink direction

Pr = 10

Pr = 9

RTS(10)

CTS(10)

CTS(10)

Fig. 3. A message exchange example in DB-MAC.

S

S

Sink

S

CH

S

S

S

S

S

SS S

SS

S

CH

CH

S

S

Sink

S

CH

S

S

S

S

S

SS S

SS

S

CH

CH

Fig. 4. LEACH clustering approach.

Sink

G

I

DF

CB

A

R1

R2

R3

R4

HE

Sink

G

I

DF

CB

A

R1

R2

R3

R4

HE

Fig. 5. Examples of aggregation paths over a ring structure.

T1

T4

T2

T3

M2

M1

T5 M

3

M4

M5

T1

T4

T2

T3

M2

M1

M M3

M4

M5

Delta region Expanded

Delta region

SINK SINK

T1

T4

T2

T3

M2

M1

T5 M

3

M4

M5

T1

T4

T2

T3

M2

M1

M M3

M4

M5


Delta region

T1

T4

T2

T3

M2

M1

T5 M

3

M4

M5

T1

T4

T2

T3

M2

M1

M M3

M4

M5


Delta region

SINK SINK

Fig. 6. Example of data gathering regions in Tributary and Delta.

n=15, k=5, σ=8

11 64 111

a b c

root

1

4 6

2 2

e f

d

4 6

2 2

1

g1

4 6

2 2

(a)

(d)

(b)

(c)

Fig. 7. Q-digest example [21]: the complete treeT is derived by a recursive binary splitting of the original (root) interval[1, σ]. The q-digest consists ofthe non-empty boxes of the data structure in sub-figure (d).

AsynchronousAsynchronousPeriodic per hopPeriodic per hopAsynchronousPeriodic per hopAsynchronousPeriodic per

hop adjustedTiming strategy

NoneNoneLocal route

repairs

Rotation of the

cluster-head,

sleeping periods

NoneRotation of the

leader

NoneSleeping

periodsEnergy saving

methods

MediumHighLowLowHighVery LowMediumLowResilience in case of

node mobility

MediumHighLowLowHighVery LowMediumLowScalability

MediumMediumMediumMediumLowHighHighHigh

Overhead to setup/maintain

the aggregation structure

HighHighMediumLowMediumLowMediumMediumResilience to link failures

Tree/Multi-path

based, driven by the

sink

Multi-path

based, on-line,

distributed

Cluster-based,

on-line,

distributed

synchronous

Cluster-based,

on-line,

distributed

Completely

distributed,

asynchronous

Chain-based,

centralized or

distributed

Tree-based, on-

line, driven by

the sink

Tree-based,

on-line, driven

by the sinkAggregation method

Tributaries and

Deltas [9]

Synopsis

Diffusion [8]

COUGAR

[4] [49]

LEACH

[2]

DB-MAC

[7]

PEGASIS

[3]

Directed

Diffusion [1]

TAG

[5]

AsynchronousAsynchronousPeriodic per hopPeriodic per hopAsynchronousPeriodic per hopAsynchronousPeriodic per

hop adjustedTiming strategy

NoneNoneLocal route

repairs

Rotation of the

cluster-head,

sleeping periods

NoneRotation of the

leader

NoneSleeping

periodsEnergy saving

methods

MediumHighLowLowHighVery LowMediumLowResilience in case of

node mobility

MediumHighLowLowHighVery LowMediumLowScalability

MediumMediumMediumMediumLowHighHighHigh

Overhead to setup/maintain

the aggregation structure

HighHighMediumLowMediumLowMediumMediumResilience to link failures

Tree/Multi-path

based, driven by the

sink

Multi-path

based, on-line,

distributed

Cluster-based,

on-line,

distributed

synchronous

Cluster-based,

on-line,

distributed

Completely

distributed,

asynchronous

Chain-based,

centralized or

distributed

Tree-based, on-

line, driven by

the sink

Tree-based,

on-line, driven

by the sinkAggregation method

Tributaries and

Deltas [9]

Synopsis

Diffusion [8]

COUGAR

[4] [49]

LEACH

[2]

DB-MAC

[7]

PEGASIS

[3]

Directed

Diffusion [1]

TAG

[5]Algo.

Char.

Algo.

Char.

Fig. 8. Summary of the basic characteristics of the routing protocols.

√(Temporal, Spatial)

√ / XX√(Spatial)

√(Temporal)

Correlation awareness

LowHighLowMediumHighResilience to losses/failures

√X√ / X√XDuplicate sensitive

√ / X√ / X√ / X√√Lossy aggregation

Q-digest [18]

Synopsis Diffusion [8]

AIDA [58]

DADMA [15]

TiNA[13]

√(Temporal, Spatial)

√ / XX√(Spatial)

√(Temporal)

Correlation awareness

LowHighLowMediumHighResilience to losses/failures

√X√ / X√XDuplicate sensitive

√ / X√ / X√ / X√√Lossy aggregation

Q-digest [18]

Synopsis Diffusion [8]

AIDA [58]

DADMA [15]

TiNA[13]

Algo.Char.

Algo.Char.

Fig. 9. Summary of the basic characteristics of the data aggregation functions.

In network aggregation techniques for wireless sensor networks - a survey

Documents

Transcript of In network aggregation techniques for wireless sensor networks - a survey