Bleach: A Distributed Stream Data Cleaning Systemtian/bleach/bleachTR.pdf · we compare Bleach to...

12
Bleach: A Distributed Stream Data Cleaning System Yongchao Tian Eurecom Biot, France Email: [email protected] Pietro Michiardi Eurecom Biot, France Email: [email protected] Marko Vukoli´ c IBM Research Zurich, Switzerland Email: [email protected] Abstract—Existing scalable data cleaning approaches have focused on batch data cleaning. However, batch data cleaning is not suitable for streaming big data systems, in which dynamic data is generated continuously. Despite the increasing popularity of stream-processing systems, no stream data cleaning techniques have been proposed so far. In this paper, we bridge this gap by addressing the problem of rule-based stream data cleaning, which sets stringent requirements on latency, rule dynamics and ability to cope with the continuous nature of data streams. We design a system, called Bleach, which achieves real-time violation detection and data repair on a dirty data stream. Bleach relies on efficient, compact and distributed data structures to maintain the necessary state to repair data, using an incremen- tal version of the equivalence class algorithm. Additionally, it supports rule dynamics and uses a “cumulative” sliding window operation to improve cleaning accuracy. We evaluate a prototype of Bleach using both synthetic and real data streams and experimentally validate its high throughput, low latency and high cleaning accuracy, which are preserved even with rule dynamics. In the absence of an existing comparable stream- cleaning baseline, we compared Bleach to a baseline system built on the micro-batch streaming paradigm, and experimentally show the superior performance of Bleach. I. I NTRODUCTION Modern big data and machine learning applications crit- ically rely on data, and derivative representation thereof, to meet certain quality criteria. Issues with data quality can lead to misleading analysis outcomes on the “garbage in, garbage out” basis. To address this issue, a range of data cleaning techniques were proposed recently [25], [20], [17]. However, existing data cleaning solutions have focused on batch data cleaning, by processing static data stored in data warehouses. This is in sharp contrast with requirements (and popularity) of distributed systems for streaming data processing (e.g., [23], [2]). In streaming data processing systems, data is continuously and simultaneously generated from potentially thousands of data sources. Examples of streaming data include log files generated by customers of mobile and web applications, online purchases, online gaming, social networks, financial trading floors, telemetry from sensors or other connected devices, etc. In contrast to the popularity of such systems, data cleaning so- lutions for cleaning streaming data have not received adequate attention. In this paper, we address this gap and focus on stream data cleaning. The challenges and requirements in stream data cleaning that make it different from established batch data cleaning are manifold, but we highlight the following ones: Stream cleaning requires both real-time guarantees as well as high accuracy, requirements that are often at odds. A na¨ ıve approach to stream data cleaning could simply extend existing batch techniques, by buffering data records in a temporary data store and cleaning it periodically before feeding it into downstream compo- nents. Although likely to achieve high accuracy, such a method clearly violates real-time requirements of streaming applications. The problem is exacerbated by the volume of data cleaning systems need to process, which prohibits centralized solutions. Therefore, our goal is to de- sign a distributed stream data cleaning system, which achieves efficient and accurate cleaning in real-time. Specific challenges that a stream cleaning system needs to address arise due to the long-term and dynamic nature of data streams. The dynamics of the streaming systems may lead the very definition of dirty data to change in response to such dynamics. In this paper, we specifically focus on rule-based data cleaning, whereby a set of domain-specific rules define how data should be cleaned: in particular, we consider functional dependencies (FDs) and conditional functional dependencies (CFDs). Despite extensive work on rule-based data clean- ing [1], [6], [9], [10], [12], [18], [7], [17], we are not aware of any precedent rule-based stream data cleaning system. Our sys- tem, called Bleach, proceeds in two phases: violation detection, to find rule violations, and violation repair, to repair data based on such violations. Bleach relies on efficient, compact and distributed data structures to maintain the necessary state (e.g., summaries of past data) to repair data, using an incremental equivalence class algorithm. To address long-term and dynamic nature of streams, Bleach supports dynamic rules, which can be added and deleted without requiring idle time. Additionally, Bleach implements a sliding window operation that trades modest additional storage requirements to temporarily store cumulative statistics, for increased cleaning accuracy. The experimental performance evaluation of our Bleach prototype is two-fold. First, we study the performance, in terms of throughput, latency and accuracy, focusing on the impact of Bleach parameters, and on the effects of rule dynamics. Then we compare Bleach to an alternative approach. In the absence of an existing comparable baseline, we compared Bleach to a system that we implemented using micro-batch streaming. Our experiments indicate superior performance of Bleach. Beyond demonstrating superior performance of our system, our comparative analysis attests that designing an efficient stream-cleaning solution goes beyond na¨ ıve application of batch cleaning techniques to stream processing.

Transcript of Bleach: A Distributed Stream Data Cleaning Systemtian/bleach/bleachTR.pdf · we compare Bleach to...

  • Bleach: A Distributed Stream Data Cleaning System

    Yongchao TianEurecom

    Biot, FranceEmail: [email protected]

    Pietro MichiardiEurecom

    Biot, FranceEmail: [email protected]

    Marko VukolićIBM Research

    Zurich, SwitzerlandEmail: [email protected]

    Abstract—Existing scalable data cleaning approaches havefocused on batch data cleaning. However, batch data cleaningis not suitable for streaming big data systems, in which dynamicdata is generated continuously. Despite the increasing popularityof stream-processing systems, no stream data cleaning techniqueshave been proposed so far. In this paper, we bridge this gap byaddressing the problem of rule-based stream data cleaning, whichsets stringent requirements on latency, rule dynamics and abilityto cope with the continuous nature of data streams.

    We design a system, called Bleach, which achieves real-timeviolation detection and data repair on a dirty data stream. Bleachrelies on efficient, compact and distributed data structures tomaintain the necessary state to repair data, using an incremen-tal version of the equivalence class algorithm. Additionally, itsupports rule dynamics and uses a “cumulative” sliding windowoperation to improve cleaning accuracy. We evaluate a prototypeof Bleach using both synthetic and real data streams andexperimentally validate its high throughput, low latency andhigh cleaning accuracy, which are preserved even with ruledynamics. In the absence of an existing comparable stream-cleaning baseline, we compared Bleach to a baseline system builton the micro-batch streaming paradigm, and experimentally showthe superior performance of Bleach.

    I. INTRODUCTION

    Modern big data and machine learning applications crit-ically rely on data, and derivative representation thereof, tomeet certain quality criteria. Issues with data quality can leadto misleading analysis outcomes on the “garbage in, garbageout” basis. To address this issue, a range of data cleaningtechniques were proposed recently [25], [20], [17]. However,existing data cleaning solutions have focused on batch datacleaning, by processing static data stored in data warehouses.

    This is in sharp contrast with requirements (and popularity)of distributed systems for streaming data processing (e.g., [23],[2]). In streaming data processing systems, data is continuouslyand simultaneously generated from potentially thousands ofdata sources. Examples of streaming data include log filesgenerated by customers of mobile and web applications, onlinepurchases, online gaming, social networks, financial tradingfloors, telemetry from sensors or other connected devices, etc.In contrast to the popularity of such systems, data cleaning so-lutions for cleaning streaming data have not received adequateattention.

    In this paper, we address this gap and focus on streamdata cleaning. The challenges and requirements in stream datacleaning that make it different from established batch datacleaning are manifold, but we highlight the following ones:

    • Stream cleaning requires both real-time guarantees as

    well as high accuracy, requirements that are often atodds. A naı̈ve approach to stream data cleaning couldsimply extend existing batch techniques, by bufferingdata records in a temporary data store and cleaning itperiodically before feeding it into downstream compo-nents. Although likely to achieve high accuracy, sucha method clearly violates real-time requirements ofstreaming applications.

    • The problem is exacerbated by the volume of datacleaning systems need to process, which prohibitscentralized solutions. Therefore, our goal is to de-sign a distributed stream data cleaning system, whichachieves efficient and accurate cleaning in real-time.

    • Specific challenges that a stream cleaning systemneeds to address arise due to the long-term anddynamic nature of data streams. The dynamics of thestreaming systems may lead the very definition of dirtydata to change in response to such dynamics.

    In this paper, we specifically focus on rule-based datacleaning, whereby a set of domain-specific rules define howdata should be cleaned: in particular, we consider functionaldependencies (FDs) and conditional functional dependencies(CFDs). Despite extensive work on rule-based data clean-ing [1], [6], [9], [10], [12], [18], [7], [17], we are not aware ofany precedent rule-based stream data cleaning system. Our sys-tem, called Bleach, proceeds in two phases: violation detection,to find rule violations, and violation repair, to repair data basedon such violations. Bleach relies on efficient, compact anddistributed data structures to maintain the necessary state (e.g.,summaries of past data) to repair data, using an incrementalequivalence class algorithm. To address long-term and dynamicnature of streams, Bleach supports dynamic rules, which canbe added and deleted without requiring idle time. Additionally,Bleach implements a sliding window operation that tradesmodest additional storage requirements to temporarily storecumulative statistics, for increased cleaning accuracy.

    The experimental performance evaluation of our Bleachprototype is two-fold. First, we study the performance, in termsof throughput, latency and accuracy, focusing on the impact ofBleach parameters, and on the effects of rule dynamics. Thenwe compare Bleach to an alternative approach. In the absenceof an existing comparable baseline, we compared Bleach toa system that we implemented using micro-batch streaming.Our experiments indicate superior performance of Bleach.Beyond demonstrating superior performance of our system,our comparative analysis attests that designing an efficientstream-cleaning solution goes beyond naı̈ve application ofbatch cleaning techniques to stream processing.

  • The paper is organized as follows. Section II gives a moreprecise problem statement. We present Bleach system designin Section III and Section IV; dynamic rule managementand windowing are discussed respectively in Section V andSection VI. Section VII presents our experimental results.Section VIII overviews related work and Section IX concludes.

    II. PRELIMINARIES

    Next, we introduce basic notation we use throughout thepaper, then we define the problem statement we consider.

    A. Background and Definitions

    In this paper we assume that a stream data cleaning systemingests a data stream and outputs a cleaned data streaminstance. We consider an input data stream instance Din withschema S(A1, A2, ..., Am) where Aj is an attribute in schema S.We assume the existence of unique tuple identifiers for everytuple in Din: thus given a tuple ti, id(ti) is the identifier ofti. In general we define a function id(e) which returns theidentifier (ID) of e where e can be any element. A list ofIDs [id(e1), id(e2), ..., id(en)] is expressed as id(e1, e2, ..., en) forbrevity. The output data stream instance Dout complies withschema S and has the same tuple identifiers as in Din, i.e.,with no tuple loss or duplication. The basic unit, a cell ci,j , isthe concatenation of a tuple id, an attribute and the projectionof the tuple on the attribute: ci,j = (id(ti), Aj , ti(Aj)). Notethat ti(Aj) is the value of ci,j , which can also be expressedas v(ci,j). Sometimes, we may simply express ci,j as ci whenthe cell attribute is not relevant to the discussion. In our work,when we point at a specific tuple ti, we also refer to this tupleas the current tuple. Tuples appearing earlier than ti in the datastream are referred to as earlier tuples and those appearingafter ti are referred to as later tuples.

    To perform data cleaning, we define a set of rules Σ =[r1, ..., rn], in which rk is either a functional dependency (FD)rule or a conditional FD rule (CFD). Each rule has a uniquerule identifier id(rk). A CFD rule rk is represented by (X →A, cond(Y )), in which cond(Y ) is a boolean function on a setof attributes Y where Y ⊆ S. X and A are respectively referredto as a set of left-hand side (LHS) attributes and right-handside (RHS) attribute: LHS(rk) = X, RHS(rk) = A. When therule is clear in the context, we omit rk so that LHS = X,RHS = A. Cells of LHS (RHS) attributes are also referred toas LHS (RHS) cells. Y is referred to as a set of conditionalattributes. If there exists a pair of tuples t1 and t2 satisfyingcondition cond(t1(Y )) = cond(t2(Y )) = true where t1(B) =t2(B) for all B ∈ X but t1(A) 6= t2(A), then we say t1 and t2violate against rk.

    A data stream instance D satisfies rk, denoted as D |= rk,when no violations for rk exist in D. A FD rule can be seenas a special case of CFD rule where cond(Y ) is always trueand Y is ∅. We refer to an attribute as an intersecting attributeif it is involved in multiple rules. If D satisfies a set of rulesΣ, denoted D |= Σ, then D |= rk for ∀rk ∈ Σ. If D does notsatisfy Σ, D is a dirty data stream instance.

    B. Challenges and Goals

    An ideal stream data cleaning system should accept a dirtyinput stream Din and output a clean stream Dout, in which all

    item category clientid city zipcode

    … … … … …

    MacBook computer 11111 France 75001

    bike sports 33333 Lyon null

    Interstellar movies 22222 Paris 75001

    bike toys 44444 Nice 06000

    Titanic movies 11111 Paris null

    … … … … …

    t1 t2 t3 t4 t5

    time

    Fig. 1. Illustrative example of a data stream consisting of on-line transactions.

    rule violations in Din are repaired (Dout |= Σ). However, thisis not possible in reality due to:

    • Real-time constraint: As the data cleaning is incre-mental, the cleaning decision for a tuple (repair or notrepair) can only be made based on itself and earliertuples in the data stream, which is different from datacleaning in data warehouses where the entire datasetis available. In other words, if a dirty tuple only hasviolations with later tuples in the data stream, it cannot be cleaned. A late update for a tuple in the outputdata stream cannot be accepted.

    • Dynamic rules: In a stream data cleaning system, therule set is not static. A new rule may be added or anobsolete rule may be deleted at any time. A processeddata tuple can not be cleaned again with an updatedrule set. Reprocessing the whole data stream wheneverthe rule set is updated is not realistic.

    • Unbounded data: A data stream produces an un-bounded amount of data, that cannot be stored com-pletely. Thus, stream data cleaning can not afford toperform cleaning on the full data history. Namely, if adirty tuple only has violations with tuples that appearmuch earlier in the data stream, it is likely that sucha tuple will not be cleaned.

    Consider the example in Figure 1, which is a data stream ofon-line shopping transactions. Each tuple represents a purchaserecord, which contains a purchased item (item), the categoryof that item (category), a client identifier (clientid), the cityof the client (city) and the zip code of that city (zipcode). Inthe example, we show an extract of five data tuples of the datastream, from t1 to t5. Now, assume we are given two FD rulesand one CFD rule stating how a clean data stream should looklike: (r1) the same items can only belong to the same category;(r2) two records with the same clientid must have the samecity; (r3) two records with the same non-null zip code musthave the same city:

    (r1) item→ category(r2) clientid→ city(r3) zipcode→ city, zipcode 6= null

    In our example, there are three violations of rules r1,r2 and/or r3: (v1) t1 and t3 have the same non-null zipcode (t1(zipcode) = t2(zipcode) 6= null) but different citynames (t1(city) 6= t2(city)); (v2) t2 claims bikes belong tocategory sports while t4 classifies bikes as toys (t2(item) =

  • t3(city) t1(city) t5(city)

    v1, v3

    v2 t2(category) t4(category)

    Fig. 2. An example of a violation graph, derived from our running example.

    t4(item), t2(category) 6= t4(category)); and (v3) t1 and t5 havethe same clientid but different city names (t1(clientid) =t5(clientid), t1(city) 6= t5(city)).

    Note that when a stream data cleaning system receivestuple t1, no violation can be detected as in our examplet1 only has violations with later tuples t3 and t5. Thus, nomodification can be made to t1. Furthermore, delaying thecleaning process for t1 is not a feasible option, not onlybecause of real-time constraints, but also because it is difficultto predict for how long this tuple should be buffered for it to becleaned. Therefore, stream data cleaning must be incremental:whenever a new piece of data arrvies, the data cleaning processstarts immediately. Although performing incremental violationdetection seems straightforward, incremental violation repair ismuch more complex to achieve. Coming back to the examplein Figure 1, assume that the stream cleaning system receivestuple t5 and successfully detects the violation v3 between t5and t1. Such detection is not sufficient to make the correctrepair decision, as the tuple t1 also conflicts with another tuple,t3. An incremental repair in stream data cleaning system shouldalso take the violations among earlier tuples into account.

    To account for the intricacies of the violation repair pro-cess, we use the concept of violation graph [17]. A violationgraph is a data structure containing the detected violations,in which each node represents a cell. If some violationsshare a common cell, they will be grouped into a singlesubgraph. Therefore, the violation graph is partitioned intosmaller independent subgraphs. A single cell can only be inone subgraph. If two subgraphs share a common cell, theyneed to merge. The repair decision of a tuple is only relevantto the subgraphs in which its cells are involved. A violationgraph for our example can be seen in Figure 2. Given thisviolation graph, to make the repair decision for tuple t5, thecleaning system can only rely on the upper subgraph whichconsists of violation v1 and v3 with the common cell t1(city).We now give our problem statement as following.

    Problem statement: Given an unbounded data stream withan associated schema1 and a dynamic set of rules, how canwe design an incremental and real-time data cleaning system,including violation detection and violation repair mechanisms,using bounded computing and storage resources, to output acleaned data stream?

    In the following sections, we overview the Bleach archi-tecture and provide details about its components. As shown inFigure 3, the input data stream first enters the detect module(Sections III), which reveals violations against defined rules.The intermediate data stream is enriched with violation infor-mation, which the repair module (Section IV) uses to make

    1Note that although we restrict the data stream to have a fixed schema inthis work, it is easy to extend our work to support a dynamic schema.

    Repair

    Violation Graph

    Detect

    DataHistory

    Intermediate Data Stream

    with ViolationsData Stream Cleaned Data

    Stream

    RuleController

    Rule Update

    Fig. 3. Stream data cleaning Overview

    Fig. 4. The Detect Module

    repair decisions. Finally, the system outputs a cleaned datastream. The rule controller module, is discussed in Section V.Finally, we discuss the windowing operation in Section VI.

    III. VIOLATION DETECTION

    The violation detection module aims at finding input tuplesthat violate rules. To do so, it stores the tuples in-memory, inan efficient and compact data structure that we call the datahistory. Input tuples are thus compared to those in the datahistory to detect violations. Figure 4 illustrates the internalsof the detect module: it consists of an ingress router, anegress router and multiple detect workers (DW). Bleach mapsviolation rules to such DWs: each worker is in charge offinding violations for a specific rule.

    A. The Ingress Router

    The role of the ingress router is to partition and distributeincoming tuples to DWs. As discussed in Section II, only asubset of the attributes of an input tuple are relevant whenverifying data validity against a given rule. For example, a FDrule only requires its LHS and RHS attributes to be verified,ignoring the rest of the input tuple attributes.

    Therefore, when the ingress router receives an input tuple,it partitions the tuple based on the current rule set, andonly sends the relevant information to each DW in chargeof each specific rule. As such, an input tuple is broken intomultiple sub-tuples, which all share the same identifier of thecorresponding input tuple. Note that some attributes of an inputtuple might be required by multiple rules: in this case, sub-tuples will contain redundant information, allowing each DWto work independently. An example of tuple partitioning canbe found in Figure 4, where we reuse the input data schemaand the rules from Section II.

  • B. The Detect Worker

    Each DW is assigned a rule, and receives the relevant sub-tuples stemming from the input stream. For each sub-tuple, aDW performs a lookup operation in the data history, and emitsa message to downstream components when a rule violationis detected. To achieve efficiency and performance, lookupoperations need to be fast, and the intermediate data streamshould avoid redundant information. Next, we describe howthe data history is represented and materialized in memory;then, we describe the output messages a DW generates, andfinally outline the DW algorithm.

    Data history representation. A DW accumulates relevantinput sub-tuples in a compact data structure that enables anefficient lookup process, which makes it similar to a traditionalindexing mechanism. The structure2 of the data history isillustrated in Figure 5. First, to speed-up the lookup process,sub-tuples are grouped by the value of the LHS attribute usedby a given rule: we call such group a cell group (CG). Thus, aCG stores all RHS cells whose sub-tuples share the same LHSvalue. The identifier of a cell group cgl is the combination ofthe rule assigned to the DW, and the value of LHS attributes,expressed as id(cgl) = (id(rk), t(LHS)) where rk is the ruleassigned to the DW.

    Next, to achieve a compact data representation, all cells ina CG sharing the same RHS value are grouped into a supercell (SC): scm = [c1,j , c2,j , ..., cn,j ]. From Section II, recallthat a cell is made of a tuple ID, an attribute and a value:(id(ti), Aj , ti(Aj)). Therefore, a super cell can be compressedas a list of tuple IDs, an attribute and their common value:scm = (id(t1, t2, ..., tn), Aj , t(Aj)) where t(Aj) = t1(Aj) =... = tn(Aj). Hence, within an individual DW, sub-tupleswhose cells are compressed in the same sc are equivalent,as they have the same LHS attributes value (the identity ofthe cell group) and the same RHS attribute value (the valueof super cell). A cell group cgl now can be expressed as:cgl = ((id(rk), t(LHS)), [sc1, sc2, ...]) including an identifierand a list of super cells.

    In summary, the lookup process for a given input sub-tuple is as follows. Cell groups are stored in a hash-mapusing their identifiers as keys: therefore the DW first findsthe CG corresponding to the current sub-tuple. Cells in thecorresponding CG are the only cells that might be in conflictwith the current cell. Overall, the complexity of the lookupprocess for a sub-tuple is O(1).

    Violation messages. DWs generate an intermediate datastream of violation messages, which help downstream com-ponents to eventually repair input tuples. The goal of the DWis to generate as few messages as possible, while allowingeffective data repair. When the lookup process reveals thecurrent tuple does not violate a rule, DWs emit a non-violationmessage (msgnvio). Instead, when a violation is detected, aDW constructs a message with all the necessary informationto repair it, including: the ID of the cell group correspondingto the current tuple and the RHS cells of the current and earliertuples in data history: msgvio = (id(cgl), ccur, cold).

    Now, to reduce the number of violation messages, the DW

    2The techniques we use are similar to the notion of partitions and compres-sion introduced in Nadeef [10].

    data

    history

    cell

    group

    cell

    group

    cell

    group

    sc sc sc sc sc

    indexing by v(LHS)

    indexing by v(RHS)

    Fig. 5. The structure of the data history in a detect worker

    can use a super cell in place of a single cell (cold) in conflictwith the current tuple. In addition, recall that a single CG cancontain multiple super cells, thus possibly requiring multiplemessages for each group. However, we observe that two cellsin the same CG must also conflict with each other, as longas their values are different. Since the data repair module inBleach is stateful, it is safe to omit some violation messages.

    Algorithm 1 Violation Detection1: given rule r = (X → Aj , cond(Y ))2: procedure RECEIVE(sub-tuple ti) . cond(ti(Y )) = true3: if ∃id(cgl) = (id(r), ti(X)) then4: if |cgl| = 1 then . cgl contains scold5: if v(scold) = ti(Aj) then6: Emit msgnvio7: else8: Emit msgvio (id(cgl), ccur, scold)9: end if

    10: else11: Emit msgvio (id(cgl), ccur, null)12: end if13: else14: Create cgl . Create a new cell group15: Emit msgnvio16: end if17: Add ccur to cgl18: end procedure

    Algorithm details. Next, we present the DW violation algo-rithm details, as illustrated in Algorithm 1. The algorithm startsby treating FD rules as a special case of CFD rules (line 1).Then, when a DW receives a sub-tuple ti satisfying the rulecondition (line 2), it performs a lookup in the data history tocheck if the corresponding cell group cgl exists (line 3). Ifyes, it determines the number of SC contained in the cgl (line4). If there is only one SC scold, violation detection works asfollows. If the RHS cell of the current sub-tuple, ccur, has thesame value as scold, it emits a non-violation message (line 5-6). Otherwise, a violation has been detected: the DW emits acomplete violation message, containing both the current celland the old cell (line 8). If the CG contains more than one SC,the DW emits a single append-only violation message, whichonly contains the cell of the current sub-tuple (line 11). Suchcompact messages omit the SC from the data history, sincethey must be contained in earlier violation messages. Finally,if the lookup procedure (line 3) fails, the DW creates a newcell group and emits a non-violation message (line 14-15). Atthis point, the current cell ccur is added to the corresponding

  • ...

    Violation Graph

    Violation Graph

    Fig. 6. Violation Repair

    group cgl (line 17), either in an existing sc, or as a new distinctcell. It is worth noticing that, following Algorithm 1, a DWemits a single message for each input sub-tuple, no matter howmany tuples in the data history it conflicts with.

    C. The Egress Router

    The egress router gathers (violation or non-violation) mes-sages for a given data tuple, as received from all DWs, andsends them downstream to the repair module.

    IV. VIOLATION REPAIR

    The goal of this module is to take the repair decisions fordirty data tuples, based on an intermediate stream of violationmessages generated by the detect module. To achieve this,Bleach uses a data structure called violation graph. Violationmessages contribute to the creation and dynamics of the viola-tion graph, which essentially groups those cells that, together,are used to perform data repair. Figure 6 sketches the internalsof the repair module: it consists of an ingress router, the repairworkers (RW), and an aggregation component that emits cleandata. An additional component, called the coordinator, steersviolation graph management, with the contribution of RWs.

    A. The Ingress Router

    The ingress router broadcasts all incoming violation mes-sages to all RWs. As opposed to its counterpart in the detectionmodule, it does not perform data partitioning. Although eachRW receives all violation messages, a cell in a violationmessage will only be stored in one RW with the goal ofcreating and maintaining the violation graph.

    B. The Repair Worker

    Next, we describe the operation of a RW. First, we focuson the violation graph and the data repair algorithm. Then, wemove to the key challenge that RWs address, that is how tomaintain a distributed violation graph. As such, we focus ongraph partitioning and maintenance. Due to violation graph dy-namics, coordination issues might arise in a distributed setting:such problems are addressed by the coordinator component.

    The repair algorithm. Current data repair algorithms usea violation graph to repair dirty data based on user-defined

    rules. A violation graph is a succinct representation of cells(both current and historical) that are in conflict accordingto some rules. A violation graph is composed of subgraphs.As incoming data streams in, the violation graph evolves:specifically, its subgraphs might merge or split, dependingon the contents of violation messages. Using the violationgraph, several algorithms can perform data cleaning, such asthe equivalence class algorithm [5] or the holistic data cleaningalgorithm [9]. Currently, Bleach uses an incremental version ofthe equivalence class algorithm, that supports streaming inputdata, although alternative approaches can be easily pluggedin our system. Thus, a subgraph in the violation graph canbe interpreted as an equivalence class, in which all cells aresupposed to have the same value.

    The Bleach violation graph is built using violation mes-sages output by the detect module. We say that a subgraphsg intersects with a violation message msgvio, denoted bymsgvio ∩ sg 6= ∅, either when any of the current or old cellsencapsulated in msgvio are alreday contained in sg or whensg has cells which are in the same cell group as any of thecells in msgvio. When there is only one RW, upon receiving aviolation message msgvio, the RW checks if there is a subgraphintersecting with msgvio. If such sg exists, the RW adds msgvioto sg, denoted by msgvio

    add−−→ sg, by adding both cells inmsgvio. If none of the subgraphs intersects with msgvio, anew subgraph will be created with the two cells in msgvio,denoted by msgvio

    add−−→ null. If more than one such subgraphsexist, Bleach merges these subgraphs to a single subgraph,and then adds msgvio to it: msgvio

    add−−→ (sg1, sg2, ...)merged.We define a subgraph identifier id(sgk) to be the list of cellgroup IDs comprised in msgvio: id(cg1, cg2, ...). A subgraphcan be expressed as sgk = (id(cg1, cg2, ...), [sc1, sc2, ...]): itconsists of a group of SC, stored in compressed format, asshown in Section III-B. Note that when two subgraphs merge,their identifiers are also merged by concatenating both CG IDlists. To make the subgraph ID clear, sgk can be presented assgid(cg1,cg2,...).

    Distributed violation graph. Due to the unbounded natureof streaming data, it is reasonable to expect the violationgraph to grow to sizes exceeding the capacity of a singleRW. As such, in Bleach, the violation graph is a distributeddata structure, partitioned across all RWs. However, unlikefor DWs, the partitioning scheme can not be simply rulebased, because a cell may violate multiple rules, creating issuesrelated to coordination and load balancing. More generally,no partitioning scheme can guarantee that cells from a singleviolation message or a single subgraph to be placed in a singleRW. Therefore, Bleach partitions the violation graph basedon cells using cells tuple IDs (e.g., hash partitioning). Sinceviolation messages are broadcasted to all RWs, a violationmessage msgvio is partially added to a subgraph sg in each RW,denoted by msgvio

    p add−−−→ sg, such that only cells matching thepartitioning scheme are added in sg. Hence, a subgraph spansseveral RWs, each storing a fraction of the cells comprisedin the subgraph. We use the subgraph identifier to recognizepartitions from the same subgraph.

    An illustrative example is in order. Let’s assume thereare two RWs, rw1 and rw2, and the current violation graphconsists in two subgraphs sgid(cg1), containing cells c1, c2, c3,and sgid(cg2), containing cells c4, c5. In our example, the

  • c1 c3

    sgid(cg1) sgid(cg2)

    rw1:

    rw2: c2

    c5

    c4

    (a) initial state

    c1 c3 c5

    c6

    sgid(cg1, cg3) sgid(cg2)

    rw1:

    rw2: c2 c4

    sgid(cg1) sgid(cg3)

    (b) merge only in rw1

    c1 c3

    sgid(cg1, cg3) sgid(cg2)

    rw1:

    rw2: c2

    c5

    c4 c6

    (c) merge in rw1 and rw2

    Fig. 7. Violation graph build example

    violation graph is partitioned as in Figure 7(a): both RWs havea portion of cells of every subgraph.

    C. The Coordinator

    The problem we address now stems from violation graphdynamics, which evolves as new violation messages streaminto the repair module. As each subgraph is partitioned amongall RWs, subgraph partitions must be identified by the sameID. Continuing with the example from Figure 7(a), supposea new violation message {id(cg3), c6, c1} is received by bothRWs. Now, in rw1, the new violation is added to subgraphsgid(cg1) since both the message and the subgraph share thesame cell c1: as such, the new subgraph becomes sgid(cg1,cg3).Instead, in rw2, the new violation triggers the creation of a newsubgraph sgid(cg3), since no common cells are shared betweenthe message and existing subgraphs in rw2. The violationgraph becomes inconsistent, as shown in Figure 7(b): this isa consequence of the independent operation of RWs. Instead,the repair algorithm requires the violation graph to be in aconsistent state, as shown in Figure 7(c), where both RWs usethe same subgraph identifier for the same equivalence class.

    To guarantee the consistency of the violation graph amongindependent RWs, Bleach uses a stateless coordinator compo-nent that helps RWs agree on subgraph identifiers. In whatfollows we present three variants of the simple protocol RWsuse to communicate with the coordinator.

    RW-basic. Algorithm 2 demonstrates how RWs work with thecoordinator in the RW-basic approach. When a RW receivesviolation messages for a tuple, it adds the cells in the messagesto the violation graph, according to its local state and thepartitioning scheme. Note that in Algorithm 2 (line 4-6)(sgi1 , sgi2 , ...)merged is a general case including when thereis none or only one intersecting subgraph. Then, the RWcreates a merge proposal containing the subgraph IDs for eachconflicting attribute, and sends it to the coordinator. Oncethe coordinator receives merge proposals from all RWs, itmerges subgraph IDs for each attribute from the various mergeproposals and produces a merge decision which is sent backto all RWs. With the merge decision, RWs merge their localsubgraphs and converge to a globally consistent state. Then,RWs are ready to generate repair proposals (more details inSection IV-D).

    Algorithm 2 RW-basic1: procedure RECEIVE([msgvio1,msgvio2, ...])2: Initialize a merge proposal mp3: for msgvioi in [msgvio1,msgvio2, ...] do4: Find [sgi1 , sgi2 , ...] where msgvioi ∩ sgij 6= ∅5: msgvioi

    p add−−−→ (sgi1 , sgi2 , ...)merged6: Add (Attr(msgvioi), id((sgi1 , sgi2 , ...)merged)) to

    mp7: end for8: Send mp to the coordinator9: end procedure

    10: procedure RECEIVE(md) . merge decision11: for (Ai, id(sgi)) in md do12: Find [sgi1 , sgi2 , ...] where id(sgij ) ⊆ id(sgi)13: Merge subgraphs [sgi1 , sgi2 , ...] such that

    id((sgi1 , sgi2 , ...)merged) = id(sgi)14: end for15: Send a repair proposal to the aggregator16: end procedure

    Clearly, such a simple approach to coordination harmsBleach performance. Indeed, the RW-basic scheme requiresone round-trip message for every incoming data tuple, fromall RWs. However, we note that it is not necessarily truethat the coordination is always needed for every tuple. Forexample, when every cell violates at most one rule, everysubgraph would only have a single CG ID. Thus, coordinationis not necessary. More generally, given violation messages for atuple, coordination is only necessary when there is a completeviolation message containing an old cell which already existsin the violation graph because of a different violation rule.Figure 8 gives an example, where the initial state (Figure 8(a))is the same as in Figure 7(a). Then, two violation messages,{id(cg1), c6, null} and {id(cg2), c6, null}, are received. Cell c6is a current cell contained in the current tuple. Obviouslysgid(cg1) and sgid(cg2) should merge into sgid(cg1,cg2). This canbe accomplished without coordination by both repair workers,as shown in Figure 8(b). Indeed, each RW is aware that c6 isinvolved in two subgraphs, although c6 is only stored in rw2because of the partitioning scheme.

    Next, we use the above observations and propose twovariants of the coordination mechanism that aim at bypassingthe coordinator component to improve performance.

    RW-dr. In RW-dr, the coordination is only conducted if it isnecessary, and the repair worker sends a merge proposal to thecoordinator and waits for the merge decision. However, thisapproach is not exempt from drawbacks: it may cause somedata tuples in the stream to be delivered out of order. Thisis because the repair worker wait for the merge decision in anon-blocking way. The violation messages of a tuple which donot require coordination may be processed in the coordinationgap of an earlier tuple.

    RW-ir. With this variant, no matter if the violation messages ofa tuple require coordination or not, a RW immediately updatesits local subgraphs, executes the repair algorithm and emitsa repair proposal downstream to the aggregator component.Then, if necessary, the RW lazily executes the coordinationprotocol. Clearly, this approach caters to system performance

  • c1 c3

    sgid(cg1) sgid(cg2)

    rw1:

    rw2: c2

    c5

    c4

    (a) initial state

    rw1:

    rw2:

    c1 c3

    sgid(cg1, cg2)

    c2

    c5

    c4 c6

    (b) after independent processing

    Fig. 8. Example of violation graph built without coordination

    and avoids tuples to be delivered out of order, but might harmcleaning accuracy. Indeed, individual data repair proposalsfrom a RW are based on a local view prior to finishing allnecessary merge operations on subgraphs, which has a directimpact on equivalence classes.

    D. The Aggregator

    With the consistent distributed violation graph, each RWemits a data repair proposal, which includes all3 candidatevalues and their frequency computed in a local subgraph par-tition. The aggregator component collects all repair proposalsand selects the candidate value to repair a given cell as the onehaving the highest aggregate frequency. Finally, the aggregatormodifies the current data tuple and outputs a clean data stream.

    Note that the aggregator only modifies current tuples in theoutput stream. Instead, cells stored in the violation graph arenot modified regardless of the repair decision: this allows toupdate frequency counts as new data streams into the system,thus steering the aggregator to make different repair decisionsas the violation graph evolves. To avoid potential bottlenecks,Bleach can have multiple coordinators and aggregators, so thattheir workload can be distributed based on current tuple IDs.

    V. DYNAMIC RULE MANAGEMENT

    In stream data cleaning the rule set is usually not immutablebut dynamic. Therefore, we now introduce a new component,the rule controller, shown in Figure 3, which allows Bleach toadapt to rule dynamics. The rule controller accepts rule updatesas input and guides the detect and the repair module to adaptto rule dynamics without stopping the cleaning process andwithout loosing state. Rule updates can be of two types: onefor adding a new rule and one for deleting an existing rule.

    Detect. In the detect module, the addition of a rule triggersthe instantiation of a new DW, as input tuples are partitionedby the rule. The new DW starts with no state, which is builtupon receiving new input tuples. As such, violation detectionusing past tuples cannot be achieved, which is consistent withthe Bleach design goals. Instead, the deletion of an existingrule simply triggers the removal of a DW, with its own localdata history.

    Repair. In the repair module, the addition of a new rule is notproblematic with respect to violation graph maintenance oper-ations. Instead, the removal of a rule implies violation graphdynamics (subgraphs might shrink or split) which are morechallenging to address. Thus, in a subgraph, we further groupcells by cell groups. A subgraph now can also be expressed

    3In case there are too many candidate values, we only send the top-k values,where k = 5.

    sgid(cg1,cg2,cg3) ccc1

    cg1

    c1

    ccc1

    ccc1 c7

    c6

    c2 c3

    c5 c4

    cg3

    cg2

    (a) initial sgid(cg1,cg2,cg3)

    sgid(cg1, cg3) ccc1

    cg1

    c1

    ccc1 c7

    c6

    c2 c3

    cg3

    (b) if delete cg2

    sgid(cg1,cg2) ccc1

    cg1

    c1

    ccc1 c7

    c2 c3

    c5 c4

    cg2

    (c) incorrect if delete cg3

    sgid(cg1) ccc1

    cg1

    c1 c2 c3

    sgid(cg2) ccc1

    cg2

    c7 c5 c4

    (d) correct if delete cg3

    Fig. 9. Subgraph split example

    as: sgk = (id(cg1, cg2, ...), [cg1, cg2, ...]), where each cell groupgathers super cells. Some cells might span multiple groups, asthey may violate multiple rules. We label such peculiar cells ashinge cells. For each hinge cell, the subgraph keeps the IDs ofits connecting cell groups: c∗i,j = (ci,j , id(cgi1 , cgi2 , ...)). Hingecells with the same value and the same connecting cell groupsare also compressed into super cells.

    With the new organization of cells in subgraphs, theviolation graph updates as following upon the removal of arule. If a subgraph contains a single cell group related to thedeleted rule, RWs are simply instructed to remove it. If asubgraph contains multiple cell groups, RWs remove the cellgroups related to the deleted rule and update the hinge cells.With the remaining hinge cells, RWs check the connectivity ofthe remaining cell groups in the subgraph and decide to splitthe subgraph or not. An example of a split operation can beseen in Figure 9. The initial state of a subgraph is shown inFigure 9(a): the subgraph is sgid(cg1,cg2,cg3), and its contentsare three cell groups. Cell c1 and c7 are hinge cells, whichwork as bridges, connecting different cell groups together.Now, as a simple case, assume we want to remove the rulepertaining to cg2: the subgraph should become sgid(cg1,cg3),as shown in Figure 9(b). Note that cell c7 looses its status ofhinge cell. A more involved case arise when we delete the rulepertaining to cg3 instead of the rule pertaining to cg2. In thiscase, the subgraph should not become sgid(cg1,cg2) as shown inFigure 9(c). Indeed, removing cg2 eliminates all existing hingecells connecting the remaining cell groups. Thus, the subgraphmust split in two separate subgraphs sgid(cg1) and sgid(cg2) asshown in Figure 9(d).

    VI. WINDOWING

    Bleach provides windowed computations, which allow ex-pressing data cleaning over a sliding window of data. De-spite being a common operation in most streaming systems,window-based data cleaning addresses the challenge of theunbounded nature of streaming data: without windowing, thedata structures Bleach uses to detect and repair a dirty streamwould grow indefinitely, which is unpractical. In this section,we discuss two windowing strategies: a basic, tuple-based

  • windowing strategy and an advanced strategy that aim atimproving cleaning accuracy.

    A. Basic Windowing

    The underlying idea of the basic windowing strategy isto only use tuples within the sliding window to populate thedata structures used by Bleach to achieve its tasks. Next, weoutline the basic windowing strategy for both DWs and RWsoperation.

    Windowed Detection. We now focus on how DWs maintaintheir local data history. Clearly, the data history only containscells that fall within the current window. When the windowslides forward, DWs update the data history as follows: i) if acell group ends up having no cells in the new window, DWssimply delete it; ii) for the remaining cell groups, DWs drop allcells that fall outside the new window, and update accordinglythe remaining super cells.

    Note that, if implemented naively, the first operation abovecan be costly as it involves a linear scan of all cell groups.To improve the efficiency of data history updates, Bleach usesthe following approach. It creates a FIFO queue of k lists,which store cell groups. In case the sliding step is half thewindow size, k = 2; more generally, we set k to be thewindow size divided by the sliding step. Any new cell groupfrom the current window enters the queue in the k-th list. Anycell group updates, e.g., due to a new cell added to the cellgroup, “promotes” it from list j to list k. As the window slidesforward, the queue drops the list (and its cell groups) in thefirst position and acquires a new empty list in position k + 1.

    Windowed Repair. Now we focus on how to maintain theviolation graph in RWs. Again, the violation graph only storescells within the current window. When the window slidesforward, RWs update the violation graph as follows:

    • If a subgraph has no cells in the new window, RWsdelete the subgraph;

    • For the remaining subgraphs, if a cell group has nocells in the new window, RWs delete the cell group;

    • RWs also delete hinge cells that are outside of the newwindow. This could require subgraphs to split, as theycould miss a “bridge” cell to connect its cell groups;

    • For the remaining cell groups, RWs drop all cellsoutside of the new window, and update the remainingsuper cells accordingly.

    For efficiency reasons, Bleach uses the same k-list approachdescribed for DWs to manage violation graph updates due toa sliding window.

    B. Bleach Windowing

    The basic windowing strategy only relies on the data withinthe current window to perform data cleaning, which may limitthe cleaning accuracy. We begin with a motivating example,then describe the Bleach windowing strategy, that aims atimproving cleaning accuracy. Note that here we only focuson the repair module and its violation graph, since Bleachwindowing does not modify the operation of the detect module.

    A B

    a b

    a b

    a b

    a c

    a c

    a b

    window [1, 4]

    window [3, 6]

    t1 t2 t3 t4 t5 t6

    (a) input data

    A B

    a b

    a b

    a b

    a b

    a c

    a b

    t1 t2 t3 t4 t5 t6

    (b) output data with basic window-ing

    A B

    a b

    a b

    a b

    a b

    a b

    a b

    t1 t2 t3 t4 t5 t6

    (c) output data with Bleach window-ing

    Fig. 10. Motivating example: Basic vs. Bleach windowing.

    Figure 10(a) illustrates a data stream of two-attribute tuples.Assume we use a single FD rule (A→ B), a window size of 4tuples, a sliding step of 2 tuples, and the basic windowingstrategy. When t4 arrives, the window covers tuples [1, 4].According to the repair algorithm, Bleach repairs t4(B) andsets it to the value b in the output stream. Note that aswe described in Section IV, t4(B) remains unchanged in theviolation graph. Now, when tuple t5 arrives, the window movesto cover tuples [3, 6], even though t6 has yet to arrive. With onlythree tuples in the current window, the algorithm determinest5(B) is correct, because now the majority of tuples have valuec. The output stream produced using basic windowing is shownin Figure 10(b). Clearly, cleaning accuracy is sacrificed, sinceit is easy to see that t5(B) should have been repaired to valueb, which is the most frequent value overall. Hence, the needfor a different windowing strategy to overcome such problems.

    Bleach windowing relies on an extension of a super cell,which we call a cumulative super cell. The idea is for theviolation graph to accumulate past state, to complement theview Bleach builds using tuples from the current window.Hence, a cumulative super cell is represented as a super cell,with an additional field that stores the number of occurrencesof cells with the same RHS value, including those that havebeen dropped because they fall outside the sliding windowboundaries. When using Bleach windowing, RWs maintain theviolation graph by storing cumulative super cells instead ofsuper cells. When the window slides forward, RWs update theviolation graph as follows. The first two steps are equivalentto those for the basic strategy. The last two steps are modifiedas follows:

    • For the remaining subgraphs, RWs delete hinge cellsthat do not bridge cell groups anymore because of theupdate. Also, RWs split subgraphs according to theremaining hinge cells;

    • For the remaining cell groups and hinge cells, RWsupdate cumulative super cells, “flushing” cells whichfall outside the new window while keeping their count.

  • TABLE I. EXAMPLE RULE SETS USED IN OUR EXPERIMENTS.

    r0 : ss item sk → i brand, (ss item sk 6= null)r1 : ss item sk → i category, (ss item sk 6= null)r2 : ca state, ca city → ca zip, (ca state, ca city 6= null)r3 : ss promo sk → p promo name, (ss promo sk 6= null)r4 : ss store sk → s store name, (ss store sk 6= null)r5 : ss ticket num→ s store name, (ss ticket num 6= null)r6 : file extension→ mime type, (file extension 6= null)

    Now, going back to the example in Figure 10(a), whentuple t5 arrives, Bleach stores two cumulative super cells:csc1(id(t) = [3], value = ‘b’, count = 3) and csc2(id(t) =[4, 5], value = ‘c’, count = 2). Although t1 and t2 have beendeleted because they are outside the sliding window, they stillcontribute to the count field in csc1. Therefore, tuple t5(B) iscorrectly repaired to value b, as shown in Figure 10(c).

    Additional notes: When using cumulative super cells, Bleachkeeps tracks of candidate values to be used in the repairalgorithm as long as cell groups remain. By using cumulativesuper cells for hinge cells, subgraphs only split if somecell groups are removed when the window moves forward.Note that the introduction of cumulative super cells does notinterfere with dynamic rule management: in particular, whendeleting a rule, subgraphs update correctly when hinge cellsuse the cumulative format. Overall, to compute the count of acandidate value in a subgraph, Bleach accumulates the countsof relevant cumulative super cells from all cell groups, takinginto account any duplicate contributions from hinge cells.

    Obviously, Bleach windowing requires more storage thanbasic windowing, as cumulative super cells store additionalinformation, and keep the super cell structure, even whenthey have an empty cell list. Section VII demonstrates thatsuch additional overhead is well balanced by superior cleaningaccuracy, making Bleach windowing truly desirable.

    VII. EVALUATION

    We built Bleach prototype implementation using ApacheStorm [24].4 Input streams, including both the data stream andrule updates, are fed into Bleach using Apache Kafka [16]. Weconducted all experiments in a cluster of 18 machines, with 4cores, 8 GB RAM and 1 Gbps network interface each.

    We evaluate Bleach using both synthetic and real-lifedatasets. The synthetic dataset is generated from TPC-DS (withscale factor 100 GB) where we join a fact table store saleswith its dimension tables to build a single table (288 milliontuples). We manually design six CFD rules, from r0 to r5,as shown in Table I. Among these rules, r4 and r5 have thesame RHS attribute s store name. We generate a dirty datastream as follows: we modify the values of RHS attributeswith probability 10% and replace the values of LHS attributeswith NULL with probability 10%.5 The real-life dataset weuse is the result of merging all the log files of Ubuntu Oneservers [15] for 30 days (773 GB of CSV text). Instead

    4Nothing prevents Bleach to be built using alternative systems such asApache Flink, for example.

    5In our experiments we also used BART [3], which is a well accepted dirtydata generator. However, BART fails to scale to hundreds of millions of tuplesdue to memory reasons. Thus, we present results obtained using our customprocess, which mimics that of BART but scales to large data streams.

    of modifying any values, the dataset already contains dirtyrecords. We design a CFD rule, r6, as shown in Table I. Withrule r6, the dirty ratio of the dataset is roughly 7 ∗ 10−5. Byexporting the datasets to Kafka, we simulate “unbounded” datastreams. In all the experiments, we use Bleach windowing asthe default windowing strategy and set the window size to2M tuples and the sliding step to 1M tuples, unless otherwisespecified. The synthetic dataset is used in all experimentsexcept in our last battery of experiments, where the real-lifedataset is used.

    Our goal is to demonstrate that Bleach achieves efficientstream data cleaning under real-time constraints. Our evalua-tion uses throughput, latency and dirty ratio as performancemetrics. We express the dirty ratio as the fraction of dirtydata remaining in the output data stream: the smaller the dirtyratio, the higher the cleaning accuracy. The processing latencyis measured from uniformly sampled tuples (1 per 100).

    Comparing Coordination Approaches. In this experiment wecompare the three RW approaches discussed in Section IV,according to our performance metrics, as shown in Figure 11:RW-basic requires coordination among repair workers for eachtuple; RW-dr omits coordination for tuples when possible; RW-ir is similar to RW-dr, but allows repair decisions to be madebefore finishing coordination.

    Figure 11(a) shows how Bleach throughput evolves withprocessed tuples. The throughput with both RW-dr and RW-ir is around 15K tuples/second, whereas RW-basic achievesroughly 13K tuples/second. The inferior performance of RW-basic is due to the large number of coordination messagesrequired to converge to global subgraph identifiers, while RW-dr and RW-ir only require 7% coordination messages in RW-basic. Figure 11(b) shows the CDF of the tuple processinglatency for the three RW approaches. RW-basic has the highestprocessing latency, on average 364 ms. The processing latencyof RW-ir is on average 316 ms. RW-dr average latency isslightly higher, about 323 ms. This difference is due again tothe additional round-trip-messages required by coordination:with RW-ir, RWs make their repair proposals without waitingfor coordination to complete, therefore the small processinglatency. Figure 11(c) illustrates the cleaning accuracy. All threeapproaches lower the ratio of dirty data significantly to at most0.5% (even 0% for rule r3 and r4). For the first five rules, thethree approaches achieve similar cleaning accuracy. Instead,for rule r5 the RW-ir method suffers and the dirty ratio is larger.Indeed, for rule r5 whose cleaning accuracy is heavily linkedto rule r4, RW-ir fails to correctly update some of its subgraphsbecause it eagerly emits repair proposals without waiting forcoordination to complete. In the following experiments, we usethe RW-dr as the default mechanism.

    Comparing Windowing Strategies. In this experiment, wecompare the performance of the basic and Bleach windowingstrategies. Additionally, for stress testing, we increase the inputdirty data ratio by modifying the values of RHS attributes withprobability 50% for data in the interval from 40M to 42Mtuples. As shown in Figure 13 and Figure 14, the two window-ing strategies are essentially equivalent in terms of throughputand latency: this is good news, as it implies the requirementfor cumulative super cells is a negligible toll on performance.Next, we focus on a detailed view of the cleaning accuracy,which is shown in Figure 12. Bleach windowing achieves

  • 0

    5000

    10000

    15000

    20000

    25000

    4 24 44 64 84 104 124 144 164 184 204

    thro

    ug

    hp

    ut

    (tu

    ple

    s/se

    c)

    processed tuples (million)

    RW-basic

    RW-dr

    RW-ir

    (a) Throughput

    10−2 10−1 100 1010

    0.2

    0.4

    0.6

    0.8

    1

    latency (sec)

    RW−basic

    RW−dr

    RW−ir

    (b) Latency

    0

    0.001

    0.002

    0.003

    0.004

    0.005

    0.006

    r0 r1 r2 r3 r4 r5

    dir

    ty r

    ati

    o

    rules

    RW-basic

    RW-dr

    RW-ir

    (c) Dirty Ratio

    Fig. 11. Comparison of Bleach coordination mechanisms: RW-basic, RW-dr and RW-ir.

    0 50 100 150 200

    10−2

    10−1

    100

    processed tuples (million)

    dir

    ty r

    atio

    basicbleach

    (a) r0

    0 50 100 150 200

    10−2

    10−1

    100

    processed tuples (million)

    dir

    ty r

    atio

    basicbleach

    (b) r1

    0 50 100 150 200

    10−3

    10−2

    10−1

    100

    processed tuples (million)

    dir

    ty r

    atio

    basicbleach

    (c) r2

    0 50 100 150 200

    10−3

    10−2

    10−1

    100

    processed tuples (million)

    dir

    ty r

    atio

    basicbleach

    (d) r3

    0 50 100 150 2000

    0.02

    0.04

    0.06

    0.08

    0.1

    processed tuples (million)

    dir

    ty r

    atio

    basicbleach

    (e) r4

    0 50 100 150 200

    10−3

    10−2

    10−1

    100

    processed tuples (million)

    dir

    ty r

    atio

    basicbleach

    (f) r5

    Fig. 12. Comparison of cleaning accuracy for basic and Bleach windowing strategies.

    0

    0.001

    0.002

    0.003

    0.004

    0.005

    0.006

    0.007

    r0 r1 r2 r3 r4 r5

    dir

    ty r

    ati

    o

    rules

    RW-basic

    RW-dr

    RW-ir

    0.0001

    0.001

    0.01

    0.1

    1

    av

    g d

    irty

    ra

    tio

    win200k

    win500k

    win1m

    win2m

    r0 r1 r2 r5

    rules

    Fig. 16. Comparison of cleaning accuracy for different Bleach window sizes.

    superior cleaning accuracy: in general, the dirty ratio is oneorder of magnitude smaller than that of basic windowing. Thisadvantage is kept also in presence of a dirty ratio spike in theinput data. In particular, for rules r3 and r4, Bleach windowingachieves 0% dirty ratio, irrespectively of the dirty ratio spike.Overall, Bleach windowing reveals that keeping state from pastwindows can indeed dramatically improve cleaning accuracy,with little to no performance overhead.

    Comparing Different Window Sizes. In this experiment,we evaluate Bleach with different window sizes. We set thewindow size as 200K, 500K, 1M and 2M respectively (thesliding step is half of the window size), and the experimentresult is as shown in Figure 15 and Figure 16. We see thatBleach has a higher chance to clean the data stream with moretuples in the window. Figure 15 shows that the throughput

    decreases as the size of window increases. With a largerwindow, there are more tuples to be detected for violations inthe data history. Hence, more violations are detected and sentto the repair module. The violation graph in the RWs will belarger. As a consequence, any subgraph operations includingmerging and split will take more time to finish. With ourimplementation the throughput drops 23% when the windowsize increases 10 times. In contrast, Figure 16 demonstratesthat the cleaning accuracy may increase more than 10 timeswhen the window size increases 10 times6.

    Dynamic Rule Management. Next, we study the perfor-mance of Bleach in presence of rule dynamics, as shownin Figure 17. To do this, we initially use the same in-put data stream and rule set as in the first sets of ex-periments. However, while Bleach is cleaning the inputstream, we delete rule r5 and add two new rules r7(ss ticket num → c email addr, (ss ticket num 6= null))and r8 (ss customer sk → c email addr, (ss customer sk 6=null)), as indicated in the Figure. Figure 17(a) and Figure 17(b)show the evolution in time of throughput and latency, whereasFigure 17(c) gives the CDF of the processing latency. Fig-ure 17(a) shows that rule dynamics can result in an increase inthroughput. Indeed, removing r5 (at the 60M tuple) implies thatBleach needs to manage fewer rules; in addition, r4 becomessimpler to manage, as there are no more intersections with

    6The cleaning accuracy of rule r3 and r4 is not shown in Figure 16, astheir cleaning accuracy is all 100% with four different window sizes.

  • 0

    5000

    10000

    15000

    20000

    25000

    4 24 44 64 84 104 124 144 164 184 204

    thro

    ug

    hp

    ut

    (tu

    ple

    s/se

    c)

    processed tuples (million)

    basic

    bleach

    Fig. 13. Comparison of achievable throughput forbasic and Bleach windowing strategies.

    10−2 10−1 100 1010

    0.2

    0.4

    0.6

    0.8

    1

    latency (sec)

    F(x

    )

    basic

    bleach

    Fig. 14. Comparison of processing latency forbasic and Bleach windowing strategies.

    200 500 800 1100 1400 1700 200014000

    15000

    16000

    17000

    18000

    19000

    window size (thousands of tuples)

    avg

    th

    rou

    gh

    pu

    t (t

    up

    les/

    sec)

    Fig. 15. Achievable throughput for different Bleachwindow sizes.

    0

    5000

    10000

    15000

    20000

    25000

    4 16 28 40 52 64 76 88 100 112

    thro

    ughp

    ut (t

    uple

    s/se

    c)

    processed tuples (million)

    delete r5 add r7 and r8

    (a) Throughput

    0 20 40 60 80 100 1200

    1

    2

    3

    4

    5

    6

    7

    processed tuples (million)

    late

    ncy

    (se

    c)

    delete r5 add r7 and r8

    (b) Latency in time

    10−2 10−1 100 1010

    0.2

    0.4

    0.6

    0.8

    1

    latency (sec)

    F(x

    )

    (c) Latency CDF

    Fig. 17. Bleach performance analysis using dynamic rule management.

    r5. Similarly, Figure 17(b) shows that also latency decreasesupon r5 removal. When rules r7 and r8 are added (at the90M tuple), the throughput drops and the latency grows, asBleach has more rules to manage and because the new ruleshave intersecting attributes, requiring more work from RWs.Figure 17(c), shows the latency distribution computed fromoutput tuple samples. While the average latency is roughly320 ms, we notice a tail in the distribution, indicating thatsome (few) tuples experience latencies up to seconds. This hasbeen observed across all our experiments, and is due to thesliding window mechanism, which imposes computationallydemanding operations when updating the violation graph,resulting also in rather low-level garbage collection problems.

    Overall, we conclude that Bleach supports dynamic rulemanagement seamlessly, with essentially no impact on perfor-mance, and no system restart required.

    Comparing Bleach to a Baseline Approach. We concludeour evaluation with a comparative analysis of Bleach anda baseline approach, which is based on the micro-batchstreaming paradigm (that we refer to as micro-batch cleaning).Essentially, micro-batch cleaning buffers input data records andperforms batch data cleaning periodically, as determined by asliding window. Our implementation uses Spark Streaming inwhich window processing is time-based and not tuple-based.7To demonstrate the performance of micro-batch cleaning andcompare it to Bleach, we perform a series of experimentswhereby we increase the sliding window size. We use thestream data input from the Ubuntu One trace file and its viola-tion rule r6. We feed the input stream at a constant throughputof 1317 tuples/second, which is the average speed of thegeneration of the trace file. Thus, we focus on performanceanalysis expressed only in terms of latency and dirty ratio.Instead of consuming all the tuples in the input stream which

    7For our experiment, the performance difference between the two isnegligible.

    0 500 1000 1500 20001

    2

    3

    4

    5x 10−5

    latency (sec)

    dir

    ty r

    atio

    win 5mins

    win 15mins

    Bleach win 60minswin 45minswin 25mins

    Fig. 18. Comparison of Bleach vs. micro-batch cleaning: latency/accuracytradeoff.

    will take 30 days, each experiment lasts 24 hours.

    Figure 18 illustrates the performance of both systems. Asexpected, for the micro-batch cleaning, the average latencyis proportional to the window size: larger sliding windowsentail higher latencies. Indeed, since the data in the inputstream is uniformly distributed, the average latency equals thesum of half of the window size and the average executiontime for cleaning data in each window. As for the cleaningaccuracy, intuitively, the larger the sliding window, the moreaccurate micro-batch cleaning gets, hence a smaller outputstream dirty ratio. Although the tuple-based window in Bleachcontains roughly the same amount of tuples as the time-based window of 25 minutes in micro-batch cleaning, Bleachachieves better cleaning accuracy because of its cumulativestatistics. In particular, we notice that to achieve the samecleaning accuracy as Bleach, micro-batch cleaning requires thesliding window to be larger than 45 minutes, which incurs inan average latency larger than 22 minutes. Instead, in Bleach,the average latency is less than 200 ms.

    VIII. RELATED WORK

    Many approaches to data cleaning [6], [18], [4], [9], [10],[17], [14] tackle the problem of detecting and repairing dirty

  • data based on predefined data quality rules. [9] proposes away to combine multiple rules together and to perform datacleaning work holistically. [7] focuses on functional depen-dency violations in a horizontally partitioned database, aimingto minimize data shipment and parallel computation time.NADEEF [10] is a extensible and generic data cleaning systemand BigDansing [17] is a large-scale version of NADEEF,which executes data cleaning job in frameworks like Hadoopand Spark. These approaches are effective when data is static.[11], [12] provide incremental algorithms to detect errors indistributed data when data is updated, which is similar toviolation detection in Bleach. [26] introduces a continuous datacleaning framework that can be applied to evolving data andrules driven by a classifier. These works focus on cleaningdata stored in a data warehouse by batch processing, whichachieves high accuracy but suffers high latency.

    In contrast, stream processing requires to be real-time, achallenge that has drawn increasing attention from researchers[2], [13], [19]. Nevertheless, stream data cleaning approachesare still in their infancy. Some works [21], [28] focus on sensorstream data cleaning where the data is a sequence of numericalvalues. These works achieve data cleaning by operations likesmoothing or deleting outliers. Instead, Bleach focuses onmore general cases where data can be both numerical andcategorical. In Spark Streaming [22], a data stream can becleaned by joining it with precomputed information. However,precomputed information is not always available nor sufficientfor accurate data cleaning. To the best of our knowledge,Bleach is the first stream data cleaning system based on dataquality rules providing both high accuracy and low latency.

    There are many other research work about data cleaning.For example, [27] and [8] are about how to perform datacleaning via knowledge base and crowdsourcing. BART [3] isa dirty data generator for evaluating data-cleaning algorithms.[1] studies the problem of temporal rules discovery for dirtyweb data. All such works are orthogonal to ours.

    IX. CONCLUSION

    This work introduced Bleach, a novel stream data cleaningsystem, that aims at efficient and accurate data cleaning underreal-time constraints. First, we have introduced the designgoals and the related challenges underlying Bleach, showingthat stream data cleaning is far from being a trivial problem.Then we have illustrated the Bleach system design, focusingboth on data quality (e.g., Bleach dynamic rule sets andstateful approach to windowing) and on systems aspects (e.g.,data partitioning and coordination), which are required by thedistributed nature of Bleach. We also have provided a seriesof optimizations to improve system performance, by usingcompact and efficient data structures, and by reducing themessaging overhead. Finally, we have evaluated a prototypeimplementation of Bleach: our experiments showed Bleachachieves low-latency and high cleaning accuracy, while ab-sorbing a dirty data stream, despite rule dynamics. We alsohave compared Bleach to a baseline system built on the micro-batch paradigm, and explained Bleach superior performance.Our plan for future works is to support a more varied rule setand to explore alternative repair algorithms, that might requirerevisiting the inner data structures we use in Bleach.

    REFERENCES[1] Abedjan et al. Temporal rules discovery for web data cleaning. In Proc.

    of VLDB, pages 336–347, 2015.[2] T. Akidau et al. The dataflow model: A practical approach to balancing

    correctness, latency, and cost in massive-scale, unbounded, out-of-orderdata processing. In Proc. of VLDB, pages 1792–1803, 2015.

    [3] Arocena et al. Messing up with bart: error generation for evaluatingdata-cleaning algorithms. In Proc. of VLDB, pages 36–47, 2015.

    [4] Beskales et al. Sampling the repairs of functional dependency violationsunder hard constraints. In Proc. of VLDB, number 1-2, pages 197–207,2010.

    [5] Bohannon et al. A cost-based model and effective heuristic for repairingconstraints by value modification. In Proc. of SIGMOD, pages 143–154,2005.

    [6] P. Bohannon et al. Conditional functional dependencies for datacleaning. In ICDE, pages 746–755, April 2007.

    [7] Q. Chen et al. Repairing functional dependency violations in distributeddata. In DASFAA, pages 441–457, 2015.

    [8] Chu et al. Katara: A data cleaning system powered by knowledge basesand crowdsourcing. In Proc. of SIGMOD, pages 1247–1261, 2015.

    [9] X. Chu et al. Holistic data cleaning: Putting violations into context. InICDE, pages 458–469, 2013.

    [10] Dallachiesa et al. Nadeef: A commodity data cleaning system. In Proc.of SIGMOD, pages 541–552, 2013.

    [11] W. Fan et al. Incremental detection of inconsistencies in distributeddata. In ICDE, pages 318–329, 2012.

    [12] W. Fan et al. Incremental detection of inconsistencies in distributeddata. IEEE Transactions on Knowledge and Data Engineering, pages1367–1383, 2014.

    [13] Fernandez et al. Liquid: Unifying nearline and offline big dataintegration. In CIDR, 2015.

    [14] Geerts et al. The llunatic data-cleaning framework. In Proc. of VLDB,pages 625–636, 2013.

    [15] Gracia-Tinedo et al. Dissecting ubuntuone: Autopsy of a global-scalepersonal cloud back-end. In IMC, pages 155–168, 2015.

    [16] Kafka. Webpage. http://kafka.apache.org/.[17] Z. Khayyat et al. Bigdansing: A system for big data cleansing. In Proc.

    of SIGMOD, pages 1215–1230, 2015.[18] Kolahi et al. On approximating optimum repairs for functional depen-

    dency violations. In Proc. of ICDT, pages 53–62, 2009.[19] Q. Lin et al. Scalable distributed stream join processing. In Proc. of

    SIGMOD, pages 811–825.[20] Openrefine. Webpage. http://openrefine.org/.[21] S. Song et al. SCREEN: stream data cleaning under speed constraints.

    In Proc. of SIGMOD, pages 827–841, 2015.[22] Spark stream cleaning. Webpage. http://spark.apache.org/docs/latest/

    streaming-programming-guide.html.[23] Stonebraker et al. The 8 requirements of real-time stream processing.

    SIGMOD Rec., 34:42–47, 2005.[24] Storm. Webpage. https://storm.apache.org/.[25] Trifacta. Webpage. https://www.trifacta.com/.[26] M. Volkovs et al. Continuous data cleaning. In ICDE, pages 244–255,

    2014.[27] Wang et al. Crowd-based deduplication: An adaptive approach. In Proc.

    of SIGMOD, pages 1263–1277, 2015.[28] Zhao et al. A model-based approach for rfid data stream cleansing. In

    Proc. of CIKM, pages 862–871, 2012.