A Grid-Based Middleware’s Support for Processing Distributed Data Streams

1

A Grid-Based Middleware’s Support for

Processing Distributed Data Streams

Liang ChenAdvisor: Gagan Agrawal

Computer Science & Engineering

2

Introduction-Motivation• Data stream processing and analysis

– Data stream: data arrive continuously and need to be processed in real-time

• Data Stream Applications:– Online network Intrusion Detection– Sensor networks– Network Fault Management System for

Telecommunication Network Elements– Computer Vision Based Surveillance

• Common features of data streams– Continuous arrival– Enormous volume– Real-time constraints– Data sources could be distributed

weili lin

grid is first developed to enable resource sharing within far-flung scientific collabration. such as colllaborative virsulization of large scientific datasets and distributed computing for highly computaionally demanding data anylysis. just as www began as a technology for scientific coopration and was adopted by e-biness, people expect the same trajectory of grid technologies

weili lin

resource in different data formatresources on different platformsdifferent kinds of resources, like storage resouces, softwars, data and the likesome sharing relationship is transient. it could be because of the upgrade of resources

3

Introduction-MotivationNetwork Fault Management System

analyzing alarm message streams

Switch Network

X

Network Fault Management System

4

Introduction-MotivationComputer Vision Based Surveillance

5

Introduction-Motivation

Switch Network

X

• Challenges & possible Solutions – Challenge1: Data and/or

Computation intensive

6

Introduction-Motivation• Challenges & possible Solutions

– Challenge1: Data and/or Computation intensive

– Solution: Grid computing technologies

Switch Network

weili lin

public key basedsingl sign on: allows user to anthenticate once and thus create a proxy credential that a program use to anthenticate with remote servise on user's behalf wihtout intervention of other users

7

• Challenges & possible Solutions – Challenge1: Data and/or Computation

intensive– Solution: Grid computing technologies

– Challenge 2: real-time analysis is required

– Solution: Self-Adaptation functionality is desired

Introduction-Motivation

8

Introduction-Motivation• From point of view of the

developers who are interested in applications of data streams– Would like to concentrate on

applications themselves– Would not like to focus efforts on

• Grid computing• Adaptation function

9

Introduction-Our Approach• A Middle-ware that is based on Grid standa

rds and tools and provides self-adaptation functionality

• The middleware is referred to as GATES (Grid-based AdapTive Execution on Stream)– Automatically distributed to proper computing

nodes– Automatically self-adaptive to varying environ

ment without implementing certain algorithms

10

System Architecture and Design

(From Application Perspective)

• Breaking down a task into several sub-tasks so that the sub-tasks can consist of a pipeline

• Implementing each sub-task in Java

• Writing an XML configuration file for the sub-tasks to be automatically deployed. I.E.– specify how many stages (sub-tasks) the pipeline has

– specify where the codes that are implementing the sub-tasks reside

• Launch the application by running a java program (StreamClient.class) provided by the GATES

11

System Architecture and Design(Architecture)

12

A B C

Stage A Stage B Stage C

:Grid services of the GATES

:Stages of an application

:Queues between Grid services

:Buffers for applications

System Architecture and Design(Architecture)

13

Public class Sampling-Stage implements StreamProcessing{… void init(){…}… void work(buffer in, buffer out){

…

while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out);

}…

}

System Architecture and Design

(Example)

sampling-ratio = GATES.getSuggestedParameter();

GATES.Information-About-Adjustment-Parameter(min, max, 1)

14

Self-adaptation Algorithm

• Given a queue’s long-term factor at each stage, we want to improve the method of adjusting values of an adaptation parameter

1. Should the adaptation parameter be modified, and if so, in which direction?

2. How to find a new value (update the value) of the adaptation parameter

d~

15

Enhanced Self-adaptation

Algorithm• Should the adaptation parameter

be modified, and if so, in which direction?– The answer is related to load status of

queues at two consecutive stages

16


AlgorithmPerformance Parameter BP

A B C

A B C A B C

A B C

A B C

A B C

A B C

A B C

A B C

Convergent States

Non-Convergent States

17

Enhanced Self-adaptation Algorithm

Summary of Load States

18


Algorithm• How to determine the new value

for the adaptation parameter– Linear update: increase or decrease

by a fixed value• Hard to find a proper fixed value

– Previous method

– Binary tree search

BPBP

)),((*),()~

(*~

21122111 TTTTddP BB

20


Algorithm

Left Border

Current Value

Right Border

New Value

Left Border

Current Value

Right Border

22

Data Mining Applications &

System Evaluation

• Two Data mining applications– Clustream: Clustering data arriving in da

ta streams

23


System Evaluation• Dist-Freq-Counting: finding frequent i

temsets from distributed streams

24


System Evaluation

25


System Evaluation

26


System Evaluation

27


System Evaluation

28


System Evaluation

29


System Evaluation

30


System Evaluation

31


System Evaluation

32


System Evaluation

33

Resource Allocation Schemes

• Problem Definition– Grid resource scheduling for Pipelined processi

ng and real-time distributed streaming applications

– Mapping workflows onto Grid is a NP-complete problem

– Static Part: the resource allocation problem for GATES is to determine a deployment configuration

– Dynamic Part

34

Static Allocation Scheme

The number of data sources and their location

The destination The number of stages

consisting of a pipeline? The number of instances

of each stage? How the instances

connect to each other? The node where each

instance is placed

Destinationm1.cluster2.edu

Data Source 1162.9.23.1


Data source 3192.168.2.8


Placement 1 Placement n1

Placement n2Placement 1

Placement 1 Placement n3

Stage 2:

Stage 3:

Stage 4:

Static allocation problem: determining a deployment configurationObjective: Automatically generate a deployment configuration according to the information of available resources

35

Stage 3Placement1

Stage 4Placement1

Stage 2Placement1

Stage 2Placement2

Destination

Data Source1

Data Source2

Data source3

Data Source4

Stage 3Placement2

Stage 3Placement1

Stage 4Placement1

Stage 4Placement2

Stage 2Placement1

Stage 2Placement2

Stage 2Placement3

Destination

Data Source1

Data Source2

Data source3

Data Source4

Examples of deployment configurations

Static Allocation Scheme

36

Related work• Grid Resource Allocation

– Condor– Realtor– ACDS etc.– Main Differences: our work focuses on Grid resource allo

cation for workflow applications• Adaptation Through a Middleware

– Cheng et al.’s adaptation framework– SWiFT– Conductor– DART– ROAM– Main Differences: our work focuses on general supports f

or adaptation in run-time

37

Summary

• Grid computing could be an effective solution for distributed data stream processing

• GATES – Distributed processing– Exploit grid web services– Self-adaptation to meet the real-time

constraints– Grid resource allocation schemes

A Grid-Based Middleware’s Support for Processing Distributed Data Streams

Documents

Transcript of A Grid-Based Middleware’s Support for Processing Distributed Data Streams