A Grid-Based Middleware’s Support for Processing Distributed Data Streams

36
1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering

description

Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering. A Grid-Based Middleware’s Support for Processing Distributed Data Streams. Introduction- Motivation. Data stream processing and analysis Data stream: data arrive continuously and need to be processed in real-time - PowerPoint PPT Presentation

Transcript of A Grid-Based Middleware’s Support for Processing Distributed Data Streams

Page 1: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

1

A Grid-Based Middleware’s Support for

Processing Distributed Data Streams

Liang ChenAdvisor: Gagan Agrawal

Computer Science & Engineering

Page 2: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

2

Introduction-Motivation• Data stream processing and analysis

– Data stream: data arrive continuously and need to be processed in real-time

• Data Stream Applications:– Online network Intrusion Detection– Sensor networks– Network Fault Management System for

Telecommunication Network Elements– Computer Vision Based Surveillance

• Common features of data streams– Continuous arrival– Enormous volume– Real-time constraints– Data sources could be distributed

weili lin
grid is first developed to enable resource sharing within far-flung scientific collabration. such as colllaborative virsulization of large scientific datasets and distributed computing for highly computaionally demanding data anylysis. just as www began as a technology for scientific coopration and was adopted by e-biness, people expect the same trajectory of grid technologies
weili lin
resource in different data formatresources on different platformsdifferent kinds of resources, like storage resouces, softwars, data and the likesome sharing relationship is transient. it could be because of the upgrade of resources
Page 3: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

3

Introduction-MotivationNetwork Fault Management System

analyzing alarm message streams

Switch Network

X

Network Fault Management System

Page 4: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

4

Introduction-MotivationComputer Vision Based Surveillance

Page 5: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

5

Introduction-Motivation

Switch Network

X

• Challenges & possible Solutions – Challenge1: Data and/or

Computation intensive

Page 6: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

6

Introduction-Motivation• Challenges & possible Solutions

– Challenge1: Data and/or Computation intensive

– Solution: Grid computing technologies

Switch Network

weili lin
public key basedsingl sign on: allows user to anthenticate once and thus create a proxy credential that a program use to anthenticate with remote servise on user's behalf wihtout intervention of other users
Page 7: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

7

• Challenges & possible Solutions – Challenge1: Data and/or Computation

intensive– Solution: Grid computing technologies

– Challenge 2: real-time analysis is required

– Solution: Self-Adaptation functionality is desired

Introduction-Motivation

Page 8: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

8

Introduction-Motivation• From point of view of the

developers who are interested in applications of data streams– Would like to concentrate on

applications themselves– Would not like to focus efforts on

• Grid computing• Adaptation function

Page 9: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

9

Introduction-Our Approach• A Middle-ware that is based on Grid standa

rds and tools and provides self-adaptation functionality

• The middleware is referred to as GATES (Grid-based AdapTive Execution on Stream)– Automatically distributed to proper computing

nodes– Automatically self-adaptive to varying environ

ment without implementing certain algorithms

Page 10: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

10

System Architecture and Design

(From Application Perspective)

• Breaking down a task into several sub-tasks so that the sub-tasks can consist of a pipeline

• Implementing each sub-task in Java

• Writing an XML configuration file for the sub-tasks to be automatically deployed. I.E.– specify how many stages (sub-tasks) the pipeline has

– specify where the codes that are implementing the sub-tasks reside

• Launch the application by running a java program (StreamClient.class) provided by the GATES

Page 11: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

11

System Architecture and Design(Architecture)

Page 12: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

12

A B C

Stage A Stage B Stage C

:Grid services of the GATES

:Stages of an application

:Queues between Grid services

:Buffers for applications

System Architecture and Design(Architecture)

Page 13: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

13

Public class Sampling-Stage implements StreamProcessing{… void init(){…}… void work(buffer in, buffer out){

while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out);

}…

}

System Architecture and Design

(Example)

sampling-ratio = GATES.getSuggestedParameter();

GATES.Information-About-Adjustment-Parameter(min, max, 1)

Page 14: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

14

Self-adaptation Algorithm

• Given a queue’s long-term factor at each stage, we want to improve the method of adjusting values of an adaptation parameter

1. Should the adaptation parameter be modified, and if so, in which direction?

2. How to find a new value (update the value) of the adaptation parameter

d~

Page 15: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

15

Enhanced Self-adaptation

Algorithm• Should the adaptation parameter

be modified, and if so, in which direction?– The answer is related to load status of

queues at two consecutive stages

Page 16: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

16

Enhanced Self-adaptation

AlgorithmPerformance Parameter BP

A B C

A B C A B C

A B C

A B C

A B C

A B C

A B C

A B C

Convergent States

Non-Convergent States

Page 17: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

17

Enhanced Self-adaptation Algorithm

Summary of Load States

Page 18: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

18

Enhanced Self-adaptation

Algorithm• How to determine the new value

for the adaptation parameter– Linear update: increase or decrease

by a fixed value• Hard to find a proper fixed value

– Previous method

– Binary tree search

BPBP

)),((*),()~

(*~

21122111 TTTTddP BB

Page 19: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

20

Enhanced Self-adaptation

Algorithm

Left Border

Current Value

Right Border

New Value

Left Border

Current Value

Right Border

Page 20: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

21

Page 21: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

22

Data Mining Applications &

System Evaluation

• Two Data mining applications– Clustream: Clustering data arriving in da

ta streams

Page 22: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

23

Data Mining Applications &

System Evaluation• Dist-Freq-Counting: finding frequent i

temsets from distributed streams

Page 23: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

24

Data Mining Applications &

System Evaluation

Page 24: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

25

Data Mining Applications &

System Evaluation

Page 25: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

26

Data Mining Applications &

System Evaluation

Page 26: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

27

Data Mining Applications &

System Evaluation

Page 27: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

28

Data Mining Applications &

System Evaluation

Page 28: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

29

Data Mining Applications &

System Evaluation

Page 29: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

30

Data Mining Applications &

System Evaluation

Page 30: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

31

Data Mining Applications &

System Evaluation

Page 31: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

32

Data Mining Applications &

System Evaluation

Page 32: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

33

Resource Allocation Schemes

• Problem Definition– Grid resource scheduling for Pipelined processi

ng and real-time distributed streaming applications

– Mapping workflows onto Grid is a NP-complete problem

– Static Part: the resource allocation problem for GATES is to determine a deployment configuration

– Dynamic Part

Page 33: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

34

Static Allocation Scheme

The number of data sources and their location

The destination The number of stages

consisting of a pipeline? The number of instances

of each stage? How the instances

connect to each other? The node where each

instance is placed

Destinationm1.cluster2.edu

Data Source 1162.9.23.1

Data Source 278.29.242.8

Data source 3192.168.2.8

Data Source 4123.97.61.9

Placement 1 Placement n1

Placement n2Placement 1

Placement 1 Placement n3

Stage 2:

Stage 3:

Stage 4:

Static allocation problem: determining a deployment configurationObjective: Automatically generate a deployment configuration according to the information of available resources

Page 34: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

35

Stage 3Placement1

Stage 4Placement1

Stage 2Placement1

Stage 2Placement2

Destination

Data Source1

Data Source2

Data source3

Data Source4

Stage 3Placement2

Stage 3Placement1

Stage 4Placement1

Stage 4Placement2

Stage 2Placement1

Stage 2Placement2

Stage 2Placement3

Destination

Data Source1

Data Source2

Data source3

Data Source4

Examples of deployment configurations

Static Allocation Scheme

Page 35: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

36

Related work• Grid Resource Allocation

– Condor– Realtor– ACDS etc.– Main Differences: our work focuses on Grid resource allo

cation for workflow applications• Adaptation Through a Middleware

– Cheng et al.’s adaptation framework– SWiFT– Conductor– DART– ROAM– Main Differences: our work focuses on general supports f

or adaptation in run-time

Page 36: A Grid-Based Middleware’s  Support for  Processing Distributed Data Streams

37

Summary

• Grid computing could be an effective solution for distributed data stream processing

• GATES – Distributed processing– Exploit grid web services– Self-adaptation to meet the real-time

constraints– Grid resource allocation schemes