A Grid-Based Middleware’s Support for Processing Distributed Data Streams
description
Transcript of A Grid-Based Middleware’s Support for Processing Distributed Data Streams
1
A Grid-Based Middleware’s Support for
Processing Distributed Data Streams
Liang ChenAdvisor: Gagan Agrawal
Computer Science & Engineering
2
Introduction-Motivation• Data stream processing and analysis
– Data stream: data arrive continuously and need to be processed in real-time
• Data Stream Applications:– Online network Intrusion Detection– Sensor networks– Network Fault Management System for
Telecommunication Network Elements– Computer Vision Based Surveillance
• Common features of data streams– Continuous arrival– Enormous volume– Real-time constraints– Data sources could be distributed
3
Introduction-MotivationNetwork Fault Management System
analyzing alarm message streams
Switch Network
X
Network Fault Management System
4
Introduction-MotivationComputer Vision Based Surveillance
5
Introduction-Motivation
Switch Network
X
• Challenges & possible Solutions – Challenge1: Data and/or
Computation intensive
6
Introduction-Motivation• Challenges & possible Solutions
– Challenge1: Data and/or Computation intensive
– Solution: Grid computing technologies
Switch Network
7
• Challenges & possible Solutions – Challenge1: Data and/or Computation
intensive– Solution: Grid computing technologies
– Challenge 2: real-time analysis is required
– Solution: Self-Adaptation functionality is desired
Introduction-Motivation
8
Introduction-Motivation• From point of view of the
developers who are interested in applications of data streams– Would like to concentrate on
applications themselves– Would not like to focus efforts on
• Grid computing• Adaptation function
9
Introduction-Our Approach• A Middle-ware that is based on Grid standa
rds and tools and provides self-adaptation functionality
• The middleware is referred to as GATES (Grid-based AdapTive Execution on Stream)– Automatically distributed to proper computing
nodes– Automatically self-adaptive to varying environ
ment without implementing certain algorithms
10
System Architecture and Design
(From Application Perspective)
• Breaking down a task into several sub-tasks so that the sub-tasks can consist of a pipeline
• Implementing each sub-task in Java
• Writing an XML configuration file for the sub-tasks to be automatically deployed. I.E.– specify how many stages (sub-tasks) the pipeline has
– specify where the codes that are implementing the sub-tasks reside
• Launch the application by running a java program (StreamClient.class) provided by the GATES
11
System Architecture and Design(Architecture)
12
A B C
Stage A Stage B Stage C
:Grid services of the GATES
:Stages of an application
:Queues between Grid services
:Buffers for applications
System Architecture and Design(Architecture)
13
Public class Sampling-Stage implements StreamProcessing{… void init(){…}… void work(buffer in, buffer out){
…
while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out);
}…
}
System Architecture and Design
(Example)
sampling-ratio = GATES.getSuggestedParameter();
GATES.Information-About-Adjustment-Parameter(min, max, 1)
14
Self-adaptation Algorithm
• Given a queue’s long-term factor at each stage, we want to improve the method of adjusting values of an adaptation parameter
1. Should the adaptation parameter be modified, and if so, in which direction?
2. How to find a new value (update the value) of the adaptation parameter
d~
15
Enhanced Self-adaptation
Algorithm• Should the adaptation parameter
be modified, and if so, in which direction?– The answer is related to load status of
queues at two consecutive stages
16
Enhanced Self-adaptation
AlgorithmPerformance Parameter BP
A B C
A B C A B C
A B C
A B C
A B C
A B C
A B C
A B C
Convergent States
Non-Convergent States
17
Enhanced Self-adaptation Algorithm
Summary of Load States
18
Enhanced Self-adaptation
Algorithm• How to determine the new value
for the adaptation parameter– Linear update: increase or decrease
by a fixed value• Hard to find a proper fixed value
– Previous method
– Binary tree search
BPBP
)),((*),()~
(*~
21122111 TTTTddP BB
20
Enhanced Self-adaptation
Algorithm
Left Border
Current Value
Right Border
New Value
Left Border
Current Value
Right Border
21
22
Data Mining Applications &
System Evaluation
• Two Data mining applications– Clustream: Clustering data arriving in da
ta streams
23
Data Mining Applications &
System Evaluation• Dist-Freq-Counting: finding frequent i
temsets from distributed streams
24
Data Mining Applications &
System Evaluation
25
Data Mining Applications &
System Evaluation
26
Data Mining Applications &
System Evaluation
27
Data Mining Applications &
System Evaluation
28
Data Mining Applications &
System Evaluation
29
Data Mining Applications &
System Evaluation
30
Data Mining Applications &
System Evaluation
31
Data Mining Applications &
System Evaluation
32
Data Mining Applications &
System Evaluation
33
Resource Allocation Schemes
• Problem Definition– Grid resource scheduling for Pipelined processi
ng and real-time distributed streaming applications
– Mapping workflows onto Grid is a NP-complete problem
– Static Part: the resource allocation problem for GATES is to determine a deployment configuration
– Dynamic Part
34
Static Allocation Scheme
The number of data sources and their location
The destination The number of stages
consisting of a pipeline? The number of instances
of each stage? How the instances
connect to each other? The node where each
instance is placed
Destinationm1.cluster2.edu
Data Source 1162.9.23.1
Data Source 278.29.242.8
Data source 3192.168.2.8
Data Source 4123.97.61.9
Placement 1 Placement n1
Placement n2Placement 1
Placement 1 Placement n3
Stage 2:
Stage 3:
Stage 4:
Static allocation problem: determining a deployment configurationObjective: Automatically generate a deployment configuration according to the information of available resources
35
Stage 3Placement1
Stage 4Placement1
Stage 2Placement1
Stage 2Placement2
Destination
Data Source1
Data Source2
Data source3
Data Source4
Stage 3Placement2
Stage 3Placement1
Stage 4Placement1
Stage 4Placement2
Stage 2Placement1
Stage 2Placement2
Stage 2Placement3
Destination
Data Source1
Data Source2
Data source3
Data Source4
Examples of deployment configurations
Static Allocation Scheme
36
Related work• Grid Resource Allocation
– Condor– Realtor– ACDS etc.– Main Differences: our work focuses on Grid resource allo
cation for workflow applications• Adaptation Through a Middleware
– Cheng et al.’s adaptation framework– SWiFT– Conductor– DART– ROAM– Main Differences: our work focuses on general supports f
or adaptation in run-time
37
Summary
• Grid computing could be an effective solution for distributed data stream processing
• GATES – Distributed processing– Exploit grid web services– Self-adaptation to meet the real-time
constraints– Grid resource allocation schemes