Optimizing of data access using replication technique

28
Optimizing of data access using replication technique Renata Słota Renata Słota 1 , Darin Nikolow , Darin Nikolow 1 ,Łukasz ,Łukasz Skitał Skitał 2 , , Jacek Kitowski Jacek Kitowski 1,2 1,2 1 Institute of Computer Science AGH-UST, Cracow Institute of Computer Science AGH-UST, Cracow 2 2 ACC CYFRONET AGH, Cracow ACC CYFRONET AGH, Cracow

description

Optimizing of data access using replication technique. Renata Słota 1 , Darin Nikolow 1 ,Łukasz Skitał 2 , Jacek Kitowski 1,2 1 Institute of Computer Science AGH-UST, Cracow 2 ACC CYFRONET AGH, Cracow. Agenda. Motivation of the work Why does today grid computing need replication? - PowerPoint PPT Presentation

Transcript of Optimizing of data access using replication technique

Page 1: Optimizing of data access using replication technique

Optimizing of data access using replication techniqueRenata SłotaRenata Słota11, Darin Nikolow, Darin Nikolow11,Łukasz Skitał,Łukasz Skitał22,, Jacek KitowskiJacek Kitowski1,21,2

11 Institute of Computer Science AGH-UST, Cracow Institute of Computer Science AGH-UST, Cracow2 2 ACC CYFRONET AGH, CracowACC CYFRONET AGH, Cracow

Page 2: Optimizing of data access using replication technique

Agenda

Motivation of the work Why does today grid computing need replication?

Replication basics Clusterix Data Management System

Architecture, optimization and replication algorithms

Optimization Example Replication Example Summary, conclusions

Page 3: Optimizing of data access using replication technique

Site-level vs. Grid-levelreplication

Site-level replication Replicas in one site Implementation examples:

RAID HSM

Grid-level replication Data management systems Replicas spread on many sites

Page 4: Optimizing of data access using replication technique

Motivation of the workWhy does today grid computing need replication?

Data protection and availability Malfunction of one storage does not

affect data itself, only performance is affected

Performance Low level optimization and replication

are not sufficient (RAID, HSM) Limited network bandwidth Limited storage performance

Page 5: Optimizing of data access using replication technique

Replication scenarios Static replication

Decision made by system administrator or user

Limited system support: replica selection, replica coherency, replica ordering

Dynamic replication Decision made by dedicated grid

component based on current data access pattern of users

Full system support

Page 6: Optimizing of data access using replication technique

Replication consequences Optimal replica selection algorithm Replica creation and removal

algorithm Cost of replica creation, update

and storage Replica coherency

Page 7: Optimizing of data access using replication technique

ClusterixNational Cluster of Linux Systems

Project aim: To develop set of tools and

procedures allowing to build productive Grid environment based on local PC clusters spread in independent supercomputing centers

Network Layer: Pionier – Polish optical networks

Page 8: Optimizing of data access using replication technique

Clusterix Data Management System

Architecture

Page 9: Optimizing of data access using replication technique

Optimization Algorithm Selects optimal storage element for:

data accessing replica creation

Takes under consideration current state of the System Optimal storage element is one with the maximal

weight W(s,d)

W(s,d)=min((1-NetLoad(s))bandwidth(s,d), (1-Sload(s))Sbandwidth(s))

s – storage elementd – destination nodeNetLoad(s) – s network interface loadBandwidth(s,d) – available bandwidth between s and dSload(s) – storage system loadSbandwidth(s) – storage system bandwidth

Page 10: Optimizing of data access using replication technique

Automatic replication algorithm Takes under consideration gain from

replication G(), cost of replica creation C(), cost of replicas update U() and administrative factor A().

Replication profit:P(d,R,S,f)=G(d,R,S,f)+C(d,R,f)+U(d,R,S,f)+A(d,f)

d – storage element, which profit is computed forR – set of storage elements containing replicas of fS – statistic data – history of file usagef – considered file

Page 11: Optimizing of data access using replication technique

Storage oriented problems

Data intensive applications for Clusterix

Simulation of transonic flow past a wings tips

Visualization of complex multidimensional structures

Ecosystem modeling and simulation

Page 12: Optimizing of data access using replication technique

Optimization Example Node A needs file F stored on SE1, SE2 and SE3

JIMS

SE1

NMS

Node A JIMS

SE2

NMS

JIMS

SE3

NMS

NMS CDMS

Optim

izer

NMS

F

F

F

F

Page 13: Optimizing of data access using replication technique

Optimization Example Node A sends request to CDMS

JIMS

SE1

NMS

Node A JIMS

SE2

NMS

JIMS

SE3

NMS

NMS CDMS

Optim

izer

NMS

F

F

F

Page 14: Optimizing of data access using replication technique

Optimization Example CDMS uses Optimizer to choice optimal SE

JIMS

SE1

NMS

Node A JIMS

SE2

NMS

JIMS

SE3

NMS

NMS CDMS

Optim

izer

NMS

F

F

F

Page 15: Optimizing of data access using replication technique

Optimization Example Optimizer is working…

JIMS

SE1

NMS

Node A JIMS

SE2

NMS

JIMS

SE3

NMS

NMS CDMS

Optim

izer

W(s2,d)=min((1-NetLoad(s2))bandwidth(s2,d), (1-Sload(s2))Sbandwidth(s2))

NMS

W(s1,d)=min((1-NetLoad(s1))bandwidth(s1,d), (1-Sload(s1))Sbandwidth(s1))

W(s3,d)=min((1-NetLoad(s3))bandwidth(s3,d), (1-Sload(s3))Sbandwidth(s3))

F

F

F

Page 16: Optimizing of data access using replication technique

Automatic replication exampleSituation

3 clusters

4 storage elements 2 contain replica of

Set of applications running on these clusters and accessing file

SE1

F

SE2 SE3 SE4

F

F

F

Page 17: Optimizing of data access using replication technique

Sleeping…Working…

Automatic replication example

CDMS

Optimizer

SE4

SE1

ReplicationModule

StatisticModuleSE2 SE3

F FGain

Cost of rep.Cost of update

Adm. factor

Page 18: Optimizing of data access using replication technique

Working…

Automatic replication example

CDMS

Optimizer

SE4

SE1

ReplicationModule

StatisticModuleSE2 SE3

F FDecision: SE2F SE4

FF F

F

F

FF

Sleeping…

Page 19: Optimizing of data access using replication technique

Sleeping…

Automatic replication example

CDMS

Optimizer

SE4

SE1

ReplicationModule

StatisticModuleSE2 SE3

F F

F

Page 20: Optimizing of data access using replication technique

Summary Architecture of CDMS with

Optimization and Replication modules has been designed

Replication and optimization algorithms has been specified

Modules interfaces has been specified

Future work Integration and tests

Page 21: Optimizing of data access using replication technique

Conclusions Simulation of replication vs. real

system implementation Replication should be designed to

meet specific Clusterix applications profile

Data availability Replication drawbacks

Page 22: Optimizing of data access using replication technique

Publications Extended functionality of Virtual Storage System

for gridRenata Słota, Darin Nikolow, Łukasz Skitał, Jacek KitowskiCracow Grid Workshop 2004, poster no. 13

Application of data replication methods in Clusterix project (in polish)Renata Słota, Darin Nikolow, Łukasz Skitał, Jacek KitowskiPionier 2004, 19-20 May, Poznań, electronic publication

Implementation of replication methods in the Grid EnvironmentRenata Słota, Darin Nikolow, Łukasz Skitał, Jacek KitowskiSubmitted to European Grid Conference

Page 23: Optimizing of data access using replication technique

Thank You!

Page 24: Optimizing of data access using replication technique

Clusterix Data Management System

ArchitectureReplication module• Responsible for:

– Automatic replica creation/removal

• Implementation– Java– Apache SOAP

• Cooperate with:– Optimization module – Statistic module

Page 25: Optimizing of data access using replication technique

Clusterix Data Management System

ArchitectureOptimization Module•Responsible for:

–storage element selection for newly created replica,

–optimal replica selection.

•Implementation–C/C++

–gSOAP

•Cooperates with:–Network Monitoring System (NMS)

–Information System •JMX-based Infrastructure Monitoring System (JIMS)

Page 26: Optimizing of data access using replication technique

Clusterix Data Management System

ArchitectureInformation System (JIMS)Department of Computer Science, AGH University of Science & Technology

Provides the following information for selected node:•Available storage capacity

•Total storage capacity

•Network interface load

•Network interface bandwidth

•Storage system load

•Average storage system load

•Maximal measured storage bandwidth

Page 27: Optimizing of data access using replication technique

Clusterix Data Management System

ArchitectureNetwork Monitoring SystemPoznan Supercomputing and Networking Center

Provides the following information:• Maximum bandwidth between two network nodes

• Current load between two network nodes

• Nodes availability

Page 28: Optimizing of data access using replication technique

Clusterix Data Management System

Architecture

Statistic ModuleBiałystok Technical University

Responsible for gathering information about past data usage