Post on 17-Jan-2018
description
University of Westminster – www.cpc.wmin.ac.uk
Checkpointing Mechanism
for the Grid EnvironmentK Sajadah, G Terstyanszky,
S Winter, P. KacsukUniversity of Westminster
Checkpointing of Parallel Applications in a Grid Environment
The Grid Environment Nature of Grid Environment:
– Generic, heterogeneous, and dynamic with lots of unreliable resources making it exposed to failures.
Solution:– Fault tolerant mechanisms should
ensure successful execution of applications.
Checkpointing of Parallel Applications in a Grid Environment
Fault Tolerant Solutions Retrying
– When a job fails, it is re-executed a certain number of times.
– The expected job’s completion time is very big. Replication
– Replicas of a job are executed on different Grid resources simultaneously.
– It requires extra processing power. Checkpointing
– It stores a snapshot of an application state, and use it for restarting the execution in case of failure.
– It is very efficient in environment where failure rate is high.
Checkpointing of Parallel Applications in a Grid Environment
Checkpointing Transparent Checkpointing
– Programmer orchestrates the checkpointing process
– Message synchronisation is performed.– Checkpointing & Recovery process is transparent
to the programmer. Non-Transparent Checkpointing
– Mechanism provides support for checkpointing through run-time libraries.
– Programmer can specify data that should be included in checkpoint file.
– Approach is not transparent to the programmer.
Challenges in Checkpointing When to take the checkpoint
How to synchronise (or how to minimise inter-process communication)
What kind of info to store at the checkpoint
Where to store the checkpoint’s info
How to restore the execution after a fault
Checkpointing of Parallel Applications in a Grid Environment
Checkpointing (2) Performance constraints in
existing solutions:– Overheads due to synchronisation of messages.– Checkpoint intervals are either user-defined with
no regular pattern or are periodic. Proposed solution:
– Take checkpoint at the best possible pre-defined intervals.
– Mimimalise (or optimise) the inter-communication as much as possible.
Checkpointing of Parallel Applications in a Grid Environment
Checkpointing (3) Inter-process communications can
cause inconsistent checkpoints due to lost messages or orphan messages.– To achieve a global consistent checkpoint
synchronization should be performed Synchronization introduces extra
communications among processes.
Checkpointing of Parallel Applications in a Grid Environment
Approaches Used Combination of :
– First Order Approximation. – Natural Synchronisation Points.
First Order Approximation – Calculate the optimal checkpointing intervals.– Based on the Poisson process.
• Occurrence of failure is random with failure rate .
Checkpointing of Parallel Applications in a Grid Environment
The Optimal Checkpoint interval Tc is:– Tc = 2TsTf , where:
• Ts is the time required to save information at a checkpoint.
• Tf is the mean time between failures and Tf = Th/k
The following data are needed:– The number of hours the program will run on the
machines (Th).– The known failure rate during that time (k).– The time required to save information at a
checkpoint (Ts).
First Order Approximation
First Order Approximation (2)
Tc
Tst = 0
Rerun Time tr
Restarting Point
Point ofFailure
Tc
Tc
Ts
Ts
Ts
…tTc
Tc = Checkpoint intervalTs = Time to save a checkpointtr = Rerun time of a failed application
Checkpointing of Parallel Applications in a Grid Environment
First Order Approximation(3)
Using the PROVE toolset, we can measure both the execution time and the checkpointing time of an application.
Nagios can be used to determine the failure rate of Grid resources.
Checkpointing of Parallel Applications in a Grid Environment
Natural Synchronisation Points Examples of natural synchronization
points: – Barriers. – Top or bottom of a main loop.– Collective operations (broadcast, gather, scatter,
etc.) No interprocess communication at these
points.– Therefore, no need to be concerned with the state of
the communication channels or possible in-transit message.
– Eliminate the overhead incurred due to the synchronization process involved during checkpointing.
Checkpointing of Parallel Applications in a Grid Environment
Natural Synchronisation Points (2)P1
P2
P3
Application Execution with Processes interactingP1
P2
P3
Coordinated checkpoint - waiting for in-transit messages
Checkpointing of Parallel Applications in a Grid Environment
Natural Synchronisation Points (4)P1
P2
P3
Coordinated checkpoint - logging in-transit messages
Checkpointing at natural synchronisation points.
P1
P2
P3
N.S.P 1 N.S.P 2
Ckpt1 Ckpt2
Checkpointing of Parallel Applications in a Grid Environment
New Checkpointing Approa Using First Order Approximation only:
– Involves synchronisation of messages and capturing in-transit messages.
Checkpointing at natural synchronisation points only:– May not be very effective because there
are no patterns in their occurrences.
Checkpointing of Parallel Applications in a Grid Environment
New Checkpointing Approach(2) Use a combination of both the
Natural Synchronisation Points and the First Order Approximation.
Take checkpoints at natural synchronization points which are closest to the optimal checkpoint intervals.
Checkpointing of Parallel Applications in a Grid Environment
Choosing Checkpoint Intervals
First Order approximation (Op)
Natural Synchronisation pts (Ns)
Critical Region { }
Choosing appropriate checkpointing intervals
Ns1
Ns2 Ns4
Ns3 Ns5
Ns6
Ns7
Ns 8
Ns9
Ns10
Op1 Op2 Op3 Op4 Op5 Op6
Checkpointing of Parallel Applications in a Grid Environment
Choosing Checkpoint Intervals(2) Decision to select a checkpoint based
on:– Optimal checkpoint interval, – Natural synchronisation points and – Critical Region.
Checkpointing process is triggered by signals sent to the coordinated process whenever synchronization points are encountered.
Checkpointing of Parallel Applications in a Grid Environment
The Checkpointing Process When coordinated process receives a signal, it
checks to see if this signal is within the critical region. – If so, a checkpoint is taken and the clock is reset.– If not, no checkpointing is performed.
If no natural synchronization points are met within the critical region, we will have to force a checkpoint at the end of the critical region.– In such cases, the checkpointing mechanism will
perform synchronization to ensure there are no lost or orphan messages.
Checkpointing of Parallel Applications in a Grid Environment
The TestBed Madcity Traffic Simulation tool was
used.– Simulates traffic on a road network
and shows how individual vehicles behave on roads and at junctions.
MadCity traffic simulator can be parallelised using PGRADE.
Checkpointing of Parallel Applications in a Grid Environment
The Testbed(2)
Proposed checkpointing solution
First Order approximation (Op)Natural Synchronisation pts (Ns)Forced Synchronisation pts (Fs)Critical Region { }Saved Checkpoints
Op1 Op2 Op3 Op4 Op5 Op6
4 min
Ns1
Ns2
Ns3
Ns4
Ns5
Ns6
Ns7
Ns8
Ns9Fs1
Checkpointing of Parallel Applications in a Grid Environment
The Testbed(3) Through the First Order Approximation, the
calculated optimal checkpoint interval was 8 minutes.
A critical region of 2 minutes range from the optimal checkpoint interval was defined.
Checkpoint taken at: Ns1, Ns2, Ns5, Fs1, Ns6,Ns9.
Overall average time between checkpoints: 8.2 minutes
Checkpointing of Parallel Applications in a Grid Environment
Conclusion Proposed checkpointing mechanism
provides a better and more efficient way to save checkpoint images.– Minimise the need of performing
synchronisation of messages.– Ensure that our average checkpointing
interval is close to the optimal checkpointing interval defined by the First Order Approximation.
Checkpointing of Parallel Applications in a Grid Environment
Future Works Integrate the checkpointing
solution in PGRADE to provide an efficient fault tolerant solution to applications executed as Grid workflows.
Provide an efficient and reliable storage mechanism.
Checkpointing of Parallel Applications in a Grid Environment
Questions