Www.montblanc-project.eu This project has received funding from the European Union's Seventh...
-
Upload
karen-bruce -
Category
Documents
-
view
218 -
download
0
Transcript of Www.montblanc-project.eu This project has received funding from the European Union's Seventh...
This project has received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreement n.610402
http://www.montblanc-project.eu
May 7, 2015CSW & BR Oslo
Understanding and Addressing the Resiliency Issues for Future Exascale
Computing with the Mont-Blanc Prototype
Ferad Zyulkyarov
Barcelona Supercomputing Center
May 7, 2015
1
Acknowledgement
• Javier Arias• Jesus Labarta• Filippo Mantovani• Dani Ruiz• Omer Subasi• Osman Unsal• Oriol Vilarrubi• Gulay Yalcin
CSW & BR Oslo May 7, 20152
About This Presentation
• Focus on memory resiliency
• First ever attempt to characterize the memory reliability of a large system which has no memory ECC
• Relate our numbers to the state-of-the-art
• Our software-based proposals to complement HW ECC
May 7, 2015CSW & BR Oslo3
Comparing with the Related Work
May 7, 2015CSW & BR Oslo
State of the Art Mont-Blanc[1] Sridharan et al. "Memory Errors in Modern Systems The Good, The Bad, and The Ugly", ASPLOS'2015[2] Sridharan and Liberty, "A study of DRAM failures in the field", SC'2012
Cielo
Hopper
4
Technical Details
May 7, 2015CSW & BR Oslo
Parameter Cielo Hopper Mont-Blanc
Nodes 8,568 6,000 1,080
Cores 137,088 144,000 2,160
Core type AMD Opteron AMD Opteron ARM Cortex A15
Memory per node 32 GB 32 GB 4 GB
Total memory 268 TB 188 TB 4.3 TB
Memory type DDR3 DDR3 LDDR3
ECC Chipkill-correct Chipkill-detect NONE
Peak performance 1,120 TFlops 1,054 TFlops 35 TFlops
Altitude 2,231 m 13 m 18 m
Location Los Alamos, NM Oakland, CA Barcelona
Mont-Blanc is much smaller than Cielo and Hopper.
5
This Study
May 7, 2015CSW & BR Oslo
Related Work Mont-BlancNode hours (million) 157 0.65GB hours (million) 11,250 1.90
This is a preliminary study and the results are not statistically strong to draw conclusions.
The study on Mont-Blanc is for 917 Nodes and only 3GB per node were scanned.
6
Memory FIT per 1GB
May 7, 2015CSW & BR Oslo
14,744
377 129 152 89 76 860
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000FI
T
Memory FIT for 1GB
LDDR3 in Mont-Blanc has very high FIT rate. We are not sure why but we attribute this for being a low-end product.
7
MTBF (all faults)
May 7, 2015CSW & BR Oslo
25.0
0.3 0.4
4.3
17.9
10.7
26.1
21.4
26.9
0.0
5.0
10.0
15.0
20.0
25.0
30.0
Memory faults (corrected & uncorrected) MTBF
We definitely need ECC in hardware or MontBlanc system of the scale of Cielo may fail every 20 minutes.
What if MB has same amount of memory like
Cielo?
What if MB has same amount of memory like
Hopper?
8
43.1
0.4 0.6
18.226.3
0.8
32.7
9.5
85.0
0.010.020.030.040.050.060.070.080.090.0
Day
s
Uncorrectable Memory Faults MTBF394 3546
ECC in Hardware
May 7, 2015CSW & BR Oslo
SECDED will not be strong enough to keep a large system operable.
9
Projections for Exascale
May 7, 2015CSW & BR Oslo10
0.0
2.0
4.0
6.0
8.0
10.0
12.0
32PB 64PB 96PB 128PB
Day
s
Exascale projection MTBF (High FIT is Vendor A)
8Gbit / High FIT 8Gbit / Low FIT 8Gbit / MB FIT 16Gbit / High FIT 16Gbit / Low FIT
16Gbit / MB FIT 32Gbit / High FIT 32Gbit / Low FIT 32Gbit / MB FIT
MTBF Mont-Blanc DRAM Chipkill uncorrectable errors
The MTBF for uncorrectable errors with chipkill in a system with commodity memory like in Mont-Blanc will be between 0.3 and 5.1 days.
Memory Reliability
CSW & BR Oslo
Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1
Permanent error. The most significant bit cannot be set to 1.
May 7, 201511
Memory Reliability
CSW & BR Oslo
Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1
Transient errors
May 7, 201512
Memory Reliability
CSW & BR Oslo
Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1
Intermittent errors at the same address and bit.
10 sec
3 sec
2 sec
May 7, 201513
Memory Reliability
CSW & BR Oslo
Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1
Multi-bit errors
May 7, 201514
Memory Reliability
CSW & BR Oslo
Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1
These coincide with major solar flares
May 7, 201515
Reliability Techniques in Software
• Task checkpoint and restart
• Task replication
• Other ongoing activities
May 7, 2015CSW & BR Oslo16
Advantages of Tasks for Fault Tolerance
• Task boundaries explicitly delimit the scope of the checkpoints
• The explicit task input/output declarations decrease the checkpoint state
• Compared to pthread-like parallel programs checkpointing does not require any complex coordination and synchronization between threads
• The recovery is asynchronous
May 7, 2015CSW & BR Oslo17
Checkpoint and Recovery for Tasks
Task start T1
Task end T1
Input
Recover task execution from detected faults.Isolate the fault propagation within the task boundaries.
Input
Checkpoint
Fault detected
Recoverexecution
Recover
Inputs are known at runtime through explicit declaration.Overheads of checkpointing are minimal.
Recovery is asynchronous.
Limitations: does not cover the execution outside tasks.
May 7, 2015CSW & BR Oslo18
Results: Checkpoint and Recovery for Tasks
Multi-Node ScalabilitySingle-Node Scalability
0%
2%
4%
6%
8%
10%
12%
14%
Sparse LU Cholesky FFT Perlin Stream
Checkpointing Overheads
Basline impl. Singleton
May 7, 2015CSW & BR Oslo19
Task Replication
T1Input
Detect and recover from silent data corruption.
Input
Checkpoint
T1’
Output Output
Fault
InputT1”
Output
Re-execute the task one more time and use the two outputs that match as the
correct result.
No fault
1. The task and its replica execute asynchronously.2. No synchronization between the task and its replica (only at the end of
the task execution).3. Faults do not limit parallelism, re-execution is also asynchronously.
May 7, 2015CSW & BR Oslo20
Results: Task Replication
Multi-Node ScalabilitySingle-Node Scalability
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
Overheads of Replication
May 7, 2015CSW & BR Oslo21
Other activities
• Software-based ECC• To complement the lack of ECC or weak ECC in hardware
• Selective replication• To reduce the cost of resource utilization by replicating the
reliability critical code
• Checkpoint and restart for tasks with MPI calls• To provide multi-node checkpoint restart within task-based
programming model
• Hierarchical checkpoint restart with task checkpointing• To decrease the checkpoint overheads and recovery time
May 7, 2015CSW & BR Oslo22
Summary
• Preliminary memory reliability characterization
• Low-end comodity DRAM devices might be more susceptible to transient faults
• Even strong memory ECC alone may not be sufficient to mitigate transient faults in exascale computing
• SW-based fault tolerance which is coupled to a specific programming model might be a lightweight solution to complement HW-based ECC
May 7, 2015CSW & BR Oslo23
May 7, 2015CSW & BR Oslo
Thank you!
Questions?
24