Www.montblanc-project.eu This project has received funding from the European Union's Seventh...

Post on 19-Jan-2016

218 views 0 download

Tags:

Transcript of Www.montblanc-project.eu This project has received funding from the European Union's Seventh...

This project has received funding from the European Union's Seventh Framework Programme for research, technological development and demonstration under grant agreement n.610402

http://www.montblanc-project.eu

May 7, 2015CSW & BR Oslo

Understanding and Addressing the Resiliency Issues for Future Exascale

Computing with the Mont-Blanc Prototype

Ferad Zyulkyarov

Barcelona Supercomputing Center

May 7, 2015

1

Acknowledgement

• Javier Arias• Jesus Labarta• Filippo Mantovani• Dani Ruiz• Omer Subasi• Osman Unsal• Oriol Vilarrubi• Gulay Yalcin

CSW & BR Oslo May 7, 20152

About This Presentation

• Focus on memory resiliency

• First ever attempt to characterize the memory reliability of a large system which has no memory ECC

• Relate our numbers to the state-of-the-art

• Our software-based proposals to complement HW ECC

May 7, 2015CSW & BR Oslo3

Comparing with the Related Work

May 7, 2015CSW & BR Oslo

State of the Art Mont-Blanc[1] Sridharan et al. "Memory Errors in Modern Systems The Good, The Bad, and The Ugly", ASPLOS'2015[2] Sridharan and Liberty, "A study of DRAM failures in the field", SC'2012

Cielo

Hopper

4

Technical Details

May 7, 2015CSW & BR Oslo

Parameter Cielo Hopper Mont-Blanc

Nodes 8,568 6,000 1,080

Cores 137,088 144,000 2,160

Core type AMD Opteron AMD Opteron ARM Cortex A15

Memory per node 32 GB 32 GB 4 GB

Total memory 268 TB 188 TB 4.3 TB

Memory type DDR3 DDR3 LDDR3

ECC Chipkill-correct Chipkill-detect NONE

Peak performance 1,120 TFlops 1,054 TFlops 35 TFlops

Altitude 2,231 m 13 m 18 m

Location Los Alamos, NM Oakland, CA Barcelona

Mont-Blanc is much smaller than Cielo and Hopper.

5

This Study

May 7, 2015CSW & BR Oslo

Related Work Mont-BlancNode hours (million) 157 0.65GB hours (million) 11,250 1.90

This is a preliminary study and the results are not statistically strong to draw conclusions.

The study on Mont-Blanc is for 917 Nodes and only 3GB per node were scanned.

6

Memory FIT per 1GB

May 7, 2015CSW & BR Oslo

14,744

377 129 152 89 76 860

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000FI

T

Memory FIT for 1GB

LDDR3 in Mont-Blanc has very high FIT rate. We are not sure why but we attribute this for being a low-end product.

7

MTBF (all faults)

May 7, 2015CSW & BR Oslo

25.0

0.3 0.4

4.3

17.9

10.7

26.1

21.4

26.9

0.0

5.0

10.0

15.0

20.0

25.0

30.0

Memory faults (corrected & uncorrected) MTBF

We definitely need ECC in hardware or MontBlanc system of the scale of Cielo may fail every 20 minutes.

What if MB has same amount of memory like

Cielo?

What if MB has same amount of memory like

Hopper?

8

43.1

0.4 0.6

18.226.3

0.8

32.7

9.5

85.0

0.010.020.030.040.050.060.070.080.090.0

Day

s

Uncorrectable Memory Faults MTBF394 3546

ECC in Hardware

May 7, 2015CSW & BR Oslo

SECDED will not be strong enough to keep a large system operable.

9

Projections for Exascale

May 7, 2015CSW & BR Oslo10

0.0

2.0

4.0

6.0

8.0

10.0

12.0

32PB 64PB 96PB 128PB

Day

s

Exascale projection MTBF (High FIT is Vendor A)

8Gbit / High FIT 8Gbit / Low FIT 8Gbit / MB FIT 16Gbit / High FIT 16Gbit / Low FIT

16Gbit / MB FIT 32Gbit / High FIT 32Gbit / Low FIT 32Gbit / MB FIT

MTBF Mont-Blanc DRAM Chipkill uncorrectable errors

The MTBF for uncorrectable errors with chipkill in a system with commodity memory like in Mont-Blanc will be between 0.3 and 5.1 days.

Memory Reliability

CSW & BR Oslo

Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1

Permanent error. The most significant bit cannot be set to 1.

May 7, 201511

Memory Reliability

CSW & BR Oslo

Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1

Transient errors

May 7, 201512

Memory Reliability

CSW & BR Oslo

Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1

Intermittent errors at the same address and bit.

10 sec

3 sec

2 sec

May 7, 201513

Memory Reliability

CSW & BR Oslo

Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1

Multi-bit errors

May 7, 201514

Memory Reliability

CSW & BR Oslo

Date Time Host Address Expected Actual Type Num Cells26/02/2015 10:21:25 mb-30-13 0x2b1f1cb4 0 8000 Permanent 106/03/2015 08:39:51 mb-38-15 0xb93e160 81F2 81FA Transient 112/03/2014 22:39:05 mb-22-3 0x38d6ea10 10000532 532 Transient 114/03/2015 11:43:12 mb-21-10 0x775a4dbc 3723 7723 Intermittent 114/03/2015 11:43:22 mb-21-10 0x775a4dbc 3727 7727 Intermittent 114/03/2015 11:43:24 mb-21-10 0x775a4dbc 3728 7728 Intermittent 114/03/2015 11:43:27 mb-21-10 0x775a4dbc 3729 7729 Intermittent 114/03/2015 11:43:29 mb-21-10 0x775a4dbc 372A 772A Intermittent 114/03/2015 11:43:31 mb-21-10 0x775a4dbc 372B 772B Intermittent 114/03/2015 11:43:34 mb-21-10 0x775a4dbc 372C 772C Intermittent 114/03/2015 11:43:36 mb-21-10 0x775a4dbc 372D 772D Intermittent 114/03/2015 11:43:38 mb-21-10 0x775a4dbc 372E 772E Intermittent 114/03/2015 11:43:41 mb-21-10 0x775a4dbc 372F 772F Intermittent 114/03/2015 11:43:43 mb-21-10 0x775a4dbc 3730 7730 Intermittent 114/03/2015 11:43:46 mb-21-10 0x775a4dbc 3731 7731 Intermittent 114/03/2015 11:43:48 mb-21-10 0x775a4dbc 3732 7732 Intermittent 114/03/2015 11:43:51 mb-21-10 0x775a4dbc 3733 7733 Intermittent 114/03/2015 11:43:53 mb-21-10 0x775a4dbc 3734 7734 Intermittent 114/03/2015 11:43:55 mb-21-10 0x775a4dbc 3735 7735 Intermittent 114/03/2015 11:43:58 mb-21-10 0x775a4dbc 3736 7736 Intermittent 114/03/2015 11:44:41 mb-21-10 0x775a4dbc 3748 7748 Intermittent 114/03/2015 11:44:43 mb-21-10 0x775a4dbc 3749 7749 Intermittent 114/03/2015 11:44:46 mb-21-10 0x775a4dbc 374A 774A Intermittent 118/03/2015 06:25:01 mb-24-4 0xb6c069e0 453DA 53DA Transient 118/03/2015 16:00:58 mb-62-6 0xa83cb640 0 2 Transient 126/03/2015 12:17:52 mb-26-12 0x405dfc00 6E61 461 Transient 426/03/2015 13:32:17 mb-24-11 0xb4f33000 E6006358 58 Transient 929/03/2015 09:40:54 mb-35-4 0x3f6eb824 14388 4338 Transient 1

These coincide with major solar flares

May 7, 201515

Reliability Techniques in Software

• Task checkpoint and restart

• Task replication

• Other ongoing activities

May 7, 2015CSW & BR Oslo16

Advantages of Tasks for Fault Tolerance

• Task boundaries explicitly delimit the scope of the checkpoints

• The explicit task input/output declarations decrease the checkpoint state

• Compared to pthread-like parallel programs checkpointing does not require any complex coordination and synchronization between threads

• The recovery is asynchronous

May 7, 2015CSW & BR Oslo17

Checkpoint and Recovery for Tasks

Task start T1

Task end T1

Input

Recover task execution from detected faults.Isolate the fault propagation within the task boundaries.

Input

Checkpoint

Fault detected

Recoverexecution

Recover

Inputs are known at runtime through explicit declaration.Overheads of checkpointing are minimal.

Recovery is asynchronous.

Limitations: does not cover the execution outside tasks.

May 7, 2015CSW & BR Oslo18

Results: Checkpoint and Recovery for Tasks

Multi-Node ScalabilitySingle-Node Scalability

0%

2%

4%

6%

8%

10%

12%

14%

Sparse LU Cholesky FFT Perlin Stream

Checkpointing Overheads

Basline impl. Singleton

May 7, 2015CSW & BR Oslo19

Task Replication

T1Input

Detect and recover from silent data corruption.

Input

Checkpoint

T1’

Output Output

Fault

InputT1”

Output

Re-execute the task one more time and use the two outputs that match as the

correct result.

No fault

1. The task and its replica execute asynchronously.2. No synchronization between the task and its replica (only at the end of

the task execution).3. Faults do not limit parallelism, re-execution is also asynchronously.

May 7, 2015CSW & BR Oslo20

Results: Task Replication

Multi-Node ScalabilitySingle-Node Scalability

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

Overheads of Replication

May 7, 2015CSW & BR Oslo21

Other activities

• Software-based ECC• To complement the lack of ECC or weak ECC in hardware

• Selective replication• To reduce the cost of resource utilization by replicating the

reliability critical code

• Checkpoint and restart for tasks with MPI calls• To provide multi-node checkpoint restart within task-based

programming model

• Hierarchical checkpoint restart with task checkpointing• To decrease the checkpoint overheads and recovery time

May 7, 2015CSW & BR Oslo22

Summary

• Preliminary memory reliability characterization

• Low-end comodity DRAM devices might be more susceptible to transient faults

• Even strong memory ECC alone may not be sufficient to mitigate transient faults in exascale computing

• SW-based fault tolerance which is coupled to a specific programming model might be a lightweight solution to complement HW-based ECC

May 7, 2015CSW & BR Oslo23

May 7, 2015CSW & BR Oslo

Thank you!

Questions?

24