Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to...
Transcript of Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to...
![Page 1: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/1.jpg)
Accurate Prediction of Soft Error Vulnerability of
Scientific Applications
Greg BronevetskyPost-doctoral Fellow
Lawrence Livermore National Lab
![Page 2: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/2.jpg)
Soft error: one-time corruption of system state
• Examples: Memory bit-flips, erroneous computations
• Caused by – Chip variability– Charged particles passing through transistors
• Decay of packaging materials (Lead208, Boron10)• Fission due to cosmic neutrons
– Temperature, power fluctuations
![Page 3: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/3.jpg)
Soft errors are a critical reliability challenge for supercomputers
• Real Machines:– ASCI Q: 26 radiation-induced errors/week– Similar-size Cray XD1: 109 errors/week
(estimated)– BlueGene/L: 3-4 L1 cache bit flips/day
• Problem grows worse with time– Larger machines ⇒ larger error probability– SRAMs growing exponentially more
vulnerable per chip
![Page 4: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/4.jpg)
We must understand the impact of soft errors on applications
• Soft errors corrupt application state• May lead to crashes or• Need to detect/tolerate soft errors
– State of the art: checkers/correctors for individual algorithms
– No general solution• Must first understand how errors affect
applications– Identify problem– Focus efforts
corrupt output
![Page 5: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/5.jpg)
Prior work says very little about most applications
• Prior fault analysis work focuses on injecting errors into individual applications– [Lu and Reed, SC04]: Linux + MPICH +
Cactus, NAMD, CAM– [Messer et al, ICSDN00]: Linux + Apache and
Linux + Java (Jess, DB, Javac, Jack)– [Some et al, AC02]: Lynx + Mars texture
segmentation application…
• Where’s my application?
![Page 6: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/6.jpg)
Extending vulnerability characterization to more applications• Goal: general purpose vulnerability
characterization– Same accuracy as per-application fault
injection– Much cheaper
• Initial steps– Fault injection iterative linear algebra methods– Library-based fault vulnerability analysis
…
![Page 7: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/7.jpg)
Step 1: Analyzing fault vulnerability of iterative methods
• Target domain: solvers for sparse linear problem Ax=b
• Goal:understand error vulnerability of class of algorithms– Raw error rates– Effectiveness of potential solutions
• Error model: memory bit-flips
![Page 8: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/8.jpg)
Possible run outcomes
• Success: <10% error
• Silent Data Corruption (SDC): ≥10% error
• Hang: method doesn’t reach target tolerance
• Abort: SegFault or failed SparseLib check
![Page 9: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/9.jpg)
Errors cause SDCs, Hangs, Aborts in ~8-10%, each
![Page 10: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/10.jpg)
Large scale applications vulnerable to silent data corruptions
• Scaled to 1-day, 1,000-processor run of application that only calls iterative method
10FIT/MB DRAM (1,000-5,000 Raw FIT/MB, 90%-98% effective error correction)
![Page 11: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/11.jpg)
Larger scale applications even more vulnerable to silent data corruptions
• Scaled to 10-day, 100,000-processor run of application that only calls iterative method
10FIT/MB DRAM (1,000-5,000 Raw FIT/MB, 90%-98% effective error correction)
![Page 12: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/12.jpg)
Error Detectors
Base
![Page 13: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/13.jpg)
Convergence detectors reduce SDC at <20% overhead
Base
![Page 14: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/14.jpg)
Convergence detectors reduce SDC at <20% overhead
Base
![Page 15: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/15.jpg)
Native detectors have little effect at little cost
Base
![Page 16: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/16.jpg)
Encoding-based detectors significantly reduce SDC at high cost
Base
![Page 17: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/17.jpg)
Encoding-based detectors significantly reduce SDC at high cost
Base
![Page 18: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/18.jpg)
First general analysis of error vulnerability of algorithm class
• Vulnerability analysis for class of common subroutines
• Described raw error vulnerability
• Analyzed various detection/tolerance techniques– No clear winner, rules of thumb
![Page 19: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/19.jpg)
Step 2: Vulnerability analysis of library-based applications
• Many applications mostly composed of calls to library routines
• If error hits some routine, output will be corrupted
• Later routines: corrupted inputs ⇒ corrupted outputs
Inputs Outputs
(Work in progress)
![Page 20: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/20.jpg)
Idea: predict application vulnerability from routine profiles
• Library implementors provide vulnerability profile for each routine:– Error pattern in routine’s output after errors– Function that maps input error patterns to
output error patterns
Inputs Outputs
![Page 21: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/21.jpg)
Idea: predict application vulnerability from routine profiles
• Given application’s dependence graph– Simulate effect of error in each routine– Average over all error locations to produce
error pattern at outputs
Inputs Outputs
![Page 22: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/22.jpg)
Examined applications that use BLAS and LAPACK
• 12 routines ≥O(n2), double precision real numbers– Matrix-vector multiplication – DGEMV– Matrix-matrix multiplication – DGEMM– Rank-1 update – DGER– Linear least squares – DGESV, DGELS– SVD factorization – DGESVD, DGGSVD,
DGESDD– Eigenvectors: DGEEV, DGGEV, DGEES,
DGGES
![Page 23: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/23.jpg)
Examined applications that use BLAS and LAPACK
• 12 routines ≥O(n2), double precision real numbers
• Executed on randomly-generated nxnmatrixes(n=62, 125, 250, 500)
• BLAS/LAPACK from Intel’s Math Kernel Library on Opteron(MLK10) and Itanium2(MKL8)– Same results on both
• Error model: memory bit-flips
![Page 24: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/24.jpg)
Error patterns: multiplicative error histograms
DGEMM
![Page 25: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/25.jpg)
Output error patterns fall into few major categories
1.E‐08
1.E‐06
1.E‐04
1.E‐02
1.E+00
1.E‐08
1.E‐06
1.E‐04
1.E‐02
1.E+00
DGGESOutput beta - 62x1
DGESVOutput L - 62x62
DGGESOutput vsr - 62x62
DGEMMOutput C - 62x62
![Page 26: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/26.jpg)
Error patterns may vary with matrix size
1.E‐07
1.E‐05
1.E‐03
1.E‐01
1.E‐07
1.E‐05
1.E‐03
1.E‐01
DGGSVDOutput beta
DGGSVDOutput V
62 125 250 500
![Page 27: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/27.jpg)
Input-Output error transition functions
• Input-Output error transition functions: trained predictors– Linear Least Squares– Support Vector Machines
(linear, 2nd degree polynomial, rbf kernels)– Artificial Neural Nets
(3,10,100 layers,; linear, gaussian, gaussiansymmetric and sigmoid transfer functions)
![Page 28: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/28.jpg)
Trained on multiple input error patterns
• DataInj: single bit errors
• DataInj-R: output errors of routines with DataInjinputs
• UniInj: uniform multiplicative errors ∈[-100,100]
• UniInj-R: output errors of routines with UniInjinputs
• Inj-R: output errors of error injected routines
![Page 29: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/29.jpg)
Input-Output error transition functions
• Input-Output error transition functions: trained predictors– Linear Least Squares– Support Vector Machines– Artificial Neural Nets
• Trained on sample input error patterns
uniform multiplicative errors∈[-100,100]UniInj:
outputs of routines with UniInj inputsUniInj-R:
single bit errorsDataInj:
outputs of error injected routinesInj-R:
outputs of routines with DataInj inputsDataInj-R:
![Page 30: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/30.jpg)
Output errors depend on input errors
• Equivalence classes– DataInj, DataInj-R | Inj-R– DataUni, DataUni-R
![Page 31: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/31.jpg)
Evaluated accuracy of all predictors on all training sets
• Error metric: – probability of error ≥δ– δ∈{1e-14, 1e-13, …, 2, 10, 100)
1E‐10
1E‐09
1E‐08
1E‐07
1E‐06
1E‐05
0.0001
0.001
0.01
0.1
1
Recorded
Predicted
![Page 32: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/32.jpg)
Evaluated accuracy of all predictors on all training sets
1E‐10
1E‐09
1E‐08
1E‐07
1E‐06
1E‐05
0.0001
0.001
0.01
0.1
1
Recorded
Predicted
1E‐10
1E‐09
1E‐08
1E‐07
1E‐06
1E‐05
0.0001
0.001
0.01
0.1
1
Recorded
Predicted
0%10%20%30%40%50%60%70%80%90%100%
Recorded Predicted Error
0%10%20%30%40%50%60%70%80%90%100%
Recorded Predicted Error
![Page 33: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/33.jpg)
Linear Least Squares has best accuracy, Neural nets worstEvaluation set: union of all training sets
![Page 34: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/34.jpg)
Linear Least Squares has best accuracy, Neural nets worst
![Page 35: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/35.jpg)
Accuracy varies among predictors
DGEES, output wr
![Page 36: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/36.jpg)
Linear Least Squares has best accuracy, Neural nets worst
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
e‐HeapInj.none.All uni0‐1All inj0‐1All e‐uni0‐1All e‐inj0‐1All
1.E‐14
2.E‐14
4.E‐14
9.E‐14
2.E‐13
3.E‐13
7.E‐13
2.E‐11
7.E‐10
2.E‐08
7.E‐07
2.E‐05
7.E‐04
2.E‐02
8.E‐01
2.E+00
1.E+01
![Page 37: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/37.jpg)
Linear Least Squares has best accuracy, Neural nets worst
Inj-R
DataInj
DataInj-R
DataUni
DataUni-R
![Page 38: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/38.jpg)
Evaluated predictors on randomly-generated applications
• Application has constant number of levels• Constant number of operations per level• Operations use as input data from prior
level(s)
Inputs Outputs
![Page 39: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/39.jpg)
Neural Nets: Poor accuracy for application vulnerability prediction
1E‐10
1E‐09
1E‐08
1E‐07
1E‐06
1E‐05
0.0001
0.001
0.01
0.1
1 Recorded
Predicted
Function=sigmoid, 3 hidden layers
![Page 40: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/40.jpg)
1E‐10
1E‐09
1E‐08
1E‐07
1E‐06
1E‐05
0.0001
0.001
0.01
0.1
1 Recorded
Predicted
Linear Least Squares: Good accuracy, restricted
![Page 41: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/41.jpg)
1E‐10
1E‐09
1E‐08
1E‐07
1E‐06
1E‐05
0.0001
0.001
0.01
0.1
1
Recorded
Predicted
SVMs: Good accuracy, general
Function=rbf, gamma=1.0
![Page 42: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/42.jpg)
Work is still in progress
• Correlating accuracy of input/output predictors to accuracy of application prediction
• More detailed fault injection
• Applications with loops
• Real applications
![Page 43: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/43.jpg)
Step 3: Compiler analyses
• No need to focus on external libraries
• Can use compiler analysis to – Do fault injection/propagation on per-function
basis– Propagate error profiles through more data
structures (matrix, scalar, tree, etc.)
![Page 44: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/44.jpg)
Step 4: Scalable analysis of parallel applications
• Cannot do fault injection on 1,000-process application
• Can modularize fault injection– Inject into individual processes
![Page 45: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/45.jpg)
Step 4: Scalable analysis of parallel applications
• Cannot do fault injection on 1,000-process application
• Can modularize fault injection– Inject into single-process runs– Propagate through small-scale runs
![Page 46: Bronevetsky - Accurate Prediction of Soft Error ......Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same](https://reader034.fdocuments.net/reader034/viewer/2022050205/5f58a37861e83164356ef920/html5/thumbnails/46.jpg)
Working toward understanding application vulnerability to errors
• Soft errors becoming increasing problem on HPC systems
• Must understand how applications react to soft errors
• Traditional approaches inefficient for realistic applications
• Developing tools to cheaply understand vulnerability of real scientific applications