Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *
-
Upload
emma-newton -
Category
Documents
-
view
228 -
download
0
description
Transcript of Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *
![Page 1: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/1.jpg)
Riza Suminto, Agung Laksono*, Anang Satria*,Thanh Do†, Haryadi Gunawi
Towards Pre-Deployment Detection
of Performance Failuresin Cloud Systems
†*
![Page 2: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/2.jpg)
2
Cloud SystemsSPV @ HotCloud ’15
![Page 3: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/3.jpg)
3
Demands Users demand high dependability,
reliability, and performance stability Amazon found that every 100ms of
latency cost them 1% in sales Google found an extra 0.5 second in
search page generation time dropped traffic by 20%
SPV @ HotCloud ’15
Speed Matters!
![Page 4: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/4.jpg)
4
Performance failures happen
What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, SOCC’14
22%
SPV @ HotCloud ’15
![Page 5: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/5.jpg)
5
PerformanceBug
System Performance
Verifier
OutlineSPV @ HotCloud ’15
![Page 6: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/6.jpg)
6
Performance Bug Jobs take multiple times than usual to
finish Improper speculative execution
JCH1 & TPL1 & FPL2 & FTY1
Unnecessary repeated recoveryTPL1 & TPL4 & FTY4 & TOP1
SPV @ HotCloud ’15
![Page 7: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/7.jpg)
7
Untriggered SpecExecMap read locallyMappers and reducersin different nodesAll-to-AllFault at map nodeSlow NIC
DLCA
TPLA
FPLA
FTYA
JCHA
M1
M2
M3
Mappers Reducers
All reducers slow!
DLCA & TPLA & JCHA & FPLA & FTYA
No straggler = No SpecExec
SPV @ HotCloud ’15
slow!
![Page 8: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/8.jpg)
8
DLCA & TPLA & JCHA & FPLA & FTYA
Untriggered SpecExec, cont
M1
M2
M3
Mappers
DN
DLCB= read remote
Straggler!
SPV @ HotCloud ’15
DLCA & TPLA & JCHA & FPLA & FTYA
M1
M2
M3
Mappers Reducers
![Page 9: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/9.jpg)
9
DLCA & TPLA & JCHA & FPLA & FTYA
Untriggered SpecExec, cont
M1
M2
M3
Mappers Reducers
FPLBslow reducer =
Straggler!
SPV @ HotCloud ’15
DLCA & TPLA & JCHA & FPLA & FTYA
M1
M2
M3
Mappers Reducers
![Page 10: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/10.jpg)
10
O(n) RecoveryMappers and Reducersin different nodesMappers and Reducersin different racksLarge number of nodes per rackSlow inter-rack switch
MMMM
R
Rack 1 Rack 2M
TPLA
TPLB
TOPA
FTYB
TPLA & TPLB & TOPA & FTYB
SPV @ HotCloud ’15
slow!
![Page 11: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/11.jpg)
11
Untriggered Speculative Execution MR-70001 = JCH1 & TPL1 & FPL2 & FTY1 MR-70002 = DSR1 & DLC1 & FPL1 & FTY1 MR-5533 = FTY2 & FPL3 & TPL3 …
O(n) Recovery MR-5251 = FTY3 & FPL3 & FTM1 MR-5060 = TPL1 & TPL3 & FTY1 & FPL2 MR-1800 = TPL1 & TPL4 & FTY4 & TOP1 …
Long lock contention MR-9191 = FTY3 & FPL3 & FTM1 MR-9292 = TPL1 & TPL3 & FTY1 & FPL2 MR-9393 = TPL1 & TPL4 & FTY4 & TOP1 …
Conditions lead to performance bug
Scenario Type Possible Condition
DLC: Data Locality (1) Read from remote disk, (2) read from local disk, ...
DSR: Data Source (1) Some tasks read from same datanode, (2) all tasks read from different datanodes, …
JCH: Job Characteristic
Map-reduce is (1) many-to-all, (2) all-to-many, (3) large fan-in, (4) large fan-out, ...
JSZ: Job Size (1)1GBjarfile,(2)1MBjarfile,...
LSZ: Load Size (1) Thousands of tasks, (2) small number of tasks, …
FTY: Fault Type (1) Slow node/NIC, (2) Node disconnect/packet drop, (3) Disk error/out of space, (4) Rack switch, …
FPL: Fault Placement Slowdown fault injection at the (1) source datanode, (2) mapper, (3) reducer, …
FGR: Fault Ganularity (1) Single disk/NIC, (2) single node (deadnode), (3) en- tire rack (network switch), …
FTM: Fault Timing (1) During shuffling, (2) during 95% of task completion, …
TOP: Topology (1) 30 nodes per rack, (2) 3 nodes per rack, …
TPL: Task Placement (1) Mappers and reducers are in different nodes, (2) AM and reducers in different nodes, (3) Mappers are in the same node, (4) Most of reducers placed in the same rack, ...
SPV @ HotCloud ’15
![Page 12: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/12.jpg)
12
PerformanceBug
System Performance
Verifier
OutlineSPV @ HotCloud ’15
![Page 13: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/13.jpg)
13
Current ApproachSPV @ HotCloud ’15
Benchmarking Hundreds benchmark for every
scenario Injecting slowdowns and failures Take days to weeks!!
![Page 14: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/14.jpg)
14
What we want… Four goals in performance verification
Fast Covers many deployment scenario Runs in pre-deployment Directly checks implementation code
SPV @ HotCloud ’15
Formal modeling tools!
![Page 15: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/15.jpg)
15
System Performance Verifier (SPV)
SPV @ HotCloud ’15
@Datapublic class JobInProgress { JobID jobId; TaskInProgress maps[]; ...}@IOpublic HeartbeatResponse heartbeat (HeartbeatData hd){ ...}
Target system(e.g., Hadoop code)
SPV CompilerAuto-generated model
(in Colored Petri Net)PerformanceVerification
20X larger thanhand model
Hand model
![Page 16: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/16.jpg)
16
Colored Petri Nets (CPN)
Tasks
NodeTask to
Run
(“T1”,map)
A @0 (A,“T1”,map) @10
input(node,task);output(assignment);action let val (id,type) = taskin (node,id,type)end;
@+10
node assignment
task
Schedule Task
SPV @ HotCloud ’15
![Page 17: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/17.jpg)
17Challenges : Two Different WorldCPN Java
SPV @ HotCloud ’15
![Page 18: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/18.jpg)
18
Our Approach Java SysJava
Data flattening Code modularization Annotation tagging
SysJava Model compiler
SPV @ HotCloud ’15
![Page 19: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/19.jpg)
19
Data Flattening Java system states = ArrayList,
Map, Tree,… CPN states = multisetsList<JobInProgress> runningJobs;
public class JobInProgress { JobID jobId; TaskInProgress maps[]; ...}
class TaskInProgress { TaskID id; double progress; ...}
Job In Progre
ss
Task In Progre
ss
Job Task
Mapping
[(1)]
[(1,a),(1,b)]
[(a,10%),(b,15%)]
SPV @ HotCloud ’15
![Page 20: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/20.jpg)
20
Code Modularization
private boolean processHeartbeat( TaskTrackerStatus trackerStats) {
synchronized (taskTrackers) { ... }
for (TaskStatus ts: trackerStats) { tasks.get(ts.id).updateStatus(ts); }
...}
@ProcessStateprivate void initCheck() { synchronized (taskTrackers) { ... }}
@ForEachprivate void updateStatuses( TaskTrackerStatus trackerStats) { for (TaskStatus ts: trackerStats) { ... }}
@GetStateprivate TaskInProgress getTask(TaskID id) { tasks.get(ts.id);}@UpdateStateprivate void tipUpdate(TaskInProgress tip, TaskStatus ts) { tip.updateStatus(ts);}
Modular function
Control Flow logic
CRUD Logic
SPV @ HotCloud ’15
![Page 21: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/21.jpg)
21
Annotation Tagging Assist compiler Annotation Category:
Data Structure I/O CRUD & Process Miscellaneous
public HeartbeatResponse heartbeat (HeartbeatData hd) { ...}
public class JobInProgress { JobID jobId; TaskInProgress maps[]; ...}
SPV @ HotCloud ’15
@Data
@IO
![Page 22: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/22.jpg)
22
SPV Compiler Executable XML Define configurations, assertions, and
specifications Explore every non-deterministic choices
Task to node mapping
Model Checking
Tasks
Node
Task to
Run
(“T1”,map)
B
(A,“T1”,map)
Tasks
Node
Task to
Run
(“T1”,map)
B
T1 on A
T1 on B
A
Schedule Task
A
(B,“T1”,map)
Schedule Task
SPV @ HotCloud ’15
![Page 23: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/23.jpg)
23
Preliminary Result 5305 lines of code on top of WALA &
Access/CPN Hadoop MapReduce 1.2.1, with 1067 lines
code change 20x larger than hand-made model 34 scenario, 30 assertion violation, 4
performance bug 1.5 hour model checking
Configuration ValueWorker Node Node A, BData Node Node A, B, CTasks 2 TaskFault Type Slow Data NodeFault Placement Node C
SPV @ HotCloud ’15
![Page 24: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/24.jpg)
24
Thank you!Questions?
http://ucare.cs.uchicago.edu
SPV @ HotCloud ’15
![Page 25: Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do , Haryadi Gunawi *](https://reader035.fdocuments.net/reader035/viewer/2022062223/5a4d1b697f8b9ab0599b28d1/html5/thumbnails/25.jpg)
25
Discussion Is it time for pre-deployment detection
of performance bugs? Bridging system code and formal
methods Future of data-centric languages Beyond Hadoop Root cause anatomy of performance
bugs Beyond performance bugs
SPV @ HotCloud ’15