Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App...
Transcript of Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App...
![Page 1: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/1.jpg)
MiningSupercomputerJobs'I/OBehaviorfromSystemLogs
Xiaosong Ma
![Page 2: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/2.jpg)
2
Rhea512 node
DevelopmentCluster
Eos736 Node
Cray XC30Cluster
Atlas1 Atlas2
Scalable IO Network (SION) - Infiniband
OSS
144 OSS Servers
OSSOSSOSS
OSSOSS
OSSOSS
1008 OST(LUN)
OSSOSS
OSSOSS
OSSOSS
OSSOSS
144 OSS Servers
OLCF Architecture Overview
1008 OST(LUN)
![Page 3: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/3.jpg)
3
Rhea512 node
DevelopmentCluster
Eos736 Node
Cray XC30Cluster
Atlas1
MySQLdatabase
Atlas2
Scalable IO Network (SION) - Infiniband
OSS
144 OSS Servers
OSSOSSOSS
OSSOSS
OSSOSS
1008 OST(LUN)
Per-OST I/O throughput
OSSOSS
OSSOSS
OSSOSS
OSSOSS
144 OSS Servers
OLCF Architecture Overview
Monitoring tool
1008 OST(LUN)Server-side I/O throughput logs
![Page 4: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/4.jpg)
4
Server-side I/O Throughput Logs
RAID controllerCoarse-granule logging
![Page 5: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/5.jpg)
5
Server-side I/O Throughput Logs
RAID controllerCoarse-granule logging
![Page 6: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/6.jpg)
6
I/O throughput logs
Server-side I/O Throughput Logs
RAID controllerCoarse-granule logging
Zero overhead
No user effort
No impact on user IO
![Page 7: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/7.jpg)
7
I/O throughput logs
Server-side I/O Throughput Logs
RAID controllerCoarse-granule logging
Zero overhead
No user effort
Mixed I/O traffic
No impact on user IO
![Page 8: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/8.jpg)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
Prior Work: IOSI WorkflowTarget App
(User ID + App ID) Throughput logsJob scheduler logs
Start_time End_time2011-10-16 00:00 2011-10-16 02:012011-10-17 01:00 2011-10-17 04:002011-10-18 05:10 2011-10-18 07:20
Sample set
Per-sample wavelet
transform
Cross-sample I/O burst
identificationData
preprocessingIOSI
IOSI Input
IOSI Output
8
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s
)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s
)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
0 100 200 300 400 5000
0.5
1
1.5
2
2.5
3
3.5
Time (s)
Write
(GB/s
)
8
IOSI paper: “Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces”, FAST '14
![Page 9: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/9.jpg)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
Prior Work: IOSI WorkflowTarget App
(User ID + App ID) Throughput logsJob scheduler logs
Start_time End_time2011-10-16 00:00 2011-10-16 02:012011-10-17 01:00 2011-10-17 04:002011-10-18 05:10 2011-10-18 07:20
Sample set
Per-sample wavelet
transform
Cross-sample I/O burst
identificationData
preprocessingIOSI
IOSI Input
IOSI Output
9
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s
)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s
)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
0 100 200 300 400 5000
0.5
1
1.5
2
2.5
3
3.5
Time (s)
Write
(GB/s
)
9
IOSI paper: “Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces”, FAST '14
Strong assumption: identical runs of app.
![Page 10: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/10.jpg)
10
AID: Automatic I/O Diverter
Job 1 Job 2 Job 3App1
App2
Job 1 Job 2 Job 3 Job 4App3
App4
App5
App6
Job 1 Job 4Job 3Job 2 Job 5
Job 1 Job 2 Job 3 Job 4 Job 5
Job 1 Job 2 Job 4 Job 5 Job 6Job 3
Time
Start_time End_time2015-10-16 00:00 2015-10-16 02:012015-10-17 01:00 2015-10-17 04:002015-10-18 05:10 2015-10-18 07:20
Job 1 Job 2 Job 4Job 3
Job 5
Scheduling suggestion
Automatically identifying I/O-heavy apps(No prior knowledge, no user involvement)
![Page 11: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/11.jpg)
11
AID: Automatic I/O Diverter
Job 1 Job 2 Job 3App1
App2
Job 1 Job 2 Job 3 Job 4App3
App4
App5
App6
Job 1 Job 4Job 3Job 2 Job 5
Job 1 Job 2 Job 3 Job 4 Job 5
Job 1 Job 2 Job 4 Job 5 Job 6Job 3
Time
Start_time End_time2015-10-16 00:00 2015-10-16 02:012015-10-17 01:00 2015-10-17 04:002015-10-18 05:10 2015-10-18 07:20
Job 1 Job 2 Job 4Job 3
Job 5
Scheduling suggestion
SC|16 Tech paper presentation:Thursday 2pm, 355D
Automatically identifying I/O-heavy apps(No prior knowledge, no user involvement)
![Page 12: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/12.jpg)
Application I/O Characterization Results
12
Name Value
Total number of logged jobs 181,969
Unique applications identified 9,998
Initial I/O-intensive candidates 95
Candidates passing scope checking 67
Candidates passing minimum support 42
User-verfied candidates 8
Result from 5 months’ Titan I/O traffic and job logs(User verification by email)
![Page 13: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/13.jpg)
Application I/O Characterization Results
13
Name Value
Total number of logged jobs 181,969
Unique applications identified 9,998
Initial I/O-intensive candidates 95
Candidates passing scope checking 67
Candidates passing minimum support 42
User-verfied candidates 8
ID Node Time(m) OST App. Domain
1 8192 1440 64 Geo-sciences
2 250 6-60 1008 Combustion
3 2048 30-185 1008 Astrophysics
4 1760 720 180 Combustion
5 1024 110-230 1008 Systems research
6 200 30-190 1008 Combustion
7 1008 13-17 1008 Computer Science
8 16388 43-310 800 Environmental
User-verified I/O-intensive applications
![Page 14: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/14.jpg)
Application I/O Characterization Results
14
Name Value
Total number of logged jobs 181,969
Unique applications identified 9,998
Initial I/O-intensive candidates 95
Candidates passing scope checking 67
Candidates passing minimum support 42
User-verfied candidates 8
![Page 15: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/15.jpg)
Application I/O Characterization Results
15
Name Value
Total number of logged jobs 181,969
Unique applications identified 9,998
Initial I/O-intensive candidates 95
Candidates passing scope checking 67
Candidates passing minimum support 42
User-verfied candidates 8
![Page 16: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/16.jpg)
Application I/O Characterization Results
16
Name Value
Total number of logged jobs 181,969
Unique applications identified 9,998
Initial I/O-intensive candidates 95
Candidates passing scope checking 67
Candidates passing minimum support 42
User-verfied candidates 8
Applications not using parallel I/O systems well!• Similar finding as Huong 2015 HPDC work (Darshan)• Motivates better I/O performance data analysis• Connecting programs to systems
![Page 17: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/17.jpg)
Questions?
Xiaosong [email protected]
Qatar Computing Research Institute, Hamad Bin Khalifa University
17
![Page 18: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/18.jpg)
I/O Contention on Large-Scale HPC Systems
• 27.1 PF Peak performance• 18,688 compute nodes
• 16-core AMD Opteron• Nvidia Tesla GPU• 32 + 6 GB memory
• 3-D Torus interconnect
ORNL’sTitan(World’s#3Supercomputer)
18
Performance variance on HPC• Shared parallel file system• I/O-heavy jobs collision -> I/O
performance degradation
I/O performance variance on Titan with IOR [6]
![Page 19: Mining Supercomputer Jobs' I/O Behavior from System Logs · Prior Work: IOSI Workflow Target App (User ID + App ID) Job scheduler Throughput logs logs Start_time End_time 2011-10-16](https://reader033.fdocuments.net/reader033/viewer/2022042306/5ed25d5f3bc515330636a2cd/html5/thumbnails/19.jpg)
CDF of per-OST I/O throughput
19
88.4%time<1%capacity(5MB/s)
98.5%time<5%capacity(25MB/s)
99.6%time<20%capacity(100MB/s)