The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… ·...
Transcript of The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… ·...
![Page 1: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/1.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
The What, Why and How ofLinux on BlueGene Compute Nodes
George AlmásiJoséBrunherotoJosé CastaňosGábor DózsaSameer Kumar
Derek LieberTodd InglettEdi ShmueliAlbert Sidelnik
![Page 2: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/2.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
Linux on compute nodes: talk outline
• Why bother?• Evaluate, fix performance gap• Supporting technology• Plans for future
![Page 3: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/3.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
Arguments for Linux on compute nodes
• Time & effort on CNK kernel development– Do we really need another O/S?– We need Linux anyway – runs in I/O node
• Usability perspective– Applications that will not run on CNK:
• Java based• Database systems ala DB2, which require huge infrastructure• Distributed file systems, in-memory-databases etc
– Applications that suffer from CNK limitations:• Dynamic linking, externally steered apps, loosely coupled accelerators
• Memory limitations (no virtual memory on CNK)• Funny file I/O, sockets, heterogeneous applications
![Page 4: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/4.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
Why we should not bother with Linux
• IBM comfort level with Linux commitment– IP issues, open source etc.– Potential increase in complexity, performance loss
• No performance compromise on a $50 million, 2MW machine• Customers ambivalent about need for Linux
– Extensive experience with small kernels (PUMA, BG/L)– Most customers want can be hacked up on top of CNK
• Python, dynamically loaded libraries• Multiple threads, scheduling
– Does anyone seriously want Java on 64k processors?– Some customers happy to follow IBM's lead on this– Others intend to replace O/S with Linux regardless
• Trend: CNK becoming more complex
![Page 5: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/5.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
Technical arguments against Linux on comp. nodes
• On BG/L the processors in a node are not coherent– Nobody knows how to run linux on these nodes using both CPUs
• On BG/P the network device requires fixed virtual-to-physical mapping– Hard to achieve in Linux, where paging is important– Non-trivial fixes available (S. Kumar)
• [Naively] measured performance on Linux is inferior to blrts on BG/L.– Single-node performance– Scaling: O/S noise
![Page 6: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/6.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
Linux: Preliminary measurement of perf. & scaling
• Goal:– Evaluate technical arguments objectively– Find, eliminate sources of performance degradation– Solve technical deployment problems
• Methodology:– Used a BG/D rack for measurements– Wrote a CIO variant for running both Linux & blrts
• BG/L and BG/P– Compiled & ran NAS benchmarks for blrts & Linux– Ran microbenchmarks
![Page 7: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/7.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
NAS Class C MG scaling: good
32 64 128 256 1024 Row 180
20
40
60
80
100
120
140
160
180
200
220
# processors
Mop
/s/c
pu
linuxblrts
![Page 8: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/8.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
NAS Class C BT scaling: bad
25 81 169 289 441 625 841 9610
25
50
75
100
125
150
175
200
225
250
275
300
325
350
# processors
Mop
/s/c
pu
linuxblrts
![Page 9: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/9.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
32 64 128 256 512 10240
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
# processors
Mop
/s/c
puNAS Class C IS scaling: ugly
linuxblrts
![Page 10: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/10.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
What is wrong with NAS IS?• Memory access pattern:
– Covers > 128 Mbytes on 32 nodes– Randomly jumps across memory– On Linux, TLB coverage is 4k*62 = 252Kbytes
• Many, many TLB misses• PPC 440 handles TLB misses in software
– On blrts (BG/L), TLB coverage is 100%• No TLB misses
• How to fix it: hugetlb support for Linux– Each TLB entry covers 16MB– 10 of 64 TLB entries cover enough for IS -> no misses
![Page 11: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/11.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
Hugetlb support in Linux: a short history
• Early hugetlb:– Allow mmap() of static hugetlb pages•protection from being reused by kernel
•no paging, no COW–Implementations:
• Intel, in early linux 2.6–R. Seth [2002]
• 64 bit Power–D. Gibson [2003]
• No powerpc32• Problem: not
transparently usable
• Hugetlb grows up:–Linux 2.6.16:
•on-demand paging, COW• libhugetlb:
– Link application to map heap onto hugetlb
–D. Gibson, 2006• Can transparently make
Fortran codes use hugetlb
• Still no powerpc32 port
![Page 12: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/12.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
Hugetlb on powerpc32• Why no powerpc32 port for hugetlb?
– All p-series processors today are 64 bit– Nobody cares about hugetlb on embedded CPUs! (we do)
• Enter Edi Shmueli– Ported hugetlb to powerpc32 (including BG)
• Both “old” and “new” hugetlb system– Started the open-source process (thru LTC)– Examined performance impact on single-node benchmarks
![Page 13: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/13.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation13 Linux for Compute Nodes Oct 11, 2006
Eliminating TLB Thrashing (measured by Edi)
• Class ‘A’, Serial, Integer-Sort NAS Benchmark– 96MB total memory footprint
• Total of three arrays, 32MB each– Total of 10 iteration, each with multiple memory read/writes– Arrays mapping to memory:
624576Number of pages
16MB pages4KB pages
TLB Thrashing
![Page 14: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/14.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation14 Linux for Compute Nodes Oct 11, 2006
Eliminating TLB Thrashing (measured by Edi)IS Benchmark Completed
Class = A
Size = 8388608
Iterations = 10
Time in seconds = 24.44
Mop/s total = 3.43
Operation type = keys ranked
Verification = SUCCESSFUL
Version = 3.2
Compile date = 31 Jul 2006
A
8388608
10
6.51
12.88
keys ranked
SUCCESSFUL
3.2
01 Aug 2006
Linux BLRTS
TLB Thrashing
A
8388608
10
6.38
13.15
keys ranked
SUCCESSFUL
3.2
31 Jul 2006
Linux with Huge-page support
No Thrashing
![Page 15: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/15.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
System noise on Linux• Scaling argument:
– Noisy systems don't scale– Linux is noisy– CNK is not
• Our argument:– Other machines are noisy– BG hardware is relatively quiet
– We can make residual Linux noise go away
Noise comparison by LANL team [ A. Hoisie] in 2003
![Page 16: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/16.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
Measuring noise on Linux (S. Kumar, E. Shmueli)
MS
Maximal noise per iteration Standard process scheduling allows noise, resulting in an un-coordinated scheduling
real-time priority eliminates all noise
![Page 17: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/17.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
Linux on Compute Nodes: the technology
• Boot Linux on compute nodes– BG/P, BG/L, BG/L experimental control systems
• J. Castanos, D. Lieber, A. Sidelnik• Train & talk to all networks
– port of link training to Linux “firmware” (BG/P)• T. Inglett, A. Tauferner, J. Brunheroto, G. Almasi
– bglnet, bgpnet network drivers• E. Shmueli, G. Almasi
• Linux distribution– Special ramdisk for compute nodes (hacked by G. Almasi)
• Job Start, function shipping of file I/O during the run– Original port: D. Lieber–“Refined” SW architecture: G. Almasi, D. Lieber
![Page 18: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/18.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
hpcio: a “high performance” CIO protocol
• Same code base on /L, /P–Development on /L – stable environment–Possible compute node targets: CNK, blrts, Linux/P, Linux/L
–No inheritance, no virtual functions• Modular and extensible
– TCP/IP over tree implemented & tested– ANL to add PVFS specific functions– Multiple debugging tools: Etnus, Eclipse-based?– Heterogeneous MPI through hpcio?
• Standardized transports
![Page 19: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/19.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
hpcio components• CioTree protocol, library
– Transport layer for tree– Similar to BGML for MPI– G. Dozsa
• CioStream protocol, lib–Transport for CIO commands & data
– D. Lieber, M. Mundy• Executables:
–blrun (job runner)–ciod (I/O daemon)
• CIO modules, currently:– Job Control– File I/O– Standard I/O– Tree TCP/IP
• Linux File I/O client– Uses FUSE
• Filesystem in USErspace• Miklós Szeredi (non-IBM)
• cioc (CN daemon)
• CNK client library
![Page 20: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/20.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
hpcio architecture
Functional Ethernet
Functional Ethernet
I/O Node 0
Linux
ciod
C-Node 0I/O Node 1023
Linux
ciod
C-Node 0
Blrts or Linux
C-Node 63
C-Node 63
Control EthernetControl Ethernet
IDo chip
Control System
“blrun”
JTAG
torus
tree
DB2
Pset 1023
Pset 0
I2C
FileServers
cioc
Blrts or Linux
cioc
Blrts or Linux
cioc
Blrts or Linux
cioc
CioStream CioTree
![Page 21: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/21.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
File I/O
TCP/IP
Stdio
Mapping
JobControlCN IONCN IONCN IONCN IONCN ION
CioTree messaging library
DebuggingCN ION
SocketsCN ION
hpcio server libraryhpcio client library
ciod
CioStream
fuse
cioc
Hpcio architecture (detail)
CNKclient
CNK
blrun
![Page 22: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/22.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
CioTree API: a tree transport (G. Dozsa)
• portable (/X, /L, /P)– “sysdep” approach pioneered in BGML
– code shared with BGML• Point-to-point messages
–Plus ION->CN broadcast• Aligned buffers only• No restart capability• All messages have to be
acknowledged
• Requires some infrastructure to use
• 256 “protocols”• User buffers
–requires O(1) memory• Partially supports VN
mode• Multithreaded mode:
untested
![Page 23: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/23.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
hpcio status• Works on BG/L
– Used by Edi to run Linux tests– Part of the Colony project's deliverable (A. Sidelnik)
• Port to BG/P underway– Low priority; IBM Rochester wanted to port BG/L ciod first– Bug fixes: D. Lieber– Deployment T.B.D
![Page 24: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/24.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
What next?• Deploy hugetlb fix in Linux
– Measure performance• Deployment of hpcio underway
– There will be file I/O performance targets on /P: •maybe hpcio can help
• Port of newest Linux kernel to /L, /P underway• Other Linux applications:
– Distributed Java benchmarks (talking to D. Bacon)– TCP/IP over the torus!– In-memory databases w/ Linux– The Linux distro effort: get rid of cross-compilers!
• Compile natively on the I/O node – or even on compute nodes
![Page 25: The What, Why and How of Linux on BlueGene Computeestrabd/LACSI2006/workshops/workshop3/… · –Plus ION->CN broadcast •Aligned buffers only •No restart capability •All messages](https://reader036.fdocuments.net/reader036/viewer/2022081403/60a172ab3610d644674c702b/html5/thumbnails/25.jpg)
IBM TJ Watson Research Center
© 2006 IBM Corporation
What next? (cont)• Continue help out with Linux firmware effort
– A way to address IP issues (T. Inglett)• Ability to run CNK/blrts and Linux at the same time