HermitCore A Low Noise Unikernel for Extrem-Scale … · HBM support similar to memkind IP...
Transcript of HermitCore A Low Noise Unikernel for Extrem-Scale … · HBM support similar to memkind IP...
HermitCore – A Low Noise Unikernel for Extrem-Scale SystemsStefan Lankes1, Simon Pickartz1, Jens Breitbart2
1RWTH Aachen University, Aachen, Germany2Bosch Chassis Systems Control, Stuttgart, Germany
Agenda
Motivation
Challenges for the Future Systems
OS Architectures
HermitCore Design
Performance Evaluation
Conclusion and Outlook
2 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Motivation
Complexity of high-end HPC systems keeps growingExtreme degree of parallelismHeterogeneous core architecturesDeep memory hierarchyPower constrains
⇒ Need for scalable, reliable performance and capability to rapidly adapt to new HWApplications have also become complex
In-situ analysis, workflowsSophisticated monitoring and tools support, etc. . .Isolated, consistent simulation performance
⇒ Dependence on POSIX, MPI and OpenMP
Seemingly contradictory requirements. . .
3 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Operating System = Overhead
Every administration layer has its overhead ⇒ e. g. Hourglass benchmark
10000 20000 30000
100
102
104
106
Loop time (cycles)
Num
bero
feve
nts
How does the HPC / Cloud Community reduce overhead?
4 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Operating System = Overhead
Every administration layer has its overhead ⇒ e. g. Hourglass benchmark
10000 20000 30000
100
102
104
106
Loop time (cycles)
Num
bero
feve
nts
How does the HPC / Cloud Community reduce overhead?
4 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
How does the HPC community reduce the overhead?
Light-weight Kernels
Typically taken of an existing fat kernel (e. g. Linux)Removalof unneeded features to improve the scalabilitye. g. ZeptoOS
Multi-KernelsA specialized kernel runs side-by-side to full-weight kernel (e. g. Linux)Applications of the full-weight kernels are able to run on the specialized kernel.The specialized kernel catch every system call and delegate them to the full-weightkernel
Binary compatible to the full-weight kernelExamples: mOS, McKernel
5 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
OS Designs for Cloud Computing – Usage of Common OS
Operating System eth0
Hypervisor Software Virtual Switch
eth0OS
Application
eth0OS
Application
Two operating systems to maintain a single computer?Double Management!Why does a Cloud base on a Multi-User-/Multi-Tasking OS?
6 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
OS Designs for Cloud Computing – LibraryOS
Operating System eth0
Hypervisor Software Virtual Switch
eth0libOS
Application
eth0libOS
Application
Now, every system call is a function call ⇒ Low overheadWhole optimization of an image possible (including the library OS)
Link Time Optimization (LTO)Removing unneeded code ⇒ reduces the attack vector
7 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Related Unikernels
Rump kernels1
Part of NetBSD ⇒ e. g., NetBSD’s TCP / IP stack is available as libraryStrong dependencies to the hypervisorNot directly bootable on a standard hypervisor (e. g., KVM)
IncludeOS2
Runs natively on the hardware ⇒ minimal OverheadOnly 32 bit support to avoid the overhead of paging ⇒ uncommon in HPC
MirageOS3
Designed for the high-level language OCaml ⇒ uncommon in HPC
1A. Kantee and J. Cormack. “Rump Kernels – No OS? No Problem!”. In: ; login: 2014.2A. Bratterud et al. “IncludeOS: A Resource Efficient Unikernel for Cloud Services”. In:
7th Int. Conference on Cloud Computing Technology and Science. 2015.3A. Madhavapeddy et al. “Unikernels: Library Operating Systems for the Cloud”. In:
8th Int. Conference on Architectural Support for Programming Languages and Operating Systems. 2013.8 HermitCore | Stefan Lankes et al. | RWTH Aachen University |
5th April 2017
HermitCore Design
Combination of the Unikernel and Multi-Kernel to reduce the overheadThe same binary is able to run
in a VM (classical unikernel setup)or bare-metal side-by-side to Linux (multi-kernel setup)
Support for dominant programming models (MPI, OpenMP)Single-address space operating system
No TLB Shootdown
9 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
HermitCore Design
Classical Unikernel SetupApplications are able to boot directly within a VM
Tested with Qemu / KVMTested with uhyve (experimental KVM-based Hypervisor)
Qemu emulates more than HermitCore needs ⇒ large setup timeuhyve reduce the boot time (from 2 s to ~30 ms)
Amazon Web Services (available soon!)
Multi-Kernel SetupOne kernel per NUMA node
Only local memory accesses (UMA)Message passing between NUMA nodes
One FWK (Linux) in the system to get access to a broader driver supportOnly a backup for pre- / post-processingCritical path should be handled by HermitCore
10 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Booting HermitCore
Hardware
Linux kernel
libc
Proxy
Linux kernel
libc
Proxy
libos(LwIP, IRQ, etc.)
Newlib
OpenMP / MPI
App
On the detection of aHermitCore app, a proxy will bestarted.
The proxy unplugs a set ofcores.Triggers Linux to bootHermitCore on the unusedcores.A reliable connection will beestablished.By termination, the cores areset to the HALT state.Finally, reregistering of thecores to Linux.
11 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Booting HermitCore
Hardware
Linux kernel
libc
Proxy
Linux kernel
libc
Proxy
libos(LwIP, IRQ, etc.)
Newlib
OpenMP / MPI
App
On the detection of aHermitCore app, a proxy will bestarted.The proxy unplugs a set ofcores.
Triggers Linux to bootHermitCore on the unusedcores.A reliable connection will beestablished.By termination, the cores areset to the HALT state.Finally, reregistering of thecores to Linux.
11 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Booting HermitCore
Hardware
Linux kernel
libc
Proxy
Linux kernel
libc
Proxy
libos(LwIP, IRQ, etc.)
Newlib
OpenMP / MPI
App
On the detection of aHermitCore app, a proxy will bestarted.The proxy unplugs a set ofcores.Triggers Linux to bootHermitCore on the unusedcores.
A reliable connection will beestablished.By termination, the cores areset to the HALT state.Finally, reregistering of thecores to Linux.
11 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Booting HermitCore
Hardware
Linux kernel
libc
Proxy
Linux kernel
libc
Proxy
libos(LwIP, IRQ, etc.)
Newlib
OpenMP / MPI
App
On the detection of aHermitCore app, a proxy will bestarted.The proxy unplugs a set ofcores.Triggers Linux to bootHermitCore on the unusedcores.A reliable connection will beestablished.
By termination, the cores areset to the HALT state.Finally, reregistering of thecores to Linux.
11 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Booting HermitCore
Hardware
Linux kernel
libc
Proxy
Linux kernel
libc
Proxy
libos(LwIP, IRQ, etc.)
Newlib
OpenMP / MPI
App
On the detection of aHermitCore app, a proxy will bestarted.The proxy unplugs a set ofcores.Triggers Linux to bootHermitCore on the unusedcores.A reliable connection will beestablished.By termination, the cores areset to the HALT state.
Finally, reregistering of thecores to Linux.
11 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Booting HermitCore
Hardware
Linux kernel
libc
Proxy
Linux kernel
libc
Proxy
libos(LwIP, IRQ, etc.)
Newlib
OpenMP / MPI
App
On the detection of aHermitCore app, a proxy will bestarted.The proxy unplugs a set ofcores.Triggers Linux to bootHermitCore on the unusedcores.A reliable connection will beestablished.By termination, the cores areset to the HALT state.Finally, reregistering of thecores to Linux.
11 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Runtime Support
SSE, AVX2, AVX512, FMA,. . .Full C-library support (newlib)HBM support similar to memkindIP interface & BSD sockets (LwIP)
IP packets are forwarded to LinuxShared memory interface
PthreadsThread binding at start timeNo load balancing ⇒ less housekeeping
OpenMP via Intel’s RuntimeiRCCE- & MPI (via SCC-MPICH)Full support for the Go runtime
R
10
R
32
R
54
R
76
R
98
R
1110
R
1312
R
1514
R
1716
R
1918
R
2120
R
2322
R
2524
R
2726
R
2928
R
3130
R
3332
R
3534
R
3736
R
3938
R
4140
R
4342
R
4544
R
4746
MC 1
MC 0
MC 3
MC 2
FPGA
Router
TileMIU MPB
Core 23
Core 22
L2$
L2$
12 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Operating System Micro-Benchmarks
Test systemsIntel Haswell CPUs (E5-2650 v3) clocked at 2.3 GHzIntel KNL (Phi 7210) clocked at 1.3 GHz, SNC mode with four NUMA nodes
Results in CPU cycles
System activity KNL Haswell
HermitCore Linux HermitCore Linux
getpid() 15 486 14 143sched_yield() 197 983 97 370malloc() 3 051 12 806 3715 6575first write access to a page 2 078 3 967 2018 4007
13 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
0 50 100 150 200 250 3000
0.2
0.4
0.6
0.8
1·105
Time in s
Gap
inTi
cks
Standard Linux
0 50 100 150 200 250 3000
0.2
0.4
0.6
0.8
1·105
Time in s
Gap
inTi
cks
HermitCore
0 50 100 150 200 250 3000
0.2
0.4
0.6
0.8
1·105
Time in s
Gap
inTi
cks
Linux with enabled isolcpus & nohz
0 50 100 150 200 250 3000
0.2
0.4
0.6
0.8
1·105
Time in sGa
pin
Tick
s
HermitCore without IP thread
Overhead of VMs – Determined via NAS Parallel Benchmarks (Class B)
Native (Linux) HermitCore + uhyve
CG BT SP EP IS0
5
10
15
20
Spee
din
GMop
/s
15 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Conclusions
It works! ⇒ https://youtu.be/gDYCJ1DOTKw
Binary packages are availableReduce the OS noise significantlyTry it out!
http://www.hermitcore.org
Thank you for your kind attention!
16 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Outlook
A fast direct access to the interconnect is requiredSR-IOV simplifies the coordination between Linux & HermitCore
Core Core
Memory NIC
Linux
Node 0
Core Core
Memory vNIC
HermitCore
Node 1
Core Core
Memory vNIC
HermitCore
Node 2
Core Core
Memory vNIC
HermitCore
Node 3
Virt
ualI
PD
evice
/M
essa
gePa
ssin
gIn
tera
fce
18 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
OS Designs for Cloud Computing – Container
OSeth0
Container
Application
Container
Application
Building of virtual borders (namespaces)Containers and their processes doesn’t see each otherFast access to OS servicesLess secure because an exploit for the container attacks also the host OSDoesn’t reduce the OS noise of the host system
19 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
EPCC OpenMP Micro-Benchmarks
2 4 6 8 100
1
2
3
Number of Threads
Ove
rhea
din
µs
Parallel on Linux (gcc)Parallel For on Linux (gcc)Parallel on Linux (icc)Parallel For on Linux (icc)Parallel on HermitCoreParallel For on HermitCore
20 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Throughput Results of the Inter-kernel Communication Layer
256 4 Ki 64 Ki 1 Mi0
1
2
3
4
5
Size in Byte
Thro
ughp
utin
GiB/
s
PingPong via iRCCEPingPong via SCC-MPICHPingPong via ParaStation MPI
21 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Lack of programmability
Non-Uniform Memory Access
Costs for memory access may varyRun processes where memory isallocatedAllocate memory where the processresidesImplications for the performance
Where should the applicationsstore the data?Who should decide the location?
The operating system?The application developers?
Socket 0 Socket 1
Memory 0 Memory 1
Interconnect
Memory 2 Memory 3
22 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Lack of programmability
Non-Uniform Memory Access
Costs for memory access may varyRun processes where memory isallocatedAllocate memory where the processresidesImplications for the performance
Where should the applicationsstore the data?Who should decide the location?
The operating system?The application developers?
Memory 0 Memory 1
Interconnect
Memory 2 Memory 3
22 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Tuning Tricks
Parallelization via Shared Memory(OpenMP)
Many side-effects and error-proneIncremental parallelization
Parallelization via Message Passing(MPI)
Restructuring of the sequential codeLess side-effects
Performance TuningBind MPI applications on one NUMAnode
⇒ No remote memory access2x8 4x4 8x2 16x10
1
2
3
4
5
< threadcount > × < proccount >
Spee
din
GFlo
p/s
LU-MZ.C.16BT-MZ.C.16SP-MZ.C.16
23 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
OpenMP Runtime
GCC includes a OpenMP Runtime (libgomp)Reuse synchronization primitives of the Pthread libraryOther OpenMP runtimes scales betterIn addition, our Pthread library was originally not designed for HPC
Integration of Intel’s OpenMP RuntimeInclude its own synchronization primitivesBinary compatible to GCC’s OpenMP RuntimeChanges for the HermitCore support are small
Mostly deactivation of function to define the thread affinityTransparent usage
For the end-user, no changes in the build process
24 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Support of compilers beside GCC
Just avoid the standard environment (−ffreestanding)Set include path to HermitCore’s toolchainBe sure that the ELF file use HermitCore’s ABI
Patching object files via elfeditUse the GCC to link the binaryLD = x86_64 -hermit -gcc#CC = x86_64 -hermit -gcc# CFLAGS = -O3 -mtune= native -march= native -fopenmp -mno -red -zoneCC = icc -D__hermit__CFLAGS = -O3 -xHost -mno -red -zone -ffreestanding -I$( HERMIT_DIR ) -openmpELFEDIT = x86_64 -hermit - elfedit
stream .o: stream .c$(CC) $( CFLAGS ) -c -o $@ $<$( ELFEDIT ) --output -osabi HermitCore $@
stream : stream .o$(LD) -o $@ $< $( LDFLAGS ) $( CFLAGS )
25 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Is the software stack difficult to maintain?
Changes to the common software stack determined with cloc
Software Stack LoC Changes
binutils 5 121 217 226gcc 6 850 382 4 821Linux 15 276 013 1 296Newlib 1 040 826 5 472LwIP 38 883 832Pthread 13 768 466OpenMP RT 61 594 324HermitCore – 10 597
26 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
100002000030000
100
102
104
106
Loop time (cycles)
Num
bero
feve
nts Linux
100002000030000
100
102
104
106
Loop time (cycles)
Num
bero
feve
nts Linux (isolcpu)
100002000030000
100
102
104
106
Loop time (cycles)
Num
bero
feve
nts Hermit w LwIP
100002000030000
100
102
104
106
Loop time (cycles)N
umbe
rofe
vent
s Hermit wo LwIP
Hydro (preliminary results)
5 10 15 20
5,000
10,000
Number of Cores
MFL
OPS
Linux (1 process × n threads)Linux (1 proc. × n thr., bind-to 0–19)Linux (n proc. × 5 thr., bind-to numa)HermitCore (n proc. × 5 thr.)
28 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017
Thank you for your kind attention!
Stefan Lankes et al. – [email protected]
Institute for Automation of Complex Power SystemsE.ON Energy Research Center, RWTH Aachen UniversityMathieustraße 1052074 Aachen
www.acs.eonerc.rwth-aachen.de