Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2....
Transcript of Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2....
Containers are good for more than serving cat pictures!?
Reid Priedhorsky, HPC-ENV Tim Randles, HPC-DES
NASA Ames Machine Learning Workshop August 30, 2017
LA-UR-17-27793
The next 25 minutes of your life
1. Why user-defined software stacks will end your suffering
2. But only if you use containers
3. Use Charliecloud, and all your wildest dreams will come true
4. “Demo”
2
Some people need different software
Default software stacks are good at specific things • in the case of HPC resources, it’s MPI-based simulation codes
What if your thing is different? • non-MPI simulations • machine learning • data analytics • epic build process
Admins will install software for you • BUT only if there’s enough demand • unusual needs go unmet • are you crackpot or innovative?
3
Solution: User-defined software stacks (UDSS)
BYOS (bring your own software) • Let users install software of their choice • ... up to and including a complete Linux distribution • ... and run it on compute resources they don’t own.
This UDSS lives in an image
4
Why UDSS?
Advantages • software dependencies: numerous, unusual, older, newer, internet for build, .... • portability of environments: e.g., across dev/test/small/large resources, .... • consistent environments: validated, standardized, archival, .... • usability
Possible pitfalls • missing functionality: HSN, accelerators, file systems • performance: many opportunities for overhead
5
Design goals 1. Standard, reproducible workflow 2. Work well on existing resources 3. Be very simple
Design goals
1. Standard, reproducible workflow • in contrast with “tinker ’til it’s ready, then freeze” • standard ⇒ reduce training/development costs, increase skill portability • reproducible ⇒ creation of images is simpler & more robust
2. Work well on existing resources • pitfalls above are already well-controlled in HPC centers • let’s not re-implement and re-optimize
• resource management: solved (Slurm, Moab, Torque, PBS, etc.) • file systems: solved (Lustre, Panasas, GPFS) • high-speed interconnect: solved (InfiniBand, OPA)
3. Be very simple • save costs: development, debugging, security, usability • UNIX philosophy: make each program do one thing well
6
Option 1: Compile it yourself
Advantages • available everywhere now • unprivileged • direct access to all resources • no performance penalty • in principle, you can do anything
Disadvantages • it’s not 1995 any more • tedious • error-prone • hard to update • no standard workflow • does not provide portability & consistency • in practice, difficulty is prohibitive
7
but see also Spack https://github.com/llnl/spack
Option 2: Virtual machines
def: Program that emulates a computer.
Advantages • ultimate flexibility: different kernel? non-Linux? no problem! • strong isolation: user is sandboxed • minimal performance penalty (for things used by industry)
Disadvantages • performance (for things uncommon in industry)
• can sometimes mitigate by compromising on isolation • must be provisioned with complete OS • supporting infrastructure can be complex
• e.g., network bridges, etc.
Should HPC centers become more like cloud? • no: let’s not re-implement and re-optimize
8
unless you really need the flexibility & isolation
Option 3: Containers
Share kernel with the native software stack • everything else, including libc, runs in the container • isolation accomplished with Linux namespaces and related kernel features
Advantages • sufficient flexibility • sufficient isolation • standard, reproducible workflows available • no performance penalty (if done right) • minimal support infrastructure (if done right) • super simple (if done right)
Disadvantages • requires newer Linux kernel (if done right) • various warts here and there
9
how containers work
Container implementations
Full-featured • image building • image management
• storage, caching, tagging, signing • images stored in custom formats • orchestration • storage management • runtime setup
• e.g., default command/script, inetd-alike • stateful containers • supervisor daemons
Lightweight • few features • given an image, run it
11
systemd-nspawn (?) NsJail(?)
These features are useful, but there are drawbacks
1. Code size 2. Support burden 3. Privileged operations
unshare(1) systemd-nspawn (?) NsNail (?) Charliecloud Lower-cost deployment
Container ingredients
1. Linux namespaces • independent per-process views of kernel resources
2. cgroups • limit resource consumption per-process
3. prctl(PR_SET_NO_NEW_PRIVS) • prevent execve(2) from increasing privileges
4. seccomp(2) • filter system calls
5. SELinux, AppArmor • various features that change what a process may do
12
Don’t need all! You can choose.
The six namespaces
mount — filesystem tree and mounts
PID — process IDs
UTS — host name & domain name
network — all other network stuff
IPC — System V and POSIX IPC
user — UID/GID/capabilities
13
privileged need root to create, unless you
also ...
unprivileged any bozo can do it
You can mix-n-match. System calls: unshare(2), clone(2), setns(2)
Conclusion
Lightweight implementations are a better choice for HPC centers • most important cloud-like flexibility • don’t compromise existing tools & strengths of HPC centers
But... some of those other features are important • e.g., image building & sharing • Docker smells really good from this perspective.
14
Charliecloud
Photo illustration by Hunter Easterday, Jordan Ogas, Adam Smith / HPC-DES
Our hybrid approach
1. Image building & sharing goes in a sandbox • safe place for user to be root: user workstation or virtual machine • wrap Docker for image building
• or anything else that can produce a directory tree • debootstrap(8), yum --installroot, etc.
• image sharing via docker (push|pull)
2. Run images with our own unprivileged runtime • mount & user namespaces only
• requires new-ish kernel (except in setuid test mode) • Crays will have it soon • RHEL/CentOS 7 can install via ELRepo
• RHEL 7.4 has support with a caveat • it’s a user program!!! • admins don’t need to do anything
Basic workflow
step where privileged?
sandbox production
1. Build Docker/etc. image ✓ yes* 2. Dump image to tarball ✓ yes* 3. Copy tarball to where you want to run ✓ no
4. Unpack tarball ✓ no 5. Inject runtime configuration (if needed) ✓ no
6. Run your commands in container ✓ no
17
*These steps require root if using Docker, but other methods might not.
Hello world
$ cd ~/charliecloud/examples/serial/hello $ ls Dockerfile hello.sh README test.bats $ ch-build -t hello ~/charliecloud Sending build context to Docker daemon 15.19 MB [...] Successfully built 30662b3f94f3 $ ch-docker2tar hello /var/tmp 57M /var/tmp/hello.tar.gz $ ch-tar2dir /var/tmp/hello.tar.gz /var/tmp/hello creating new image /var/tmp/hello /var/tmp/hello unpacked ok $ ch-run /var/tmp/hello -- echo "I'm in a container" I'm in a container
Performance: CoMD and VPIC (32 nodes) 19
0
10
20
30
40
50
60
70
80
90
100
CoMD VPIC
Wal
lclo
ck P
erce
nt
Bare Metal Charliecloud
0.7% 4%
Charliecloud vs. design goals
1. Standard, reproducible workflow • Docker: industry standard for reproducible builds • or anything else you want to script up
2. Work well on existing resources • ch-run: minimal but sufficient isolation (mount & user namespaces) • performance unchanged; direct access to everything • scales using standard HPC tools
3. Be very simple • 5 shell scripts, 2 C programs, 900 lines of code • for comparison ...
• NsJail: 4,000 LOC • Singularity: 11,000 LOC • Shifter: 19,000 LOC • Docker: 133,000 LOC
20
Machine Learning Software Stack
Charliecloud status
1. Available now on LANL clusters
2. Available now on any Linux box • once you install it • newer kernel needed, or setuid tolerance • including cloud instances
3. Soon: re-installed VirtualBox image • no root needed • Mac, Windows, Linux
4. Passes tests on Crays
23
The end
1. ;login: article (USENIX magazine) • “Linux containers for fun and profit in HPC” • published September; ask me for the draft
2. Supercomputing 2017 pre-print • “Charliecloud: Unprivileged containers for user-defined software stacks in HPC” • http://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-16-22370
3. Documentation • https://hpc.github.io/charliecloud • includes detailed tutorials
4. Source code • https://github.com/hpc/charliecloud
24 Reid Priedhorsky, HPC-ENV [email protected]
Tim Randles, HPC-DES [email protected]
Additional Slides 25
World’s most useless root shell 26
int main(void) { uid_t euid = geteuid(); int fd; printf("outside userns, uid=%d\n", euid); unshare(CLONE_NEWUSER); fd = open("/proc/self/uid_map", O_WRONLY); dprintf(fd, "0 %d 1\n", euid); close(fd); printf("in userns, uid=%d\n", geteuid()); execlp("/bin/bash", "bash", NULL); }
1-to-1 UID map host side must be your
EUID
create user namespace
really acting as yourself all access checks use host
EUID
MPI test program: Report userns & rank; send & recv 27
#define _GNU_SOURCE #include <limits.h> #include <stdio.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #include <mpi.h> #define TAG 0 int main(int argc, char ** argv) { int msg, rank, rank_ct; struct stat st; MPI_Status mstat; char hostname[HOST_NAME_MAX+1]; stat("/proc/self/ns/user", &st); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &rank_ct); MPI_Comm_rank(MPI_COMM_WORLD, &rank); gethostname(hostname, HOST_NAME_MAX+1); printf("%d: init ok %s, %d ranks, userns %lu\n", rank, hostname, rank_ct, st.st_ino); fflush(stdout); if (rank == 0) { for (int i = 1; i < rank_ct; i++) { msg = i; MPI_Send(&msg, 1, MPI_INT, i, TAG, MPI_COMM_WORLD); MPI_Recv(&msg, 1, MPI_INT, i, TAG, MPI_COMM_WORLD, &mstat); } } else { MPI_Recv(&msg, 1, MPI_INT, 0, TAG, MPI_COMM_WORLD, &mstat); msg = -rank; MPI_Send(&msg, 1, MPI_INT, 0, TAG, MPI_COMM_WORLD); } if (rank == 0) printf("%d: send/receive ok\n", rank); MPI_Finalize(); if (rank == 0) printf("%d: finalize ok\n", rank); return 0; }
Based on: http://en.wikipedia.org/wiki/Message_Passing_Interface#Example_program
Dockerfile 28
FROM debian:jessie # OS packages needed to build OpenMPI. RUN apt-get update && apt-get install -y g++ gcc make wget \ && rm -rf /var/lib/apt/lists/* # Compile OpenMPI. ENV VERSION 1.10.5 RUN wget -nv https://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-${VERSION}.tar.gz RUN tar xf openmpi-${VERSION}.tar.gz RUN cd openmpi-${VERSION} \ && CFLAGS=-O3 CXXFLAGS=-O3 \ ./configure --prefix=/usr --sysconfdir=/mnt/0 \ --disable-pty-support --disable-mpi-cxx --disable-mpi-fortran \ && make -j$(getconf _NPROCESSORS_ONLN) install RUN rm -Rf openmpi-${VERSION}* # This example COPY /examples/mpi/mpihello /hello WORKDIR /hello RUN make clean && make
Step 1: Build Docker image
sndbx$ cd ~/charliecloud/examples/mpi/mpihello sndbx$ ls Dockerfile hello.c Makefile README slurm.sh test.bats sndbx$ ch-build -t mpihello ~/charliecloud [sudo] password for reidpr: Sending build context to Docker daemon 15.19 MB Step 1/4 : FROM debian8openmpi ---> 8a2a63f43554 [...] Step 4/4 : RUN make clean && make ---> Using cache ---> d028357f85e5 Successfully built d028357f85e5
29 privileged
Step 2: Dump Docker image to tarball
sndbx$ ch-docker2tar mpihello /var/tmp 146M /var/tmp/mpihello.tar.gz sndbx$ tar tvf /var/tmp/mpihello.tar.gz | head -rwxr-xr-x 0/0 0 2017-08-16 16:29 .dockerenv drwxr-xr-x 0/0 0 2017-08-03 16:30 bin/ -rwxr-xr-x 0/0 1029624 2016-11-05 15:22 bin/bash -rwxr-xr-x 0/0 51912 2015-03-14 09:47 bin/cat -rwxr-xr-x 0/0 14560 2014-09-08 01:01 bin/chacl -rwxr-xr-x 0/0 60072 2015-03-14 09:47 bin/chgrp -rwxr-xr-x 0/0 55944 2015-03-14 09:47 bin/chmod -rwxr-xr-x 0/0 64168 2015-03-14 09:47 bin/chown -rwxr-xr-x 0/0 150824 2015-03-14 09:47 bin/cp -rwxr-xr-x 0/0 125400 2014-11-08 06:49 bin/dash
30 privileged
Step 3: Tarball to (in this case) Woodchuck
sndbx$ scp /var/tmp/mpihello.tar.gz tfta:/users/reidpr mpihello.tar.gz 100% 145MB 76.6MB/s 00:01 sndbx$ ssh wtrw wtrw $ ssh wc-fe wc-fe$ ls -lh mpihello.tar.gz -rw-r----- 1 reidpr reidpr 146M Aug 16 16:41 mpihello.tar.gz
31 unprivileged
Step 4: Start job & unpack image
wc-fe$ salloc -N32 -p charliecloud wc003$ module load openmpi/1.10.5 wc003$ module load sandbox wc003$ module load charliecloud wc003$ srun ch-tar2dir ~/mpihello.tar.gz /var/tmp/mpihello creating new image /var/tmp/mpihello [32 times] /var/tmp/mpihello unpacked ok [32 times] wc003$ ls /var/tmp/mpihello WEIRD_AL_YANKOVIC dev home media opt run sys var bin etc lib mnt proc sbin tmp boot hello lib64 oldroot root srv usr wc003$ du -sh /var/tmp/mpihello 362M /var/tmp/mpihello wc003$ ls -R /var/tmp/mpihello | wc -l 18009
32 unprivileged
Step 5: Configure your stuff
Nothing needed — we’ll use host’s SLURM/MPI configuration.
Some things (like Spark) need SLURM stuff translated to config files.
33 unprivileged
Step 6: Run your code!!!
wc003$ mpirun ch-run /var/tmp/mpihello -- /hello/hello [...] 0: init ok wc003.localdomain, 512 ranks, userns 4026561620 [...] 511: init ok wc034.localdomain, 512 ranks, userns 4026561650 [...] 0: send/receive ok 0: finalize ok
34 unprivileged