Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2....

34
Containers are good for more than serving cat pictures!? Reid Priedhorsky, HPC-ENV Tim Randles, HPC-DES NASA Ames Machine Learning Workshop August 30, 2017 LA-UR-17-27793

Transcript of Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2....

Page 1: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Containers are good for more than serving cat pictures!?

Reid Priedhorsky, HPC-ENV Tim Randles, HPC-DES

NASA Ames Machine Learning Workshop August 30, 2017

LA-UR-17-27793

Page 2: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

The next 25 minutes of your life

  1. Why user-defined software stacks will end your suffering

  2. But only if you use containers

  3. Use Charliecloud, and all your wildest dreams will come true

  4. “Demo”

2

Page 3: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Some people need different software

  Default software stacks are good at specific things •  in the case of HPC resources, it’s MPI-based simulation codes

  What if your thing is different? •  non-MPI simulations •  machine learning •  data analytics •  epic build process

  Admins will install software for you •  BUT only if there’s enough demand •  unusual needs go unmet •  are you crackpot or innovative?

3

Page 4: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Solution: User-defined software stacks (UDSS)

  BYOS (bring your own software) •  Let users install software of their choice •  ... up to and including a complete Linux distribution •  ... and run it on compute resources they don’t own.

  This UDSS lives in an image

4

Page 5: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Why UDSS?

  Advantages •  software dependencies: numerous, unusual, older, newer, internet for build, .... •  portability of environments: e.g., across dev/test/small/large resources, .... •  consistent environments: validated, standardized, archival, .... •  usability

  Possible pitfalls •  missing functionality: HSN, accelerators, file systems •  performance: many opportunities for overhead

5

  Design goals 1. Standard, reproducible workflow 2. Work well on existing resources 3. Be very simple

Page 6: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Design goals

  1. Standard, reproducible workflow •  in contrast with “tinker ’til it’s ready, then freeze” •  standard ⇒ reduce training/development costs, increase skill portability •  reproducible ⇒ creation of images is simpler & more robust

  2. Work well on existing resources •  pitfalls above are already well-controlled in HPC centers •  let’s not re-implement and re-optimize

•  resource management: solved (Slurm, Moab, Torque, PBS, etc.) •  file systems: solved (Lustre, Panasas, GPFS) •  high-speed interconnect: solved (InfiniBand, OPA)

  3. Be very simple •  save costs: development, debugging, security, usability •  UNIX philosophy: make each program do one thing well

6

Page 7: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Option 1: Compile it yourself

  Advantages •  available everywhere now •  unprivileged •  direct access to all resources •  no performance penalty •  in principle, you can do anything

  Disadvantages •  it’s not 1995 any more •  tedious •  error-prone •  hard to update •  no standard workflow •  does not provide portability & consistency •  in practice, difficulty is prohibitive

7

  but see also Spack https://github.com/llnl/spack

Page 8: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Option 2: Virtual machines

  def: Program that emulates a computer.

  Advantages •  ultimate flexibility: different kernel? non-Linux? no problem! •  strong isolation: user is sandboxed •  minimal performance penalty (for things used by industry)

  Disadvantages •  performance (for things uncommon in industry)

•  can sometimes mitigate by compromising on isolation •  must be provisioned with complete OS •  supporting infrastructure can be complex

•  e.g., network bridges, etc.

  Should HPC centers become more like cloud? •  no: let’s not re-implement and re-optimize

8

  unless you really need the flexibility & isolation

Page 9: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Option 3: Containers

  Share kernel with the native software stack •  everything else, including libc, runs in the container •  isolation accomplished with Linux namespaces and related kernel features

  Advantages •  sufficient flexibility •  sufficient isolation •  standard, reproducible workflows available •  no performance penalty (if done right) •  minimal support infrastructure (if done right) •  super simple (if done right)

  Disadvantages •  requires newer Linux kernel (if done right) •  various warts here and there

9

Page 10: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

how containers work

Page 11: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Container implementations

  Full-featured •  image building •  image management

•  storage, caching, tagging, signing •  images stored in custom formats •  orchestration •  storage management •  runtime setup

•  e.g., default command/script, inetd-alike •  stateful containers •  supervisor daemons

  Lightweight •  few features •  given an image, run it

11

  systemd-nspawn (?)   NsJail(?)

  These features are useful, but there are drawbacks

1.  Code size 2.  Support burden 3.  Privileged operations

  unshare(1)   systemd-nspawn (?)   NsNail (?)   Charliecloud   Lower-cost deployment

Page 12: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Container ingredients

  1. Linux namespaces •  independent per-process views of kernel resources

  2. cgroups •  limit resource consumption per-process

  3. prctl(PR_SET_NO_NEW_PRIVS) •  prevent execve(2) from increasing privileges

  4. seccomp(2) •  filter system calls

  5. SELinux, AppArmor •  various features that change what a process may do

12

  Don’t need all! You can choose.

Page 13: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

The six namespaces

  mount — filesystem tree and mounts

  PID — process IDs

  UTS — host name & domain name

  network — all other network stuff

  IPC — System V and POSIX IPC

  user — UID/GID/capabilities

13

  privileged   need root to create, unless you

also ...

  unprivileged   any bozo can do it

  You can mix-n-match.   System calls: unshare(2), clone(2), setns(2)

Page 14: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Conclusion

  Lightweight implementations are a better choice for HPC centers •  most important cloud-like flexibility •  don’t compromise existing tools & strengths of HPC centers

  But... some of those other features are important •  e.g., image building & sharing •  Docker smells really good from this perspective.

14

Page 15: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

  Charliecloud

  Photo illustration by Hunter Easterday, Jordan Ogas, Adam Smith / HPC-DES

Page 16: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Our hybrid approach

  1. Image building & sharing goes in a sandbox •  safe place for user to be root: user workstation or virtual machine •  wrap Docker for image building

•  or anything else that can produce a directory tree •  debootstrap(8), yum --installroot, etc.

•  image sharing via docker (push|pull)

  2. Run images with our own unprivileged runtime •  mount & user namespaces only

•  requires new-ish kernel (except in setuid test mode) •  Crays will have it soon •  RHEL/CentOS 7 can install via ELRepo

•  RHEL 7.4 has support with a caveat •  it’s a user program!!! •  admins don’t need to do anything

Page 17: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Basic workflow

step where privileged?

sandbox production

1. Build Docker/etc. image ✓ yes* 2. Dump image to tarball ✓ yes* 3. Copy tarball to where you want to run ✓ no

4. Unpack tarball ✓ no 5. Inject runtime configuration (if needed) ✓ no

6. Run your commands in container ✓ no

17

  *These steps require root if using Docker, but other methods might not.

Page 18: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Hello world

$ cd ~/charliecloud/examples/serial/hello $ ls Dockerfile hello.sh README test.bats $ ch-build -t hello ~/charliecloud Sending build context to Docker daemon 15.19 MB [...] Successfully built 30662b3f94f3 $ ch-docker2tar hello /var/tmp 57M /var/tmp/hello.tar.gz $ ch-tar2dir /var/tmp/hello.tar.gz /var/tmp/hello creating new image /var/tmp/hello /var/tmp/hello unpacked ok $ ch-run /var/tmp/hello -- echo "I'm in a container" I'm in a container

Page 19: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Performance: CoMD and VPIC (32 nodes) 19

0

10

20

30

40

50

60

70

80

90

100

CoMD VPIC

Wal

lclo

ck P

erce

nt

Bare Metal Charliecloud

0.7% 4%

Page 20: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Charliecloud vs. design goals

  1. Standard, reproducible workflow •  Docker: industry standard for reproducible builds •  or anything else you want to script up

  2. Work well on existing resources •  ch-run: minimal but sufficient isolation (mount & user namespaces) •  performance unchanged; direct access to everything •  scales using standard HPC tools

  3. Be very simple •  5 shell scripts, 2 C programs, 900 lines of code •  for comparison ...

•  NsJail: 4,000 LOC •  Singularity: 11,000 LOC •  Shifter: 19,000 LOC •  Docker: 133,000 LOC

20

Page 21: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Machine Learning Software Stack

Page 22: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack
Page 23: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Charliecloud status

  1. Available now on LANL clusters

  2. Available now on any Linux box •  once you install it •  newer kernel needed, or setuid tolerance •  including cloud instances

  3. Soon: re-installed VirtualBox image •  no root needed •  Mac, Windows, Linux

  4. Passes tests on Crays

23

Page 24: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

The end

  1. ;login: article (USENIX magazine) •  “Linux containers for fun and profit in HPC” •  published September; ask me for the draft

  2. Supercomputing 2017 pre-print •  “Charliecloud: Unprivileged containers for user-defined software stacks in HPC” •  http://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-16-22370

  3. Documentation •  https://hpc.github.io/charliecloud •  includes detailed tutorials

  4. Source code •  https://github.com/hpc/charliecloud

24 Reid Priedhorsky, HPC-ENV [email protected]

Tim Randles, HPC-DES [email protected]

Page 25: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Additional Slides 25

Page 26: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

World’s most useless root shell 26

int main(void) { uid_t euid = geteuid(); int fd; printf("outside userns, uid=%d\n", euid); unshare(CLONE_NEWUSER); fd = open("/proc/self/uid_map", O_WRONLY); dprintf(fd, "0 %d 1\n", euid); close(fd); printf("in userns, uid=%d\n", geteuid()); execlp("/bin/bash", "bash", NULL); }

  1-to-1 UID map   host side must be your

EUID

  create user namespace

  really acting as yourself   all access checks use host

EUID

Page 27: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

MPI test program: Report userns & rank; send & recv 27

#define _GNU_SOURCE #include <limits.h> #include <stdio.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #include <mpi.h> #define TAG 0 int main(int argc, char ** argv) { int msg, rank, rank_ct; struct stat st; MPI_Status mstat; char hostname[HOST_NAME_MAX+1]; stat("/proc/self/ns/user", &st); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &rank_ct); MPI_Comm_rank(MPI_COMM_WORLD, &rank); gethostname(hostname, HOST_NAME_MAX+1); printf("%d: init ok %s, %d ranks, userns %lu\n", rank, hostname, rank_ct, st.st_ino); fflush(stdout); if (rank == 0) { for (int i = 1; i < rank_ct; i++) { msg = i; MPI_Send(&msg, 1, MPI_INT, i, TAG, MPI_COMM_WORLD); MPI_Recv(&msg, 1, MPI_INT, i, TAG, MPI_COMM_WORLD, &mstat); } } else { MPI_Recv(&msg, 1, MPI_INT, 0, TAG, MPI_COMM_WORLD, &mstat); msg = -rank; MPI_Send(&msg, 1, MPI_INT, 0, TAG, MPI_COMM_WORLD); } if (rank == 0) printf("%d: send/receive ok\n", rank); MPI_Finalize(); if (rank == 0) printf("%d: finalize ok\n", rank); return 0; }

Based on: http://en.wikipedia.org/wiki/Message_Passing_Interface#Example_program

Page 28: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Dockerfile 28

FROM debian:jessie # OS packages needed to build OpenMPI. RUN apt-get update && apt-get install -y g++ gcc make wget \ && rm -rf /var/lib/apt/lists/* # Compile OpenMPI. ENV VERSION 1.10.5 RUN wget -nv https://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-${VERSION}.tar.gz RUN tar xf openmpi-${VERSION}.tar.gz RUN cd openmpi-${VERSION} \ && CFLAGS=-O3 CXXFLAGS=-O3 \ ./configure --prefix=/usr --sysconfdir=/mnt/0 \ --disable-pty-support --disable-mpi-cxx --disable-mpi-fortran \ && make -j$(getconf _NPROCESSORS_ONLN) install RUN rm -Rf openmpi-${VERSION}* # This example COPY /examples/mpi/mpihello /hello WORKDIR /hello RUN make clean && make

Page 29: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Step 1: Build Docker image

  sndbx$ cd ~/charliecloud/examples/mpi/mpihello   sndbx$ ls   Dockerfile hello.c Makefile README slurm.sh test.bats   sndbx$ ch-build -t mpihello ~/charliecloud   [sudo] password for reidpr:   Sending build context to Docker daemon 15.19 MB   Step 1/4 : FROM debian8openmpi   ---> 8a2a63f43554   [...]   Step 4/4 : RUN make clean && make   ---> Using cache   ---> d028357f85e5   Successfully built d028357f85e5

29   privileged

Page 30: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Step 2: Dump Docker image to tarball

  sndbx$ ch-docker2tar mpihello /var/tmp   146M /var/tmp/mpihello.tar.gz   sndbx$ tar tvf /var/tmp/mpihello.tar.gz | head   -rwxr-xr-x 0/0 0 2017-08-16 16:29 .dockerenv   drwxr-xr-x 0/0 0 2017-08-03 16:30 bin/   -rwxr-xr-x 0/0 1029624 2016-11-05 15:22 bin/bash   -rwxr-xr-x 0/0 51912 2015-03-14 09:47 bin/cat   -rwxr-xr-x 0/0 14560 2014-09-08 01:01 bin/chacl   -rwxr-xr-x 0/0 60072 2015-03-14 09:47 bin/chgrp   -rwxr-xr-x 0/0 55944 2015-03-14 09:47 bin/chmod   -rwxr-xr-x 0/0 64168 2015-03-14 09:47 bin/chown   -rwxr-xr-x 0/0 150824 2015-03-14 09:47 bin/cp   -rwxr-xr-x 0/0 125400 2014-11-08 06:49 bin/dash

30   privileged

Page 31: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Step 3: Tarball to (in this case) Woodchuck

  sndbx$ scp /var/tmp/mpihello.tar.gz tfta:/users/reidpr   mpihello.tar.gz 100% 145MB 76.6MB/s 00:01   sndbx$ ssh wtrw   wtrw $ ssh wc-fe   wc-fe$ ls -lh mpihello.tar.gz   -rw-r----- 1 reidpr reidpr 146M Aug 16 16:41 mpihello.tar.gz

31   unprivileged

Page 32: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Step 4: Start job & unpack image

  wc-fe$ salloc -N32 -p charliecloud   wc003$ module load openmpi/1.10.5   wc003$ module load sandbox   wc003$ module load charliecloud   wc003$ srun ch-tar2dir ~/mpihello.tar.gz /var/tmp/mpihello   creating new image /var/tmp/mpihello [32 times]   /var/tmp/mpihello unpacked ok [32 times]   wc003$ ls /var/tmp/mpihello   WEIRD_AL_YANKOVIC dev home media opt run sys var   bin etc lib mnt proc sbin tmp   boot hello lib64 oldroot root srv usr   wc003$ du -sh /var/tmp/mpihello   362M /var/tmp/mpihello   wc003$ ls -R /var/tmp/mpihello | wc -l   18009

32   unprivileged

Page 33: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Step 5: Configure your stuff

  Nothing needed — we’ll use host’s SLURM/MPI configuration.

  Some things (like Spark) need SLURM stuff translated to config files.

33   unprivileged

Page 34: Containers are good for more than serving cat pictures!?€¦ · 1. Build Docker/etc. image yes* 2. Dump image to tarball yes* 3. Copy tarball to where you want to run no 4. Unpack

Step 6: Run your code!!!

  wc003$ mpirun ch-run /var/tmp/mpihello -- /hello/hello   [...]   0: init ok wc003.localdomain, 512 ranks, userns 4026561620   [...]   511: init ok wc034.localdomain, 512 ranks, userns 4026561650   [...]   0: send/receive ok   0: finalize ok

34   unprivileged