ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process...

48
© 2016 Mesosphere, Inc. All Rights Reserved. 1 Process Migration in the Orchestration World ContainerCon 2016 - Jimenez, Arya Isabel Jimenez Distributed Systems Engineer DC/OS Security Team & Apache Mesos Contributor [email protected] @ijimene Kapil Arya Distributed Systems Engineer Apache Mesos Committer & DMTCP Developer [email protected] @karya0

Transcript of ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process...

Page 1: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 1

Process Migration in the Orchestration WorldContainerCon 2016 - Jimenez, Arya

Isabel JimenezDistributed Systems Engineer

DC/OS Security Team & Apache Mesos Contributor

[email protected]@ijimene

Kapil AryaDistributed Systems Engineer

Apache Mesos Committer & DMTCP Developer

[email protected]@karya0

Page 2: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 2

Overview

➢ Motivation

➢ Process Migration

➢ Apache Mesos

➢ Process/Container Migration for Mesos

➢ Demo

Overview

Page 3: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 3

Motivation

Page 4: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 4

● Stateless applications:○ No local state○ Start from a (relatively) vanilla state○ Perform transaction(s)○ Kill when no longer needed

● Stateful application:○ Some local state○ Start from vanilla state and compute “work” state○ Non-graceful shutdown results in loss of compute time

Stateless vs. Stateful Applications

Page 5: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 5

● Stateless applications:○ Scale up: “on-demand” deployment by launching clones as needed○ Scale down: kill unused instances without loss of computation time○ Making room for high-priority task without significant penalty

● Stateful application:○ Scale up: longer initialization times for new instances○ Scale down: wait for instances to reach a “safe” state to preserve compute cycles.○ Making room for high-priority tasks results in significant compute-time penalty

Similarly for moving applications from one node/cluster to another!

Scheduling Stateless vs. Stateful Applications

Page 6: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 6

Modern container orchestration tools are optimized for stateless applications!

Scheduling Stateless vs. Stateful Applications

Page 7: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 7

Make them stateless!

● How?○ Rewrite ‘em!

● Alternatively○ Use process/container checkpointing and migration!

How to Better Schedule Stateful Applications?

Page 8: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 8

Process Migration

Page 9: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 9

● Process Migration○ Move a running process from one node to another

● Container Migration○ Move a running container from one node to another

● Virtual machine migration (e.g., vMotion)○ Move a running virtual machine from one node to another

Terminology

Page 10: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 10

1. Pause the running process/container/VM2. Take a snapshot of the current state a.k.a. checkpointing3. Move the snapshot to the target node4. Restart from the snapshot on the target node

Do this transparently to the outside world!

● Ensure minimal downtime○ Reduce time required for stages (2) and (3)○ Ideally on the order of milliseconds!

How to Migrate a Process/Container/VM?

Page 11: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved.

Checkpoint-Restart is the ability to save a set of running processes to a checkpoint-image on disk, and to later restart it from disk.

● A quick demo!

What is Checkpointing?

Page 12: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved.

● Fault tolerance● Scheduling and process migration● Debugging (an executable bug report)● Faster startup times (checkpoint after initialization)● Save/restore workspace (for interactive sessions)● Speculative execution (what-if scenarios)● Managing long tails (single thread continues to run after other threads have

exited)

Checkpointing Use Cases

Page 13: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 13

Stateful Application + Checkpointing ≈ Stateless Application

● Scale up: start from pre-initialized snapshot● Scale down: checkpoint and kill● Migrate: checkpoint, kill, and restart

Stateful Applications with Checkpointing

Page 14: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved.

Checkpoint-restart involves saving and restoring:

● all of user-space memory● state of all threads● kernel state● network state● …

All this while ensure the state doesn’t change while taking a checkpoint!

● Quiesce the process(es) before saving the state!

How to Checkpoint/Restart a Process?

Page 15: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 15

● Application-level○ Embed checkpointing code inside the application itself○ Optimal○ Burden on the application developer

● Virtual machine level○ Complete state○ Higher cost

● System-level○ No modification to application source/binary○ Can be done at the kernel-level or in the user-space

Different types of Checkpointing

Page 16: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved.

● CRIU (Checkpoint Restart In Userspace)○ Single-node checkpointing○ Recent kernels (3.9+)○ Container-level○ http://criu.org/

● DMTCP (Distributed MultiThreaded CheckPointing)○ User-space libraries with LD_PRELOAD○ Distributed processes across multiple nodes○ http://dmtcp.sourceforge.net

16

Modern Checkpointing Systems

Page 17: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 17

Apache Mesos:The datacenter kernel

Page 18: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 18

Why can’t we run applications on our datacenters just like we run applications on our mobile phones?

We’re all building distributed systems.

Why?

Page 19: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 19

The datacenter abstraction

Page 20: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 20

Operating system

“a collection of software that manages the computer hardware resources and provides common services for computer programs”

- Wikipedia

The datacenter computer needs an operating system

Page 21: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 21

Mesos can’t run applications on its own

A Mesos framework is a distributed system

that has a scheduler.

Schedulers like Marathon keeps your application running. A bit like a distributed “init.d”.

Resource offersOffer based model

Page 22: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 22

High utilizationApache Mesos

time

Page 23: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 23

Mesos mechanics

master

agent

scheduler

RESOURCES(cpu, mem, disk, etc)

Page 24: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 24

Mesos mechanics

master

agent

scheduler

OFFER(cpu, mem, disk, etc)

Page 25: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 25

Mesos mechanics

master

agent

scheduler{ "container": { "docker": { "image": "busybox", }, "type": "DOCKER" }, "cpus": 0.1, "id": "demo", "instances": 1, "mem": 128}

Page 26: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 26

Mesos mechanics

master

agent

scheduler

ACCEPT OFFER(cpu, mem, disk, etc)

Page 27: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 27

Mesos mechanics

master

agent

scheduler

LAUNCH TASK

Page 28: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 28

Mesos mechanics

master

agent

scheduler

UPDATE STATE(STAGING, RUNNING, etc)

Page 29: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 29

Mesos mechanics

master

agent

scheduler

UPDATE STATE(STAGING, FAILED, etc)

Page 30: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 30

Mesos mechanics: Custom executor

master

agent

scheduler{ "container": { "docker": { "image": "busybox", }, "type": "DOCKER" }, "cpus": 0.1, "id": "demo", "executor": demo-executor, "mem": 128}

Page 31: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 31

Mesos mechanics

Executor

Task

Agent

LAUNCH TASK

Page 32: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 32

Mesos mechanics

Executor

Task

Agent

LAUNCH TASK

Page 33: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 33

Mesos mechanics

Executor

Task

Agent

TASK STATE

Page 34: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 34

Mesos mechanics

Executor

Task

Agent

UPDATE STATE

Page 35: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 35

Mesos mechanics

Executor

Task

Agent

UPDATE STATE

Page 36: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 36

Mesos mechanics

Executor

Task

Agent

ISOLATION

Page 37: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 37

Mesos mechanics are fair

master

agent

scheduler C scheduler Dscheduler B scheduler Escheduler A

agentagent agent agent

Page 38: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 38

Mesos mechanics are HA

master 2

agent

scheduler C scheduler Dscheduler B scheduler Escheduler A

agentagent agent agent

master 3master 1

ZooKeeper

Page 39: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 39

APACHE MESOS: Putting it all together

m 2

scheduler C scheduler Dscheduler B scheduler Escheduler A

m 1

ZooKeeper

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

aa

m 3 m 4 m 5 m 6 m 7 m 8 m 9

scheduler C scheduler Dscheduler B scheduler Escheduler Ascheduler C scheduler Dscheduler B scheduler Escheduler A

scheduler C scheduler Dscheduler B scheduler Escheduler Ascheduler C scheduler Dscheduler B scheduler Escheduler A

Page 40: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 40

Mesos Container Migration

Page 41: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 41

RUNC

● OCI specification

● Well integrated with CRIU

● Lightweight universal runtime container

● Compatible with Docker

Page 42: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 42

Mesos mechanics: Custom executor

Mesos

agent

Volt Scheduler{ "container": { "docker": { "image": "busybox", }, "type": "DOCKER" }, "cpus": 0.1, "id": "demo", "executor": volt-executor, "mem": 128}

Volt Executor

Page 43: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 43

Mesos mechanics

VOLT Executor

RunC

Agent

LAUNCH TASK

Page 44: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 44

Mesos mechanics

VOLT Executor

RunC

Agent

RunC

LAUNCH TASK

Page 45: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 45

Mesos mechanics

VOLT Executor

RunC

Agent

RunCRunC

LAUNCH TASK

Page 46: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 46

Demo!

Page 47: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved. 47

First class integration with Mesos

○ Transparent to the scheduler and executor

○ New tasks states (CHECKPOINTED, RESTORING, etc)

○ Support multiple checkpoint-service providers (DMTCP, CRIU, etc)

Future Work: Checkpointing as a Service

Page 48: ContainerCon 2016 - Jimenez, Arya @ijimene isabel ... · Process Migration Move a running process from one node to another Container Migration Move a running container from one node

© 2016 Mesosphere, Inc. All Rights Reserved.

THANK YOU!

48