Big Data in Container; Hadoop Spark in Docker and Mesos

Big Data in ContainerHeiko Loewe @loewehMeetup Big Data Hadoop & Spark NRW 08/24/2016

Why• Fast Deployment• Test/Dev Cluster• Better Utilize Hardware• Learn to manage Hadoop• Test new Versions• An appliance for continuous

integration/API testing

DesignMaster Container

- Name Node- Secondary Name Node- Yarn

Slave Container- Node Manager- Data Node

More than 1 Hosts needs Overlay NetInterface Docker0 not routed

Overlay Network

1 Host Config(almost ) noProblem

For 2 Hostsand morewe need anOverlay Net-work

Choice of the Overlay Network Impl.

Docker Multi-Host Network Weave Net• Backend: VXLAN, AWS,

GCE. • Fallback: custom UDP-based tunneling.

• Control plane: built-in, uses Etcd for shared state.

CoreOS Flanneld• Backend: VXLAN. • Fallback: none. • Control plane: built-in,

uses Zookeeper, Consul or Etcd for shared state.

• Backend: VXLAN via OVS. • Fallback: custom UDP-

based tunneling called “sleeve”. • Control plane: built-in.

Normal mode of operations is called FDP – fast data path – which works via OVS’s data path kernel module (mainline since 3.12). It’s just another VXLAN implementation.

Has a sleeve fallback mode, works in userspace via pcap.

Sleeve supports full encryption.

Weaveworks also has Weave DNS, Weave Scope and Weave Flux – providing introspection, service discovery & routing capabilities on top of Weave Net.

WEAVE NET

/etc/sudoers # at the end:vuser ALL=(ALL) NOPASSWD: ALL# secure_path, append /usr/local/bin for weave

Defaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin

sudo groupadd docker sudo gpasswd -a ${USER} docker sudo chgrp docker /var/run/docker.sock alias docker="sudo /usr/bin/docker"

Docker Adaption (Fedora/Centos/RHEL)

WARNING: existing iptables rule

'-A FORWARD -j REJECT --reject-with icmp-host-prohibited'

will block name resolution via weaveDNS - please reconfigure your firewall.

sudo systemctl stop firewalld Sudo systemctl disable firewalld

/sbin/iptables -D FORWARD -j REJECT --reject-with icmp-host-prohibited/sbin/iptables -D INPUT -j REJECT --reject-with icmp-host-prohibited

iptables-save reboot

Weave Problems on Fedora/Centos/RHEL

[vuser@linux ~]$ ifconfig | grep -v "^ "docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536

[vuser@linux ~]$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES[vuser@linux ~]$ sudo weave launch[vuser@linux ~]$ eval $(sudo weave env)[vuser@linux ~]$ sudo weave -–local expose10.32.0.6[vuser@linux ~]$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES0fd6ab928d96 weaveworks/plugin:1.6.1 "/home/weave/plugin" 11 seconds ago Up 8 seconds weaveplugin4b24e5802fcc weaveworks/weaveexec:1.6.1 "/home/weave/weavepro" 13 seconds ago Up 10 seconds weaveproxyc4882326398a weaveworks/weave:1.6.1 "/home/weave/weaver -" 18 seconds ago Up 15 seconds weave[vuser@linux ~]$ ifconfig | grep -v "^ "datapath: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410

docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536

vethwe-bridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410vethwe-datapath: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410vxlan-6784: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65485weave: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410

WEAVE Container

WEAVE Interfaces

Weave Run

https://github.com/kiwenlau/hadoop-cluster-docker/blob/master/Dockerfile

Hadoop Container Docker FileFROM ubuntu:14.04# install openssh-server, openjdk and wget# install hadoop 2.7.2# set environment variable# ssh without key# set up Hadoop directorties# copy config files from local# make Hadoop start files executable# format namenode#standard run commandCMD [ "sh", "-c", "service ssh start; bash"]

$ docker build –t loewe/hadoop:latest

Start Hadoop ContainerHost 1• Master

$ sudo weave run –itd –p 8088:8088 –p 50070:50070 -–name hadoop-master

• Slaves 1,2$ sudo weave run –itd -–name hadoop-slave1$ sudo weave run –itd -–name hadoop-slave2

Host2• Slave 3,4

$ sudo weave run –itd -–name hadoop-slave1$ sudo weave run –itd -–name hadoop-slave2

root@boot2docker:~# weave status dnshadoop-master 10.32.0.1 6a4db5f52340 92:64:f5:c5:57:a7hadoop-slave1 10.32.0.2 34e0a7de1105 92:64:f5:c5:57:a7hadoop-slave2 10.32.0.3 d879f077cf4e 92:64:f5:c5:57:a7hadoop-slave3 10.44.0.0 6ca7ddb9daf8 92:56:f4:98:36:b0hadoop-slave4 10.44.0.1 c1ed48630b1c 92:56:f4:98:36:b0

Hadoop Cluster / 2 Host / 5 Nodes

Click icon to add picture

Persitent Volumes for HDFS

• Container (like Docker) are the Foundation for agile Software Development

• The initial Container Design was stateless (12-factor App)

• Use-cases are grown in the last few Month (NoSQL, Stateful Apps)

• Persistence for Container is not easy

The Problem

• Enables Persistence of Docker Volumes• Enables the Implementation of

– Fast Bytes (Performance)– Data Services (Protection / Snapshots)– Data Mobility– Availability

• Operations: – Create, Remove, Mount, Path, Unmount– Additonal Option can be passed to the Volume Driver

DOCKER Volume Manager API

Persistente Volumes for CONTAINER

Container OS

Storage

/mnt/PersistentData

Container Container

-v /mnt/PersistenData:/mnt/ContainerData

Container Container

Automation ??Docker Host

Docker Host

Container OS

Storage

/mnt/PersistentData

Container Container

-v /mnt/PersistenData:/mnt/ContainerData

Container Container

AWS EC2 (EBS) OpenStack (Cinder) EMC Isilon EMC ScaleIO EMC VMAX EMC XtremIO Google Compute Engine (GCE) VirtualBox

UbuntuDebianRedHatCentOSCoreOSOSXTinyLinux (boot2docker)

Docker Volume APIMesos Isolator...

Hadoop + persisten Volumes

Host A

Making theHadoop Containerephemeral

Overlay Network

Strech Hadoop w/ persisten VolumesHost A Host B

Easiely strechand shrink aCluster withoutloosing the Data

Other similar Projects• Big Top Provisioner / Apache Foundation

https://github.com/apache/bigtop/tree/master/provisioner/docker

• Building Hortonworks HDP on Dockerhttp://henning.kropponline.de/2015/07/19/building-hdp-on-docker/https://hub.docker.com/r/hortonworks/ambari-server/https://hub.docker.com/r/hortonworks/ambari-agent/

• Building Cloudera CHD on Dockerhttp://blog.cloudera.com/blog/2015/12/docker-is-the-new-quickstart-option-for-apache-hadoop-and-cloudera/https://hub.docker.com/r/cloudera/quickstart/Watch out Overlay Network topix

Apache Myriad

Myriad Overview• Mesos Framework for Apache Yarn• Mesos manages DC, Yarn Manages Hadoop• Coarse and fine grained Resource Sharing

Situation without Integration

Yarn/Mesos Integration

How it works (simplyfied)

Myriad = Control Plane

Myriad Container

What about the DataMyriad only cares for the Compute

Master Container- Name Node

- Secondary Name Node- Yarn

Slave Container- Node Manager

- Data Node

Myriad/Mesos

Cares about

Has to be providedOutside fromMyriad/Mesos

What about the Data

• Myriad only cares for Compute / Map Reduce• HDFS has to be provided on other Ways

Big Data New Realities

Big Data Traditional Assumptions

Bare-metal

Data locality

Data on local disks

Big Data New Realities

Containers and VMs

Compute and storage separation

In-place access on remote data stores

New Benefits and Value

Big-Data-as-a-Service

Agility and cost savings

Faster time-to-insights

Options for HDFS Data Layer• Pure HDFS Cluster (only Data Node running)

– Bare Metal– Containerized– Mesos based

• Enterprise HDFS Array– EMC Isilon

Myriad, Mesos, EMC Isilon for HDFS

• Multi Tenancy• Multiple HDFS Environments

sharing the same storage• Quota possible on HDFS

Environments• Snapshots of HDFS Environemnt

possible• Remote Replication• Worm Option for HDFS

• High Avaiable HDFS Infrastructure (distributed Namen and Data Nodes)

• Storage efficient (usable/raw 0.8 compared to 0.33 with Hadoop)

• Shared Access HDFS / CIFS / NFS/SFTP possible

• Maintenance equals Enterprise Array Standard

• All major Distributions supported

EMC Isilon Advantages over classic Hadoop HDFS

Click icon to add picture

Spark on Mesos

48%Standalone mode

40%YARN

11%Mesos

Most Common Spark Deployment Environments (Cluster Managers)

Source: Spark Survey Report, 2015 (Databricks)

Common Deployment Patterns

Bare Metal Bare Metal Bare Metal

Bare MetalSpark Client Virtual Machine

Virtual Machine Virtual Machine Virtual Machine

Spark Slave

tasktask task

Spark Slave

tasktask task

Spark Slave

tasktask task

Spark Master

Spark Cluster – Standalone Mode

Data providedoutside

Node Manager Node Manager Node Manager

Spark Executor

tasktask task

Spark Executor

tasktask task

Spark Executor

tasktask task

Spark Client

Spark Master Resource Manager

Spark Cluster – Hadoop YARNData provideBy HadoopCluster

Mesos Slave Mesos Slave Mesos Slave

Spark Executor

tasktask task

Spark Executor

tasktask task

Spark Executor

tasktask task

MesosMaster

SparkScheduler

Spark Client

Spark Cluster – MesosData providedoutside

Spark + Mesos + EMC IsilonTo solve the HDFS Data Layer

Thank YouFollow me on Twitter: @loeweh

Big Data in Container; Hadoop Spark in Docker and Mesos

Data & Analytics

Transcript of Big Data in Container; Hadoop Spark in Docker and Mesos

Mesos + Marathon + Docker

Computable Content with Jupyter Docker Mesos

Docker vs. Mesos Unified Container

Deploying Docker Containers at Scale with Mesos and Marathon

PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos

Scaling Jenkins with Docker: Swarm, Kubernetes or Mesos?

Cloud Computing, Docker, Mesos, DCOS, Container, Big Data, Paas

Scaling Big Data with Hadoop and Mesos

Docker based Hadoop provisioning - Hadoop Summit 2014

docker, kubernetes, and mesos: compared

Enabling Hybrid Workflows with Docker/Mesos @Orbitz

Delivering eBay's CI Solution with Apache Mesos & Docker

Workshop mesos docker devoxx fr 2016

Docker, Hadoop Streaming, MRJob, NGS Example

Hadoop Cluster on Docker Containers

Docker based Hadoop Deployment

DOCKER, KUBERNETES, AND MESOS: COMPARED.€¦ · MESOS MAGNUM DOCKER API K8S API MAGNUM OVERVIEW. Understanding Magnum Resources (1/2) • Click to edit Master text styles –Second

Monitoring microservices: Docker, Mesos and Kubernetes visibility at scale

Azure Container Service: Slides · DOCKER CON DOCKER docker ... Jenkins Torque ElasticSearch Cassandra Hypertable Mesos . MARATHON . APPLAUSE . Title: Azure …

Mesos/Docker clusters with Ironic: A Match Made in Heaven