Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache...

Post on 20-May-2020

23 views 0 download

Transcript of Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache...

Apache Spark in the Cloud

Zbyněk RoubalíkSenior Quality Engineer, Red Hat

February 15 2018

Apache Spark in the Cloud | Zbyněk Roubalík2

Technologies

● Apache Spark

● Docker

● Kubernetes

● OpenShift

Apache Spark in the Cloud | Zbyněk Roubalík3

Apache Spark in the Cloud

aka

How to create and deploy Apache Spark

applications to cloud native environments like

OpenShift

Apache Spark in the Cloud | Zbyněk Roubalík4

What is cloud native?

● Containerized● Dynamically orchestrated● Microservice oriented

● www.cncf.io/about/faq

Apache Spark in the Cloud | Zbyněk Roubalík5

Containers

● A container image is a lightweight, stand-alone,

executable package of a piece of software that

includes everything needed to run it: code, runtime,

system tools, system libraries, settings.

● https://www.docker.com/what-container

Apache Spark in the Cloud | Zbyněk Roubalík6

VM vs Containers

Apache Spark in the Cloud | Zbyněk Roubalík7

Containers

● Cloud vs standard deployment model

● Pets vs Cattle

● Developers + Operations (Admins) → DevOps

● Docker

Apache Spark in the Cloud | Zbyněk Roubalík8

Kubernetes

● Container cluster manager

Apache Spark in the Cloud | Zbyněk Roubalík9

Kubernetes

● Based on etcd – distributed clustered key value store● Smallest deployable unit is Pod

Apache Spark in the Cloud | Zbyněk Roubalík10

OpenShift

● Open Source Container Application Platform● Focused on application (not just containers as a

concept) and developer experience

Apache Spark in the Cloud | Zbyněk Roubalík11

OpenShift

● Sits on the top of Kubernetes● Source code, builds and deployments management● S2I - Source to Image● Application lifecycle management (CI/CD)● Service catalog (Language runtimes, Middleware,

Databases)● Security

Apache Spark in the Cloud | Zbyněk Roubalík12

OpenShift architecture

Apache Spark in the Cloud | Zbyněk Roubalík13

Apache Spark

● Fast and general engine for large-scale data processing

● Distributed computation system

● Provides high-level APIs in Java, Scala, Python and R

● Supports a rich set of tools for Big Data, AI, ML● Spark SQL for SQL and structured data processing● MLlib for machine learning● GraphX for graph processing● Spark Streaming● ...

Apache Spark in the Cloud | Zbyněk Roubalík14

General Spark architecture

Apache Spark in the Cloud | Zbyněk Roubalík15

How to interact with Spark

● Run an application

● Start a REPL● Scala

● Python

● R

Apache Spark in the Cloud | Zbyněk Roubalík16

The fundamental Spark abstraction

Resilient distributed dataset (RDD)

● are partitioned, lazy and immutable homogenous collections

● partitioned● lazy● immutable

Apache Spark in the Cloud | Zbyněk Roubalík17

Resilient distributed dataset in action

Apache Spark in the Cloud | Zbyněk Roubalík18

Resilient distributed dataset in action

Apache Spark in the Cloud | Zbyněk Roubalík19

Resilient distributed dataset in action

Apache Spark in the Cloud | Zbyněk Roubalík20

Resilient distributed dataset in action

Apache Spark in the Cloud | Zbyněk Roubalík21

Resilient distributed dataset in action

Apache Spark in the Cloud | Zbyněk Roubalík22

Resilient distributed dataset in action

Apache Spark in the Cloud | Zbyněk Roubalík23

Resilient distributed dataset in action

Apache Spark in the Cloud | Zbyněk Roubalík24

What is Spark application?

Apache Spark in the Cloud | Zbyněk Roubalík25

simple.py application

● Even numbers count

Apache Spark in the Cloud | Zbyněk Roubalík26

Apache Spark in the Cloud | Zbyněk Roubalík27

A little more complex application

Apache Spark in the Cloud | Zbyněk Roubalík28

Designing a Spark microservice

Apache Spark in the Cloud | Zbyněk Roubalík29

On demand batch processing

Apache Spark in the Cloud | Zbyněk Roubalík30

Continuous batch processing

Apache Spark in the Cloud | Zbyněk Roubalík31

Stream processing

Apache Spark in the Cloud | Zbyněk Roubalík32

OpenShift architecture - recall

Apache Spark in the Cloud | Zbyněk Roubalík33

Spark on OpenShift

Apache Spark in the Cloud | Zbyněk Roubalík34

Oshinko - Integrating Spark and OpenShift

Apache Spark in the Cloud | Zbyněk Roubalík35

Oshinko - Integrating Spark and OpenShift

Apache Spark in the Cloud | Zbyněk Roubalík36

Demo time

Apache Spark in the Cloud | Zbyněk Roubalík37

Takeaways

● Containers

● Kubernetes

● OpenShift

● Apache Spark

● Oshinko tooling

Apache Spark in the Cloud | Zbyněk Roubalík38

Спасибі!

www.github.com/radanalyticsio

zroubali@redhat.com