Apache Spark in the Cloud - Amazon S3 13 Apache Spark in the Cloud | Zbyn¤â€k...

download Apache Spark in the Cloud - Amazon S3 13 Apache Spark in the Cloud | Zbyn¤â€k Roubal£­k Apache Spark

of 38

  • date post

    20-May-2020
  • Category

    Documents

  • view

    4
  • download

    0

Embed Size (px)

Transcript of Apache Spark in the Cloud - Amazon S3 13 Apache Spark in the Cloud | Zbyn¤â€k...

  • Apache Spark in the Cloud

    Zbyněk Roubalík Senior Quality Engineer, Red Hat

    February 15 2018

  • Apache Spark in the Cloud | Zbyněk Roubalík2

    Technologies

    ● Apache Spark

    ● Docker

    ● Kubernetes

    ● OpenShift

  • Apache Spark in the Cloud | Zbyněk Roubalík3

    Apache Spark in the Cloud

    aka

    How to create and deploy Apache Spark

    applications to cloud native environments like

    OpenShift

  • Apache Spark in the Cloud | Zbyněk Roubalík4

    What is cloud native?

    ● Containerized ● Dynamically orchestrated ● Microservice oriented

    ● www.cncf.io/about/faq

    http://www.cncf.io/about/faq

  • Apache Spark in the Cloud | Zbyněk Roubalík5

    Containers

    ● A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings.

    ● https://www.docker.com/what-container

    https://www.docker.com/what-container

  • Apache Spark in the Cloud | Zbyněk Roubalík6

    VM vs Containers

  • Apache Spark in the Cloud | Zbyněk Roubalík7

    Containers

    ● Cloud vs standard deployment model

    ● Pets vs Cattle

    ● Developers + Operations (Admins) → DevOps

    ● Docker

  • Apache Spark in the Cloud | Zbyněk Roubalík8

    Kubernetes

    ● Container cluster manager

  • Apache Spark in the Cloud | Zbyněk Roubalík9

    Kubernetes

    ● Based on etcd – distributed clustered key value store ● Smallest deployable unit is Pod

  • Apache Spark in the Cloud | Zbyněk Roubalík10

    OpenShift

    ● Open Source Container Application Platform ● Focused on application (not just containers as a

    concept) and developer experience

  • Apache Spark in the Cloud | Zbyněk Roubalík11

    OpenShift

    ● Sits on the top of Kubernetes ● Source code, builds and deployments management ● S2I - Source to Image ● Application lifecycle management (CI/CD) ● Service catalog (Language runtimes, Middleware,

    Databases) ● Security

  • Apache Spark in the Cloud | Zbyněk Roubalík12

    OpenShift architecture

  • Apache Spark in the Cloud | Zbyněk Roubalík13

    Apache Spark

    ● Fast and general engine for large-scale data processing

    ● Distributed computation system

    ● Provides high-level APIs in Java, Scala, Python and R

    ● Supports a rich set of tools for Big Data, AI, ML ● Spark SQL for SQL and structured data processing ● MLlib for machine learning ● GraphX for graph processing ● Spark Streaming ● ...

  • Apache Spark in the Cloud | Zbyněk Roubalík14

    General Spark architecture

  • Apache Spark in the Cloud | Zbyněk Roubalík15

    How to interact with Spark

    ● Run an application

    ● Start a REPL ● Scala

    ● Python

    ● R

  • Apache Spark in the Cloud | Zbyněk Roubalík16

    The fundamental Spark abstraction

    Resilient distributed dataset (RDD)

    ● are partitioned, lazy and immutable homogenous collections

    ● partitioned ● lazy ● immutable

  • Apache Spark in the Cloud | Zbyněk Roubalík17

    Resilient distributed dataset in action

  • Apache Spark in the Cloud | Zbyněk Roubalík18

    Resilient distributed dataset in action

  • Apache Spark in the Cloud | Zbyněk Roubalík19

    Resilient distributed dataset in action

  • Apache Spark in the Cloud | Zbyněk Roubalík20

    Resilient distributed dataset in action

  • Apache Spark in the Cloud | Zbyněk Roubalík21

    Resilient distributed dataset in action

  • Apache Spark in the Cloud | Zbyněk Roubalík22

    Resilient distributed dataset in action

  • Apache Spark in the Cloud | Zbyněk Roubalík23

    Resilient distributed dataset in action

  • Apache Spark in the Cloud | Zbyněk Roubalík24

    What is Spark application?

  • Apache Spark in the Cloud | Zbyněk Roubalík25

    simple.py application

    ● Even numbers count

  • Apache Spark in the Cloud | Zbyněk Roubalík26

  • Apache Spark in the Cloud | Zbyněk Roubalík27

    A little more complex application

  • Apache Spark in the Cloud | Zbyněk Roubalík28

    Designing a Spark microservice

  • Apache Spark in the Cloud | Zbyněk Roubalík29

    On demand batch processing

  • Apache Spark in the Cloud | Zbyněk Roubalík30

    Continuous batch processing

  • Apache Spark in the Cloud | Zbyněk Roubalík31

    Stream processing

  • Apache Spark in the Cloud | Zbyněk Roubalík32

    OpenShift architecture - recall

  • Apache Spark in the Cloud | Zbyněk Roubalík33

    Spark on OpenShift

  • Apache Spark in the Cloud | Zbyněk Roubalík34

    Oshinko - Integrating Spark and OpenShift

  • Apache Spark in the Cloud | Zbyněk Roubalík35

    Oshinko - Integrating Spark and OpenShift

  • Apache Spark in the Cloud | Zbyněk Roubalík36

    Demo time

  • Apache Spark in the Cloud | Zbyněk Roubalík37

    Takeaways

    ● Containers

    ● Kubernetes

    ● OpenShift

    ● Apache Spark

    ● Oshinko tooling

  • Apache Spark in the Cloud | Zbyněk Roubalík38

    Спасибі!

    www.github.com/radanalyticsio

    zroubali@redhat.com

    http://www.github.com/radanalyticsio

    Slide 1 Slide 2 Slide 3 Slide 4 Slide 5 Slide 6 Slide 7 Slide 8 Slide 9 Slide 10 Slide 11 Slide 12 Slide 13 Slide 14 Slide 15 Slide 16 Slide 17 Slide 18 Slide 19 Slide 20 Slide 21 Slide 22 Slide 23 Slide 24 Slide 25 Slide 26 Slide 27 Slide 28 Slide 29 Slide 30 Slide 31 Slide 32 Slide 33 Slide 34 Slide 35 Slide 36 Slide 37 Slide 38