Spark on Dataproc - Israel Spark Meetup at taboola

download Spark on Dataproc - Israel Spark Meetup at taboola

of 29

  • date post

    19-Jan-2017
  • Category

    Software

  • view

    412
  • download

    0

Embed Size (px)

Transcript of Spark on Dataproc - Israel Spark Meetup at taboola

  • Vadim Soloveyvadim@doit-intl.com

    Google Cloud DataprocSpark and Hadoop with superfast start-up, easy management and billed by the minute.

  • Copyright 2015 Google Inc

    Google Developer Expert & Trainer

    CTO of DoIT International

  • Agenda

    01

    02

    03

    04

    05

    06

    Google Dataproc Overview

    Features

    Demo

    Roadmap

    Q&A

    Try Google Dataproc

  • Google Cloud Dataproc is a fast, easy to use, low cost and fully-managed service that lets you run Spark and Hadoop on Google Cloud Platform.

    Cloud Dataproc

  • Confidential & ProprietaryGoogle Cloud Platform 5

    Management

    Mobile

    Services

    Compute

    Big Data

    Storage

    Developer Tools

  • Confidential & ProprietaryGoogle Cloud Platform 6

    Dataproc 101

    Low Cost IntegratedEasy to Use

    Easily create and scale clusters to run native:

    Spark PySpark Spark SQL MapReduce Hive Pig More with IAs

    Integration with Cloud Platform provides immense scalability, ease-of use, and multiple channels for cluster interaction and management.

    Low-cost data processing with: Low and fixed price Minute-by-minute billing Fast cluster provisioning,

    execution, and removal Ability to manually scale

    clusters based on needs Preemptible instances

  • Confidential & ProprietaryGoogle Cloud Platform 7

    Product Characteristics Cloud DataprocAmazon

    EMR Customer Impact

    Cluster start timeElapsed time from cluster creation until it is ready.

    < 90 seconds ~360 secondsFaster data processing workflows because less time is spent waiting for clusters to provision and start executing applications.

    Billing unit of measureIncrement used for billing service when active.

    Minute HourlyReduced costs for running Spark and Hadoop because you pay for what you actually use, not a cost which has been rounded up.

    Preemptible VMsClusters can utilize preemptible VMs.

    Yes Kind of :-)Lower total operating costs for Spark and Hadoop processing by leveraging the cost benefits of preemptibles.

    Job output & cancellationJob output easy to find and are cancelable without SSH

    Yes NoHigher productivity because job output does not necessitate reviewing log files and canceling jobs does not require SSH.

    Competitive Highlights

  • 02 Features

  • Confidential & ProprietaryGoogle Cloud Platform 9

    Spark 1.5.2 w/ Py-Spark & Spark-SQL

    Hadoop 2.7.1

    Pig 0.15

    Hive 1.2.1

    YARN Resource Manager

    Debian 8 based O/S

    Google Connectors for Cloud Storage, BigQuery & BigTable etc.

    Packaging & Versioning

  • Confidential & ProprietaryGoogle Cloud Platform 10

    Features

    Integrated with Cloud Storage, Cloud Logging,

    BigQuery, and more.

    Integrated

    Manually scale clusters up or down based on need,

    even when jobs are running.

    Anytime Scaling

    UI, API & CLI for rapid development including

    Initialization Actions & Job Output Driver

    Tools

    Available in every Google Cloud zone in the United States, Europe, and Asia

    Global Availability

  • Confidential & ProprietaryGoogle Cloud Platform 11

    # Only run on the master nodeROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)if [[ "${ROLE}" == 'Master' ]]; then

    apt-get install build-essential python-dev libpng-dev libfreetype6-dev libxft-dev pkg-config python-matplotlib python-requestscurl https://bootstrap.pypa.io/get-pip.py | python

    mkdir IPythonNBpip install "ipython[notebook]"ipython profile create default

    echo "c = get_config()" > /root/.ipython/profile_default/ipython_notebook_config.pyecho "c.NotebookApp.ip = '*'" >> /root/.ipython/profile_default/ipython_notebook_config.py

    # Setup script for iPython Notebook so it uses the cluster's Sparkcat > /root/.ipython/profile_default/startup/00-pyspark-setup.py

  • Confidential & ProprietaryGoogle Cloud Platform 12

    Off-the-Shelf Initialization Actionshttps://github.com/GoogleCloudPlatform/dataproc-initialization-actions

    Pull Requests are Welcome!

    JupyterFacebook Presto Zeppelin Kafka Zookeeper

    https://github.com/GoogleCloudPlatform/dataproc-initialization-actionshttps://github.com/GoogleCloudPlatform/dataproc-initialization-actions

  • Confidential & ProprietaryGoogle Cloud Platform 13

    BigQuery BigTable CloudSQL Datastore

    Available Datastores

    Cloud Storage Nearline

  • Confidential & ProprietaryGoogle Cloud Platform 14

    GCS Connector Performance (I)Recommendation Engine Use-Case (1 file, 500GB)

  • Confidential & ProprietaryGoogle Cloud Platform 15

    GCS Connector Performance (II)Sessionization Use-Case (14,800 files, 1GB each)

  • Confidential & ProprietaryGoogle Cloud Platform 16

    GCS Connector Performance (III)Document Clustering Use-Case (31,000 files, 250MB each)

  • Confidential & ProprietaryGoogle Cloud Platform 17

    Additional Integrations

    Cloud Logging Cloud Monitoring

  • Confidential & ProprietaryGoogle Cloud Platform 18

    Spark & BigQuery Integration Exampleval fullyQualifiedInputTableId = "publicdata:samples.shakespeare"val fullyQualifiedOutputTableId = ""val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"val jobName = "wordcount"

    // Set the job-level projectId.conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)

    // Use the systemBucket for temporary BigQuery export data used by the InputFormat.val systemBucket = conf.get("fs.gs.system.bucket")conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)

    // Configure input and output for BigQuery access.BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)BigQueryConfiguration.configureBigQueryOutput(conf, fullyQualifiedOutputTableId, outputTableSchema)

    val fieldName = "word"

    val tableData = sc.newAPIHadoopRDD(conf, classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])tableData.cache()tableData.count()tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)

  • 03 Demo

  • Confidential & ProprietaryGoogle Cloud Platform 20

    Pricing Example

    35-minutes Spark job running on 14x 16-cores workers (224 cores)

    [ Crunching 3TB TeraSort ]

  • Confidential & ProprietaryGoogle Cloud Platform 21

    Pricing

    Pricing Example

    Function Machine Type # in Cluster vCPUs Instances Price Dataproc Price

    Master Node n1-standard-4 1 4 $0.2 $0.04

    Worker Nodes n1-highmem-16 4 64 $4.032 $0.64

    Worker Nodes (Preemptible) n1-highmem-16 10 160 $3.8 $1.6

    Cluster Total n/a 15 224 $4.88

    Pricing Details

    Per Compute Engine vCPU (any Machine Type) $0.01 Dataproc per hour price (USD)

    35% to 300% less than AWS EMR(c3.2xlarge | m2.4xlarge)

  • 04 Roadmap

  • Confidential & ProprietaryGoogle Cloud Platform 23

    Roadmap (Q1 2015)

    More Pre-Installed Engines, Frameworks & Tools (via Initialization Scripts)Mahout, Hue, Cloudera, MapR and others

    PerformanceFurther improve performance on jobs running directly on Google Cloud Storage. The ultimate goal is to make GCS the default storage for Dataproc and provide 2x performance of local HDFS (when not using LocalSSD)

    More Native DatastoresSpanner, Google ML

  • 06 Try Google Dataproc in 2015

  • Confidential & ProprietaryGoogle Cloud Platform 25

    AWS EMR Customer?

    Get $1,000To test Google Dataproc

  • Confidential & ProprietaryGoogle Cloud Platform 26

    Not a AWS EMR Customer?

    Get $1,000*To test Google Dataproc

  • Confidential & ProprietaryGoogle Cloud Platform 27

    * Agree to 1-hour meeting@ Google Tel-Aviv

    to discuss your Big Data needs

  • Confidential & ProprietaryGoogle Cloud Platform 28

    goo.gl/mFwCYapromo code is 1K-Dataproc

  • 05 Q?A

    goo.gl/mFwCYa