Spark on Dataproc - Israel Spark Meetup at taboola

Vadim [email protected]

Google Cloud DataprocSpark and Hadoop with superfast start-up, easy management and billed by the minute.

Copyright 2015 Google Inc

<[email protected]>

Google Developer Expert & Trainer

CTO of DoIT International

Agenda

01

02

03

04

05

06

Google Dataproc Overview

Features

Demo

Roadmap

Q&A

Try Google Dataproc

Google Cloud Dataproc is a fast, easy to use, low cost and fully-managed service that lets you run Spark and Hadoop on Google Cloud Platform.

Cloud Dataproc

Confidential & ProprietaryGoogle Cloud Platform 5

Management

Mobile

Services

Compute

Big Data

Storage

Developer Tools


Dataproc 101

Low Cost IntegratedEasy to Use

Easily create and scale clusters to run native:

• Spark• PySpark• Spark SQL• MapReduce• Hive• Pig• More with IA’s

Integration with Cloud Platform provides immense scalability, ease-of use, and multiple channels for cluster interaction and management.

Low-cost data processing with:• Low and fixed price• Minute-by-minute billing• Fast cluster provisioning,

execution, and removal• Ability to manually scale

clusters based on needs• Preemptible instances


Product Characteristics Cloud Dataproc

Amazon EMR Customer Impact

Cluster start timeElapsed time from cluster creation until it is ready.

< 90 seconds ~360 secondsFaster data processing workflows because less time is spent waiting for clusters to provision and start executing applications.

Billing unit of measureIncrement used for billing service when active.

Minute HourlyReduced costs for running Spark and Hadoop because you pay for what you actually use, not a cost which has been rounded up.

Preemptible VMsClusters can utilize preemptible VMs.

Yes Kind of :-)Lower total operating costs for Spark and Hadoop processing by leveraging the cost benefits of preemptibles.

Job output & cancellationJob output easy to find and are cancelable without SSH

Yes NoHigher productivity because job output does not necessitate reviewing log files and canceling jobs does not require SSH.

Competitive Highlights

02 Features


● Spark 1.5.2 w/ Py-Spark & Spark-SQL

● Hadoop 2.7.1

● Pig 0.15

● Hive 1.2.1

● YARN Resource Manager

● Debian 8 based O/S

● Google Connectors for Cloud Storage, BigQuery & BigTable etc.

Packaging & Versioning


Features

Integrated with Cloud Storage, Cloud Logging,

BigQuery, and more.

Integrated

Manually scale clusters up or down based on need,

even when jobs are running.

Anytime Scaling

UI, API & CLI for rapid development including

Initialization Actions & Job Output Driver

Tools

Available in every Google Cloud zone in the United States, Europe, and Asia

Global Availability


# Only run on the master nodeROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)if [[ "${ROLE}" == 'Master' ]]; then

apt-get install build-essential python-dev libpng-dev libfreetype6-dev libxft-dev pkg-config python-matplotlib python-requestscurl https://bootstrap.pypa.io/get-pip.py | python

mkdir IPythonNBpip install "ipython[notebook]"ipython profile create default

echo "c = get_config()" > /root/.ipython/profile_default/ipython_notebook_config.pyecho "c.NotebookApp.ip = '*'" >> /root/.ipython/profile_default/ipython_notebook_config.py

# Setup script for iPython Notebook so it uses the cluster's Sparkcat > /root/.ipython/profile_default/startup/00-pyspark-setup.py <<'_EOF'

import osimport sysspark_home = '/usr/lib/spark/'os.environ["SPARK_HOME"] = spark_homesys.path.insert(0, os.path.join(spark_home, 'python'))sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))_EOF

nohup ipython notebook --no-browser --ip=* --port=8123 > /var/log/python_notebook.log &fi

Initialization Action Example


Off-the-Shelf Initialization Actionshttps://github.com/GoogleCloudPlatform/dataproc-initialization-actions

Pull Requests are Welcome!

JupyterFacebook Presto Zeppelin Kafka Zookeeper

https://github.com/GoogleCloudPlatform/dataproc-initialization-actions

https://github.com/GoogleCloudPlatform/dataproc-initialization-actions


BigQuery BigTable CloudSQL Datastore

Available Datastores

Cloud Storage Nearline


GCS Connector Performance (I)Recommendation Engine Use-Case (1 file, 500GB)


GCS Connector Performance (II)Sessionization Use-Case (14,800 files, 1GB each)


GCS Connector Performance (III)Document Clustering Use-Case (31,000 files, 250MB each)


Additional Integrations

Cloud Logging Cloud Monitoring


Spark & BigQuery Integration Exampleval fullyQualifiedInputTableId = "publicdata:samples.shakespeare"val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"val jobName = "wordcount"

// Set the job-level projectId.conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)

// Use the systemBucket for temporary BigQuery export data used by the InputFormat.val systemBucket = conf.get("fs.gs.system.bucket")conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)

// Configure input and output for BigQuery access.BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)BigQueryConfiguration.configureBigQueryOutput(conf, fullyQualifiedOutputTableId, outputTableSchema)

val fieldName = "word"

val tableData = sc.newAPIHadoopRDD(conf, classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])tableData.cache()tableData.count()tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)

03 Demo


Pricing Example

35-minutes Spark job running on 14x 16-cores workers (224 cores)

[ Crunching 3TB TeraSort ]


Pricing

Pricing Example

Function Machine Type # in Cluster vCPUs Instances Price Dataproc Price

Master Node n1-standard-4 1 4 $0.2 $0.04

Worker Nodes n1-highmem-16 4 64 $4.032 $0.64

Worker Nodes (Preemptible) n1-highmem-16 10 160 $3.8 $1.6

Cluster Total n/a 15 224 $4.88

Pricing Details

Per Compute Engine vCPU (any Machine Type) $0.01 Dataproc per hour price (USD)

35% to 300% less than AWS EMR(c3.2xlarge | m2.4xlarge)

04 Roadmap


Roadmap (Q1 2015)

More Pre-Installed Engines, Frameworks & Tools (via Initialization Scripts)Mahout, Hue, Cloudera, MapR and others

PerformanceFurther improve performance on jobs running directly on Google Cloud Storage. The ultimate goal is to make GCS the default storage for Dataproc and provide 2x performance of local HDFS (when not using LocalSSD)

More Native DatastoresSpanner, Google ML

06 Try Google Dataproc in 2015


AWS EMR Customer?

Get $1,000To test Google Dataproc


Not a AWS EMR Customer?

Get $1,000*

To test Google Dataproc


* Agree to 1-hour meeting@ Google Tel-Aviv

to discuss your Big Data needs


goo.gl/mFwCYapromo code is “1K-Dataproc”

05 Q?A

goo.gl/mFwCYa

Spark on Dataproc - Israel Spark Meetup at taboola

Software

Transcript of Spark on Dataproc - Israel Spark Meetup at taboola