Spark on Dataproc - Israel Spark Meetup at taboola
-
Upload
tsliwowicz -
Category
Software
-
view
421 -
download
1
Embed Size (px)
Transcript of Spark on Dataproc - Israel Spark Meetup at taboola

Vadim [email protected]
Google Cloud DataprocSpark and Hadoop with superfast start-up, easy management and billed by the minute.

Copyright 2015 Google Inc
Google Developer Expert & Trainer
CTO of DoIT International

Agenda
01
02
03
04
05
06
Google Dataproc Overview
Features
Demo
Roadmap
Q&A
Try Google Dataproc

Google Cloud Dataproc is a fast, easy to use, low cost and fully-managed service that lets you run Spark and Hadoop on Google Cloud Platform.
Cloud Dataproc

Confidential & ProprietaryGoogle Cloud Platform 5
Management
Mobile
Services
Compute
Big Data
Storage
Developer Tools

Confidential & ProprietaryGoogle Cloud Platform 6
Dataproc 101
Low Cost IntegratedEasy to Use
Easily create and scale clusters to run native:
• Spark• PySpark• Spark SQL• MapReduce• Hive• Pig• More with IA’s
Integration with Cloud Platform provides immense scalability, ease-of use, and multiple channels for cluster interaction and management.
Low-cost data processing with:• Low and fixed price• Minute-by-minute billing• Fast cluster provisioning,
execution, and removal• Ability to manually scale
clusters based on needs• Preemptible instances

Confidential & ProprietaryGoogle Cloud Platform 7
Product Characteristics Cloud Dataproc
Amazon EMR Customer Impact
Cluster start timeElapsed time from cluster creation until it is ready.
< 90 seconds ~360 secondsFaster data processing workflows because less time is spent waiting for clusters to provision and start executing applications.
Billing unit of measureIncrement used for billing service when active.
Minute HourlyReduced costs for running Spark and Hadoop because you pay for what you actually use, not a cost which has been rounded up.
Preemptible VMsClusters can utilize preemptible VMs.
Yes Kind of :-)Lower total operating costs for Spark and Hadoop processing by leveraging the cost benefits of preemptibles.
Job output & cancellationJob output easy to find and are cancelable without SSH
Yes NoHigher productivity because job output does not necessitate reviewing log files and canceling jobs does not require SSH.
Competitive Highlights

02 Features

Confidential & ProprietaryGoogle Cloud Platform 9
● Spark 1.5.2 w/ Py-Spark & Spark-SQL
● Hadoop 2.7.1
● Pig 0.15
● Hive 1.2.1
● YARN Resource Manager
● Debian 8 based O/S
● Google Connectors for Cloud Storage, BigQuery & BigTable etc.
Packaging & Versioning

Confidential & ProprietaryGoogle Cloud Platform 10
Features
Integrated with Cloud Storage, Cloud Logging,
BigQuery, and more.
Integrated
Manually scale clusters up or down based on need,
even when jobs are running.
Anytime Scaling
UI, API & CLI for rapid development including
Initialization Actions & Job Output Driver
Tools
Available in every Google Cloud zone in the United States, Europe, and Asia
Global Availability

Confidential & ProprietaryGoogle Cloud Platform 11
# Only run on the master nodeROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)if [[ "${ROLE}" == 'Master' ]]; then
apt-get install build-essential python-dev libpng-dev libfreetype6-dev libxft-dev pkg-config python-matplotlib python-requestscurl https://bootstrap.pypa.io/get-pip.py | python
mkdir IPythonNBpip install "ipython[notebook]"ipython profile create default
echo "c = get_config()" > /root/.ipython/profile_default/ipython_notebook_config.pyecho "c.NotebookApp.ip = '*'" >> /root/.ipython/profile_default/ipython_notebook_config.py
# Setup script for iPython Notebook so it uses the cluster's Sparkcat > /root/.ipython/profile_default/startup/00-pyspark-setup.py <<'_EOF'
import osimport sysspark_home = '/usr/lib/spark/'os.environ["SPARK_HOME"] = spark_homesys.path.insert(0, os.path.join(spark_home, 'python'))sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))_EOF
nohup ipython notebook --no-browser --ip=* --port=8123 > /var/log/python_notebook.log &fi
Initialization Action Example

Confidential & ProprietaryGoogle Cloud Platform 12
Off-the-Shelf Initialization Actionshttps://github.com/GoogleCloudPlatform/dataproc-initialization-actions
Pull Requests are Welcome!
JupyterFacebook Presto Zeppelin Kafka Zookeeper

Confidential & ProprietaryGoogle Cloud Platform 13
BigQuery BigTable CloudSQL Datastore
Available Datastores
Cloud Storage Nearline

Confidential & ProprietaryGoogle Cloud Platform 14
GCS Connector Performance (I)Recommendation Engine Use-Case (1 file, 500GB)

Confidential & ProprietaryGoogle Cloud Platform 15
GCS Connector Performance (II)Sessionization Use-Case (14,800 files, 1GB each)

Confidential & ProprietaryGoogle Cloud Platform 16
GCS Connector Performance (III)Document Clustering Use-Case (31,000 files, 250MB each)

Confidential & ProprietaryGoogle Cloud Platform 17
Additional Integrations
Cloud Logging Cloud Monitoring

Confidential & ProprietaryGoogle Cloud Platform 18
Spark & BigQuery Integration Exampleval fullyQualifiedInputTableId = "publicdata:samples.shakespeare"val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"val outputTableSchema = "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"val jobName = "wordcount"
// Set the job-level projectId.conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
// Use the systemBucket for temporary BigQuery export data used by the InputFormat.val systemBucket = conf.get("fs.gs.system.bucket")conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)
// Configure input and output for BigQuery access.BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)BigQueryConfiguration.configureBigQueryOutput(conf, fullyQualifiedOutputTableId, outputTableSchema)
val fieldName = "word"
val tableData = sc.newAPIHadoopRDD(conf, classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])tableData.cache()tableData.count()tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)

03 Demo

Confidential & ProprietaryGoogle Cloud Platform 20
Pricing Example
35-minutes Spark job running on 14x 16-cores workers (224 cores)
[ Crunching 3TB TeraSort ]

Confidential & ProprietaryGoogle Cloud Platform 21
Pricing
Pricing Example
Function Machine Type # in Cluster vCPUs Instances Price Dataproc Price
Master Node n1-standard-4 1 4 $0.2 $0.04
Worker Nodes n1-highmem-16 4 64 $4.032 $0.64
Worker Nodes (Preemptible) n1-highmem-16 10 160 $3.8 $1.6
Cluster Total n/a 15 224 $4.88
Pricing Details
Per Compute Engine vCPU (any Machine Type) $0.01 Dataproc per hour price (USD)
35% to 300% less than AWS EMR(c3.2xlarge | m2.4xlarge)

04 Roadmap

Confidential & ProprietaryGoogle Cloud Platform 23
Roadmap (Q1 2015)
More Pre-Installed Engines, Frameworks & Tools (via Initialization Scripts)Mahout, Hue, Cloudera, MapR and others
PerformanceFurther improve performance on jobs running directly on Google Cloud Storage. The ultimate goal is to make GCS the default storage for Dataproc and provide 2x performance of local HDFS (when not using LocalSSD)
More Native DatastoresSpanner, Google ML

06 Try Google Dataproc in 2015

Confidential & ProprietaryGoogle Cloud Platform 25
AWS EMR Customer?
Get $1,000To test Google Dataproc

Confidential & ProprietaryGoogle Cloud Platform 26
Not a AWS EMR Customer?
Get $1,000*
To test Google Dataproc

Confidential & ProprietaryGoogle Cloud Platform 27
* Agree to 1-hour meeting@ Google Tel-Aviv
to discuss your Big Data needs

Confidential & ProprietaryGoogle Cloud Platform 28
goo.gl/mFwCYapromo code is “1K-Dataproc”

05 Q?A
goo.gl/mFwCYa