Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed...

16
Apache Spark: Sreeram Nudurupati May 2015

Transcript of Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed...

Page 1: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

Apache Spark:Sreeram Nudurupati May 2015

Page 2: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

What is Spark?A distributed computing platform designed to be

Fast

Fast to develop distributed applications

Fast to run distributed applications

General Purpose

A single framework to handle a variety of workloads

Batch, interactive, iterative, streaming, SQL

Page 3: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

Spark Architecture

Page 4: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

How to run Spark?Local

Not really distributed computing

Cluster

Standalone Scheduler + Shared File System

YARN

Mesos

Amazon EC2 + S3

Google Compute Engine + Mesosphere

Databricks Cloud

Page 5: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

Spark Cluster

• Mesos• Yarn• Standalone

Page 6: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

Spark BasicsRDD - Resilient Distributed Datasets

Spark’s primary abstraction

A distributed collection of items called elements

Can be created from a variety of sources

Immutable

Page 7: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

RDD Visualized

RDD 1 Partition 1

RDD 2

RDD 3

Partition 2 Partition 3

Partition 1 Partition 2

Partition 1 Partition 3 Partition 4Partition 2

Node 1 Node 4Node 3Node 2

Page 8: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

RDD OperationsTransformations

Operate on an RDD and return a new RDD

Are lazily evaluated

Actions

Return a value after running a computation on an RDD

Lazy Evaluation

Evaluation happens only when an action is called

Deferring decisions for better runtime optimization

data back to Driver

Transformation 1

Transformation 2

Action

map

filter

collect

Page 9: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

DataFramesExtension of RDD API and a Spark SQL abstraction

Distributed collection of data with named columns

Equivalent to RDBMS tables or data frames in R/Pandas

Can be built from a variety of structured data sources

Hive tables, JSON, Databases, RDDs etc.

Page 10: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

Why DataFrame?Lot of data formats are structured

Schema-on-read

data has inherent structure and needed to make sense of it

RDD programming with structured data is not intuitive

SchemaRDD = RDD(ROW) + Schema

Write SQLs

Use Domain Specific Language (DSL)

Page 11: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

RDD vs DataFrameDataFrame

Inbuilt support for a variety of data formats

A more feature rich DSL

Memory management with Java objects is challenging

Future GC free managed memory in the future

Execution optimized by Catalyst

JVM bytecode generated for any/all APIs

Page 12: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

RDD vs DataFrame

Page 13: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

DataFrame OpsprintSchema prints schema

show(N) shows N rows

join joins two DFs

apply returns the selected column

select returns new DF with selected columns

selectExpr use a SQL query to select

filter same as where

groupBy groups using specified columns

SaveAs(JSON/Parquet/Table)

saveAsTable saves to a Hive table

createJDBCTable save to a JDBC database

Page 14: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

SQLContext OpsparquetFile loads parquet file

into a DF

jsonFile loads JSON file into a DF

load creates a DF from a source file

createExternalTable

creates a Hive external table

jdbc returns new DF with selected columns

sql executes SQL query

table return specified table as DF

cacheTable cache table in-memory

Page 15: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

Demo

Page 16: Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed computing platform designed to be Fast Fast to develop distributed applications Fast

What Next?Spark Community: spark.apache.org/community.html

Worldwide Events: goo.gl/2YqJZK

Video, presentation archives: spark-summit.org

Dev resources: databricks.com/spark/developer-resources

Workshops: databricks.com/services/spark-training

Books: Learning Spark, Advanced Analytics with Apache Spark

Github: https://github.com/snudurupati/spark_training