Spark DataFrames for Data Munging

18
© 2016 Mesosphere, Inc. All Rights Reserved. SPARK DATAFRAMES FOR DATA MUNGING 1 Susan X. Huynh, Scala by the Bay, Nov. 2016

Transcript of Spark DataFrames for Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

SPARK DATAFRAMES FOR DATA MUNGING

1

Susan X. Huynh, Scala by the Bay, Nov. 2016

© 2016 Mesosphere, Inc. All Rights Reserved.

OUTLINE

2

Motivation

Spark DataFrame API

Demo

Beyond Data Munging

© 2016 Mesosphere, Inc. All Rights Reserved.

MOTIVATION

3

•Your job: Analyze 100 GB of log data:

{"created_at":"Tue Sep 13 19:54:43 +0000 2016","id":775784797046124544,"id_str":"775784797046124544","text":"@abcd4321 ur icon is good","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":1180432963,"in_reply_to_user_id_str":"1180432963","in_reply_to_screen_name":"fakejoshler","user":{"id":4795786058,"id_str":"4795786058","name":"maggie","screen_name":"wxyz1234","location":"she her - gabby, mily","url":"http:\/\/666gutz.tumblr.com","description":"one too many skeletons","protected":false,"verified":false,"followers_count":2168,"friends_count":84,"listed_count":67,"favourites_count":22298,"statuses_count":29769,"created_at":"Fri Jan 22 00:04:30 +0000 2016","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"000000","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,

© 2016 Mesosphere, Inc. All Rights Reserved.

WHAT DO YOU MEAN BY “ANALYZE”?

4

AKA: data munging, ETL, data cleaning, acronym: PETS (or PEST? :)

Parse

Explore

Transform

Summarize

Data pipeline

Motivation

"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\”nofollow\”…

© 2016 Mesosphere, Inc. All Rights Reserved.

BEST TOOL FOR THE JOB?

5

DataFrame

Pandas (Python)

R

Big data + SQL

Hive, Impala

DataFrame + Big data / SQL

Spark DataFrame

Motivation

https://flic.kr/p/fnCVbL

© 2016 Mesosphere, Inc. All Rights Reserved.

WHY SPARK?

6

Open source

Scalable

Fast ad-hoc queries

Motivation

© 2016 Mesosphere, Inc. All Rights Reserved.

WHY SPARK DATAFRAME?

7

Parse: Easy to read structured, semi-structured (JSON) formats

Explore: DataFrame

Transform / Summarize:

SQL queries + procedural processing

Utilities for math, string, date / time manipulation

Scala

Motivation

© 2016 Mesosphere, Inc. All Rights Reserved.

PARSE: READING JSON DATA

8

> spark res4: org.apache.spark.sql.SparkSession@3fc09112

> val df = spark.read.json(“/path/to/mydata.json”) df: org.apache.spark.sql.DataFrame = [contributors: string ... 33 more fields]

DataFrame: a table with rows and columns (fields)

Spark DataFrame API

"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\”nofollow\”…

© 2016 Mesosphere, Inc. All Rights Reserved.

EXPLORE

9

> df.printSchema() // lists the columns in a DataFrameroot |-- contributors: string (nullable = true) |-- coordinates: struct (nullable = true) | |-- coordinates: array (nullable = true) | | |-- element: double (containsNull = true) | |-- type: string (nullable = true) |-- created_at: string (nullable = true) |-- delete: struct (nullable = true) | |-- status: struct (nullable = true) | | |-- id: long (nullable = true) |-- lang: string (nullable = true)…

Spark DataFrame API

© 2016 Mesosphere, Inc. All Rights Reserved.

EXPLORE (CONT’D)

10

> df.filter(col(”coordinates”).isNotNull) // filters on rows, with given condition .select("coordinates", “created_at”) // filters on columns .show()

+------------------------------------------------+------------------------------+|coordinates |created_at |+------------------------------------------------+------------------------------+|[WrappedArray(104.86544034, 15.23611896),Point] |Thu Sep 15 02:00:00 +0000 2016||[WrappedArray(-43.301755, -22.990065),Point] |Thu Sep 15 02:00:03 +0000 2016||[WrappedArray(100.3833729, 6.13822131),Point] |Thu Sep 15 02:00:30 +0000 2016||[WrappedArray(-122.286, 47.5592),Point] |Thu Sep 15 02:00:38 +0000 2016||[WrappedArray(110.823004, -6.80342),Point] |Thu Sep 15 02:00:42 +0000 2016|

Other DataFrame ops: count(), describe(), create new columns, …

Spark DataFrame API

© 2016 Mesosphere, Inc. All Rights Reserved.

TRANSFORM/SUMMARIZE: SQL QUERIES + PROCEDURAL PROC.

11

> val langCount = df.select(“lang")

.where(col(”lang”).isNotNull)

.groupBy(“lang")

.count()

.orderBy(col(”count”).desc) +----+-----+

|lang|count|

+----+-----+

| en|61644|

| es|22937|

| pt|21610|

| ja|19160|

| und|10376|

Also: joins

> val result = langCount.map{row:Row => …} // or flatMap, filter, …

Spark DataFrame API

SQL

PROCEDURAL

© 2016 Mesosphere, Inc. All Rights Reserved.

MATH, STRING, DATE / TIME FUNCTIONS

12

> df.select(”created_at”)

.withColumn(“day_of_week”, col(”created_at”).substr(0, 3))

.show() +--------------------+-----------+

| created_at|day_of_week|

+--------------------+-----------+

| null| null|

|Thu Sep 15 01:59:...| Thu|

|Thu Sep 15 01:59:...| Thu|

|Thu Sep 15 01:59:...| Thu|

| null| null|

|Thu Sep 15 01:59:...| Thu|

Also: sin, cos, exp, log, pow, toDegrees, toRadians, ceil, floor, round, concat, format_string, lower, regexp_extract, split, trim, upper, current_timestamp, datediff, from_unixtime, …

Spark DataFrame API

© 2016 Mesosphere, Inc. All Rights Reserved.

DEMO

13

Spark 2.0

Zeppelin notebook 0.6.1

8 GB JSON-formatted public Tweet data

© 2016 Mesosphere, Inc. All Rights Reserved.

BEYOND DATA MUNGING

14

Machine learning

Data pipeline in production

Streaming data

© 2016 Mesosphere, Inc. All Rights Reserved.

BEYOND DATA MUNGING

15

Machine learning => DataFrame-based ML API

Data pipeline in production => Dataset API, with type safety

Streaming data => Structured Streaming API, based on DataFrame

Spark 2.0

© 2016 Mesosphere, Inc. All Rights Reserved.

RECAP

16

Spark DataFrames combine the “data frame” abstraction with Big Data and SQL

Spark DataFrames simplify data munging tasks (“PETS”):

Parse => structured and semi-structured formats (JSON)

Explore => DataFrame: printSchema, filter by row / column, show

Transform,

Summarize => SQL + procedural processing, math / string / date-time utility functions

All in Scala

© 2016 Mesosphere, Inc. All Rights Reserved.

REFERENCES

17

Spark SQL and DataFrames Guide: http://spark.apache.org/docs/latest/sql-programming-guide.html

Spark DataFrame API: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

Overview of Spark DataFrames: http://xinhstechblog.blogspot.com/2016/05/overview-of-spark-dataframe-api.html

DataFrame Internals: https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf

© 2016 Mesosphere, Inc. All Rights Reserved.

THANK YOU!

18