Spark DataFrames for Data Munging
date post
18-Jan-2017Category
Software
view
103download
1
Embed Size (px)
Transcript of Spark DataFrames for Data Munging
2016 Mesosphere, Inc. All Rights Reserved.
SPARK DATAFRAMES FOR DATA MUNGING
1
Susan X. Huynh, Scala by the Bay, Nov. 2016
2016 Mesosphere, Inc. All Rights Reserved.
OUTLINE
2
Motivation
Spark DataFrame API
Demo
Beyond Data Munging
2016 Mesosphere, Inc. All Rights Reserved.
MOTIVATION
3
Your job: Analyze 100 GB of log data:
{"created_at":"Tue Sep 13 19:54:43 +0000 2016","id":775784797046124544,"id_str":"775784797046124544","text":"@abcd4321 ur icon is good","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":1180432963,"in_reply_to_user_id_str":"1180432963","in_reply_to_screen_name":"fakejoshler","user":{"id":4795786058,"id_str":"4795786058","name":"maggie","screen_name":"wxyz1234","location":"she her - gabby, mily","url":"http:\/\/666gutz.tumblr.com","description":"one too many skeletons","protected":false,"verified":false,"followers_count":2168,"friends_count":84,"listed_count":67,"favourites_count":22298,"statuses_count":29769,"created_at":"Fri Jan 22 00:04:30 +0000 2016","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"000000","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,
2016 Mesosphere, Inc. All Rights Reserved.
WHAT DO YOU MEAN BY ANALYZE?
4
AKA: data munging, ETL, data cleaning, acronym: PETS (or PEST? :)
Parse
Explore
Transform
Summarize
Data pipeline
Motivation
"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\nofollow\
2016 Mesosphere, Inc. All Rights Reserved.
BEST TOOL FOR THE JOB?
5
DataFrame
Pandas (Python)
R
Big data + SQL
Hive, Impala
DataFrame + Big data / SQL
Spark DataFrame
Motivation
https://flic.kr/p/fnCVbL
https://flic.kr/p/fnCVbL
2016 Mesosphere, Inc. All Rights Reserved.
WHY SPARK?
6
Open source
Scalable
Fast ad-hoc queries
Motivation
2016 Mesosphere, Inc. All Rights Reserved.
WHY SPARK DATAFRAME?
7
Parse: Easy to read structured, semi-structured (JSON) formats
Explore: DataFrame
Transform / Summarize:
SQL queries + procedural processing
Utilities for math, string, date / time manipulation
Scala
Motivation
2016 Mesosphere, Inc. All Rights Reserved.
PARSE: READING JSON DATA
8
> spark res4: org.apache.spark.sql.SparkSession@3fc09112
> val df = spark.read.json(/path/to/mydata.json) df: org.apache.spark.sql.DataFrame = [contributors: string ... 33 more fields]
DataFrame: a table with rows and columns (fields)
Spark DataFrame API
"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\nofollow\
2016 Mesosphere, Inc. All Rights Reserved.
EXPLORE
9
> df.printSchema() // lists the columns in a DataFrameroot |-- contributors: string (nullable = true) |-- coordinates: struct (nullable = true) | |-- coordinates: array (nullable = true) | | |-- element: double (containsNull = true) | |-- type: string (nullable = true) |-- created_at: string (nullable = true) |-- delete: struct (nullable = true) | |-- status: struct (nullable = true) | | |-- id: long (nullable = true) |-- lang: string (nullable = true)
Spark DataFrame API
2016 Mesosphere, Inc. All Rights Reserved.
EXPLORE (CONTD)
10
> df.filter(col(coordinates).isNotNull) // filters on rows, with given condition .select("coordinates", created_at) // filters on columns .show()
+------------------------------------------------+------------------------------+|coordinates |created_at |+------------------------------------------------+------------------------------+|[WrappedArray(104.86544034, 15.23611896),Point] |Thu Sep 15 02:00:00 +0000 2016||[WrappedArray(-43.301755, -22.990065),Point] |Thu Sep 15 02:00:03 +0000 2016||[WrappedArray(100.3833729, 6.13822131),Point] |Thu Sep 15 02:00:30 +0000 2016||[WrappedArray(-122.286, 47.5592),Point] |Thu Sep 15 02:00:38 +0000 2016||[WrappedArray(110.823004, -6.80342),Point] |Thu Sep 15 02:00:42 +0000 2016|
Other DataFrame ops: count(), describe(), create new columns,
Spark DataFrame API
2016 Mesosphere, Inc. All Rights Reserved.
TRANSFORM/SUMMARIZE: SQL QUERIES + PROCEDURAL PROC.
11
> val langCount = df.select(lang")
.where(col(lang).isNotNull)
.groupBy(lang")
.count()
.orderBy(col(count).desc) +----+-----+
|lang|count|
+----+-----+
| en|61644|
| es|22937|
| pt|21610|
| ja|19160|
| und|10376|
Also: joins
> val result = langCount.map{row:Row => } // or flatMap, filter,
Spark DataFrame API
SQL
PROCEDURAL
2016 Mesosphere, Inc. All Rights Reserved.
MATH, STRING, DATE / TIME FUNCTIONS
12
> df.select(created_at)
.withColumn(day_of_week, col(created_at).substr(0, 3))
.show() +--------------------+-----------+
| created_at|day_of_week|
+--------------------+-----------+
| null| null|
|Thu Sep 15 01:59:...| Thu|
|Thu Sep 15 01:59:...| Thu|
|Thu Sep 15 01:59:...| Thu|
| null| null|
|Thu Sep 15 01:59:...| Thu|
Also: sin, cos, exp, log, pow, toDegrees, toRadians, ceil, floor, round, concat, format_string, lower, regexp_extract, split, trim, upper, current_timestamp, datediff, from_unixtime,
Spark DataFrame API
2016 Mesosphere, Inc. All Rights Reserved.
DEMO
13
Spark 2.0
Zeppelin notebook 0.6.1
8 GB JSON-formatted public Tweet data
2016 Mesosphere, Inc. All Rights Reserved.
BEYOND DATA MUNGING
14
Machine learning
Data pipeline in production
Streaming data
2016 Mesosphere, Inc. All Rights Reserved.
BEYOND DATA MUNGING
15
Machine learning => DataFrame-based ML API
Data pipeline in production => Dataset API, with type safety
Streaming data => Structured Streaming API, based on DataFrame
Spark 2.0
2016 Mesosphere, Inc. All Rights Reserved.
RECAP
16
Spark DataFrames combine the data frame abstraction with Big Data and SQL
Spark DataFrames simplify data munging tasks (PETS):
Parse => structured and semi-structured formats (JSON)
Explore => DataFrame: printSchema, filter by row / column, show
Transform,
Summarize => SQL + procedural processing, math / string / date-time utility functions
All in Scala
2016 Mesosphere, Inc. All Rights Reserved.
REFERENCES
17
Spark SQL and DataFrames Guide: http://spark.apache.org/docs/latest/sql-programming-guide.html
Spark DataFrame API: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
Overview of Spark DataFrames: http://xinhstechblog.blogspot.com/2016/05/overview-of-spark-dataframe-api.html
DataFrame Internals: https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
http://spark.apache.org/docs/latest/sql-programming-guide.htmlhttp://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Datasethttp://xinhstechblog.blogspot.com/2016/05/overview-of-spark-dataframe-api.htmlhttps://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
2016 Mesosphere, Inc. All Rights Reserved.
THANK YOU!
18