Spark DataFrames for Data Munging

Click here to load reader

  • date post

    18-Jan-2017
  • Category

    Software

  • view

    103
  • download

    1

Embed Size (px)

Transcript of Spark DataFrames for Data Munging

  • 2016 Mesosphere, Inc. All Rights Reserved.

    SPARK DATAFRAMES FOR DATA MUNGING

    1

    Susan X. Huynh, Scala by the Bay, Nov. 2016

  • 2016 Mesosphere, Inc. All Rights Reserved.

    OUTLINE

    2

    Motivation

    Spark DataFrame API

    Demo

    Beyond Data Munging

  • 2016 Mesosphere, Inc. All Rights Reserved.

    MOTIVATION

    3

    Your job: Analyze 100 GB of log data:

    {"created_at":"Tue Sep 13 19:54:43 +0000 2016","id":775784797046124544,"id_str":"775784797046124544","text":"@abcd4321 ur icon is good","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":1180432963,"in_reply_to_user_id_str":"1180432963","in_reply_to_screen_name":"fakejoshler","user":{"id":4795786058,"id_str":"4795786058","name":"maggie","screen_name":"wxyz1234","location":"she her - gabby, mily","url":"http:\/\/666gutz.tumblr.com","description":"one too many skeletons","protected":false,"verified":false,"followers_count":2168,"friends_count":84,"listed_count":67,"favourites_count":22298,"statuses_count":29769,"created_at":"Fri Jan 22 00:04:30 +0000 2016","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"000000","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,

  • 2016 Mesosphere, Inc. All Rights Reserved.

    WHAT DO YOU MEAN BY ANALYZE?

    4

    AKA: data munging, ETL, data cleaning, acronym: PETS (or PEST? :)

    Parse

    Explore

    Transform

    Summarize

    Data pipeline

    Motivation

    "source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\nofollow\

  • 2016 Mesosphere, Inc. All Rights Reserved.

    BEST TOOL FOR THE JOB?

    5

    DataFrame

    Pandas (Python)

    R

    Big data + SQL

    Hive, Impala

    DataFrame + Big data / SQL

    Spark DataFrame

    Motivation

    https://flic.kr/p/fnCVbL

    https://flic.kr/p/fnCVbL

  • 2016 Mesosphere, Inc. All Rights Reserved.

    WHY SPARK?

    6

    Open source

    Scalable

    Fast ad-hoc queries

    Motivation

  • 2016 Mesosphere, Inc. All Rights Reserved.

    WHY SPARK DATAFRAME?

    7

    Parse: Easy to read structured, semi-structured (JSON) formats

    Explore: DataFrame

    Transform / Summarize:

    SQL queries + procedural processing

    Utilities for math, string, date / time manipulation

    Scala

    Motivation

  • 2016 Mesosphere, Inc. All Rights Reserved.

    PARSE: READING JSON DATA

    8

    > spark res4: org.apache.spark.sql.SparkSession@3fc09112

    > val df = spark.read.json(/path/to/mydata.json) df: org.apache.spark.sql.DataFrame = [contributors: string ... 33 more fields]

    DataFrame: a table with rows and columns (fields)

    Spark DataFrame API

    "source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\nofollow\

  • 2016 Mesosphere, Inc. All Rights Reserved.

    EXPLORE

    9

    > df.printSchema() // lists the columns in a DataFrameroot |-- contributors: string (nullable = true) |-- coordinates: struct (nullable = true) | |-- coordinates: array (nullable = true) | | |-- element: double (containsNull = true) | |-- type: string (nullable = true) |-- created_at: string (nullable = true) |-- delete: struct (nullable = true) | |-- status: struct (nullable = true) | | |-- id: long (nullable = true) |-- lang: string (nullable = true)

    Spark DataFrame API

  • 2016 Mesosphere, Inc. All Rights Reserved.

    EXPLORE (CONTD)

    10

    > df.filter(col(coordinates).isNotNull) // filters on rows, with given condition .select("coordinates", created_at) // filters on columns .show()

    +------------------------------------------------+------------------------------+|coordinates |created_at |+------------------------------------------------+------------------------------+|[WrappedArray(104.86544034, 15.23611896),Point] |Thu Sep 15 02:00:00 +0000 2016||[WrappedArray(-43.301755, -22.990065),Point] |Thu Sep 15 02:00:03 +0000 2016||[WrappedArray(100.3833729, 6.13822131),Point] |Thu Sep 15 02:00:30 +0000 2016||[WrappedArray(-122.286, 47.5592),Point] |Thu Sep 15 02:00:38 +0000 2016||[WrappedArray(110.823004, -6.80342),Point] |Thu Sep 15 02:00:42 +0000 2016|

    Other DataFrame ops: count(), describe(), create new columns,

    Spark DataFrame API

  • 2016 Mesosphere, Inc. All Rights Reserved.

    TRANSFORM/SUMMARIZE: SQL QUERIES + PROCEDURAL PROC.

    11

    > val langCount = df.select(lang")

    .where(col(lang).isNotNull)

    .groupBy(lang")

    .count()

    .orderBy(col(count).desc) +----+-----+

    |lang|count|

    +----+-----+

    | en|61644|

    | es|22937|

    | pt|21610|

    | ja|19160|

    | und|10376|

    Also: joins

    > val result = langCount.map{row:Row => } // or flatMap, filter,

    Spark DataFrame API

    SQL

    PROCEDURAL

  • 2016 Mesosphere, Inc. All Rights Reserved.

    MATH, STRING, DATE / TIME FUNCTIONS

    12

    > df.select(created_at)

    .withColumn(day_of_week, col(created_at).substr(0, 3))

    .show() +--------------------+-----------+

    | created_at|day_of_week|

    +--------------------+-----------+

    | null| null|

    |Thu Sep 15 01:59:...| Thu|

    |Thu Sep 15 01:59:...| Thu|

    |Thu Sep 15 01:59:...| Thu|

    | null| null|

    |Thu Sep 15 01:59:...| Thu|

    Also: sin, cos, exp, log, pow, toDegrees, toRadians, ceil, floor, round, concat, format_string, lower, regexp_extract, split, trim, upper, current_timestamp, datediff, from_unixtime,

    Spark DataFrame API

  • 2016 Mesosphere, Inc. All Rights Reserved.

    DEMO

    13

    Spark 2.0

    Zeppelin notebook 0.6.1

    8 GB JSON-formatted public Tweet data

  • 2016 Mesosphere, Inc. All Rights Reserved.

    BEYOND DATA MUNGING

    14

    Machine learning

    Data pipeline in production

    Streaming data

  • 2016 Mesosphere, Inc. All Rights Reserved.

    BEYOND DATA MUNGING

    15

    Machine learning => DataFrame-based ML API

    Data pipeline in production => Dataset API, with type safety

    Streaming data => Structured Streaming API, based on DataFrame

    Spark 2.0

  • 2016 Mesosphere, Inc. All Rights Reserved.

    RECAP

    16

    Spark DataFrames combine the data frame abstraction with Big Data and SQL

    Spark DataFrames simplify data munging tasks (PETS):

    Parse => structured and semi-structured formats (JSON)

    Explore => DataFrame: printSchema, filter by row / column, show

    Transform,

    Summarize => SQL + procedural processing, math / string / date-time utility functions

    All in Scala

  • 2016 Mesosphere, Inc. All Rights Reserved.

    REFERENCES

    17

    Spark SQL and DataFrames Guide: http://spark.apache.org/docs/latest/sql-programming-guide.html

    Spark DataFrame API: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

    Overview of Spark DataFrames: http://xinhstechblog.blogspot.com/2016/05/overview-of-spark-dataframe-api.html

    DataFrame Internals: https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf

    http://spark.apache.org/docs/latest/sql-programming-guide.htmlhttp://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Datasethttp://xinhstechblog.blogspot.com/2016/05/overview-of-spark-dataframe-api.htmlhttps://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf

  • 2016 Mesosphere, Inc. All Rights Reserved.

    THANK YOU!

    18