Download - DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

Transcript
Page 1: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup)

Page 2: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

2

Year of the lamb, goat, sheep, and ram …?

Page 3: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

A slide from 2013 …

3

Page 4: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

From MapReduce to Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Page 5: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

Spark’s Growth

5

Google Trends for “Apache Spark”

Page 6: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

Beyond Hadoop Users

6

Early adopters

Data Scientists Statisticians R users … PyData

Users

Understands MapReduce

& functional APIs

Page 7: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

RDD API

• Most data is structured (JSON, CSV, Avro, Parquet, Hive …) –  Programming RDDs inevitably ends up with a lot of tuples (_1, _2, …)

• Functional transformations (e.g. map/reduce) are not as intuitive

7

Page 8: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

8

Page 9: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

DataFrames in Spark

• Distributed collection of data grouped into named columns (i.e. RDD with schema) • Domain-specific functions designed for common tasks

– Metadata –  Sampling –  Project, filter, aggregation, join, … –  UDFs

• Available in Python, Scala, Java, and R (via SparkR)

9

Page 10: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

10

0 2 4 6 8 10

RDD Scala

RDD Python

Spark Scala DF

Spark Python DF

Runtime performance of aggregating 10 million int pairs (secs)

Page 11: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

Agenda

•  Introduction • Learn by demo • Design & internals

–  API design –  Plan optimization –  Integration with data sources

11

Page 12: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

Learn by Demo (in a Databricks Cloud Notebook)

• Creation • Project • Filter • Aggregations • Join • SQL • UDFs • Pandas

12

For the purpose of distributing the slides online, I’m attaching screenshots of the notebooks.

Page 13: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

13

Page 14: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

14

Page 15: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

15

Page 16: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

16

Page 17: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

17

Page 18: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

18

Page 19: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

19

Page 20: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

20

Page 21: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

21

Page 22: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

22

Page 23: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

23

Page 24: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

24

Page 25: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

25

Page 26: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

26

Page 27: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

Machine Learning Integration

27

tokenizer = Tokenizer(inputCol="text", outputCol="words”)hashingTF = HashingTF(inputCol="words", outputCol="features”)lr = LogisticRegression(maxIter=10, regParam=0.01)pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

df = context.load("/path/to/data")model = pipeline.fit(df)

Page 28: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

Design Philosophy

Simple tasks easy -  DSL for common operations -  Infer schema automatically (CSV,

Parquet, JSON, …) -  MLlib pipeline integration Performance -  Catalyst optimizer -  Code generation

Complex tasks possible -  RDD API -  Full expression library

Interoperability -  Various data sources and formats -  Pandas, R, Hive …

28

Page 29: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

DataFrame Internals

• Represented internally as a “logical plan”

• Execution is lazy, allowing it to be optimized by Catalyst

29

Page 30: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

Plan Optimization & Execution

30

SQL  AST  

DataFrame  

Unresolved  Logical  Plan   Logical  Plan   Op;mized  

Logical  Plan  Physical  Plans  Physical  Plans   RDDs  Selected  

Physical  Plan  

Analysis   Logical  Op;miza;on  

Physical  Planning  

Cost  M

odel  

Physical  Plans  

Code  Genera;on  

Catalog  

DataFrames and SQL share the same optimization/execution pipeline

Page 31: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

31

Page 32: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

32

joined = users.join(events, users.id == events.uid)filtered = joined.filter(events.date >= ”2015-01-01”)

logical plan

filter

join

scan (users)

scan (events)

physical plan

join

scan (users) filter

scan (events)

this join is expensive à

Page 33: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

Data Sources supported by DataFrames

33

{ JSON }

built-in external

JDBC

and more …

Page 34: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

More Than Naïve Scans

• Data Sources API can automatically prune columns and push filters to the source –  Parquet: skip irrelevant columns and blocks of data; turn

string comparison into integer comparisons for dictionary encoded data

–  JDBC: Rewrite queries to push predicates down

34

Page 35: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

35

joined = users.join(events, users.id == events.uid)filtered = joined.filter(events.date > ”2015-01-01”)

logical plan

filter

join

scan (users)

scan (events)

optimized plan

join

scan (users) filter

scan (events)

optimized plan with intelligent data sources

join

scan (users)

filter scan (events)

Page 36: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

DataFrames in Spark

• APIs in Python, Java, Scala, and R (via SparkR)

• For new users: make it easier to program Big Data

• For existing users: make Spark programs simpler & easier to understand, while improving performance

• Experimental API in Spark 1.3 (early March)

36

Page 37: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

Our Vision

37

Page 38: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

Thank you! Questions?

Page 39: DataFrames for Large-scale Data Science - GitHub PagesFeb 17, 2015  · DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup) 2 ... Spark Scala

More Information

Blog post introducing DataFrames: http://tinyurl.com/spark-dataframes Build from source: http://github.com/apache/spark (branch-1.3)