Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process...

35
Python API for Spark Josh Rosen UC BERKELEY

Transcript of Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process...

Page 1: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Python API for Spark

Josh  RosenUC#BERKELEY#

Page 2: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

What is Spark?

Page 3: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Fast  and  expressive  cluster  computing  system

Compatible  with  Hadoop-­‐supported  file  systems  and  data  formats  (HDFS,  S3,  SequenceFile,  ...)

Page 4: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Improves efficiency through in-memory computing primitives and general computation graphs

As much as 30x faster

Page 5: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Improves usability through rich APIs in Scala, Python, and Java, and an interactive shell

Often 2-10x less code

Page 6: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

RDDsResilient Distributed Datasets

Immutable, partitioned collections of objects

Page 7: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Transformations

Actions

mapfiltergroupByjoin...

countcollectsave...

Page 8: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

val  lines  =  spark.textFile(“hdfs://...”)  val  errors  =  lines.filter(_.startsWith(“ERROR”))  val  messages  =  errors.map(_.split(‘\t’)(2))

messages.filter(_.contains(“foo”)).count  

Example: Log Mining

Page 9: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

What is PySpark?

Page 10: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

PySpark at a Glance

Write  Spark  jobs  in  Python

Run  interactive  jobs  in  the  shell

Supports  C  extensions

Page 11: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Previewed at AMP Camp 2012

Available now in 0.7 release

Page 12: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Example: Word Countfrom  pyspark.context  import  SparkContext

sc  =  SparkContext(...)lines  =  sc.textFile(sys.argv[2],  1)counts  =  lines.flatMap(lambda  x:  x.split('  '))  \                                                                    .map(lambda  x:  (x,  1))  \                                                                                        .reduceByKey(lambda  x,  y:  x  +  y)

for  (word,  count)  in  counts.collect():   print  "%s  :  %i"  %  (word,  count)  

Page 13: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Demo

Page 14: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Implementation

Page 15: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Built on top of Java API

Spark

LocalMode Mesos Standalone YARN

Java API

PySpark

Page 16: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Process data in Python

and persist / transfer it in Java

Page 17: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

schedulingbroadcastcheckpointingnetworkingfault-recoveryHDFS access

Re-uses Spark’s

Page 18: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

< 2K lines, including comments

PySpark has a small codebase:-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐File                                                                                                                                                blank                comment                      code-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐python/pyspark/rdd.py                                                                                                                  115                        345                        302core/src/main/scala/spark/api/python/PythonRDD.scala                                                      33                          45                        231python/pyspark/context.py                                                                                                            32                        101                        133python/pyspark/tests.py                                                                                                                26                          11                          84python/pyspark/accumulators.py                                                                                                  37                          91                          70python/pyspark/serializers.py                                                                                                    21                            7                          55python/pyspark/join.py                                                                                                                  15                          27                          50python/pyspark/worker.py                                                                                                                8                            7                          44core/src/main/scala/spark/api/python/PythonPartitioner.scala                                        5                            9                          34pyspark                                                                                                                                                  9                            8                          27python/pyspark/java_gateway.py                                                                                                    5                            7                          26python/pyspark/files.py                                                                                                                  7                          14                          17python/pyspark/broadcast.py                                                                                                          8                          16                          15python/pyspark/shell.py                                                                                                                  4                            6                            8python/pyspark/__init__.py                                                                                                            6                          14                            7-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐SUM:                                                                                                                                                    331                        708                      1103-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐

Page 19: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Data Flow

Python JVM

Local Cluster

Page 20: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Data Flow

SparkContext

Python JVM

Local Cluster

Page 21: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Data Flow

SparkContext

Python JVM

Py4J

Socket

Local Cluster

Spark Context

Page 22: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Data Flow

SparkContext

Python JVM

Py4J

Socket

LocalFS

Local Cluster

Spark Context

Page 23: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Data Flow

SparkContext

Python JVM

Py4J

Socket

LocalFS

Local Cluster

Spark Worker

Spark Worker

Spark Context

Page 24: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Data Flow

SparkContext

Python JVM

Py4J

Socket

LocalFS

Local Cluster

Spark Worker

Python

Python

Python

Pipe

Spark Worker

Python

Python

Python

Spark Context

Page 25: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Data is stored as Pickled objects in RDD[Array[Byte]]

Page 26: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Storing batches of Python objects in one Scala object reduces overhead

Page 27: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

When possible, RDD transformations are pipelined:lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1))

MappedRDDfunc(x) = x.split(‘’)

MappedRDDfunc(x) = (x, 1)

MappedRDDfunc(x) = (x.split(‘’), 1)

Page 28: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Python functions and closures are serialized using PiCloud’s CloudPickle module

Page 29: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Roadmap

Page 30: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Available in Spark 0.7

Page 31: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Thanks!

Page 32: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Bonus Slides

Page 33: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

>>>  x  =  ["Hello",  "World!"]>>>  pickletools.dis(cPickle.dumps(x,  2))        0:  \x80  PROTO            2        2:  ]        EMPTY_LIST        3:  q        BINPUT          1        5:  (        MARK        6:  U                SHORT_BINSTRING  'Hello'      13:  q                BINPUT          2      15:  U                SHORT_BINSTRING  'World!'      23:  q                BINPUT          3      25:  e                APPENDS        (MARK  at  5)      26:  .        STOPhighest  protocol  among  opcodes  =  2

Pickle is a miniature stack language

Page 34: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

You can do crazy stuff, like converting a collection of pickled objects into a pickled collection.

https://gist.github.com/JoshRosen/3384191

Page 35: Python API for Spark - Meetup Meetup Talk.pdf · Mesos Standalone YARN Java API PySpark. Process data in Python and persist / transfer it in Java. scheduling broadcast checkpointing

Bulk depickling can be faster even if it involves Pickle opcode manipulation:

https://gist.github.com/JoshRosen/3401373

10000  integers:Bulk  depickle  (chunk  size  =  2):  0.266709804535Bulk  depickle  (chunk  size  =  10):  0.0797798633575Bulk  depickle  (chunk  size  =  100):  0.0388460159302Bulk  depickle  (chunk  size  =  1000):  0.0333180427551Individual  depickle:  0.0540158748627

10000  dicts  (dict([  (str(n),  n)  for  n  in  range(100)  ])):Bulk  depickle  (chunk  size  =  2):  2.70617198944Bulk  depickle  (chunk  size  =  10):  2.30310201645Bulk  depickle  (chunk  size  =  100):  2.22087192535Bulk  depickle  (chunk  size  =  1000):  2.22118020058Individual  depickle:  2.44124102592