Python and Bigdata - An Introduction to Spark (PySpark)

Python and Big data - An Introduction to Spark (PySpark)

Hitesh Dharmdasani

About me

• Security Researcher, Malware Reversing Engineer, Developer

• GIT > GMU > Berkeley > FireEye > On Stage

• Bootstrapping a few ideas

• Hiring!

InformationSecurity

Big Data

Machine Learning

What we will talk about?• What is Spark? • How does spark do things • PySpark and data processing primitives • Example Demo - Playing with Network Logs • Streaming and Machine Learning in Spark • When to use Spark

http://bit.do/PyBelgaumSpark

http://tinyurl.com/PyBelgaumSpark

What will we NOT talk about

• Writing production level jobs

• Fine Tuning Spark

• Integrating Spark with Kafka and the like

• Nooks and Crooks of Spark

• But glad to talk about it offline

The Common Scenario

Some Data (NTFS, NFS, HDFS, Amazon S3 …)

Python

Process 1 Process 2 Process 3 Process 4 Process 5 …

You write 1 job. Then chunk,cut, slice and dice

Compute where the data is• Paradigm shift in computing

• Don't load all the data into one place and do operations

• State your operations and send code to the machine

• Sending code to machine >>> Getting data over network

MapReducepublic static MyFirstMapper { public void map { . . . } }

public static MyFirstReducer { public void reduce { . . . } }

public static MySecondMapper { public void map { . . . } }

public static MySecondReducer { public void reduce { . . . } }

Job job = new Job(conf, “First"); job.setMapperClass(MyFirstMapper.class); job.setReducerClass(MyFirstReducer.class);

/*Job 1 goes to Disk */

if(job.isSuccessful()) {

Job job2 = new Job(conf,”Second”); job2.setMapperClass(MySecondMapper.class); job2.setReducerClass(MySecondReducer.class);

This also looks ugly if you ask me!

What is Spark?• Open Source Lighting Fast Cluster Computing

• Focus on Speed and Scale

• Developed at AMP Lab, UC Berkeley by Matei Zaharia

• Most active Apache Project in 2014 (Even more than Hadoop)

• Recently beat MapReduce in sorting 100TB of data by being 3X faster and using 10X fewer machines

What is Spark?

Java Python Scala

MLLib Streaming ETL SQL ….GraphX

What is Spark?

• Inherently distributed • Computation happens where the data

resides

What is different from MapReduce

• Uses main memory for caching

• Dataset is partitioned and stored in RAM/Disk for iterative queries

• Large speedups for iterative operations when in-memory caching is used

Spark InternalsThe Init

• Creating a SparkContext • It is Sparks’ gateway to access the cluster • In interactive mode. SparkContext is created as ‘sc’

$ pyspark ... ... SparkContext available as sc.

>>> sc <pyspark.context.SparkContext at 0xdeadbeef>

Spark InternalsThe Key Idea

Resilient Distributed Datasets

• Basic unit of abstraction of data • Immutable • Persistance

>>> data = [90, 14, 20, 86, 43, 55, 30, 94 ] >>> distData = sc.parallelize(data) ParallelCollectionRDD[13] at parallelize at PythonRDD.scala:364

Spark Internals

Operations on RDDs - Transformations & Actions

Spark Internals

Transformations RDD

Spark Context

File/Collection

Spark Internals

Lazy Evaluation

Now what?

Spark Internals

Transformations ActionsRDD

Spark Context

File/Collection

Spark InternalsTransformation Operations on RDDs

def mapFunc(x):return x+1

rdd_2 = rdd_1.map(mapFunc)

Filterdef filterFunc(x):

if x % 2 == 0:return True

else:return False

rdd_2 = rdd_1.filter(filterFunc)

Spark InternalsTransformation Operations on RDDs

• map • filter • flatMap • mapPartitions • mapPartitionsWithIndex • sample • union • intersection • distinct • groupByKey

Spark Internals>>> increment_rdd = distData.map(mapFunc) >>> increment_rdd.collect() [91, 15, 21, 87, 44, 56, 31, 95] >>> >>> increment_rdd.filter(filterFunc).collect() [44, 56] OR >>> distData.map(mapFunc).filter(filterFunc).collect() [44, 56]

Spark Internals

Fault Tolerance and Lineage

Moving to the Terminal

Spark Streaming

Kafka Flume HDFS Twitter ZeroMQ

HDFS Cassandra NFS TextFile

ML Lib

• Machine Learning Primitives in Spark

• Provides training and classification at scale

• Exploits Sparks’ ability for iterative computation (Linear Regression, Random Forest)

• Currently the most active area of work within Spark

How can I use all this?

Spark + ML Lib

Load Tweets

Bad Tweets Model

Live Tweets

Report to Twitter

Spark Streaming

To Spark or not to Spark• Iterative computations

• “Don't fix something that is not broken”

• Lesser learning barrier

• Large one-time compute

• Single Map Reduce Operation

Python and Bigdata - An Introduction to Spark (PySpark)

Technology

Transcript of Python and Bigdata - An Introduction to Spark (PySpark)

BigData - PageRank Algorithm with Scala and Spark

Introduction to Big Data with Apache Spark...• Supported by pySpark DataFrames (SparkSQL)" • Some of the functionality SQL provides:" » Create, modify, delete relations" » Add,

In-Memory Computation with Spark Lecture BigData Analytics

PySpark of Warcraft

PySpark with Juypter

Introduction to Big Data with Apache Spark...Python Spark (pySpark)" • We are using the Python programming interface to Spark (pySpark)" • pySpark provides an easy-to-use programming

BigData Tools Seyyed mohammad Razavi. Outline Introduction Hbase Cassandra Spark Acumulo Blur MongoDB Hive Giraph Pig.

Improving PySpark performance: Spark Performance Beyond the JVM

Accelerate Big Data Processing (Hadoop, Spark, Memcached ...nowlab.cse.ohio-state.edu/static/media/talks/slide/bigdata-DK-Xiaoyi-OSU.pdfAccelerate Big Data Processing (Hadoop, Spark,

Machine Learning with PySpark

HPC Meets BigData: Accelerating Apache Hadoop, Spark…hpcadvisorycouncil.com/events/2017/stanford-workshop/pdf/DKPanda... · HPC Meets BigData: Accelerating Apache Hadoop, ... (RedHat

pyspark package - Univerzita Karlovaufal.mff.cuni.cz/~straka/courses/npfl102/pyspark-1.6.1.pdfpyspark package Contents PySpark is the Python API for Spark. Public classes: ... Create

Dr. Christian Staudt data scientist in cooperation with · 2. Spark Fundamentals An overview of Spark - a framework for programming distributed computation, using PySpark, its Python

Programming in Spark using PySpark

HPC Meets BigData: Accelerating Apache Hadoop, Spark, and ... · HPC Meets BigData: Accelerating Apache Hadoop, Spark, and Memcached with HPC Technologies Dhabaleswar K. (DK) Panda

Spark in the BigData dark

New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

Course Content for Hadoop and Spark Batchcompletejavaclasses.com/hadoop/Course Content for... · Course Content for Hadoop and Spark Batch Introduction to BIGDATA and HADOOP ... Apache

Plateforme bigdata orientée BI avec Hortoworks Data Platform et Apache Spark

Bigdata processing with Spark - part II