Spark Meetup TensorFrames

TensorFrames: Google Tensorflow on Apache Spark

Tim HunterSpark Summit 2016 - Meetup

How familiar are you with Spark?

1. What is Apache Spark?

2. I have used Spark

3. I am using Spark in production or I contribute to its development

How familiar are you with TensorFlow?

1. What is TensorFlow?

2. I have heard about it

3. I am training my own neural networks

Founded by the team who created Apache Spark

Offers a hosted service:- Apache Spark in the cloud- Notebooks- Cluster management- Production environment

About Databricks

Software engineer at Databricks

Apache Spark contributor

Ph.D. UC Berkeley in Machine Learning

(and Spark user since Spark 0.2)

About me

Outline

•Numerical computing with Apache Spark

•Using GPUs with Spark and TensorFlow

•Performance details

•The future

Numerical computing for Data Science

•Queries are data-heavy

•However algorithms are computation-heavy

•They operate on simple data types: integers, floats, doubles, vectors, matrices

The case for speed

•Numerical bottlenecks are good targets for optimization

• Let data scientists get faster results

• Faster turnaround for experimentations

•How can we run these numerical algorithms faster?

Evolution of computing power

Failureisnotanoption:itisafact

Whenyoucanaffordyourdedicatedchip

Scaleout

Scaleup

NLTKTheano

Today’stalk:Spark+TensorFlow

• Processor speed cannot keep up with memory and network improvements

• Access to the processor is the new bottleneck

• Project Tungsten in Spark: leverage the processor’s heuristics for executing code and fetching memory

• Does not account for the fact that the problem is numerical

Asynchronous vs. synchronous

• Asynchronous algorithms perform updates concurrently• Spark is synchronous model, deep learning frameworks

usually asynchronous• A large number of ML computations are synchronous• Even deep learning may benefit from synchronous updates

Outline

•The future

GPGPUs

•Graphics Processing Units for General Purpose computations

Theoreticalpeakthroughput

GPU CPU

Theoreticalpeakbandwidth

GPU CPU

• Library for writing “machine intelligence” algorithms

• Very popular for deep learning and neural networks

•Can also be used for general purpose numerical computations

• Interface in C++ and Python

Google TensorFlow

Numerical dataflow with Tensorflow

x = tf.placeholder(tf.int32, name=“x”)y = tf.placeholder(tf.int32, name=“y”)output = tf.add(x, 3 * y, name=“z”)

session = tf.Session()output_value = session.run(output,

{x: 3, y: 5})

x:int32

y:int32

Numerical dataflow with Spark

df = sqlContext.createDataFrame(…)

x = tf.placeholder(tf.int32, name=“x”)y = tf.placeholder(tf.int32, name=“y”)output = tf.add(x, 3 * y, name=“z”)

output_df = tfs.map_rows(output, df)

output_df.collect()

df:DataFrame[x:int,y:int]

output_df:DataFrame[x:int,y:int,z:int]

x:int32

y:int32

Outline

•The future

It is a communication problem

Sparkworkerprocess Workerpythonprocess

C++buffer

Pythonpickle

Tungstenbinaryformat

Pythonpickle

Javaobject

TensorFrames: native embedding of TensorFlow

Sparkworkerprocess

C++buffer

Javaobject

•Estimation of distribution from samples•Non-parametric•Unknown bandwidth

parameter•Can be evaluated with

goodness of fit

An example: kernel density scoring

• In practice, compute:

• In a nutshell: a complex numerical function

An example: kernel density scoring

Speedup

ScalaUDF ScalaUDF(optimized) TensorFrames TensorFrames+GPU

Runtim

e(sec)

def score(x:Double):Double ={val dis=points.map {z_k =>- (x- z_k)*(x- z_k)/(2*b*b)}val minDis =dis.minval exps =dis.map(d=>math.exp(d- minDis))minDis - math.log(b*N)+math.log(exps.sum)}

val scoreUDF =sqlContext.udf.register("scoreUDF", score_)sql("selectsum(scoreUDF(sample)) fromsamples").collect()

Speedup

Runtim

e(sec)

def score(x:Double):Double ={val dis=new Array[Double](N)var idx =0while(idx <N){val z_k =points(idx)dis(idx)=- (x- z_k)*(x- z_k)/(2*b*b)idx +=1}val minDis =dis.minvar expSum =0.0idx =0while(idx <N){expSum +=math.exp(dis(idx) - minDis)idx +=1}minDis - math.log(b*N)+math.log(expSum)}

val scoreUDF =sqlContext.udf.register("scoreUDF",score_)sql("selectsum(scoreUDF(sample))fromsamples").collect()

Speedup

Runtim

e(sec)def cost_fun(block,bandwidth):distances=- square(constant(X)- sample)/(2*b*b)m=reduce_max(distances,0)x=log(reduce_sum(exp(distances- m),0))return identity(x+m- log(b*N),name="score”)

sample=tfs.block(df,"sample")score=cost_fun(sample,bandwidth=0.5)df.agg(sum(tfs.map_blocks(score,df))).collect()

Speedup

Runtim

e(sec)def cost_fun(block,bandwidth):distances=- square(constant(X)- sample)/(2*b*b)m=reduce_max(distances,0)x=log(reduce_sum(exp(distances- m),0))return identity(x+m- log(b*N),name="score”)

with device("/gpu"):sample=tfs.block(df,"sample")score=cost_fun(sample,bandwidth=0.5)df.agg(sum(tfs.map_blocks(score,df))).collect()

Demo: Deep dreams

Outline

•The future

Improving communication

Sparkworkerprocess

C++buffer

Javaobject

Directmemorycopy

Columnarstorage

The future

• Integration with Tungsten:• Direct memory copy• Columnar storage

•Better integration with MLlib data types

•GPU instances in Databricks: Official support coming this summer

Recap•Spark: an efficient framework for running

computations on thousands of computers

•TensorFlow: high-performance numerical framework

•Get the best of both with TensorFrames:• Simple API for distributed numerical computing• Can leverage the hardware of the cluster

Try these demos yourself

•TensorFrames source code and documentation:github.com/tjhunter/tensorframesspark-packages.org/package/tjhunter/tensorframes

•Demo available later on Databricks

•The official TensorFlow website:www.tensorflow.org

•More questions and attending the Spark summit?We will hold office hours at the Databricks booth.

Thank you.

Spark Meetup TensorFrames

Data & Analytics

Transcript of Spark Meetup TensorFrames

Meetup Real Time Aggregations Spark Streaming + Spark Sql

Meetup spark + kerberos

20150617 spark meetup zagreb

Spark Meetup TensorFrames

Jump Start into Apache Spark (Seattle Spark Meetup)

Spark sql meetup

Spark Streaming @ Berlin Apache Spark Meetup, March 2015

Meetup tensorframes

Spark meetup TCHUG

Spark Meetup July 2015

Meetup Spark 2.0

Using apache spark to fight world hunger - spark meetup

Spark Hsinchu meetup

Budapest Spark Meetup - Apache Spark @enbrite.ly

Chicago Spark Meetup 03 01 2016 - Spark and Recommendations

Spark 4th Meetup Londond - Building a Product with Spark

Talend spark meetup 03042017 - Paris Spark Meetup

Spark web meetup

Apache spark meetup

Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24, 2016