TensorFrames: Google Tensorflow on Apache Spark

TensorFrames: Google Tensorflow on Apache Spark

Tim HunterMeetup 08/2016 - Salesforce

How familiar are you with Spark?

1. What is Apache Spark?

2. I have used Spark

3. I am using Spark in production or I contribute to its development

2

How familiar are you with TensorFlow?

1. What is TensorFlow?

2. I have heard about it

3. I am training my own neural networks

3

Founded by the team who created Apache Spark

Offers a hosted service: - Apache Spark in the cloud - Notebooks - Cluster management - Production environment

About Databricks

4

Software engineer at Databricks

Apache Spark contributor

Ph.D. UC Berkeley in Machine Learning

(and Spark user since Spark 0.5)

About me

5

Outline•Numerical computing with Apache Spark

•Using GPUs with Spark and TensorFlow

•Performance details

•The future

6

Numerical computing for Data Science

•Queries are data-heavy

•However algorithms are computation-heavy

•They operate on simple data types: integers, floats, doubles, vectors, matrices

7

The case for speed•Numerical bottlenecks are good targets for

optimization

• Let data scientists get faster results

• Faster turnaround for experimentations

•How can we run these numerical algorithms faster?

8

Evolution of computing power

9

Failure is not an option: it is a fact

When you can afford your dedicated chip

GPGPU

Scale out

Scal

e up

Evolution of computing power

10

NLTKTheano

Today’s talk:Spark + TensorFlow

Evolution of computing power• Processor speed cannot keep up with memory and

network improvements

• Access to the processor is the new bottleneck

• Project Tungsten in Spark: leverage the processor’s heuristics for executing code and fetching memory

• Does not account for the fact that the problem is numerical

11

Asynchronous vs. synchronous

• Asynchronous algorithms perform updates concurrently• Spark is synchronous model, deep learning frameworks

usually asynchronous• A large number of ML computations are synchronous• Even deep learning may benefit from synchronous updates

12

http://arxiv.org/abs/1604.00981




•The future

13

GPGPUs

14

•Graphics Processing Units for General Purpose computations

Series1

6000

Theoretical peakthroughput

GPU CPU

Series1

Theoretical peakbandwidth

GPU CPU

• Library for writing “machine intelligence” algorithms

• Very popular for deep learning and neural networks

•Can also be used for general purpose numerical computations

• Interface in C++ and Python

15

Google TensorFlow

Numerical dataflow with Tensorflow

16

x = tf.placeholder(tf.int32, name=“x”)y = tf.placeholder(tf.int32, name=“y”)output = tf.add(x, 3 * y, name=“z”)

session = tf.Session()output_value = session.run(output, {x: 3, y: 5})

x:int32

y:int32

mul 3

z

Numerical dataflow with Spark

df = sqlContext.createDataFrame(…)

x = tf.placeholder(tf.int32, name=“x”)y = tf.placeholder(tf.int32, name=“y”)output = tf.add(x, 3 * y, name=“z”)

output_df = tfs.map_rows(output, df)

output_df.collect()

df: DataFrame[x: int, y: int]

output_df: DataFrame[x: int, y: int, z: int]

x:int32

y:int32

mul 3

z

Demo

18




•The future

19

20

It is a communication problem

Spark worker process Worker python process

C++buffer

Python pickle

Tungsten binary format

Python pickle

Javaobject

21

TensorFrames: native embedding of TensorFlow

Spark worker process

C++buffer


Javaobject

• Estimation of distribution from samples•Non-parametric•Unknown bandwidth

parameter•Can be evaluated with

goodness of fit

An example: kernel density scoring

22

• In practice, compute:

with:

• In a nutshell: a complex numerical function

An example: kernel density scoring

23

24

Speedup

Scala UDF Scala UDF (optimized) TensorFrames TensorFrames + GPU0

60

120

180

Run

time

(sec

)

def score(x: Double): Double = { val dis = points.map { z_k => - (x - z_k) * (x - z_k) / ( 2 * b * b) } val minDis = dis.min val exps = dis.map(d => math.exp(d - minDis)) minDis - math.log(b * N) + math.log(exps.sum)}

val scoreUDF = sqlContext.udf.register("scoreUDF", score _)sql("select sum(scoreUDF(sample)) from samples").collect()

25

Speedup


60

120

180

Run

time

(sec

)def score(x: Double): Double = { val dis = new Array[Double](N) var idx = 0 while(idx < N) { val z_k = points(idx) dis(idx) = - (x - z_k) * (x - z_k) / ( 2 * b * b) idx += 1 } val minDis = dis.min var expSum = 0.0 idx = 0 while(idx < N) { expSum += math.exp(dis(idx) - minDis) idx += 1 } minDis - math.log(b * N) + math.log(expSum)}

val scoreUDF = sqlContext.udf.register("scoreUDF", score _)sql("select sum(scoreUDF(sample)) from samples").collect()

26

Speedup


60

120

180

Run

time

(sec

)def cost_fun(block, bandwidth): distances = - square(constant(X) - sample) / (2 * b * b) m = reduce_max(distances, 0) x = log(reduce_sum(exp(distances - m), 0)) return identity(x + m - log(b * N), name="score”)

sample = tfs.block(df, "sample")score = cost_fun(sample, bandwidth=0.5)df.agg(sum(tfs.map_blocks(score, df))).collect()

27

Speedup


60

120

180

Run

time

(sec

)def cost_fun(block, bandwidth): distances = - square(constant(X) - sample) / (2 * b * b) m = reduce_max(distances, 0) x = log(reduce_sum(exp(distances - m), 0)) return identity(x + m - log(b * N), name="score”)

with device("/gpu"): sample = tfs.block(df, "sample") score = cost_fun(sample, bandwidth=0.5)df.agg(sum(tfs.map_blocks(score, df))).collect()

Demo: Deep dreams

28

Demo: Deep dreams

29




•The future

30

31

Improving communication

Spark worker process

C++buffer


Javaobject

Direct memory copy

Columnarstorage

The future• Integration with Tungsten:• Direct memory copy• Columnar storage

•Better integration with MLlib data types

•GPU instances in Databricks: Official support coming this fall

32

Recap•Spark: an efficient framework for running

computations on thousands of computers

•TensorFlow: high-performance numerical framework

•Get the best of both with TensorFrames:• Simple API for distributed numerical computing• Can leverage the hardware of the cluster

33

Try these demos yourself•TensorFrames source code and documentation:

github.com/databricks/tensorframesspark-packages.org/package/databricks/tensorframes

•Demo notebooks available on Databricks

•The official TensorFlow website: www.tensorflow.org

34

https://github.com/tjhunter/tensorframes

https://spark-packages.org/package/tjhunter/tensorframes

http://www.tensorflow.org/

Spark Summit EU 2016 15% Discount Code: DatabricksEU16

35

Thank you.

TensorFrames: Google Tensorflow on Apache Spark

Software

Transcript of TensorFrames: Google Tensorflow on Apache Spark