Atlanta Hadoop Users Meetup 09 21 2016

34
TENSORFLOW + SPARK DATAFRAMES = TENSORFRAMES Atlanta Hadoop Users Group Sept 21, 2016 Chris Fregly Research Scientist @ http://pipeline.io Thank You for Hosting, HashMap!!

Transcript of Atlanta Hadoop Users Meetup 09 21 2016

Page 1: Atlanta Hadoop Users Meetup 09 21 2016

TENSORFLOW + SPARK DATAFRAMES=

TENSORFRAMES

Atlanta Hadoop Users GroupSept 21, 2016

Chris FreglyResearch Scientist @ http://pipeline.io

Thank You for Hosting, HashMap!!

Page 2: Atlanta Hadoop Users Meetup 09 21 2016

WHO AM I

Chris Fregly

• Currently

Research Scientist @ PipelineIO (http://pipeline.io)

Contributor @ Apache Spark

Committer @ Netflix Open Source

Founder @ Advanced Spark and TensorFlow Meetup

Author @ Advanced Spark (http://advancedspark.com)

Creator @ PANCAKE STACK (http://pancake-stack.com)

• Previously

Streaming Data Engineer @ Netflix, Databricks, IBM Spark

Page 3: Atlanta Hadoop Users Meetup 09 21 2016

ADVANCED SPARK AND TENSORFLOWMEETUP

4,400+ Members!

Top 4 Spark Meetup!!

Github Repo Stars + Forks

DockerHub Repo Pulls

Page 4: Atlanta Hadoop Users Meetup 09 21 2016

SPARK MEETUP TOMORROW NIGHT!

Page 5: Atlanta Hadoop Users Meetup 09 21 2016

MLCONF 2 DAYS FROM NOW!!

Page 6: Atlanta Hadoop Users Meetup 09 21 2016

CURRENT PIPELINE.IO RESEARCH

• Model Deploying and Testing

• Model Scaling and Serving

• Online Model Training

• Dynamic Model Optimizing

Page 7: Atlanta Hadoop Users Meetup 09 21 2016

PIPELINE.IO DELIVERABLES

• 100% Open Source!!

• Github:• https://github.com/fluxcapacitor/

• DockerHub• https://hub.docker.com/r/fluxcapacitor

Page 8: Atlanta Hadoop Users Meetup 09 21 2016

PIPELINE.IO WORKSHOPS

Page 9: Atlanta Hadoop Users Meetup 09 21 2016

AGENDA

• Neural Networks

• GPUs

• Tensorflow

• TensorFrames

Page 10: Atlanta Hadoop Users Meetup 09 21 2016

WHAT ARE NEURAL NETWORKS?

• Like All Machine Learning, Goal is to Minimize Loss (Error)• Mostly Supervised Learning Classification• Many labeled training samples exist

• Training• Step 1: Start with Random Guesses for Input Weights

• Step 2: Calculate Error Against Labeled Data

• Step 3: Determine Gradient Amount and Direction (+ or -)• Step 4: Back-propagate Gradient to Update Each Input Weight

• Step 5: Repeat Step 1 until Convergence or Max Epochs Reached

Page 11: Atlanta Hadoop Users Meetup 09 21 2016

BACK PROPAGATION

http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html

Chain Rule

Page 12: Atlanta Hadoop Users Meetup 09 21 2016

CONVOLUTIONAL NEURAL NETWORKS

• Apply Many Layers (aka. Filters) to Input

• Each Layer/Filter Picks up on Features

• Features not necessarily human-grokkable

• Brute Force –Try Diff numLayers & layerSizes

• Filter Examples

• 3 Color Filters: RGB

• Moving AVG for Time Series

Page 13: Atlanta Hadoop Users Meetup 09 21 2016

MY FAVORITE USE CASE – STITCH FIX

StitchFix Strata Conf SF 2016:Using Deep Learning to Create New Clothing Styles!

Page 14: Atlanta Hadoop Users Meetup 09 21 2016

RECURRENT NEURAL NETWORKS

Maintain State

Enables Learning of Sequential Patterns

Uses for Text/NLP Prediction

Page 15: Atlanta Hadoop Users Meetup 09 21 2016

CHARACTER RNNS

Preserving State differentiates between

1st and 2nd ‘l’to improve prediction

Page 16: Atlanta Hadoop Users Meetup 09 21 2016

AGENDA

• Neural Networks

• GPUs

• Tensorflow

• TensorFrames

Page 17: Atlanta Hadoop Users Meetup 09 21 2016

CPU VS GPU

• Fundamentally Different than CPUs

• Therefore, GPU/CUDA Programming Fundamentally Different

Page 18: Atlanta Hadoop Users Meetup 09 21 2016

SAME INSTRUCTION, MULTIPLE DATA

Page 19: Atlanta Hadoop Users Meetup 09 21 2016

MINIMIZE DATA DEPENDENCIES

• More natural for structured, independent data

• Tasks perform identical instructions in parallel on same-structured data

• Reduce data dependencies as they limit parallelism

Previous Instruction Previous Loop Iteration

Page 20: Atlanta Hadoop Users Meetup 09 21 2016

MEMORY AND CORES

Page 21: Atlanta Hadoop Users Meetup 09 21 2016

EXPLORE YOUR SURROUNDINGS

`nvidia-smi`

Page 22: Atlanta Hadoop Users Meetup 09 21 2016

AGENDA

• Neural Networks

• GPUs

• Tensorflow

• TensorFrames

Page 23: Atlanta Hadoop Users Meetup 09 21 2016

WHAT IS TENSORFLOW?

• Google Open Source General Purpose Numerical Computation Engine• Happens to be Good for Neural Networks!

• Tooling• Tensorboard (port 6006 == `goog` upside down!) à

• DAG-based like Spark!• Computation graph is logical plan• Stored in Protobuf ’s• Tensorflow converts logical to physical plan

• Lots of Libraries• TFLearn (Tensorflow’s Scikit-learn Impl)• Tensorflow Serving (Prediction Layer) à

• Distributed and GPU-Optimized CUDA/cuDNN

Page 24: Atlanta Hadoop Users Meetup 09 21 2016

DEMO!

Tensorflow Fundamentals

Page 25: Atlanta Hadoop Users Meetup 09 21 2016

DEMO!

AWS + Docker + GPU + Docker + Tensorflow

Page 26: Atlanta Hadoop Users Meetup 09 21 2016

DEMO!

Tensorflow Serving

Page 27: Atlanta Hadoop Users Meetup 09 21 2016

AGENDA

• Neural Networks

• GPUs

• Tensorflow

• TensorFrames

Page 28: Atlanta Hadoop Users Meetup 09 21 2016

WHAT ARE TENSORFRAMES?

• Bridge between Spark (JVM) and Tensorflow (C++)

• Python and Scala Bindings for Application Code

• Uses JavaCPP for JNI-level Integration

• Must Install TensorFrames C++ Runtime Libs on All Spark Workers

• Developed by Old Co-worker @ Databricks, Tim Hunter• PhD in Tensors – He’s ”Mr.. Tensor”

Page 29: Atlanta Hadoop Users Meetup 09 21 2016

WHY TENSORFRAMES?

• Why Not?!

• Best of Both Worlds: Legacy Spark Support + Tensorflow

• Mix and Match Spark ML + Tensorflow AI on Same Data

• Tensorflow is DAG-based Similar to Spark

• Enables Data-Parallel Model Training

Page 30: Atlanta Hadoop Users Meetup 09 21 2016

DATA-PARALLEL MODEL TRAINING

• Large Dataset are Partitioned Across HDFS Cluster

• Computation Graph (Logical Plan) Passed to Spark Workers

• Workers Train on Each Data Partition in Parallel

• Workers Periodically Aggregate (ie. AVG) Results

• Aggregations happen in “Parameter Server”

• Spark Master/Driver is Parameter Server

Computation Graph (logical plan) is passed to every Spark

Page 31: Atlanta Hadoop Users Meetup 09 21 2016

TENSORFLOW + MULTIPLE HOSTS/GPUS

Multi-GPU, Data-Parallel Training

Step 1: CPU transfers model replica and (initial) gradients to each GPU

Step 2: CPU synchronizes and waits for all GPUs to process batch

Step 3: CPU copies all training results (gradients) back from GPU

Step 4: CPU averages gradients from all GPUs

Step 5: Repeat Step 1 with (new) gradients

Code

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py

Page 32: Atlanta Hadoop Users Meetup 09 21 2016

TENSORFRAME PERFORMANCE

• Depends on Algorithm and Dataset, of course!

• TensorFrames Require Extra Serialization JVM <-> C++

• What about Python Serialization from Python Bindings?

• Should be minimal unless using Python UDFs

• PySpark keeps small logical plan in Python layer

• Physical operations happen in JVM (except Python UDFs!)

Page 33: Atlanta Hadoop Users Meetup 09 21 2016

DEMO!

TensorFrames in Python and Scala

Page 34: Atlanta Hadoop Users Meetup 09 21 2016

THANK YOU!!

Chris Fregly, Research Scientist @ PipelineIO

• LinkedIn: https://linkedin.com/in/cfregly

• Twitter: @cfregly

http://pipeline.io