Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow

36
Fizz buzz in tensorflow Joel Grus Research Engineer, AI2 @joelgrus

Transcript of Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow

Fizz buzz in tensorflowJoel Grus

Research Engineer, AI2@joelgrus

About me

Research engineer at AI2we're hiring!

(in Seattle)

(where normal people can afford to buy a house)

(sort of)

Previously SWE at Google, data science at VoloMetrix, Decide, Farecast/MicrosoftWrote a book ------->

Fizz Buzz, in case you're not familiar

Write a program that prints the numbers 1 to 100, except thatif the number is divisible by 3, instead print "fizz"if the number is divisible by 5, instead print "buzz"if the number is divisible by 15, instead print

"fizzbuzz"

weed-out problem

the backstory

Saw an online discussion about the stupidest way to solve fizz buzz

Thought, "I bet I can come up with a stupider way"Came up with a stupider wayBlog post went viralSort of a frivolous thing to use up my 15 minutes of fame

on, but so be it

super simple solution

haskellfizzBuzz :: Integer -> StringfizzBuzz i | i `mod` 15 == 0 = "fizzbuzz" | i `mod` 5 == 0 = "buzz" | i `mod` 3 == 0 = "fizz" | otherwise = show i

mapM_ (putStrLn . fizzBuzz) [1..100]

ok, then python

def fizz_buzz(i): if i % 15 == 0: return "fizzbuzz" elif i % 5 == 0: return "buzz" elif i % 3 == 0: return "fizz" else: return str(i)

for i in range(1, 101): print(fizz_buzz(i))

taking on fizz buzz as a

machine learning problem

outputs

given a number, there are four mutually exclusive cases1.output the number itself2.output "fizz"3.output "buzz"4.output "fizzbuzz"

so one natural representation of the output is a vector of length 4 representing the predicted probability of each case

ground truth

def fizz_buzz_encode(i): if i % 15 == 0: return np.array([0, 0, 0, 1]) elif i % 5 == 0: return np.array([0, 0, 1, 0]) elif i % 3 == 0: return np.array([0, 1, 0, 0]) else: return np.array([1, 0, 0, 0])

feature selection - Cheating

feature selection - cheating clever

def x(i): return np.array([1, i % 3 == 0, i % 5 == 0])

def predict(x): return np.dot(x, np.array([[ 1, 0, 0, -1], [-1, 1, -1, 1], [-1, -1, 1, 1]]))

for i in range(1, 101): prediction = np.argmax(predict(x(i))) print([i, "fizz", "buzz", "fizzbuzz"][prediction])

It's hard to imagine an interviewer

who wouldn't be impressed

by even this simple

solution.

feature selection - cheating clever

divisible by 3 not divisible by 3

divisible by 5

notdivisible by 5

what if we aren't that clever?

binary encoding, say 10 digits (up to 1023)1 -> [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

2 -> [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]

3 -> [1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

and so on

in comments, someone suggested one-hot decimal encoding the digits, say up to 999315 -> [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

and so on

training data

need to generate fizz buzz for 1 to 100, so don't want to train on those

binary: train on 101 - 1023one-hot decimal digits: train on 101 - 999

then use 1 to 100 as the test data

tensorflow in one slide

import numpy as npimport tensorflow as tf

X = tf.placeholder("float", [None, input_dim])Y = tf.placeholder("float", [None, output_dim])

beta = tf.Variable(tf.random_normal(beta_shape, stddev=0.01))

def model(X, beta): # some function of X and beta

p_yx = model(X, beta)

cost = some_cost_function(p_yx, Y)train_op = tf.train.SomeOptimizer.minimize(cost)

with tf.Session() as sess: sess.run(tf.initialize_all_variables()) for _ in range(num_epochs): sess.run(train_op, feed_dict={X: trX, Y: trY})

the extent of what I know about

standard imports

placeholders for our data

parameters to learn

some parametric modelapplied to the symbolic variables

train by minimizing some cost functioncreate session and initialize variables

train using data

Visualizing the results (a hard problem by itself)1 100correct "11"

incorrect "buzz"

actual "fizzbuzz"

correct "fizz"

black + red = predictionsblack + tan = actuals

predicted "fizz"

actual "buzz"

[[30, 11, 6, 2], [12, 8, 4, 1], [ 4, 3, 2, 3], [ 4, 2, 0, 0]]

linear regressiondef model(X, w, b): return tf.matmul(X, w) + b

py_x = model(data.X, w, b)

cost = tf.reduce_mean(tf.pow(py_x - data.Y, 2))train_op = tf.train.GradientDescentOptimizer(0.05).minimize(cost)

binary

decimal

[[54, 27, 14, 6], [ 0, 0, 0, 0], [ 0, 0, 0, 0], [ 0, 0, 0, 0]]

[[54, 27, 0, 0], [ 0, 0, 0, 0], [ 0, 0, 14, 6], [ 0, 0, 0, 0]]

black + red = predictionsblack + tan = actuals

logistic regression

def model(X, w, b): return tf.matmul(X, w) + b

py_x = model(data.X, w, b)

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(py_x, data.Y))train_op = tf.train.GradientDescentOptimizer(0.05).minimize(cost)

binary[[54, 27, 14, 6], [ 0, 0, 0, 0], [ 0, 0, 0, 0], [ 0, 0, 0, 0]]

[[54, 27, 0, 0], [ 0, 0, 0, 0], [ 0, 0, 14, 6], [ 0, 0, 0, 0]]

decimalblack + red = predictionsblack + tan = actuals

multilayer perceptron

def model(X, w_h, w_o, b_h, b_o): h = tf.nn.relu(tf.matmul(X, w_h) + b_h) # 1 hidden layer with ReLU activation return tf.matmul(h, w_o) + b_o

py_x = model(data.X, w_h, w_o, b_h, b_o)

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(py_x, data.Y))train_op = tf.train.RMSPropOptimizer(learning_rate=0.0003, decay=0.8, momentum=0.4).minimize(cost)

from here on, no more decimal encoding, it's really good at "divisible by 5" and really bad at everything else

by # of hidden units (after 1000's of epochs)

5102550

100200

[[52, 2, 1, 0], [ 0, 25, 0, 0], [ 1, 0, 13, 0], [ 0, 0, 0, 6]]

[[45, 16, 3, 0], [ 8, 11, 1, 0], [ 0, 0, 10, 0], [ 0, 0, 0, 6]]

black + red = predictionsblack + tan = actuals

deep learning

def model(X, w_h1, w_h2, w_o, b_h1, b_h2, b_o, keep_prob): h1 = tf.nn.dropout(tf.nn.relu(tf.matmul(X, w_h1) + b_h1), keep_prob) h2 = tf.nn.relu(tf.matmul(h1, w_h2) + b_h2) return tf.matmul(h2, w_o) + b_o

def py_x(keep_prob): return model(data.X, w_h1, w_h2, w_o, b_h1, b_h2, b_o, keep_prob)

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(py_x(keep_prob=0.5), data.Y))

train_op = tf.train.RMSPropOptimizer(learning_rate=0.0003, decay=0.8, momentum=0.4).minimize(cost)

predict_op = tf.argmax(py_x(keep_prob=1.0), 1)

HIDDEN LAYERS (50% dropout in 1st hidden layer)

[100, 100]

will sometimes get it 100% right, but not reliably

[2000, 2000]

seems to get it exactly right every time (in ~200 epochs) black + red = predictions

black + tan = actuals

But how does it work?

25-hidden-neuron shallow net was simplest interesting model

in particular, it gets all the "divisible by 15" exactly rightnot obvious to me how to learn "divisible by 15" from binary

[[45, 16, 3, 0], [ 8, 11, 1, 0], [ 0, 0, 10, 0], [ 0, 0, 0, 6]]

black + red = predictionsblack + tan = actuals

which inputs produce largest "fizz buzz" values? (120, array([ -4.51552565, -11.66495565, -17.10086776, 0.32237191])), (240, array([ -5.04136949, -12.02974626, -17.35017639, 0.07112655])), (90, array([ -4.52364648, -11.48799399, -16.91179542, -0.20747044])), (465, array([ -4.95231711, -11.88604214, -17.5155363 , -0.34996536])), (210, array([ -5.04364677, -11.85627498, -17.17183826, -0.4049097 ])), (720, array([ -4.98066528, -11.68684173, -17.01117473, -0.46671827])), (345, array([ -4.49738021, -11.34621705, -16.88004503, -0.4713167 ])), (600, array([ -4.48999048, -11.30909995, -16.70980522, -0.53889132])), (360, array([ -9.32991992, -15.18924931, -17.8993147 , -4.35817601])), (480, array([ -9.79430086, -15.72038142, -18.51560547, -4.38727747])), (450, array([ -9.80194752, -15.54985676, -18.32664509, -4.89815184])), (330, array([ -9.34660544, -15.01537882, -17.69651957, -4.95658813])), (960, array([ -9.74109305, -15.37921101, -18.16552369, -4.95677615])), (840, array([ -9.31266483, -14.83212949, -17.49181923, -5.26606825])), (105, array([ -8.73320381, -11.08279653, -9.31921242, -5.52620068])), (225, array([ -9.22702329, -11.50045288, -9.64725618, -5.76014854])), (585, array([ -8.62907369, -10.84616688, -9.23592859, -5.79517941])), (705, array([ -9.12030976, -11.2651869 , -9.56738927, -6.02974533])),

last column only needs to be larger than the other columns but in this case it works out -- these are all divisible by 15

notice that they cluster into similar outputs

notice also that we have pairs of numbers that differ by 120

a stray observation

If two numbers differ by a multiple of 15, they have the same fizz buzz outputIf a network could ignore differences that are multiples of 15 (or 30, or 45, or so on), that could be a good startThen only have to learn the correct output for each equivalence classVery few "fizz buzz" equivalence classes

two-bit SWAPS that are congruent mod 15

-8 +128 = +120 120 [0 0 0 1 1 1 1 0 0 0]240 [0 0 0 0 1 1 1 1 0 0]

+2 -32 = -30 (from 120/240)90 [0 1 0 1 1 0 1 0 0 0]210 [0 1 0 0 1 0 1 1 0 0]

-32 +512 = +480 (from 120/240) 600 [0 0 0 1 1 0 1 0 0 1]720 [0 0 0 0 1 0 1 1 0 1]

+1 -256 = -255 (from 600/720)345 [1 0 0 1 1 0 1 0 1 0]465 [1 0 0 0 1 0 1 1 1 0]

two-bit SWAPS that are congruent mod 15

-8 +128 = +120 120 [0 0 0 1 1 1 1 0 0 0]240 [0 0 0 0 1 1 1 1 0 0]

+2 -32 = -3090 [0 1 0 1 1 0 1 0 0 0]210 [0 1 0 0 1 0 1 1 0 0]

-32 +512 = +480600 [0 0 0 1 1 0 1 0 0 1]720 [0 0 0 0 1 0 1 1 0 1]

+1 -256 = -255345 [1 0 0 1 1 0 1 0 1 0]465 [1 0 0 0 1 0 1 1 1 0]

-8 +128 360 [0 0 0 1 0 1 1 0 1 0]480 [0 0 0 0 0 1 1 1 1 0]

330 [0 1 0 1 0 0 1 0 1 0]450 [0 1 0 0 0 0 1 1 1 0]

840 [0 0 0 1 0 0 1 0 1 1]960 [0 0 0 0 0 0 1 1 1 1]

two-bit SWAPS that are congruent mod 15

-8 +128 = +120 120 [0 0 0 1 1 1 1 0 0 0]240 [0 0 0 0 1 1 1 1 0 0]

+2 -32 = -3090 [0 1 0 1 1 0 1 0 0 0]210 [0 1 0 0 1 0 1 1 0 0]

-32 +512 = +480600 [0 0 0 1 1 0 1 0 0 1]720 [0 0 0 0 1 0 1 1 0 1]

+1 -256 = -255345 [1 0 0 1 1 0 1 0 1 0]465 [1 0 0 0 1 0 1 1 1 0]

-8 +128360 [0 0 0 1 0 1 1 0 1 0]480 [0 0 0 0 0 1 1 1 1 0]

330 [0 1 0 1 0 0 1 0 1 0]450 [0 1 0 0 0 0 1 1 1 0]

840 [0 0 0 1 0 0 1 0 1 1]960 [0 0 0 0 0 0 1 1 1 1]

105 [1 0 0 1 0 1 1 0 0 0]225 [1 0 0 0 0 1 1 1 0 0]

-32 +512585 [1 0 0 1 0 0 1 0 0 1]705 [1 0 0 0 0 0 1 1 0 1]

any neuron with the same weight on those two inputs will produce the same outcome if they're swapped

if you want to drive yourself mad, spend a few hours staring at the

neuron weights themselves!

lessons learned

It's hard to turn a joke blog post into a talkFeature selection is important (we already knew that)Stupid problems sometimes contain really interesting subtletiesSometimes "black box" models actually reveal those subtleties if you look at them the right way

sorry for not being

just a joke talk!

thanks!

code: github.com/joelgrusblog: joelgrus.comtwitter: @joelgrus (will tweet out link to slides, so go follow!)

book: --------------------------->(might add a chapter about slides, so go buy just in case!)