Deep Learning Applications (dadada2017)

Deep Learning Applications(in industries and elsewhere)

Abhishek Thakur@abhi1thakur

About me● Chief Data Scientist @ Boost AI● Machine learning enthusiast● Kaggle junkie (highest world rank #3)● Interested in:

○ Automatic machine learning○ Large scale classification of text data○ Chatbots

I like big data and

I cannot lie

Agenda● Brief introduction to deep learning● Implementation of deepnets● Fine-tuning of pre-trained networks● 4 different industrial use cases● No maths!!!!

What is deep learning?

What is deep learning?● A buzzword

What is deep learning?● A buzzword● Neural networks

What is deep learning?● A buzzword● Neural networks● Removes manual feature extraction steps

What is deep learning?● A buzzword● Neural networks● Removes manual feature extraction steps● Not a black box

How have convnets evolved?

1989


2012


2014

What can deep learning do?

What can deep learning do?● Natural language processing

What can deep learning do?● Natural language processing● Speech processing

What can deep learning do?● Natural language processing● Speech processing● Computer vision● And more and more

How can I implement my own DeepNets?

How can I implement my own DeepNets?● Implement them on your own


○ Decompose into smaller parts


○ Decompose into smaller parts○ Implement layers


○ Decompose into smaller parts○ Implement layers○ Start training



● Save yourself some time and finetune



● Save yourself some time and finetune ○ Convert data



● Save yourself some time and finetune ○ Convert data○ Define net



● Save yourself some time and finetune ○ Convert data○ Define net○ Define solver



● Save yourself some time and finetune ○ Convert data○ Define net○ Define solver○ Train



● Save yourself some time and finetune ○ Convert data○ Define net○ Define solver○ Train

● Caffe (caffe.berkeleyvision.org)● Keras (www.keras.io)

Caffe● Speed

Caffe● Speed● Openness

Caffe● Speed● Openness● Modularity

Caffe● Speed● Openness● Modularity● Expression - No coding knowledge? No problem!

Caffe● Speed● Openness● Modularity● Expression - No coding knowledge? No problem!● Community

What do you need for Caffe?

What do you need for Caffe?● Convert data

What do you need for Caffe?● Convert data● Define a network (prototxt)

What do you need for Caffe?● Convert data● Define a network (prototxt)● Define a solver (prototxt)

What do you need for Caffe?● Convert data● Define a network (prototxt)● Define a solver (prototxt)● Train the network (with or without pre-trained weights)

Prototxt● solver.prototxt

Prototxt● train.prototxt

Prototxt● train_val.prototxt

Training a net using Caffe

Training a net using Caffe

/PATH_TO_CAFFE/caffe train --solver=solver.prototxt

Fine Tuning!● Fine tuning using GoogleNet● Why?

○ It has Google in its name○ It won ILSVRC 2014○ It’s complicated and I wanted to play with it

● Caffe model zoo offers a lot of pretrained nets, including GoogleNet● Model Zoo: https://github.com/BVLC/caffe/wiki/Model-Zoo

https://github.com/BVLC/caffe/wiki/Model-Zoo

Honey Bee vs. Bumble Bee

Tougher Than

Honey Bee vs. Bumble Bee

The Metis Challenge: Naive Bees Classifier @ Drivendata.Org

An initial model

Steps to finetune

Steps to finetune● Create training and test files

Steps to finetune● Create training and test files● Get the prototxt files from model zoo

Steps to finetune● Create training and test files● Get the prototxt files from model zoo● Modify them

Steps to finetune● Create training and test files● Get the prototxt files from model zoo● Modify them● Run the caffe solver

Generating training and validation sets

Changes in train_val.prototxt

Changes in solver.prototxt

Changes in solver.prototxt

And that’s all

Finetune your network

/PATH_TO_CAFFE/caffe train -solver ./solver.prototxt -weights ./models/bvlc_googlenet.caffemodel

Did the net learn something new?

Breaking down the various layers of GoogLeNet

Random PretrainedFinetuned

● inception_3a● inception_3b● inception_4a● inception_4b● inception_4c● inception_4d● inception_4e● inception_5a● inception_5b

Why finetune?

74

Why finetune?● It is faster

Why finetune?● It is faster● It is better (most of the times)

Why finetune?● It is faster● It is better (most of the times)● Why reinvent the wheel?

Tell me how to train a deepnet in Python!

Tell me how to train a deepnet in Python!● Caffe has a python interface

Tell me how to train a deepnet in Python!● Caffe has a python interface● Tensorflow

Tell me how to train a deepnet in Python!● Caffe has a python interface● Tensorflow● Theano

Tell me how to train a deepnet in Python!● Caffe has a python interface● Tensorflow● Theano● Lasagne

Tell me how to train a deepnet in Python!● Caffe has a python interface● Tensorflow● Theano● Lasagne● Keras

Tell me how to train a deepnet in Python!● Caffe has a python interface● Tensorflow● Theano● Lasagne● Keras● Neon

Tell me how to train a deepnet in Python!● Caffe has a python interface● Tensorflow● Theano● Lasagne● Keras● Neon● And lots more…..

Classifying Search Queries

Why classify search queries?● For businesses

○ Find out user-intent○ Track keywords according to transactional buying cycle of user○ Optimize website content and focus on smaller keyword set

Why classify search queries?● For business

○ Find out user-intent○ Track keywords according to transactional buying cycle of user○ Optimize website content and focussing on smaller keyword set

● For data scientists○ 100s of millions of unlabeled keywords to play with○ Why Not!

Word2Vec in Search Queries

Feeding Data to LSTMs

the white house


the white house

Sequence for LSTM


the white house

Sequence for LSTM

❖ United States❖ President❖ Politician❖ Washington❖ Lawyer❖ Secretary

Performance of the Network

Navigational QueriesTransactional

Queries

Informational Queries

Awareness

Decision

Evaluation

Retention

Representing Queries as Images

David VillaWord2Vec representations of the top search result titles

Apple juice

Irish

I don’t see much difference!

Guild Wars or Apple Juice

Machine Learning Models● Boosted trees

○ Word2vec embeddings○ Titles from top results○ Additional features of the SERP page○ TF-IDF○ XGBoost!!!! (https://github.com/dmlc/xgboost)

Machine Learning Models● Convolutional Neural Networks:

○ Using images directly

○ Using random crops from the image




Convolutional Neural Network

Neural Networks with Keras


https://github.com/fchollet/keras

Neural Networks with Keras


Approaching “any” ML problem

Approaching “any” ML problem

AutoCompete: A Framework for Machine Learning Competitions, A.Thakur and A Krohn-Grimberghe, ICML AutoML Workshop, 2015

Optimizing neural networks

Optimizing neural networks

AutoML Challenge: Rules for tuning Neural Networks, A.Thakur, ICML AutoML Workshop, System Desc Track, 2016

Selecting NNet Architecture

Selecting NNet Architecture● Always use SGD or Adam (for fast convergence)

Selecting NNet Architecture● Always use SGD or Adam (for fast convergence)● Start low:


○ Single layer with 120-500 neurons


○ Single layer with 120-500 neurons○ Batch normalization + ReLU


○ Single layer with 120-500 neurons○ Batch normalization + ReLU○ Dropout: 10-20%



● Add new layer:



● Add new layer:○ 1200-1500 neurons



● Add new layer:○ 1200-1500 neurons○ High dropout: 40-50%




● Very big network:




● Very big network:○ 8000-10000 neurons in each layer




● Very big network:○ 8000-10000 neurons in each layer○ 60-80% dropout

The AutoML Challenge

Some Results

AutoML Final1 Results

Some Results

AutoML Final4 Results

Some Results

AutoML GPU Track Results

@abhi1thakur

10 Things You Didn’t Know About Clickbaits!

What are clickbaits?● 10 things Apple didn’t tell you about the new iPhone● What happened next will surprise you● This is what the actor/actress from 90s looks like now● What did Donald Trump just say about Obama and Clinton● 9 things you must have to be a good data scientist

@abhi1thakur

What are clickbaits?

@abhi1thakur

What are clickbaits?● Interesting titles● Frustrating titles● Seldomly good enough content

● Google penalizes clickbait content● Facebook does the same

@abhi1thakur

The data● Crawl buzzfeed, clickhole● Crawl new york times, cnn● ~10000 titles

○ Clickbaits: buzzfeed, clickhole○ Non-clickbaits: new york times, cnn○ ~5000 from either categories

@abhi1thakur

Good old TF-IDF

● Very powerful● Used both character and word analyzers

@abhi1thakur

Some interesting words

@abhi1thakur

Let’s build some models

@abhi1thakur

Logistic Regression

@abhi1thakur

● ROC AUC Score = 0.987319021551● Precision Score = 0.950326797386● Recall Score = 0.939276485788● F1 Score = 0.944769330734

XGBoost

@abhi1thakur


Is that it?● No!● Model predictions:

○ “Donald Trump” : 15% Clickbait○ “Barack Obama”: 80% Clickbait

● Something was very wrong! ● TF-IDF didn’t capture the meanings

@abhi1thakur

Word2Vec● Shallow neural networks● Generates vectors of high dimension for every word● Every word gets a position in space● Similar words cluster together

@abhi1thakur

Word2Vec

@abhi1thakur

XGBoost + W2V

@abhi1thakur


Performance● Fast to train● Good results

@abhi1thakur

@abhi1thakur

Does word2vec capture everything?

Do we have all we need only from titles?

What if content of website isn’t clickbait-y?

@abhi1thakur

The data● Crawl Buzzfeed, NYT, CNN, clickhole, etc.● Too much work● Simple models● Doubts about results

● Crawl public Facebook pages:○ Buzzfeed○ CNN○ The New York Times○ Clickhole○ StopClickBaitOfficial○ Upworthy○ Wikinews

Facebook page scrapper is available here:https://github.com/minimaxir/facebook-page-post-scraper

@abhi1thakur

The data

● link_name (the title of the URL shared)● status_type (whether it’s a link, photo or a video)● status_link (the actual URL)

@abhi1thakur

@abhi1thakur

Data Processing● Get the HTML content too● Clean the mess up!

@abhi1thakur

Feature Generation● Size of the HTML (in bytes)● Length of HTML● Total number of links ● Total number of buttons ● Total number of inputs● Total number of unordered lists● Total number of ordered lists● Total number of lists (ordered +

unordered)

@abhi1thakur

● Total Number of H1 tags● Total Number of H2 tags● Full length of all text in all H1

tags that were found● Full length of all text in all H2

tags that were found● Total number of images● Total number of html tags● Number of unique html tags

More Features● All H1 text● All H2 text● Meta description

@abhi1thakur

Feature Generation

@abhi1thakur

Number of lists

@abhi1thakur

Number of links

@abhi1thakur

Number of images

@abhi1thakur

Number of buttons

@abhi1thakur

Customary word clouds

@abhi1thakur

Clickbaits Non-Clickbaits

Final Features

@abhi1thakur

Deep Learning Models● Simple LSTM● Two dense layers● Dropout + Batch Normalization● Softmax Activation

@abhi1thakur

Deep Learning Models

@abhi1thakur

Results

@abhi1thakur

Detecting Duplicates in Quora Questions

The Problem➢ ~ 13 million questions (as of March, 2017)➢ Many duplicate questions➢ Cluster and join duplicates together➢ Remove clutter

➢ First public data release: 24th January, 2017

Duplicate Questions➢ How does Quora quickly mark questions as needing improvement?➢ Why does Quora mark my questions as needing improvement/clarification

before I have time to give it details? Literally within seconds…

➢ What practical applications might evolve from the discovery of the Higgs Boson?

➢ What are some practical benefits of discovery of the Higgs Boson?

➢ Why did Trump win the Presidency?➢ How did Donald Trump win the 2016 Presidential Election?

Non-Duplicate Questions➢ Who should I address my cover letter to if I'm applying for a big company like

Mozilla?➢ Which car is better from safety view?""swift or grand i10"".My first priority is

safety?

➢ Mr. Robot (TV series): Is Mr. Robot a good representation of real-life hacking and hacking culture? Is the depiction of hacker societies realistic?

➢ What mistakes are made when depicting hacking in ""Mr. Robot"" compared to real-life cybersecurity breaches or just a regular use of technologies?

➢ How can I start an online shopping (e-commerce) website?➢ Which web technology is best suitable for building a big E-Commerce

website?

The Data➢ 400,000+ pairs of questions➢ Initially data was very skewed➢ Negative samples from related questions➢ Not real distribution on Quora’s website➢ Noise exists (as usual)

https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

The Data➢ 255045 negative samples (non-duplicates) ➢ 149306 positive samples (duplicates)➢ 40% positive samples

The Data➢ Average number characters in question1: 59.57➢ Minimum number of characters in question1: 1➢ Maximum number of characters in question1: 623

➢ Average number characters in question2: 60.14➢ Minimum number of characters in question2: 1➢ Maximum number of characters in question2: 1169

Basic Feature Engineering➢ Length of question1➢ Length of question2➢ Difference in the two lengths➢ Character length of question1 without spaces➢ Character length of question2 without spaces➢ Number of words in question1➢ Number of words in question2➢ Number of common words in question1 and question2

Basic Feature Engineering

➢ Basic feature set: fs-1

data['len_q1'] = data.question1.apply(lambda x: len(str(x)))

data['len_q2'] = data.question2.apply(lambda x: len(str(x)))

data['diff_len'] = data.len_q1 - data.len_q2

data['len_char_q1'] = data.question1.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))

data['len_char_q2'] = data.question2.apply(lambda x: len(''.join(set(str(x).replace(' ', '')))))

data['len_word_q1'] = data.question1.apply(lambda x: len(str(x).split()))

data['len_word_q2'] = data.question2.apply(lambda x: len(str(x).split()))

data['common_words'] = data.apply(lambda x:

len(set(str(x['question1']).lower().split()).intersection(set(str(x['question2']).lower().split()))), axis=1)

Fuzzy Features➢ pip install fuzzywuzzy

➢ Uses Levenshtein distance➢ QRatio➢ WRatio➢ Token set ratio➢ Token sort ratio➢ Partial token set ratio➢ Partial token sort ratio➢ etc. etc. etc.

https://github.com/seatgeek/fuzzywuzzy

Fuzzy Features

➢ Fuzzy feature set: fs-2

data['fuzz_qratio'] = data.apply(lambda x: fuzz.QRatio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_WRatio'] = data.apply(lambda x: fuzz.WRatio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_partial_ratio'] = data.apply(lambda x: fuzz.partial_ratio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_partial_token_set_ratio'] = data.apply(lambda x: fuzz.partial_token_set_ratio(str(x['question1']), str(x['question2'])),

axis=1)

data['fuzz_partial_token_sort_ratio'] = data.apply(lambda x: fuzz.partial_token_sort_ratio(str(x['question1']),

str(x['question2'])), axis=1)

data['fuzz_token_set_ratio'] = data.apply(lambda x: fuzz.token_set_ratio(str(x['question1']), str(x['question2'])), axis=1)

data['fuzz_token_sort_ratio'] = data.apply(lambda x: fuzz.token_sort_ratio(str(x['question1']), str(x['question2'])), axis=1)

TF-IDF➢ TF(t) = Number of times a term t appears in a document / Total number of

terms in the document➢ IDF(t) = log(Total number of documents / Number of documents with term t in

it)➢ TF-IDF(t) = TF(t) * IDF(t)

tfidf = TfidfVectorizer(min_df=3, max_features=None,

strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',

ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1,

stop_words = 'english')

SVD➢ Latent semantic analysis➢ scikit-learn version of SVD➢ 120 components

svd = decomposition.TruncatedSVD(n_components=120)

xtrain_svd = svd.fit_transform(xtrain)

xtest_svd = svd.transform(xtest)

Fuzzy Features➢ Also known as approximate string matching➢ Number of “primitive” operations required to convert string to exact match➢ Primitive operations:

○ Insertion○ Deletion○ Substitution

➢ Typically used for:○ Spell checking○ Plagiarism detection○ DNA sequence matching○ Spam filtering

A Combination of TF-IDF & SVD➢ TF-IDF features: fs3-1

A Combination of TF-IDF & SVD➢ TF-IDF features: fs3-2

A Combination of TF-IDF & SVD➢ TF-IDF + SVD features: fs3-3

Word2Vec Features➢ Multi-dimensional vector for all the words in any dictionary➢ Always great insights➢ Very popular in natural language processing tasks➢ Google news vectors 300d

Word2Vec Features➢ Representing words➢ Representing sentences

def sent2vec(s):

words = str(s).lower().decode('utf-8')

words = word_tokenize(words)

words = [w for w in words if not w in stop_words]

words = [w for w in words if w.isalpha()]

M = []

for w in words:

M.append(model[w])

M = np.array(M)

v = M.sum(axis=0)

return v / np.sqrt((v ** 2).sum())

W2V Features: WMD

Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances.

W2V Features: Skew➢ Skew = 0 for normal distribution➢ Skew > 0: more weight in left tail

W2V Features: Kurtosis➢ 4th central moment over the square of variance➢ Types:

○ Pearson○ Fisher: subtract 3.0 from result such that result is 0 for normal distribution

W2V Features➢ Word2Vec feature set: fs-4

scipy.spatial.distance

scipy.stats

minkowski

jaccard

manhattanbraycurtis

euclidean

cosine

canberra

kurtosisskew

Raw Word2Vec Vectors

https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne

➢ Raw W2V feature set: fs-5

Features Snapshot

Feature Snapshot

Machine Learning Models

Machine Learning Models➢ Logistic regression➢ Xgboost➢ 5 fold cross-validation➢ Accuracy as a comparison metric (also, precision + recall)➢ Why accuracy?

Results

Deep Learning

LSTM➢ Long short term memory➢ A type of RNN➢ Learn long term dependencies➢ Used two LSTM layers

1D CNN➢ One dimensional convolutional layer➢ Temporal convolution➢ Simple to implement:

for i in range(sample_length):

y[i] = 0

for j in range(kernel_length):

y[i] += x[i-j] * h[j]

Embedding Layers➢ Simple layer➢ Converts indexes to vectors➢ [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

Time Distributed Dense Layer➢ TimeDistributed wrapper around dense layer➢ TimeDistributed applies the layer to every temporal slice of input➢ Followed by Lambda layer➢ Implements “translation” layer used by Stephen Merity (keras snli model)

model1 = Sequential()

model1.add(Embedding(len(word_index) + 1,

300,

weights=[embedding_matrix],

input_length=40,

trainable=False))

model1.add(TimeDistributed(Dense(300, activation='relu')))

model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))

GloVe Embeddings➢ Count based model➢ Dimensionality reduction on co-occurrence counts matrix➢ word-context matrix -> word-feature matrix➢ Common Crawl

○ 840B tokens, 2.2M vocab, 300d vectors

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation

Basis of Deep Learning Model➢ Keras-snli model: https://github.com/Smerity/keras_snli

Before Training DeepNets➢ Tokenize data➢ Convert text data to sequences

tk = text.Tokenizer(nb_words=200000)

max_len = 40

tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str)))

x1 = tk.texts_to_sequences(data.question1.values)

x1 = sequence.pad_sequences(x1, maxlen=max_len)

x2 = tk.texts_to_sequences(data.question2.values.astype(str))

x2 = sequence.pad_sequences(x2, maxlen=max_len)

word_index = tk.word_index

Before Training DeepNets➢ Initialize GloVe embeddings

embeddings_index = {}

f = open('data/glove.840B.300d.txt')

for line in tqdm(f):

values = line.split()

word = values[0]

coefs = np.asarray(values[1:], dtype='float32')

embeddings_index[word] = coefs

f.close()

Before Training DeepNets➢ Create the embedding matrix

embedding_matrix = np.zeros((len(word_index) + 1, 300))

for word, i in tqdm(word_index.items()):

embedding_vector = embeddings_index.get(word)

if embedding_vector is not None:

embedding_matrix[i] = embedding_vector

Final Deep Learning Model

Model 1 and Model 2



300,


input_length=40,

trainable=False))


model1.add(Lambda(lambda x: K.sum(x, axis=1),

output_shape=(300,)))



300,


input_length=40,

trainable=False))


model2.add(Lambda(lambda x: K.sum(x, axis=1),

output_shape=(300,)))

Model 3 and Model 4

Model 3 and Model 4model3 = Sequential()


300,


input_length=40,

trainable=False))

model3.add(Convolution1D(nb_filter=nb_filter,

filter_length=filter_length,

border_mode='valid',

activation='relu',

subsample_length=1))

model3.add(Dropout(0.2))

.

.

.

model3.add(Dense(300))

model3.add(Dropout(0.2))

model3.add(BatchNormalization())

Model 5 and Model 6


model5.add(Embedding(len(word_index) + 1, 300, input_length=40,

dropout=0.2))

model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))


model6.add(Embedding(len(word_index) + 1, 300, input_length=40,

dropout=0.2))

model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))

Merged Model

Time to Train the DeepNet➢ Total params: 174,913,917➢ Trainable params: 60,172,917➢ Non-trainable params: 114,741,000

➢ NVIDIA Titan X

Combined Results

The deep network was trained on

an NVIDIA TitanX and took

approximately 300 seconds for

each epoch and took 10-15 hours

to train. This network achieved

an accuracy of 0.848 (~0.85).

Improving Further➢ Cleaning the text data, e.g correcting mis-spellings➢ POS tagging➢ Entity recognition➢ Combining deepnet with traditional ML models

Conclusion & References➢ The deepnet gives near state-of-the-art result➢ BiMPM model accuracy: 88%

Some reference:

➢ Zhiguo Wang, Wael Hamza and Radu Florian. "Bilateral Multi-Perspective Matching for Natural Language Sentences," (BiMPM)

➢ Matthew Honnibal. "Deep text-pair classification with Quora's 2017 question dataset," 13 February 2017. Retreived at https://explosion.ai/blog/quora-deep-text-pair-classification

➢ Bradley Pallen’s work: https://github.com/bradleypallen/keras-quora-question-pairs

https://explosion.ai/blog/quora-deep-text-pair-classification

Natural Language Processing

Pre-trained domain knowledge

Classification of intentIdentify entities

(extracting information)

API

Analytics

Delegation to customer support

Delegation to back-end robots

INSTANT PROCESSING and END-TO-END AUTOMATION

Monitoring and AI training

Chat Avatar

Text(Speech)

Pre-defined replyEnquiry

Intent classificationPre-processing of enquiry

Stemming Cross-languageMisspellings algorithm

1. Insurance2. Vehicle3. Car4.Rules for practice driving

Conversation without APIYou don’t need to adjust your car insurance when practise driving with a learner’s permit. In case of damage it’s the supervisor with a full driver’s license that shall write and sign the insurance claim

Hey you, do you knoww if my car insruacne covers practice driving??

Hi James, what’s the weather in Berlin on Thursday?

Thursday’s forecast for Berlin is partly sunny and mostly clouds.

Required value - Location

Optional value- Date

Conversation with APIRedirect to API- Weather

http://www.youtube.com/watch?v=x4JBfJ_2YbE

Thank you!Questions / Comments?

All The Code:

❖ github.com/abhishekkrthakur

Get in touch:

➢ E-mail: [email protected]➢ LinkedIn: bit.ly/thakurabhishek➢ Kaggle: kaggle.com/abhishek➢ Twitter: @abhi1thakur

If everything fails, use Xgboost

Deep Learning Applications (dadada2017)

Education

Transcript of Deep Learning Applications (dadada2017)