Atlanta MLconf Machine Learning Conference 09-23-2016
-
Upload
chris-fregly -
Category
Software
-
view
454 -
download
1
Transcript of Atlanta MLconf Machine Learning Conference 09-23-2016
MLconf ATL!Sept 23rd, 2016
Chris FreglyResearch Scientist @ PipelineIO
Who am I?
Chris Fregly, Research Scientist @ PipelineIO, San Francisco
Previously, Engineer @ Netflix, Databricks, and IBM Spark
Contributor @ Apache Spark, Committer @ Netflix OSS
Founder @ Advanced Spark and TensorFlow Meetup
Author @ Advanced Spark (advancedspark.com)
Advanced Spark and Tensorflow Meetup
ATL Spark Meetup (9/22)
http://www.slideshare.net/cfregly/atlanta-spark-user-meetup-09-22-2016
ATL Hadoop Meetup (9/21)
http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016
Confession #1
I Failed Linguistics in College!Chose Pass/Fail Option
(90 (mid-term) + 70 (final)) / 2 = 80 = C+How did a C+ turn into an F?
ZER0 (0) CLASS PARTICIPATION?!
Confession #2
I Hated Statistics in College
2 Degrees: Mechanical + Manufacturing EnggApproximations were Bad!
I Wasn’t a Fluffy Physics MajorThough, I Kinda Wish I Was!
Wait… Please Don’t Leave!I’m Older and Wiser Now
Approximate is the New Exact
Computational Linguistics and NLP are My Jam!
Agenda
Tensorflow + Neural Nets
NLP Fundamentals
NLP Models
What is Tensorflow?General Purpose Numerical Computation Engine
Happens to be good for neural nets!
ToolingTensorboard (port 6006 == `goog`) à
DAG-based like Spark!Computation graph is logical plan
Stored in Protobuf’s
TF converts logical -> physical plan
Lots of LibrariesTFLearn (Tensorflow’s Scikit-learn Impl)
Tensorflow Serving (Prediction Layer) à ^^
Distributed and GPU-Optimized
What are Neural Networks?Like All ML, Goal is to Minimize Loss (Error)
Error relative to known outcome of labeled data
Mostly Supervised Learning ClassificationLabeled training data
Training StepsStep 1: Randomly Guess Input Weights
Step 2: Calculate Error Against Labeled Data
Step 3: Determine Gradient Value, +/- Direction
Step 4: Back-propagate Gradient to Update Each Input Weight
Step 5: Repeat Step 1 with New Weights until Convergence
ActivationFunction
Activation FunctionsGoal: Learn and Train a Model on Input Data
Non-Linear Functions Find Non-Linear Fit of Input Data
Common Activation FunctionsSigmoid Function (sigmoid)
{0, 1}Hyperbolic Tangent (tanh)
{-1, 1}
Back Propagation
http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
Gradients Calculated by Comparing to Known Label
Use Gradients to Adjust Input Weights
Chain Rule
Loss/Error OptimizersGradient Descent
Batch (entire dataset)Per-record (don’t do this!)Mini-batch (empirically 16 -> 512)Stochastic (approximation)Momentum (optimization)
AdaGradSGD with adaptive learning rates per featureSet initial learning rateMore-likely to incorrectly converge on local minima
http://www.slideshare.net/cfregly/gradient-descent-back-propagation-and-auto-differentiation-advanced-spark-and-tensorflow-meetup-08042016
The MathLinear Algebra
Matrix MultiplicationVery Parallelizable
CalculusDerivativesChain Rule
Convolutional Neural NetworksFeed-forward
Do not form a cycle
Apply Many Layers (aka. Filters) to Input
Each Layer/Filter Picks up on FeaturesFeatures not necessarily human-grokkable
Examples of Human-grokkable Filters3 color filters: RGBMoving AVG for time series
Brute ForceTry Diff numLayers & layerSizes
CNN Use Case: Stitch Fix
Stitch Fix Also Uses NLP to Analyze Return/Reject Comments
StitchFix Strata Conf SF 2016:Using Deep Learning to Create New Clothing Styles!
Recurrent Neural NetworksForms a Cycle (vs. Feed-forward)
Maintains State over TimeKeep track of context
Learns sequential patterns
Decay over time
Use CasesSpeech
Text/NLP Prediction
RNN Sequences
Input: ImageOutput: Classification
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Input: ImageOutput: Text (Captions)
Input: TextOutput: Class (Sentiment)
Input: Text (English)Output: Text (Spanish)
InputLayer
HiddenLayer
OutputLayer
Character-based RNNsTokens are Characters vs. Words/Phrases
Microsoft trains ever 3 characters
Less Combination of Possible NeighborsOnly 26 alpha character tokens vs. millions of word tokens
Preserves state between
1st and 2nd ‘l’improves prediction
Long Short Term Memory (LSTM)
More ComplexState Update
Functionthan
Vanilla RNN
LSTM State Update
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Cell State
Forget Gate Layer(Sigmoid)
Input Gate Layer(Sigmoid)
Candidate Gate Layer(tanh)
OutputLayer
Transfer Learning
Agenda
Tensorflow + Neural Nets
NLP Fundamentals
NLP Models
Use CasesDocument Summary
TextRank: TF/IDF + PageRank
Article Classification and SimilarityLDA: calculate top `k` topic distribution
Machine Translationword2vec: compare word embedding vectors
Must Convert Text to Numbers!
Core ConceptsCorpus
Collection of text ie. Documents, articles, genetic codes
EmbeddingsTokens represented/embedded in vector spaceLearned, hidden features (~PCA, SVD)Similar tokens cluster together, analogies cluster apart
k-skip-gramSkip k neighbors when defining tokens
n-gramTreat n consecutive tokens as a single token
Composable:1-skip, bi-gram(every other word)
Parsers and POS Taggers
Describe grammatical sentence structure
Requires context of entire sentence
Helps reason about sentence
80% obvious, simple token neighbors
Major bottleneck in NLP pipeline!
Pre-trained Parsers and TaggersPenn Treebank
Parser and Part-of-Speech TaggerHuman-annotated (!)Trained on 4.5 million words
Parsey McParsefaceTrained by SyntaxNet
Feature EngineeringLower-case
Preserve proper nouns using carat (`^`)“MLconf ” => “^m^lconf ”“Varsity” => “^varsity”
Encode Common N-grams (Phrases)Create a single token using underscore (`_`)“Senior Developer” => “senior_developer”
Stemming and LemmatizationTry to avoid: let the neural network figure this outCan preserve part of speech (POS) using “_noun”, “_verb”“banking” => “banking_verb”
Agenda
Tensorflow + Neural Nets
NLP Fundamentals
NLP Models
Count-based ModelsGoal: Convert Text to Vector of Neighbor Co-occurrences
Bag of Words (BOW)Simple hashmap with word countsLoses neighbor context
Term Frequency / Inverse Document Frequency (TF/IDF)Normalizes based on token frequency
GloVeMatrix factorization on co-occurrence matrix
Highly parallelizable, reduce dimensions, capture global co-occurrence statsLog smoothing of probability ratios
Stores word vector diffs for fast analogy lookups
Neural-based Predictive ModelsGoal: Predict Text using Learned Embedding Vectors
word2vecShallow neural networkLocal: nearby words predict each otherFixed word embedding vector size (ie. 300)Optimizer: Mini-batch Stochastic Gradient Descent (SGD)
SyntaxNetDeep(er) neural networkGlobal(er)Not a Recurrent Neural Net (RNN)!Can combine with BOW-based models (ie. word2vec CBOW)
word2vec
CBOW word2vecPredict target word from source contextA single source context is an observationLoses useful distribution informationGood for small datasets
Skip-gram word2vec (Inverse of CBOW)Predict source context words from target wordEach (source context, target word) tuple is observationBetter for large datasets
word2vec Libraries
gensimPython onlyMost popular
Spark MLPython + Java/Scala Supports only synonyms
*2vec
lda2vecLDA (global) + word2vec (local)From Chris Moody @ Stitch Fix
like2vecEmbedding-based Recommender
word2vec vs. GloVeBoth are Fundamentally Similar
Capture local co-occurrence statistics (neighbors)Capture distance between embedding vector (analogies)
GloVeCount-basedAlso captures global co-occurrence statisticsRequires upfront pass through entire dataset
SyntaxNet POS TaggingDetermine coarse-grained grammatical role of each wordMultiple contexts, multiple roles
Neural Net Inputs: stack, buffer
Results: POS probability distro
Already Tagged
SyntaxNet Dependency ParserDetermine fine-grained roles using grammatical relationships“Transition-based”, Incremental Dependency Parser
Globally Normalized using Beam Search with Early Update
Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs
Fine-grained
Coarse-grained
SyntaxNet Use Case: NutritionNutrition and Health Startup in SF (Stealth)
Using Google’s SyntaxNet
Rate Recipes and Menus by Nutritional Value
Correct
Incorrect
Model ValidationUnsupervised Learning Requires Validation
Google has Published Analogy Tests for Model Validation
Thanks, Google!
Thank You, Atlanta!Chris Fregly, Research Scientist @ PipelineIO
All Source Code, Demos, and Docker Images @ pipeline.io
Join the Global Meetup for all Slides and Videos@ advancedspark.com