Knitting boar - Toronto and Boston HUGs - Nov 2012
-
Upload
josh-patterson -
Category
Technology
-
view
897 -
download
0
Transcript of Knitting boar - Toronto and Boston HUGs - Nov 2012
![Page 1: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/1.jpg)
1
KNITTING BOARMachine Learning, Mahout, and Parallel Iterative Algorithms
Josh PattersonPrincipal Solutions Architect
![Page 2: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/2.jpg)
Hello
✛ Josh Patterson> Master’s Thesis: self-organizing mesh networks
∗ Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
> Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)
> Twitter: @jpatanooga> Email: [email protected]
![Page 3: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/3.jpg)
Outline
✛ Introduction to Machine Learning✛ Mahout✛ Knitting Boar and YARN✛ Parting Thoughts
![Page 4: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/4.jpg)
4
MACHINE LEARNINGIntroduction to
![Page 5: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/5.jpg)
Basic Concepts
✛ What is Data Mining?> “the process of extracting patterns from data”
✛ Why are we interested in Data Mining?> Raw data essentially useless
∗ Data is simply recorded facts∗ Information is the patterns underlying the data
✛ Machine Learning> Algorithms for acquiring structural descriptions from
data “examples”∗ Process of learning “concepts”
![Page 6: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/6.jpg)
Shades of Gray
✛ Information Retrieval> information science, information architecture,
cognitive psychology, linguistics, and statistics.✛ Natural Language Processing
> grounded in machine learning, especially statistical machine learning
✛ Statistics> Math and stuff
✛ Machine Learning> Considered a branch of artificial intelligence
![Page 7: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/7.jpg)
Hadoop in Traditional Enterprises Today
✛ ETL✛ Joining multiple disparate data sources✛ Filtering data✛ Aggregation✛ Cube materialization
“Descriptive Statistics”
![Page 8: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/8.jpg)
Hadoop All The Time?
✛ Don’t always assume you need “scale” and parallelization> Try it out on a single machine first> See if it becomes a bottleneck!
✛ Will the data fit in memory on a beefy machine?
✛ We can always use the constructed model back in MapReduce to score a ton of new data
![Page 9: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/9.jpg)
Twitter Pipeline
✛ http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
> Looks to study data with descriptive statistics in the hopes of building models for predictive analytics
✛ Does majority of ML work via Pig custom integrations
> Pipeline is very “Pig-centric”
> Example: https://github.com/tdunning/pig-vector
> They use SGD and Ensemble methods mostly being conducive to large scale data mining
✛ Questions they try to answer
> Is this tweet spam?> What star rating might this user give this movie?
![Page 10: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/10.jpg)
Typical Pipeline for Cloudera Customer
✛ Data collection performed w Flume✛ Data cleansing / ETL performed with Hive
or Pig✛ ML work performed with
> SAS> SPSS> R> Mahout
![Page 11: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/11.jpg)
11 MAHOUTIntroduction to
![Page 12: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/12.jpg)
Algorithm Groups in Apache Mahout
Copyright 2010 Cloudera Inc. All rights reserved12
✛ Classification> “Fraud detection”
✛ Recommendation> “Collaborative
Filtering”✛ Clustering
> “Segmentation”✛ Frequent Itemset
Mining
![Page 13: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/13.jpg)
Classification
✛ Stochastic Gradient Descent > Single process> Logistic Regression Model Construction
✛ Naïve Bayes> MapReduce-based> Text Classification
✛ Random Forests> MapReduce-based
Copyright 2010 Cloudera Inc. All rights reserved13
![Page 14: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/14.jpg)
What Are Recommenders?
✛ An algorithm that looks at a user’s past actions and suggests> Products> Services> People
✛ Advertisement> Cloudera has a great Data Science training course on
this topic> http://university.cloudera.com/training/data_science/in
troduction_to_data_science_-_building_recommender_systems.html
![Page 15: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/15.jpg)
Clustering: Topic Modeling
✛ Cluster words across docs to identify topics✛ Latent Dirichlet Allocation
![Page 16: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/16.jpg)
Taking a Breath For a Minute
✛ Why Machine Learning?> Growing interest in predictive modeling
✛ Linear Models are Simple, Useful> Stochastic Gradient Descent is a very popular tool for
building linear models like Logistic Regression
✛ Building Models Still is Time Consuming> The “Need for speed”
> “More data beats a cleverer algorithm”
![Page 17: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/17.jpg)
17
KNITTING BOARIntroducing
![Page 18: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/18.jpg)
Goals
✛ Parallelize Mahout’s Stochastic Gradient Descent> With as few extra dependencies as possible
✛ Wanted to explore parallel iterative algorithms using YARN> Wanted a first class Hadoop-Yarn citizen
> Work through dev progressions towards a stable state
> Worry about “frameworks” later
![Page 19: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/19.jpg)
19
Stochastic Gradient Descent
✛ Training> Simple gradient descent
procedure> Loss functions needs to be
convex✛ Prediction
> Logistic Regression:∗ Sigmoid function using
parameter vector (dot) example as exponential parameter
SGD
Model
Training Data
![Page 20: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/20.jpg)
20
Current Limitations
✛ Sequential algorithms on a single node only goes so far
✛ The “Data Deluge”> Presents algorithmic challenges when combined with
large data sets> need to design algorithms that are able to perform in
a distributed fashion✛ MapReduce only fits certain types of algorithms
![Page 21: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/21.jpg)
21
Distributed Learning Strategies
✛ Langford, 2007> Vowpal Wabbit
✛ McDonald 2010> Distributed Training Strategies for the Structured
Perceptron✛ Dekel 2010
> Optimal Distributed Online Prediction Using Mini-Batches
![Page 22: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/22.jpg)
22
MapReduce vs. Parallel Iterative
Input
Output
Map Map Map
Reduce Reduce
Processor Processor Processor
Superstep 1
Processor Processor
Superstep 2
. . .
Processor
![Page 23: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/23.jpg)
23
Why Stay on Hadoop?
“Are the gains gotten from using X worth the integration costs incurred in building the end-to-end solution?
If no, then operationally, we can consider the Hadoop stack …
there are substantial costs in knitting together a patchwork of different frameworks, programming models, etc.”
–– Lin, 2012
![Page 24: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/24.jpg)
24
The Boar
✛ Parallel Iterative implementation of SGD on YARN
✛ Workers work on partitions of the data✛ Master keeps global copy of merged parameter
vector
![Page 25: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/25.jpg)
25
Worker
✛ Each given a split of the total dataset> Similar to a map task
✛ Using a modified OLR> process N samples in a epoch (subset of split)
✛ Local parameter vector sent to master node> Master averages all workers’ vectors together
![Page 26: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/26.jpg)
26
Master
✛ Gathers and averages worker parameter vectors> From worker OLR runs
✛ Produces new global parameter vector> By averaging workers’ vectors
✛ Sends update to all workers> Workers replace local parameter vector with new
global parameter vector
![Page 27: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/27.jpg)
27
IterativeReduce
✛ ComputableMaster> Setup()> Compute()> Complete()
✛ ComputableWorker> Setup()> Compute()
Worker Worker Worker
Master
Worker Worker
Master
. . .
Worker
![Page 28: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/28.jpg)
28
Comparison: OLR vs POLR
OnlineLogisticRegression
Model
Training Data
Worker 1
Master
Partial Model
Global Model
Worker 2
Partial Model
Worker N
Partial Model
OnlineLogisticRegression Knitting Boar’s POLRSplit 1 Split 2 Split 3
…
![Page 29: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/29.jpg)
29
20Newsgroups
Input Size vs Processing Time
4.1 8.2 12.3 16.4 20.5 24.6 28.7 32.8 36.9 410
50
100
150
200
250
300
OLRPOLR
![Page 30: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/30.jpg)
30
PARTING THOUGHTSKnitting Boar
![Page 31: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/31.jpg)
31
Knitting Boar Lessons Learned
✛ Parallel SGD> The Boar is temperamental, experimental
∗ Linear speedup (roughly)
✛ Developing YARN Applications> More complex the just MapReduce> Requires lots of “plumbing”
✛ IterativeReduce> Great native-Hadoop way to implement algorithms> Easy to use and well integrated
![Page 32: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/32.jpg)
32
Bits
✛ Knitting Boar> https://github.com/jpatanooga/KnittingBoar> 100% Java> ASF 2.0 Licensed> Quick Start
∗ https://github.com/jpatanooga/KnittingBoar/wiki/Quick-Start
✛ IterativeReduce> https://github.com/emsixteeen/IterativeReduce> 100% Java> ASF 2.0 Licensed
![Page 33: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/33.jpg)
33
✛ Machine Learning is hard> Don’t believe the hype> Do the work
✛ Model development takes time> Lots of iterations> Speed is key here
Picture: http://evertrek.files.wordpress.com/2011/06/everestsign.jpg
![Page 34: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/34.jpg)
34
References
✛ Strata / Hadoop World 2012 Slides> http://www.cloudera.com/content/cloudera/en/resourc
es/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html
✛ Mahout’s SGD implementation> http://lingpipe.files.wordpress.com/2008/04/lazysgdre
gression.pdf✛ MapReduce is Good Enough? If All You Have is
a Hammer, Throw Away Everything That’s Not a Nail!> http://arxiv.org/pdf/1209.2191v1.pdf
![Page 35: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/35.jpg)
35
References
✛ Langford > http://hunch.net/~vw/
✛ McDonald, 2010> http://dl.acm.org/citation.cfm?id=1858068
![Page 36: Knitting boar - Toronto and Boston HUGs - Nov 2012](https://reader036.fdocuments.net/reader036/viewer/2022070513/5885ab891a28abd2348b593d/html5/thumbnails/36.jpg)
36
Photo Credits
✛ http://eteamjournal.files.wordpress.com/2011/03/photos-of-mount-everest-pictures.jpg
✛ http://images.fineartamerica.com/images-medium-large/-say-hello-to-my-little-friend--luis-ludzska.jpg
✛ http://freewallpaper.in/wallpaper2/2202-2-2001_space_odyssey_-_5.jpg