Introduction to Fast Scalable Machine Learning with H2O - Plano
Scalable Machine Learning -...
Transcript of Scalable Machine Learning -...
“Scalable” Machine Learning
Mikio L. Braun Recommender Stammtisch
June 26, 2014
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 1 / 29
Hm... scalable machine learning
Why scalable machine learning?
Lots of data (literally, 100 GBs of log data)
Many many tasks (say, one per user)
Proper model selection over features/parameters takes a lot of time!
But the truth is, core ML methods don’t scale very well... .
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 2 / 29
But is it really necessary?
6 4 2 0 2 4 60.5
0.0
0.5
1.0
1.5
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 3 / 29
Size 6= Complexity
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 4 / 29
A complex data set
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 5 / 29
Learning curve checkers data set
102
103
104
105
data set size
0
5
10
15
20
25
30
35
40
45
test
err
or
(%)
BumpBoost500BumpBoost1000BumpBoost2000KRRSVM
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 6 / 29
What is enough data?
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 7 / 29
What is enough data?
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 7 / 29
What is enough data?
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 7 / 29
What is enough data?
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 7 / 29
So how does complexity occur?
a lot of information
Many many combinations of things are indicativeWhatever would actually take to memorize a lot of things
high dimensional, unresolved invariancesFor example text: “He liked the book very much”
“like”: “enjoy”, “found interesting”, “was captivated by”“very much”: “greatly”, “massively”More degrees of freedom: “He thought the book was a keeper”“He had read his share of books over the past years, in fact, he hadcome to consider himself quite a lover of books. But this book, whichhis aunt had so mysteriously sent him just when he most needed it, wassuprisingly captivating beyond his wildest expectations.”
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 8 / 29
Map-Reduce for Machine Learning On Multicore
Locally Weighted Linear Regression, Naive Bayes, GaussianDiscriminantive Analyses, k-Means, Logistic Regression, Neural Networks,Principal Component Analysis, Indepenednt Component Analysis,Expectation Maximization, Support Vector Machines
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 9 / 29
One example
Locally Weighted Linear Regression
Solves Aθ = b with A =∑m
i=1 wi (xixTi ), b =
∑mi=1 wi (xiyi ).
Compute sums partially in Map step, combine sums in Reduce step toget A and b.
Solve for θ single threaded.
But...
Only works when number of dimensions is small.
... in which case the problem doesn’t require many examples anyway.
... same approach (more or less) for GDA, PCA, ICA
... Naive Bayes even simpler (just compute conditional probabilities)
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 10 / 29
k-Means Clustering
Parallelize computation of all distances
Iteration, updates done sequentially
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 11 / 29
Batch updates
Logistic Regression, Neural Networks, SVMs with stochastic gradientdescent.
BUT compute gradient over all the data set
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 12 / 29
Batch vs. true stochastic gradient descent
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 13 / 29
Lots of open questions, though
Mostly iterative algorithms
Sometimes, questionable implementations, for example, microbatches.
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 14 / 29
What does scalable mean anyway?
Scalable Inference for Logistic-Normal Topic Models... This paper presents a partially collapsed Gibbs sampling algorithmthat approaches the provably correct distribution by exploring theideas of data augmentation ...
A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks
Scalable Inference of Overlapping Communities
Scalable Influence Estimation in Continuous-Time Diffusion Networks
Scalable imputation of genetic data with a discretefragmentation-coagulation process
Scalable kernels for graphs with continuous attributes
Custom made algorithms & implementations!
(http://papers.nips.cc/search/?q=scalable)Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29
What does scalable mean anyway?
Scalable Inference for Logistic-Normal Topic Models
A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks... With [...] an efficient stochastic variational inference algorithm, weare able to analyze real networks with over a million vertices [...] on asingle machine in a matter of hours ...
Scalable Inference of Overlapping Communities
Scalable Influence Estimation in Continuous-Time Diffusion Networks
Scalable imputation of genetic data with a discretefragmentation-coagulation process
Scalable kernels for graphs with continuous attributes
Custom made algorithms & implementations!
(http://papers.nips.cc/search/?q=scalable)Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29
What does scalable mean anyway?
Scalable Inference for Logistic-Normal Topic Models
A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks
Scalable Inference of Overlapping Communities... Our algorithm is based on stochastic variational inference in themixed-membership stochastic blockmodel ...
Scalable Influence Estimation in Continuous-Time Diffusion Networks
Scalable imputation of genetic data with a discretefragmentation-coagulation process
Scalable kernels for graphs with continuous attributes
Custom made algorithms & implementations!
(http://papers.nips.cc/search/?q=scalable)
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29
What does scalable mean anyway?
Scalable Inference for Logistic-Normal Topic Models
A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks
Scalable Inference of Overlapping Communities
Scalable Influence Estimation in Continuous-Time Diffusion Networks... In this paper, we propose a randomized algorithm for influenceestimation in continuous-time diffusion networks ...
Scalable imputation of genetic data with a discretefragmentation-coagulation process
Scalable kernels for graphs with continuous attributes
Custom made algorithms & implementations!
(http://papers.nips.cc/search/?q=scalable)
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29
What does scalable mean anyway?
Scalable Inference for Logistic-Normal Topic Models
A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks
Scalable Inference of Overlapping Communities
Scalable Influence Estimation in Continuous-Time Diffusion Networks
Scalable imputation of genetic data with a discretefragmentation-coagulation process... Our model can be thought of as a discrete time analogue ofcontinuous time fragmentation-coagulation processes, preserving theimportant properties of projectivity, exchangeability and reversibility,while being more scalable ...
Scalable kernels for graphs with continuous attributes
Custom made algorithms & implementations!
(http://papers.nips.cc/search/?q=scalable)Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29
What does scalable mean anyway?
Scalable Inference for Logistic-Normal Topic Models
A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks
Scalable Inference of Overlapping Communities
Scalable Influence Estimation in Continuous-Time Diffusion Networks
Scalable imputation of genetic data with a discretefragmentation-coagulation process
Scalable kernels for graphs with continuous attributes ... In thispaper, we present a class of path kernels with computationalcomplexity O(n2(m + δ2)) ...
Custom made algorithms & implementations!
(http://papers.nips.cc/search/?q=scalable)
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29
Large Scale Learning
Stochastic Gradient Descent
Higher order descent, just a few steps
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 16 / 29
Stochastic Gradient Descent
And do gradient descent on
1
2‖w‖2 + C
n∑i=1
max(0, 1− yi 〈w , xi 〉+ b)
yields
w ←
{w − η
t w if yt〈w ,Xt〉+ b ≤ 1
w − ηt (w − ytXt) else
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 17 / 29
Parallelization potential for SGD
Memory footprint: w
Main problem: Stream by data fast enough
Cross-validation/feature extraction issues
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 18 / 29
Google’s DistBelief
Data shards, model replicas, common parameter set.
(Dean, Corrado, Monga et al. Large Scale Distributed Deep Networks,NIPS 2012,http://research.google.com/archive/large deep networks nips2012.html)
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 19 / 29
The future: Spark and Stratosphere/Flink
The problem with Hadoop:
Just one Map / Reduce step, but many algorithms are iterative
Disk based → long startup times
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 20 / 29
Spark
(http://spark.apache.org)
In-memory
Much larger set of operations (groupBy, joins, etc.)
resilience by storing how data was generated
caching of results on disk
micro-batch streaming, too
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 21 / 29
Apache Spark
// open file as collection of lines
val textFile = sc.textFile("README.md")
// count lines in file
textFile.count()
// get lines containing the word Spark
val linesWithSpark = textFile
.filter(line => line.contains("Spark"))
// count those lines
linesWithSpark.count()
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 22 / 29
Stratosphere/Flink - Big Data by Query Optimization
(http://stratosphere.eu/)
Databases: Query → Relational Algebra → Algorithms
Stratosphere: The same for Big Data
In-Memory, more operations, but also iterations!
Optimizing operations, including reshuffles of data.
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 23 / 29
Stream Mining
Stream Mining — mid 2000s
Process potentially infinite stream of data
Stream query:
How often have I seen item i?What are the most frequent items?How many distinct items are there?
Approximate results with bounded resources
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 24 / 29
Getting rid of exactness
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 25 / 29
Getting rid of exactness
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 26 / 29
Heavy hitters
Task: find most frequent items in a data set.
Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-kElements in Data Streams, International Conference on Database Theory,2005
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 27 / 29
Exponential Decay
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 28 / 29
Indices
page referrer agent score
/index.html google mozilla 12.3/about.html facebook iexplorer 10.2/index.html twitter safari 8.5/post/123 google mozilla 5.5/about.html twitter safari 3.2
Trend for “referrer = google” →(/index.html, google, mozilla, 12.3), (/post/123, google, mozilla, 5.5)
Trend for “agent = safari” →(/index.html, twitter, safari, 8.5), (/about.html, twitter, safari, 3.2)
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 29 / 29
streamdrill
In-memory, stream mining driven realtime analytics engine
written in Scala
Main tool: Trends (Top K + indices + exponential decay)
Process up to 20k events per second
Track about 1M per GB
Upcoming: Profiling, Recommendation, etc.
Download demo jar at streamdrill.com
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 30 / 29
Summary
Complexity 6= Size
Big Data best used for inherently parallel steps like
preprocessingfeature extractioncross validationapplying predictions
Else, you’re stuck with algorithm design!
Large scale learning by stochastic gradient descent
Spark/Stratosphere making it easier!
Stream Mining!
Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 31 / 29