Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

21
Scaling Up Practical Learning Algorithms Misha Bilenko ALMADA Summer School, Moscow 2013

Transcript of Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Page 1: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Scaling Up Practical Learning Algorithms

Misha Bilenko

ALMADA Summer School, Moscow 2013

Page 2: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Preliminaries: ML-in-four-slides• ML: mapping observations to predictions that minimize error• Predictor: ,

– : observations, each consisting of features• Numbers, strings, attributes – typically mapped to vectors

– : predictions (labels), assume a true for a given • Binary, numeric, ordinal, structured…

– loss function that quantifies error. • 0-1 loss: L1: L2 :

– is the function class from which a predictor is learned• E.g., “linear model (feature weights)” or “1000 decision trees with 32 leaves

each”

• Supervised learning: training set • Regularization: prevents overfitting to the training set:

Page 3: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

ML Examples

• Email spam– : header/body words, client- and server-side statistics (sender, recipient, etc.)– : binary label– : Cost-sensitive: false positives (good mail in Junk) vs. false negatives (Inbox spam)

• Click prediction (ads, search results, recommendations, …)– : attributes of context (e.g., query), item (e.g., ad text), user (e.g., location), – : probability – :

Page 4: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Key Prediction Models• Important function classes

– Linear predictors (logistic regression, linear SVMs, …)• is a hyperplane (feature weights):

– Tree ensembles (boosting, random forests)• is a set of each tree’s splits and leaf outputs

– Non-linear parametric predictors (neural nets, Bayes nets)• is sets of parameters (weights for hidden units, distribution parameters)

– Non-parametric predictors (k-NN, kernel SVMs, Gaussian Processes)• is “remembered” subset of training examples with corresponding parameters

Page 5: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Learning: Training Predictors

• Two key algorithm patterns– Iteratively updating to reduce

• Gradient descent (stochastic, coordinate/sub-gradient, quasi-Newton, …)• Boosting: each subsequent ensemble member reduces error (functional GD)• Active-subset SVM training: iterative improvement over support vectors

– Averaging multiple models to reduce variance• Bagging/random forests: models learned on subsets of data/features• Parameter mixtures: explicitly averages weights from different data subsets

Page 6: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Big Learning: Large Datasets.. and Beyond

• Large training sets: many examples iff accuracy is improved• Large models: many features, ensembles, “deep” nets• Model selection: hyper-parameter tuning, statistical significance• Fast inference: structured prediction (e.g., speech)

• Fundamental differences across settings– Learning vs. inference, input complexity vs. model complexity– Dataflow/computation and bottlenecks are highly algorithm- and task-specific– Rest of this talk: practical algorithm nuggets for (1), (2)

Page 7: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Dealing with Large Training Sets (I): SGD

• Online learning: Stochastic Gradient Descent– “averaged perceptron”, “Pegasos”, etc.

• For • E.g., with hinge loss and

– if error, no update otherwise.

• Algorithm is fundamentally iterative: no “clean” parallelization– Literature: mini-batches, averaging, async updates, ….

• …but why? Algorithm runs at disk I/O speed!– Parallelize I/O. For truly enormous datasets, average parameters/models.

• RCV1: 1M documents (~1GB): <12s on this laptop!

Page 8: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Dealing with Large Training Sets (II): L-BFGS

• Regularized logistic regression

• L-BFGS: batch quasi-Newton method using quadratic approximation– Update is: where – Limited memory trick: keep a buffer of recent ( to approximate

• Parallelizes well on multi-core: – Each core takes a batch of examples and computes gradient

• Multi-node– Poor fit for MapReduce: global weight update, ( history, many iterations– Alternative: ADMM (first-order but better rate than SGD’s )

Page 9: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

• Rule-based prediction is natural and powerful (non-linear)– Play outside: if no rain and not too hot, or if snowing but not windy.

• Trees hierarchically encode rule-based prediction– Nodes test features and split– Leaves produce predictions– Regression trees: numeric outputs

• Ensembles combine tree predictions

Dealing with Large Datasets (II): Trees

𝑊𝑖𝑛𝑑<10

=

𝑃𝑟𝑒𝑐𝑖𝑝𝑖𝑡𝑎𝑡𝑖𝑜𝑛

𝑇𝑒𝑚𝑝<90

𝑓𝑎𝑙𝑠𝑒𝑡𝑟𝑢𝑒

𝑓𝑡 𝑓𝑡

−+¿−𝑓 𝑡−

𝑇𝑒𝑚𝑝<32

+¿𝑊𝑖𝑛𝑑<10

=

𝑃𝑟𝑒𝑐𝑖𝑝𝑖𝑡𝑎𝑡𝑖𝑜𝑛

𝑇𝑒𝑚𝑝<30

𝑓𝑎𝑙𝑠𝑒𝑡𝑟𝑢𝑒

𝑓𝑡 𝑓𝑡

.1 0.6 0.2𝑓 𝑡

0.05

0

0.6

𝑇𝑒𝑚𝑝>10=

𝑊𝑖𝑛𝑑<25𝑓 𝑡

0.2

0.01 0.7

𝑓𝑡

.1 0.6 0.2

0.050.6

+ + +…

Page 10: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Tree Ensemble Zoo

• Different models can define different types of:– Combiner function: voting vs. weighting – Leaf prediction models: constant vs. regression– Split conditions: single vs. multiple features

• Examples (small biased sample, some are not tree-specific)– Boosting: AdaBoost, LogitBoost, GBM/MART, BrownBoost, Transform

Regression– Random Forests: Random Subspaces, Bagging, Additive Groves, BagBoo – Beyond regression and binary classification: RankBoost, abc-mart, GBRank,

LambdaMART, MatrixNet

Page 11: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Tree Ensembles Are Rightfully Popular• State-of-the-art accuracy: web, vision, CRM, bio, …

• Efficient at prediction time – Multithread evaluation of individual trees; optimize/short-circuit

• Principled: extensively studied in statistics and learning theory

• Practical– Naturally handle mixed, missing, (un)transformed data– Feature selection embedded in algorithm– Well-understood parameter sweeps– Scalable to extremely large datasets: rest of this section

Page 12: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Naturally Parallel Tree Ensembles• No interaction when learning individual trees

– Bagging: each tree trained on a bootstrap sample of data – Random forests: bootstrap plus subsample features at each split– For large datasets, local data replaces bootstrap -> embarrassingly parallel

Bagging tree construction

¿𝑇𝑟𝑎𝑖𝑛 ()

Random forest tree construction

¿𝑆𝑝𝑙𝑖𝑡 ()

¿𝑆𝑝𝑙𝑖𝑡 ()

Page 13: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Boosting: Iterative Tree Construction“Best off-the-shelf classifier in the world” – Breiman

¿𝑇𝑟𝑎𝑖𝑛 ()

• Numerically: gradient descent in function space– Each subsequent tree approximates a step in direction– Recompute target labels – Logistic loss: – Squared loss:

• Reweight examples for each subsequent tree to focus on errors

¿𝑇𝑟𝑎𝑖𝑛 () ¿𝑇𝑟𝑎𝑖𝑛 (). . .

Page 14: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Efficient Tree Construction

• Boosting is iterative: scaling up = parallelizing tree construction

• For every node: pick best feature to split– For every feature: pick best split-point• For every potential split-point: compute gain– For every example in current node, add its gain contribution for given split

• Key efficiency: limiting+ordering the set of considered split points– Continuous features: discretize into bins, splits = bin boundaries– Allows computing split values in a single pass over data

??

Page 15: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Binned Split Evaluation• Each feature’s range is split into bins. Per-bin statistics are

aggregated in a single pass

• For each tree node, a two-stage procedure (1) Pass through dataset aggregating node-feature-bin statistics(2) Select split among all (feature,bin) options

… Bin …

Accumulate: Totals:Split gain:

Features

Bins

Page 16: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

A

B

Tree Construction Visualized

• Observation 1: a single pass is sufficient per tree level• Observation 2: data pass can iterate by-instance or by-feature

– Supports horizontally or vertically partitioned data

. . . .

Features

Bins

Features

Instances

Features

Bins

A

B

. . .

Features

Bins

Features

Bins

Page 17: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Data-Distributed Tree Construction

• Master1. Send workers current model and set of nodes to expand2. Wait to receive local split histograms from workers3. Aggregate local split histograms, select best split for every node

• Worker2a. Pass through local data, aggregating split histograms2b. Send completed local histograms to master

Master Worker

Page 18: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Feature-Distributed Tree Construction

• Workers maintain per-instance index of current residuals and previous splits• Master

1. Request workers to expand a set of nodes2. Wait to receive best per-feature splits from workers3. Select best feature-split for every node4. Request best splits’ workers to broadcast per-instance assignments and residuals

• Worker2a. Pass through all instances for local features, aggregating split histograms for each node2b. Select local features’ best splits for each node, send to master

Master Worker

Page 19: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

• How many is “many”? At least billions.

• Exhibit A: English n-gramsUnigrams: 13 millionBigrams: 315 millionTrigrams: 977 millionFourgrams: 1.3 billionFivegrams: 1.2 billion

• Can we scale up linear learners? Yes, but there are limits:– Retraining: ideally real-time, definitely not more than a couple hours– Modularity: ideally fit in memory, definitely decompose elastically

• Exhibit B: search ads, 3 monthsUser IDs: hundreds of millionsListing IDs: hundreds of millionsQueries: tens to hundreds of millionsUser x Listing x Query: billions

Learning with Many Features

Page 20: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Towards infinite features:Feature hashing

• Basic idea: elastic, medium-dimensional projection • Classic low-d projections: storage, cost, updates hard• Solution: mapping defined by a hashing function

+ Effortless elasticity, sparsity preserved- Compression is random (not driven by error reduction)

[Weinberger et al. 2009]; trick first described in [Moody 1989].

Page 21: Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

Scaling up ML: Concluding Thoughts

• Learner parallelization is highly algorithm-dependent

• High-level parallelization (MapReduce)– Less work but there is a convenience penalty– Limits on communication and control can be algorithm-killing

• Low-level parallelization (Multicore, GPUs, )– Harder to implement/debug– Successes architecture-vs-algorithm specific: i.e. GPUs are great if matrix

multiplication is the core operation (NNs)– Typical trade-off: memory/IO latency/contention vs. update complexity