Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th...

SGD cont’d AdaGrad

Machine Learning for Big Data CSE547/STAT548, University of Washington

Emily Fox April 2nd, 2015

©Emily Fox 2015 1

Case Study 1: Estimating Click Probabilities

Learning Problem for Click PredicOon •  PredicOon task: •  Features:

•  Data:

–  Batch: –  Online:

•  Many approaches (e.g., logisOc regression, SVMs, naïve Bayes, decision trees,

boosOng,…) –  Focus on logisOc regression; captures main concepts, ideas generalize to other approaches

©Emily Fox 2015 2

Standard v. Regularized Updates •  Maximum condiOonal likelihood esOmate

•  Regularized maximum condiOonal likelihood esOmate

©Emily Fox 2015 3

⇤= argmax

P (yj |xj ,w))

5� �X

Challenge 1: Complexity of compuOng gradients

•  What’s the cost of a gradient update step for LR???

©Emily Fox 2015 4

Challenge 2: Data is streaming •  AssumpOon thus far: Batch data

•  But, click predicOon is a streaming data task: –  User enters query, and ad must be selected:

•  Observe xj, and must predict yj

–  User either clicks or doesn’t click on ad: •  Label yj is revealed a_erwards

–  Google gets a reward if user clicks on ad –  Weights must be updated for next Ome:

©Emily Fox 2015 5

SGD: StochasOc Gradient Ascent (or Descent) •  “True” gradient:

•  Sample based approximaOon:

•  What if we esOmate gradient with just one sample??? –  Unbiased esOmate of gradient –  Very noisy! –  Called stochasOc gradient ascent (or descent)

•  Among many other names –  VERY useful in pracOce!!!

©Emily Fox 2015 6

r`(w) = Ex

[r`(w,x)]

StochasOc Gradient Ascent: General Case •  Given a stochasOc funcOon of parameters:

–  Want to find maximum

•  Start from w(0) •  Repeat unOl convergence:

–  Get a sample data point xt

–  Update parameters:

•  Works in the online learning sefng! •  Complexity of each gradient step is constant in number of examples! •  In general, step size changes with iteraOons

©Emily Fox 2015 7

StochasOc Gradient Ascent for LogisOc Regression

•  LogisOc loss as a stochasOc funcOon:

•  Batch gradient ascent updates:

•  StochasOc gradient ascent updates: –  Online sefng:

©Emily Fox 2015 8

(t+1)i w

(t)i + ⌘

:��w(t)i +

(j)i [y(j) � P (Y = 1|x(j)

(t+1)i w

(t)i + ⌘t

��w(t)i + x

(t)i [y(t) � P (Y = 1|x(t)

(t))]o

[`(w,x)] = Ex

⇥lnP (y|x,w)� �||w||22

Convergence Rate of SGD •  Theorem:

–  (see Nemirovski et al ‘09 from readings) –  Let l be a strongly convex stochasOc funcOon –  Assume gradient of l is Lipschitz conOnuous and bounded

–  Then, for step sizes:

–  The expected loss decreases as O(1/t):

©Emily Fox 2015 9

Convergence Rates for Gradient Descent/Ascent vs. SGD

•  Number of IteraOons to get to accuracy

•  Gradient descent: –  If func is strongly convex: O(ln(1/ϵ)) iteraOons

•  StochasOc gradient descent: –  If func is strongly convex: O(1/ϵ) iteraOons

•  Seems exponenOally worse, but much more subtle: –  Total running Ome, e.g., for logisOc regression:

•  Gradient descent:

•  SGD:

•  SGD can win when we have a lot of data –  See readings for more details

©Emily Fox 2015 10

`(w⇤)� `(w) ✏

Constrained SGD: Projected Gradient •  Consider an arbitrary restricted feature space

•  OpOmizaOon objecOve:

•  If , can use projected gradient for (sub)gradient descent

©Emily Fox 2015 11

w(t+1) =

MoOvaOng AdaGrad (Duchi, Hazan, Singer 2011) •  Assuming , standard stochasOc (sub)gradient descent

updates are of the form:

•  Should all features share the same learning rate?

•  O_en have high-‐dimensional feature spaces –  Many features are irrelevant –  Rare features are o_en very informaOve

•  Adagrad provides a feature-‐specific adapOve learning rate by incorporaOng knowledge of the geometry of past observaOons

©Emily Fox 2015 12

w 2 Rd

w(t+1)i w(t)

i � ⌘tgt,i

Why adapt to geometry?

t,1 �

t,2 �

1 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

1 Frequent, irrelevant

2 Infrequent, predictive

Duchi et al. (UC Berkeley) Adaptive Subgradient Methods ISMP 2012 8 / 32

Why Adapt to Geometry?

©Emily Fox 2015 13

Examples from Duchi et al. ISMP 2012

slides

Why adapt to geometry?

t,1 �

t,2 �

1 1 0 0-1 .5 0 11 -.5 1 0-1 0 0 01 .5 0 0-1 1 0 01 -1 1 0-1 -.5 0 1

1 Frequent, irrelevant

Not All Features are Created Equal

•  Examples:

©Emily Fox 2015 14

Motivation

Text data:

The most unsung birthday

in American business and

technological history

this year may be the 50th

anniversary of the Xerox

914 photocopier.

aThe Atlantic, July/August 2010.

High-dimensional image features

Other motivation: selecting advertisements in online advertising,document ranking, problems with parameterizations of manymagnitudes...

Images from Duchi et al. ISMP 2012 slides

©Emily Fox 2015 15

Credit: http://imgur.com/a/Hqolp

Visualizing Effect

Regret MinimizaOon •  How do we assess the performance of an online algorithm?

•  Algorithm iteraOvely predicts •  Incur loss •  Regret:

What is the total incurred loss of algorithm relaOve to the best choice of that could have been made retrospec1vely

©Emily Fox 2015 16

`t(w(t))

R(T ) =TX

`t(w(t))� inf

Regret Bounds for Standard SGD •  Standard projected gradient stochasOc updates:

•  Standard regret bound:

©Emily Fox 2015 17

w(t+1) = arg minw2W

||w � (w(t) � ⌘gt)||22

`t(w(t))� `t(w

⇤) 1

2⌘||w(1) �w⇤||22 +

||gt||22

Projected Gradient using Mahalanobis

w(t+1) = arg minw2W

||w � (w(t) � ⌘A�1gt)||2A

©Emily Fox 2015 18

•  Standard projected gradient stochasOc updates:

•  What if instead of an L2 metric for projecOon, we considered the Mahalanobis norm

w(t+1) = arg minw2W

||w � (w(t) � ⌘gt)||22

Mahalanobis Regret Bounds

•  What A to choose? •  Regret bound now:

•  What if we minimize upper bound on regret w.r.t. A in hindsight?

©Emily Fox 2015 19

w(t+1) = arg minw2W

||w � (w(t) � ⌘A�1gt)||2A

`t(w(t))� `t(w

⇤) 1

2⌘||w(1) �w⇤||22 +

||gt||2A�1

gTt A�1gt

Mahalanobis Regret MinimizaOon •  ObjecOve:

•  SoluOon:

For proof, see Appendix E, Lemma 15 of Duchi et al. 2011. Uses “trace trick” and Lagrangian. •  A defines the norm of the metric space we should be operaOng in

©Emily Fox 2015 20

subject to A ⌫ 0, tr(A) CminA

gTt A�1gt

AdaGrad Algorithm •  At Ome t, esOmate opOmal (sub)gradient modificaOon A by

•  For d large, At is computaOonally intensive to compute. Instead,

•  Then, algorithm is a simple modificaOon of normal updates:

g⌧gT⌧

w(t+1) = arg minw2W

||w � (w(t) � ⌘diag(At)�1gt)||2diag(At)

AdaGrad in Euclidean Space •  For , •  For each feature dimension,

•  That is,

•  Each feature dimension has it’s own learning rate! –  Adapts with t –  Takes geometry of the past observaOons into account –  Primary role of η is determining rate the first Ome a feature is encountered

W = Rd

w(t+1)i w(t)

i � ⌘t,igt,i

⌘t,i =

w(t+1)i w(t)

i �⌘qPt⌧=1 g

2⌧,i

`t(w(t))� `t(w

⇤) 2R1

||g1:T,i||2

AdaGrad TheoreOcal Guarantees •  AdaGrad regret bound:

–  In stochasOc sefng:

•  This really is used in pracOce! •  Many cool examples. Let’s just examine one…

R1 := max

t||w(t) �w⇤||1

!#� `(w⇤) 2R1

E[||g1:T,j ||2]

AdaGrad TheoreOcal Example •  Expect to out-‐perform when gradient vectors are sparse •  SVM hinge loss example:

•  If xjt ≠ 0 with probability

•  Previously best known method:

t 2 {�1, 0, 1}d

/ j�↵, ↵ > 1

`t(w) = [1� yt⌦x

t,w↵]+

!#� `(w⇤

) = O✓ ||w⇤||1p

T·max{log d, d1�↵/2}

!#� `(w⇤) = O

✓ ||w⇤||1pT

•  Very non-‐convex problem, but use SGD methods anyway

Neural Network Learning

0 20 40 60 80 100 1200

Time (hours)

rage F

Accuracy on Test Set

SGDGPUDownpour SGDDownpour SGD w/AdagradSandblaster L−BFGS

(Dean et al. 2012)

Distributed, d = 1.7 · 109 parameters. SGD and AdaGrad use 80machines (1000 cores), L-BFGS uses 800 (10000 cores)

Wildly non-convex problem:

f(x; ⇠) = log (1 + exp (h[p(hx1, ⇠1i) · · · p(hxk

i)], ⇠0i))

p(↵) =

1 + exp(↵)

�1 �2 �3 �4�5

x1 x2 x3 x4 x5

p(hx1, �1i)

Idea: Use stochastic gradient methods to solve it anyway

Images from Duchi et al. ISMP 2012 slides

`(w, x) = log(1 + exp(h[p(hw1, x1i) · · · p(hwk, xki)], x0i))

p(↵) =1

1 + exp(↵)

p(hw1, x1i)w ww w w

x1 x2 x3 x4 x5

What you should know about LogisOc Regression (LR) and Click PredicOon

•  Click predicOon problem: –  EsOmate probability of clicking –  Can be modeled as logisOc regression

•  LogisOc regression model: Linear model •  Gradient ascent to opOmize condiOonal likelihood •  Overfifng + regularizaOon •  Regularized opOmizaOon

–  Convergence rates and stopping criterion •  StochasOc gradient ascent for large/streaming data

–  Convergence rates of SGD •  AdaGrad moOvaOon, derivaOon, and algorithm

Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th...

Documents

Transcript of Case Study 1: Estimating Click Probabilities...technological history this year may be the 50th...

aTHE TURKISH CIVIL Register TC-.pdf

GENETIC PROBABILITIES

edital 914

What is Probability? Hit probabilities Damage probabilities Personality (e.g. chance of attack, run, etc.) ??? Probabilities are used to add.

Conditional probabilities

ATHE - Qual Delivery HDBM

914 Beton, Stahlbeton, Spannbeton, Stahl Seite 914 0 ... · PDF file914 Beton, Stahlbeton, Spannbeton, Stahl Seite 914 0 Vorbemerkungen 914/1 914 1 Beton und Stahlbeton 101 Ausgleich,

Why we should think of quantum probabilities as Bayesian probabilities

aTHE ISRAEL OF GOD.

Limiting probabilities

codigo 914

914 ajaccio

Review Probabilities –Definitions of experiment, event, simple event, sample space, probabilities, intersection, union compliment –Finding Probabilities.

ATHE τεύχος ΙΙΙ αρχική έκδοση

Expresión 914

aThe pitch

914 Engines

Ignition probabilities

ATHE - Qual Delivery PDBM

HANSA MICRA€¦ · 5816 2071 , 5816 2171 1 59 914 397 2 59 914 116 3 59 911 076 4 59 914 396 5 59 914 113 6 59 914 114 7 59 911 829 NaN 59 914 319 NaN 59 904 793 NaN 59 914 472 5815