Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

45
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen

Transcript of Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Page 1: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Efficient Logistic Regression with Stochastic Gradient Descent

William Cohen

Page 2: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Admin• Phrase assignment extension– We forgot about discarding phrases with stop words– Aside: why do stop words hurt performance?

• How many map-reduce assignments does it take to screw in a lightbulb?– ?

• AWS mechanics– I will circulate keys for $100 of time

• …soon– You’ll need to sign up for an account with a credit/debit card anyway as guarantee

• Let me know if this is a major problem– $100 should be plenty

• Let us know if you run out

Page 3: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Reminder: Your map-reduce assignments are mostly done

Old NB learning• Stream & sort• Stream – sort – aggregate

– Counter update “message”– Optimization: in-memory hash, periodically emptied– Sort of messages– Logic to aggregate them

• Workflow is on one machine

New NB learning• Map-reduce• Map – shuffle - reduce

– Counter update Map– Combiner– (Hidden) shuffle & sort– Sum-reduce

• Workflow is done in parallel

Page 4: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

NB implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| tee words.dat | sort | java answerWordCountRequests| tee test.dat| sort | testNBUsingRequests

id1 w1,1 w1,2 w1,3 …. w1,k1

id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...

X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……

524510542120

373

train.dat counts.dat

Map to generate counter updates + Sum combiner + Sum reducer

Page 5: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| tee words.dat | sort | java answerWordCountRequests| tee test.dat| sort | testNBUsingRequests

words.dat

w Counts associated with W

aardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …

zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

Map

Reduce

Page 6: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| tee words.dat | sort | java answerWordCountRequests| tee test.dat| sort | testNBUsingRequests

w Countsaardvark C[w^Y=sports]

=2

agent …

zynga …

found ~ctr to id1

aardvark ~ctr to id2

…today ~ctr to idi

w Counts

aardvark C[w^Y=sports]=2

aardvark ~ctr to id1

agent C[w^Y=sports]=…

agent ~ctr to id345

agent ~ctr to id9854

… ~ctr to id345

agent ~ctr to id34742

zynga C[…]

zynga ~ctr to id1

output looks like thisinput looks like this

words.dat

Map + IdentityReduce; save output in temp files

Identity Map with two sets of inputs; custom secondary sort (used in shuffle/sort)

Reduce, output to temp

Page 7: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| tees words.dat | sort | java answerWordCountRequests| tee test.dat| sort | testNBUsingRequests

Output:id1 ~ctr for aardvark is C[w^Y=sports]=2…id1 ~ctr for zynga is ….…

id1 found an aardvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..

Output looks like this

test.dat

Identity Map with two sets of inputs; custom secondary sort

Page 8: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| tees words.dat | sort | java answerWordCountRequests| tee test.dat| sort | testNBUsingRequestsInput looks like

thisKey Value

id1 found aardvark zynga farmville today

~ctr for aardvark is C[w^Y=sports]=2

~ctr for found is

C[w^Y=sports]=1027,C[w^Y=worldNews]=564

id2 w2,1 w2,2 w2,3 ….

~ctr for w2,1 is …

… …

Reduce

Page 9: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Reminder: The map-reduce assignments are mostly done

Old NB testing• Request-answer

– Produce requests– Sort requests and records together– Send answers to requestors – Sort answers and sending entities together– …– Reprocess input with request answers

• Workflows are isomorphic

New NB testing• Two map-reduces

– Produce requests w/ Map– (Hidden) shuffle & sort with custom secondary sort– Reduce, which sees records first, then requests– Identity map with two inputs– Custom secondary sort– Reduce that sees old input first, then answers

• Workflows are isomporphic

Page 10: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Outline

• Reminder about early/next assignments• Logistic regression and SGD– Learning as optimization– Logistic regression: • a linear classifier optimizing P(y|x)

– Stochastic gradient descent• “streaming optimization” for ML problems

–Regularized logistic regression– Sparse regularized logistic regression–Memory-saving logistic regression

Page 11: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization: warmup• Goal: Learn the parameter θ of a binomial• Dataset: D={x1,…,xn}, xi is 0 or 1• MLE estimate of θ, Pr(xi=1)–#[flips where xi =1]/#[flips xi ] = k/n

• Now: reformulate as optimization…

Page 12: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization: warmupGoal: Learn the parameter θ of a binomialDataset: D={x1,…,xn}, xi is 0 or 1, k of them are 1Easier to optimize:

Page 13: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization: warmupGoal: Learn the parameter θ of a binomialDataset: D={x1,…,xn}, xi is 0 or 1, k are 1

Page 14: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization: warmupGoal: Learn the parameter θ of a binomialDataset: D={x1,…,xn}, xi is 0 or 1, k of them are 1 = 0

θ= 0θ= 1 k- kθ – nθ + kθ = 0

nθ = k θ = k/n

Page 15: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization: general procedure

• Goal: Learn the parameter θ of a classifier–probably θ is a vector

• Dataset: D={x1,…,xn}• Write down Pr(D|θ) as a function of θ• Maximize by differentiating and setting to zero

Page 16: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization: general procedure

• Goal: Learn the parameter θ of …• Dataset: D={x1,…,xn}–or maybe D={(x1,y1)…,(xn , yn)}

• Write down Pr(D|θ) as a function of θ–Maybe easier to work with log Pr(D|θ)

• Maximize by differentiating and setting to zero–Or, use gradient descent: repeatedly take a small step in the direction of the gradient

Page 17: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization: general procedure for SGD

• Goal: Learn the parameter θ of …• Dataset: D={x1,…,xn}– or maybe D={(x1,y1)…,(xn , yn)}

• Write down Pr(D|θ) as a function of θ• Big-data problem: we don’t want to load all the data D into memory, and the gradient depends on all the data• Solution: – pick a small subset of examples B<<D– approximate the gradient using them

• “on average” this is the right direction– take a step in that direction– repeat….

B = one example is a very popular choice

Page 18: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for logistic regression

• Goal: Learn the parameter θ of a classifier– Which classifier?– We’ve seen y = sign(x . w) but sign is not continuous…– Convenient alternative: replace sign with the logistic function

Page 19: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for logistic regression

• Goal: Learn the parameter w of the classifier• Probability of a single example (y,x) would be• Or with logs:

Page 20: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.
Page 21: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

p

Page 22: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.
Page 23: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Key computational point: • if xj=0 then the gradient of wj is zero• so when processing an example you only need to update weights for the non-zero features of an example.

An observation

Page 24: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for logistic regression

• Final algorithm:-- do this in random order

Page 25: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Another observation

• Consider averaging the gradient over all the examples D={(x1,y1)…,(xn , yn)}

Page 26: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for logistic regression

• Goal: Learn the parameter θ of a classifier– Which classifier?– We’ve seen y = sign(x . w) but sign is not continuous…– Convenient alternative: replace sign with the logistic function

• Practical problem: this overfits badly with sparse features– e.g., if wj is only in positive examples, its gradient is always positive !

Page 27: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Outline

• [other stuff]• Logistic regression and SGD– Learning as optimization– Logistic regression: • a linear classifier optimizing P(y|x)

– Stochastic gradient descent• “streaming optimization” for ML problems

–Regularized logistic regression– Sparse regularized logistic regression–Memory-saving logistic regression

Page 28: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Regularized logistic regression• Replace LCL• with LCL + penalty for large weights, eg• So:• becomes:

Page 29: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Regularized logistic regression• Replace LCL• with LCL + penalty for large weights, eg• So the update for wj becomes:• Or

Page 30: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for logistic regression

• Algorithm:-- do this in random order

Page 31: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for regularized logistic regression

• Algorithm:Time goes from O(nT) to O(nmT) where• n = number of non-zero

entries, • m = number of features, • T = number of passes over

data

Page 32: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for regularized logistic regression

• Final algorithm:• Initialize hashtable W• For each iteration t=1,…T– For each example (xi,yi)• pi = …• For each feature W[j]–W[j] = W[j] - λ2μW[j]–If xij>0 then»W[j] = W[j] + λ(yi - pi)xj

Page 33: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for regularized logistic regression

• Final algorithm:• Initialize hashtable W• For each iteration t=1,…T– For each example (xi,yi)• pi = …• For each feature W[j]–W[j] *= (1 - λ2μ)–If xij>0 then»W[j] = W[j] + λ(yi - pi)xj

Page 34: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for regularized logistic regression

• Final algorithm:• Initialize hashtable W• For each iteration t=1,…T– For each example (xi,yi)• pi = …• For each feature W[j]–If xij>0 then»W[j] *= (1 - λ2μ)A»W[j] = W[j] + λ(yi - pi)xj

A is number of examples seen since the last time we did an x>0 update on W[j]

Page 35: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for regularized logistic regression

• Final algorithm:• Initialize hashtables W, A and set k=0• For each iteration t=1,…T– For each example (xi,yi)• pi = … ; k++• For each feature W[j]–If xij>0 then»W[j] *= (1 - λ2μ)k-A[j]»W[j] = W[j] + λ(yi - pi)xj»A[j] = k

k-A[j] is number of examples seen since the last time we did an x>0 update on W[j]

Page 36: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for regularized logistic regression

• Final algorithm:• Initialize hashtables W, A and set k=0• For each iteration t=1,…T– For each example (xi,yi)• pi = … ; k++• For each feature W[j]–If xij>0 then»W[j] *= (1 - λ2μ)k-A[j]»W[j] = W[j] + λ(yi - pi)xj»A[j] = k

Time goes from O(nmT) back to to O(nT) where• n = number of non-zero

entries, • m = number of features, • T = number of passes over

data

• memory use doubles (m 2m)

Page 37: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Outline

• [other stuff]• Logistic regression and SGD– Learning as optimization– Logistic regression: • a linear classifier optimizing P(y|x)

– Stochastic gradient descent• “streaming optimization” for ML problems

–Regularized logistic regression– Sparse regularized logistic regression–Memory-saving logistic regression

Page 38: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Pop quiz

• In text classification most words area. rareb. not correlated with any classc. given low weights in the LR classifierd. unlikely to affect classificatione. not very interesting

Page 39: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Pop quiz

• In text classification most bigrams area. rareb. not correlated with any classc. given low weights in the LR classifierd. unlikely to affect classificatione. not very interesting

Page 40: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Pop quiz

• Most of the weights in a classifier are– important–not important

Page 41: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

How can we exploit this?• One idea: combine uncommon words together randomly• Examples:

– replace all occurrances of “humanitarianism” or “biopsy” with “humanitarianismOrBiopsy”– replace all occurrances of “schizoid” or “duchy” with “schizoidOrDuchy”– replace all occurrances of “gynecologist” or “constrictor” with “gynecologistOrConstrictor”– …

• For Naïve Bayes this breaks independence assumptions– it’s not obviously a problem for logistic regression, though

• I could combine– two low-weight words (won’t matter much)– a low-weight and a high-weight word (won’t matter much)– two high-weight words (not very likely to happen)

• How much of this can I get away with?– certainly a little– is it enough to make a difference?

Page 42: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

How can we exploit this?• Another observation: – the values in my hash table are weights– the keys in my hash table are strings for the feature names• We need them to avoid collisions

• But maybe we don’t care about collisions?– Allowing “schizoid” & “duchy” to collide is equivalent to replacing all occurrences of “schizoid” or “duchy” with “schizoidOrDuchy”

Page 43: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for regularized logistic regression

• Algorithm:• Initialize hashtables W, A and set k=0• For each iteration t=1,…T– For each example (xi,yi)• pi = … ; k++• For each feature j: xij>0:

»W[j] *= (1 - λ2μ)k-A[j]»W[j] = W[j] + λ(yi - pi)xj»A[j] = k

Page 44: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for regularized logistic regression

• Algorithm:• Initialize arrays W, A of size R and set k=0• For each iteration t=1,…T– For each example (xi,yi)• Let V be hash table so that • pi = … ; k++• For each hash value h: V[h]>0:

»W[h] *= (1 - λ2μ)k-A[j]»W[h] = W[h] + λ(yi - pi)V[h]»A[j] = k

Page 45: Efficient Logistic Regression with Stochastic Gradient Descent William Cohen.

Learning as optimization for regularized logistic regression

• Algorithm:• Initialize arrays W, A of size R and set k=0• For each iteration t=1,…T– For each example (xi,yi)• Let V be hash table so that • pi = … ; k++• For each hash value h: V[h]>0:

»W[h] *= (1 - λ2μ)k-A[j]»W[h] = W[h] + λ(yi - pi)V[h]»A[j] = k

???