Machine Learning on Big Data - ClassIntro

download Machine Learning on Big Data - ClassIntro

of 29

Transcript of Machine Learning on Big Data - ClassIntro

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    1/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    M achine Learning Big Dat a

    using

    M ap Reduce

    By

    M ichael Bow les, PhD

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    2/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    W her e Does Big Dat a Com e Fro m ?

    -W eb data (w eb logs, cl ick histo r ies)

    -e-com m erce appl icat ions (purchase histor ies)

    -Retai l purchase histor ies (W almart )

    -Bank and credit card t r ansact ions

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    3/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    W hat is Data M in ing?

    -W hat page wil l v isi tor next visi t? Given:

    Visitor 's bro w sing history

    Visitor 's dem ographics

    -Should card com pany appro ve transact ion that 's w ait ing? Given:

    User's usage histor y

    I tem b eing purchased

    Locat ion o f merchant .

    -What isn ' t da ta min ing

    W hat pages d id v isi to rs v iew m ost o f ten?

    W hat products are most popu lar?

    Data mining tel ls us som ething th at isn' t in th e data or isn' t a simp le sum m ary.

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    4/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Appr oaches fo r Data M ining Large Dat a

    -Data M ine a Samp le

    Take a manageable subset ( f i ts in m em ory, run s in reasonable t ime)

    Develop m ode ls

    -Limi ta t ions o f t h is method?

    General ly, mo re data supp ort s f iner grained mo dels

    e.g. makin g specif ic purchase recom me ndat ions

    "custom ers who bo ught . " requ i res much m ore data than " top ten m ost popu lar a re "

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    5/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Side N ot e o n Large Dat a:

    -Rhine' s parad ox

    ESP exper im ent in 50's

    1 in 1000 can correct ly ident i fy color (re d or b lue) of 10 cards th ey can' t see

    Do t hey h ave ESP?

    -Bonferron i ' s p r incip le Given enou gh data any combinat ion o f o u tcom es can be fo und

    Is th is a reason to avoid large data sets?

    No

    I t ' s a reason to not d r aw conclusions tha t the data don ' t suppor t (and th is is t rue no m at ter how

    large the data set)

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    6/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Use M ult iple Pro cessor Cor es

    -What i f we w ant to use the fu l l data se t? How can we devote more comp uta t iona l power to our job?

    -There are always perfor m ance l imits w ith a single processor core => Use mult ip le cores simu lataneously.

    -Tradit ional appro ach add struct ure t o pr ogram m ing language (C++, Java)

    -Issues w ith th is appro ach

    High com mu nicat ion costs (i f da ta m ust be d is t r ibu ted over a net wo rk)

    Dif f icult t o deal w ith CPU f ai lures at t h is level (Processor fa i lures are in evitable as scale increases)

    -To d eal wit h t hese issues Google developed M ap-Reduce Paradigm

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    7/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    W hat is M ap-Reduce?

    -Arrangement of com put e tasks enabl ing relat ively easy scal ing.

    -Includes:

    Hardw are arrangemen t racks of CPU's with direct access to local d isk, netw orked t o on e another

    Fi le System Distr ibut ed stor age across mult ip le disks, redun dancy

    -Soft w are pro cesses runn ing on various CPU in th e assem bly

    Cont rol ler m anages m apper and red ucer tasks, fault det ect ion and recover y

    M apper Ident ical tasks assigned t o m ult ip le CPU's for each t o ru n over i ts local data.

    Reducer Aggregates out put f rom severa l mappers to fo rm end pro duct

    -Programm er on ly needs to author m apper and reducer . The rest o f the s t ructure is p rov ided.

    Dean, Jeff and Gh em awat , Sanjay. M apRe duce: Simp lified Data Processing on Large Clustersh t tp : / / labs.google.com/papers/mapreduce-

    osdi04.pdf

    http://labs.google.com/papers/mapreduce-http://labs.google.com/papers/mapreduce-
  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    8/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Sim ple Cod e Exam ple (using m rJob ) ( t his is runn ing code)

    f rom mr job . j ob impor t M RJob

    f rom m ath impor t sq r timport json

    class m rM eanVar(M RJob):

    DEFAULT_PROTOCOL = 'json'

    def m apper(sel f , key, l ine) :

    num = json. loads( l ine)

    va r = [num,num* num]

    yield 1,var

    def re ducer(self, n, vars):

    N = 0.0

    sum = 0.0

    sumsq = 0.0

    for x in vars:

    N += 1

    sum += x[0]

    sumsq += x[1]

    mean = sum/ Nsd = sq r t (sumsq /N - mean* mean)

    results = [mean,sd]

    yield 1,results

    i f __name__ == '__m ain__' :

    mrMeanVar . run ( )

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    9/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Exam p le Sum Values fr o m a Large Dat a Set

    -Data set D divided int o d1, d2, dn (each a l ist o f th in gs that can be sum m ed say real num bers or vectors)

    -M apper s run ning o n CPU 1 , CPU 2, CPU n

    -Each mapp er for m s a sum o ver i ts p iece of th e data and em its the sum s1, s2, sn.

    CPU 1 m apper

    Sum Element s of

    d1

    CPU 2 - m apper

    Sum Element s of

    d2

    CPU - map per

    Sum Elem ents of

    dn

    s1 s2 sn

    Reducer

    Sum m apper

    ou t u t s

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    10/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    M achine Learning w . M ap Reduce

    -Stat ist ical Query M odel (see ref below ) A two -step p rocess

    1. Com pute su f f ic ien t s ta t ist ics by sum ming some funct ions over the data

    2. Perform calcu la t ion on sums to y ie ld data m in ing mode l

    3. Conceptua l ly sim i la r to the "sum" exam ple above.

    -Consider ordinary least squares regression

    Given m out puts yi (also called labe ls, ob servatio ns, etc.)

    And m correspond ing a t t r ibu t e vectors xi (also called regressors, pr edict or s, etc.)

    Fi t a mode l o f t he fo rm y = T x by solving * = m in ( )

    * = A -1 b wher e A = ( ) and b =

    -See the natu ral d ivision int o m apper and red ucer? (Hint : Look f or a )

    M ap-Reduce for M achine Learn ing on M ul t icore h t tp : / /w ww .cs.stan fo rd .edu /peop le /ang /papers/n ips06-mapreducemu l t i co re .pd f

    http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdfhttp://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf
  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    11/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    M achine Learn ing w i th M ap Reduce

    W ith OLS the m appers comp ute par t ia l sums o f the fo rm ( ) an d The reducer aggregates the par t ia l sum s int o to tals and com pletes the calculat ion * = A -1 b

    CP U 1 ma ppe r

    Sum x ix ,T & x iy i

    fo r x i d1

    CP U 2 - m a ppe r

    Sum x ix ,T & x iy i

    fo r x i d2

    CPU - map per

    Sum x ix ,T & x iy i

    fo r x i dn

    Sum xx, ,

    Sum xy

    Reducer

    Aggregate mapper output s

    A = Sum xx , b = Sum xy

    Calculate: * = A -1 b

    Sum xx ,,

    Sum xy

    Sum xx ,,

    Sum xy

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    12/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    M achine Learning w . M ap Reduce

    -Referenced paper dem onstrat es that t he fo l low ing algori th m s can al l be arranged in th is Stat ist ical Query M odel

    f o r m :

    Locally W eight ed Linear Regression,

    Nave Bayes,

    Gaussian Discrim inat ive Analysis,

    k-M eans,

    Neura l Networ ks,

    Principal Compo nent Analysis,

    Independent Component Analysis,

    Expectat ion M aximizat ion,

    Suppor t Vector M ach ines

    -In som e cases, i terat ion is required. Each i terat ive step in volves a m ap-reduce sequence

    -Other mach ine learn ing a lgor i thm s can be ar ranged fo r map r educe but no t in S tat is t ica l Query M ode l fo rm

    (e.g. canopy cluster ing or b inary decision t rees)

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    13/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    M ore M ap Reduce Deta il

    -M appers emit a key-value pair

    contro l ler sorts key-value pairs by key

    Reducer gets pairs grou ped b y key

    -M apper can be a t wo -step pro cess

    -With OLS, fo r example , we m ight have had the m apper em i t each xix iT and x iy i, instead of em it t ing sum s of

    these quant i t ies

    Post processing the mapp er out put (e.g. form ing ) is a mapper funct ion ca l led "com biner"

    Reduces net w ork t raf f ic

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    14/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Som e Algor i th m s

    -Canop y Cluster -

    -K-means

    -EM a lgo fo r Gaussian M ixture M ode l

    -Glmnet

    -SVM

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    15/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Cano py Cluster ing

    -Usual ly used as rough cluster in g to r educe com put at ion (e.g. search zip code fo r t he closest p izza versus search

    the wo r ld )

    -Find w ell d istr ibut ed in i t ia l condit ions for ot her cluster ing also (e.g. kmeans)

    -A lgor i thm:

    Given set o f po ints P and distance measure d(,)

    Pick tw o d istance t hresholds T1 > T2 > 0

    Step 1: Find cluster centers

    1. In i t ia l ize set of centers C = nul l

    2. I te ra te over po in ts p i P

    If there isn ' t c C s.t. d(c,p i) < T2

    Add p i t o C

    get next p i

    Step 2: Assign point t o cluster s

    For p i P assign p i to {c C : d( p i , c) < T1}

    -Not ice that p oint s general ly get assigned to m ore t han one cluster.

    M cCal lum, Nigam, and Un gar "Eff icient Cluster ing of High-Dimensional Data Sets w i th Ap pl icat ion to Reference M atching",

    h t tp : / /www.kama ln igam.com/papers /canopy-kdd00 .pd f

    http://www.kamalnigam.com/papers/canopy-kdd00.pdfhttp://www.kamalnigam.com/papers/canopy-kdd00.pdf
  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    16/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Cano py Clust er in g Pict ur e

    x

    x

    x

    x

    x

    x

    x

    x

    Cluster 1

    Cluster 2

    Cluster 3

    Cluster 4

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    17/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Cano py Clust er ing w M ap Redu ce

    1st Pass f ind cent ers

    M appers Run canopy cluster ing o n subset, pass center s to redu cer

    Reducer Run canopy cluster ing on cente rs f rom m appers.

    2nd Pass M ake cluster assignm ent s (if necessary)

    M appers compare po in ts p i to centers to fo rm set c i = {c C | d (p i , c) < T1}

    Em it = for each c c i

    Reducer Since the redu cer input is sorted o n key value (here, th at 's cluster center), the re ducer inpu t w i l l

    be a l ist of a l l the po ints assigned t o a given center.

    -One smal l p rob lem is tha t t he centers p icked by the reducer m ay not cover a l l the po in ts o f the com bined

    orig inal set . Pick T1 > 2* T2 or use larger T2 in reducer in or der to insure that a l l poin ts are covered.

    The Apache M ahout pro ject has a lo t o f great a lgor i thm s and docum entat ion. https:/ /cwik i .apache.org/MAHOUT/canopy-cluster ing.html

    https://cwiki.apache.org/MAHOUT/canopy-clustering.htmlhttps://cwiki.apache.org/MAHOUT/canopy-clustering.html
  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    18/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    K-M eans Clust er ing

    -K-means algori t hm seeks to part i t ion a data set into K disjo int sets, such that the sum of t he w ith in set var iances

    is m inim ized.

    -Using Eucl idean distance, w ith in set var iance is the sum o f squared distances fro m the set 's centroid

    -Lloyd's algori t hm for K-means goes as fo l low s:

    Init ialize:

    Pick K start ing guesses for cent roids (at r andom )

    Iterate:

    Assign point s to cluster w hose centro id is closest

    Calculate cluster cent roids

    -Here's a sequence fro m W ikipedia page on k-m eans cluster ing ht tp : / /en .w ik iped ia .org /w ik i /K -means_cluster ing

    http://en.wikipedia.org/wiki/K-means_clusteringhttp://en.wikipedia.org/wiki/K-means_clustering
  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    19/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    K-m eans in M ap Redu ce

    M apper given K in i t ia l guesses, run t hrou gh local data and for each po int , deter mi ne w hich centro id is closest,

    accumula te vector sum of po in ts c losest to each cent ro id , com biner em i ts for each of the i

    centroids.

    Reducer for each old centr oid i , aggregate sum and n f rom a l l mappers and ca lcu late new cent ro id .

    This m ap-reduce pair comp letes an i terat io n of Lloyd's algori th m .

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    20/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Sup por t Vect or M achine - SVM

    -For classif icat ion, SVM f inds separat ing hyperp lane

    -H 3 doe s no t separate classes. H1 separa tes, bu t no t w i th max marg in . H2 separates w ith m ax m argin

    Figure from http:/ /en.wik ipedia.org/wik i /Support_vector_machine

    http://en.wikipedia.org/wiki/Support_vector_machinehttp://en.wikipedia.org/wiki/Support_vector_machine
  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    21/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    SVM as M athem at ical Opt im izat ion Prob lem

    -Given a tr ainin g set S = { (

    ,

    ) }

    -Where x i Rn and y i { +1, -1}

    -W e can write any l inear fu nct ion al of x as w Tx +b, w here w Rn and b is scalar

    Find we ight vector w and constant b t o m in imize

    ,1

    2 | || | + ( , ; ( , ) )(,)

    W h e r e

    l(w ,b;(x,y)) = m ax{0, 1 y (w Tx + b)}

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    22/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Solu t ion by Bat ch Gradient De scent

    -W e can solve th is by using batch gradient descent. Take a derivat ive w rt w & b.

    Not ice tha t i f the 1 - y (w Tx + b) < 0, then l() = 0 and,( ) = 0 .

    Denote by S+ = {(x,y) S | 1 - y (w Tx + b) > 0 } . Then the grad ient w r t w is

    w +

    (,)

    and wr t b is:

    (,)

    For r eference see "M ap Reduce for M achine Learn ing on M ul t icore" m entioned ear l ier . Also see Shalev-Shwart z, "Pegasos: Pr imal Est im ated

    sub-Gradient Solver for SVM "

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    23/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Calculat ing a Grad ient St ep

    -What ' s impo r tant about what we just deve loped? Severa l th ings

    1. I n t he equa t ion we w ound up w i th t e rms l i ke (w + (,) ) , on ly the t e rm inside the sum isdata dependent .

    2. The data dependent t e rms sum(-yx) is summ ed over the po in ts in the input da ta where t he const ra in ts

    are act ive.

    -Le t ' s sum mar ize by draw ing up th e m apper and red ucer funct ions.

    In it ia l ize w and b , C and m (# of instances).

    I terate

    M apper Each m apper has access to a subset Sm of S. For point s (x,y) Sm check 1 - y (wTx + b) > 0. If yes,

    accum ulate y and yx. Em it . W e' l l pu t in a dum my key value, bu t a l l the out put ge ts

    summ arized in a single (vector) quan t i t y to be pro cessed by a single reducer

    Reducer Accum ulate -y and -yx and update est imate o f w and b us ing

    w ne w = w ol d - (w ol d +

    -yx)

    an d

    b ne w = bol d - (-y )

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    24/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    GLM Net Algor i thm

    -Regularized Regression

    -To avoid over-f i t t ing have to exert cont rol over degrees of f reedom in regression

    -cut back on at t r ibut es subset select ion

    -penalize regression coefficients coefficient shrinkage, ridge regression, lasso regression

    -W ith Coeff ic ient Shrinkage, d if feren t penalt ies give solut ions with dif f erent pr oper t ies

    "Regularization Paths for General ized Linear M odels, via Coordinat e Descent" Friedm an, Hastie and Tibshirani

    h t tp : / /www.s tan fo rd .edu /~has t i e /Papers /g lmne t .pd f

    http://www.stanford.edu/~hastie/Papers/glmnet.pdfhttp://www.stanford.edu/~hastie/Papers/glmnet.pdf
  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    25/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Regular izing Regression w it h Coe ff icient Shr inkage

    -Start w ith OLS pro blem f orm ulat ion and add a penalt y to OLS erro r

    -Suppose

    y R and vector o f pred ictors x Rn

    -As w ith OLS, seek a f i t for y of t he for m

    y 0 + xT

    -Assume th at x i have been standardized to mean zero and unit var iance

    -Find

    (,)

    + ( )

    -The part in t he sum is the or dinary least squares penalty. The bit at the end (( ) ) is new .

    -Not ice tha t the minimizing set of 's is a funct ion o f the param eter . If = 0 we get OLS coefficients.

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    26/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Pen alty Term

    -The coeff ic ient penalt y term is g iven by

    P() = ( 1 ) + | |

    -This is cal led " elast icnet " p enalty (see ref below ).

    - = 1 gives l1 penalty (sum o f absolute values)

    - = 0 gives l2 squared penalty (sum of squ ares)

    -Why is th is impo r tant? The cho ice o f pena l ty in f luences tha t na tu re o f the so lu t ions. l1 gives sparse solut ion s. It

    ignores at t r ibutes. l2 tend s to average correlated at t r ibut es.

    -Consider the coeff ic ient p aths as funct ions of the param eter .

    H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. Royal. Stat. Soc. B., 67(2):301{320, 2005.

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    27/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Coef f ic ient Trajecto r ies

    -Here's Figure 1 f ro m Friedm an et . a l . paper

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    28/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Path w ise Solu t ion

    -This algori thm w orks as fo l low s:

    In it ia l ize w ith value of large enou gh that a l l 's are 0.Decrease slight ly and updat e ' s by tak ing a grad ient s tep f ro m the o ld based on new va lue o f .

    Sm all chan ges in means that th e update converges quickly.

    In many cases faster t o genera te ent i re coef f ic ien t t ra jectory w i th th is a lgo , than to generate po in t so lu t ion

    -Fr iedman e t . a l . show that the e lemen ts o f t he coef f ic ien t vector j satisfies

    j

    =

    ( ) ,

    ( )

    w here t he funct ion S() is g iven by

    S(z,) = z if z > 0 and < |z|

    = z + if z < 0 and < |z|

    = 0 if >= |z|

    -The po int is:This algori thm f i ts the Stat ist ical Query M odel.

    The sum inside the fun ct ion S() can be spread over any num ber of m appers

    This algori thm handles elast icnet re gular ized r egressions, logist ic r egression and m ult ic lass logist ic

    regression

  • 7/29/2019 Machine Learning on Big Data - ClassIntro

    29/29

    M ike Bowles M achine Learn ing on Big Data using M ap Reduce Wint er , 2012

    Summary

    -Here 's what w e covered

    1. W here do big-data machine learning pro blems arise?

    -e-com m erce, ret ai l , narro w t arget ing, bank and credit cards

    2. W hat w ays are there to dea l w i th b ig -data mach ine learn ing prob lems?

    -samp le, language level paral le l izat ion , m ap-reduce

    3. W hat is map-reduce?-programm ing fo rmal ism t hat iso la tes programm ing task to m apper and reducer funct ions

    3. Appl ica t ion o f map-reduce to some fam i l iar m ach ine learn ing algor i thm s

    Hope you learned som eth ing from t he class. To get into m ore detai l , come to t he big-data class.