2nd Hivemall meetup 20151020

52
Introduction to Hivemall and it’s new features in v0.4 Research Engineer Makoto YUI @myui 2015/10/20 Hivemall meetup #2 1 Tweet w/ #hivemallmtup http://eventdots.jp/event/571107

Transcript of 2nd Hivemall meetup 20151020

Page 1: 2nd Hivemall meetup 20151020

IntroductiontoHivemallandit’snewfeaturesinv0.4

ResearchEngineerMakotoYUI@myui

2015/10/20Hivemallmeetup#2 1

Tweetw/#hivemallmtup

http://eventdots.jp/event/571107

Page 2: 2nd Hivemall meetup 20151020

Ø 2015.04JoinedTreasureData,Inc.1st ResearchEngineerinTreasureDataMymissioninTDisdevelopingML-as-a-Service

Ø 2010.04-2015.03SeniorResearcheratNationalInstituteofAdvancedIndustrialScienceandTechnology,Japan.Workedonalarge-scaleMachineLearningprojectandParallelDatabases

Ø 2009.03Ph.D.inComputerSciencefromNAISTØ SuperprogrammerawardfromtheMITOU

Foundation

WhoamI?

2015/10/20Hivemallmeetup#2 2

Page 3: 2nd Hivemall meetup 20151020

Agenda

1. WhatisHivemall

2. HowtouseHivemall

3. NewFeaturesinHivemallv0.41. RandomForest2. FactorizationMachine

4. DevelopmentRoadmapofHivemall

2015/10/20Hivemallmeetup#2 3

Page 4: 2nd Hivemall meetup 20151020

WhatisHivemallScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2

2015/10/20Hivemallmeetup#2 4

https://github.com/myui/hivemall

Page 5: 2nd Hivemall meetup 20151020

WhatisHivemall

HadoopHDFS

MapReduce(MR v1)

Hive /PIG

Hivemall

ApacheYARN

ApacheTezDAGprocessing MRv2

MachineLearning

QueryProcessing

ParallelDataProcessingFramework

ResourceManagement

DistributedFileSystem

2015/10/20Hivemallmeetup#2 5

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2

Page 6: 2nd Hivemall meetup 20151020

Hivemall’s Vision:MLonSQL

ClassificationwithMahout

CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop

2015/10/20Hivemallmeetup#2 6

Page 7: 2nd Hivemall meetup 20151020

ListofFeaturesinHivemallv0.3.2Classification(bothbinary- andmulti-class)✓ Perceptron✓ PassiveAggressive(PA)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA

Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad✓AdaDELTA

kNN andRecommendation✓Minhash andb-BitMinhash(LSHvariant)✓ Similarity SearchusingK-NN

(Euclid/Cosine/Jaccard/Angular)✓MatrixFactorization

Featureengineering✓ FeatureHashing✓ FeatureScaling(normalization, z-score)✓ TF-IDFvectorizer✓ Polynomial Expansion

AnomalyDetection✓ LocalOutlierFactor

TreasureDatasupportsHivemallv0.3.2-3

2015/10/20Hivemallmeetup#2 7

Page 8: 2nd Hivemall meetup 20151020

Ø CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.andmore

Ø GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.

Ø ChurnDetection• Algorithm:Regression• OISIXandmore

Ø Item/Userrecommendation• Algorithm:Recommendation(MatrixFactorization/kNN)• Adtech Companies,ISPportal,andmore

Ø ValuepredictionofRealestates• Algorithm:Regression• Livesense

IndustryusecasesofHivemall

82015/10/20Hivemallmeetup#2

Page 9: 2nd Hivemall meetup 20151020

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Datapreparation2015/10/20Hivemallmeetup#2 9

Page 10: 2nd Hivemall meetup 20151020

CREATE EXTERNAL TABLE e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

HowtouseHivemall- Datapreparation

DefineaHivetablefortraining/testingdata

2015/10/20Hivemallmeetup#2 10

Page 11: 2nd Hivemall meetup 20151020

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

FeatureEngineering

2015/10/20Hivemallmeetup#2 11

Page 12: 2nd Hivemall meetup 20151020

create view e2006tfidf_train_scaled asselect

rowid,rescale(target,${min_label},${max_label}) as label,

featuresfrom

e2006tfidf_train;

ApplyingaMin-MaxFeatureNormalization

HowtouseHivemall- FeatureEngineering

Transformingalabelvaluetoavaluebetween0.0and1.0

2015/10/20Hivemallmeetup#2 12

Page 13: 2nd Hivemall meetup 20151020

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Training

2015/10/20Hivemallmeetup#2 13

Page 14: 2nd Hivemall meetup 20151020

HowtouseHivemall- Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Trainingbylogisticregression

map-onlytasktolearnapredictionmodel

Shufflemap-outputstoreducesbyfeature

Reducersperformmodelaveraginginparallel

2015/10/20Hivemallmeetup#2 14

Page 15: 2nd Hivemall meetup 20151020

HowtouseHivemall- Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

TrainingofConfidenceWeightedClassifier

Votetousenegativeorpositiveweightsforavg

+0.7,+0.3,+0.2,-0.1,+0.7

TrainingfortheCWclassifier

2015/10/20Hivemallmeetup#2 15

Page 16: 2nd Hivemall meetup 20151020

create table news20mc_ensemble_model1 asselect label, cast(feature as int) as feature,cast(voted_avg(weight) as float) as weightfrom (select

train_multiclass_cw(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_arow(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_scw(addBias(features),label)as (label,feature,weight)

from news20mc_train_x3

) t group by label, feature;

Ensemblelearningforstablepredictionperformance

Juststackpredictionmodelsbyunionall

26 / 43162015/10/20Hivemallmeetup#2

Page 17: 2nd Hivemall meetup 20151020

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Prediction

2015/10/20Hivemallmeetup#2 17

Page 18: 2nd Hivemall meetup 20151020

HowtouseHivemall- Prediction

CREATETABLElr_predictasSELECTt.rowid,sigmoid(sum(m.weight)) asprobFROMtesting_exploded tLEFTOUTERJOINlr_model mON(t.feature =m.feature)GROUPBYt.rowid

PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel

Noneedtoloadtheentiremodelintomemory

2015/10/20Hivemallmeetup#2 18

Page 19: 2nd Hivemall meetup 20151020

HowtouseHivemall

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS

PredictionModel Label

FeatureVector

FeatureVector

Label

Exportpredictionmodels

2015/10/20Hivemallmeetup#2 19

Page 20: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 20

OnlinePredictiononMySQL(RDBMS)

Quick(msec)responseonaRDBMSbyaddinganindextofeaturecolumn

bit.ly/hivemall-mysql

Page 21: 2nd Hivemall meetup 20151020

Agenda

1. WhatisHivemall

2. HowtouseHivemall

3. NewFeaturesinHivemallv0.41. RandomForest2. FactorizationMachine

4. DevelopmentRoadmapofHivemall

2015/10/20Hivemallmeetup#2 21

Page 22: 2nd Hivemall meetup 20151020

Features tobesupportedinHivemallv0.4

2015/10/20Hivemallmeetup#2 22

1.RandomForest• classification,regression• BasedonSmilegithub.com/haifengl/smile

2.FactorizationMachine• classification,regression (factorization)

Plannedtoreleasev0.4inOct.

FactorizationMachineareoftenusedbydatasciencecompetitionwinners(Criteo/Avazu CTRprediction)

Page 23: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 23

RandomForestinHivemallv0.4

EnsembleofDecisionTrees

Alreadyavailableonadevelopment(smile)branchandit’susageisexplainedintheprojectwiki

Bagging

Page 24: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 24

TrainingofRandomForest

Page 25: 2nd Hivemall meetup 20151020

Out-of-bagtestsandVariableImportance

2015/10/20Hivemallmeetup#2 25

Page 26: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 26

PredictionofRandomForest

Page 27: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 27

RandomForest

DEMO

http://bit.ly/hivemall-rf

Page 28: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 28

FactorizationMachine

MatrixFactorization

Page 29: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 29

FactorizationMachine

Contextinformation(e.g.,time)canbeconsidered

Source:http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf

Page 30: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 30

FactorizationMachine

FactorizationModelwithdegress=2(2-wayinteraction)

Global BiasRegression coefficience

of j-th variable

Pairwise Interaction

Factorization

Page 31: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 31

FactorizationMachine

FactorizationMachine≈ PolynomialRegression+Factorization

Forafeature[a,b],thedegree-2polynomialfeaturesare[1,a,b,a^2,ab,b^2].

bit.ly/hivemall-poly

Page 32: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 32

FactorizationMachine

DEMO

Page 33: 2nd Hivemall meetup 20151020

Agenda

1. WhatisHivemall

2. HowtouseHivemall

3. NewFeaturesinHivemallv0.41. RandomForest2. FactorizationMachine

4. DevelopmentRoadmapofHivemall

2015/10/20Hivemallmeetup#2 33

Page 34: 2nd Hivemall meetup 20151020

Features tobesupportedinHivemallv0.4.1

2015/10/20Hivemallmeetup#2 34

1.GradientTreeBoosting• classifier,regression

2.Field-awareFactorizationMachine• classification,regression (factorization)• Existingimplementation, i.e.,LibFFM,onlycanbeappliedforclassification

Plannedtoreleasev0.4.1inNov/Dec.

Page 35: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 35

GradientTreeBoosting(orGradientBoostingTrees)

RF≈Bagging+DecisionTreesparallel execution ofdecision trees

GBT≈Boosting+DecisionTreesSequential execution ofdecision trees

Page 36: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 36

GradientTreeBoosting

Page 37: 2nd Hivemall meetup 20151020

Features tobesupportedinHivemallv0.4.2

2015/10/20Hivemallmeetup#2 37

1. OnlineLDA• topicmodeling,clustering

2. MixserveronApacheYARN• Serviceforparametersharingamongworkers• workingw/@maropu

Plannedtoreleasev0.4.2inDec/Jan.

Page 38: 2nd Hivemall meetup 20151020

Externalservicetoshareparametersbydistributedtrainingprocessesinthemiddleoftraining

2015/10/20Hivemallmeetup#2 38

What’sMixServer?

・・・・・・

Modelupdates

Async addPiggybackif…

AVG/Argmin KLDaccumulator

hash(feature)%N

Non-blockingChannel(singlesharedTCPconnectionw/TCPkeepalive)

classifiers

Mixserv.Mixserv.

Computation/trainingisnotbeingblocked

Takingbenefitsofasynchronousnon-blockingI/OisthecoreideabehindHivemall’s MIXprotocol

Page 39: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 39

createtablekdd10a_pa1_model1asselectfeature,cast(voted_avg(weight)asfloat)asweightfrom(selecttrain_pa1(addBias(features),label,"-mixhost01,host02,host03")

as(feature,weight)fromkdd10a_train_x3

)tgroupbyfeature;

HowtouseMixServer

Page 40: 2nd Hivemall meetup 20151020

ConclusionandTakeaway

Newfeaturesinv0.4

2015/10/20Hivemallmeetup#2 40

• RandomForest• FactorizationMachine

Morewillfollowinv0.4.1

NextActions• ProposeHivemalltoApacheIncubator

• NewHivemallLogo

HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFsThelatestversionofHivemallisavailableonTreasureDataandusedbyseveralcompaniesIncludingOISIX,Livesense,Scaleout,andFreakout.

Page 41: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 41

BeyondQuery-as-a-Service!

WeOpen-source!Weinvented..

Wearehiringmachinelearningengineer!

Page 42: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 42

Additionalslides

Page 43: 2nd Hivemall meetup 20151020

Recommendation

RatingpredictionofaMatrix

Canbeappliedforuser/ItemRecommendation

432015/10/20Hivemallmeetup#2

Page 44: 2nd Hivemall meetup 20151020

44

MatrixFactorization

Factorizeamatrixintoaproductofmatriceshavingk-latentfactor

2015/10/20Hivemallmeetup#2

Page 45: 2nd Hivemall meetup 20151020

45

MeanRating

MatrixFactorization

Regularization

Biasforeachuser/item

CriteriaofBiasedMF

2015/10/20Hivemallmeetup#2

Factorization

Page 46: 2nd Hivemall meetup 20151020

46

TrainingofMatrixFactorization

Support iterative training using local disk cache2015/10/20Hivemallmeetup#2

Page 47: 2nd Hivemall meetup 20151020

47

PredictionofMatrixFactorization

2015/10/20Hivemallmeetup#2

Page 48: 2nd Hivemall meetup 20151020

ØAlgorithmisdifferentSpark:ALS-WR(considersregularization)Hivemall:Biased-MF(considersregularizationandbiases)

ØUsabilitySpark:100+lineScalacodingHivemall:SQL(wouldbemoreeasytouse)

ØPredictionAccuracyAlmostsameforMovieLens 10Mdatasets

2015/10/20Hivemallmeetup#2 48

ComparisontoSparkMLlib

Page 49: 2nd Hivemall meetup 20151020

rowid features

1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]

2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]

3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]

UnsupervisedLearning:AnomalyDetection

Sensordataetc.

AnomalydetectionrunsonaseriesofSQLqueries

492015/10/20Hivemallmeetup#2

Page 50: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 50

AnomaliesinaSensorData

Source:https://codeiq.jp/q/207

Page 51: 2nd Hivemall meetup 20151020

ImageSource:https://en.wikipedia.org/wiki/Local_outlier_factor2015/10/20Hivemallmeetup#2 51

LocalOutlierFactor(LoF)

BasicideaofLOF:comparingthelocaldensityofapointwiththedensities ofitsneighbors

Page 52: 2nd Hivemall meetup 20151020

2015/10/20Hivemallmeetup#2 52

DEMO:LocalOutlierFactor

rowid features

1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]

2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]

3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]