2nd Hivemall meetup 20151020

Post on 07-Jan-2017

1.860 views 2 download

Transcript of 2nd Hivemall meetup 20151020

IntroductiontoHivemallandit’snewfeaturesinv0.4

ResearchEngineerMakotoYUI@myui

2015/10/20Hivemallmeetup#2 1

Tweetw/#hivemallmtup

http://eventdots.jp/event/571107

Ø 2015.04JoinedTreasureData,Inc.1st ResearchEngineerinTreasureDataMymissioninTDisdevelopingML-as-a-Service

Ø 2010.04-2015.03SeniorResearcheratNationalInstituteofAdvancedIndustrialScienceandTechnology,Japan.Workedonalarge-scaleMachineLearningprojectandParallelDatabases

Ø 2009.03Ph.D.inComputerSciencefromNAISTØ SuperprogrammerawardfromtheMITOU

Foundation

WhoamI?

2015/10/20Hivemallmeetup#2 2

Agenda

1. WhatisHivemall

2. HowtouseHivemall

3. NewFeaturesinHivemallv0.41. RandomForest2. FactorizationMachine

4. DevelopmentRoadmapofHivemall

2015/10/20Hivemallmeetup#2 3

WhatisHivemallScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2

2015/10/20Hivemallmeetup#2 4

https://github.com/myui/hivemall

WhatisHivemall

HadoopHDFS

MapReduce(MR v1)

Hive /PIG

Hivemall

ApacheYARN

ApacheTezDAGprocessing MRv2

MachineLearning

QueryProcessing

ParallelDataProcessingFramework

ResourceManagement

DistributedFileSystem

2015/10/20Hivemallmeetup#2 5

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2

Hivemall’s Vision:MLonSQL

ClassificationwithMahout

CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop

2015/10/20Hivemallmeetup#2 6

ListofFeaturesinHivemallv0.3.2Classification(bothbinary- andmulti-class)✓ Perceptron✓ PassiveAggressive(PA)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA

Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad✓AdaDELTA

kNN andRecommendation✓Minhash andb-BitMinhash(LSHvariant)✓ Similarity SearchusingK-NN

(Euclid/Cosine/Jaccard/Angular)✓MatrixFactorization

Featureengineering✓ FeatureHashing✓ FeatureScaling(normalization, z-score)✓ TF-IDFvectorizer✓ Polynomial Expansion

AnomalyDetection✓ LocalOutlierFactor

TreasureDatasupportsHivemallv0.3.2-3

2015/10/20Hivemallmeetup#2 7

Ø CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.andmore

Ø GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.

Ø ChurnDetection• Algorithm:Regression• OISIXandmore

Ø Item/Userrecommendation• Algorithm:Recommendation(MatrixFactorization/kNN)• Adtech Companies,ISPportal,andmore

Ø ValuepredictionofRealestates• Algorithm:Regression• Livesense

IndustryusecasesofHivemall

82015/10/20Hivemallmeetup#2

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Datapreparation2015/10/20Hivemallmeetup#2 9

CREATE EXTERNAL TABLE e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

HowtouseHivemall- Datapreparation

DefineaHivetablefortraining/testingdata

2015/10/20Hivemallmeetup#2 10

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

FeatureEngineering

2015/10/20Hivemallmeetup#2 11

create view e2006tfidf_train_scaled asselect

rowid,rescale(target,${min_label},${max_label}) as label,

featuresfrom

e2006tfidf_train;

ApplyingaMin-MaxFeatureNormalization

HowtouseHivemall- FeatureEngineering

Transformingalabelvaluetoavaluebetween0.0and1.0

2015/10/20Hivemallmeetup#2 12

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Training

2015/10/20Hivemallmeetup#2 13

HowtouseHivemall- Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Trainingbylogisticregression

map-onlytasktolearnapredictionmodel

Shufflemap-outputstoreducesbyfeature

Reducersperformmodelaveraginginparallel

2015/10/20Hivemallmeetup#2 14

HowtouseHivemall- Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

TrainingofConfidenceWeightedClassifier

Votetousenegativeorpositiveweightsforavg

+0.7,+0.3,+0.2,-0.1,+0.7

TrainingfortheCWclassifier

2015/10/20Hivemallmeetup#2 15

create table news20mc_ensemble_model1 asselect label, cast(feature as int) as feature,cast(voted_avg(weight) as float) as weightfrom (select

train_multiclass_cw(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_arow(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_scw(addBias(features),label)as (label,feature,weight)

from news20mc_train_x3

) t group by label, feature;

Ensemblelearningforstablepredictionperformance

Juststackpredictionmodelsbyunionall

26 / 43162015/10/20Hivemallmeetup#2

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Prediction

2015/10/20Hivemallmeetup#2 17

HowtouseHivemall- Prediction

CREATETABLElr_predictasSELECTt.rowid,sigmoid(sum(m.weight)) asprobFROMtesting_exploded tLEFTOUTERJOINlr_model mON(t.feature =m.feature)GROUPBYt.rowid

PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel

Noneedtoloadtheentiremodelintomemory

2015/10/20Hivemallmeetup#2 18

HowtouseHivemall

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS

PredictionModel Label

FeatureVector

FeatureVector

Label

Exportpredictionmodels

2015/10/20Hivemallmeetup#2 19

2015/10/20Hivemallmeetup#2 20

OnlinePredictiononMySQL(RDBMS)

Quick(msec)responseonaRDBMSbyaddinganindextofeaturecolumn

bit.ly/hivemall-mysql

Agenda

1. WhatisHivemall

2. HowtouseHivemall

3. NewFeaturesinHivemallv0.41. RandomForest2. FactorizationMachine

4. DevelopmentRoadmapofHivemall

2015/10/20Hivemallmeetup#2 21

Features tobesupportedinHivemallv0.4

2015/10/20Hivemallmeetup#2 22

1.RandomForest• classification,regression• BasedonSmilegithub.com/haifengl/smile

2.FactorizationMachine• classification,regression (factorization)

Plannedtoreleasev0.4inOct.

FactorizationMachineareoftenusedbydatasciencecompetitionwinners(Criteo/Avazu CTRprediction)

2015/10/20Hivemallmeetup#2 23

RandomForestinHivemallv0.4

EnsembleofDecisionTrees

Alreadyavailableonadevelopment(smile)branchandit’susageisexplainedintheprojectwiki

Bagging

2015/10/20Hivemallmeetup#2 24

TrainingofRandomForest

Out-of-bagtestsandVariableImportance

2015/10/20Hivemallmeetup#2 25

2015/10/20Hivemallmeetup#2 26

PredictionofRandomForest

2015/10/20Hivemallmeetup#2 27

RandomForest

DEMO

http://bit.ly/hivemall-rf

2015/10/20Hivemallmeetup#2 28

FactorizationMachine

MatrixFactorization

2015/10/20Hivemallmeetup#2 29

FactorizationMachine

Contextinformation(e.g.,time)canbeconsidered

Source:http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf

2015/10/20Hivemallmeetup#2 30

FactorizationMachine

FactorizationModelwithdegress=2(2-wayinteraction)

Global BiasRegression coefficience

of j-th variable

Pairwise Interaction

Factorization

2015/10/20Hivemallmeetup#2 31

FactorizationMachine

FactorizationMachine≈ PolynomialRegression+Factorization

Forafeature[a,b],thedegree-2polynomialfeaturesare[1,a,b,a^2,ab,b^2].

bit.ly/hivemall-poly

2015/10/20Hivemallmeetup#2 32

FactorizationMachine

DEMO

Agenda

1. WhatisHivemall

2. HowtouseHivemall

3. NewFeaturesinHivemallv0.41. RandomForest2. FactorizationMachine

4. DevelopmentRoadmapofHivemall

2015/10/20Hivemallmeetup#2 33

Features tobesupportedinHivemallv0.4.1

2015/10/20Hivemallmeetup#2 34

1.GradientTreeBoosting• classifier,regression

2.Field-awareFactorizationMachine• classification,regression (factorization)• Existingimplementation, i.e.,LibFFM,onlycanbeappliedforclassification

Plannedtoreleasev0.4.1inNov/Dec.

2015/10/20Hivemallmeetup#2 35

GradientTreeBoosting(orGradientBoostingTrees)

RF≈Bagging+DecisionTreesparallel execution ofdecision trees

GBT≈Boosting+DecisionTreesSequential execution ofdecision trees

2015/10/20Hivemallmeetup#2 36

GradientTreeBoosting

Features tobesupportedinHivemallv0.4.2

2015/10/20Hivemallmeetup#2 37

1. OnlineLDA• topicmodeling,clustering

2. MixserveronApacheYARN• Serviceforparametersharingamongworkers• workingw/@maropu

Plannedtoreleasev0.4.2inDec/Jan.

Externalservicetoshareparametersbydistributedtrainingprocessesinthemiddleoftraining

2015/10/20Hivemallmeetup#2 38

What’sMixServer?

・・・・・・

Modelupdates

Async addPiggybackif…

AVG/Argmin KLDaccumulator

hash(feature)%N

Non-blockingChannel(singlesharedTCPconnectionw/TCPkeepalive)

classifiers

Mixserv.Mixserv.

Computation/trainingisnotbeingblocked

Takingbenefitsofasynchronousnon-blockingI/OisthecoreideabehindHivemall’s MIXprotocol

2015/10/20Hivemallmeetup#2 39

createtablekdd10a_pa1_model1asselectfeature,cast(voted_avg(weight)asfloat)asweightfrom(selecttrain_pa1(addBias(features),label,"-mixhost01,host02,host03")

as(feature,weight)fromkdd10a_train_x3

)tgroupbyfeature;

HowtouseMixServer

ConclusionandTakeaway

Newfeaturesinv0.4

2015/10/20Hivemallmeetup#2 40

• RandomForest• FactorizationMachine

Morewillfollowinv0.4.1

NextActions• ProposeHivemalltoApacheIncubator

• NewHivemallLogo

HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFsThelatestversionofHivemallisavailableonTreasureDataandusedbyseveralcompaniesIncludingOISIX,Livesense,Scaleout,andFreakout.

2015/10/20Hivemallmeetup#2 41

BeyondQuery-as-a-Service!

WeOpen-source!Weinvented..

Wearehiringmachinelearningengineer!

2015/10/20Hivemallmeetup#2 42

Additionalslides

Recommendation

RatingpredictionofaMatrix

Canbeappliedforuser/ItemRecommendation

432015/10/20Hivemallmeetup#2

44

MatrixFactorization

Factorizeamatrixintoaproductofmatriceshavingk-latentfactor

2015/10/20Hivemallmeetup#2

45

MeanRating

MatrixFactorization

Regularization

Biasforeachuser/item

CriteriaofBiasedMF

2015/10/20Hivemallmeetup#2

Factorization

46

TrainingofMatrixFactorization

Support iterative training using local disk cache2015/10/20Hivemallmeetup#2

47

PredictionofMatrixFactorization

2015/10/20Hivemallmeetup#2

ØAlgorithmisdifferentSpark:ALS-WR(considersregularization)Hivemall:Biased-MF(considersregularizationandbiases)

ØUsabilitySpark:100+lineScalacodingHivemall:SQL(wouldbemoreeasytouse)

ØPredictionAccuracyAlmostsameforMovieLens 10Mdatasets

2015/10/20Hivemallmeetup#2 48

ComparisontoSparkMLlib

rowid features

1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]

2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]

3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]

UnsupervisedLearning:AnomalyDetection

Sensordataetc.

AnomalydetectionrunsonaseriesofSQLqueries

492015/10/20Hivemallmeetup#2

2015/10/20Hivemallmeetup#2 50

AnomaliesinaSensorData

Source:https://codeiq.jp/q/207

ImageSource:https://en.wikipedia.org/wiki/Local_outlier_factor2015/10/20Hivemallmeetup#2 51

LocalOutlierFactor(LoF)

BasicideaofLOF:comparingthelocaldensityofapointwiththedensities ofitsneighbors

2015/10/20Hivemallmeetup#2 52

DEMO:LocalOutlierFactor

rowid features

1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]

2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]

3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]