Introduction to New features and Use cases of Hivemall

55
Introduction to New features and Use cases of Hivemall Research Engineer Makoto YUI @myui <[email protected]> 1 2016/03/30 Treasure Data Techtalk http://eventdots.jp/event/583226

Transcript of Introduction to New features and Use cases of Hivemall

IntroductiontoNewfeaturesandUsecasesofHivemall

ResearchEngineerMakotoYUI@myui

<[email protected]>

1

2016/03/30TreasureDataTechtalk

http://eventdots.jp/event/583226

Ø 2015.04JoinedTreasureData,Inc.1st ResearchEngineerinTreasureData

Ø 2010.04-2015.03SeniorResearcheratNationalInstituteofAdvancedIndustrialScienceandTechnology,Japan.

Ø 2009.03Ph.D.inComputerSciencefromNAISTØ TD登山部部長Ø 部員3名(うち幽霊部員1名)

WhoamI?

2

Ø 2015.04JoinedTreasureData,Inc.1st ResearchEngineerinTreasureData

Ø 2010.04-2015.03SeniorResearcheratNationalInstituteofAdvancedIndustrialScienceandTechnology,Japan.

Ø 2009.03Ph.D.inComputerSciencefromNAISTØ TD登山部部長Ø 部員3名(うち幽霊部員1名)

WhoamI?

3

4

Announcement

WefinallyreplacedtheLogoofHivemallJ

5

StoryofHivemallLogo

6

StoryofHivemallLogo

7

Hadoop

LogosofHadoop-relatedProducts

8

Hadoop Hive

LogosofHadoop-relatedProducts

9

Hadoop Hive Hivemall

LogosofHadoop-relatedProducts

10

LogosofHadoop-relatedProducts

Hadoop Hive Hivemall

11

WeOpenSource

12

他製品連携

SQL

Server

CRM

RDBMS

Appログ

センサー

Webログ

ERP

バッチ型分析

アドホック型分析

API

ODBCJDBC

PUSH

TreasureAgent

分析ツール連携

データ可視化・共有

TreasureDataCollectors

組込み

Embulk

モバイルSDK

JSSDK

TreasureDatasupportsML-as-a-Service

MachineLearning

Agenda

1. IntroductiontoHivemall

2. Industrialusecases

3. HowtouseHivemall

4. Developmentroadmap

13

WhatisHivemall

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2

14

https://github.com/myui/hivemall

WhatisHivemall

HadoopHDFS

MapReduce(MR v1)

Hive /PIG

Hivemall

ApacheYARN

ApacheTezDAGprocessing MRv2

MachineLearning

QueryProcessing

ParallelDataProcessingFramework

ResourceManagement

DistributedFileSystem

15

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2

Hivemall’s Vision:MLonSQL

ClassificationwithMahout

CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop

16

ListofFeaturesinHivemallv0.3.xClassification (bothbinary- andmulti-class)✓ Perceptron✓ PassiveAggressive(PA)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓AdaGrad+RDA

Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad✓AdaDELTA

kNN andRecommendation✓Minhash andb-BitMinhash(LSHvariant)✓ Similarity SearchusingK-NN

(Euclid/Cosine/Jaccard/Angular)✓MatrixFactorization

Featureengineering✓ FeatureHashing✓ FeatureScaling(normalization, z-score)✓TF-IDFvectorizer✓Polynomial Expansion

AnomalyDetection✓ LocalOutlierFactor

Top-kqueryprocessing

17

Features supportedinHivemallv0.4.0

18

1.RandomForest• classification,regression

2.FactorizationMachine• classification,regression (factorization)

Features supportedinHivemallv0.4.1-alpha

19

1. NLPTokenizer (形態素解析)• Kuromoji

2. Mini-batchGradientDescent3. RandomForest scalabilityImprovements

TreasureDataisoperatingHivemallv0.4.1-alpha.6

Theabovefeaturearealreadysupported

Agenda

1. IntroductiontoHivemall

2. Industrialusecases

3. HowtouseHivemall

4. Developmentroadmap

20

Ø CTRpredictionofAdclicklogs•Freakout Inc.andmore•ReplacedSparkMLlibw/HivemallatcompanyX

IndustryusecasesofHivemall

21http://www.slideshare.net/masakazusano75/sano-hmm-20150512

22

ØGenderpredictionofAdclicklogs•Scaleout Inc.

http://eventdots.jp/eventreport/458208

IndustryusecasesofHivemall

23

IndustryusecasesofHivemallØ ValuepredictionofRealestates•Livesense

http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall

24Source:http://itnp.net/article/2016/02/18/2286.html

IndustryusecasesofHivemall

25

ØChurnDetection•OISIX

IndustryusecasesofHivemall

http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix

26

会員サービスの解約予測

•10万人の会員による定期購買が会社全体の売上、利益を左右するが、解約リスクのある会員を事前に把握、防止する策を欠いていた

•統計の専門知識無しで機械学習•解約予測リストへのポイント付与により解約率が半減•解約リスクを伴う施策、イベントを炙り出すと同時に、非解約者の特徴的な行動も把握可能に•リスク度合いに応じて UI を変更するなど間接的なサービス改善も実現

•機械学習を行い、過去1ヶ月間のデータをもとに未来1ヶ月間に解約する可能性の高い顧客リストを作成•具体的には、学習用テーブル作成 -> 正規化 -> 学習モデル作成-> ロジスティック回帰の各ステップをTD + Hivemall を用いてクエリで簡便に実現

Web

Mobile

属性情報

行動ログ

クレーム情報

流入元

利用サービス情報

直接施策

間接施策

ポイント付与 ケアコール

成功体験への誘導UI変更

予測に使うデータ

27

ØRecommendation•Portalsite

IndustryusecasesofHivemall

Agenda

1. IntroductiontoHivemall

2. Industrialusecases

3. HowtouseHivemall

4. Developmentroadmap

28

29

RandomForestinHivemallv0.4

EnsembleofDecisionTrees

30

RandomForestinHivemallv0.4

EnsembleofDecisionTrees

31

TrainingofRandomForest

32

PredictionofRandomForest

Out-of-bagtestsandVariableImportance

33

34

Out-of-bagtestsandVariableImportance

Recommendation

RatingpredictionofaMatrix

Canbeappliedforuser/ItemRecommendation

35

36

MatrixFactorization

Factorizeamatrixintoaproductofmatriceshavingk-latentfactor

37

TrainingofMatrixFactorization

Support iterative training using local disk cache

38

PredictionofMatrixFactorization

39

FactorizationMachines

MatrixFactorization

40

FactorizationMachines

Contextinformation(e.g.,time)canbeconsidered

Source:http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf

41

TrainingdataforFactorizationMachines

EachFeaturetakesLibSVM-likeformat<feature[:weight]>

42

TrainingofFactorizationMachines

43

PredictionofFactorizationMachines

44

FeatureEngineeringfunctions

45

FeatureEngineeringfunctions

Agenda

1. IntroductiontoHivemall

2. Industrialusecases

3. HowtouseHivemall

4. Developmentroadmap

46

Features tobesupportedinHivemallv0.4.1

47

1. NLPTokenizer (形態素解析)• Kuromoji integrationwasrequestedbyCompanyR

2. Mini-batchGradientDescent3. RandomForest scalabilityImprovements4. RecommendationforImplicitFeedbackDataset• Usefulwhereonlypositive-onlyfeedbackisavailable• BPR:BayesianPersonalizedRankingfromImplicitFeedback,Proc.UAI,2009.

Plannedtoreleasev0.4.1inApril.

Features tobesupportedinHivemallv0.4.2

48

1.GradientTreeBoosting• classifier,regression• basedonSmilehttps://github.com/haifengl/smile/

Features tobesupportedinHivemallv0.4.2

49

1.GradientTreeBoosting• classifier,regression• basedonSmilehttps://github.com/haifengl/smile/

2.Field-awareFactorizationMachine• classification,regression (factorization)

Plannedtoreleasev0.4.1inJune

Features tobesupportedinHivemallv0.5

50

1. MixserveronApacheYARN• Serviceforparametersharingamongworkers

学習器1

学習器2

学習器N

パラメタ交換

学習モデル

分割された訓練例

データ並列

データ並列

Features tobesupportedinHivemallv0.5

51

1. MixserveronApacheYARN• Serviceforparametersharingamongworker

2. OnlineLDA• topicmodeling,clustering

3. XGBoost Integration4.GeneralizedLinearModel• Ridge/Elasticnet/Lassoregularization• Supportsvariouslossfunctions

5. AlternatingDirectionMethodofMultipliers(ADMM)convexoptimization

6. T-sne DimensionReduction

52

AnalyticsWorkflowMachinelearningworkflowscanbesimplifiedusingournewworkflowengine,namedDigdag

+main:+prepare:

_parallel: true

+train:td>: ./tasks/train_join.sql

+test:td>: ./tasks/test_join.sql

+quantify:td>: ./tasks/train_quantify.sql

+model_test_quantify:_parallel: true

+model:td>: ./tasks/make_model.sql

+test_quantify:td>: ./tasks/test_quantify.sql

+pred:td>: ./tasks/prediction.sql

CLIversionwillbereleasedsoon.Staytuned!

ConclusionandTakeaway

53

HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs

Hivemall’s Positioning

TreasureDataprovidesML-as-a-ServiceusingHivemall

Majordevelopmentleapsinv0.4

Morewillfollowinv0.4.1andlater

• ForSQLusersthatneedML• Easy-of-useandscalabilityinmind

• RandomForest• FactorizationMachine

54

BlogarticleaboutHivemall

http://blog-jp.treasuredata.com/

TD,Hivemall,Jupyter,Pandas-TDを使ってKaggleの課題を解くシリーズ

55

WesupportmachinelearninginCloud

Anyfeaturerequest?Or,questions?