3rd Hivemall meetup

33
Recent progress and future roadmap of Hivemall Research Engineer Makoto YUI @myui <[email protected]> 1 #hivemallmtup 2016/09/08 3rd Hivemall meetup

Transcript of 3rd Hivemall meetup

Page 1: 3rd Hivemall meetup

RecentprogressandfutureroadmapofHivemall

ResearchEngineerMakotoYUI@myui

<[email protected]>

1

#hivemallmtup

2016/09/083rdHivemallmeetup

Page 2: 3rd Hivemall meetup

Agenda

1. ShortIntroductiontoHivemallü Hivemalluse-cases

2. RecentUpdates3. RoadmapofHivemallü comingnewfeatures

22016/09/083rdHivemallmeetup

Page 3: 3rd Hivemall meetup

WhatisHivemall

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs,licensedundertheApacheLicensev2

3

https://github.com/myui/hivemall

Thankforeveryonecontributedtotheproject!

2016/09/083rdHivemallmeetup

Page 4: 3rd Hivemall meetup

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File SystemCloud Storage

SparkSQL

ApacheSpark

MESOS

Hive Pig

MLlib

WhatisHivemall

AmazonS3

2016/09/083rdHivemallmeetup 4

Page 5: 3rd Hivemall meetup

Hivemall’s Vision:MLonSQL

ClassificationwithMahout

CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop

52016/09/083rdHivemallmeetup

Page 6: 3rd Hivemall meetup

Ø CTRpredictionofAdclicklogs•Freakout Inc.,Fancommunication,andmore•ReplacedSparkMLlibw/HivemallatcompanyX

IndustryusecasesofHivemall

6http://www.slideshare.net/masakazusano75/sano-hmm-20150512

2016/09/083rdHivemallmeetup

Page 7: 3rd Hivemall meetup

7

ØGenderpredictionofAdclicklogs•Scaleout Inc.andFancommutations

http://eventdots.jp/eventreport/458208

IndustryusecasesofHivemall

2016/09/083rdHivemallmeetup

Page 8: 3rd Hivemall meetup

8

IndustryusecasesofHivemallØ ValuepredictionofRealestates•Livesense

http://www.slideshare.net/y-ken/real-estate-tech-with-hivemall2016/09/083rdHivemallmeetup

Page 9: 3rd Hivemall meetup

9

ØChurnDetection•OISIX

IndustryusecasesofHivemall

http://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix2016/09/083rdHivemallmeetup

Page 10: 3rd Hivemall meetup

Agenda

1. ShortIntroductiontoHivemallü Hivemalluse-cases

2. RecentUpdates3. RoadmapofHivemallü comingnewfeatures

102016/09/083rdHivemallmeetup

Page 11: 3rd Hivemall meetup

v0.4.2-rc.2Ø Releasedon2016/06/28ØminorhotfixesØ Thelatestrelease

11

RecentReleases

2016/09/083rdHivemallmeetup

Page 12: 3rd Hivemall meetup

v0.4.2-rc.1Ø Releasedon2016/06/07Ø HivemallonSparkv1.6Ø Kudosto@maropu

Ø BPR-MF(MatrixFactorizationforImplicitFeedbacks)

12

RecentReleases

2016/09/083rdHivemallmeetup

Page 13: 3rd Hivemall meetup

13

HivemallonApacheSpark

Installationisveryeasyasfollows:$spark-shell--packagesmaropu:hivemall-spark:0.0.6

2016/09/083rdHivemallmeetup

Page 14: 3rd Hivemall meetup

14

FeatureHashingFrequentlyusedtechniquetodealwithhigh-dimensionaldata

2016/09/083rdHivemallmeetup

高次元 低次元

Page 15: 3rd Hivemall meetup

Kerneltrick

2016/09/083rdHivemallmeetup 15

高次元に写像

InputFeatureSpace MappedFeatureSpace

高次空間でhyperplaneを引く低次元で非線形分離できている

For two dimensional features [a, b], the degree-2 polynomial features are [(1, ) a, b, a^2, ab, b^2].高次元低次元

Page 16: 3rd Hivemall meetup

16

PolynomialExpansion

2016/09/083rdHivemallmeetup

Page 17: 3rd Hivemall meetup

17

PolynomialExpansion

b^b:1.0andb^b^b:1.0areomittedw/truncateoptiona^a:0.25andc^c:0.09areomittedw/interactiveonlyoption

2016/09/083rdHivemallmeetup

Page 18: 3rd Hivemall meetup

FeatureVectorformatterFunctions

18

量的変数は「カラム名:値」質的変数は「カラム名#値」となるなお、nullや重み0.0の特徴は作成されない

2016/09/083rdHivemallmeetup

Page 19: 3rd Hivemall meetup

19

Mini-batchGradientDescent

Caution:Mini-batchgenerallyrequiresmoreiterationsthanSGD2016/09/083rdHivemallmeetup

Page 20: 3rd Hivemall meetup

20

JapaneseTokenizerusingKuromoji

ThisfeatureisrequestfromaTreasureDatacustomer

2016/09/083rdHivemallmeetup

Thanksprovidingareferenceimplementationtous(companyR)

Page 21: 3rd Hivemall meetup

Agenda

1. ShortIntroductiontoHivemallü Hivemalluse-cases

2. RecentUpdates3. RoadmapofHivemallü comingnewfeatures

212016/09/083rdHivemallmeetup

Page 22: 3rd Hivemall meetup

22

ImportantAnnouncement

HivemallwillbecomeApacheHivemall(?)Nowonvotingthough..

2016/09/083rdHivemallmeetup

Page 23: 3rd Hivemall meetup

23

ApacheIncubationstatus

2016/09/083rdHivemallmeetup

Page 24: 3rd Hivemall meetup

•MakotoYui<TreasureData>• TakeshiYamamuro <NTT>Ø HivemallonApacheSpark• DanielDai<Hortonworks>Ø HivemallonApachePigØ ApachePigPMCmember• TsuyoshiOzawa<NTT>ØApacheHadoopPMCmember• KaiSasaki<TreasureData>

24

Initialcommitters

2016/09/083rdHivemallmeetup

Page 25: 3rd Hivemall meetup

Champion

NominatedMentors

25

Projectmentors

• ReynoldXin<Databricks,ASFmember>ApacheSparkPMCmember• MarkusWeimer<Microsoft,ASFmember>ApacheREEFPMCmember• Xiangrui Meng <Databricks,ASFmember>ApacheSparkPMCmember

• RomanShaposhnik <Pivotal,ASFmember>ApacheBigtop/IncubatorPMCmember

2016/09/083rdHivemallmeetup

Page 26: 3rd Hivemall meetup

• PossiblyenterApacheIncubatorinSept,2016• IPclearanceandproject/repositorysitesetup•Contributionguideline•CreatewhouseHivemalllist•Moredocumentations!SepttoNov

• InitialApacheReleaseDec(orlateNov?)• v0.5

• Non-Apachereleaseofv0.5-beta.xxwillbereleaseingithub inOct

26

Roadmap

2016/09/083rdHivemallmeetup

Page 27: 3rd Hivemall meetup

ü HivemallonSpark2.0w/Dataframe support• Kudosto@maropu

ü ChangeFinder• ChangePointandAnomalyDetection• Kudosto@L3sota@takuti• PR#333

ü XGBoost support• Kudosto@maropu

27

ComingNewFeatures- alreadymergedinMaster

2016/09/083rdHivemallmeetup

Page 28: 3rd Hivemall meetup

ü ChangeFinder

28

ComingNewFeatures- alreadymergedinMaster

cf_detect(array<double>x[,const stringoptions])

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

2016/09/083rdHivemallmeetup

Page 29: 3rd Hivemall meetup

ü ChangeFinder

29

ComingNewFeatures- alreadymergedinMaster

cf_detect(array<double>x[,const stringoptions])

2016/09/083rdHivemallmeetup

Page 30: 3rd Hivemall meetup

ü VariousEvaluationMetrics• Kudosto@takuti,alsoR2by,logloss by• PR#326

30

ComingNewFeatures- alreadymergedinMaster

2016/09/083rdHivemallmeetup

Fan-cs,sakai-san

Page 31: 3rd Hivemall meetup

31

ComingNewFeatures- alreadymergedinMaster

ü FeatureBinning• Kudosto@amaya382onPR#382• Mapsquantitativevariablestobins

Age(quantitativevariable)ismappedintoameaningfulbin(categoricalvariable)basedonquantiles

2016/09/083rdHivemallmeetup

Page 32: 3rd Hivemall meetup

• v0.5-beta{1,2}release(Oct-Nov)ü Systemtestframework

üKudosto@amaya382ü one-hotencoding

üKudosto@kaiü Field-awareFactorizationMachinesü Kernelized PassiveAggressive

üKudosto@L3sotaü GeneralizedLinearModel

ü OptimizerframeworkincludingADAMü L1/L2regularizationü Kudosto@maropu

ü Disk-basediterationsupportü Toavoidtoolargeamplify

ü GradientTreeBoostingü OnlineLDA

32

Otherundergoingnewfeatures

2016/09/083rdHivemallmeetup

Page 33: 3rd Hivemall meetup

33

WesupportmachinelearninginCloud

Anyfeaturerequest?Or,questions?

bit.ly/td-wants-you