Introduction to Hivemall

37
Hivemall: Scalable Machine Learning Library for Apache Hive Research Engineer Makoto YUI @myui <[email protected]> 1 bit.ly/hivemall

Transcript of Introduction to Hivemall

Page 1: Introduction to Hivemall

Hivemall:ScalableMachineLearningLibraryforApacheHive

ResearchEngineerMakotoYUI@myui

<[email protected]>

1

bit.ly/hivemall

Page 2: Introduction to Hivemall

2

Page 3: Introduction to Hivemall

3

Page 4: Introduction to Hivemall

ExternalIntegrations

SQL

Server

CRM

RDBMS

App log

Sensor

Apache log

ERP

HiveBatch

AdhocPresto

API

ODBCJDBC

PUSH

Treasure Agent

BI tools

Data analysis

Treasure Data Collectors

Embedded

Embulk

Mobile SDK

JS SDK

Treasure Data Cloud Service

Machine Learning

900,000Records stored

per sec.

Page 5: Introduction to Hivemall

1. What is Hivemall (short intro.)

2. Why Hivemall (motivations etc.)

3. Hivemall Internals

4. How to use Hivemall

Agenda

Page 6: Introduction to Hivemall

What is HivemallScalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File System

SparkSQL

ApacheSpark

MESOS

Hive Pig

MLlib

Page 7: Introduction to Hivemall

WonIDG’sInfoWorld2014Bossie Awards 2014: The best open source big data tools

InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem(awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka)

bit.ly/hivemall-award

Page 8: Introduction to Hivemall

Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

8

Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓FactorizationMachines✓RandomForestRegression

List of supported Algorithms

Page 9: Introduction to Hivemall

List of supported AlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

9

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

Page 10: Introduction to Hivemall

List of Algorithms for Recommendation

10

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems

Page 11: Introduction to Hivemall

Other Supported Algorithms

11

AnomalyDetection✓ LocalOutlierFactor(LoF)

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion

(FeaturePairing)✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)

Page 12: Introduction to Hivemall

Ø CTR prediction of Ad click logs• Algorithm: Logistic regression• Freakout Inc. and more

Ø Gender prediction of Ad click logs• Algorithm: Classification• Scaleout Inc.

Ø Churn Detection• Algorithm: Regression• OISIX and more

Ø Item/User recommendation• Algorithm: Recommendation (Matrix Factorization / kNN) • Adtech Companies, ISP portal, and more

Ø Value prediction of Real estates• Algorithm: Regression• Livesense

Industry use cases of Hivemall

12

Page 13: Introduction to Hivemall

1. What is Hivemall (short intro.)

2. Why Hivemall (motivations etc.)

3. Hivemall Internals

4. How to use Hivemall

Agenda

Page 14: Introduction to Hivemall

WhyHivemall

1. InmyexperienceworkingonML,IusedHiveforpreprocessingandPython(scikit-learnetc.)forML.ThiswasINEFFICIENTandANNOYING.Also,PythonisnotasscalableasHive.

2. WhynotrunMLalgorithmsinsideHive?Lesscomponentstomanageandmorescalable.

That’swhyIbuildHivemall.

Page 15: Introduction to Hivemall

HowIusedtodoMLprojectsbeforeHivemall

GivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

height:173cmweight:60kgage:34gender: man…

Extract-Transform-Load

MachineLearning

file

Page 16: Introduction to Hivemall

HowIusedtodoMLprojectsbeforeHivemall

GivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

height:173cmweight:60kgage:34gender: man…

Extract-Transform-Load

file

Need to do expensive data preprocessing

(Joins, Filtering, and Formatting of Data that does not fit in memory)

MachineLearning

Page 17: Introduction to Hivemall

HowIusedtodoMLprojectsbeforeHivemall

GivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

height:173cmweight:60kgage:34gender: man…

Extract-Transform-Load

file

Do not scaleHave to learn R/Python APIs

Page 18: Introduction to Hivemall

HowIusedtodoMLbeforeHivemallGivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

height:173cmweight:60kgage:34gender: man…

Extract-Transform-Load

Does not meet my needsIn terms of its scalability, ML algorithms, and usability

I ❤ scalableSQL query

Page 19: Introduction to Hivemall

Framework UserinterfaceMahout JavaAPIProgrammingSparkMLlib/MLI ScalaAPIprogramming

ScalaShell(REPL)H2O Rprogramming

GUIClouderaOryx HttpRESTAPIprogrammingVowpalWabbit(w/Hadoopstreaming)

C++APIprogrammingCommandLine

SurveyonexistingMLframeworks

ExistingdistributedmachinelearningframeworksareNOTeasytouse

Page 20: Introduction to Hivemall

Hivemall’s Vision:MLonSQL

ClassificationwithMahout

CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)

✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop

Page 21: Introduction to Hivemall

21

HivemallonApacheSpark

Installationisveryeasyasfollows:$spark-shell--packagesmaropu:hivemall-spark:0.0.6

Page 22: Introduction to Hivemall

1. What is Hivemall

2. Why Hivemall

3. Hivemall Internals

4. How to use Hivemall

Agenda

Page 23: Introduction to Hivemall

ImplementedmachinelearningalgorithmsasUser-DefinedTablegeneratingFunctions(UDTFs)

HowHivemallworksintraining

+1,<1,2>..+1,<1,7,9>

-1,<1,3,9>..+1,<3,8>

tuple<label,array<features>>

tuple<feature,weights>

Predictionmodel

UDTF

Relation<feature,weights>

param-mix param-mix

Trainingtable

Shufflebyfeature

train train

● Resulting prediction model is a relation of feature and its weight

● # of mapper and reducers are configurable

UDTF is a function that returns a relation

ParallelismisPowerful

Page 24: Introduction to Hivemall

AlternativeApproachinHivemallHivemallprovidesthe amplify UDTFtoenumerateiterationeffectsinmachinelearningwithoutseveralMapReduce steps

SET hivevar:xtimes=3;

CREATE VIEW training_x3asSELECT*

FROM (SELECTamplify(${xtimes}, *) as (rowid, label, features)

FROMtraining

) tCLUSTER BY rand()

Page 25: Introduction to Hivemall

1. What is Hivemall

2. Why Hivemall

3. Hivemall Internals

4. How to use Hivemall

Agenda

Page 26: Introduction to Hivemall

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Datapreparation 26

Page 27: Introduction to Hivemall

Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

HowtouseHivemall- Datapreparation

DefineaHivetablefortraining/testingdata

27

Page 28: Introduction to Hivemall

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

FeatureEngineering

28

Page 29: Introduction to Hivemall

create view e2006tfidf_train_scaled asselect rowid,rescale(target,${min_label},${max_label}) as label,

featuresfrom e2006tfidf_train;

Applying a Min-Max Feature Normalization

HowtouseHivemall- FeatureEngineering

Transformingalabelvaluetoavaluebetween0.0and1.0

29

Page 30: Introduction to Hivemall

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Training

30

Page 31: Introduction to Hivemall

HowtouseHivemall- Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Trainingbylogisticregression

map-onlytasktolearnapredictionmodel

Shufflemap-outputstoreducesbyfeature

Reducersperformmodelaveraginginparallel

31

Page 32: Introduction to Hivemall

HowtouseHivemall- Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

TrainingofConfidenceWeightedClassifier

Votetousenegativeorpositiveweightsforavg

+0.7,+0.3,+0.2,-0.1,+0.7

TrainingfortheCWclassifier

32

Page 33: Introduction to Hivemall

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Prediction

33

Page 34: Introduction to Hivemall

HowtouseHivemall- Prediction

CREATE TABLE lr_predictasSELECTt.rowid, sigmoid(sum(m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)

GROUP BY t.rowid

PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel

Noneedtoloadtheentiremodelintomemory

34

Page 35: Introduction to Hivemall

Real-timeprediction

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS

PredictionModel Label

FeatureVector

FeatureVector

Label

Exportpredictionmodels

35

bit.ly/hivemall-rtp

Page 36: Introduction to Hivemall

Conclusion

HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs

36

Ø ForSQLusersthatneedMLØ ForwhomalreadyusingHiveØ Easy-of-useandscalabilityinmind

Do not require coding, packaging, compiling or introducing a new programming language or APIs.

Hivemall’s Positioning

Page 37: Introduction to Hivemall

Thank you!MakotoYUI- Researchengineer/TreasureData

twitter:@myui

37

Download Hivemall from bit.ly/hivemall