Introduction to Hivemall

Hivemall:ScalableMachineLearningLibraryforApacheHive

ResearchEngineerMakotoYUI@myui

<[email protected]>

1

bit.ly/hivemall

ExternalIntegrations

SQL

Server

CRM

RDBMS

App log

Sensor

Apache log

ERP

HiveBatch

AdhocPresto

API

ODBCJDBC

PUSH

Treasure Agent

BI tools

Data analysis

Treasure Data Collectors

Embedded

Embulk

Mobile SDK

JS SDK

Treasure Data Cloud Service

Machine Learning

900,000Records stored

per sec.

1. What is Hivemall (short intro.)

2. Why Hivemall (motivations etc.)

3. Hivemall Internals

4. How to use Hivemall

Agenda

What is HivemallScalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File System

SparkSQL

ApacheSpark

MESOS

Hive Pig

MLlib

WonIDG’sInfoWorld2014Bossie Awards 2014: The best open source big data tools

InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem(awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka)

bit.ly/hivemall-award

Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

8

Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓FactorizationMachines✓RandomForestRegression

List of supported Algorithms

List of supported AlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

9

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

List of Algorithms for Recommendation

10

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems

Other Supported Algorithms

11

AnomalyDetection✓ LocalOutlierFactor(LoF)

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion

(FeaturePairing)✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)

Ø CTR prediction of Ad click logs• Algorithm: Logistic regression• Freakout Inc. and more

Ø Gender prediction of Ad click logs• Algorithm: Classification• Scaleout Inc.

Ø Churn Detection• Algorithm: Regression• OISIX and more

Ø Item/User recommendation• Algorithm: Recommendation (Matrix Factorization / kNN) • Adtech Companies, ISP portal, and more

Ø Value prediction of Real estates• Algorithm: Regression• Livesense

Industry use cases of Hivemall

12

1. What is Hivemall (short intro.)

2. Why Hivemall (motivations etc.)



Agenda

WhyHivemall

1. InmyexperienceworkingonML,IusedHiveforpreprocessingandPython(scikit-learnetc.)forML.ThiswasINEFFICIENTandANNOYING.Also,PythonisnotasscalableasHive.

2. WhynotrunMLalgorithmsinsideHive?Lesscomponentstomanageandmorescalable.

That’swhyIbuildHivemall.

HowIusedtodoMLprojectsbeforeHivemall

GivenrawdatastoredonHadoopHDFS

RawData

HDFSS3 FeatureVector

height:173cmweight:60kgage:34gender: man…

Extract-Transform-Load

MachineLearning

file



RawData




file

Need to do expensive data preprocessing

(Joins, Filtering, and Formatting of Data that does not fit in memory)

MachineLearning



RawData




file

Do not scaleHave to learn R/Python APIs

HowIusedtodoMLbeforeHivemallGivenrawdatastoredonHadoopHDFS

RawData




Does not meet my needsIn terms of its scalability, ML algorithms, and usability

I ❤ scalableSQL query

Framework UserinterfaceMahout JavaAPIProgrammingSparkMLlib/MLI ScalaAPIprogramming

ScalaShell(REPL)H2O Rprogramming

GUIClouderaOryx HttpRESTAPIprogrammingVowpalWabbit(w/Hadoopstreaming)

C++APIprogrammingCommandLine

SurveyonexistingMLframeworks

ExistingdistributedmachinelearningframeworksareNOTeasytouse

Hivemall’s Vision:MLonSQL

ClassificationwithMahout

CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)

✓InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop

21

HivemallonApacheSpark

Installationisveryeasyasfollows:$spark-shell--packagesmaropu:hivemall-spark:0.0.6

1. What is Hivemall

2. Why Hivemall



Agenda

ImplementedmachinelearningalgorithmsasUser-DefinedTablegeneratingFunctions(UDTFs)

HowHivemallworksintraining

+1,<1,2>..+1,<1,7,9>

-1,<1,3,9>..+1,<3,8>

tuple<label,array<features>>

tuple<feature,weights>

Predictionmodel

UDTF

Relation<feature,weights>

param-mix param-mix

Trainingtable

Shufflebyfeature

train train

● Resulting prediction model is a relation of feature and its weight

● # of mapper and reducers are configurable

UDTF is a function that returns a relation

ParallelismisPowerful

AlternativeApproachinHivemallHivemallprovidesthe amplify UDTFtoenumerateiterationeffectsinmachinelearningwithoutseveralMapReduce steps

SET hivevar:xtimes=3;

CREATE VIEW training_x3asSELECT*

FROM (SELECTamplify(${xtimes}, *) as (rowid, label, features)

FROMtraining

) tCLUSTER BY rand()

1. What is Hivemall

2. Why Hivemall



Agenda

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Datapreparation 26

Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

HowtouseHivemall- Datapreparation

DefineaHivetablefortraining/testingdata

27

HowtouseHivemall

MachineLearning

Training

Prediction


FeatureVector

FeatureVector

Label

FeatureEngineering

28

create view e2006tfidf_train_scaled asselect rowid,rescale(target,${min_label},${max_label}) as label,

featuresfrom e2006tfidf_train;

Applying a Min-Max Feature Normalization

HowtouseHivemall- FeatureEngineering

Transformingalabelvaluetoavaluebetween0.0and1.0

29

HowtouseHivemall

MachineLearning

Training

Prediction


FeatureVector

FeatureVector

Label

Training

30

HowtouseHivemall- Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Trainingbylogisticregression

map-onlytasktolearnapredictionmodel

Shufflemap-outputstoreducesbyfeature

Reducersperformmodelaveraginginparallel

31

HowtouseHivemall- Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

TrainingofConfidenceWeightedClassifier

Votetousenegativeorpositiveweightsforavg

+0.7,+0.3,+0.2,-0.1,+0.7

TrainingfortheCWclassifier

32

HowtouseHivemall

MachineLearning

Training

Prediction


FeatureVector

FeatureVector

Label

Prediction

33

HowtouseHivemall- Prediction

CREATE TABLE lr_predictasSELECTt.rowid, sigmoid(sum(m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)

GROUP BY t.rowid

PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel

Noneedtoloadtheentiremodelintomemory

34

Real-timeprediction

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS


FeatureVector

FeatureVector

Label

Exportpredictionmodels

35

bit.ly/hivemall-rtp

Conclusion

HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs

36

Ø ForSQLusersthatneedMLØ ForwhomalreadyusingHiveØ Easy-of-useandscalabilityinmind

Do not require coding, packaging, compiling or introducing a new programming language or APIs.

Hivemall’s Positioning

Thank you!MakotoYUI- Researchengineer/TreasureData

twitter:@myui

37

Download Hivemall from bit.ly/hivemall

Introduction to Hivemall

Data & Analytics

Transcript of Introduction to Hivemall