Talk about Hivemall at Data Scientist Organization on 2015/09/17

Introduction toMachine Learning on using Hivemall

Research EngineerMakoto YUI @myui

<myui@treasure-‐data.com>

2014/09/17 Talk@Japan DataScientist Society 1

Ø 2015.04 Joined Treasure Data, Inc.1st Research Engineer in Treasure DataMy mission in TD is developing ML-‐as-‐a-‐Service

Ø 2010.04-‐2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. Worked on a large-‐scale Machine Learning project and Parallel Databases

Ø 2009.03 Ph.D. in Computer Science from NAISTØ Super programmer award from the MITOU

Foundation Super creators in TD: Sada Furuhashi, Keisuke Nishida

Who am I ?


Agenda

1. What is Hivemall

2. Why Hivemall (motivations etc.)

3. Hivemall Internals

4. How to use Hivemall• Logistic regression (RDBMS integration)• Matrix Factorization• Anomaly Detection (demo)• Random Forest (demo)


What is HivemallScalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2


https://github.com/myui/hivemall

What is Hivemall

Hadoop HDFS

MapReduce(MR v1)

Hive / PIG

Hivemall

Apache YARN

Apache Tez DAG processing MR v2

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File System


Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2

R

M MM

M M

HDFS

R

MapReduce and DAG engine

MapReduce DAG engine(Tez / Spark)

No intermediate DFS reads/writes!

62014/09/17 Talk@Japan DataScientist Society

M MM

M

HDFS

HDFS

M M M

R

M M M

R

HDFS

HDFS HDFS

Won IDG’s InfoWorld 2014Bossie Awards 2014: The best open source big data toolsInfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem

bit.ly/hivemall-‐award2014/09/17 Talk@Japan DataScientist Society 7

List of Features in Hivemall v0.3.2Classification (both binary-‐ and multi-‐class) Perceptron Passive Aggressive (PA) Confidence Weighted (CW) Adaptive Regularization of Weight Vectors (AROW) Soft Confidence Weighted (SCW) AdaGrad+RDA

RegressionLogistic Regression (SGD)PA RegressionAROW RegressionAdaGradAdaDELTA

kNN and RecommendationMinhash and b-‐Bit Minhash(LSH variant) Similarity Search using K-‐NN

(Euclid/Cosine/Jaccard/Angular)Matrix Factorization

Feature engineering Feature Hashing Feature Scaling(normalization, z-‐score) TF-‐IDF vectorizer Polynomial Expansion

Anomaly Detection Local Outlier Factor

Treasure Data supports Hivemall v0.3.2-‐3


Algorithms News20.binaryClassification Accuracy

Perceptron 0.9460 Passive-‐Aggressive(a.k.a. Online-‐SVM) 0.9604

LibLinear 0.9636 LibSVM/TinySVM 0.9643 ConfidenceWeighted (CW) 0.9656 AROW [1] 0.9660 SCW [2] 0.9662

Better

CW-‐variants are very smart onlineML algorithm

Hivemall supports the state-‐of-‐the-‐art online learning algorithms (for classification and regression)


List of Features in Hivemall

Why CW variants are so good?Suppose a binary classification setting to classify sentences positive or negative→ learn the weight for each word (each word is a feature)

I like this authorPositive

I like this author, but found this book dullNegative

Label Feature Vector

Naïve update will reduce both at same rateWlikeWdull

CW-‐variants adjust weights at different rates2014/09/17 Talk@Japan DataScientist Society 10

Why CW variants are so good?

weight

weight

Adjust a weight

Adjust a weight & confidence

0.6 0.80.6

0.80.6

At this confidence, the weight is 0.5

Confidence(covariance)

0.5


Features to be supported from Hivemall v0.4


1.RandomForest• classification, regression

2.Gradient Tree Boosting• classifier, regression

3.Factorization Machine• classification, regression (factorization)

4.Online LDA• topic modeling, clustering

Planned to release v0.4 in Oct.

Gradient Boosting and Factorization Machineare often used by data science competition winners(very important for practitioners)


Factorization Machine

Matrix Factorization



Context information (e.g., time) can be considered

Source: http://www.ismll.uni-‐hildesheim.de/pub/pdfs/Rendle2010FM.pdf



Factorization Model with degress=2 (2-‐way interaction)

Global BiasRegression coefficience

of j-th variable

Pairwise Interaction

Factorization

Ø CTR prediction of Ad click logs• Algorithm: Logistic regression• Freakout Inc. and more

Ø Gender prediction of Ad click logs• Algorithm: Classification• Scaleout Inc.

Ø Churn Detection• Algorithm: Regression• OISIX and more

Ø Item/User recommendation• Algorithm: Recommendation (Matrix Factorization / kNN) • Wish.com, Adtech Company, Real-‐estate Portal, and more

Ø Value prediction of Real estates• Algorithm: Regression• Livesense

Industry use cases of Hivemall


Agenda

1. What is Hivemall





Why Hivemall

1. In my experience working on ML, I used Hive for preprocessing and Python (scikit-‐learn etc.) for ML. This was INEFFICIENT and ANNOYING. Also, Python is not as scalable as Hive.

2. Why not run ML algorithms inside Hive? Less components to manage and more scalable.

That’s why I build Hivemall.


Data Moving in Data Analytics

Data Collection Data Lake Data Processing Data Mart

Amazon S3Amazon EMR

Redshift

Amazon RDS

Event D

ata

Insig

hts and De

cisions

Data Analysis

Data Engineer Data Scientist Data Engineer2014/09/17 Talk@Japan DataScientist Society 19


What Data Scientists actually Do What Data Scientists Should Do

Data Moving in Data Analytics

Hive is a great data preprocessing tooldue to its easiness/efficiency/scalability for join, filtering, and selection (data preprocessing)

How I used to do ML projects before Hivemall

Given raw data stored on Hadoop HDFS

RawData

HDFSS3 Feature Vector

height:173cmweight:60kg

age:34gender: man

…

Extract-‐Transform-‐Load

Machine Learning

file




RawData



age:34gender: man

…


file

Need to do expensive data preprocessing

(Joins, Filtering, and Formatting of Data that does not fit in memory)

Machine Learning2014/09/17 Talk@Japan DataScientist Society 22



RawData



age:34gender: man

…


file

Do not scaleHave to learn R/Python APIs


How I used to do ML before HivemallGiven raw data stored on Hadoop HDFS

RawData



age:34gender: man

…


Does not meet my needsIn terms of its scalability, ML algorithms, and usability

I scalableSQL query


Framework User interfaceMahout Java API ProgrammingSpark MLlib/MLI Scala API programming

Scala Shell (REPL)H2O R programming

GUICloudera Oryx Http REST API programmingVowpal Wabbit(w/ Hadoop streaming)

C++ API programmingCommand Line

Survey on existing ML frameworks

Existing distributed machine learning frameworksare NOT easy to use



Motivation: Machine Learning need to be more easy for developers (esp. data engineers)!

People are saying that ..

Hivemall’s Vision: ML on SQL

Classification with Mahout

CREATE TABLE lr_model ASSELECTfeature, -‐-‐ reducers perform model averaging in parallelavg(weight) as weightFROM (SELECT logress(features,label,..) as (feature,weight)FROM train) t -‐-‐ map-‐only taskGROUP BY feature; -‐-‐ shuffled to reducers

Machine Learning made easy for SQL developers (ML for the rest of us)Interactive and Stable APIs w/ SQL abstraction

This SQL query automatically runs in parallel on Hadoop


Agenda

1. What is Hivemall





Implemented machine learning algorithms as User-‐Defined Table generating Functions (UDTFs)

How Hivemall works in training

+1, <1,2>..+1, <1,7,9>

-‐1, <1,3, 9>..+1, <3,8>

tuple<label, array<features>>

tuple<feature, weights>

Prediction model

UDTF

Relation<feature, weights>

param-‐mix param-‐mix

Training table

Shuffle by feature

train train

Resulting prediction model is a relation of feature and its weight

# of mapper and reducers are configurable

UDTF is a function that returns a relation

Parallelism is Powerful


train train

+1, <1,2>..+1, <1,7,9>

-‐1, <1,3, 9>..+1, <3,8>

merge

tuple<label, array<features >

array<weight>

array<sum of weight>, array<count>

Training table

Prediction model

-‐1, <2,7, 9>..+1, <3,8>

final merge

merge

-‐1, <2,7, 9>..+1, <3,8>

train train

array<weight>

Why not UDAF

4 ops in parallel

2 ops in parallel

No parallelism

Machine learning as an aggregate function

Bottleneck in the final mergeThroughput limited by its fan out

Memory consumptiongrows

Parallelismdecreases


Problem that I faced: IterationsIterations are mandatory to get a good prediction model• However, MapReduce is not suited for iterations because IN/OUT of MR job is through HDFS

• Spark avoid it by in-‐memory computation

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

iter. 1 iter. 2

Input


Training with Iterations in Spark

val data = spark.textFile(...).map(readPoint).cache()

for (i <- 1 to ITERATIONS) val gradient = data.map(p =>

(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x).reduce(_ + _)w -= gradient

Repeated MapReduce steps

to do gradient descent

For each node, loads data in memory once

This is just a toy example! Why?

Logistic Regression example of Spark

Input to the gradient computation should be shuffled for each iteration (without it, more iteration is required)


What MLlib actually do?

Val data = ..

for (i <- 1 to numIterations) val sampled = val gradient =

w -= gradient

Mini-‐batch Gradient Descent with Sampling

Iterations are mandatory for convergence because each iteration uses only small fraction of data

GradientDescent.scalabit.ly/spark-‐gd

sample subset of data (partitioned RDD)

averaging the subgradients over the sampled data using Spark MapReduce


Alternative Approach in HivemallHivemall provides the amplify UDTF to enumerate iteration effects in machine learning without several MapReduce steps

SET hivevar:xtimes=3;

CREATE VIEW training_x3asSELECT

* FROM (

SELECTamplify($xtimes, *) as (rowid, label, features)FROMtraining

) tCLUSTER BY rand()


Map-‐only shuffling and amplifying

rand_amplify UDTF randomly shuffles the input rows for each Map task

CREATE VIEW training_x3asSELECT

rand_amplify($xtimes, $shufflebuffersize, *) as (rowid, label, features)

FROMtraining;


Detailed plan w/ map-‐local shuffle

…

Reduce taskMerge

Aggregate

Reduce write

Map

taskTable scan

Rand Amplifier

Map write

Logress UDTF

Partial aggregate

Map

taskTable scan

Rand Amplifier

Map write

Logress UDTF

Partial aggregate

Reduce taskMerge

Aggregate

Reduce write

Scanned entries are amplified and then shuffledNote this is a pipeline op.

The Rand Amplifier operator is interleaved between the table scan and the training operator

Shuffle (distributed by feature)


Method ELAPSED TIME (sec) AUC

Plain 89.718 0.734805

amplifier+clustered by(a.k.a. global shuffle)

479.855 0.746214

rand_amplifier (a.k.a. map-‐local shuffle)

116.424 0.743392

Performance effects of amplifiers

With the map-‐local shuffle, prediction accuracy got improved with an acceptable overhead


Agenda

1. What is Hivemall





How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature Vector

Feature Vector

Label

Data preparation2014/09/17 Talk@Japan DataScientist Society 39

CREATE EXTERNAL TABLE e2006tfidf_train (rowid int,label float,features ARRAY<STRING>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';;

How to use Hivemall -‐ Data preparation

Define a Hive table for training/testing data


How to use Hivemall

MachineLearning

Training

Prediction


Feature Vector

Feature Vector

Label

Feature Engineering


create view e2006tfidf_train_scaled asselect

rowid,rescale(target,$min_label,$max_label) as label,

featuresfrom

e2006tfidf_train;

Applying a Min-Max Feature Normalization

How to use Hivemall -‐ Feature Engineering

Transforming a label value to a value between 0.0 and 1.0


How to use Hivemall

MachineLearning

Training

Prediction


Feature Vector

Feature Vector

Label

Training


How to use Hivemall -‐ Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Training by logistic regression

map-‐only task to learn a prediction model

Shuffle map-‐outputs to reduces by feature

Reducers perform model averaging in parallel


How to use Hivemall -‐ Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

Training of Confidence Weighted Classifier

Vote to use negative or positive weights for avg

+0.7, +0.3, +0.2, -‐0.1, +0.7

Training for the CW classifier


create table news20mc_ensemble_model1 asselect label, cast(feature as int) as feature,cast(voted_avg(weight) as float) as weightfrom (select

train_multiclass_cw(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_arow(addBias(features),label) as (label,feature,weight)


union allselect

train_multiclass_scw(addBias(features),label)as (label,feature,weight)


) t group by label, feature;

Ensemble learning for stable prediction performance

Just stack prediction models by union all

26 / 43462014/09/17 Talk@Japan DataScientist Society

How to use Hivemall

MachineLearning

Training

Prediction


Feature Vector

Feature Vector

Label

Prediction


How to use Hivemall -‐ Prediction

CREATE TABLE lr_predictasSELECTt.rowid, sigmoid(sum(m.weight)) as probFROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)GROUP BY t.rowid

Prediction is done by LEFT OUTER JOINbetween test data and prediction model

No need to load the entire model into memory


How to use Hivemall

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS


Feature Vector

Feature Vector

Label

Export prediction models


Real-‐time Prediction on Treasure Data

Run batch trainingjob periodically

Real-‐time predictionon a RDBMS

Periodicalexport


Agenda

1. What is Hivemall





Supervise Learning: Recommendation

Rating prediction of a Matrix

Can be applied for user/Item Recommendation


53


Factorize a matrix into a product of matriceshaving k-‐latent factor


54

Mean Rating


Regularization

Bias for each user/item

Criteria of Biased MF


Factorization

55

Training of Matrix Factorization

Support iterative training using local disk cache2014/09/17 Talk@Japan DataScientist Society

56

Prediction of Matrix Factorization


ØAlgorithm is differentSpark: ALS-‐WR (considers regularization)Hivemall: Biased-‐MF (considers regularization and biases)

ØUsabilitySpark: 100+ line Scala codingHivemall: SQL

ØPrediction AccuracyAlmost same for MovieLens 10M datasets


Comparison to Spark MLlib

rowid features

1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]



Unsupervised Learning: Anomaly Detection

Sensor data etc.

Anomaly detection runs on a series of SQL queries



Anomalies in a Sensor Data

Source: https://codeiq.jp/q/207

Image Source: https://en.wikipedia.org/wiki/Local_outlier_factor2014/09/17 Talk@Japan DataScientist Society 60

Local Outlier Factor (LoF)

Basic idea of LOF: comparing the local density of a point with the densities of its neighbors


DEMO: Local Outlier Factor

rowid features





RandomForest in Hivemall v0.4

Ensemble of Decision Trees

Already available on a development (smile) branchand it’s usage is explained in the project wiki


Training of RandomForest

Out-‐of-‐bag tests and Variable Importance



Prediction of RandomForest


Jupyter Integration

DEMO

Conclusion and TakeawayHivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs

Ø For SQL users that need MLØ For whom already using HiveØ Easy-‐of-‐use and scalability in mind

Do not require coding, packaging, compiling or introducing a new programming language or APIs.

Hivemall’s Positioning


v0.4 will make a developmental leap

5/12の第一回目ではFreakout, Scaleout様より利用事例発表

10/20(火)の第2回目ではOISIX, Livesense様より利用事例発表

dotsで近日募集開始


告知: Hivemall meetup


Beyond Query-‐as-‐a-‐Service!

We Open-‐source! We invented ..

We are hiring machine learning engineer!

Talk about Hivemall at Data Scientist Organization on 2015/09/17

Data & Analytics

Transcript of Talk about Hivemall at Data Scientist Organization on 2015/09/17