Talk about Hivemall at Data Scientist Organization on 2015/09/17

69
Introduction to Machine Learning on using Hivemall Research Engineer Makoto YUI @myui <[email protected]> 2014/09/17 Talk@Japan DataScientist Society 1

Transcript of Talk about Hivemall at Data Scientist Organization on 2015/09/17

Page 1: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Introduction toMachine Learning on using Hivemall

Research EngineerMakoto YUI @myui

<myui@treasure-­‐data.com>

2014/09/17 Talk@Japan DataScientist Society 1

Page 2: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Ø 2015.04 Joined Treasure Data, Inc.1st Research Engineer in Treasure DataMy mission in TD is developing ML-­‐as-­‐a-­‐Service

Ø 2010.04-­‐2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. Worked on a large-­‐scale Machine Learning project and Parallel Databases

Ø 2009.03 Ph.D. in Computer Science from NAISTØ Super programmer award from the MITOU

Foundation Super creators in TD: Sada Furuhashi, Keisuke Nishida

Who am I ?

2014/09/17 Talk@Japan DataScientist Society 2

Page 3: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Agenda

1. What is Hivemall

2. Why Hivemall (motivations etc.)

3. Hivemall Internals

4. How to use Hivemall• Logistic regression (RDBMS integration)• Matrix Factorization• Anomaly Detection (demo)• Random Forest (demo)

2014/09/17 Talk@Japan DataScientist Society 3

Page 4: Talk about Hivemall at Data Scientist Organization on 2015/09/17

What is HivemallScalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2

2014/09/17 Talk@Japan DataScientist Society 4

https://github.com/myui/hivemall

Page 5: Talk about Hivemall at Data Scientist Organization on 2015/09/17

What is Hivemall

Hadoop HDFS

MapReduce(MR v1)

Hive / PIG

Hivemall

Apache YARN

Apache Tez DAG processing MR v2

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File System

2014/09/17 Talk@Japan DataScientist Society 5

Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2

Page 6: Talk about Hivemall at Data Scientist Organization on 2015/09/17

R

M MM

M M

HDFS

R

MapReduce and DAG engine

MapReduce DAG engine(Tez / Spark)

No intermediate DFS reads/writes!

62014/09/17 Talk@Japan DataScientist Society

M MM

M

HDFS

HDFS

M M M

R

M M M

R

HDFS

HDFS HDFS

Page 7: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Won IDG’s InfoWorld 2014Bossie Awards 2014: The best open source big data toolsInfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem

bit.ly/hivemall-­‐award2014/09/17 Talk@Japan DataScientist Society 7

Page 8: Talk about Hivemall at Data Scientist Organization on 2015/09/17

List of Features in Hivemall v0.3.2Classification (both binary-­‐ and multi-­‐class) Perceptron Passive Aggressive (PA) Confidence Weighted (CW) Adaptive Regularization of Weight Vectors (AROW) Soft Confidence Weighted (SCW) AdaGrad+RDA

RegressionLogistic Regression (SGD)PA RegressionAROW RegressionAdaGradAdaDELTA

kNN and RecommendationMinhash and b-­‐Bit Minhash(LSH variant) Similarity Search using K-­‐NN

(Euclid/Cosine/Jaccard/Angular)Matrix Factorization

Feature engineering Feature Hashing Feature Scaling(normalization, z-­‐score) TF-­‐IDF vectorizer Polynomial Expansion

Anomaly Detection Local Outlier Factor

Treasure Data supports Hivemall v0.3.2-­‐3

2014/09/17 Talk@Japan DataScientist Society 8

Page 9: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Algorithms News20.binaryClassification Accuracy

Perceptron 0.9460 Passive-­‐Aggressive(a.k.a. Online-­‐SVM) 0.9604

LibLinear 0.9636 LibSVM/TinySVM 0.9643 ConfidenceWeighted (CW) 0.9656 AROW [1] 0.9660 SCW [2] 0.9662

Better

CW-­‐variants are very smart onlineML algorithm

Hivemall supports the state-­‐of-­‐the-­‐art online learning algorithms (for classification and regression)

2014/09/17 Talk@Japan DataScientist Society 9

List of Features in Hivemall

Page 10: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Why CW variants are so good?Suppose a binary classification setting to classify sentences positive or negative→ learn the weight for each word (each word is a feature)

I like this authorPositive

I like this author, but found this book dullNegative

Label Feature Vector

Naïve update will reduce both at same rateWlikeWdull

CW-­‐variants adjust weights at different rates2014/09/17 Talk@Japan DataScientist Society 10

Page 11: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Why CW variants are so good?

weight

weight

Adjust a weight

Adjust a weight & confidence

0.6 0.80.6

0.80.6

At this confidence, the weight is 0.5

Confidence(covariance)

0.5

2014/09/17 Talk@Japan DataScientist Society 11

Page 12: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Features to be supported from Hivemall v0.4

2014/09/17 Talk@Japan DataScientist Society 12

1.RandomForest• classification, regression

2.Gradient Tree Boosting• classifier, regression

3.Factorization Machine• classification, regression (factorization)

4.Online LDA• topic modeling, clustering

Planned to release v0.4 in Oct.

Gradient Boosting and Factorization Machineare often used by data science competition winners(very important for practitioners)

Page 13: Talk about Hivemall at Data Scientist Organization on 2015/09/17

2014/09/17 Talk@Japan DataScientist Society 13

Factorization Machine

Matrix Factorization

Page 14: Talk about Hivemall at Data Scientist Organization on 2015/09/17

2014/09/17 Talk@Japan DataScientist Society 14

Factorization Machine

Context information (e.g., time) can be considered

Source: http://www.ismll.uni-­‐hildesheim.de/pub/pdfs/Rendle2010FM.pdf

Page 15: Talk about Hivemall at Data Scientist Organization on 2015/09/17

2014/09/17 Talk@Japan DataScientist Society 15

Factorization Machine

Factorization Model with degress=2 (2-­‐way interaction)

Global BiasRegression coefficience

of j-th variable

Pairwise Interaction

Factorization

Page 16: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Ø CTR prediction of Ad click logs• Algorithm: Logistic regression• Freakout Inc. and more

Ø Gender prediction of Ad click logs• Algorithm: Classification• Scaleout Inc.

Ø Churn Detection• Algorithm: Regression• OISIX and more

Ø Item/User recommendation• Algorithm: Recommendation (Matrix Factorization / kNN) • Wish.com, Adtech Company, Real-­‐estate Portal, and more

Ø Value prediction of Real estates• Algorithm: Regression• Livesense

Industry use cases of Hivemall

162014/09/17 Talk@Japan DataScientist Society

Page 17: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Agenda

1. What is Hivemall

2. Why Hivemall (motivations etc.)

3. Hivemall Internals

4. How to use Hivemall• Logistic regression (RDBMS integration)• Matrix Factorization• Anomaly Detection (demo)• Random Forest (demo)

2014/09/17 Talk@Japan DataScientist Society 17

Page 18: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Why Hivemall

1. In my experience working on ML, I used Hive for preprocessing and Python (scikit-­‐learn etc.) for ML. This was INEFFICIENT and ANNOYING. Also, Python is not as scalable as Hive.

2. Why not run ML algorithms inside Hive? Less components to manage and more scalable.

That’s why I build Hivemall.

2014/09/17 Talk@Japan DataScientist Society 18

Page 19: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Data Moving in Data Analytics

Data Collection Data Lake Data Processing Data Mart

Amazon S3Amazon EMR

Redshift

Amazon RDS

Event D

ata

Insig

hts and De

cisions

Data Analysis

Data Engineer Data Scientist Data Engineer2014/09/17 Talk@Japan DataScientist Society 19

Page 20: Talk about Hivemall at Data Scientist Organization on 2015/09/17

2014/09/17 Talk@Japan DataScientist Society 20

What Data Scientists actually Do What Data Scientists Should Do

Data Moving in Data Analytics

Hive is a great data preprocessing tooldue to its easiness/efficiency/scalability for join, filtering, and selection (data preprocessing)

Page 21: Talk about Hivemall at Data Scientist Organization on 2015/09/17

How I used to do ML projects before Hivemall

Given raw data stored on Hadoop HDFS

RawData

HDFSS3 Feature Vector

height:173cmweight:60kg

age:34gender: man

Extract-­‐Transform-­‐Load

Machine Learning

file

2014/09/17 Talk@Japan DataScientist Society 21

Page 22: Talk about Hivemall at Data Scientist Organization on 2015/09/17

How I used to do ML projects before Hivemall

Given raw data stored on Hadoop HDFS

RawData

HDFSS3 Feature Vector

height:173cmweight:60kg

age:34gender: man

Extract-­‐Transform-­‐Load

file

Need to do expensive data preprocessing

(Joins, Filtering, and Formatting of Data that does not fit in memory)

Machine Learning2014/09/17 Talk@Japan DataScientist Society 22

Page 23: Talk about Hivemall at Data Scientist Organization on 2015/09/17

How I used to do ML projects before Hivemall

Given raw data stored on Hadoop HDFS

RawData

HDFSS3 Feature Vector

height:173cmweight:60kg

age:34gender: man

Extract-­‐Transform-­‐Load

file

Do not scaleHave to learn R/Python APIs

2014/09/17 Talk@Japan DataScientist Society 23

Page 24: Talk about Hivemall at Data Scientist Organization on 2015/09/17

How I used to do ML before HivemallGiven raw data stored on Hadoop HDFS

RawData

HDFSS3 Feature Vector

height:173cmweight:60kg

age:34gender: man

Extract-­‐Transform-­‐Load

Does not meet my needsIn terms of its scalability, ML algorithms, and usability

I scalableSQL query

2014/09/17 Talk@Japan DataScientist Society 24

Page 25: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Framework User interfaceMahout Java API ProgrammingSpark MLlib/MLI Scala API programming

Scala Shell (REPL)H2O R programming

GUICloudera Oryx Http REST API programmingVowpal Wabbit(w/ Hadoop streaming)

C++ API programmingCommand Line

Survey on existing ML frameworks

Existing distributed machine learning frameworksare NOT easy to use

2014/09/17 Talk@Japan DataScientist Society 25

Page 26: Talk about Hivemall at Data Scientist Organization on 2015/09/17

2014/09/17 Talk@Japan DataScientist Society 26

Motivation: Machine Learning need to be more easy for developers (esp. data engineers)!

People are saying that ..

Page 27: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Hivemall’s Vision: ML on SQL

Classification with Mahout

CREATE TABLE lr_model ASSELECTfeature, -­‐-­‐ reducers perform model averaging in parallelavg(weight) as weightFROM (SELECT logress(features,label,..) as (feature,weight)FROM train) t -­‐-­‐ map-­‐only taskGROUP BY feature; -­‐-­‐ shuffled to reducers

Machine Learning made easy for SQL developers (ML for the rest of us)Interactive and Stable APIs w/ SQL abstraction

This SQL query automatically runs in parallel on Hadoop

2014/09/17 Talk@Japan DataScientist Society 27

Page 28: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Agenda

1. What is Hivemall

2. Why Hivemall (motivations etc.)

3. Hivemall Internals

4. How to use Hivemall• Logistic regression (RDBMS integration)• Matrix Factorization• Anomaly Detection (demo)• Random Forest (demo)

2014/09/17 Talk@Japan DataScientist Society 28

Page 29: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Implemented machine learning algorithms as User-­‐Defined Table generating Functions (UDTFs)

How Hivemall works in training

+1, <1,2>..+1, <1,7,9>

-­‐1, <1,3, 9>..+1, <3,8>

tuple<label, array<features>>

tuple<feature, weights>

Prediction model

UDTF

Relation<feature, weights>

param-­‐mix param-­‐mix

Training table

Shuffle by feature

train train

Resulting prediction model is a relation of feature and its weight

# of mapper and reducers are configurable

UDTF is a function that returns a relation

Parallelism is Powerful

2014/09/17 Talk@Japan DataScientist Society 29

Page 30: Talk about Hivemall at Data Scientist Organization on 2015/09/17

train train

+1, <1,2>..+1, <1,7,9>

-­‐1, <1,3, 9>..+1, <3,8>

merge

tuple<label, array<features >

array<weight>

array<sum of weight>, array<count>

Training table

Prediction model

-­‐1, <2,7, 9>..+1, <3,8>

final merge

merge

-­‐1, <2,7, 9>..+1, <3,8>

train train

array<weight>

Why not UDAF

4 ops in parallel

2 ops in parallel

No parallelism

Machine learning as an aggregate function

Bottleneck in the final mergeThroughput limited by its fan out

Memory consumptiongrows

Parallelismdecreases

2014/09/17 Talk@Japan DataScientist Society 30

Page 31: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Problem that I faced: IterationsIterations are mandatory to get a good prediction model• However, MapReduce is not suited for iterations because IN/OUT of MR job is through HDFS

• Spark avoid it by in-­‐memory computation

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

iter. 1 iter. 2

Input

2014/09/17 Talk@Japan DataScientist Society 31

Page 32: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Training with Iterations in Spark

val data = spark.textFile(...).map(readPoint).cache()

for (i <- 1 to ITERATIONS) val gradient = data.map(p =>

(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x).reduce(_ + _)w -= gradient

Repeated MapReduce steps

to do gradient descent

For each node, loads data in memory once

This is just a toy example! Why?

Logistic Regression example of Spark

Input to the gradient computation should be shuffled for each iteration (without it, more iteration is required)

2014/09/17 Talk@Japan DataScientist Society 32

Page 33: Talk about Hivemall at Data Scientist Organization on 2015/09/17

What MLlib actually do?

Val data = ..

for (i <- 1 to numIterations) val sampled = val gradient =

w -= gradient

Mini-­‐batch Gradient Descent with Sampling

Iterations are mandatory for convergence because each iteration uses only small fraction of data

GradientDescent.scalabit.ly/spark-­‐gd

sample subset of data (partitioned RDD)

averaging the subgradients over the sampled data using Spark MapReduce

2014/09/17 Talk@Japan DataScientist Society 33

Page 34: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Alternative Approach in HivemallHivemall provides the amplify UDTF to enumerate iteration effects in machine learning without several MapReduce steps

SET hivevar:xtimes=3;

CREATE VIEW training_x3asSELECT

* FROM (

SELECTamplify($xtimes, *) as (rowid, label, features)FROMtraining

) tCLUSTER BY rand()

2014/09/17 Talk@Japan DataScientist Society 34

Page 35: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Map-­‐only shuffling and amplifying

rand_amplify UDTF randomly shuffles the input rows for each Map task

CREATE VIEW training_x3asSELECT

rand_amplify($xtimes, $shufflebuffersize, *) as (rowid, label, features)

FROMtraining;

2014/09/17 Talk@Japan DataScientist Society 35

Page 36: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Detailed plan w/ map-­‐local shuffle

Reduce taskMerge

Aggregate

Reduce write

Map

taskTable scan

Rand Amplifier

Map write

Logress UDTF

Partial aggregate

Map

taskTable scan

Rand Amplifier

Map write

Logress UDTF

Partial aggregate

Reduce taskMerge

Aggregate

Reduce write

Scanned entries are amplified and then shuffledNote this is a pipeline op.

The Rand Amplifier operator is interleaved between the table scan and the training operator

Shuffle (distributed by feature)

2014/09/17 Talk@Japan DataScientist Society 36

Page 37: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Method ELAPSED TIME (sec) AUC

Plain 89.718 0.734805

amplifier+clustered by(a.k.a. global shuffle)

479.855 0.746214

rand_amplifier (a.k.a. map-­‐local shuffle)

116.424 0.743392

Performance effects of amplifiers

With the map-­‐local shuffle, prediction accuracy got improved with an acceptable overhead

2014/09/17 Talk@Japan DataScientist Society 37

Page 38: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Agenda

1. What is Hivemall

2. Why Hivemall (motivations etc.)

3. Hivemall Internals

4. How to use Hivemall• Logistic regression (RDBMS integration)• Matrix Factorization• Anomaly Detection (demo)• Random Forest (demo)

2014/09/17 Talk@Japan DataScientist Society 38

Page 39: Talk about Hivemall at Data Scientist Organization on 2015/09/17

How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature Vector

Feature Vector

Label

Data preparation2014/09/17 Talk@Japan DataScientist Society 39

Page 40: Talk about Hivemall at Data Scientist Organization on 2015/09/17

CREATE EXTERNAL TABLE e2006tfidf_train (rowid int,label float,features ARRAY<STRING>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-­tfidf/train';;

How to use Hivemall -­‐ Data preparation

Define a Hive table for training/testing data

2014/09/17 Talk@Japan DataScientist Society 40

Page 41: Talk about Hivemall at Data Scientist Organization on 2015/09/17

How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature Vector

Feature Vector

Label

Feature Engineering

2014/09/17 Talk@Japan DataScientist Society 41

Page 42: Talk about Hivemall at Data Scientist Organization on 2015/09/17

create view e2006tfidf_train_scaled asselect

rowid,rescale(target,$min_label,$max_label) as label,

featuresfrom

e2006tfidf_train;

Applying a Min-Max Feature Normalization

How to use Hivemall -­‐ Feature Engineering

Transforming a label value to a value between 0.0 and 1.0

2014/09/17 Talk@Japan DataScientist Society 42

Page 43: Talk about Hivemall at Data Scientist Organization on 2015/09/17

How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature Vector

Feature Vector

Label

Training

2014/09/17 Talk@Japan DataScientist Society 43

Page 44: Talk about Hivemall at Data Scientist Organization on 2015/09/17

How to use Hivemall -­‐ Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Training by logistic regression

map-­‐only task to learn a prediction model

Shuffle map-­‐outputs to reduces by feature

Reducers perform model averaging in parallel

2014/09/17 Talk@Japan DataScientist Society 44

Page 45: Talk about Hivemall at Data Scientist Organization on 2015/09/17

How to use Hivemall -­‐ Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

Training of Confidence Weighted Classifier

Vote to use negative or positive weights for avg

+0.7, +0.3, +0.2, -­‐0.1, +0.7

Training for the CW classifier

2014/09/17 Talk@Japan DataScientist Society 45

Page 46: Talk about Hivemall at Data Scientist Organization on 2015/09/17

create table news20mc_ensemble_model1 asselect label, cast(feature as int) as feature,cast(voted_avg(weight) as float) as weightfrom (select

train_multiclass_cw(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_arow(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_scw(addBias(features),label)as (label,feature,weight)

from news20mc_train_x3

) t group by label, feature;

Ensemble learning for stable prediction performance

Just stack prediction models by union all

26 / 43462014/09/17 Talk@Japan DataScientist Society

Page 47: Talk about Hivemall at Data Scientist Organization on 2015/09/17

How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature Vector

Feature Vector

Label

Prediction

2014/09/17 Talk@Japan DataScientist Society 47

Page 48: Talk about Hivemall at Data Scientist Organization on 2015/09/17

How to use Hivemall -­‐ Prediction

CREATE TABLE lr_predictasSELECTt.rowid, sigmoid(sum(m.weight)) as probFROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)GROUP BY t.rowid

Prediction is done by LEFT OUTER JOINbetween test data and prediction model

No need to load the entire model into memory

2014/09/17 Talk@Japan DataScientist Society 48

Page 49: Talk about Hivemall at Data Scientist Organization on 2015/09/17

How to use Hivemall

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS

PredictionModel Label

Feature Vector

Feature Vector

Label

Export prediction models

2014/09/17 Talk@Japan DataScientist Society 49

Page 50: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Real-­‐time Prediction on Treasure Data

Run batch trainingjob periodically

Real-­‐time predictionon a RDBMS

Periodicalexport

2014/09/17 Talk@Japan DataScientist Society 50

Page 51: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Agenda

1. What is Hivemall

2. Why Hivemall (motivations etc.)

3. Hivemall Internals

4. How to use Hivemall• Logistic regression (RDBMS integration)• Matrix Factorization• Anomaly Detection (demo)• Random Forest (demo)

2014/09/17 Talk@Japan DataScientist Society 51

Page 52: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Supervise Learning: Recommendation

Rating prediction of a Matrix

Can be applied for user/Item Recommendation

522014/09/17 Talk@Japan DataScientist Society

Page 53: Talk about Hivemall at Data Scientist Organization on 2015/09/17

53

Matrix Factorization

Factorize a matrix into a product of matriceshaving k-­‐latent factor

2014/09/17 Talk@Japan DataScientist Society

Page 54: Talk about Hivemall at Data Scientist Organization on 2015/09/17

54

Mean Rating

Matrix Factorization

Regularization

Bias for each user/item

Criteria of Biased MF

2014/09/17 Talk@Japan DataScientist Society

Factorization

Page 55: Talk about Hivemall at Data Scientist Organization on 2015/09/17

55

Training of Matrix Factorization

Support iterative training using local disk cache2014/09/17 Talk@Japan DataScientist Society

Page 56: Talk about Hivemall at Data Scientist Organization on 2015/09/17

56

Prediction of Matrix Factorization

2014/09/17 Talk@Japan DataScientist Society

Page 57: Talk about Hivemall at Data Scientist Organization on 2015/09/17

ØAlgorithm is differentSpark: ALS-­‐WR (considers regularization)Hivemall: Biased-­‐MF (considers regularization and biases)

ØUsabilitySpark: 100+ line Scala codingHivemall: SQL

ØPrediction AccuracyAlmost same for MovieLens 10M datasets

2014/09/17 Talk@Japan DataScientist Society 57

Comparison to Spark MLlib

Page 58: Talk about Hivemall at Data Scientist Organization on 2015/09/17

rowid features

1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]

2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]

3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]

Unsupervised Learning: Anomaly Detection

Sensor data etc.

Anomaly detection runs on a series of SQL queries

582014/09/17 Talk@Japan DataScientist Society

Page 59: Talk about Hivemall at Data Scientist Organization on 2015/09/17

2014/09/17 Talk@Japan DataScientist Society 59

Anomalies in a Sensor Data

Source: https://codeiq.jp/q/207

Page 60: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Image Source: https://en.wikipedia.org/wiki/Local_outlier_factor2014/09/17 Talk@Japan DataScientist Society 60

Local Outlier Factor (LoF)

Basic idea of LOF: comparing the local density of a point with the densities of its neighbors

Page 61: Talk about Hivemall at Data Scientist Organization on 2015/09/17

2014/09/17 Talk@Japan DataScientist Society 61

DEMO: Local Outlier Factor

rowid features

1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]

2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]

3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]

Page 62: Talk about Hivemall at Data Scientist Organization on 2015/09/17

2014/09/17 Talk@Japan DataScientist Society 62

RandomForest in Hivemall v0.4

Ensemble of Decision Trees

Already available on a development (smile) branchand it’s usage is explained in the project wiki

Page 63: Talk about Hivemall at Data Scientist Organization on 2015/09/17

2014/09/17 Talk@Japan DataScientist Society 63

Training of RandomForest

Page 64: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Out-­‐of-­‐bag tests and Variable Importance

2014/09/17 Talk@Japan DataScientist Society 64

Page 65: Talk about Hivemall at Data Scientist Organization on 2015/09/17

2014/09/17 Talk@Japan DataScientist Society 65

Prediction of RandomForest

Page 66: Talk about Hivemall at Data Scientist Organization on 2015/09/17

2014/09/17 Talk@Japan DataScientist Society 66

Jupyter Integration

DEMO

Page 67: Talk about Hivemall at Data Scientist Organization on 2015/09/17

Conclusion and TakeawayHivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs

Ø For SQL users that need MLØ For whom already using HiveØ Easy-­‐of-­‐use and scalability in mind

Do not require coding, packaging, compiling or introducing a new programming language or APIs.

Hivemall’s Positioning

2014/09/17 Talk@Japan DataScientist Society 67

v0.4 will make a developmental leap

Page 68: Talk about Hivemall at Data Scientist Organization on 2015/09/17

5/12の第一回目ではFreakout, Scaleout様より利用事例発表

10/20(火)の第2回目ではOISIX, Livesense様より利用事例発表

dotsで近日募集開始

2014/09/17 Talk@Japan DataScientist Society 68

告知: Hivemall meetup

Page 69: Talk about Hivemall at Data Scientist Organization on 2015/09/17

2014/09/17 Talk@Japan DataScientist Society 69

Beyond Query-­‐as-­‐a-­‐Service!

We Open-­‐source! We invented ..

We are hiring machine learning engineer!