Talk about Hivemall at Data Scientist Organization on 2015/09/17
-
Upload
makoto-yui -
Category
Data & Analytics
-
view
1.835 -
download
1
Transcript of Talk about Hivemall at Data Scientist Organization on 2015/09/17
Introduction toMachine Learning on using Hivemall
Research EngineerMakoto YUI @myui
<myui@treasure-‐data.com>
2014/09/17 Talk@Japan DataScientist Society 1
Ø 2015.04 Joined Treasure Data, Inc.1st Research Engineer in Treasure DataMy mission in TD is developing ML-‐as-‐a-‐Service
Ø 2010.04-‐2015.03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. Worked on a large-‐scale Machine Learning project and Parallel Databases
Ø 2009.03 Ph.D. in Computer Science from NAISTØ Super programmer award from the MITOU
Foundation Super creators in TD: Sada Furuhashi, Keisuke Nishida
Who am I ?
2014/09/17 Talk@Japan DataScientist Society 2
Agenda
1. What is Hivemall
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall• Logistic regression (RDBMS integration)• Matrix Factorization• Anomaly Detection (demo)• Random Forest (demo)
2014/09/17 Talk@Japan DataScientist Society 3
What is HivemallScalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2
2014/09/17 Talk@Japan DataScientist Society 4
https://github.com/myui/hivemall
What is Hivemall
Hadoop HDFS
MapReduce(MR v1)
Hive / PIG
Hivemall
Apache YARN
Apache Tez DAG processing MR v2
Machine Learning
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File System
2014/09/17 Talk@Japan DataScientist Society 5
Scalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2
R
M MM
M M
HDFS
R
MapReduce and DAG engine
MapReduce DAG engine(Tez / Spark)
No intermediate DFS reads/writes!
62014/09/17 Talk@Japan DataScientist Society
M MM
M
HDFS
HDFS
M M M
R
M M M
R
HDFS
HDFS HDFS
Won IDG’s InfoWorld 2014Bossie Awards 2014: The best open source big data toolsInfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem
bit.ly/hivemall-‐award2014/09/17 Talk@Japan DataScientist Society 7
List of Features in Hivemall v0.3.2Classification (both binary-‐ and multi-‐class) Perceptron Passive Aggressive (PA) Confidence Weighted (CW) Adaptive Regularization of Weight Vectors (AROW) Soft Confidence Weighted (SCW) AdaGrad+RDA
RegressionLogistic Regression (SGD)PA RegressionAROW RegressionAdaGradAdaDELTA
kNN and RecommendationMinhash and b-‐Bit Minhash(LSH variant) Similarity Search using K-‐NN
(Euclid/Cosine/Jaccard/Angular)Matrix Factorization
Feature engineering Feature Hashing Feature Scaling(normalization, z-‐score) TF-‐IDF vectorizer Polynomial Expansion
Anomaly Detection Local Outlier Factor
Treasure Data supports Hivemall v0.3.2-‐3
2014/09/17 Talk@Japan DataScientist Society 8
Algorithms News20.binaryClassification Accuracy
Perceptron 0.9460 Passive-‐Aggressive(a.k.a. Online-‐SVM) 0.9604
LibLinear 0.9636 LibSVM/TinySVM 0.9643 ConfidenceWeighted (CW) 0.9656 AROW [1] 0.9660 SCW [2] 0.9662
Better
CW-‐variants are very smart onlineML algorithm
Hivemall supports the state-‐of-‐the-‐art online learning algorithms (for classification and regression)
2014/09/17 Talk@Japan DataScientist Society 9
List of Features in Hivemall
Why CW variants are so good?Suppose a binary classification setting to classify sentences positive or negative→ learn the weight for each word (each word is a feature)
I like this authorPositive
I like this author, but found this book dullNegative
Label Feature Vector
Naïve update will reduce both at same rateWlikeWdull
CW-‐variants adjust weights at different rates2014/09/17 Talk@Japan DataScientist Society 10
Why CW variants are so good?
weight
weight
Adjust a weight
Adjust a weight & confidence
0.6 0.80.6
0.80.6
At this confidence, the weight is 0.5
Confidence(covariance)
0.5
2014/09/17 Talk@Japan DataScientist Society 11
Features to be supported from Hivemall v0.4
2014/09/17 Talk@Japan DataScientist Society 12
1.RandomForest• classification, regression
2.Gradient Tree Boosting• classifier, regression
3.Factorization Machine• classification, regression (factorization)
4.Online LDA• topic modeling, clustering
Planned to release v0.4 in Oct.
Gradient Boosting and Factorization Machineare often used by data science competition winners(very important for practitioners)
2014/09/17 Talk@Japan DataScientist Society 13
Factorization Machine
Matrix Factorization
2014/09/17 Talk@Japan DataScientist Society 14
Factorization Machine
Context information (e.g., time) can be considered
Source: http://www.ismll.uni-‐hildesheim.de/pub/pdfs/Rendle2010FM.pdf
2014/09/17 Talk@Japan DataScientist Society 15
Factorization Machine
Factorization Model with degress=2 (2-‐way interaction)
Global BiasRegression coefficience
of j-th variable
Pairwise Interaction
Factorization
Ø CTR prediction of Ad click logs• Algorithm: Logistic regression• Freakout Inc. and more
Ø Gender prediction of Ad click logs• Algorithm: Classification• Scaleout Inc.
Ø Churn Detection• Algorithm: Regression• OISIX and more
Ø Item/User recommendation• Algorithm: Recommendation (Matrix Factorization / kNN) • Wish.com, Adtech Company, Real-‐estate Portal, and more
Ø Value prediction of Real estates• Algorithm: Regression• Livesense
Industry use cases of Hivemall
162014/09/17 Talk@Japan DataScientist Society
Agenda
1. What is Hivemall
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall• Logistic regression (RDBMS integration)• Matrix Factorization• Anomaly Detection (demo)• Random Forest (demo)
2014/09/17 Talk@Japan DataScientist Society 17
Why Hivemall
1. In my experience working on ML, I used Hive for preprocessing and Python (scikit-‐learn etc.) for ML. This was INEFFICIENT and ANNOYING. Also, Python is not as scalable as Hive.
2. Why not run ML algorithms inside Hive? Less components to manage and more scalable.
That’s why I build Hivemall.
2014/09/17 Talk@Japan DataScientist Society 18
Data Moving in Data Analytics
Data Collection Data Lake Data Processing Data Mart
Amazon S3Amazon EMR
Redshift
Amazon RDS
Event D
ata
Insig
hts and De
cisions
Data Analysis
Data Engineer Data Scientist Data Engineer2014/09/17 Talk@Japan DataScientist Society 19
2014/09/17 Talk@Japan DataScientist Society 20
What Data Scientists actually Do What Data Scientists Should Do
Data Moving in Data Analytics
Hive is a great data preprocessing tooldue to its easiness/efficiency/scalability for join, filtering, and selection (data preprocessing)
How I used to do ML projects before Hivemall
Given raw data stored on Hadoop HDFS
RawData
HDFSS3 Feature Vector
height:173cmweight:60kg
age:34gender: man
…
Extract-‐Transform-‐Load
Machine Learning
file
2014/09/17 Talk@Japan DataScientist Society 21
How I used to do ML projects before Hivemall
Given raw data stored on Hadoop HDFS
RawData
HDFSS3 Feature Vector
height:173cmweight:60kg
age:34gender: man
…
Extract-‐Transform-‐Load
file
Need to do expensive data preprocessing
(Joins, Filtering, and Formatting of Data that does not fit in memory)
Machine Learning2014/09/17 Talk@Japan DataScientist Society 22
How I used to do ML projects before Hivemall
Given raw data stored on Hadoop HDFS
RawData
HDFSS3 Feature Vector
height:173cmweight:60kg
age:34gender: man
…
Extract-‐Transform-‐Load
file
Do not scaleHave to learn R/Python APIs
2014/09/17 Talk@Japan DataScientist Society 23
How I used to do ML before HivemallGiven raw data stored on Hadoop HDFS
RawData
HDFSS3 Feature Vector
height:173cmweight:60kg
age:34gender: man
…
Extract-‐Transform-‐Load
Does not meet my needsIn terms of its scalability, ML algorithms, and usability
I scalableSQL query
2014/09/17 Talk@Japan DataScientist Society 24
Framework User interfaceMahout Java API ProgrammingSpark MLlib/MLI Scala API programming
Scala Shell (REPL)H2O R programming
GUICloudera Oryx Http REST API programmingVowpal Wabbit(w/ Hadoop streaming)
C++ API programmingCommand Line
Survey on existing ML frameworks
Existing distributed machine learning frameworksare NOT easy to use
2014/09/17 Talk@Japan DataScientist Society 25
2014/09/17 Talk@Japan DataScientist Society 26
Motivation: Machine Learning need to be more easy for developers (esp. data engineers)!
People are saying that ..
Hivemall’s Vision: ML on SQL
Classification with Mahout
CREATE TABLE lr_model ASSELECTfeature, -‐-‐ reducers perform model averaging in parallelavg(weight) as weightFROM (SELECT logress(features,label,..) as (feature,weight)FROM train) t -‐-‐ map-‐only taskGROUP BY feature; -‐-‐ shuffled to reducers
Machine Learning made easy for SQL developers (ML for the rest of us)Interactive and Stable APIs w/ SQL abstraction
This SQL query automatically runs in parallel on Hadoop
2014/09/17 Talk@Japan DataScientist Society 27
Agenda
1. What is Hivemall
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall• Logistic regression (RDBMS integration)• Matrix Factorization• Anomaly Detection (demo)• Random Forest (demo)
2014/09/17 Talk@Japan DataScientist Society 28
Implemented machine learning algorithms as User-‐Defined Table generating Functions (UDTFs)
How Hivemall works in training
+1, <1,2>..+1, <1,7,9>
-‐1, <1,3, 9>..+1, <3,8>
tuple<label, array<features>>
tuple<feature, weights>
Prediction model
UDTF
Relation<feature, weights>
param-‐mix param-‐mix
Training table
Shuffle by feature
train train
Resulting prediction model is a relation of feature and its weight
# of mapper and reducers are configurable
UDTF is a function that returns a relation
Parallelism is Powerful
2014/09/17 Talk@Japan DataScientist Society 29
train train
+1, <1,2>..+1, <1,7,9>
-‐1, <1,3, 9>..+1, <3,8>
merge
tuple<label, array<features >
array<weight>
array<sum of weight>, array<count>
Training table
Prediction model
-‐1, <2,7, 9>..+1, <3,8>
final merge
merge
-‐1, <2,7, 9>..+1, <3,8>
train train
array<weight>
Why not UDAF
4 ops in parallel
2 ops in parallel
No parallelism
Machine learning as an aggregate function
Bottleneck in the final mergeThroughput limited by its fan out
Memory consumptiongrows
Parallelismdecreases
2014/09/17 Talk@Japan DataScientist Society 30
Problem that I faced: IterationsIterations are mandatory to get a good prediction model• However, MapReduce is not suited for iterations because IN/OUT of MR job is through HDFS
• Spark avoid it by in-‐memory computation
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
iter. 1 iter. 2
Input
2014/09/17 Talk@Japan DataScientist Society 31
Training with Iterations in Spark
val data = spark.textFile(...).map(readPoint).cache()
for (i <- 1 to ITERATIONS) val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x).reduce(_ + _)w -= gradient
Repeated MapReduce steps
to do gradient descent
For each node, loads data in memory once
This is just a toy example! Why?
Logistic Regression example of Spark
Input to the gradient computation should be shuffled for each iteration (without it, more iteration is required)
2014/09/17 Talk@Japan DataScientist Society 32
What MLlib actually do?
Val data = ..
for (i <- 1 to numIterations) val sampled = val gradient =
w -= gradient
Mini-‐batch Gradient Descent with Sampling
Iterations are mandatory for convergence because each iteration uses only small fraction of data
GradientDescent.scalabit.ly/spark-‐gd
sample subset of data (partitioned RDD)
averaging the subgradients over the sampled data using Spark MapReduce
2014/09/17 Talk@Japan DataScientist Society 33
Alternative Approach in HivemallHivemall provides the amplify UDTF to enumerate iteration effects in machine learning without several MapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3asSELECT
* FROM (
SELECTamplify($xtimes, *) as (rowid, label, features)FROMtraining
) tCLUSTER BY rand()
2014/09/17 Talk@Japan DataScientist Society 34
Map-‐only shuffling and amplifying
rand_amplify UDTF randomly shuffles the input rows for each Map task
CREATE VIEW training_x3asSELECT
rand_amplify($xtimes, $shufflebuffersize, *) as (rowid, label, features)
FROMtraining;
2014/09/17 Talk@Japan DataScientist Society 35
Detailed plan w/ map-‐local shuffle
…
Reduce taskMerge
Aggregate
Reduce write
Map
taskTable scan
Rand Amplifier
Map write
Logress UDTF
Partial aggregate
Map
taskTable scan
Rand Amplifier
Map write
Logress UDTF
Partial aggregate
Reduce taskMerge
Aggregate
Reduce write
Scanned entries are amplified and then shuffledNote this is a pipeline op.
The Rand Amplifier operator is interleaved between the table scan and the training operator
Shuffle (distributed by feature)
2014/09/17 Talk@Japan DataScientist Society 36
Method ELAPSED TIME (sec) AUC
Plain 89.718 0.734805
amplifier+clustered by(a.k.a. global shuffle)
479.855 0.746214
rand_amplifier (a.k.a. map-‐local shuffle)
116.424 0.743392
Performance effects of amplifiers
With the map-‐local shuffle, prediction accuracy got improved with an acceptable overhead
2014/09/17 Talk@Japan DataScientist Society 37
Agenda
1. What is Hivemall
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall• Logistic regression (RDBMS integration)• Matrix Factorization• Anomaly Detection (demo)• Random Forest (demo)
2014/09/17 Talk@Japan DataScientist Society 38
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Data preparation2014/09/17 Talk@Japan DataScientist Society 39
CREATE EXTERNAL TABLE e2006tfidf_train (rowid int,label float,features ARRAY<STRING>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';;
How to use Hivemall -‐ Data preparation
Define a Hive table for training/testing data
2014/09/17 Talk@Japan DataScientist Society 40
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Feature Engineering
2014/09/17 Talk@Japan DataScientist Society 41
create view e2006tfidf_train_scaled asselect
rowid,rescale(target,$min_label,$max_label) as label,
featuresfrom
e2006tfidf_train;
Applying a Min-Max Feature Normalization
How to use Hivemall -‐ Feature Engineering
Transforming a label value to a value between 0.0 and 1.0
2014/09/17 Talk@Japan DataScientist Society 42
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Training
2014/09/17 Talk@Japan DataScientist Society 43
How to use Hivemall -‐ Training
CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight
FROM (SELECT logress(features,label,..)
as (feature,weight)FROM train
) tGROUP BY feature
Training by logistic regression
map-‐only task to learn a prediction model
Shuffle map-‐outputs to reduces by feature
Reducers perform model averaging in parallel
2014/09/17 Talk@Japan DataScientist Society 44
How to use Hivemall -‐ Training
CREATE TABLE news20b_cw_model1 ASSELECT
feature,voted_avg(weight) as weight
FROM(SELECT
train_cw(features,label) as (feature,weight)
FROMnews20b_train
) t GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive weights for avg
+0.7, +0.3, +0.2, -‐0.1, +0.7
Training for the CW classifier
2014/09/17 Talk@Japan DataScientist Society 45
create table news20mc_ensemble_model1 asselect label, cast(feature as int) as feature,cast(voted_avg(weight) as float) as weightfrom (select
train_multiclass_cw(addBias(features),label) as (label,feature,weight)
from news20mc_train_x3
union allselect
train_multiclass_arow(addBias(features),label) as (label,feature,weight)
from news20mc_train_x3
union allselect
train_multiclass_scw(addBias(features),label)as (label,feature,weight)
from news20mc_train_x3
) t group by label, feature;
Ensemble learning for stable prediction performance
Just stack prediction models by union all
26 / 43462014/09/17 Talk@Japan DataScientist Society
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Prediction
2014/09/17 Talk@Japan DataScientist Society 47
How to use Hivemall -‐ Prediction
CREATE TABLE lr_predictasSELECTt.rowid, sigmoid(sum(m.weight)) as probFROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)GROUP BY t.rowid
Prediction is done by LEFT OUTER JOINbetween test data and prediction model
No need to load the entire model into memory
2014/09/17 Talk@Japan DataScientist Society 48
How to use Hivemall
MachineLearning
Batch Training on Hadoop
Online Prediction on RDBMS
PredictionModel Label
Feature Vector
Feature Vector
Label
Export prediction models
2014/09/17 Talk@Japan DataScientist Society 49
Real-‐time Prediction on Treasure Data
Run batch trainingjob periodically
Real-‐time predictionon a RDBMS
Periodicalexport
2014/09/17 Talk@Japan DataScientist Society 50
Agenda
1. What is Hivemall
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall• Logistic regression (RDBMS integration)• Matrix Factorization• Anomaly Detection (demo)• Random Forest (demo)
2014/09/17 Talk@Japan DataScientist Society 51
Supervise Learning: Recommendation
Rating prediction of a Matrix
Can be applied for user/Item Recommendation
522014/09/17 Talk@Japan DataScientist Society
53
Matrix Factorization
Factorize a matrix into a product of matriceshaving k-‐latent factor
2014/09/17 Talk@Japan DataScientist Society
54
Mean Rating
Matrix Factorization
Regularization
Bias for each user/item
Criteria of Biased MF
2014/09/17 Talk@Japan DataScientist Society
Factorization
55
Training of Matrix Factorization
Support iterative training using local disk cache2014/09/17 Talk@Japan DataScientist Society
56
Prediction of Matrix Factorization
2014/09/17 Talk@Japan DataScientist Society
ØAlgorithm is differentSpark: ALS-‐WR (considers regularization)Hivemall: Biased-‐MF (considers regularization and biases)
ØUsabilitySpark: 100+ line Scala codingHivemall: SQL
ØPrediction AccuracyAlmost same for MovieLens 10M datasets
2014/09/17 Talk@Japan DataScientist Society 57
Comparison to Spark MLlib
rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]
Unsupervised Learning: Anomaly Detection
Sensor data etc.
Anomaly detection runs on a series of SQL queries
582014/09/17 Talk@Japan DataScientist Society
2014/09/17 Talk@Japan DataScientist Society 59
Anomalies in a Sensor Data
Source: https://codeiq.jp/q/207
Image Source: https://en.wikipedia.org/wiki/Local_outlier_factor2014/09/17 Talk@Japan DataScientist Society 60
Local Outlier Factor (LoF)
Basic idea of LOF: comparing the local density of a point with the densities of its neighbors
2014/09/17 Talk@Japan DataScientist Society 61
DEMO: Local Outlier Factor
rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]
2014/09/17 Talk@Japan DataScientist Society 62
RandomForest in Hivemall v0.4
Ensemble of Decision Trees
Already available on a development (smile) branchand it’s usage is explained in the project wiki
2014/09/17 Talk@Japan DataScientist Society 63
Training of RandomForest
Out-‐of-‐bag tests and Variable Importance
2014/09/17 Talk@Japan DataScientist Society 64
2014/09/17 Talk@Japan DataScientist Society 65
Prediction of RandomForest
2014/09/17 Talk@Japan DataScientist Society 66
Jupyter Integration
DEMO
Conclusion and TakeawayHivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs
Ø For SQL users that need MLØ For whom already using HiveØ Easy-‐of-‐use and scalability in mind
Do not require coding, packaging, compiling or introducing a new programming language or APIs.
Hivemall’s Positioning
2014/09/17 Talk@Japan DataScientist Society 67
v0.4 will make a developmental leap
5/12の第一回目ではFreakout, Scaleout様より利用事例発表
10/20(火)の第2回目ではOISIX, Livesense様より利用事例発表
dotsで近日募集開始
2014/09/17 Talk@Japan DataScientist Society 68
告知: Hivemall meetup
2014/09/17 Talk@Japan DataScientist Society 69
Beyond Query-‐as-‐a-‐Service!
We Open-‐source! We invented ..
We are hiring machine learning engineer!