Db tech show - hivemall
-
Upload
makoto-yui -
Category
Software
-
view
3.370 -
download
3
Transcript of Db tech show - hivemall
Copyright ©2015 Treasure Data. All Rights Reserved.
Treasure Data Inc.Research EngineerMakoto YUI @myui
2015/06/12db tech showcase
http://myui.github.io/
Hivemall: Machine Learning Made Easy with SQL
Copyright ©2015 Treasure Data. All Rights Reserved.
➢2015/04 Joined Treasure Data, Inc.➢1st Research Engineer in Treasure Data➢My mission in TD is developing ML-as-a-Service
(MLaaS) ➢2010/04-2015/03 Senior Researcher at National Institute
of Advanced Industrial Science and Technology, Japan. ➢Worked on a large-scale Machine Learning project and
Parallel Databases ➢2009/03 Ph.D. in Computer Science from NAIST
➢XML native database and Parallel Database systems➢Super programmer award from the MITOU Foundation
(Government funded program for finding young and talented programmers)➢ Super creators in Treasure Data: Sada Furuhashi,
Keisuke Nishida2
Who am I ?
Copyright ©2015 Treasure Data. All Rights Reserved.3
0
2000
4000
6000
8000
10000
12000
Aug-‐12
Sep-‐12Oct-‐12
Nov-‐12
Dec-‐12Jan-‐13
Feb-‐13
Mar-‐13Apr-‐13
May-‐13Jun-‐13
Jul-‐13
Aug-‐13
Sep-‐13Oct-‐13
Nov-‐13
Dec-‐13Jan-‐14
Feb-‐14
Mar-‐14Apr-‐14
May-‐14Jun-‐14
Jul-‐14
Aug-‐14
Sep-‐14Oct-‐14
(単位)10億レコード
サービス開始
Series A Funding
100社導入
Gartner社「Cool Vendor in Big Data」に選定される
10兆件
5兆レコード
数字でみる トレジャーデータ (2014年10月):40万レコード 毎秒インポートされるデータの数10兆レコード以上 インポートされたデータの数
120億 アドテク業界のお客様1社によって毎日送られてくるデータ
数字で見るトレジャーデータ
Copyright ©2015 Treasure Data. All Rights Reserved.
数字で見る現在のトレジャーデータ
4
100+日本の顧客社数
18兆保存されているデータ件数
4,000一社が所有する最大サーバー数
600,0001秒間に保存されるデータ件数
Copyright ©2015 Treasure Data. All Rights Reserved.5
トレジャーデータ導入企業様
Copyright ©2015 Treasure Data. All Rights Reserved.
Plan of the Talk
1. Brief introduction to Hivemall
2. How to use Hivemall
6
Copyright ©2015 Treasure Data. All Rights Reserved.
What is HivemallScalable machine learning library built on the top of Apache Hive, licensed under the Apache License v2
Hadoop HDFS
MapReduce(MRv1)
Hive / PIG
Hivemall
Apache YARN
Apache TezDAG processing MR v2
Machine Learning
Check http://github.com/myui/hivemall
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File System
7
Copyright ©2015 Treasure Data. All Rights Reserved.
Awarded in IDG’s InfoWorld 2014Bossie Awards 2014: The best open source big data tools
InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem
bit.ly/hivemall-‐award8
Copyright ©2015 Treasure Data. All Rights Reserved.
How I used to do ML projects before Hivemall
Given raw data stored on Hadoop HDFS
RawData
HDFSS3 Feature Vector
height:173cmweight:60kgage:34gender: man…
Extract-‐Transform-‐Load
file
9
Machine Learning
Copyright ©2015 Treasure Data. All Rights Reserved.
How I used to do ML projects before Hivemall
Given raw data stored on Hadoop HDFS
RawData
HDFSS3 Feature Vector
height:173cmweight:60kgage:34gender: man…
Extract-‐Transform-‐Load
file
Need to do expensive data preprocessing
(Joins, Filtering, and Formatting of Data that does not fit in memory)
Machine Learning10
Copyright ©2015 Treasure Data. All Rights Reserved.
How I used to do ML projects before Hivemall
Given raw data stored on Hadoop HDFS
RawData
HDFSS3 Feature Vector
height:173cmweight:60kgage:34gender: man…
Extract-‐Transform-‐Load
file
Do not scaleHave to learn R/Python APIs
11
Copyright ©2015 Treasure Data. All Rights Reserved.
Given raw data stored on Hadoop HDFS
RawData
HDFSS3 Feature Vector
height:173cmweight:60kgage:34gender: man…
Extract-‐Transform-‐Load
Does not meet my needsIn terms of its scalability, ML algorithms, and usability
I ❤ scalableSQL query
How I used to do ML projects before Hivemall
12
Copyright ©2015 Treasure Data. All Rights Reserved.
Framework User interfaceMahout Java API ProgrammingSpark MLlib/MLI Scala API programming
Scala Shell (REPL)H2O R programming
GUICloudera Oryx Http REST API programmingVowpal Wabbit(w/ Hadoop streaming)
C++ API programmingCommand Line
Survey on existing ML frameworks
Existing distributed machine learning frameworksare NOT easy to use
13
Copyright ©2015 Treasure Data. All Rights Reserved.
Very easy to use; Machine Learning on SQL
The key characteristic of Hivemall
100+ lines
of code
Classification with Mahout
CREATE TABLE lr_model ASSELECTfeature, -‐-‐ reducers perform model averaging in parallelavg(weight) as weightFROM (SELECT logress(features,label,..) as (feature,weight)FROM train) t -‐-‐ map-‐only taskGROUP BY feature; -‐-‐ shuffled to reducers
ü Machine Learning made easy for SQL developers (ML for the rest of us)
ü APIs are very stable because of SQL abstraction
This SQL query automatically runs in parallelon Hadoop
14
Copyright ©2015 Treasure Data. All Rights Reserved.
List of functions in Hivemall v0.3
• Classification (both binary-‐ and multi-‐class)
ü Perceptronü Passive Aggressive (PA)ü Confidence Weighted (CW)ü Adaptive Regularization of Weight Vectors (AROW)
ü Soft Confidence Weighted (SCW)ü AdaGrad+RDA
• Regressionü Logistic Regression (SGD)ü PA Regressionü AROW Regressionü AdaGradü AdaDELTA
• kNN and Recommendationü Minhash and b-‐Bit Minhash(LSH variant)
ü Similarity Search using K-‐NNü Matrix Factorization
• Feature engineeringü Feature hashingü Feature scaling(normalization, z-‐score)
ü TF-‐IDF vectorizer• Anomaly Detection
ü Local Outlier Factor (LOF)
Treasure Data will support Hivemallv0.3.2 in the next biweekly release
15
Copyright ©2015 Treasure Data. All Rights Reserved.
xResponse Variable
Explanatory Variable 1
Explanatory Variable 2
Explanatory Variable 3 ?
?
?
Supervised Learning in Nutshell
Feature Vector
16
Copyright ©2015 Treasure Data. All Rights Reserved.
x Label
Explanatory Variable 1
Explanatory Variable 2
Explanatory Variable 3 ?
?
?
Supervised Learning: Classification
Feature Vector
Response Variable is a non-‐numeric value
Positive / NegativeClicked / Not ClickedMan / Woman (not Man)
Sunny / Rainy / CloudySports / Entertainment / Politics
17
Copyright ©2015 Treasure Data. All Rights Reserved.
x Target value
Explanatory Variable 1
Explanatory Variable 2
Explanatory Variable 3 ?
?
?
Supervised Learning: Regression
Feature Vector
Response Variable is a numeric value
Temperature: 25℃Monthly house rent: $1,000Weekly payment of a user: ¥1,000
18
Copyright ©2015 Treasure Data. All Rights Reserved.
Supervise Learning: Recommendation
Rating prediction of a Matrix
Can be applied for Item Recommendation
19
Copyright ©2015 Treasure Data. All Rights Reserved.
rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]
Unsupervised Learning: Outlier Detection
Sensor data etc.
Anomaly detection runs on a series of SQL queries
20
Copyright ©2015 Treasure Data. All Rights Reserved.
• CTR prediction of Ad click logs• Algorithm: Logistic regression• Freakout Inc., Smartnews, and more
• Gender prediction of Ad click logs• Algorithm: Classification• Scaleout Inc.
• Item/User recommendation• Algorithm: Recommendation• Wish.com
• Value prediction of Real estates• Algorithm: Regression• Livesense
• Anomaly Detection from Sensor Data• Anomaly Detection
Industry use cases of Hivemall
21
Copyright ©2015 Treasure Data. All Rights Reserved.
Plan of the Talk
1. Brief introduction to Hivemall
2. How to use Hivemall
22
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Data preparation
23
Copyright ©2015 Treasure Data. All Rights Reserved.
Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
How to use Hivemall -‐ Data preparation
Define a Hive table for training/testing data
24
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Feature Engineering
25
Copyright ©2015 Treasure Data. All Rights Reserved.
create view e2006tfidf_train_scaled asselect
rowid,rescale(target,${min_label},${max_label})
as label,features
from e2006tfidf_train;
Applying a Min-Max Feature Normalization
How to use Hivemall -‐ Feature Engineering
Transforming a label value to a value between 0.0 and 1.0
26
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Training
27
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall -‐ Training
CREATE TABLE lr_model ASSELECT
feature,avg(weight) as weight
FROM (SELECT logress(features,label,..)
as (feature,weight)FROM train
) tGROUP BY feature
Training by logistic regression
map-‐only task to learn a prediction model
Shuffle map-‐outputs to reduces by feature
Reducers perform model averaging in parallel
28
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall -‐ Training
CREATE TABLE news20b_cw_model1 ASSELECT
feature,voted_avg(weight) as weight
FROM(SELECT
train_cw(features,label) as (feature,weight)
FROMnews20b_train
) t GROUP BY feature
Training of Confidence Weighted Classifier
Vote to use negative or positive weights for avg
+0.7, +0.3, +0.2, -‐0.1, +0.7
Training for the CW classifier
29
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
MachineLearning
Training
Prediction
PredictionModel Label
Feature Vector
Feature Vector
Label
Prediction
30
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall -‐ Prediction
CREATE TABLE lr_predict asSELECTt.rowid, sigmoid(sum(m.weight)) as prob
FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)
GROUP BY t.rowid
Prediction is done by LEFT OUTER JOINbetween test data and prediction model
No need to load the entire model into memory
31
Copyright ©2015 Treasure Data. All Rights Reserved.
How to use Hivemall
MachineLearning
Batch Training on Hadoop
Online Prediction on RDBMS
PredictionModel Label
Feature Vector
Feature Vector
Label
Export prediction models
32
Copyright ©2015 Treasure Data. All Rights Reserved.
Export Prediction Model to a RDBMS
Any RDBMS
TD exportPeriodical export is very easyin Treasure Data
103 -0.4896543622016907104 -0.0955817922949791105 0.12560302019119263106 0.09214721620082855
33
PredictionModel
Copyright ©2015 Treasure Data. All Rights Reserved.
Real-‐time Prediction on MySQL
SIGMOID(x) = 1.0 / (1.0 + exp(-‐x))
PredictionModel Label
Feature Vector
SELECT sigmoid(sum(t.value * m.weight)) as prob
FROMtesting_exploded t LEFT OUTER JOIN prediction_model m ON (t.feature = m.feature)
Online prediction on MySQL
Index lookups are veryefficient in RDBMSs
34
Copyright ©2015 Treasure Data. All Rights Reserved.
Cost of Amazon Machine LearningAmazon-‐ML is suspected to be based on Vowpal Wabbit(single process)
Data Analysis and Model Building Fees$0.42/Instance per Hour
Batch Prediction$0.1/1000 requests
Real-‐time Prediction$0.0001 per a request
Pay-‐per-‐request is apparently not suitable for doing prediction for each web request (e.g. online CTR prediction)
35
Copyright ©2015 Treasure Data. All Rights Reserved.
Real-‐time Prediction on Treasure Data
Run batch trainingjob periodically
Real-‐time predictionon a RDBMS
Periodicalexport
36
Copyright ©2015 Treasure Data. All Rights Reserved.37
Matrix Factorization
Simulate R by Product of P and Q
Copyright ©2015 Treasure Data. All Rights Reserved.38
Mean Ratingvalue
Matrix Factorization
Regularization
Rating Biasof user and item
Biased Matrix Factorization
Copyright ©2015 Treasure Data. All Rights Reserved.39
Training of Matrix Factorization
Copyright ©2015 Treasure Data. All Rights Reserved.40
Prediction and Evaluation
Copyright ©2015 Treasure Data. All Rights Reserved.
rowid features
1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]
2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]
3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]
Unsupervised Learning: Outlier Detection
Sensor data etc.
Anomaly detection runs on a series of SQL queries
41
Copyright ©2015 Treasure Data. All Rights Reserved.42
Outlier Detection
Image taken from:http://next.rikunabi.com/tech/docs/ct_s03600.jsp?p=002259
Copyright ©2015 Treasure Data. All Rights Reserved.
Beyond Query-‐as-‐a-‐Service!
We ❤ Open-‐source! We invented ..
43