Db tech show - hivemall

43
Copyright ©2015 Treasure Data. All Rights Reserved. Treasure Data Inc. Research Engineer Makoto YUI @myui 2015/06/12 db tech showcase http://myui.github.io/ Hivemall: Machine Learning Made Easy with SQL

Transcript of Db tech show - hivemall

Page 1: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Treasure  Data  Inc.Research  EngineerMakoto  YUI  @myui

2015/06/12db tech  showcase

http://myui.github.io/

Hivemall:  Machine  Learning  Made  Easy  with  SQL

Page 2: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

➢2015/04 Joined Treasure Data, Inc.➢1st Research Engineer in Treasure Data➢My mission in TD is developing ML-as-a-Service

(MLaaS) ➢2010/04-2015/03 Senior Researcher at National Institute

of Advanced Industrial Science and Technology, Japan. ➢Worked on a large-scale Machine Learning project and

Parallel Databases ➢2009/03 Ph.D. in Computer Science from NAIST

➢XML native database and Parallel Database systems➢Super programmer award from the MITOU Foundation

(Government funded program for finding young and talented programmers)➢ Super creators in Treasure Data: Sada Furuhashi,

Keisuke Nishida2

Who  am    I  ?

Page 3: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.3

0

2000

4000

6000

8000

10000

12000

Aug-­‐12

Sep-­‐12Oct-­‐12

Nov-­‐12

Dec-­‐12Jan-­‐13

Feb-­‐13

Mar-­‐13Apr-­‐13

May-­‐13Jun-­‐13

Jul-­‐13

Aug-­‐13

Sep-­‐13Oct-­‐13

Nov-­‐13

Dec-­‐13Jan-­‐14

Feb-­‐14

Mar-­‐14Apr-­‐14

May-­‐14Jun-­‐14

Jul-­‐14

Aug-­‐14

Sep-­‐14Oct-­‐14

(単位)10億レコード

サービス開始

Series  A  Funding

100社導入

Gartner社「Cool  Vendor   in  Big  Data」に選定される

10兆件

5兆レコード

数字でみる トレジャーデータ (2014年10月):40万レコード 毎秒インポートされるデータの数10兆レコード以上 インポートされたデータの数

120億 アドテク業界のお客様1社によって毎日送られてくるデータ

数字で見るトレジャーデータ

Page 4: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

数字で見る現在のトレジャーデータ

4

100+日本の顧客社数

18兆保存されているデータ件数

4,000一社が所有する最大サーバー数

600,0001秒間に保存されるデータ件数

Page 5: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.5

トレジャーデータ導入企業様

Page 6: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Plan  of  the  Talk

1. Brief  introduction  to  Hivemall

2. How  to  use  Hivemall

6

Page 7: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

What  is  HivemallScalable  machine  learning  library  built  on  the  top  of  Apache  Hive,  licensed  under  the  Apache  License  v2

Hadoop  HDFS

MapReduce(MRv1)

Hive /  PIG

Hivemall

Apache  YARN

Apache  TezDAG processing MR v2

Machine  Learning

Check  http://github.com/myui/hivemall

Query  Processing

Parallel  Data  Processing  Framework

Resource  Management

Distributed  File  System

7

Page 8: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Awarded  in  IDG’s  InfoWorld  2014Bossie Awards 2014: The best open source big data tools

InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem

bit.ly/hivemall-­‐award8

Page 9: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  I used  to  do  ML  projects  before  Hivemall

Given  raw data  stored  on  Hadoop  HDFS

RawData

HDFSS3 Feature  Vector

height:173cmweight:60kgage:34gender:  man…

Extract-­‐Transform-­‐Load

file

9

Machine  Learning

Page 10: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  I used  to  do  ML  projects  before  Hivemall

Given  raw data  stored  on  Hadoop  HDFS

RawData

HDFSS3 Feature  Vector

height:173cmweight:60kgage:34gender:  man…

Extract-­‐Transform-­‐Load

file

Need  to  do  expensive  data  preprocessing  

(Joins,  Filtering,  and  Formatting  of  Data  that  does  not  fit  in  memory)

Machine  Learning10

Page 11: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  I used  to  do  ML  projects  before  Hivemall

Given  raw data  stored  on  Hadoop  HDFS

RawData

HDFSS3 Feature  Vector

height:173cmweight:60kgage:34gender:  man…

Extract-­‐Transform-­‐Load

file

Do  not  scaleHave  to  learn  R/Python  APIs

11

Page 12: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Given  raw data  stored  on  Hadoop  HDFS

RawData

HDFSS3 Feature  Vector

height:173cmweight:60kgage:34gender:  man…

Extract-­‐Transform-­‐Load

Does  not  meet  my  needsIn  terms  of  its  scalability,  ML  algorithms,  and  usability

I  ❤ scalableSQL  query

How  I used  to  do  ML  projects  before  Hivemall

12

Page 13: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Framework User  interfaceMahout Java  API  ProgrammingSpark  MLlib/MLI Scala  API  programming

Scala  Shell  (REPL)H2O R  programming

GUICloudera  Oryx Http  REST  API  programmingVowpal  Wabbit(w/  Hadoop  streaming)

C++  API  programmingCommand  Line

Survey on  existing  ML  frameworks

Existing  distributed  machine  learning  frameworksare  NOT  easy  to  use

13

Page 14: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Very  easy  to  use;  Machine  Learning  on  SQL

The  key  characteristic  of  Hivemall

100+  lines

of  code

Classification  with  Mahout

CREATE  TABLE  lr_model ASSELECTfeature,  -­‐-­‐ reducers  perform  model  averaging  in  parallelavg(weight)  as  weightFROM  (SELECT  logress(features,label,..)  as  (feature,weight)FROM  train)  t  -­‐-­‐ map-­‐only  taskGROUP  BY  feature;  -­‐-­‐ shuffled  to  reducers

ü Machine  Learning  made  easy  for  SQL  developers  (ML  for  the  rest  of  us)

ü APIs  are  very  stable  because  of  SQL  abstraction

This  SQL  query  automatically  runs  in  parallelon  Hadoop  

14

Page 15: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

List  of  functions  in  Hivemall  v0.3

• Classification  (both  binary-­‐ and  multi-­‐class)

ü Perceptronü Passive  Aggressive   (PA)ü Confidence  Weighted   (CW)ü Adaptive  Regularization  of  Weight  Vectors  (AROW)

ü Soft  Confidence  Weighted   (SCW)ü AdaGrad+RDA

• Regressionü Logistic  Regression   (SGD)ü PA  Regressionü AROW  Regressionü AdaGradü AdaDELTA

• kNN and  Recommendationü Minhash and  b-­‐Bit  Minhash(LSH  variant)

ü Similarity  Search  using  K-­‐NNü Matrix  Factorization

• Feature  engineeringü Feature  hashingü Feature  scaling(normalization,  z-­‐score)  

ü TF-­‐IDF  vectorizer• Anomaly  Detection

ü Local  Outlier  Factor  (LOF)

Treasure  Data  will  support  Hivemallv0.3.2  in  the  next  biweekly  release  

15

Page 16: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

xResponse  Variable

Explanatory  Variable  1

Explanatory  Variable  2

Explanatory  Variable  3 ?

?

?

Supervised  Learning  in  Nutshell

Feature  Vector

16

Page 17: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

x Label

Explanatory  Variable  1

Explanatory  Variable  2

Explanatory  Variable  3 ?

?

?

Supervised  Learning:  Classification

Feature  Vector

Response  Variable  is  a  non-­‐numeric value

Positive  /  NegativeClicked  /  Not  ClickedMan  /  Woman  (not  Man)

Sunny  /  Rainy  /  CloudySports   /  Entertainment  /  Politics  

17

Page 18: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

x Target  value

Explanatory  Variable  1

Explanatory  Variable  2

Explanatory  Variable  3 ?

?

?

Supervised  Learning:  Regression

Feature  Vector

Response  Variable  is  a  numeric value

Temperature:  25℃Monthly  house   rent:  $1,000Weekly  payment  of  a  user:  ¥1,000

18

Page 19: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Supervise  Learning:  Recommendation

Rating  prediction  of  a  Matrix  

Can  be  applied  for  Item  Recommendation

19

Page 20: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

rowid features

1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]

2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]

3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]    

Unsupervised  Learning:  Outlier  Detection

Sensor  data  etc.

Anomaly  detection  runs  on  a  series  of  SQL  queries

20

Page 21: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

• CTR  prediction  of  Ad  click  logs• Algorithm:  Logistic  regression• Freakout Inc.,  Smartnews,  and  more

• Gender  prediction  of  Ad  click  logs• Algorithm:  Classification• Scaleout Inc.

• Item/User  recommendation• Algorithm:  Recommendation• Wish.com

• Value  prediction  of  Real  estates• Algorithm:    Regression• Livesense

• Anomaly  Detection  from  Sensor  Data• Anomaly  Detection

Industry  use  cases  of  Hivemall

21

Page 22: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Plan  of  the  Talk

1. Brief  introduction  to  Hivemall

2. How  to  use  Hivemall

22

Page 23: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature  Vector

Feature  Vector

Label

Data  preparation

23

Page 24: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

How  to  use  Hivemall  -­‐ Data  preparation

Define  a  Hive  table  for  training/testing  data

24

Page 25: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature  Vector

Feature  Vector

Label

Feature  Engineering

25

Page 26: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

create view e2006tfidf_train_scaled asselect

rowid,rescale(target,${min_label},${max_label})

as label,features

from e2006tfidf_train;

Applying a Min-Max Feature Normalization

How  to  use  Hivemall  -­‐ Feature  Engineering

Transforming  a  label  value  to  a  value  between  0.0  and  1.0

26

Page 27: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature  Vector

Feature  Vector

Label

Training

27

Page 28: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall  -­‐ Training

CREATE TABLE lr_model ASSELECT

feature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Training  by  logistic  regression

map-­‐only  task  to  learn  a  prediction  model

Shuffle  map-­‐outputs  to  reduces  by  feature

Reducers  perform  model  averaging  in  parallel

28

Page 29: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall  -­‐ Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

Training  of  Confidence  Weighted  Classifier

Vote  to  use  negative  or  positive  weights  for  avg

+0.7,  +0.3,  +0.2,  -­‐0.1,  +0.7

Training  for  the  CW  classifier

29

Page 30: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature  Vector

Feature  Vector

Label

Prediction

30

Page 31: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall  -­‐ Prediction

CREATE TABLE lr_predict asSELECTt.rowid, sigmoid(sum(m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)

GROUP BY t.rowid

Prediction  is  done  by  LEFT  OUTER  JOINbetween  test  data  and  prediction  model

No  need  to  load  the  entire  model  into  memory

31

Page 32: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS

PredictionModel Label

Feature  Vector

Feature  Vector

Label

Export  prediction  models

32

Page 33: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Export  Prediction  Model  to  a  RDBMS

Any  RDBMS

TD  exportPeriodical  export  is  very easyin  Treasure  Data

103 -0.4896543622016907104 -0.0955817922949791105 0.12560302019119263106 0.09214721620082855

33

PredictionModel

Page 34: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Real-­‐time  Prediction  on  MySQL

SIGMOID(x) =  1.0  /  (1.0  +  exp(-­‐x))

PredictionModel Label

Feature  Vector

SELECT    sigmoid(sum(t.value   *  m.weight))  as  prob

FROMtesting_exploded   t  LEFT  OUTER  JOIN  prediction_model   m  ON  (t.feature  =  m.feature)

Online  prediction  on  MySQL  

Index  lookups  are  veryefficient  in  RDBMSs

34

Page 35: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Cost  of  Amazon  Machine  LearningAmazon-­‐ML  is  suspected  to  be  based  on  Vowpal Wabbit(single  process)  

Data  Analysis  and  Model  Building  Fees$0.42/Instance  per  Hour

Batch  Prediction$0.1/1000 requests

Real-­‐time  Prediction$0.0001  per  a  request

Pay-­‐per-­‐request    is  apparently  not  suitable  for  doing  prediction  for  each  web  request  (e.g.  online  CTR  prediction)

35

Page 36: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Real-­‐time  Prediction  on  Treasure  Data

Run  batch  trainingjob  periodically

Real-­‐time  predictionon  a  RDBMS

Periodicalexport

36

Page 37: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.37

Matrix  Factorization

Simulate  R  by  Product  of  P  and  Q

Page 38: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.38

Mean  Ratingvalue

Matrix  Factorization

Regularization

Rating  Biasof  user  and  item

Biased  Matrix  Factorization

Page 39: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.39

Training  of  Matrix  Factorization

Page 40: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.40

Prediction  and  Evaluation

Page 41: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

rowid features

1 ["reflectance:0.5252967","specific_heat:0.19863537","weight:0.0"]

2 ["reflectance:0.6797837","specific_heat:0.12567581","weight:0.13255163"]

3 ["reflectance:0.5950446","specific_heat:0.09166764","weight:0.052084323"]    

Unsupervised  Learning:  Outlier  Detection

Sensor  data  etc.

Anomaly  detection  runs  on  a  series  of  SQL  queries

41

Page 42: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.42

Outlier  Detection

Image  taken  from:http://next.rikunabi.com/tech/docs/ct_s03600.jsp?p=002259

Page 43: Db tech show - hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Beyond  Query-­‐as-­‐a-­‐Service!

We  ❤ Open-­‐source!  We  invented  ..

43