Hivemall v0.3の機能紹介@1st Hivemall meetup

44
Copyright ©2015 Treasure Data. All Rights Reserved. Treasure Data Inc. Research Engineer 油井 誠 @myui 2015/05/12 Hivemall meetup #1 1 Hivemallv0.3)の機能紹介 http://myui.github.io/

Transcript of Hivemall v0.3の機能紹介@1st Hivemall meetup

1. Copyright 2015 Treasure Data. All Rights Reserved. Treasure Data Inc. Research Engineer @myui 2015/05/12 Hivemall meetup #1 1 Hivemallv0.3 http://myui.github.io/ 2. Copyright 2015 Treasure Data. All Rights Reserved. 2015/04 1 ML as a Service (MLaaS) 2015/03 2009/03 NAIST XML H141 2 3. Copyright 2015 Treasure Data. All Rights Reserved. 3 0 2000 4000 6000 8000 10000 12000 Aug-12 Sep-12 Oct-12 Nov-12 Dec-12 Jan-13 Feb-13 M ar-13 Apr-13M ay-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 M ar-14 Apr-14M ay-14 Jun-14 Jul-14 Aug-14 Sep-14 Oct-14 ()10 Series A Funding 100 GartnerCool Vendor in Big Data 10 (201410): 40 10 120 1 4. Copyright 2015 Treasure Data. All Rights Reserved. 100+ 15 4,000 500,000 1 4 5. Copyright 2015 Treasure Data. All Rights Reserved. Hivemall How to use Hivemall w/ Hivemall and RDBMS Hivemall v0.3 Matrix Factorization AdaGrad/AdaDelta Mix Server (Parameter Mixing) HivemallFeature Requests 5 6. Copyright 2015 Treasure Data. All Rights Reserved. Hivemall Apache Hadoop (Apache license v2) Hadoop HDFS MapReduce (MRv1) Hive/PIG Hivemall Apache YARN Apache Tez DAG MR v2 github.com/myui/hivemall 6 MapReduceTezYARN 1 7. Copyright 2015 Treasure Data. All Rights Reserved. R M MM M HDFS HDFS M M M R M M M R HDFS M MM M M HDFS R MapReduce and DAG engine MapReduce DAG engine Tez/Spark 7 8. Copyright 2015 Treasure Data. All Rights Reserved. SQL Hivemall Mahout CREATE TABLE lr_model AS SELECT feature, -- reducers perform model averaging in parallel avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t -- map-only task GROUP BY feature; -- shuffled to reducers APIHiveQLAPIstable Sparkunstable) Hadoop 8 9. Copyright 2015 Treasure Data. All Rights Reserved. Hivemall v0.3 9 (/ ) Perceptron Passive Aggressive (PA) Confidence Weighted (CW) Adaptive Regularization of Weight Vectors (AROW) Soft Confidence Weighted (SCW) AdaGrad+RDA PA Regression AROW Regression AdaGrad AdaDELTA K & Minhashb-Bit Minhash (LSH variant) K Matrix Factorization Feature engineering Feature hashing Feature scaling (normalization, z-score) TF-IDF vectorizer v0.35 10. Copyright 2015 Treasure Data. All Rights Reserved. Contribution from Daniel Dai (Pig PMC) from Hortonworks To be supported from Pig 0.15 10 Hivemall on Apache Pig 11. Copyright 2015 Treasure Data. All Rights Reserved. On-going work by Takeshi Yamamuro https://github.com/maropu/hivemall-spark Spark is not a Foe () but a Friend () of Hivemall J Supports Hyper parameter optimization and model selection on Spark though Spark ML Pipeline More to be introduced by @maropu 11 Hivemall on Apache Spark 12. Copyright 2015 Treasure Data. All Rights Reserved. Hivemall How to use Hivemall w/ Hivemall and RDBMS Hivemall v0.3 Matrix Factorization AdaGrad/AdaDelta Mix Server (Parameter Mixing) HivemallFeature Requests 12 13. Copyright 2015 Treasure Data. All Rights Reserved. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Data preparation 13 14. Copyright 2015 Treasure Data. All Rights Reserved. Create external table e2006tfidf_train ( rowid int, label float, features ARRAY ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' COLLECTION ITEMS TERMINATED BY ", STORED AS TEXTFILE LOCATION '/dataset/E2006- tfidf/train'; How to use Hivemall HDFS(HiveSERDE 14 15. Copyright 2015 Treasure Data. All Rights Reserved. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Feature Engineering 15 16. Copyright 2015 Treasure Data. All Rights Reserved. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Min-Max How to use Hivemall - Feature Engineering Target01 16 17. Copyright 2015 Treasure Data. All Rights Reserved. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Training 17 18. Copyright 2015 Treasure Data. All Rights Reserved. How to use Hivemall - Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature map-onlytask FeaturemapreducerShuffle reducer 18 19. Copyright 2015 Treasure Data. All Rights Reserved. How to use Hivemall - Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Confidence Weighted Positive or Negative +0.7, +0.3, +0.2, -0.1, +0.7 CW 19 20. Copyright 2015 Treasure Data. All Rights Reserved. 20 hive> desc news20b_cw_model1; feature int weight double hive> select * from a9a_model1 limit 10; 0 -0.5761121511459351 1 -1.5259535312652588 10 0.21053194999694824 100 -0.017715860158205032 101 0.007558753248304129 102 -0.277366042137146 103 -0.4896543622016907 104 -0.0955817922949791 105 0.12560302019119263 106 0.09214721620082855 How to use Hivemall - Training 21. Copyright 2015 Treasure Data. All Rights Reserved. create table news20mc_ensemble_model1 as select label, cast(feature as int) as feature, cast(voted_avg(weight) as float) as weight from (select train_multiclass_cw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_arow(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_scw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 ) t group by label, feature; for stable prediction performance Union all 21 22. Copyright 2015 Treasure Data. All Rights Reserved. How to use Hivemall Machine Learning Training Prediction Prediction Model Label Feature Vector Feature Vector Label Prediction 22 23. Copyright 2015 Treasure Data. All Rights Reserved. How to use Hivemall - Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid LEFT OUTER JOIN 23 24. Copyright 2015 Treasure Data. All Rights Reserved. Hivemall How to use Hivemall w/ Hivemall and RDBMS Hivemall v0.3 Matrix Factorization AdaGrad/AdaDelta Mix Server (Parameter Mixing) HivemallFeature Requests 24 25. Copyright 2015 Treasure Data. All Rights Reserved. 25 MLCT@tokoroten http://www.slideshare.net/TokorotenNakayama/mlct/12 26. Copyright 2015 Treasure Data. All Rights Reserved. How to use Hivemall Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature Vector Feature Vector Label Export prediction model 26 27. Copyright 2015 Treasure Data. All Rights Reserved. 27 hive> desc news20b_cw_model1; feature int weight double #1 export Any RDBMS TD export TD(SQL) export 103 -0.4896543622016907 104 -0.0955817922949791 105 0.12560302019119263 106 0.09214721620082855 28. Copyright 2015 Treasure Data. All Rights Reserved. 28 hive> desc testing_exploded; feature string value float #2 feature/valueview SIGMOID(x) =1.0 / (1.0 + exp(-x)) Prediction Model Label Feature Vector SELECT sigmoid(sum(t.value * m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN prediction_model m ON (t.feature = m.feature) #3 selectexplode SubqueryWITH) model feature 29. Copyright 2015 Treasure Data. All Rights Reserved. : Amazon Machine Learning Vowpal Wabbit?) 29 $0.42/ $0.1/1000 $0.1/1000+) (!?) 30. Copyright 2015 Treasure Data. All Rights Reserved. 30 Hivemall Hivemall Real-time prediction on a RDBMS 31. Copyright 2015 Treasure Data. All Rights Reserved. Hivemall How to use Hivemall w/ Hivemall and RDBMS Hivemall v0.3 Matrix Factorization AdaGrad/AdaDelta Mix Server (Parameter Mixing) HivemallFeature Requests 31 32. Copyright 2015 Treasure Data. All Rights Reserved. 32 Matrix Factorization k P,Q 33. Copyright 2015 Treasure Data. All Rights Reserved. 33 Matrix Factorization Biased MFSGDAdagrad 34. Copyright 2015 Treasure Data. All Rights Reserved. 34 Matrix Factorization 35. Copyright 2015 Treasure Data. All Rights Reserved. 35 Matrix Factorization/ 36. Copyright 2015 Treasure Data. All Rights Reserved. http://bit.ly/hivemall-mf foldVIEW 36 37. Copyright 2015 Treasure Data. All Rights Reserved. Sparkmatrix factorization Movielens 10M Qiita(Hivemall Qiita/Matrix Factorization) Spark100+Scala 37 http://bit.ly/spark-mf 38. Copyright 2015 Treasure Data. All Rights Reserved. AdaGrad (SGD) AdaGrad AdaDeltaAdaGrad t 38 39. Copyright 2015 Treasure Data. All Rights Reserved. 39 1 2 N 40. Copyright 2015 Treasure Data. All Rights Reserved. create table kdd10a_pa1_model1 as select feature, cast(voted_avg(weight) as float) as weight from (select train_pa1(addBias(features),label,"-mix host01,host02,host03") as (feature,weight) from kdd10a_train_x3 ) t group by feature; MIX Server Mix server 40 41. Copyright 2015 Treasure Data. All Rights Reserved. Model updates Async add AVG/Argmin KLD accumulator hash(feature) % N Non-blocking Channel (single shared TCP connection w/ TCP keepalive) classifiers Mix serv.Mix serv. Computation/training is not being blocked MIX Server 41 42. Copyright 2015 Treasure Data. All Rights Reserved. Hivemall How to use Hivemall w/ Hivemall and RDBMS Hivemall v0.3 Matrix Factorization AdaGrad/AdaDelta Mix Server (Parameter Mixing) HivemallFeature Requests 42 43. Copyright 2015 Treasure Data. All Rights Reserved. 43 Feature requests to Hivemall 44. Copyright 2015 Treasure Data. All Rights Reserved. 44 Treasure Data Kaggle Master/Data Scientists! Hiring [email protected] @myui http://bit.ly/gmo0512