Introduction to Hivemall
-
Upload
makoto-yui -
Category
Data & Analytics
-
view
901 -
download
1
Transcript of Introduction to Hivemall
Hivemall:ScalableMachineLearningLibraryforApacheHive
ResearchEngineerMakotoYUI@myui
1
bit.ly/hivemall
2
3
ExternalIntegrations
SQL
Server
CRM
RDBMS
App log
Sensor
Apache log
ERP
HiveBatch
AdhocPresto
API
ODBCJDBC
PUSH
Treasure Agent
BI tools
Data analysis
Treasure Data Collectors
Embedded
Embulk
Mobile SDK
JS SDK
Treasure Data Cloud Service
Machine Learning
900,000Records stored
per sec.
1. What is Hivemall (short intro.)
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall
Agenda
What is HivemallScalable machine learning library built as a collection of Hive UDFs, licensed under the Apache License v2
HadoopHDFS
MapReduce(MRv1)
Hivemall
ApacheYARN
ApacheTezDAGprocessing
Machine Learning
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File System
SparkSQL
ApacheSpark
MESOS
Hive Pig
MLlib
WonIDG’sInfoWorld2014Bossie Awards 2014: The best open source big data tools
InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem(awarded along w/ Spark, Tez, Jupyter notebook, Pandas, Impala, Kafka)
bit.ly/hivemall-award
Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
8
Regression✓LogisticRegression(SGD)✓PARegression✓AROWRegression✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓FactorizationMachines✓RandomForestRegression
List of supported Algorithms
List of supported AlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
9
Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression
SCW is a good first choiceTry RandomForest if SCW does not work
Logistic regression is good for getting a probability of a positive class
Factorization Machines is good where features are sparse and categorical ones
List of Algorithms for Recommendation
10
K-NearestNeighbor✓ Minhash andb-BitMinhash
(LSHvariant)✓ SimilaritySearchonVectorSpace
(Euclid/Cosine/Jaccard/Angular)
MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)
each_top_k functionofHivemallisusefulforrecommendingtop-kitems
Other Supported Algorithms
11
AnomalyDetection✓ LocalOutlierFactor(LoF)
FeatureEngineering✓FeatureHashing✓FeatureScaling
(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion
(FeaturePairing)✓ Amplifier
NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)
Ø CTR prediction of Ad click logs• Algorithm: Logistic regression• Freakout Inc. and more
Ø Gender prediction of Ad click logs• Algorithm: Classification• Scaleout Inc.
Ø Churn Detection• Algorithm: Regression• OISIX and more
Ø Item/User recommendation• Algorithm: Recommendation (Matrix Factorization / kNN) • Adtech Companies, ISP portal, and more
Ø Value prediction of Real estates• Algorithm: Regression• Livesense
Industry use cases of Hivemall
12
1. What is Hivemall (short intro.)
2. Why Hivemall (motivations etc.)
3. Hivemall Internals
4. How to use Hivemall
Agenda
WhyHivemall
1. InmyexperienceworkingonML,IusedHiveforpreprocessingandPython(scikit-learnetc.)forML.ThiswasINEFFICIENTandANNOYING.Also,PythonisnotasscalableasHive.
2. WhynotrunMLalgorithmsinsideHive?Lesscomponentstomanageandmorescalable.
That’swhyIbuildHivemall.
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
MachineLearning
file
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
file
Need to do expensive data preprocessing
(Joins, Filtering, and Formatting of Data that does not fit in memory)
MachineLearning
HowIusedtodoMLprojectsbeforeHivemall
GivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
file
Do not scaleHave to learn R/Python APIs
HowIusedtodoMLbeforeHivemallGivenrawdatastoredonHadoopHDFS
RawData
HDFSS3 FeatureVector
height:173cmweight:60kgage:34gender: man…
Extract-Transform-Load
Does not meet my needsIn terms of its scalability, ML algorithms, and usability
I ❤ scalableSQL query
Framework UserinterfaceMahout JavaAPIProgrammingSparkMLlib/MLI ScalaAPIprogramming
ScalaShell(REPL)H2O Rprogramming
GUIClouderaOryx HttpRESTAPIprogrammingVowpalWabbit(w/Hadoopstreaming)
C++APIprogrammingCommandLine
SurveyonexistingMLframeworks
ExistingdistributedmachinelearningframeworksareNOTeasytouse
Hivemall’s Vision:MLonSQL
ClassificationwithMahout
CREATETABLElr_modelASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
✓MachineLearningmadeeasyforSQLdevelopers(MLfortherestofus)
✓InteractiveandStableAPIsw/ SQLabstraction
ThisSQLqueryautomaticallyrunsinparallelonHadoop
21
HivemallonApacheSpark
Installationisveryeasyasfollows:$spark-shell--packagesmaropu:hivemall-spark:0.0.6
1. What is Hivemall
2. Why Hivemall
3. Hivemall Internals
4. How to use Hivemall
Agenda
ImplementedmachinelearningalgorithmsasUser-DefinedTablegeneratingFunctions(UDTFs)
HowHivemallworksintraining
+1,<1,2>..+1,<1,7,9>
-1,<1,3,9>..+1,<3,8>
tuple<label,array<features>>
tuple<feature,weights>
Predictionmodel
UDTF
Relation<feature,weights>
param-mix param-mix
Trainingtable
Shufflebyfeature
train train
● Resulting prediction model is a relation of feature and its weight
● # of mapper and reducers are configurable
UDTF is a function that returns a relation
ParallelismisPowerful
AlternativeApproachinHivemallHivemallprovidesthe amplify UDTFtoenumerateiterationeffectsinmachinelearningwithoutseveralMapReduce steps
SET hivevar:xtimes=3;
CREATE VIEW training_x3asSELECT*
FROM (SELECTamplify(${xtimes}, *) as (rowid, label, features)
FROMtraining
) tCLUSTER BY rand()
1. What is Hivemall
2. Why Hivemall
3. Hivemall Internals
4. How to use Hivemall
Agenda
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Datapreparation 26
Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';
HowtouseHivemall- Datapreparation
DefineaHivetablefortraining/testingdata
27
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
FeatureEngineering
28
create view e2006tfidf_train_scaled asselect rowid,rescale(target,${min_label},${max_label}) as label,
featuresfrom e2006tfidf_train;
Applying a Min-Max Feature Normalization
HowtouseHivemall- FeatureEngineering
Transformingalabelvaluetoavaluebetween0.0and1.0
29
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Training
30
HowtouseHivemall- Training
CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight
FROM (SELECT logress(features,label,..)
as (feature,weight)FROM train
) tGROUP BY feature
Trainingbylogisticregression
map-onlytasktolearnapredictionmodel
Shufflemap-outputstoreducesbyfeature
Reducersperformmodelaveraginginparallel
31
HowtouseHivemall- Training
CREATE TABLE news20b_cw_model1 ASSELECT
feature,voted_avg(weight) as weight
FROM(SELECT
train_cw(features,label) as (feature,weight)
FROMnews20b_train
) t GROUP BY feature
TrainingofConfidenceWeightedClassifier
Votetousenegativeorpositiveweightsforavg
+0.7,+0.3,+0.2,-0.1,+0.7
TrainingfortheCWclassifier
32
HowtouseHivemall
MachineLearning
Training
Prediction
PredictionModel Label
FeatureVector
FeatureVector
Label
Prediction
33
HowtouseHivemall- Prediction
CREATE TABLE lr_predictasSELECTt.rowid, sigmoid(sum(m.weight)) as prob
FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)
GROUP BY t.rowid
PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel
Noneedtoloadtheentiremodelintomemory
34
Real-timeprediction
MachineLearning
Batch Training on Hadoop
Online Prediction on RDBMS
PredictionModel Label
FeatureVector
FeatureVector
Label
Exportpredictionmodels
35
bit.ly/hivemall-rtp
Conclusion
HivemallprovidesacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs
36
Ø ForSQLusersthatneedMLØ ForwhomalreadyusingHiveØ Easy-of-useandscalabilityinmind
Do not require coding, packaging, compiling or introducing a new programming language or APIs.
Hivemall’s Positioning
Thank you!MakotoYUI- Researchengineer/TreasureData
twitter:@myui
37
Download Hivemall from bit.ly/hivemall