Post on 22-Jan-2018
ApacheHivemall:ScalablemachinelearninglibraryforApacheHive/Spark/Pig
1).ResearchEngineer,TreasureDataMakotoYUI@myui
2017/5/16ApacheBigData NorthAmerica'17,Miami
2).ResearchEngineer,NTTTakashiYamamuro@maropu
@ApacheHivemall
Planofthetalk
1. IntroductiontoHivemall
2. HivemallonSpark
2017/5/16ApacheBigDataNorthAmerica'17,Miami
2017/5/16ApacheBigDataNorthAmerica'17,Miami
HivemallenteredApacheIncubatoronSept13,2016🎉
Sincethen,weinvited2contributorsasnewcommitters(acommitterhasbeenvotedasPPMC). Currently,weareworkingtowardthefirstApachereleaseonthisQ2.
hivemall.incubator.apache.org
WhatisApacheHivemall
ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Multi/Crossplatform VersatileScalableEase-of-use
Hivemalliseasyandscalable…MLmadeeasyforSQLdevelopers
Borntobeparallelandscalable
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Ease-of-use
Scalable
CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
ThisqueryautomaticallyrunsinparallelonHadoop
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Hivemallisamulti/cross-platformMLlibrary
HiveQL SparkSQL/Dataframe API PigLatin
HivemallisMulti/Crossplatform..Multi/Crossplatform
predictionmodelsbuiltbyHivecanbeusedfromSpark,andconversely,predictionmodelsbuildbySparkcanbeusedfromHive
HadoopHDFS
MapReduce(MRv1)
Hivemall
ApacheYARN
ApacheTezDAGprocessing
Machine Learning
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File SystemCloud Storage
SparkSQL
ApacheSpark
MESOS
Hive Pig
MLlib
Hivemall’s TechnologyStack
2017/5/16ApacheBigDataNorthAmerica'17,Miami
AmazonS3
2017/5/16ApacheBigDataNorthAmerica'17,Miami
HivemallonApacheHive
2017/5/16ApacheBigDataNorthAmerica'17,Miami
HivemallonApacheSparkDataframe
2017/5/16ApacheBigDataNorthAmerica'17,Miami
HivemallonSparkSQL
2017/5/16ApacheBigDataNorthAmerica'17,Miami
HivemallonApachePig
2017/5/16ApacheBigDataNorthAmerica'17,Miami
OnlinePredictionbyApacheStreaming
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Versatile
HivemallisaVersatilelibrary..
ü NotonlyforMachineLearningü providesabunchofgenericutilityfunctions
EachorganizationhasownsetsofUDFsfordatapreprocessing
Don’tRepeatYourself!Don’tRepeatYourself!
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Hivemallgenericfunctions
ArrayandMap Bitandcompress StringandNLP
WewelcomecontributingyourgenericUDFstoHivemall!
GeoSpatial
Top-kprocessing
> TF/IDF
> TILE> MAP_URL
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Maptilingfunctions
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n
Maptilingfunctions
Zoom=10
Zoom=15
2017/5/16ApacheBigDataNorthAmerica'17,Miami
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Top-kqueryprocessing
student class score3 a 902 a 801 b 706 b 60
Listtop-2studentsforeachclass
2017/5/16ApacheBigDataNorthAmerica'17,Miami
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Top-kqueryprocessing
Listtop-2studentsforeachclass
SELECT*FROM(SELECT*,rank()over(partitionbyclassorderbyscoredesc)asrank
FROMtable)tWHERErank<=2
RANKover()querydoesnotfinishesin24hoursLwhere20millionMOOCsclassesandavg1,000studentsineachclasses
2017/5/16ApacheBigDataNorthAmerica'17,Miami
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Top-kqueryprocessing
Listtop-2studentsforeachclass
SELECTeach_top_k(2,class,score,class,student
)as(rank,score,class,student)FROM (SELECT*FROMtableDISTRIBUTEBYclassSORTBYclass
)t
EACH_TOP_Kfinishesin2hoursJ
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Top-kqueryprocessingbyRANKOVER()
partitionbyclass
Node1
Sortbyclass,score
rankover()
rank>=2
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Top-kqueryprocessingbyEACH_TOP_K
distributedbyclass
Node1
Sortbyclass
each_top_k
OUTPUTonlyKitems
2017/5/16ApacheBigDataNorthAmerica'17,Miami
ComparisonbetweenRANKandEACH_TOP_K
distributedbyclass
Sortbyclass
each_top_k
Sortbyclass,score
rankover()
rank>=2
SORTING ISHEAVY
NEEDTOPROCESSALL
OUTPUTonlyKitems
Each_top_k isveryefficientwherethenumberofclassislarge
BoundedPriorityQueueisutilized
ListofSupportedAlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression
SCW is a good first choiceTry RandomForest if SCW does not work
Logistic regression is good for getting a probability of a positive class
Factorization Machines is good where features are sparse and categorical ones
RandomForestinHivemall
EnsembleofDecisionTrees
2017/5/16ApacheBigDataNorthAmerica'17,Miami
TrainingofRandomForest
2017/5/16ApacheBigDataNorthAmerica'17,Miami
PredictionofRandomForest
2017/5/16ApacheBigDataNorthAmerica'17,Miami
SupportedAlgorithmsforRecommendation
2017/5/16ApacheBigDataNorthAmerica'17,Miami
K-NearestNeighbor✓ Minhash andb-BitMinhash
(LSHvariant)✓ SimilaritySearchonVectorSpace
(Euclid/Cosine/Jaccard/Angular)
MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)
each_top_k functionofHivemallisusefulforrecommendingtop-kitems
OtherSupportedAlgorithms
2017/5/16ApacheBigDataNorthAmerica'17,Miami
FeatureEngineering✓FeatureHashing✓FeatureScaling
(normalization,z-score)✓ FeatureBinning✓ TF-IDFvectorizer✓ PolynomialExpansion✓ Amplifier
NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer
Evaluationmetrics✓AUC,nDCG,logloss,precisionrecall@K,andetc
2017/5/16ApacheBigDataNorthAmerica'17,Miami
FeatureEngineering– FeatureHashing
2017/5/16ApacheBigDataNorthAmerica'17,Miami
FeatureEngineering– FeatureHashing
2017/5/16ApacheBigDataNorthAmerica'17,Miami
FeatureEngineering– FeatureBinning
Mapsquantitativevariablestofixednumberofbinsbasedonquantiles/distribution
MapAgesinto3bins
2017/5/16ApacheBigDataNorthAmerica'17,Miami
FeatureSelection– SignalNoiseRatio
2017/5/16ApacheBigDataNorthAmerica'17,Miami
EvaluationMetrics
OtherSupportedFeatures
2017/5/16ApacheBigDataNorthAmerica'17,Miami
AnomalyDetection✓LocalOutlierFactor(LoF)✓ChangeFinder
Clustering/Topicmodels✓Onlinemini-batchLDA✓Onlinemini-batchPLSA
ChangePointDetection✓ChangeFinder✓SingularSpectrumTransformation
Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
2017/5/16ApacheBigDataNorthAmerica'17,Miami
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
Anomaly/Change-pointDetectionbyChangeFinder
Takethis…
Anomaly/Change-pointDetectionbyChangeFinder
2017/5/16ApacheBigDataNorthAmerica'17,Miami
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Anomaly/Change-pointDetectionbyChangeFinder
…anddothis!
Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Anomaly/Change-pointDetectionbyChangeFinder
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Anomaly/Change-pointDetectionbyChangeFinder
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
2017/5/16ApacheBigDataNorthAmerica'17,Miami
• T.IdeandK.Inoue,"KnowledgeDiscoveryfromHeterogeneousDynamicSystemsusingChange-PointCorrelations",Proc.SDM,2005T.
• T.IdeandK.Tsuda,"Change-pointdetectionusingKrylovsubspacelearning",Proc.SDM,2007.
Change-pointdetectionbySingularSpectrumTransformation
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Onlinemini-batchLDA
ü XGBoost Integrationü SparsevectorsupportinRandomForestsü Dockersupport(forevaluation)ü Field-awareFactorizationMachinesü GeneralizedLinearModel
• OptimizerframeworkincludingADAM• L1/L2regularization
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Othernewfeaturesindevelopment
Copyright©2016 NTT corp. All Rights Reserved.
Hivemall on
44Copyright©2016 NTT corp. All Rights Reserved.
Who am I• Takeshi Yamamuro• NTT corp. in Japan• OSS activities
• Apache Hivemall PPMC• Apache Spark contributer
45Copyright©2016 NTT corp. All Rights Reserved.
Whatʼs Spark?• Distributed data analytics engine, generalizing Map Reduce
46Copyright©2016 NTT corp. All Rights Reserved.
Whatʼs Spark?• 1. Unified Engine
• support end-to-end APs, e.g., MLlib and Streaming
• 2. High-level APIs• easy-to-use, rich optimization
• 3. Integrate broadly• storages, libraries, ...
47Copyright©2016 NTT corp. All Rights Reserved.
• Hivemall wrapper for Spark• Wrapper implementations for DataFrame/SQL• + some utilities for easy-to-use in Spark
• The wrapper makes you...• run most of Hivemall functions in Spark• try examples easily in your laptop• improve some function performance in Spark
Whatʼs Hivemall on Spark?
48Copyright©2016 NTT corp. All Rights Reserved.
• Hivemall already has many fascinating ML algorithms and useful utilities
• + High barriers to add newer algorithms in MLlib
Whyʼs Hivemall on Spark?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
49Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark v2.0 and v2.1
Current Status
- To compile for Spark v2.1
$ git clone https://github.com/apache/incubator-hivemall$ cd incubator-hivemall$ mvn package –Pspark-2.1–DskipTests$ ls target/*spark*target/hivemall-spark-2.1_2.11-XXX-with-dependencies.jar...
50Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark v2.0 and v2.1
Current Status
- To compile for Spark v2.0
$ git clone https://github.com/apache/incubator-hivemall$ cd incubator-hivemall$ mvn package –Pspark-2.0–DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...
51Copyright©2016 NTT corp. All Rights Reserved.
• 1. Fetch training and test data• 2. Load these data in Spark• 3. Build a model• 4. Do predictions
4 Step Example
52Copyright©2016 NTT corp. All Rights Reserved.
• E2006 tfidf regression dataset• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf
1. Fetch training and test data
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2
53Copyright©2016 NTT corp. All Rights Reserved.
2. Load data in Spark
// Download Spark-v2.1 and launch a spark-shell with Hivemall$ <HIVEMALL_HOME>/bin/spark-shell
// Create DataFrame from the bzipʼd libsvm-formatted filescala> val trainDf = spark.read.format("libsvm”).load(“E2006.train.bz2")
scala> trainDf.printSchemaroot|-- label: double (nullable = true)|-- features: vector (nullable = true)
54Copyright©2016 NTT corp. All Rights Reserved.
2. Load data in Spark (Detailed)
0.000357499151147113 6066:0.00079327062196048 6069:0.000311377727123504 6070:0.000306754934580457 6071:0.000276992485786437 6072:0.00039663531098024 6074:0.00039663531098024 6075:0.00032548335…
trainDf
Partition1 Partition2 Partition3 PartitionN…
…
…
Load in parallel becausebzip2 is splittable
55Copyright©2016 NTT corp. All Rights Reserved.
3. Build a model - DataFrame
scala> paste:val modelDf = trainDf.train_logregr($"features", $"label")
.groupBy("feature”)
.agg("weight" -> "avg”)
56Copyright©2016 NTT corp. All Rights Reserved.
3. Build a model - SQL
scala> trainDf.createOrReplaceTempView("TrainTable")scala> paste:val modelDf = sql("""
| SELECT feature, AVG(weight) AS weight| FROM (| SELECT train_logregr(features, label)| AS (feature, weight)| FROM TrainTable| )| GROUP BY feature
""".stripMargin)
57Copyright©2016 NTT corp. All Rights Reserved.
4. Do predictions - DataFrame
scala> paste:val df = testDf.select(rowid(), $"features").explode_vector($"features").cache
# Do predictionsdf.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER").groupBy("rowid").avg(sigmoid(sum($"weight" * $"value")))
58Copyright©2016 NTT corp. All Rights Reserved.
4. Do predictions - SQL
scala> modelDf.createOrReplaceTempView(”ModelTable")scala> df.createOrReplaceTempView(”TestTable”)scala> paste:sql("""
| SELECT rowid, sigmoid(value * weight) AS predicted| FROM TrainTable t| LEFT OUTER JOIN ModelTable m| ON t.feature = m.feature| GROUP BY rowid
""".stripMargin)
59Copyright©2016 NTT corp. All Rights Reserved.
• Top-K Join• Join two relations and compute top-K for each group
Add New Features in Spark
scala> paste:val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”).select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”).withColumn("rank",
rank().over(partitionBy($"group”).orderBy($"score".desc))).where($"rank" <= topK)
Use Spark vanilla APIs
60Copyright©2016 NTT corp. All Rights Reserved.
• Top-K Join• Join two relations and compute top-K for each group
Add New Features in Spark
scala> paste:val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”).select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”).withColumn("rank", rank().over(partitionBy($"group”).orderBy($"score".desc))
).where($"rank" <= topK)
1. Join leftDf with rightDf
2. Sort data in each group
3. Filter top-K data
Use Spark vanilla APIs
61Copyright©2016 NTT corp. All Rights Reserved.
• Top-K Join• Join two relations and compute top-K for each group
Add New Features in Spark
scala> paste:val topkDf = leftDf.top_k_join(lit(topK), rightDf, leftDf(“group”) === rightDf(“group”),(leftDf(“x”) + rightDf(“y”)).as(“score”)
)
Use a fused API implemented in Hivemall
62Copyright©2016 NTT corp. All Rights Reserved.
• How top_k_join works?
Add New Features in Spark
group xgroup y
leftDf
rightDf
・・・
63Copyright©2016 NTT corp. All Rights Reserved.
• How top_k_join works?
Add New Features in Spark
group xgroup y
・・・
・・・
K-lengthpriority queue
Compute top-K rows by using a priority queue
leftDf
rightDf
64Copyright©2016 NTT corp. All Rights Reserved.
• How top_k_join works?
Add New Features in Spark
group xgroup y
・・・
・・・
K-lengthpriority queue
Compute top-K rows by using a priority queue
Only join top-K rowsleftDf
rightDf
65Copyright©2016 NTT corp. All Rights Reserved.
• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built
physical plan, and compiles/executes it
Add New Features in Spark
Spark Planner (Catalyst) Overview
66Copyright©2016 NTT corp. All Rights Reserved.
• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built
physical plan, and compiles/executes it
Add New Features in Spark
Codegen
A Physical Plan
67Copyright©2016 NTT corp. All Rights Reserved.
• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built
physical plan, and compiles/executes it
Add New Features in Spark
scala> topkDf.explain== Physical Plan ==*ShuffledHashJoinTopK 1, [group#10], [group#27]:- Exchange hashpartitioning(group#10, 200): +- LocalTableScan [group#10, x#11]+- Exchange hashpartitioning(group#27, 200)
+- LocalTableScan [group#27, y#28]
ʻ*ʼ in the head means a codegenʼd plan
68Copyright©2016 NTT corp. All Rights Reserved.
• Benchmark Result
Add New Features in Spark
~13 times faster than vanilla APIs!!
69Copyright©2016 NTT corp. All Rights Reserved.
• Support more useful functions• Spark only implements naive and basic functions in
terms of usability and maintainability• e.g.,)
• f latten - f latten a nested schema into f lat one• from_csv/to_csv – interconversion of CSV str ings and structured data with schemas• ...
• See more in the Hivemall user guide• https://hivemal l . incubator.apache.org /userguide /spark /misc /misc.html
Add New Features in Spark
70Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))scala> val sparkUdf = functions.udf(sigmoidFunc)scala> df.select(sparkUdf($“value”))
71Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val hiveUdf = HivemallOps.sigmoidscala> df.select(hiveUdf($“value”))
72Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
73Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
Improve Some Functions in Spark
scala> paste:df.withColumn(“rank”,rank().over(Window.partitionBy($"key").orderBy($"score".desc)
) .where($"rank" <= topK)
74Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
• Fixed the overhead issue for each_top_k• See pr#353: “Implement EachTopK as a generator expression“ in Spark
Improve Some Functions in Spark
scala> df.each_top_k(topK, “key”, “score”, “value”)
75Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
Improve Some Functions in Spark
~4 times faster than rank!!
76Copyright©2016 NTT corp. All Rights Reserved.
• supports under development• fast implementation of the gradient tree boosting• widely used in Kaggle competitions
• This integration will make you...• load built models and predict in parallel• build multiple models in parallel
for cross-validation
Support 3rd Party Library in Spark
ConclusionandTakeawayHivemallisamulti/cross-platformMLlibraryprovidingacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs
2017/5/16ApacheBigDataNorthAmerica'17,Miami
WewelcomeyourcontributionstoApacheHivemallJ
HiveQL SparkSQL/Dataframe API PigLatin
2017/5/16ApacheBigDataNorthAmerica'17,Miami
Anyfeaturerequestorquestions?