Apache Hivemall @ Apache BigData '17, Miami

Post on 22-Jan-2018

2.167 views 0 download

Transcript of Apache Hivemall @ Apache BigData '17, Miami

ApacheHivemall:ScalablemachinelearninglibraryforApacheHive/Spark/Pig

1).ResearchEngineer,TreasureDataMakotoYUI@myui

2017/5/16ApacheBigData NorthAmerica'17,Miami

2).ResearchEngineer,NTTTakashiYamamuro@maropu

@ApacheHivemall

Planofthetalk

1. IntroductiontoHivemall

2. HivemallonSpark

2017/5/16ApacheBigDataNorthAmerica'17,Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

HivemallenteredApacheIncubatoronSept13,2016🎉

Sincethen,weinvited2contributorsasnewcommitters(acommitterhasbeenvotedasPPMC). Currently,weareworkingtowardthefirstApachereleaseonthisQ2.

hivemall.incubator.apache.org

WhatisApacheHivemall

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Multi/Crossplatform VersatileScalableEase-of-use

Hivemalliseasyandscalable…MLmadeeasyforSQLdevelopers

Borntobeparallelandscalable

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Ease-of-use

Scalable

CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

ThisqueryautomaticallyrunsinparallelonHadoop

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Hivemallisamulti/cross-platformMLlibrary

HiveQL SparkSQL/Dataframe API PigLatin

HivemallisMulti/Crossplatform..Multi/Crossplatform

predictionmodelsbuiltbyHivecanbeusedfromSpark,andconversely,predictionmodelsbuildbySparkcanbeusedfromHive

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File SystemCloud Storage

SparkSQL

ApacheSpark

MESOS

Hive Pig

MLlib

Hivemall’s TechnologyStack

2017/5/16ApacheBigDataNorthAmerica'17,Miami

AmazonS3

2017/5/16ApacheBigDataNorthAmerica'17,Miami

HivemallonApacheHive

2017/5/16ApacheBigDataNorthAmerica'17,Miami

HivemallonApacheSparkDataframe

2017/5/16ApacheBigDataNorthAmerica'17,Miami

HivemallonSparkSQL

2017/5/16ApacheBigDataNorthAmerica'17,Miami

HivemallonApachePig

2017/5/16ApacheBigDataNorthAmerica'17,Miami

OnlinePredictionbyApacheStreaming

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Versatile

HivemallisaVersatilelibrary..

ü NotonlyforMachineLearningü providesabunchofgenericutilityfunctions

EachorganizationhasownsetsofUDFsfordatapreprocessing

Don’tRepeatYourself!Don’tRepeatYourself!

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Hivemallgenericfunctions

ArrayandMap Bitandcompress StringandNLP

WewelcomecontributingyourgenericUDFstoHivemall!

GeoSpatial

Top-kprocessing

> TF/IDF

> TILE> MAP_URL

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Maptilingfunctions

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n

Maptilingfunctions

Zoom=10

Zoom=15

2017/5/16ApacheBigDataNorthAmerica'17,Miami

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

student class score3 a 902 a 801 b 706 b 60

Listtop-2studentsforeachclass

2017/5/16ApacheBigDataNorthAmerica'17,Miami

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

Listtop-2studentsforeachclass

SELECT*FROM(SELECT*,rank()over(partitionbyclassorderbyscoredesc)asrank

FROMtable)tWHERErank<=2

RANKover()querydoesnotfinishesin24hoursLwhere20millionMOOCsclassesandavg1,000studentsineachclasses

2017/5/16ApacheBigDataNorthAmerica'17,Miami

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

Listtop-2studentsforeachclass

SELECTeach_top_k(2,class,score,class,student

)as(rank,score,class,student)FROM (SELECT*FROMtableDISTRIBUTEBYclassSORTBYclass

)t

EACH_TOP_Kfinishesin2hoursJ

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Top-kqueryprocessingbyRANKOVER()

partitionbyclass

Node1

Sortbyclass,score

rankover()

rank>=2

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Top-kqueryprocessingbyEACH_TOP_K

distributedbyclass

Node1

Sortbyclass

each_top_k

OUTPUTonlyKitems

2017/5/16ApacheBigDataNorthAmerica'17,Miami

ComparisonbetweenRANKandEACH_TOP_K

distributedbyclass

Sortbyclass

each_top_k

Sortbyclass,score

rankover()

rank>=2

SORTING ISHEAVY

NEEDTOPROCESSALL

OUTPUTonlyKitems

Each_top_k isveryefficientwherethenumberofclassislarge

BoundedPriorityQueueisutilized

ListofSupportedAlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

RandomForestinHivemall

EnsembleofDecisionTrees

2017/5/16ApacheBigDataNorthAmerica'17,Miami

TrainingofRandomForest

2017/5/16ApacheBigDataNorthAmerica'17,Miami

PredictionofRandomForest

2017/5/16ApacheBigDataNorthAmerica'17,Miami

SupportedAlgorithmsforRecommendation

2017/5/16ApacheBigDataNorthAmerica'17,Miami

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems

OtherSupportedAlgorithms

2017/5/16ApacheBigDataNorthAmerica'17,Miami

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ FeatureBinning✓ TF-IDFvectorizer✓ PolynomialExpansion✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer

Evaluationmetrics✓AUC,nDCG,logloss,precisionrecall@K,andetc

2017/5/16ApacheBigDataNorthAmerica'17,Miami

FeatureEngineering– FeatureHashing

2017/5/16ApacheBigDataNorthAmerica'17,Miami

FeatureEngineering– FeatureHashing

2017/5/16ApacheBigDataNorthAmerica'17,Miami

FeatureEngineering– FeatureBinning

Mapsquantitativevariablestofixednumberofbinsbasedonquantiles/distribution

MapAgesinto3bins

2017/5/16ApacheBigDataNorthAmerica'17,Miami

FeatureSelection– SignalNoiseRatio

2017/5/16ApacheBigDataNorthAmerica'17,Miami

EvaluationMetrics

OtherSupportedFeatures

2017/5/16ApacheBigDataNorthAmerica'17,Miami

AnomalyDetection✓LocalOutlierFactor(LoF)✓ChangeFinder

Clustering/Topicmodels✓Onlinemini-batchLDA✓Onlinemini-batchPLSA

ChangePointDetection✓ChangeFinder✓SingularSpectrumTransformation

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

2017/5/16ApacheBigDataNorthAmerica'17,Miami

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Anomaly/Change-pointDetectionbyChangeFinder

Takethis…

Anomaly/Change-pointDetectionbyChangeFinder

2017/5/16ApacheBigDataNorthAmerica'17,Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Anomaly/Change-pointDetectionbyChangeFinder

…anddothis!

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Anomaly/Change-pointDetectionbyChangeFinder

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Anomaly/Change-pointDetectionbyChangeFinder

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

2017/5/16ApacheBigDataNorthAmerica'17,Miami

• T.IdeandK.Inoue,"KnowledgeDiscoveryfromHeterogeneousDynamicSystemsusingChange-PointCorrelations",Proc.SDM,2005T.

• T.IdeandK.Tsuda,"Change-pointdetectionusingKrylovsubspacelearning",Proc.SDM,2007.

Change-pointdetectionbySingularSpectrumTransformation

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Onlinemini-batchLDA

ü XGBoost Integrationü SparsevectorsupportinRandomForestsü Dockersupport(forevaluation)ü Field-awareFactorizationMachinesü GeneralizedLinearModel

• OptimizerframeworkincludingADAM• L1/L2regularization

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Othernewfeaturesindevelopment

Copyright©2016 NTT corp. All Rights Reserved.

Hivemall on

44Copyright©2016 NTT corp. All Rights Reserved.

Who am I• Takeshi Yamamuro• NTT corp. in Japan• OSS activities

• Apache Hivemall PPMC• Apache Spark contributer

45Copyright©2016 NTT corp. All Rights Reserved.

Whatʼs Spark?• Distributed data analytics engine, generalizing Map Reduce

46Copyright©2016 NTT corp. All Rights Reserved.

Whatʼs Spark?• 1. Unified Engine

• support end-to-end APs, e.g., MLlib and Streaming

• 2. High-level APIs• easy-to-use, rich optimization

• 3. Integrate broadly• storages, libraries, ...

47Copyright©2016 NTT corp. All Rights Reserved.

• Hivemall wrapper for Spark• Wrapper implementations for DataFrame/SQL• + some utilities for easy-to-use in Spark

• The wrapper makes you...• run most of Hivemall functions in Spark• try examples easily in your laptop• improve some function performance in Spark

Whatʼs Hivemall on Spark?

48Copyright©2016 NTT corp. All Rights Reserved.

• Hivemall already has many fascinating ML algorithms and useful utilities

• + High barriers to add newer algorithms in MLlib

Whyʼs Hivemall on Spark?

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

49Copyright©2016 NTT corp. All Rights Reserved.

• Most of Hivemall functions supported in Spark v2.0 and v2.1

Current Status

- To compile for Spark v2.1

$ git clone https://github.com/apache/incubator-hivemall$ cd incubator-hivemall$ mvn package –Pspark-2.1–DskipTests$ ls target/*spark*target/hivemall-spark-2.1_2.11-XXX-with-dependencies.jar...

50Copyright©2016 NTT corp. All Rights Reserved.

• Most of Hivemall functions supported in Spark v2.0 and v2.1

Current Status

- To compile for Spark v2.0

$ git clone https://github.com/apache/incubator-hivemall$ cd incubator-hivemall$ mvn package –Pspark-2.0–DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...

51Copyright©2016 NTT corp. All Rights Reserved.

• 1. Fetch training and test data• 2. Load these data in Spark• 3. Build a model• 4. Do predictions

4 Step Example

52Copyright©2016 NTT corp. All Rights Reserved.

• E2006 tfidf regression dataset• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf

1. Fetch training and test data

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2

53Copyright©2016 NTT corp. All Rights Reserved.

2. Load data in Spark

// Download Spark-v2.1 and launch a spark-shell with Hivemall$ <HIVEMALL_HOME>/bin/spark-shell

// Create DataFrame from the bzipʼd libsvm-formatted filescala> val trainDf = spark.read.format("libsvm”).load(“E2006.train.bz2")

scala> trainDf.printSchemaroot|-- label: double (nullable = true)|-- features: vector (nullable = true)

54Copyright©2016 NTT corp. All Rights Reserved.

2. Load data in Spark (Detailed)

0.000357499151147113 6066:0.00079327062196048 6069:0.000311377727123504 6070:0.000306754934580457 6071:0.000276992485786437 6072:0.00039663531098024 6074:0.00039663531098024 6075:0.00032548335…

trainDf

Partition1 Partition2 Partition3 PartitionN…

Load in parallel becausebzip2 is splittable

55Copyright©2016 NTT corp. All Rights Reserved.

3. Build a model - DataFrame

scala> paste:val modelDf = trainDf.train_logregr($"features", $"label")

.groupBy("feature”)

.agg("weight" -> "avg”)

56Copyright©2016 NTT corp. All Rights Reserved.

3. Build a model - SQL

scala> trainDf.createOrReplaceTempView("TrainTable")scala> paste:val modelDf = sql("""

| SELECT feature, AVG(weight) AS weight| FROM (| SELECT train_logregr(features, label)| AS (feature, weight)| FROM TrainTable| )| GROUP BY feature

""".stripMargin)

57Copyright©2016 NTT corp. All Rights Reserved.

4. Do predictions - DataFrame

scala> paste:val df = testDf.select(rowid(), $"features").explode_vector($"features").cache

# Do predictionsdf.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER").groupBy("rowid").avg(sigmoid(sum($"weight" * $"value")))

58Copyright©2016 NTT corp. All Rights Reserved.

4. Do predictions - SQL

scala> modelDf.createOrReplaceTempView(”ModelTable")scala> df.createOrReplaceTempView(”TestTable”)scala> paste:sql("""

| SELECT rowid, sigmoid(value * weight) AS predicted| FROM TrainTable t| LEFT OUTER JOIN ModelTable m| ON t.feature = m.feature| GROUP BY rowid

""".stripMargin)

59Copyright©2016 NTT corp. All Rights Reserved.

• Top-K Join• Join two relations and compute top-K for each group

Add New Features in Spark

scala> paste:val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”).select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”).withColumn("rank",

rank().over(partitionBy($"group”).orderBy($"score".desc))).where($"rank" <= topK)

Use Spark vanilla APIs

60Copyright©2016 NTT corp. All Rights Reserved.

• Top-K Join• Join two relations and compute top-K for each group

Add New Features in Spark

scala> paste:val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”).select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”).withColumn("rank", rank().over(partitionBy($"group”).orderBy($"score".desc))

).where($"rank" <= topK)

1. Join leftDf with rightDf

2. Sort data in each group

3. Filter top-K data

Use Spark vanilla APIs

61Copyright©2016 NTT corp. All Rights Reserved.

• Top-K Join• Join two relations and compute top-K for each group

Add New Features in Spark

scala> paste:val topkDf = leftDf.top_k_join(lit(topK), rightDf, leftDf(“group”) === rightDf(“group”),(leftDf(“x”) + rightDf(“y”)).as(“score”)

)

Use a fused API implemented in Hivemall

62Copyright©2016 NTT corp. All Rights Reserved.

• How top_k_join works?

Add New Features in Spark

group xgroup y

leftDf

rightDf

・・・

63Copyright©2016 NTT corp. All Rights Reserved.

• How top_k_join works?

Add New Features in Spark

group xgroup y

・・・

・・・

K-lengthpriority queue

Compute top-K rows by using a priority queue

leftDf

rightDf

64Copyright©2016 NTT corp. All Rights Reserved.

• How top_k_join works?

Add New Features in Spark

group xgroup y

・・・

・・・

K-lengthpriority queue

Compute top-K rows by using a priority queue

Only join top-K rowsleftDf

rightDf

65Copyright©2016 NTT corp. All Rights Reserved.

• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built

physical plan, and compiles/executes it

Add New Features in Spark

Spark Planner (Catalyst) Overview

66Copyright©2016 NTT corp. All Rights Reserved.

• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built

physical plan, and compiles/executes it

Add New Features in Spark

Codegen

A Physical Plan

67Copyright©2016 NTT corp. All Rights Reserved.

• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built

physical plan, and compiles/executes it

Add New Features in Spark

scala> topkDf.explain== Physical Plan ==*ShuffledHashJoinTopK 1, [group#10], [group#27]:- Exchange hashpartitioning(group#10, 200): +- LocalTableScan [group#10, x#11]+- Exchange hashpartitioning(group#27, 200)

+- LocalTableScan [group#27, y#28]

ʻ*ʼ in the head means a codegenʼd plan

68Copyright©2016 NTT corp. All Rights Reserved.

• Benchmark Result

Add New Features in Spark

~13 times faster than vanilla APIs!!

69Copyright©2016 NTT corp. All Rights Reserved.

• Support more useful functions• Spark only implements naive and basic functions in

terms of usability and maintainability• e.g.,)

• f latten - f latten a nested schema into f lat one• from_csv/to_csv – interconversion of CSV str ings and structured data with schemas• ...

• See more in the Hivemall user guide• https://hivemal l . incubator.apache.org /userguide /spark /misc /misc.html

Add New Features in Spark

70Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))scala> val sparkUdf = functions.udf(sigmoidFunc)scala> df.select(sparkUdf($“value”))

71Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

scala> val hiveUdf = HivemallOps.sigmoidscala> df.select(hiveUdf($“value”))

72Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

73Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.2) Compute top-k for each key group

Improve Some Functions in Spark

scala> paste:df.withColumn(“rank”,rank().over(Window.partitionBy($"key").orderBy($"score".desc)

) .where($"rank" <= topK)

74Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.2) Compute top-k for each key group

• Fixed the overhead issue for each_top_k• See pr#353: “Implement EachTopK as a generator expression“ in Spark

Improve Some Functions in Spark

scala> df.each_top_k(topK, “key”, “score”, “value”)

75Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.2) Compute top-k for each key group

Improve Some Functions in Spark

~4 times faster than rank!!

76Copyright©2016 NTT corp. All Rights Reserved.

• supports under development• fast implementation of the gradient tree boosting• widely used in Kaggle competitions

• This integration will make you...• load built models and predict in parallel• build multiple models in parallel

for cross-validation

Support 3rd Party Library in Spark

ConclusionandTakeawayHivemallisamulti/cross-platformMLlibraryprovidingacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs

2017/5/16ApacheBigDataNorthAmerica'17,Miami

WewelcomeyourcontributionstoApacheHivemallJ

HiveQL SparkSQL/Dataframe API PigLatin

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Anyfeaturerequestorquestions?