Apache Hivemall @ Apache BigData '17, Miami

ApacheHivemall:ScalablemachinelearninglibraryforApacheHive/Spark/Pig

1).ResearchEngineer,TreasureDataMakotoYUI@myui

2017/5/16ApacheBigData NorthAmerica'17,Miami

2).ResearchEngineer,NTTTakashiYamamuro@maropu

@ApacheHivemall

Planofthetalk

1. IntroductiontoHivemall

2. HivemallonSpark

2017/5/16ApacheBigDataNorthAmerica'17,Miami

HivemallenteredApacheIncubatoronSept13,2016🎉

Sincethen,weinvited2contributorsasnewcommitters(acommitterhasbeenvotedasPPMC). Currently,weareworkingtowardthefirstApachereleaseonthisQ2.

hivemall.incubator.apache.org

WhatisApacheHivemall

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs

Multi/Crossplatform VersatileScalableEase-of-use

Hivemalliseasyandscalable…MLmadeeasyforSQLdevelopers

Borntobeparallelandscalable

Ease-of-use

Scalable

CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

ThisqueryautomaticallyrunsinparallelonHadoop

Hivemallisamulti/cross-platformMLlibrary

HiveQL SparkSQL/Dataframe API PigLatin

HivemallisMulti/Crossplatform..Multi/Crossplatform

predictionmodelsbuiltbyHivecanbeusedfromSpark,andconversely,predictionmodelsbuildbySparkcanbeusedfromHive

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File SystemCloud Storage

SparkSQL

ApacheSpark

Hive Pig

Hivemall’s TechnologyStack

AmazonS3

HivemallonApacheHive

HivemallonApacheSparkDataframe

HivemallonSparkSQL

HivemallonApachePig

OnlinePredictionbyApacheStreaming

Versatile

HivemallisaVersatilelibrary..

ü NotonlyforMachineLearningü providesabunchofgenericutilityfunctions

EachorganizationhasownsetsofUDFsfordatapreprocessing

Don’tRepeatYourself!Don’tRepeatYourself!

Hivemallgenericfunctions

ArrayandMap Bitandcompress StringandNLP

WewelcomecontributingyourgenericUDFstoHivemall!

GeoSpatial

Top-kprocessing

> TF/IDF

> TILE> MAP_URL

Maptilingfunctions

Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n

Maptilingfunctions

Zoom=10

Zoom=15

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

student class score3 a 902 a 801 b 706 b 60

Listtop-2studentsforeachclass

SELECT*FROM(SELECT*,rank()over(partitionbyclassorderbyscoredesc)asrank

FROMtable)tWHERErank<=2

RANKover()querydoesnotfinishesin24hoursLwhere20millionMOOCsclassesandavg1,000studentsineachclasses

SELECTeach_top_k(2,class,score,class,student

)as(rank,score,class,student)FROM (SELECT*FROMtableDISTRIBUTEBYclassSORTBYclass

EACH_TOP_Kfinishesin2hoursJ

Top-kqueryprocessingbyRANKOVER()

partitionbyclass

Sortbyclass,score

rankover()

rank>=2

Top-kqueryprocessingbyEACH_TOP_K

distributedbyclass

Sortbyclass

each_top_k

OUTPUTonlyKitems

ComparisonbetweenRANKandEACH_TOP_K

distributedbyclass

Sortbyclass

each_top_k

Sortbyclass,score

rankover()

rank>=2

SORTING ISHEAVY

NEEDTOPROCESSALL

OUTPUTonlyKitems

Each_top_k isveryefficientwherethenumberofclassislarge

BoundedPriorityQueueisutilized

ListofSupportedAlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

RandomForestinHivemall

EnsembleofDecisionTrees

TrainingofRandomForest

PredictionofRandomForest

SupportedAlgorithmsforRecommendation

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems

OtherSupportedAlgorithms

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ FeatureBinning✓ TF-IDFvectorizer✓ PolynomialExpansion✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer

Evaluationmetrics✓AUC,nDCG,logloss,precisionrecall@K,andetc

FeatureEngineering– FeatureHashing

FeatureEngineering– FeatureBinning

Mapsquantitativevariablestofixednumberofbinsbasedonquantiles/distribution

MapAgesinto3bins

FeatureSelection– SignalNoiseRatio

EvaluationMetrics

OtherSupportedFeatures

AnomalyDetection✓LocalOutlierFactor(LoF)✓ChangeFinder

Clustering/Topicmodels✓Onlinemini-batchLDA✓Onlinemini-batchPLSA

ChangePointDetection✓ChangeFinder✓SingularSpectrumTransformation

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Anomaly/Change-pointDetectionbyChangeFinder

Takethis…

…anddothis!

• T.IdeandK.Inoue,"KnowledgeDiscoveryfromHeterogeneousDynamicSystemsusingChange-PointCorrelations",Proc.SDM,2005T.

• T.IdeandK.Tsuda,"Change-pointdetectionusingKrylovsubspacelearning",Proc.SDM,2007.

Change-pointdetectionbySingularSpectrumTransformation

Onlinemini-batchLDA

ü XGBoost Integrationü SparsevectorsupportinRandomForestsü Dockersupport(forevaluation)ü Field-awareFactorizationMachinesü GeneralizedLinearModel

• OptimizerframeworkincludingADAM• L1/L2regularization

Othernewfeaturesindevelopment

Hivemall on

Who am I• Takeshi Yamamuro• NTT corp. in Japan• OSS activities

• Apache Hivemall PPMC• Apache Spark contributer

Whatʼs Spark?• Distributed data analytics engine, generalizing Map Reduce

Whatʼs Spark?• 1. Unified Engine

• support end-to-end APs, e.g., MLlib and Streaming

• 2. High-level APIs• easy-to-use, rich optimization

• 3. Integrate broadly• storages, libraries, ...

• Hivemall wrapper for Spark• Wrapper implementations for DataFrame/SQL• + some utilities for easy-to-use in Spark

• The wrapper makes you...• run most of Hivemall functions in Spark• try examples easily in your laptop• improve some function performance in Spark

Whatʼs Hivemall on Spark?

• Hivemall already has many fascinating ML algorithms and useful utilities

• + High barriers to add newer algorithms in MLlib

Whyʼs Hivemall on Spark?

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

• Most of Hivemall functions supported in Spark v2.0 and v2.1

Current Status

- To compile for Spark v2.1

$ git clone https://github.com/apache/incubator-hivemall$ cd incubator-hivemall$ mvn package –Pspark-2.1–DskipTests$ ls target/*spark*target/hivemall-spark-2.1_2.11-XXX-with-dependencies.jar...

• Most of Hivemall functions supported in Spark v2.0 and v2.1

Current Status

- To compile for Spark v2.0

$ git clone https://github.com/apache/incubator-hivemall$ cd incubator-hivemall$ mvn package –Pspark-2.0–DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...

• 1. Fetch training and test data• 2. Load these data in Spark• 3. Build a model• 4. Do predictions

4 Step Example

• E2006 tfidf regression dataset• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf

1. Fetch training and test data

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2

2. Load data in Spark

// Download Spark-v2.1 and launch a spark-shell with Hivemall$ <HIVEMALL_HOME>/bin/spark-shell

// Create DataFrame from the bzipʼd libsvm-formatted filescala> val trainDf = spark.read.format("libsvm”).load(“E2006.train.bz2")

scala> trainDf.printSchemaroot|-- label: double (nullable = true)|-- features: vector (nullable = true)

2. Load data in Spark (Detailed)

0.000357499151147113 6066:0.00079327062196048 6069:0.000311377727123504 6070:0.000306754934580457 6071:0.000276992485786437 6072:0.00039663531098024 6074:0.00039663531098024 6075:0.00032548335…

trainDf

Partition1 Partition2 Partition3 PartitionN…

Load in parallel becausebzip2 is splittable

3. Build a model - DataFrame

scala> paste:val modelDf = trainDf.train_logregr($"features", $"label")

.groupBy("feature”)

.agg("weight" -> "avg”)

3. Build a model - SQL

scala> trainDf.createOrReplaceTempView("TrainTable")scala> paste:val modelDf = sql("""

""".stripMargin)

4. Do predictions - DataFrame

scala> paste:val df = testDf.select(rowid(), $"features").explode_vector($"features").cache

# Do predictionsdf.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER").groupBy("rowid").avg(sigmoid(sum($"weight" * $"value")))

4. Do predictions - SQL

scala> modelDf.createOrReplaceTempView(”ModelTable")scala> df.createOrReplaceTempView(”TestTable”)scala> paste:sql("""

""".stripMargin)

• Top-K Join• Join two relations and compute top-K for each group

Add New Features in Spark

scala> paste:val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”).select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”).withColumn("rank",

rank().over(partitionBy($"group”).orderBy($"score".desc))).where($"rank" <= topK)

Use Spark vanilla APIs

scala> paste:val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”).select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”).withColumn("rank", rank().over(partitionBy($"group”).orderBy($"score".desc))

).where($"rank" <= topK)

1. Join leftDf with rightDf

2. Sort data in each group

3. Filter top-K data

Use Spark vanilla APIs

scala> paste:val topkDf = leftDf.top_k_join(lit(topK), rightDf, leftDf(“group”) === rightDf(“group”),(leftDf(“x”) + rightDf(“y”)).as(“score”)

Use a fused API implemented in Hivemall

• How top_k_join works?

group xgroup y

leftDf

rightDf

・・・

group xgroup y

・・・

K-lengthpriority queue

Compute top-K rows by using a priority queue

leftDf

rightDf

group xgroup y

・・・

K-lengthpriority queue

Compute top-K rows by using a priority queue

Only join top-K rowsleftDf

rightDf

• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built

physical plan, and compiles/executes it

Spark Planner (Catalyst) Overview

Codegen

A Physical Plan

scala> topkDf.explain== Physical Plan ==*ShuffledHashJoinTopK 1, [group#10], [group#27]:- Exchange hashpartitioning(group#10, 200): +- LocalTableScan [group#10, x#11]+- Exchange hashpartitioning(group#27, 200)

+- LocalTableScan [group#27, y#28]

ʻ*ʼ in the head means a codegenʼd plan

• Benchmark Result

~13 times faster than vanilla APIs!!

• Support more useful functions• Spark only implements naive and basic functions in

terms of usability and maintainability• e.g.,)

• f latten - f latten a nested schema into f lat one• from_csv/to_csv – interconversion of CSV str ings and structured data with schemas• ...

• See more in the Hivemall user guide• https://hivemal l . incubator.apache.org /userguide /spark /misc /misc.html

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))scala> val sparkUdf = functions.udf(sigmoidFunc)scala> df.select(sparkUdf($“value”))

scala> val hiveUdf = HivemallOps.sigmoidscala> df.select(hiveUdf($“value”))

• ex.2) Compute top-k for each key group

scala> paste:df.withColumn(“rank”,rank().over(Window.partitionBy($"key").orderBy($"score".desc)

) .where($"rank" <= topK)

• Fixed the overhead issue for each_top_k• See pr#353: “Implement EachTopK as a generator expression“ in Spark

scala> df.each_top_k(topK, “key”, “score”, “value”)

~4 times faster than rank!!

• supports under development• fast implementation of the gradient tree boosting• widely used in Kaggle competitions

• This integration will make you...• load built models and predict in parallel• build multiple models in parallel

for cross-validation

Support 3rd Party Library in Spark

ConclusionandTakeawayHivemallisamulti/cross-platformMLlibraryprovidingacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs

WewelcomeyourcontributionstoApacheHivemallJ

HiveQL SparkSQL/Dataframe API PigLatin

Anyfeaturerequestorquestions?

Apache Hivemall @ Apache BigData '17, Miami

Data & Analytics

Transcript of Apache Hivemall @ Apache BigData '17, Miami

Datacenter Computing with Apache Mesos - BigData DC

Recommendation 101 using Hivemall

Apache Hivemall: Scalable machine learning library for ... · Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto

Professional Training for BigData and Apache Hadoophadoopexam.com/BigData_Hadoop_Training_Brochure.pdf · Professional Training for BigData and Apache Hadoop ... This training will

· Apache Spark Hortonworks Data Platform - Operación y Admiistración MapR Apache Ambari Bases de Datos BigData Apache Cassandra / Apache HBasse

3rd Hivemall meetup

TVM & THE APACHE SOFTWARE FOUNDATION · 䡧 Mentor for mxnet, HiveMall, Nemo Here to help the TVM community get into The Apache Software Foundation. AGENDA Intro about the ASF The

SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

Hivemall tech talk at Redwood, CA

20160908 hivemall meetup

Hivemall dbtechshowcase 20160713 #dbts2016

Hivemall LT @ Machine Learning Casual Talks #3

Apache Cassandra As A BigData Platform Matthew F. Dennis ...€¦ · Apache Cassandra As A BigData Platform Matthew F. Dennis // @mdennis October 1-3, 2012 gotocon.com

Stream Processing (with Storm, Spark, Flink) Lecture BigData … · 7 Apache Flink 8 Summary Julian M. Kunkel Lecture BigData Analytics, WiSe 17/18 2/59. Overview Spark Streaming

Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Professional Training for BigData and Apache Hadoop Features of ...

Data Flow Languages & Apache Pig Lecture BigData Analytics€¦ · Data Flow Languages & Apache Pig Lecture BigData Analytics Julian M. Kunkel julian.kunkel@googlemail.com University

Apache hadoop bigdata-in-banking

HPC Meets BigData: Accelerating Apache Hadoop, Spark, and ... · HPC Meets BigData: Accelerating Apache Hadoop, Spark, and Memcached with HPC Technologies Dhabaleswar K. (DK) Panda

ScyllaDB @ Apache BigData, may 2016