Apache Hivemall @ Apache BigData '17, Miami

78
Apache Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto YUI @myui 2017/5/16 Apache BigData North America '17, Miami 2). Research Engineer, NTT Takashi Yamamuro @maropu @ApacheHivemall

Transcript of Apache Hivemall @ Apache BigData '17, Miami

Page 1: Apache Hivemall @ Apache BigData '17, Miami

ApacheHivemall:ScalablemachinelearninglibraryforApacheHive/Spark/Pig

1).ResearchEngineer,TreasureDataMakotoYUI@myui

2017/5/16ApacheBigData NorthAmerica'17,Miami

2).ResearchEngineer,NTTTakashiYamamuro@maropu

@ApacheHivemall

Page 2: Apache Hivemall @ Apache BigData '17, Miami

Planofthetalk

1. IntroductiontoHivemall

2. HivemallonSpark

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Page 3: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

HivemallenteredApacheIncubatoronSept13,2016🎉

Sincethen,weinvited2contributorsasnewcommitters(acommitterhasbeenvotedasPPMC). Currently,weareworkingtowardthefirstApachereleaseonthisQ2.

hivemall.incubator.apache.org

Page 4: Apache Hivemall @ Apache BigData '17, Miami

WhatisApacheHivemall

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Multi/Crossplatform VersatileScalableEase-of-use

Page 5: Apache Hivemall @ Apache BigData '17, Miami

Hivemalliseasyandscalable…MLmadeeasyforSQLdevelopers

Borntobeparallelandscalable

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Ease-of-use

Scalable

CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

ThisqueryautomaticallyrunsinparallelonHadoop

Page 6: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Hivemallisamulti/cross-platformMLlibrary

HiveQL SparkSQL/Dataframe API PigLatin

HivemallisMulti/Crossplatform..Multi/Crossplatform

predictionmodelsbuiltbyHivecanbeusedfromSpark,andconversely,predictionmodelsbuildbySparkcanbeusedfromHive

Page 7: Apache Hivemall @ Apache BigData '17, Miami

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File SystemCloud Storage

SparkSQL

ApacheSpark

MESOS

Hive Pig

MLlib

Hivemall’s TechnologyStack

2017/5/16ApacheBigDataNorthAmerica'17,Miami

AmazonS3

Page 8: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

HivemallonApacheHive

Page 9: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

HivemallonApacheSparkDataframe

Page 10: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

HivemallonSparkSQL

Page 11: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

HivemallonApachePig

Page 12: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

OnlinePredictionbyApacheStreaming

Page 13: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Versatile

HivemallisaVersatilelibrary..

ü NotonlyforMachineLearningü providesabunchofgenericutilityfunctions

EachorganizationhasownsetsofUDFsfordatapreprocessing

Don’tRepeatYourself!Don’tRepeatYourself!

Page 14: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Hivemallgenericfunctions

ArrayandMap Bitandcompress StringandNLP

WewelcomecontributingyourgenericUDFstoHivemall!

GeoSpatial

Top-kprocessing

> TF/IDF

> TILE> MAP_URL

Page 15: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Maptilingfunctions

Page 16: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Tile(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^n

Maptilingfunctions

Zoom=10

Zoom=15

Page 17: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

student class score3 a 902 a 801 b 706 b 60

Listtop-2studentsforeachclass

Page 18: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

Listtop-2studentsforeachclass

SELECT*FROM(SELECT*,rank()over(partitionbyclassorderbyscoredesc)asrank

FROMtable)tWHERErank<=2

RANKover()querydoesnotfinishesin24hoursLwhere20millionMOOCsclassesandavg1,000studentsineachclasses

Page 19: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

Listtop-2studentsforeachclass

SELECTeach_top_k(2,class,score,class,student

)as(rank,score,class,student)FROM (SELECT*FROMtableDISTRIBUTEBYclassSORTBYclass

)t

EACH_TOP_Kfinishesin2hoursJ

Page 20: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Top-kqueryprocessingbyRANKOVER()

partitionbyclass

Node1

Sortbyclass,score

rankover()

rank>=2

Page 21: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Top-kqueryprocessingbyEACH_TOP_K

distributedbyclass

Node1

Sortbyclass

each_top_k

OUTPUTonlyKitems

Page 22: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

ComparisonbetweenRANKandEACH_TOP_K

distributedbyclass

Sortbyclass

each_top_k

Sortbyclass,score

rankover()

rank>=2

SORTING ISHEAVY

NEEDTOPROCESSALL

OUTPUTonlyKitems

Each_top_k isveryefficientwherethenumberofclassislarge

BoundedPriorityQueueisutilized

Page 23: Apache Hivemall @ Apache BigData '17, Miami

ListofSupportedAlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

Page 24: Apache Hivemall @ Apache BigData '17, Miami

RandomForestinHivemall

EnsembleofDecisionTrees

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Page 25: Apache Hivemall @ Apache BigData '17, Miami

TrainingofRandomForest

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Page 26: Apache Hivemall @ Apache BigData '17, Miami

PredictionofRandomForest

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Page 27: Apache Hivemall @ Apache BigData '17, Miami

SupportedAlgorithmsforRecommendation

2017/5/16ApacheBigDataNorthAmerica'17,Miami

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems

Page 28: Apache Hivemall @ Apache BigData '17, Miami

OtherSupportedAlgorithms

2017/5/16ApacheBigDataNorthAmerica'17,Miami

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ FeatureBinning✓ TF-IDFvectorizer✓ PolynomialExpansion✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer

Evaluationmetrics✓AUC,nDCG,logloss,precisionrecall@K,andetc

Page 29: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

FeatureEngineering– FeatureHashing

Page 30: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

FeatureEngineering– FeatureHashing

Page 31: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

FeatureEngineering– FeatureBinning

Mapsquantitativevariablestofixednumberofbinsbasedonquantiles/distribution

MapAgesinto3bins

Page 32: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

FeatureSelection– SignalNoiseRatio

Page 33: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

EvaluationMetrics

Page 34: Apache Hivemall @ Apache BigData '17, Miami

OtherSupportedFeatures

2017/5/16ApacheBigDataNorthAmerica'17,Miami

AnomalyDetection✓LocalOutlierFactor(LoF)✓ChangeFinder

Clustering/Topicmodels✓Onlinemini-batchLDA✓Onlinemini-batchPLSA

ChangePointDetection✓ChangeFinder✓SingularSpectrumTransformation

Page 35: Apache Hivemall @ Apache BigData '17, Miami

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

2017/5/16ApacheBigDataNorthAmerica'17,Miami

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Anomaly/Change-pointDetectionbyChangeFinder

Page 36: Apache Hivemall @ Apache BigData '17, Miami

Takethis…

Anomaly/Change-pointDetectionbyChangeFinder

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Page 37: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Anomaly/Change-pointDetectionbyChangeFinder

…anddothis!

Page 38: Apache Hivemall @ Apache BigData '17, Miami

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Anomaly/Change-pointDetectionbyChangeFinder

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Page 39: Apache Hivemall @ Apache BigData '17, Miami

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Anomaly/Change-pointDetectionbyChangeFinder

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Page 40: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

• T.IdeandK.Inoue,"KnowledgeDiscoveryfromHeterogeneousDynamicSystemsusingChange-PointCorrelations",Proc.SDM,2005T.

• T.IdeandK.Tsuda,"Change-pointdetectionusingKrylovsubspacelearning",Proc.SDM,2007.

Change-pointdetectionbySingularSpectrumTransformation

Page 41: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Onlinemini-batchLDA

Page 42: Apache Hivemall @ Apache BigData '17, Miami

ü XGBoost Integrationü SparsevectorsupportinRandomForestsü Dockersupport(forevaluation)ü Field-awareFactorizationMachinesü GeneralizedLinearModel

• OptimizerframeworkincludingADAM• L1/L2regularization

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Othernewfeaturesindevelopment

Page 43: Apache Hivemall @ Apache BigData '17, Miami

Copyright©2016 NTT corp. All Rights Reserved.

Hivemall on

Page 44: Apache Hivemall @ Apache BigData '17, Miami

44Copyright©2016 NTT corp. All Rights Reserved.

Who am I• Takeshi Yamamuro• NTT corp. in Japan• OSS activities

• Apache Hivemall PPMC• Apache Spark contributer

Page 45: Apache Hivemall @ Apache BigData '17, Miami

45Copyright©2016 NTT corp. All Rights Reserved.

Whatʼs Spark?• Distributed data analytics engine, generalizing Map Reduce

Page 46: Apache Hivemall @ Apache BigData '17, Miami

46Copyright©2016 NTT corp. All Rights Reserved.

Whatʼs Spark?• 1. Unified Engine

• support end-to-end APs, e.g., MLlib and Streaming

• 2. High-level APIs• easy-to-use, rich optimization

• 3. Integrate broadly• storages, libraries, ...

Page 47: Apache Hivemall @ Apache BigData '17, Miami

47Copyright©2016 NTT corp. All Rights Reserved.

• Hivemall wrapper for Spark• Wrapper implementations for DataFrame/SQL• + some utilities for easy-to-use in Spark

• The wrapper makes you...• run most of Hivemall functions in Spark• try examples easily in your laptop• improve some function performance in Spark

Whatʼs Hivemall on Spark?

Page 48: Apache Hivemall @ Apache BigData '17, Miami

48Copyright©2016 NTT corp. All Rights Reserved.

• Hivemall already has many fascinating ML algorithms and useful utilities

• + High barriers to add newer algorithms in MLlib

Whyʼs Hivemall on Spark?

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Page 49: Apache Hivemall @ Apache BigData '17, Miami

49Copyright©2016 NTT corp. All Rights Reserved.

• Most of Hivemall functions supported in Spark v2.0 and v2.1

Current Status

- To compile for Spark v2.1

$ git clone https://github.com/apache/incubator-hivemall$ cd incubator-hivemall$ mvn package –Pspark-2.1–DskipTests$ ls target/*spark*target/hivemall-spark-2.1_2.11-XXX-with-dependencies.jar...

Page 50: Apache Hivemall @ Apache BigData '17, Miami

50Copyright©2016 NTT corp. All Rights Reserved.

• Most of Hivemall functions supported in Spark v2.0 and v2.1

Current Status

- To compile for Spark v2.0

$ git clone https://github.com/apache/incubator-hivemall$ cd incubator-hivemall$ mvn package –Pspark-2.0–DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...

Page 51: Apache Hivemall @ Apache BigData '17, Miami

51Copyright©2016 NTT corp. All Rights Reserved.

• 1. Fetch training and test data• 2. Load these data in Spark• 3. Build a model• 4. Do predictions

4 Step Example

Page 52: Apache Hivemall @ Apache BigData '17, Miami

52Copyright©2016 NTT corp. All Rights Reserved.

• E2006 tfidf regression dataset• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf

1. Fetch training and test data

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2

Page 53: Apache Hivemall @ Apache BigData '17, Miami

53Copyright©2016 NTT corp. All Rights Reserved.

2. Load data in Spark

// Download Spark-v2.1 and launch a spark-shell with Hivemall$ <HIVEMALL_HOME>/bin/spark-shell

// Create DataFrame from the bzipʼd libsvm-formatted filescala> val trainDf = spark.read.format("libsvm”).load(“E2006.train.bz2")

scala> trainDf.printSchemaroot|-- label: double (nullable = true)|-- features: vector (nullable = true)

Page 54: Apache Hivemall @ Apache BigData '17, Miami

54Copyright©2016 NTT corp. All Rights Reserved.

2. Load data in Spark (Detailed)

0.000357499151147113 6066:0.00079327062196048 6069:0.000311377727123504 6070:0.000306754934580457 6071:0.000276992485786437 6072:0.00039663531098024 6074:0.00039663531098024 6075:0.00032548335…

trainDf

Partition1 Partition2 Partition3 PartitionN…

Load in parallel becausebzip2 is splittable

Page 55: Apache Hivemall @ Apache BigData '17, Miami

55Copyright©2016 NTT corp. All Rights Reserved.

3. Build a model - DataFrame

scala> paste:val modelDf = trainDf.train_logregr($"features", $"label")

.groupBy("feature”)

.agg("weight" -> "avg”)

Page 56: Apache Hivemall @ Apache BigData '17, Miami

56Copyright©2016 NTT corp. All Rights Reserved.

3. Build a model - SQL

scala> trainDf.createOrReplaceTempView("TrainTable")scala> paste:val modelDf = sql("""

| SELECT feature, AVG(weight) AS weight| FROM (| SELECT train_logregr(features, label)| AS (feature, weight)| FROM TrainTable| )| GROUP BY feature

""".stripMargin)

Page 57: Apache Hivemall @ Apache BigData '17, Miami

57Copyright©2016 NTT corp. All Rights Reserved.

4. Do predictions - DataFrame

scala> paste:val df = testDf.select(rowid(), $"features").explode_vector($"features").cache

# Do predictionsdf.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER").groupBy("rowid").avg(sigmoid(sum($"weight" * $"value")))

Page 58: Apache Hivemall @ Apache BigData '17, Miami

58Copyright©2016 NTT corp. All Rights Reserved.

4. Do predictions - SQL

scala> modelDf.createOrReplaceTempView(”ModelTable")scala> df.createOrReplaceTempView(”TestTable”)scala> paste:sql("""

| SELECT rowid, sigmoid(value * weight) AS predicted| FROM TrainTable t| LEFT OUTER JOIN ModelTable m| ON t.feature = m.feature| GROUP BY rowid

""".stripMargin)

Page 59: Apache Hivemall @ Apache BigData '17, Miami

59Copyright©2016 NTT corp. All Rights Reserved.

• Top-K Join• Join two relations and compute top-K for each group

Add New Features in Spark

scala> paste:val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”).select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”).withColumn("rank",

rank().over(partitionBy($"group”).orderBy($"score".desc))).where($"rank" <= topK)

Use Spark vanilla APIs

Page 60: Apache Hivemall @ Apache BigData '17, Miami

60Copyright©2016 NTT corp. All Rights Reserved.

• Top-K Join• Join two relations and compute top-K for each group

Add New Features in Spark

scala> paste:val topkDf = leftDf.join(rightDf, “group” :: Nil, “INNER”).select(leftDf("group"), (leftDf(“x”) + rightDf(“y”)).as(“score”).withColumn("rank", rank().over(partitionBy($"group”).orderBy($"score".desc))

).where($"rank" <= topK)

1. Join leftDf with rightDf

2. Sort data in each group

3. Filter top-K data

Use Spark vanilla APIs

Page 61: Apache Hivemall @ Apache BigData '17, Miami

61Copyright©2016 NTT corp. All Rights Reserved.

• Top-K Join• Join two relations and compute top-K for each group

Add New Features in Spark

scala> paste:val topkDf = leftDf.top_k_join(lit(topK), rightDf, leftDf(“group”) === rightDf(“group”),(leftDf(“x”) + rightDf(“y”)).as(“score”)

)

Use a fused API implemented in Hivemall

Page 62: Apache Hivemall @ Apache BigData '17, Miami

62Copyright©2016 NTT corp. All Rights Reserved.

• How top_k_join works?

Add New Features in Spark

group xgroup y

leftDf

rightDf

・・・

Page 63: Apache Hivemall @ Apache BigData '17, Miami

63Copyright©2016 NTT corp. All Rights Reserved.

• How top_k_join works?

Add New Features in Spark

group xgroup y

・・・

・・・

K-lengthpriority queue

Compute top-K rows by using a priority queue

leftDf

rightDf

Page 64: Apache Hivemall @ Apache BigData '17, Miami

64Copyright©2016 NTT corp. All Rights Reserved.

• How top_k_join works?

Add New Features in Spark

group xgroup y

・・・

・・・

K-lengthpriority queue

Compute top-K rows by using a priority queue

Only join top-K rowsleftDf

rightDf

Page 65: Apache Hivemall @ Apache BigData '17, Miami

65Copyright©2016 NTT corp. All Rights Reserved.

• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built

physical plan, and compiles/executes it

Add New Features in Spark

Spark Planner (Catalyst) Overview

Page 66: Apache Hivemall @ Apache BigData '17, Miami

66Copyright©2016 NTT corp. All Rights Reserved.

• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built

physical plan, and compiles/executes it

Add New Features in Spark

Codegen

A Physical Plan

Page 67: Apache Hivemall @ Apache BigData '17, Miami

67Copyright©2016 NTT corp. All Rights Reserved.

• Codegenʼd top_k_join for fast processing• Spark internally generates Java code from a built

physical plan, and compiles/executes it

Add New Features in Spark

scala> topkDf.explain== Physical Plan ==*ShuffledHashJoinTopK 1, [group#10], [group#27]:- Exchange hashpartitioning(group#10, 200): +- LocalTableScan [group#10, x#11]+- Exchange hashpartitioning(group#27, 200)

+- LocalTableScan [group#27, y#28]

ʻ*ʼ in the head means a codegenʼd plan

Page 68: Apache Hivemall @ Apache BigData '17, Miami

68Copyright©2016 NTT corp. All Rights Reserved.

• Benchmark Result

Add New Features in Spark

~13 times faster than vanilla APIs!!

Page 69: Apache Hivemall @ Apache BigData '17, Miami

69Copyright©2016 NTT corp. All Rights Reserved.

• Support more useful functions• Spark only implements naive and basic functions in

terms of usability and maintainability• e.g.,)

• f latten - f latten a nested schema into f lat one• from_csv/to_csv – interconversion of CSV str ings and structured data with schemas• ...

• See more in the Hivemall user guide• https://hivemal l . incubator.apache.org /userguide /spark /misc /misc.html

Add New Features in Spark

Page 70: Apache Hivemall @ Apache BigData '17, Miami

70Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))scala> val sparkUdf = functions.udf(sigmoidFunc)scala> df.select(sparkUdf($“value”))

Page 71: Apache Hivemall @ Apache BigData '17, Miami

71Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

scala> val hiveUdf = HivemallOps.sigmoidscala> df.select(hiveUdf($“value”))

Page 72: Apache Hivemall @ Apache BigData '17, Miami

72Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

Page 73: Apache Hivemall @ Apache BigData '17, Miami

73Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.2) Compute top-k for each key group

Improve Some Functions in Spark

scala> paste:df.withColumn(“rank”,rank().over(Window.partitionBy($"key").orderBy($"score".desc)

) .where($"rank" <= topK)

Page 74: Apache Hivemall @ Apache BigData '17, Miami

74Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.2) Compute top-k for each key group

• Fixed the overhead issue for each_top_k• See pr#353: “Implement EachTopK as a generator expression“ in Spark

Improve Some Functions in Spark

scala> df.each_top_k(topK, “key”, “score”, “value”)

Page 75: Apache Hivemall @ Apache BigData '17, Miami

75Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.2) Compute top-k for each key group

Improve Some Functions in Spark

~4 times faster than rank!!

Page 76: Apache Hivemall @ Apache BigData '17, Miami

76Copyright©2016 NTT corp. All Rights Reserved.

• supports under development• fast implementation of the gradient tree boosting• widely used in Kaggle competitions

• This integration will make you...• load built models and predict in parallel• build multiple models in parallel

for cross-validation

Support 3rd Party Library in Spark

Page 77: Apache Hivemall @ Apache BigData '17, Miami

ConclusionandTakeawayHivemallisamulti/cross-platformMLlibraryprovidingacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs

2017/5/16ApacheBigDataNorthAmerica'17,Miami

WewelcomeyourcontributionstoApacheHivemallJ

HiveQL SparkSQL/Dataframe API PigLatin

Page 78: Apache Hivemall @ Apache BigData '17, Miami

2017/5/16ApacheBigDataNorthAmerica'17,Miami

Anyfeaturerequestorquestions?