Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig

1). Research Engineer, Treasure DataMakoto YUI @myui

2016/10/26 Hadoop Summit '16, Tokyo

2). Research Engineer, NTTTakashi Yamamuro @maropu

#hivemall

Plan of the talk

1. Introduction to Hivemall

2. Hivemall on Spark

2016/10/26 Hadoop Summit '16, Tokyo 2

Hivemall enters Apache Incubator on Sept 13, 2016 🎉

http://incubator.apache.org/projects/hivemall.html

@ApacheHivemall

• Makoto Yui <Treasure Data>• Takeshi Yamamuro <NTT>

Hivemall on Apache Spark• Daniel Dai <Hortonworks>

Hivemall on Apache Pig Apache Pig PMC member

• Tsuyoshi Ozawa <NTT>Apache Hadoop PMC member

• Kai Sasaki <Treasure Data>

Initial committers

Champion

Nominated Mentors

Project mentors

• Reynold Xin <Databricks, ASF member>Apache Spark PMC member

• Markus Weimer <Microsoft, ASF member> Apache REEF PMC member

• Xiangrui Meng <Databricks, ASF member>Apache Spark PMC member

• Roman Shaposhnik <Pivotal, ASF member>Apache Bigtop/Incubator PMC member

What is Apache Hivemall

Scalable machine learning library built as a collection of Hive UDFs

2016/10/26 Hadoop Summit '16, Tokyo

github.com/myui/hivemall

hivemall.incubator.apache.org

Hivemall’s Vision: ML on SQL

100+ lines

of code

Classification with Mahout

CREATE TABLE lr_model ASSELECT feature, -- reducers perform model averaging in parallel avg(weight) as weightFROM ( SELECT logress(features,label,..) as (feature,weight) FROM train) t -- map-only taskGROUP BY feature; -- shuffled to reducers

✓ Machine Learning made easy for SQL developers✓ Interactive and Stable APIs w/ SQL abstraction

This SQL query automatically runs in parallel on Hadoop

Hadoop HDFS

MapReduce(MRv1)

Hivemall

Apache YARN

Apache Tez DAG processing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File SystemCloud Storage

SparkSQL

Apache Spark

Hive Pig

Hivemall’s Technology Stack

Amazon S3

List of supported AlgorithmsClassification

✓ Perceptron✓ Passive Aggressive (PA, PA1, PA2)✓ Confidence Weighted (CW)✓ Adaptive Regularization of Weight

Vectors (AROW)✓ Soft Confidence Weighted (SCW)✓ AdaGrad+RDA✓ Factorization Machines✓ RandomForest Classification

Regression✓Logistic Regression (SGD)✓AdaGrad (logistic loss)✓AdaDELTA (logistic loss)✓PA Regression ✓AROW Regression✓Factorization Machines✓RandomForest Regression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

List of Algorithms for Recommendation

K-Nearest Neighbor✓ Minhash and b-Bit Minhash (LSH variant)

✓ Similarity Search on Vector Space (Euclid/Cosine/Jaccard/Angular)

Matrix Completion✓ Matrix Factorization✓ Factorization Machines (regression)

each_top_k function of Hivemall is useful for recommending top-k items

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-k query processing

student class score3 a 902 a 801 b 706 b 60

List top-2 students for each class

SELECT * FROM ( SELECT *, rank() over (partition by class order by score desc) as rank FROM table) tWHERE rank <= 2

List top-2 students for each classSELECT each_top_k( 2, class, score, class, student ) as (rank, score, class, student)FROM ( SELECT * FROM table DISTRIBUTE BY class SORT BY class) t

Top-k query processing by RANK OVER()

partition by class

Node 1

Sort by class, score

rank over()

rank >= 2

Top-k query processing by EACH_TOP_K

distributed by class

Node 1

Sort by class

each_top_k

OUTPUT only K items

Comparison between RANK and EACH_TOP_K

distributed by class

Sort by class

each_top_k

Sort by class, score

rank over()

rank >= 2

SORTING IS HEAVY

NEED TO PROCESS ALL

OUTPUT only K items

Each_top_k is very efficient where the number of class is large

Bounded Priority Queueis utilized

Performance reported by TD customer

•1,000 students in each class•20 million classes

RANK over() query does not finishes in 24 hours EACH_TOP_K finishes in 2 hours

Refer for detailhttps://speakerdeck.com/kaky0922/hivemall-meetup-20160908

Other Supported Algorithms

Anomaly Detection✓ Local Outlier Factor (LoF)

Feature Engineering✓Feature Hashing✓Feature Scaling (normalization, z-score)

✓ TF-IDF vectorizer✓ Polynomial Expansion (Feature Pairing)

✓ Amplifier

NLP✓Basic Englist text Tokenizer ✓Japanese Tokenizer (Kuromoji)

Industry use cases of Hivemall

CTR prediction of Ad click logs• Algorithm: Logistic regression• Freakout Inc., Smartnews, and more

Gender prediction of Ad click logs• Algorithm: Classification• Scaleout Inc.

http://www.slideshare.net/eventdotsjp/hivemall

Item/User recommendation• Algorithm: Recommendation• Wish.com, GMO pepabo

Problem: Recommendation using hot-item is hard in hand-crafted product market because each creator sells few single items (will soon become out-of-stock)

minne.com

Value prediction of Real estates• Algorithm: Regression• Livesense

User score calculation• Algrorithm: Regression• Klout

bit.ly/klout-hivemall

Influencer marketing

klout.com

Efficient algorithm for finding change point and outliers from timeseries data

J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006.

Anomaly/Change-point Detection by ChangeFinder

Efficient algorithm for finding change point and outliers from timeseries data

Anomaly/Change-point Detection by ChangeFinder

J. Takeuchi and K. Yamanishi, “A Unifying Framework for Detecting Outliers and Change Points from Time Series,” IEEE transactions on Knowledge and Data Engineering, pp.482-492, 2006.

• T. Ide and K. Inoue, "Knowledge Discovery from Heterogeneous Dynamic Systems using Change-Point Correlations", Proc. SDM, 2005T.

• T. Ide and K. Tsuda, "Change-point detection using Krylov subspace learning", Proc. SDM, 2007.

Change-point detection by Singular Spectrum Transformation

Evaluation Metrics

Feature Engineering – Feature Binning

Maps quantitative variables to fixed number of bins based on quantiles/distribution

Map Ages into 3 bins

Feature Selection – Signal Noise Ratio

Feature Selection – Chi-Square

Feature Transformation – Onehot encoding

Maps a categorical variable to a unique number starting from 1

Spark 2.0 support XGBoost Integration Field-aware Factorization Machines Generalized Linear Model

• Optimizer framework including ADAM• L1/L2 regularization

Other new features to come

Hivemall onTakeshi Yamamuro @ NTT

Hadoop Summit 2016Tokyo, Japan

Who am I?

What’s Spark?• 1. Unified Engine

• support end-to-end APs, e.g., MLlib and Streaming

• 2. High-level APIs• easy-to-use, rich optimization

• 3. Integrate broadly• storages, libraries, ...

• Hivemall wrapper for Spark• Wrapper implementations for DataFrame/SQL• + some utilities for easy-to-use in Spark

• The wrapper makes you...• run most of Hivemall functions in Spark• try examples easily in your laptop• improve some function performance in Spark

What’s Hivemall on Spark?

• Hivemall already has many fascinating ML algorithms and useful utilities

• High barriers to add newer algorithms in MLlib

Why’s Hivemall on Spark?

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

• Most of Hivemall functions supported in Spark v1.6 and v2.0

Current Status

- For Spark v2.0$ git clone https://github.com/myui/hivemall$ cd hivemall$ mvn package –Pspark-2.0 –DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...

• Most of Hivemall functions supported in Spark v1.6 and v2.0

Current Status

- For Spark v2.0

$ git clone https://github.com/myui/hivemall$ cd hivemall$ mvn package –Pspark-2.0 –DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...

-Pspark-1.6 for Spark v1.6

• 1. Download a Spark binary• 2. Fetch training and test data• 3. Load these data in Spark• 4. Build a model• 5. Do predictions

Running an Example

1. Download a Spark binary Running an Example

http://spark.apache.org/downloads.html

• E2006 tfidf regression dataset• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/

datasets/regression.html#E2006-tfidf

2. Fetch training and test dataRunning an Example

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2

3. Load data in Spark Running an Example

$ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-2.0_2.11-XXX-with-dependencies.jar

// Create DataFrame from the bzip’d libsvm-formatted filescala> val trainDf = sqlContext.sparkSession.read.format("libsvm”) .load(“E2006.train.bz2")

scala> trainDf.printSchemaroot |-- label: double (nullable = true) |-- features: vector (nullable = true)

3. Load data in Spark Running an Example

0.000357499151147113 6066:0.00079327062196048 6069:0.000311377727123504 6070:0.000306754934580457 6071:0.000276992485786437 6072:0.00039663531098024 6074:0.00039663531098024 6075:0.00032548335…

trainDfPartition1 Partition2 Partition3 PartitionN

Load in parallel becausebzip2 is splittable

4. Build a model - DataFrame Running an Example

scala> paste:val modelDf = trainDf.train_logregr($"features", $"label") .groupBy("feature”) .agg("weight" -> "avg”)

4. Build a model - SQL Running an Example

5. Do predictions - DataFrame Running an Example

scala> paste:val df = testDf.select(rowid(), $"features") .explode_vector($"features") .cache

# Do predictionsdf.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER") .groupBy("rowid") .avg(sigmoid(sum($"weight" * $"value")))

5. Do predictions - SQL Running an Example

scala> modelDf.createOrReplaceTempView(”ModelTable")scala> df.createOrReplaceTempView(”TestTable”)scala> paste:sql(""" | SELECT rowid, sigmoid(value * weight) AS predicted | FROM TrainTable t | LEFT OUTER JOIN ModelTable m | ON t.feature = m.feature | GROUP BY rowid """.stripMargin)

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))scala> val sparkUdf = functions.udf(sigmoidFunc)scala> df.select(sparkUdf($“value”))

scala> val hiveUdf = HivemallOps.sigmoidscala> df.select(hiveUdf($“value”))

• ex.2) Compute top-k for each key group

scala> paste:df.withColumn( “rank”, rank().over(Window.partitionBy($"key").orderBy($"score".desc)) .where($"rank" <= topK)

• Fixed the overhead issue for each_top_k• See pr#353: “Implement EachTopK as a generator expression“ in Spark

scala> df.each_top_k(topK, “key”, “score”, “value”)

~4 times faster than rank!!

• supports under development• fast implementation of the gradient tree boosting• widely used in Kaggle competitions

• This integration will make you...• load built models and predict in parallel• build multiple models in parallel for cross-validation

3rd–Party Library Integration

Conclusion and TakeawayHivemall provides a collection of machine learning algorithms as Hive UDFs/UDTFs

We welcome your contributions to Apache Hivemall

Any feature request or questions?

Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig

Technology

Transcript of Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig

Big Data Basics Hadoop, MapReduce, Hive, Pig, & Spark

Introduction to Hivemall

Hive and Pig - VGCWikijuliana/courses/BigData2014/Lectures/hive... · Hive and Pig! • Hive: data warehousing application in Hadoop • Query language is HQL, variant of SQL •

Hadoop M/R Pig Hive

Pig vs Hive: Benchmarking High Level Query Languages · Pig vs Hive: Benchmarking High Level Query Languages Benjamin Jakobus IBM, Ireland Dr. Peter McBrien Imperial College London,

PIG LATIN AND HIVE

PIG AND HIVE - pages.iai.uni-bonn.depages.iai.uni-bonn.de/behrend_andreas/lehre/FIM/SS16/BDA/BDA_06.pdf · PIG AND HIVE TWO HIGHER-LEVEL APPROACHES TO PROCESS DATA WITH MAPREDUCE

Optimizing Apache Hive Performance in HDInsight 2/Hive and Pig/Hive_Opt.pdf · But, Hive may choose too few reducers Usually reducers are the bottlenecks Original input data = 50GB

MapR, Hive, and Pig on Google Compute Engine Scenario MapR, Hive, and Pig on Google Compute Engine Bring Your MapR-based Infrastructure to Google Compute Engine ® In 2004, Google

Cloudera Data Analyst Training: Using Pig, Hive, and ...sincrone.fr/repository/Data_Analyst_Training_Exercise_Manual.pdf · Cloudera Data Analyst Training: Using Pig, Hive, and Impala

Pig and Hive Hadoop with Programming in · 2019-11-14 · for Hadoop-- Pig and Hive •Pig is a "scripting language" that excels in specifying a processing pipeline that is automatically

Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive

Pig vs Hive: Benchmarking High Level Query Languages

Dataiku pig - hive - cascading

Hive és Pig

MapR, Hive, and Pig on Google Compute Engine › tech-briefs › mapr-hive-pig-google... · Introduction Scenario MapR, Hive, and Pig on Google Compute Engine Bring Your MapR-based

Pig ve Hive ile Hadoop üzerinde Veri Analizi

Hive vs Pig

BigDataBasics Hadoop, MapReduce, Hive, Pig, & Spark

Hivemall v0.3の機能紹介＠1st Hivemall meetup