Scalable Regression Tree Learning on Hadoop using OpenPlanet

Scalable Regression Tree Learning on Hadoop using OpenPlanet

Wei Yin

Contributions• We implement OpenPlanet, an open-source

implementation of the PLANET regression tree algorithm using the Hadoop MapReduce framework.

• We tune and analyze the impact of two parameters, HDFS block sizes and threshold value between ExpandNode and InMemoryWeka Tasks of OpenPlanet to improve the default performance.

Motivation for large-scale Machine Learning

• Models operate on large data sets

• Large number of forecasting models

• New data arrives constantly and real-time training requirement

Regression Tree

Classification algorithm maps

features → target variable (prediction)

Classifier uses a Binary Search Tree Structure

• Each non-leaf node is a binary classifier with a decision condition

• One numeric or categorical feature goes left or right in the tree

• Leaf Nodes contain the regression function or a single prediction value• Intuitive to understand

by domain users• Effect for each feature

USC DR Technical Forum 5

Google’s PLANET Algorithm

• Use distributed worker nodes coordinated using a master node to build regression tree

Master

worker

worker

worker

worker

worker

worker

21-Sep-11

OpenPlanet• Give an introduction abolut OpenPlanet• Introduce difference between OpenPlanet and PLANET• Give specific re-implementation details

Controller

InitHistogram ExpandNode InMemoryWeka

Model File

Threshold Value(60000)

CotrollerController{/*read user defined parameters, such as input file path, test data file, model output file etc.*/ Read Parameters( arguments[] );/*Initialize 3 job Sets: MRExpandSet, MRInMemWekaSet, CompletionSet, each of which contains the nodes that need relevant process*/ JobSetsInit(ExpandSet, InMemWekaSet, CompletionSet);/*Initialize Model File instance containing a Regression Tree structure with root node only*/ InitModelFile(modelfile); Do {/*If any Set is not empty, continue the loop*//*populate each Set using modelfile*/ populateSets(modelfile, ExpandSet, InMemWekaSet, CompletionSet ); if(ExpandSet != 0){

processing_nodes <- all nodes in ExpandSet;TaskRunner(InitHistogram, processing_nodes);CandidatePoints <- collects Reducers Result();TaskRunner(ExpandNodes, processing_nodes);globalOptimalSplitPoint <- collects Reducers’ Result();

} if(InMemWekaSet != 0){

processing_nodes <- all nodes in InMemWekaSet;TaskRunner(InMemWeka, processing_nodes);

} UpdatesTreeModel(Result); } While ( ExpandSet & InMemWekaSet & CompletionSet != 0 ) Output(modelfile);}

Start

Initialization

While Queues are NOT empty

Populate Queues

Issue MRInitial Task

MRExpandQueue NOT Empty

True

True

Issue MR-InMemGrow Task

MRInMemQueue NOT Empty

True

Issue MRExpandNode Task

False

False

Update Model & Populate Queue

End

False

ModelFileIt is an object used to contain the regression model, and support relevant functions, like adding node, check node status etc.

Advantages:• More convenient to update the model and predict target value, compare to parsing XML file.• Load and Write model file == serialize and de-serialize Java Object

RootF1 < 27

F2 < 43

F1< 90 Weka Model

F3 {M,W,F}

F4 < 16Predict Value =

95

Model File Instance

*Regression Tree Model

*Update Function( )

*CurrentLeaveNode( )

*.…...

InitHistogram

• A pre-processing step to find out potential candidate split points for ExpandNodes• Numerical features: Find fewer candidate points from huge data at expanse of little accuracy

lost, e.g feat1, feat2,• Categorical features: All the components, e.g feat 3Input node (or subset):

ExpandNodes Task just need to evaluate all the points in the candidate set without consulting other resources.

block

block

block

block

Map

Map

Map

MapReduce

Reduce

Feature 1:{10,2,1,8,3,6,9,4,6,5,7}

{1,3,5,7,9}Colt: High performance Java library

Sampling: Boundaries of equal-depth histogram

f1: 1,3,5,7,9

f2: 30,40,50,60,70

f3: 1,2,3,4,5,6,7{Moday -> Friday}

Feat1(num)

Feat2(num),Feat3 (categ)

node 3

• Filtering: Only interests in data point belong to node 3

• Routing: Key-Value: (featureID, value)

ExpandNodeInput node (or subset):

block

block

block

block

Map

Map

Map

MapReduce

Reduce

D_right = D_total – D_leftCandidate points

Controller

Local optimal split point , sp1 (value = 23)

Local optimal split point , sp2 (value = 26)

Global optimal split point , sp1 (value = 23)

node3f2< 23

𝐷 𝑙𝑒𝑓𝑡 𝐷 h𝑟𝑖𝑔 𝑡

Update expending nodee.g. sp1 = 23 in feature 2

node 3

FilteringRouting

MRInMemWeka Input node (or subset):

block

block

block

block

Map

Map

Map

MapReduce

Reduce

node 4 node 5

FilteringRouting: (NodeID, Data Point)

Node 4

Node 5

1.Collect data points for node 4

2.Call Weka REPTree (or any other model, M5P), to build model

Controller

Location of Weka Model

node4

Weka Model

node5

Weka Model

Update tree nodes

…..

Distinctions between OpenPlanet and PLANET:

(1) Sampling MapReduce method: InitHistogram

(2) Broadcast(BC) function

(3)Hybrid Model

WekaModel

WekaModel

WekaModel

WekaModel

WekaModel

BC_Key.Set(BC); for( i : num_reducers ){ BC_Key.partitionID = i ; send(BC_Key, BC_info);}

Partitioning:

ID of Partition = key.hashCode() % numReduceTasks

Key.hashCode(){ if (key_type == BC) return partitionID; else return key.hashCode;}

Performance Analysis and Tuning Method

Baseline for Weka, Matlab and OpenPlanet (single machine)

Parallel Performance for OpenPlanet with default settings

Question ?1. For 17 million data set, very little improvement difference between 2x8 case and 8x8 case2. Not much performance improvement, especially compare to Weka baseline performance * 1.58 speed-up for 17 M dataSet * no speed-up for small data point (no memory overhead)

Question 1: Similar performance between 2x8 case and 8x8 case ?

050

100150200250300350400450500

1 3 5 7 9 11 13 15 17 19 21

Aver

age

Trai

ning

Tim

e (S

ec )

Iteration Number

OpenPlanet Time per Stage using 2x8 Cores (Default Settings)

MRInMemWekaMRExTotalMRInitTotal

050

100150200250300350400450500

1 3 5 7 9 11 13 15 17 19 21

Aver

age

Trai

ning

Tim

e (S

ec )

Iteration Number

OpenPlanet Time per Stage using 8x8 Cores (Default Settings)


0%10%20%30%40%50%60%70%80%90%

100%

040

080

012

0016

0020

0024

0028

0032

0036

0040

0044

0048

0052

0056

00

Utiliz

ation

Per

sent

ation

(%)

Running Time (sec)

Mapper Slots Utilization for 2x8 Cores (Default)M2 Reducer Usage(%) M1 Reducer Usage(%)

M1 Map Slot Usage (%) M2 Map Slot Usage (%)

0%10%20%30%40%50%60%70%80%90%

100%

040

080

012

0016

0020

0024

0028

0032

0036

0040

0044

0048

0052

00

Util

izatio

n Pe

rsen

tatio

n (%

)

Running Time (sec)

Mapper Slots Utilization for 8x8 Cores (Default)

M1 M2 M3 M4 M5 M6 M7 M8

Answer for question 1:1. In HDFS, the basic unit is block (default 64MB)2. Each Map instance only process one block each time3. Therefore, we can say, if N blocks, only N Map instances can running in parallel.Therefore for our problem:4. Size of train data: 17 Million = 842 MB5. Default block size = 64 MB6. Number of blocks = 842/64 = 137. For 2x8 case, 13 Map can run in parallel, utilization=13/16 = 81%8. For 8x8 case, 13 Map can run in parallel, utilization=13/64 = 20%9. But for both case, only 13 Map running in parallel => reason for the similar

performance

Solution: Tuning the block size, which lead to:Number of blocks Number of computing cores

Þ What if number of blocks >> Number of computing cores ?Not necessary to improve performance since the network bandwidth limitation

Improvement:

0

50

100

150

200

250

300

350

400

1 3 5 7 9 11 13 15 17 19 21

Aver

age

Trai

ning

Tim

e (S

ec )

Iteration Number

OpenPlanet Time per Stage using 8x8 Cores (optimized block size)


0%10%20%30%40%50%60%70%80%90%

100%

Util

izatio

n Pe

rsen

tatio

n (%

)

Running Time (sec)

Mapper Slots Utilization for 8x8 Cores (optimized block size)

M1 M2 M3 M4 M5 M6 M7 M8

DataSize

Nodes

3.5M Tuples

/170MB

17M Tuples

/840MB

35M tuples /1.7GB

2x8 20MB 80MB 128MB

4x8 8MB 32MB 64MB

8x8 4MB 16MB 32MB

1. We tune block size = 16 MB2. We have 842/16 = 52 blocks3. Total running time = 4,300,457 sec, while 5,712,154 sec for original version, speed-up : 1.33

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

3.5 17.5 35.1

Aver

age

Trai

ning

Tim

e (s

ec)

Training Data Size (rows) Million

2x8 Default Block Size2x8 Optimized Block Size2x8 16MB Block Size4x8 Default Block Size4x8 Optimized Block Size4x8 16MB Block Size8x8 Default Block Size8x8 Optimized Block Size8x8 16MB Block Size

Question 2:1. Weka works better if no memory overhead2. Observed from the picture,

0

50

100

150

200

250

300

350

400

1 3 5 7 9 11 13 15 17 19 21

Aver

age

Trai

ning

Tim

e (S

ec )

Iteration Number



Area is small

Area is large

What about balance those two areas but avoid memory overhead for Weka ?

Solution:Increasing the threshold value between ExpandNodes task and InMemWeka

Then by experiment, when the JVM for reducer instance is 1GB, the maximum threshold value is 2,000,000.

Performance Improvement

050

100150200250300350400450500

1 2 3 4 5 6 7 8

Aver

age

Trai

ning

Tim

e (S

ec )

Iteration Number

OpenPlanet Time per Stage using 8x8 Cores (Optimized Block Size & 2M Threshold value)

MRInMemTotal

MRExTotal

MRInitTotal

0

50

100

150

200

250

300

350

400

1 3 5 7 9 11 13 15 17 19 21

Aver

age

Trai

ning

Tim

e (S

ec )

Iteration Number



1. Total running time = 1,835,430 sec vs 4,300,457 sec 2. Areas balanced3. Iteration number decreased4. Speed-up = 4300457 / 1835430 = 2.34

AVG Total speed-up on 17M data set using 8x8 cores:• Weka: 4.93 times• Matlab: 14.3 times

AVG Accuracy (CV-RMSE): Weka 10.09 %

Matlab 10.46 %

OpenPlanet 10.35 %

Summary:• OpenPlanet, an open-source implementation of the PLANET regression

tree algorithm using the Hadoop MapReduce framework. • We tune and analyze the impact of parameters such as HDFS block sizes

and threshold for in-memory handoff on the performance of OpenPlanet to improve the default performance.

Future work:(1) Parallel Execution between MRExpand and MRInMemWeka in each iteration(2) Issuing multiple OpenPlanet instances for different usages, which leads to increase

the slots utilization(3) Optimal block size (4) Real-time model training method(5) Move to Cloud platform and give analysis about performance

Scalable Regression Tree Learning on Hadoop using OpenPlanet

Documents

Transcript of Scalable Regression Tree Learning on Hadoop using OpenPlanet