Semi-random model tree ensembles: an effective and scalable regression method
Scalable Regression Tree Learning on Hadoop using OpenPlanet
description
Transcript of Scalable Regression Tree Learning on Hadoop using OpenPlanet
Scalable Regression Tree Learning on Hadoop using OpenPlanet
Wei Yin
Contributions• We implement OpenPlanet, an open-source
implementation of the PLANET regression tree algorithm using the Hadoop MapReduce framework.
• We tune and analyze the impact of two parameters, HDFS block sizes and threshold value between ExpandNode and InMemoryWeka Tasks of OpenPlanet to improve the default performance.
Motivation for large-scale Machine Learning
• Models operate on large data sets
• Large number of forecasting models
• New data arrives constantly and real-time training requirement
Regression Tree
Classification algorithm maps
features → target variable (prediction)
Classifier uses a Binary Search Tree Structure
• Each non-leaf node is a binary classifier with a decision condition
• One numeric or categorical feature goes left or right in the tree
• Leaf Nodes contain the regression function or a single prediction value• Intuitive to understand
by domain users• Effect for each feature
USC DR Technical Forum 5
Google’s PLANET Algorithm
• Use distributed worker nodes coordinated using a master node to build regression tree
Master
worker
worker
worker
worker
worker
worker
21-Sep-11
OpenPlanet• Give an introduction abolut OpenPlanet• Introduce difference between OpenPlanet and PLANET• Give specific re-implementation details
Controller
InitHistogram ExpandNode InMemoryWeka
Model File
Threshold Value(60000)
CotrollerController{/*read user defined parameters, such as input file path, test data file, model output file etc.*/ Read Parameters( arguments[] );/*Initialize 3 job Sets: MRExpandSet, MRInMemWekaSet, CompletionSet, each of which contains the nodes that need relevant process*/ JobSetsInit(ExpandSet, InMemWekaSet, CompletionSet);/*Initialize Model File instance containing a Regression Tree structure with root node only*/ InitModelFile(modelfile); Do {/*If any Set is not empty, continue the loop*//*populate each Set using modelfile*/ populateSets(modelfile, ExpandSet, InMemWekaSet, CompletionSet ); if(ExpandSet != 0){
processing_nodes <- all nodes in ExpandSet;TaskRunner(InitHistogram, processing_nodes);CandidatePoints <- collects Reducers Result();TaskRunner(ExpandNodes, processing_nodes);globalOptimalSplitPoint <- collects Reducers’ Result();
} if(InMemWekaSet != 0){
processing_nodes <- all nodes in InMemWekaSet;TaskRunner(InMemWeka, processing_nodes);
} UpdatesTreeModel(Result); } While ( ExpandSet & InMemWekaSet & CompletionSet != 0 ) Output(modelfile);}
Start
Initialization
While Queues are NOT empty
Populate Queues
Issue MRInitial Task
MRExpandQueue NOT Empty
True
True
Issue MR-InMemGrow Task
MRInMemQueue NOT Empty
True
Issue MRExpandNode Task
False
False
Update Model & Populate Queue
End
False
ModelFileIt is an object used to contain the regression model, and support relevant functions, like adding node, check node status etc.
Advantages:• More convenient to update the model and predict target value, compare to parsing XML file.• Load and Write model file == serialize and de-serialize Java Object
RootF1 < 27
F2 < 43
F1< 90 Weka Model
F3 {M,W,F}
F4 < 16Predict Value =
95
Model File Instance
*Regression Tree Model
*Update Function( )
*CurrentLeaveNode( )
*.…...
InitHistogram
• A pre-processing step to find out potential candidate split points for ExpandNodes• Numerical features: Find fewer candidate points from huge data at expanse of little accuracy
lost, e.g feat1, feat2,• Categorical features: All the components, e.g feat 3Input node (or subset):
ExpandNodes Task just need to evaluate all the points in the candidate set without consulting other resources.
block
block
block
block
Map
Map
Map
MapReduce
Reduce
Feature 1:{10,2,1,8,3,6,9,4,6,5,7}
{1,3,5,7,9}Colt: High performance Java library
Sampling: Boundaries of equal-depth histogram
f1: 1,3,5,7,9
f2: 30,40,50,60,70
f3: 1,2,3,4,5,6,7{Moday -> Friday}
Feat1(num)
Feat2(num),Feat3 (categ)
node 3
• Filtering: Only interests in data point belong to node 3
• Routing: Key-Value: (featureID, value)
ExpandNodeInput node (or subset):
block
block
block
block
Map
Map
Map
MapReduce
Reduce
D_right = D_total – D_leftCandidate points
Controller
Local optimal split point , sp1 (value = 23)
Local optimal split point , sp2 (value = 26)
Global optimal split point , sp1 (value = 23)
node3f2< 23
𝐷 𝑙𝑒𝑓𝑡 𝐷 h𝑟𝑖𝑔 𝑡
Update expending nodee.g. sp1 = 23 in feature 2
node 3
FilteringRouting
MRInMemWeka Input node (or subset):
block
block
block
block
Map
Map
Map
MapReduce
Reduce
node 4 node 5
FilteringRouting: (NodeID, Data Point)
Node 4
Node 5
1.Collect data points for node 4
2.Call Weka REPTree (or any other model, M5P), to build model
Controller
Location of Weka Model
node4
Weka Model
node5
Weka Model
Update tree nodes
…..
Distinctions between OpenPlanet and PLANET:
(1) Sampling MapReduce method: InitHistogram
(2) Broadcast(BC) function
(3)Hybrid Model
WekaModel
WekaModel
WekaModel
WekaModel
WekaModel
BC_Key.Set(BC); for( i : num_reducers ){ BC_Key.partitionID = i ; send(BC_Key, BC_info);}
Partitioning:
ID of Partition = key.hashCode() % numReduceTasks
Key.hashCode(){ if (key_type == BC) return partitionID; else return key.hashCode;}
Performance Analysis and Tuning Method
Baseline for Weka, Matlab and OpenPlanet (single machine)
Parallel Performance for OpenPlanet with default settings
Question ?1. For 17 million data set, very little improvement difference between 2x8 case and 8x8 case2. Not much performance improvement, especially compare to Weka baseline performance * 1.58 speed-up for 17 M dataSet * no speed-up for small data point (no memory overhead)
Question 1: Similar performance between 2x8 case and 8x8 case ?
050
100150200250300350400450500
1 3 5 7 9 11 13 15 17 19 21
Aver
age
Trai
ning
Tim
e (S
ec )
Iteration Number
OpenPlanet Time per Stage using 2x8 Cores (Default Settings)
MRInMemWekaMRExTotalMRInitTotal
050
100150200250300350400450500
1 3 5 7 9 11 13 15 17 19 21
Aver
age
Trai
ning
Tim
e (S
ec )
Iteration Number
OpenPlanet Time per Stage using 8x8 Cores (Default Settings)
MRInMemWekaMRExTotalMRInitTotal
0%10%20%30%40%50%60%70%80%90%
100%
040
080
012
0016
0020
0024
0028
0032
0036
0040
0044
0048
0052
0056
00
Utiliz
ation
Per
sent
ation
(%)
Running Time (sec)
Mapper Slots Utilization for 2x8 Cores (Default)M2 Reducer Usage(%) M1 Reducer Usage(%)
M1 Map Slot Usage (%) M2 Map Slot Usage (%)
0%10%20%30%40%50%60%70%80%90%
100%
040
080
012
0016
0020
0024
0028
0032
0036
0040
0044
0048
0052
00
Util
izatio
n Pe
rsen
tatio
n (%
)
Running Time (sec)
Mapper Slots Utilization for 8x8 Cores (Default)
M1 M2 M3 M4 M5 M6 M7 M8
Answer for question 1:1. In HDFS, the basic unit is block (default 64MB)2. Each Map instance only process one block each time3. Therefore, we can say, if N blocks, only N Map instances can running in parallel.Therefore for our problem:4. Size of train data: 17 Million = 842 MB5. Default block size = 64 MB6. Number of blocks = 842/64 = 137. For 2x8 case, 13 Map can run in parallel, utilization=13/16 = 81%8. For 8x8 case, 13 Map can run in parallel, utilization=13/64 = 20%9. But for both case, only 13 Map running in parallel => reason for the similar
performance
Solution: Tuning the block size, which lead to:Number of blocks Number of computing cores
Þ What if number of blocks >> Number of computing cores ?Not necessary to improve performance since the network bandwidth limitation
Improvement:
0
50
100
150
200
250
300
350
400
1 3 5 7 9 11 13 15 17 19 21
Aver
age
Trai
ning
Tim
e (S
ec )
Iteration Number
OpenPlanet Time per Stage using 8x8 Cores (optimized block size)
MRInMemWekaMRExTotalMRInitTotal
0%10%20%30%40%50%60%70%80%90%
100%
Util
izatio
n Pe
rsen
tatio
n (%
)
Running Time (sec)
Mapper Slots Utilization for 8x8 Cores (optimized block size)
M1 M2 M3 M4 M5 M6 M7 M8
DataSize
Nodes
3.5M Tuples
/170MB
17M Tuples
/840MB
35M tuples /1.7GB
2x8 20MB 80MB 128MB
4x8 8MB 32MB 64MB
8x8 4MB 16MB 32MB
1. We tune block size = 16 MB2. We have 842/16 = 52 blocks3. Total running time = 4,300,457 sec, while 5,712,154 sec for original version, speed-up : 1.33
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
3.5 17.5 35.1
Aver
age
Trai
ning
Tim
e (s
ec)
Training Data Size (rows) Million
2x8 Default Block Size2x8 Optimized Block Size2x8 16MB Block Size4x8 Default Block Size4x8 Optimized Block Size4x8 16MB Block Size8x8 Default Block Size8x8 Optimized Block Size8x8 16MB Block Size
Question 2:1. Weka works better if no memory overhead2. Observed from the picture,
0
50
100
150
200
250
300
350
400
1 3 5 7 9 11 13 15 17 19 21
Aver
age
Trai
ning
Tim
e (S
ec )
Iteration Number
OpenPlanet Time per Stage using 8x8 Cores (optimized block size)
MRInMemWekaMRExTotalMRInitTotal
Area is small
Area is large
What about balance those two areas but avoid memory overhead for Weka ?
Solution:Increasing the threshold value between ExpandNodes task and InMemWeka
Then by experiment, when the JVM for reducer instance is 1GB, the maximum threshold value is 2,000,000.
Performance Improvement
050
100150200250300350400450500
1 2 3 4 5 6 7 8
Aver
age
Trai
ning
Tim
e (S
ec )
Iteration Number
OpenPlanet Time per Stage using 8x8 Cores (Optimized Block Size & 2M Threshold value)
MRInMemTotal
MRExTotal
MRInitTotal
0
50
100
150
200
250
300
350
400
1 3 5 7 9 11 13 15 17 19 21
Aver
age
Trai
ning
Tim
e (S
ec )
Iteration Number
OpenPlanet Time per Stage using 8x8 Cores (optimized block size)
MRInMemWekaMRExTotalMRInitTotal
1. Total running time = 1,835,430 sec vs 4,300,457 sec 2. Areas balanced3. Iteration number decreased4. Speed-up = 4300457 / 1835430 = 2.34
AVG Total speed-up on 17M data set using 8x8 cores:• Weka: 4.93 times• Matlab: 14.3 times
AVG Accuracy (CV-RMSE): Weka 10.09 %
Matlab 10.46 %
OpenPlanet 10.35 %
Summary:• OpenPlanet, an open-source implementation of the PLANET regression
tree algorithm using the Hadoop MapReduce framework. • We tune and analyze the impact of parameters such as HDFS block sizes
and threshold for in-memory handoff on the performance of OpenPlanet to improve the default performance.
Future work:(1) Parallel Execution between MRExpand and MRInMemWeka in each iteration(2) Issuing multiple OpenPlanet instances for different usages, which leads to increase
the slots utilization(3) Optimal block size (4) Real-time model training method(5) Move to Cloud platform and give analysis about performance