MapReduce based SVM

28
1. Introduction 2. Support Vector Machine 3. MapReduce 4. Development of System Model 5. Simulation Results 6.Conclusion CloudSVM: Training an SVM Classifier in Cloud Computing Systems F. Ozgur CATAK 1 - M. Erdal BALABAN 2 1 TUBITAK - National Research Institute of Electronics and Cryptology(UEKAE) 2 Istanbul University, Faculty of Business Administration, Department of Quantitative Methods ICPCA / SWS 2012 28 Nov 2012 1 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

description

Training SVM in Cloud Computing Systems with MapReduce

Transcript of MapReduce based SVM

Page 1: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

CloudSVM: Training an SVM Classifier in Cloud ComputingSystems

F. Ozgur CATAK 1

- M. Erdal BALABAN 2

1TUBITAK - National Research Institute of Electronics and Cryptology(UEKAE)

2Istanbul University, Faculty of Business Administration, Department of Quantitative Methods

ICPCA / SWS 2012

28 Nov 2012

1 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 2: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

Motivation

Our Research Focus

Overcome Big Space Complexity and Time Complexity of Support VectorMachine Algorithm

Training SVM in Cloud Systems with MapReduce

Using HDFS File System

Try to Find out a Global Classifier Function

2 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 3: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

Contents

1 1. Introduction1.1 Support Vector Machine1.2 SVM Solutions

2 2. Support Vector Machine2.1 Definition2.3 Optimization Problem2.4 Lagrange Multiplier

3 3. MapReduce3.1 MapReduce - Cloud Computing Algorithm3.2 Schematic View of MapReduce

4 4. Development of System Model4.1 Overview4.2 CloudSVM Architecture Schematic View4.3 CloudSVM Algorithm MapReduce Function

5 5. Simulation Results5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

6 6.ConclusionConclusion & RecommendationReferences

3 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 4: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

1.1 Support Vector Machine1.2 SVM Solutions

Contents

1 1. Introduction1.1 Support Vector Machine1.2 SVM Solutions

2 2. Support Vector Machine2.1 Definition2.3 Optimization Problem2.4 Lagrange Multiplier

3 3. MapReduce3.1 MapReduce - Cloud Computing Algorithm3.2 Schematic View of MapReduce

4 4. Development of System Model4.1 Overview4.2 CloudSVM Architecture Schematic View4.3 CloudSVM Algorithm MapReduce Function

5 5. Simulation Results5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

6 6.ConclusionConclusion & RecommendationReferences

4 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 5: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

1.1 Support Vector Machine1.2 SVM Solutions

1. INTRODUCTION

Support Vector Machine - SVM

Developed from Statistical Learning Theory (Vapnik & Chervonenkis)

Supervised learning method in statistics and computer science

Analyze data and recognize patterns, used for classification and regressionanalysis

Maximum generalization accuracy while avoiding overfit

Issues

computationally expensive to process

Quadratic optimization problem has O(m3) time and O(m2) spacecomplexity, where m is the training set size

4 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 6: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

1.1 Support Vector Machine1.2 SVM Solutions

1. INTRODUCTION

Solution - Feature Reduction

Singular Value Decomposition (SVD)

Principal Component Analysis (PCA)

Independent Component Analysis (ICA)

Correlation Based Feature Selection (CFS)

Solution - Distributed Computing

Conventional distributed machine learning methods are complicated

Pre-Configured Intranet/Internet Environments

Costly

5 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 7: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

2.1 Definition2.3 Optimization Problem2.4 Lagrange Multiplier

Contents

1 1. Introduction1.1 Support Vector Machine1.2 SVM Solutions

2 2. Support Vector Machine2.1 Definition2.3 Optimization Problem2.4 Lagrange Multiplier

3 3. MapReduce3.1 MapReduce - Cloud Computing Algorithm3.2 Schematic View of MapReduce

4 4. Development of System Model4.1 Overview4.2 CloudSVM Architecture Schematic View4.3 CloudSVM Algorithm MapReduce Function

5 5. Simulation Results5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

6 6.ConclusionConclusion & RecommendationReferences

6 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 8: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

2.1 Definition2.3 Optimization Problem2.4 Lagrange Multiplier

2. SUPPORT VECTOR MACHINE

Support Vector Machine

In machine learning, support vector machines are supervised learningmodels with associated learning algorithms that analyze data andrecognize patterns, used for classification and regression analysis.

An SVM model is a representation of the examples as points in space,mapped so that the examples of the separate categories are divided by aclear gap that is as wide as possible.

6 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 9: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

2.1 Definition2.3 Optimization Problem2.4 Lagrange Multiplier

2. SUPPORT VECTOR MACHINE

D a set of n points of the form.

D = {(xi, yi) | xi ∈ Rm, yi ∈ {−1, 1} }ni=1

for each xi in data set D

w.xi − b > 1 if yi = 1, (1)

w.xi − b < −1 if yi = −1 (2)

Or equivalently

yi(w.xi − b) ≥ 1,∀(xi, yi) ∈ D (3)

.The distance between these two hyper-

planes is |F (xi)|‖ #»w‖ =⇒ 1

‖ #»w‖.Maximize distance between these twohyperplanes:

Minimize : P (w, b) =1

2‖ #»w‖2

subject to : yi(〈 #»w, #»x i〉+ b) ≥ 1

(4)

7 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 10: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

2.1 Definition2.3 Optimization Problem2.4 Lagrange Multiplier

2. SUPPORT VECTOR MACHINE

By introducing Lagrange multipliers α, the previous linear constrained problemcan be expressed as

Optimization Problem

Minimize :P ( #»w, b) =1

2‖ #»w‖2

Subject to :yi(〈 #»w, #»x i〉+ b) ≥ 1(5)

Lagrange Multipliers

J( #»w, b, α) =1

2‖ #»w‖2 +

n∑i=1

αi(yi(#»w. #»x i − b)− 1) (6)

8 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 11: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

2.1 Definition2.3 Optimization Problem2.4 Lagrange Multiplier

2. SUPPORT VECTOR MACHINE

Lagrange Multiplier Solution

Minimization of Lagrange Function J(w, b, α) respect to w and b’. Saddle Points ;

State 1 =∂J( #»w, b, α)

∂w= 0

State 2 =∂J( #»w, b, α)

∂b= 0

State 1 ve 2 solution,

#»w =m∑

i=1

αiyi#»x i and

n∑i=1

αiyi = 0 (7)

New Optimization Problem

Maksimize :Q =n∑

i=1

αi −1

2

n∑i=1

n∑j=1

αiαjyiyj#»x i

#»x j

subject to :n∑

i=1

αiyi = 0

α ≥ 0

(8)

9 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 12: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

3.1 MapReduce - Cloud Computing Algorithm3.2 Schematic View of MapReduce

Contents

1 1. Introduction1.1 Support Vector Machine1.2 SVM Solutions

2 2. Support Vector Machine2.1 Definition2.3 Optimization Problem2.4 Lagrange Multiplier

3 3. MapReduce3.1 MapReduce - Cloud Computing Algorithm3.2 Schematic View of MapReduce

4 4. Development of System Model4.1 Overview4.2 CloudSVM Architecture Schematic View4.3 CloudSVM Algorithm MapReduce Function

5 5. Simulation Results5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

6 6.ConclusionConclusion & RecommendationReferences

10 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 13: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

3.1 MapReduce - Cloud Computing Algorithm3.2 Schematic View of MapReduce

3. MapReduce - Cloud Computing Algorithm

MapReduce Overview

Breaks large problem into smaller parts, solve in parallel, combine results.

Programmer specifies map and reduce functions.

Transparent Scaling: use same code on MBs locally or TBs acrossthousands of machines.

10 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 14: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

3.1 MapReduce - Cloud Computing Algorithm3.2 Schematic View of MapReduce

3. MapReduce - Cloud Computing Algorithm

MapReduce Overview

Most Popular Cloud Computing Model

Elastic Framework for Software Developers for Parallel and DistributedApplications

Input and Output files are on distributed file system.

map(key1, value1)⇒ list(key2, value2)

reduce(key2, list(value2))⇒ list(value3)

Figure : Overview of MapReduce

11 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 15: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

4.1 Overview4.2 CloudSVM Architecture Schematic View4.3 CloudSVM Algorithm MapReduce Function

Contents

1 1. Introduction1.1 Support Vector Machine1.2 SVM Solutions

2 2. Support Vector Machine2.1 Definition2.3 Optimization Problem2.4 Lagrange Multiplier

3 3. MapReduce3.1 MapReduce - Cloud Computing Algorithm3.2 Schematic View of MapReduce

4 4. Development of System Model4.1 Overview4.2 CloudSVM Architecture Schematic View4.3 CloudSVM Algorithm MapReduce Function

5 5. Simulation Results5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

6 6.ConclusionConclusion & RecommendationReferences

12 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 16: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

4.1 Overview4.2 CloudSVM Architecture Schematic View4.3 CloudSVM Algorithm MapReduce Function

4. Development of System Model

CloudSVM

It’s a new Technique for Training SVM in Cloud with MapReduce

Training data set is uploaded to HDFS

We found classifier functions with this novel approach for data sets inHDFS

What’s new in CloudSVM

SVM has O(m3) time complexity and O(m2) space complexity where m isdata set size.It is very important result for large scale data sets and BigData

12 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 17: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

4.1 Overview4.2 CloudSVM Architecture Schematic View4.3 CloudSVM Algorithm MapReduce Function

4. Development of System Model

Figure : CloudSVM Architecture Schematic View.

13 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 18: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

4.1 Overview4.2 CloudSVM Architecture Schematic View4.3 CloudSVM Algorithm MapReduce Function

4. Development of System Model

CloudSVM Algorithm Map Function

SVGlobal = ∅ {Empty global support vector set}while ht 6= ht−1 do

for l ∈ L {For each subset loop} doDtl ← Dtl ∪ SV tGlobal

end forend while

14 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 19: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

4.1 Overview4.2 CloudSVM Architecture Schematic View4.3 CloudSVM Algorithm MapReduce Function

4. Development of System Model

CloudSVM Algorithm Reduce Function

while ht 6= ht−1 dofor l ∈ L doSVl, h

t ← svm(Dl) {Train merged Dataset to obtain Support Vectorsand Hypothesis }

end forfor l ∈ L doSVGlobal ← SVGlobal ∪ SVl

end forend while

15 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 20: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

Contents

1 1. Introduction1.1 Support Vector Machine1.2 SVM Solutions

2 2. Support Vector Machine2.1 Definition2.3 Optimization Problem2.4 Lagrange Multiplier

3 3. MapReduce3.1 MapReduce - Cloud Computing Algorithm3.2 Schematic View of MapReduce

4 4. Development of System Model4.1 Overview4.2 CloudSVM Architecture Schematic View4.3 CloudSVM Algorithm MapReduce Function

5 5. Simulation Results5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

6 6.ConclusionConclusion & RecommendationReferences

16 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 21: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

5. SIMULATION RESULTS

Method

We used 10-fold cross-validation, dividing the set of samples at random into 10approximately equal-size parts.We used ”Hinge Loss” for testing our models trained with CloudSVMalgorithm. Empirical risk can be computed with an approximation.

l(f( #»x ), y) = max {0, 1− y.f( #»x )} (9)

Remp(h) =1

n

n∑i=1

l(h( #»x i), yi) (10)

According to the empirical risk minimization principle the learning algorithmshould choose a hypothesis h which minimizes the empirical risk:

h = argminh∈H

Remp(h). (11)

16 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 22: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

5. SIMULATION RESULTS

Softwares & Development Environments

Hadoop 0.23

Python 2.7

SciPy, NumPy (Scientific and Numeric Python Libraries)

pythonxy (Scientific-oriented Python Distribution based on Qt andSpyder)

MrJob 0.3.5 (Hadoop Streaming)

LibSVM

Centos 6.2 64 bit

17 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 23: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

5. SIMULATION RESULTS

Table : Various UCI Datasets

Dataset Row Feature γ C Iteration SV Accuracy Kernel Type

German 1000 24 100 1 5 606 0.7728 LinearHeart 270 13 100 1 3 137 0.8259 Linear

Ionosphere 351 34 108 1 3 160 0.8423 LinearSatellite 4435 36 100 1 2 1384 0.9064 Linear

18 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 24: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

5. SIMULATION RESULTS

Table : Data set prediction accuracy with iterations

German & Heart Datasets. Smoothly Converges to Loss Values and SVs Size

19 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 25: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

5. SIMULATION RESULTS

Table : Data set prediction accuracy with iterations

Ionosphere & Satellite Datasets. Smoothly Converges to Loss Values and SVsSize

20 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 26: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

Conclusion & RecommendationReferences

Contents

1 1. Introduction1.1 Support Vector Machine1.2 SVM Solutions

2 2. Support Vector Machine2.1 Definition2.3 Optimization Problem2.4 Lagrange Multiplier

3 3. MapReduce3.1 MapReduce - Cloud Computing Algorithm3.2 Schematic View of MapReduce

4 4. Development of System Model4.1 Overview4.2 CloudSVM Architecture Schematic View4.3 CloudSVM Algorithm MapReduce Function

5 5. Simulation Results5.1 Method5.2 UCI Dataset Results5.3 Convergence of CloudSVM

6 6.ConclusionConclusion & RecommendationReferences

21 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 27: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

Conclusion & RecommendationReferences

6. CONCLUSION & RECOMMENDATION

Conclusion

We showed the simulation results

Stable and High Generalization Property

Independent of Network and Computer Infrastructure (Cloud ComputingBased)

Recommendation

Multiclass Classification

Application to Real Datasets

How many several different parts can be divided?

21 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems

Page 28: MapReduce based SVM

1. Introduction2. Support Vector Machine

3. MapReduce4. Development of System Model

5. Simulation Results6.Conclusion

Conclusion & RecommendationReferences

6. REFERENCES

Vapnik, V.N.: The nature of statistical learning theory. Springer, NY(1995)

Chang, E.Y., Zhu, K., Wang, H., Bai, H., Li, J. and Qiu, Z.,Cui, H.:PSVM: Parallelizing Support Vector Machines on Distributed Computers.Advances in Neural Information Processing Systems 20, (2007)

Lu, Y., Roychowdhury, V., Vandenberghe, L.: Distributed parallel supportvector machines in strongly connected networks. IEEE Trans. NeuralNetworks, 19, 1167-1178 (2008)

Graf,H. P., Cosatto, E., Bottou, L., Durdanovic, I., Vapnik, V.: Parallelsupport vector machines: The cascade SVM.In: Proceedings of theEighteenth Annual Conference on Neural Information Processing Systems(NIPS), pp. 521-528. MIT Press, Vancouver (2004)

Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on largeclusters. In :Proceedings of the 6th conference on Symposium onOperating Systems Design & Implementation(OSDI), pp. 10-10. USENIXAssociation, Berkeley (2004)

22 / 22 F. Ozgur CATAK - M. Erdal BALABAN CloudSVM: Training an SVM Classifier in Cloud Computing Systems