DP-IV presentation - ashutosh
-
Upload
ashutosh-sathe -
Category
Documents
-
view
57 -
download
0
Transcript of DP-IV presentation - ashutosh
Performance analysis of C-means Clustering on Big Data using Hadoop
Fuzz
y C
-mea
ns
Guided ByProf. A. J. Umbarkar
Presented ByA. S. Sathe
BROAD AREA : DISTRIBUTED COMPUTING, DATA MINING
SUB AREA: CLUSTERING ALGORITHMS, DATA CLUSTERING
1
Presentation Agenda• Literature Survey• Problem Statement• Objectives achieved• Results• Future Scope• References
Fuzz
y C
-mea
ns
2
Data Growth Rate[7]
Fuzz
y C
-mea
ns
3
Relevance • Data Clustering - Classification of a data set into a Similar groups based on
some criteria
• Big Data- Amount of data that is difficult to process using traditional database and software techniques
• Hadoop – A MapReduce Architecture based distributed computing framework
• Document Clustering • Text based data stored in file format or unstructured format• Based on text property like frequency of words, keywords provided etc.• Text properties are considered as similarity criteria• Based on similarity criteria documents are differentiated
Fuzz
y C
-mea
ns
4
Fuzz
y C
-mea
ns
5
Relevance• Need of data clustering• Data Mining is used for Knowledge Discovery from Data [KDD].• Based on historical data• Historical data may be Big Data• Big data processing is very tedious task• Data clustering is preprocessing for Big data processing• Processed data will be used for data mining• Data clustering give better results than randomly placed data.
Fuzz
y C
-mea
ns
6
Relevance• Why Text clustering• Type of unstructured data• Free from any database constraints• File can be very large without any restrictions• In real time scenario text clustering
• Retrieve, Filter, and Categorize documents• Information Retrieval
• Clustered data is useful for Knowledge Data Retrieval
Fuzz
y C
-mea
ns
7
Relevance• Why Hadoop• Distributed Framework• Can use processor capacity on the fly• Made for Big data processing
Fuzz
y C
-mea
ns
8
Problem Statement
• “Performance Analysis of C-means Clustering on Big Data using Hadoop.”
Fuzz
y C
-mea
ns
9
Objectives achieved Design of processing model of Fuzzy C-Means
Algorithm for Map-Reduce Implementation of C-means algorithm on Map-Reduce Testing & Performance analysis of above algorithm
with Big-Data on Map-Reduce Compare C-means with other equivalent works
Fuzz
y C
-mea
ns
10
Fuzzy C-means Clustering
Fuzz
y C
-mea
ns
11
Fuzzy C-means Clustering
Fuzz
y C
-mea
ns
12
Fuzzy C-means Clustering
Fuzz
y C
-mea
ns
13
Fuzzy C-means Clustering
Fuzz
y C
-mea
ns
14
Fuzzy C-means Clustering
Fuzz
y C
-mea
ns
15
Fuzzy C-means Clustering
Fuzz
y C
-mea
ns
16
Fuzzy C-means Clustering
17
Fuzz
y C
-mea
ns
Fuzzy C-means Clustering
• For example: we have initial centroid 3 & 11 (with m=2)
• For node 2 (1st element): U11 = The membership of first node to first cluster
U12 =The membership of first node to second cluster
Fuzz
y C
-mea
ns
%78.988281
8111
1
11232
3232
1
122
122
%22.1821
1811
112112
32112
1
122
122
Dataset Conversion
Fuzz
y C
-mea
ns
19
Hadoop based
K-Meanson
Documents
Fuzz
y C
-mea
ns
20
Fuzzy C-Means
on Documents
Fuzz
y C
-mea
ns
21
Hadoop based
Fuzzy C-Means
on Documents Fu
zzy
C-m
eans
22
Results
Experimental Setup
3 Centroids
4 Centroids
5 Centroids 6 Centroids Split
4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr
Classical K-Means √ √ √ √ √ √ √ √ Not Applicable
Hadoop Based K-Means
√ √ √ √ √ √ √ √ 4 Mb Split
√ √ √ √ √ √ √ √ 8 Mb Split
16 Mb Split
√ √ √ √ √ √ √ √ 32 Mb Split
Classical Fuzzy C-Means √ √ √ √ √ √ √ √ Not Applicable
Hadoop Based Fuzzy C-Means
√ √ √ √ √ √ √ √ 4 Mb Split
√ √ √ √ √ √ √ √ 8 Mb Split
16 Mb Split
√ √ √ √ √ √ √ √ 32 Mb Split
23
Fuzz
y C
-mea
ns
Experimental Setup
Fuzz
y C
-mea
ns
24
ClassicalK-Means
2 Node K-Means
4 NodeK-Means
8 NodeK-Means
0 100 200 300 400 500 600 700 800 900 1000
6 centroid5 centroid4 centroid3 centroid
Time (Sec)
No.
of N
odes
Fuzz
y C
-mea
ns
25Classical
FCM
2 Node FCM
4 NodeFCM
8 NodeFCM
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
6 centroid5 centroid4 centroid3 centroid
Time in sec
No.
of N
odes
26
Fuzz
y C
-mea
ns
2Node 4 Node 8 Node0
0.5
1
1.5
2
2.5
3
4MB Split KM Performance
4 ITR6 ITR
No. of Nodes
Spee
dup
2Node 4 Node 8 Node0
1
2
3
4
5
6
4MB Split FCM Performance
4 ITR6 ITR
No. of Nodes
Spee
dup
Speedup Comparison of KM w.r.t. HKM
Speedup Comparison of FCM w.r.t. HFCM
27
Fuzz
y C
-mea
ns
2Node 4 Node 8 Node0
0.5
1
1.5
2
2.5
8MB Split HKM Performance
4 ITR6 ITR
No of Nodes
Spee
dup
2Node 4 Node 8 Node0
1
2
3
4
5
6
8MB Split HFCM Performance
4 ITR6 ITR
No. of Nodes
Spee
dup
Speedup Comparison of KM w.r.t. HKM
Speedup Comparison of KM w.r.t. HKM
28
Fuzz
y C
-mea
ns4 Mb Split 8 Mb Split 32 mb Split
4 Mb Split 8 Mb Split 32 mb Split
0
1
2
3
4
5
6
2Node4 Node8 Node
HKM HFCM
Spee
dup
4 Mb Split 8 Mb Split 32 mb Split 4 Mb Split 8 Mb Split 32 mb Split0
1
2
3
4
5
6
2Node4 Node8 Node
HKM HFCM
Spee
dup
HKM and HFCM speedup performances and comparison
4 Ite
ratio
ns6
Itera
tions
29
Fuzz
y C
-mea
ns
Analysis based on cluster sizes
KM 2 Node HKM 4 Node HKM 8 Node HKM0
2000
4000
6000
8000
10000
12000
3 Centroids4 Centroids5 Centroids6 Centroids
Tim
e
Average FCM and HFCM time consumption w.r.t cluster sizes
CONT…
30
Fuzz
y C
-mea
ns
Average KM and HKM time consumption w.r.t cluster sizes
FCM 2 Node HFCM 4 Node HKM 8 Node HKM0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
3 Centroids4 Centroids5 Centroids6 Centroids
Tim
e
Future Scope
Fuzz
y C
-mea
ns
31
Paper publication• Submitted to IEEE CONECCT 2015
Fuzz
y C
-mea
ns
32
Tools and Platform Required1. Text Dataset4. Hadoop 1.215. JDK 1.66. O.S. Ubuntu 14.04
Fuzz
y C
-mea
ns
33
References1. Cui, Xiaoli et al. "Optimized big data K-means clustering using
MapReduce." The Journal of Supercomputing, Vol 70, pp.1249-1259, 2014.
2. Jain, Anil K., M. NarasimhaMurty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR), Vol.31, pp.264-323, (1999). DOI:10.1145/331499.331504
3. Zhao, Weizhong et al. "Parallel k-means clustering based on mapreduce." In Cloud Computing Springer Berlin Heidelberg, Vol. 5931, pp. 674-679, 2009.
4. Xie, Jiong, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. "Improving mapreduce performance through data placement in heterogeneous hadoop clusters." In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pp. 1-9. IEEE, 2010. DOI:10.1109/IPDPSW.2010.5470880
Fuzz
y C
-mea
ns
34
References(cont...)5. J.Dean, S.Ghemawat, MapReduce, Commun. ACM 51(1) (2008)107,Jan
6. A.Asuncionand, D.J.Newman, UCI Machine Learning Repository, available http://archive.ics.uci.edu/ml/ (accessed:07-Jan-2015)
7. https://www.linkedin.com/pulse/big-data-whats-deal-debarchan-sarkar [Used on Apr 9, 2015]
Fuzz
y C
-mea
ns
35
Fuzz
y C
-mea
nsQUESTIONS???
36
Thank You