Big data clustering
-
Upload
jagadeesan-a-s -
Category
Data & Analytics
-
view
83 -
download
0
Transcript of Big data clustering
![Page 1: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/1.jpg)
Workshop on Parallel, Cluster and
Cloud Computing on Multi-core & GPU
(PCCCMG - 2015)
Workshop Conducted by Computer Society of India In
Association with Dept. of CSE, VNIT and
Persistence System Ltd, Nagpur4th – 6th Sep’15
![Page 2: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/2.jpg)
Big-Data Cluster Computing
Advance tools & technologies
Jagadeesan A SSoftware Engineer
Persistent Systems Limited
www.github.com/jagadeesanas2
www.linkedin.com/in/jagadeesanas2
![Page 3: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/3.jpg)
ContentContentOverview of Big Data
• Data clustering concepts• Clustering vs Classification • Data Journey
Advance tools and technologies• Apache Hadoop• Apache Spark
Future of analytics• Demo - Spark RDD in Intellij IDEA
![Page 4: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/4.jpg)
Big-Data is similar to Small-Data , but bigger in size and complexity.
What is Big-Data ?
Definition from Wikipedia:
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
![Page 5: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/5.jpg)
Characterization of Big Data: 4V’s
Veracity
![Page 6: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/6.jpg)
Characterization of Big Data: 4V’s
![Page 7: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/7.jpg)
Now big question ????Why we need Big Data ?
What to do with those
Data ?
![Page 8: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/8.jpg)
And the answer is very clear…!!
![Page 9: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/9.jpg)
What is a Cluster ?A group of the same or similar elements gathered or occurring closely together.
Clustering is the key to Big Data problem
• Not feasible to “label” large collection of objects • No prior knowledge of the number and nature of groups (clusters) in data • Clusters may evolve over time • Clustering provides efficient browsing, search, recommendation and organization of data
![Page 10: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/10.jpg)
Difference between Clustering & classification
![Page 11: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/11.jpg)
Clustering data on
![Page 12: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/12.jpg)
Clustering videos on
![Page 13: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/13.jpg)
Clustering Algorithms Hundreds of Clustering algorithms are available.
• K-Means• Kernel K-means• Nearest neighbour • Gaussian mixture• Fuzzy Clustering• OPTICS algorithm
![Page 14: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/14.jpg)
Data Journey
![Page 15: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/15.jpg)
Advance tools &
Technologies
![Page 16: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/16.jpg)
Large-Scale Data Analytics
MapReduce computing paradigm vs. Traditional database systems
Database
Many enterprises are turned to HadoopEspecially applications generating big data, Web applications, social networks, scientific applications
![Page 17: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/17.jpg)
APACHE HADOOP (Disk Based Computing)open-source software framework written in Java for distributed storage and distributed processing
Design Principles of Hadoop• Need to process big data • Need to parallelize computation across thousands of nodes• Commodity hardware• Large number of low-end cheap machines working in
parallel to solve a computing problem• Small number of high-end expensive machines
![Page 18: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/18.jpg)
Hadoop cluster architecture A Hadoop cluster can be divided into two abstract entities:
MapReduce engine + distributed file system =
![Page 19: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/19.jpg)
What is SPARK
Why SPARK How to configure SPARK
APACHE SPARKOpen-source cluster computing framework
![Page 20: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/20.jpg)
APACHE SPARK (Memory Based Computing)open-source software framework written in Java for distributed storage and distributed processing
• Fast cluster computing system for large-scale data processing compatible with Apache Hadoop
• Improves efficiency through:• In-memory computing primitives• General computation graphs
• Improves usability through:• Rich APIs in Java, Scala, Python• Interactive shell
Up to 100× faster
Often 2-10× less code
![Page 21: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/21.jpg)
Spark OverviewSpark OverviewSpark Shell Spark applications
• Interactive shell for learning or data exploration
• Python or Scala • It provides a preconfigured
Spark context called sc.
• For large scale data processing
• Python, Java, Scala and R• Every spark application
requires a spark Context. It is the main entry point to the Spark API.
Scala Interactive shell Python Interactive shell
![Page 22: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/22.jpg)
Spark Overview
Resilient distributed datasets (RDDs) Immutable collections of objects spread across a
cluster Built through parallel transformations (map,
filter, etc) Automatically rebuilt on failure Controllable persistence (e.g. caching in RAM) for
reuse Shared variables that can be used in parallel
operations
Work with distributed collections as we would with local ones
![Page 23: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/23.jpg)
Resilient Distributed Datasets (RDDs)
Two types of RDD operation
• Transformation – define new RDDs based on the current one Example: Filter, map, reduce
• Action – return values. Example : count, take(n)
![Page 24: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/24.jpg)
Resilient Distributed Datasets (RDDs)
I have never seen the horror movies.I never hope to see one;But I can tell you, anyhow,I had rather see than be one.
File: movie.txt
RDD: mydata
I have never seen the horror movies.I never hope to see one;But I can tell you, anyhow,I had rather see than be one.
![Page 25: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/25.jpg)
Resilient Distributed Datasets (RDDs)map and filter Transformation
I have never seen the horror movies.I never hope to see one;But I can tell you, anyhow,I had rather see than be one.
I HAVE NEVER SEEN THE HORROR MOVIES.I NEVER HOPE TO SEE ONE;BUT I CAN TELL YOU, ANYHOW,I HAD RATHER SEE THAN BE ONE.
I HAVE NEVER SEEN THE HORROR MOVIES.I NEVER HOPE TO SEE ONE;I HAD RATHER SEE THAN BE ONE.
Map(lambda line : line.upper())
Filter(lambda line: line.startswith(‘I’))
Map(line => line.toUpperCase())
Filter(line => line.startsWith(‘I’))
![Page 26: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/26.jpg)
Spark Stack
• Spark SQL : --- For SQL and unstructured data processing
• Spark Streaming : --- Stream processing of live data streams
• MLib: --- For machine learning algorithm
• GraphX: --- Graph processing
![Page 27: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/27.jpg)
Why Spark ? Core engine with SQL, Streaming, machine learning and graph
processing modules. Can run today’s most advanced algorithms. Alternative to Map Reduce for certain applications. APIs in Java, Scala and Python Interactive shells in Scala and Python Runs on Yarn, Mesos and Standalone.
![Page 28: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/28.jpg)
Spark’s major use cases over Hadoop• Iterative Algorithms in Machine Learning• Interactive Data Mining and Data Processing• Spark is a fully Apache Hive-compatible data warehousing
system that can run 100x faster than Hive.• Stream processing: Log processing and Fraud detection in
live streams for alerts, aggregates and analysis• Sensor data processing: Where data is fetched and joined
from multiple sources, in-memory dataset really helpful as they are easy and fast to process.
![Page 29: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/29.jpg)
MapReduce Example: Word Count
![Page 30: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/30.jpg)
MapReduce Example: Word Count
![Page 31: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/31.jpg)
MapReduce Example: Word Count
![Page 32: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/32.jpg)
MapReduce Example: Word Count
![Page 33: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/33.jpg)
Example : Page RankA way of analyzing websites based on their link relationships
• Good example of a more complex algorithm• Multiple stages of map & reduce• Benefits from Spark’s in-memory caching• Multiple iterations over the same data
Basic IdeaGive pages ranks (scores) based on links to them
• Links from many pages high rank• Link from a high-rank page high rank
![Page 34: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/34.jpg)
PageRank Performance
30 600
20
40
60
80
100
120
140
160
180 171
80
23
14
Hadoop SparkNumber of machines
Itera
tion
time
(s)
NOTE : Less Iteration Time denotes high Performance
![Page 35: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/35.jpg)
Other Iterative Algorithms
Logistic Regression
0 25 50 75 100 125
0.96110
K-Means Clustering
0 30 60 90 120 150 180
4.1155
Hadoop Spark
TIME PER ITERATION (S)
NOTE : Less Iteration Time denotes high Performance
![Page 36: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/36.jpg)
Spark Installation (For end-user side)Download Spark distribution from https://spark.apache.org/downloads.html which pre-build of hadoop 2.4 or later.
![Page 37: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/37.jpg)
Spark Installation
Clone from apache https://github.com/apache/spark GitHub repository
(For developer side)
![Page 38: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/38.jpg)
Spark Installation (continue)
Build the source code using maven and hadoop
<SPARK_HOME>#build/mvn –Pyarn –Phadoop –Phaddop-2.4 -Dhadoop.version=2.6.0
![Page 39: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/39.jpg)
How to run Spark ?(Standalone mode )
Once the build is completed. Go to your bin directory which is inside Spark home directory in a terminal and invoke Spark Shell
<SPARK_HOME>/bin#./spark-shell
![Page 40: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/40.jpg)
To start all Spark’s Master and slave nodes:
To execute following terminal inside sbin directory side spark home directory.
<SPARK_HOME>/sbin#./start-all.sh
![Page 41: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/41.jpg)
Spark Master at Spark (Browser view):
localhost:8080
![Page 42: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/42.jpg)
To stop all Spark’s Master and slave nodes:
To execute following terminal inside sbin directory side spark home directory.
<SPARK_HOME>/sbin#./stop-all.sh
![Page 43: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/43.jpg)
Future of analytics
![Page 44: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/44.jpg)
Analytics in the Cloud
![Page 45: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/45.jpg)
https://www.youtube.com/watch?v=JfqJTQnVZvA
• IBM is making Spark available as a cloud service on its Bluemix cloud platform.
• 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs worldwide.
![Page 46: Big data clustering](https://reader036.fdocuments.net/reader036/viewer/2022062901/58f1bf3b1a28ab627c8b45c3/html5/thumbnails/46.jpg)
Demo - Spark RDD in Intellij IDEA