Spark Driven Big Data Analytics
Embed Size (px)
Transcript of Spark Driven Big Data Analytics
Spark Driven Big Data Analytics
Inosh GoonewardenaAssociate Technical Lead - WSO2, Inc.
Agenda What is Big Data?
Big Data Analytics
Introduction to Apache Spark
Apache Spark Components & Architecture
Writing Spark Analytic Applications
What is Big Data? Big data is a term for data sets that are so large and complex in nature
Constitute structured, semi-structured and unstructured data
Big Data cannot easily be managed by traditional RDBMS or statistics tools
Characteristics of Big Data - The 3Vs
Sources of Big Data Banking transactions
Social Media Content
Results of scientific experiments
Financial market data
Mobile-phone call detail records
Machine data captured by sensors connected to IoT devices
Traditional Vs Big Data
Attribute Traditional Data Big Data
Volume Gigabytes to Terabytes Petabytes to Zettabytes
Organization Centralized Distributed
Structure Structured Structured, Semi-structured & Unstructured
Data Model Strict schema based Flat schema
Data Relationship Complex interrelationships Almost flat with few relationships
Big Data Analytics Process of examining large data sets to uncover hidden patterns, unknown
correlations, market trends, customer preferences and other useful business information.
Analytical findings can lead to better more effective marketing, new revenue opportunities, better customer service, improved operational efficiency, competitive advantages over rival organizations and other business benefits.
4 types of Analytics Batch
Challenges of Big Data Analytics Traditional RDBMS fail to handle Big Data
Big Data cannot fit in the memory of a single computer
Processing of Big Data in a single computer will take a lot of time
Scaling with traditional RDBMS is expensive
Traditional Large-Scale Computation Traditionally, computation has been processor-bound
Relatively small amounts of data Significant amount of complex processing performed on that data
For decades, the primary push was to increase the computing power of a single machine
Faster processor, more RAM
Hadoop Hadoop is an open source, Java-based programming framework that
supports the processing and storage of extremely large data sets in a distributed computing environment
The Hadoop Distributed File System - HDFS Responsible for storing data on the cluster
Data files are split into blocks and distributed across multiple nodes in the cluster
Each block is replicated multiple times Default is to replicate each block three times Replicas are stored on different nodes This ensures both reliability and availability
MapReduce MapReduce is the system used to process data in the Hadoop cluster
A method for distributing a task across multiple nodes
Each node processes data stored on that node - Where possible
Consists of two phases: Map - process the input data and creates several small chunks of
data Reduce - process the data that comes from the mapper and
produces a new set of output
Scalable, Flexible, Fault-tolerant & Cost effective
MapReduce - Example
Limitations of MapReduce Slow due to replication, serialization, and disk IO
Inefficient for: Iterative algorithms (Machine Learning, Graphs & Network
Analysis) Interactive Data Mining (R, Excel, Adhoc Reporting, Searching)
Apache Spark Apache Spark is a cluster computing platform designed to be fast and
Extends the Hadoop MapReduce model to efficiently support more types of computations, including interactive queries and stream processing
Provides in-memory cluster computing that increases the processing speed of an application
Designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, iterative algorithms, interactive queries and streaming
Features of Spark Speed Spark helps to run an application in Hadoop cluster, up to 100
times faster in memory, and 10 times faster when running on disk.
Supports multiple languages Spark provides built-in APIs in Java, Scala, or Python.
Advanced Analytics Spark not only supports Map and reduce. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
Components of Spark Apache Spark Core Underlying general execution engine for spark
platform that all other functionality is built upon. Provides In-Memory computing and referencing datasets in external storage systems.
Spark SQL Component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming Leverages Spark Core's fast scheduling capability to perform streaming analytics. Ingests data in mini-batches and performs RDD transformations on those mini-batches of data.
Components of Spark MLlib Distributed machine learning framework above Spark. Provides
multiple types of machine learning algorithms, including binary classification, regression, clustering and collaborative filtering, as well as supporting functionality such as model evaluation and data import.
GraphX Distributed graph-processing framework on top of Spark. Provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API.
Why a New Programming Model? MapReduce simplified big data analysis.
But users quickly wanted more: More complex, multi-pass analytics (e.g. ML, graph) More interactive ad-hoc queries More real-time stream processing
All 3 need faster data sharing in parallel apps
Data Sharing in MapReduce Iterative Operations on MapReduce
Interactive Operations on MapReduce
Data Sharing using Spark RDD Iterative Operations on Spark RDD
Interactive Operations on Spark RDD
Execution Flow (contd.)1. A standalone application starts and instantiates a SparkContext instance.
Once the SparkContext is initiated the application is called the driver.2. The driver program ask for resources to launch executors from the cluster
manager.3. The cluster manager launches executors.4. The driver process runs through the user application. Depending on the
actions and transformations over RDDs task are sent to executors.5. Executors run the tasks and save the results.6. If any worker crashes, its tasks will be sent to different executors to be
User program built on Spark. Consists of a driver program and executors on the cluster.
Application Jar A jar containing the user's Spark application and its dependencies
except Hadoop & Spark Jars
Driver Program The process where the main method of the program runs Runs the user user code that creates a SparkContext, creates
RDDs, and performs actions and transformation
Terminology (contd.) SparkContext
Represents the connection to a Spark cluster Driver programs access Spark through a SparkContext object Can be used to create RDDs, accumulators and broadcast
variables on that cluster
Cluster Manager An external service to manage resources on the cluster
(standalone manager, YARN, Apache Mesos)
Terminology (contd.) Deploy Mode
cluster - driver inside the cluster client - driver outside the cluster
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs
tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Terminology (contd.) Task
A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets
spawned in response to a Spark action (e.g. save, collect).
Stage Smaller set of tasks that each job is divided. Sequential and depend on each other
Spark PillarsTwo main abstractions of Spark.
RDD - Resilient Distributed Dataset
DAG - Direct Acyclic Graph
RDD (Resilient Distributed Dataset) Fundamental data structure of Spark
Immutable distributed collection of objects
The data is partitioned across machines in the cluster that can be operated in parallel