Spark Driven Big Data Analytics

download Spark Driven Big Data Analytics

of 44

Embed Size (px)

Transcript of Spark Driven Big Data Analytics

  • Spark Driven Big Data Analytics

    Inosh GoonewardenaAssociate Technical Lead - WSO2, Inc.

  • Agenda What is Big Data?

    Big Data Analytics

    Introduction to Apache Spark

    Apache Spark Components & Architecture

    Writing Spark Analytic Applications

  • What is Big Data? Big data is a term for data sets that are so large and complex in nature

    Constitute structured, semi-structured and unstructured data

    Big Data cannot easily be managed by traditional RDBMS or statistics tools

  • Characteristics of Big Data - The 3Vs

  • Sources of Big Data Banking transactions

    Social Media Content

    Results of scientific experiments

    GPS trails

    Financial market data

    Mobile-phone call detail records

    Machine data captured by sensors connected to IoT devices


  • Traditional Vs Big Data

    Attribute Traditional Data Big Data

    Volume Gigabytes to Terabytes Petabytes to Zettabytes

    Organization Centralized Distributed

    Structure Structured Structured, Semi-structured & Unstructured

    Data Model Strict schema based Flat schema

    Data Relationship Complex interrelationships Almost flat with few relationships

  • Big Data Analytics Process of examining large data sets to uncover hidden patterns, unknown

    correlations, market trends, customer preferences and other useful business information.

    Analytical findings can lead to better more effective marketing, new revenue opportunities, better customer service, improved operational efficiency, competitive advantages over rival organizations and other business benefits.

  • 4 types of Analytics Batch




  • Challenges of Big Data Analytics Traditional RDBMS fail to handle Big Data

    Big Data cannot fit in the memory of a single computer

    Processing of Big Data in a single computer will take a lot of time

    Scaling with traditional RDBMS is expensive

  • Traditional Large-Scale Computation Traditionally, computation has been processor-bound

    Relatively small amounts of data Significant amount of complex processing performed on that data

    For decades, the primary push was to increase the computing power of a single machine

    Faster processor, more RAM

  • Hadoop Hadoop is an open source, Java-based programming framework that

    supports the processing and storage of extremely large data sets in a distributed computing environment

  • The Hadoop Distributed File System - HDFS Responsible for storing data on the cluster

    Data files are split into blocks and distributed across multiple nodes in the cluster

    Each block is replicated multiple times Default is to replicate each block three times Replicas are stored on different nodes This ensures both reliability and availability

  • MapReduce MapReduce is the system used to process data in the Hadoop cluster

    A method for distributing a task across multiple nodes

    Each node processes data stored on that node - Where possible

    Consists of two phases: Map - process the input data and creates several small chunks of

    data Reduce - process the data that comes from the mapper and

    produces a new set of output

    Scalable, Flexible, Fault-tolerant & Cost effective

  • MapReduce - Example

  • Limitations of MapReduce Slow due to replication, serialization, and disk IO

    Inefficient for: Iterative algorithms (Machine Learning, Graphs & Network

    Analysis) Interactive Data Mining (R, Excel, Adhoc Reporting, Searching)

  • Apache Spark Apache Spark is a cluster computing platform designed to be fast and


    Extends the Hadoop MapReduce model to efficiently support more types of computations, including interactive queries and stream processing

    Provides in-memory cluster computing that increases the processing speed of an application

    Designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, iterative algorithms, interactive queries and streaming

  • Features of Spark Speed Spark helps to run an application in Hadoop cluster, up to 100

    times faster in memory, and 10 times faster when running on disk.

    Supports multiple languages Spark provides built-in APIs in Java, Scala, or Python.

    Advanced Analytics Spark not only supports Map and reduce. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

  • Spark Stack

  • Components of Spark Apache Spark Core Underlying general execution engine for spark

    platform that all other functionality is built upon. Provides In-Memory computing and referencing datasets in external storage systems.

    Spark SQL Component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.

    Spark Streaming Leverages Spark Core's fast scheduling capability to perform streaming analytics. Ingests data in mini-batches and performs RDD transformations on those mini-batches of data.

  • Components of Spark MLlib Distributed machine learning framework above Spark. Provides

    multiple types of machine learning algorithms, including binary classification, regression, clustering and collaborative filtering, as well as supporting functionality such as model evaluation and data import.

    GraphX Distributed graph-processing framework on top of Spark. Provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API.

  • Why a New Programming Model? MapReduce simplified big data analysis.

    But users quickly wanted more: More complex, multi-pass analytics (e.g. ML, graph) More interactive ad-hoc queries More real-time stream processing

    All 3 need faster data sharing in parallel apps

  • Data Sharing in MapReduce Iterative Operations on MapReduce

    Interactive Operations on MapReduce

  • Data Sharing using Spark RDD Iterative Operations on Spark RDD

    Interactive Operations on Spark RDD

  • Execution Flow

  • Execution Flow (contd.)1. A standalone application starts and instantiates a SparkContext instance.

    Once the SparkContext is initiated the application is called the driver.2. The driver program ask for resources to launch executors from the cluster

    manager.3. The cluster manager launches executors.4. The driver process runs through the user application. Depending on the

    actions and transformations over RDDs task are sent to executors.5. Executors run the tasks and save the results.6. If any worker crashes, its tasks will be sent to different executors to be

    processed again.

  • Terminology Application

    User program built on Spark. Consists of a driver program and executors on the cluster.

    Application Jar A jar containing the user's Spark application and its dependencies

    except Hadoop & Spark Jars

    Driver Program The process where the main method of the program runs Runs the user user code that creates a SparkContext, creates

    RDDs, and performs actions and transformation

  • Terminology (contd.) SparkContext

    Represents the connection to a Spark cluster Driver programs access Spark through a SparkContext object Can be used to create RDDs, accumulators and broadcast

    variables on that cluster

    Cluster Manager An external service to manage resources on the cluster

    (standalone manager, YARN, Apache Mesos)

  • Terminology (contd.) Deploy Mode

    cluster - driver inside the cluster client - driver outside the cluster

    Worker node Any node that can run application code in the cluster

    Executor A process launched for an application on a worker node, that runs

    tasks and keeps data in memory or disk storage across them. Each application has its own executors.

  • Terminology (contd.) Task

    A unit of work that will be sent to one executor

    Job A parallel computation consisting of multiple tasks that gets

    spawned in response to a Spark action (e.g. save, collect).

    Stage Smaller set of tasks that each job is divided. Sequential and depend on each other

  • Spark PillarsTwo main abstractions of Spark.

    RDD - Resilient Distributed Dataset

    DAG - Direct Acyclic Graph

  • RDD (Resilient Distributed Dataset) Fundamental data structure of Spark

    Immutable distributed collection of objects

    The data is partitioned across machines in the cluster that can be operated in parallel