Introduction to Apache Spark - I Hadoop MapReduce, Apache Spark, Apache Flink, etc 25. Agenda...

download Introduction to Apache Spark - I Hadoop MapReduce, Apache Spark, Apache Flink, etc 25. Agenda Computing

of 90

  • date post

    05-Nov-2019
  • Category

    Documents

  • view

    8
  • download

    0

Embed Size (px)

Transcript of Introduction to Apache Spark - I Hadoop MapReduce, Apache Spark, Apache Flink, etc 25. Agenda...

  • Introduction to Apache Spark

    Thomas Ropars

    thomas.ropars@univ-grenoble-alpes.fr

    http://tropars.github.io/

    2018

    1

    mailto:thomas.ropars@univ-grenoble-alpes.fr http://tropars.github.io/

  • References

    The content of this lectures is inspired by:

    • The lecture notes of Yann Vernaz. • The lecture notes of Vincent Leroy. • The lecture notes of Renaud Lachaize. • The lecture notes of Henggang Cui.

    2

  • Goals of the lecture

    • Present the main challenges associated with distributed computing

    • Review the MapReduce programming model for distributed computing I Discuss the limitations of Hadoop MapReduce

    • Learn about Apache Spark and its internals

    • Start programming with PySpark

    3

  • Agenda

    Computing at large scale

    Programming distributed systems

    MapReduce

    Introduction to Apache Spark

    Spark internals

    Programming with PySpark

    4

  • Agenda

    Computing at large scale

    Programming distributed systems

    MapReduce

    Introduction to Apache Spark

    Spark internals

    Programming with PySpark

    5

  • Distributed computing: Definition

    A distributed computing system is a system including several computational entities where:

    • Each entity has its own local memory

    • All entities communicate by message passing over a network

    Each entity of the system is called a node.

    6

  • Distributed computing: Motivation

    There are several reasons why one may want to distribute data and processing:

    • Scalability I The data do not fit in the memory/storage of one node I The processing power of more processor can reduce the time

    to solution

    • Fault tolerance / availability I Continuing delivering a service despite node crashes.

    • Latency I Put computing resources close to the users to decrease latency

    7

  • Increasing the processing power

    Goals

    • Increasing the amount of data that can be processed (weak scaling)

    • Decreasing the time needed to process a given amount of data (strong scaling)

    Two solutions

    • Scaling up • Scaling out

    8

  • Vertical scaling (scaling up)

    Idea Increase the processing power by adding resources to existing nodes:

    • Upgrade the processor (more cores, higher frequency) • Increase memory capacity • Increase storage capacity

    Pros and Cons

    © Performance improvement without modifying the application § Limited scalability (capabilities of the hardware) § Expensive (non linear costs)

    9

  • Vertical scaling (scaling up)

    Idea Increase the processing power by adding resources to existing nodes:

    • Upgrade the processor (more cores, higher frequency) • Increase memory capacity • Increase storage capacity

    Pros and Cons

    © Performance improvement without modifying the application § Limited scalability (capabilities of the hardware) § Expensive (non linear costs)

    9

  • Horizontal scaling (scaling out)

    Idea Increase the processing power by adding more nodes to the system

    • Cluster of commodity servers

    Pros and Cons

    § Often requires modifying applications © Less expensive (nodes can be turned off when not needed) © Infinite scalability

    Main focus of this lecture

    10

  • Horizontal scaling (scaling out)

    Idea Increase the processing power by adding more nodes to the system

    • Cluster of commodity servers

    Pros and Cons

    § Often requires modifying applications © Less expensive (nodes can be turned off when not needed) © Infinite scalability

    Main focus of this lecture

    10

  • Horizontal scaling (scaling out)

    Idea Increase the processing power by adding more nodes to the system

    • Cluster of commodity servers

    Pros and Cons

    § Often requires modifying applications © Less expensive (nodes can be turned off when not needed) © Infinite scalability

    Main focus of this lecture

    10

  • Large scale infrastructures

    Figure: Google Data-center

    Figure: Amazon Data-center

    Figure: Barcelona Supercomputing Center

    11

  • Programming for large-scale infrastructures

    Challenges

    • Performance I How to take full advantage of the available resources? I Moving data is costly

    • How to maximize the ratio between computation and communication?

    • Scalability I How to take advantage of a large number of distributed

    resources?

    • Fault tolerance I The more resources, the higher the probability of failure I MTBF (Mean Time Between Failures)

    • MTBF of one server = 3 years • MTBF of 1000 servers ' 19 hours (beware: over-simplified

    computation)

    12

  • Programming in the Clouds Cloud computing

    • A service provider gives access to computing resources through an internet connection.

    Pros and Cons

    © Pay only for the resources you use © Get access to large amount of resources

    I Amazon Web Services features millions of servers

    § Volatility I Low control on the resources I Example: Access to resources based on bidding I See ”The Netflix Simian Army”

    § Performance variability I Physical resources shared with other users

    13

  • Programming in the Clouds Cloud computing

    • A service provider gives access to computing resources through an internet connection.

    Pros and Cons

    © Pay only for the resources you use © Get access to large amount of resources

    I Amazon Web Services features millions of servers

    § Volatility I Low control on the resources I Example: Access to resources based on bidding I See ”The Netflix Simian Army”

    § Performance variability I Physical resources shared with other users

    13

  • Architecture of a data center Simplified

    Switch

    : storage : memory : processor

    14

  • Architecture of a data center

    A shared-nothing architecture

    • Horizontal scaling • No specific hardware

    A hierarchical infrastructure

    • Resources clustered in racks • Communication inside a rack is more efficient than between

    racks

    • Resources can even be geographically distributed over several datacenters

    15

  • A warning about distributed computing

    You can have a second computer once you’ve shown you know how to use the first one. (P. Braham)

    Horizontal scaling is very popular.

    • But not always the most efficient solution (both in time and cost)

    Examples

    • Processing a few 10s of GB of data is often more efficient on a single machine that on a cluster of machines

    • Sometimes a single threaded program outperforms a cluster of machines (F. McSherry et al. “Scalability? But at what COST!”. 2015.)

    16

  • Agenda

    Computing at large scale

    Programming distributed systems

    MapReduce

    Introduction to Apache Spark

    Spark internals

    Programming with PySpark

    17

  • Summary of the challenges

    Context of execution

    • Large number of resources • Resources can crash (or disappear)

    I Failure is the norm rather than the exception.

    • Resources can be slow

    Objectives

    • Run until completion I And obtain a correct result :-)

    • Run fast

    18

  • Shared memory and message passing

    Two paradigms for communicating between computing entities:

    • Shared memory • Message passing

    19

  • Shared memory

    • Entities share a global memory • Communication by reading and writing to the globally shared

    memory

    • Examples: Pthreads, OpenMP, etc

    20

  • Message passing

    • Entities have their own private memory • Communication by sending/receiving messages over a network • Example: MPI

    21

  • Dealing with failures: Checkpointing

    Checkpointing

    App

    ckpt 1 ckpt 2 ckpt 3 ckpt 4

    • Saving the complete state of the application periodically • Restart from the most recent checkpoint in the event of a

    failure.

    22

  • Dealing with failures: Checkpointing

    Checkpointing

    App

    ckpt 1 ckpt 2 ckpt 3 ckpt 4

    • Saving the complete state of the application periodically

    • Restart from the most recent checkpoint in the event of a failure.

    22

  • Dealing with failures: Checkpointing

    Checkpointing

    App

    ckpt 1 ckpt 2 ckpt 3 ckpt 4

    • Saving the complete state of the application periodically • Restart from the most recent checkpoint in the event of a

    failure.

    22

  • About checkpointing

    Main solution when processes can apply fine-grained modifications to the data (Pthreads or MPI)

    • A process can modify any single byte independently • Impossible to log all modifications

    Limits

    • Performance cost • Difficult to implement • The alternatives (passive or active