Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein,...

19
Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan - Some additions by Johannes Schneider

Transcript of Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein,...

Page 1: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems

Distributed computation with Spark

Abraham Bernstein, Ph.D.

Course material based on:- Based on slides by Reynold Xin, Tudor Lapusan- Some additions by Johannes Schneider

Page 2: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

How good is Map/Reduce?

• Abstraction• Simple?

• Automatic distribution of (data and) tasks

• Be platform agnostic

• Performance

•2

Page 3: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Map/Reduce is not so simple…

• Not easy to program directly in Map/Reduce

• Most real applications require multiple steps...• Iterative algorithms (eg. PageRank): 10’s of steps

• Analytics query (eg. count & top K): 2-5

ÞEach step one map and reduce class

ÞBoilerplate code, spaghetti like…

Page 4: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Higher level frameworks

• Simpler to use than Map/Reduce

• Examples• HiveQL, Pig, Spark

• Built on top of Hadoop• Use at least some parts of Hadoop

• (often can) generate Map/Reduce jobs

Page 5: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Spark

• Simpler to program• Nicer syntax: no explicit map/reduce

• Faster execution

• How? Two key points:• Generalized directed acyclic graphs for computation

• Faster data sharing • Don’t write intermediate results to discs

• How to achieve faul-tolerance if data is in RAM?Þ RDD

Page 6: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Spark Ecosystem

• Under development (Spark released 2014)

• This course: Spark Core Engine only

Page 7: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Resilient Distributed Dataset (RDD)

• Collection of (data) elements• Held on disc or in RAM

• Can be distributed on different nodes

• Programmer can “persist/cache” RDDs• Kept in memory for faster access

• System can remove(delete) from RAM, if need space

• RDDs are immutable• Transformations: Create new RDDs from old one

Page 8: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Operations on RDD

• Transformations• f(RDD) => RDD

• Lazy evaluation: not computed immediately

• Actions• Triggers computation

Page 9: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Transformations and Actions

Type T to Type U

Page 10: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Reminder: Java Syntax

• Assign a function to a variable• Pass functions as parameters• Functional Interfaces

Interface FlatMapFunction<T,R> {

Iterable<R> call (T t)}

FlatMapFunction<String, String> myFunc

= new FlatMapFunction<String, String>(){Iterable<String> call(String s)

{return Arrays.asList(s.split(“ “));}};

myFunc.apply(“This is first.”); => Iterator => “This”, “is”, “first”

public void flatMapSet(FlatMapFunction<String, String> mapper) {…};flapMapSet(myFunc);

Return type

Argument types

Page 11: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Example:Count lines with word “Error” in file

import org.apache.spark.api.java.*;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.function.Function;

public class SimpleApp {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setAppName("Simple Application");JavaSparkContext sc = new JavaSparkContext(conf);

JavaRDD<String> lines = sc.textFile("YOUR_SPARK_HOME/log.txt");

JavaRDD<String> linesWithError =

lines.filter(new Function<String, Boolean>() {Boolean call(String s) {return s.contains(“Error”);}

});

long nLines = linesWithError.count();

System.out.println("Lines with Errors: “ + nLines);

}

}

Example: log.txt10:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 1211:04 | Task 3 done

Page 12: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Wordcount in Spark

• mapFunction operates per data item/“line”:• flatMap unifies to get one list

Example: This is first.This second.

This,is,first.,This,second.

(This,1),(is,1),(first.,1),(This,1),(second.,1)

(This,2),(is,1),(first.,1), (second.,1)

This,is,first.,This,second.

This,is,first.,This,second.

Page 13: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

RDDs creation

• Create initial RDD from some data• Eg. from HDFS: “hdfs://myFile.txt”

lines = sc.textFile(“hdfs://myFile.txt”)

Page 14: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

RDDs during computation

lines = sc.textFile(...)

linesWithError = lines.filter(new Function<String, Boolean>() {…}

linesWithError.count();

Page 15: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Example: Count error messages with “SQL”, “php”,…

JavaRDD<String> linesWithError = lines.filter(new Function<String, Boolean>() {

Boolean call(String s) {return s.contains(“Error”);}} );

JavaRDD<List<String>> messages = linesWithError.map(new MapFunction<String, List<String>>() {

public List<String> call(String s) {return Arrays.asList(s.split(“|“)); }});messages.cache();

JavaRDD<String> msgsSQL = messages.filter(…s.contains(“SQL”)…);long nSQLMsgs = msgsSQL.count();JavaRDD<String> msgsPHP = messages.filter(…s.contains(“php”)…);long nPHPMsgs = msgsPHP.count();

Page 16: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Example: Count error messages with “SQL”, “php”,…

lines = sc.textFile(“hdfs://...”) RDD 110:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 1211:04 | Task 3 done

RDD 4Error SQL Syntax

RDD 5Error php 12

RDD 210:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 12

RDD 310:00Error SQL SyntaxTask 1 done 11:02Worker addedError php 12

Page 17: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Example: Directed Acyclic Graph

• Dependencies among RDDsRDD 110:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 1211:04 | Task 3 done

RDD 4Error SQL Syntax

RDD 5Error php 12

RDD 210:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 12

RDD 310:00Error SQL SyntaxTask 1 done 11:02Worker addedError php 12

Page 18: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

Directed Acyclic graphMap/Reduce vs Spark

• Dependencies • of map/reduce results …. RDDs

Page 19: Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan Distributed

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

RDD Recreation

• Automatically recompute (parts of) RDD if lost • Due to deletion/removal of RDD by system (to get more RAM)

• Due to fault, eg. crash of machine

• Track transformations and used (parts of) RDDs in transformation• Start from last RDD stored on disc (Checkpoint)