Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein,...

Distributed Systems

Distributed computation with Spark

Abraham Bernstein, Ph.D.

Course material based on:- Based on slides by Reynold Xin, Tudor Lapusan- Some additions by Johannes Schneider

Distributed Systems, Univ. of Zurich, Fall 2009 31/10/16

How good is Map/Reduce?

• Abstraction• Simple?

• Automatic distribution of (data and) tasks

• Be platform agnostic

• Performance

•2


Map/Reduce is not so simple…

• Not easy to program directly in Map/Reduce

• Most real applications require multiple steps...• Iterative algorithms (eg. PageRank): 10’s of steps

• Analytics query (eg. count & top K): 2-5

ÞEach step one map and reduce class

ÞBoilerplate code, spaghetti like…


Higher level frameworks

• Simpler to use than Map/Reduce

• Examples• HiveQL, Pig, Spark

• Built on top of Hadoop• Use at least some parts of Hadoop

• (often can) generate Map/Reduce jobs


Spark

• Simpler to program• Nicer syntax: no explicit map/reduce

• Faster execution

• How? Two key points:• Generalized directed acyclic graphs for computation

• Faster data sharing • Don’t write intermediate results to discs

• How to achieve faul-tolerance if data is in RAM?Þ RDD


Spark Ecosystem

• Under development (Spark released 2014)

• This course: Spark Core Engine only


Resilient Distributed Dataset (RDD)

• Collection of (data) elements• Held on disc or in RAM

• Can be distributed on different nodes

• Programmer can “persist/cache” RDDs• Kept in memory for faster access

• System can remove(delete) from RAM, if need space

• RDDs are immutable• Transformations: Create new RDDs from old one


Operations on RDD

• Transformations• f(RDD) => RDD

• Lazy evaluation: not computed immediately

• Actions• Triggers computation


Transformations and Actions

Type T to Type U


Reminder: Java Syntax

• Assign a function to a variable• Pass functions as parameters• Functional Interfaces

Interface FlatMapFunction<T,R> {

Iterable<R> call (T t)}

FlatMapFunction<String, String> myFunc

= new FlatMapFunction<String, String>(){Iterable<String> call(String s)

{return Arrays.asList(s.split(“ “));}};

myFunc.apply(“This is first.”); => Iterator => “This”, “is”, “first”

public void flatMapSet(FlatMapFunction<String, String> mapper) {…};flapMapSet(myFunc);

Return type

Argument types


Example:Count lines with word “Error” in file

import org.apache.spark.api.java.*;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.function.Function;

public class SimpleApp {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setAppName("Simple Application");JavaSparkContext sc = new JavaSparkContext(conf);

JavaRDD<String> lines = sc.textFile("YOUR_SPARK_HOME/log.txt");

JavaRDD<String> linesWithError =

lines.filter(new Function<String, Boolean>() {Boolean call(String s) {return s.contains(“Error”);}

});

long nLines = linesWithError.count();

System.out.println("Lines with Errors: “ + nLines);

}

}

Example: log.txt10:00 | Error SQL Syntax | Task 1 done 11:02 | Worker added | Error php 1211:04 | Task 3 done


Wordcount in Spark

• mapFunction operates per data item/“line”:• flatMap unifies to get one list

Example: This is first.This second.

This,is,first.,This,second.

(This,1),(is,1),(first.,1),(This,1),(second.,1)

(This,2),(is,1),(first.,1), (second.,1)




RDDs creation

• Create initial RDD from some data• Eg. from HDFS: “hdfs://myFile.txt”

lines = sc.textFile(“hdfs://myFile.txt”)


RDDs during computation

lines = sc.textFile(...)

linesWithError = lines.filter(new Function<String, Boolean>() {…}

linesWithError.count();


Example: Count error messages with “SQL”, “php”,…

JavaRDD<String> linesWithError = lines.filter(new Function<String, Boolean>() {

Boolean call(String s) {return s.contains(“Error”);}} );

JavaRDD<List<String>> messages = linesWithError.map(new MapFunction<String, List<String>>() {

public List<String> call(String s) {return Arrays.asList(s.split(“|“)); }});messages.cache();

JavaRDD<String> msgsSQL = messages.filter(…s.contains(“SQL”)…);long nSQLMsgs = msgsSQL.count();JavaRDD<String> msgsPHP = messages.filter(…s.contains(“php”)…);long nPHPMsgs = msgsPHP.count();


Directed Acyclic graphMap/Reduce vs Spark

• Dependencies • of map/reduce results …. RDDs


RDD Recreation

• Automatically recompute (parts of) RDD if lost • Due to deletion/removal of RDD by system (to get more RAM)

• Due to fault, eg. crash of machine

• Track transformations and used (parts of) RDDs in transformation• Start from last RDD stored on disc (Checkpoint)

Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein,...

Documents

Transcript of Distributed Systems...Distributed Systems Distributed computation with Spark Abraham Bernstein,...