Download - Parallel Computing Concepts

7/24/2019 Parallel Computing Concepts

1/63

Parallel Programming Concepts

OpenHPI Course

Week 5 : Distributed Memory Parallelism

Unit 5.1: Hardware

Dr. Peter Trger + Teaching Team


2/63

Summary: Week 4

!

Accelerators enable major speedup for data parallelism

" SIMD execution model (no branching)

" Memory latency managed with many light-weight threads

! Tackle diversity with OpenCL

"

Loop parallelism with index ranges

" Kernels in C, compiled at runtime

" Complex memory hierarchy supported

! Getting fast is easy, getting faster is hard

"

Best practices for accelerators" Hardware knowledge needed

2

What if my computational problem still demands more power?

OpenHPI | Parallel Programming Concepts | Dr. Peter Trger


3/63

Parallelism for

!

Speed compute faster

! Throughput compute more in the same time

! Scalability compute faster / more with additional resources

" Huge scalability only with shared nothingsystems

"

Still also depends on application characteristics

Processing Element A1



Processing Element B1

Processing Element B2

Processing Element B3ScalingUp

Scaling Out

Main

Memory

Main

Memory

3



4/63

Parallel Hardware

!

Shared memory system

" Typically a single machine, common address space for tasks

" Hardware scaling is limited (power / memory wall)

! Shared nothing (distributed memory) system

"

Tasks on multiple machines, can only access local memory" Global task coordination by explicit messaging

" Easy scale-out by adding machines to the network

4


Processing

Element

Task

Shared Memory

Processing

Element

Task

Processing

Element

Task

Processing

Element

Task

Message

Message

Message

Message

Cache CacheLocal

MemoryLocal

Memory


5/63

Parallel Hardware

!

Shared memory system !collection of processors

" Integrated machine for capacity computing

" Prepared for a large variety of problems

! Shared-nothing system !collection of computers

"

Clusters and supercomputers for capability computing

" Installation to solve few problems in the best way

" Parallel software must be able leverage multiple machines

at the same time

"

Difference to distributed systems (Internet, Cloud)! Single organizational domain, managed as a whole

! Single parallel application at a time,

no separation of client and server application

! Hybrids are possible (e.g. HPC in Amazon AWS cloud)

5



6/63

Shared Nothing: Clusters

!

Collection of stand-alone machines connected by a local network

" Cost-effective technique for a large-scale parallel computer

" Users are builders, have control over their system

" Synchronization much slower than in shared memory

"

Task granularity becomes an issue

6

ProcessingElement

Task

ProcessingElement

Task

Message

Message

Message

Message


LocalMemory

LocalMemory


7/63

Shared Nothing: Supercomputers

!

Supercomputers / Massively Parallel Processing (MPP) systems

" (Hierarchical) cluster with a lot of processors

" Still standard hardware, but specialized setup

" High-performance interconnection network

"

For massive data-parallel applications, mostly simulations(weapons, climate, earthquakes, airplanes, car crashes, ...)

! Examples (Nov 2013)

" BlueGene/Q, 1.5 million cores, 1.5 PB memory, 17.1 TFlops

"

Tianhe-2, 3.1 million cores,1 PB memory, 17.808 kW power,

33.86 PFlops (quadrillions

calculations per second)

! Annual ranking with the TOP500 list

(www.top500.org)

7



8/63

Example

8

2011 IBM Corporation

1. Chip:16+2 !P

cores

2. Single Chip Module

4. Node Card:32 Compute Cards,Optical Modules, Link Chips; 5D Torus

5a. Midplane:16 Node Cards

6. Rack: 2 Midplanes

7. System:96 racks, 20PF/s

3. Compute card:One chip module,16 GB DDR3 Memory,Heat Spreader for H2O Cooling

5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus

Sustained single node perf: 10x P, 20x L MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) Software and hardware support for programming modelsfor exploitation of node hardware concurrency

Blue Gene/Q



9/63

Interconnection Networks

!

Bus systems

" Static approach, low costs

" Shared communication path,

broadcasting of information

"

Scalability issues with shared bus! Completely connected networks

" Static approach, high costs

" Only direct links, optimal performance

! Star-connected networks

" Static approach with central switch

" Less links, still very good performance

" Scalability depends on central switch

9

PE PE PE PE


PE

PE

PE

PE

PE

PE PE

PE

PE

PE

PE

PE

PE

PE PE

PE

Switch


10/63


!

Crossbar switch

" Dynamic switch-based network

" Supports multiple parallel direct

connections without collisions

"

Less edges than completely connectednetwork, but still scalability issues

! Fat tree

" Use wider links in higher parts of the

interconnect tree

"

Combine tree design advantages with

solution for root node scalability

" Communication distance between any

two nodes is no more than 2 log #PEs

10


PE1 PE2 PE3 PEn

PE1

PE2

PE3

PEn

PE PE PE PEPE PE

Switch SwitchSwitch

SwitchSwitch

Switch


11/63


!

Linear array

! Ring

" Linear array with connected endings

! N-way D-dimensional mesh

"

Matrix of processing elements" Not more than N neighbor links

" Structured in D dimensions

! N-way D-dimensional torus

"

Mesh with wrap-around connection

11


PE PE PE PE

PE PE PE PE

PE PE

PE PE

PE PE

PE

PE

PE

PE PE

PE PE

PE PE

PE

PE

PE

4-way 2D torus8-way 2D mesh

4-way 2D mesh


12/63

Example: Blue Gene/Q 5D Torus

!

5D torus interconnect in Blue Gene/Q supercomputer

" 2 GB/s on all 10 links, 80ns latency to direct neighbors

" Additional link for

communication

with I/O nodes

12


[IBM]


13/63


OpenHPI Course


Unit 5.2: Granularity and Task Mapping



14/63

Workload


14

! Last week showed that task granularity may be flexible

" Example: OpenCL work group size

! But: Communication overhead becomes significant now

" What is the right level of task granularity ?


15/63

Surface-To-Volume Effect

!

Envision the work to be done

(in parallel) as sliced 3D cube

" Not a demand on the application

data, just a representation

!

Slicing represents splitting into tasks! Computationalwork of a task

" Proportional to the volumeof the cube slice

" Represents the granularity of decomposition

! Communication requirements of the task

" Proportional to the surfaceof the cube slice

! communication-to-computation ratio

" Fine granularity: Communication high, computation low

" Coarse granularity: Communication low, computation high

15



16/63


16

[nicerweb.com]

!

Fine-grained decomposition forusing all processing elements ?

! Coarse-grained decomposition

to reduce communication

overhead ?

!

A tradeoff question !



17/63


!

Heatmap example with64 data cells

! Version (a): 64 tasks

" 64x4=

256 messages,

256 data values

" 64 processing

elements used in

parallel

!

Version (b): 4 tasks" 16 messages,

64 data values

" 4 processing elements

used in parallel

17

[Foster]



18/63


!

Rule of thumb

" Agglomerate tasks to avoid communication

" Stop when parallelism is no longer exploited well enough

" Agglomerate in all dimensions at the same time

!

Influencing factors" Communication technology + topology

" Serial performance per processing element

" Degree of application parallelism

!

Task communication vs. network topology" Resulting task graphmust be

mapped to network topology

" Task-to-task communication

may need multiple hops

18


[Foster]


19/63

The Task Mapping Problem

!

Given

" a number of homogeneous processing elements

with performance characteristics,

" some interconnection topology of the processing elements

with performance characteristics,

" an application dividable into parallel tasks.

! Questions:

" What is the optimal task granularity ?

" How should the tasks be placed on processing elements ?

"

Do we still get speedup / scale-up by this parallelization ?

! Task mapping is still research, mostly manual tuning today

! More options with configurable networks / dynamic routing

" Reconfiguration of hardware communication paths

19



20/63


OpenHPI Course


Unit 5.3: Programming with MPI



21/63

Message Passing

!

Parallel programming paradigm for shared nothing environments

" Implementations for shared memory available,

but typically not the best approach

! Users submit their message passing program & data as job

!

Cluster management system creates program instances

Instance0

Instance1

Instance2 Instance

3

Execution Hosts

21

Cluster Management Software

SubmissionHost

Job

Appli-cation


22/63

Single Program Multiple Data (SPMD)

22


// (determine rank and comm_size) int token;

if (rank != 0) {// Receive from your left neighbor if you are not rank

0

MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE);

printf("Process %d received token %d from process %d\n",rank, token, rank - 1);

} else {

// Set the token's value if you are rank 0token = -1;

}// Send your local token value to your right neighbor

MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,0, MPI_COMM_WORLD);

// Now rank 0 can receive from the last rank.

if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);printf("Process %d received token %d from process %d\n",

rank, token, comm_size - 1);

}

Input data

SPMD program

// (determine rank and comm_size) int token;if (rank != 0) {

// Receive from your left neighbor if you are not rank 0MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,


rank, token, rank - 1);

} else {// Set the token's value if you are rank 0token = -1;



// Now rank 0 can receive from the last rank.if (rank == 0) {MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,


rank, token, comm_size - 1);}












// Receive from your left neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,








rank, token, comm_size - 1);

}





















Instance 0

Instance 1

Instance 2

Instance 3

Instance 4


23/63

Message Passing Interface (MPI)

!

Many optimized messaging libraries for shared nothingenvironments, developed by networking hardware vendors

! Need for standardized API solution: Message Passing Interface

" Definition of API syntax and semantics

"

Enables source code portability, not interoperability" Software independent from hardware concepts

! Fixed number ofprocess instances, defined on startup

" Point-to-point and collective communication

! Focus on efficiency of communication and memory usage

! MPI Forum standard

! Consortium of industry and academia

! MPI 1.0 (1994), 2.0 (1997), 3.0 (2012)

23



24/63

MPI Communicators

!

Each application instance (process) has a rank, starting at zero! Communicator: Handle for a group of processes

" Unique rank numbers inside the communicator group

" Instance can determine communicatorsizeand own rank

"

Default communicator MPI_COMM_WORLD" Instance may be in multiple communicator groups

24


Rank 0Size 4 Rank 1

Size 4

Rank 2Size 4 Rank 3

Size 4 Communic

ator


25/63

Communication

!

Point-to-point communication between instancesintMPI_Send(void* buf, int count, MPI_Datatype type,

int destRank, int tag, MPI_Comm com);

intMPI_Recv(void* buf, int count, MPI_Datatype type,

int sourceRank, int tag, MPI_Comm com);

!

Parameters

" Send / receive buffer + size + data type

" Sender provides receiver rank, receiver provides sender rank

" Arbitrary message tag

!

Source / destination identified by [tag, rank, communicator] tuple

! Default send / receive will block until the match occurs

! Useful constants: MPI_ANY_TAG, MPI_ANY_SOURCE, MPI_ANY_DEST

! Variations in the API for different buffering behavior

25



26/63

Example: Ring communication

26



if (rank != 0) {

// Receive from your left neighbor if you are not rank 0

MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,


rank, token, rank - 1);} else {

// Set the token's value if you are rank 0

token = -1;}

// Send your local token value to your right neighbor

MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size,

0, MPI_COMM_WORLD);

// Now rank 0 can receive from the last rank.if (rank == 0) {

MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

printf("Process %d received token %d from process %d\n",rank, token, comm_size - 1);

}

[mpitutorial.com]


27/63

Deadlocks

27Consider:

int a[10], b[10], myrank;MPI_Status status;

...

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

if (myrank == 0) {MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);

MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);}

else if (myrank == 1) {MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD);

MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD);

}...

If MPI_Sendis blocking, there is a deadlock.

intMPI_Send(void* buf, int count, MPI_Datatype type,int destRank, int tag, MPI_Comm com);



28/63

Collective Communication

!

Point-to-point communication vs. collective communication

! Use cases: Synchronization, data distribution & gathering

! All processes in a (communicator) group communicate together

" One sender with multiple receivers (one-to-all)

"

Multiple senders with one receiver (all-to-one)"

Multiple senders and multiple receivers (all-to-all)

! Typical pattern in supercomputer applications

! Participants continue if the group communication is done

"

Always blocking operation" Must be executed by all processes in the group

" No assumptions on the state of other participants on return

28



29/63

Barrier

29 !

Communicator members block until everybody reaches the barrier


MPI_Barrier(comm) MPI_Barrier(comm) MPI_Barrier(comm)


if (rank != 0) {// Receive from your left neighbor if you are not rank 0

MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

















MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,MPI_COMM_WORLD, MPI_STATUS_IGNORE);

printf("Process %d received token %d from process %d\n",



















MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size ,0, MPI_COMM_WORLD);


























30/63

Broadcast

!

int MPI_Bcast( void *buffer, int count,

MPI_Datatype datatype, int rootRank, MPI_Comm comm )

" rootRank is the rank of the chosen root process

" Root process broadcasts data in buffer to all other processes,

itself included

"

On return, all processes have the same data in their buffer

30


Data

Processes

D0

Data

Processes

D0

D0

D0

D0

D0

D0

Broadcast


31/63

Scatter

!

int MPI_Scatter(void *sendbuf, int sendcnt,

MPI_Datatype sendtype, void *recvbuf, int recvcnt,

MPI_Datatype recvtype, int rootRank, MPI_Comm comm)

" sendbuf buffer on root process is divided, parts are sent

to all processes, including root

" MPI_SCATTERVallows varying count of data per rank


31

Data

Processes

D0 D1 D2 D3 D4 D5

Data

Processes

D0

D1

D2

D3

D4

D5

Scatter

Gather


32/63

Gather

!

int MPI_Gather(void *sendbuf, int sendcnt,MPI_Datatype sendtype, void *recvbuf, int recvcnt,

MPI_Datatype recvtype, int rootRank, MPI_Comm comm)

" Each process (including the root process) sends the data in its

sendbuf buffer to the root process

"

Incoming data in recvbufis stored in rank order"

recvbuf parameter is ignored for all non-root processes

32


Data

Processes

D0 D1 D2 D3 D4 D5

Data

Processes

D0

D1

D2

D3

D4

D5

Scatter

Gather


33/63

Reduction

!

int MPI_Reduce(void *sendbuf, void *recvbuf,int count, MPI_Datatype datatype, MPI_Op op,

int rootRank, MPI_Comm comm)

" Similar to MPI_Gather

" Additional reductionoperation op to aggregate received

data: maximum, minimum, sum, product, boolean

operators, max-min, min-min

! MPI implementation can overlap communication and

reduction calculation for faster results

33


Data

Processes D0A

D0B

D0C

Reduce +

Data

Processes D0A+ D0B+ D0C

D0B

D0C


34/63

Example: MPI_Scatter + MPI_Reduce

34/* -- E. van den Berg 07/10/2001 -- */

#include

#include "mpi.h"

int main (int argc, char *argv[]) {

int data[] = {1, 2, 3, 4, 5, 6, 7}; // Size must be >= #processorsint rank, i = -1, j = -1;

MPI_Init (&argc, &argv);

MPI_Comm_rank (MPI_COMM_WORLD, &rank);

MPI_Scatter ((void *)data, 1, MPI_INT, (void *)&i ,

1, MPI_INT, 0, MPI_COMM_WORLD);

printf ("[%d] Received i = %d\n", rank, i);

MPI_Reduce ((void *)&i, (void *)&j, 1, MPI_INT, MPI_PROD,

0, MPI_COMM_WORLD);

printf ("[%d] j = %d\n", rank, j);

MPI_Finalize();return 0;

}



35/63

What Else

! Variations:

MPI_ISend, MPI_Sendrecv, MPI_Allgather, MPI_Alltoall,

! Definition of virtual topologies for better task mapping

! Complex data types

!

Packing / Unpacking (sprintf / sscanf)

! Group / Communicator Management

! Error Handling

! Profiling Interface

!

Several implementations available

" MPICH - Argonne National Laboratory

" OpenMPI - Consortium of Universities and Industry

" ...

35



36/63


OpenHPI Course


Unit 5.4: Programming with Channels



37/63

Communicating Sequential Processes

!

Formal process algebra to describe concurrent systems

" Developed by Tony Hoare at University of Oxford (1977)

! Also inventor of QuickSort and Hoare logic

" Computer systems act and interact with the environment

"

Decomposition in subsystems (processes) that operate

concurrently inside the system

" Processes interact with other processes, or the environment

! Book: T. Hoare, Communicating Sequential Processes, 1985

!

A mathematical theory, described with algebraic laws! CSP channel concept available in many programming

languages for shared nothing systems

! Complete approach implemented in Occam language

37



38/63

CSP: Processes

!

Behavior of real-world objects can be described through theirinteraction with other objects

" Leave out internal implementation details

" Interface of a process is described as set of atomic events

! Example:ATMand User, both modeled as processes

" cardevent insertion of a credit card in an ATM card slot

" moneyevent extraction of money from the ATM dispenser

! Alphabet- set of relevant events for an object description

" May never happen, interaction is restricted to these events

"

!ATM= !User= {card, money}

! A CSP processis the behavior of an object, described with its

alphabet

38



39/63

Communication in CSP

!

Special class of event: Communication" Modeled as unidirectional channel between processes

" Channel name is a member of the alphabets of both processes

" Send activity described by multiple c.v events

!

Channel approach assumes rendezvousbehavior" Sender and receiver block on the channel operation until the

message is transmitted

" Implicit barrier based on communication

! With formal foundation, mathematical proofs are possible

" When two concurrent processes communicate with each other

only over a single channel, they cannot deadlock.

" Network of non-stopping processes which is free of cycles

cannot deadlock.

"

39



40/63

Whats the Deal ?

!

Any possible system can be modeled through event chains

" Enables mathematical proofs for deadlock freedom,

based on the basic assumptions of the formalism

(e.g. single channel assumption)

! Some tools available (check readings page)

! CSP was the formal base for the Occam language

" Language constructs follow the formalism

" Mathematical reasoning about the behavior of written code

! Still active research (Welsh University), channel concept

frequently adopted

" CSP channel implementations for Java, MPI, Go, C, Python

" Other formalisms based on CSP, e.g. Task/Channel model

40



41/63

Channels in Scala

41actor {var out: OutputChannel[String] = null

val child= actor{

react {

case "go" => out!"hello"

}}

val channel= new Channel[String]

out= channelchild!"go"

channel.receive{case msg => println(msg.length)

}

}

case class ReplyTo(out:OutputChannel[String])

val child= actor{

react {

case ReplyTo(out) => out!"hello"}

}

actor {

val channel= new Channel[String]child!ReplyTo(channel)

channel.receive{

case msg => println(msg.length)

}}

Scope-based channel sharing

Sending channels in messages



42/63

Channels in Go

42


package mainimport fmt fmtfunc sayHello (ch1 chan string) {

ch1


43/63

Channels in Go

!

selectconcept allows to switch between available channels" All channels are evaluated

" If multiple can proceed, one is chosen randomly

" Default clause if no channel is available

! Channels are typically first-class language constructs

" Example: Client provides a response channel in the request

! Popular solution to get deterministic behavior

43

select {case v :=


44/63

Task/Channel Model

!

Computational modelfor multi-computer by Ian Foster!

Similar concepts to CSP

! Parallel computation consists of one or more tasks

" Tasks execute concurrently

"

Number of tasks can vary during execution" Task: Serial programwith localmemory

" A task has in-portsand outportsas interface to the

environment

" Basic actions: Read / write local memory,

send message on outport,

receive message on in-port,

create new task, terminate

44



45/63

Task/Channel Model

!

Outport / in-port pairs are connected by channels

" Channels can be created and deleted

" Channels can be referenced as ports,

which can be part of a message

"

Sendoperation is non-blocking" Receiveoperation is blocking

" Messages in a channel stay in order

! Tasks are mappedto physical processors by the execution

environment

"

Multiple tasks can be mapped to one processor

! Data locality is explicit part of the model

! Channels can model controland datadependencies

45



46/63

Programming With Channels

!

Channel-only parallel programs have advantages

" Performance optimization does not influence semantics

! Example: Shared-memory channels for some parts

" Task mapping does not influence semantics

!

Align number of tasks for the problem,not for the execution environment

! Improves scalability of implementation

" Modular design with well-defined interfaces

!

Communication should be balanced between tasks!

Each task should only communicate with a small group of

neighbors

46



47/63


OpenHPI Course


Unit 5.5: Programming with Actors



48/63

Actor Model

!

Carl Hewitt, Peter Bishop and Richard Steiger. A Universal ModularActor Formalism for Artificial Intelligence IJCAI 1973.

" Mathematical model for concurrent computation

" Actor as computational primitive

! Local decisions, concurrently sends / receives messages

! Has a mailboxfor incoming messages

! Concurrently creates more actors

" Asynchronous one-way message sending

" Changing topology allowed, typically no order guarantees

!

Recipient is identified by mailing address

! Actors can send their own identity to other actors

! Available as programming language extension or library

in many environments

48



49/63

Erlang Ericsson Language

!

Functional language with actor support

! Designed for large-scale concurrency

" First version in 1986 by Joe Armstrong, Ericsson Labs

" Available as open source since 1998

!

Language goals driven by Ericsson product development" Scalable distributed execution of phone call handling software

with large number of concurrent activities

" Fault-tolerant operation under timing constraints

" Online software update

! Users

" Amazon EC2 SimpleDB , Delicious, Facebook chat, T-Mobile

SMS and authentication, Motorola call processing, Ericsson

GPRS and 3G mobile network products, CouchDB, EJabberD,

49



50/63

Concurrency in Erlang

!

Concurrency Oriented Programming" Actor processes are completely independent (shared nothing)

" Synchronization and data exchange with message passing

" Each actor process has an unforgeable name

"

If you know the name, you can send a message" Default approach is fire-and-forget

" You can monitor remote actor processes

! Using this gives you

" Opportunity for massive parallelism

" No additional penalty for distribution, despite latency issues

" Easier fault tolerance capabilities

" Concurrency by default

50



51/63

Actors in Erlang

!

Communication via message passing is part of the language! Send never fails, works asynchronously (PID ! Message)

! Actors have mailboxfunctionality

" Queue of received messages, selective fetching

" Only messages from same source arrive in-order

" receivestatement with set of clauses, pattern matching

" Process is suspended in receive operation until a match

receive

Pattern1 when Guard1 -> expr1, expr2, ..., expr_n;Pattern2 when Guard2 -> expr1, expr2, ..., expr_n;

Other -> expr1, expr2, ..., expr_n

end

51



52/63

Functions exported + #args

Erlang Example: Ping Pong Actors

52

Start Pingand Pongactors

Blocking recursive receive,scanning the mailbox

Pingactor,sending message to Pong

Blocking recursive receive,scanning the mailbox

Sending message to Ping

[erlang.org]

-module(tut15).-export([test/0, ping/2, pong/0]).

ping(0, Pong_PID) ->Pong_PID ! finished,

io:format("Ping finished~n", []);

ping(N, Pong_PID) ->

Pong_PID ! {ping, self()},

receivepong ->io:format("Ping received pong~n", [])

end,

ping(N - 1, Pong_PID).

pong() ->

receivefinished ->

io:format("Pong finished~n", []);{ping, Ping_PID} ->

io:format("Pong received ping~n", []),

Ping_PID ! pong,pong()

end.

test() ->

Pong_PID = spawn(tut15, pong, []),

spawn(tut15, ping, [3, Pong_PID]).

Pongactor


53/63

Actors in Scala

!

Actor-based concurrency in Scala, similar to Erlang

! Concurrency abstraction on top of threads or processes

! Communication by non-blocking send operation and blocking

receive operation with matching functionality

actor{

var sum = 0

loop{

receive{

caseData(bytes) => sum += hash(bytes)

caseGetSum(requester) => requester ! sum

}}}

! All constructs are library functions(actor, loop, receiver, !)

! Alternative self.receiveWithin()call with timeout

! Case classes act as message type representation

53



54/63

Case classes,acting as message types

Start the counteractor

Scala Example: Counter Actor

54


import scala.actors.Actorimport scala.actors.Actor._case class Inc(amount: Int)

case class Value

class Counterextends Actor {var counter: Int = 0;def act() = {

while (true) {

receive{case Inc(amount)=>

counter += amountcase Value =>

println("Value is "+counter)exit()

}}}}

object ActorTest extends Application {

val counter = new Countercounter.start()for (i


55/63

Actor Deadlocks

55!

Synchronous send operator !? available in Scala" Sends a message and blocks in receive afterwards

" Intended for request-response pattern

! Original asynchronous send makes deadlocks less probable

[http://savanne.be/articles/concurrency-in-erla

ng-scala/]// actorA

actorB!?Msg1(value) match {case Response1(r) =>

// }

receive{caseMsg2(value) =>reply(Response2(value))

}

// actorBactorA!?Msg2(value) match {

case Response2(r) =>

// }

receive{caseMsg1(value) =>reply(Response1(value))

}

// actorAactorB!Msg1(value)while (true) {

receive{caseMsg2(value) =>

reply(Response2(value))case Response1(r) =>

// ...

}}

// actorBactorA!Msg2(value)while (true) {

receive{caseMsg1(value) =>

reply(Response1(value))case Response2(r) =>// ...

}}


56/63


OpenHPI Course


Unit 5.6: Programming with MapReduce



57/63

MapReduce

!

Programming model for parallel processing of large data sets

" Inspired by map()and reduce()in functional programming

" Intended for best scalability in data parallelism

! Huge interest started with Google Research publication

"

Jeffrey Dean and Sanjay Ghemawat.MapReduce: Simplified Data Processing on Large Clusters

" Google products rely on internal implementation

! Apache Hadoop: Widely known open source implementation

" Scales to thousands of nodes

" Has shown to process petabytes of data

" Cluster infrastructure with custom file system (HDFS)

! Parallel programming on very high abstraction level

57



58/63

MapReduce Concept

!

Map step

" Convert input tuples

[key, value] with map()

function into one / multiple

intermediate tuples

[key2, value2]per input

! Shufflestep: Collect all intermediate tuples with the same key

! Reducestep

" Combine all intermediate tuples with the same key by

some reduce()function to one result per key!

Developer just defines stateless map() and reduce()functions

! Framework automatically ensures parallelization

! Persistence layer needed for input and output only

58


[developers.google.com]


59/63

Example: Character Counting

59



60/63

Java Example: Hadoop Word Count

60


public classWordCount{

public static classMapextends MapReduceBaseimplementsMapper {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();

publicvoidmap(LongWritablekey, Textvalue,OutputCollector output, Reporter reporter)

throws IOException {String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

output.collect(word, one);}}}

public static classReduceextends MapReduceBase

implementsReducer {public void reduce(Text key, Iterator values,

OutputCollector output, Reporter reporter)throws IOException {int sum = 0;

while (values.hasNext()) {sum += values.next().get();

}output.collect(key, new IntWritable(sum));

}}...}

[h

adoop.apache.org]


61/63

MapReduce Data Flow

61


[developer.yahoo.com]


62/63

Advantages

!

Developer never implements communication or synchronization,implicitly done by the framework

" Allows transparent fault tolerance and optimization

! Running mapand reducetasks are stateless

"

Only rely on their input, produce their own output"

Repeated execution in case of failing nodes

" Redundant execution for compensating nodes with different

performance characteristics

! Scale-out only limited by

"

Distributed file system performance (input / output data)

" Shuffle step communication performance

! Chaining of map/reduce tasks is very common in practice

! But: Demands embarrassingly parallel problem

62



63/63

Summary: Week 5

!

Shared nothing systems provide very good scalability" Adding new processing elements not limited by walls

" Different options for interconnect technology

! Task granularity is essential

"

Surface-to-volume effect" Task mapping problem

! De-facto standard is MPI programming

! High level abstractions with

" Channels

" Actors

" MapReduce

63

What steps / strategy would you applyto parallelize a given compute-intense program?