Iterative MapReduce E nabling HPC-Cloud Interoperability

Iterative MapReduce Enabling HPC-Cloud Interoperability

SALSA HPC Group http://salsahpc.indiana.edu

School of Informatics and Computing

Indiana University

Workshop on Petascale Data Analytics: Challenges, and Opportunities, SC11

http://salsahpc.indiana.edu/

TwisterBingjing Zhang, Richard Teng

Twister4AzureThilina Gunarathne

High-Performance Visualization Algorithms

For Data-Intensive AnalysisSeung-Hee Bae and Jong Youl Choi

SALSA HPC Group

DryadLINQ CTP EvaluationHui Li, Yang Ruan, and Yuduo Zhou

Cloud Storage, FutureGridXiaoming Gao, Stephen Wu

Million Sequence Challenge Saliya Ekanayake, Adam Hughs, Yang

Ruan

Cyberinfrastructure for Remote Sensing of Ice Sheets Jerome Mitchell

SALSAIntel’s Application Stack

SALSA

(Iterative) MapReduce in Context

Linux HPCBare-system

Amazon Cloud Windows Server HPC

Bare-system Virtualization

Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)

Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping

CPU Nodes

Virtualization

Applications

Programming Model

Infrastructure

Hardware

Azure Cloud

Security, Provenance, Portal

High Level Language

Distributed File Systems Data Parallel File System

Grid Appliance

GPU Nodes

Support Scientific Simulations (Data Mining and Data Analysis)

Runtime

Storage

Services and Workflow

Object Store

What are the challenges?

Providing both cost effectiveness and powerful parallel programming paradigms that is capable of handling the incredible increases in dataset sizes. (large-scale data analysis for Data Intensive applications )

Research issues portability between HPC and Cloud systems scaling performance fault tolerance

These challenges must be met for both computation and storage. If computation and storage are separated, it’s not possible to bring computing to data.

Data locality its impact on performance; the factors that affect data locality; the maximum degree of data locality that can be achieved.

Factors beyond data locality to improve performance To achieve the best data locality is not always the optimal scheduling decision. For instance, if the node where input data of a task are stored is overloaded, to run the task on it will result in performance degradation.

Task granularity and load balance In MapReduce , task granularity is fixed. This mechanism has two drawbacks 1) limited degree of concurrency 2) load unbalancing resulting from the variation of task execution time.

8

Programming Models and Tools

8

MapReduce in Heterogeneous Environment

MICROSOFT

MotivationData

DelugeMapReduce Classic Parallel

Runtimes (MPI)

Experiencing in many domains

Data Centered, QoS Efficient and Proven techniques

Input

Output

map

Inputmap

reduce

Inputmap

reduce

iterations

Pij

Expand the Applicability of MapReduce to more classes of Applications

Map-Only MapReduceIterative MapReduce

More Extensions

Twister v0.9

Distinction on static and

variable data

Configurable long running

(cacheable) map/reduce tasks

Pub/sub messaging based

communication/data transfers

Broker Network for facilitating

communication

New Infrastructure for Iterative MapReduce Programming

configureMaps(..)

configureReduce(..)

runMapReduce(..)

while(condition){

} //end while

updateCondition()

close()

Combine() operation

Reduce()

Map()

Worker Nodes

Communications/data transfers via the pub-sub broker network & direct TCP

Iterations

May send <Key,Value> pairs directly

Local Disk

Cacheable map/reduce tasks

• Main program may contain many MapReduce invocations or iterative MapReduce invocations

Main program’s process space

Worker Node

Local Disk

Worker Pool

Twister Daemon

Master Node

Twister Driver

Main Program

B

BB

B

Pub/sub Broker Network

Worker Node

Local Disk

Worker Pool

Twister Daemon

Scripts perform:Data distribution, data collection, and partition file creation

map

reduce Cacheable tasks

One broker serves several Twister daemons

Components of Twister

TwisterR4Azure

Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for input/output/intermediate data storage.

Iterative MapReduce for Azure

Reduce

Reduce

MergeAdd

Iteration? No

Map Combine

Map Combine

Map Combine

Data Cache

Yes

Hybrid scheduling of the new iteration

Job Start

Job Finish

Programming model extensions to support broadcast data Merge Step In-Memory Caching of static data Cache aware hybrid scheduling using Queues, bulletin

board (special table) and execution histories Hybrid intermediate data transfer

Map 1

Map 2

Map n

Map Workers

Red 1

Red 2

Red n

Reduce Workers

In Memory Data Cache

Map Task Meta Data Cache

Worker Role

Scheduling Queue

Job Bulletin Board + In Memory Cache + Execution History

New Iteration

Left over tasks that did not get scheduled through bulleting board.

New Job

Twister4Azure• Distributed, highly scalable and highly available cloud

services as the building blocks.• Utilize eventually-consistent , high-latency cloud services

effectively to deliver performance comparable to traditional MapReduce runtimes.

• Decentralized architecture with global queue based dynamic task scheduling

• Minimal management and maintenance overhead• Supports dynamically scaling up and down of the compute

resources.• MapReduce fault tolerance

Performance Comparisons

50%55%60%65%70%75%80%85%90%95%

100%

Para

llel E

ffici

ency

Num. of Cores * Num. of Files

Twister4Azure

Amazon EMR

Apache Hadoop

Cap3 Sequence AssemblySmith Waterman Sequence Alignment

BLAST Sequence Search

Performance – Kmeans Clustering

Number of Executing Map Task Histogram

Strong Scaling with 128M Data PointsWeak Scaling

Task Execution Time Histogram

Performance – Multi Dimensional Scaling

Azure Instance Type Study Number of Executing Map Task Histogram

Weak Scaling Data Size Scaling

Twister-MDS DemoThis demo is for real time visualization of the process of multidimensional scaling(MDS) calculation.

We use Twister to do parallel calculation inside the cluster, and use PlotViz to show the intermediate results at the user client computer.

The process of computation and monitoring is automated by the program.

MDS projection of 100,000 protein sequences showing a few experimentally identified clusters in preliminary work with Seattle Children’s Research Institute

Twister-MDS Output

Twister-MDS Work Flow

Master Node

Twister Driver

Twister-MDS

ActiveMQBroker

MDS Monitor

PlotViz

I. Send message to start the job

II. Send intermediate results

Client Node

MDS Output Monitoring InterfacePub/Sub Broker Network

Worker Node

Worker Pool

Twister Daemon

Master Node

Twister Driver

Twister-MDS

Worker Node

Worker Pool

Twister Daemon

map

reduce

map

reduce calculateStress

calculateBC

Twister-MDS Structure

Input

map

reduce

Iterations

Initial Collective StepNetwork of Brokers

Final Collective StepNetwork of Brokers

User Program

Map-Collective Model

New Network of BrokersTwister Driver Node

Twister Daemon NodeActiveMQ Broker Node

Broker-Daemon Connection

Broker-Broker Connection

Broker-Driver Connection

7 Brokers and 32 Computing Nodes in total

A. Full Mesh Network

B. Hierarchical Sending

C. Streaming

5 Brokers and 4 Computing Nodes in total

Performance Improvement

38400 51200 76800 1024000.000

200.000

400.000

600.000

800.000

1000.000

1200.000

1400.000

1600.000

189.288

359.625

816.364

1508.487

148.805

303.432

737.073

1404.431

Twister-MDS Execution Time100 iterations, 40 nodes, under different input data sizes

Original Execution Time (1 broker only) Current Execution Time (7 brokers, the best broker number)

Number of Data Points

Tota

l Exe

cutio

n Ti

me

(Sec

onds

)

Broadcasting on 40 Nodes(In Method C, centroids are split to 160 blocks, sent through 40 brokers in 4 rounds)

400M 600M 800M0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

13.07

18.79

24.50

46.19

70.56

93.14

Method C Method B

Broa

dcas

ting

Tim

e (U

nit:

Sec

ond)

Twister New ArchitectureWorker Node

Twister DaemonTwister Driver

Worker Node

Twister Daemon

map

reduce

merge

Broker

Map

Reduce

Master Node

Configure Mapper

Add to MemCache

Broker Broker

Cacheable tasks

map broadcasting chain

reduce collection chain

Chain/Ring BroadcastingTwister Daemon Node

Twister Driver Node • Driver sender:• send broadcasting data• get acknowledgement• send next broadcasting data• …

• Daemon sender:• receive data from the last daemon (or

driver)• cache data to daemon • Send data to next daemon (waits for

ACK)• send acknowledgement to the last

daemon

Chain Broadcasting Protocolsend

get ack

send

Driver Daemon 0receive

handle data

send

Daemon 1

ack

receive

handle data

send

ackreceive

handle data

get ack

ack

receive

handle data

get ack

send

send

get ack

Daemon 2

receive

handle data

ack

receive

handle data

ack

ack

get ack

get ack ack

ackget ack

I know this is the end of Daemon Chain

I know this is the end of Cache Block

Broadcasting Time Comparison

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 500

5

10

15

20

25

30

Broadcasting Time Comparison on 80 nodes, 600 MB data, 160 pieces

Chain Broadcasting All-to-All Broadcasting, 40 brokers

Execution No.

Broa

dcas

ting

Tim

e (U

nit:

Sec

onds

)

SALSA

Applications & Different Interconnection PatternsMap Only Classic

MapReduceIterative Reductions

TwisterLoosely Synchronous

CAP3 AnalysisDocument conversion (PDF -> HTML)Brute force searches in cryptographyParametric sweeps

High Energy Physics (HEP) HistogramsSWG gene alignmentDistributed searchDistributed sortingInformation retrieval

Expectation maximization algorithmsClusteringLinear Algebra

Many MPI scientific applications utilizing wide variety of communication constructs including local interactions

- CAP3 Gene Assembly- PolarGrid Matlab data analysis

- Information Retrieval - HEP Data Analysis- Calculation of Pairwise Distances for ALU Sequences

- Kmeans - Deterministic Annealing Clustering- Multidimensional Scaling MDS

- Solving Differential Equations and - particle dynamics with short range forces

Input

Output

map

Inputmap

reduce

Inputmap

reduce

iterations

Pij

Domain of MapReduce and Iterative Extensions MPI

SALSA

Scheduling vs. Computation of Dryad in a Heterogeneous Environment

34

SALSA

Runtime Issues

35

Twister FuturesDevelopment of library of Collectives to use at Reduce phase

Broadcast and Gather needed by current applications

Discover other important ones

Implement efficiently on each platform – especially Azure

Better software message routing with broker networks using asynchronous I/O with communication fault tolerance

Support nearby location of data and computing using data parallel file systems

Clearer application fault tolerance model based on implicit synchronizations points at iteration end points

Later: Investigate GPU support

Later: run time for data parallel languages like Sawzall, Pig Latin, LINQ

Convergence is Happening

Multicore

Clouds

Data IntensiveParadigms

Data intensive application (three basic activities):capture, curation, and analysis (visualization)

Cloud infrastructure and runtime

Parallel threading and processes

SALSA

FutureGrid: a Grid Testbed• IU Cray operational, IU IBM (iDataPlex) completed stability test May 6• UCSD IBM operational, UF IBM stability test completes ~ May 12• Network, NID and PU HTC system operational• UC IBM stability test completes ~ May 27; TACC Dell awaiting delivery of components

NID: Network Impairment DevicePrivatePublic FG Network

SALSA

• Switchable clusters on the same hardware (~5 minutes between different OS such as Linux+Xen to Windows+HPCS)• Support for virtual clusters• SW-G : Smith Waterman Gotoh Dissimilarity Computation as an pleasingly parallel problem suitable for MapReduce

style applications

Pub/Sub Broker Network

Summarizer

Switcher

Monitoring Interface

iDataplex Bare-metal Nodes

XCAT Infrastructure

Virtual/Physical Clusters

Monitoring & Control Infrastructure

iDataplex Bare-metal Nodes (32 nodes)

XCAT Infrastructure

Linux Bare-

system

Linux on Xen

Windows Server 2008 Bare-system

SW-G Using Hadoop

SW-G Using Hadoop

SW-G Using DryadLINQ

Monitoring Infrastructure

Dynamic Cluster Architecture

SALSAHPC Dynamic Virtual Cluster on FutureGrid -- Demo at SC09Demonstrate the concept of Science

on Clouds on FutureGrid

SALSA

SALSAHPC Dynamic Virtual Cluster on FutureGrid -- Demo at SC09

• Top: 3 clusters are switching applications on fixed environment. Takes approximately 30 seconds.• Bottom: Cluster is switching between environments: Linux; Linux +Xen; Windows + HPCS.

Takes approxomately 7 minutes• SALSAHPC Demo at SC09. This demonstrates the concept of Science on Clouds using a FutureGrid iDataPlex.

Demonstrate the concept of Science on Clouds using a FutureGrid cluster

Education and Broader ImpactWe devote a lot to guide studentswho are interested in computing

Education

We offer classes with emerging new topics

Together with tutorials on the most popular cloud computing tools

Hosting workshops and spreading our technology across the nation

Giving students unforgettable research experience

Broader Impact

AcknowledgementSALSA HPC Group Indiana University

http://salsahpc.indiana.edu

http://salsahpc.indiana.edu/

http://salsahpc.indiana.edu/twister4azure/

http://salsahpc.indiana.edu/plotviz/index.html

http://www.iterativemapreduce.org/

Iterative MapReduce E nabling HPC-Cloud Interoperability

Documents

Transcript of Iterative MapReduce E nabling HPC-Cloud Interoperability