Iterative MapReduce E nabling HPC-Cloud Interoperability
description
Transcript of Iterative MapReduce E nabling HPC-Cloud Interoperability
Iterative MapReduce Enabling HPC-Cloud Interoperability
SALSA HPC Group http://salsahpc.indiana.edu
School of Informatics and Computing
Indiana University
Workshop on Petascale Data Analytics: Challenges, and Opportunities, SC11
TwisterBingjing Zhang, Richard Teng
Twister4AzureThilina Gunarathne
High-Performance Visualization Algorithms
For Data-Intensive AnalysisSeung-Hee Bae and Jong Youl Choi
SALSA HPC Group
DryadLINQ CTP EvaluationHui Li, Yang Ruan, and Yuduo Zhou
Cloud Storage, FutureGridXiaoming Gao, Stephen Wu
Million Sequence Challenge Saliya Ekanayake, Adam Hughs, Yang
Ruan
Cyberinfrastructure for Remote Sensing of Ice Sheets Jerome Mitchell
SALSAIntel’s Application Stack
SALSA
(Iterative) MapReduce in Context
Linux HPCBare-system
Amazon Cloud Windows Server HPC
Bare-system Virtualization
Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)
Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping
CPU Nodes
Virtualization
Applications
Programming Model
Infrastructure
Hardware
Azure Cloud
Security, Provenance, Portal
High Level Language
Distributed File Systems Data Parallel File System
Grid Appliance
GPU Nodes
Support Scientific Simulations (Data Mining and Data Analysis)
Runtime
Storage
Services and Workflow
Object Store
SALSA
What are the challenges?
Providing both cost effectiveness and powerful parallel programming paradigms that is capable of handling the incredible increases in dataset sizes. (large-scale data analysis for Data Intensive applications )
Research issues portability between HPC and Cloud systems scaling performance fault tolerance
These challenges must be met for both computation and storage. If computation and storage are separated, it’s not possible to bring computing to data.
Data locality its impact on performance; the factors that affect data locality; the maximum degree of data locality that can be achieved.
Factors beyond data locality to improve performance To achieve the best data locality is not always the optimal scheduling decision. For instance, if the node where input data of a task are stored is overloaded, to run the task on it will result in performance degradation.
Task granularity and load balance In MapReduce , task granularity is fixed. This mechanism has two drawbacks 1) limited degree of concurrency 2) load unbalancing resulting from the variation of task execution time.
8
Programming Models and Tools
8
MapReduce in Heterogeneous Environment
MICROSOFT
MotivationData
DelugeMapReduce Classic Parallel
Runtimes (MPI)
Experiencing in many domains
Data Centered, QoS Efficient and Proven techniques
Input
Output
map
Inputmap
reduce
Inputmap
reduce
iterations
Pij
Expand the Applicability of MapReduce to more classes of Applications
Map-Only MapReduceIterative MapReduce
More Extensions
Twister v0.9
Distinction on static and
variable data
Configurable long running
(cacheable) map/reduce tasks
Pub/sub messaging based
communication/data transfers
Broker Network for facilitating
communication
New Infrastructure for Iterative MapReduce Programming
configureMaps(..)
configureReduce(..)
runMapReduce(..)
while(condition){
} //end while
updateCondition()
close()
Combine() operation
Reduce()
Map()
Worker Nodes
Communications/data transfers via the pub-sub broker network & direct TCP
Iterations
May send <Key,Value> pairs directly
Local Disk
Cacheable map/reduce tasks
• Main program may contain many MapReduce invocations or iterative MapReduce invocations
Main program’s process space
Worker Node
Local Disk
Worker Pool
Twister Daemon
Master Node
Twister Driver
Main Program
B
BB
B
Pub/sub Broker Network
Worker Node
Local Disk
Worker Pool
Twister Daemon
Scripts perform:Data distribution, data collection, and partition file creation
map
reduce Cacheable tasks
One broker serves several Twister daemons
Components of Twister
14
TwisterR4Azure
Azure Queues for scheduling, Tables to store meta-data and monitoring data, Blobs for input/output/intermediate data storage.
Iterative MapReduce for Azure
Reduce
Reduce
MergeAdd
Iteration? No
Map Combine
Map Combine
Map Combine
Data Cache
Yes
Hybrid scheduling of the new iteration
Job Start
Job Finish
Programming model extensions to support broadcast data Merge Step In-Memory Caching of static data Cache aware hybrid scheduling using Queues, bulletin
board (special table) and execution histories Hybrid intermediate data transfer
Map 1
Map 2
Map n
Map Workers
Red 1
Red 2
Red n
Reduce Workers
In Memory Data Cache
Map Task Meta Data Cache
Worker Role
Scheduling Queue
Job Bulletin Board + In Memory Cache + Execution History
New Iteration
Left over tasks that did not get scheduled through bulleting board.
New Job
Twister4Azure• Distributed, highly scalable and highly available cloud
services as the building blocks.• Utilize eventually-consistent , high-latency cloud services
effectively to deliver performance comparable to traditional MapReduce runtimes.
• Decentralized architecture with global queue based dynamic task scheduling
• Minimal management and maintenance overhead• Supports dynamically scaling up and down of the compute
resources.• MapReduce fault tolerance
Performance Comparisons
50%55%60%65%70%75%80%85%90%95%
100%
Para
llel E
ffici
ency
Num. of Cores * Num. of Files
Twister4Azure
Amazon EMR
Apache Hadoop
Cap3 Sequence AssemblySmith Waterman Sequence Alignment
BLAST Sequence Search
Performance – Kmeans Clustering
Number of Executing Map Task Histogram
Strong Scaling with 128M Data PointsWeak Scaling
Task Execution Time Histogram
Performance – Multi Dimensional Scaling
Azure Instance Type Study Number of Executing Map Task Histogram
Weak Scaling Data Size Scaling
Twister-MDS DemoThis demo is for real time visualization of the process of multidimensional scaling(MDS) calculation.
We use Twister to do parallel calculation inside the cluster, and use PlotViz to show the intermediate results at the user client computer.
The process of computation and monitoring is automated by the program.
MDS projection of 100,000 protein sequences showing a few experimentally identified clusters in preliminary work with Seattle Children’s Research Institute
Twister-MDS Output
Twister-MDS Work Flow
Master Node
Twister Driver
Twister-MDS
ActiveMQBroker
MDS Monitor
PlotViz
I. Send message to start the job
II. Send intermediate results
Client Node
MDS Output Monitoring InterfacePub/Sub Broker Network
Worker Node
Worker Pool
Twister Daemon
Master Node
Twister Driver
Twister-MDS
Worker Node
Worker Pool
Twister Daemon
map
reduce
map
reduce calculateStress
calculateBC
Twister-MDS Structure
Input
map
reduce
Iterations
Initial Collective StepNetwork of Brokers
Final Collective StepNetwork of Brokers
User Program
Map-Collective Model
New Network of BrokersTwister Driver Node
Twister Daemon NodeActiveMQ Broker Node
Broker-Daemon Connection
Broker-Broker Connection
Broker-Driver Connection
7 Brokers and 32 Computing Nodes in total
A. Full Mesh Network
B. Hierarchical Sending
C. Streaming
5 Brokers and 4 Computing Nodes in total
Performance Improvement
38400 51200 76800 1024000.000
200.000
400.000
600.000
800.000
1000.000
1200.000
1400.000
1600.000
189.288
359.625
816.364
1508.487
148.805
303.432
737.073
1404.431
Twister-MDS Execution Time100 iterations, 40 nodes, under different input data sizes
Original Execution Time (1 broker only) Current Execution Time (7 brokers, the best broker number)
Number of Data Points
Tota
l Exe
cutio
n Ti
me
(Sec
onds
)
Broadcasting on 40 Nodes(In Method C, centroids are split to 160 blocks, sent through 40 brokers in 4 rounds)
400M 600M 800M0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
13.07
18.79
24.50
46.19
70.56
93.14
Method C Method B
Broa
dcas
ting
Tim
e (U
nit:
Sec
ond)
Twister New ArchitectureWorker Node
Twister DaemonTwister Driver
Worker Node
Twister Daemon
map
reduce
merge
Broker
Map
Reduce
Master Node
Configure Mapper
Add to MemCache
Broker Broker
Cacheable tasks
map broadcasting chain
reduce collection chain
Chain/Ring BroadcastingTwister Daemon Node
Twister Driver Node • Driver sender:• send broadcasting data• get acknowledgement• send next broadcasting data• …
• Daemon sender:• receive data from the last daemon (or
driver)• cache data to daemon • Send data to next daemon (waits for
ACK)• send acknowledgement to the last
daemon
Chain Broadcasting Protocolsend
get ack
send
Driver Daemon 0receive
handle data
send
Daemon 1
ack
receive
handle data
send
ackreceive
handle data
get ack
ack
receive
handle data
get ack
send
send
get ack
Daemon 2
receive
handle data
ack
receive
handle data
ack
ack
get ack
get ack ack
ackget ack
I know this is the end of Daemon Chain
I know this is the end of Cache Block
Broadcasting Time Comparison
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 500
5
10
15
20
25
30
Broadcasting Time Comparison on 80 nodes, 600 MB data, 160 pieces
Chain Broadcasting All-to-All Broadcasting, 40 brokers
Execution No.
Broa
dcas
ting
Tim
e (U
nit:
Sec
onds
)
SALSA
Applications & Different Interconnection PatternsMap Only Classic
MapReduceIterative Reductions
TwisterLoosely Synchronous
CAP3 AnalysisDocument conversion (PDF -> HTML)Brute force searches in cryptographyParametric sweeps
High Energy Physics (HEP) HistogramsSWG gene alignmentDistributed searchDistributed sortingInformation retrieval
Expectation maximization algorithmsClusteringLinear Algebra
Many MPI scientific applications utilizing wide variety of communication constructs including local interactions
- CAP3 Gene Assembly- PolarGrid Matlab data analysis
- Information Retrieval - HEP Data Analysis- Calculation of Pairwise Distances for ALU Sequences
- Kmeans - Deterministic Annealing Clustering- Multidimensional Scaling MDS
- Solving Differential Equations and - particle dynamics with short range forces
Input
Output
map
Inputmap
reduce
Inputmap
reduce
iterations
Pij
Domain of MapReduce and Iterative Extensions MPI
SALSA
Scheduling vs. Computation of Dryad in a Heterogeneous Environment
34
SALSA
Runtime Issues
35
Twister FuturesDevelopment of library of Collectives to use at Reduce phase
Broadcast and Gather needed by current applications
Discover other important ones
Implement efficiently on each platform – especially Azure
Better software message routing with broker networks using asynchronous I/O with communication fault tolerance
Support nearby location of data and computing using data parallel file systems
Clearer application fault tolerance model based on implicit synchronizations points at iteration end points
Later: Investigate GPU support
Later: run time for data parallel languages like Sawzall, Pig Latin, LINQ
Convergence is Happening
Multicore
Clouds
Data IntensiveParadigms
Data intensive application (three basic activities):capture, curation, and analysis (visualization)
Cloud infrastructure and runtime
Parallel threading and processes
SALSA
FutureGrid: a Grid Testbed• IU Cray operational, IU IBM (iDataPlex) completed stability test May 6• UCSD IBM operational, UF IBM stability test completes ~ May 12• Network, NID and PU HTC system operational• UC IBM stability test completes ~ May 27; TACC Dell awaiting delivery of components
NID: Network Impairment DevicePrivatePublic FG Network
SALSA
• Switchable clusters on the same hardware (~5 minutes between different OS such as Linux+Xen to Windows+HPCS)• Support for virtual clusters• SW-G : Smith Waterman Gotoh Dissimilarity Computation as an pleasingly parallel problem suitable for MapReduce
style applications
Pub/Sub Broker Network
Summarizer
Switcher
Monitoring Interface
iDataplex Bare-metal Nodes
XCAT Infrastructure
Virtual/Physical Clusters
Monitoring & Control Infrastructure
iDataplex Bare-metal Nodes (32 nodes)
XCAT Infrastructure
Linux Bare-
system
Linux on Xen
Windows Server 2008 Bare-system
SW-G Using Hadoop
SW-G Using Hadoop
SW-G Using DryadLINQ
Monitoring Infrastructure
Dynamic Cluster Architecture
SALSAHPC Dynamic Virtual Cluster on FutureGrid -- Demo at SC09Demonstrate the concept of Science
on Clouds on FutureGrid
SALSA
SALSAHPC Dynamic Virtual Cluster on FutureGrid -- Demo at SC09
• Top: 3 clusters are switching applications on fixed environment. Takes approximately 30 seconds.• Bottom: Cluster is switching between environments: Linux; Linux +Xen; Windows + HPCS.
Takes approxomately 7 minutes• SALSAHPC Demo at SC09. This demonstrates the concept of Science on Clouds using a FutureGrid iDataPlex.
Demonstrate the concept of Science on Clouds using a FutureGrid cluster
Education and Broader ImpactWe devote a lot to guide studentswho are interested in computing
Education
We offer classes with emerging new topics
Together with tutorials on the most popular cloud computing tools
Hosting workshops and spreading our technology across the nation
Giving students unforgettable research experience
Broader Impact
AcknowledgementSALSA HPC Group Indiana University
http://salsahpc.indiana.edu