Cloud Computing for HPC - University of Oklahoma · 2009. 10. 7. · • Cloud is happening due to...
Transcript of Cloud Computing for HPC - University of Oklahoma · 2009. 10. 7. · • Cloud is happening due to...
-
Cloud Computing for HPC
William LuPlatform Computing
-
Trends
• Increasing demand for compute power in simulation, analysis, and research– More Data, More Compute, More Analysis
• Organizations are searching for an integrated management stack to further improve user service level and simplify HPC operations
-
Plans to Deploy Private Cloud in 2009
Source: International Supercomputing Conference 2009 attendee survey
Yes28%
No72%
-
Motivations for Private Cloud
41%
9%
17%
15%
18%
0% 10% 20% 30% 40% 50%
Improving efficiency
Improving IT responsiveness
Cutting costs
Experimenting with cloud …
Resource scalability
Source: International Supercomputing Conference 2009 attendee survey
-
Barriers to Private Cloud
37%
21%
8%
26%
8%
0% 10% 20% 30% 40%
Organisational culture
Security
Upfront costs
Complexity of managing
Application software licensing
Source: International Supercomputing Conference 2009 attendee survey
-
Challenges in HPC grid/clusters
• Job interference– Memory leak– Run time environment (OS+tools)
• Security• “One fits all” scheduler
– Fair share with history vs. current resource allocation
-
Reality Behind Resource Allocation
• Static (advance) Allocation vs. Real Workload
• Realities– Lacking of visibility of real workload– This delivers SLA/QoS– But… real utilization is low
Time
ResourceAdvanced Allocation
Real workload
Backfilled Allocation
Backfilled over run job
-
Elastic vs. Traditional Scheduling
4Resource
132
Time
4
Time
Resource1
32
Elastic scheduling• Better SLA• Higher utilization
Traditional scheduling
-
Resource Scavenging
• Scavenging resources in desktop, labs, external data centers, public cloud…
4Resource
13
2Time
Resources in HPC center
Resources outside of HPC center
-
Different job environment
SLES 10
Resource
RHEL
Scientific Linux
SLES 11
Time
• Job runtime environment needs to be dynamically scheduled together with the job
-
Moving from Cluster/Grid to Cloud
Cluster
Single App
Multiple Apps /Multiple Clusters
Grid
Cloud
Dynamic and elastic
-
Management of Private Cloud for HPC
Workload Management
Servers Storage Network
VMmanagement Provisioning
Private Cloud Management
Internal Data center
External Data Center/ Public
Cloud
HPC Applications
Resource Integration
Allocation Engine
Service Delivery
Application & workload integrations
Match supplyto demand
Service creation, access, delivery, accounting
Create a shared pool of physical & virtual resources
-
Software Stack from Platform Computing
Platform LSF
Servers Storage Network
VMmanagement Provisioning
Platform ISFPlatform ISF HPC Cluster
Internal Data center
External Data Center/ Public
Cloud
Platform MPI
-
Large Research Institute
Before AfterUser support incidence 15/day 2/day
Job failure due to app environment conflict
5/day ~0
App env set up time 1 weeks 1 day
Profile Problems SolutionHPC center with thousands of servers used by multiple user groups for data analysis
• Application environment set up is too long• Job fails due to application environment conflicts on the same server• Scheduling policy conflicts across multiple user groups causing user dissatisfaction
• Elastic and autonomous Platform LSF clusters scheduled by Platform ISF for each user group• When workload changes, Platform ISF dynamically add and remove resources from individual clusters• VM is used as job container to reduce job interference
-
First-come, first server capacityOn-Demand:Pay-per-use for non-critical or burst workloads
Reserve resources for critical workloadsGuaranteedReservation: Optimization utilization, guaranteed capacity
Reservation & On-Demand Allocations
-
PackingPack workload on fewest number of physical serversMaximizes usable capacity, reduces fragmentations, reduce energy consumption
Striping Spread workload across as many physical servers as possibleReduce impact of host failures, higher application performance
Load-Aware Allocate physical servers with lowest load to new workloadsHigher application performance
HA-Aware Allocate HA-enabled resources to critical workloadsMatch availability levels to service requirements and costs
Energy-AwarePlace workload according to energy indices and datacenter hot spotsReduce energy consumption
Affinity-Aware Place workload close to critical resources such as storageHigher application performance
Server Model- Aware
Allocate resource to workload according to model typesMaximize utilization of higher performing & more expensive resources
Topology-AwareAllocate resources on the same interconnect to the same applicationImprove application performance
Resource-Aware Allocation Policies
-
Reporting and Accounting
10/9/2009 17
-
User Access Point – LSF Web Portal
App Centric Interface
Job notification
-
User Access Point – LSF Web Portal
Job data management
Customize without HTML programming
-
Summary
• Cloud is happening due to its superior SLA• Cloud is to add dynamics and elasticity to
grid and cluster environment• Platform ISF delivers elastic logical
clusters in a shared HPC infrastructure. It transforms cluster and grid to cloud
-
Thank You
www.platform.com
Cloud Computing for HPCTrendsPlans to Deploy Private Cloud in 2009Motivations for Private CloudBarriers to Private CloudChallenges in HPC grid/clustersReality Behind Resource AllocationElastic vs. Traditional SchedulingResource ScavengingDifferent job environmentMoving from Cluster/Grid to CloudManagement of Private Cloud for HPCSoftware Stack from Platform ComputingLarge Research InstituteReservation & On-Demand AllocationsResource-Aware Allocation PoliciesReporting and AccountingUser Access Point – LSF Web PortalUser Access Point – LSF Web PortalSummaryThank You