Awsugit hybrid cloud a supporto del calcolo scientifico
-
Upload
claudio-pontili -
Category
Presentations & Public Speaking
-
view
175 -
download
1
Transcript of Awsugit hybrid cloud a supporto del calcolo scientifico
Hybrid cloud a supporto del calcolo scientifico
Claudio Pontili
www.vexpert.it
Claudio Pontili…….qualche logo
21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico3
it.linkedin.com/in/claudiopontili
@vexpert_it
FermiLab
• Fermilab is America’s premier national laboratory for particle
physics research, funded by the U.S. Department of Energy.
Thousands of scientists from universities and laboratories
around the world collaborate at Fermilab on experiments at
the frontiers of discovery.
21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 4
Introduction: Fermilab Experiment Schedule
• Measurements at all
frontiers – Electroweak
physics, neutrino
oscillations, muon g-2,
dark energy, dark matter
• 8 major experiments in 3
frontiers running
simultaneously in 2016
• Sharing both beam and
computing resources
• Impressive breadth of
experiments at FNAL
5 21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico
FY16
Introduction: Slots Fermilab, OSG, & Clouds
• Current full capacity of FermiGrid ~30k slots
• Full capacity of OSG (Open Science Grid) ~85k slots
• Additional OSG opportunistic slots 15k – 30k
• Additional per-pay slots at commercial Clouds
21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 6
• The purpose of the
GlideinWMS is to provide a
simple way to access the
resources.
• The user submit jobs,
HTCondor downloads input
data and algoritms and waits
until the user download
output
• Private cloud, Grid
resources, Commercial
Cloud (AWS, Google
Compute, Azure, etc.)
Hybrid cloud using GlideinWMS and HTCondor
21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 7
Running AWS NovA Jobs as function of time, Oct 23. 2014
21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 8
0
200
400
600
800
1000
12002
:21
2:2
8
2:3
5
2:4
2
2:4
9
2:5
6
3:0
3
3:1
0
3:1
7
3:2
4
3:3
1
3:3
8
3:4
5
3:5
2
3:5
9
4:0
6
4:1
3
4:2
0
4:2
7
4:3
4
4:4
1
4:4
8
4:5
5
5:0
2
5:0
9
5:1
6
5:2
3
5:3
0
5:3
7
5:4
4
5:5
1
5:5
8
6:0
5
6:1
2
6:1
9
6:2
6
6:3
3
6:4
0
6:4
7
6:5
4
7:0
1
7:0
8
7:1
5
7:2
2
7:2
9
7:3
6
7:4
3
7:5
0
Jo
bs
Time
• 3300 jobs• 525 m3.large• Total cost
$449• Prices are
getting down
1 hour
Autoscaling backend architecture inside cloud
• Bottleneck in the backend architecture inside cloud
21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 9
Task A: Moving software and data to commercial cloud
using Scalable Squid Servers
• Need to transport software and data to the Cloud
• Using m3.large with SSD 500 Mbit/s -> 10 m3.large 5 Gbit/s
• Auto-scalable squid servers, deploying and destroying in 30
seconds using CloudFormation script
21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 10
Task A: Scalable Squid Servers with CloudFormation
• Different environments (Development, Stage, Production)
• Different Regions (Oregon, N. Virginia, etc.)
21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 11
Task B: Auto-Scaling GlideinWMS adding/removing
HTCondor using Amazon Web Services
• New resources are made available through the WMS
(HTCondor)
• The system is designed to scale by adding servers
21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 12
• Problem: the
submission system is
a stateful service
– Easy to scale up
– Hard to scale down
• Solution? Manage
lifecycle of each
server using AWS
Hooks and Standby
(released July 30th
2014)
Pending
Pending:Wait(Lifecycle hook)
InService
Standby
Terminating:Wait (Lifecycle Hook)
Terminated
State diagram
• At least 7 different Amazon services
• 2 different programming languages (Java and bash scripting)
• Role based authentication: auto-generating and auto-rotating
logins and passwords
Task B: Auto-Scaling HTCondor using AWS - 2
21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 13
Custom Metrics
(Idle and running jobs)
AWS CLI
S3 Logs and Init Scripts
Auto Scaling Group and ELB controlled by Custom
Metrics
Scale UpEvent
Instance Launched
Lifecycle Hook
Hook queueSNS
Custom Action: looking for standby instance instead of creating a new one
Java
Instance attached to Auto Scaling Group and ELB
Scale DownEvent
Instance Removed from the Auto Scaling Group
and ELBStandby Instance (it ll be terminated after 5 days)
Role Based Authentication
Permissions
QueueSNS
Custom Action: finding the right VM
to terminate and changing ASG state to
standby
Java
• No change for the final user, who just wants to deploy a job in
the same old way and compute it as fast as possible
• Now we can handle spikes of traffic using commercial cloud
• We pay AWS only during spikes
Task C: Hybrid cloud – Fermicloud and Amazon Web
Service
21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 14
Conclusions
• Experiments have an increased need for computing
resources with an increased diversity of requirements
• Managing these needs (especially peak demand) is a major
focus
• Experiments are being enabled to use a diverse set of
resources: Local, Grid, and Cloud
• This work demonstrated the scaling of on-demand services in
support of scientific workflows using native Amazon Services
21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 15