Awsugit hybrid cloud a supporto del calcolo scientifico

16
Hybrid cloud a supporto del calcolo scientifico Claudio Pontili [email protected] www.vexpert.it

Transcript of Awsugit hybrid cloud a supporto del calcolo scientifico

Hybrid cloud a supporto del calcolo scientifico

Claudio Pontili

[email protected]

www.vexpert.it

Ringraziamenti

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 2

Claudio Pontili…….qualche logo

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico3

it.linkedin.com/in/claudiopontili

@vexpert_it

FermiLab

• Fermilab is America’s premier national laboratory for particle

physics research, funded by the U.S. Department of Energy.

Thousands of scientists from universities and laboratories

around the world collaborate at Fermilab on experiments at

the frontiers of discovery.

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 4

Introduction: Fermilab Experiment Schedule

• Measurements at all

frontiers – Electroweak

physics, neutrino

oscillations, muon g-2,

dark energy, dark matter

• 8 major experiments in 3

frontiers running

simultaneously in 2016

• Sharing both beam and

computing resources

• Impressive breadth of

experiments at FNAL

5 21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico

FY16

Introduction: Slots Fermilab, OSG, & Clouds

• Current full capacity of FermiGrid ~30k slots

• Full capacity of OSG (Open Science Grid) ~85k slots

• Additional OSG opportunistic slots 15k – 30k

• Additional per-pay slots at commercial Clouds

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 6

• The purpose of the

GlideinWMS is to provide a

simple way to access the

resources.

• The user submit jobs,

HTCondor downloads input

data and algoritms and waits

until the user download

output

• Private cloud, Grid

resources, Commercial

Cloud (AWS, Google

Compute, Azure, etc.)

Hybrid cloud using GlideinWMS and HTCondor

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 7

Running AWS NovA Jobs as function of time, Oct 23. 2014

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 8

0

200

400

600

800

1000

12002

:21

2:2

8

2:3

5

2:4

2

2:4

9

2:5

6

3:0

3

3:1

0

3:1

7

3:2

4

3:3

1

3:3

8

3:4

5

3:5

2

3:5

9

4:0

6

4:1

3

4:2

0

4:2

7

4:3

4

4:4

1

4:4

8

4:5

5

5:0

2

5:0

9

5:1

6

5:2

3

5:3

0

5:3

7

5:4

4

5:5

1

5:5

8

6:0

5

6:1

2

6:1

9

6:2

6

6:3

3

6:4

0

6:4

7

6:5

4

7:0

1

7:0

8

7:1

5

7:2

2

7:2

9

7:3

6

7:4

3

7:5

0

Jo

bs

Time

• 3300 jobs• 525 m3.large• Total cost

$449• Prices are

getting down

1 hour

Autoscaling backend architecture inside cloud

• Bottleneck in the backend architecture inside cloud

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 9

Task A: Moving software and data to commercial cloud

using Scalable Squid Servers

• Need to transport software and data to the Cloud

• Using m3.large with SSD 500 Mbit/s -> 10 m3.large 5 Gbit/s

• Auto-scalable squid servers, deploying and destroying in 30

seconds using CloudFormation script

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 10

Task A: Scalable Squid Servers with CloudFormation

• Different environments (Development, Stage, Production)

• Different Regions (Oregon, N. Virginia, etc.)

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 11

Task B: Auto-Scaling GlideinWMS adding/removing

HTCondor using Amazon Web Services

• New resources are made available through the WMS

(HTCondor)

• The system is designed to scale by adding servers

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 12

• Problem: the

submission system is

a stateful service

– Easy to scale up

– Hard to scale down

• Solution? Manage

lifecycle of each

server using AWS

Hooks and Standby

(released July 30th

2014)

Pending

Pending:Wait(Lifecycle hook)

InService

Standby

Terminating:Wait (Lifecycle Hook)

Terminated

State diagram

• At least 7 different Amazon services

• 2 different programming languages (Java and bash scripting)

• Role based authentication: auto-generating and auto-rotating

logins and passwords

Task B: Auto-Scaling HTCondor using AWS - 2

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 13

Custom Metrics

(Idle and running jobs)

AWS CLI

S3 Logs and Init Scripts

Auto Scaling Group and ELB controlled by Custom

Metrics

Scale UpEvent

Instance Launched

Lifecycle Hook

Hook queueSNS

Custom Action: looking for standby instance instead of creating a new one

Java

Instance attached to Auto Scaling Group and ELB

Scale DownEvent

Instance Removed from the Auto Scaling Group

and ELBStandby Instance (it ll be terminated after 5 days)

Role Based Authentication

Permissions

QueueSNS

Custom Action: finding the right VM

to terminate and changing ASG state to

standby

Java

• No change for the final user, who just wants to deploy a job in

the same old way and compute it as fast as possible

• Now we can handle spikes of traffic using commercial cloud

• We pay AWS only during spikes

Task C: Hybrid cloud – Fermicloud and Amazon Web

Service

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 14

Conclusions

• Experiments have an increased need for computing

resources with an increased diversity of requirements

• Managing these needs (especially peak demand) is a major

focus

• Experiments are being enabled to use a diverse set of

resources: Local, Grid, and Cloud

• This work demonstrated the scaling of on-demand services in

support of scientific workflows using native Amazon Services

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 15

Thank you

• Questions?

21/02/2015Claudio Pontili | Hybrid cloud a supporto del calcolo scientifico 16