Evolving Data Logistics Needs of the LHC Experiments

30
Evolving Data Logistics Needs of the LHC Experiments Paul Sheldon Vanderbilt University

description

Paul Sheldon Vanderbilt University. Evolving Data Logistics Needs of the LHC Experiments. Outline. To meet its computing and storage needs (and socio/political and funding needs), the LHC experiments deployed a globally distributed “grid” computing model. - PowerPoint PPT Presentation

Transcript of Evolving Data Logistics Needs of the LHC Experiments

Page 1: Evolving Data Logistics Needs of the LHC Experiments

Evolving Data Logistics Needs of the

LHC Experiments

Paul SheldonVanderbilt University

Page 2: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 2

Outline To meet its computing and storage needs

(and socio/political and funding needs), the LHC experiments deployed a globally distributed “grid” computing model.

Successfully demonstrated grid concepts in production at an extremely large scale – enabled significant science!

Short term (2 years) and longer term (10 years) projects will require them to significantly evolve their computing model. Wide area distributed storage Agile, intellegent network based solutions: DYNES

and ANSE

Thanks to Artur Barczyk, Ken Bloom, Eric Boyd, Oli Gutsche, Ian Fisk, Harvey Newman, Jim Shank, Jason Zurawski, and others for slides and ideas.

September 5, 2013

Page 3: Evolving Data Logistics Needs of the LHC Experiments

3

A High Profile Success

September 5, 2013Emerging Data Logistics Needs of the LHC Expts

Page 4: Evolving Data Logistics Needs of the LHC Experiments

4

A High Profile Success

September 5, 2013Emerging Data Logistics Needs of the LHC Expts

The LHC would not have been successful without

grid computing, data logistics,

and robust high performance

networks

Page 5: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 5

Distributed Compute Resources and People

September 5, 2013

CMS Collaboration: 38 Countries, >163 Institutes

Page 6: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 6

Distributed Compute Resources and People

LHC Experiments decided early on that it was not possible or even desirable to put all or even most computing resources at CERN: Sociology, Politics, Funding

Physics is done in small groups, which are themselves geographically distributed

To maximize the quality and rate of scientific discovery, all physicists must have equal ability to access and analyze the experiment's data

September 5, 2013

Page 7: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 7

Distributed Compute Resources and People

LHC Experiments decided early on that it was not possible or even desirable to put all or even most computing resources at CERN: Sociology, Politics, Funding

Physics is done in small groups, which are themselves geographically distributed

To maximize the quality and rate of scientific discovery, all physicists must have equal ability to access and analyze the experiment's data

September 5, 2013

Adopt a grid-like computing model, use those pieces of grid concepts and tools that were

“ready” and needed.

Page 8: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 8

Tiered Computing Model

September 5, 2013

Tier 0

LHC raw data logged to the “Tier 0” compute center at CERN. T0 does initial processing, verification and data quality tests.

Page 9: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 9

Tiered Computing Model

September 5, 2013

Tier 1

Tier 1 Tier 1’s

Tier 1’s

Tier 0

Data is then distributed to Tier 1 centers, the top level centers for each region of the world (US CMS Tier 1 is at Fermilab).

These process the data to create “data products” that physicists will use

Not all Tier 1’s are shown

20 Gbs dedicated network

Page 10: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 10

Tier 2 and Tier 3 Centers

8 Regional Tier 2’s in the US, 53 worldwide.

Tier 3’s serve individual Universities

Tier 2 and 3’s provide the computing and storage used by physicists to analyze the data & produce physics.

Physicists work in small groups, but those groups may be globally distributed.

September 5, 2013

DataFlow

Page 11: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 11

Analysis Projects are Mostly Self-Organized

Worker nodes at each T2/3 site present a batch system

Access through grid interface/certs (no local access)

User analysis workflows built and submitted by “grid enabled” software framework (CRAB, PanDA) breaks tasks into jobs – decides where to

run them – monitors them – retrieves the output when they are done.

Data tethering: originally jobs could only run at a site if they were hosting the required data locally Evolving model (more later) is relaxing this

requirement

September 5, 2013

Page 12: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 12

How

is it

W

orki

ng?

200K – 300K LHC jobs launched/day

September 5, 2013

CMSTier 1

CMSTier 2

CPU

Hou

rs (H

S06h

)C

PU H

ours

(HS0

6h)

1 H

S06h

~ 1

GFL

OP

hour

http

://w

3.he

pix.

org/

benc

hmar

ks/d

oku.

php/

Page 13: Evolving Data Logistics Needs of the LHC Experiments

13

Data Movement Some data is strategically placed based on

high level “management” decisions Users can also request datasets be placed at

“friendly sites” where resources have been made available to their analysis group. (They can whitelist those sites in CRAB/PanDA.)

System for Organized Transfers (PhEDEx in CMS)… Destination based transfer requests (transfer

dataset X to site A) System picks out source sites based on link quality

and load (several source sites might be used) PhEDEx schedules GridFTP transfers between sites

through several layers of software Software layers protect sites, enable multiple

streams and take care of authentication (Globus tools, SRM, FTS, etc.)

September 5, 2013Emerging Data Logistics Needs of the LHC Expts

Page 14: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 14

How

is it

W

orki

ng?

September 5, 2013

ATLAS

CMS2012

Thro

ughp

ut (M

B/S

)Th

roug

hput

(MB

/S)

2012

AVG: ~5 GB/s

AVG: ~1 GB/s

Combined:180 PB/year

Aggregate Bandwidth on all Tier N Links

Page 15: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 15

Model Has Evolved Originally strictly

hierarchical In addition to strategic

pre-placement of data sets…

Meshed data flows: Any site can use any other site to get data

Dynamic data caching: Analysis sites will pull datasets from other sites “on demand”

Remote data access: Jobs executing locally access data at remote sites in quasi-real time

September 5, 2013

Page 16: Evolving Data Logistics Needs of the LHC Experiments

16

Near Term Data Requirments: LHC Run

2 LHC Run 2 (begins 2015): Data xfer volume 2-

4 times larger Energy increase 8 to 13 TeV – more complex events,

larger event sizes Trigger rate increase from 300-400 Hz to 1 kHz

Dynamic data placement and cache release replicate samples in high demand automatically Release cache of replicas automatically when

demand decreases Establish CMS data federation for all CMS

files on disk Allow remote access of files with local caching as job

runs processing can start before samples arrive at

destination site Higher demand for sustained network bandwidth, fill

gaps btwn bursts Increase diversity of resources: get access to

more resources

September 5, 2013Emerging Data Logistics Needs of the LHC Expts

Page 17: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 17

10 Year Projections Requirements in 10 years? Look 10

years back…

September 5, 2013

Sour

ce: I

an F

isk

and

Jim

Sha

nk

Page 18: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 18

Fisk and Shank 10 Year Projections

The processing has increased by a factor of 30 in capacity

This is essentially what would be expected from a Moore’s law increase with a 2 year cycle Says we spent similar amounts

Storage and networking have both increased by factor of 100 10 times trigger and 10 times event size

In their judgment, future programs will likely show similar growth

“How to make better use of resources as the technology changes?” September 5, 2013

Fisk and Shank, Snowmass Planning Exercise for High Energy Physics

Page 19: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 19

Fisk/Shank View of the Future

September 5, 2013

Page 20: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 20

…Fisk/Shank View of Future

September 5, 2013

Page 21: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 21

…Fisk/Shank View of Future

September 5, 2013

Page 22: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 22

DYNES: Addressing Growing/Evolving Network Needs

NSF MRI-R2: DYnamic NEtwork System (DYNES)

A nationwide cyber-instrument spanning ~40 US universities and ~14 Internet2 connectors Extends Internet2/ESnet Dynamic Circuit tools (OSCARS, …)

Who is it? Project Team: Internet2, Caltech, Univ. of Michigan, and Vanderbilt Univ. Community of regional networks and campuses LHC, astrophysics community, OSG, WLCG, other virtual organizations

What are the goals? Support large, long-distance scientific data flows in the LHC, other data

intensive science (LIGO, Virtual Observatory, LSST,…), and the broader scientific community

Build a distributed virtual instrument at sites of interest to the LHC but available to R&E community generally September 5, 2013

Page 23: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 23

Why DYNES? LHC exemplifies growing demand for research

bandwidth LHC “Tiered” computing infrastructure

already encompasses 140+ sites moving data at an average rate of ~5 GB/s

LHC data volumes and transfer rates are expected to expand by an order of magnitude over the next few years US LHCNet, for example, is expected to reach 280-

400 Gbps by 2014 between its points of presence in NYC, Chicago, CERN and Amsterdam

Network usage on this scale can only be accommodated with planning, an appropriate architecture, and nationwide community involvement By the LHC groups at universities and labs By campuses, regional and state networks

connecting to Internet2 By ESnet, US LHCNet, NSF/IRNC, major networks in

US & Europe

September 5, 2013

Page 24: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 24

DYNES Mission/Goals

Provide access to Internet2’s dynamic circuit network in a manageable way, with fair-sharing Based on a ‘hybrid’ network architecture

utilizing both routed and circuit based paths.

Enable such circuits w/ bandwidth guarantees across multiple US network domains and to Europe Will require scheduling services at some

stage Build a community with high

throughput capability using standardized, common methods

September 5, 2013

Page 25: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 25

Why Dynamic Circuits?

Internet2’s and ESnet’s dynamic circuits services provide: Increased effective bandwidth capacity, and

reliability of network access, by isolating traffic from standard IP “many small flows” traffic

Guaranteed bandwidth as a service by building a system to automatically schedule and implement virtual circuits

Improved ability of scientists to access network measurement data for network segments end-to-end via perfSONAR monitoring

Static “nailed-up” circuits will not scale. GP network firewalls incompatible with

enabling large-scale science network dataflows

Implementing many high capacity ports on traditional routers would be expensive

September 5, 2013

Page 26: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 26

DYN

ES

Dat

aflow

September 5, 2013

user requests via a DYNES Transfer URL

DYNES provided equipment at end sites and regionals (Inter-Domain Controllers, Switches, FDT Servers)

DYNES Agent will provide the functionality to request a circuit instantiation, initiate and manage the data transfer, and terminate the dynamically provisioned resources.

Site A

Site Z

Dynamic Circuit

Page 27: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 27

DYNES Topology

September 5, 2013

Based on Applications Rcvd from Potential Sites

Page 28: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 28

ANSE: Integrating DYNES into LHC

Computing Frameworks NSF CC-NIE: Advanced Network Services for

Expts (ANSE) Project Team: Caltech, Michigan, Texas-Arlington, & Vanderbilt Goal: Enable strategic workflow planning including network

capacity as well as CPU and storage as a co-scheduled resource Path Forward:

Integrate advanced network-aware tools with the mainstream production workflows of ATLAS and CMS

Use in-depth monitoring and network provisioning Complex workflows: a natural match and a challenge for SDN

Exploit state of the art progress in high throughput long distance data transport, network monitoring and control

September 5, 2013

Page 29: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 29

ANSE Methodology Use agile, managed bandwidth for tasks with

levels of priority along with CPU and disk storage allocation. define goals for time-to-completion, with reasonable

chance of success define metrics of success, such as the rate of work

completion, with reasonable resource usage efficiency

define and achieve “consistent” workflow Dynamic circuits a natural match Process-Oriented Approach

Measure resource usage and job/task progress in real-time

If resource use or rate of progress is not as requested/planned, diagnose, analyze and decide if and when task re-planning is needed

Classes of work: defined by resources required, estimated time to complete, priority, etc.

September 5, 2013

Page 30: Evolving Data Logistics Needs of the LHC Experiments

Emerging Data Logistics Needs of the LHC Expts 30

Summary and Conclusions

Successful globally distributed computing model has been extremely important for LHC’s success

Have intentionally not been bleeding-edge, but have certainly validated the distributed/grid model

For the future, CPU may scale sufficiently but we can’t afford to grow storage by a factor of 100 Agile, large, intelligent network-based

systems are key Leads to less bursty, more sustained use of

networks Projects such as DYNES and ANSE are

essential first steps as the LHC evolves to what Harvey Newman calls a data intensive Content Delivery Network

September 5, 2013