Software & GRID Computing for LHC (Computing Work in a Big Group)

Software & GRIDComputing for LHC

(Computing Work in a Big Group)Frederick Luehring

[email protected] 10, 2011

P410/P609 Class Presentation

mailto:[email protected]

Outline• This talk will have three main sections:

– A short introduction about the LHC and ATLAS• LHC/ATLAS serves as an example of large scale scientific computing

– Writing software in large collaboration– Processing a large datastream

February 10, 2011 P410/P609 Presentation F. Luehring Pg. 2

Hal’s Motivation for this Talk• Hal wanted me to tell you about how a large scientific

collaboration deals with data processing to give you that perspective on scientific computing.

• Of course even a large experiment has individuals writing individual algorithms just as a small experiment does.

• However in the end for the large collaboration, some process has to glue all of those individual algorithms together and use the resulting code to analyze data.


A Little About LHC & ATLAS


What are LHC & ATLAS?• LHC is the “Large Hadronic Collider” which is the

largest scientific experiment ever built.– This is the accelerator that produces the energetic proton

beams that we use to look for the new physics discoveries

• ATLAS (A Toroidal LHC ApparatuS) is one of four experiments at LHC and one of the two large general purpose experiments at CERN.– The ATLAS detector will produce Petabytes of data each

year it runs.

• Billions of Swiss Francs, Euros, Dollars, Yen, etc. have been spent to build LHC and ATLAS.


LHC Project• Atlas is one of two general purpose detectors at LHC

(CMS is the other detector).


Atlas Detector1.2x106

190,000

3,600

10,0001.45x108

Numbers are electronic channel count


Atlas Detector Construction

Notice the size of the man.


Data Producer

level 1 (hardware trigger)

40 MHz (80 TB/sec)

level 2 (RoI analysis)

level 3 (Event Filter)O(75) KHz (150 GB/sec)

O(2) KHz (4 GB/sec)

200 Hz

O(400) MB/sec

• RAW– 1.6MB/event

• ESD– Event Summary Data– .5MB/event

• AOD– Analysis object data– .1 MB/event

• TAG– Event=level metadata tags– 1 KB/event

• DPD– Derived physics data– N-tuple type, user based


Writing Software in a Collaboration


Collaborative SW Building in ATLAS• There are just over 800 people registered and about

400 active ATLAS software developers (3/2009).– These developers are located in ~40 countries world-wide and

using their code, ATLAS must build timely, working software releases.

• Relatively few computing professionals – most people are physicists.

– Democratic coordination of effort.• No way to fire anyone…

• Central SVN code repository at CERN (US mirror)– There are ~7 Million Lines of Code in ~3000 packages have

been written mainly in C++ and Python (also Fortran, Java, PERL, SQL).

• The main platform is Linux/gcc (SLC5 which is basically RHEL5)

– Had to close parts of our computing infrastructure to the world (much info no longer findable using Google).


Collaborative SW Building in ATLAS• The ATLAS Software Infrastructure Team (SIT) provides

the infrastructure used by these developers to collaboratively produce many versions of the ATLAS software.– There have been 16(!) major releases of the ATLAS SW since

May of 2000 and hundreds(!) of smaller releases.

• Note: ATLAS shares lots of code between the online (Trigger/DAQ) software and the offline software.– This talk is mainly about the offline portion of the software.

• Online code is the software used to select which 200-400 beam crossings per second will be written our for further analysis (out of 40 million bunch crossings per second).

– This sharing of offline and online code is unusual and results from design of the ATLAS Trigger. Much of the trigger is run on commodity PCs instead of specially built electronics.


Membership of the SIT• The ATLAS Software Infrastructure Team is made up

of ~30 people.– There are only a few people working full time on the SIT

activities and most people are volunteers working part time.– Many of the SIT members are computing professionals with

a few physicists & technical physicists included.• Those ~30 people amount to ~12 Full Time Equivalents (FTEs)

– Just like the developers, the SIT members are spread out worldwide with a core of about 4-5 people at CERN.

– Most SIT members receive only service work credit.– Quarterly meetings at CERN, bi-weekly phone calls and

huge numbers of emails are used to coordinate the work.


Software Versions/Diversity• Beyond the large number of developers and the large

amount of code, the SIT is supporting many versions of the software.– Each community (e. g. Simulation, Reconstruction, Cosmic

Running, Monte Carlo production) can & do request builds.– This puts a large stress on the SIT because often (usually!)

priorities conflict between the various groups.– Various meetings are held to define the priorities.

• To aid in managing this diversity, the software is partitioned into ~10 offline SW projects (light blue on diagram). This allows the SIT to separate work

flows and to control dependencies.


A Few Words about the Online Code• The online code must be written to a high standard.

– If the online code is wrong, there is a risk of not recording data about the physics events of interest. Events not selected by the trigger are lost forever and cannot be recovered.

– The trigger has multiple levels and the lowest level is executed millions of times per second. While the first use of offline-developed code actually begins at the second trigger level which “only” runs 75 kHz, even a tiny memory leak per execution is an enormous problem.

• When memory is exhausted, the trigger algorithm must be restarted and this takes a significant amount of time.

• There is a very strict testing and certification process for the online code and this has a direct effect on any offline code that is used in the online.


Offline Code Release Strategy• We have had to significantly modify our code release

building strategy.– Previously we had nightly builds every night, development

builds every month, production releases every six months, and bug fix production releases as needed.

• Typically there were at most two releases operating at one time: one for fixing bugs in the current production release and one that was developing new functionality for the next production release.

– Now there are ~35 releases (many built opt & debug).• As mentioned before these releases are for simulation, central data

reprocessing, migration to new external software, etc.• The main development, production, and bug fix releases are copies of

nightly builds. • There are also several releases producing patches. These patch

releases replace buggy code on the fly as the software starts up.• 44 servers (220 cores) are used to build the releases and do basic

tests every night!


Multiple Releases and Coordinators• The needs of the many user communities and the need to

validate new versions of external software (e.g. ROOT, GEANT) have lead to a intricate release coordination system to manage the multiple releases.– Each release has its own release coordinator who:

• Controls what tags (new versions of SW packages) are allowed in the release• Checks the relevant validation tests each day• Moves new code from a “validation” (trial) state to a “development” (candidate)

state to being in the final release.• Tracks problems along the way with bug reports.

– There are also two release builders on shift to build the releases once the OK to build a release is given.

• For important releases, many people may have to signoff while for minor releases, the release coordinator just asks the shifters to build it.

• We hold weekly global release coordination meetings to track issues needing resolution before the pending releases can be built.

– These meetings also look at the relative priority of the pending releases.


Multiple Releases and Coordinators• Even with aggressive release coordination and

extensive validation, it takes time and multiple iterations of the building cycle to get a final production version of the software.

• Two mechanisms exist for adding adjustments and corrections to stable releases:1. Patch releases that allow the software to patch itself at runtime

(does not require a full rebuild of the release).2. Bug fix releases that are full releases build rolling up the

patches.

• It is a constant struggle to meet the release schedule.– This is our toughest issue.– In the end you can’t cut a software release until all of the bugs

are fixed and nightly test suite passes.


Coding• ATLAS does have a set of coding standards.

– Originally there was an ATLAS developed system to automatically check compliance the coding rules.

– This system was discarded as being too hard to maintain.• Constant tweaking was required to get it to work correctly.

• ATLAS still does have a naming convention for packages and publically viewable objects.– Even though this is essentially enforced by an honor system,

most developers comply with it.

• The release coordinators manually find compiler warnings and get the developers to fix them.– A large fraction of compiler warnings indicate real problems.

• Floating point problems and unchecked status codes generally lead to fatal terminating the program.


User Coding• The final layer of code which is generally written by the

end-user (i.e. physicists) and is often written to a lower a standard of coding.– This is a real problem because this code is used to make the

results tables and plots that go into journal articles.– Often this coding is done under extreme time pressure because

high energy physics is very competitive.– To help protect against problems, a tool set (the Physics Analysis

Tools) have been developed. Not everyone uses the official tools and some user code is still required even if you are using the official tools.

– There is a guidebook (“The Analysis Workbook”) contains numerous examples of how to write good analysis code.

• A lengthy period of review and interrogation of any result is required before it can be published.


Software Release Tools• The SIT uses a number of tools to

build, test, and distribute the SW:1. SVN – the code repository that holds the

code submitted by the developers 2. Tag Collector (TC) – manages which

software versions are used in the release3. CMT – manages software configuration,

build, and use & generates the makefile4. NICOS – drives nightly builds of the ATLAS

software making use of a variety of tools5. Selected nightlies become official releases6. Four automatic systems validate the software

builds7. CMT, CMT-based packaging tools, Pacman

- create & install the software distribution kit.

1. In addition to the above tools, we also use Doxygen to document our code automatically and 3 workbooks to teach developers/users how to use the software.

“Bedrock” base allowing the management of many developers and releases

The control framework of the software building engine

Easy to use distribution tools

Automatic

Validation

February 10, 2011 P410/P609 Presentation

SVN

F. Luehring Pg. 21

SW Validation• ATLAS uses several tools to validate the building and

the installation of its software.– AtNight (ATN) testing is part of the nightly builds and runs

over 300 “smoke” tests each night.– Full Chain Testing (FCT) checks the entire software chain.– Run Time Tester (RTT) runs a long list of user defined tests

against all of the software projects for many builds daily.– Tier 0 Chain Testing (TCT) is used to check that the ATLAS

software is installed and running correctly during data taking.– Kit Validation (KV) is a standalone package that is used to

validate that the ATLAS software is installed and functioning correctly on all ATLAS production sites (Tier 1/2/3).

• The release coordinators actively monitor each day’s test results to see if new SW can be promoted to candidate production releases and check that the candidate production release is working.


SW Validation: Use of RTT• The RTT is the primary ATLAS SW validation system.

– The RTT runs a significant number of validation tests against 10 software builds each day.

– The RTT runs between 10 & 500 test jobs against a release.• The test jobs can run for several hours and test a package intensively.• Currently the RTT runs ~1300 test jobs each day.

– The RTT uses a pool of 12 cores to schedule the test jobs and 368 cores to run the test jobs.

• Additional scheduling cores are currently being brought online.• The RTT also needs a small disk pool of 1.4 TB.

• The RTT system is highly scalable: if more tests are needed or more releases need testing, the RTT team can (and has!) added more nodes to the pool of servers used by the RTT system.


ATLAS Code Distribution• ATLAS has used the Pacman open source product to

install the offline software for the last several years.– The installation uses pacman caches and mirrors.

• Recently we begun using “pacballs” to install the software. A pacball is:– A self-installing executable.– Stamped with an md5sum in its file name.

• The pacballs come in two sizes:1. Large (~3 GB) for full releases (dev, production, & bug fix).2. Small (~100 MB) for patches that are applied to full releases.

• Pacballs are used to install the software everywhere from individual laptops to large production clusters.

• ATLAS will now probably go an rpm based install.


Documentation• ATLAS documents its code using Doxygen.

– All developers are expected to put normal comments and special Doxygen comments into their source code.

• Once a day the Doxygen processor formats the comments and header information about the classes and members into web pages.

– Twiki pages are then used to organize the Doxygen-generated web pages into coherent content.

– There are other ways to see the Doxygen pages including an alphabetical class listing and links from within the tag collector.

• ATLAS now has three “workbooks”:1. The original introductory (getting started) workbook.2. A software developers workbook.3. A physics analysis workbook.

• ATLAS has had three coding “stand-down” weeks to write documentation.

• Significant time/effort has been spent to review the documentation.• In spite of this, the documentation still needs more work.


Summary of Collaborative SW Building • The SIT has adapted our tools and methods to the

growing demand to support more developers, releases, validation, and documentation.– Even though the SIT is small, it provides effective central support

for a very large collaborative software project.

• Producing working code on a schedule has been difficult but we have learned what is required to do the job:– Strong tools to ease the handling of millions of lines of code.– Lots of testing and code validation.– Lots of careful (human) coordination.

• We are continue to proactively look for way to further improve the ATLAS software infrastructure.


Processing Peta Bytes of Data• Hint:

– 1 PB = 1,000 TB = 1,000,000 GB• Or in some cases: 1 PB =1,024 TB = 1,048,576 GB


Large Scale Data Production• Many scientific endeavors (physics, astronomy,

biology, etc.) run a large data processing systems to analyze the contents of enormous datasets.– In these cases it is generally no longer possible for a single

site (even a large laboratory) to have enough computing power and enough disk storage to hold and process the datasets.

– This leads to the use of multiple data centers (distributed data processing) and having to split the data over multiple data centers.

– Currently many experiments pre-position the data split over various data centers and once the data is positioned “send the job to the data” instead of “sending the data to the job”.

• ATLAS disconinued most pre-positioning when it was discovered that much of the pre-positioned data was never used.


The Scale of LHC Data Production• The LHC beams cross every 25 ns. Each bunch

crossing will produce ~23 particle interactions when the design luminosity of 1034 cm-2s-1 is achieved.– Initial LHC operation is at much lower luminosity (~1030 - 1032

cm-2s-1).

• The Atlas trigger selects 200 or even 400 bunch crossings per second for full read-out.– The readout data size ranges from 1.7 MB/event initially to

about 3.4 MB/event at full luminosity.

• When LHC is fully operational, the LHC beams will be available to the experiments for ~107s/year.– Simple arithmetic gives 3.4 to 6.8 PB of readout data a year

(depending on the exact luminosity).


Data Production Scale (continued)• However, to produce publishable results additional

data is needed.– A Monte Carlo simulation dataset of comparable size to the

readout (RAW + summary) data is required.– The Atlas detector has a huge number of channels in each

subsystem. Most of the detectors require substantial amounts of calibration data to produce precision results.

• All of this puts the amount of data well above 5 PB even in the first full years of running.


What is the Computing Model?• The Atlas computing model is the design of the Atlas

grid-based computing resources. The computing model contains details such as:– Where the computing resources will be located.– What tasks will be done at each computing resource.– How much CPU power is needed.– How much storage is needed.– How much network bandwidth is needed.

• All of these items must be specified as a function of time as the machine ramps up.

• Much of the computing model has had to be modified on the fly as problems have arisen during the initial running.


Multi-Tiered Computing Model• The computing model used by the LHC experiments

is a multi-tiered model that takes advantage of the large number of geographically distributed institutes involved and the availability of high speed networks.– The model maximizes the usage of resources from all over

the world. Many of the resources are “free” to use.– The model makes it easier for the various funding agencies

to contribute computing resources to the experiments.– Atlas has ~240 institutes on 5 continents.

• The model is made possible by using the grid.– The grid allows the efficient use of distributed resources.– The grid also reflects the political reality that funding

agencies like to build computer centers in their own country.


F. Luehring Pg. 33

The Atlas Data Flow

~200 TB/year/T2

~2 MSI2k/T1 ~2 PB/year/T1

Tier2 Center ~200kSI2k

Event Builder

Event Filter~1.5MSI2k

T0 ~5MSI2k

UK Regional Center (RAL)

US Regional Center - BNL

French Regional Center

Italian Regional Center

SheffieldManchester

LiverpoolLancaster ~0.25TIPS

Workstations

4-8 GB/sec

400-800 MB/sec

100 - 1000 MB/s

•Some data for calibration and monitoring to institutes

•Calibrations flow back

Physics data cache

~80-160 TB/sec

~ 300MB/s/T1 /expt



Tier 0Tier 0

Tier 1Tier 1

DesktopDesktop(Tier 3)


Tier 2Tier 2

Massive, distributed computing is required to extract rare signals of new physics from the ATLAS detector

~9 PB/yearNo simulation

February 10, 2011 P410/P609 Presentation

This is an old diagram, so the number are out of date. Image suppliedby Jim Shank.

F. Luehring Pg. 33

A Physicist’s View of Grid Computing• A physicist working at LHC starts from the point of view

that we must process a lot of data.– Raw experimental data (data from the particle interactions at

LHC). ~3-5 PB/year– Simulated (Monte Carlo) data needed to understand the

experiment data. More PB/year– Physicists and computing resources are all over the world.

• We are not Walmart and our funding is small on the scale of the data. This means:– We must use all of our own computing resources efficiently.– We must take advantage of anyone else’s resources that we

can use: Opportunistic Use. (e.g. IU’s own Quarry)– Funding agencies like grids because they use resources

efficiently.


Grid Middleware• Grid middleware is the “glue” that binds together the

computing resources which are all over the world.– Middleware is the software between the OS and the application

software and computing resources.

• Careful authentication and authorization are the key to running a grid.– Authentication is knowing for sure who is asking to use a

resource and is done with an X.509 certificate.– Authorization is knowing that an authenticated user is allowed

to to use your resources.

• A number of projects write grid middleware.– Perhaps the best known is the Globus project.– Grid middleware is often packaged by projects like VDT.

• VDT stands for Virtual Data Tool Kit.


Physical Parts of a Grid• Compute Elements (CEs)

– Usually quoted in numbers of “cores” now that a CPU chip has more then one processor it.

• Storage Elements (SEs)– Mostly disk but lots of tape for preserving archival copies of

event and Monte Carlo data.

• Lots of high speed networks– Data is distributed exclusively using networks. Shipping

physical containers such as data tapes to a site never happens (at least for ATLAS).


Grid use at ATLAS• ATLAS uses three grids:

1. EGEE/LCG based in Europe2. Open Science Grid (OSG) based in the US3. Nordugrid based in Scandinavia

• Resources scale available to ATLAS currently:– ~30,000 CPUs (cores in our jargon)– Several PB of disk and tape– High speed (up to 10 GBps links) networking.– Everything runs on Linux– Things will need to grow to keep up with data production

• The resources are arranged hieratically in tiers:– CERN is Tier 0, National Resources Tier 1, Regional

Resources Tier 2, and Individual Institutes Tier 3


The Result of the Process• In the end we filter the huge data set down to data

from a few interesting particle collisions and make plots and print scientific papers.– There are multiple levels data “distillation” and filtering to get

down to a data set that a physicist can process on his laptop or local clusters.

– At all stages of filtering the worry is always that you will accidentally remove the interesting data.

– This is data mining on a grand scale.


ATLAS Production System• ATLAS uses a own system called Panda to run jobs

(both production and user analysis) on the grid.– Panda uses a “pilot” system where test jobs are periodically

sent to each site to look for unused cores. When an available core is found, the input data is loaded on a local disk of the selected server and the pilot starts the actual job.

• The input data must be present on the site for the job to run. This has been found to greatly reduce the number of failures compared to transferring the data from another site after the job starts.

• Panda has a central database, operates using central servers, and has a monitoring system so that users can follow the progress of their jobs.

• Human input plus software algorithms are used to set the relative priorities of jobs.

• A large team of people creates & submits the jobs, monitors the system for hardware & software failures, and assists users.


Scale of ATLAS Data Production• ATLAS processes data through out the world.

– All sites also contribute personnel to the effort.– In addition there are central teams in most countries and CERN

that contribute personnel.– There are perhaps 20,000 cores worldwide assigned to run

ATLAS jobs.– On an average day, ATLAS runs ~100,000 - ~150,000 jobs of

which a few percent fail.

• Keep this machinery running and up to date with current, validated software is a real challenge.– Occasionally we run out of work while waiting for new software

versions.

• Typical tier 0 & 1 run event data analysis, tier 2 run Monte Carlo and all levels 0-3 run user analysis jobs.


User Analysis Scenario• It is possible to work efficiently at a Tier 3 or a laptop:

– Subscribe an interesting dataset to the Tier 2– Use data retrieval to get the sample data onto the Tier 3– Develop/test your software package on the Tier 3– Run your Monte Carlo locally on the Tier 3 (if needed!)– Then use pathena (Panda Athena) to send large numbers of

data analysis jobs to a Tier 2 or Tier (wherever the data is)– Process DP3Ds and ntuples on the Tier 3 or laptop.

• High energy physics uses data analysis tool ROOT– Used interactively and in batch to process final ntuples into

final plots.– Used throughout the hep community.– Based object-oriented (OO) design and interface.– Notice we are back to working as a single individual.


Summing Up Data Processing• Datasets are too large to process at one site.• The Grid is used to process the data over large

number of geographically distributed data sites.• There is a hierarchy of tiered sites using to process

the data.• The processing of the data requires the participation

of lots people spread throughout the world.• After a lot of work, the distributed data processing

system is working reasonably reliably.• Users are now able to effectively do data analysis in

this environment.• No doubt further refinements will be required.


The Future of Data Processing• The current hot buzz words are:

– “Cloud” which is a sort of commercial computing grid run the big internet companies: Amazon, Yahoo, Google, etc. The cloud is a sort of black box for computing.

• This has the potential to be much cheaper than the grid but issue remain about whether the cloud can deal with the enormous datasets required by some physics calculations.

• The advantage of the grid is that many (but not all) of the servers are owned physics/academic projects and can be tuned as needed.

– “Virtual Machine” which is a software/middleware layer that allows a server running one OS version to run jobs written in another OS or version of the OS.

• This virtualization layer used extract a huge overhead of wasted CPU cycles but modern virtual machine software uses a few percent of the CPU time on the target machine.

• An obvious result of the use of virtualization is that it is now possible to run physics codes written in c++/gcc/Linux on Windows servers.


The Future (continued)– “GPUs” or graphical processing units (essentially graphics

cards) can be used to process certain types of physics data. This amounts to using a graphics card a computer.

• People have also played with using gaming consoles because they are sold below their fabrication cost (the money for the game makers is in the game licensing fees and game sales).

– “SSDs” or Solid State Drives are under consideration to replace hard (disk) drives because they provide significant improvements in the speed of access to stored data and are physically small.

• Various technical and cost issues have to be worked out before SSDs become practical.

• Personal experience with putting heavily used databases on SSDs shows great potential.


Software & GRID Computing for LHC (Computing Work in a Big Group)

Documents

Transcript of Software & GRID Computing for LHC (Computing Work in a Big Group)