Engagement, Art, & Often Children: Gobal Exhibit Forum Sweden
1 The Open Science Grid Fermilab. The Open Science Grid2 The Vision Practical support for end-to-end...
-
Upload
helena-nelson -
Category
Documents
-
view
216 -
download
1
Transcript of 1 The Open Science Grid Fermilab. The Open Science Grid2 The Vision Practical support for end-to-end...
The Open Science Grid 2
The VisionThe VisionPractical support for end-to-end
community systems in a heterogeneous gobal environment to
Transform compute and data intensive science through a national
cyberinfrastructure that includes from the smallest to the largest
organizations.
Practical support for end-to-end community systems in a
heterogeneous gobal environment toTransform compute and data intensive
science through a national cyberinfrastructure that includes from
the smallest to the largest organizations.
The Open Science Grid 4
Goals of OSGGoals of OSG Enable scientists to use and share a greater % of
available compute cycles.
Help scientists to use distributed systems storage, processors and software with less effort.
Enable more sharing and reuse of software and reduce duplication of effort through providing effort in integration and extensions.
Establish “open-source” community working together to communicate knowledge and experience and also overheads for new participants.
Enable scientists to use and share a greater % of available compute cycles.
Help scientists to use distributed systems storage, processors and software with less effort.
Enable more sharing and reuse of software and reduce duplication of effort through providing effort in integration and extensions.
Establish “open-source” community working together to communicate knowledge and experience and also overheads for new participants.
The Open Science Grid 5
The HistoryThe History
1999 2000 2001 2002 20052003 2004 2006 2007 2008 2009
PPDG
GriPhyN
iVDGL
Trillium Grid3 OSG(DOE)
(DOE+NSF)(NSF)
(NSF)
Campus, regional grids
LHC operationsLHC construction, preparation
LIGO operation LIGO preparation
European Grid + Worldwide LHC Computing Grid
The Open Science Grid 6
The LeadersThe Leaders
High Energy & Nuclear Physics (HENP) Collaborations - Global communities with large distributed systems in Europe as well as the US
Condor Project - distributed computing across diverse clusters.
Globus - Grid security, data movement and information services software.
Laser Interferometer Gravitational Wave Observatory - legacy data grid with large data collections,
DOE HENP Facilities University Groups and researchers
High Energy & Nuclear Physics (HENP) Collaborations - Global communities with large distributed systems in Europe as well as the US
Condor Project - distributed computing across diverse clusters.
Globus - Grid security, data movement and information services software.
Laser Interferometer Gravitational Wave Observatory - legacy data grid with large data collections,
DOE HENP Facilities University Groups and researchers
The Open Science Grid 8
Institutions InvolvedInstitutions InvolvedProject Staff at
Boston U
Brookhaven National Lab+
CalTech
(Clemson)
Columbia
Cornell
FermiLab+
ISI, U of South California
Indiana U
LBNL+
(Nebraska)
RENCI
SLAC+
UCSD
U of Chicago
U of Florida
U of Urbana Champaign/NCSA
U of Wisconsin Madison
Sites on OSG : Many with >1 resource. 46 separate institutions.
* - no physics Florida State U.
Nebraska U. Of Arkansas *
Kansas State LBNL U. Of Chicago
U. Of Michigan U of Iowa Notre Dame U. California at Riverside
Academia Sinica
Hampton U Penn State U UCSD
Brookhaven National Lab
UERJ Brazil Oaklahoma U. U. Of Florida
Boston U. Iowa State SLAC U. Illinois Chicago
Cinvestav, Mexico City
Indiana University
Purdue U. U. New Mexico
Caltech Lehigh University *
Rice U. U. Texas at Arlington
Clemson U. * Louisiana University
Southern Methodist U.
U. Virginia
Dartmouth U * Louisiana Tech *
U. Of Sao Paolo
U. Wisconsin Madison
Florida International U.
McGill U Wayne State U.
U. Wisconsin Milwaukee
Fermilab MIT TTU Vanderbilt U.
The Open Science Grid 10
The Value PropositionThe Value Proposition Increased usage of CPUs and infrastructure alone
(ie cost of processing cycles) is not the persuading cost-benefit value.
The benefits come from reducing risk in and sharing support for large, complex systems which must be run for many years with a short life-time workforce. Savings in effort for integration, system and software
support, Opportunity and flexibility to distribute load and
address peak needs. Maintainance of an experienced workforce in a
common system Lowering the cost of entry to new contributors. Enabling of new computational opportunities to
communities that would not otherwise have access to such resources.
Increased usage of CPUs and infrastructure alone (ie cost of processing cycles) is not the persuading cost-benefit value.
The benefits come from reducing risk in and sharing support for large, complex systems which must be run for many years with a short life-time workforce. Savings in effort for integration, system and software
support, Opportunity and flexibility to distribute load and
address peak needs. Maintainance of an experienced workforce in a
common system Lowering the cost of entry to new contributors. Enabling of new computational opportunities to
communities that would not otherwise have access to such resources.
The Open Science Grid 11
OSG DoesOSG Does
Release, deploy and support Software. Integrate and test new software at the
system level. Support operations and Grid-wide services. Provide Security operations and policy. Troubleshoot end to end user and system
problems. Engage and help new communities. Extend capability and scale.
Release, deploy and support Software. Integrate and test new software at the
system level. Support operations and Grid-wide services. Provide Security operations and policy. Troubleshoot end to end user and system
problems. Engage and help new communities. Extend capability and scale.
The Open Science Grid 12
And OSG Does TrainingAnd OSG Does Training
Grid Schools train students, teachers and new entrants to use grids: 2-3 day training with hands on workshops and core
curriculum (based on iVDGL annual weeklong schools). 3 held already; several more this year (2 scheduled). Some
as participants in internationals schools. 20-60 in each class. Each class regionally based with broad
cachement area. Gathering an online repository of training material.
End-to-end application training in collaboration with user communities.
Grid Schools train students, teachers and new entrants to use grids: 2-3 day training with hands on workshops and core
curriculum (based on iVDGL annual weeklong schools). 3 held already; several more this year (2 scheduled). Some
as participants in internationals schools. 20-60 in each class. Each class regionally based with broad
cachement area. Gathering an online repository of training material.
End-to-end application training in collaboration with user communities.
The Open Science Grid 13
Participants in a recent Open Science Grid workshop held in Argentina.Image courtesy of Carolina Leon Carri, University of Buenos Aires
Participants in a recent Open Science Grid workshop held in Argentina.Image courtesy of Carolina Leon Carri, University of Buenos Aires
The Open Science Grid 14
Virtual OrganizationsVirtual Organizations
A Virtual Organization is a collection of people (VO members).
A VO has responsibilities to manage its members and the services its runs on their behalf.
A VO may own resources and be prepared to share in their use.
A Virtual Organization is a collection of people (VO members).
A VO has responsibilities to manage its members and the services its runs on their behalf.
A VO may own resources and be prepared to share in their use.
The Open Science Grid 15
VOsVOsSelf Operated Research Vos: 15
Collider Detector at Fermilab (CDF)
Compact Muon Solenoid (CMS)
CompBioGrid (CompBioGrid)
D0 Experiment at Fermilab (DZero)
Dark Energy Survey (DES)
Functional Magnetic Resonance Imaging (fMRI)
Geant4 Software Toolkit (geant4)
Genome Analysis and Database Update (GADU)
International Linear Collider (ILC)
Laser Interferometer Gravitational-Wave Observatory (LIGO)
nanoHUB Network for Computational Nanotechnology (NCN) (nanoHUB)
Sloan Digital Sky Survey (SDSS)
Solenoidal Tracker at RHIC (STAR)
Structural Biology Grid (SBGrid)
United States ATLAS Collaboration (USATLAS)
Campus Grids: 5.
Georgetown University Grid (GUGrid)
Grid Laboratory of Wisconsin (GLOW)
Grid Research and Education Group at Iowa (GROW)
University of New York at Buffalo (GRASE)
Fermi National Accelerator Center (Fermilab)
Regional Grids: 4
NYSGRID
Distributed Organization for Scientific and Academic Research (DOSAR)
Great Plains Network (GPN)
Northwest Indiana Computational Grid (NWICG)
OSG Operated VOs: 4
Engagement (Engage)
Open Science Grid (OSG)
OSG Education Activity (OSGEDU)
OSG Monitoring & Operations
The Open Science Grid 17
SitesSites
A Site is a collection of commonly administered computing and/or storage resources and services.
Resources can be owned by and shared among VOs
A Site is a collection of commonly administered computing and/or storage resources and services.
Resources can be owned by and shared among VOs
The Open Science Grid 18
A Compute ElementA Compute ElementProcessing Farms accessed through Condor-G submissions to Globus
GRAM inteface which supports many different local batch systems.
Priorities and policies through assignment of VO Roles mapped to accounts and batch queue priorities, modified by Site policies and priorities.
Processing Farms accessed through Condor-G submissions to Globus GRAM inteface which supports many different local batch systems.
Priorities and policies through assignment of VO Roles mapped to accounts and batch queue priorities, modified by Site policies and priorities.
From ~20 CPU Department Computers
to 10,000 CPU Super Computers
Jobs run under anylocalbatch system
OSGgateway machine
+ services
the network & other OSG resources
The Open Science Grid 19
Storage ElementStorage ElementStorage Services - access storage through Storage Resource
Manager (SRM) interface and GridFtp.Allocation of shared storage through agreements between Site
and VO(s) facilitated by OSG.
Storage Services - access storage through Storage Resource Manager (SRM) interface and GridFtp.
Allocation of shared storage through agreements between Site and VO(s) facilitated by OSG.
From 20 GBytes Disk CacheTo 4 Petabyte Robotic
Tape Systems
AnyShared Storage
OSG SE gateway
the network & other OSG resources
The Open Science Grid 20
How are VOs supported? How are VOs supported?
Virtual Organization Management services (VOMS) allow registration, administration and control of members of the group.
Facilities trust and authorize VOs not individual users
Storage and Compute Services prioritize according to VO group.
Virtual Organization Management services (VOMS) allow registration, administration and control of members of the group.
Facilities trust and authorize VOs not individual users
Storage and Compute Services prioritize according to VO group.
Resources that Trust the VO
VO Management Service
Network & other OSG resouces
VO Middleware & Applications
The Open Science Grid 21
Running JobsRunning Jobs
Condor-G client Pre-ws or WS Gram as Site gateway Priority through VO role and policy,
mitigate by Site policy Pilot jobs submitted through regular
gateway can then bring down multiple user jobs until batch slot resources are used up. Glexec modelled on Apache suexec allows jobs to run under user identity.
Condor-G client Pre-ws or WS Gram as Site gateway Priority through VO role and policy,
mitigate by Site policy Pilot jobs submitted through regular
gateway can then bring down multiple user jobs until batch slot resources are used up. Glexec modelled on Apache suexec allows jobs to run under user identity.
The Open Science Grid 22
Data and StorageData and Storage
GridFTP data transfer Storage Resource Manager to manage
shared and common storage Environment variables on the site let VOs
know where to put and leave files. dCache - large scale, high I/O disk caching
system for large sites DRM - NFS based disk management
system for small sites. ? NFS V4 ? GPFS ?
GridFTP data transfer Storage Resource Manager to manage
shared and common storage Environment variables on the site let VOs
know where to put and leave files. dCache - large scale, high I/O disk caching
system for large sites DRM - NFS based disk management
system for small sites. ? NFS V4 ? GPFS ?
The Open Science Grid 23
CPUHours/Day on OSG During 2007
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
1/1/071/8/071/15/071/22/071/29/072/5/072/12/072/19/072/26/073/5/073/12/073/19/073/26/074/2/074/9/074/16/074/23/074/30/075/7/075/14/075/21/075/28/07
AGLT2 ASGC_OSG BNL_OSG BNL_PANDA CIT_CMS_T2 FIU-PGFNAL_CDFOSG_1 FNAL_CDFOSG_2 FNAL_FERMIGRID FNAL_GPFARM GLOW GRASE-CCR-U2GRASE-GENESEO-OSG GROW-PROD HEPGRID_UERJ IPAS_OSG Lehigh Coral MIT_CMSNebraska NERSC-PDSF OSG_LIGO_PSU OU_OCHEP_SWT2 OU_OSCER_ATLAS OU_OSCER_CONDORPurdue-Lear Purdue-RCAC SPRACE STAR-BNL STAR-WSU TTU-ANTAEUSUC_ATLAS_MWT2 UCRHEP UCSDT2 UFlorida-IHEPA UFlorida-PG USCMS-FNAL-WC1-CEUSCMS-FNAL-WC1-CE2 UTA_SWT2 UTA-DPCC UWMilwaukee Vanderbilt
The Open Science Grid 24
Resource ManagementResource Management
Many resources are owned or statically allocated to one user community. The institutions which own resources typically have ongoing
relationships with (a few) particular user communities (VOs)
The remainder of an organization’s available resources can be “used by everyone or anyone else”. organizations can decide against supporting particular VOs. OSG staff are responsible for monitoring and, if needed,
managing this usage.
Our challenge is to maximize good - successful - output from the whole system.
Many resources are owned or statically allocated to one user community. The institutions which own resources typically have ongoing
relationships with (a few) particular user communities (VOs)
The remainder of an organization’s available resources can be “used by everyone or anyone else”. organizations can decide against supporting particular VOs. OSG staff are responsible for monitoring and, if needed,
managing this usage.
Our challenge is to maximize good - successful - output from the whole system.
The Open Science Grid 25
An Example of Opportunistic use:An Example of Opportunistic use: D0’s own resources are committed to the
processing of newly acquired data and analysis of the processed datasets.
In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07.
The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs.
The Council members agreed to contribute resources to meet this request.
D0’s own resources are committed to the processing of newly acquired data and analysis of the processed datasets.
In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07.
The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs.
The Council members agreed to contribute resources to meet this request.
The Open Science Grid 27
How did D0 Reprocessing Go?How did D0 Reprocessing Go?
D0 had 2-3 months of smooth production running using >1,000 CPUs and met their goal by the end of May.
To achieve this D0 testing of the integrated software system
took until February. OSG staff and D0 then worked closely together
as a team to reach the needed throughput goals - facing and solving problems
sites - hardware, connectivity, software configurations application software - performance, error recovery scheduling of jobs to a changing mix of available
resources.
D0 had 2-3 months of smooth production running using >1,000 CPUs and met their goal by the end of May.
To achieve this D0 testing of the integrated software system
took until February. OSG staff and D0 then worked closely together
as a team to reach the needed throughput goals - facing and solving problems
sites - hardware, connectivity, software configurations application software - performance, error recovery scheduling of jobs to a changing mix of available
resources.
The Open Science Grid 28
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Week in 2007
CIT_CMS_T2 FNAL_DZEROOSG_2 FNAL_FERMIGRIDFNAL_GPFARM GLOW GRASE-CCR-U2MIT_CMS MWT2_IU NebraskaNERSC-PDSF OSG_LIGO_PSU OU_OSCER_ATLASOU_OSCER_CONDOR Purdue-RCAC SPRACEUCSDT2 UFlorida-IHEPA UFlorida-PGUSCMS-FNAL-WC1-CE
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Week in 2007
CIT_CMS_T2 FNAL_DZEROOSG_2 FNAL_FERMIGRIDFNAL_GPFARM GLOW GRASE-CCR-U2MIT_CMS MWT2_IU NebraskaNERSC-PDSF OSG_LIGO_PSU OU_OSCER_ATLASOU_OSCER_CONDOR Purdue-RCAC SPRACEUCSDT2 UFlorida-IHEPA UFlorida-PGUSCMS-FNAL-WC1-CE
D0 OSG CPUHours / Week
The Open Science Grid 29
What did this teach us ?What did this teach us ?
Consortium members contributed significant opportunistic resources as promised. VOs can use a significant number of sites they
“don’t own” to achieve a large effective throughput.
Combined teams make large production runs effective.
How does this scale? how we going to support multiple requests
that oversubcribe the resources? We anticipate this may happen soon.
Consortium members contributed significant opportunistic resources as promised. VOs can use a significant number of sites they
“don’t own” to achieve a large effective throughput.
Combined teams make large production runs effective.
How does this scale? how we going to support multiple requests
that oversubcribe the resources? We anticipate this may happen soon.
The Open Science Grid 30
Use by non-PhysicsUse by non-Physics
Rosetta@Kuhlman lab: in production across ~15 sites since April
Weather Research Forecase: MPI job running on 1 OSG site; more to come
CHARMM molecular dynamic simulation to the problem of water penetration in staphylococcal nuclease
Genome Analysis and Database Update system (GADU): portal across OSG & TeraGrid. Runs Blast.
NanoHUB at Purdue: Biomoca and Nanowire production.
Rosetta@Kuhlman lab: in production across ~15 sites since April
Weather Research Forecase: MPI job running on 1 OSG site; more to come
CHARMM molecular dynamic simulation to the problem of water penetration in staphylococcal nuclease
Genome Analysis and Database Update system (GADU): portal across OSG & TeraGrid. Runs Blast.
NanoHUB at Purdue: Biomoca and Nanowire production.
The Open Science Grid 31
Rosetta User decided to submit jobs..Rosetta User decided to submit jobs..
3,000 jobs
The Open Science Grid 32
Scale needed in 2008/2009:Scale needed in 2008/2009: 20-30 Petabyte tertiary automated tape
storage at 12 centers world-wide physics and other scientific collaborations.
High availability (365x24x7) and high data access rates (1GByte/sec) locally and remotely.
Evolving and scaling smoothly to meet evolving requirements.
E.g. for a single experiment
20-30 Petabyte tertiary automated tape storage at 12 centers world-wide physics and other scientific collaborations.
High availability (365x24x7) and high data access rates (1GByte/sec) locally and remotely.
Evolving and scaling smoothly to meet evolving requirements.
E.g. for a single experiment
The Open Science Grid 34
Software
Infr
astr
uctu
reA
pplic
atio
ns
VO Middleware
Core grid technology distributions: Condor, Globus, Myproxy: shared with TeraGrid and
others
Virtual Data Toolkit (VDT) core technologies + software needed by
stakeholders:many components shared with EGEE
OSG Release Cache: OSG specific configurations, utilities etc.
HEP
Data and workflow management etc
Biology
Portals, databases etc
User Science Codes and Interfaces
Existing Operating, Batch systems and Utilities.
Astrophysics
Data replication etc
The Open Science Grid 35
Horizontal and Vertical Integrations
Infr
astr
uctu
reA
pplic
atio
ns
HEP
Data and workflow management etc
Biology
Portals, databases etc
User Science Codes and Interfaces
Astrophysics
Data replication etc
The Open Science Grid 36
The Virtual Data Toolkit SoftwareThe Virtual Data Toolkit Software Pre-built, integrated and packaged set of software which is easy to download, install and use to access
OSG. Client, Server, Storage, Service versions. Automated Build and Test: Integration and regression testing.
Software Included: Grid software: Condor, Globus, dCache, Authz (voms/prima/gums), accounting (Gratia). Utilities: Monitoring, Authorization, Configuration Common components e.g. Apache Built for >10 flavors/versions of Linux
Support structure. Software acceptance structure.
Pre-built, integrated and packaged set of software which is easy to download, install and use to access OSG. Client, Server, Storage, Service versions. Automated Build and Test: Integration and regression testing.
Software Included: Grid software: Condor, Globus, dCache, Authz (voms/prima/gums), accounting (Gratia). Utilities: Monitoring, Authorization, Configuration Common components e.g. Apache Built for >10 flavors/versions of Linux
Support structure. Software acceptance structure.
The Open Science Grid 37
How we get to a Production Software StackHow we get to a Production Software Stack
Input from stakeholders and OSG directors
VDT Release
OSG Integration Testbed Release
OSG Production Release
Test on OSG Validation Testbed
The Open Science Grid 38
How we get to a Production Software StackHow we get to a Production Software Stack
Input from stakeholders and OSG directors
VDT Release
OSG Integration Testbed Release
OSG Production Release
Test on OSG Validation Testbed
Validation/Integration takes months and is the result of
work many people.
Validation/Integration takes months and is the result of
work many people.
The Open Science Grid 39
How we get to a Production Software StackHow we get to a Production Software Stack
Input from stakeholders and OSG directors
VDT Release
OSG Integration Testbed Release
OSG Production Release
Test on OSG Validation Testbed
VDT used by others than OSG: TeraGrid, Enabling Grids for EscienE (Europe), APAC,
VDT used by others than OSG: TeraGrid, Enabling Grids for EscienE (Europe), APAC,
The Open Science Grid 40
05
101520253035404550
Jan-02Jul-02Jan-03Jul-03Jan-04Jul-04Jan-05Jul-05Jan-06Jul-06Jan-07
Number of major components
VDT 1.1.x VDT 1.2.x VDT 1.3.x VDT 1.4.0 VDT 1.5.x VDT 1.6.x
VDT 1.0Globus 2.0bCondor-G 6.3.1
VDT 1.1.8Adopted by LCG
VDT 1.1.11Grid2003 VDT 1.2.0
VDT 1.3.0
VDT 1.3.9For OSG 0.4
VDT 1.6.1For OSG 0.6.0
VDT 1.3.6For OSG 0.2
More dev releases
Both added and removed software
The Open Science Grid 41
SecuritySecurity
Operational security a priority Incident response Signed agreements, template policies Auditing, assessment and training
Parity of Sites and VOs A Sites trust the VOs that use it. A VO trusts the Sites it runs on. VOs trust their users.
Infrastructure X509 certificate based. With extended attributes for authorization.
Operational security a priority Incident response Signed agreements, template policies Auditing, assessment and training
Parity of Sites and VOs A Sites trust the VOs that use it. A VO trusts the Sites it runs on. VOs trust their users.
Infrastructure X509 certificate based. With extended attributes for authorization.
The Open Science Grid 42
Illustrative example of trust model
User VO Site
Jobs
VO infra.
DataStorage
CE
W W W
WWWWWWWW
WWWW
I trust it is the VO (or agent)
I trust it is the user
I trust it is the user’s job
I trust the job is for the VO
The Open Science Grid 43
Operations & Troubleshooting & SupportOperations & Troubleshooting & Support Well established Grid Operations Center at
Indiana University Users support distributed, includes osg-
general@opensciencegrid community support.
Site coordinator supports team of sites. Accounting and Site Validation required services
of sites. Troubleshooting looks at targetted end to
end problems Partnering with LBNL Troubleshooting work for
auditing and forensics.
Well established Grid Operations Center at Indiana University
Users support distributed, includes osg-general@opensciencegrid community support.
Site coordinator supports team of sites. Accounting and Site Validation required services
of sites. Troubleshooting looks at targetted end to
end problems Partnering with LBNL Troubleshooting work for
auditing and forensics.
The Open Science Grid 44
Campus GridsCampus Grids
Sharing across compute clusters is a change and a challenge for many Universities.
OSG, TeraGrid, Internet2, Educause working together on CI Days Work with CIO, Faculty, IT organizations
for a 1 day meeting where we all come and talk about the needs the ideas and, yes, the next steps.
Sharing across compute clusters is a change and a challenge for many Universities.
OSG, TeraGrid, Internet2, Educause working together on CI Days Work with CIO, Faculty, IT organizations
for a 1 day meeting where we all come and talk about the needs the ideas and, yes, the next steps.
The Open Science Grid 45
OSG and TeraGridOSG and TeraGridComplementary and interoperating infrastructuresComplementary and interoperating infrastructures
TeraGrid OSGNetworks supercomputer centers. Includes small to large clusters and
organizations.
Based on Condor & Globus s/w stack built at Wisconsin Build and Test.
Based on Same versions of Condor & Globus in the Virtual Data Toolkit.
Development of User Portals/Science Gateways.
Supports jobs/data from TeraGrid science gateways.
Currently relies mainly on remote login. No login access. Many sites expect VO attributes in the proxy certificate
Training covers OSG and TeraGrid usage.