The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT...
-
date post
21-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT...
The Sam-Grid project
Gabriele GarzoglioODS, Computing Division, Fermilab
PPDG, DOE SciDACACAT 2002, Moscow, Russia
June 26, 2002
Outline
The SAM-Grid Project• The SAM & JIM Architecture
– SAM: the Data Handling System– Jim: the Job Management Infrastructure– JIM: the Information and Monitoring System
• The Current Grid Infrastructure• Milestones of the Deliverables• Conclusions
The scope of the project• Enable fully distributed computing for DZero and CDF, by enhancing
the distributed data handling system of the experiments (SAM), incorporating standard Grid tools and protocols, and developing new solutions for Grid computing, in a secure and accountable environment.
• The SAM ‘grid-ification’ is funded by PPDG and GridPP: we are working with both Computer scientists, like the Condor Team, and physicists, like UTA and Imperial College.
• We are collaborating with other groups working on Grid technologies as well (EDG, DataTAG among them).
• Warm cooperation between Fermilab CD Departments and the Project (e.g. ISD for the SAM/DCache integration)
• We promote interoperability and code reuse (via modularization and standardization).
• CDF and DZero are running now! Short-term deliverables are are due at the end of the Summer; long-term in 2 yrs.
Why a Job and Data Handling infrastructure?
• Increases the productivity of physics results • A high level of transparency to the user: maximize time
spent by the physicist doing physics• Enable worldwide analysis of the data• Efficient utilization of the resources: disks, mass
storage systems, processing nodes, network…• Automatic bookkeeping: reproducibility +
accountability• Extensibility to new standardized services and protocols
via modularization and “plug-in” mechanisms
Outline
• The SAM-Grid ProjectThe SAM & JIM Architecture
– SAM: the Data Handling System– Jim: the Job Management Infrastructure– JIM: the Information and Monitoring System
• The Current Grid Infrastructure• Milestones of the Deliverables• Conclusions
High Level Components
Information and
Monitoring
Data Handling
Job Management
The Data Handling: SAM
Data Handling
DH Resource Management
Data Delivery and Caching
SAM
PrincipalComponent Service
ImplementationOr Library
Information
Information and
Monitoring
Job Management
History
• SAM is Sequential data Access via Meta-data
• Joint project between D0 and Computing Division Joint project between D0 and Computing Division started in 1997 to meet the Run II data handling started in 1997 to meet the Run II data handling needsneeds
• SAM is integrated into DZero at all levels.
• SAM is in commissioning phase for CDF
• http://d0db.fnal.gov/sam
• http://runIIcomputing.fnal.gov
SAM as a Distributed SystemDatabaseServer(s)
(Central Database)
Station 1Servers
Station 2Servers
Station 3 Servers
Station nServers
Mass Storage System(s)
SharedGlobally
LocalTo Site
SharedLocally
Arrows indicateControl and data flow
NameServer
Global Resource
Manager(s)Log server
services
A Station is a collection of resourcescontrolled by the SAM system. SAM services can be accessed to monitor the status of the systemThe central Database Server has proven to be robust and reliable.
Components of a SAM Station
• SAM is a distributed data movement and management service: data replication is achieved by the use of disk caches during file routing.
• SAM is a fully functional meta-data catalog.
Station &Cache
Manager
File Storage Server
File Stager(s)
Project Managers
/Consumers
eworkers
FileStorageClients
MSS orOther
Station
MSS orOther
Station
Data flowControl
Producers/
Cache DiskTemp Disk
… …
Accessibility of the Fabric via SAM Services
MSS1
LocalStation 1Cache1
LocalStation 1Cache2
LocalStation 2Cache1
RemoteStationCache1
• A station can access a remote resource via the services offered by other connected stations
• Service connectivity does not in general correspond to network connectivity
• Requests are routed from the originator to the destination
• File caching during routing leads to file replication
More in Igor Terekhov’s Talk:“Meta-Computing at DØ”
MSS2
RemoteStationCache2
Current Developments of SAM
• Site Autonomy: the goal is enabling site installations of SAM and JIM to work even if disconnected from the network. The distribution of the Replica and Meta-data Catalogs is a prerequisite for this to happen.
• Opportunistic deployment: in order to enable SAM and JIM to operate in full efficiency in a dynamic environment like the Grid, automatic deployment of stations at resources that are momentarily available is an interesting path to investigate.
The Job Management
Data Handling
DH Resource Management
Data Delivery and Caching
SAM
Job Management
RequestBroker
Compute ElementResource
SiteGatekeeper
Job Scheduler
JH Client
BatchSystem
Condor-G
Condor MMS
GRAM
Grid sensors
(All) Job Status
Updates
PrincipalComponent Service
ImplementationOr Library
Information
Information and
Monitoring
The Job Description Language• User interface: the Job
Description Language must be expressive enough to fully characterize the structure of the job (Monte Carlo and Analysis)
• We are collaborating with the University of Texas Arlington to define the structure of a DZero (CDF) job.
Job Management
RequestBroker
Compute ElementResource
SiteGatekeeper
Job Scheduler
JH Client
BatchSystem
Condor-G
Condor MMS
GRAM
Grid sensors
(All) Job Status
Updates
The Request Broker• The Brokering Service is
implemented using the Condor Match Making Service
• The idea is to use a stable technology in a new way
• Because of the collaboration with the Condor Team under PPDG, 2 features have been added to make this possible :– Runtime selection of the
remote execution site– Execution of external code
when negotiating the matches
Job Management
RequestBroker
Compute ElementResource
SiteGatekeeper
Job Scheduler
JH Client
BatchSystem
Condor-G
Condor MMS
GRAM
Grid sensors
(All) Job Status
Updates
The Job Submission Service• The job submission
service relies on standard Condor technologies
• It implements a high level of robustness to service failures and loss of connectivity
Job Management
RequestBroker
Compute ElementResource
SiteGatekeeper
Job Scheduler
JH Client
BatchSystem
Condor-G
Condor MMS
GRAM
Grid sensors
(All) Job Status
Updates
The Job Submission Mechanism (I)• Physical job dispatch is
achieved via the GRAM protocol from the Globus Toolkit
• When applicable, executables, configuration files, stdio and stderr are transported via GASS servers
• Gatekeepers deployed at each site serve client requests for job submission
Job Management
RequestBroker
Compute ElementResource
SiteGatekeeper
Job Scheduler
JH Client
BatchSystem
Condor-G
Condor MMS
GRAM
Grid sensors
(All) Job Status
Updates
The Job Submission Mechanism (II)• A Gatekeeper
authenticates and authorizes the client via the Globus Security Infrastructure
• After AA, the Gatekeeper spawns a Job Manager that submits the job to the local batch system, reports the status to the submission client (Condor-G), cleans up after job termination.
Job Management
RequestBroker
Compute ElementResource
SiteGatekeeper
Job Scheduler
JH Client
BatchSystem
Condor-G
Condor MMS
GRAM
Grid sensors
(All) Job Status
Updates
The Fabric (I)• Among the Batch
systems currently supported by the Gatekeeper are LSF, PBS, Condor, FBS
• In our architecture Grid Sensors are deployed at the compute elements as well as the local submission nodes.
• The Sensors report static and small-size dynamic states to the Information and Monitoring System.
Job Management
RequestBroker
Compute ElementResource
SiteGatekeeper
Job Scheduler
JH Client
BatchSystem
Condor-G
Condor MMS
GRAM
Grid sensors
(All) Job Status
Updates
The Fabric (II)• What attributes best
describe resources is still a research topic. The choice of such schema as implication on the semantics of the JDL.
• We are collaborating with DataTAG and EDG to find a common Glue Schema in order to enable interoperability of EU and US Grids.
Job Management
RequestBroker
Compute ElementResource
SiteGatekeeper
Job Scheduler
JH Client
BatchSystem
Condor-G
Condor MMS
GRAM
Grid sensors
(All) Job Status
Updates
Information FlowUser Interfac
e
User Interfac
e
Condor-G
InformationAnd
Monitoring
Gatekeeper
Batch Syestem
Grid Sensors
Compute Resource
GRAM
CondorNegotiator
CondorCollector
CondorGrid Manager
External Code
Execution Site
ParserParserJDLClassAd
ClassAd
CinCout
User Interfac
eParser
CondorScheddCondorSchedd
CondorSchedd
CondorCollector
CondorCollector
Grid Sensors
Grid Sensors
CondorNegotiator
CondorNegotiator
External Code
External Code
CondorGrid Manager
CondorGrid Manager
GatekeeperGatekeeper
Batch Syestem
Batch Syestem
Compute Resource
Compute Resource
Monitoring and Information: the glue
Data Handling
DH Resource Management
Data Delivery and Caching
SAM
Job Management
RequestBroker
Compute ElementResource
SiteGatekeeper
Job Scheduler
JH Client
BatchSystem
Condor-G
Condor MMS
GRAM
Grid sensors
(All) Job Status
Updates
PrincipalComponent Service Implementation
Or Library
Information
Monitoring and Information
Logging andBookkeeping
Info ProcessorAnd Converter
Replica Catalog
ResourceInfo
AAAGSI
MDS-2Condor
Class Ads Grid RC
Status Monitor
– Meta Directory Service from the Globus Toolkit (LDAP protocol)
– Condor Components (ClassAds)
Monitoring and Information
Logging andBookkeeping
Info ProcessorAnd Converter
Replica Catalog
ResourceInfo
AAAGSI
MDS-2Condor
Class Ads Grid RC
DataHandling
Job Management
• Resource and Information Service implementations:
• MDS automatically discard old information and pull the new information from information providers.
• Well suited for the run-time monitoring of the system.
Logging and Bookkeeping
implemented via a plug-able back-end module.
• SAM servers already use the logger,
Monitoring and Information
Logging andBookkeeping
Info ProcessorAnd Converter
Replica Catalog
ResourceInfo
AAAGSI
MDS-2Condor
Class Ads Grid RC
DataHandling
Job Management
• SAM provides a UDP-based message logger. Persistency is
which results in a valuable debugging tool. We are going to extend the use of this service to JIM.
• Messages will be store in XML format.
The Replica Catalog
• We plan to migrate to the Grid Replica Catalog, in order to allow distribution of the service and a set of standardized interfaces to external services
Monitoring and Information
Logging andBookkeeping
Info ProcessorAnd Converter
Replica Catalog
ResourceInfo
AAAGSI
MDS-2Condor
Class Ads Grid RC
DataHandling
Job Management
• The Replica Catalog is currently implemented with SAM
Information Conversion and Accessibility
when needed: LDAP, ClassAd, XML.• We are evaluating web portal frameworks to
enable access to the system from the internet
Monitoring and Information
Logging andBookkeeping
Info ProcessorAnd Converter
Replica Catalog
ResourceInfo
AAAGSI
MDS-2Condor
Class Ads Grid RC
DataHandling
Job Management
• A translation service is responsible to convert the 3 protocols used
Site AAA
Information System are built on top of standard grid tools and adopt the GSI security mechanisms.
Monitoring and Information
Logging andBookkeeping
Info ProcessorAnd Converter
Replica Catalog
ResourceInfo
AAAGSI
MDS-2Condor
Class Ads Grid RC
DataHandling
Job Management
• The Job Management Infrastructure and the Monitoring and
• The full integration of the Data Handling System with GSI is work in progress…
• Open issue: the management of the AA map files
Outline
• The SAM-Grid Project• The SAM & JIM Architecture
– SAM: the Data Handling System– Jim: the Job Management Infrastructure– JIM: the Information and Monitoring System
The Current Grid Infrastructure• Milestones of the Deliverables• Conclusions
The Current Grid Infrastructure
Node_1GRA
MCondor-G
Node_3GRA
M
Fork
Node_2GRA
MPBS
Node_4GRA
M
Condor Condor
FNAL
IC
UTA
Node_1GRA
M
Condor
Condor-G
Node_1GRA
M
LSF
Condor-G
pcBS
client
Info
Outline
• The SAM-Grid Project• The SAM & JIM Architecture
– SAM: the Data Handling System– Jim: the Job Management Infrastructure– JIM: the Information and Monitoring System
• The Current Grid InfrastructureMilestones of the Deliverables• Conclusions
The Organization: a Collaborative Effort
• We hold weekly meetings to coordinate efforts on the DZero/CDF SAM Grid Project.
• Participants are from UK institutions, NIKHEF, INFN and US institutions.
• We discuss deliverables, design, implementation.
• The real pressure comes from the experiments that are taking data now!
The Short Term Project Goals
• Deployment of JIM to enable execution of unstructured Monte Carlo jobs with basic brokering (end of Summer)
• Status Monitoring of unstructured jobs (end of Summer)
• Basic System Monitoring (end of Summer)• Execution of unstructured SAM analysis
jobs with basic brokering (end of the year)
The 2yr-Term Project Goals
• Reliable Execution of structured, locally distributed Monte Carlo and SAM analysis jobs with basic brokering.
• Scheduling criteria for data-intensive jobs, full Job Handling – Data Handling interaction.
• Fully Distributed Monitoring and Information Services for Structured Jobs and Data Handling.
The Milestones Dependencies
Job Def Doc
Execute unstructured MC andSAM analysis jobs with
basic brokering
Tech. Rev. doc.
Execute unstructured SAM analysis jobs
UC doc
Arch. Doc
Execute User-routed MC Jobs Prototype Grid with
RB, JSS, GMA-based MIS
Study JDLs Use Cases Condor GMA, MDSGSI SAM
GSI InSAM
CondorIn SAM
Basic SAM Res Info Service
Toy Grid with JSS, basic Monitoring
MDS TestBed
Status Mon-ing ofunstructured jobs
Basic System Mon-ing
CondorG TestBed
SAM Grid-ready
Reliable Execution ofstructured, locally
distributed MC and SAM analysis
jobs with basic brokering
Scheduling criteria fordata-intensive
jobs, JH-DH interactiondesign
Monitoring of structured jobs
DHMon-ing
JH, MIS fullydistributed
JDL
6 M
o9-
19 M
oN
ow
Conclusions• SAM is the Data Handling System of the DZero experiment and in phase
of commissioning for CDF.• The SAM-Grid project has the goal of integrating SAM with standard grid
technologies to enable fully distributed computing for DZero and CDF.• The Brokering service of the Grid Architecture of the project is based on
the Condor Match Making Service.• We are funded by PPDG and GridPP and we collaborate with Grid groups
in US and EU to best tailor and develop the technologies for the experiments.
• We are deploying a test bed in US and EU to develop and test SAM and JIM.
• The experiments are running now! Closest delivery milestones at the end of the Summer and at the end of the year.
• http://www-d0.fnal.gov/computing/grid/