Grid Job, Information and Data Management for the Run II Experiments at FNAL

27
Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al (see next slide) FNAL/CD/CCF, D0, CDF, Condor team, UTA, ICL

description

Grid Job, Information and Data Management for the Run II Experiments at FNAL. Igor Terekhov et al (see next slide) FNAL/CD/CCF, D0, CDF, Condor team, UTA, ICL. Authors. - PowerPoint PPT Presentation

Transcript of Grid Job, Information and Data Management for the Run II Experiments at FNAL

Page 1: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov et al (see next slide)FNAL/CD/CCF, D0, CDF, Condor team, UTA, ICL

Page 2: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Authors

A. Baranovskii, G. Garzoglio, H. Kouteniemi, A. Kreymer, L. Lueking, V. Murthi, P. Mhashikar, S. Patil, A. Rana, F. Ratnikov, A. Roy, T. Rockwell, S. Stonjek, T. Tannenbaum, R. Walker, F. Wuerthwein

Page 3: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Plan of Attack

Brief History, D0 and CDF computing, data handlingGrid Jobs and Information Management

ArchitectureJob managementInformation managementJIM project status and plans

Globally Distributed data handling in SAM and beyondSummary

Page 4: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

HistoryRun II CDF and D0, the two largest, currently running collider experimentsEach experiment to accumulate ~1PB raw, reconstructed, analyzed data by 2007. Get the Higgs jointly.Real data acquisition – 5 /wk, 25MB/s, 1TB/day, plus MC

1pb

Page 5: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Globally Distributed Computing and GridD0 – 78 institutions, 18 countries. CDF – 60 institutions, 12 countries.Many institutions have computing (including storage) resources, dozens for each of D0, CDFSome of these are actually shared, regionally or experiment-wide

Sharing is goodA possible contribution by the institution into the collaboration while keeping it localRecent Grid trend (and its funding) encourages it

Page 6: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Goals of Globally Distributed Computing in Run II

To distribute data to processing centers – SAM is a way, see later slideTo benefit from the pool of distributed resources – maximize job turnaround, yet keep single interfaceTo facilitate and automate decision making on job/data placement.

Submit to the cyberspace, choose best resource

To reliably execute jobs spread across multiple resourcesTo provide an aggregate view of the system and its activities and keep track of what’s happeningTo maintain security Finally, to learn and prepare for the LHC computing

Page 7: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Data Distribution - SAMSAM is Sequential data Access via Meta-data.

http://{d0,cdf}db.fnal.gov/samPresented numerous times, prev CHEPSCore features: meta-data cataloguing, global data replication and routing, co-allocation of compute and data resourcesGlobal data distribution:

MC import from remote sitesOff-site analysis centersOff-site reconstruction (D0)

Page 8: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Now that the Data’s Distributed: JIMGrid Jobs and Information ManagementOwes to the D0 Grid funding – PPDG (the FNAL team), UK GridPP (Rod Walker, ICL)Very young – started 2001Actively explore, adopt, enhance, develop new Grid technologiesCollaborate with the Condor team from The University of Wisconsin on Job managementJIM with SAM is also called The SAMGridT<10min?

Page 9: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Page 10: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Job Management StrategiesWe distinguish grid-level (global) job scheduling (selection of a cluster to run) from local scheduling (distribution of the job within the cluster)We distinguish structured jobs from unstructured.

Structured jobs have their details known to Grid middleware. Unstructured jobs are mapped as a whole onto a cluster

In the first phase, we want reasonably intelligent scheduling and reliable execution of unstructured data-intensive jobs.

Page 11: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Job Management HighlightsWe seek to provide automated resource selection (brokering) at the global level with final scheduling done locally (environments like CDF CAF, Frank’s talk)Focus on data-intensive jobs:

Execution time is composed of:• Time to retrieve any missing input data• Time to process the data• Time to store output data

In the Leading Order, we rank sites by the amount of data cached at the site (minimize missing input data)Scheduler is interfaced with the data handling system

Page 12: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Job Management – Distinct JIM Features

Decision making is based on both:Information existing irrespective of jobs (resource description)Functions of (jobs,resource)

Decision making is interfaced with data handling middleware rather than individual SE’s or RC alone: this allows incorporation of DH considerationsDecision making is entirely in the Condor framework (no own RB) – strong promotion of standards, interoperability

Page 13: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Condor Framework and Enhancements We Drove

Initial Condor-G:Personal Grid agent helping user run a job on a cluster of his/her choice

JIM: True grid service for accepting and placing jobs from all users

Added MMS for Grid job brokering

JIM: from 2-tier to 3-tier architectureDecouple queing/spooling/scheduling machine from user machineSecurity delegation, proper std* spooling, etc

Will move into standard Condor

Page 14: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Condor Framework and Enhancements We Drove

Classic Matchmaking service (MMS):Clusters advertise their availability, jobs are matched with clustersCluster (Resource) description exists irrespective of jobs

JIM: Ranking expressions contain functions that are evaluated at run-time

Helps rank a job by a function(job,resource)Now: query participating sites for data cached. Future: estimates when data for the job can arrive etc Feature now in standard Condor-G

Page 15: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNALJO

B

Computing Element

Submission Client

User Interface

QueuingSystem

Job ManagementUser Interface

User Interface

BrokerMatch

Making Service

Information Collector

Execution Site #1

Submission Client

Submission Client

Match Making Service

Match Making Service

Computing Element

Grid Sensors

Execution Site #n

Queuing System

Queuing System

Grid Sensors

Storage Element

Storage Element

Computing Element

Storage Element

Data Handling System

Data Handling System

Storage Element

Storage Element

Storage Element

Storage Element

Information Collector

Information Collector

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Computing Element

Computing Element

Data Handling System

Data Handling System

Data Handling System

Data Handling System

Page 16: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Monitoring Highlights

Sites (resources) and jobsDistributed knowledge about jobs etcIncremental knowledge buildingGMA for current state inquiries, Logging for recent history studiesAll Web based

Page 17: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Web Browser

Web Server

Site 1 Information System

IPIPIP

Web Browser

Web Server 1

Site 2 Information System

IPIP

IPIP

Web Server N

Site N Information System

JIM Monitoring

Page 18: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Information Management – Implementation and Technology Choices

XML for representation of site configuration and (almost) all other informationXquery and XSLT for information processingXindice and other native XML databases for database semantics

Page 19: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Main Site/cluster Config

Schema

Resource Advertisement

Monitoring Schema

DataHandling

HostingEnvironment

Meta-Schema

Page 20: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

JIM Project StatusDelivered prototype for D0, Oct 10, 2002:

Remote job submissionBrokering based on data cachedWeb-based monitoring

SC-2002 demo – 11 sites (D0, CDF), big successApril 2003 – production deployment of V1 (Grid analysis in production a reality as of April, 1)Post V1 – OGSA, Web services, logging service

Page 21: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Grid Data HandlingWe define GDH as a middleware service which:

Brokers storage requestsMaintains economical knowledge about costs of access to different SE’sReplicates data as needed (not only as driven by admins)

Generalizes or replaces some of the services of the Data Management part of SAM

Page 22: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Grid Data Handling, Initial Thoughts

Page 23: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

The Necessary (Almost) Final SlideRun II experiments’ computing is highly distributed, Grid trend is very relevantThe JIM (Jobs and Information Management) part of the SAMGrid addresses the needs for global and grid computing at Run IIWe use Condor and Globus middleware to schedule jobs globally (based on data), and provide Web-based monitoringDemo available – see me or Gabriele

Page 24: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

AcksPPDG project, its management, for making it possibleGridPP project in the UK, for its fundingJae Yu and others of UTA-Texas, FNAL CD mgmt for continuing support for student internship programsOther members of the Condor team for fruitful discussions

Page 25: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

P.S. – Related TalksF. Wuerthwein, CAF (Cluster Analysis Facility) – job management on a cluster and interface to JIM/GridF. Ratnikov, Monitoring on CAF and interface to JIM/GridS. Stonjek, SAMgrid deployment experiencesL. Lueking, G. Garzoglio – SAM-related

Page 26: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Backup Slides

Page 27: Grid Job, Information and Data Management for the Run II Experiments at FNAL

Igor Terekhov, FNAL

Information Management

In JIM’s view, this includes both:resource description for job brokeringInfrastructure for monitoring (core project area)

GT MDS is not sufficient:Need (persistent) info representation that’s independent of LDIF or other such formatNeed maximum flexibility in information structure – no fixed schemaNeed configuration tools, push operation etc