Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator.
-
Upload
roy-newton -
Category
Documents
-
view
214 -
download
0
Transcript of Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator.
Production CoordinationProduction Coordination
Staff RetreatStaff RetreatJuly 21, 2010July 21, 2010
Dan Fraser – Production CoordinatorDan Fraser – Production Coordinator
Production WBSProduction WBSIdentify & resolve OSG issuesIdentify & resolve OSG issues Prod calls, look at usage patterns, metrics …Prod calls, look at usage patterns, metrics …
T3 LiaisonT3 LiaisonDocs, Security, Xrootd, WLCG client, Coord w/ Hiro, Asoka…Docs, Security, Xrootd, WLCG client, Coord w/ Hiro, Asoka…
Manage relationships & raise issues with the ET when Manage relationships & raise issues with the ET when necessarynecessary T1/T2/T3 Admins – identify requirementsT1/T2/T3 Admins – identify requirements
Assist the ET team w/Operations, Sites, VOs, & Assist the ET team w/Operations, Sites, VOs, & Education/trainingEducation/training
Lead & Coordinate plans to improve the OSG facilityLead & Coordinate plans to improve the OSG facility Glide-in effort (paper, VO transitions, testing format)Glide-in effort (paper, VO transitions, testing format)
Some Production Examples…Some Production Examples…Effort from the entire teamEffort from the entire team SE-only solution for Atlas T3sSE-only solution for Atlas T3s ITIL like processes for OperationsITIL like processes for Operations Updated process for CA testing prior to productionUpdated process for CA testing prior to production CEMON issues (hanging the BDII)CEMON issues (hanging the BDII) CERN BDII not reporting (RG data limit exceeded)CERN BDII not reporting (RG data limit exceeded) CERN BDII not a high priority (pushed but no mvmnt)CERN BDII not a high priority (pushed but no mvmnt) Transitioning sites to use the new Gratia collector address (in Transitioning sites to use the new Gratia collector address (in
progress)progress) Urgent security updates for sites running Condor/GratiaUrgent security updates for sites running Condor/Gratia Addressing VO Issues:Addressing VO Issues:
LIGO Production running well (Rob E.)LIGO Production running well (Rob E.) Between Rank #1 & #2 on OSGBetween Rank #1 & #2 on OSG
SBGRID reaching new peaks (~7000 parallel jobs)SBGRID reaching new peaks (~7000 parallel jobs)
A View from the Production A View from the Production CoordinatorCoordinator
What are the biggest problems in OSG?What are the biggest problems in OSG? Supporting VO’s Is difficultSupporting VO’s Is difficult
Lowering the barrier for scaling across sitesLowering the barrier for scaling across sites Site differences often require site-by-site investigationSite differences often require site-by-site investigation
Effort in progress to understand this (Dan, Abhishek)Effort in progress to understand this (Dan, Abhishek) Big win possible with Glide-insBig win possible with Glide-ins
New paper comparing job submission strategies New paper comparing job submission strategies
How to get opportunistic storageHow to get opportunistic storage Current method is to talk to each site…Current method is to talk to each site… New strategies being explored (Tanya, Brian, Dan)New strategies being explored (Tanya, Brian, Dan)
OSG Health MonitoringOSG Health MonitoringAll links now on the production pageAll links now on the production page
https://twiki.grid.iu.edu/bin/view/Production/WebHome
Usage ChartsUsage Charts
Weekly CallsWeekly Calls
OSG Data movementOSG Data movement
Job/Error ratiosJob/Error ratios
DOE display showing last 24 hoursDOE display showing last 24 hours
and much more …and much more …
Solving Production ProblemsSolving Production Problems
Solving problems is a TEAM sportSolving problems is a TEAM sport
The weekly production call has key people from The weekly production call has key people from all the teams that are needed to solve problemsall the teams that are needed to solve problems
CMS, Atlas, LIGO, VOs, Engage, Integration, Sites, CMS, Atlas, LIGO, VOs, Engage, Integration, Sites, STG, Security, Operations, MetricsSTG, Security, Operations, Metrics
Problems accurately prioritized and channeled Problems accurately prioritized and channeled to the correct avenueto the correct avenue
Sometimes solved on the call.Sometimes solved on the call.
Forewarning to prepare for upcoming issues.Forewarning to prepare for upcoming issues.
Example ProblemsExample Problems
Handling of job pre-emption (LIGO / D0)Handling of job pre-emption (LIGO / D0)
VO Package Validation probe neededVO Package Validation probe needed GIP “truth in advertising”GIP “truth in advertising”
LIGO switch to GT2 and also Condor-G job LIGO switch to GT2 and also Condor-G job submissionsubmission
Condor scaling limits in GridMon (Atlas)Condor scaling limits in GridMon (Atlas)
Globus LSF gatekeeper bug (D0/CMS)Globus LSF gatekeeper bug (D0/CMS)
Security Drill successes (for T1)Security Drill successes (for T1)
Gratia probe introduction & ITB testingGratia probe introduction & ITB testing
Example Issues cont.Example Issues cont.STEP09 monitoring (partially successful)STEP09 monitoring (partially successful)
IceCube management of opportunistic storageIceCube management of opportunistic storage
Gratia file transfer data catch upGratia file transfer data catch up
Transition from VORS to myOSGTransition from VORS to myOSG
New location for RSV probes and ability to New location for RSV probes and ability to update from the “production” cacheupdate from the “production” cache Also, ensure that config_OSG does not update the Also, ensure that config_OSG does not update the
probes automaticallyprobes automatically
Root Cause Analysis of CMS BDII outageRoot Cause Analysis of CMS BDII outage
Example Issues cont.Example Issues cont.Plan to localize data transfer information Plan to localize data transfer information and upload summary transfer packets.and upload summary transfer packets.
Globus memory leak was causing frequent Globus memory leak was causing frequent reboots at BNL.reboots at BNL.
Site name mapping problem to enable Site name mapping problem to enable different names internal to OSG.different names internal to OSG.
OIM display difference (http vs https)OIM display difference (http vs https)
Site admin meeting & materials prep to help Site admin meeting & materials prep to help sites upgrade to OSG 1.2.sites upgrade to OSG 1.2.
Example Issues cont.Example Issues cont.Condor problem with directory creation in Condor problem with directory creation in a multiple gateway scenario. (Nebraska)a multiple gateway scenario. (Nebraska)
Gratia collector problem with handling Gratia collector problem with handling records that accumulate faster than they records that accumulate faster than they can be processed.can be processed.
LIGO/Pegasus transition to use BDII data LIGO/Pegasus transition to use BDII data instead of central probe data.instead of central probe data.