INFSO-RI-508833 Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks...

14
INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007, CNAF

Transcript of INFSO-RI-508833 Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks...

Page 1: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

ATLAS DDM Operations - IIMonitoring and Daily TasksJiří Chudoba

ATLAS meeting, 25.9.2007, CNAF

Page 2: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 2

Enabling Grids for E-sciencE

INFSO-RI-508833

Cloud Status

• Scheduled and unscheduled downtimes– direct emails from sites– EGEE broadcasts– GOCDB: https://goc.gridops.org/site/

• ARDA Dashboard pages– T0 to T1 transfers

http://dashb-atlas-data-tier0.cern.ch/dashboard/request.py/site

– all other transfers

http://dashb-atlas-data.cern.ch/dashboard/request.py/site

Page 3: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 3

Enabling Grids for E-sciencE

INFSO-RI-508833

VOBoxes at CERN

• https://twiki.cern.ch/twiki/bin/view/Atlas/DistributedDataManagementARDAMachines

• separate machines for db services and site services• CNAF:

– dq2db-cnaf – db services– dq2cnaf – site services for CNAF and T2’s

• Access via an account ddmusr02– limited possibilities, check /tmp/dq2.log

• Account ddmusr01 restricted to developers– why ???

• Installation done by developers

Page 4: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 4

Enabling Grids for E-sciencE

INFSO-RI-508833

Panda Monitoring

• panda pages– DS on sites

http://gridui01.usatlas.bnl.gov:25880/server/pandamon/query?overview=dslist

– AOD: http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?mode=listAODReplications

– aborted DS:

http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?mode=listAbortedDatasets

– M4:

http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?mode=listM4

Page 5: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 5

Enabling Grids for E-sciencE

INFSO-RI-508833

More Monitoring

• Stephane’s overview of disks occupancies

http://lapp.in2p3.fr/atlas/Informatique/Offline/monitor_files_sites/all_sites/list_sites.html

• Per data type version, DE cloud:

http://www.etp.physik.uni-muenchen.de/ddm/DE/summary.html

• Site status monitored by GOC – gstat– http://goc.grid.sinica.edu.tw/gstat/RU-Protvino-IHEP/GIISQuery_

Usage_store_.html

Page 6: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 6

Enabling Grids for E-sciencE

INFSO-RI-508833

FTS monitoring

• FTS 1.5– DE cloud: http://grid.fzk.de/monitoring/fts/transfers.html– SARA: http://winnetou.matrix.sara.nl/monitoring/datatransfer/

– glite-transfer commands:glite-transfer-channel-list -s https://fts.grid.sara.nl:8443/glite-data-

transfer-fts/services/ChannelManagement

Page 7: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 7

Enabling Grids for E-sciencE

INFSO-RI-508833

Typical tasks

• Errors spotted via monitoring– check reasons– contact site– possibly close the FTS channel– verify when corrected– open FTS channel

Page 8: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 8

Enabling Grids for E-sciencE

INFSO-RI-508833

Deletion of Aborted DS

• Mail sent to T1 cloud responsibles (usually 1 per week)• Different procedures in different clouds

– FZK Cedric’s script delete_dataset_aborted.py run regularly from a crontab uses: dq2.deleteDatasetReplicas, dq2.deleteDatasetSubscription,

dq2.listFilesInDataset, lcg-del, lcg-uf list of DS from a file part of MyFrameWork:

/afs/cern.ch/user/s/serfon/public/ddm/Myframework will be published on Thursday

Page 9: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 9

Enabling Grids for E-sciencE

INFSO-RI-508833

Deletion of Aborted DS II

• SARA cloud: wrappers around dq2_cleanup:• dq2_delete_aborted.sh

#!/bin/sh

# delete aborted DS using dq2_cleanup# start 1 d2_cleanup instance per site

# input via parameter.# Parameter 1: list of aborted dataset and sites# example:# ideal0_mc12.007042.singlepart_gamma_Et60.simul.HITS.v12003103_tid010675 ITEP

# tested from lxplus, when grid and dq2 environment was set and# production proxy obtained like this:## source /afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh# voms-proxy-init -voms atlas:/atlas/Role=production -valid 96:0# source /afs/cern.ch/atlas/offline/external/GRID/ddm/pro03/dq2.sh

SITES="SARADISK SARATAPE NIKHEF ITEP IHEP JINR SINP"

DSLIST=$1

for SITE in $SITES ; do dq2_delete_aborted_site.sh $DSLIST $SITE &done

Page 10: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 10

Enabling Grids for E-sciencE

INFSO-RI-508833

Deletion of Aborted DS II

• dq2_delete_aborted_site.sh#!/bin/sh

# delete aborted DS from a site using dq2_cleanup## Input# parameter 1: list of aborted DS# parameter 2: SITENAME

DSLIST=$1SITE=$2

DQ2_CLEANUP=/afs/cern.ch/atlas/offline/external/GRID/ddm/pro03/dq2_cleanup

LOG="${SITE}_${DSLIST}_`date +%Y%m%d_%H%M`.log"touch $LOGgrep $SITE $DSLIST | while read DS ; do $DQ2_CLEANUP $DS >>$LOG 2>&1done

Page 11: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 11

Enabling Grids for E-sciencE

INFSO-RI-508833

Integrity checks

• Cedric’ script– http://atlas-sw.cern.ch/cgi-bin/viewcvs-atlas.cgi/offline/Production/swing/

scripts/ddm/integrity_check.py?view=log – some assumptions (/pnfs access)

• Simple compare of dumps:

#!/bin/bash

# read files from a DPM dump and match them with an LFC dump# DPM dump obtained by select name from Cns_file_metadata where gid=1307 and filesize > 0;

DPM_DUMP=$1LFC_DUMP=$2

FOUND=$1.foundMISS=$1.miss

cat $DPM_DUMP | while read FN FILEID; do grep -q $FN $LFC_DUMP if [ $? == 0 ] ; then echo "$FN $FILEID" >> $FOUND else echo "$FN $FILEID" >> $MISS fidone

Page 12: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 12

Enabling Grids for E-sciencE

INFSO-RI-508833

Data loss

• https://twiki.cern.ch/twiki/bin/view/Main/AtlasDDMLostFiles

• Only production files are treated • Get list of lost files (provided by a sysadmin) • Remove information about lost files from the SE db

(must be done by a sysadmin) – see later talk• Delete lost entries from an LFC catalogue • Locate replicas of lost files. If they exist, consider

replication to the affected SE. If they do not exist, remove lost files from datasets (DQ2 db) and pass the list of really lost files to prodsys group.

• DB of lost files – will be part of DQ2

Page 13: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 13

Enabling Grids for E-sciencE

INFSO-RI-508833

T2 cleaning

• remove_t2_in_t1.py by Stephane– A file is deleted if it fullfills all the following requests:

The file in the T2 is replicated in the T1DISK? of the name cloud The file belongs to a dataset which is not complete at the site The file belongs to a dataset (with _tid) which is not subscribed to

the T2 site ( Be carefull: During DDM migration to 0.3, all subscriptions are removed. You might deleted too many files untill subscriptions are put back. )

– Since v1.4, you can provide a list of restricted datasets to be deleted (even if subscribed)

– It first scan the LFC catalog at the Tier1 (it is possible to use a local dump of the LFC catalog), scans the T2 entries in the LFC and deletes duplicated files on the T2 (using lcg-del). To run : python remove_t2_in_t1.py LAPP LPC or python remove_t2_in_t1.py LAPP LPC dataset1 dataset2

Page 14: INFSO-RI-508833 Enabling Grids for E-sciencE  ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, 25.9.2007,

ATLAS DDM Operations 14

Enabling Grids for E-sciencE

INFSO-RI-508833

More scripts

• https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationsScripts • Framework in preparation