Státní sociální pomoc, životní a existenční minimum, minimální mzda, moderní chudoba
INFSO-RI-508833 Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks...
-
Upload
harvey-hood -
Category
Documents
-
view
215 -
download
0
Transcript of INFSO-RI-508833 Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks...
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
ATLAS DDM Operations - IIMonitoring and Daily TasksJiří Chudoba
ATLAS meeting, 25.9.2007, CNAF
ATLAS DDM Operations 2
Enabling Grids for E-sciencE
INFSO-RI-508833
Cloud Status
• Scheduled and unscheduled downtimes– direct emails from sites– EGEE broadcasts– GOCDB: https://goc.gridops.org/site/
• ARDA Dashboard pages– T0 to T1 transfers
http://dashb-atlas-data-tier0.cern.ch/dashboard/request.py/site
– all other transfers
http://dashb-atlas-data.cern.ch/dashboard/request.py/site
ATLAS DDM Operations 3
Enabling Grids for E-sciencE
INFSO-RI-508833
VOBoxes at CERN
• https://twiki.cern.ch/twiki/bin/view/Atlas/DistributedDataManagementARDAMachines
• separate machines for db services and site services• CNAF:
– dq2db-cnaf – db services– dq2cnaf – site services for CNAF and T2’s
• Access via an account ddmusr02– limited possibilities, check /tmp/dq2.log
• Account ddmusr01 restricted to developers– why ???
• Installation done by developers
ATLAS DDM Operations 4
Enabling Grids for E-sciencE
INFSO-RI-508833
Panda Monitoring
• panda pages– DS on sites
http://gridui01.usatlas.bnl.gov:25880/server/pandamon/query?overview=dslist
– AOD: http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?mode=listAODReplications
– aborted DS:
http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?mode=listAbortedDatasets
– M4:
http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query?mode=listM4
ATLAS DDM Operations 5
Enabling Grids for E-sciencE
INFSO-RI-508833
More Monitoring
• Stephane’s overview of disks occupancies
http://lapp.in2p3.fr/atlas/Informatique/Offline/monitor_files_sites/all_sites/list_sites.html
• Per data type version, DE cloud:
http://www.etp.physik.uni-muenchen.de/ddm/DE/summary.html
• Site status monitored by GOC – gstat– http://goc.grid.sinica.edu.tw/gstat/RU-Protvino-IHEP/GIISQuery_
Usage_store_.html
ATLAS DDM Operations 6
Enabling Grids for E-sciencE
INFSO-RI-508833
FTS monitoring
• FTS 1.5– DE cloud: http://grid.fzk.de/monitoring/fts/transfers.html– SARA: http://winnetou.matrix.sara.nl/monitoring/datatransfer/
– glite-transfer commands:glite-transfer-channel-list -s https://fts.grid.sara.nl:8443/glite-data-
transfer-fts/services/ChannelManagement
ATLAS DDM Operations 7
Enabling Grids for E-sciencE
INFSO-RI-508833
Typical tasks
• Errors spotted via monitoring– check reasons– contact site– possibly close the FTS channel– verify when corrected– open FTS channel
ATLAS DDM Operations 8
Enabling Grids for E-sciencE
INFSO-RI-508833
Deletion of Aborted DS
• Mail sent to T1 cloud responsibles (usually 1 per week)• Different procedures in different clouds
– FZK Cedric’s script delete_dataset_aborted.py run regularly from a crontab uses: dq2.deleteDatasetReplicas, dq2.deleteDatasetSubscription,
dq2.listFilesInDataset, lcg-del, lcg-uf list of DS from a file part of MyFrameWork:
/afs/cern.ch/user/s/serfon/public/ddm/Myframework will be published on Thursday
ATLAS DDM Operations 9
Enabling Grids for E-sciencE
INFSO-RI-508833
Deletion of Aborted DS II
• SARA cloud: wrappers around dq2_cleanup:• dq2_delete_aborted.sh
#!/bin/sh
# delete aborted DS using dq2_cleanup# start 1 d2_cleanup instance per site
# input via parameter.# Parameter 1: list of aborted dataset and sites# example:# ideal0_mc12.007042.singlepart_gamma_Et60.simul.HITS.v12003103_tid010675 ITEP
# tested from lxplus, when grid and dq2 environment was set and# production proxy obtained like this:## source /afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh# voms-proxy-init -voms atlas:/atlas/Role=production -valid 96:0# source /afs/cern.ch/atlas/offline/external/GRID/ddm/pro03/dq2.sh
SITES="SARADISK SARATAPE NIKHEF ITEP IHEP JINR SINP"
DSLIST=$1
for SITE in $SITES ; do dq2_delete_aborted_site.sh $DSLIST $SITE &done
ATLAS DDM Operations 10
Enabling Grids for E-sciencE
INFSO-RI-508833
Deletion of Aborted DS II
• dq2_delete_aborted_site.sh#!/bin/sh
# delete aborted DS from a site using dq2_cleanup## Input# parameter 1: list of aborted DS# parameter 2: SITENAME
DSLIST=$1SITE=$2
DQ2_CLEANUP=/afs/cern.ch/atlas/offline/external/GRID/ddm/pro03/dq2_cleanup
LOG="${SITE}_${DSLIST}_`date +%Y%m%d_%H%M`.log"touch $LOGgrep $SITE $DSLIST | while read DS ; do $DQ2_CLEANUP $DS >>$LOG 2>&1done
ATLAS DDM Operations 11
Enabling Grids for E-sciencE
INFSO-RI-508833
Integrity checks
• Cedric’ script– http://atlas-sw.cern.ch/cgi-bin/viewcvs-atlas.cgi/offline/Production/swing/
scripts/ddm/integrity_check.py?view=log – some assumptions (/pnfs access)
• Simple compare of dumps:
#!/bin/bash
# read files from a DPM dump and match them with an LFC dump# DPM dump obtained by select name from Cns_file_metadata where gid=1307 and filesize > 0;
DPM_DUMP=$1LFC_DUMP=$2
FOUND=$1.foundMISS=$1.miss
cat $DPM_DUMP | while read FN FILEID; do grep -q $FN $LFC_DUMP if [ $? == 0 ] ; then echo "$FN $FILEID" >> $FOUND else echo "$FN $FILEID" >> $MISS fidone
ATLAS DDM Operations 12
Enabling Grids for E-sciencE
INFSO-RI-508833
Data loss
• https://twiki.cern.ch/twiki/bin/view/Main/AtlasDDMLostFiles
• Only production files are treated • Get list of lost files (provided by a sysadmin) • Remove information about lost files from the SE db
(must be done by a sysadmin) – see later talk• Delete lost entries from an LFC catalogue • Locate replicas of lost files. If they exist, consider
replication to the affected SE. If they do not exist, remove lost files from datasets (DQ2 db) and pass the list of really lost files to prodsys group.
• DB of lost files – will be part of DQ2
ATLAS DDM Operations 13
Enabling Grids for E-sciencE
INFSO-RI-508833
T2 cleaning
• remove_t2_in_t1.py by Stephane– A file is deleted if it fullfills all the following requests:
The file in the T2 is replicated in the T1DISK? of the name cloud The file belongs to a dataset which is not complete at the site The file belongs to a dataset (with _tid) which is not subscribed to
the T2 site ( Be carefull: During DDM migration to 0.3, all subscriptions are removed. You might deleted too many files untill subscriptions are put back. )
– Since v1.4, you can provide a list of restricted datasets to be deleted (even if subscribed)
– It first scan the LFC catalog at the Tier1 (it is possible to use a local dump of the LFC catalog), scans the T2 entries in the LFC and deletes duplicated files on the T2 (using lcg-del). To run : python remove_t2_in_t1.py LAPP LPC or python remove_t2_in_t1.py LAPP LPC dataset1 dataset2
ATLAS DDM Operations 14
Enabling Grids for E-sciencE
INFSO-RI-508833
More scripts
• https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationsScripts • Framework in preparation