S4PA Deployment M. Hegde Science Systems & Applications, Inc April 26, 2006.
-
Upload
berenice-greene -
Category
Documents
-
view
213 -
download
0
Transcript of S4PA Deployment M. Hegde Science Systems & Applications, Inc April 26, 2006.
2
Introduction
S4PA dependencies Installing S4PA Creating an S4PA instance Monitoring an S4PA instance
Instructions available in S4PA Wiki at http://discette.gsfc.nasa.gov/mwiki/index.php/S4PA
3
S4PA Dependencies
Perl 5.8.x S4P 5.28.1+ XML::LibXML,XML::LibXSLT,XML::Simple,XML::Twig Net::FTP, Net::Netrc, Net::SSH2 MLDBM, DB_File, Storable, Data::Dumper SOAP::Lite, URI::URL HTTP_service_URL, Clavis Compilers/libraries needed for metadata extractors
and Giovanni pre-processors.
4
It is good to know…
Editing XML. Using XML schema is preferred. .netrc setup Setting up SSH key exchange if needed. Perl regular expressions for data polling. XPath for complex granule replacement logic.
5
S4PA Directory Structure
Data Storage
RAID
restricted data sub-web (defined in server config)
S4PA Roote. g., /vol1/OPS/s4pa/
receiving/
<provider>/
active_fs→
storage/
<data_class>/
<dataset>/
granule.db data→
check_<data_class>/
store_<data_class>/ publish_mirador/
pending_delete/
publish_echo/
pending_delete/
subscribe/
pending/
postoffice/
FTP Roote.g, /ftp
.<provider>/<nnn>/
<data group>/
<dataset>/
<yyyy>/
<ddd>/
doc/
<dataset>/
x.hdf, x.xml,x.jpg, x.png
x.hdf→, x.xml→,x.jpg→, x.png→
<http root>/
groups,usersdbm
.htaccess (opt)
delete_<data_class>/
dataset.cfg
pending_publish/
polling/
pending_publish/
.htaccess (opt.)
publish_whom/
pending_publish/
pending_delete/Storage Directorye.g., data/s4pa/
s4pa_subscription.cfg
data/
pdr/Poller: PDR
Poller: Data
ReceiveData: pDeleteData: dc
StoreData: dc
CheckData: dc
PostOfficeSubscribeData PublishECHO
PublishMirador
PublishWHOM
directory/
dataset
symbolic link→
station_directory/Station Name
6
S4PA Terminology
S4PA terms Dataset is the equivalent of data product in WHOM and
ESDT in ECS. Data Class is the logical group of datasets for S4PA’s
internal use. Generally, this is the group based on common methods in S4PA. It is not visible to data users.
Data Group is the logical group of datasets from data user’s perspective.
Active File System is the file system where S4PA is currently writing data.
Storage Directory is the root directory for data access in S4PA. Its sub-directories are Data Groups.
Data Provider is the label given to a data provider. All datasets belonging to a provider end up on the same active file system.
7
Architecting an S4PA Instance
Create a user account for operating S4PA (ex. s4paops) and a group (ex. s4pa) to share resources.
Estimate disk space requirements and divide the RAID into file systems whose size equals that of a backup tape or disk.
Name file systems as /ftp/.<provider>/<nnnn> where <provider> is the data provider’s label and <nnnn> is the 3 or 4 digit label of the file system. Ex: /ftp/.trmm/001
Create the Storage Directory generally in an FTP area. Ex: /ftp/data/s4pa
8
Architecting - continued
Determine Datasets, Data Classes and Data Groups supported by the instance.
Determine Data Providers supported by the instance. Identify metadata extractors if any and get its synopsis. Identify publication requirements, prepare GCMD DIF
and collection README document. Add entries to .netrc for all hosts with whom S4PA will
interact including cases where SSH/SFTP is used. Set up SSH key exchange if necessary.
9
Obtaining S4PA Distributions
Generally, an S4PA instance depends on a core and an instance specific distribution.
The distributions are available as gzipped tar files from ftp://s4pt.ecs.nasa.gov/software/s4pa/
The core distribution is named S4PA-X.Y.Z.tar.gz where X, Y and Z are major, minor and patch release numbers.
The project specific distribution is named S4PA_<ProjectName>-X.Y.Z.tar.gz.
The S4PA core and the instance specific are stored in CVS repository with project names S4PA and S4PA_<ProjectName> (Ex: S4PA_ACDISC, S4PA_TRMM etc.,).
10
Installing S4PA
Identify and obtain necessary S4PA distributions. Decompress and un-tar distribution files. S4PA projects use MakeMaker for installation. Use
following steps to install a project. Change directory to root of the un-tarred directory. perl Makefile.PL PREFIX=/tools/gdaac/TS2 (substitute mode as
necessary) make make pure_site_install
Save the un-tarred area of instance specific distribution. It may contain configuration files needed later in the process. Ex:./doc/xsd/S4paDescriptor.xsd./doc/xsd/S4paSubscription.xsd
11
Creating an S4PA Instance
S4PA provides a tool, s4pa_deploy.pl, to create necessary S4PA stations, directories and symbolic links. It can:
Create station directories and necessary station configuration files for all stations in S4PA.
Create directories for datasets in the Storage Directory. Create a symbolic link to the Active File System. Transfer README document to its dataset storage directory.
It will not: Set up file systems in the Active File System area. Set up any configuration file needed by metadata extractor
or Giovanni pre-processors.
12
Creating an S4PA Instance
An S4PA instance is described in a file called deployment descriptor.
Suggested name: descriptor_<InstanceName>.xml The deployment descriptor is in XML and is based on
schema ftp://s4pt.ecs.nasa.gov/software/s4pa/S4paDescriptor.xsd
Once created the deployment descriptor is stored in cfg directory under the instance specific CVS project.
Run command
s4pa_deploy.pl -f <Descriptor> -s <DescriptorSchema>
13
Deployment Descriptor
Notation Words in bold indicate XML elements. Italicized words with a @ as super-script, indicate attributes.
The root element of the descriptor is s4pa, it has a NAME @ indicates the S4PA instance name.
s4pa contains a root, storageDir, tempDir, docmentLocation, urlRoot, logger and one or more providers.
s4pa also contains optional auziliaryBackUpArea, project, protocol, reconciliation, publication, subscription, deletionDelay, postoffice, and houseKeeper.
s4pa Element
The content of root is the root directory of S4PA stations. Ex: <root>/vol1/OPS/s4pa/</root>
The content of storageDir is the root directory of data archive’s public view. Ex: <storageDir>/ftp/data/s4pa/</storageDir>
The content of tempDir is the global directory serves as the root working directory for filters. Ex: <tempDir>/var/tmp</tempDir>
The content of documentLocation is the URL for storage of README documents for datasets. Ex: <documentLocation>http://discette.gsfc.nasa.gov/uploads </documentLocation>
s4pa Element
The content of urlRoot is the root URLs for accessing data specified as element attributes (FTP and HTTP).
Logger has DIR @ and LEVEL @
DIR @ indicates the directory for storing log files, LEVEL @ indicates the logging level (DEBUG and INFO).
Optional project contains one or more locations which indicates the location of project sandbox for metadata extractor.
Optional subscription has INTERVAL @ indicates the subscribe station’s polling interval, defaults to 86400 (1 day).
15
s4pa Element
Optional deletionDelay has INTER_VERSION@ and INTRA_VERSION@
INTER_VERSION@ indicates the retention period for inter_version_deletion, defaults to 86400*180 (6 months),
INTRA_VERSION@ indicates the retention period for intra_version_deletion, defaults to 86400 (1 day).
Optional postOffice has INTERVAL@ indicates the postOffice station’s polling
interval, defaults to 10. MAX_JOBS@ indicates the postoffice station’s
max_children, defaults to 1. MAX_ATTEMP@ indicates the postoffice station’s
maximum number of retries before job failure, defaults to 1.
16
17
protocol Element
protocol has a NAME@. Valid values are FILE, FTP, SFTP and HTTP. If unspecified, FTP is used for a host.
protocol contains one or more hosts. host indicates the name of the host for which the
specified protocol is to be used.Ex: <protocol NAME=“FTP”>
<host>discette.gsfc.nasa.gov</host></protocol >
<protocol>NAME=“SFTP”>
<host>tads1.ecs.nasa.gov</host>
<host>auraraw1.ecs.nasa.gov</host>
</protocol>
reconciliation Element
reconciliation is the holder of partner data reconciliation information, it contains optional echo, mirador, dotchart.
echo has URL@, USERNAME@, PASSWORD@,
PUSH_USER@, PUSH_PWD@, ENDPOINT_URI@ required attributes, and MAX_GRANULE_COUNT@, LOCAL_DIR@, CHROOT_DIR@, STAGING_DIR@, DATA_HOST@,
MIN_INTERVAL@ optional attributes. mirador and dotchart has one ENDPOINT_URI@
required attribute, and the same set of optional attributes as echo element plus an extra PULL_TIMEOUT@.
18
publication and echo Element
publication is the holder of metadata publication related information, it contains optional echo, mirador, giovanni, dotchart.
echo contains granuleInsert, granuleDelete, browseInsert, browseDelete, collectionInsert.
echo has HOST@, VERSION@, MAX_GRANULE_COUTN@
Ex: <echo HOST="ingest.echo.nasa.gov" VERSION="10">
<granuleInsert DIR="/data/granule"/>
<granuleDelete DIR="/data/granule"/>
<browseInsert DIR="/data/browse"/>
<browseDelete DIR="/data/browse"/>
<collectionInsert DIR="/data/collection"/>
</echo>
19
publication - mirador Element
mirador contains granuleInsert, granuleDelete, productDocument and has a HOST@.
granuleInsert, granuleDelete each has HOST@ and DIR@.
productDocument has HOST@, DIR@, and [email protected]: <mirador HOST="invenio.gsfc.nasa.gov”>
<granuleInsert DIR="/ftp/private/Mirador/agdisc/Inserts"/>
<granuleDelete DIR="/ftp/private/Mirador/agdisc/Deletes"/>
<productDocument DIR="/ftp/private/Mirador/agdisc/ProdDocs”
CMS_TEMPLATE=“/home/s4pa/mirador_L2_RCW.dwt”/>
</mirador>
20
publication - giovanni Element
giovanni contains granuleInsert, granuleDelete and has a HOST@.
granuleInsert, granuleDelete each has HOST@ and DIR@.
Ex: <giovanni HOST=“gdata1.sci.gsfc.nasa.gov”>
<granuleInsert DIR="/ftp/private/Giovanni/agdisc/Inserts"/>
<granuleDelete DIR="/ftp/private/Giovanni/agdisc/Deletes"/>
</giovanni>
21
publication - dotChart Element
dotChart contains granuleInsert, granuleDelete, dbExport, collectionInsert and has a HOST@.
granuleInsert, granuleDelete, and collectionInsert each has HOST@ and DIR@.
dbExport has HOST@, DIR@, and [email protected]: <dotChart HOST=“tads1.ecs.nasa.gov”>
<granuleInsert DIR="/ftp/private/Dotchart/pending_insert"/>
<granuleDelete DIR="/ftp/private/Dotchart/pending_delete"/>
<dbExport DIR="/ftp/private/Dotchart/dbExport”/>
<collectionInsert DIR="/ftp/private/Dotchart/pending_dif”/>
</dotChart>
22
houseKeeper Element
houseKeeper is the holder for user defined house keeping jobs.
houseKeeper contains one or more job, the content of each job is the customized script command.
job has NAME@ and DOWNSTREAM@
NAME@ indicates the job title. DOWNSTREAM@ indicates the downstream station for
output work order. Ex:
<job NAME=CLEAN_UP>./my_clean_up_job.sh</job>
<job NAME=“REQUEST_DATA” DOWNSTREAM=“other/machine_search”> ./my_auto_request.sh</job>
23
24
provider Element
provider has a NAME@. It contains an activeFileSystem, a poller, a pan
and one or more dataClass. The content of activeFileSystem is the location of
the current file system being written to. Ex: /ftp/.trmm/001/
activeFileSystem has MAX@ : fraction (0-1) of maximum usable disk space. FILE_SIZE_MARGIN@ : marginal size needed for every file
during ingest. For example,
((1 + FILE_SIZE_MARGIN) * File Size) is the size allocated for a file.
provider Element
NOTIFY_ON_FULL@ : email address(es) to alert for volume backup when it is completed.
CONFIGURED_VOLUMES@ : Optional use of configuration file to specify allocated volumes for rolling archive and non-continuous volumes partition.
LOW_VOLUME_THRESHOLD @ : Optional fraction of the configured volume to trigger anomaly when the free space left on the current volume passed this threshold with no more new volume configured in the line-up.
25
poller and pdrPoller Element
poller is a complex element holding pdrPollers and dataPollers.
pdrPoller has an INTERVAL@, MAX_THREAD@ and MAX_FAILURE@.
INTERVAL@ is the polling interval in seconds. Default value is 600 seconds.
MAX_THREAD@ is the maximum number of threads allowed for PDR poller. Default value is 1.
MAX_FAILURE@ is the maximum number of failures allowed for pdr poller. Default value is 1.
pdrPoller contains pdrFilter and one or more jobs. pdrFilter has a PATTERN@.
26
pdrPoller - job Element
job contains exlcude and pdrFilter. job has a NAME@, HOST@, DIR@, IGNORE_HISTORY@,
MERGE_PAN@, PATTERN@, and TYPE@. NAME@ is the name of poller (must be unique). HOST@ and DIR@ are the host and directory being polled. IGNORE_HISTORY@ has a boolean value indicating
whether to ignore polling history, default to “false”. MERGE_PAN@ has a boolean value indicating whether
pan merging is required, default to “false”. PATTERN@ is the PDR filename pattern, default to
“\.PDR$”. TYPE@ is to specify if PDR is “EDOS” type, default is
non-EDOS.
27
28
dataPoller Element
dataPoller has INTERVAL@, MAX_THREAD@ and MAX_FAILURE@. They have the same definition as in a pdrPoller.
dataPoller has one or more jobs. A dataPoller’s job has following attributes:
NAME@ is the unique name of the poller. HOST@ and DIR@ are the host and directory being polled. PROTOCOL@ is the protocol to be used for polling. Valid
values are “FTP”, “FILE”, “HTTP”. Default value is FTP. EXTERNAL_API @ is an API to supply the list of remote
files (in ‘URL|size’ format) for a HTTP protocol poller. RECURSIVE@ is the Boolean (true or false) value indicates
recursive polling of a directory for a FTP protocol poller. Default value is “false”.
29
dataPoller – job Element
MAX_DEPTH @ is the maximum directory for FILE and HTTP protocol polling.
ORIGINATING_SYSTEM@ is the label to be used with PDRs. It is meaningful to the PAN element to be discussed later. Default value is “S4PA”.
INGORE_HISTORY @ has a boolean value indicating whether to ignore polling history, default to “false”.
MAX_FILE_GROUP @ indicates the maximum number of FileGroup in a resulting PDR for the downstream receiving station. Default is unlimited.
MINIMUM_FILE_SIZE @ is the minimum file size of the polled data file. Default value is 0.
REPOLL_PAUSE @ is the sleep time in second before repolling to confirm the polled file size. Default is no pause.
SUB_DIR_PATTERN @ is the sub-directory pattern in Linux ‘date’ command format for a FTP poller to limit the scanning of matched pattern directories only.
30
dataPoller – job Element
LATENCY @ in days prior to current date for a matching sub-directory name pattern to be polled in a FTP protocol poller.
Ex: <job NAME=“test_poller” HOST=“s4pt.ecs.nasa.gov” DIR=“/ftp/private/TS2” PROTOCOL=“FTP” RECURSIVE=“true” MAX_FILE_GROUP=“20” SUB_DIR_PATTERN=“%Y/%Y%m%d” LATENCY=“31”>
A dataPoller’s job has one or more datasets. A dataPoller dataset has a NAME@,VERSION@, and
ALIAS@. NAME@ is name of the dataset being polled. VERSION@ is the dataset’s version. ALIAS@ is the pattern for renaming the polled files.
31
dataPoller – job Element
dataset contains the Perl regular expression for use with file names to detect files belonging to the dataset. Ex:<dataset NAME=“GDAS1” ALIAS=“GDAS1.$1.00z”>
gdas1.PGrbF00\.(\d{6})\.00z$</dataset> dataset could also contains one file and 0 or more
associateFile. file and associateFile has PATTERN@ and ALIAS@ for
multiple-file granule polling. Ex:<dataset NAME=“P3L2TRGB” VERSION=“001”>
<file PATTERN=“(P3L2TRGB\d{6}\w)D$”/>
<associateFile PATTERN=“$1L”/></dataset>
32
pan Element
A pan contains a local and an optional remote element.
local contains the local directory name for storing PANs.
remote contains one or more originating_systems. originating_system has a NAME@, a HOST@, a DIR@,
and a NOTIFY@. NAME@ is the value of a field by the same name in PDRs
encountered by S4PA. HOST@ and DIR@ are the host name and directory where the
PAN for the originating system will be pushed to. NOTIFY@ is the email address for PAN to be sent to.
33
dataClass Element
dataClass has NAME@ , GROUP@, FREQUENCY@, ACCESS@,TIME_MARGIN@, PUBLISH_ECHO @, PUBLISH_MIRADOR @, PUBLISH_GIOVANNI @, EXPIRY @, and DOC @.
NAME@ is the data class name. GROUP@ is the default data group name for the datasets in
the data class. FREQUENCY@ is the temporal frequency for the dataset.
Valid values are “yearly”, “monthly”, “daily”, and “none” (for Climatology dataset). Default is daily.
ACCESS@ is the access type for the dataset. Valid values are “public”, “restricted”, and “hidden”. Default is public.
TIME_MARGIN @ is the time difference in seconds for identifying replacement granules. Default is zero.
34
dataClass Element
PUBLISH_ECHO@, PUBLISH_MIRADOR@, and PUBLISH_ECHO@ is the Boolean value indicating the publication requirement for each partner. Default is true.
Expiry@ is the granule retention days for a rolling archive dataset. Default is no expiration.
DOC@ is the filename of the README file for the dataset. dataClass contains an optional method and one or
more dataset(s). Ex:<dataClass NAME=“GLDAS” GROUP=“GLDAS_MONTHLY” FREQUENCY=“yearly” ACCESS=“public” PUBLISH_ECHO=“false” PUBLISH_MIRADOR=“true” PUBLISH_GIOVANNI=“false”>
<method>/home/s4paops/bin/s4pa_get_gldas_metadata.pl</method>
<dataset NAME=“GLDAS_MOS10_M></dataset>
</dataClass>
35
method Element
method contains metadata, compression, decompression and giovanniPreprocess elements.
metadata contains the complete command for metadata extraction.
compression is a complex element containing command, tmpfile and output.
decompression is a complex element containing command, tmpfile and output.
giovanniPreprocess contains the complete command for Giovanni pre-processing .
36
compression Element
command specifies the compression command to be used. S4PA replaces any string/substring specified as “INFILE” with name of the file being processed. Ex: hrepack -t 'l3m_data:GZIP 1' -i INFILE -o INFILE.tmp
tmpfile is the file name of command output. You can specify it in terms of “INFILE”. Ex: INFILE.tmp
output is the desired file name after compression. You can specify in terms of “INFILE”. Ex: INFILE
37
decompression Element
command specifies the compression command to be used. S4PA replaces any string/substring specified as “INFILE” with name of the file being processed. Ex: bunzip2 -f INFILE
tmpfile contains an anonymous Perl subroutine that is supplied with the filename as argument. It returns name of the file produced by the decompression command.Ex: sub {my($a) = @_; $a=~s/\.bz2$//; return $a;}
output contains an anonymous Perl subroutine that is supplied with the filename as argument. It returns the desired file name after decompressing data file.Ex: sub {my($a) = @_; $a=~s/\.bz2$//; return $a;}
38
dataset Element
dataset has NAME@, GROUP@, FREQUENCY@, ACCESS@,TIME_MARGIN@, DIF_ENTRY_ID@, PUBLISH_ECHO @, PUBLISH_MIRADOR @,
PUBLISH_GIOVANNI @, EXPIRY @, and DOC @. They override corresponding values defined for a dataClass.
DIF_ENTRY_ID@ indicates the ID of GCMD DIF. dataset contains a method and optional
ignoreCondition(s), uniqueAttribute(s), associateData(s), and dataVersion(s).
A method in a dataset has the same definition as in a dataClass.
39
dataset – ignoreCondition
ignoreCondition is the location of a metadata (in XML) attribute in XPATH expressions that specify cases where an incoming granule has to be ignored after comparing the same attribute with the existing granule. It has an optional OPERATOR@.
OPERATOR@ is the operation of comparison. Valid values are “EQ”, “NE”, “GT”, “GE”, “LT”, “LE”. Default value is EQ.
Ex:
<ignoreCondition OPERATOR=“LE”>
//DataGranule/SizeBytesDataGranule</ignoreCondition>
-- This will avoid the existing granule being replaced by a smaller size incoming granule covering the same RangeDateTime.
40
dataset – uniqueAttribute
uniqueAttribute is the location of a metadata (in XML) attribute specified using XPATH for determining the uniqueness of a granule. If the value of XPATH expression matches in all cases, incoming granule is deemed a valid replacement. Otherwise, it is treated as a new granule. It also has an optional OPERATOR@.
OPERATOR@ is the operation of comparison. Valid values are “EQ”, “NE”, “GT”, “GE”, “LT”, “LE”. Default value is EQ.
Ex:
<uniqueAttribute>//DataGranule/GranuleID</uniqueAttribute>-- This will avoid the existing granule being replaced by a incoming
granule with different GranuleID covering the same RangeDateTime.
41
dataset – associateData
associateData is used to associate a data granule with its browse file from a different dataset.
associateData has a NAME@, a VERSION@, and a TYPE@.
NAME@ is the associated dataset name. VERSION@ is the optional associated dataset version.
Default is versionless. TYPE@ is the association type, currently “Browse” only.
Ex:
<dataset NAME="TRMM_2A21">
<associateData NAME="TRMM_2A21_BR“ TYPE="Browse" />
</dataset>
42
dataset – dataVersion
dataVersion has LABEL@, FREQUENCY@, ACCESS@,TIME_MARGIN@, DIF_ENTRY_ID@, PUBLISH_ECHO @, PUBLISH_MIRADOR @,
PUBLISH_GIOVANNI @, EXPIRY @, and DOC @. They override corresponding values defined for a dataset.
LABEL@ can be an empty string for versionless dataset or a non-white space string for a versioned dataset.
dataVersion contains optional ignoreCondition(s), uniqueAttribute(s), and associateData(s).
All attributes and element in a dataVersion has the same definition as in a dataset.
43
Creating S4PA Subscriptions
S4PA instance is described in a file called subscription configuration.
Suggested name: subscription_<InstanceName>.xml Two types of subscription are supported: Pull (user
initiates the download) and Push (S4PA push files to users).
The subscription descriptor is based on schema ftp://s4pt.ecs.nasa.gov/software/s4pa/S4paSubscription.xsd
Run command
s4pa_update_subscription.pl -f <SubscriptionConfiguration> -d <DescriptorSchema> -s <SubscriptionSchema>
44
Subscription Descriptor
The root element of the descriptor is subscription, it has a NOTICE_SUBJECT @, a HTTP_ROOT@ and a FTP_ROOT@.
NOTICE_SUBJECT @ indicates the general email delivery notice (DN) subject.
HTTP_ROOT@ specifies the root URL for accessing restricted data.
FTP_ROOT@ specifies the root URL for accessing public data.
subscription contains one or more pushSubscription and pullSubscription.
45
pushSubscription Element
pushSubscription contains notification, destination, and one or more dataset(s).
pushSubscription has ID@, LABEL@, FTP_ROOT@, HTTP_ROOT@, MAX_GRANULE_COUNT@, USER@, INCLUDE_BROWSE@, VERIFY@.
ID@ is a unique identification string across all subscriptions. LABEL@ indicates any user-specific string that will be
included in the DN. MAX_GRANULE_COUNT@ sets the maximum number of
granules in each subscription. USER@ specifies the username for Machine Request
Interface.
46
pushSubscription Element
INCLUDE_BROWSE@ sets the inclusion of browse file in the subscription.
VERIFY@ is to confirm the existence of the pushed files on the remote site.
notification specifies the address and the format of subscription delivery notice. It contains an optional filter.
The content of the filter specifies the user provided script to create a special format of the delivery notice (ex. XML formatted). It is only needed when the attribute FORMAT@ is specified as “USER-DEFINED”.
47
notification Element
notification has FORMAT@, PROTOCOL@, ADDRESS@,
NOTICE_SUFFIX@, and NOTICE_SUBJECT@. FORMAT@ indicates the format of the notice. Valid values
are “S4PA”, “LEGACY”, “PDR”, and “USER-DEFINEED”. PROTOCOL@ indicates the protocol to be used to send the
notice. Valid values are “mailto”, “ftp”, “sftp”, “file”. ADDRESS@ indicates the destination of the notice. It can be
subscriber’s email address for “mailto” protocol or “<remote_host>/<remote_directory>” for other protocols.
NOTICE_SUFFIX@ and NOTICE_SUBJECT@ specify the DN file extension and the special email notification subject. Default suffix is none and default subject is:
“GES DISC Order Notification Order ID: DN<xxx>-<xxx>”
48
destination Element
destination specifies the destination for subscribed data.
destination has PROTOCOL@ and ADDRESS@. PROTOCOL@ indicates the protocol to be used to send the
notice. Valid values are “mailto”, “ftp”, “sftp”, “file”. ADDRESS@ indicates the destination of the notice. It can be
subscriber’s email address for “mailto” protocol or “<remote_host>/<remote_directory>” for other protocols.
Ex: <notification FORMAT=“S4PA” PROTOCOL=“mailto”
ADDRESS=“[email protected]”/>
<destination PROTOCOL=“ftp”
ADDRESS=“s4pt.ecs.nasa.gov/ftp/private/TS2/push”/>
49
subscription - dataset Element
dataset contains optional validator and filter. dataset has a NAME@ and optional VERISON@.
NAME@ is the dataset name for the subscription. VERISON@ is the version label for the subscription. Default
to all versions under the specified dataset. validator is used to validate if the incoming granule
will trigger the subscription to be processed. The content should be a boolean value (true or false) or a script that return a boolean value. Specify “false” for an Machine Request Interface (MRI) only subscription which will disable the triggering from ingest. The default content is “true”.
50
subscription - dataset Element
filter specifies the user-provided script to convert the pattern-matched file and deliver the output to the subscriber.
filter has a PATTERN @ to specify the file pattern to apply the filtering scheme.Ex:
<dataset NAME=“D5OIXMET” VERSION=“5.1.0”>
<validator>s4pa_sub_check.pl -b ‘2007-01-01’
-e ‘2009-12-31’</validator>
<filter PATTERN=“xml”>s4pa_extract_ODL.pl -o
/var/tmp</filter>
</dataset>
51
pullSubscription Element
pullSubscription contains notification, optional destination, and one or more dataset(s).
pullSubscription has ID@, LABEL@, FTP_ROOT@, HTTP_ROOT@, MAX_GRANULE_COUNT@, USER@, INCLUDE_BROWSE@.
All elements and attributes have the same definition as those in pushSubscription.
destination (if specified) indicates that the subscribed data has to be pushed via PROTOCOL@ to an intermediate destination (ADDRESS@) from where the data will be pulled by the subscriber. It is only used to support the legacy ECS users.
52
pullSubscription Element
destination has an extra URL_ROOT@ to replace the original FTP_ROOT@ or HTTP_ROOT@ with the new URL on the intermediate address for user to pull from.
dataset contains an extra service element to provide on-the-fly services to the files downloaded via the HTTP protocol.
service has NAME@, CHANNELS@, CHNUMBERS@, WVNUMBERS@, VARIABLES@, BBOX@, FORMAT@,
COMPRESS_ID@, and REASON@. The converted HTTP service URL will be included in the delivery notice.
53
Deploying Using Descriptor
Deploy S4PA instance by running:s4pa_deploy.pl -f <Descriptor> -s <DescriptorSchema>
Copy metadata extractor configuration files in S4PA_<ProjectName> distribution under “cfg” directory to <S4PA_ROOT>/receiving/<provider>. Generally, these configuration files are named as .table, .metTemplate, .xml, ..etc.,
<S4PA_ROOT> is the root directory of S4PA stations (root element under s4pa in descriptor).
54
Deploying Using CVS
Deploy S4PA instance by running:s4pa_deploy.pl -i <InstanceName> [-p <ProjectName>]
Both descriptor and subscription configuration needs to be in S4PA_CONFIG repository and named as:
descriptor_< InstanceName>.xml subscription _< InstanceName>.xml
All required metadata extractor templates need to be in <ProjectName> repository under its cfg directory.
Once deployed, a copy set of descriptor, subscription and schemas can be found under <S4PA_ROOT>/config directory.
55
Coming in future
dataPoller with SFTP protocol. Giovanni reconciliation. HDF4 map file creation on ingest and archive.
56
Viewing S4PA Instance
Set the PERLLIB environment variable to /tools/gdaac/OPS/lib/perl5/site_perl/<version>/ where <version> is the Perl’s version number. Currently, on our hosts, it is 5.8.8.
Set the PATH variable to include /tools/gdaac/OPS/bin.
57
Viewing S4PA Instance
Run,tkstat.pl <S4PA_ROOT>/receiving/polling/* <S4PA_ROOT>/receiving/<provider> <S4PA_ROOT>/storage/*/store* <S4PA_ROOT>/storage/*/check* <S4PA_ROOT>/storage/*/delete* <S4PA_ROOT>/subscribe <S4PA_ROOT>/publish*<S4PA_ROOT>/other/* <S4PA_ROOT>/postoffice &
<S4PA_ROOT> is the root directory of S4PA stations (root element under s4pa in descriptor).
58
Viewing S4PA Instance
Run,tkstat.pl <S4PA_ROOT>/receiving/polling/* <S4PA_ROOT>/receiving/<provider> <S4PA_ROOT>/storage/*/store* <S4PA_ROOT>/storage/*/check* <S4PA_ROOT>/storage/*/delete* <S4PA_ROOT>/subscribe <S4PA_ROOT>/publish* <S4PA_ROOT>/postoffice &
<S4PA_ROOT> is the root directory of S4PA stations (root element under s4pa in descriptor).
59
Housekeeping
Incremental backups needed for S4PA Root (/vol1/OPS/s4pa) S4PA Storage Directory (/ftp/data/s4pa)
S4PA jobs, mostly polling and rarely ReceiveData, fail.
Have a cron job that monitors these station directories and resubmit these jobs.
S4P/S4PA logs pile up. Currently, they are manually cleaned. Probably, there
will be a script to trim them.