S4PA Deployment M. Hegde Science Systems & Applications, Inc April 26, 2006.

S4PA Deployment

M. Hegde

Science Systems & Applications, Inc

April 26, 2006

2

Introduction

S4PA dependencies Installing S4PA Creating an S4PA instance Monitoring an S4PA instance

Instructions available in S4PA Wiki at http://discette.gsfc.nasa.gov/mwiki/index.php/S4PA

3

S4PA Dependencies

Perl 5.8.x S4P 5.28.1+ XML::LibXML,XML::LibXSLT,XML::Simple,XML::Twig Net::FTP, Net::Netrc, Net::SSH2 MLDBM, DB_File, Storable, Data::Dumper SOAP::Lite, URI::URL HTTP_service_URL, Clavis Compilers/libraries needed for metadata extractors

and Giovanni pre-processors.

4

It is good to know…

Editing XML. Using XML schema is preferred. .netrc setup Setting up SSH key exchange if needed. Perl regular expressions for data polling. XPath for complex granule replacement logic.

5

S4PA Directory Structure

Data Storage

RAID

restricted data sub-web (defined in server config)

S4PA Roote. g., /vol1/OPS/s4pa/

receiving/

<provider>/

active_fs→

storage/

<data_class>/

<dataset>/

granule.db data→

check_<data_class>/

store_<data_class>/ publish_mirador/

pending_delete/

publish_echo/

pending_delete/

subscribe/

pending/

postoffice/

FTP Roote.g, /ftp

.<provider>/<nnn>/

<data group>/

<dataset>/

<yyyy>/

<ddd>/

doc/

<dataset>/

x.hdf, x.xml,x.jpg, x.png

x.hdf→, x.xml→,x.jpg→, x.png→

<http root>/

groups,usersdbm

.htaccess (opt)

delete_<data_class>/

dataset.cfg

pending_publish/

polling/

pending_publish/

.htaccess (opt.)

publish_whom/

pending_publish/

pending_delete/Storage Directorye.g., data/s4pa/

s4pa_subscription.cfg

data/

pdr/Poller: PDR

Poller: Data

ReceiveData: pDeleteData: dc

StoreData: dc

CheckData: dc

PostOfficeSubscribeData PublishECHO

PublishMirador

PublishWHOM

directory/

dataset

symbolic link→

station_directory/Station Name

6

S4PA Terminology

S4PA terms Dataset is the equivalent of data product in WHOM and

ESDT in ECS. Data Class is the logical group of datasets for S4PA’s

internal use. Generally, this is the group based on common methods in S4PA. It is not visible to data users.

Data Group is the logical group of datasets from data user’s perspective.

Active File System is the file system where S4PA is currently writing data.

Storage Directory is the root directory for data access in S4PA. Its sub-directories are Data Groups.

Data Provider is the label given to a data provider. All datasets belonging to a provider end up on the same active file system.

7

Architecting an S4PA Instance

Create a user account for operating S4PA (ex. s4paops) and a group (ex. s4pa) to share resources.

Estimate disk space requirements and divide the RAID into file systems whose size equals that of a backup tape or disk.

Name file systems as /ftp/.<provider>/<nnnn> where <provider> is the data provider’s label and <nnnn> is the 3 or 4 digit label of the file system. Ex: /ftp/.trmm/001

Create the Storage Directory generally in an FTP area. Ex: /ftp/data/s4pa

8

Architecting - continued

Determine Datasets, Data Classes and Data Groups supported by the instance.

Determine Data Providers supported by the instance. Identify metadata extractors if any and get its synopsis. Identify publication requirements, prepare GCMD DIF

and collection README document. Add entries to .netrc for all hosts with whom S4PA will

interact including cases where SSH/SFTP is used. Set up SSH key exchange if necessary.

9

Obtaining S4PA Distributions

Generally, an S4PA instance depends on a core and an instance specific distribution.

The distributions are available as gzipped tar files from ftp://s4pt.ecs.nasa.gov/software/s4pa/

The core distribution is named S4PA-X.Y.Z.tar.gz where X, Y and Z are major, minor and patch release numbers.

The project specific distribution is named S4PA_<ProjectName>-X.Y.Z.tar.gz.

The S4PA core and the instance specific are stored in CVS repository with project names S4PA and S4PA_<ProjectName> (Ex: S4PA_ACDISC, S4PA_TRMM etc.,).

ftp://s4pt.ecs.nasa.gov/software/s4pa/release/

10

Installing S4PA

Identify and obtain necessary S4PA distributions. Decompress and un-tar distribution files. S4PA projects use MakeMaker for installation. Use

following steps to install a project. Change directory to root of the un-tarred directory. perl Makefile.PL PREFIX=/tools/gdaac/TS2 (substitute mode as

necessary) make make pure_site_install

Save the un-tarred area of instance specific distribution. It may contain configuration files needed later in the process. Ex:./doc/xsd/S4paDescriptor.xsd./doc/xsd/S4paSubscription.xsd

11

Creating an S4PA Instance

S4PA provides a tool, s4pa_deploy.pl, to create necessary S4PA stations, directories and symbolic links. It can:

Create station directories and necessary station configuration files for all stations in S4PA.

Create directories for datasets in the Storage Directory. Create a symbolic link to the Active File System. Transfer README document to its dataset storage directory.

It will not: Set up file systems in the Active File System area. Set up any configuration file needed by metadata extractor

or Giovanni pre-processors.

12

Creating an S4PA Instance

An S4PA instance is described in a file called deployment descriptor.

Suggested name: descriptor_<InstanceName>.xml The deployment descriptor is in XML and is based on

schema ftp://s4pt.ecs.nasa.gov/software/s4pa/S4paDescriptor.xsd

Once created the deployment descriptor is stored in cfg directory under the instance specific CVS project.

Run command

s4pa_deploy.pl -f <Descriptor> -s <DescriptorSchema>

ftp://discette.gsfc.nasa.gov/private/s4pa/release/S4paDescriptor.xsd

ftp://discette.gsfc.nasa.gov/private/s4pa/release/S4paDescriptor.xsd

13

Deployment Descriptor

Notation Words in bold indicate XML elements. Italicized words with a @ as super-script, indicate attributes.

The root element of the descriptor is s4pa, it has a NAME @ indicates the S4PA instance name.

s4pa contains a root, storageDir, tempDir, docmentLocation, urlRoot, logger and one or more providers.

s4pa also contains optional auziliaryBackUpArea, project, protocol, reconciliation, publication, subscription, deletionDelay, postoffice, and houseKeeper.

s4pa Element

The content of root is the root directory of S4PA stations. Ex: <root>/vol1/OPS/s4pa/</root>

The content of storageDir is the root directory of data archive’s public view. Ex: <storageDir>/ftp/data/s4pa/</storageDir>

The content of tempDir is the global directory serves as the root working directory for filters. Ex: <tempDir>/var/tmp</tempDir>

The content of documentLocation is the URL for storage of README documents for datasets. Ex: <documentLocation>http://discette.gsfc.nasa.gov/uploads </documentLocation>

s4pa Element

The content of urlRoot is the root URLs for accessing data specified as element attributes (FTP and HTTP).

Logger has DIR @ and LEVEL @

DIR @ indicates the directory for storing log files, LEVEL @ indicates the logging level (DEBUG and INFO).

Optional project contains one or more locations which indicates the location of project sandbox for metadata extractor.

Optional subscription has INTERVAL @ indicates the subscribe station’s polling interval, defaults to 86400 (1 day).

15

s4pa Element

Optional deletionDelay has INTER_VERSION@ and INTRA_VERSION@

INTER_VERSION@ indicates the retention period for inter_version_deletion, defaults to 86400*180 (6 months),

INTRA_VERSION@ indicates the retention period for intra_version_deletion, defaults to 86400 (1 day).

Optional postOffice has INTERVAL@ indicates the postOffice station’s polling

interval, defaults to 10. MAX_JOBS@ indicates the postoffice station’s

max_children, defaults to 1. MAX_ATTEMP@ indicates the postoffice station’s

maximum number of retries before job failure, defaults to 1.

16

17

protocol Element

protocol has a NAME@. Valid values are FILE, FTP, SFTP and HTTP. If unspecified, FTP is used for a host.

protocol contains one or more hosts. host indicates the name of the host for which the

specified protocol is to be used.Ex: <protocol NAME=“FTP”>

<host>discette.gsfc.nasa.gov</host></protocol >

<protocol>NAME=“SFTP”>

<host>tads1.ecs.nasa.gov</host>

<host>auraraw1.ecs.nasa.gov</host>

</protocol>

reconciliation Element

reconciliation is the holder of partner data reconciliation information, it contains optional echo, mirador, dotchart.

echo has URL@, USERNAME@, PASSWORD@,

PUSH_USER@, PUSH_PWD@, ENDPOINT_URI@ required attributes, and MAX_GRANULE_COUNT@, LOCAL_DIR@, CHROOT_DIR@, STAGING_DIR@, DATA_HOST@,

MIN_INTERVAL@ optional attributes. mirador and dotchart has one ENDPOINT_URI@

required attribute, and the same set of optional attributes as echo element plus an extra PULL_TIMEOUT@.

18

publication and echo Element

publication is the holder of metadata publication related information, it contains optional echo, mirador, giovanni, dotchart.

echo contains granuleInsert, granuleDelete, browseInsert, browseDelete, collectionInsert.

echo has HOST@, VERSION@, MAX_GRANULE_COUTN@

Ex: <echo HOST="ingest.echo.nasa.gov" VERSION="10">

<granuleInsert DIR="/data/granule"/>

<granuleDelete DIR="/data/granule"/>

<browseInsert DIR="/data/browse"/>

<browseDelete DIR="/data/browse"/>

<collectionInsert DIR="/data/collection"/>

</echo>

19

publication - mirador Element

mirador contains granuleInsert, granuleDelete, productDocument and has a HOST@.

granuleInsert, granuleDelete each has HOST@ and DIR@.

productDocument has HOST@, DIR@, and [email protected]: <mirador HOST="invenio.gsfc.nasa.gov”>

<granuleInsert DIR="/ftp/private/Mirador/agdisc/Inserts"/>

<granuleDelete DIR="/ftp/private/Mirador/agdisc/Deletes"/>

<productDocument DIR="/ftp/private/Mirador/agdisc/ProdDocs”

CMS_TEMPLATE=“/home/s4pa/mirador_L2_RCW.dwt”/>

</mirador>

20

publication - giovanni Element

giovanni contains granuleInsert, granuleDelete and has a HOST@.

granuleInsert, granuleDelete each has HOST@ and DIR@.

Ex: <giovanni HOST=“gdata1.sci.gsfc.nasa.gov”>

<granuleInsert DIR="/ftp/private/Giovanni/agdisc/Inserts"/>

<granuleDelete DIR="/ftp/private/Giovanni/agdisc/Deletes"/>

</giovanni>

21

publication - dotChart Element

dotChart contains granuleInsert, granuleDelete, dbExport, collectionInsert and has a HOST@.

granuleInsert, granuleDelete, and collectionInsert each has HOST@ and DIR@.

dbExport has HOST@, DIR@, and [email protected]: <dotChart HOST=“tads1.ecs.nasa.gov”>

<granuleInsert DIR="/ftp/private/Dotchart/pending_insert"/>

<granuleDelete DIR="/ftp/private/Dotchart/pending_delete"/>

<dbExport DIR="/ftp/private/Dotchart/dbExport”/>

<collectionInsert DIR="/ftp/private/Dotchart/pending_dif”/>

</dotChart>

22

houseKeeper Element

houseKeeper is the holder for user defined house keeping jobs.

houseKeeper contains one or more job, the content of each job is the customized script command.

job has NAME@ and DOWNSTREAM@

NAME@ indicates the job title. DOWNSTREAM@ indicates the downstream station for

output work order. Ex:

<job NAME=CLEAN_UP>./my_clean_up_job.sh</job>

<job NAME=“REQUEST_DATA” DOWNSTREAM=“other/machine_search”> ./my_auto_request.sh</job>

23

24

provider Element

provider has a NAME@. It contains an activeFileSystem, a poller, a pan

and one or more dataClass. The content of activeFileSystem is the location of

the current file system being written to. Ex: /ftp/.trmm/001/

activeFileSystem has MAX@ : fraction (0-1) of maximum usable disk space. FILE_SIZE_MARGIN@ : marginal size needed for every file

during ingest. For example,

((1 + FILE_SIZE_MARGIN) * File Size) is the size allocated for a file.

provider Element

NOTIFY_ON_FULL@ : email address(es) to alert for volume backup when it is completed.

CONFIGURED_VOLUMES@ : Optional use of configuration file to specify allocated volumes for rolling archive and non-continuous volumes partition.

LOW_VOLUME_THRESHOLD @ : Optional fraction of the configured volume to trigger anomaly when the free space left on the current volume passed this threshold with no more new volume configured in the line-up.

25

poller and pdrPoller Element

poller is a complex element holding pdrPollers and dataPollers.

pdrPoller has an INTERVAL@, MAX_THREAD@ and MAX_FAILURE@.

INTERVAL@ is the polling interval in seconds. Default value is 600 seconds.

MAX_THREAD@ is the maximum number of threads allowed for PDR poller. Default value is 1.

MAX_FAILURE@ is the maximum number of failures allowed for pdr poller. Default value is 1.

pdrPoller contains pdrFilter and one or more jobs. pdrFilter has a PATTERN@.

26

pdrPoller - job Element

job contains exlcude and pdrFilter. job has a NAME@, HOST@, DIR@, IGNORE_HISTORY@,

MERGE_PAN@, PATTERN@, and TYPE@. NAME@ is the name of poller (must be unique). HOST@ and DIR@ are the host and directory being polled. IGNORE_HISTORY@ has a boolean value indicating

whether to ignore polling history, default to “false”. MERGE_PAN@ has a boolean value indicating whether

pan merging is required, default to “false”. PATTERN@ is the PDR filename pattern, default to

“\.PDR$”. TYPE@ is to specify if PDR is “EDOS” type, default is

non-EDOS.

27

28

dataPoller Element

dataPoller has INTERVAL@, MAX_THREAD@ and MAX_FAILURE@. They have the same definition as in a pdrPoller.

dataPoller has one or more jobs. A dataPoller’s job has following attributes:

NAME@ is the unique name of the poller. HOST@ and DIR@ are the host and directory being polled. PROTOCOL@ is the protocol to be used for polling. Valid

values are “FTP”, “FILE”, “HTTP”. Default value is FTP. EXTERNAL_API @ is an API to supply the list of remote

files (in ‘URL|size’ format) for a HTTP protocol poller. RECURSIVE@ is the Boolean (true or false) value indicates

recursive polling of a directory for a FTP protocol poller. Default value is “false”.

29

dataPoller – job Element

MAX_DEPTH @ is the maximum directory for FILE and HTTP protocol polling.

ORIGINATING_SYSTEM@ is the label to be used with PDRs. It is meaningful to the PAN element to be discussed later. Default value is “S4PA”.

INGORE_HISTORY @ has a boolean value indicating whether to ignore polling history, default to “false”.

MAX_FILE_GROUP @ indicates the maximum number of FileGroup in a resulting PDR for the downstream receiving station. Default is unlimited.

MINIMUM_FILE_SIZE @ is the minimum file size of the polled data file. Default value is 0.

REPOLL_PAUSE @ is the sleep time in second before repolling to confirm the polled file size. Default is no pause.

SUB_DIR_PATTERN @ is the sub-directory pattern in Linux ‘date’ command format for a FTP poller to limit the scanning of matched pattern directories only.

30


LATENCY @ in days prior to current date for a matching sub-directory name pattern to be polled in a FTP protocol poller.

Ex: <job NAME=“test_poller” HOST=“s4pt.ecs.nasa.gov” DIR=“/ftp/private/TS2” PROTOCOL=“FTP” RECURSIVE=“true” MAX_FILE_GROUP=“20” SUB_DIR_PATTERN=“%Y/%Y%m%d” LATENCY=“31”>

A dataPoller’s job has one or more datasets. A dataPoller dataset has a NAME@,VERSION@, and

ALIAS@. NAME@ is name of the dataset being polled. VERSION@ is the dataset’s version. ALIAS@ is the pattern for renaming the polled files.

31


dataset contains the Perl regular expression for use with file names to detect files belonging to the dataset. Ex:<dataset NAME=“GDAS1” ALIAS=“GDAS1.$1.00z”>

gdas1.PGrbF00\.(\d{6})\.00z$</dataset> dataset could also contains one file and 0 or more

associateFile. file and associateFile has PATTERN@ and ALIAS@ for

multiple-file granule polling. Ex:<dataset NAME=“P3L2TRGB” VERSION=“001”>

<file PATTERN=“(P3L2TRGB\d{6}\w)D$”/>

<associateFile PATTERN=“$1L”/></dataset>

32

pan Element

A pan contains a local and an optional remote element.

local contains the local directory name for storing PANs.

remote contains one or more originating_systems. originating_system has a NAME@, a HOST@, a DIR@,

and a NOTIFY@. NAME@ is the value of a field by the same name in PDRs

encountered by S4PA. HOST@ and DIR@ are the host name and directory where the

PAN for the originating system will be pushed to. NOTIFY@ is the email address for PAN to be sent to.

33

dataClass Element

dataClass has NAME@ , GROUP@, FREQUENCY@, ACCESS@,TIME_MARGIN@, PUBLISH_ECHO @, PUBLISH_MIRADOR @, PUBLISH_GIOVANNI @, EXPIRY @, and DOC @.

NAME@ is the data class name. GROUP@ is the default data group name for the datasets in

the data class. FREQUENCY@ is the temporal frequency for the dataset.

Valid values are “yearly”, “monthly”, “daily”, and “none” (for Climatology dataset). Default is daily.

ACCESS@ is the access type for the dataset. Valid values are “public”, “restricted”, and “hidden”. Default is public.

TIME_MARGIN @ is the time difference in seconds for identifying replacement granules. Default is zero.

34

dataClass Element

PUBLISH_ECHO@, PUBLISH_MIRADOR@, and PUBLISH_ECHO@ is the Boolean value indicating the publication requirement for each partner. Default is true.

Expiry@ is the granule retention days for a rolling archive dataset. Default is no expiration.

DOC@ is the filename of the README file for the dataset. dataClass contains an optional method and one or

more dataset(s). Ex:<dataClass NAME=“GLDAS” GROUP=“GLDAS_MONTHLY” FREQUENCY=“yearly” ACCESS=“public” PUBLISH_ECHO=“false” PUBLISH_MIRADOR=“true” PUBLISH_GIOVANNI=“false”>

<method>/home/s4paops/bin/s4pa_get_gldas_metadata.pl</method>

<dataset NAME=“GLDAS_MOS10_M></dataset>

</dataClass>

35

method Element

method contains metadata, compression, decompression and giovanniPreprocess elements.

metadata contains the complete command for metadata extraction.

compression is a complex element containing command, tmpfile and output.

decompression is a complex element containing command, tmpfile and output.

giovanniPreprocess contains the complete command for Giovanni pre-processing .

36

compression Element

command specifies the compression command to be used. S4PA replaces any string/substring specified as “INFILE” with name of the file being processed. Ex: hrepack -t 'l3m_data:GZIP 1' -i INFILE -o INFILE.tmp

tmpfile is the file name of command output. You can specify it in terms of “INFILE”. Ex: INFILE.tmp

output is the desired file name after compression. You can specify in terms of “INFILE”. Ex: INFILE

37

decompression Element

command specifies the compression command to be used. S4PA replaces any string/substring specified as “INFILE” with name of the file being processed. Ex: bunzip2 -f INFILE

tmpfile contains an anonymous Perl subroutine that is supplied with the filename as argument. It returns name of the file produced by the decompression command.Ex: sub {my($a) = @_; $a=~s/\.bz2$//; return $a;}

output contains an anonymous Perl subroutine that is supplied with the filename as argument. It returns the desired file name after decompressing data file.Ex: sub {my($a) = @_; $a=~s/\.bz2$//; return $a;}

38

dataset Element

dataset has NAME@, GROUP@, FREQUENCY@, ACCESS@,TIME_MARGIN@, DIF_ENTRY_ID@, PUBLISH_ECHO @, PUBLISH_MIRADOR @,

PUBLISH_GIOVANNI @, EXPIRY @, and DOC @. They override corresponding values defined for a dataClass.

DIF_ENTRY_ID@ indicates the ID of GCMD DIF. dataset contains a method and optional

ignoreCondition(s), uniqueAttribute(s), associateData(s), and dataVersion(s).

A method in a dataset has the same definition as in a dataClass.

39

dataset – ignoreCondition

ignoreCondition is the location of a metadata (in XML) attribute in XPATH expressions that specify cases where an incoming granule has to be ignored after comparing the same attribute with the existing granule. It has an optional OPERATOR@.

OPERATOR@ is the operation of comparison. Valid values are “EQ”, “NE”, “GT”, “GE”, “LT”, “LE”. Default value is EQ.

Ex:

<ignoreCondition OPERATOR=“LE”>

//DataGranule/SizeBytesDataGranule</ignoreCondition>

-- This will avoid the existing granule being replaced by a smaller size incoming granule covering the same RangeDateTime.

40

dataset – uniqueAttribute

uniqueAttribute is the location of a metadata (in XML) attribute specified using XPATH for determining the uniqueness of a granule. If the value of XPATH expression matches in all cases, incoming granule is deemed a valid replacement. Otherwise, it is treated as a new granule. It also has an optional OPERATOR@.

OPERATOR@ is the operation of comparison. Valid values are “EQ”, “NE”, “GT”, “GE”, “LT”, “LE”. Default value is EQ.

Ex:

<uniqueAttribute>//DataGranule/GranuleID</uniqueAttribute>-- This will avoid the existing granule being replaced by a incoming

granule with different GranuleID covering the same RangeDateTime.

41

dataset – associateData

associateData is used to associate a data granule with its browse file from a different dataset.

associateData has a NAME@, a VERSION@, and a TYPE@.

NAME@ is the associated dataset name. VERSION@ is the optional associated dataset version.

Default is versionless. TYPE@ is the association type, currently “Browse” only.

Ex:

<dataset NAME="TRMM_2A21">

<associateData NAME="TRMM_2A21_BR“ TYPE="Browse" />

</dataset>

42

dataset – dataVersion

dataVersion has LABEL@, FREQUENCY@, ACCESS@,TIME_MARGIN@, DIF_ENTRY_ID@, PUBLISH_ECHO @, PUBLISH_MIRADOR @,

PUBLISH_GIOVANNI @, EXPIRY @, and DOC @. They override corresponding values defined for a dataset.

LABEL@ can be an empty string for versionless dataset or a non-white space string for a versioned dataset.

dataVersion contains optional ignoreCondition(s), uniqueAttribute(s), and associateData(s).

All attributes and element in a dataVersion has the same definition as in a dataset.

43

Creating S4PA Subscriptions

S4PA instance is described in a file called subscription configuration.

Suggested name: subscription_<InstanceName>.xml Two types of subscription are supported: Pull (user

initiates the download) and Push (S4PA push files to users).

The subscription descriptor is based on schema ftp://s4pt.ecs.nasa.gov/software/s4pa/S4paSubscription.xsd

Run command

s4pa_update_subscription.pl -f <SubscriptionConfiguration> -d <DescriptorSchema> -s <SubscriptionSchema>

ftp://s4pt.ecs.nasa.gov/software/s4pa/release/S4paSubscription.xsd

ftp://s4pt.ecs.nasa.gov/software/s4pa/release/S4paSubscription.xsd

44

Subscription Descriptor

The root element of the descriptor is subscription, it has a NOTICE_SUBJECT @, a HTTP_ROOT@ and a FTP_ROOT@.

NOTICE_SUBJECT @ indicates the general email delivery notice (DN) subject.

HTTP_ROOT@ specifies the root URL for accessing restricted data.

FTP_ROOT@ specifies the root URL for accessing public data.

subscription contains one or more pushSubscription and pullSubscription.

45

pushSubscription Element

pushSubscription contains notification, destination, and one or more dataset(s).

pushSubscription has ID@, LABEL@, FTP_ROOT@, HTTP_ROOT@, MAX_GRANULE_COUNT@, USER@, INCLUDE_BROWSE@, VERIFY@.

ID@ is a unique identification string across all subscriptions. LABEL@ indicates any user-specific string that will be

included in the DN. MAX_GRANULE_COUNT@ sets the maximum number of

granules in each subscription. USER@ specifies the username for Machine Request

Interface.

46

pushSubscription Element

INCLUDE_BROWSE@ sets the inclusion of browse file in the subscription.

VERIFY@ is to confirm the existence of the pushed files on the remote site.

notification specifies the address and the format of subscription delivery notice. It contains an optional filter.

The content of the filter specifies the user provided script to create a special format of the delivery notice (ex. XML formatted). It is only needed when the attribute FORMAT@ is specified as “USER-DEFINED”.

47

notification Element

notification has FORMAT@, PROTOCOL@, ADDRESS@,

NOTICE_SUFFIX@, and NOTICE_SUBJECT@. FORMAT@ indicates the format of the notice. Valid values

are “S4PA”, “LEGACY”, “PDR”, and “USER-DEFINEED”. PROTOCOL@ indicates the protocol to be used to send the

notice. Valid values are “mailto”, “ftp”, “sftp”, “file”. ADDRESS@ indicates the destination of the notice. It can be

subscriber’s email address for “mailto” protocol or “<remote_host>/<remote_directory>” for other protocols.

NOTICE_SUFFIX@ and NOTICE_SUBJECT@ specify the DN file extension and the special email notification subject. Default suffix is none and default subject is:

“GES DISC Order Notification Order ID: DN<xxx>-<xxx>”

48

destination Element

destination specifies the destination for subscribed data.

destination has PROTOCOL@ and ADDRESS@. PROTOCOL@ indicates the protocol to be used to send the

notice. Valid values are “mailto”, “ftp”, “sftp”, “file”. ADDRESS@ indicates the destination of the notice. It can be

subscriber’s email address for “mailto” protocol or “<remote_host>/<remote_directory>” for other protocols.

Ex: <notification FORMAT=“S4PA” PROTOCOL=“mailto”

ADDRESS=“[email protected]”/>

<destination PROTOCOL=“ftp”

ADDRESS=“s4pt.ecs.nasa.gov/ftp/private/TS2/push”/>

49

subscription - dataset Element

dataset contains optional validator and filter. dataset has a NAME@ and optional VERISON@.

NAME@ is the dataset name for the subscription. VERISON@ is the version label for the subscription. Default

to all versions under the specified dataset. validator is used to validate if the incoming granule

will trigger the subscription to be processed. The content should be a boolean value (true or false) or a script that return a boolean value. Specify “false” for an Machine Request Interface (MRI) only subscription which will disable the triggering from ingest. The default content is “true”.

50

subscription - dataset Element

filter specifies the user-provided script to convert the pattern-matched file and deliver the output to the subscriber.

filter has a PATTERN @ to specify the file pattern to apply the filtering scheme.Ex:

<dataset NAME=“D5OIXMET” VERSION=“5.1.0”>

<validator>s4pa_sub_check.pl -b ‘2007-01-01’

-e ‘2009-12-31’</validator>

<filter PATTERN=“xml”>s4pa_extract_ODL.pl -o

/var/tmp</filter>

</dataset>

51

pullSubscription Element

pullSubscription contains notification, optional destination, and one or more dataset(s).

pullSubscription has ID@, LABEL@, FTP_ROOT@, HTTP_ROOT@, MAX_GRANULE_COUNT@, USER@, INCLUDE_BROWSE@.

All elements and attributes have the same definition as those in pushSubscription.

destination (if specified) indicates that the subscribed data has to be pushed via PROTOCOL@ to an intermediate destination (ADDRESS@) from where the data will be pulled by the subscriber. It is only used to support the legacy ECS users.

52

pullSubscription Element

destination has an extra URL_ROOT@ to replace the original FTP_ROOT@ or HTTP_ROOT@ with the new URL on the intermediate address for user to pull from.

dataset contains an extra service element to provide on-the-fly services to the files downloaded via the HTTP protocol.

service has NAME@, CHANNELS@, CHNUMBERS@, WVNUMBERS@, VARIABLES@, BBOX@, FORMAT@,

COMPRESS_ID@, and REASON@. The converted HTTP service URL will be included in the delivery notice.

53

Deploying Using Descriptor

Deploy S4PA instance by running:s4pa_deploy.pl -f <Descriptor> -s <DescriptorSchema>

Copy metadata extractor configuration files in S4PA_<ProjectName> distribution under “cfg” directory to <S4PA_ROOT>/receiving/<provider>. Generally, these configuration files are named as .table, .metTemplate, .xml, ..etc.,

<S4PA_ROOT> is the root directory of S4PA stations (root element under s4pa in descriptor).

54

Deploying Using CVS

Deploy S4PA instance by running:s4pa_deploy.pl -i <InstanceName> [-p <ProjectName>]

Both descriptor and subscription configuration needs to be in S4PA_CONFIG repository and named as:

descriptor_< InstanceName>.xml subscription _< InstanceName>.xml

All required metadata extractor templates need to be in <ProjectName> repository under its cfg directory.

Once deployed, a copy set of descriptor, subscription and schemas can be found under <S4PA_ROOT>/config directory.

55

Coming in future

dataPoller with SFTP protocol. Giovanni reconciliation. HDF4 map file creation on ingest and archive.

56

Viewing S4PA Instance

Set the PERLLIB environment variable to /tools/gdaac/OPS/lib/perl5/site_perl/<version>/ where <version> is the Perl’s version number. Currently, on our hosts, it is 5.8.8.

Set the PATH variable to include /tools/gdaac/OPS/bin.

57


Run,tkstat.pl <S4PA_ROOT>/receiving/polling/* <S4PA_ROOT>/receiving/<provider> <S4PA_ROOT>/storage/*/store* <S4PA_ROOT>/storage/*/check* <S4PA_ROOT>/storage/*/delete* <S4PA_ROOT>/subscribe <S4PA_ROOT>/publish*<S4PA_ROOT>/other/* <S4PA_ROOT>/postoffice &


58


Run,tkstat.pl <S4PA_ROOT>/receiving/polling/* <S4PA_ROOT>/receiving/<provider> <S4PA_ROOT>/storage/*/store* <S4PA_ROOT>/storage/*/check* <S4PA_ROOT>/storage/*/delete* <S4PA_ROOT>/subscribe <S4PA_ROOT>/publish* <S4PA_ROOT>/postoffice &


59

Housekeeping

Incremental backups needed for S4PA Root (/vol1/OPS/s4pa) S4PA Storage Directory (/ftp/data/s4pa)

S4PA jobs, mostly polling and rarely ReceiveData, fail.

Have a cron job that monitors these station directories and resubmit these jobs.

S4P/S4PA logs pile up. Currently, they are manually cleaned. Probably, there

will be a script to trim them.

S4PA Deployment M. Hegde Science Systems & Applications, Inc April 26, 2006.

Documents

Transcript of S4PA Deployment M. Hegde Science Systems & Applications, Inc April 26, 2006.