Post on 11-Sep-2021
PreservingScientificDataOnOurPhysicalUniverse
ANewStrategyforArchivingtheNation'sScientificInformationResources
SteeringCommitteefortheStudyontheLong-termRetentionofSelectedScientificandTechnicalRecordsoftheFederalGovernment
CommissiononPhysicalSciences,Mathematics,andApplications
NationalResearchCouncil
NATIONALACADEMYPRESSWashington,D.C.1995
title:
PreservingScientificDataOnOurPhysicalUniverse:ANewStrategyforArchivingtheNation'sScientificInformationResources
author:publisher: NationalAcademiesPress
isbn10|asin: 030905186Xprintisbn13: 9780309051866ebookisbn13: 9780585022888
language: English
subject
Communicationinscience--Governmentpolicy--UnitedStates,Science--UnitedStates--Dataprocessing,Technology--UnitedStates--Dataprocessing,Informationstorageandretrievalsystems--Science.
publicationdate: 1995lcc: Q224.3.U6N371995ebddc: 353.00819
subject:
Communicationinscience--Governmentpolicy--UnitedStates,Science--UnitedStates--Dataprocessing,Technology--UnitedStates--Dataprocessing,Informationstorageandretrievalsystems--Science.
NOTICE:TheprojectthatisthesubjectofthisreportwasapprovedbytheGoverningBoardoftheNationalResearchCouncil,whosemembersaredrawnfromthecouncilsoftheNationalAcademyofSciences,theNationalAcademyofEngineering,andtheInstituteofMedicine.Themembersofthecommitteeresponsibleforthereportwerechosenfortheirspecialcompetencesandwithregardforappropriatebalance.
ThisreporthasbeenreviewedbyagroupotherthantheauthorsaccordingtoproceduresapprovedbyaReportReviewCommitteeconsistingofmembersoftheNationalAcademyofSciences,theNationalAcademyofEngineering,andtheInstituteofMedicine.
TheNationalAcademyofSciencesisaprivate,nonprofit,self-perpetuatingsocietyofdistinguishedscholarsengagedinscientificandengineeringresearch,dedicatedtothefurtheranceofscienceandtechnologyandtotheiruseforthegeneralwelfare.UpontheauthorityofthechartergrantedtoitbytheCongressin1863,theAcademyhasamandatethatrequiresittoadvisethefederalgovernmentonscientificandtechnicalmatters.Dr.BruceAlbertsispresidentoftheNationalAcademyofSciences.
TheNationalAcademyofEngineeringwasestablishedin1964,underthecharteroftheNationalAcademyofSciences,asaparallelorganizationofoutstandingengineers.Itisautonomousinitsadministrationandintheselectionofitsmembers,sharingwiththeNationalAcademyofSciencestheresponsibilityforadvisingthefederalgovernment.TheNationalAcademyofEngineeringalsosponsorsengineeringprogramsaimedatmeetingnationalneeds,encourageseducationandresearch,andrecognizesthesuperiorachievementsofengineers.Dr.RobertM.WhiteispresidentoftheNationalAcademyofEngineering.
TheInstituteofMedicinewasestablishedin1970bytheNational
AcademyofSciencestosecuretheservicesofeminentmembersofappropriateprofessionsintheexaminationofpolicymatterspertainingtothehealthofthepublic.TheInstituteactsundertheresponsibilitygiventotheNationalAcademyofSciencesbyitscongressionalchartertobeanadvisertothefederalgovernmentand,uponitsowninitiative,toidentifyissuesofmedicalcare,research,andeducation.Dr.KennethI.ShineispresidentoftheInstituteofMedicine.
TheNationalResearchCouncilwasestablishedbytheNationalAcademyofSciencesin1916toassociatethebroadcommunityofscienceandtechnologywiththeAcademy'spurposesoffurtheringknowledgeandadvisingthefederalgovernment.FunctioninginaccordancewithgeneralpoliciesdeterminedbytheAcademy,theCouncilhasbecometheprincipaloperatingagencyofboththeNationalAcademyofSciencesandtheNationalAcademyofEngineeringinprovidingservicestothegovernment,thepublic,andthescientificandengineeringcommunities.TheCouncilisadministeredjointlybybothAcademiesandtheInstituteofMedicine.Dr.BruceAlbertsandDr.RobertM.Whitearechairmanandvicechairman,respectively,oftheNationalResearchCouncil.
SupportforthisprojectwasprovidedbytheNationalArchivesandRecordsAdministration(underContractNo.NAMA-S-92-0019),theNationalOceanicandAtmosphericAdministration(underContractNo.50-DGNE-3-00105),andtheNationalAeronauticsandSpaceAdministration(underContractNo.S-54040-Z).Theviewsexpressedinthisreportarethoseoftheauthorsanddonotnecessarilyreflecttheviewsofthesponsoringagenciesorsubagencies.
LibraryofCongressCatalogCardNumber94-68991InternationalStandardBookNumber0-309-05186-X
Additionalcopiesofthisreportareavailablefrom:
NationalAcademyPress2101ConstitutionAve.,NWBox285Washington,DC20055800-624-6242202-334-3313(intheWashingtonMetropolitanArea)
B-499
Copyright1995bytheNationalAcademyofSciences.Allrightsreserved.
PrintedintheUnitedStatesofAmerica
Pageiii
SteeringCommitteeForTheStudyOnTheLong-TermRetentionOfSelectedScientificAndTechnicalRecordsOfTheFederalGovernmentJEFFDOZIER,UniversityofCalifornia,SantaBarbara,Chair
SHELTONALEXANDER,PennsylvaniaStateUniversity
MARJORIECOURAIN,Consultant(deceased,January14,1994)
JOHNA.DUTTON,PennsylvaniaStateUniversity
WILLIAMEMERY,UniversityofColorado
BRUCEGRITTON,MontereyBayAquariumResearchInstitute
ROYJENNE,NationalCenterforAtmosphericResearch
WILLIAMKURTH,UniversityofIowa
DAVIDLIDE,Consultant,Gaithersburg,Maryland
B.K.RICHARD,TRW
JOANWARNOW-BLEWETT,AmericanInstituteofPhysics
NationalResearchCouncilStaff
PaulF.Uhlir,AssociateExecutiveDirector,CommissiononPhysicalSciences,Mathematics,andApplications
MarkDavidHandel,ProgramOfficer,BoardonAtmosphericSciencesandClimate
AliceKillian,ResearchAssociate,CommissiononGeosciences,Environment,andResources
JamesE.Mallory,StaffOfficer,ComputerScienceandTelecommunicationsBoard
ScottT.Weidman,SeniorProgramOfficer,BoardonChemicalSciencesandTechnology
JulieM.Esanu,ResearchAssistant,CommissiononPhysicalSciences,Mathematics,andApplications
DavidJ.Baskin,ProjectAssistant,CommissiononPhysicalSciences,Mathematics,andApplications
Pageiv
CommissionOnPhysicalSciences,Mathematics,AndApplicationsRICHARDN.ZARE,StanfordUniversity,Chair
RICHARDS.NICHOLSON,AmericanAssociationfortheAdvancementofScience,ViceChair
STEPHENL.ADLER,InstituteforAdvancedStudy
SYLVIAT.CEYER,MassachusettsInstituteofTechnology
SUSANL.GRAHAM,UniversityofCaliforniaatBerkeley
ROBERTJ.HERMANN,UnitedTechnologiesCorporation
RHONDAJ.HUGHES,BrynMawrCollege
SHIRLEYA.JACKSON,DepartmentofPhysics
KENNETHI.KELLERMANN,NationalRadioAstronomyObservatory
HANSMARK,UniversityofTexasatAustin
THOMASA.PRINCE,CaliforniaInstituteofTechnology
JEROMESACKS,NationalInstituteofStatisticalSciences
L.E.SCRIVEN,UniversityofMinnesota
A.RICHARDSEEBASSIII,UniversityofColorado
LEONT.SILVER,CaliforniaInstituteofTechnology
CHARLESP.SLICHTER,UniversityofIllinoisatUrbana-Champaign
ALVINW.TRIVELPIECE,OakRidgeNationalLaboratory
SHMUELWINOGRAD,IBMT.J.WatsonResearchCenter
CHARLESA.ZRAKET,MITRECorporation(retired)
NORMANMETZGER,ExecutiveDirector
PAULF.UHLIR,AssociateExecutiveDirector
Pagev
PrefaceInJanuary1992theNationalArchivesandRecordsAdministration(NARA)sponsoredathree-dayplanningmeetingattheNationalResearchCouncil(NRC)toreviewtheissuesrelatedtothelong-termretentionofthefederalgovernment'sscientificandtechnicaldatainthephysicalsciences.TheplanningmeetingwasorganizedbytheNRC'sCommissiononPhysicalSciences,Mathematics,andApplicationsandprovidedthebasisforthisstudy,whichwasinitiatedinthefallof1992attherequestofNARA.TheNationalOceanicandAtmosphericAdministration(NOAA)andtheNationalAeronauticsandSpaceAdministration(NASA)subsequentlyprovidedadditionalsupport.
Thestudy'ssteeringcommittee,inconsultationwiththesponsors,developedthefollowingchargetoguidethewritingofthisreport:Describethestatusandplansforthegovernment'sarchivingofobservationalandexperimentaldatainthephysicalsciences.Identifytheprincipalscientific,technical,informationmanagement,andinstitutionalissuesregardingthepermanentarchivingofsuchdata.Assessthecommonalitiesanddifferencesamongthecasestudiesprovidedbythepanelsorganizedunderthisstudy(seebelow)inordertodeterminetheextenttowhichcommonlong-termretentionpoliciesandappraisalguidelinescanbeappliedtodisciplinesthatcollectobservationalandexperimentaldatainthephysicalsciences.Establishasetofgoals,principles,andpriorities,aswellasgenericretentioncriteriaandappraisalguidelinesthatNARAcanincorporateintoitsmission,program,andbudgetplanning.SuggestmechanismsandprocessesforNARAandNOAAtouseinimplementingaprogramofdataappraisal,retention,andpreservation,andlaterinevaluatingtheeffectivenessoftheprogram.Provideasummaryoffindings,conclusions,andrecommendations.
Thesteeringcommitteeformedfivepanelsinspacesciences,atmosphericsciences,oceansciences,geosciences,andphysics,chemistry,andmaterialssciencestoprovidetheirviewsonthekeydataretentionissuesfromdifferentdisciplinaryperspectivesinthephysicalsciences.Thesepanelseachmettwiceandproducedasetofworkingpapers,whicharepublishedseparatelyinStudyontheLong-termRetentionofSelectedScientificandTechnicalRecordsoftheFederalGovernment:WorkingPapers(NationalAcademyPress,Washington,D.C.,1995).Theworkofthepanelswasinvaluabletothe
Pagevi
steeringcommitteeinframingtheissues,informingitsconclusionsandrecommendations,andinproducingitsfinalreport.
Thereareseveralaspectsregardingthescopeandfocusofthisreportthatshouldbementioned.Thecommitteedevotedmostofitsattentiontodatastoredonelectronicmedia,ratherthanonpaperoronothermedia.Almostalldataarenowacquired,stored,anddistributedelectronically.Thus,thepreponderanceofdataarchivingproblemsandtheirsolutionsmustbeconsideredinthiscontext.Nevertheless,muchoftheadviceofferedhereisequallyrelevanttodatainotherformats.
Theprincipalfocusofthisreportisonthelong-termretentionofdatainthephysicalsciences.Muchofthediscussion,however,includesnear-termdatamanagementissues,becauseeffectivearchivingbeginswhentheplansforacquiringadatasetaremadeandextendsthroughoutthelifecycleofthedata.Althoughthefocusisexclusivelyondatainthephysicalsciences,thecommitteebelievesthatthedistinctionsithasdrawnbetweentheexperimentalandtheobservationaldata,aswellasthedatamanagementprinciplesithasprovided,arebroadlyapplicabletomostdataintheothernaturalsciences.Inaddition,thestrategicapproachadoptedbythecommitteenecessarilyinvolvesallfederalagenciesthatacquireandmanagephysicalsciencedata,andnotsimplythethreeagenciesthatsponsoredthisstudy.
Finally,itisnecessarytopointoutthatthecommitteewasunabletoachieveconsensusononemajorrecommendationofthestudy,namely,theproposaltoestablishtheNationalScientificInformationResource(NSIR)Federation.AppendixBcontainstheminorityopinionofthedissentingcommitteemember,RoyJenne.Therestofthecommitteemembers,whostronglysupporttheNSIRFederationrecommendation,aredisappointedbythislackofunanimityand
considermanyoftheassertionsintheminorityopiniontobebasedonanerroneousinterpretationofwhatthereportactuallystatesorrecommends.Weleavethattothereadertojudge.Nevertheless,webelievethattheminorityopinioncanperhapsserveausefulpurposebydrawinggreaterattentiontotheseissuesandbybroadeningthediscussionofthemamongthesponsorsofthestudy,theotherscienceagencies,andtheresearchcommunity.
Inconclusion,thecommitteehopesthatitsadvicewillhelpbringaboutthechangesnecessarytoeffectivelypreservethevaluablescientificdataonourphysicaluniverse.
JeffDozierSteeringCommitteeChair
PaulF.UhlirStudyDirector
Pagevii
AcknowledgmentsThesteeringcommitteeisverygratefultothemanyindividualswhoplayedasignificantroleinthecompletionofthisstudy,includingthemembersofthefiveadhocpanelsthatprovidedconclusionsandrecommendationsondataarchivingfromthedifferentphysicalsciencedisciplines;theindividualswhobriefedthesteeringcommitteeandpanels;andmembersoftheNationalResearchCouncil(NRC)staffwhoworkedonvariousaspectsofthisstudy.ThesteeringcommitteealsoextendsitsthankstoTrudyPetersonandKennethThibodeauoftheNationalArchivesandRecordsAdministration(NARA),WilliamTurnbullandHelenWoodoftheNationalOceanicandAtmosphericAdministration(NOAA),andJosephKingoftheNationalAeronauticsandSpaceAdministration(NASA),fromthestudy'ssponsoringagencies.
GerdRosenblatt,ofLawrenceBerkeleyLaboratory,chairedthePhysics,Chemistry,andMaterialsSciencesDataPanel.ThememberswereR.StephenBerry,UniversityofChicago;EdwardGalvin,TheAerospaceCorporation;J.G.Kaufman,TheAluminumAssociation;KirbyKemper,FloridaStateUniversity;DavidR.Lide,Jr.,consultant;andEdgarWestrum,Jr.,UniversityofMichigan.ThesteeringcommitteegratefullyacknowledgesthedetailedbriefingsandinformationprovidedtothispanelbyDonaldAlderson,DepartmentofDefenseNuclearInformationAnalysisCenter;FrankBiggs,SandiaNationalLaboratories;RobertBillingsley,DefenseTechnicalInformationCenter;MarkConrad,NARA;SuzanneLeech,Bionetics,Inc.;VictoriaMcLane,BrookhavenNationalLaboratory;andPatriciaSchuette,BattellePacificNorthwestLaboratory.
TheSpaceSciencesDataPanelwaschairedbyChristopherRusselloftheUniversityofCaliforniaatLosAngeles.Thepanelmemberswere
GuiseppinaFabbiano,Harvard-SmithsonianCenterforAstrophysics;SarahKadec,consultant;WilliamKurth,UniversityofIowa;StevenLee,UniversityofColorado;andR.StephenSaunders,JetPropulsionLaboratory.Thesteeringcommitteeextendsitsthanksfortheassistanceofthefollowingindividuals,whoprovidedbriefingsandotherinformationtotheSpaceSciencesDataPanel:JoeAllen,NationalGeophysicalDataCenter;StevenBlair,LosAlamosNationalLaboratory;JosephBredekamp,NASA;DeanBundy,NavalResearchLaboratory;DaviddeYoung,NationalOpticalAstronomyObservatories;RobertFrederick,AirForceSpaceForecastCenter;JosephKing,NationalSpaceScienceDataCenter;KnoxLong,SpaceScienceTelescopeInstitute;GuentherRiegler,NASAAstrophysicsDivision;ThomasSmithandJudStailey,AirForceEnvironmentalTechnicalApplicationsCenter;EarlTech,LosAlamosNationalLaboratory;RaymondWalker,UniversityofCaliforniaatLosAngeles;andJamesWillet,NASASpacePhysicsDivision.
Pageviii
WernerBaum,ofFloridaStateUniversity,wasthechairoftheAtmosphericSciencesDataPanel.ThememberswereMarjorieCourain,consultant(deceased,January14,1994);WilliamHaggard,ClimatologicalConsultingCorporation;RoyJenne,NationalCenterforAtmosphericResearch;KellyRedmond,DesertResearchInstitute;andThomasVonderHaar,ColoradoStateUniversity.ThesteeringcommitteegratefullyacknowledgesthediverseandsubstantialinputsprovidedbythefollowingindividualstotheAtmosphericSciencesDataPanel:LarryBaume,NARA;ThomasBoden,CarbonDioxideInformationandAnalysisCenter;DeanBundy,NavalResearchLaboratory;DonaldCollins,NASA;RichardDavis,NationalClimaticDataCenter,P.C.Hariharan,JohnsHopkinsUniversity;andGeraldStokes,PacificNorthwestLaboratories.
TheOceanSciencesDataPanelwaschairedbyBruceGritton,MontereyBayAquariumResearchInstitute.ThememberswereRichardDugdale,UniversityofSouthernCalifornia;ThomasDuncan,UniversityofCaliforniaatBerkeley;RobertEvans,RosenstielSchoolofMarineandAtmosphericScience;TerrenceJoyce,WoodsHoleOceanographicInstitution;andVictorZlotnicki,JetPropulsionLaboratory.ThesteeringcommitteeextendsitsthanksforthebriefingsandotherinformationprovidedtotheOceanSciencesDataPanelbyLarryBaume,NARA;DonaldCollinsandSusanDigby,JetPropulsionLaboratory;RonaldFauquet,NOAA;TedTsui,NavalResearchLaboratory;andR.S.Winokur,OfficeofNavalResearch.
TheGeoscienceDataPanelwaschairedbyTheodoreAlbert,aprivateconsultant.ThememberswereSheltonAlexander,PennsylvaniaStateUniversity;SaraGraves,UniversityofAlabamainHuntsville;DavidLandgrebe,PurdueUniversity;andSorooshSorooshian,UniversityofArizona.ThesteeringcommitteegratefullyacknowledgestheinformationprovidedatthemeetingsoftheGeosciencesDataPanelbythefollowingindividuals:RogerBarry,NationalSnowandIce
DataCenter;DanielCavanaugh,U.S.GeologicalSurvey;DonaldCollins,JetPropulsionLaboratory;KatrinDouglass,SouthernCaliforniaEarthquakeCenterDataCenter;WilliamDraegar,U.S.GeologicalSurvey;JohnDwyer,NARA;ClaireHenson,NationalSnowandIceDataCenter;HerbMeyers,NationalGeophysicalDataCenter;RonWeaver,NationalSnowandIceDataCenter;andThomasYorke,U.S.GeologicalSurvey.
Finally,thesteeringcommitteeisgratefultothestaffoftheNationalResearchCouncil:PaulF.Uhlir,associateexecutivedirectoroftheCommissiononPhysicalSciences,Mathematics,andApplications,whoservedasstudydirector;MarkDavidHandelandTheresaFisher(BoardonAtmosphericSciencesandClimate),AliceKillian(CommissiononGeosciences,Environment,andResources),JamesE.Mallory(ComputerScienceandTelecommunicationsBoard),andScottT.WeidmanandTañaSpencer(BoardonChemicalSciencesandTechnology),whoprovidedstaffsupportforthefivepanels;JulieM.Esanu,fortheprogramassistanceprovidedtothesteeringcommitteeandpanelsandforthepreparationofthefinalmanuscript;DavidBaskin,forhisworkonpreparingthefinalmanuscript;LizPanos,forcoordinatingthereportreview;andRoseannePrice,whoeditedthefinalmanuscript.
Pageix
ContentsSUMMARY
1INTRODUCTION
ImperativesforPreservingDataonOurPhysicalUniverse
ANewFutureforScientificData
2THECHALLENGE:PRESERVATIONANDUSEOFSCIENTIFICDATA
ExperimentalLaboratoryData
ObservationalDatainthePhysicalSciences
SummaryofMajorIssues
3RETENTIONCRITERIAANDTHEAPPRAISALPROCESS
RetentionCriteria
OtherElementsoftheAppraisalProcess
Recommendations
4THEOPPORTUNITIES:THERELATIONSHIPOFTECHNOLOGICALADVANCESTONEWDATAUSEANDRETENTIONSTRATEGIES
EnablingTechnologiesandRelatedDevelopments
OpportunitiesforNewOrganizationalStructures
5ANEWSTRATEGYFORARCHIVINGTHENATION'SSCIENTIFICANDTECHNICALDATA
FundamentalPrinciplesforLong-termDataRetention
TheProposedNationalScientificInformationResourceFederation
RecommendationsfortheCreationoftheNSIRFederation
RecommendationsSpecificallyforNARA
RecommendationsSpecificallyforNOAA
REFERENCES
APPENDIXAListofAcronyms
APPENDIXBMinorityOpinion
ThisstudyisdedicatedinfondmemoryofMarjorieCourain.
Page1
SummaryScientificdatareflectboththeorganizationandthechaosofthenaturalworld.Theystimulateustodevelopconcepts,theories,andmodelstomakesenseofthepatternstheyrepresent.Theresultingabstractionsaretheformalandsystematicideasthatconstitutetheunderstandingofrelationshipsbetweencausesandconsequences,andperhapsmayenablepredictionoffuturesequencesofevents.Becausescientiststransformdatafromthematerialworldintoideas,theobservationsofobjectsandprocessesinthephysicalworldarethestimuliofscientificthought.Dataarethustheseedsofscientificideas.
Therearestrongmotivationsforpreservingscientificobservations:Manyobservationsaboutthenaturalworldarearecordofeventsthatwillneverberepeatedexactly.Examplesincludeobservationsofanatmosphericstorm,adeepoceancurrent,avolcaniceruption,andtheenergyemittedbyasupernova.Oncelost,suchrecordscanneverbereplaced.Observeddataprovideabaselinefordeterminingratesofchangeandforcomputingthefrequencyofoccurrenceofunusualevents.Theyspecifytheobservedenvelopeofvariability.Thelongertherecord,thegreaterourconfidenceintheconclusionswedrawfromit.Adatarecordmayhavemorethanonelife.Asscientificideasadvance,newconceptsmayemergeinthesameorentirelydifferentdisciplinesfromstudyofobservationsthatledearliertodifferentkindsofinsights.Newcomputingtechnologiesforstoringandanalyzingdataenhancethepossibilitiesforfindingorverifyingnewperspectivesthroughreanalysisofexistingdatarecords.Thus,therelativeimportanceofdata,bothcurrentandhistorical,canchangedramatically,ofteninentirelyunanticipateddirections.Thesubstantialinvestmentsmadetoacquiredatarecordsjustifytheirpreservation.Thecostofpreservationwillalmostalwaysbesmallin
preservation.Thecostofpreservationwillalmostalwaysbesmallincomparisonwiththecostofobservation.Becausewecannotpredictwhichdatawillyieldthemostscientificbenefitinyearsahead,thedatawediscardtodaymaybethedatathatwouldhavebeeninvaluabletomorrow.
Theassembledrecordofobservationaldatathushasdualvalue:itissimultaneouslyahistoryofeventsinthenaturalworldandarecordofhumanaccomplishment.Thehistoryofthephysicalworldisanessentialpartofouraccumulatingknowledge,andtheunderlyingdataformasignificantpartofthatheritage.Theyalsoportrayahistoryofourscientificandtechnologicaldevelopment.
Page2
Therearenumeroussocioeconomicreasons,inadditiontothecompellingscientificandhistoricalmotivations,forthelong-termretentionofobservational,aswellascertaintypesofexperimental,data.Forexample,historicalclimatedatahavehadwell-documentedusesinabroadrangeofapplicationsinthemanufacturing,energy,agriculture,transportation,communications,engineering,construction,insurance,andentertainmentsectors.SuchapplicationsarecommonaswellforothertypesofobservationaldataontheEarth'senvironment.Experimentaldatainthephysicalsciencesalsohavemanyindustrialandotherpracticaluses.
Todaywecanforeseethepossibilityofusingthenationalresourceofscientificdatamoreadvantageouslythaneverbeforeastechnologicaladvancesopennewvistasformanagingscientificinformation.Advancesindatastoragetechnologiesmakethelong-termretentionofvirtuallyalldatabothfeasibleandaffordable.TheexistenceoftheInternetandoftheemergingNationalInformationInfrastructure(NII)enablesnationwidesharingandapplicationofdatathatresideinappropriatelyconfigureddatabases.
Ournewpowertostore,distribute,andaccessdataandinformationischangingthewayweworkandthink.However,thecommunitiesinvolvedinthecreation,retention,anduseofscientificdataaboutthephysicalworldarenotoptimallyorganized.Theycommonlyworktowarddisparategoals,arenotwellconnected,anddonottakefulladvantageoftechnologicalandconceptualadvancesindatamanagementandcommunication.Anentirelynewapproachtothelong-termpreservationofscientificdataisnowbothfeasibleandessential.Itmusttakeadvantageofadvancingtechnologyandofdistributedcommunicationsandmanagementstructurestoempowerboththecreatorsandtheusersofsuchdata.
Thisstudy,performedattherequestoftheNationalArchivesand
RecordsAdministration(NARA),andpartiallysupportedbytheNationalOceanicandAtmosphericAdministration(NOAA)andtheNationalAeronauticsandSpaceAdministration(NASA),identifiesthemajorissuesregardingeffortstoarchiveandusedatainthephysicalsciences,establishesretentioncriteriaandappraisalguidelinesforthosedata,reviewsimportanttechnologicaladvancesandrelatedopportunities,andproposesanewstrategytohelpensureaccesstothedatabyfuturegenerations.
TheChallengeOfEffectivePreservationAndUseOfScientificData
Theresultsofscientificresearcharedisseminatedinthiscountrythroughahybridsystemthatincludesprofessionalsocietyandothernot-for-profitpublishers,thecommercialsector,andthegovernment.Theformaljournalsarepublishedlargelybytheprofessionalsocietyandcommercialsectors,whilegovernmentagenciesmanagelessformalreports(grayliterature).Secondaryabstractingandindexingservicesprovideaccesstothisliterature,increasinglybyelectronicmeans.Whiletherearestrainsinthissystembecauseofrisingcosts,increasingworkload,andissuesrelatedtotheprotectionofintellectualproperty,ithasservedU.S.sciencewellandhasbeenaninvaluablelinkintheprocessoftranslatingscientificadvancesintofurtheradvances,usefultechnology,andeconomicbenefits.
Thecurrentsystem,however,isnotwellsuitedtohandlethescientificandtechnicalelectronicdatabasesthatarethefocusofthisstudy.Thecostofmaintainingthesedatabasesistypicallytoogreattobecoveredbyuserfees;insteadthesedatabasesmustbeconsideredpartofthenationalscientificheritage.Somegovernmentagencieshaveacceptedresponsibilityformaintaininganddisseminatingthedataresultingfromtheirresearchanddevelopment.Insomecases,thissystemisworkingreasonablywell,butinothersthereareproblemsevenwithprovidingcurrentaccess.Archivingforthelongtermraisesquestions
inallcases,however.
Ageneralproblemprevalentamongallscientificdisciplinesisthelowpriorityattachedtodatamanagementandpreservationbymostagencies.Experienceindicatesthatnewresearchprojectstendtogetmuchmoreattentionthanthehandlingofdatafromoldones,eventhoughthepayofffromoptimalutilizationofexistingdatamaybegreater.
Page3
Withregardtolaboratorydata,governmentprogramshaveexistedsincethe1960stocompileresultsfromtheworldscientificliterature,tocheckthedatacarefully,andtopreparedatabasesofcriticallyevaluateddata.Despitechronicunderfunding,theseprogramshaveproduceddatabasesoflastingvaluetothenation,andthegovernmentinvestmentincreatingandmaintainingthesedatabaseshasbeenrepaidmanytimesover.
Intheareaofobservationaldatabases,thesituationismixed.Federalagenciescollectlargeamountsofobservationaldata,whichinmanycasesarecontinuouslyaddedtotheavailablerecordofEarthandspaceprocesses.Thedatasetsresultingfromtheseactivitiesaresometimeswell-documentedandmaintainedinreadilyaccessibleform;inmanyothercases,however,whilethedataaresaved,theyareexceedinglydifficultorimpossibletoaccessoruse,andthusareeffectivelyunavailable.
Themostimportantdeficienciesareinthedocumentation,access,andlong-termpreservationofdatainusableform.Insufficientdocumentationisagenericproblemthataffects,invaryingdegrees,alltheclassesofdataaddressedinthisstudy.Furthermore,fewofthefederaldatacenterscangiveadequateattentiontolong-termarchivingbecausetheyarestretchedthinbycurrentdemandsandinadequateresources.Eventhedatathatarearchivedmaybecomeinaccessiblebecausetheyarenotregularlymigratedtonewstoragemediaasthehardwareandsoftwareusedtoaccessthedatabecomeobsoleteorinoperable.
Anothermajorprobleminhibitingaccesstodataisthelackofdirectoriesthatdescribewhatdatasetsexist,wheretheyarelocated,andhowuserscanaccessthem.Inmanycasestheexistenceofthedataisunknownoutsidetheoriginalscientificgroups,andevenifknown,therefrequentlyisnotenoughinformationforapotentialuser
toassesstheirrelevanceandusefulness.Thelackofadequatedirectoriesadverselyaffectstheexploitationofournationaldataresourcesandleadstounnecessaryduplicationofeffort.
Asignificantfractionofthearchivedscientificdataisheldbythefederalagenciesthatcollectedthedataaspartoftheirmission.However,alargeamountofvaluablescientificdatagatheredwithfederalfundsisneverarchivedormadeaccessibletoanyoneotherthantheoriginalinvestigators,manyofwhomarenotgovernmentemployees.Inmanyinstances,theorganizationsandindividualsthatreceivegovernmentcontractsorgrantsforscientificinvestigationsareundernoobligationtoretainthedatacollected,ortoplacetheminanaccessiblearchiveattheconclusionoftheproject.Thus,datasetsthatcommonlyaregatheredatgreatexpenseandeffortarenotbroadlyavailableandultimatelymaybelost,squanderingvaluablescientificresourcesandmuchofthepublicinvestmentspentinacquiringthem.Clearly,thereisagreatneedfortheagenciestogetmorereturnontheirinvestmentinsciencebythesimpleexpedientofmakingthedatacollectedundertheirauspicesaccessibletoothers.
Finally,theholdingsofscientificandtechnicaldatabyNARAinelectronicoranyotherformareverysmallincomparisonwiththedataholdingsofthefederalagenciesandtheorganizationssupportedbythem.Moreover,NARA'sbudgetforitsCenterforElectronicRecords,whichhastheformalresponsibilityforarchivingalltypesoffederalelectronicrecords,wasonly$2.5millioninFY1994,abudgetlowerthanthatofmanyoftheindividualagencydatacentersreviewedbythecommitteeinthisstudy.GivenNARA'scurrentandprojectedlevelofeffortforarchivingelectronicscientificdata,itisobviousthatNARAwillbeunabletotakecustodyofthevastmajorityofthesescientificdatasets.Therefore,acoordinatedeffortinvolvingNARA,otherfederalagencies,certainnonfederalentities,andthescientificcommunityisneededtopreservethemostvaluabledataandensurethattheywillremainavailableinusableformindefinitely.The
challengeistodevelopdatamanagementandarchivingproceduresthatcanhandletherapidincreasesinthevolumesofscientificdata,andatthesametimemaintainolderarchiveddatainaneasilyaccessible,usableform.Animportantpartofthischallengeistopersuadepolicymakersthatscientificdataandinformationareindeedapreciousnationalresourcethatshouldbepreservedandusedbroadlytoadvancescienceandtobenefitsociety.
Page4
RetentionCriteriaAndTheAppraisalProcess
TheNationalArchivesandRecordsAdministrationappraisesrecordsonthebasisoftheirinformationalandevidentialvalue.Itisconcernedwithrecordsoflong-termvalue,thoserecordsthatwillprobablyhavevaluelongaftertheyceasetohaveimmediate,orprimary,uses.Thevalueofscientificandtechnicaldataisprimarilyinformationalandisbasedonthescientificcontentoftherecords,ratherthanontheevidencetheyprovideconcerningtheactivitiesoftheagencythatcollectedorcreatedthem.
Recommendations
Therecommendationsbelowregardingtheretentioncriteriaandappraisalprocessshouldbeappliedbythoseresponsibleforstewardshiptoallphysicalsciencedata.Similarcriteriaandappraisalguidelinesmustbedevelopedfordatainotherdisciplines.ThisisatopicofprimaryconcernnotonlytoNARA,NOAA,andNASA,buttoallscientists,datamanagers,andarchivistswhoworkwithsuchrecords.
Asageneralrule,allobservationaldatathatarenonredundant,useful,anddocumentedwellenoughformostprimaryusesshouldbepermanentlymaintained.Laboratorydatasetsarecandidatesforlong-termpreservationifthereisnorealisticchanceofrepeatingtheexperiment,orifthecostandintellectualeffortrequiredtocollectandvalidatethedataweresogreatthatlong-termretentionisclearlyjustified.Forbothobservationalandexperimentaldata,thefollowingretentioncriteriashouldbeusedtodeterminewhetheradatasetshouldbesaved:uniqueness,adequacyofdocumentation(metadata),availabilityofhardwaretoreadthedatarecords,costofreplacement,andevaluationbypeerreview.Completemetadatashoulddefinethecontent,formatorrepresentation,structure,andcontextofadataset.
Theappraisalprocessmustapplytheestablishedcriteriawhileallowingfortheevolutionofcriteriaandprioritiesandmustbeabletorespondtospecialevents,suchaswhenthesurvivalofdatasetsisthreatened.Allstakeholdersscientists,researchmanagers,informationmanagementprofessionals,archivists,andmajorusergroupsshouldberepresentedinthebroadoverarchingdecisionsregardingeachclassofdata.Theappraisalofindividualdatasets,however,shouldbeperformedbythosemostknowledgeableabouttheparticulardataprimarilytheprincipalinvestigatorsandprojectmanagers.Insomecases,theymayneedtoinvolveanarchivistorinformationresourcesprofessionaltoassistwithissuesoflong-termretention.
Classifieddatamustbeevaluatedaccordingtothesameretentioncriteriaasunclassifieddatainanticipationoftheirlong-termvaluewheneventuallydeclassified.Evaluationoftheutilityofclassifieddataforunclassifiedusesneedstobedonebystakeholderswiththerequisiteclearancestoaccesssuchdata.
OpportunitiesCreatedByTechnologicalAdvancesForNewDataUseAndRetentionStrategies
Rapidprogressininformationtechnologycontinuallyaltersboththequantityandthequalityofscientificinformationandperiodicallystimulatesfundamentalmodificationofdatamanagementandarchivingstrategies.Recenttechnologicaladvanceshaveenablednewmethodsandstrategiesfordatastorageandretrievalandhavecreatedbetterwaysofconnectinguserstodataresourcesandtoeachother.Moreover,theevolvingtechnologiesarecatalystsforrevisingorganizationalstructurestomanagedistributedscientificdataarchivesmuchmoreeffectively.
TableS.1providesasummaryofnewtechnologiesandrelateddevelopmentsthatenableanewstrategyforthemanagementofscientificandtechnicaldata.Theseadvancesininformationtechnologies
Page5
TABLES.1NewTechnologiesandRelatedDevelopmentsThatEnableaNewStrategyfortheManagementofScientificandTechnicalData
NewTechnologyTrendsandRelatedDevelopments
KeyFeatures WhatIsEnabled?
High-performancecomputernetworks
Distributedfunctions;rapiddeliveryoflargedatavolumes
Locationofdatabasesandarchiveswherebestmanaged;collaborativework;distributedorganizations;distributedresponsibility
Lowanddecliningcostofstorage
Inexpensivebackup;continuallydecliningcost;easeofmigration
Deferralofarchivingdecisions;trustindistributedmanagementduetosafestoragebackup
Advanceddatamanagement
Abilitytorigorouslyandformallymanagediversedatatypes
Morecomplexdatastructures(otherthan''flatfiles")handledinarchives,withgreatpotentialadvantages
Changingrequirementsforinformationtechnologyprofessionals
Abilityofpersonnelwithlowertechnicalskillstosucceedindatamanagementroles
Abilitytoentrustscientificdatamanagementinadistributedenvironment
Highreliabilityoftechnologycomponents
Availabilityofbettercomponentsandconnections;reducedprocurementandoperationscosts
Reducedcostandeffortindatamigration;trustedconnectionsforcommunicationandcollaboration
Developmentandacceptanceofstandards
Agreementonterms,interfaces,media,procedures
Reducedefforttocommunicateandapplyresultsofothers;abilitytoconcentrateonmissionissuesandnotontechnologysupport
anddatamanagementsupportthecreationofahighlydistributed,
anddatamanagementsupportthecreationofahighlydistributed,federatedmanagementstructureforournation'sscientificinformationresources.
ANewStrategyForArchivingTheNation'sScientificAndTechnicalData
Inordertorespondadequatelytotheimperativesforpreservingdataaboutthephysicaluniverseandtotakeadvantageofthetechnologicaladvancesdescribedabove,thefederalgovernmentshouldcreateanintegratedandadaptiveinfrastructureandrelatedprocessesforprovidingreadyaccesstothenationalresourceofscientificandtechnicaldataandrelatedinformation.Suchaneffortmustsupporttheneedsofdataoriginators,users,andcustodiansacrossallphasesofthedatalifecycle,fromorigintousebyfuturegenerations.Thecommitteebelievesthatthefollowingprinciplesshouldguidetheeffortofthegovernmentagenciesinthelong-termretentionofscientificandtechnicaldata:Dataarethelifebloodofscienceandthekeytounderstandingthisandotherworlds.Assuch,dataacquiredinfederalorfederallyfundedendeavors,whichmeetestablishedretentioncriteria,areacriticalnationalresourceandmustbeprotected,preserved,andmadeaccessibletoallpeopleforalltime.Thevalueofscientificdataliesintheiruse.Meaningfulaccesstodata,therefore,meritsasmuchattentionasacquisitionandpreservation.
Page6Adequateexplanatorydocumentation,ormetadata,caneliminateoneoftoday'sgreatestbarrierstouseofscientificdata.Asuccessfularchiveisaffordable,durable,extensible,evolvable,andreadilyaccessible.Theonlyeffectiveandaffordablearchivingstrategyisbasedondistributedarchivesmanagedbythosemostknowledgeableaboutthedata.Planningactivitiesatthepointofdataoriginmustincludelong-termdatamanagementandarchiving.
TheProposedNationalScientificInformationResourceFederation
ThecommitteebelievesthatthefederalgovernmentshouldcreateaNationalScientificInformationResourceFederationanevolutionaryandcollaborativenetworkofscientificandtechnicaldatacentersandarchivestotakeonthechallengeofprovidingeffectiveaccesstoandpreservationofimportantdataandrelatedinformation.Suchaninitiativewouldbegintoexploitfullyournation'ssignificantinvestmentinthephysical(andother)sciencesandthedataacquiredwiththatinvestment.Severalcriticalconceptsmustgovernanyfederatedmanagementstructureforittofunctionproperly(Handy,1992):Subsidiaritythepowerisassumedtoliewiththesubordinateunitsofanorganization.Powercanberelinquished,butnottakenaway.Thesubordinateunitstypicallyarebestqualifiedtomakeoperationaldecisionsthatdirectlyaffectthemandthattheywillbeimplementing.Thecentralmanagementisallowedonlythosepowersneededtoensurethatthesubordinatesdonotdamagetheorganization.ItisclearthatthestrengthsofthecurrentsystemformanagingscientificandtechnicaldataandinformationintheUnitedStatesaredistributedamonganumberofdiversedatacentersandarchives,bothwithinandoutsidethegovernment.Asuccessfulfederationoftheseexistinginstitutionswouldrecognizethattheyarethelocationsofexpertiseontheirrespectivedataholdings.Thusthecentralorganizationshouldbesmallandshouldnot
holdings.Thusthecentralorganizationshouldbesmallandshouldnotmicromanagetheday-to-dayoperationsofthesubsidiaryorganizations.Pluralismthemembersareinterdependent.Inafederation,theindividualsubsidiaryorganizationsrecognizetheadvantagesofbelongingtothefederation,becauseofproductsorservicesthatcanbeobtainedfromotherelementsinthefederation.Theexistenceofmanyspecializeddatacentersandarchives,aswellasthepossibilityofcreatingnewonesinanetworkedenvironment,canoffersignificanteconomiesofscaleandimprovedsharingofideasandexpertise.Whatisgoodforthesubsidiaryelementalsoshouldbegoodforthewhole.Pluralism,coupledwithsubsidiarity,guaranteesameasureofdemocracyinthefederation.Standardizationinterdependencerequirescompatiblelanguages,communications,basicrulesofconduct,andunitsofmeasurement.Theseelementsmaybesummarizedastechnicalandproceduralstandardization.Standardsthataredevelopedbyconsensusofthesubsidiaryelements(e.g.,theparticipatingdatacenters,archives,andresearchers)arewidelyrecognizedasessentialtothesuccessfulmanagementofdata.Separationofpowers(responsibilities)asystemofchecksandbalancesisnecessarytoensurethatthecentralauthoritydoesnottakeonunnecessarypower.Thisprinciplemustbeincorporatedintothefederation'sorganizationalstructure.Strongleadershipthecentralcoordinatingelementorexecutiveofficemustactasthestandardbearer,promotingthefederation'sestablishedgoalsandobjectiveswhileremindingthesubsidiaryorganizationsoftheimportanceofcarryingouttheirresponsibilities.
AfederateddatamanagementsystemwouldbeconsistentwiththegoaloftheNationalInformationInfrastructuretodistributeinformationresourcesbroadlythroughoutoursociety.Thetechnologyis
Page7
availabletomakeafullynetworked,buthighlydistributedsystemofdatacentersandarchivesbothfeasibleanddesirable.Suchasystemwouldbeefficientinprovidingaccesstoscientificdataandinformationtoalargenumberofpotentialusersandwouldmaximizethegovernment'sreturnontheverylargeinvestmentthatinitiallywentintoacquiringthosedata.Fromanorganizationalstandpoint,afederatedmanagementstructurewouldallowthedisparateelementstocontinuetospecializeinwhattheyeachdobestandtofulfilltheirindividualorganizationalmandates,whileprovidingsomeefficienciesofscaleandpoliticalleverageinaddressingthemostpressingissues.Thecommitteebelievesthisapproachisespeciallytimelyandimportantinaneraoffederalgovernmentbudgetreductions.
Recommendations
Thecommitteethusrecommendsthatthefederalgovernmenttakethefollowingstepsforadequatelypreservingandprovidingaccesstodataaboutourphysicaluniverse:
AdopttheNationalScientificInformationResource(NSIR)FederationconceptasanintegralpartoftheNationalInformationInfrastructure(NII).Thisconceptmustencompassnotonlyanelectronicnetwork,butalsoindividuals,organizations,communities,dataresources,procedures,guidelines,andassociatedactivitiesofdatageneration,management,custodianship,anduse.TheNSIRFederationthusshouldprovidethemeansfordefiningacoherentapproachtomanagingthelifecycleofscientificdata.Thisapproachshouldbedevelopedandimplementedthroughconsensusofcollaboratingorganizationswithdiverseandautonomousmissions.TheinteragencyGlobalChangeDataandInformationSystemisanexampleofaprototypeNSIRFederation,focusedondataforaspecificsetofinterdisciplinaryscienceproblems.TheNSIRFederationwouldbuildonsuchefforts,providingforbetter
coordinationandinteractionamongthem,andwouldhelporganizefledglingeffortstopreserveandprovidebroadaccesstodatainotherdisciplines.
TheadministrationshouldtakethestepsnecessarytofullydefineandcreatetheNSIRFederation.Thereareatleasttwopotentialfocalpointswithintheadministrationforplanningsuchanactivity.ThesearetheinteragencyInformationInfrastructureTaskForcefortheNIIandtheNationalScienceandTechnologyCouncil.Aconvocationofrepresentativesfromthescientific,dataandinformationmanagement,andarchivingcommunitieswouldbeagoodwaytohelpdefineandinauguratethisinitiative.
FollowingtheformalauthorizationbythefederalgovernmentforcreatingtheNSIRFederation,theprincipalparties,includingNARAandNOAA,shouldconcludeagreementsfortheimplementationofadistributedarchivesystem.Thesystemshouldinvolveallrelevantinstitutions,includingnongovernmentalentitiesthatarefundedbythefederalgovernmentorthatmaintaindatathatwereacquiredwithfederalfunds.Asageneralprinciple,datacollectedbyanagencyshouldremainwiththatagencyindefinitely.ThecommitteerecognizesthatthisrecommendationmayrequiresignificantoperationalchangesforagenciesotherthanNOAA,andevensomechangeswithrespecttoNOAA'sdataactivities.Furthermore,theassociatedagenciesintheNSIRFederationmustworktogether,undertheleadofasmallexecutiveofficewiththeexpertisetoestablishdatamanagementguidelinesandminimumcriteriaforadequatemetadatathatcouldbeappliedacrosstheentireFederation.Theexecutiveofficecouldbeeitherahigh-levelinteragencycoordinatingcommitteeoranewofficeatanappropriatefederalagency,suchastheNationalScienceFoundation,whichhasabroadscientificandtechnicalaswellascommunicationmandate.Inanycase,theexecutiveofficeshouldresistthetypicaltendencytowardbureaucraticaccretionofpower,personnel,andresources,aswellasthetendencytoconsolidateand
centralizedataholdings.Amanagementcouncilconsistingofrepresentativesofthememberorganizationsshouldbecreatedtohelpensurethattheexecutiveofficefunctionremainsfullyresponsivetoallmembersofthefederation.
Page8
Dataaccessandpreservationservicesshouldbeimplementedonthemostcost-effectivebasispossiblefortheFederation.Forexample,oneinstitutionshouldprovideaservicetooneormoreotherinstitutionsinordertoexploitpotentialeconomiesofscaleandfocalpointsofexpertise.Thismeasuremightincreasethecosttotheprovidinginstitution,butwoulddecreasetheoverallcosttothefederation,thegovernment,andthetaxpayer.
TheinstitutionsbelongingtotheNSIRFederationshoulddevelopaprocessforcollaboratingeffectivelyonspecificinitiatives.Thisprocessshouldprovideamechanismtodefineandprioritizedatamanagementandpreservationinitiatives,toestablishtherequiredagreementsbetweencollaboratingorganizations,andtosecurefundingforeachinitiative.Eachparticipatingorganizationwouldcontributetothefederationaccordingtoitsparticularstrengthsandinamannerconsistentwiththefoundingcharter.Inaddition,anindependentadvisoryboardconsistingofexpertsfromusergroupsshouldbeformedinsupportofeachinitiative.
TheNSIRFederationshoulddevelopanationalresourceofinformationtechnologythatisconsistentwithitscharteredobjectivesandthatcanbeeffectivelydistributedtoinstitutionsthatmustmanagedata.Thesetechnologieswouldincludecompleteproducts,designs,guidelines,standards,andmethodologies.Arelatedlong-termtechnologystrategy,or"technologynavigation"function,shouldbedevelopedtohelpguidetheseefforts.
TheNSIRFederationshouldinstituteanindependentlymanagedprocessforawardingNSIRcertificationtomemberscientificinstitutionsandtheirdataandinformationsystemsonthebasisofwell-definedcriteriaandstandards.Thecertificationprocessshouldbemanagedbyanongovernmental,not-for-profitorganization,whichwouldreceivetechnicalguidancefromtheparticipatingfederal
agencies.Thecertificationneedstohavecredibilityinthecommunity,sothatnonmemberinstitutionswillaspiretoattaincertificationandhaveittaggedtotheirproducts.Thecertificationalsoshouldbesomethingthatcommercialvalue-addedprovidersseektoincreasethecredibilityoftheirproducts.
ItalsoisimportantforthecommitteetostatewhattheNSIRFederationshouldnotbe.Itshouldnotbecomeanexpensivebureaucraticentity.Theexecutiveofficemustnotimposeanystandardsorinformationtechnologiesfromabovethathavenotbeenvalidatedthroughaconsensusprocessofthememberorganizations.Finally,theexecutiveofficemustnotattempttomicromanagetheoperationsoftheparticipants,norshouldithaveanydirectcontrolovertheirbudgetsandfundingallocations.
RecommendationsSpecificallyforNARA
AlthoughNARAhasalegislativemandatetopreservefederalrecords,itcannottoday,norwillitlikelyeverbeableto,actasthecustodianofmostphysicalsciencedata.ThedatavolumeistoogreatinrelationtotheverylowfundingappropriatedtoNARA,theNARAstaffdonothavethespecializedscientificknowledge,theinteragencylinkagesarenotinplace,andahugeinfrastructuresimilartothatwhichalreadyexistsatotheragencieswouldneedtobeduplicatedbyNARA.Inaddition,thedesignationofafederalrecordissometimesirrelevanttothearchivalprocessforscientificandtechnicaldata,andmanydataoflong-terminterestdonotmeettheexistingdefinitionofafederalrecord.*Hence,
*"'[Federal]records'includesallbooks,papers,maps,photographs,machinereadablematerials,orotherdocumentarymaterials,regardlessofphysicalformorcharacteristics,madeorreceivedbyanagencyoftheUnitedStatesGovernmentunderFederallaworinconnectionwiththetransactionofpublicbusinessandpreservedorappropriateforpreservationbythatagencyoritslegitimatesuccessorasevidenceoftheorganization,function,policies,decisions,procedures,operations,orother
activitiesoftheGovernmentorbecauseoftheinformationalvalueofthedatainthem"(44U.S.C.3301).
Page9
NARAhasaspecialroleasapartnerinthearchivingprocessforscientificandtechnicaldatasetsthatisdifferentfromitstraditionalroleasthenation'sarchives.
ThecommitteemakesthefollowingspecificrecommendationstoNARAinadditiontothosemadeelsewhereinthisreport:
NARAshouldstrengthenitsliaisonwitheachfederalagencythatproducesscientificandtechnicaldatatoensurethatappropriateattentionisdevotedtotheirlong-termretentioninadistributedstorageenvironment.
NARAshouldformstandingadvisorycommitteeswithmanagersofscientificdata,historians,andscientificresearcherstoaddresstheretentionandappraisalofscientificandtechnicaldatacollectionsandrelatedissues.
NARAshouldcollaboratewithotheragenciesthatmaintainlong-termcustodyofdatatodevelopaneffectiveaccessmechanismtothesedistributedarchives.Theinitialstepshouldfocusonlocatorsystemsandevolvetowardatransparentaccesssystem.
Finally,NARAshouldworkwiththescientificcommunityandpotentialsourcesofscientificdatatodevelopadaptableperformancecriteriafordataformatsandmedia,ratherthanmandatingnarrowandinflexibleproductstandards.
RecommendationsSpecificallyforNOAA
AsthelargestholderofearthsciencesdataintheUnitedStates,NOAAhasavastamountofscientificdatastoredatanumberoffacilitiesacrossthecountry.NOAAthushasanespeciallyimportantroleinthepreservationofournation'sobservationaldataonthephysicalenvironment.ThecommitteemakesthefollowingspecificrecommendationstoNOAA:
NOAAshouldplaceahigherpriorityondocumentingandestablishing
NOAAshouldplaceahigherpriorityondocumentingandestablishingdirectoriesofitsdataholdings.
NOAA,withtheactivecooperationofNARA,shouldleadeffortstobetterdefinetechnology-independentstandardsforarchiving,storing,andtransmittingthedatawithinitspurview.
Finally,NOAA,aswellaseveryotherfederalscienceagency,shouldensurethat:allitsdataaresharedandreadilyavailable;itfulfillsitsresponsibilityforqualitycontrol,metadatastructures,documentation,andcreationofdataproducts;itparticipatesinelectronicnetworksthatenableaccess,sharing,andtransferofdata;anditexpresslyincorporatesthelong-termviewinplanningandcarryingoutitsdatamanagementresponsibilities.
Thecreationofthecommittee'sproposedNSIRFederationwouldhelpprovideacollaborativemechanismandmoresustainedpeerpressuretomeettheseobjectives,andthusenhancethevalueofscientificandtechnicaldataandinformationresourcestothenation.
Page10
1IntroductionStandingattheintersectionofpastandfuture,wehumansarefascinatedwiththeeventsofyesteryearandintriguedwithwhattomorrowwillbring.Ourprehistoricancestorsbegantheprocessofrecordingaspectsoftheenvironmentthatwereimportanttothem(Marshack,1985;Boorstin,1992).Todaywearecuriousaboutmanymoreworlds,rangingfromthoseofatomicsizetothoseofcosmicscale.WithinstrumentsonEarthandinspace,weseektocaptureviewsofrealitythatwillhelpusunderstandnatureandourrelationshiptoit.
Scientificdatareflectboththeorganizationandthechaosofthenaturalworld.Theystimulateustodevelopconcepts,theories,andmodelstomakesenseofthepatternstheyrepresent.Theresultingabstractionsaretheproductofscientificendeavor,thegoalbeingtodeveloptheformalandsystematicideasthatconstitutetheunderstandingofrelationshipsbetweencausesandconsequencesandperhapsmayenablepredictionoffuturesequencesofevents.Becausescientiststransformdatafromthematerialworldintoideas,theobservationsofobjectsandprocessesinthephysicalworldarethestimuliofscientificthought.Dataarethustheseedsofscientificideas.
Sciencegenerallyworksbyproceedingfromdatatounderstandingthroughaprocessoforganizingthedataandanalyzingtheirimplications.Thefollowingdefinitions,adaptedfromSettingPrioritiesforSpaceResearch:OpportunitiesandImperatives(NRC,1992a),indicatehowtheprocessworks:Dataarenumericalquantitiesorotherfactualattributesderivedfromobservation,experiment,orcalculation.Informationisacollectionofdataandassociatedexplanations,interpretations,orothertextualmaterialconcerningaparticularobject,
interpretations,orothertextualmaterialconcerningaparticularobject,event,orprocess.Knowledgeisinformationorganized,synthesized,orsummarizedtoenhancecomprehension,awareness,orunderstanding.Understandingisthepossessionofaclearandcompleteideaofthenature,significance,orexplanationofsomething;itisthepowertorenderexperienceintelligiblebyorderingparticularsunderbroadconcepts.
Thisprocessiscyclical.Newdataconfirmorrefuteexistingtheoriesandstimulatenewunderstanding,whichgeneratesnewanddeeperquestionsthatoftenneedentirelynewsetsofobservationstobegintheprocessofansweringthem.Newunderstandingalsoleadstoincreasedtechnologicalcapability,and
Page11
thatinturnmakesnewobservationspossibleandagainallowsustocontemplatemoresophisticatedquestions.
Thusobservationsandscientificprogressareintertwined;datafromthephysicalworldensurethatscienceisfoundedonrealityaswetrytoanswertheunending"how"and"why"questionsthatarepartofbeinghuman.Theanswersbecomeunderstandingthatenablesustodevelopschemesforpredictingornotbeingsurprisedbyfutureevents.Andunderstanding,wehope,ultimatelyleadstowisdomaboutourinteractionswiththeworldaroundus.
ImperativesForPreservingDataOnOurPhysicalUniverse
Thescientificreasonsforpreservingdataderivefromthefactthatobservations,knowledge,andunderstandingarecumulative.Thuswebelievethatthemorecompletetherecord,themorewecanextractfromit.
Manyobservationsaboutthenaturalworldarearecordofeventsthatwillneverberepeatedexactly.Examplesincludeobservationsofanatmosphericstorm,adeepoceancurrent,avolcaniceruption,andtheenergyemittedbyasupernova.Oncelost,suchrecordscanneverbereplaced.
Observeddataprovideabaselinefordeterminingratesofchangeandforcomputingthefrequencyofoccurrenceofunusualevents.Thelongertherecord,thegreaterourconfidenceintheconclusionswedrawfromit.Ourtraditionalobservationalrecordshaveportrayedfrozeninstantsofreality.Ifpreserved,theywillcontinuetoprovideinsights,butifneglected,theywillmeltaway.
Adatarecordisalsoworthpreservingbecauseitmayhavemorethanonelife.Asscientificideasadvance,newconceptsemergeinthesameorentirelydifferentdisciplinesfromstudyofobservationsthatled
earliertodifferentkindsofinsights.Newcomputingtechnologiesforstoringandanalyzingdataenhancethepossibilitiesforfindingorverifyingnewperspectivesthroughreanalysisofexistingdatarecords.Thus,therelativeimportanceofdata,bothcurrentandhistorical,canchangedramatically,ofteninentirelyunanticipateddirections.Thismeansthatthereanalysisofdata,eveninthedistantfuture,maybringnewunderstanding,whichwillagainincreasethevalueofthosedataoverthatwhichwemighthaveassignedtothematthetimeoftheirarchiving.Finally,thesubstantialinvestmentsmadetoacquiredatarecordsusuallyjustifytheirpreservation.Thecostofpreservationwillalmostalwaysbesmallincomparisonwiththecostofobservation.Becausewecannotpredictwhichdatawillyieldthemostscientificbenefitinyearsahead,thedatawediscardtodaymaybethedatathatwouldhavebeeninvaluabletomorrow.
Theassembledrecordofobservationaldatathushasdualvalue:itissimultaneouslyahistoryofeventsinthenaturalworldandarecordofhumanaccomplishment.Thehistoryofthephysicalworldisanessentialpartofouraccumulatingknowledge,andtheunderlyingdataformasignificantpartofthatheritage.Theyalsoportrayahistoryofourscientificandtechnologicaldevelopment.
Withappropriateexplanatorydocumentation,oftenreferredtoasmetadata,thedatademonstratetheincreasingsophisticationofourattemptstounderstandournaturalsurroundingsandthetechnologicalcapabilitiesweapplytothetask.Preservedforstudybyfuturegenerations,thedatawillspeakacrosstheyearsaboutwhatwetriedtodo,wherewesucceeded,andwherewefailed.Withincreasingcapabilitiesforanalyzingandconceptualizingpatternsindata,thosewhofollowmayfindinourarchiveddataimportantcluesthatwecouldnotordidnotsee.Atthesametime,ourdescendantswillbegratefulthatwepreservedasufficientlylonghistoryoftheirworldthattheycanmakeimportantdecisionsabouttheirownfuture.
Therearenumeroussocioeconomicreasons,inadditiontothecompellingscientificandhistoricalmotivations,forthelong-termretentionofobservational,aswellascertaintypesofexperimental,data.Forexample,historicalclimatedatahavehadwell-documentedusesinabroadrangeofapplicationsinmanufacturing,energy,agriculture,transportation,communications,engineering,construction,insurance,andentertainment(OTA,1994).Suchapplicationsarecommonforothertypesofobservational
Page12
dataontheEarth'senvironment.Experimentaldatainthephysicalsciencesalsohavemanyindustrialandotherpracticaluses.Additionalexamplesofthelong-termusesofthevariousphysicalsciencedataareprovidedinthenextchapter.
ANewFutureForScientificData
Thecollectionsofscientificdataacquiredwithgovernmentandprivatesupportarethefoundationforourunderstandingofthephysicalworldandforourcapabilitiestopredictchangesinthatworld.Intheyearsahead,thevolumesofthosecollectionsofdatawillincreasedramatically.Theywillstimulateadvancesinourscientificunderstandingandinourapplicationsofthatunderstandingtopursueimportantnationalgoals.Thescientificdatainfederal,state,andprivatedatabasesthusconstituteacriticalnationalresource,onewhosevalueincreasesasthedatabecomemorereadilyandbroadlyavailable.
Today,wecanforeseethepossibilityofusingthenationalresourceofscientificdatamoreadvantageouslythaneverbefore,astechnologicaladvancesopennewvistasformanagingandaccessingscientificinformation.Growingcomputationalpowerenablesnewapproachestotheanalysis,management,andapplicationofdata.Advancesindatastoragetechnologiesmakethelong-termretentionofvirtuallyalldatabothfeasibleandaffordable.TheexistenceoftheInternetandoftheemergingNationalInformationInfrastructure(NII)enableunprecedentednationwidesharingandapplicationofdatathatresideinappropriatelyconfigureddatabases.Automaticsearchprocedures,filetransfercapabilities,andtheacceleratinguseoftheWorldWideWebfunctionsontheInternetillustratethepowerofthecontemporarytechnology.Itisimportanttonotethattheseenablingtechnologieshaveemergedinashorttimespan;equallyrapidadvancescanbeanticipatedintheyearsahead,whichwillfurtherfacilitatethesearch
forandaccesstothenation'sdataresources.
Ournewpowertostoreanddistributedataandinformationischangingthewayweworkandthink.However,thecommunitiesinvolvedinthecreation,retention,anduseofscientificdataaboutthephysicalworldarenotoptimallyorganized.Theycommonlyworktowarddisparategoals,arenotwellconnected,anddonottakefulladvantageoftechnologicalandconceptualadvancesindatamanagementandcommunication.Anentirelynewapproachtothelong-termpreservationofscientificdataisnowbothfeasibleandessential.Itmusttakeadvantageofadvancingtechnologyandofdistributedcommunicationsandmanagementstructurestoempowerboththecreatorsandtheusersofsuchdata.
Thisstudyidentifiesthemajorissuesregardingexistingeffortstoarchiveandusedatainthephysicalsciences,establishesretentioncriteriaandappraisalguidelinesforthosedata,reviewsimportanttechnologicaladvancesandrelatedopportunities,andproposesanewstrategytoensureaccesstothedatabyfuturegenerations.
Page13
2TheChallenge:PreservationandUseofScientificDataWeadvanceourunderstandingofthephysicaluniversebybuildingoncurrentandpaststudiesinindividualdisciplines,bycollectingandanalyzingnewtypesofdata,andbyusingpastobservationsinentirelynewwaysnotenvisionedwhenthedatawereinitiallycollected.Themorecompletetherecordofscientificdataandinformation,themorenewunderstandingandknowledgewecanextractfromit.Observationsofnaturalphenomenatypicallyrepresentarecordofeventsthatwillneverberepeatedinadynamicuniversethatcontinuallychangesintimeandvariesinspace.Newscientificadvanceshavehadsignificant,sometimesprofound,societalandeconomicimpactsandmaybeexpectedtobeequallyimportantinthefuture.Scientificdataandinformationareattheheartoftheseadvancesandareessentialfornewdiscoveries.Therefore,theyconstituteapreciousnationalresource.
Thesectionsthatfollowdescribebrieflythetwomajortypesofdatathatareofcriticalimportanceinthephysicalsciencesexperimentallaboratorydatainphysics,chemistry,andmaterialssciences,andobservationaldataintheearthandspacesciences.Ineachofthesebroadareastheprogressthathasbeenmadetodateintermsoflong-termpreservationandaccessibilityischaracterized,andthekeyissuesidentified.Morecomprehensivedescriptionsofthestatusoflong-termdataretentioninthevariousphysicalsciencedisciplineareasareinthevolumeofworkingpaperspreparedasbackgroundforthisreport(NRC,1995).
ExperimentalLaboratoryData
Theexperimentalscienceshaveprogressedoverthecenturiesbybuildingontheconcepts,theories,andfactualinformationresultingfromeachgenerationofscientificinquiry.TheobservationsofTychoBrahewereusedbyKeplertodevelophislawsofplanetaryorbits,andNewton'sformulationofmechanicsdrewuponthepreviousworkofGalileo,Kepler,andothers.AcenturyofmeasurementsonpropertiesofthechemicalelementsprovidedtherawmaterialneededforMendeleevtoconstructhisperiodictable.Thehistoryofscienceisrichinexampleswheretheintroductionofnew,oftenrevolutionary,conceptsrestedondatathathadbeenpreservedfrompreviousscientificinvestigations.Furthermore,thetechnologyoftomorrowisoftenbasedonthelaboratorydataoftodayoryesterday.
Theexplosivegrowthofscienceinthiscenturyprovidesmanyotherexamplesofthekeyroleofdatafrompreviousexperiments.WhenTownesandSchawlowpublishedtheirlandmark1958paperthatdemonstratedthetheoreticalpossibilityofbuildingalaser,intensiveeffortswerestartedtofindareal
Page14
physicalsystemthatwouldmeetthenecessaryrequirements.Dataonatomicspectra,someofthem60to70yearsold,providedthekeytocreationofthefirstworkinggaslaser.Ifithadbeennecessarytomakenewmeasurementsoneveryconceivablesysteminordertoselectthemostpromisingfortrial,theinventionofthelaserandallthenewtechnologyandeconomicbenefitsthatithasbroughtwouldhavebeendelayedformanyyears.
ThecrashprogramtoimproverocketpropulsionsystemsfollowingthelaunchofthefirstSovietSputnikprovidesanotherexample.Dataonthethermodynamicpropertiesofawiderangeofsubstanceswereessentialtotheeffortstooptimizerocketengineperformance.Aconcertedgovernmentprogramwasstartedtobuildadatabaseofthermodynamicpropertiesforrocketenginedesign.Althoughsomenewlaboratorymeasurementswererequired,manyoftheneededdatawereinthescientificliterature,somepublishedasearlyas1880.Theavailabilityoftheseolderdatasignificantlyaidedtherocketengineprogram.
Datageneratedbyscientistsandengineersinthefieldsofphysics,chemistry,andmaterialssciencehavetraditionallybeenpublishedinresearchjournals,whichservebothacurrentdisseminationandanarchivalfunction.Thisjournalsystemhasservedsciencewellfor300years.Manyscientificlibrariesthroughoutthecountryprovideaccesstothesejournals.Becausebackvolumesarekeptinlibrariesinmanydifferentplaces,thereislittledangerofirreparablelossfromanaturalcatastrophe.Manyscientificsocietiesalsohavedepositorysystemsthatallowauthorstosubmitvoluminousdatasetsthatcannotbepublishedinthejournalsbecauseoflackofspace.Thesocietiesmaintainthesearchives,generallyonmicrofilm,andsupplycopiesonrequest.
Whilethegrowinguseofelectronicrecordingandstoragetechniques
isalreadyaffectingthetraditionaljournalsystem,wecanexpectpublisherstotakeadvantageofthenewtechnologytomeetnewneeds.Scientificsocietiesarebeginningtoimplementelectronicarchivesforpreservingdatathataretoovoluminoustopublishinpaperformats.Forexample,theAmericanChemicalSocietyrecentlybegantomakedatafrompapersinitsleadingjournal(JournaloftheAmericanChemicalSociety)availableontheInternet.Itisanaturalstepfromthepaperandmicrofilmarchivesthatsuchsocietiesnowmaintaintotheelectronicarchivesofthefuture.Clearly,theseprivatesectorarchivesmustbeanintegralpartoftheoverallconceptofa"NationalScientificInformationResource."
Electronicallyrecordeddatainthelaboratoryphysicalsciencesareoftwoforms,originalexperimentalmeasurementsandevaluatedcompilationsofpublisheddata.Theseareexaminedhereinturn.
OriginalExperimentalMeasurements
Recentdecadeshaveseensignificantchangesintheformof"originaldata."Arawexperimentalresultwas,inthepast,typicallyameasuredvaluesuchasavoltageordistance.Theinvestigatorreadthesemeasurementsfrominstruments,wrotetheminanotebook,treatedthemarithmeticallytoobtainthedesiredscientificvariablefromtherawmeasurement,andinterpretedthem.Theoriginalmeasurementswereeventuallydiscardedinmostcases.Today,manyrawdataareacquiredandprocessedelectronicallyassoonastheyareenteredintothecomputer,sothatonlytheprocesseddataexistlongenoughforanyonetolookat.Withrapid,automateddataacquisitionandmanipulation,theoptionexiststokeepelectronicdataandreanalyzethemasrequired.However,automateddatacollectionoftenresultsinlargevolumesofinsignificantdata,sothatinmanyexperimentsthedatastreamisscreenedandmostofthedataarediscardedinrealtimebyacomputerprogramorbytheexperimenter.Forexample,spectroscopistsusedtokeep,atleasttemporarily,thephotographic
platesorrecorderchartsfromwhichtheyhadtakenmeasurements.Nowthespectralfeaturesmaybeanalyzedelectronicallyimmediatelyuponmeasurement,andonlytheattributesofrelevantfeaturesarerecorded.Thefractionoftherawdatathatissavedafterinitialprocessingmaybesmall,sometimeslessthanonepartin10,000.Invirtuallyallcases,thereisnojustificationforpreservingtherawdata,becausetheexperimentcanberepeatedinthoserareinstancesinwhichanunanticipatedfutureinterestappears.
Page15
Whenconsideringlaboratorydataofthiskind,itisusuallybesttorecognizethatnooneknowsasmuchabouttheoriginaldataastheoriginalexperimenter.Iftheexperimenterdoesnotfindtherawdataworthpreserving(andworthdocumenting),thenthedataareprobablynotgoingtobeofusetoanyoneelse.Becausethenumberofstagesofprocessing(e.g.,replication,averaging,coordinatetransformations,applyingcorrections,andsoon)differforeverytypeofmeasurementandundergocontinualevolutionasnewtechniquesareintroduced,itwouldbefruitlesstotrytoformulategenericretentioncriteriaforalltypesoflaboratorydata.
However,therearecertainclassesoflaboratorydata(where''laboratory"isusedinabroadsense)thatshouldbecandidatesforpreservationifproperlydocumented,becauseitwouldbeimpossibleorimpracticaltoreproducethemeasurements.Someofthedatatakeninlargeplasmaphysicsfacilitiesfallinthiscategory,becausereproductionofthefacilitieswouldbeextremelycostly.Amorestrikingexampleisthespectroscopicandothermeasurementsfromnucleartestsintheatmosphere,whichitishopedwillneverbereproduced.Onamoremundanelevel,propertiesofengineeringmaterials,measuredasapartoflargegovernmentresearchanddevelopmentprograms,providemanydataofpossibleinterestinthefuture.Suchdataareacquiredasasmallstepinalargerprogramandusuallyarenotpublishedinthescientificliteratureordisseminatedbytheusualchannels.Theywouldbecostlytoreproducebecausemanyofthematerialswerespeciallypreparedwithuniquefabricationtechnology.ExamplesincludepolymerandsensordatafromtheStrategicDefenseInitiative,engineeringdatafromtheNationalAeronauticsandSpaceAdministration(NASA),andthesuperconductingmaterialsmeasurementscarriedouttodevelopmagnetfabricationtechniquesforthecanceledSuperconductingSuperCollider.Eventhoughthisprojectwillnotbecompleted,the
materialsmeasurementsshouldbesaved,becausetheymaywellbeapplicabletofutureengineeringprojects.
EvaluatedCompilations
Compilationsresultingfromthecriticalanalysisofalargebodyofdatafromthescientificliteratureareaseparateareaforconsideration.Well-knownexamplesincludethermodynamicpropertycompilationssuchastheNationalInstituteofStandardsandTechnology'sJointArmy-Navy-AirForce(JANAF)tablesandthethermophysicalpropertiesdisseminatedbytheDepartmentofDefense'sCenterforInformationandDataAnalysisandSynthesisatPurdueUniversity(seethePhysics,Chemistry,andMaterialsSciencesDataPanelreportintheNRC(1995)reportforadetaileddiscussionoftheseexamples).TheDepartmentofEnergyoperatesseveraldataevaluationcentersinnuclearphysicsandchemistry.Insuchcenters,thedataandbackupdocumentationarenotimpossibletoreplace;theysimplyrepresentsomucheffortandexerciseofspecializedscientificjudgmentthatitwouldbeextremelycostlytoredothework.Thecostofnothavingthedataavailable,althoughusuallydifficulttomeasureotherthananecdotally,canbemuchhigherthanthecostofpreservingthem.Inparticular,ifitbecomesnecessaryinthefuturetoexpandorextendthecompilation,thefulldocumentation(e.g.,dataextractedfromreferences,fittingprograms,notesontheanalysistechniques,andthelike)willprovideavaluablebaseforthenewwork.Amajorconcerninconsideringthesedatacollectionsishowthedataandtheunderlyingdocumentationcanbepreservedandmadeaccessibleifthecentersproducingthemlosetheirfundingorexpertpersonnel.Thisconcernincreasesasgovernmentagenciesdownsizetheiractivities.
ObservationalDataInThePhysicalSciences
Overthepasttwodecades,theNationalResearchCouncilandothergroupshaveissuednumerousreportsthathaveaddresseddatamanagementissues,includinglong-termretentionrequirements,for
digitalobservationaldataintheearthandspacesciences(NRC,1982,1984,1986a,b,1988a,b,1990,1992b,1993;GAO,1990a,b;Haasetal.,1985;NAPA,1991).Mostofthesereportshavefocusedquitenarrowlyonthedatamanagementorarchivingproblemsofspecificdisciplinesoragencies,and
Page16
nonehasaddressedcomprehensivelytheissuesassociatedwiththelong-termretentionofobservationalandexperimentaldatainthephysicalsciences.
MajorCharacteristicsofObservationalData
Observationaldatasets,likelaboratorydata,includedigitalinformation(inbothwrittenandelectronicform),graphicalrecords,andverbaldescriptions.Therecordsexistasinkonpaper,punchedpaper,film(includingmicroforms),magnetictapeofmanytypes(includingvideotape),magneticdisk,anddigitalopticalmedia(includingCD-ROM).Overthepastthreedecades,however,thedominantformofdatacollectionandstoragehasbeenelectronic.
Observationaldatacanbecharacterizedbythecollectionandmanagementpracticesappliedthroughoutthelifecycleoftheirexistence.Onemightcharacterizetwomajorpracticesdrivenbythefundingmodelsforconductingtheunderlyingscience.The"bigscience"fundingmodelcreatesafundingumbrellaformultipleindividualsandinstitutionstoconductcoordinateddataacquisition,investigation,andpublication.Often,theselargeprogramsadoptastandardapproachforlife-cycledatamanagement.However,thereisusuallylittlestandardizationamongthebigscienceprograms.ExamplesofsuchprogramsincludetheWorldOceanCirculationExperiment,theWorldClimateResearchProgram,andNASA'sMissiontoPlanetEarth(CENR,1994).Theotherfundingmodel,"smallscience,"fundsindividualsorsmallgroupsofindividualstoconductindependentdataacquisition,analysis,andpublication.Typically,theseinvestigatorsplan,design,andimplementtheirowndatamanagementstrategywithlittleinteractionwiththerestofthescientificcommunity.Thedatageneratedunderbothmodelshavelong-termvalue,bothforscienceandforthebroaderinterestsofthenation.
Specificsubdisciplinesalsoimposedifferentrequirementsonlong-termdatamanagement.Forinstance,whilethereisgeneralagreementwithinthephysicaloceanographycommunityonthedefinitionofstandardobservationvariablesandtheprocessesofmeasuringthosevariables,thesamecannotbesaidforbiologicaloceanography.Becauseofdifferencesinmeasuringtechniques,lackofcommunityagreementonnamingstandards,andthescientificprocessbywhichbiologyprogresses,datamanagementforbiologicaldatasetsisinherentlymorecomplexthaninphysicaloceanography.Thedatafromthesetwosubdisciplineswillhavetoaccommodatemultiplenamingschemesandalternatetaxonomies.Therefore,datamanagersandarchivistshavetodealwithdifferingapproachesandvocabulariesamongdisciplines,evolutionofdisciplineresearchparadigmsovertime,anddivergingconceptsandmethodswithinadiscipline.
Scientificresearchleadstothecreationofdatathatcanbeprocessedandinterpretedatdifferentlevelsofcomplexity.Typically,eachlevelofprocessingaddsvaluetotheoriginal(level-0)databysummarizingtheoriginalproduct,synthesizinganewproduct,orprovidinganinterpretationoftheoriginaldata.Theprocessingofdataleadstoaninherentparadoxthatmaynotbereadilyapparent.Theoriginalunprocessed,orminimallyprocessed,dataareusuallythemostdifficulttounderstandorusebyanyoneotherthantheexpertprimaryuser.Witheverysuccessivelevelofprocessing,thedatatendtobecomemoreunderstandableandoftenbetterdocumentedforthenonexpertuser.Onemightthereforeassumethatitisthemosthighlyprocesseddataproductsthathavethegreatestvalueforlong-termpreservation,becausetheyaremoreeasilyunderstoodbyabroaderspectrumofpotentialusers.Infact,justtheoppositeisusuallythecaseforobservationaldata,foritisonlywiththeoriginalunprocesseddatathatitwillbepossibletorecreateallotherlevelsofprocesseddataanddataproducts.Todoso,however,requirespreservationofthenecessaryinformationaboutprocessingstepsandancillarydata.
Anotherimportantcharacteristicofobservationaldataistheirvolume.Inthisrespect,observationaldatacanbedividedintotwodifferentclasses:small-volumeandlarge-volumedatasets.Themajorityoftraditionalground-based,insituobservationsformsmall-volumedatasetsbecausetheyarebasedonindividuallyconductedmeasurementsorsamplecollections.Satelliteandotherremotelysensedobservationsgenerallyformlarge-volumedatasets.
Page17
Thecommitteedefinessmall-volumedatasetsasthosewithvolumesthataresmallinrelationtothecapacityoflow-cost,widelyavailablestoragemediaandrelatedhardware.ThehardwareandsoftwaretowriteandproduceCD-ROMsarenowgenerallyavailableforlessthan$10,000,andpersonalcomputerscapableofreadingCD-ROMsarebeingmarketedashome-use,consumeritems.Forexample,thetotalvolumeofthesmall-volumeoceanographicdataisprojectedtobelessthan50gigabytesby1995,andthustheentirehistoricaldatasetforallobservationscouldbestoredonfewerthan100CD-ROMs.Thisisfewerdiskettesthanmanypeoplehaveintheircompactdiskmusiccollections.
Issuessuchasarchivingcost,longevityofmedia,andmaintenanceofthedataholdingsarenotthedominantconsiderationswithregardtoretainingsmall-volumedatasets.Rather,themajorissuewithrespecttothisclassofdataisthecompletenessofthedescriptiveinformation,ormetadata.Ifadatasethasbeenproperlypreparedanddocumented,theoperationsrequiredtomigratethedatashouldbeamenabletosignificantautomationandthereforeposeonlyaminorchallengetothelong-termmaintenanceofthearchive.Further,thesedatamaybewidelydistributedwithsimplereplicationofthemedia.Forexample,thevariousNOAAandNASAdatacentershaveprovidedcopiesoftheirdatasetstomanyusersforanumberofyears.
Adifferentproblemisposedbylarge-volumedatasets.ThebiggestdatasetstypicallycomefromEarthobservationsatellitesensorsandspacesciencemissions,andarechallengingtosomecontemporarystoragedevices.However,itisclearthatforthedatasettoexistatall,anadequatestoragemediumcapableofcapturingandmaintainingthedataforsometimeperiodmustexistwhenthedataaregenerated.Further,thetimeperiodforreliable,initialstorageshouldatleastcoverthelifetimeofthedatasetattheorganizationacquiringandusingthedatabeforetherecordsneedtobemigratedtonewmediaor
transferredtoanotherorganization,suchasNOAAorNARA.Inaddition,duringtheinitialstorageperiod,therearelikelytobemajorincreasesinthedensityofmassstorageaccompaniedbysignificantdecreasesinthecostofstorageofthedata.Thus,datasetsthatarechallengingtodaywillgraduallybetransformedto"small-volume"statusinthefuture,asadvancingtechnologyincreasesthecapacityandlowersthecostofstoragedevices.Nevertheless,itisimportanttonotethatthelargestdatasets(e.g.,largerthatoneterabyte)canpresentsignificantorganizationalandmanagementproblemsthatrequirespecialanalysisofthedataflow,volume,access,andtimingcharacteristics.
ObservationalDataintheSpaceandEarthSciences
AstronomyandAstrophysicsData
Astronomyandastrophysicsareobservationalsciences;thatis,theyarebasedonwhattheskyprovidesandwecollect.Therefore,inmanyastronomicalinvestigationsthereisnosuchthingas"repeatinganexperiment"withtheexpectationofgettingthesameresults.Manyobjectshavepropertiesthatchangewithtimeeitherbecauseoftheirintrinsicnature(e.g.,variablestars),evolution(e.g.,starsgoingsupernova),orreasonsyetunknown.Ithappensquitefrequentlythatahighlyvariableobjectisfoundinsatellitedataandsubsequentarchivalresearchinopticalplatesallowsitsidentificationasagiventypeofstar.
Astronomyandastrophysicsdataareacquiredbybothground-basedandspace-basedobservatories.Ground-basedobservatories,whichareoperatedbyuniversitiesorothernonprofitorganizations(e.g.,AssociationofUniversitiesforResearchinAstronomy,theSmithsonianInstitution)andfundedbytheseorganizationsorbytheNationalScienceFoundation(NSF),havetraditionallybeenusedtostudytheskyatvisiblewavelengths.SincethesecondWorldWar,astronomershaveusedimprovingtechnologiestoobserveatradioand
infraredwavelengths.Consortiaofuniversities,includingbothU.S.andforeigninstitutions,areconstructingnewtelescopes,whichuseadvancedtechnologytobuildlargermirrorsthatwillallowustolookdeeperintotheuniverse.Radioobservatoriesrangefromsmalleronesoperatedbyuniversitiestolargernationalfacilities,suchastheNationalRadioAstronomyObservatory,fundedby
Page18
NSF.Mosttelescopesareforindividualobservingprograms,butsomearededicatedtosystematicskysurveys.
Datafromgroundobservationshavetraditionallybeenthepropertyoftheobserver;therefore,observatorieshavenostandardpoliciesfordataarchiving.Theexceptionsaresomebigprojects,suchasthePalomarSkySurvey,wheredataeitheraremadepublicandsoldorarearchivedwithintheuniversityorobservatory.Somecenters,suchastheNationalRadioAstronomyObservatory,theNationalOpticalAstronomyObservatories,andtheHarvard-SmithsonianCenterforAstrophysics,havebeguntoarchivemostdataobtainedfrommajortelescopes.Thesedataarevaluedandusedbroadlybyastronomers.Nevertheless,archivalactivitiesremainofgenerallylowpriority.
Althoughtheolderastronomicaldataconsistofphotographicplatesandotheranalogdata,virtuallyalldatatodayarecollecteddigitally.Therealsohavebeenmajoreffortstodigitizeoldphotographicdatatoallowtheiranalysisbycomputer.Anexampleofthisisthedigitizationofawhole-skysurveybytheSpaceTelescopeScienceInstitute,andthissurveyisnowavailableforsaleonCD-ROMfromtheAstronomicalSocietyofthePacific.Recently,theastronomicalcommunityadoptedastandardformatfortransfersofdigitalfiles(FITS).Withtheadventofdigitaldata,therealsohasbeenanevolutionfromindividualdataanalysispackagestoafewwidelydistributedpackages(e.g.,IRAF,AIPS,VISTA,XANADU),whichprovidestandardtoolsforbaselineanalysis.
BecauseofthefilteringanddistortionproducedbytheEarth'satmosphere,theamountofenergyemittedbycelestialbodiesthatcanbedetectedonthegroundislimitedsignificantly.Observationsfromspaceabovetheatmosphereremovesuchlimitations.Fromitsinception,spaceastronomyandastrophysicshavebeenmostlyunderNASA'spurview,althoughsomeimportantexperimentshavebeen
financedbytheDepartmentofDefense.Thedataarecollectedthroughtelescopesanddetectorsplacedonairbornedevices(balloonsorplanes),rockets,NASA'sSpaceShuttle,andorbitingsatellites.Thelargestvolumeofdataiscollectedbysatellites,andmostofthesemissionsareinternationalcollaborations.TheU.S.portionhasalwaysbeenhandledbyNASA.
WithinNASA,spaceastronomyandastrophysicsareorganizedindifferentwavelength-baseddisciplines,reflectingtheorganizationinthescientificcommunity.Thesedisciplinesincludetheinfrared,whosemaindatacenteristheInfraredProcessingandAnalysisCenterinPasadena,California,wherethedatafromtheInfraredAstronomySatellitemissionarearchived;theopticalandultraviolet,withdatacentersattheSpaceTelescopeScienceInstituteinBaltimore,Maryland,wheretheHubbleSpaceTelescopedataarearchived,andattheNASAGoddardSpaceFlightCenterinGreenbelt,Maryland,wheretheInternationalUltravioletExplorerarchiveresides;andhigh-energyastrophysics,whichmaintainsx-raydataattheEinsteinObservatoryDataCenterinCambridge,Massachusetts.
Table2.1providesarepresentativesampleofNASAAstrophysicsArchives.TheearlierNASAastrophysicsprojectswereso-called"principalinvestigator"missions,whereacontractwasawardedtoagroupofprincipalinvestigators,whobuiltthehardware,receivedthedatafromtheexperiments,andanalyzedandinterpretedthem.Theseprincipalinvestigatorshadnoclearlystatedguidelinestopreparedataforarchiving,otherthantodeliverthereduceddatatotheNASAdatadepositoryattheNationalSpaceScienceDataCenter(NSSDC)attheNASAGoddardSpaceFlightCenter.Documentationgenerallywasminimal,andthedata,whichoftenwerenotwell-documentedorwell-organized,weredifficulttoretrieveforscientificuse,eveniftheywereadequatelyphysicallypreserved.
Ithasbecomefullyapparent,however,thattheuniquenessandhigh
acquisitioncostofthesespacedatamaketheireffectivepreservationandarchivingahighpriority.Evenaftertheactiveoperationofaspaceobservatoryhasended,thedatatypicallyareretrievedandusedbyscientistsformanymoreyears.Asaresult,thesituationhasimprovedconsiderablyattheNSSDCinrecentyears.Moreover,NASAnowfundswavelength-specificscientificdatacenterstoprocessthedata,eliminateanomaliesinthedata,andprovidesoftwareforscientificanalysis.
TABLE2.1ARepresentativeSampleofNASAAstrophysicsArchives,bySatelliteMission
HighEnergyAstrophysicalObservatory2
InternationalUltravioletExplorer
InfraredAstronomicalSatellite
HubbleSpaceTelescope
Datatype X-raydata Ultravioletdata Infrareddata Optical/Ultravioletdata
Yearoflaunch
1978 1978 1983 1990
Duration 2.5years Ongoing 300days Ongoing
Totaldatavolume(gigabytes)
~100 ~100 ~150 ~5500byyear2005
Datacenter EinsteinObservatoryDataCenter,Cambridge,Massachusetts
NationalSpaceScienceDataCenter,Greenbelt,Maryland
InfraredProcessingandAnalysisCenter,Pasadena,California
SpaceTelescopeScienceInstitute,Baltimore,Maryland
Page20
PlanetaryScienceData
Planetarydataalsoareacquiredbybothground-basedandspace-basedobservations.Planetarydataincludeobservationsoftheentirephysicalsystemandforcesaffectingaplanetorotherbody,includingthegeologyandgeophysics,atmosphere,rings,andfields.Thesensorsusedcollectdataacrossmuchoftheelectromagneticspectrum.Currently,mostplanetaryobservationsaresupportedbyNASA,eitherasthedirectresultofplanetarymissionsorasground-basedobservationsthatsupportamission.Overthepastthreedecades,NASAhassentroboticspacecrafttoeveryplanetinthesolarsystemexceptPluto,totwoasteroids,andtoacomet.MenhavewalkedontheMoon,performedexperimentsthere,andreturnedsamples.Theknowledgewehaveaboutthebodiesinthesolarsystem,withtheexceptionofourownplanet,comesmostlyfromspacemissions.Insomecases,suchasthegasgiantsJupiter,Saturn,Uranus,andNeptune,roboticspaceprobeshaveprovidedmostofourcurrentknowledge.Manyofthesatellitesoftheotherplanetswerenomorethanpointsoflightwithminimalspectralandlight-curvemeasurementsbeforetheVoyagermission.Noweachisrecognizedasaseparateworldwithhighlyindividualcharacteristics.
Thescientificandhistoricalimportanceofspace-basedplanetaryobservations,therealizationthatadditionalmissionscannotreplicatetheoriginalobservations,andtheexpenseofplanetarymissionsallpromptedNASAtocreatethePlanetaryDataSystem(PDS)toimprovetheacquisition,archiving,anddistributionofplanetarydata.ThedevelopersandcurrentstaffofthePDSrecognizethatthedatafromplanetarymissionsmakeupthescientificcapitaloftheagency'splanetaryexplorationprogramandthatthesedataareanationalresource.ThePDStriestoacquireallexistingplanetarydatafromNASA'smissionsandevenfrominternationalventures,inordertohaveacompletearchiveofourexplorationofthesolarsystem.In
additiontothespace-basedmeasurements,thePDSacceptsrelevantground-basedobservationsandlaboratorymeasurementsthatsupportplanetarymissionsbyprovidingbaselineorcalibrationdata.Abasicconditionforacceptanceisthatthedatasetmustbeproperlydocumentedandincludeallrelevantancillarydata,includingplanetandspacecraftephemerides,calibrationtables,andexperimenternotesabouttheshortcomingsofthedata.MembersofthePDSscientificstaffandscientistsinthecommunitywhohaveexpertisewithintherelevantdisciplinespeer-revieweachdataset.
OneofthemoreimportantcontributionsofthePDS,especiallywithregardtotheongoingpreservationofdatainausefulform,istheelectronic"publication"ofthemajorityofthedatafrommanyplanetarymissionsintheformofCD-ROMs.Theseincludenotonlythedata,butalsodocumentation,formatspecifications,ancillarydata,andeven,insomecases,displayandanalysistools.
SpacePhysicsData
Spacephysicsinvolvesthestudyofthelargeststructuresinthesolarsystemtheplasmaenvironmentsoftheplanetsandotherbodiesandthesolarwind.Thoseenvironmentsconsistofplasmasrangingfromlowenergies(thethermalcomponent)tochargedparticlesofhighenergies,includingcosmicraysacceleratedbygalacticprocesses.Theyalsoconsistofthemagneticfields(iftheyexist)ofplanetsortheSun,aswellaselectrostaticandelectromagneticfieldsgeneratedfromnaturalinstabilitiesinplasmasandcharged-particlepopulations.Furthermore,inmanylocales,suchascometsandtheEarth'sionosphere,dustandneutralgasesplayanimportantroleinmediatingthebehaviorofplasmasandelectromagneticfields.Asaconsequence,thefieldofspacephysicsrequiresabroadarrayofsensorsandinstrumentsatalllevelsofcomplexity.
Manyinstrumentsmakeinsituobservations,butnoveltechniquesenableremotesensingofvariousplasmaregimes.Becausesomeof
themostapparentmanifestationsofspacephysicsprocessesresultinthenorthernlightsandinplanetary-scalemodificationsoftheterrestrialmagneticfield(andsubsequentcatastrophiceffectsonpowergridsandcommunications),spacephysicsreliesheavilyonawidearrayofground-basedobservations,includingmagnetometers,ionosphericsounders,incoherentradarfacilities,
Page21
all-skycameras,andphotometers.Inaddition,abroadrangeofground-basedandspace-basedsolarmonitorshasbecomecrucialtostudythecorrelationsbetweenvariousdisruptionsintheterrestrialplasmaenvironmentandsolaractivity,includingsunspots,flares,andprominences.
Formanyreasons,itisessentialtopreservespacephysicsdataforlongperiodsoftime.TheSundrivessolar-terrestrialrelationships,andmanystudiesrequireobservationsover22-yearsolarcycles.DuringthiscycletheSunreversesitsmagneticpolaritytwiceandgoesthroughperiodsofincreasedactivitywithsunspotsandassociatedflares.Atsolaractivityminimum,flareandsunspotactivitydecreases,butexpandedcoronalholesappear.Longintervalsofrecordsarerequiredbecauseeachsolarcycleisdifferentfrompreviousonesandbecausetherearelong-termdeviations,suchastheMaunderminimum,from"normal"patterns.Fromtheterrestrialpointofview,therearemotionsofthemagneticdipoleandevenmagneticfieldreversalsontimescalesofthousandsofyears.
Becausemanyspacephysicsobservationsaretakeninsitu,modelsofthemagnetosphereneeddatacollectedbymanyspacecraft,havingdifferentkindsoforbitsandtrajectories.Tomakesenseoutofdatafromoneofthesemissions,itisimportanttobeabletoexaminewhatanotherspacecraftinadifferentorbitfound.Onlybypreservingthedatafromnumerousmissionsdoweacquireasufficientarchive.
Spacephysicshasgeneratedabout50gigabytesofdataperyearoverthelast30years.ThefieldhasenjoyedthisextraordinaryproductivityprimarilybecausemostmissionswereinEarthorbitandweretrackedcontinuouslyforyears.Manyofthesedatasetswere"archived"bysendingthetapesandsometimestherelevantdocumentationtotheNSSDC.Copiesofthedataonmicrofilmoronothermediaweresentthereaswell.Unfortunately,foreverywell-prepared,thoroughly
documentedspacephysicsdatasetattheNSSDC,thereareseveralpoorlypreparedandimproperlydocumenteddatasets.Fortheearliestspacemissions,thearchivingtechniqueswereundeveloped,andarchivingwasnotdeemedahighpriority.Thus,therearemanydataattheNSSDCthatmostscientistswouldfinddifficulttousewithonlytheinformationoriginallysupplied.GiventherecentemphasisontheproperpreservationofdataandtheimportanceofarchivingpromptedinpartbytwoGeneralAccountingOfficereports(1990a,b)andalsobyaheightenedawarenessanddesireforhigh-qualityarchivesbythecommunitymanyrecentlyarchiveddatasetsareinbetterconditionthantheirpredecessors.EventhoughtheSpacePhysicsDataSystemhasbeeninexistenceonlysince1993,themoreadvanceddataactivitiesinotherdisciplineshaveinfluencedthespacephysicscommunityfavorably.Hence,itisbecomingmorelikelythatthedatanowbeingsubmittedareofahigherquality,havemoreadequatedocumentation,andaremorecompletethanearlierdatasets.
NOAA,NSF,theDepartmentofDefense,privateandeducationalinstitutions,andforeignorganizationstypicallysupporttheground-basedobservations.Mostofthesedata,notmanagedbyNASA,eventuallycomeunderthepurviewoftheNationalGeophysicalDataCenter,operatedbyNOAAatBoulder,Colorado.Thecenter'sholdingsconsistofover300digitalandanalogdatabases,someofwhichareverylarge.However,manyimportantdatasetsstillresidesolelyinthehandsoftheoriginalinvestigators,themilitary,orforeignsources.
AtmosphericScienceData
Atmosphericsciencedatasetsarediverseandpresentavarietyofproblemsfordistribution,archiving,andlaterinterpretation.Somedatasetsontheatmospherestandoutasthelargestinanyscientificdiscipline,particularlythosefromremotesensingbysatelliteorradar;othersconsistofcontributionsfromthousandsofindividualsallover
theworld,andtheprovenanceofthosedataissometimesuncertain.Manydatasetsspandecades,andafewspanmorethanacentury,withaccompanyingproblemsduetolackofhomogeneityinmeasurementtechniquesandsamplingstrategies.ThelargestatmosphericsciencedataholdingsintheUnitedStatesarethoseofthefederalgovernment.However,significantamountsofmaterialareavailableonlyfromstateorprivatesources.
Page22
Notallatmosphericdatasetsarelargeandconspicuous;manyaresmall.Therearehundredsofdatasetsofonlyafewmegabytesorless.Therearealsomanymedium-sizeddatasetsthatrangefromperhaps100megabytestotensofgigabytes,aswellasverylargedatasets,manyterabytesinvolume.Table2.2providesasamplingofsomeofthelargerdatasets.Datavolumedoesnotdrivethecostofarchivingsmall-sizedandmedium-sizeddatasetsifpropertechnicalchoicesaremade.Rather,itisthelabor-intensiveprocessofreadyingadatasetforindefinitepreservationthatcanbecostly.
Manyatmosphericdatasetsaredynamic,continuallygrowingorbeingotherwisemodified.Becauseweatherkeepsoccurring,observationaltimeseriesfromoperationalmeteorologicalactivitiesarenever"complete."Incontrast,fieldprogramsusuallyhavefiniteextent,andtheresultingdatasetshaveadefiniteend.However,manyrecentlarge,complexfieldprogramshavespawnedassociatedmonitoringactivitiesthathavecontinuedaftertheinitialphasesoftheproject.Despitethefrequentusageoftheterm"experiment"todenotefieldprograms,theseintensiveeffortsareobservational,ratherthanexperimental,exercises.Sometrulyexperimentaldataexist,includingafewdatasetsthatincludetheresultsfromsuchworkassensordevelopmentandtests,fluiddynamicsexperiments,thermodynamicmeasurements,andlaboratorychemicalstudies.Nevertheless,thevastmajorityofatmosphericsciencedatadescribeobservationsofever-changingphenomena,andthustheyareunique,valuable,andirreplaceable.
Formuchmeteorologicalandclimateresearch,aswellasformanyapplications,itisessentialtohavearchivesofglobaldata.ThisgoalhasbeenlargelyachievedintheUnitedStates,althougholderdatasetsstillneedtobedigitized.Collectively,U.S.archiveshavethebestsetsofglobaldataofanynation,particularlyfordatasincetheearly1950s.However,manyvaluabledatastoredinothernationsare
inaccessibletoU.S.scientists(andinsomecasesareinaccessibletothosenations'scientistsaswell).
Meteorologicalandotheratmosphericdataareusedforvaryingpurposesondifferenttimescales.Itisconvenienttodelineatethree:(1)real-timeorcurrent,(2)recentpastorshort-termretrospective,and(3)distantpastorretrospective.Comparedwithotherdisciplines,meteorologicaldataareprobablyusedbyawidersegmentoftheU.S.populationthanotherscientificdata,becausetheyrelatedirectlytopractical,dailyconcerns.Thereisalargelayaudienceforweatherandclimateinformation.
Thereal-timeorcurrentuseofmostdatasetsusuallymotivatesdecisionsoncollectionstrategiesandthereforequality.Forexample,theprimaryreasonforcollectingmostmeteorologicaldataisforoperationalweatherforecastingandwarning,includingforecastingforaviationoperations.Thesedataareperishable,andtimelinessandspatialresolutionaremoreimportantthanabsoluteaccuracyandcontinuity.
Therearemanyrecentpastorshort-termretrospectiveusesofmeteorologicaldatathatcanbeofgreatsignificance.Inthiscontext,shorttermtypicallymeansfromyesterdaytoafewweeks,oroccasionallyafewmonths,ago.Agoodexampleofsuchusageofdataisinmonitoringthedevelopmentofadrought,asignificantfunctionforpredictingcropyields.Thetransportationindustryusespastdataforverificationofweatherconditionsfordelayclaims.
Mostretrospectiveusesrequiredatafromseveralmonthsoldthroughthetraditional(thoughnowsuspect)30-yearaveragingperiodsusedforclimatenormals.TheNationalClimaticDataCenterhandlesover100,000datarequestsperyear.Thestateclimatologistsandregionalclimatecentersalsoprocessaboutthismany.Legalproceedingsandinsuranceclaimsoftenrequireaccuratemeteorologicalrecordsforcorroborationofwitnesstestimony,criminalinvestigations,and
validationsofweatherclaimsrelatedtoaccidentsandpropertydamage.Farmersandagronomistsneeddatacoveringmonthstoyearsforstudiesofpesticideresidueandtoxicology,decisionsaboutpesticidespraying,planningoffertilizerusage,andcropselection.Architectsandbuildingengineersrequiresite-specificdataonheatingandcoolingneeds,windstresses,snowloads,andsolaravailability.Airportdesignersneedprevailingwindpatterns.Utilityplannersneedaggregateheatingandcoolingloadsfortheirareas.
Long-termretrospectiveusesofatmosphericdataaretheprimaryconcerninthisstudy.Theseusesarehighlydiverse,difficulttopredict,andmakegreatdemandsonthedataandtheirassociatedmetadata.
Page23
TABLE2.2VolumeofSelectedDataSetsinAtmosphericSciences
TypeofDataSet Comments DatesYearsVolumeAtmosphericInSituObservations
Worldupperair Twotimesperday,1,000stations
1962-1993
32 25GB
Worldlandsurface Every3hours,7,500stations
1967-1993
27 60GB
Worldoceansurface Every3hours(40,000observationsperday)
1854-1993
139 15GB
WorldobservationsduringFirstGARPGlobalExperiment
Surfaceandaloft,butnotsatellite
1978-1979
1 10GB
U.S.surface Daily,now9,000stations 1900-1993
94 15GB
SelectedAnalyses(mostlyglobal)
MainNationalMeteorologicalCenteranalyses
Twotimesperday,increasingat4GB/year
1945-1993
48 50GB
NationalMeteorologicalCenteradvancedanalyses
Fourtimesperday,increasingat19GB/year
1990-1993
4 58GB
NationalCenterforAtmosphericResearch'soceanobservationsandanalyses
Thirty-eightdatasets 8GB
EuropeanCenterforMediumRangeWeatherForecastingadvancedanalyses
Fourtimesperday,increasingat8GB/year
1985-1993
9 76GB
SelectedSatellites
NOAAgeostationarysatellites Half-hour,visibleandinfrared
1978-1993
16 130TB
NOAApolarorbitingsatellites 1978-1993
15
Sounders(TIROSOperationalVerticalSounder)
15 720GB
AdvancedVeryHighResolutionRadiometer(4-kmcoverage,5channel)
15 5TB
NASAEarthObservingSatellite-AM
Indevelopment,88TB/year,level-1data
1998-
U.S.RadarData
Domainsof30to60km 1973-1991
19 1GB
NextGenerationRadarSystem(NEXRAD)a
650GBperradareachyear,104TB/yearfor160-sitesystem
1997- 100sTB
Notes:Manyotheratmosphericdatasetshavevolumesofonly1to500MB.
1MB(megabyte)=106bytes;1GB(gigabyte)=109bytes;1TB(terabyte)=1012bytes.
aFirstradarsweredeployedin1993.
Mostoftheusesdiscussedabovedonotneeddatacoveringmorethanafewdecades.Severaloftheseapplications,however,requirethelongesttimeserieswecanprovide.
Whentechnologyadvancesandaltersthemethodofdatacollection,thereisastrongimpetustoscrapthedatacollectedby"obsolete"technology.However,theseolddatamaybecomecriticalinthefuture.Anotableexampleinvolvesupperairwindprofiles.Thesewereoriginallycollectedbykitesandlaterbyradiosondescarriedonballoons.Withtheonsetofthespaceprogram,therewasanurgentneedfordetailedlow-altitudewinddataforanalysisofstressesonrocketsatlaunch.
Appropriatedatacouldnot
Page24
beobtainedfromradiosondes,becauseoftheirhighascentrate,butolderkite-baseddata,whichhadbeenscheduledfordisposal,wereavailable.Fortunately,theyhadnotyetbeendestroyedwhentheywereagainneeded.
Therehavebeendramaticretrospectiveusesformilitarypurposes(e.g.,Jacobs,1947).PlanningfortheD-dayinvasionofFrance,bombingrunsoverJapan,andtherecentdesertwarinIraqallrequireddetailedclimaticinformation,somelongthoughtuselessbutnotyetdiscarded.Suchunexpectedusesrequiretheretentionofmanytypesofdatafrommanyplacesforalongtime.Sincethefirstflightsofmeteorologicalsatellitesin1959,wealreadyhavehadseveralexamplesofimportantretrospectiveusesofsatellitedatasets.Forinstance,acombinationofreprocessedNimbus-7satellitedataandolddatafromtheDobsonnetworkhelpedtoconfirmtherecurringseasonallossofstratosphericozoneovertheAntarcticintheearly1980s.
Ifmeteorologistsaretostudypastweatherevents,suchasseverehurricanes,damagingwinterstorms,oroutbreaksoftornadoes,theymusthaveattheirdisposalalldatafortheperiodsoftimeandgeographicalareasinvolved.Hurricanetrackrecordsspanningmorethanacenturyarestillregularlyusedforbothresearchandoperationalpurposes.
Anincreasinglysignificantuseofmeteorologicaldataisthemonitoringoftheclimateoftheplanet.Althoughbarelytwodecadesagothestudyofclimatewasnotaveryhighpriority,todayclimateresearchissuesareprominent;someofthenation'sleadingscientistsspecializeinclimatestudies,andpolicymakersseekinformationonlikelyclimaticconditionsofthefuture.Theimportanceofoldatmosphericdatahasbecomeclear,butthereanalysisoftheseolddatainthesearchfortrendshasoftenfoundtheminadequateandpoorly
documented.Thegrowinginterestinglobalclimatechangeandthedifficultieswithhistoricaldatathatithelpeduncoverhavestronglymotivatedearthscientiststotakeaseriousinterestinthelong-termpreservationofatmosphericdata.Similarly,studiesoflong-termwaterandlandusagerequiretimeseriesofmanydecades,ormore.Suchdataneedsalsoapplytoplanningaquiferusageandstudiesondeforestationanddesertification.
Somehistoriansexamineconnectionsbetweenenvironmentalconditionsandhumanevents.Thetimescalesstudiedcanrangefromtheimmediate,suchastheinfluenceofweatheronbattles,totheverylongterm,suchastheriseordeclineofacivilizationaffectedbywateravailability.Workersinthisfieldoftensearchthroughtheoldestexistingdataandhaveevenprovidedmeteorologicalinformationtoatmosphericscientistsfromunconventionalsourcessuchasdiariesandagriculturalrecords.
Contemporaryarrangementsforthestorageandarchivingofatmosphericdataarediverse,complex,andpresentmanyproblems.Someofthesearrangementscouldbeimproved.Atmosphericdataareinmanylocations,andtheyhaveabroadrangeoflifecycles.Difficultproblemsariseinpreparingmetadata,packagingdataforextendedarchiving,motivatingresearcherstopreparetheirdataforusebyothers,andsimplydealingwiththelargesizeofsomeoftheatmosphericdatasets.Criteriaforidentifyingdatasetstosaveindefinitelyarenotnecessarilyobvious.Finally,anyproposedsolutionsmustbemadeinfullrecognitionoftheirimpactonbudgetsandotherresources.
GeoscienceData
Spatially,thedomaincoveredbythegeosciencesextendsfromtheEarth'scoretothesurfaceandintospace.Temporally,itcoversbroadtrendsfromtheremoteoriginsoftheEarthtopossiblefuturescenarios,butitalsoisconcernedwithrapidlyvarying,oftenshort-
livedphenomena.Datainthegeosciencesfallintotwobroadcategories.Oneistheobservationanddescriptionofuniqueevents,suchasearthquakes,volcaniceruptions,andfloods.Inmostcases,suchdataneedtobearchivedforalongtimeperiod,regardlessoftheirquality.Theothercategoryconsistsofobservationsofquantitiescontinuousinspaceandtime,suchasgravityandtheEarth'smagnetismandstructure,seismicsampling,andgroundwaterdistribution.
Page25
Thevolumeofgeosciencedataobtainedwithpublicfundinghasincreaseddramaticallyoverthepastfewdecades.Thisincreaseistheresultofseveralconvergingfactors,includingtheextremelyvariedtypesofobservationaldatacollectedbythescientificcommunity;thelargevolumesavailablethroughbettermeasurementtechniques,moresophisticatedinstrumentation,andadvancingcomputertechnology;andincreasingdemandfromnotonlythescientificcommunitybutalsothegeneralpublic,includingengineers,lawyers,andstatisticians.Nongovernmentalandcommercialinstitutionsalsoaremajorcollectorsandsourcesofpertinentdata.
TwoexamplestheLandsatdatabaseandthenation'sholdingsofseismicdataillustratemanyofthecharacteristicsandissuesinherentinthelong-termarchivingofgeosciencedata.OtherexamplesareprovidedintheworkingpaperoftheGeoscienceDataPanel(NRC,1995).
TheLandsatdatabaseconsistsofmultispectralimagesoftheEarth'ssurface,whichhavebeenaccumulatingsincethelaunchofLandsat1inJuly1972.Thearchiveincludesdigitaltapesofmultispectralimagedatainseveralformats,black-and-whitefilm,andfalse-colorcompositesofsynopticviewsoftheEarth'ssurface,allfrom700kminspace.ThisdatabasethusconstitutesanimportantrecordoftheevolvingcharacteristicsoftheEarth'slandsurface,includingthatoftheUnitedStates,itsterritories,andpossessions.Therecorddocumentsnotonlytheresultsofvariousfederalgovernmentpoliciesandprograms,butalsothoseofmanystateandlocalgovernmentsandprivateprogramsandactivities.Itfurtherprovidesdocumentationoftheimpactofvariouslarge-scaleepisodicevents,suchasfloods,storms,andvolcaniceruptions,andisofgreatvaluetobothcurrentandfuturepublicandprivateactivities.
Landsatdataarecurrentlyavailableineitherimageordigitalform
fromtheEarthResourcesObservingSystem(EROS)DataCenterinSiouxFalls,SouthDakota.TheLandsatsatelliteswereoriginallyunderthecontrolofNASA.However,in1980theybecametheresponsibilityofNOAA.ThecurrentlyoperationalLandsat4and5spacecraftwereplacedundercontroloftheEOSATCompanyin1985.UnderEOSAT'scontrol,thedataarenotinthepublicdomain,aresignificantlymoreexpensive,andcarryproprietaryrestrictionsontheiruse.BeginningwiththelaunchofLandsat7,responsibilityfortheLandsatsystemwillpassbacktoNASA,whichwillbuildandlaunchthesatellitethelate1990s.NASAwilloperatethesystemsanddeliverthedatatotheEROSDataCenterfordistribution.Thedatawillonceagainbeinthepublicdomain,althoughtheEROSDataCenterstillplanstochargemorethanthemarginalcostofreproductioninfulfillinguserrequests.ItisnowwidelyrecognizedthattheshifttoprivatecontroloftheLandsatsystemsignificantlyreducedtheaccesstoanduseofthedata.
AsofJanuary1993theLandsatdatabasecontainedmorethan100,000tapesofvaryingdensityandformats,andover2,850,000framesofhardcopyimagery.DigitalLandsatdataareusuallydeliveredtousersasmagnetictapes.Othermedia,suchasCD-ROMsandstreamingtapes,alsomaysoonbeused.Datarequestsoccurmostfrequentlyinreferencetoaparticulargeographiclocation,commonlyexpressedaslatitudeandlongitude,foraparticulartimeoftheyear,andmeetingcertaincloudcoverlimitations.
Landsatdataareusedwidelyacrossthespectrumofgeoscienceapplicationsinbothcivilianandmilitaryoperationsandresearch.Theseincludesuchapplicationsastheimpactofhumanactivitiesontheenvironment,land-useplanningandresource-allocationdecisions,disasterassessment,measurementandassessmentofrenewableandnonrenewableresources,andmanyothers.TheyareusedalsobythegeneralpublicinanycontextwhereviewsoftheEarth'ssurfaceareneeded.Examplesincludesuchdiverseapplicationsasvisualaidsin
elementaryandsecondaryeducation,backgroundforhighwaymaps,andillustrationsformagazinearticlesaboutvariousregionsoftheworld.
TheLandsatdatabaseisuniquebecausedatafromanygivenareamaybeavailableatsampledinstantsoveraperiodofmorethan20years,thusmakingpossibleforthefirsttimethestudyofslowlyvaryingphenomenaonEarth.Eventhoughdatafromtheearly1970smaynowhavealowfrequencyofuse,theirpotentialvalueremainshighandtheyrepresentasignificantarchivalrecord.
Page26
IncontrasttotheLandsatdatabase,seismicdataarebroadlydistributedratherthanconcentratedinonedatacenterorsystem.Thisexamplefocusesprimarilyonseismicdatafromearthquakesandexplosions,bothnuclearandchemical.Somefederalagencies,notablytheU.S.GeologicalSurvey(USGS)andNOAA'sNationalGeophysicalDataCenter,collectandarchiveimportantseismicexplorationdata.Inaddition,theDepartmentofDefense(DOD),DepartmentofEnergy(DOE),U.S.NuclearRegulatoryCommission(USNRC),USGS,andNOAAhavebeenandcontinuetobeengagedinthecollectionandarchivingofearthquakeandexplosiondata.Theseagencyprogramsarecarriedoutindependentlyofoneanotherwiththeresultthateachagencyhasitsowndatamanagementandarchivingpoliciesandpractices.Consequently,thesedataholdingsaregreatlydistributedamongtheagenciesinfundamentallydifferentformsandformats.
Globalearthquakedatahavebeenacquiredsystematicallysincetheearly1960s,whentheU.S.CoastandGeodeticSurveyoftheDepartmentofCommercedeployedaglobalseismicnetworkofabout130stationscalledtheWorld-WideStandardizedSeismographicNetwork(WWSSN)andproducedanarchiveofphotographicfilm''chips"ofthe24-hour/dayrecordingsatallstations.Researchersandotherapplicationscouldobtaincopiesoftheseanalogdataatmodestcost.Thesuccessofthisprecursortotoday'sglobaldigitalnetworkcannotbeoverestimated,becausetheavailabilityofaglobaldatasetinstandardformatfromwell-calibratedinstrumentspermittedpreviouslyimpossiblestudiesofglobalseismicitypatterns,earthquakesourcemechanisms,andtheEarth'sstructure.ThesestudieshaveledtoavastlyimprovedunderstandingofthedynamicsoftheEarthasawhole,includingtectonicplatemovements,generationofnewoceanfloor,evolutionoftheEarth'scrust,andoccurrencesofdestructiveearthquakesandvolcaniceruptions.
TheUSNRChasfundedtheoperationofregionalseismicnetworksovermuchoftheUnitedStates,somesincetheearly1970s,insupportofprogramsforthesitingandsafetyofnuclearpowerplants.USGSalsohasco-fundedorseparatelyfundedregionalnetworksforearthquakehazardassessmentsinseismogenicareasoftheUnitedStates.However,changesinthefundingprioritiesofUSGSandUSNRCinrecentyearshaveresultedintheinterruptionordiscontinuationofsomeofthesenetworks,particularlyintheeasternUnitedStates.Thishasadverselyaffecteddataflowandseismicresearch.Seismicdatahavebeenarchivedinabroadlydistributed,nonuniformmodebytheorganizationsmostlyuniversitiesthatcollectedthedatafromthevariousnetworks.Manyofthesedatahavelong-termvalueforcharacterizingindetailthetectonicactivityofseismogenicareasintheUnitedStates.
Inadditiontothefederalagencies,severalprivatesectororganizationsnowcollect,distribute,andarchiveseismicdatasetsoflong-termsignificance.TheIncorporatedResearchInstitutionsforSeismology(IRIS),anot-for-profitconsortiumofuniversitiesandprivateresearchorganizations,isengagedinamajordevelopmentofaglobaldigitalseismicnetworkofabout100continuouslyrecordingstations(theGlobalSeismicNetwork)incooperationwithUSGS.Theprojectalsoincludesaversatile,portabledigitalseismicarrayofupto1,000stationsthatcanbedeployedforvarioustimeintervalsforspecialseismologicalstudies.DatasetsfromtheglobalandportablearrayarebeingpermanentlyarchivedattheIRISDataManagementCenter(DMC)inSeattle,Washington.TheDMCalsoservesastheInternationalFederationofDigitalNetworks'centerforcontinuousdigitaldata,whichaddsobservationsfrommanyadditionalstationstothearchive.IRISfundingforthisactivitycomesprimarilyfromNSFandDOD.Finally,individualuniversities,suchastheCaliforniaInstituteofTechnology,theUniversityofCaliforniaatBerkeley,theUniversityofAlaska,theUniversityofWashington,Columbia
University,MemphisStateUniversity,andSt.LouisUniversity,alsomaintainarchivesoftheseismicdatathattheycollect.
ThevolumeofdigitaldatacurrentlyheldandanticipatedtobeacquiredbytheIRISDMCissummarizedinTable2.3.Althoughsomedatasetshavebeencompletedbecausetheyareproject-orprogram-specific,mostofthecurrentoperationscontinuetoaddlargeamountsofnewdataandimplementnewtechnologyforrecording,storage,retrieval,anddistribution,therebycreatingadynamic,highlydistributedarchivewhoseholdingsandaccessprotocolschangewithtime.Forexample,theIRIS
TABLE2.3SummaryofActualandProjectedDataVolumesArchivedintheIRISDataManagementCenter
NumberofInstruments
ProjectedDataVolumes(gigabytes/year)
1994 1995 1996 1997 1998 1999
GSN 100 1,159 2,359 3,959 6,003 8,047 10,091
FDSN 146 370 670 1,070 1,530 2,050 2,670
JSParrays 5 1,095 2,190 3,650 5,475 7,300 9,125
OSN 30 0 0 15 58 218 498
PASSCAL-BB 500 1,318 2,277 3,556 5,154 7,073 9,312
PASSCAL-RR 500 542 885 1,341 1,912 2,597 3,397
Regional-Trig 500 150 290 490 730 1,030 1,390
Total 1,781 4,634 8,671 14,081 20,862 28,315 36,483
Note:Abbreviationsareasfollows:
GSN GlobalSeismicNetwork(IRIS)
FDSN FederationofDigitalSeismicNetworks
JSP JointSeismicProgram(withtheformerSovietUnion)(IRIS)
OSN OceanSeismicNetwork
PASSCAL-BB
ProgramforArrayStudiesoftheContinentalLithosphereBroadband(IRIS)
PASSCAL-RR
ProgramforArrayStudiesoftheContinentalLithosphereRegionalRecordings(IRIS)
Regional-Trig
RegionalTriggeredRecordings
aProjectednumbersbyyear2000.
Source:IRISDataManagementCenter,privatecommunication,1994.
DMCrecentlybeganprovidingbotharchivedandnear-real-timedataontheInternet,therebygreatlyfacilitatingrapidaccess.
SignificantvolumesofexploratoryseismicdataobtainedbygeophysicalcontractorsareheldbytheDepartmentofInterior.Thesedataareusedbythefederalgovernmentandbypetroleumcompaniesinpreparingforoilandgasexplorationactivities.Thereare,however,variousproprietaryrestrictionsonaccesstothesedatabyotherusers.
Insummary,thesourcesofseismicdataarediverse,thearchivingishighlydistributed,andthedataareinmanydifferentformatswithdifferentmetadatastructures.Moreover,datasetswithlong-termscientificandhistoricalvalueresideinbothfederalandnongovernmentalorganizations,althoughinmostofthelattercasesfederalfundshavepaidatleastinpartfortheiracquisition,archiving,anddistribution.
Theusersofseismicdataaremanyanddiverseaswell.Theyincludefederalandstategovernmentagencies,universities,andprivateindustry,particularlythepetroleumindustry.Thousandsofindividualsaredirectorindirectusersofseismicdata.Certainly,thepublicasawholeisanenduserofhistoricalseismicdataandinformation,includingthelocation,magnitude,anddamageassociatedwithearthquakesaroundtheworld.
Mostseismicdatasetshavebeenorarenowusedbothforoperationalpurposesandforresearch,althoughforoperationalactivitiesthedataareusedprimarilyimmediatelyfollowingtheircollection.Examplesoftheiruseforoperationalactivitiesincludetsunamiwarningandtherapiddeterminationofthemagnitude,location,andfaultmechanismofdestructiveearthquakesandtheiraftershocks,bothtoinformthepublicandtoassistinemergencyresponseandspecialmonitoring.Onalongertimescalethedataareusedforhazardreductionandseismicsafetyinseismogenicregions,includinglocalzoningdecisionsforfuturedevelopment,andsitingandsafetyofcriticalfacilitiessuchasnuclearpower
plants.Dataareobtainedandusedforcontinuousglobalmonitoringofearthquakeactivityandofthresholdorcomprehensivetestbansonundergroundnuclearexplosions.Ofcourse,therealsoisabroadspectrumof
Page28
researchthatuseshistoricalseismicdata,includingstudiesofthephysicsofearthquakeandexplosivesources,propagationeffectsonseismicsignals,imagingoftheEarth'sstructuresatallscales,seismicitypatterns,andearthquakepredictionorhazardestimation.Olderdataareimportantandarecommonlyusedformostofthesetypesofresearch.Forexample,establishingtherecurrencerateforlarger-magnitudeearthquakesrequiresdecadestocenturiesofobservations,eveninthemostseismicallyactiveareas.
Inconclusion,mostoftheseismicdatahavelong-termvalueforscientificresearch,disastermitigation,andvarioussocioeconomicuses.Thedataarearchivedinabroadlydistributedmanner.However,onlyafractionofthearchiveddataareunderthedirectcontroloffederalgovernmentagencies,anditappearsthatmanyofthesedatasetsarenotconsideredofficialfederalrecords.Exceptformostcommercialexploratoryseismicdata,federalfundshavepaidformuchoftheinstrumentation,stationoperationandmaintenance,collection,storage,anddistributionofseismicdata.Theseimportantseismicdatasetsshouldbekeptindefinitelyinaformaccessibletoboththescientificcommunityandotherusers.
OceanScienceData
Theoceansandatmosphereareturbulentfluids,constantlychangingovermanyspatialandtemporalscales.Thenumeroustypesofdatathatdescribetheoceansareoftenunrelatedtooneanother,andeventhosethatarerelatedfrequentlyhavenonlinearandpoorlyunderstoodinteractions.Forexample,temperaturedatafromaspecificpointandtimeintheNorthAtlanticcannotbeaccuratelypredictedfromdatacollectedinthesameplacetheyearbefore,oreventheweekbefore,orfromdatacollectedatthesametime1,000kilometersoreven100kilometersaway,orfromsalinitydatacollectedatthesameplaceandtime.Eachdatumcontributesuniqueinformationaslongasitis
accurate,correspondstoadifferentphysicalquantity,isobtainedfromadifferenttimeandplace,andcannotbeaccuratelycomputedfromotherexistingdata.
Onesourceofoceanographicdataisthefieldprogram.Largeandsmallfieldprogramsconductedinsupportofspecificresearchprojectsaretheprimecontributorsofinsituandinvitroobservationaldatasetsforalltheoceandisciplines.Insitudatasetsarethosethatarederivedbyprocessingthemeasurementsfromsensorsimmerseddirectlyintotheoceanenvironment.Processingofinsitudataislargelyautomated,andsothedatasetsarerelativelydense.Invitrodatasetsareproducedbylaboratoryanalysesofsamplescollectedfromtheoceanenvironment.Theselaboratoryanalysescombinesophisticatedmeasurementequipmentwithlabor-andtime-intensiveprocedures.Therefore,invitrodataaretypicallysparse.Remotelysensedobservationsalsomaybeassociatedwithfieldprogramdatabysynchronizinginsitusamplingwiththeuseofremotesensingplatforms.
Theharshandremotenatureoftheworldoceanenvironmenthasinhibitedtheestablishmentofaroutinedatacollectionsystem.Althoughseveralremotesensingplatformsdoprovidedailymonitoringofoceansurfaceconditionsonaglobalbasis,continuousmeasurementofsubsurfaceconditionswithadequatetimeandspaceresolutionforeffectivemonitoringisnotareality.Thelackofcontinuousandcomprehensiveoceanographicdatamaycontributemosttotheinconsistentdatamanagementpracticesandlackofcommunity-widestandardsfordatareportingandexchangeintheoceandisciplines.Becauseoftheneedfordailyglobalprediction,suchstandardsandpracticesaremuchmorehighlydevelopedintheatmosphericcommunity.TheestablishmentoftheGlobalOceanObservationSystempresentsanopportunitytoengagetheoceancommunityintheidentificationandimplementationofappropriatestandards.
Likeotherobservationaldata,oceanographicdataextendbeyonddirectlyorremotelymeasuredobservationsoftheenvironment.Thedataproductsbasedontheanalyses,interpretations,andpresentationsofaggregatesofobservationsalsomustbeconsideredinthedesign,implementation,andmaintenanceofanydatamanagementandarchivingmechanism.Themoretraditionalproducts,suchasparametergridsandoutputfromoceanmodels,willsurelybesupplementedfrominnovativesources
Page29
likelytoemergefromtheinteractivescientificcollaborationandvalue-addedservicesthatarebecomingincreasinglyavailablethroughelectronicnetworks.
TheprincipalfederalagencyoceandataholdingsareattheNOAANationalOceanographicDataCenter(NODC),theNASAPhysicalOceanographyDistributedActiveArchiveCenter(PO.DAAC)attheJetPropulsionLaboratory,andatseveralNavycenters,whichholdmostlyclassifieddatasets.Inaddition,significantamountsofdataareheldbytheuniversities.
LocatedinWashington,D.C.,theNODCarchivesphysical,chemical,andbiologicaloceanographicdatacollectedbyotherfederalagencies,includingdatacollectedbyprincipalinvestigatorsundergrantsfromtheNationalScienceFoundation;stateandlocalgovernmentagencies;universitiesandresearchinstitutions;andprivateindustry.ThecenteralsoobtainsforeigndatathroughbilateralexchangeswithothernationsandthroughthefacilitiesofWorldDataCenterAforOceanography,whichisoperatedbytheNODCundertheauspicesoftheNationalAcademyofSciences.TheNODCprovidesabroadrangeofoceanographicdataandinformationproductsandservicestothousandsofusersworldwide,andincreasingly,thesedataarebeingdistributedonCD-ROMsandontheInternet.Table2.4presentsasummaryoftheNODC'sdataholdings.
ThePO.DAACisamajorfederallysponsoredoceanographicdatacenter,whichisoperatedbytheCaliforniaInstituteofTechnology'sJetPropulsionLaboratoryinPasadena,California.AsoneelementoftheNASAEarthObservingSystemDataandInformationSystem,themissionofthePO.DAACistoarchiveanddistributedataonthephysicalstateoftheoceans.UnlikethedataattheNODC,mostofthedatasetsatthePO.DAACarederivedfromsatelliteobservations.Dataproductsincludesea-surfaceheight,surface-windvector,
surface-windstressvector,surface-windspeed,integratedwatervapor,atmosphericliquidwater,sea-surfacetemperature,sea-iceextentandconcentration,heatflux,andinsitudatathatarerelatedtothesatellitedata.ThesatellitemissionsthathaveproducedthesedataincludetheNASAOceanTopographyExperiment(TOPEX/Poseidon,doneincooperationwithFrance),Geos-3,Nimbus-7,andSeasat;theNOAAPolar-OrbitingOperationalEnvironmentalSatelliteseries;andtheDOD'sGeosatandDefenseMeteorologicalSatelliteProgram.
SummaryOfMajorIssues
Theresultsofscientificresearcharedisseminatedinthiscountrythroughahybridsystemthatincludesprofessionalsocietyandothernot-for-profitpublishers,thecommercialsector,andthegovernment.Theformaljournalsarepublishedlargelybytheprofessionalsocietyandcommercialsectors,whilegovernmentagenciesmanagelessformalreports(grayliterature).Secondaryservices,suchasabstractingandindexing,provideaccesstothisliterature,increasinglybyelectronicmeans.Whiletherearestrainsinthissystembecauseofrisingcosts,increasingworkload,andissuesrelatedtotheprotectionofintellectualproperty,ithasservedU.S.sciencewellandhasbeenaninvaluablelinkintheprocessoftranslatingscientificadvancesintofurtheradvances,usefultechnology,andeconomicbenefits.
Thecurrentsystem,however,isnotwellsuitedtohandlethescientificelectronicdatabasesthatarethefocusofthisstudy.Thecostsofmaintainingthesedatabasesaretypicallytoogreattobecoveredbyuserfees;instead,thesedatabasesmustbeconsideredpartofthenationalscientificheritage.Somegovernmentagencieshaveacceptedresponsibilityformaintaininganddisseminatingdataresultingfromtheirownresearchanddevelopment.Insomecases,thissystemisworkingreasonablywell,butinothersthereareproblemsevenwithprovidingcurrentaccess.Archivingforthelongtermraisesquestionsinallcases,however.
Ageneralproblemcommontoallscientificdisciplinesisthelowpriorityattachedtodatamanagementandpreservation.Experienceindicatesthatnewexperimentstendtogetmuchmoreattentionthanthehandlingofdatafromoldones,eventhoughthepayofffromoptimalutilizationofexistingdatamaybegreater.Forinstance,accordingtofiguressuppliedbyNOAA,NOAA'sbudgetforitsNationalDataCentersinFY1980was$24.6million,andtheirtotaldatavolumewasapproximatelyoneterabyte.In
Page30
TABLE2.4NationalOceanographicDataCenterDataHoldings(asofOctober1994)
Discipline Volume(megabytes)Physical/ChemicalDataMasterdatafiles
Buoydata(wind/waves) 9,679
Currents 4,290
Oceanstations 1,645
Salinity/temperature/depth 1,557
BTtemperatureprofiles 872
Sealevel 125
Marinechemistry/marinepollutants 89
Other 68
Subtotal 18,325Individualdatasets,forexample
Geosatdatasets 12,841
CoastWatchdata 60,000
LevitusOceanAtlas1994datasets 4,743
Other(estimated) 11,000
Subtotal 88,584
TotalPhysical/Chemical 106,909MarineBiologicalDataMasterdatafiles
Fish/shellfish 115
Benthicorganisms 69
Intertidal/subtidalorganisms 30
Plankton 32
Marinemammalsighting/census 21
Primaryproductivity 7
Subtotal 274Individualdatasets,forexample
Marinebirddatasets 52
Marinemammaldatasets 4
Marinepathologydatasets 4
Other(estimated) 200
Subtotal 260
TotalBiological 534
TotalDataHoldings 107,443
Source:NOAA,privatecommunication,1994.
Page31
FY1994,thebudgetwasonly$22.0million(notadjustedforinflation),whilethevolumeoftheircombineddataholdingswasabout220terabytes!Duringthissameperiod,theoverallNOAAbudgetincreasedfrom$827.5millionto$1.86billion.
Withregardtolaboratorydata,governmentprogramshaveexistedsincethe1960stocompileresultsfromtheworldscientificliterature,tocheckthedatacarefully,andtopreparedatabasesofcriticallyevaluateddata.Forinstance,theNationalInstituteofStandardsandTechnologyoperatesitsStandardReferenceDataProgram,whichcoversabroadrangeofdatainphysics,chemistry,andmaterialsscience.TheDepartmentofEnergyalsosupportsanumberofdatacentersofthistype.Despitechronicunderfunding,theseprogramshaveproduceddatabasesoflastingvaluetothenation.Tociteoneexample,theMassSpectralDatabasemanagedbytheNationalInstituteofStandardsandTechnology,theNationalInstitutesofHealth,andtheEnvironmentalProtectionAgencycontainsspectraofover60,000compounds.Ithasbeeninstalledinmanythousandsofmassspectrometersthatarebeingusedformonitoringenvironmentalpollution,designingdrugs,characterizingnewmaterials,andmanyotherapplications.Thegovernmentinvestmentincreatingandmaintainingthisdatabasehasbeenrepaidmanytimesover.
Intheareaofobservationaldatabases,thesituationismixed.Federalagenciescollectlargeamountsofobservationaldata,whichinmanycasesarecontinuouslyaddedtotheavailablerecordofEarthandspaceprocesses.Thedatasetsresultingfromtheseactivitiessometimesarewell-documentedandmaintainedinreadilyaccessibleform;butinmanyothercases,theyareexceedinglydifficultorimpossibletoaccessoruse,andthusareeffectivelyunavailable.Ingeneral,theagenciesandotherorganizationsdoagoodjobofmakingdataandinformationavailabletothescientists(primaryusers)duringtheactivestagesofprojectsandforsometimeafterward.Examplesof
notablesuccessesincludetheNASAPlanetaryDataSystem,wherethepremisehasbeenthatthedatahavelong-termvalueandmustbeaccessibleindefinitelyintothefuture,andtheNOAANationalDataCenters,wherethepolicyistomigratearchiveddatatonewmediaevery10years.
Technologicaladvanceshavekeptpacewiththelargegrowthindatavolumesinscientificdisciplinessuchthatthelong-termretentionofallornearlyallofthedatacollectedisfeasible.Indeed,inmostfieldstheentirecollectionofdatafromthepastisnotlargeincomparisonwiththecurrentandanticipateddatavolumesthatwillbecollectedduringonlyayearortwo.However,significantfractionsoftheolderdataaredifficultorinsomecasesimpossibletoaccess,becausetheyhavenotbeentransferredtonewstoragemedia.Thistransferoftenhasreceivedlowprioritybecausemanydatamanagementanddataretentionactivitiesarechronicallyunderfundedandjusthandlingthecurrentdataflowusesnearlyalloftheavailableresources.Thus,manyvaluabledatasetsarestoredonlow-densityroundtapesoronspecializedmagnetictapemediarequiringhardwarethatisnowobsoleteorinoperable.Forexample,alargevolumeoftheearlyLandsatcoverageoftheEarthresidesontapesthatcannotbereadbyanyexistinghardware.Recentdata-rescueeffortshavebeensuccessfulingettingolderdataintoaccessibleform,buttheseeffortsaretime-consumingandcostly.Thereasontheseeffortshavebeenundertaken,particularlyintheobservationalsciences,istherecognitionthatretrospectivedataarevitaltounderstandinglong-termchangesinnaturalphenomena.Giventheextraordinarilyrapidadvancesincomputingandstoragetechnologyinrecentyears,plannedperiodicmigrationofdatatonewmediawillbeincreasinglyimportantinallscientificdisciplinestoensurelong-termaccesstoourscientificdataresources.
Itisaxiomaticthatadatabasehaslimitedutilityunlesstheauxiliaryinformationrequiredtounderstandanduseitcorrectlythemetadatais
includedintherecord.Anunambiguousdescriptionofthestorageformatisobviouslyessentialforinterpretationofanelectronicdatabase.Therequirementisevenmorestringenttosupportmeaningfulaccesstodataoverthelongterm,becausethehardware,software,andeventhelanguagebywhichformatsaredescribedwilllikelybedifferentdecadesandcenturiesfromnow.Thesameistrueregardingthescientificdetailsofthecontentofthedata.Auxiliaryinformationsuchasenvironmentalconditions(e.g.,temperatureandpressure),methodofcalibratingthe
Page32
instruments,anddataanalysistechniquesmustbegiventobeabletofullyandcorrectlyusethedata.Providingthisinformationistimeconsumingandcostlyifdoneretrospectively,butmuchlesssoifitispreparedatthetimethedataarecollected.Documentationthatisinadequateforunderstandingandusingthedatagreatlydiminishesthevalueofthedata,particularlyforsecondaryandtertiaryusers.
Anothermajorprobleminhibitingaccesstodataisthelackofdirectoriesthatdescribewhatdatasetsexist,wheretheyarelocated,andhowuserscanaccessthem.This,too,isespeciallyaproblemforpotentialsecondaryandtertiaryusers.Inmanycasestheexistenceofthedataisunknownoutsidetheprimaryusergroups,andevenifknown,therefrequentlyisnotenoughinformationforapotentialusertoassesstheirrelevanceandusefulness.Thisrealizationhasresultedinaninteragencyeffort,ledbyNASA,tobuildaMasterDirectoryofGlobalChangeDataandInformation.ThisMasterDirectoryisintendedtoinformusersofwheredatasetsofpotentialinterestresideandhowtoaccessthem.Similardirectoriesareneededinotherscientificdisciplines,aswellasacrossalldisciplines.Thelackofadequatedirectoriesadverselyaffectstheexploitationofournationaldataresourcesandcommonlyleadstounnecessaryduplicationofeffort.
Asignificantfractionofthearchivedscientificdataisheldbythefederalagenciesthatcollectedthedataaspartoftheirmission.However,alargeamountofvaluablescientificdatagatheredwithfederalfundsisneverarchivedormadeaccessibletoanyoneotherthantheoriginalinvestigators,manyofwhomarenotgovernmentemployees.Inmanyinstances,theorganizationsandindividualsthatreceivegovernmentcontractsorgrantsforscientificinvestigationsareundernoobligationtoretainthedatacollected,ortoplacetheminapubliclyaccessiblearchiveattheconclusionoftheproject.Atbest,scientistsinthesamefieldmaybeabletoobtaindesireddatasetson
anadhocbasisbycontactingtheoriginalinvestigatorsdirectly;secondaryandtertiaryuserstypicallyareunawareoftheexistenceofthedataandhavenomechanism(otherthanpersonalcontact)toaccessthedata.Thus,datasetsthatcommonlyaregatheredatgreatexpenseandeffortarenotbroadlyavailableandultimatelymaybelost,squanderingvaluablescientificresourcesandmuchofthepublicinvestmentspentinacquiringthem.Clearly,thereisagreatneedfortheagenciestogetmorereturnontheirinvestmentinsciencebythesimpleexpedientofmakingthedatacollectedundertheirauspicesaccessibletoothers.
Asseenfromthediscussioninearliersectionsandaddressedindetailintheindividualdisciplinepanelreports(NRC,1995),thereisalargeanddiversecollectionofscientificdataandinformationextantinfederalagenciesandnonfederalorganizations,includingstateandlocalagencies,universities,not-for-profitinstitutions,andtheprivatesector.Ataminimum,thosedatathatareacquiredwiththesupportoffederalfundingshouldberegardedaspartoftheNationalScientificInformationResource.
Finally,NARA'sholdingsofscientificandtechnicaldatainelectronicoranyotherformareverysmallincomparisontothedataholdingsoftheseotherorganizations.Moreover,NARA'sbudgetforitsCenterforElectronicRecords,whichhasformalresponsibilityforarchivingalltypesoffederalelectronicrecords,wasonly$2.5millioninFY1994,abudgetlowerthanthatofmanyoftheindividualagencydatacentersreviewedbythecommitteeinthisstudy.GivenNARA'scurrentandprojectedlevelofeffortforarchivingelectronicscientificdata,itisobviousthatNARAwillbeunabletotakecustodyofthevastmajorityofthescientificdatasetsthatrequirearchiving.Therefore,acoordinatedeffortinvolvingNARA,otherfederalagencies,certainnonfederalentities,andthescientificcommunityisneededtopreservethemostvaluabledataandensurethattheywillremainavailableinusableformindefinitely.Thechallengeistodevelopdata
managementandarchivinginfrastructureandproceduresthatcanhandletherapidincreasesinthevolumesofscientificdata,andatthesametimemaintainolderarchiveddatainaneasilyaccessible,usableform.Animportantpartofthischallengeistopersuadepolicymakersthatscientificdataandinformationareindeedapreciousnationalresourcethatshouldbepreservedandusedbroadlytoadvancescienceandtobenefitsociety.
Page33
3RetentionCriteriaandtheAppraisalProcessTheNationalArchivesandRecordsAdministrationappraisesandretainsrecordsonthebasisoftheirinformationalandevidentialvalue.Itisconcernedwithrecordsoflong-termvaluethoserecordsthatwillprobablyhavevaluelongaftertheyceasetohaveimmediate,orprimary,uses.Althoughscientificdatabasescanprovideevidenceoftheresearchconductedbyanagency,theirvalueisprimarilyinformational;itisbasedonthecontentoftherecordsratherthanontheirdescriptionofactivitiesbytheagencythatcollectedorcreatedthem.
Specialproblemsariseinappraisingscientificdatafortheirlong-termvalue,particularlybeyondthecommunityofresearchscientistsworkinginthespecificfieldtowhichthemeasurementsrefer.Scientificdataarevoluminous,constantlyincreasing,andoftendifficultforthoseinotherfieldstouseintheiroriginalformats.Thedatatypicallyareexpensivetocollect,providebaselinesforfutureobservations,enhanceunderstandingofotherdata,andareofimmenseimportanceforadvancingscientificknowledgeandforeducatingnewscientists.Thedataalsoareimportanttoanunderstandingoftheworldinwhichwelive;thedata(ortheconclusionsdrawnfromthem)maybeimportanttoeconomists,historians,statisticians,politicians,andthegeneralpublic.Atthesametime,itisdifficulttopredictthefullvalueofthedatatoresearchersandotherusersdecadesorcenturiesfromnow,althoughpastexperiencehasshownthatscientificdatacollectedmanyyearsagoprovideuniquecontributionstonewunderstandingofourphysicaluniverse.
RetentionCriteria
Thecriteriathatfollowaretobeusedduringtheappraisalprocesstodetermineretentionofphysicalsciencedata.Theyshouldbeappliedbythoseresponsibleforstewardshiptoallphysicalsciencedata,whethercreatedbysmallindividualprojectsorinthecourseoflarge-scaleresearchprograms.Similarcriteriaandguidelinesmustbedevelopedfordatainotherdisciplines.ThisisatopicofprimaryconcernnotonlytoNARA,NOAA,andNASA,buttoallscientists,datamanagers,andarchivistswhoworkwithsuchrecords,andwasprovidedinthechargetothecommitteeasacentralissue.Althoughthecommitteefoundthatmanyretentioncriteriaapplytoboththeobservationalandthelaboratorysciences,significantdifferencesarenotedbelow.Themetadatarequirements,whichtendtobeeitherpoorlyunderstoodorignored,aregivenparticularemphasis.Additionaldetailsanddistinctionsarediscussedintheworkingpapersofthedisciplinepanels(NRC,1995).
Page34
CriteriaCommontoBothObservationalandLaboratorySciencesUniquenessofdata.DootherauthenticatedcopiesofthedataunderconsiderationalreadyexistinanaccessiblerepositorythatmeetsNARAstandardsofpermanenceandsecurity?Ifso,aretheyadequatelybackedup?Iftheanswersareyes,thedatasetneednotnecessarilyberetained.Accessibilityadequacyofdocumentation.Thoughwemightwishthatalldatasetswereofhighqualityandaccompaniedbydetailedmetadata,thatisnotalwaysthecase.Ataminimum,themetadatashouldbesufficientforascientistworkinginthedisciplinetomakeuseofthedataset.Ifdocumentationislackingorissopoorthatadatasetisnotlikelytobeofvaluetosomeoneinterestedindataofthattype,orthedataaremorelikelytomisleadthantoinform,thatdatasetshouldhavealowpriorityforarchiving,orperhapsshouldnotbearchivedevenifresourcesareavailable.Nevertheless,thecommitteedoesnotbelievethatmanydatasetsshouldbepurgedbecausetheylacksufficientdocumentation.Thevastmajorityofdatasetsnowmeetminimumstandardsofdocumentation,whichmeansthataskilledusereitherisgivensufficientinformationorcanfigureitout.Adequacyofdocumentationisthusbutonecriteriontoconsiderintheappraisalofdataforlong-termretention.Metadatarequirementsarediscussedingreaterdetailbelow.Accessibilityavailabilityofhardware.Isthehardwareneededtoaccessthedataobsolescent,inoperable,orotherwiseunavailable?Ifso,thedataarenotusable.Decisionsonwhethertokeepsuchdatashouldbebasedonthefeasibilityofbuildingoracquiringthenecessaryhardware,theusabilityofthedataiftheywereaccessible,andthenatureofthedataset,ifknown.Toavoidthissituation,migrationofdatatocurrentstoragemediashouldbepartofthenormalroutinetomaintainthearchive.Costofreplacement.Couldthedatabereacquiredifafuturenationalneedforthedataweretoarise?Ifso,wouldreacquisitionofthedatabemorecostlythantheirpreservation?Fortheobservationalsciences,the
answerisalmostalwaysthatthedatacannotbereacquired.Theexceptioniswithadatasetinadisciplineinwhichthechangesofnaturearesoslowthatthedatacouldberecapturedatanothertime.Forexample,dataonthefossilrecordofevolutioncontainedinstratigraphicrockunitscouldbereacquired.Thelaboratorysciencesgeneratedatathatcan,inprinciple,bereacquired.Thequestioniswhetherthedatacanbereproducedatanacceptablecost.Datasetsinthelaboratorysciencesthatarecandidatesforlong-termpreservationcanbeclassifiedintothreegenerictypes:(1)massiverecordsanddatafromanoriginalexperiment,particularlyacostly"mega-experiment,"thatthereisnorealisticchanceofreplicating(e.g.,dataobtainedfromexpensivefacilitiessuchasplasmafusiondevices,ordataofinterestinphysicsandchemistryderivedfromspecialeventssuchasnucleartests);(2)unique,perhapssample-dependentorenvironment-dependent,engineeringdata,manyofwhichneverreachthepublishedliterature;and(3)criticallyevaluatedcompilationsofdatafromalargenumberoforiginalsources,togetherwiththebackupdataanddocumentationonselectionofrecommendedvalues,thatrepresenttremendousaccumulatedeffort.Peerreview.Hasthedatasetundergoneaformalpeerreviewtocertifyitsintegrityandcompleteness,oristheredocumentedevidenceofuseofthedatasetinpublicationsinpeer-reviewedjournals?Haveexpertusersprovidedevidencethatthisdatasetisasdescribedinthedocumentation?Formalreviewofdatasetsisnotnowcommon.Itshouldbeencouraged,however,especiallyintheobservationalsciences.AgoodmodelisthepeerreviewsystemforNASA'sPlanetaryDataSystem.Inthelaboratorysciences,thecriticallyevaluatedcompilationsofdatareferredtoinChapter2haveundergoneextensivepeerreview.
DifferencesBetweentheObservationalandtheLaboratorySciences
Dataderivedfromlaboratoryexperiments,suchasthehardnessofsteelproducedinaparticularmelt,differfromdatabasedonobservationsoftransientnaturalphenomena,suchastherecordsofthe1993
transientnaturalphenomena,suchastherecordsofthe1993
Page35
midwesternfloods.Thus,theystimulatedifferentquestionsrelatedtodatapreservationissues.Ashasalreadybeennoted,onedifferencearisesfromthefactthattransientnaturalphenomenaarenotreproducible;thefactthattheresultingobservationaldataare"snapshotsintime"sometimesmeansthatthedatahavehistoricalorevidentialvalueinadditiontotheirinformationalvalue.Observationaldatasetsthatprovideacontinuoustime-seriesrecordofthephysicaluniverse,orofhumanimpactuponit,areimportanttofuturegenerationsforcomparisonandtheidentificationoftrends.Inaddition,manyobservationaldatasetsrepresentmajorengineeringorworker-intensivecollectionactivitiesthatwarrantdocumentationandcouldnotfeasiblybecarriedoutagain.
Experimentershavegoodreasontobelievethatifandwhentheirdataarerecreatedinthefuture,instrumentswillbebetter.Inmanyexperiments,rawdata(e.g.,theinitialsensorreadingsbeforeanytransformations,conversions,averaging,orcorrectionsaremade)mayexistonlyforafleetinginstantbeforetheyarediscardedorfurtherprocessed.Evenwhenraw(level-0)dataareacquiredandsaved,principalinvestigatorsfrequentlyfailtoprovideappropriatedocumentationbecausetheydonotexpectanyoneelsetousethesedata.Instead,theprocesseddatasetsaremorelikelytohaveadequatemetadataandmeetthecommittee'sothercriteriaforretention.
Quitetheoppositesituationseemstoprevailfortheobservationalsciences,wheremanysecondaryscientificusersfeeltheyneedtobeabletogetbacktothelevel-0dataandarebecomingmoreactiveindemandingthatthecollectorsofthedataprovideadequatemetadata.
SpecialIssuesintheRetentionofObservationalData
Allobservationaldatathatarenonredundant,reliable,andusablebymostprimaryusersshouldbepermanentlymaintained.Thisjudgmentisbasedonthecommittee'sbeliefthatadvancingtechnologiesand
betterdatamanagementpracticesmakeitpossibletostayaheadofthegrowingdatavolumes,asdiscussedinChapter4.Italsoislikelythatitwillbemoreexpensivetoreappraisedatasetsthansimplytokeepthem.Ifthecommitteeiswrongonthesetwocounts,itmaybepossiblethatthevolumeofthedatacanbereducedthroughsamplingtechniquesandthroughintelligentselectionofthedatasetsofhighestpriority,asexplainedbelow.
Datasamplingissuesariseinmeasurementsystemsandinconsideringarchivalstrategiestoprovidereadyuseraccess.Evenbeforeadatamanagerfacesarchivingdecisions,manysamplingratedecisionsalreadyhavebeenmade.Forexample,intheatmosphericsciences,wecouldeasilysampletemperaturesensorsandwindgauges100timesperminute,butthatfrequencyisunnecessaryfornearlyalluses.Ingeneral,itisnecessarytokeeponlydataproperlysampledintimeandspace;thatis,thesamplingintervalmustbesuchthatthemost-rapidly-varyingcomponentisnotaliased.AtleasttwosamplespercyclearerequiredaccordingtotheSamplingTheorem.Thusreductionofoversampleddatatotheminimumsamplingrateneeded,coupledwithlosslessdatacompression,cansignificantlyreducedatavolumeswithnolossofscientificcontent.However,ifthephenomenaofinterestareslowlyvarying,thenmorerapidfluctuations,whichmighthavevalueforotherpurposes,canbefilteredoutandthedatareducedtoretainthedesireddataunaliased;thistechniquecanfurtherreducethedatavolumeattheexpenseoflosinghigher-frequencydata.Thearchivingofonly"representative"subsetsofourlargestdatasetsisoftensuggested,butthenotionraisesdifficultissuesinstatistics,datamanagementphilosophy,andbudgeting.Inconcept,theremaybeacceptableproceduresforthelong-termarchivingofrepresentativesubsetsoflargedatasets,butnoeffectivemethodologyexiststodaytochoosethosethatwouldsatisfytheneedsoffutureusers.
Anexampleoftheapproachtodecidingwhichobservationaldatasets
toretaincomesfromtheatmosphericsciences.Inthisfieldthevalueofadatasetaspartofalongtimeseriesisanimportantcriterionforarchivingdecisions.Thetemperaturerecordforagivenyearfromastationoperatingoveracenturyismuchmorevaluablethanasimilarrecordfromanearbystationwithashorterlifetime.Studiesofclimatechangeandothertypesofenvironmentalchangefindlongtimeseriestobeessential.For
Page36
example,confirmationoftheseasonalstratosphericozonedepletionovertheAntarcticinthe1980srequiredreferencebacktotheDobsoncolumnozonedatafromthefirsthalfofthiscenturyforcomparativepurposes.TheU.S.HistoricalClimateNetworkdataareahighpriorityforarchivingbecausetheyrepresentalongtimeseriesofhigh-qualitydata,withexcellentmetadata;thiscombinationofattributesofdataofacommontypemakestheoveralldatasetexceptionallyvaluable.
MetadataIssues
Thecommitteehasarrivedatseveralrelatedconclusionsconcerningtheimportanceofdocumentation,ormetadata,totheeffectivearchivingofscientificdata.Theseincludethefollowing:Effectivearchivingneedstobeginwheneveradecisiontocollectdataismade.Originatorsofdatashouldpreparetheminitiallysotheycanbearchivedorpassedonwithoutsignificantadditionalprocessing.Thegreatestbarriertocontemporaryandfutureuseofscientificdatabyotherresearchers,policymakers,educators,andthegeneralpublicislackofadequatedocumentation.Adatasetwithoutmetadata,orwithmetadatathatdonotsupporteffectiveaccessandassessmentofdatalineageandquality,haslittlelong-termuse.Fordatasetsofmodestvolume,themajorproblemiscompletenessofthemetadata,ratherthanarchivingcost,longevityofmedia,ormaintenanceofdataholdings.Lackofeffectivepolicies,procedures,andtechnicalinfrastructureratherthantechnologyistheprimaryconstraintinestablishinganeffectivemetadatamechanism.
Thissuiteofconclusionsledthecommitteetorecommendthat''adequacyofdocumentation"beacriticalevaluationcriterionfordatasetretention.Thefollowingdiscussionilluminatesthemultiple
setretention.Thefollowingdiscussionilluminatesthemultipleperspectivesofmetadata,theessenceoftheproblem,andimportantelementsofanymetadatasolution.
PerspectivesonMetadata
Thetermmetadataoftenisusedtodenote"dataaboutdata,"thatis,theauxiliaryinformationneededtousetheactualdatainadatabaseproperlyandtoavoidpossiblemisinterpretationofthosedata.Thetermisusedinmanyscientificdisciplines,butnotalwayswithpreciselythesamemeaning.Somecommentsondifferenttypesofmetadatamaybehelpful.
Themostbasicclassofmetadatacomprisestheinformationthatisessentialtoanyuseofthedata.Anobviousexampleistheunitsinwhichphysicalquantitiesareexpressed.Ifunitsarenotspecified,thenumbersareambiguous;atbest,theusermustattempttodeducetheunitsbycomparisonwithotherdatasources.Indealingwithobservationaldata,thecoordinatesandthecoordinatesystem(spatialandtemporal)obviouslymustbespecified.Laboratorydataareoftensensitivefunctionsofsomeenvironmentalconditionsuchastemperatureorpressure.Forexample,theboilingpointofaliquidvarieswithpressure,sothataboilingpointvaluehasnomeaningunlessthepressureisspecified.Althoughthisiswellknown,manymistakesoccurwhenauserassumesavaluetakenfromacompilationtobeaboilingpointatnormalatmosphericpressure,whileitactuallyreferstoareducedpressure.
Asignificantprobleminplanningalong-termdataarchiveissimplecarelessnessonthepartofthecreatorsandcustodiansofthedata.Currentpractitionersinascientificfieldmayimplicitlyunderstandwhattheunitsorenvironmentalconditionsare.Shortcutsaretakenbytheauthorsthatcausenoproblemincommunicatingwiththeircontemporarycolleagues(althoughtheymaybeconfusingtothoseinadifferentdiscipline),butpracticesandlanguagecanchangeoveragenerationortwo.Foralong-termarchive,eventhemostobviousmetadatashouldbespecifiedindetail.
metadatashouldbespecifiedindetail.
Page37
Beyondthisbasictypeofmetadata,thereisauxiliaryinformationthatisnotneededbythemajorityofusers(presentorfuture),butisofinteresttoafewspecialists.Includedherearetheparametersthathaveonlyaslightinfluenceonthedatainquestion,sothatmostusersdonotneedtoknowaboutthem.Forexample,thetypicaluserofadatabaseofatomicspectraisconcernedonlywiththewavelengthandaroughvalueoftheintensityofeachspectralline.However,afewuserswhoaretryingtoextractfurtherinformationfromthedatamaywanttoknowtheconditionsunderwhichthespectrumwasrecorded,suchasthecurrentdensity,typeofelectrode,andgaspressure.ReferringtotheJANAFThermochemicalTables,whicharediscussedinthePhysics,Chemistry,andMaterialsSciencesDataPanelreport(NRC,1995),mostusersareperfectlycontentwiththevaluesgiven(alongwiththeconfidencethatthecompilersdidagoodjobofselectingthemostreliablevalues).Aminorityofusers,however,willwantmoredetailsonhowthedatawereanalyzed,suchaswhethertheheatcapacityvalueswerefittedtoafifth-degreepolynomialoracubicspline,andsoforth.
Perhapsthemostpervasiveformofmetadataistheaccuracyofthevalues.Toapurist,nonumberhasmeaningunlessitisaccompaniedbyanestimateofuncertainty.Specifyingtheuncertaintyofeachdatapointincreasesthesizeandcomplexityofthedatabase,butsometimesmaybenecessary.Ataminimum,themetadatashouldincludegeneralcommentsonthemaximumexpectederrors,evenifaquantitativemeasuresuchasstandarddeviationcannotbegiven.Finally,thetermmetadataissometimesunderstoodtoencompassthefulldocumentationnecessarytotracethepedigreeonthedatabase.Forlaboratorydata,thisincludescitationstoalltheprimaryresearchpapersrelevanttothedatabase.Acriticalevaluationofespeciallyimportantquantities(suchasthefundamentalphysicalconstantsorkeythermodynamicvalues)mayendupwithonlyafewhundreddata
points,butincludemassivedocumentationandcitationstoahundredyearsofliterature.Insuchcasesthemetadataoccupyfarmorespacethanthedatathemselves.
Fromthisdiscussion,itisevidentthatmetadatacanspantherangefromafewsimplestatementsaboutthedatatoveryextensive(andexpensive)documentation.Itisdifficulttogivegeneralguidelinesontheamountofmetadataneeded;eachcasemustbeconsideredinthecontextofhowfutureusersmayusethedataandwhatauxiliaryinformationtheywillneed.Someguidancemaybeobtainedfromformaleffortstosetmetadatastandardsforexperimenterstofollowinpreservingtheirdata.Inchemistry,forexample,manyorganizationshavedevelopeddetailedrecommendationsonreportingdatafromspecificsubfields.Thesehavebeencollectedinarecentbook,ReportingExperimentalData(ACS,1993).TheAmericanSocietyforTestingandMaterialsCommitteeE49onComputerizationofMaterialPropertyDatahasanambitiousprogramtodevelopconsensusstandardsformetadatarequirementsfordatabasesofpropertiesofengineeringmaterials.Thesedocumentsemphasizethatmetadatarequirementsmustbeapproachedonacase-by-casebasisandmustinvolveexpertsineachfield.
Theconclusionisthatmetadata,whatevertheparticularform,arecrucialtotheuseofalmosteverydatasetandmustbeincludedinanyarchivingplan.Thenecessarymetadatausuallyaddverylittletothestoragerequirements,butmayrequireconsiderableintellectualefforttoprepare,especiallyiftheyareassembledretrospectivelyratherthanwhenthedataarefirstcollected.
Theprecedingdiscussiondefinesmetadatafromtheperspectiveoftheresearchscientist.Anadditional,andsomewhatoverlapping,perspectiveisprovidedbythecomputersciencecommunity.Inthiscommunity,thetermmetadatareferstothespecificationofelectronicrepresentationofindividualdataitems,thelogicalstructureofgroups
ofdataitems,andthephysicalaccessandstoragemediaandformatsthatholdthedata.Tothecomputerscientistordatabaseadministrator,thecontextualdatathattheresearchscientistreferstoasmetadataencompassotherdataentities.Infact,divergencecanexistevenamongresearchscientistsastothedifferencesbetweendataandmetadata.Whatismetadataforonemaybedatafortheother.
Inviewofthisconfusion,thecommitteehaschosentokeepthetermmetadataandtoexplicitlydefineitsfundamentalcomponents.Assuch,thecommitteeviewsmetadataasrepresentinginformationthat
Page38
supportstheeffectiveuseofdatafromcreationthroughlong-termuse.Itspansfourancillaryrealms:content,formatorrepresentation,structure,andcontext.Thecontentrealmidentifies,defines,anddescribesprimarydataitemsincludingunits,acceptablevalues,andsoforth.Therepresentationrealmspecifiesthephysicalrepresentationofeachvaluedomain,oftentechnologydependent,andthephysicalstoragestructureofaggregateddataitems,oftenarbitrary.Thestructurerealmdefinesthelogicalaggregationofitemsintoameaningfulconcept.Thecontextrealmtypicallysuppliesthelineageandqualityassessmentoftheprimarydata.Itincludesallancillaryinformationassociatedwiththecollection,processing,anduseoftheprimarydata.Onthebasisofthisexplicitdefinition,thefollowingsectiondescribesmetadataobjectives,implementationissues,andpotentialfordefiningastandardizedframework.
AnalysisofMetadata:FromChallengetoSolution
Theproblemofdatasetdocumentationisreceivingincreasedattentioninthecontextofscientificdatamanagement.Intheearthsciences,globalclimatechangeresearchandgeneralenvironmentalconcernshaveignitedinterestinamoreinterdisciplinaryandlong-termapproachtoconductingscience.Interdisciplinarycollaborationrequiresmoreeffectivesharingofdataandinformationamongindividualresearchers,disciplines,programs,andinstitutions,allofwhichmayoperateunderdifferentparadigmsorhavedifferentterminologyforsimilarconcepts(NRC,inpress).Further,long-termresearchrequiresthatresearchersbeabletoaccessandcomparedatasetsthatwerecreatedbypastresearchersandcollectedindifferentcontextsbydifferenttechnologies.Therefore,tosupporttheinterdisciplinarysharingandlong-termusefulnessofdata,adequatemetadatamustbeincludedwithinaframeworkthataccomplishesthefollowingobjectives:providesmeaningfulselectioncriteriaforaccessingpertinentdata;supportsthetranslationoflogicalconceptsandterminologyamong
communities;supportstheexchangeofdatastoredindifferingphysicalformats;andenhancestheassessmentofdatasetsbyconsumers.
Acriticalquestionishowtomotivatetheusercommunitytoparticipateintheprocessofmetadatapreparationandstandardization.Theissueofmotivationisbestaddressedbythevaluesystemofthecommunityitself.Itmaybearguedthattheproblemwillnotbesolveduntiltheproductionofverifieddatasetsandtheirprovisiontoscientificcolleaguesbecomemorehighlyvaluedactivities.Developmentssuchasthepeer-reviewedpublicationofdatasetsshouldcontributetothisshiftinvalues.However,untiltheseactivitiesareassimilatedintothefabricofcareeradvancement,suchasbeingincorporatedintocriteriafortenureinacademicinstitutions,progresswillcontinuetobeslowanduneven.
Nevertheless,thereareanumberofspecificactionsthatcanbetakentopromotethepreparationandstandardizationofmetadata.Fundingagenciescouldhelpfacilitatechangebyrequiringandenforcingminimaldocumentationofdatasetscreatedundertheirgrants(aswellasotherdesirabledatamanagementandarchivingpracticesdiscussedelsewhereinthisreport).Thiswillnotbeaneffectivemechanism,however,unlesstheminimalstandardsforconsistencyandcompletenessareprovidedasatargetforgranteesandasameasuringstickforthefundingagent.Tobeeffective,thesestandardsmustbecreatedthroughthecollaborationofresearchers,datamanagers,librarians,archivists,andpolicymakers.
Individualsandinstitutionsinthescientificcommunitycouldcontributebyrecognizingthatdatamanagementandtheprovisionofappropriatedocumentationofdataareanessentialscienceinfrastructurefunctionspanningalldisciplines.Greatercost-effectiveness,consistency,andqualitycanbeachievedifthemanydiversedatamanagementactivitiesarebettercoordinated.Theessentialrequirementformakingthesevaluesystemchangesanddevelopingeffectivesolutionsistherecognitionthat
systemchangesanddevelopingeffectivesolutionsistherecognitionthatall
Page39
segmentsofthescientificcommunityneedtobeeducatedonthisissue.Fundingagenciesandthescientificcommunitythusmustmoveforwardtogetherinthedevelopmentofacoherentstrategyforend-to-endmanagementthatfocusesonmetadatarequirementsasamajorelement.
Theultimatesolutionformetadatahandlingwillincludeanapproachthatnotonlysupportsthedocumentationofadatasetthroughoutitslifecycle,butalsosupportsevolutionarydocumentationrequirements.Forexample,earlyinthedevelopmentanduseofaninstrumentsystem,thescientificcommunitymaynotbeabletospecifycompletelywhatmetadatawillbeimportantfortheeffectiveuseoftheobservationsproducedbythissystem.Inthiscase,someofthedocumentationmayincludefree-formnarrativeswithoutthebenefitofcontrolledvocabularies.Documentationofthisnatureisusefulonlytoalimitedaudiencethatunderstandsthespecializedvocabularyofthesourceinstrument,project,discipline,orinstitution.Inaddition,itisstilldifficulttomakethesedescriptionsusefultoanautomatedagentperformingasearchonbehalfofauser.Asinstrumentusebecomesmoreroutine,thisdocumentationcouldevolvetoamorestructured,butnotcumbersome,form.Onepotentiallyusefulapproachconstrainsthetextualdescriptionstoawell-defined,controlledvocabulary.Ifthevocabularyisclearlyspecifiedandmadeeasilyavailablewiththedataandassociateddocumentation,usersbeyondthosecloselyassociatedwiththecreationofthedatasetmaybeabletousethisinformationtoassessitsrelevance,significance,andreliability.Eventually,thismorestructuredalternativewillevolveintothespecificationofstructuredrecordswithappropriatelydefinedfields,standardvaluedomains,andrelationshipswithdatasetrecords.Thecommitteealsoexpectsthatimprovementsinsoftwarefornaturallanguageunderstandingwillenabletheautomatictranslationoffree-formnarrativesintoeasilysearchedmetadatafields.
Anequallyimportantcomponentofthemetadatasolutionistheidentificationanddetaileddefinitionofclassesofinformationthatarecriticaltothecompleteandconsistentdocumentationofdatasets.Informationmodelingtechniquescanbeusedtodeveloptheseclassesofinformation,someofwhichwillhaveclear,concisedefinitionsandasetofdefinedattributes,whileotherswillbeidentifiedbutwillnothaveclearlydefinedattributesorboundarieswithotherclasses.Theresultinginformationmodelshouldpresentatechnology-independentdescriptionofmetadataentitiesandtheirrelationshipswiththeprimarydata.Themodelshouldidentifymetadatathatmaybegeneralizedacrossallclassificationsofdatasetsandusagepatterns,aswellasaccommodatespecializedneeds.Suchamodelshouldprovidethebasisforintelligentinformationpolicies,datamanagementpractices,andmetadatastandards.Theinformationpolicies,however,mustnotsaddledataproviderswithlong,cumbersome"forms"tofillout.Thatwoulddiscouragethecontributionofthedatathemselves,andthecommitteerecognizesthatdatawithincompletedocumentationarebetterthannodataatall.Nevertheless,appropriatelyestablishedmetadatastandardsdonotnecessarilyneedtobedifficultorcostlytoapply,andthereforeneednotbeoneroustothedataprovider.AnexampleofageneralizedmetadataframeworkintheobservationalsciencesispresentedintheworkingpaperoftheOceanSciencesDataPanel(NRC,1995).
OtherElementsOfTheAppraisalProcess
Adatamanagementplanshouldbecreatedforanynewresearchprojectormissionplan,consistentwiththerequirementsofOMB(1994)CircularA-130.AgoodexampleofthisistheProjectDataManagementPlanoftheNASANationalSpaceScienceDataCenter(NASA,1992).Ataminimum,thoseindividualswhohaveresponsibilityforimplementingthedatamanagementplanandensuringaccessibilityandmaintenanceofthedatashouldplayakeyroleinthesubsequentappraisalprocess.
Mostindividualinvestigatorsandpeerreviewersdonotrecognizetheirrolesasappraisersforarchivalpurposes,buttheviewsoftheseexpertsshouldweighheavilyinthedecisionsrelatingtolong-termvalueorpermanencyofthedataobtained.Theprincipalinvestigatorsandprojectmanagerswhocollectandanalyzethedataclearlyhavethebestsenseofhowlongthedatawillbevaluablefortheirownscientificpurposes.Primaryusersalsocanprovideadetailedunderstandingregardingtheusesofthe
Page40
datafortheirowndiscipline,buttheymaynotcomprehendthelong-termvalueofthedataforapplicationtootherresearchornationalproblems.Becausesuchprimaryusersandotherdatacollectorssometimesdonotthinkbeyondtheirownneeds,theagenciesshouldworkwithNARAtoprovidegooddocumentationattheinceptionofscientificprojects,especiallydocumentationthatwouldbeusefultosecondaryandtertiaryusers.Althoughprovidingmoreextensivedocumentationoftenmaybeviewedasanextraburdenbytheprincipalinvestigatorsanddatamanagers,thelaborandexpensecanbeminimizedifitisplannedattheinceptionofaproject,whereasitisextremelydifficultaftertheprojectiscompleted.Properdatamanagementpracticescanbepromotedbyconsideringdatamanagementintheevaluationofaninvestigator'spastperformance.
Becausemanyscientificendeavorsrequireparticipationbyanumberofagenciesandorganizations,itisimportanttocoordinatedatamanagementactivitiesandassignresponsibilitiesforthemaintenanceofthedataduringperiodsofprimaryuse.NARAiscurrentlyresponsibleforthefinalappraisaloffederalrecordsandthedeterminationoftheirvalueasaccessionstothepermanentnationalcollectionunderitsstatutorymandate.However,NARAshouldtakeadvantageoftheexpertiseoftheotherparticipantsinvolvedthroughoutthelifecycleofthedata.
Thecommitteebelievesthatallstakeholdersscientists,researchmanagers,informationmanagementprofessionals,archivists,andmajorusergroupsshouldberepresentedinthebroad,overarchingdecisionsregardingeachclassofdata.Theappraisalofindividualdatasets,however,shouldbeseenasanongoing,informalprocessassociatedwiththeactiveresearchuseofthedata,andthereforeshouldbeperformedbythosemostknowledgeableabouttheparticulardataprimarilytheprincipalinvestigatorsandprojectmanagers.Insomecases,theymayneedtoinvolveanarchivistor
informationresourcesmanagertohelpwithissuesoflong-termretention.Althoughthecommitteebelievesthatformalappraisalsshouldbekepttoaminimum,appraisalsshouldbeperformedaccordingtothedatamanagementplanestablishedforeachproject.
Althoughthecommitteewasnotexpresslychargedwithadvisingonclassifieddata,thereisanobviousneedtosaveclassifiedscientificdataaswell.Thecompleterecordsoftheatmosphericatomicbombtestsareaclearexample.Itismoredifficulttoprovideandassessmetadataforaclassifieddataset,anditcostsmoretomaintainclassifieddata.Also,thereisatrade-offbetweenthevalueofthedatafornationalsecurity,therisktonationalsecurityifthedataaredeclassified,andthepotentialvaluetosocietyofhavingthedatadeclassified.Thus,itishighlybeneficialandcost-effectivetohavemechanismsinplacethatconsidertheseissuesperiodicallyforanygivenclassifieddatasetandthatpromotedeclassificationwhenappropriate.
Recommendations
Thecommitteemakesthefollowingrecommendationsregardingtheretentioncriteriaandappraisalprocessforphysicalsciencedata:
Asageneralrule,allobservationaldatathatarenonredundant,useful,anddocumentedwellenoughformostprimaryusesshouldbepermanentlymaintained.Laboratorydatasetsarecandidatesforlong-termpreservationifthereisnorealisticchanceofrepeatingtheexperiment,orifthecostandintellectualeffortrequiredtocollectandvalidatethedataweresogreatthatthelong-termretentionisclearlyjustified.Forbothobservationalandexperimentaldata,thefollowingretentioncriteriashouldbeusedtodeterminewhetheradatasetshouldbesaved:uniqueness,adequacyofdocumentation(metadata),availabilityofhardwaretoreadthedatarecords,costofreplacement,andevaluationbypeerreview.Completemetadatashoulddefinethecontent,formatorrepresentation,structure,andcontextofadataset.
Theappraisalprocessmustapplytheestablishedcriteriawhileallowingfortheevolutionofcriteriaandpriorities,andbeabletorespondtospecialevents,suchaswhenthesurvivalofdata
Page41
setsisthreatened.Allstakeholdersscientists,researchmanagers,informationmanagementprofessionals,archivists,andmajorusergroupsshouldberepresentedinthebroad,overarchingdecisionsregardingeachclassofdata.Theappraisalofindividualdatasets,however,shouldbeperformedbythosemostknowledgeableabouttheparticulardataprimarilytheprincipalinvestigatorsandprojectmanagers.Insomecases,theymayneedtoinvolveanarchivistorinformationresourcesprofessionaltoassistwithissuesoflong-termretention.
Classifieddatamustbeevaluatedaccordingtothesameretentioncriteriaasunclassifieddatainanticipationoftheirlong-termvaluewheneventuallydeclassified.Evaluationoftheutilityofclassifieddataforunclassifiedusesneedstobedonebystakeholderswiththerequisiteclearancestoaccesssuchdata.
Page42
4TheOpportunities:TheRelationshipofTechnologicalAdvancestoNewDataUseandRetentionStrategiesRapidprogressininformationtechnologycontinuallyaltersboththequantityandthequalityofscientificinformationandperiodicallystimulatesfundamentalmodificationofdatamanagementandarchivingstrategies.Recenttechnologicaladvanceshaveenablednewmethodsandstrategiesfordatastorageandretrievalandhavecreatedbetterwaysofconnectinguserstodataresourcesandtoeachother.Moreover,theevolvingtechnologiesarecatalystsforrevisingorganizationalstructurestomanagescientificdataarchivesmuchmoreeffectivelyinadistributedmanner.Assumptionsabouteffectivemanagementofscientificdatathathavebeenlongandfirmlyheldarebeingdirectlychallengedbynewinformationtechnology.Theseassumptionshavebeenbasedonexperiencewithmanagementofpaperrecords,generallyindomainsoutsideofscience.Someoftheoutdatedassumptionsthatarerapidlylosingtheirrelevanceincludethefollowing:Physicalpossessionofthedataisessentialtotheirmanagementandarchiving.Thisprinciplehasoutliveditsusefulnessinthecontextofelectronicphysicalsciencedataandhasmadeaccessdifficultforlegitimateusers.Electronicinformationiseasilycopiedanddisseminated.Thisfeatureremovesconstraintsimposedbythelimitedphysicalaccess.Becausemostgovernmentphysicalsciencedataareconsideredtobeinthepublicdomain,theconstraintsofcopyrightandfeecollectiontothefreemovementofdataareremovedaswell.Costofanarchiveincreasesinproportiontocollectionsizeanduse.Physicalarchivecostisafunctionofspace,aswellascataloging,repair,andaccessefforts.Improvedinventorytechnologyhaseasedsomeofthecostburdenoverthelastseveralyears,but,fundamentally,archiveswithlargephysicalholdingsoperateintraditionalwayswithlinearly
withlargephysicalholdingsoperateintraditionalwayswithlinearlyscalingcosts.Suchcostsactuallydiscourageuse,sincephysicalhandlingofitemsscaleswithuse,whereasbudgetsreflectusageindirectly.Incontrast,electronicinformationstorageandmanagementcostshavedeclinedasrapidlyasthecostsofcomputertechnologyandprocessingoverthelast30years.Thereisnoforeseeableendtothisprocess.Storingandusingthenextbytewillbecheaperthanstoringandusingthemostrecentbyteforalongtimetocome.Onlyarchivistsandlibrarianshavethecapabilitiestomanagearchiveddata.Whilelibrariansandarchivistsareimportantadvisorsandparticipantsinscientificdatamanagement,thedominantmanagementresponsibilityfallstothescientificcommunityanditsdesignatedscientificdatamanagers(whoareablendofscientist,computerscientist,andlibrarian/archivist).Ifpracticingscientistsdonotparticipateinthemanagementofscientificinformation,suchdatawillfallintoobscurityorobsolescence.
Page43Thelocatorinformation(catalog)aboutthemanagedobjectsissimpleandcompact.Findingrelevantscientificinformationoftenrequiressearchingthefullcontentandthiscontentgenerallyisnotintheconvenientlycompressedformoftext.Forexample,tosearchforalldatasetswherethestratosphericozoneconcentrationislessthansomeadhocthresholdinsomeregion,onewouldneedtoexecuteacomplexalgorithmoneverydatasamplecoveringtheregioninquestion.Queriessuchasthisbecomeevenmorecomplexiftheregionofinterestisdeterminedafterretrieval(e.g.,howmanydaysinarowwasthearealextentoftheozoneholeoveropenoceangreaterthan5,000squarekilometers?).Theselectionanduseofscientificdatatosolvecomplexproblemscanbesimplifiedthroughtheuseoftheconceptofbrowsinginformationbasedoncontent.Browsingofteninvolvesexaminationoflargenumbersofsamplesanddatavolumes.Specialized"browsingproducts"canbedefinedtolocaterecordsofinterest.Forthequeryexamplesabove,low-resolutionozonemapscouldbeusedtofindcandidatedatasetswithhighprobabilityofrelevance.Informationabouttheprocesses(includingsensorcharacteristics,computerprogramcapabilities,andcalibrationpoints)usedtodevelopthedatasetisneededforitsproperuse.Suchinformationincreasesthesizeandcomplexityofthelocatorservice.
Theremainderofthischapterdescribeshowadvancinginformationtechnologiesenablethedatamanager,librarian,andarchivisttodealwiththechallengesofscientificdatamanagementinacollaborativefashionwiththescientificusercommunity.
EnablingTechnologiesAndRelatedDevelopments
Table4.1providesasummaryofaspectsofscientificdatamanagementchangedbynewtechnologiesandrelateddevelopments.Thesesixareasarediscussedinmoredetailbelow.
High-PerformanceComputerNetworks
High-PerformanceComputerNetworks
Therapidexpansionofcomputernetworksandtheiruseforelectronicmailanddatabaseaccesshaveobviatedtheneedforresearchersandotherusersofscientificandtechnicaldatatobeinphysicalproximitytocolleagues,informationresources,andevenadvancedtechnicalfacilities.Thishaspresentedamenuofchoicesaboutthebestmeanstodistributedataandtheresponsibilityofmanagingthem.
Aworldwide,"virtual"libraryisbeingcreatedontheInternet.ApplicationprogramssuchasMosaicaredemonstratingthepoweroffreeandsimplenavigationacrossanoceanofavailableresources.Improvingnetworkcapacity,reliability,performance,andsecuritymeasuresarehelpingtomaketheseresourcesmorewidelyaccessibleanduseful.
High-performancenetworksalsosupportmovementofinformationfornewapplications(e.g.,forproducingsafelymanagedbackupcopies,"profiling"informationforindividualuser'sneeds,orstagingdatathroughanumberofrefinementstepsindifferentlocationsforfocusedresearch).Networkssupportcollaborativeworkandresearchprojectsthatspantraditionalresearchboundaries.Suchworkrequireseasyaccesstoavarietyofdatasourcesatonce.
High-performancenetworksenablescientificdataresourcestobewidelydistributedandmanagedbygroupsofscientists.Usersthusarefreedtoconcentrateonthemosteffectiveuseofthedata,ratherthanontheirowndatamanagementissues.Networkscanprovideavehicleforregularlydistributingbackupcopiesofdataandmetadatatoensuresafestorage.Distributionofdatatouserscanbedoneviathenetworkinadditionto,orinsteadof,viaphysicalmediasuchastapesandCD-ROMs.Datacanbelinkedtogethertohelpusersnavigateamongrelateditems.ThiskindoflinkingisattheheartoftheWorldWideWebconceptandbroughttousersbyMosaic.Thepopulationofinformationproviders(e.g.,peoplewhocancontributetotheknowledgebase)hasnowgrowntoincludeallnetworkedmembersofauser
nowgrowntoincludeallnetworkedmembersofauser
Page44
TABLE4.1NewTechnologiesandRelatedDevelopmentsThatEnableaNewStrategyfortheManagementofScientificandTechnicalData
NewTechnologyTrendsandRelatedDevelopments
KeyFeatures WhatIsEnabled?
High-performancecomputernetworks
Distributedfunctions;rapiddeliveryoflargedatavolumes
Locationofdatabasesandarchiveswherebestmanaged;collaborativework;distributedorganizations;distributedresponsibility
Lowanddecliningcostofstorage
Inexpensivebackup;continuallydecliningcost;easeofmigration
Deferralofarchivingdecisions;trustindistributedmanagementduetosafestoragebackup
Advanceddatamanagement
Abilitytorigorouslyandformallymanagediversedatatypes
Morecomplexdatastructures(otherthan"flatfiles")handledinarchives,withgreatpotentialadvantages
Changingrequirementsforinformationtechnologyprofessionals
Abilityofpersonnelwithlowertechnicalskillstosucceedindatamanagementroles
Abilitytoentrustscientificdatamanagementinadistributedenvironment
Highreliabilityoftechnologycomponents
Availabilityofbettercomponentsandconnections;reducedprocurementandoperationscosts
Reducedcostandeffortindatamigration;trustedconnectionsforcommunicationandcollaboration
Developmentandacceptanceofstandards
Agreementonterms,interfaces,media,procedures
Reducedefforttocommunicateandapplyresultsofothers;abilitytoconcentrateonmissionissuesandnotontechnologysupport
population.Suchcontributionscanbeassimpleasanannotationonanexistingitem,orascomplexasafullyprocessedandpeer-reviewednewitem.Mostprofoundly,theevolvingnetworkinfrastructureenablesnewconceptsfordistributionoffunctionsandresponsibilityinorganizations(NRC,1994).
Althoughnetworkscanprovideaquickandeasymeanstodistributedata,itmustbenotedthatCD-ROMshavebeenusedtodistributedataforseveralyearsandhavebeenverysuccessful.CD-ROMsnotonlypermituserstohaveahugelocallibraryofdata,buttheyoftencomewithabettersetofdataaccesstoolsthanarenormallyavailable.Somedatasetsarelargeenoughthatthemostcost-effectivemethodtodeliverthemisonmediasuchasExabytetapes(8mm).
LowandDecliningCostofStorage
Asformostaspectsofcomputerhardware,thecostofstoragehasdeclinedcontinuouslyandrapidlyforthe30yearsofthemoderncomputerage.Newstoragetechnologyisalsoincreasinglycompactandsupportsevergreateraccessspeeds(Gelsingeretal.,1989).Thehistoricaltrendsareexpectedtocontinueforupto20years.Already,laboratoryengineeringresultsconfirmthisprojectionforatleastthenextdecade.Themostsignificantimplicationisthatthedecisionsaboutsamplingordiscardingscientificdatacangenerallybedeferred,particularlyfordatasetsforwhichthenecessarymetadataexistandwhosequalityhasbeencertified.Forrelativelysmallerdatasets,thedeliberationregardinglong-termretentionmaywellcostmorethantherecurringactsofmigration.Thecostofstorageissmallinrelationtooverallmissionorinvestigationcostsandthereforeshouldnotbeadecisiondriver.Experience
Page45
suggests,however,thatthefundstomeetthesecostsneedtoreceivespecialprotectionintheannualagencybudgetcycles.Thesupportforthedatamanagementaspectsofscientificmissionshastypicallyhadalowerprioritythanthedatacollectionaspects.Thelowcostofstoragealsoimpliesthattheincrementalcostofsupportingaremotesafecopyofdataandmetadataalsowillbesmall,exceptfortheverylargedatasets.Therefore,overthenextfewdecades,datareceivedandstoredmaybeexpectedtobecheaplyandquicklymigratedtonewtechnologieswhenstoragemediareachtheirnominallimitsofreliabilityorforconvenienceofimprovedaccess.
Itisimportantnottoexpectaperpetualadvantagefromthistechnologicaldiscontinuity.Thefactthatdatarequiresignificanttimeperiodsfortheirmigrationmustbeconsidered.Thecostdecaytrendwillslowdownatsomepointinthefuture,causingtheoverallcostofstoragetoreturntosomethingclosertothelinearrelationshiptovolume.Wealsomustberealisticandexpectthatfundswillnotalwaysbeavailabletosaveandbackupeverydataset.Decisionsonretentionorsamplingwillhavetobemade.
Nevertheless,thealreadylowandcontinuallydecliningcostofstorageallowsaprioridecisionstobemadeincertaincircumstancestokeepscientificdatasetsindefinitely.Backuporsafestoragecopiesofdataarebecomingmoreaffordableasdatamigrationbecomeslessexpensivewithsmaller,faster,andcheaperstoragedevices.Reliabilityalsoisimprovingwithnewsoftware-basedarchivesystems(includingmigrationandbackupfeatures).However,thereisanenhancedneedforongoingtechnologymonitoringbyanappropriatebodyformedia,standards,andmigrationautomation.Suchmonitoringshouldbeincorporatedinanyscientificdatamanagementandarchivingstrategy.
Therapidchangeofstoragetechnologiessuggeststhateffortsto
protecttoday'sscientificdatalegacymustbeaccelerated.Theobsolescenceofmediatypesandrecorders/playersisoccurringwithinshorterandshortertimeperiods.Thisimpliesthat"salvage"activitieswillbeincreasinglydifficultfordataleftoutofmigrationstonewmedia.This"joinorbeleftbehind"by-productofrapidtechnologicalchangeintensifiesshort-termbudgetpressuresonarchives.Itdemandsinresponseastrongmanagementcommitmenttoprovideresourcesandsaveimportantdatasets.
Ifdigitaldataaretosurvive,itisoffundamentalimportancetomanageandconstrainthecostsofarchivemaintenance.Theproblemisthatnewdatawillbecomingin,olddatawillneedtobemigratedtonewmedia,thebuildingwillneedtoberepaired,andthereusuallywillnotbealotofextramoneyfornewequipmentoraddedstaff.Toavoidproblems,thedatamigrationprocessinthesystemdesignmustbealmosttotallyautomated.Thisrefinementoftenhasnotbeenachieved,anditcancauseunnecessarybudgetdifficulties.Finally,itisessentialforagenciestopreserveallthehardwareandsoftwarenecessarytoaccessalltheirdatauntilthedatahavebeensuccessfullymigratedorotherwisedisposedof.
AdvancedDataManagement
Therearesignsthatdatamanagementtechnologyisbeginningtoaddressand,perhaps,tocatchupwiththecomplexitiesoftheverylargevolumesofscientificdata.Improvementshaveoccurredindatabasemanagementsystems,hierarchicalfilesystems,datarepresentationstandards,queryoptimizers,datadistributiontechniques,specializedaccessmethods,anddatasecuritytools(Silberschatzetal.,1991).Further,investmentinstandardsandcooperativeapproachesisaccelerating,fueledinpartbythedemandsofmedicine,education,entertainment,journalism,financialservices,andothercommercialapplications.Whilecompetingapproachesandinconsistentvocabularycreatenear-termconfusion,theattentionand
investmentlevelsbodewellforthelonger-termcapabilitytogobeyond"flatfile"representationsofdatathatneedtobearchived.Thenewtoolsandtechniquesaremoredescriptiveofthedata,theirheritage,theprocessesthathaveworkeduponthedata,andtherelationshipsofdatatoeachother.
Newdatamanagementtechnologywillenableeasierrepresentationofmorediversetypesofscientificdata.Becauseoftherigorthatnewtechniquesrequire(e.g.,forself-documentationorforprecisedefinitionofaccessmethods),long-termarchiveswillbenefitfromdatastructuresotherthanflat
Page46
files.Thenewtechnologyalsoimpliesthatthecreationofarichersetofmetadatawillbeeasiertoimplementandthatthesedatawillbeofhighscientificvalueforcontent-basedretrievals.Torealizethepotentialofthisenabledfacilitywithmetadata,thescientificcommunitywillhavetoacceptandsupporteffortstodevelopandapplynewmetadatarequirements.
TheChangingRequirementsforInformationTechnologyProfessionals
InformationtechnologyprofessionalswithhighskilllevelscannowbefoundinallpartsoftheUnitedStatesandaroundtheworld.Butastheybringtheinformationtechnologyindustrytohigherlevelsofmaturity,theeffectistoreducethecomplexityofmajortasksinmanaginginformation.Suchtaskspreviouslyrequiredtheirskilleduseofsophisticatedassemblylanguageorjobcontrollanguage(JCL)programming.JCLprogrammingreferstothestepsintheolddaysthatoneusedatthesystemconsoletogetprogramstorun,attachtherightfiles,printtotherightprinter,andsimilarfunctions.Today,muchofthisworkismasked,madeautomatic,andcontrolledthroughiconsandothermeans.Thesetaskscannowbeperformedbycompetentscientistsorprofessionalswithlowertechnicalskills,ratherthanbyhighlytrainedspecialists.Becausemorefunctionscanbecompletelyhandledbymachines,managementofthedatacanbegreatlyautomatedandoperatedbylessskilledindividuals.Thedatathemselvescanbewidelydistributedwithoutfearofloss,particularlywithabackupcopyinsafestorage.
Overthenext5to10years,thecostsforinformationtechnologyprofessionalsatindividualscientificdatacentersandarchivescanbedramaticallyreduced.Thereasonsforthereductionincostsincludemoreautomaticprocessesforstoragemanagement,rudimentarylearningcapabilityinsystems,servicesperformedbyendusersbased
ontheirpreferences,improvedsystemsmanagement,highercomponentreliability,improvedapplicationofstandards,andvendorconsistencywithstandards.
Althoughthedominanttrendwillbeforasmaller,lesstechnicallyskilledstafftomanagethephysicalaspectsofthearchive,therewillbeapressingdemandforfewer,highlyskilledpeoplewhoblendtheskillsofphysicalscientist,computerscientist,andarchivist.Thesepeoplemustbeabletohandletheintellectualchallengesofbridgingthesedisciplineswhileprovidingthecoachinganddirectiontohelpdevelopdataandoperationsstandardsforscientificcommunities.
HighReliabilityofTechnologyComponents
Microprocessors,newstoragemediatechnologies,maturesoftware,errorcorrectioncapabilities,improvedpackaging,andreducedpowerconsumptionhaveallmadesignificantcontributionstothereliabilityofcomputersystemsandnetworks.Whatwasrecentlyconsideredunreliable,requiringconstantattentionandexpensiverepair,isnowregardedasreliableandnotworthyofefforttorepair.Althoughprecautionshavealwaysbeentakentoprotectagainstlossofvaluabledata,manyoftheseprecautionsarenowbuiltintothebaseofmaturesoftwareorareincreasinglyfamiliarpartsoffacilities'operatingprocedures.
Highreliabilityoftechnologysupportsacapacityforhighlevelsoftrustandtheabilitytowidelydistributefunctionsanddatabases.Thesedistributedsystemscanachievethesamelevelsofqualityandtrustascentralizedarchivesthroughtheuseofthesameunderlyinghardwareandsoftwaretechnology,operatingprocedures,safestorageofcopies,andhigh-quality(error-corrected)telecommunicationconnections.HighreliabilityhasenablednewapplicationssuchastheWorldWideWeb,inwhichcontextswitchingfromonemachinetothenextonaworldwidebasisisreadilyaccomplished.Increasedreliabilityalsohasallowedcomputingtechnologytobeputintothehandsof
businessmanagers,consumers,andshopclerks.Withoutsuchreliability,maintenanceeffortwouldoutweighproductivitybenefit.Asaresult,powerfulorganizationaloroperationalframeworkscanbebuilt,muchasnewmaterialsenablenewarchitectureornewmachines.
Page47
DevelopmentandAcceptanceofStandards
Thedevelopmentofeffectivestandardshasbeenpivotaltopromotingthewidespreaduseofelectronicinformation.CommunicationprotocolssuchasTCP/IPhavefueledthegrowthoftheInternet.Otherformatstandardsfordocumentssupporttheirinterchange.Forexample,theStandardGeneralizedMarkupLanguage(SGML)providesauniformwayofformattingtextualdocumentssothattheycanbereadbydifferentdocumentprocessingtools.TheHyperTextMarkupLanguage(HTML)isastandardusedtorepresentandlinkdocuments;itisusedtodescribepagesviewedwithInternetviewerssuchasMosaic.Hardwareandsoftwarestandardssuchastheinstructionsetarchitecturesformicroprocessor-basedcomputers,modemprotocols,mediaformats,andquerylanguagesalsohaveplayedcriticalroles.
Standardscansimplifymanyofthetraditionaldatamanagementjobs.Forexample,thetimethatwouldbeusedtodecipheratapeformatissavedandthejobofinstallinganewapplicationisfacilitated.Havingeffectivestandardsinplacereducestheleveloftedious,nonproductiveeffortandfreesuptimefornewtasksforthearchivist.Standardsdeterminednowwilltypicallybeineffectforlongperiodsoftime,perhapsadecadeormore,withsomesmallevolutionaryaugmentations.Thismeansthatabaselineofappropriatestandardscanbeselectedforabodyofinformationwithsomereasonableexpectationthattheywillnotbequicklyreplaced.Whenitappearsthattheexistingstandardsbaselineneedstobeupdated,theinformationcanthenbemigratedtoanewone.Adeliberatedatamigrationstrategybasedonstandardstrackingispossible.
Theroleofstandardscertainlyisnotlimitedtothegeneralcomputingcommunity.Scientificteamsanddisciplinegroupscontinuouslyworktocodifybestpractices,definitions,andalgorithms.Thesearepropagatedascommunitystandards.Standardsdevelopedbythescientificcommunityareoftenthemostimportanttopromoteandapply.If
communityareoftenthemostimportanttopromoteandapply.Ifproperlypromulgated,theycanenableimprovedunderstanding,broadercollaboration,andfacilitationofthedatamanagementandrelatedresearch.
Finally,itshouldbeemphasizedthatstandardsandguidelinestosupportlong-termarchivingmustnotinhibitinnovation,ortheevolutionofinformationsystemsandtechnology.Oftenthebeststandardsandguidelinesarethosethatareindependentoftechnology.
OpportunitiesForNewOrganizationalStructures
Withrapidtechnologicalimprovementsandnewlyenabledcapabilities,itissometimeseasytoforgettheimportanceoflong-termcommitmentbymanagerstopolicyandresourcerequirements.Notechnologicalchangeswillbythemselvesreplacethebasic,unsungeffortsofhigh-qualityscientificdatamanagement.Infact,althoughtechnologyitselfcanimprovetheavailabilityofdata,trulyaccessibleandusefulscientificinformationwillbeachievedonlythroughsuchmanagementcommitment.Thiscommitmentmustbebasedonacoherentstrategyforlife-cyclemanagementofdata,includingtechnologyacquisition,dataandinformationmanagementpractices,andtechnology-independentstandardstoensurethattheminimumlevelsofdatacontentandconsistencyforresearchusesaremet.Further,suchacomprehensivestrategywillbesuccessfulonlywiththeactiveandcommittedinvolvementofthescientificcommunityitself.Thelevelofeffortandchangethatmayberequiredtoachievethiscommunityinvolvementcannotbeunderestimated,andfundamentalchangetothevaluesystemofthecommunitymayberequired.
Nevertheless,asdiscussedabove,technologicaladvancesallowthecreationofnewinfrastructure,challengingexistingorganizationalassumptions.Effectiveorganizationaldesignsbasedonnewallocationsofresponsibilityareenabled.Forscientificdatamanagement,thetechnologicalchangessupportorganizationswiththefollowingattributes:
attributes:Widelydistributedresponsibility.Newtelecommunications,datamanagement,andstandardstechnologyallowsforhighlevelsoftrustindistributeddatamanagement.Physicalpossessionofdataby
Page48
archivistsisnolongeressential.Thewideavailabilityofinformationtechnologyprofessionalsandotherskilleddatamanagers(alongwiththelowertechnicalskilllevelsactuallyneeded)enhancestheabilitytodistributethedatamorebroadlyandincreaseuserparticipation.Suchdistributionofdataandtheirownership(whetheractualorimplied)byusergroupsimprovestheutilityofthedataandhelpscreateimportantsupportforlong-termretention.High-valuepeer-to-peercommunication.Withaccesstodataandtopeopleonline,avarietyofnewcollaborativerelationshipscandevelop.Informationcanbebroadcasttointerestedindividualsinatimelyfashion.Datacanbeprovideddirectlytofieldresearcherstofocusnewdatacollection.Physicalproximityandformallinesofcommunicationarenolongervitaltoeffectiveorganizationaloperation.Indeed,closed,highlystructuredorganizationsoftenwillbeuncompetitiveorfailtotakefulladvantageofinnovation.Specializeddatacenters.Distributionofresourcesimpliesthatsomespecificlocationscanspecializeandyetstillcontributeeffectivelytoall.Specializedgroupsorinstitutionscouldbecreatedinascientificdisciplineorinsomeaspectofdatamanagement,archives,orstandards.Designationofsuchspecializedcenters,inadditiontothosealreadyinexistence,isasignificantmechanismforachievingeconomiesofscale,reducingoverallcostswhileenhancingtheeffectivenessofcertainfunctionsforthebenefitofall.Explicitlong-term(technology)strategies.Along-termtechnologystrategyneedstobedeveloped.Therapidlychangingbaseoftechnologyrequiresthatadeliberatesequenceofphasesbeselected,throughwhichdataanddatamanagementwillmigrate.Theconstantevolutionofinformationtechnologiesdemandsthatanorganizationalelementtakeonthis''technologynavigation"function.Measurementasavitaltool.Inafast-paced,and,perhaps,widelydistributedeffort,metricsareimportanttoclearlycommunicateexpectationsofperformance,registerresults,andhelpindetectingweak
spotsforcorrectiveaction.Inparticular,metricscouldbeestablishedtodeterminedatasetuseandtosupportarchivingstrategydecisions.Metricsalsocouldbedevelopedtohelpensurehigh-qualityserviceandproperdataprotection.
Page49
5ANewStrategyforArchivingtheNation'sScientificandTechnicalDataThescientificandtechnicaldataheldbyfederalgovernmentagenciesandbyotherinstitutionssupportedbyfederalfundsconstituteanextremelyvaluablenationalresource.Unfortunately,inmanycasesthisresourcecanbeexploitedonlywithgreatdifficultybecausekeyelementsoftheinfrastructureforbroadandeasyaccesstoitareincompleteormissing.
Currently,themostimportantdevelopmentwithinthefederalgovernmentforimprovingthemanagementandlong-termretentionofscientificandtechnicaldataistheNationalInformationInfrastructure(NII)initiative.TheNIIfocusesontheapplicationofpublic,private,andacademicresourcestodefine,implement,andmaintainanevolvingnetworkofknowledgeresources(IITF,1993).Thisinfrastructurewillbethefoundationforinformation-centeredenterprisesofthenextcentury(NRC,1994).Thescientificcommunity,whoselifebloodiswidelyavailabledataandinformation,mustbecomefullyengagedinthisnationaleffort.Acoherentstrategyneedstobedefinedandimplemented,tocombinenewtechnologicalcapabilitywithanewwayofdoingbusinessthroughoutallphasesofthescientificinformationlifecycle(observation,measurement,analysis,interpretation,application,dissemination,andeducation).
Aneffectiveinformationinfrastructuremustbuildonenablingtechnologiestocreateanintegratedandadaptivesystemthatiseasilyaccessibletoallpotentialusers.EachusercommunitywillhaveitsownviewofwhattheNIImeanstoitsenterpriseandhowtheNIIcanbestserveitsusersbecausetheNIIwillbemadeupofmanyseparate
"enterpriseinformationinfrastructures."Theexistingscientificandtechnicaldatacentersandarchivesalreadyconstituteaseparateenterpriseinformationinfrastructure,whichmustbecomefullyintegratedintotheNII.
Inthediscussionthatfollows,thecommitteelaysoutathree-partstrategyforthelong-termretentionofscientificandtechnicaldata.TheelementsofthisstrategyarebasedonthetechnologicaladvancesoutlinedinChapter4andontheissuesraisedinChapter2,whichprovidethecontextandtheneedforaction.
Thestrategybeginswithasetoffundamentalprinciplesforthelong-termretentionofscientificandtechnicaldata.Thesecondmajorelementoutlinesthecommittee'sproposaltoformaNationalScientificInformationResourceFederation,whichwouldprovideacoordinationmechanismforend-to-endmanagementofnetworkedscientificandtechnicaldatafacilities.ThefinalsectionshighlightsomespecificrecommendationsforNARAandNOAAintheirlong-termretentionofscientificandtechnicaldata.
Page50
FundamentalPrinciplesForLong-TermDataRetention
Inordertorespondadequatelytotheimperativesforpreservingdataaboutthephysicaluniverseandeventuallytocreateanintegrated,adaptive,andaccessibleinfrastructure,thefederalgovernmentshouldhelpestablisheffectiveandaffordableprocessesforprovidingreadyaccesstothevastnationalresourceofscientificandtechnicaldataandrelatedinformation.Theprocessmustsupporttheneedsofdataoriginators,users,andcustodiansacrossallphasesofthedatalifecycle,fromorigintousebyfuturegenerations.Thecommitteebelievesthatthefollowingprinciplesshouldguidetheeffortofthegovernmentagenciesinthelong-termretentionofscientificandtechnicaldata:Dataarethelifebloodofscienceandthekeytounderstandingthisandotherworlds.Assuch,dataacquiredinfederalorfederallyfundedendeavors,whichmeetestablishedretentioncriteria,areacriticalnationalresourceandmustbeprotected,preserved,andmadeaccessibletoallpeopleforalltime.Theoriginalcollectionandanalysisofscientificandtechnicaldatatraditionallyhavebeenusedprimarilytosupportthescholarlypublicationofscientificinterpretationbyindividualinvestigators.Theavailabilityofcompleteandconsistentdatasetsforbroaderuses,bothwithinandoutsidethescientificcommunity,wouldsignificantlyincreasethereturnontheinvestmentmadeinobtainingthosedataandprovideinsightsnotattainableiftheoriginaldatawerelostorunusable.Thevalueofscientificdataliesintheiruse.Meaningfulaccesstodata,therefore,meritsasmuchattentionasacquisitionandpreservation.Technologycanmakedataavailablethroughfastcomputers,large-bandwidthnetworks,massivestoragecapabilities,andportablemedia.However,ifthepathstodataareobscure,orthereisnowayforausertodeterminewhatissignificantandrelevant,thenthedatabecomeinaccessibleandareeffectivelylost.Adequateexplanatorydocumentation,ormetadata,caneliminateoneoftoday'sgreatestbarrierstouseofscientificdata.Theproblemof
today'sgreatestbarrierstouseofscientificdata.Theproblemofinadequatemetadataisamplifiedwhenusersareremovedfromthepointoforiginbybeinginadifferentdiscipline,byhavingadifferentlevelofexpertise,orbytime.Addressingthisproblemcomprehensivelywillmakedatausefulinthebroadestpossiblecontext.Asuccessfularchiveisaffordable,durable,extensible,evolvable,andreadilyaccessible.Thesetermsmayappeartobevaguetargets,buttheyimplybasicgoals.Thecostsofdeveloping,operating,andusinganarchivemustnotbeexcessive.Thearchivemustenduretheravagesoflong-termuse,anditmustbeabletoextendbroadlytheservicesitoffersandtherecordsitmanages.Itmustevolvetosupporttheassimilationofnewtechnology,policies,procedures,anduses.Finally,anarchiveisnoteffectiveifabroadpopulationofuserscannotuseit.Thearchivingsystemthusshouldprovidemultiplelevelsofaccesstoanysubsetofitsholdings,althoughholdingsnotaccessedoftenmaynotrequireasophisticatedaccessmechanism.Theonlyeffectiveandaffordablearchivingstrategyisbasedondistributedarchivesmanagedbythosemostknowledgeableaboutthedata.Archivecentersgenerallyshouldbeattheagenciesorinstitutionsthatcollectthedata,andtheyshouldberesponsibleforarchivingandprovidingaccesstothedataaslongastheagency'sorinstitution'smissionandscientificcompetencecontinuetoencompassthesubjectfield.Physicaltransfersofthedatashouldbeavoidedifpossible,soagenciesandinstitutionswillneedtoallocateadequateresourcestotheentirelifecycleoftheirdataholdings.Planningactivitiesatthepointofdataoriginmustincludelong-termdatamanagementandarchiving.ThisprincipleisrecognizedintheOfficeofManagementandBudgetCircularA-130onthe"ManagementofFederalInformationResources"(OMB,1994).Thescientificinformationmanagementspectrumspansdatacollectedfromasensortothescholarlypublicationsthatreportscientists'interpretationsofthedata.Scientists,informationtechnologyprofessionals,datamanagers,librarians,andarchivistsmustunifytheirexpertiseintheestablishmentofacoherentstrategyforend-to-enddataandinformationmanagement.
ofacoherentstrategyforend-to-enddataandinformationmanagement.Althoughthesecommunitiestraditionallyhavenotworkedcloselytogether,
Page51
theircombinedknowledgeandeffortarenowrequired.Thebenefitofincorporatingplanningatthepointoforiginisthatitischeaperandmoreeffectivetoplanforretentionthantoreconstructdatasetslater.
TheProposedNationalScientificInformationResourceFederation
ThecommitteebelievesthatthefederalgovernmentshouldcreateaNationalScientificInformationResourceFederationanevolutionaryandcollaborativenetworkofscientificandtechnicaldatacentersandarchivestotakeonthechallengeofprovidingeffectiveaccesstoandpreservationofimportantscientificandtechnicaldataandrelatedinformation.Suchaninitiativewouldbegintoexploitmorefullyournation'ssignificantinvestmentinthephysical(andother)sciencesandthedataacquiredwiththatinvestment.Inthediscussionthatfollows,thecommitteereviewsthebasicelementsofafederatedmanagementstructure,describessomenotableexamplesofexistingfederalgovernmentorganizationsforlarge-scaledistributeddatamanagement,andoutlinesthemostimportantaspectsoftheproposedNationalScientificInformationResourceFederation.
ElementsofaFederatedManagementStructure
Severalcriticalconceptsmustgovernanyfederatedmanagementstructureforittofunctionproperly.Theseincludethenotionsofsubsidiarity,pluralism,standardization,theseparationofpowers,andstrongleadershipatalllevels(Handy,1992).
Subsidiaritymeansthatpowerisassumedtoliewiththesubordinateunitsofanorganizationandcanberelinquished,butnottakenaway.Thesubordinateunitstypicallyarebestqualifiedtomakeoperationaldecisionsthatdirectlyaffectthemandthattheywillbeimplementing.Thecentralmanagementisallowedonlythosepowersneededtoensurethatthesubordinatesdonotdamagetheorganization.Forexample,theConstitutionoftheUnitedStatesreservesonlyspecifiedpowersforthe
ConstitutionoftheUnitedStatesreservesonlyspecifiedpowersforthefederalgovernment,withanyunstatedpowersbelongingtothestates.Appliedtothesituationathand,itisclearthatthestrengthsofthecurrentsystemformanagingscientificandtechnicaldataandinformationintheUnitedStatesaredistributedamonganumberofdiversedatacentersandarchives,bothwithinandoutsidethegovernment.Asuccessfulfederationoftheseexistinginstitutionswouldrecognizethattheyarethelocationsofexpertiseontheirrespectivedataholdings.Thusthecentralorganizationshouldbesmallandshouldnotmicromanagetheday-to-dayoperationsofthesubsidiaryorganizations.
Pluralismmaybedefinedasinterdependenceofthemembers.Inafederation,theindividualsubsidiaryorganizationsrecognizetheadvantagesofbelongingtothefederation,becauseofproductsorservicesthatcanbeobtainedfromotherelementsinthefederation.Asnotedinthepreviouschapter,theexistenceofmanyspecializeddatacentersandarchives,aswellasthepossibilityofcreatingnewonesinanetworkedenvironment,canoffersignificanteconomiesofscaleandimprovedsharingofideasandexpertise.Whatisgoodforthesubsidiaryelementalsoshouldbegoodforthewhole.Pluralism,coupledwithsubsidiarity,guaranteesameasureofdemocracyinthefederation.
Interdependence,inturn,requiresstandardizationoflanguages,communications,basicrulesofconduct,andunitsofmeasurement.Theseelementsmaybesummarizedastechnicalandproceduralstandardization.ThistoowasdiscussedinChapter4,regardingthedevelopmentofstandardsinsoftware,hardware,anddatamanagement.Standardsthataredevelopedbyconsensusofthesubsidiaryelements(e.g.,theparticipatingdatacenters,archives,andresearchers)arewidelyrecognizedasessentialtothesuccessfulmanagementofdata.
Aseparationofpowers(responsibilities),withasystemofchecksandbalances,isnecessarytoensurethatthecentralauthoritydoesnottakeonunnecessarypower.Thisprinciplemustbeincorporatedintothefederation'sorganizationalstructure.
Finally,afederationrequiresstrongleadershipthatiseffective,yetnotoverbearing.Thecentralcoordinatingelementorexecutiveofficemustactasthestandardbearer,promotingthefederation's
Page52
establishedgoalsandobjectiveswhileremindingthesubsidiaryorganizationsoftheimportanceofcarryingouttheirresponsibilities.
ExamplesofDistributedDataManagementOrganizations
Successfulexamplesofafederatedmanagementstructurearenumerousintheprivatesector(Handy,1992).Morespecifically,however,therealreadyaretwolarge-scale,federalgovernment,distributeddatamanagementgroupsthatembodymany,thoughnotall,ofthefederatedmanagementattributesoutlinedabove.ThesearetheInteragencyWorkingGrouponDataManagementforGlobalChangeandtheFederalGeographicDataCommittee.
InteragencyWorkingGrouponDataManagementforGlobalChange
In1990,CongressformallyestablishedtheU.S.GlobalChangeResearchProgram(GCRP),"aimedatunderstandingandrespondingtoglobalchange,includingthecumulativeeffectsofhumanactivitiesandnaturalprocessesontheenvironment,[and]topromotediscussionstowardinternationalprotocolsinglobalchangeresearch"(CENR,1994).TheactivitiesoftheGCRParecoordinatedbytheCommitteeonEnvironmentandNaturalResources(CENR),underthePresident'sNationalScienceandTechnologyCouncil.
Thetimelyavailabilityofabroadspectrumofscientificdataandinformation,frombothgovernmentalandnongovernmentalsources,isfundamentaltomeetingthegoalsofthisprogram.AGlobalChangeDataandInformationSystem(GCDIS)isbeingcreatedtofacilitateaccesstoanduseofthedataandinformationnecessarytosupportglobalchangeresearch.ThefederalorganizationsinvolvedintheGCDISplanningincludetheDepartmentsofAgriculture,Commerce,Defense,Energy,Interior,andState,aswellastheEnvironmentalProtectionAgency,theNationalAeronauticsandSpaceAdministration,andtheNationalScienceFoundation.
AccordingtoTheU.S.GlobalChangeDataandInformationSystemDraftImplementationPlan(CENR,inpress),theGCDISisbuildingontheresourcesandresponsibilitiesofeachparticipatingagency,linkingthedataandinformationservicesoftheagenciestoeachotherandtotheusers.Thesystemthusiscomposedlargelyoftheseparatelyfundedcomponentscontributedbytheparticipatingagencies.Itissupplementedbyaminimalamountofcrosscuttingnewinfrastructurethroughtheuseofstandards,commonmanagementapproaches,technologysharing,anddatapolicycoordination.NeitheraleadagencynoraseparatelyfundedbudgetfortheGCDISisplanned;rather,implementationofthesystemisbeingcoordinatedthroughtheInteragencyWorkingGrouponDataManagementforGlobalChange(IWGDMGC).Decisionmaking,therefore,isdonethroughaconsensusprocessbasedonthecommoninterestsofallparticipants.
PlansfortheGCDISrecognizethattheglobalchangedatamustbeavailableforaverylongtime,regardlessofthechanginginterestsoftheresearcher,group,oragencythatoriginallycollectedandanalyzedtheobservations.AlthougheachagencyparticipatingintheGCDISisexpectedtomanage,store,andmaintainthedatasetsunderitspurview,theplandoesallowanagencytodesignateanotherGCDISagencytoarchivesomeofitsdata.Theparticipatingagenciesareexpectedtoadheretogovernmentstandardsformedia,storage,andhandlingasprescribedbyNARAandtheNationalInstituteofStandardsandTechnology.TheagencyarchivesassociatedwiththeGCDISaccesssystemwillbestaffedbyprofessionalswhounderstandthedataandtheirsources.TheIWGDMGCexpectstodevelopguidelinesforpreparingdatasetsandassociateddocumentationforlong-termretentionattheparticipatingagencies.Ideally,theGCDISarchivesalsowillbeassociatedwithresearchgroups,bothwithinandoutsidegovernment,who,asprincipalusersofthosedata,willverifyqualityanddocumentationofthedata.
Page53
TheGCDISplangiveseachagencyresponsibilityforitsowndata-purgingpolicies,althoughinteragencycoordinationprocedureswillbedevelopedtopreventthelossofimportantdatasets.Beforeanydatasetsarepurged,however,anagencywillberequiredtonotifytheIWGDMGCofitsplansatleastoneyearinadvance,andtoallowotherGCDISagenciestoindicatetheirrequirementsforthosedata,ortoagreetoassumeresponsibilityforthearchivingofthosedata.Intheeventthatnoagreementcanbereachedonthedispositionofadatasetidentifiedforpurging,existingNARAprocedureswillapply(CENR,inpress).
FederalGeographicDataCommittee
Theothermajorfederaldatacoordinationentityimportanttothelong-termmanagementofobservationaldata(includingsomedatafromthebiologicalandsocialsciences)istheFederalGeographicDataCommittee(FGDC).TheOfficeofManagementandBudget(OMB)establishedtheFGDCin1990todevelopaNationalSpatialDataInfrastructure(NSDI)toworktowardthecoordinateddevelopment,use,sharing,anddisseminationofgeographicdata(OMB,1990).ParticipatinggovernmentorganizationsincludetheDepartmentsofAgriculture,Commerce,Defense,Energy,HousingandUrbanDevelopment,Interior,State,andTransportation,aswellastheEnvironmentalProtectionAgency,FederalEmergencyManagementAgency,LibraryofCongress,NationalAeronauticsandSpaceAdministration,NationalArchivesandRecordsAdministration,andTennesseeValleyAuthority.Infulfillingitsmandate,theFGDCcarriesoutthefollowingactivities,amongothers:promotesthedevelopment,maintenance,andmanagementofdistributeddatabasesystemsthatarenationalinscopeforgeographicdata;encouragesthedevelopmentandimplementationofstandards,exchangeformats,specifications,procedures,andguidelines;promotestechnologydevelopment,transfer,andexchange;andpromotesinteractionwithotherexistingfederalcoordinating
mechanismsthathaveinterestinthegeneration,collection,use,andtransferofspatialdata(FGDC,1994).
TheFGDChasreceivedauthorityandsomelimitedfundingtopursuetheseobjectives.Specifically,ExecutiveOrder12906on"CoordinatingGeographicDataAcquisitionandAccess:TheNationalSpatialDataInfrastructure,"assignstotheFGDCtheresponsibilitytocoordinatethefederalgovernment'sdevelopmentoftheNSDI.ThatExecutiveOrderalsoinstructstheFGDCtoinvolvestateandlocalgovernmentsinitsNSDIactivities,andtousetheexpertiseofacademia,professionalsocieties,theprivatesector,andothersasnecessarytoassisttheFGDC.
TheFGDChasestablishedamatrixofsubcommitteesandworkinggroupsaccordingtodiscipline-relateddatacategoriesandinterests.Theworkinggroupissuesincludeaframeworkfordata,aclearinghousefordata,standards,technology,anddataarchiving.TheFGDCplansfordataarchivingarestillbeingdeveloped,however.
CreationoftheNationalScientificInformationResourceFederation
Thetwoexamplescitedaboveindicatethatafederatedmanagementstructureforhighlydistributedscientificdatacanbecreated.Infact,betweenthesetwogroups,thelife-cyclemanagementofmanyofthedatathatarethetopicofthisreportisbeginningtobesystematicallyapproached.Nevertheless,asdiscussedinthisreportandinthevolumeofworkingpapers(NRC,1995),manyimportantgapsandinadequaciesremaininthemanagementandretentionofournation'sscientificdataandrelatedinformation.ThecommitteebelievesthatthesedeficienciescanbestbeaddressedbyacomprehensivefederatedsystemaNationalScientificInformationResource(NSIR)Federationthatbuildsonthesuccessesof
Page54
theexistinggroupsandhelpscoordinatethemwithotherdatamanagemententitiesthatstillneedimprovement.
Therearemanyreasonswhyitisnowpropitioustoestablishasystemoffederateddatamanagement,withanemphasisonlong-termretention.Fromapolicyperspective,itwouldbeconsistentwiththegoaloftheNationalInformationInfrastructuretodistributeinformationresourcesbroadlythroughoutoursociety,withthefederalgovernmentactingasfacilitatorforsuchactivities.Thetechnologyisavailabletomakeafullynetworked,buthighlydistributed,systemofdatacentersandarchivesbothfeasibleanddesirable.Suchasystemwouldbeefficientinprovidingaccesstoscientificdataandinformationtoalargenumberofpotentialusersandwouldmaximizethegovernment'sreturnonthesignificantinvestmentthatinitiallywentintoacquiringthosedata.Fromanorganizationalstandpoint,afederatedmanagementstructurewouldallowthedisparateelementstocontinuetospecializeinwhattheyeachdobestandtofulfilltheirindividualorganizationalmandates,whileprovidingsomeefficienciesofscaleandpoliticalleverageinaddressingthemostpressingissues.Moreover,thistypeofapproachisespeciallytimelyandimportantinaneraoffederalgovernmentbudgetreductions.Thecommitteethereforeenvisionsabroadlynetworkedorganization,whichwouldbeimplementedthroughthecollaborationofthefederalgovernment'sscientificandtechnicalagenciesaswellascommercialandnoncommercialorganizationsoutsidethegovernment,andintegratedintotheemergingNationalInformationInfrastructure.
MostoftheelementsoftheNSIRFederationarealreadyinplace.Theseincludethedatacentersandfieldarchivesrunbyseveralofthefederalagenciesthatareamongtheprimarygeneratorsandcollectorsofthenation'sscientificdataandinformation.Inadditiontoholdingdata,thesecentersandarchiveshavehighlyskilledstaffwiththerequisiteexpertise.Theorganizationsarewidelydistributed,both
geographicallyandbydiscipline.
Theexistingdatacentersandfieldarchives,however,donotapproachthefederatedorganizationalmodelforseveralreasons.Thereisnounifyingorganizationamongthevariouselements,thereiswidedisparityinthequalityanddepthofserviceprovided,andfewofthemhaveachartertopreservedata"permanently."AlthoughNARAhasthestatutorychartertopreservefederalrecordsinperpetuity,itscurrentandprojectedholdingsofelectronicscientificrecordsareverysmall.WhilethecommitteedoesnotbelievethatNARA'sarchivesofscientificdatashouldincreasesubstantially,itfoundlittleevidenceofactivitywithinthescientificandtechnicalagenciesthatwouldindicatethattheirabilitytoprovideforlong-termretentionandaccesstotheirdatawouldimprovewithoutsomerestructuring.
Afundamentalpreceptisthatthosemostfamiliarwithscientificdatathescientiststhemselvesareinthebestpositiontooverseethemanagementofthosedata(NRC,1982).Inlightofthevolumeanddiversityofscientificdata,adistributedapproachthatmaintainsthedataclosesttotheprimaryusercommunityisthemosteffectivemethodformanagingthem.Asmentionedabove,severalagencieshaveadoptedanapproachofcaringfortheirdatainsystemsoffieldarchivesordisciplinedatacenters.Althoughtheseagencieshavedevotedsignificantattentiontothepreservationofdata,theirconcernislimitedtoprovidingimmediateservicetoprimaryusersofthedatafortheiroriginallyintendedpurpose.Littlethoughthasbeengiventotheperpetualarchivingofthedatawithinmostagencies,withthenotableexceptionofNARAandNOAA,whichalreadyhaveastatutorymandatethatallowsthemtopreservedatacollectedbythefederalgovernment.Becauseitisnotpossibletobesurethatanydatacenterwillexistinperpetuity,somemechanismmustbeinplacetoensurethatthedatawillberetainedbyanappropriateorganizationwithinoroutsidethegovernmentintheeventthatthecontinuedexistenceofadatacenterisjeopardized.
Ifaleadagencycanbedeterminedforasubjectmatter,thenitshouldtakeresponsibilityforcoordinationofscientificdataonthatsubject,nomatterwhichagencyhasphysicalownershiporcustodyofthosedata.Thecommitteerecognizes,however,thatsomedatasetsarelargelyofinterestattheboundariesofdisciplinesoragencychartersandthatconsequentlythesemaybemoredifficulttomanageordocumentproperly.Largedatasetsthatareofaninterdisciplinarynaturecausespecialproblemsin
Page55
thisregard.Forthesecomplexsituations,nosimplerulewilltaketheplaceofnegotiationsamongtheinvolvedagenciestomakethenecessaryarrangementsforlong-termarchiving.Indeed,everyagencyshouldassumetheobligationtokeepitsholdingsofscientificdatainusableform,evenifthedataarenotinactiveuse,untilagreeingondispositionofthosedatawithNARAoranotheragency.
Inadditiontotheagency-administereddatacenters,thereareeducationalorprivateconcernsthatholdandadministerdataimportanttooneormoreagencies,suchasthearchiveddatafromtheNOAAGeostationaryOperationalEnvironmentalSatellitesattheUniversityofWisconsinortheseismicdataheldbytheIncorporatedResearchInstitutionsforSeismology.Whilesomeofthesenonfederalarchivesarefirmlyassociatedwithoneormorefederalagenciesthroughcontractualandfundingrelationships,inothercasesaone-to-oneassociationislessclear.Itfollowsthatawell-definedchainofresponsibilitymustbeestablishedforalldatathataretobepreserved.Thisdecisionshouldbemadebytheindividualsandinstitutionsmostcloselyassociatedwithandinterestedinthosedata,anditshouldbemadewithdueconsiderationforcostefficiency,appropriateexpertise,scientificinterest,andconvenience,amongotherfactors.Establishingaclearconnectionbetweenafieldarchiveandanagencyshouldinnowaylimitthecommunityofusersservedbythearchive,butshouldensureanorderlyandsecurepathofresponsibilityforthedata.
Thestructureofthenation'sscientificandtechnicalorganizationscontinuestochange.Insomeinstances,institutionsorevenagencieswillmerge,whileinothercases,organizationsmaydisappear.Whensuchchangesoccur,itislikelythatthescientificinterestsformerlyrepresentedbythoseorganizationswillbesubsumedbyexistingornewagenciesororganizations.ThegeneraltopologyoftheNSIRFederation,however,wouldnotchange.
ThecommitteedoesnotanticipatethatthecreationandimplementationoftheFederationwillrequiremuchadditionalfunding,ifany,becauseitwillconsistprimarilyofimprovinglinkagesandcoordinationamongexistingdatacenters,archives,andrelatedorganizationswithinahighlydecentralizedmanagementstructure.Moreover,anycostsincurredinthisprocessshouldbemorethanoffsetbytheimprovementsinefficiencyandaccesstothedataandrelatedinformationresources.
RecommendationsForTheCreationOfTheNSIRFederation
Thecommitteethusrecommendsthatthefederalgovernmenttakethefollowingstepsforadequatelypreservingandprovidingaccesstodataaboutourphysicaluniverse:
AdopttheNationalScientificInformationResource(NSIR)FederationconceptasanintegralpartoftheNationalInformationInfrastructure(NII).Thisconceptmustencompassnotonlyanelectronicnetwork,butalsoindividuals,organizations,communities,dataresources,procedures,guidelines,andassociatedactivitiesofdatageneration,management,custodianship,anduse.TheNSIRFederationshouldprovidethefoundationfordefiningacoherentapproachtomanagementofthelifecycleofscientificdata,withthegoalofprovidingbroadandeffectiveaccesstoallpotentialusersascosteffectivelyaspossible.TheFederationshouldbedevelopedandimplementedthroughconsensusofcollaboratingorganizationswithdiverseandautonomousmissions.TheGCDIS,inparticular,isanexampleofaprototypeNSIR,focusedondataforaspecificsetofinterdisciplinaryscienceproblems.TheNSIRFederationwouldbuildonsuchefforts,providingforbettercoordinationandinteractionamongthem,andwouldhelporganizefledglingeffortstopreserveandprovideaccesstodatainotherdisciplines.
TheadministrationshouldtakethestepsnecessarytofullydefineandcreatetheNSIRFederation.Thereareatleasttwopotentialfocal
pointswithintheadministrationforplanningsuchanactivity.ThesearetheinteragencyInformationInfrastructureTaskForcefortheNIIandtheNationalScienceandTechnologyCouncil.TheNSIRFederationcouldbecreatedinamannersimilartothecreationoftheFederalGeographicDataCommitteeanditsNationalSpatialDataInfrastructure(e.g.,
Page56
throughanOfficeofManagementandBudgetCircularandExecutiveOrder),oroftheInteragencyWorkingGrouponDataManagementforGlobalChangeanditsGlobalChangeDataandInformationSystem(e.g.,throughlegislationincooperationwiththeadministration).Aconvocationofrepresentativesfromthescientific,dataandinformationmanagement,andarchivingcommunitieswouldbeagoodwaytodefineandinauguratethisinitiative,focusingonthemostsignificantissuesandproblemsidentifiedattheendofChapter2.
FollowingtheformalauthorizationbythefederalgovernmentforcreatingtheNSIRFederation,theprincipalparties,includingNARAandNOAA,shouldconcludeagreementsfortheimplementationofadistributedarchivesystem.Thesystemshouldinvolveallrelevantinstitutions,includingnongovernmentalentitiesthatarefundedbythefederalgovernmentorthatmaintaindatathatwereacquiredwithfederalfunds.Asageneralprinciple,datacollectedbyanagencyshouldremainwiththatagencyindefinitely.ThecommitteerecognizesthatthisrecommendationmayrequiresignificantoperationalchangesforagenciesotherthanNOAA,andevensomechangeswithrespecttoNOAA'sdataactivities.Inaddition,NARAshouldconsiderconcludinginteragencyagreementstogiveformalrecognitionofthisprocessasappropriate.Furthermore,theassociatedagenciesintheNSIRFederationmustworktogether,undertheleadofasmall,coordinatingexecutiveofficewiththeexpertisetoestablishdatamanagementguidelinesandminimumcriteriaforadequatemetadatathatcouldbeappliedacrosstheentireFederation.Theexecutiveofficecouldbeeitherahigh-levelinteragencycoordinatingcommittee,similartotheFGDC,oranewofficeatanappropriatefederalagency,suchastheNationalScienceFoundation,whichhasabroadscientificandtechnicalaswellascommunicationmandate.Inanycase,theexecutiveofficeshouldresistthetypicaltendency
towardbureaucraticaccretionofpower,personnel,andresources,andthetendencytoconsolidateandcentralizedataholdings.AmanagementcouncilconsistingofrepresentativesofthememberorganizationsshouldbecreatedtohelpensurethatthecentralexecutivefunctionremainsfullyresponsivetoallmembersoftheFederation.
Dataaccessandpreservationservicesshouldbeimplementedonthemostcost-effectivebasispossiblefortheFederation.Forexample,oneinstitutionmayprovideaservicetooneormoreotherinstitutionsinordertoexploitpotentialeconomiesofscaleandfocalpointsofexpertise(e.g.,thespecializeddatacenterssuggestedinChapter4).Thismeasuremightincreasethecosttotheprovidinginstitution,butwoulddecreasetheoverallcosttothefederation,thegovernment,andthetaxpayer.Anexampleofthisisthemethodbywhichbackupcopiesofdatamightbekept.NARAmayhaveatanygiventimethemostcost-effective"vault"inwhichtokeepphysicallyseparatebackupcopiesofdataforallagencies,and,hence,thefederalgovernmentwouldsavemoneybyincreasingNARA'sbudgettoprovidethisservicefortheotheragencies.Ontheotherhand,ifcosttrade-offstudiesweretofindthatasinglelarge"vault"isnotascost-effectiveasdistributedfacilities,theneachagencywouldberesponsibleforitsownbackup.InallNSIRFederationactivities,emphasisshouldbeplacedoncontrolofcosts,withthemostsuccessfulmethodsusedbyindividualmembersidentifiedandsharedwithallothermembers.
TheinstitutionsbelongingtotheNSIRFederationshoulddevelopaprocessforcollaboratingeffectivelyonspecificinitiatives.Thisprocessshouldprovideamechanismtodefineandprioritizedatamanagementandpreservationinitiatives,toestablishtherequiredagreementsbetweencollaboratingorganizations,andtosecurefundingforeachinitiative.EachparticipatingorganizationwouldcontributetotheFederationaccordingtoitsparticularstrengthsandin
amannerconsistentwiththefoundingcharter.Inaddition,anindependentadvisorybodyconsistingofexpertsfromusergroupsshouldbeformedinsupportofeachinitiative.
TheNSIRFederationshoulddevelopanationalresourceofinformationtechnologythatisconsistentwithitscharteredobjectivesandthatcanbeeffectivelydistributedtoinstitutionsthatmustmanagedata.Thesetechnologieswouldincludecompleteproducts,designs,guidelines,standards,
Page57
andmethodologies.Arelatedlong-termtechnologystrategy,or"technologynavigation"function,shouldbedeveloped,assuggestedinChapter4.
TheNSIRFederationshouldinstituteanindependentlymanagedprocessforawardingNSIRcertificationtomemberscientificinstitutionsandtheirdataandinformationsystemsonthebasisofwell-definedcriteriaandstandards.Thecertificationprocessshouldbemanagedbyanongovernmental,not-for-profitorganization,whichwouldreceivetechnicalguidancefromtheparticipatingfederalagencies.Thecertificationneedstohavecredibilityinthecommunitysothatnonmemberinstitutionswillaspiretoattaincertificationandhaveittaggedtotheirproducts.Thecertificationalsoshouldbesomethingthatcommercialvalue-addedproviderswillseektoincreasethecredibilityoftheirproducts.
ItalsoisimportantforthecommitteetostatewhattheNSIRFederationshouldnotbe.Itshouldnotbecomeanexpensivebureaucraticentity.Theexecutiveofficemustnotimposeanystandardsorinformationtechnologiesfromabovethathavenotbeenvalidatedthroughaconsensusprocessofthememberorganizations.Finally,theexecutiveofficemustnotattempttomicromanagetheoperationsoftheparticipants,norshouldithaveanydirectcontrolovertheirbudgetsandfundingallocations.
RecommendationsSpecificallyForNARA
Inordertoimproveitsresponsibilitiesinthelong-termretentionofscientificandtechnicaldata,thecommitteerecommendsthatNARAstrengthenitsliaisonwitheachfederalagencythatproducessuchdatatoensurethatappropriateattentionisdevotedtolong-termdataretentioninadistributedstorageenvironment.
Asshownearlierinthisreport,NARAcannottoday,norwillitlikely
everbeableto,actasthecustodianofmostphysicalsciencedata.ThedatavolumeistoogreatinrelationtothefundingappropriatedtoNARA,theNARAstaffdonothavethenecessaryspecializedscientificknowledge,theinteragencylinkagesarenotinplace,andahugeinfrastructuresimilartothatwhichalreadyexistsatotheragencieswouldneedtobeduplicatedatNARA.Theagenciesclosesttothedatasetsandbestequippedtodealwiththemarethemselvesalreadystrugglingwiththeseissues.However,NARAdoeshavegreatexpertiseinissuesinvolvingthelong-termstorageofdataandthepackagingrequirementsfordatatobeofvaluetofutureusers.
ThecommitteethereforebelievesthatNARA'sroleshouldbeprimarilyadvisoryorconsultative,tohelpensurethattheagenciesthataretheactualcustodiansofdataattheworkinglevelfollowalltherelevantfederallawsandguidelinesintakingcareofthedata.ThecommitteesuggeststhatscientificdataandrelatedinformationshouldgotoNARA'sphysicalpossessiononlyasalastresort,whentheagencythatcollectedthedatacannolongerprovideaccessfortheusercommunity.Ashasalreadybeennoted,scientificdataarebestmaintainedbytheagencythatoriginallyacquiredthosedataaslongasthereisanyregularactiveuse.Theholdingagenciesshouldcollect,analyze,store,andmakeavailablethemaximumfeasibleamountofrelevantphysicalsciencedata,consistentwiththeprinciplesandgoalssetforthfortheNSIRFederationandwiththeretentioncriteriaandappraisalguidelinesdiscussedabove.
Currently,agenciesinformNARAoftheirintentionsfortheirfederalrecords,includingscientificdata,throughvariousschedules.Allagenciesarerequiredtoschedulerecordswhentheyreach30yearsofage,althoughtheyareencouragedtodosoearlier.TheNationalClimaticDataCenterevenprovidesschedulesfordatathatitplanstoholdindefinitely,notingthatintention.Formosttypesofrecords,thepressuretoscheduleprovidestheusefulfunctionofpreventinganagencyfromsimplywarehousingcontinuallyincreasingvolumesof
unusedrecordswithoutexamination.Fordatathatanagencydoesnotwishtodestroy,butthatarenotfrequentlyaccessed,NARAmakesavailablestoragespacewithouttakingownership.IfNARAdidnotprovidesomeworthinesstestforrecordsbeforeagreeingtoprovide
Page58
storageforanotheragency,theFederalRecordsCenterscouldbecomeinundatedwithrecordsoflittlevalueorpotentialforfutureuse.
Asdiscussedinthisreport,weareheadingincreasinglytowardasystemofdistributedarchivesforelectronicrecords.Datasetsaredistributedamongvariousphysicallocations,andtheexpertisetointerpretthesedatasetsislikewisealreadydistributedandbecomingmoreso.TherapidincreaseincomputernetworkswithintheUnitedStatesandintherestoftheworldisbeginningtosignificantlyaffectthewaypeopleaccessinformation.Thereisalesseningneedfordatausersandproviderstophysicallypossessthedatatheyneedordistribute,andusersareincreasinglyunawareofthesourcelocation(s)ofthedatatheyareaccessing.NARAthereforeshouldcontinuetostudyarrangementsregardingthephysicalcustodyofelectronicrecords,therelationshipbetweenNARAandotheragencies,andhowthesewillandshouldbeaffectedbytheexpansionofelectronicnetworks.
Duringthecourseofthisstudy,thecommitteefoundthatwiththeexceptionofsomestaffmembersatgovernmentdatacenters,manygovernmentscientistsandmostnongovernmentscientistsarenotawareoftherequirementsoftheRecordsDisposalAct(44U.S.C.3301etseq.).EvensomeofthoseentrustedwithlargequantitiesofvaluabledatawerelargelyunawareofNARAanditsrelatedresponsibilitiesuntilcontactedbythecommittee,orbyitspanels.Thismaybepartiallybecausescientists,eventhosewithinthefederalgovernment,sometimesdonotrespondtothebureaucraticrequirementsoftheirowninstitutions.ThecommitteeisencouragedthatNARAisworkingtoaddressthisproblem.Nevertheless,manypanelvisitorsandmembersobservedthattheNARAbrochureshaveanauthoritarianandlegalistictoneandarenotconducivetoestablishingproductivepartnershipswithNARA.NARA'sfutureeffectivenessinoverseeingandadvisingonthearchivingofscientific
andtechnicaldatarequiresthatitimproveitsrelationswithotheragenciesandinstitutions.
Asacorollary,noneofthecommittee'ssuggestionsshouldbeconstruedtoimplythatNARAshouldissueadditionalproclamationsorregulations.Thegoalshouldbetopresentmorecarrotsthansticks.Forexample,NARAshouldconsiderprovidingrewardsandrecognitiontoresearchers,managers,andfundersfordevelopingandimplementingsuccessfuldataretentionplans,withappropriatemetadata.Withbettercommunicationsandgreatersensitivitytotheneedsofthescientificcommunity,NARAcanplaytheroleofa''serviceprovider"and"appraisalconsultant."Forinstance,NARAisalreadyworkingwiththeDODLegacyResourceManagementProgramtoidentifyandpreserveculturalresourcesunderDODjurisdiction.NARAandthisDODprogramtogetherhavesponsoredaconferencetoassistmilitarycontractorsinpreservingtheirdocumentaryheritage.ThecommitteesuggeststhatNARApursueothersuchcollaborationsinthesamespiritofpartnership.
Asamatterofformalresponsibilityandtraining,NARAstaffaremoreconcernedwithlong-termarchivingissuesthanmoststaffatotheragencies.NARAthereforecanserveanessentialroleinremindingagenciesofthelong-termvalueofdataandshouldregularlyprovideadvicetoagenciesthatkeepscientificdataonhandforextendedperiodsoftime.NARAalsoshouldconductcontinuousresearchonretentionandappraisalissuestoremainwell-informed.ThecommitteerecommendsthatNARAformstandingadvisorycommitteeswithmanagersofscientificdata,historians,andscientificresearcherstoaddresstheretentionandappraisalofscientificandtechnicaldatacollections,andrelatedissues.
Unfortunately,NARAhasalmostnoscientificexpertisewithinitsranks(exceptrelatedtophysicalrecordspreservation).Despitethelargeamountsofscientificinformationwithinsomefederalrecords,
NARAofficialshaveindicatedthattheydonotbelievethattheycouldkeepascientistonthestaffinterestedintheworkanddonotplantohireanypermanentscientificpersonnel.Nevertheless,NARAwillcontinuetobefacedwithdifficultissuesinvolvingthearchivingofscientificdata.Intheinterim,thecommitteesuggeststhatNARAshouldarrangefortemporarystaffassignmentsfromtheactivescientificranksofthefederalgovernmentonafrequentas-neededbasis.GiventhegreatchallengesthatNARAwillfacefromscientificdataandtheprovenabilityofotheragenciestoholdscientificallytrained
Page59
personnelindatamanagementpositions,NARAshouldrethinkitspositionandconsidercreatingacadreofpermanentstaffwithscientificexpertise.
NARAalsomightconsidersettingupanin-housedatabasetotrackfederalholdings,especiallytoanticipateproblemswithdatasetshousedinotheragenciesthatmayeventuallyneedNARAprotectionorotherhelpfromNARA.Todothiseffectivelywouldrequireestablishingasetofcontactsinotheragencieswithpeoplewhounderstandthedatabasesintheagencycollections.
Thisbringsustotheneedforamoregenerallocatorfunction,or"directoryofdirectories,"fortheNSIRFederation'snetworkofnetworks.Archivesmustnotbeviewedormanagedasdatacemeteries,withonlyrareanddwindlingvisitsafterthedepositionofdata.Theprovisionofbroadaccesstodatamustbepartofarchivedesignandconstruction,andthussomesortofbroadlocatorismuchneeded.Thecommitteeisencouragedbytherecentinteragencyefforts,organizedbytheOfficeofManagementandBudget,todevelopaGovernmentInformationLocatorService.Nevertheless,thereisaneedforaNARA-maintaineddirectoryofarchiveddatawithinitsownsystem.ThisshouldincludearchivedrecordsmaintainedbyothergovernmentagenciesandfederallyfundedinstitutionsthatarerecognizedaspartofadistributedarchivesystemoverseenbroadlybyNARA.ThecommitteerecommendsthatNARAcollaboratewithotheragenciesthatmaintainlong-termcustodyofdatatodevelopaneffectiveaccessmechanismtothesedistributedarchives.Theinitialstepshouldfocusonlocatorsystemsandevolvetowardatransparentaccesssystem.
Finally,withregardtoitsrequirementsforaccessionofdata,NARAshouldworkwiththescientificcommunityandpotentialsourcesofscientificdatatodevelopadaptableperformancecriteriafordata
formatsandmedia,ratherthanmandatingnarrowandinflexibleproductstandards.ThegoalwouldbetomeetNARA'sbasicneedtoensurelong-termusabilitywhilealsoenablingaccessionofdata,suchasimagesandstructures,thatcannotbeaccommodatedbyNARA'scurrentrestrictivefile-formatandmediastandards.
RecommendationsSpecificallyForNOAA
AsthelargestholderofearthsciencesdataintheUnitedStates,NOAAhasavastamountofscientificdatastoredatmanyfacilitiesacrossthecountry.TheprimarystoragesitesaretheNationalDataCenters,whichincludetheNationalClimaticDataCenter(NCDC),theNationalOceanographicDataCenter(NODC),andtheNationalGeophysicalDataCenter(NGDC).Eachofthesedatacentersnowhasitsownon-lineinformationservice.Thedatacentersareaccessiblethroughcommonnodes,forexamplethroughNOAA'swebserverorNASA'sMasterDirectoryserver.ThusauserwhounderstandsthestructureofNOAA'sdataholdingscannavigatethroughthedifferentdatacenters,lookfordataofinterestineachcenter'sholdings,andretrievethedataovertheInternet.However,itisnotpossibletosearchNOAA'sdataholdingswiththesameprecisionandaccuracywithwhichonecansearchforbibliographicdata,through,forexample,theCurrentContentsorINSPECdatabases.ThediversityandvolumeofdatathattheNationalDataCentersholdandregularlyreceivemakeitdifficulttoproduceanoveralldirectoryforallofNOAA'sdataholdings.Inparticular,NCDCreceivesdailyalloftheweatherinformationfortheUnitedStates.WithoutsuchageneraldirectoryitisdifficultforuserstoqueryacrossNOAAarchivestolocateandintegratediversedata.Moreover,oncetheuserfindsdata,thevarietyofstorageformatsanddatatypesmakesaccesscumbersome.Thus,thecommitteeencouragesNOAAtobeambitious.DevelopmentofanewcomprehensivedirectorycoveringallNOAA'sholdingsofgeosciencedatawouldsetthestandardforotheragenciesandwouldmakethedatamuchmoreaccessibletothe
public.
Thisdirectorymayincorporatecapabilitiesofthemanydifferenton-linedirectoryservicescurrentlyinuseattheNationalDataCenters,buttheemphasisshouldbeonconnectivity,dataaccess,andinformation.Forthisreason,NOAAshouldconcentratefirstonthemorerecentdigitaldatathatcanmosteasilybeincorporatedintosuchadirectorysystem.Effortstogetolderanalogdatadigitizedshould
Page60
continue,althoughsomedatamayhavetoremainintheiroriginalformat.Animportantfacetofthisdirectoryistolist,alongwiththedirectoryentry,howtolocateandaccessthedata.Oncetheyhavelocatedthedataofinterest,mostuserswantmainlytoretrievethedatainaformthattheycanuseforfurtheranalysis.
Thus,thedirectoryshouldspecifytheactuallocationofthedata,aswellasthemethodsbywhichthedatacanbeacquired.UnderthepresentNOAAsystem,acquisitioninvolvesaformalorderingprocedureandthetransferoffunds,atleastforanydatathatmustbetransferredviatapeorhardcopy.ExperimentalNOAAsystems(NOAA'sSatelliteActiveArchive)makeitpossibletoorderlimitedsatelliteimageryoverthenetworkatnocost.Forthoseordersrequiringthetransferoffunds,thedirectoryserviceshouldbeabletoestimatethecostofthedataordersothattheusercanfactorcostintothedecisiontoorder.
ThisinterconnectedNOAAdirectoryservicealsowouldassisttheNOAAdatacentersintheirmanagementofdata.ByhavingaccesstotoolsandtechniquesdevelopedatotherNOAAdatacentersandelsewhereinthedatastoragecommunity,theNOAAdatacenterswouldbebetterabletostayabreastofnewdevelopmentsandtoincorporatethemintotheirdataaccesssystems.SimilaritiesamongvariousearthsciencedataandtheemergingneedforinterdisciplinaryresearchmakeitnecessarytoimplementsuchanoveralldirectoryformanagingNOAAdata,forbothdatalocationandaccess.Asnotedearlier,NOAAalreadyhasstartedtodevelopdatadirectories,on-linedatasystems,anddataaccess.
NOAAandNASAhavemadeprogressindatarescueandinderivingbetterproductsfromolddata.Since1990,NCDChascopiedthousandsoftapesofsatellitedatathatwereattheendoftheirusefulshelflife.TheNOAA/NASAPathfinderprogramwasestablishedto
makethesatellitedatamoregenerallyavailabletoresearchersandtocalculatenewproducts;ithasbeenaneffectiveprogram.Althoughthecommitteesupportsactivitiestopreserveolddata,rescueddata(includingdatamovedtobettermediaandanalogdatathathavebeendigitized)areoflittlevalueiftheycannotbeaccessedorretrieved.Thecommitteeadvocatesmoreemphasisonimprovingaccesstodataforinterestedusers.
Mostfederalagenciesarenowawarethatstorageandretrievalofdataareimportant.Problemsarisebecauseeachagency,andsometimesevendifferentpartsofthesameagency,setsupdatacentersandfacilities,andeachoftheseestablishesitsowntypeofsystem.Inaddition,becausethetechnologyforstoringdatachangesfrequently,itisdifficultifnotimpossibletodecidejustwhathardwareandsoftwaresystemshouldbeused.Thisuniquenessofsystemsoftenhinderssystemportabilityandtheexchangeofdataamongsystems.
Therearesomeapproachesandproceduresthataredesignedtobetechnology-independentandthereforecanbeusedtoavoidsomeoftheseproblems.Moreover,thetechnologicalandportabilityrequirementsforarchiving,storage,andtransmissionaredifferent,soa"universal"formatwillnotwork.Anarchivalformatmustbeutterlyportableandself-describing,ontheassumptionthat,apartfromthetranscriptiondevice,neitherthesoftwarenorthehardwarethatwrotethedatawillbeavailablewhenthedataareread.Astorageformatshouldbeoptimizedforretrievinganyaddressablesubsetofadataset.Asecondary,butimportant,considerationistheeasewithwhichthestorageformatmaybecastintoatransmissionformat.Atransmissionformatshouldbeoptimizedforeaseofconversiontootherformats,accommodationofbothdataandmetadatainasingledatastream,portability,andextensibility(i.e.,accommodatingdataandmetadatatypesandstructuresnotyetinvented).BecausebothNOAAandNARAhavealong-termarchivalproblem,thecommitteesuggeststhattheyworktogethertolocateandtesthardwareandsoftwareunits
thatcanbeusedforthistechnology-independentapproach.Bylocatingthemostsimplecommontechnologies,itshouldbepossibletosetupsystemsthataresufficientlycapable,butyetareabletointeractwitheachother.Onceafewofthese"standards"aresetupandoperating,itislikelythatotheruserswillwanttorunthissuiteofsoftware.Ideally,thistypeofprojectwouldbebestcarriedoutundertheauspicesoftheNSIRFederation.
Page61
Consideringtheforegoingdiscussion,thecommitteemakesthefollowingrecommendations:
NOAAshouldplaceahigherpriorityondocumentingandestablishingdirectoriesofitsdataholdings.
Furthermore,NOAA,withtheactivecooperationofNARA,shouldleadeffortstobetterdefinetechnology-independentstandardsforarchiving,storing,andtransmittingthedatawithinitspurview.
Finally,NOAA,aswellaseveryotherfederalscienceagency,shouldensurethatallitsdataaresharedandreadilyavailable;itfulfillsitsresponsibilityforqualitycontrol,metadatastructures,documentation,andcreationofdataproducts;itparticipatesinelectronicnetworksthatenableaccess,sharing,andtransferofdata;anditexpresslyincorporatesthelong-termviewinplanningandcarryingoutitsdatamanagementresponsibilities.
Thecreationofthecommittee'sproposedNSIRFederationwouldhelpprovideacollaborativemechanismandmoresustainedpeerpressuretomeettheseobjectives,andthusenhancethevalueofscientificandtechnicaldataandinformationresourcestothenation.
Page62
ReferencesAmericanChemicalSociety(ACS).1993.ReportingExperimentalData,H.J.White(ed.),Washington,D.C.
Boorstin,D.J.1992.TheCreators,RandomHouse,NewYork.
CommitteeonEnvironmentandNaturalResources(CENR).1994.OurChangingPlanet:TheFY1995U.S.GlobalChangeResearchProgram,NationalScienceandTechnologyCouncil,Washington,D.C.
CommitteeonEnvironmentandNaturalResources(CENR).Inpress.TheU.S.GlobalChangeDataandInformationSystemDraftImplementationPlan,NationalScienceandTechnologyCouncil,Washington,D.C.
FederalGeographicDataCommittee(FGDC).1994.October1994FactSheet,FederalGeographicDataCommittee,Washington,D.C.
Gelsinger,P.P.,P.A.Gargini,G.H.Parker,andA.Y.C.Yu.1989.Microprocessorscirca2000,IEEESpectrum,October:43-47.
GeneralAccountingOffice(GAO).1990a.EnvironmentalData--MajorEffortIsNeededtoImproveNOAA'sDataManagementandArchiving,Washington,D.C.
GeneralAccountingOffice(GAO).1990b.SpaceOperations--NASAIsNotArchivingAllPotentiallyValuableData,Washington,D.C.
Haas,J.K.,H.W.Samuels,andB.T.Simmons.1985.AppraisingtheRecordsofModernScienceandTechnology:AGuide,MassachusettsInstituteofTechnology,Cambridge,Mass.
Handy,C.1992.BalancingCorporatePower:ANewFederalistPaper,
HarvardBusinessReview70(6):59-72.
InformationInfrastructureTaskForce(IITF).1993.TheNationalInformationInfrastructure:AgendaforAction,Washington,D.C.
Jacobs,W.1947.Wartimedevelopmentsinappliedclimatology,MeteorologicalMonographs1(1),52pp.
Marshack,A.1985.HierarchicalEvolutionoftheHumanCapacity:ThePaleolithicEvidence,AmericanMuseumofNaturalHistory,NewYork.
NationalAcademyofPublicAdministration(NAPA).1991.TheArchivesoftheFuture:ArchivalStrategiesfortheTreatmentofElectronicDatabases,AreportfortheNationalArchivesandRecordsAdministration,Washington,D.C.
NationalAeronauticsandSpaceAdministration.1992.DraftGuidelinesforDevelopmentofaProjectDataManagementPlan(PDMP),NASAOfficeofSpaceScienceandApplications,Washington,D.C.
NationalResearchCouncil(NRC).1982.DataManagementandComputation--VolumeI:IssuesandRecommendations,SpaceScienceBoard,NationalAcademyPress,Washington,D.C.
NationalResearchCouncil(NRC).1984.Solar-TerrestrialDataAccess,Distribution,andArchiving,SpaceScienceBoardandBoardonAtmosphericSciencesandClimate,NationalAcademyPress,Washington,D.C.
NationalResearchCouncil(NRC).1986a.AtmosphericClimateData:ProblemsandPromises,BoardonAtmosphericSciencesandClimate,NationalAcademyPress,Washington,D.C.
Page63
NationalResearchCouncil(NRC).1986b.IssuesandRecommendationsAssociatedwithDistributedComputationandDataManagementSystemsfortheSpaceSciences,SpaceScienceBoard,NationalAcademyPress,Washington,D.C.
NationalResearchCouncil(NRC).1988a.GeophysicalData:PolicyIssues,CommitteeonGeophysicalData,NationalAcademyPress,Washington,D.C.
NationalResearchCouncil(NRC).1988b.SelectedIssuesinSpaceScienceDataManagementandComputation,SpaceScienceBoard,NationalAcademyPress,Washington,D.C.
NationalResearchCouncil(NRC).1990.SpatialDataNeeds:TheFutureoftheNationalMappingProgram,BoardonEarthSciencesandResources,NationalAcademyPress,Washington,D.C.
NationalResearchCouncil(NRC).1992a.SettingPrioritiesforSpaceResearch:OpportunitiesandImperatives,SpaceStudiesBoard,NationalAcademyPress,Washington,D.C.
NationalResearchCouncil(NRC).1992b.TowardaCoordinatedSpatialDataInfrastructureforthenation,BoardonEarthSciencesandResources,NationalAcademyPress,Washington,D.C.
NationalResearchCouncil(NRC).1993.1992ReviewoftheWorldDataCenter-AforRocketsandSatellites,NationalSpaceScienceDataCenter,BoardonEarthSciencesandResources,NationalAcademyPress,Washington,D.C.
NationalResearchCouncil(NRC).1994.RealizingtheInformationFuture--TheInternetandBeyond,NRENAISSANCECommittee,ComputerScienceandTelecommunicationsBoard,NationalAcademyPress,Washington,D.C.
NationalResearchCouncil(NRC).1995.StudyontheLong-term
RetentionofSelectedScientificandTechnicalRecordsoftheFederalGovernment:WorkingPapers,CommissiononPhysicalSciences,Mathematics,andApplications,NationalAcademyPress,Washington,D.C.
NationalResearchCouncil(NRC).Inpress.FindingtheForestintheTrees:TheChallengeofCombiningDiverseEnvironmentalData,U.S.NationalCommitteeforCODATA,NationalAcademyPress,Washington,D.C.
OfficeofManagementandBudget(OMB).1990.CoordinationofSurveying,Mapping,andRelatedDataActivities,CircularNo.A-16,Washington,D.C.
OfficeofManagementandBudget(OMB).1994.ManagementofFederalInformationResources,CircularNo.A-130(59F.R.37906,July25,1994),Washington,D.C.
OfficeofTechnologyAssessment(OTA).1994.RemotelySensedData:Technology,Management,andMarkets,OTA-ISS-604,GovernmentPrintingOffice,Washington,D.C.
Silberschatz,A.,M.Stonebreaker,andJ.Ullman.1991.Databasesystems:Achievementsandopportunities,CommunicationsoftheACM34(10):110-120.
Page64
AppendixAListofAcronymsCD-ROM CompactDisk-ReadOnlyMemory
CENR CommitteeonEnvironmentandNaturalResourcesDMC DataManagementCenterDOD DepartmentofDefenseDOE DepartmentofEnergyEROS EarthResourcesObservingSystemESDM EarthScienceDataManagementFGDC FederalGeographicDataCommitteeFITS FlexibleImageTransportSystemGARP GlobalAtmosphericResearchProgramGCDIS GlobalChangeDataandInformationSystemGCRP GlobalChangeResearchProgramGILS GovernmentInformationLocatorServiceHTML HyperTextMarkupLanguageIRIS IncorporatedResearchInstitutionsforSeismologyIWGDMGCInteragencyWorkingGrouponDataManagementfor
GlobalChangeJANAF JointArmy-Navy-AirForceJCL JointControlLanguageNARA NationalArchivesandRecordsAdministrationNCDC NationalClimaticDataCenterNGDC NationalGeophysicalDataCenterNII NationalInformationInfrastructureNOAA NationalOceanicandAtmosphericAdministrationNODC NationalOceanographicDataCenterNRC NationalResearchCouncilNSDI NationalSpatialDataInfrastructureNSF NationalScienceFoundation
NSF NationalScienceFoundationNSIR NationalScientificInformationResourceNSSDC NationalSpaceScienceDataCenterOMB OfficeofManagementandBudget
Page65
PDS PlanetaryDataSystem
PO.DAAC PhysicalOceanographyDistributedActiveArchiveCenterSGML StandardGeneralizedMarkupLanguageTCP-IP TransmissionControlProtocol-InternetProtocolUSGS UnitedStatesGeologicalSurveyUSNRC UnitedStatesNuclearRegulatoryCommissionWWSSN World-WideStandardizedSeismographicNetwork
Page66
AppendixBMinorityOpinionThisreporthasawealthofgoodmaterialinit,butIfeelthatImustwriteaminorityopinionononemainissue,thecommittee'srecommendationtocreatetheNSIRFederation.IthinkthattheexactfunctionsoftheNSIRFederationarestillnotclearenoughtoimmediatelyformit,especiallysincemechanismstocoordinatedataactivitiesalreadyexist.
AgroupsuchastheNSIRFederationwouldnotbeagoodmethodtosetthehardwarestandardsthatareusedindatasystems(networks,tapes,etc.).Thecoordinatedpartofdatadirectoryeffortscanbebuiltaroundpresentinteragencywork.ItisreasonablethatNARAshouldrequestlistsofdatasetsintendedforlong-termarchival,butmostoftheprocessofevaluatingdatasetsneedstobekeptclosetotheworkinglevel.Thediscussionofstandardizationinthereportshouldnotbeinterpretedtomeanthatallagenciesandarchivesshouldbeforcedtoadoptcertainstandardsandreworktheirdataholdingsintoacommonformandformat.Thereareotherconcernsforwhichananalysisoftheissuescouldbeuseful,butIbelievethattheNSIRFederationrequiresabetterdescriptionoftasksandmoredebatebeforesuchanewbodyisestablished.Otherwisewemayhavemorecoordination,moresystems,morecost,andlessdata.
Considertheimportanttaskofdevelopinginformationaboutdata.Informationaboutdatasetsisneededinatleasttwoorthreelevelsofdetail.Atthehighestlevelofinformation,theMasterDirectorymethodsthatareinplacefortheGCDIScanbeadopted(orevensimplifiedmore)todescribethedatasets.ThisinteragencyDirectoryInterchangeFormat(DIF)isusednationallyandinternationally.We
needtokeepitsimpleenoughsothatpeoplewillsubmittheinformation.Someagency-levelcatalogeffortsfordatasetshaveexistedsinceabout1968,andbecamemoreseriousinthelate1970s.WeshouldbuildontheGCDIScatalogefforts,andcertainlynotinventmorecomplicatedsystems.Otherdatainformationeffortsareneeded,buttheywillbebasedonabottom-upflowofideas,onworkshops,andthelike.Eachdatasystemdoesnothavetodoexactlythesamething,buttheymustbeeasytouse.ItisnotclearthataformalNSIRFederationisneededtocoordinatethis.
HowdoestheNSIRFederationrelatetootherdatacoordinatingmechanisms?TheInteragencyWorkingGrouponDataManagementforGlobalChange(IWGDMGC)meetsregularlytohelpcoordinatedataissuesacrossmany"globalchange"disciplines,whichincludeair,water,ice,rocks,soils,andsomebiology.ItseemstomethattheIWGDMGCandtheproposedNSIRFederationaremainlytryingtodothesamething.Theycovermuchofthesameturfintermsofdisciplines.Theybothwantinformationaboutdata,accesstodata,anddatathatwillexistformorethan20years.Ifwecreateseparateorganizationsdoingroughlythesamething,thenitbecomesevenlesslikelythatkeyagency