Always Already Computational: Library Collections as Data · Services supported Always Already...

61
#aacdata Always Already Computational: Library Collections as Data National Forum Position Statements March 2017 Image Credit - Library by Nikita Rozin Jefferson Bailey Alexandra Chassanoff Tanya Clement P. Gabrielle Foreman Dan Fowler Harriett Green Jennifer Guiliano Juliet L. Hardesty Christina Harlow Greg Jansen Matthew Lincoln Alan Liu Richard Marciano Matthew Miller Labanya Mookerjee Anna Neatrour Miriam Posner Sheila Rabun Mia Ridge Hannah Skates Kettler Ben Schmidt David Seubert Laila Shereen Sakr Tim Sherratt Timothy St. Onge Santi Thompson Kate Zwaard

Transcript of Always Already Computational: Library Collections as Data · Services supported Always Already...

#aacdata

AlwaysAlreadyComputational:LibraryCollectionsasData NationalForumPositionStatements March2017

ImageCredit-LibrarybyNikitaRozin JeffersonBailey AlexandraChassanoff TanyaClement P.GabrielleForeman DanFowler HarriettGreen JenniferGuiliano JulietL.Hardesty ChristinaHarlow GregJansen MatthewLincoln AlanLiu RichardMarciano MatthewMiller LabanyaMookerjee AnnaNeatrour MiriamPosner SheilaRabun MiaRidge HannahSkatesKettler BenSchmidt DavidSeubert LailaShereenSakr TimSherratt TimothySt.Onge SantiThompson KateZwaard

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 2

ThepositionstatementsthatfollowwerepreparedinadvanceoftheInstituteofMuseumandLibraryServicessupportedAlwaysAlreadyComputational:LibraryCollectionsasDatanationalforum. Participantswereaskedtorespondtothefollowingprompt:

Leadinguptotheforum,[we]askthatyouwriteabriefpositionstatementderivedfromdirector relatedexperiencesalient to thescopeofworkdescribed inAlwaysAlreadyComputational.Wewelcomebridging,divergence,andprovocation. Is theresomethingconcreteorconceptualwearemissing?Arethereprojectsand initiatives thisworkshouldbeconnectedto?Aretherequestionsandcommunitieswearen’tcurrentlyconsidering?This isanopportunity tohighlightaspects of your experience that relate to the project and will to some extent help stageinteractionat the face-to-facemeeting -andbeyond-as theproject teamworks to iterativelyrefineforumoutputsinarangeofprofessionalanddisciplinarycommunities.

Perspectivesrepresentedinthepositionstatementshighlightthemanydirectionscollectionsasdataworkcouldgo.Thestatementswillcertainlyinformtheworkoftheforum,andconsequentlytheiterativecommunitybaseddevelopmentofprojectoutcomes.

ThomasPadilla LaurieAllen StewartVarner SarahPotvin ElizabethRusseyRoke HannahFrost

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 3

PseudodoxiaData:ourendsareasobscureasourbeginnings 5 JeffersonBailey,InternetArchive

ExperiencingLibraryCollectionsasData 7 AlexandraChassanoff,MassachusettsInstituteofTechnology

UnsolvedProblemsintheHumanitiesDataGenerationWorkflow:DigitizationComplexities,UndiscoverableAudiovisualMaterials,andLimitedTrainingforInformationProfessionals 9

TanyaClement,UniversityofTexasAustin

ComputingintheDark:Spreadsheets,DataCollectionandDH’sRacistInheritance 11 P.GabrielleForemanandLabanyaMookerjee,UniversityofDelaware

FrictionlessCollectionsData 13 DanFowler,OpenKnowledgeFoundation

BookcartsofData:UsabilityandAccessofDigitalContentfromLibraryCollections 15 HarriettGreen,UniversityofIllinoisatUrbana-Champaign

HistoricalComplicationsof/forOpenAccessComputationalData 17 JenniferGuiliano,IndianaUniversity–PurdueUniversityIndianapolis

IdentifyingUseCasesforUsableandInclusiveLibraryCollectionsasData 20 JulietL.Hardesty,IndianaUniversity

EmergingMemoryInstitutionDataInfrastructureintheServiceofComputationalResearch 22 ChristinaHarlow,CornellUniversity

OntheComputationalTurninArchives&LibrariesandtheNotionofLevelsofComputationalServices 24

GregJansenandRichardMarciano,UniversityofMaryland

PartnershipRecommended–Thecaseofcuratingresearchdatacollections 27 LisaJohnston,UniversityofMinnesotaLibraries

WaysofForgetting:TheLibrarian,TheHistorian,andtheMachine 30 MatthewLincoln,GettyResearchInstitute

AssessingDataWorkflowsforCommonData'Moves'AcrossDisciplines 32 AlanLiu,UniversityofCaliforniaSantaBarbara

Attheintersectionofinstitutionanddata 34 MatthewMiller,NewYorkPublicLibrary

MetadataandDigitalRepositoryAccessibilityIssuesforLibraryCollectionsasData 36 AnnaNeatrour,UniversityofUtah

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 4

ActuallyUsefulCollectionData:SomeInfrastructureSuggestions 38 MiriamPosner,UniversityofCaliforniaLosAngeles

InteroperabilityandCommunityBuilding 40 SheilaRabun,InternationalImageInteroperabilityFramework(IIIF)Consortium

Fromlibrariesaspatchworktodatasetsasassemblages? 42 MiaRidge,BritishLibrary

Maintainingthe‘why’inData:Consideruserinteractionandconsumptionoflibrarycollections 44 HannahSkatesKettler,UniversityofIowa

Peopleandmachinesbothneednewwaystoaccessdigitizedartifactsnonconsumptively 46 BenSchmidt,NortheasternUniversity

RepurposingDiscographicMetadataandDigitizedSoundRecordingsasDataforAnalysis 49 DavidSeubert,UniversityofCaliforniaSantaBarbara

TheLibraryasVirtualReality:AWorldbuildingApproach 51 LailaShereenSakr,UniversityofCaliforniaSantaBarbara

Thestruggleforaccess 53 TimSherratt,UniversityofCanberra

ImplicationsfortheMapina'CollectionsasData'Framework 55 TimSt.Onge,LibraryofCongress

Consideringtheuser 57 SantiThompson,UniversityofHouston

BuildingInstitutionalandNationalCapacityforCollectionsasData 59 KateZwaard,LibraryofCongress

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 5

PseudodoxiaData:ourendsareasobscureasourbeginnings JeffersonBailey,InternetArchive

In his meditation on oblivion and regeneration, W.G Sebald writes, “on every new thing there liesalready the shadowof annihilation.” Contemplating collections as data evokes a similar correlation --onewheretransformation(“thisasthat”) is lessaprocessofalterationandmoreoneofextractionofkey, but possibly opaque, preexistent characteristics (“these from those”). When we consider thecomputational availability of collections, we begin from a perspective in which collections are anamalgamationof fragmentaryelements --andtheirdecomposition isneitheraffordancenor flaw,butinsteadanaturalstateoffluxthatallowsthemtobecontextualizedanewthroughacontinualstateofreconstitutionandderivation.Thisprevailinglogicofdecompositiondistinguishescollectionsnotasdatabutinsteadaspiecesandprocesses,withattendantopportunitiesandentanglements--collectionsanddata become inseparable, commingled not in operation but instead via a type of consanguinity.Likewise,ourservicessupportingcomputationalaccesstodatashouldmatchthislatentconsanguinity. Asalarge-scale,onlinedigital librarythatisalsoamission-driven,nonprofittechnologydeveloper,theInternetArchivehaslongapproachedcollectionsasdata.Beingfullyonline,withnophysicalreferencecollectionsother thanthose intendedfordigitization,collectionsanddataareso intertwinedas tobeindivisible,eitherinconcept,technology,oruse.TheInternetArchive’scollectionsincludemorethan30petabytesofuniquedataandhassupportedcomputationaluseofthesecollectionssinceitsbeginning,fromprojects aswide-rangingas semantic analysisof television closed-caption transcripts tonetworkgraph study of linking behavior of hundreds of terabytes of web data. In addition, and as a self-sustainingnon-profit,theInternetArchivehasfacilitatedthistypearesearchthroughaservice-orientedand sustainable program development approach. Developing data-driven approaches to access andbinding them to scalable, sustainable programs has elucidated many of the obstacles and potentialsolutionsthatemergefromthiswork.Questionsthathaveemerged:

● How can computational research services create better pathways to interpretation throughtools andmethods for the smooth traversal between “reduction and abstraction” inherent inderivationandaggregation?

● Howcannewaccessmodelshelpresearchershavegreatercomfortwithtechnicalmediationatmultiple levels and with an increasing distance between the granularity and totality of theobject(s)ofstudy?

● Howcanprogramsaddressthechallengesstill inherent,evenwithderiveddatasets,of limitedtechnicalproficiencyandlocalinfrastructure?

In testing multiple models internally, and surveying and collaborating with similar efforts in thecommunity,wedevelopedaloosetypologyofprogrammodelsforresearchservices,orientedtowards,butnotexclusiveto,verylargeborn-digitalcollectionsuchaswebarchives.

● BulkDataModel: The totality of domain, global-scale crawl, or largeborn-digital collection istransferred to researchersviadata shippedondrives.Analysis takesplace locally,usually inaresearcher’sownhigh-performancecomputingenvironment.

● CyberinfrastructureModel: A custodial/archival institution provides free/subsidized access toits own computing environment that is pre-loaded with data, VMs, and other tooling.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 6

Researcherscandoanalysisinthisremoteenvironmentandexportresults. ● Roll Your Own Model: Researchers receive support, generally in the form of funded or

sponsoredservices,tocreatetheirowntoolsandleverageexistingdataplatformsforcandidatecollectionbuildingandanalysis.

● Programming Support Model: Researchers, generally non-technical, are given time withspecializedtechnicalsupportstaff(engineers)tocollaborativelybuildoraggregatedatasetsandperformanalysis.

● MiddlewareModel: The creation of specific tools and platforms that operate between datahostedwithacustodianandadvancedanalyticstoolsmaintainedexternally.

● Derivative Model: Provide pre-defined datasets that contain key extracted, derived, or pre-analyzed data culled from specific resources. The derived datasets support specific researchquestions,arefungible,andaligndataanddeliverywithresearcherneed.

WhiletheInternetArchivehaspursuedmanyofthesemodels,themostflexibleandscalablehasproventobethederivativemodel,inwhichkeyelementsareextractedfromprimaryresourcesandpackagedinsimplebuteasy-to-usedatasets.Thispreferencewastheresultofmany lessons learned inworkingtosupportcomputationaluseofextremelylargedigitalcollections.

● Servicesforcomputationalaccessaremoresuccessfulwhenbuiltontopof,orexpandedfrom,pre-existing internal systems, processes, and infrastructure. Modular, generalized, andinteroperablearepreferredandboutiqueservicesdon’tscale.

● Research services should be flexible and, most importantly, content delivered should bedisposabletotheprovidinginstitutionandbeabletoberecreatedbyexisting,ongoingpipelinesorframeworks.

● Focus on derivation (extract desired data from origin), portability (processes should work onmultiplecontenttypesorinmanyareasoftheworkflow),andaccess(easeoftransferofdatatorecipientandeaseofusebytherecipient).

● Focusonscalablepartnerships&decentralizationinresearchservicesupport.● Researcherexpectationsoftenarenotalignedwithavailablecustodialresourcesorservicesand

researchmethodologies(conceptual,practical,technical)oftenarenotalignedwithtargetdatacharacteristics,acquisitionmethods,ormanagementtools.

● Servicemodelsmustbeself-sustainingandscale.No“grantthengone.”● Continually orient towardsmutually reinforcingwork, be itwith collaborators or researchers,

andalwaysallowforgenerality,inpartners,technologies,andmodels. Discoveringhowtheselessonsandapproachesmatch,contest,oraugmentthefindingsofothereffortswillbeaparticularlyinformativeresultofthe“CollectionsasData”forum.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 7

ExperiencingLibraryCollectionsasData

AlexandraChassanoff,MassachusettsInstituteofTechnology

Recentempiricalresearchhasconfirmedthatdigitaltoolsandtechnologiesarefundamentallychanginghow scholars work.[1] Yet the inverse of this relationship has received little attention – how isinfrastructurechangingtosupportemergentscholarlypractice?[2]Asyounoteinyourgrantnarrative,“Predominantdigitalcollectiondevelopmentfocusesonreplicatingtraditionalwaysofinteractingwithobjects inadigitalspace.” Indeed,muchoftheresearchexamininghowscholarsfind,access,andusematerials in digital collections has paid little attention to qualitative factors about the interactionbetweencollectionusersandenvironmentalaspects.[3] Mydoctoralresearchfocusedonthisproblem–exploringhowscholarsweresearchingfor,accessing,andusingdigitizedarchivalphotographsasformsofhistoricalevidence.Anunderlyingobjectiveofmyresearchwas to explore the interpretive and evaluative practices that scholars bring to bear onnon-textual objects of humanistic inquiry. The intent was to think about how digitized photographs canfunction as data, and to provide a perspective on what makes interactions meaningful for scholarsworkingwithdigitalmaterials. In my role as the project manager on the BitCurator and BitCurator Access projects, I worked withscholarsandarchivists todevelopapproachesandmethodologies foraccessingandusingborn-digitalmaterials.Atthecloseofeachproject,Irecallthinkingthattechnologywashardlythedifficultpartofourwork.Rather,thechallengeswefacedseemedtobeconceptualinnature.Howmightweenvisionways to access born-digital materials? Relatedly, how might we use born-digital materials in ourresearch?Whatkindsofquestionscouldbeaskedandansweredfromexaminationofcontentsoftheso-calledblackbox? Itseemsthatwefaceasimilarchallengeinconsideringlibrarycollectionsasdata.Iamgratefulthatthisforum is explicitly seeking to address this gap, particularly through the enlistment of a diversity ofplayersintheculturalheritagecommunity.Technologists,librarians,museumprofessionals,archivists,and scholars will contribute important and unique perspectives to this conversation. Strategicapproachesthatfacilitateaccessto,andpreservationof,librarycollectionsasdatawillneedtoconsiderthe constant and shifting interplay between infrastructure and emergent scholarly practices. Forexample, recent research has shown that scholars are using Google Image Search to locate archivalphotographs. Traditional archival design approaches may not accommodate the serendipitouspossibilitiesofdigitalspace. Inthinkingaboutwaystofacilitateuseandreuse,IhopetodrawonmycurrentresearchasaCLIR/DLFSoftware Curation Postdoctoral Fellow. Since October, I have been working at theMIT Libraries toinvestigate andmake recommendations for how institutions canmanage software as complex digitalobjectsacrossgenerationsoftechnology. Software isanothertypeof“data”,albeitonewith implicitconstraintsforaccess,useandreuse.Researchersrelyonsoftwareforavarietyofresearchactivities–asasubjectofresearchitself,awaytooperationalizemethods,ortoreproduceandvalidatepreviousresults. Institutions are increasingly tasked with activities related to the active management ofsoftware:fromcreationthroughuse,dissemination,preservationandreuse.Institutionalapproachestosoftwarecollectiondevelopmentmustconsidersoftwareinavarietyofcontexts:atanintellectuallevel

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 8

(e.g. selection and appraisal); in planning for and designing repositories, platforms, services;and indevelopingstaffcompetencies.

How can we accommodate the fluid and rapidly changing practices which characterize the currentscholarlylandscape?Theresultsofmydissertationresearchsuggestthatonepartofthepuzzlemightbetodevelopanunderstandingofthefactorsandqualitiesthatmakeexperiencesmeaningfulindifferentkindsofinteractions.Forexample,whatisitabouttheexperienceof(digitized)oralhistoriesthatmakethem accessible and usable? Rather than focusing on delivery mechanisms or crafting explicitmethodological approaches,wemightdowell to consider themyriadways inwhich specific typesofmaterialsindigitallibrarycollectionscanbeexperienced.

WorksCited [1]AlexandraChassanoff,“HistoriansandtheUseofPrimarySourceMaterialsintheDigitalAge,”TheAmericanArchivist76,no.2(2013):458-480;JenniferRumerandRogerC.Schonfeld,SupportingtheChangingResearchPracticesofHistorians,FinalReportfromITHAKAS+R(2012),11 [2]Theimportantrelationshipbetweeninfrastructure,technology,andscholarshipisexploredinChristineBorgman’sScholarshipintheDigitalAge:Information,InfrastructureandtheInternet(Cambridge:MITPress,2007). [3]TwonotableexceptionsinthefieldofLibraryandInformationScience(LIS)are:MarciaBates,“TheCascadeofInteractionsintheDigitalLibraryInterface,”InformationProcessingandManagement38,no.3,2003;ChristopherA.Lee,“DigitalCurationasCommunicationMediation,”inHandbookofTechnicalCommunication,ed.AlexanderMehler,LaurentRomary,andDafyddGibbon(Berlin:MoutonDeGruyter,2012),507-530.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 9

UnsolvedProblemsintheHumanitiesDataGenerationWorkflow:DigitizationComplexities,UndiscoverableAudiovisualMaterials,andLimitedTrainingforInformationProfessionals

TanyaClement,UniversityofTexasAustin

DigitalHumanitieshas changed rapidly froma field that inwhichweprimarilybuild and createaccess toresourcesinthehumanitiestoafieldinwhichwedeployanalyticsonthoseresourcesinaccordancewithageneral move to data analytics. The Always Already Computational initiative is taking an essential steptowardsbridging the firstactivity (digitization) to thesecond (analytics)by focusingonhowwestructure,bundle, and disseminate digitized or born digital collections and metadata on such collections. This isimportantandmuchneededwork,buttherearethreemainareasofconcernor“unsolvedproblems”thatIwouldliketointroduceintotheconversationfortheconsiderationofthegroup:(1)digitizationworkflows;(2)AVmetadata;(3)andpedagogyintermsoftraininginformationprofessionalsaboutdatascience,dataanalytics,anddatavisualization. Digitizationworkflowsarewheremuchlibrarycollections“data”suchasdescriptiveortechnicalmetadataare born, but these workflows are complicated processes that include selecting collections; establishingperformance goals based on standardized measurement protocols; developing efficient test plans; andtakingcorrectiveactiontomaintainquality.Evenasculturalheritageinstitutionscontinuetorapidlydigitizeandrefinetheseworkflows,ourknowledgeaboutnewapproachestodigitizationstandards,toschemasforthe semanticweb,and to increasingour regard for issuesofdiversityand inclusivity in thedigitizationofcultural heritage artifacts continues to evolve. Newly issued guidelines from FADGI[1] – an initiativeincorporatingmanyentitiesattheLibraryofCongress–challengelibrariansandarchiviststoimproveimagequality precisely when pressures to digitize everything including collections that embody inclusivity arebuilding.Consequently,muchof themetadata thatwemayuse inadata frameworkhasbeengeneratedduringanevolvingandcomplexdigitizationprocess,whichisoftenatimeofincreasedone-timefundingforthe specific digitization job. To what extent will the guidelines that we generate during Always AlreadyComputational take digitizationworkflows into account? Canwe advise libraries and archives on how anunderstanding of an eventual data framework can be integrated into these workflows such that whenrequestsforfundingaremadeourcolleaguescananticipategeneratingthekindsofdatathatwewillneedforadataaccessenvironment? Second, anda case inpoint for the first “unsolved”problem,Audiovisualmaterials arenotoriouslyunderrepresented in digital humanities precisely because they often lack the detailed data (ormetadata) thatsupportstheireffectivediscovery,identification,andusebyresearchers,students,instructors,orcollectionsstaff. In recentyears, increasedconcernover the longevityofphysicalAV formatsdue to issuesofmediadegradationandobsolescence,combinedwiththedecreasingcostofdigitalstorage,haveledlibrariesandarchivestodigitizerecordingsforpurposesoflong-termpreservationandimprovedaccess.However,unliketextual materials, for which some degree of discovery may be provided through full-text indexing, AVmaterials that lackdetailedmetadatacannotbe found,understood,or consumed.Mostopensourceandcommercialeffortsthatattempttogeneratecomputationally-assistedmetadataandtofacilitateimproveddiscoveryarenarrow in focus,non-scalable,developedasstandalonetools,anddonotaddress therightsandpermissionsthatcollectionsstaffmustconsiderforcreatingaccess.Becauseofthecomplicatedmorassoftechnicalandsocial issuesthat limitAVdiscovery,anddescriptiveaccesstoaudiovisualobjectsatscalewould require a variety of mechanisms for analysis that would need to be linked together with tasks

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 10

involving human labor in a recursive and reflexive workflow platform that could eventually facilitatecompiling, refining, synthesizing, and delivering metadata. Colleagues from Indiana University andAVPreserve and a team of researchers at UT including myself are in the process of developing such aworkflow platform, which would allow libraries and archives to bring together and use task-appropriatetools in a production setting. Thiswork is in direct conversationwith the kind of framework that AlwaysAlreadyComputationalisproposing,butwebelievethatAVneeds,whichincludegeneratingdataaboutAVmaterials as a solitary means of providing access to materials that may never (because of privacy andcopyright concerns) be publically accessible, are distinct from, though complementary with, those needsthatcorrespondtogeneratingdatafortextcollections. Third, while information literacy is today a routine goal of library instruction, data work that includesenablingdatadiscoveryandretrieval,maintainingdataquality,addingvalue,andprovidingforre-uselagsasatopic.[2]Ifthelibraryisthelaboratoryofthehumanities,thislagimpactshowthedigitalcollectionsthatlibrarianscurateareusedinthehumanities.Rigorousdataworkrequiresdata“carpentry”knowledgethatconsidersvalidity, reliability,andusabilityaswellascritical literaciesmoregenerally suchasdataquality,authenticity, and lineage, but humanists and librarians are not traditionally trained on evaluating theseaspectsofdata.Thecorrespondingdifficultyoftrainingstudentsandprofessionalacademiclibrariansliesintheever-evolvingnatureofdatawork,whichmustrespondtochangingstandardsandneedsinthecontextofincreasingdatainthehumanitiesandofchanginginfrastructuresinlibraries.Thereisworkbeingdoneinthis space including the Data Science Curriculum Project, which ismeeting just after the Always AlreadyComputationalmeeting inWashington DCwith representatives from the American Statistical Association(ASA),theASABusiness-HigherEducationForum(BHEF),theAssociationforComputersandtheHumanities(ACH),theAssociationforComputingMachinery(ACM),theAssociationforInformationSystems(AIS),theIEEE Computer Society (IEEE-CS), INFORMS, the iCaucus, EDISON, and the American Association for theAdvancementofScience(AAAS).Aswell,manyprogramsinDataSciencehaveemergedinrecentyearsatmany universities and in many iSchools, but there are few programs of study that focus specifically onteaching students with concerns shaped by the humanities in the context of humanities collections.Conversations on data science pedagogy are needed to ensure the integration of up-to-date resources,theories,andpracticesindataworkinacurriculumthatwillbegearedtowardsinclusivityandteachingthenextgenerationofourdigitalworkforceaboutdatapreparationandanalysisinthehumanities.Again,thiswork is directly relevant to the Always Already Computational conversation since the data frameworkproposedrequirespractitionerswhoalsohavesometrainingindatawork. WorksCited [1]FederalAgenciesDigitizationGuidelinesInitiative.TechnicalGuidelinesfortheStillImageDigitizationofCulturalHeritageMaterials.September2016.http://www.digitizationguidelines.gov/. [2]AssociationofCollegeandResearchLibraries.WorkingGrouponIntersectionsofScholarlyCommunicationandInformationLiteracy.IntersectionsofScholarlyCommunicationandInformationLiteracy:CreatingStrategicCollaborationsforaChangingAcademicEnvironment.Chicago,IL:AssociationofCollegeandResearchLibraries,2013.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 11

ComputingintheDark: Spreadsheets,DataCollectionandDH’sRacistInheritance

P.GabrielleForemanandLabanyaMookerjee,UniversityofDelaware

Livinginanationofpeoplewhodecidedthattheirworldviewwouldcombineagendasforindividualfreedomandmechanismsfordevastatingracialoppressionpresentsasingularlandscape. -ToniMorrison,PlayingintheDark Earlyoninthe“AlwaysAlreadyComputational”abstractthisassertionappears,underscoringacentralassumption of the project: “predominant digital collection development focuses on replicatingtraditionalwaysofinteractingwithobjectsinadigitalspace.Thisapproachdoesnotmeettheneedsofthe researcher, the student, the journalist, and others who would like to leverage computationalmethods and tools to treat digital library collections as data.” Not only do the protocols anddevelopmentof digital collections, of interactingwithobjects, notmeet theneedsof varioususers—let’scall thempeopleorcommunities—who interactwith“objects indigital spaces,” the lexicon itselfreproduces particularly freighted ideas for Black communities of researchers and students, many ofwhose ancestors entered theWest as chattel property, as peoplewhowere both called objects and“leveraged,”that isbartered,mortgaged,soldand listedassuch. IntheUS, this is truefor thealmost250 years of municipal, census, and other records which make up collections and archives duringslavery,forrecordsthatdocumentthedebtpeonagethatcharacterizesJimCrow,and,onemightargue,for ways in which Black people are accounted for in a prison industrial complex that again treatsmembersofcommunitiesasthingstobecategorized,assurveilledandrecordedobjects. The lexicon of digital collections extends the freighted, fretted, relation of categorization and datacollection,toBlacksubjectsandBlacksubjectivity.Theterm"item,”like“object,”againrecallsthewaysin which Black people appear/ed in public records—as items on manifests, as "losses" on insuranceclaims,andagainasitemsforsaleinnewspapersortobedistributedinprobate.“Fortune”wasan18th-centuryConnecticutenslavedmanwhoseverynameannounceshisrelationtothecapitalproduction,thewealthand fortune,hewasmeant toproduce forhisenslaver,Dr.PreservedPorter (this isnotatypo).Whenthedoctordiednotlongafterhedid,Fortuneappearsinprobaterecordsasaskeletonthedoctormade fromhis body, claiming him in death as in life, and literally transforming him into bothmaterial object and intellectual prop and property. Fortune’s own wife, Dinah, still enslaved by thefamily, was worth less as a living, sentient, being in those records than her husband’s skeleton, askeletonshemayhavehadtodustorclean,thebonesofahusbandshecouldnotbury. Likewise, thespreadsheetopensupcomplexanalogiestothe ledger,asLabanyaMookerjee,a formerexhibitscommitteeco-chair for theColoredConventionsProject,writes inher“DisruptingDataViz.&theColoredConventionsProject:InterrogatingDataManagementMethodsthroughDisabilityStudies,”apieceshewroteandpublishedontumblrforagraduateseminarledbyP.GabrielleForeman.Storingdata in spreadsheets powered by programs such asMicrosoft Excel introduces an additional layer ofcomplications;spreadsheets,asbookkeepersofcapitalism,canbetraceddirectlytothehistoryofslavetrader ledgers. The violence of this history runs the risk of being replicated if we continue to useconventionalmethodsofstoringdata.AsmanyDHcriticshavenowpointedout,theinstitutionalpowerinvestedintheprocessofdatacollection—thepreludetodatavisualization—canbediscussedalongside

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 12

conversationsonthepower intheproductionofthearchive.Computationalactivity“iscontingentonthe availability of collections that are tuned for computational work (Hughes 2014),” as the AlwaysAlready Computational abstract asserts. “Suitability is predicated on form, integrity, and method ofaccess (Padilla 2016). This points us to the hegemonic logic guiding the selective operations inknowledgeproductionthathasbeeninterrogatedthroughstudiesonthearchives(Trouillot)andindatavisualization (Drucker). Both Trouillot and Drucker make a DH community (attuned to archiveproductionaswellasarchiveavailability)awareof theneed tonamethedifferencebetween“capta”and “data” and to challenge and counter the institutional powers that authorize “credibility” or“suitability”(Padilla). Datasets, when constructed using conventional methods of data collection and organization, run asimilar risk of activating institutional power and defining “credibility,” especially when the data isprocuredfromtraditionalarchivalsourcesthattoooftenexcise,anonymizeanderasecertainsubjects,transmogrifyingtheminturninto(almostinvisible,ghosting)“objects”and“items.”Twoexamplesfromthe Colored Conventions movement obtain. First is the challenge of including Black women whosenames and participation are excisedwhenwe use traditionalmethods of collecting and naming data(fromthelistsofthousandsofdelegatesoversevendecades).Curatingadatasetthatisreflectiveoftheactual history of women’s involvement has prompted CCP to revisit the logic used to develop theparameters of what qualifies as “participations,” extending the definition of participation fromappearing in theminutes, toattendanceat thegatherings, and tohostingandcurating conversations(following Psyche Williams-Forson) at boarding houses, eateries etc. where women’s presences orimprintsappear.Asecondexample is thework that JimCasey,co-founderofCCP,hasdoneonsocialnetworkanalysesanddatavisualizationbetweenColoredConventionsandTheUndergroundRailroadshowing a surprising lack of overlap and co-attendance. “All of this data is vexed,” asserts Casey,“shaped by centuries of decisions based on racial hierarchies about what to record, store, andreproduce.”CaseyusesSiebert’s “Directoryof the [3000]NamesofUndergroundRailroadOperators”included inhisUndergroundRailroad (1898), andBostonPublic Library’sAnti-SlaveryCollectionData.ThesesourceshewtoahistoricalimaginarythatplaceswhitesatthecenteroftheUGRandthatexcisesBlackleadershipandinvolvement,acorrectivethathasjustbeguntoappearinrecentscholarshipandhas not produced a directory as of yet. Based on racially hegemonic raw data, the co-attendancevisualizationsdon’tcaptureBlackUGRinvolvementbydefault. This leads us to this set of questions. How do we account for (new, collective) data collection thataccountsforhauntingimprintsandoutrightabsencesinthearchivesuponwhichwedepend?Whataretheimplicationsofalexiconandsetofpractices/toolsthatrelyuponandreproduceacoloniallanguageof power and entitlement in the digital humanities as we think collectively about best practices to“leverage computational methods and tools to treat digital library collections as data”.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 13

FrictionlessCollectionsData

DanFowler,OpenKnowledgeFoundation

DataPackageisacontainerizationformatforallkindsofdata.Itprovidesaframeworkfor“frictionless”data transport by specifying useful metadata that allows for greater automation in data processingworkflows.Theaimistoprovidetheminimumamountofinformationnecessarytotransferdatafromone researcher to another, and, likewise, one data analysis platform to another. After several yearsdeveloping these specs for generaluse, it isworthdirectlyexamining theextent towhich libraryandmuseumcollectionsdataareamenabletothisapproach. Newapproachestopublishinglibraryandmuseumcollectionsdataarenecessary.Suchdata,releasedon the Internetunderopen licenses, canprovideanopportunity for researchers tocreateanew lensontoourculturalandartistichistorybysparkingimaginativere-useandanalysis.Fororganizationslikemuseumsandlibrariesthatservethepublicinterest,itisimportantthatdataareprovidedinwaysthatenable themaximumnumberofusers toeasilyprocess it. Unfortunately, there arenot always clearstandards for publishing such data, and the diversity of publishing options can cause unnecessaryoverheadwhenresearchersarenottrainedindataaccess/cleaningtechniques. One approach for publishing collections data is via an API (Application Programming Interface) on arecord-by-record basis. This approach has its advantages: the data is likely structured and welldescribed.However,theseservicesmaynotmapdirectlytothetypesofqueriesoranalysesresearchersneedtorun.Further,forboththeresearcherandpublisher,itcanbetediousandcostlytoprovidelargeamountsofcollectionsdatadeliveredrecord-by-record.Forcertainusecases,itispreferabletopublishdata in bulk format in open standards like CSV or JSON. TheMetropolitanMuseumof Art and TateGallery, for instance,havereleasedtheircollectionsdataassetsof text-basedfilesonGitHub. Inthisapproach,associateddocumentationisprovidedviafilesnamedbyconvention,forexample,“README”or “LICENSE”. Thismethod of publishing allows users to load data into their own tools without theoverheadofprogrammingagainstanAPI. Documentation for data published in bulk is often ad hoc. There is often no clear or rigorousdocumentation of the fields (what types of data are in each column). Reading such data into dataanalysisprogramsusingthebuilt-inCSVingestmechanismsyieldsdatadivorcedfromcontext:commondateandboolean(“TRUE/FALSE”)columnsmustbeexplicitlyassignedassuch,numericidentifiersmaybe incorrectly loaded as integers, etc. These datasets are often exported from in-house collectionsdatabasesoftware,andsmallerrorsinthetranslationoftheseoftenlargedatasetsmaygounnoticed. DataPackagesforCollections FrictionlessData,developed in theopenbyOpenKnowledge International andmembersof theopendatacommunity,isanidealframeworkforpublishingthistypeofbulkdata.TheDataPackageformat,requiringonly the additionof a descriptor file calleddatapackage.json, provides aminimally invasive,but standardized way to provide clear and machine-readable metadata. Datasets created as DataPackagescanlaterbeeasilyexposedasAPIsgiventhewealthofmetadataprovided. Asanexample,theCarnegieMuseumofArtinPittsburgh,Pennsylvaniahasprovideditscollectionsdata

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 14

asadownloadableDataPackage.Providingthedatainthisformatyieldsseveralbenefits:

1. Usersareprovidedwithusefulmetadatatoallowforeasy import intotheirpreferredanalysistool. These explicitly defined column types andmetadata can eliminate someof the tediousworkinvolvedin“wrangling”adataset.

2. PublisherscanusetoolinglikeGoodTablestoautomaticallyvalidatedata.3. Basicdocumentationforhowtousethedataset(e.g.whatcolumnsmean)canbeautomatically

createdfromstructuredmetadata.4. Collectionsdatacanbelicensedinamachine-readablemanner.5. IntheabsenceofData-Package-awaretooling,theoriginaldatacanberead/writtenasusual.

Overthecourseofthisyear,withthecontinuedsupportofagrantfromtheSloanFoundation,wearelooking to work with researchers and institutions across a variety of fields to pilot the use of thespecifications. Thismay involvebuildingtoolsandwritingguidestoanalyse,validate,and/orvisualizecollectionsdata.Throughthisprocesswehopetoimprovethespecificationsmoregenerallywhilealsoprovidingusefultoolingforresearchersindigitalhumanities.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 15

BookcartsofData: UsabilityandAccessofDigitalContentfromLibraryCollections

HarriettGreen,UniversityofIllinoisatUrbana-Champaign

NotallofthedatawecreateorpurchaseforLibrarycollectionscomesinneatmulti-gigabytepackagesoforderedfiles:Werecentlydiscoveredthatdatasetswehadpurchasedaspartofadatabaselicensingnegotiationweremore shelf ready thanmachine ready: They currently exist as stacksof harddrives,discs, andother bewildering formats sittingon a book cart.Howdoweprovide access to thesedatacollections? Inmyextensiveworkwithresearchteams,graduatestudents,andfacultymemberstoobtain,generate,and transform data derived from collections in the University of Illinois Library and far beyond, thequestion of access and usability consistently rises to the fore. Thus, I would ask, how can weconceptualize the full spectrum of data usability? It is not enough for us to digitize the collectionmaterials and for the data to exist on someone’s server: Usability encompasses data formats, toolinteroperabilitytothenegotiatedpermissionsandrightsforresearcherstoshareandmanipulatedataastheyengageinanalyticworkflows. Datausabilitymeansdevelopingdatamodelsthattakeintoaccounttheactionsthatwillbeperformedonourdata.Indeterminingthedifferenttypesofdatamodelsthatwecanbuildandimplementintoourcollections,wemust consider how humanists and social scientists effectivelyworkwith data in theirresearchandteaching. MyworkwiththeHathiTrustDigitalLibraryandHathiTrustResearchCenterhasseenthispractice:TheHTRChasattemptedtomeetvariousexpertiselevelsandneedsofusersinenablingaccesstothedata:On the newcomer end of the spectrum, we provide fully guided access to gathering and using datathrough our Workset Builder and the Portal with its pre-set algorithms. But researchers frequentlyexpress theneedfor larger-scaledatathat ismorepliableandmanipulatable,sotheHTRCdevelopedthe Extracted Features datasets that allow researchers to generate highly customized and curateddatasets.Butthebarrierstoaccessingthisdatacanbehighintermsofskillsetsneededtobothaccessandusethedata. My research explorations on scholarly research practices also have shown me that data usability iscritical:

Our research for theHTRC’sWorkset Creation for ScholarlyAnalysis project examined researcherrequirementsfortextualcorporatobeuseableforresearch(Fenlonetal.2015,Greenetal.2014).Our interviewswith scholars revealed that thecoreareasof concern for researchers included theconceptualizationof collections as reusabledatasets and resources for scholarly communications;theabilitytobreakapartcollectionsintovariouslevelsofgranularitytogeneratediverseobjectsofanalysis; and the need for enrichedmetadata.We proposed building out the datamodel of the“workset,”theHTRC-specifictermfortextualcorporathatresearchersbuild. Our subsequent user study for HTRCUser Requirements (Green and Dickson, 2016) gave furtherinsights on how researchers used textual corpora and their scholarly practices that shape their

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 16

needsforbeingabletoworkeffectivelywithtextcollectionsintheHathiTrustDigitalLibrary,aswellas overall. We learned that scholarly practices and notable challenges when working with ourtextual collections included the ability to acquire and structure thedata; theneed for a space toworkwithvarioustoolsandgenerateresults; theability tosharedata forresearchcollaborations;andtheroleofdatainteachingandtraining. AndmyrecentlyconcludedresearchstudyforEmblematicaOnlineexploredhowscholarsengagedwith the digitized emblem books drawn from leading rare book collections at Illinois, HABWolfenbuettel, University of Glasgow, Duke, and the Getty Institute. In my examination of howscholars engaged with these multi-institutional collections, their metadata, and the interlinkeddigitalcontentthroughinterviewsandusabilitytestingsessions,wefoundthattheexpectationsofuserswhenexploringdigitalcollectionsiscomplex:Theyrangefromthebasicneedforhigh-qualityreproductions, which Emblematica was praised for by all participants; to advanced scholarlyconcernssuchastheabilitytodistinguishbetweenthetypesofarchivalcontenttheyareperusing—emblem books versus emblems themselves—and the historical particularities of this specializedgenre of emblem studies. Respondents frequently expressed the need for context, annotatedcontent,andotherfunctionalitiesthatwouldallowthemtofullyengagewiththeemblembooksasan archival source and scholarly area. We considered that this may reveal the needs ofinterdisciplinaryscholarshipasresearchertakeadvantageofeasyaccesstovastdigitalcollectionsofcontent: The scholarly knowledgebase thatusers approachwithdigital collections varieswidely,and an effective digital collection must welcome all levels and inculcate them into the scholarlydomainofthecollection.

Theseare someof the findings I have learned inmywork toexaminewhat researchersneedsareastheyengagewithourLibrarycollectionsindigitalformatsandmakeuseofthesematerialsasdata.ThisForum’s discussion can provide critical new avenues for exploring how collections can be accessible,browseable,andextensibleforaddressingadiversityofemergentusesinresearchandteaching. WorksCited FenlonK.,SenseneyM.,GreenH.,BhattacharyyaS.,WillisC.andDownie,J.S.(2014).Scholar-builtcollections:Astudyofuserrequirementsforresearchinlarge-scaledigitallibraries.ProceedingsoftheAmericanSocietyforInformationScience&Technology51(1),1–10.doi:10.1002/meet.2014.14505101047 Green,H.E.,Fenlon,K.,Senseney,M.,Bhattacharyya,S.,Willis,C.,Organisciak,P.,Downie,J.S.,Cole,T.,andPlale,B.(2014).UsingCollectionsandWorksetsinLarge-ScaleCorpora:PreliminaryFindingsfromtheWorksetCreationforScholarlyAnalysisPrototypingProject.PosterpresentedatiConference2014,Berlin,Germany. Green,Harriett,EleanorDickson,andSayanBhattacharyya.“ScholarlyRequirementsforLargeScaleTextAnalysis:AUserNeedsAssessmentfortheHathiTrustResearchCenter.”DigitalHumanities2016Proceedings,Krakow,Poland,July11–15,2016. Green,Harriett,MaraWade,TimothyCole,andMyung-JaHan.2015.“UserEngagementwithDigitalArchives:ACaseStudyofEmblematicaOnline.”InCreatingSustainableCommunity:TheProceedingsoftheACRL2015Conference,editedbyDawnMueller,177–187.Chicago,IL:AssociationforCollegeandResearchLibraries.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 17

HistoricalComplicationsof/forOpenAccessComputationalData

JenniferGuiliano,IndianaUniversity–PurdueUniversityIndianapolis

Always Already Computational seeks to support the “development of a strategic approach todeveloping, describing, providing access to, and encouraging reuse of library collections that supportcomputationally--drivenresearchandteaching.” Historically,data in thedigitalcollectionsspherehasmost often been expressed as homogenous datasets falling into one of three primary types: textual,visual,oraudio. “Scholars”or “researchers”use largescale textual informationderived fromdigitizedvolumesor the extractionof text only fromhypertextual andmultimedia environmentsor theyminehundredoreventhousandsofhoursofvideooraudiomaterialstoextractandanalyzesubsets.Duetothe dominance of datasets like those derived from theGoogle Books corpus or throughwebscrapingtoolsthatculltext,image,oraudio,largeordenseculturaldatasetsarethenormindigitalhumanities,andarenotonlyhomogenousintypebutrarelyimagineinteractionsasledbyorwithinterventionfromindividualsnotholdingtheroleofscholarorresearcher. More simply, I am suggesting that thequestionof creating computationally-accessibledatasets isnotjustthedeploymentofanecosystemfordevelopment,description,access,andreusebutarecognitionthattherearepotentiallymultipleecosystemsofresearchandteachingthatmustexistsimultaneouslyandbetreatedasrelationalcomputationaldata.Toillustratethisprinciple,I’llprovideabriefsynopsisofthe work of Edward Curtis and how the open access images that are currently available ascomputationally-accessibledatathroughtheLibraryofCongresspresentacomplicatedconsiderationofcomputationaldata.Beginningin1868,EdwardS.Curtisembarkedonathirty-yearcareerdocumentingover eighty native communities. Participating as part of scientific expeditions and anthropologicalexcursions, he produced roughly 20 volumes of information on Native and Indigenous life that wereaccompaniedbyphotographicimagesaspartofhisTheNorthAmericanIndianseries.Createdprimarilyassilver-gelatinphotographicprints,thisserieshaslongheldaplaceofprominenceinhistoricalanalysisastheimagesarenotonlynotedfortheirraritybutforthelimiteddisseminationandreusethroughoutthe twentieth century as full sets of materials. Only 300 sets of the 20 volume series were sold;however, these imagesas individualobjectshave seen significantdisseminationand reuse since theiracquisition by the Library of Congress. More than 2,400 silver-gelatin photographic prints (of aprojected total of 40,000) were acquired by the Library of Congress through copyright deposit fromabout1900through1930.Abouttwo-thirds(1,608)oftheseimageswerenotpublishedinCurtis'smulti-volumework,TheNorthAmericanIndian.Thecollectionincludesindividualandgroupportraits,aswellasphotographsofindigenoushousing,occupations,artsandcrafts,religiousandceremonialrites,andsocialrituals(meals,dancing,games,etc).Morethan1,000ofthephotographshavebeendigitizedandindividually described and are available through the Library of Congress API as well as via manualdownloadofbothjpegandtifffileformats. Usingstrategiescommontoanthropologistsworkinginindigenouscommunitiesattheturnofthe20thcentury,Curtismodifiedtheimagesheproducedtoremovesignsofmodernityandcontemporarylife.Thisincludedprovidingspecificformsofdressthatwereperceivedasbeing“moretraditional”aswellasstrongerinterventioniststrategieslikeremovingobjectsthatwouldsignalintegrationwith20thcenturyEuro-Americansociety.WhenviewinganimageofaPieganlodgeontheLOCwebsite,theunretouchednegative is provided to the API of an image of two Pieganmen situated in their lodge with a clockcentered between them. A computational dataset would expose the existence of this image, which

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 18

couldallowscholarstorunobjectbasedvisualanalysisalgorithmstoidentifytheclockintheimageandpotentiallyfindotherimagesofmodernityusingshape-segmentationleadingtosomeconclusionsabouttheinterventionismoftechnologyinindigenouslife---howwidespreadhastechnologyembeddeditselfintoindigenouslife?Butincurrentthinkingaboutcomputationally-accessibledata,whatwouldnotberevealedisthatthisoriginalnegativeshowsanalarmclockbetweentwoseatedmeninaPieganlodge,notthepublished,retouchedimagethatAmericanaudienceswouldhaveviewedinTheNorthAmericanIndian. Curtis physically cut the clock out of the negative. He then the retouched the image forpublicationinTheNorthAmericanIndian.Itisimportantforaccuracypurposesforthedatasettoreflectnot just the original photographic negatives but also relational data derived fromwhat was actuallypublishedbyCurtis.Otherwise,researchersmightconcludethatAmericanswerefamiliarwithsignsofmodernityinindigenouslifewhen,infact,thatconclusionisrelativelyrecenthistoriographically.Otherexamplesofthistypeofrelationalcomputational-dataareavailablewithCurtis:hedepictedaCrowwarpartyonhorses,eventhoughtherehadbeennoCrowwarpartiesforyears,andheusedtechniquesoffocusanddurationtoinducehuesaturationthatromanticizedimages. More problematically, for our computational dataset, Curtis was also known to photograph religiousritualsaspartofhisexcursions.The[Oraibisnakedance]imagedepictsHopinativesthatwerepartoftheSnakeandAntelopesocietiesparticipatinginacommunalceremony.PerformedinAugusttoensureabundant rainfall tohelp corngrowth, the ritualwas themostwidelyphotographedceremony in theSouthwest Pueblos by non-native observers. In current computationally-accessible form, there are anumberof issuestoconfront:1)thereisnonotationthatthis imageisofareligiousritualthat isnowprohibited from viewing by the non-Hopi public (and thus should be pulled from view for reasons ofcultural sensitivity); 2) when subjected to computer vision techniques, the derivative images rely onsegmentationofphysicalbodies---aformofdisembodiedviolencethatreflectscolonialpracticeswhereNativesaretreatedas lessthanhumanthroughsegmented imagerepresentation(e.g.scalps,severedlimbs, etc).Moreholistically, this case illustratesoneof the long-term challengesof computationally-enabled access: computers cannot identify culturally-sensitive datanor is there an efficientmeans toretrieveculturally-sensitivedataonceithasbeendistributedincomputationalform.Whiledatamightbe displayed in an integrated manner, when it comes to the processing or analysis of our data,computationalanalysishaslargelyexistedatasegmentedlevelratherthanasanintegratedstructuralprocessforresearchandteachingpurposes.Acomplexhumanitiessystemfordataareoftenartificiallylayered representations that rely on augmentation of 'found' datasets such as traditional and webarchives. Often,humaninterventionisneededtoverifytheresultsofthesecomputationalprocesses,whichhaveahabitofveryquicklyhighlightingcontradictionsatthelevelofbothobjectandcorpora.Anintegrateddataecosystempositsthatthroughcomputationalanalysisitisimportantnotonlyforcoreactivitiesofdevelopment, description, access, and reuse, but also the return of data to its originating collectionthrough data correction and relational derivatives. More simply, what is needed is an integratedhumanitiesdataecosystemthatrecognizesapproachestocomputationally-accessibledataandreliesonimportantcharacteristicsofhumanitiesresearchdataandhumanitiesresearchpractices:1)humaniststendtocreatedata,notjustgatherdata;2)someofthisdataisinherentlystructured,butmostisnot;3)theresultingdata isoftenhighly interpretative,whichhas implications forsharingandre-use;4)datacreation is often iterative and layered with implications for copyright, versioning and active workingspaces;and5)theprocess isas importantastheproduct. And,significantly, toenvisionthebroadestpotential intervention of computationally-accessible datasets, we cannot envision that the terms“scholar”and“researcher”belongtotheacademicorarchivalcommunities.Wemustunderstandthat

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 19

the communities of origin should be the initiating point for considering development, deployment,access,etc. WorksCited [1]PortionsofthisresponseappearedinanearlierformintheIntroductionto“TheFutureofDigitalMethodsforComplexDatasets”,anInternationalJournalofArtsandHumanitiesComputing(IJHAC)specialeditionandasacontributiontoaDigitalLibraryFederationpanelonHumanitiesDataissues.JenniferGuilianoandMiaRidge,InternationalJournalofHumanitiesandArtsComputing,Volume10Issue1,Page1-7.DOI:http://dx.doi.org/10.3366/ijhac.2016.0155.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 20

IdentifyingUseCasesforUsableandInclusiveLibraryCollectionsasData

JulietL.Hardesty,IndianaUniversity

A grounded, practical approach to digital projects often centers around concerns of how will theprojectbeuseful,howcantheprojectrealisticallybecompleted,andwhatinformationisnecessarytomake this project (or the items in a digital project) discoverable and accessible? Based on thisapproach, there are two sides to making library collections useful as computational data – thecollection-holdinglibraryhastobeabletoreleasethedatainawaythatallowsforcomputationandresearchershave tobeable to findoutabout thisdataanddosomethingwith it.Puttingdataouttheredoesnotmeanitwillbeusedandofferingacomputationalinterfacedoesnotmeanitwillfitallresearchneeds. The grant references the HathiTrust Research Center (HTRC) as an example of a computationalinterfaceforresearchers.ItalsoreferencesHydra-in-a-Boxasanexampleofanapplicationthatcouldbenefitfromcomputationalfunctionality.ThisgeneratedthethoughtofanHTRC-in-a-Boxthatcouldworkforlibrariestosetuptheirowncomputationalinterfacefortheircollections.OpengovernmentdataeffortslikeCodeforAmericaordata.govandckan.orgshowhowvariousgroupsandindividualscan come together around a commongoal of providing access to computational data andprovidewaystoaccess,analyze,andofferdata.Itwouldbeusefultoexaminethosemodelswhendiscussingapproachestotreatinglibrarycollectionsasdata. Thisprojectisconcernedwithalltypesofdigitalobjects.Text, images,audio,video,born-digital,3-dimensional,allhaveuniqueaspectstothemthataresometimescomputationallyavailablebutoftenarenot.Sometimestheonlywaytoknowaboutsegmentsonavideoorthecontentsofanimageistohavetextualdescriptionavailable.Thatrequiresmetadatagenerationormetadataenhancement.Thisworkcanbemanuallyintensivebutcanalsobeaidedbysoftware.EffortssuchasAVPreserve’splan to enhance metadata in stages for Indiana University’s Media Digitization and PreservationInitiativemove gradually towardmore advanced technologies to identify aspects such as people’sfaces,beatsperminute,andspeakeridentificationinvideoandaudioforthepurposeofproducingmetadatathancanthenbediscoveredbyresearchers.[1]AnotherprojecttowatchwillbeWikimediaCommons’ StructuredData project to “develop storage information formedia files in a structuredwayonWikimediaCommons,sotheyareeasiertoview,translate,search,edit,curateanduse.”[2]This process will not always be just about putting the data out there or making it possible forresearchers to access the data, itwill also involve producing data about different types of objectsthanhastraditionallybeenthecase indigital libraries.Recommendations, tools,andworkflowsformetadataenhancementwillbenecessarytocreateusablecomputationaldata. MichelleDalmau,HeadofDigitalCollectionsServicesatIndianaUniversity,correctlypointsoutthatdifferentusecasesareneededforlibrarycollectionsasdata.[2]AtIndianaUniversity,severaldigitalcollectionsareavailableasdatasets,[3]largelybasedonresearcherrequests.Trackinguseinthewildischallenging,butdatasetsareused in theclassroom(CharlesW.CushmanPhotographCollection)and for research (WrightAmericanFiction). Lookingathowdata isused for research compared tohow it is used pedagogically for instructionmight lead to insights on qualities of data that makecollectionsbettersuitedforteachingversusresearch.Beingabletoreliablytracethewaysinwhichthese data sets are used will demonstrate impact to stakeholders. Using metadata about digital

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 21

collections versus using the collection items themselves for content analysis is something else toconsider.TheBritishLibraryoffersimagecollectionsforanalysisseparatefrombibliographicdatasetsabouttheirarchivalholdings. IndianaUniversity’sCushmandatasetoffersonlythemetadataabouttheimages,nottheimagesthemselves. Afinalpointtobringupconcernsdiversityandinclusion.Notonlyshouldthisprojectmakesurethecollectionsconsideredforusecasesarediverseinformat,content,andsource,buttheprojectitselfneedstohaveabroadanddeeprepresentationofvoicesandperspectivesoncomputationaldata.These are not data that are only useful in the academic realm. Access to computational data orworkflows and tools to allow others to provide access to computational data will be ever moreimportant in the world, particularly if national governments continue to trend toward populism,nationalism,andprivatization. WorksCited [1]Rudersdorf,AmyandJulietL.Hardesty.(2016).“AVDescriptionwithAVPreserveandIU:StrategiesandtoolstodescribeaudiovisualmaterialsatscaleforIndianaUniversity’sMediaDigitizationandPreservationInitiative.”DigitalLibraryFederationForum,Milwaukee,Wisconsin.https://osf.io/gfazc/ [2]JulietL.HardestyinterviewedMichelleDalmauregardinglibrarycollectionsasdatainFebruary2017. [3]https://commons.wikimedia.org/wiki/Commons:Structured_data [4]BritishLibrary.Collectionguides:Datasetsforimageanalysis.http://www.bl.uk/collection-guides/datasets-for-image-analysis

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 22

EmergingMemoryInstitutionDataInfrastructureintheServiceofComputationalResearch

ChristinaHarlow,CornellUniversity

In my opinion, the Always Already Computational Forum work area rests at the intersection of theunderstood functionalities ofmemory institution’s collection platforms and the needs of researchersworkingwithlarge-scaleorcomputationaldataanalysistechniques.InthinkingaboutthisForum’sscopeandmyownwork, I amstruckbypossiblecollaborationsnot leveragedormentioned. Iwould like toexploreifmyworkapproachtoafacetofalargerdataproblemcouldexpandand,inturn,beexpandedby the Forum’s discussion and deliverables on computational research needs andmemory institutiondatapractices. MypositionforthisupcomingForumwillmostlyfallalongthesepoints:

● Iflibrarycollections,includingbutnotlimitedtothatofdigitalrepositoryplatforms,areconsidered(primarilydigitalrepositoriesaretargetedintheproposal),thereisawealthofdataandmetadata(*data)thatalreadyexists.Betteryet,memoryinstitutionsalreadyworkwiththis*dataatscaleusingtraditionalandemergingtechnologiesthatunderpinandarehiddenbydeliveryanddiscoveryinterfaces.Howcanthisunderlyingecosystembebetterleveragedforcomputationaldataanalysisbyresearchers?i.e.dowejustneedtomakeaccesstoaSolrindexpubliclyavailable?CanweplugintoourlibrarydataETLsystemsapublicHadoopintegrationpoint?DoweneedtobetterdocumentandexposetonewcommunitiesourexistingdataAPIsordataexchangeprotocols?

● Iwouldliketosurfacethefunctionalneedsoftheresearchareasalludedtointheproposal,

thenseewheretheyoverlapwithexisting*dataoperationsworkareasinmemoryinstitutions.Astrategicpartnershipheremeanswecanstrengthenthecasesfor,collaborationon,andsupportofthetechnological,procedural,andorganizationalframeworksemerging.Thesearealreadybeingbuiltandusedtosupporteffortsofmemoryinstitutionsandtheirdatapartners.

● Computationalorlarge-scale*dataworkrequirestransparencyandagreementonanumberof

pointstomakeitstatisticallyrelevantandpubliclyreliable.Theseagreementpointsincludebutarenotlimitedto:

o Machinesshouldbeabletounderstandthemodelsorentitiesrepresentedbythedata;o Thisrequireshavingsharedspecificationsaround*datarepresentationandcontextual

meaningofmodels,datum,types,etc.;o Weneedtobuildandmaintainconsistentdataexposureservices,pointsormethodsso

thatcomputationalworkcanbereproducible,iterated,ordistributedasneeded(forscalability);

o Recognizethattechnologicalframeworksforcomputationalanalysis(forexample,

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 23

Hadoop)oftenrequiresignificanthardware,software,andmaintenancetosupport.Stabilityofhowdataisexposedanddataprovenancecanmitigatethetechnologicalburdenbyofferingconsistencyonwhichmultiplepartnerscanbuildandcoordinateeffortsontheframeworks;

o Andwhatistheresponsibilityoftheoriginatingmemoryinstitutiontosupportcaptureofthatcomputationaldataoutputforsakeofarchiving,reproducibility,discoverability,andexpanded*dataservices?

Mypositionscomefrommyownworkonmetadataoperationswithinalargeandwell-fundedacademiclibrary system. My work focuses on building an efficient and coordinated *data ecosystem amongsourcesincludingbutnotlimitedto:

● AtraditionalMARC21Catalogwithabout9millionbibliographicrecords,managedinanILS(IntegratedLibrarySystem),afewOracledatabases,aPerl-basedmetadatareportingandmanagementinterface,andotherbatchjobmanagementandmetadataexposureservices(APIsanddataexchangeprotocolslikeZ39.50orSRU);

● Alocally-developedmetadataintegrationlayerthattakesmultipledatarepresentationsof

authority,bibliographicandothermetadataretrievedviaAPIs,mergesthem,andindexesintoanumberofSolrindexes;

● Multiple(~8dependingonthedefinition)digitalrepositoryapplicationsandservicesfordelivery

ofdataandmetadatatouserinterfaces.TheserepositoriesspantechnologyandresourcetypesfromloneFedora4instancesforobjectpersistenceofprimarilytext-focuseddigitalsurrogatestomoretraditionalDSpaceinstallationsforuser-generatedscholarlyoutputtyperesources;

● Alocally-managedauthoritiesandentitiesinterfacethatdealswithbothlocalvocabulariesand

enhancedrepresentationsofcurrently3large(>1millionresources)externalmetadatasets;

● And*datafromarchives,preservation,digitization,andmanyotherworkflowsandsystems. Inbuildingacoherentecosystemforthis*data,Iworkwithenterprisedatatoolingandapproachesthatperhapsalsocansupport thecomputationaldataanalysisneeds tobesurfaced in theAlwaysAlreadyComputationalForum.Inparticular,IamleveragingETLanddistributeddatamanagementsystemsthatthen interact with (and coordinate) existing memory institution *data standards, applications,specifications, and exchange protocols. Due to the computational support of the selected distributeddatasystems, I runanumberofprocesses thatparallel somecomputationaldataapproaches,but fordifferent ends. I would like to outline howwe could reuse or expand these existing approaches andservicestosupporttheresearchers(andtheirrespectiveareas)whotakepartinthisForum.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 24

OntheComputationalTurninArchives&LibrariesandtheNotionofLevelsofComputationalServices

GregJansenandRichardMarciano,UniversityofMaryland

1.TheComputationalTurninArchives&Libraries TheUniversity ofMaryland iSchool’s Digital Curation Innovation Center (DCIC) is pursuing a strategicinitiative to understand and contribute to the computational turn in archives and libraries. Thefoundationalpaper(withpartnersfromUBC,KCL,TACC,andNARA)callsforre-envisioningtrainingforMLISstudentsinthe“AgeofBigData”.See:“ArchivalRecordsandTrainingintheAgeofBigData”.WeargueforanewComputationalArchivalScience(CAS)inter-discipline,withmotivatingcasestudieson:(1) evolutionary prototyping and computational linguistics, (2) graph analytics, digital humanities andarchival representation, (3) computational finding aids, (4) digital curation, (5) public engagement /interaction with archival content, (6) authenticity, and (7) confluences between archival theory andcomputationalpractices:cyberinfrastructureandtherecordscontinuum. DeeperexperimentationwiththesenewculturalcomputationalapproachesisurgentlyneededandtheDCIC is developing a CAS curriculum that brings together faculty from Computer Science, Archival &Library Science, andData Science.We conduct experiential projects teams of students to help them:gain digital skills, conduct interdisciplinary research, and explore professional developmentopportunities at the intersection of archives, big data, and analytics. These projects leverage uniquetypesofarchivalcollections: refugeenarratives,communitydisplacement, racial zoning,movementofpeople,citizen internment,andcyberinfrastructure fordigitalcuration. See“PracticalDigitalCurationSkillsforArchivistsinthe21stCentury”(Lee,Kendig,Marciano,Jansen),MARAC2016.TwoworkshopsontheinterplayofcomputationalandarchivalthinkingwereheldinApril2016andDecember2016,andapop-upsessionatSAA2016discussedarchivalrecordsintheageofbigdata. Finally, theDCIC isdevelopingnewcyberinfrastructure,calledDRAS-TIC (seeNov.2016CNI talk), thatfacilitatescomputationaltreatmentofculturaldata.DRAS-TICstandsforDigitalRepositoryatScalethatInvites Computation (To Improve Collections), and blends hierarchical archival organization principleswiththepowerandscalabilityofdistributeddatabases. Our position statement builds to these CAS investigations by suggesting a framework for “Levels ofComputationalService”tobetterdescribetheemergingecosystemandidentifygapsandopportunities. 2.LevelsofComputationalService Journalists, researchers, planners, and other user patrons support their investigations with newmethodsof computationalanalysis. Libraries,archives,museums,and scientificdata repositoriesholddata that will inform their disciplines. It is far easier today to analyze Twitter behavior than it is toinvestigate public life using public data frompublic institutions, such as government records, cultural

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 25

heritage,andsciencedata.WestrivetomakeourpublicdataandculturalmemoryasopentoresearchasTwitter. Computational analysis happens in various technical environments: on a single server; in distributedclusters;oncloudservices.Thetoolsweusehaveuniquerequirements,configurations,andhardware.Itis said thatadatastewardshiporganizationcannotanticipate theuses for theirdata,but it isequallytruethattheycannotanticipatethetoolsusedforanalysis.Organizationsneedaservicestrategythatserves a range of users, from the most technically innovative, to the most time and resourcesconstrained.Wedescribearangeofservicesforcollectionsasdatawithoutlosingsiteofcoreservices.Thisisa“maturitymodel”forstewardshiporganizations,withlevelsofcomputationalservicesthatshowaclearprogressiontowardfullservice. 2.1.CoreServiceLevel Shippingdatasets into the researcher computeenvironment remains the criticaluse case,maximizingflexibilityandallowingresearcherstolinkmanydatasetsintoonecorpus.Researchersneedtodiscover,scope, ship and make reference to datasets. Though we may also move computational work acrossthem, boundaries are an important place to define stable conditions, such as custody, provenance,security, and concise technical contracts. Even the most advanced repository must establish theseboundary conditions.

● Definelicenseterms,howcanweusethedata?● Defineprovenance:

○ Whoproducedthedataandwhy?○ Howdiditarrivehere?○ Doversionsexistelsewhere?

● Definedatasetscope:○ Whatmakesthecorpuscomplete?○ Isitcomplete?○ Isitgrowing?Whatistheupdatehistory?

● Transfermethodswithintegrityverificationandresumefromfailure● Persistentlycitabledatasets

2.2.ProtocolsServiceLevel

● File-by-filetransferthroughHTTPAPI(insteadofbatchdownloads,likeZIPs)● Definecitablesubsetsthroughcustomqueriesorfunctions.● Checkforupdatestoanydatasetorsubset.(viaHTTPAPI)● HTTPAPIfornavigationofstructuredcollections:

○ Staticsite(ApacheorNginxauto-indexoffiles)○ CloudDataManagementInterface(CDMI)○ LinkedDataPlatform(andFedoraAPI)

● Deliverytocloudandcloud-hosted,publicdatasets

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 26

2.3.EnhancedServiceLevel ● Deriveddataavailableassubsets:

○ plaintextfordocumentsandimages○ normalizedfileformats○ tabulardatafortable-likesources○ linkeddataforgraph-likesources

● Machine-readableprovenancerecords● Crowd-sourcingofmetadata● Namedentityindexingandsubsetting(people,places,organizations,dates,events)● Geospatialindexingandsubsetting● Consistentandcitablerandomsamplesubsets(addrandomseedstoeachobservation)

2.4.ComputerRoomServiceLevel Containertechnologies,suchasDocker,shipacustomcomputeenvironmenttothedatasetlocation.Ahosteddatabasecanbeopenedupforqueriesordistributedcomputejobs.Whilenotasflexibleastheresearcherenvironment, computer roomservicesprovide rapidandcost-effectiveanalysis. Journalistsondeadlinebenefitmostfromcomputerroomservices. There are also growing calls, beyond the physical sciences, for analysis of big collections data injournalism and humanities scholarship. The sheer scale of big data makes transfer prohibitive, as isprovisioningenoughstoragetohostanentirecorpus.AttheDigitalCuration InnovationCenterattheUniversityofMaryland’siSchool,weareactivelydevelopingtheDRAS-TICrepository(DigitalRepositoryatScale that InvitesComputation).ThroughDRAS-TICweaimtodelivercomputer room-styleservicesover heterogeneous digital collections and remove the limits of scale.

● RunanApacheSparkjobonadefineddataset● Hostacomputecontainerwithadatasetmountedlocally● SPARQLqueryservice● Usetechniquesabovetoproduceanewsubsetfortransfer

3.ProvisioningtheResearcherEnvironment From code notebooks to deployment scripts that provision clusters, it becomes easier to create andshare compute environments. Research that aims towards publication will also need to track theresearchstepsworkflow.Throughmachinereadablescriptsandprovenance,wecanaimtoreproduceananalysisatadifferenttimeandplace,startingfromtheciteddatasetsandwelldescribedmethods.Thecurationactivitiesperformedbyastewardshiporganizationandthestepstakenbytheresearchercanformanunbrokenchainofeventsleadingtoareproducibleproduct. Summary Forverifiableresultsinscholarship,orpublictrustinanindependentpress,weneedtoproviderelevantdatasetsand services thatmake it straightforward to trace findingsback to their source in thepublic

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 27

record.Wemustconfrontarightlyskepticalreader,whofacesincreasinglyhigh-flyingvisualizationsandclaimsmadefromthem.Theyarecorrecttodemandlinkstotheunderlyingevidenceandmethods.Byproviding these we enrich public understanding and trust. At the Digital Curation Innovation Center(DCIC) we have committed to this agenda and pursue it through our research projects, scholarlyactivities, and the active development of the DRAS-TIC software project, and the building of acomputationalarchivalcommunity.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 28

PartnershipRecommended–Thecaseofcuratingresearchdatacollections[1]

LisaJohnston,UniversityofMinnesotaLibraries

Digitization alone is not enough to support large-scale computational analysis of library collections.Rather the more difficult steps of digital curation will be necessary to prepare our collections forappropriatereuse.Partnershipmaybethekey.

Takeforexampletheproblemofanalogdata.Theextractionofhistoricalclimatedatafromtablesandcharts and other artifacts (e.g., Zooniverse's Old Weather project) is an ambitious and importantundertakingasthesedataareundeniablyvaluableandtemporallyunique.Yet,thedigitizationofdatapoints from the written page is just the first step toward a greater integration of their meaning inmodernandfutureresearch. Inorderforcomputationofthesecollectionstobesuccessful,thedigitalsurrogatemust be curated in a numberofways. Thedatamaybe transformed, cleaned, normalized,described, contextualized, and quality assurance measures put in place to ensure trust and trackprovenanceofthework,tonameafew.Datacurationactivitiesprepareandmaintainresearchdatainwaysthatmakeitfindable,accessible,interoperableandreusable(FAIR).

Inourwork,theDataCurationNetworkprojecthastakenstepstobetterunderstandthedatacurationactivitiesmentionedaboveandidentifywaystoharnessthenecessarydomainandfileformatexpertiseneeded to curate research data across a network of partner institutions.[2] We represent academiclibrarydatarepositoryprogramsthatarestaffedwithcurationexpertsforarangeofdatadomainsanddata file formats.Ourgoalsare todeveloppracticaland transparentworkflowsand infrastructure fordata curation, promote data curation practices across the profession in order to build an innovativecommunitythatenrichescapacitiesfordatacurationwritlarge,andmostimportantly,developasharedstaffingmodelthatenablesinstitutionstobettersupportresearchbycollectivelycuratingresearchdatainwaysthatscalewhatanysingleinstitutionmightaccomplishindividually.

We are not alone in this desire to partner on data curation skills, staff, and infrastructure. NationalexamplesofdatacurationsuchasthePortageNetwork(https://portagenetwork.ca),developedbytheCanadian Association of Research Libraries (CARL), aims to support library-based data managementconsultation and curation services across a broader network and the JISC-funded Research DataManagementSharedServiceProjectaimstodevelopalightweightserviceframeworkthatcanscaletoallUK institutionsandresult inefficienciesby“relievingburdenfrominstitutional ITandprocurementstaff.” In theUS,partnershipson technological infrastructurearebooming.TheProjectHydra’s Sofiaplatform (https://projecthydra.org), which builds in the DuraSpace Fedora framework, has been co-developedbynumerousinstitutionsthatseektobuildabetterdigitalrepositoryinfrastructurefordata.And the Hydra-in-a-Box project (lead in part by another partnership success story for disseminatingarchival materials, the Digital Public Library of America) aims to provide a networked platform forrepository services that will scale for institutions big and small. Another inspiring example is theResearchDataAlliance,whichprovides an incubator for collaboration arounda rangeof data-relatedtopics. RDA projects to track include the Publishing Data Workflows working group and the newlyformedResearchDataRepository Interoperabilityworkinggroup.Andpartnershipsdonotnecessarilyneedtostartatthenational-level.Severalsmaller-scalepartnershipsunderwayforsharingcurationstaffexpertise across institutions include the Digital Liberal Arts Exchange, which facilitates data-relatedproblem solving and communication amongst peers as well as providing hosting services that allowsdigitalhumanitiesprojectstoberunonsharedinfrastructure.AndtheDataQProject,whichprovidesa

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 29

virtual online forum for expert data staff to discuss and provide solutions for data issues in acollaborativeway.

Bypartneringondata curation efforts like thesewemaymovebeyond individualizeddigital curationstrategies toward what I hope will become a robust “network” of digital collections that arecomputational,butalsotrusted.Andaspartners inthiseffortwemaycontinueashareddialogueandcollectivelydevelopnewand improvedprocesses for curating researchdataandotherdigitalobjects.Finally, our networked research collections will demonstrate our continuing and important role thatlibrariesandarchiveshavetoplayinthebroaderscholarlyprocess. WorksCited [1]Portionsofthisstatementwerealsopublishedin“ConcludingRemarks”byLisaR.JohnstoninCuratingResearchDataVolume2:AHandbookofCurrentPractice(ACRL,2017)availableasanopenaccessebookathttp://www.ala.org/acrl/publications/booksanddigitalresources/booksmonographs/catalog/publications. [2]Currentlyinourplanningphase,theDataCurationNetworkaimsexpandintoasustainableentitythatgrowsbeyondourinitialsixpartnerinstitutions,leadbytheUniversityofMinnesota,andaretheUniversityofIllinois,CornellUniversity,theUniversityofMichigan,PennStateUniversity,andWashingtonUniversityinSt.Louis.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 30

WaysofForgetting:TheLibrarian,TheHistorian,andtheMachine

MatthewLincoln,GettyResearchInstitute

JorgeLuisBorgestellsusofFunes,theMemorious:amandistinguishedbyhisextraordinaryrecall.SopreciseandcompletewereFunes'memories,though,thatitwasimpossibleforhimtoabstractfromthenear-infinityofrecalledspecificshepossessed,togeneralprinciplesforunderstandingtheworld:

Locke,intheseventeenthcentury,postulated(andrejected)animpossibleidiominwhich each individual object, each stone, eachbird andbranchhadan individualname.Funeshadonceprojectedananalogous idiom,buthehad renounced itasbeingtoogeneral,tooambiguous.Ineffect,Funesnotonlyrememberedeveryleafoneverytreeofeverywood,buteveneveryoneofthetimeshehadperceivedorimagined it... He was, let us not forget, almost incapable of general, platonicideas... hewas not very capable of thought. To think is to forget a difference, togeneralize,toabstract.IntheoverlyrepleteworldofFunestherewerenothingbutdetails,almostcontiguousdetails.(Borges1962,27)

Attending to Drucker's admonition that all "data" are properly understood as "capata", the story ofFunesisapotentreminderthatitisnotonlyinevitablethatwewillbeselectivewhencapturingdatasetsfromourcollections,butthatitisactuallynecessarytobeselective.(Drucker2014)Adatasetthataimsfor perfect specificity does so at the expense of allowing any generalizations to be made thoughgrouping, aggregating, or linking to other datasets. For our data to be useful in drawing broadconclusions,itisanimperativetoforget.

However,inconsideringlibraryandmuseumcollectionsasdata,wemustgrapplewithseveraldifferentframeworks of remembering, forgetting, and abstracting: that of the librarian, the historian, and themachine.Theseframeworkswilloftenbeatcross-purposes:

● The librarian favors data that is standard: forgetting enough specifics about thecollectioninordertoproducedatathatreferencesthesamevocabulariesandthesaurias other collection datasets. The librarian's generalization aims to support access bymanydifferentcommunitiesofpractice.

● The historian favors data that is rich: replete with enough specifics that they may

operationalize that data in pursuit of their research goals, while forgetting anythingirrelevant to those goals. The historian's generalization aims to identify guidingprinciplesorexceptionalcaseswithinahistoricalcontext.(Notwohistorians,ofcourse,willagreeonwhatthatcontextshouldbe.)

● The machine favors data that is structured: amenable to computation because it is

producedinaregularizedformat(whetherasadocumentedcorpusoftext,aseriesofrelational tables, a semantic graph, or a store of image files with metadata.) In astatistical learning context, the machine seeks generalizations that reduce error in agiven classification task, forgetting enough to be able to perform well on new datawithoutover-fittingtothetrainingset.

At theGettyResearch Institute, ourproject to remodel theGetty Provenance Index® as LinkedOpen

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 31

Dataiscompellingustobalanceeachoftheseperspectivesagainstthelaborrequiredtosupportthem.Our legacydata is filledwithamixof transcriptionsof sales catalogs,archival inventories,anddealerstock books, paired with editorial annotations that index some of those fields against authorities orother controlled vocabularies. Originally designed to support the generation of printed volumes, andthenlateraweb-basedinterfaceforlookupofindividualrecords,theselegacydataspeakmostlyaboutdocuments of provenance events, and do so for an audience of human readers. Tomake these datalinkabletomuseumsthatareproducingtheirownLinkedOpenData(followingthegeneralCIDOC-CRMprinciplesofdefiningobjects,people,places,andconceptsthroughtheirevent-basedrelationships),wearetransformingthesedata intostatementsaboutthoseprovenanceeventsthemselves. Insodoing,we are standardizing the terms referenced,enriching fields by turning them from transcribed stringsintoURIsofthings,andexplicitlystructuringtherelationshipsbetweenthesedataasanRDFgraph.

Allthisworkrequiresdedicatedlabor.Thisleadstohardquestionsaboutpriorities.

Towhat extent dowepreserve the literal content of thesedocuments, versus standardizing thewaythatweexpresstheideasthosedocumentscommunicate(insofaraswe,asmodern-dayinterpreters,cancorrectly identifythose ideas)?Tomaintain(toremember)plaintextnotesabout,say,anobject'smaterialsas recordedbyanartdealer, is togrant thepossibilityofperfectspecificityaboutwhatourdocuments.Butnotaligningdescriptionswithauthoritative terms fordifferent typesofmaterialsandprocesses forecloses thepossibilityofgeneralizingabout thehistoryof thosematerialsandprocessesacrosshundredsofthousandsofobjects.Remembertoomuch,inotherwords,andwebecomeFunes:incapableofsyntheticthought.

Capaciouscollectionsdatamustrememberenoughandforgetenoughtobeuseful.Forwhichtermswillweexpendtheefforttodothisreconciliation?Whichedgecaseswillwetrytocaptureinanever-more-complexdatamodel?Opinionsonhowtodrawthatlinewillfrequentlysetthelibrarian,thehistorian,andthemachineatcrosspurposes.Outliningthenecessarycompetenciesacollectionsdataproductionteamneeds,andthekeyquestions,inordertonavigateperspectivesmustthereforebeacrucialoutputofthisforum.

WorksCited Borges,JorgeLuis.1962.“Funes,theMemorious.”InFicciones,editedbyAnthonyKerrigan,107–15.NewYork:GrovePress.

Drucker,Johanna.2014.Graphesis:PerformativeApproachestoGraphicalFormsofKnowledgeProductionintheHumanities.Cambridge:HarvardUniversityPress.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 32

Figure SEQ Figure \* ARABIC 1 --

AssessingDataWorkflowsforCommonData'Moves'AcrossDisciplines

AlanLiu,UniversityofCaliforniaSantaBarbara

In considering how library collections can serve as data for a variety of data ingest, transformation,analysis, replication,presentation,andcirculationpurposes, itmaybeuseful to compareexamplesofdata workflows across disciplines to identify common data "moves" as well as points in the datatrajectorythatareespeciallyinneedoflibrarysupportbecausetheyareforavarietyofreasonsbrittle. Wemight take a page from current research on scientificworkflows in conjunctionwith research ondata provenance in suchworkflows.Scientificworkflowmanagement is now awhole ecosystem thatincludes integrated systems and tools for creating, visualizing, manipulating, and sharing workflows(e.g.,Wings,ApacheTaverna,Kepler,etc.).Atthefrontend,suchsystemstypicallymodelworkflowsasdirected, acyclic network graphs whose nodes represent entities(includingdatasetsandresults),activities,processes,algorithms,etc.atmanylevelsofgranularity,andwhoseedgesrepresentcausalorlogicaldependencies (e.g., source, output, derivation, generation,transformation,etc.)(seefig.1).Dataprovenance(or"datalineage"asit has also been called in relation to workflows) complements thatecosystem through standards, frameworks, and tools--including theOpenProvenanceModel (OPM)theW3C'sPROVmodel,ProvONE,etc.Linked-data provenance models have also been proposed forunderstandingdata-creationand-accesshistoriesofrelationsbetween

"actors, executions, and artifacts.”[1] In the digital humanities, the in-progress "Manifest" workflowmanagementsystemcombinesworkflowmanagementandprovenancesystems.[2] Themost advanced researchon scientificworkflowandprovenancenowgoesbeyond themissionofpractical implementation to meta-level analyses of workflow and provenance. The most interestinginstance I amawareof is a studybyDanielGarijoet al. that analyzes177workflows recorded in theWings and Taverna systems to identify high-level, abstract patterns in the workflows.[3] The studycatalogsthesepatternsasdata-orientedmotifs(commonstepsordesignsofdataretrieval,preparation,movement,cleaning/curation,analysis,visualization,etc.)andworkflow-orientedmotifs(commonstepsor designs of "stateful/asynchronous" and "stateless/synchronous" processes, "internal macros,""human interactions versus computational steps," "composite workflows," etc.). Then, the studyquantitatively compares the proportions of these motifs in the workflows of different scientificdisciplines.For instance,datasorting ismuchmoreprevalent indrugdiscoveryresearchthan inotherfields,whereasdata-inputaugmentationisoverwhelminglyimportantinastronomy.

Since this usage of the wordmotifs is unfamiliar, we might use the more common, etymologicallyrelatedwordmovestospeakof"datamoves"or"workflowmoves."Amoveconnotesacombinationofstepanddesign.Thatis,itisastepimplementednotjustinanywaybutinsomecommonwayorform.In this regard, theRussianwordmov for "motif,"usedby theRussian Formalists andVladimirPropp,nicelybacksupthechoiceofthewordmovetomeanacommonplacedatastep/design.Indeed,Propp'sdiagrammaticanalysesoffolknarratives(seefig.2) lookalot likescientificworkflows.Wemightevengeneralizethe ideaof"workflows" inan interdisciplinarywayandsay, inthespiritofPropp,thattheyare actually narratives. Scientists, social scientists, and humanists do not just process data; they are

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 33

Figure SEQ Figure \* ARABIC 2 -

telling data stories, some ofwhich influence the shape of theirfinalnarrative(argument,interpretation,conclusion).

The takeaway fromall theabove is thata comparative studyofdata workflow and provenance across disciplines (includingsciences, social sciences, humanities, arts) conducted usingworkflowmodeling tools could help identify high-priority "datamoves" (nodes in the workflow graphs) for a library-based"alwaysalreadycomputational"framework.

Onekindofhighpriorityislikelytobeverycommondatamoves.Forexample,imaginethatacomparativestudyshowedthatina

sample of in silico or data analysis projects across several disciplines over 40% of the data movesinvolved R-based or Python-based processing using common packages in similar sequences (perhapsconcatenatedinJupyternotebooks);and,moreover,thatamongthisnumber60%werecommonacrossdisciplinarysectors(e.g.,science,socialscience,digitalhumanities).Thentheseareclearlydatamovestoprioritizeinplanning"alwaysalreadycomputational"frameworksandstandards.

Another kind of high priority may be data moves that involve a lot of friction in projects or in themovement of data between projects. One simple example pertains to researchers at differentuniversitiesingestingdatafromthe"same"proprietarydatabasewhoarepreventedfromstandardizinglivereferencestotheoriginaldatabecauselinksgeneratedthroughtheirdifferentinstitutions'accesstothedatabasesaredifferent.Frictionpointsofthiskindidentifiedthroughacomparativeworkflowstudyarealsohighvaluetargetsfor"alwaysalreadycomputational"frameworksandstandards.

Finally,oneotherkindofhighprioritydatamovedeservesattentionforacombinationofpracticalandsensitiveissues.Manyscenariosofdataresearchinvolvethegenerationoftransientdataproducts(i.e.,data that has been transformed at one or more steps of remove from the original data set). Acomparativeworkflowstudywouldidentifycommonkindsoftransientdataformsthatrequireholdingforreasonsofreplicationorassupportingevidenceforresearchpublications.Inaddition,becausesomedatasetscannotsafelybeheldbecauseofintellectualpropertyorIRBissues,transformeddatasets(e.g.,converted into "bags of words," extracted features, anonymized, aggregated, etc.) take on specialimportance as holdings. A comparative workflow study could help identify high-value kinds of suchholdingsthatcouldbesupportedby"alwaysalreadycomputational"frameworksandstandards.

WorksCited

[1]Hartig,Olaf."ProvenanceInformationintheWebofData."InProceedingsoftheLinkedDataontheWebWorkshopatWWW,editedbyChristianBizer,TomHeath,TimBerners-Lee,andKingsleyIdehen,April20,2009.http://ceur-ws.org/Vol-538/ldow2009_paper18.pdf.

[2]Kleinman,Scott.DraftManifestschema.WhatEvery1Says(WE1S)Project,4Humanities.org.

[3]Garijo,Daniel,PinarAlper,KhalidBelhajjamey,OscarCorcho,YolandaGil,andCaroleGoble."CommonMotifsinScientificWorkflows:AnEmpiricalAnalysis."2012IEEE8thInternationalConferenceonE-Science(e-Science),2012:1–8.doi:10.1109/eScience.2012.6404427.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 34

Attheintersectionofinstitutionanddata

MatthewMiller,NewYorkPublicLibrary

Libraries are awash indata, from the large reservoirsofbibliographicmetadata thatpowerdiscoveryand access systems, to boutique datasets created from the documents themselves and even theephemeral data exhaust produced by staff and patrons conducting research. Emerging frompracticalday-to-dayworkingwiththistypeofdatabelowaresomeproposedobservationsandquestionsarounddescription, distribution and access that are potentially useful and could benefit from closerexamination. Themostpotentiallykineticcomputationallyamenabledatacomesfromtheconversionandprocessingofdocuments themselves.Transformingdocuments intodataat theNewYorkPublic Library took theformof small projects that converted special collectionmaterials into datasets through thepower ofalgorithms, staff and the crowd. The results were a domain specific dataset oftenwith a necessarilyunique data model. Taking stock of the growing number these datasets we theorized about theirpossible integrationwithour traditionalmetadata systems.Would itbepossible togobeyond simplylinkingtothedatasetasadigitalasset?IfweweretobuildaRDFmetadatasystemfromthegroundupcouldwebeginthinkingofitasanopen-worldassumptionsystemwherethecontentsofthesedatasetscould exist alongside traditional bibliographic metadata? As more cultural heritage organizationscontinue toproducesimilardatasetsweneedtoconsiderhowtheyshapethenextgenerationofourmetadataanddiscoveryplatforms. Steppingbackfromthislargerquestion,whenthinkingabouttheseresourcesasdiscretedatasets,whatworkcouldbedonetoimprovetheiruseandinteroperability?WC3standardssuchtheVoIDVocabularyprovided the means to describe the metadata about datasets. Leveraging such standards andestablishing best practices and preferred authorities could we increase access across humanitiesdatasets?Howmuchwork andwhat sort of resources are required to accomplish this at thedatasetlevel and perhaps at the data level aswell. For example using common non-bibliographic authoritiessuchasWikidataURIsinthedatatofacilitateinteroperabilityacrossdatasetsandeveninstitutions. Whenpublishingdata forothers it isabalancebetweenprovidingaccesstothedata ina formatthatprovidestheleastfrictionforadoptionanduseversushowknowledgeorganizationsystemsworkwithina cultural heritage institution. This often requires preprocessing of librarymetadata turning it into amoreaccessibleformthatdoesnotrequireextensivedomainknowledge.Forexample,whenreleasingthemetadataforNYPL’spublicdomainimageswedidnotpublishtheMODsXMLmetadata,theformatthatitisinherentlystoredinoursystems.InsteadweoptedtopublishitasJSONandalsoassimpleCSVfiles along with extensive documentation. Reducing the complexity of the format reduced thecomplexityofthetoolsandskillsneededtoworkwithit.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 35

Another example taking this approach a step further is in Linked Jazz project in which we providedaccess to the data in the form of a SPARQL endpoint. The data, which is stored as RDF statements,representasocialnetworkofJazzmusicians.ThisdatasetlendsitselftonetworkanalysisusingpopulartoolssuchasGephi.TomaketheapplicationofsuchatoolassimpleaspossibleweaddedaGephifileexportAPIallowinganyonetoquicklydownloadagexf fileofpartofor thewholenetwork to importinto the software. This sort of scholarly API is geared for delivering the resources needed to beginutilizingthedataimmediatelyasopposedtojustprovidingaccesstotheunderlyingdatastore. The topic of preprocessing introduces the question of best practices and standards that could befollowedtoensurethebroadestaccesstoourdatasets.Whataresomeadditionalusecasesthatcoulddrive shared best practices or tools for releasing cultural heritage data? Are there more advancedpreprocessingthatcouldbedonetosomeofthecommonarchetypicaldataformatsfoundinlibraries,archivesandmuseums?Andwhatsortofresourcesarerequiredinanorganizationtoprocessdatasetsforpublicconsumption? As institutions increasingly produce and release datasets, establishing some best practices arounddescription, distribution and access can facilitate collaboration between organizations and ensureproductiveuseoftheseresourcesbypatrons.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 36

MetadataandDigitalRepositoryAccessibilityIssuesforLibraryCollectionsasData

AnnaNeatrour,UniversityofUtah

In thinkingofways touse librarycollectionsasdata, Iwasstruckwith the themeofaccessibility.Areresearchers genuinely invited to engage with library collections as data? I’m going to focus on thisnarrowly,lookingmainlyataspectsofmetadataandtechnicalinfrastructureindigitalrepositories.

Metadataasinvitationtocomputation

Encouragingusageof librarycollectionsasdatacouldbeembedded indigital collectionsmetadatabyincludingastatementthatmetadataisfreetoreuse,providingaCC0license,orstatingthatmetadataisopenasapolicy.OneexampleofthisisseenintheHarvardpolicyonopenmetadata.Manyinstitutionshaveagreedthattheirmetadatais inthepublicdomain,whichisaconditionforharvestbyDPLA,butthereisoftennometadatareusestatementavailableattheitemorcollectionlevelinthesourcedigitalrepositories for these shared collections.Making it clear that we expectmetadata to be reused andrepurposed improves the accessibility of digital library collections asdata. Providing aneasyway forresearchers todownloadmetadata inaddition toadigital imagemightalsoencouragemoreresearchengagement with digital collections metadata. An example of this can be found in the University ofHull’srepository,whererecordsareeasilydownloadedinModsorDublinCore.Inaddition,highlightinginvestigationsundertakenbyrepurposinglibrarymetadatawithinthedigitalrepositoryitselfcouldsparkadditionalideasforresearchfrompeoplewhomightbeencounteringthispossibilityforthefirsttime.

Makedigitalrepositoriesmorewelcoming

While offering access to digital collections via an API may be an effective way of showing thatcomputation is possible with digital collections, it doesn’t provide a welcoming environment forstudentsorresearcherswhoareattheinitialstagesoftheirresearchandwhomightnotyethavethetechnicalexpertisetoutilizeanAPI.ProvidingaportaltoasuiteofsampleappscreatedwithanAPI,asDPLA does along with the search interface for a digital repository creates a signal that applicationdevelopmentandcomputationutilizingadigitallibraryisbothpossibleanddesired.

Withlibrarieseverywherecontinuallybeingaskedtodomorewithless,curatingalldigitalcollectionsforcomputationalpurposesmaybeimpossible.However,developingeasywaysofbulkdownloadforbothimagesandmetadataoutsideofanAPImayopenupwindowsforresearchers.Providingclearmethodsto download digital objects across different collections, or interact with images across repositoriesthrough a framework like IIIF could be yet anothermethod for enabling researchers to interactwithlibrarycollectionsasdata.

Digitalcollectionmanagersmaybeable tocuratenew localor regionalcorporaby thinkingcreativelyabout digital items they already own. For example, inmy own library at the University of Utah, I’vewondered about the possibility of making our typewritten oral history transcripts available toresearchers.TheseoralhistorieswerescannedasPDFs,andIexpecttheOCRwouldbedecentenoughto support text based topic modeling. Figuring out how to make these resources accessible to

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 37

researchersbypackagingtheminawaythatwouldencouragecomputationaluseisagoalofmine.

Whatdoesadigitalcollectionsasdatarepositorylooklike?

Providing additional layers andportals that leverage computational exploration to existing collectionsmightserveasanintermediatestep.ImagineiftextbaseddigitalcollectionsalsohadaVoyant-likelayerbuilt intothedigitalrepository itselfthatresearcherscoulduse,alongwithprepopulatedqueriesandvisualizationssopeopleatthebeginningstagesofinquirycouldseeexamplesoftextanalysis.Thiscouldsupportanintroductoryapproachtoexploringcollectionsasdataintheclassroom.Manydigitallibraryrepositoriesleveragevisualpossibilitiesforgeospatialvisualizationandbrowsing,as intheOpenParksNetwork Map that shows thumbnail images of digital items along with map locations. Could aninterfacebebuiltintoadigitalrepositorythatwouldenableresearcherstoeasilymashupdigitalitemsinto a personalized portal thatwould support geospatial visualizationwithout the need to downloadmetadata,enhanceinformationwithcoordinatedata,andthencreateamorestaticmapinanexternalsystemfromthatexporteddata?Couldourdigitalrepositoriesprovideamechanismforresearcherstocuratetheirownresearchcollections,providingaspacewheredigitallibraryobjectscouldbecombinedwith researcher supplieddata?Anyapproachhave toblendwhat ispragmaticallypossiblealongwithsupportforexperimentationwiththeexistinginfrastructureforourdigitalrepositories.Keepinginmindtheideaofaccessibilityforresearchersandlibraryusersatallstagesofinquirywillhopefullyresultinaneffectiveblendofsolutionsforinteractingwithlibrarycollectionsasdata.

I’dliketothankJeremyMynttiandJimMcGrathforprovidingfeedbackonadraftofthispositionstatement.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 38

ActuallyUsefulCollectionData:SomeInfrastructureSuggestions

MiriamPosner

Libraries and archives are increasinglymaking theirmaterials available online, but, as a general rule,thesematerials aren’t ofmuch use for computational purposes. For themost part, institutions havesoughttoreplicateascloselyaspossibletheexperienceofbeing inareadingroomwithan individualobject. We see this in artifacts like skeumorphic “swishes” on digital page-turns, mammoth lists ofbrowsabletopics,and,whatconcernsmemosthere,theinabilitytodownloadlargequantitiesofobjectmetadata.Many of us have learned the basics of webscraping precisely to get around this problem,laboriouslywritingscriptstoharvestmetadatathatweknowmustalreadyexistsomewhere,asdata,inarepository. Therearemanygoodreasonsculturalinstitutionsimposetheselimitationsontheirmetadata.Foronething, it’s not at all clear howmany people actuallywant to treat collections as data.Most patronsaren’taccustomedtoencounteringdatainaculturalinstitution.Soperhapsarchivesarejustbeinggoodstewardsof limitedresourcesby focusingtheirattentiononsimplymakingdigital facsimilesavailable.But the lackof collectiondataalso limitsotherpeople’s imaginationsaboutwhat theymightdowithcollections’materials. I’ve also been told by various institutions that they don’t have the rightmetadata for researchers toworkwith--thattheirdescriptiveinformationisoftenschematic,high-level,andmeantforsearchanddiscovery,notforvisualizationandanalysis.Iagreethatthisisaconcernthatweneedtotakeseriously,but I contend that even themost basicmetadata is oftenmoreuseful for understanding a collectionthanmanylibrariansimagine.Simplyhavingauthororcreatorinformation,orlanguageinformation,canbe very helpful.My impression is thatmany institutions are holding onto their data tightly,with thehope of cleaning and improving it in the future. But researchers canworkwith imperfect data, if itslimitationsarediscussedfrankly.Wecanalsocontributeimproveddatabacktotheinstitution. Going forward, I imaginemultiple pieces of infrastructure that could helpmake the data of culturalinstitutionsaswidelyusable--andwidelyused--aspossible: Aworkablehumanitiesdatarepositoryorregistry.Agoodmanyopendatarepositoriesalreadyexist.Mostofthemaredesignedtoholdscientificdata,althoughthisneednotdisqualifythemforhumanitiesdata. Humanists are actively contributing data (albeit on a relatively small scale) to general-use datarepositoriessuchasFigShareandZenodo.Themore troublesomeproblem is thata)consensushasn’tbuiltaroundoneparticularrepository;andb)absentacentralrepository,nosubstitute,suchasadataregistry,gatherslistsofculturaldatainoneplace.Whatculturaldataexistsisstored,forthemostpart,onGitHub—finefordownloading,versioning,andcontributingdata,butaterriblewaytodiscovernewdatasets.Weneedabetterwaytofindculturaldata. ConsiderationofAPIsversus“datadumps.”Manyculturalinstitutions,reasonablyenough,offerAPIsasameansofaccessingtheirdata.Thismakessenseforalotofdifferentreasons,includingaccesstothemostrecentdataandtheabilitytoretrieveinstitutions’datainmanydifferentways.Theproblemhereis thatmanyhumanistscanworkwithstructureddata,butnotwithAPIs.Manycommonvisualizationtools require no programming, and so it’s possible for humanists to work with data, even in

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 39

sophisticated, thoughtful ways, without necessarily knowing how to program. Developers at culturalinstitutionsmayfeelthatlearninganAPIistrivial,butformanypeople,theavailabilityofsimpleflatfilescanbethedifferencebetweenusingandnotusingadataset.Ithereforehopethatculturalinstitutionswillconsiderthepossibilityofprovidingunglamorousflatfiles,inadditiontoAPIaccesstotheirdata. Really lowbrow thought about data formats.Very simply,my students canworkwith CSVs, but notXML or JSON. Visualizing and analyzing the latter two formats takes programming knowledge, whileevennon-coderscan importCSVs intoExcelandcreategraphsandcharts.Obviously,onecanconvertXMLandJSONtoCSVs,butdoingthisrequiressomeknowledgeoftheseformats,andsometimessomeprogramming(oratleastcommand-line)ability. Casestudies.Itmayseemunlikely,giventherecentproliferationofdigitalhumanitiesjournals,butit’srelatively difficult to find vetted, A-to-Z, soup-to-nuts examples of how to build visualizations andanalysisfromdatasets.Theaggregationofanumberoffairlysimpleexampleswould,Ibelieve,gofarindemonstratinghowpeoplemightusedatasetsintheirownwork,andwouldcertainlybeofgreatutilityintheclassroom.Thekeyherewouldbetokeeptheexamplesquitesimple,sothatpeoplecanreplicateandbuildonthemwithrelativeease.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 40

InteroperabilityandCommunityBuilding

SheilaRabun,InternationalImageInteroperabilityFramework(IIIF)Consortium

I am coming from a non-traditional background, with aMaster’s in interdisciplinary folklore studies,havinggainedthemajorityofmyexperienceinlibrariesasthedigitalprojectmanagerandsubsequentlytheinterimdirectoroftheUniversityofOregon(UO)Libraries’DigitalScholarshipCenter.Amongmanydigitalprojects,IwasresponsiblefortheOregonDigitalNewspaperProgram,wherewemadelargesetsof newspaperOCR data and images available to the public online, following the Library of Congress’Chronicling America site and open API. While digital newspaper data has been used to createvisualizationsandothercomputationalprojects(forexample,theMappingTextscollaborationbetweentheUniversityofNorthTexasandStanfordUniversity),thelearningcurveforscholarstofind,harvest,and use the data provided remains a challenge. Students and faculty from all subject areas areincreasingly looking to library and informationprofessionals for guidanceonwhere to find accessibledataresources,howtousethem,andrecommendationsonplatformsforsharingtheirwork.Inadditiontodeterminingbestpracticesformakingcollectionsavailableasdata,comprehensivetrainingmaterialsand documentation for end users will be key to lowering the barrier of entry to make it easier forresearchers to get started working with data on their own, encouraging wider re-use andexperimentation. Overthepast7monthsIhaveshiftedmyfocusslightly,astheCommunityandCommunicationsOfficerfor the International Image Interoperability Framework (IIIF) Consortium, to improve digital imagerepository maintenance and sustainability as well as access and functionality for end users. As acommunity-driven initiative including national and state libraries, museums, research institutions,software firms, and other organizations across the globe, IIIF provides specifications for publishingdigital imagecollectiondata toallow for interoperabilityacross repositories. IIIF specificallyaddressesthe“datasilo”problemthathasbeenplaguingthedigital repositorycommunity,particularlybyusingexistingstandardsandmodelssuchasJSON-LDandWebAnnotationthatmakesharingandre-useeasy.AgrowingnumberofdigitalimagerepositoriesarebyadoptingIIIF,andtheIIIFConsortiumhasgrowntoinclude40institutionalmemberssinceitwasformedin2015. The IIIF community and specifications are especially relevant to the goals of the Always AlreadyComputational(AAC)work,especiallyregardingdigitalimages.IIIFhaslaidagroundworkforcreationofalibrarycollectionsasdataasaninternationallyagreed-uponbestpracticeformakingdigitalimagedatashareableandmoreusableforstudy.IIIFutilizesJSON-LDmanifests(representationsofaphysicalobjectsuchasabook,asdescribed intheIIIFPresentationAPI),toencouragesharing,parsing,andre-useofdata regardlessofdifferingmetadata schemasacross collectionsand repositories.The IIIF communityhasbuiltthespecificationsspecificallyaroundusecasestosolverealproblems,sofarprimarilyfocusingontheneedsofthosebothusingandmakingavailabledigitizedmanuscripts,newspapers,andmuseumcollections. WearecurrentlyworkingonextendingtheIIIFspecificationstoincludeinteroperabilityforAudioand/orVisualmaterials (with3Dmaterials further along the roadmap), aswell as improveddiscoveryof IIIF-compatible resourceson theweb.Collaborationwith theexistingcommunity thathas formedaroundIIIFwillbeessentialfortheworkofAACandwewelcomenewinterestedpartiestogetinvolved,informandprovide feedbackonapproaches fordiscovery and stay informedwithnew innovations. Libraries

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 41

and museums have been the primary adopters so far, but we have plans to do more outreach toscholars and researchers in all disciplines, STEM imaging providers, publishers, and the commercialsector.VendorslikeCONTENTdmandLUNAhaveincorporatedIIIFintotheirproducts,andIIIFisgainingspeed inopen sourceefforts like theHydra-in-a-box repositoryproduct,which is IIIF-compatible. ThegoalsofIIIFandAACareinalignment,andthereisanexcitingpotentialtoworkmorecloselytogether,leveragingtheexistingIIIFcommunitynetworkandtechnicalframeworktocreateandbuilduponbestpractices.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 42

Fromlibrariesaspatchworktodatasetsasassemblages?

MiaRidge,BritishLibrary

TheBritish Library's collectionsare vast, andvastly varied,with180-200million items inmost knownlanguages. Within that, there are important, growing collections of manuscript and sound archives,printedmaterialsandwebsites,eachwithitsowncollectinghistoryandcataloguingpractices.Perhaps1-2% of these collections have been digitised, a process spanning many years and many distinctdigitisationprojects,andanensuingpatchworkofimagingandcataloguingstandardsandlicences.Thispaper representsmy own perspective on the challenges of providing access to these collections andothersI'veworkedwithovertheyears.

Many of the challenges relate to the volume and variety of the collections. The BL is working torationalise the patchwork of legacymetadata systems into a smaller number of strategic systems.[1]Otherprojectsareingestingmassesofpreviouslydigitiseditemsintoacentralsystem,fromwhichtheycanbedisplayedinIIIF-compatibleplayers.[2]

The BL has had an 'open metadata' strategy since 2010, and published a significant collection ofmetadata,theBritishNationalBibliography,as linkedopendata in2011.[3]Somedigitised itemshavebeen posted toWikimedia Commons,[4] and individual items can be downloaded from the new IIIFplayer(whererightsstatementsallow).TheBLlaunchedadataportal,https://data.bl.uk/, in2016.It'swork-in-progress - manymore collections are still to be loaded, the descriptions and site navigationcould be improved - but it represents a significantmilestonemany years in themaking. The BL hasparticularlybenefittedfromtheworkoftheBLLabsteaminfindingdigitisedcollectionsandundertakingthepaperworkrequiredtomakethefreelyavailable.TheBLLabsAwardshavehelpedgatherexamplesfor creative, scholarly and entrepreneurial uses of digitised collections collection re-use, and BL LabsCompetitionshave ledto individualcasestudies indigitalscholarshipwhilehelpingtheBLunderstandtheneedsofpotentialusers.[5]Most recently, theBLhasbeenworkingwith theBBC'sResearchandEducationSpaceproject,[6] adding linkedopendatadescriptionsaboutarticles to itswebsite so theycanbeindexedandsharedbytheRESproject.

In various guises, the BL has spent centuries optimising the process of delivering collection items onrequest to the reading room. Digitisation projects are challenging for systems designed around the'deliverable item',but thedigitalusermaywish toaccessorannotateaspecific regionofapageofaparticular item, but themanuscript itselfmay be catalogued (and therefore addressable) only at thearchiveboxorboundvolumelevel.Thevisibilityofresearchactivitieswithitemsinthereadingroomsisnot easily achieved for offsite research with digitised collections. Staff often respond better todiscussions of the transformational effect of digital scholarship in terms of scale (e.g. it's faster andeasiertoaccessresources)thantodiscussionsofnewermethodslikedistantreadinganddatascience.

The challenges the BL faces are not unique. The cultural heritage technology community has beendiscussing the issues around publishing open cultural data for years,[7] in part because makingcollections usable as 'data' requires cooperation, resources and knowledge from many departmentswithin an institution. Some tensions are unavoidable in enhancing records for use externally - forexample curators may be reluctant or short of the time required to pin down their 'probable'provenance or date range, let alone guess at the intentions of an earlier cataloguer or learn how toapplymodernontologiesinordertoassignanexternalidentifiertoapersonordatefield.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 43

Whilepublishingdata 'as is' in CSV files exported froma collectionsmanagement systemmight haveverylittleoverhead,theresultsmaynotbeeasilycomprehensible,ormayrequiresomuchcleaningtoremovemissing,undocumentedorfuzzyvaluesthattheresultingdatasetbarelyresemblestheoriginal.Publishing data benefits from workflows that allow suitably cleaned or enhanced records to be re-ingested, and export processes that can regularly update published datasets (allowing errors to becorrectedandenhancementsshared),butthesearealltoorare.Datasetdocumentationmaymentionthe technicalprotocols requiredbut fail todescribehowthecollectioncametobe formed,whatwasexcluded from digitisation or from the publishing process, let alone mention the backlog of itemswithout digital catalogue records, let alone digitised images. Finally, users who expect beautifullydescribed datasets with high quality images may be disappointed when their download containsdigitisedmicroficheimagesandsparsemetadata.

Rendering collections as datasets benefits from an understanding of the intangible and uncertainbenefitsofreleasingcollectionsasdataandofthebarrierstouptake,ideallygroundedinconversationswithorprototypesforpotentialusers.Librariesnotusedtothinkingofdevelopersas'users'orlackingthetechnicalunderstandingtotranslatetheirworkintobenefitsformoretraditionalaudiencesmayfindthischallenging.Myhopeisthateventslikethiswillhelpusdealwiththesesharedchallenges.

WorksCited

[1]TheBritishLibrary,‘UnlockingTheValue:TheBritishLibrary’sCollectionMetadataStrategy2015-2018’. [2]TheInternationalImageInteroperabilityFramework(IIIF)standardsupportsinteroperabilitybetweenimagerepositories.Ridge,‘There’saNewViewerforDigitisedItemsintheBritishLibrary’sCollections’. [3]Deloitetal.,‘TheBritishNationalBibliography:WhoUsesOurLinkedData?’ [4]https://commons.wikimedia.org/wiki/Commons:British_Library [5]http://www.bl.uk/projects/british-library-labs,http://labs.bl.uk/Ideas+for+Labs [6]https://bbcarchdev.github.io/res/ [7]Forexample,the'MuseumAPI'wikipagelistingmachine-readablesourcesofopenculturaldatawasbegunin2009http://museum-api.pbworks.com/w/page/21933420/Museum%C2%A0APIsfollowingdiscussionatmuseumtechnologyeventsandonmailinglists.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 44

Maintainingthe‘why’inData: Consideruserinteractionandconsumptionoflibrarycollections

HannahSkatesKettler,UniversityofIowa

AlwaysAlreadyComputationalrepresentsthenexthurdleforlibraries,archivesandmuseums.Nowthattheprofession is comfortablewith thenotionofdigitization, andhave reaped the rewardsof greaterand broader impact (Proffitt and Schaffner, 2008), it has now turned its focus towards born digitalmaterials. It'snotthatborndigitalmaterials, in2017, isanewnotionbut it isdefinitelyaconcepttheprofessionhasbeenawareof,buthasbeenhesitant to tackle.AsaDigitalHumanitiesprofessional, Idealwiththeuseandcreationofborndigitalmaterialseverydayandadapttothemultiplicitouswayslibrarycollectionsarecreatedandmadeavailable,especiallyintheHumanities.

IthereforeapproachthequestionsinAlwaysAlreadyComputationalwiththeseconceptsinmind:

RelationalDatasets:

Nolibrarycollectionisanisland.Librarycollectionsarenotsimplyalistofonesandzerosthatwaittobeconsumedand reused, then spatoutagainas somethingdifferent.At least,notwhenwewant tobeabletocitethem.Data(whichhenceforthwillbeastandinfor'librarycollections')mustbepersistentinorder to be effectively accessible and reused for research. In order to amalgamate various datasets,immenseamountof time isspentstandardizingthedata intosomethingthatcanbecross referencedandusedcomputationally.Understandingthatourdataareunique, itdoesnotnecessarilyfollowthataccessshouldbeasuniqueandidiosyncratic.WhatthatLinkedDatahasprovidedisaframeworktolinkdisparate ideastoeachotherrelationally. Iamparticularly interested in thepossibilitiesof theLinkedData at it applies to datasets thatwould allow one to describe contextual relationships between thedata,relationshipswhichtypicallyareentirelyuseanduserbased.Bygeneralizingdatainawaythatisuseful in multiple contexts by creating a framework that is flexible enough to accommodate data'smultiplicity.

AssociationofParadata:

Pullingfromexperiencewith3Dcollections,functioningwithoutstandardsofhowtomakeborndigitalmaterials more usable makes interfacing with other datasets much more difficult than other moretraditionaldata.Forexample,visualmaterialsaremuchmorereliantonsupplementalcontextualdatathantext.Thatisnottosaythereisnocontextwithintextualdata,buttheaforementioneddatacouldincludecontextwithinit.Visualdata,usually lacksthispackagedapproach.Visualsareassociatedwithtext in order to provide that context. Beyond catalogues, visual data's supplemental material isseparated from and unintentionally disassociated from the visual (think a search result in an imagedatabase). Few image datasets are accompanied with why the image was created. True, one caninferencebasedon thebasicmetadata includedwith theobject, butwithout intent, it ismuchmoredifficulttomakejudgementaboutwhythedataset(asgeneratedbyanAPIforinstance)isincludedandwhyotherswerenot.Italsomakesiteasiertofake,ormisrepresentlibrarydata/collections.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 45

CulturalConstructsofData:

Compoundingthenarrowedcontextoftextualandnumericaldatasets,problematicvisualdatasets,andevenmixeddata sets, youhave thesocial constructs that supportdata.Thisalignsverywellwith thework I, and a group of librarian and museum professionals are doing in association with the DigitalLibrary Federation. As was mentioned in the October 2004 Information Bulletin from the Library ofCongress, "Because there isnoanalog (physical) versionofmaterials created solely indigital formats,these so-called 'born-digital' materials are at much greater risk of either being lost and no longeravailableashistoricalresources,orofbeingaltered,preventingfutureresearchersfromstudyingthemintheiroriginal form."Theirparticular focusforthisremarkwasthepreservationofborn-digitaldata.Nowthattheprofession,tosomeextent,hastheabilityandfocusforpreservationofborn-digital,itistimetoturnoureyetointeroperability(likeAlwaysAlreadyComputational)andtheculturalcontextofthe data itself. Consider the bookThe Intersectional Internet: Race, Sex, and CultureOnlineby SafiyaNoble and Brendesha Tynes (2016) which underscores "how representation to hardware, software,computercode,andinfrastructuresmightbeimplicatedinglobaleconomic,political,andsocialsystemsof control." Data without context is meaningless. Data with context but without social awareness isdeceptively meaningless. With that deception comes, in the worst case, the use and articulation ofargumentfoundedonalackofunderstandingandawarenessofperpetuatingideasthatareintrinsicallylinkedtothecreationandcurationofsaiddata.Aquestionforthisgroupwouldbe;howdoweattempttopreservethatcontextwithoutoverwhelmingtheuser?

The Always Already Computational group can hopefully come together to attempt to solve this andotherconcernsregardingdigitalaggregatedata.

References

"BornDigital':Eightinstitutionsandtheirpartnersreceivedawardstotalingalmost$15millionfromtheLibrarytocollectandpreservedigitalmaterialsaspartoftheNationalDigitalInformationInfrastructureandPreservationProgram".2004.LibraryofCongressInformationBulletin.63(10):202-203.

Noble,SafiyaUmoja,andBrendeshaM.Tynes.2016.TheintersectionalInternet:race,sex,classandcultureonline.ISBN:978-1-4331-3000-7.

ProffittandSchaffner.2008.TheImpactofDigitizingSpecialCollectionsonTeachingandScholarship:ReflectionsonaSymposiumaboutDigitizationandtheHumanities.ReportproducedbyOCLCProgramsandResearch.Publishedonlineat:www.oclc.org/programs/reports/200804.pdf

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 46

Peopleandmachinesbothneednewwaystoaccessdigitizedartifactsnonconsumptively

BenSchmidt,NortheasternUniversity

How can we integrate generations of high-quality, professionally-created metadata with electronicversions of the object itself? Particularly when copyright comes into play, we can't simply hope foropenness;andthere'sasteeptrade-offbetweenthethoroughnessofawell-thought-outstandardandasimplicityofconception thatmakesadigital resourceuseful for (for instance)agraduatestudent justbeginningtogetinterestedinworkingwithlargecollections.

Whenwedigitalhumanitiesresearcherssaythatwe'reworkingwiththe"fulltext"ofascannedbook,it's usually more posturing than truth. In fact, what datasets like the Hathitrust Research Center'sExtracted Features really do is just radically transform the amount ofmetadata we have; instead ofknowing 10 or 20 things from a MARC record (eg: the language, four or five subject headings, theauthor,thepublisher),wejustaddonanadditionalseveralthousand("Howmanytimesdoesitusetheword "aardvark?" "aardvarks?" "abacus?"...). All the rest of the information (even simple stuff likesyntax,wordorder,negation)isthrownout.It'sgreatthatorganizationslikeJStorandHathiarestartingto release this computationally-derived metadata. But there's no clear way to incorporate thiscomputationalmetadataintoatraditionallibrarycatalog.ThetechnicaldemandsofevendownloadingsomethingliketheHTRCEFsetexceedboththetechnicalcompetenciesandcomputinginfrastructureofmosthumanists--I'veliterallyspentseveralweeksrecently,restartingdownloadsandidentifyingmissingfiles as I try to fill up aRAID arraywith several terabytes of data. Processing these files into the rawmaterialofresearchisevenharder.

Sohowdowemakecollectionsaccessibleforwork?Therearetwowaysthatlibrariescantakemoreoftheburdenonto themselves,anddistribute (non-copyright-violating)distillationsof texts thatprovideanonrampfordigitalanalysiswithinthereachofmeremortals.

VisualExploration

One useful and important way to work with this metadata and full text is by exposing throughvisualization;thisiswhatprojectsliketheGoogleNgramsviewerandtheHathi+BookwormprojectI'vehelpedworkonunderanNEHgrant.Patronsareable touse thiscombinationof full textandcatalogmetadata toexplore the shapesandcontoursofvastdigital libraries. Since theyknow (sortof!)whatany given word means, they can use it to understand how vocabulary changes; find anomalous,interesting, ormisclassified items; or understand the limits and constraints of an entire collection, asorely-neededformofinformationliteracy.We'vebuilttheBookwormplatformsotheadvanceswe'remakingwithHathicanbeusedonanysmaller(orlarger)library,andwehopeotherswillbeinterestedinusingtoexploretheirtextsinthecontextoftheirmetadata.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 47

HathiTrustBookwormbrowser

Low-dimensionalembeddings

I'dalsoliketoputontheradarafartherout-thereideathatextrapolatesfromthecurrenttrendsintheworldofmachinelearning:theideaofasharedembeddingfordigitalitemsthatwouldallowmachinestocompare itemsacrossvariouscollections,times,andartifacts.Thebasic ideaofanembedding istoassociate a long list of numbers (maybe a few hundred) with a digital object so that items that aresimilar have similar lists of numbers. These are sort of the inverse of the checksums that librariesfrequently associatewith digital artifacts now, which are designed so that even the slightest changemakesafilegetacompletelydifferentnumber.Agoodembeddingwilldotheopposite;allowusersandsoftwaretofindsimilaritems.InasinglecollectionlikeHathi,thispracticeI'vefoundwithevenasimpleembeddingthatit'spossibleto,forinstance,lookintheneighborhoodofabooklike"HuckleberryFinn"andfind,intheimmediateneighborhood,dozensoftitleslike"CollectedWorksofMarkTwain,vol.8"thatlackpropertitlesthatwouldidentifythem;andintheextendedneighborhoodothernovelsaboutAmericanboysonriverboats.

Insideacollection, thismakes itpossibleto findworkswith improbablemetadata. (It'ssadlycommonfor thewrong scan tobeassociatedwithmetadata, and this canbeextremelyhard to catch.)Acrosscollections,thismakesitpossibletoengageintheworkofcomparison,duplicatedetection,

Perhapsthemostinterestingthingsaboutembeddingsofdigitalfilesisthatthey'renotrestrictedtotextualfeatures.Imageembeddingsarejustaspossibleastextualembeddings,asinthislandscapevisualizationofartworksthatGooglerecentlyproduced.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 48

WhenGooglerecentlyreleasedhalfamillionhoursofvideo,theydiditnotasimagestillsbutasvectorizedfeaturesreadbyaneuralnetwork.

These features--essentially,a computer's roughsummaryofanartifact intoa fewhundrednumbers--couldmake it possible to researchers and students to immediately engage in computational analysiswithout having to wade through the preparatory steps. If done according to shared standards, theycouldmakecollectionsinteroperableinstrikingwaysevenwhentextsorimagescan'tbedistributed.It'sprobablya fewyears tooearly to seta specificembedding fordifferent typesofdocuments,but it istime now to contemplatewhat it wouldmean to distribute not documents themselves, but a usefuldigitalshadowofthem.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 49

RepurposingDiscographicMetadataandDigitizedSoundRecordingsasDataforAnalysis

DavidSeubert,UniversityofCaliforniaSantaBarbara

Useofsoundrecordingsforresearchhasbeenslowtodevelopduetobiasagainstsoundrecordingsashistorical documents by textual scholars, lack of descriptive data (discography), and lack of accessbecauseofrestrictivecopyright lawsthatmakeitdifficulttodigitizeandprovideaccesstocollections.Theuseofdigitizedsoundrecordingsorthediscographicmetadataaboutsoundrecordingsasdatatostudy isunderdeveloped.TheUCSBLibrarywantstoencouragescholarshipofthiskindusingthedatafromtheAmericanDiscographyProject. TheAmericanDiscography Project that is presently based at theUCSB Librarywith funding from thePackard Humanities Institute was originally conceived as the Encyclopedic Discography of VictorRecordingsbytworecordcollectorsintheearly1960s.TheybeganaprojecttodocumenteveryclassicalrecordingbytheVictorTalkingMachineCompany,buteventuallybroadenedtheirgoaltoincludeeveryVictorrecordingsessionfor78rpmdiscs.In1966theyweregrantedliberalaccesstotherecordingfilesheldbyRCAVictorRecords(nowSonyMusicEntertainment)anddevotedmanythousandsofhourstocompilinglistsofthetensofthousandsofVictormasterrecordingsessionsfromaroundtheworld. The American Discography Project and its principal product, the Discography of American HistoricalRecordings (DAHR) isnowa research,publication,anddigitizationprogrambasedat theUCSBLibrarywith a goal of documenting disc recordings made during the standard groove era (1900-1950s) byAmerican record companies and to digitize asmany as possible for online access.Much of the dataaboutarecording(who,what,where,when)isnotdocumentedontherecordingsthemselves,andonlycanbedeterminedbyconsultingapublisheddiscographyorprimary sourcedocuments like companyrecordingledgers. Now in its fifth decade, the project has expanded beyond Victor to incorporate other publisheddiscographies and includes data on recordings made by five early 20th century record companies(Berliner, Victor, Zonophone, Columbia and Okeh) with three more large labels (Brunswick, Decca,Edison)andseveralsmalleronesinthepipeline. Thesheeramountdatadocumentedintheonlinedatabaseissignificant.DAHRcurrentlycontainsover6.5milliondatapointsdocumentingsystematicallyandcomprehensivelythefirst45yearsofAmericanrecordinghistoryincluding:

● 146,524recordingsessions● 417,428recordingevents(takes)● 107,784physicalmanifestations(discs)● 36,767namesofperformers,authors,composers● 90languages● 393recordinglocations

Theinitialprojectdesignwastodocumenttheserecordingsinasystematicfashionforthepurposesof

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 50

identification, cataloging by libraries and archives, collectors, and others. A bibliography of soundrecordings.Oneofthefurthergoalsoftheproject istoencourageuseofsoundrecordingsasprimarysourcedocumentsbyscholarsinfieldsbeyondthestudyofmusicandastheprojecthasgrown,wehavegrowingsuccessinthisarea.Systematicallyaddingaudiotothedatabasehasallowedscholarstostudytherecordings,incontextwithauthoritativedataabouttheircreation. Soundrecordingsandthemetadataassociatedwiththemhavenotbeenminedandanalyzedthewaytextualarchiveshave.AstheDiscographyofAmericanHistoricalRecordingsgrowsinsize, it isaprimecandidate for manipulation and analysis as data, as it contains standardized elements includinglanguage,dates,geographicinformation(recordinglocations),genres,names,andtitles. Since theprojectwasdesigned fromtheoutset tobestructureddata, includingauthoritycontrolandstandardizedvocabulariesformanyelements,apotentialandasyetunrealizedreuseofthemetadataas data, is now possible. As a participant in the National Forum, we hope to be able to furtherconceptualizehowthiscanbebestrealized.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 51

TheLibraryasVirtualReality:AWorldbuildingApproach

LailaShereenSakr,UniversityofCaliforniaSantaBarbara

Theprocessofconsideringdigitallibrarycollectionsasdatapointsreliesonsimilarlogicsfoundationaltothedevelopmentofvirtualreality(VR).ImaginethelibraryasaVRfilmorasacomputer---temporallyandspatially.Ifthegoalofthe“AlwaysAlreadyComputational:LibraryCollectionsasData”projectistofind a common framework among librarians, curators, and researchers that makes digitally-bornscholarshippossible,Iwouldliketosuggestconsideringspeculativedesignmethodologies,orwhatAlexMcDowellhasdescribedasworldbuilding. AlexMcDowell,adeeply influentialdesignerhasshiftedhowwethinkaboutdesignby fundamentallychanging the role design plays in the creative process, potentially altering audiences’ expectations ofcreative work that ranges from architecture to computer games. Drawing on the literary metaphor“worldbuilding”toexplainhisapproachtodesign,McDowell’smethodsrepresentaculturalshiftinhisindustry’sproductionprocess.Speculatingaboutwhattheworld“might”looklikeinthefutureiseasy.More challenging, though, is realizing that speculative vision through the design process.McDowell’swork realizing a future-world inspired by Philip K. Dick’s novella in the 2002 filmMinority Report isemblematic of a transformation in design process that is made possible through the use ofcomputationalmedia.OnMinorityReport,McDowellledhisproductiondesignteam,whichbeganasalargely analog art department, through a transition in which they became the first fully-digital artdepartmentinthefilmindustry—anexamplethatmanyotherdesigndepartmentswouldsoonfollowandthatforeshadowedabroaderculturalshiftincreativeprocess. Mostofthefilm’saudiencewillprobablyrememberthegesturalinterfaceofthe3Dscreensusedbytheagents in the department — speculative designs that, in turn, have influenced actual technologiesrangingfromApple’siPadtoMicrosoft’sKinect.However,MinorityReport‘sinfluenceindesignreachedanevenwiderarrayofdesigncultures,includingbiometrics(particularlyretinalscanning),throughotherimaginedtechnologieswoventhroughoutthefilm’senvironmentandplot. Inotherwords,McDowell’sworldbuildingintegratesinterdisciplinaryhumanistic,scientific,anddesigninquirywithemergingformsofcomputationalmediatofundamentallyalterthefilmproductionprocess,blurringboundariesbetweenphysicalandvirtualenvironmentsand thedistinctionsbetween filmandothermediaforms. InthedigitallydesignedworldofMinorityReport,propscouldbemodeledfirstastwo-dimensional images and later as three-dimensional physical objects. Then, through computer-controlled milling, those models could be used to create final props by sculpting and mold-making.Bringing direction, cinematography, and design together in the virtual space of the pre-visualizationstage,props,actors,andthecreatedworld interactedthroughouttheproductionprocess.Asaresult,MinorityReportandMcDowell’sworldbuildingprocesssignaledatransformationindesignculturethathasnotyetfullyplayedout.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 52

One approach to worldbuilding builds upon a procedure of information design that moves fromarchiving, to visualizing, to rationalizing, and then to governing. This processmust take into accountmattersofscale.Takingfrombothinformationdesignandgamedesign,worldbuildingreliesonseveraldistinctway visualperspectives:drawinga completeworldmapand filling in asmuch informationaspossible, then running the game and letting the players explore that world. This visual perspectiveoperatesonalargescale.Anotherperspectivebeginswithinspecifictown/city/place/room...andastheyexploremoreandmoreof theworld is revealed. Theseare somebasic guidelines to consider asoneconceptualizesbuildingavirtualwordofdata. Applying this theoretical framework to a process of speculative design for future library collections,couldyield interestingresults.Thepracticeand ideasofworldbuilding, inMcDowell’sdefinition,areaclear example of interdisciplinarywork connecting the arts, design,media-focused computer science,andelementsofthehumanitiesandsocialsciences.Worldbuildingisboththecreationofmediaandadesignresearchpractice,andinneithercaseis its interdisciplinaritya luxury,becausetheworksimplymustengagemultipledisciplinesinordertoachieveacoherentvisionandtopushmanyfieldsforward.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 53

Thestruggleforaccess

TimSherratt,UniversityofCanberra

Forme,exposingculturalheritagecollectionstocomputationalmethodsraisesdifficult,important,andinteresting questions about the nature of ‘access’ itself. So while we can and should develop best-practiceguidelines, I thinkweshouldalsoadmitthatwewillneverbe,shouldneverbe,satisfiedwithwhatculturalinstitutionsdeliver.Wewillalwayswantsomethingmore.Andthat’sagoodthing.

I’vespentfartoomuchofmylifehackingthewebinterfacesoflibrariesandarchivesinthepursuitofusefuldata.ButwhileIwouldgladlytakethetimeback,Irecognisethevalueofthestruggle.Processessuch as screen-scraping and normalisation are often frustrating, but they do at leastmake you thinkabouttheprocessesbywhichthedatawascreated,managed,andshared.

So for me, one of the key questions is how we expose data to facilitate the use of computationalmethodswhile preserving someof the difficulties and irregularities – the chiselmarks in the smoothworkedsurface–thatremindusofitshistoryandhumanity.

I’mnotsurewhetherthisisametadataquestion,oramatterofhowweframetherelationshipbetweenresearcherand institution. Ifwethinkofmachine-actionabledataasaproductorservicedeliveredbyinstitutions,thenresearchersarecastasclientsorconsumers.Butifeachdatasetisnotaproduct,butaproblem,thenweopenupnewspacesforcollaborationandcritique.

I’ve started to realise that I have very little interest in statistics, or even data visualisation as Iunderstandit.Iusecomputationalmethodstomanipulatethecontextsofculturalheritagecollections.Sometimes this results inuseful toolsor interfaces, sometimes it’smoreakin toart. I’mmotivatedbythesimpledesiretoseethingsdifferently–topokeattheboundariesandlimitsofsystemsinthehopethatsomethinginterestinghappens.

Whatseemstohappenfairlyregularly isthatIfindwherethesystemsarebroken.Forexample,whileharvestingdebatesfromtheAustralianparliament’sonlinedatabase,Idiscoveredabout100sittingdayswere missing. This sort of thing happens with complex systems, and the staff at the ParliamentaryLibraryhavenowfixedtheproblems.Forme,it’sanexampleofthefactthatwecanneversimplyacceptwhatwe’regiven–searchinterfaceslie,anddatasetshaveholes.Butit’salsoshowsthatonceyouopenupchannelsforthetransmissionofdata,informationflowsbothways.

We can’t talk about the need for institutions to provide computation-ready datawithout consideringwhattheymightget inreturn.Thestruggle foraccessmightnotalwaysbecomfortable,but itcanbeproductive.Ifdataisaproblemtobeengagedwith,ratherthanaservicetobeconsumed,thenwecanseehowresearchersmighthelpinstitutionstoseetheirownstructuresdifferently.Onapracticallevel,howmightwemakeiteasierforinstitutionstore-ingestthefeaturesandderivativestructuresidentifiedthroughuse.

I’m also a bit suspicious of scale. Big solutions aren’t always best. Large data dumps are great forresearchers with adequate computing power and resources, but APIs support rapid experimentationandlight-weightinterventions.Similarly,whilearticulatingbest-practiceforcomputation-readydataweshouldn’t lose sight of other ways data can be exposed. I want hackable websites as well asdownloadableCSVs–allthatbasicstufflikepersistenturls,semantichtml,andmaybeasprinkleofRDFaorJSON-LD,enablesdatatobediscoveredeverywhere,notjustinadesignatedrepository.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 54

AsIsaid,wewillalwayswantmore.Accesswillneverbeopenandthejobwillneverbedone.Weneedsystems,protocols,guidelines,andcollaborationsthatremindusthereisalwaysmoretodo,andofferthesupporttocontinue.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 55

ImplicationsfortheMapina'CollectionsasData'Framework

TimSt.Onge,LibraryofCongress

I amarrivingof the challengeof developing computationally amenabledigital library collections fromtheperspectiveofadigitalcartographerandgeospatialanalyst.MyworkfortheLibraryofCongressasacartographer primarily involves digital map-making and the analysis of born-digital and made-digitalgeographic information and maps to serve Congressional research requests. My academic andprofessional backgrounds are based in geographic information science (GIS) rather than in libraryscience.However,IamoftenthinkingabouthowtheLibraryofCongresscanbestserveourcollectionstomeettheresearchandaccessneedsofgeographersinadigitalage. Allofthisistosaythatmyinitialthoughtsondevelopinga“librarycollectionsasdata”frameworkarelargelyshapedbytheimplicationsforonetypeofcollectionmaterialinparticular:themap. There is enormouspotential for the computational analysis of historicmapsenmasse,withmethodsthatarebothtext-based(e.g.extractingwrittentexttocreategazetteersofplacenamesfromcertaintime periods, cultures, languages, etc.) and image-based (e.g. extracting map features based ongroupingsofimagepixelvaluesofsimilarcolor)(Chiang,Leyk&Knoblock2014).Forthefullintegrationof historic maps into Geographic Information Systems, processes like georeferencing and featuredigitization,whichhave achieved varying levels of automationpotential,must be completed. It ismyviewthatgeoreferencedversionsofscannedmaps in librarycollectionsarehighlyappreciatedamongresearchers and should be more standard “collections as data” offerings from libraries. Thegeoreferenced map viewer created by the National Library of Scotland (2017) demonstrates thetremendousvalueofthistypeofdataoffering. Giventheuniquechallengesofofferinghistoricmapsascomputationallyamenablecollections,IadmiretheobjectiveoftheAlwaysAlreadyComputationaltoconceiveofa“collectionsasdata”frameworkthatismultimediainscopeandnotonlyconcernedwithtextanalysisofwrittenworks(ascriticallyimportantandvaluableasthisis). In my reading of the “Statement of Need” from the Always Already Computational scope of workdocument, I interpret fourmajor currentproblemsof computationally amenable collections tobe (1)the lackofa commoncollections-transformation frameworkacross institutions, (2)a lackof solutionsfor non-textmedia, (3) technical inadequacies in providing collections in large scale, and (4) no datareuseparadigmforcollections. In addressing the first and second problems, I look forward to hearing more on the needs ofcomputationalresearcherswhoareworkingwithimage-basedcollections,including,butnotexclusively,scanned and digitizedmaps. In this needs assessmentmore broadly, in an abstractway, I imagine ahierarchyofuse casesandanalysis tools. Towards the topareelements that aremost readily sharedamongallkindsoflibrarycollections(e.g.allcollectionitemshavemetadatafilesinstandardformat;alltext-based,text-extracteditemscouldundergoanalyseslikefrequencyvisualizationortopicmodeling).Towards the bottom are more medium-specific (e.g. only scanned maps are concerned withgeoreferencing and geographic projections). In laying out the strongest commonalities amongresearcher needs in working with library collections, perhaps a framework can be developed that

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 56

addressesthegreatest,unifyingneedsofcollectionpatronsacrossdiverseusesinthedigitalhumanitiesand other disciplines. Furthermore, I hope that this framework highlights the unique and worthychallengesofdevisingsolutionsforresearchersofnon-textmedia. Thethirdproblemofprovidingcollectionsonalargescaleiscertainlyacriticalconcerntocomputationalresearch.Ifaccesstocollectionitemsislimitedtoone-by-onedownloadsordeliveriesofphysicalDVDsof data, simply the “data acquisition” phase can be sufficiently burdensome to slow or stopcomputationalanalysesbeforetheyevenbegin.Thechallengesoflarge-scalecollectionaccessappeartobe technological and, as is often the case for libraries and the digital humanities, budgetary. ThemethodsofaccessdetailedintheAlwaysAlreadyComputationalscopeofworkdocumentdemonstratethewidevariabilityamongdifferentinstitutions.Iaminterestedtohearfromprojectparticipantsonthemeritsofthesemethodsfromtheirexperienceandwhattechnicalandbudgetaryconsiderationsshouldbemadeintheprocessofdevelopingbestpracticesonthisissue. Onthefourthproblemofthedatareuseparadigm, Ibelievethis issue involvesnotonlytechnologicalhurdles, but policy ones aswell. Simply put,when researchers or patronsmorebroadlywant to giveback to libraries, libraries should trust them. For example, this can take the form of an online-basedcrowdsourced georeferencing tool that allows users to georeference scanned maps from a librarycollectionandsharethembacktothelibrary,whichtherebysharesthatresourceuniversallyasaGIS-ready raster image (Fleet, Kowal, & Přidal 2012). Another example would be for libraries to hosthackathonsandothereventsthatinviteresearcherstointerrogatetheircollectionsasdataandpresentontheirfindings,therebyallowinglibrarieslearnlessonsofthekindsofcomputationalresearchthatcan(orcannot)workwiththeircollections. IbelievetheArchivesUnleashedseries,whichfocusesonwebarchive research, isagreatmodel for thiskindofproject (Weber2016).Any frameworksarising fromtheAlwaysAlreadyComputationalshouldencouragethesekindsof“datasandbox”projectsthatallowforexperimentationthatrevealnewinsightsintothecomputationalanalysisofcollectionsasdataandprovidederivedcontentandresearchdirectlybacktolibraries. I look forward to learning from the diverse array of participants and contributingmy insights to theAlwaysReadyComputationalinitiative. WorksCited Chiang,Y.,Leyk,S.,&Knoblock,C.A.(2014)Asurveyofdigitalmapprocessingtechniques.ACMComputingSurveys,47(1),Article1(April2014),44pages.Retrievedfromhttp://usc-isi-i2.github.io/papers/chiang14-acm.pdf. Fleet,C.,Kowal,K.C.,&Přidal,P.(2012)Georeferencer:CrowdsourcedGeoreferencingforMapLibraryCollections.D-LibMagazine,18(11/12).Retrievedfromhttp://www.dlib.org/dlib/november12/fleet/11fleet.html. NationalLibraryofScotland(2017)Viewmapsoverlaidonamodernmap/satelliteimage.Retrievedfromhttp://maps.nls.uk/geo/explore/. Weber,M.S.(2016)ArchivesUnleashed!CollectionsasData|September27,2016|LibraryofCongress.Retrievedfromhttp://digitalpreservation.gov/meetings/documents/dcs16/3_Weber_ArchivesUnleashed.pdf.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 57

Consideringtheuser

SantiThompson,UniversityofHouston

Astheforumunfolds,Iwouldencourageparticipantstoquestionandexpandourassumptionsofthosewho (re-)use computational library collection data. In my mind, the identities of users and theirmotivations for coming to the digital library are just as important to understand as the technicalrequirementsneededtore-usedataininteroperableandcollaborativeways.Knowingyourusershelpsculturalheritageprofessionals,amongotherthings,tobetterselectcontentforthefuture,markettheresources and collections available to them, and understand how to describe and make contentavailabletoothers.[1] I was pleased to see that the proposal forAlways Already Computational acknowledges the user tosomedegree,notingthatcurrentdigital library infrastructureanddigitalcollectionparadigmsdo"notmeet theneedsof the researcher, the student, the journalist, andotherswhowould like to leveragecomputationalmethodsandtoolstotreatdigitallibrarycollectionsasdata."Assuch,partofourforumobjectiveswillbetodraftpotentialuserstoriesand“toapply[datadefinitionsandconcepts]toarangeofpotentialusercommunities.”Ifindthistobeincrediblyimportantbecauselibraries(andmostlikelyotherculturalheritageorganizationtypes)havenotspentavastamountoftimeaskingandpublishingon“whoisadigitallibraryuser.” Myown researchhas focused in somenarrowwaysonbetterunderstandingdigital libraryusers.MycollaborationwithothermembersoftheDLFAssessmentInterestGroup’sUserStudiesWorkingGrouphas found that the assessment of digital library reuse is complicated for a whole host of reasons,including the profession’s inability to systematically identify and understand digital library users.[2]AdditionalresearchIhavedonewithaco-authorsuggeststhatdigitallibraryusers(note:NOTusersofcomputational data) aremore frequently (1) from outside of academia and (2) reusing digital librarycontentforawidearrayofnon-scholarlypursuits.[3] IfindAlwaysAlreadyComputationaltobeanexcitingopportunitytoaddressmajorgapsinourcurrentunderstanding of what is a digital library collection and how is it being used by targeted audiences.While I recognize that demystifying the digital library user is not the primary pursuit of this nationalforum, I look forward todiscussing thisaswellasother importantaspectsof thegrantwithadeeplyknowledgeableandinspiringgroupofparticipants.Iappreciatetheopportunitytocontributetosuchadiscussion. WorksCited [1]Formoreonhowunderstandingusersandreusescaninformdigitallibrarymanagement,seemyworkwithMicheleReilly:“UnderstandingUltimateUseDataandItsImplicationforDigitalLibraryManagement:ACaseStudy,”TheJournalofWebLibrarianship8(2)(2014):196-213.DOI:http://dx.doi.org/10.1080/19322909.2014.901211. [2]In2015theUserStudiesWorkingGroupdraftedawhitepaper,“SurveyingtheLandscape:UseandUsabilityAssessmentofDigitalLibraries,”thatexploredthestateofresearcharoundthreeassessmenttopics: user/usability studies, return on investment, and content reuse. A copy can be found here:https://osf.io/uc8b3/.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 58

[3] SeeReillyandThompson,“UnderstandingUltimateUse,”andMicheleReillyandSantiThompson,“Reverse ImageLookup:AssessingDigitalLibraryUsersandReuses,”The JournalofWebLibrarianship(2016):1-13.DOI:http://dx.doi.org/10.1080/19322909.2016.1223573.

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 59

BuildingInstitutionalandNationalCapacityforCollectionsasData

KateZwaard,LibraryofCongress

Aboutayearago,theLibraryofCongresscreatedanewdivision,NationalDigitalInitiatives,whichIamproudto lead.Ourmission is tomaximizethebenefitof thedigitalcollection,to incubate innovation,andtoencouragenationalcapacityfordigitalculturalmemory.

InarecentNewYorkerarticle,theLibrarianofCongresssaidshewantsTheLibraryofCongress“togettothepointwherethere’llstillbeaspecialness,butIdon’twantittobeanexclusiveness.Itshouldfeelveryspecialbecauseitisveryspecial.Butitshouldbeveryfamiliar[1]”WeinNDItakethatmessagetoheart.Webelieve thatan importantstep ingettingusers toengagewith theLibrary’sdigitalmaterialandstaffistoprovoke,explore,tellstories,andinvite.

OurvisionisforNDItohelplibrariesandpatronsexploretheedgesofpossibility.Totrythingsourselvesandsharewiththeprofession.Tohelphighlightthetreasureswehave--hereattheLibraryofCongressandinournation’sculturalheritageinstitutions–andsparkpeople’simaginationaroundthepotentialusesofdigitizedorborndigitalcollectionobjects.Toencouragethecuriousandhelpthemgetanswers.

Tohelppeopleunderstandwhatalibraryis.

Uponourfounding,thedirectorofNationalandInternationalOutreachsaid“It’snotenoughanymoreto justopenthedoorsof thisbuildingand invitepeople in.Wehavetoopentheknowledge itself forpeopleexploreanduse.[2]”

Afewthingswe’vebeenworkingon:

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 60

● Weorganized“CollectionsasData,”[2]aconferencedevotedtoexploringwhat’spossibleusingcomputationwithdigitalcollections.

● WehostedanArchivesUnleashedhackathon,bringingtogetherprogrammers,librarians,andscholarslookingatcomputationalanalysisofwebarchivescollections[4]

● WeperformedadigitallabproofofconceptalongwithareportexploringhowtodeliverLibraryofCongressdigitalcollectionsasdatatoon-siteresearchers[5]

● WehostedaSoftwareCarpentryWorkshop[6]tohelpteachLibraryofCongresslibrariansandothersintheneighborhoodhowtousecodetomanageandanalyzedigitalcollections.

● We’vestartedaseriesofsamplecodenotebookstohelppeopleworkwithLibraryofCongressdata[7]

Mybackgroundisinsoftwaredevelopment.Beforethisjob,IrantheRepositoryDevelopmentgroup[8]attheLibraryofCongressandbeforethatIworkedoncreatingdigitalpreservationsoftwaresolutionsfortheGovernmentPublishingOffice.Myperspectiveisontheverypractical.Institutionshavespentalot of time, effort, and money on digitizing collections and establishing policies and infrastructuresaroundthemodelofaccessthatmimicsanalogmodels.Transformingthetechnology,staff,andpracticeto accommodate data analysis is a second paradigm shift that will be just as difficult. For manyknowledge institutions, funding is decreasing and becoming less secure while the volume andcomplexityofdigital information ismultiplyingand the commitment toanalog collections remains. Inmyview,theonlywayforwardistogether:

● Leverageconnectionswithphysicalsciences,socialsciences,andjournalism.Worktogetherontoolingandtraining.

● Highlightdigitalscholarshipprojectswitheasytounderstandoutcomestomakethecasebeyondacademia.

● Supportdistributedfellowshipmodels(NDSR)forbuildingdigitalstewardshipcurationskillsandbuildingskillsfordoingdigitalresearch.

● Createtrain-the-trainerprogramstohelpscholarsunderstandwhat’spossibleusing

AlwaysAlreadyComputational:LibraryCollectionasData,March2017 61

computation● Getcontent,methodologies,andtoolstoK-12educationalaudiences.● Explorelegal,culturalandprivacyreviewmodelstoguideresearchersusingnoveldigital

content,likealight-weightIRB.● Providespaceandtimeforexperimentation.

The Library of Congress “preserves and provides access to a rich, diverse and enduring source ofknowledgeto inform, inspireandengageyou inyour intellectualandcreativeendeavors.” [9]Wearethrilledtobeapartofthisexcitingconversation,andlookforwardtoworkingtogether. WorksCited [1]“TheLibrarianofCongressandtheGreatnessofHumility”bySarahLarson.TheNewYorker.February19,2017http://www.newyorker.com/culture/sarah-larson/the-librarian-of-congress-and-the-greatness-of-humility

[2]“DataandHumanismShapeLibraryofCongressConference”byMikeAshenfelder.TheSignal.October21,2016http://blogs.loc.gov/thesignal/2016/10/data-and-humanism-shape-library-of-congress-conference/

[3]“CollectionsasDataReportSummary”byJaimeMears.TheSignal.February15,2017http://blogs.loc.gov/thesignal/2017/02/read-collections-as-data-report-summary/

[4]“Co-HostingaDatathonattheLibraryofCongress”byJaimeMears.TheSignal.July21,2015http://blogs.loc.gov/thesignal/2016/07/co-hosting-a-datathon-at-the-library-of-congress/?loclr=blogsig

[5]“LibraryofCongressLab:LibraryofCongressDigitalScholarsLabPilotProjectReport”byMichelleGallingerandDanielChudnov.December21,2016http://digitalpreservation.gov/meetings/dcs16/DChudnov-MGallinger_LCLabReport.pdf

[6]SoftwareCarpentryattheLibraryofCongresshttps://oulib-swc.github.io/2017-02-15-loc/

[7]data-explorationGithubpagehttps://github.com/LibraryOfCongress/data-exploration

[8]“Yes,theLibraryofCongressDevelopsLotsofSoftwareTools”byLeslieJohnston.August16,2011https://blogs.loc.gov/thesignal/2011/08/yes-the-library-of-congress-develops-lots-of-software-tools/

[9]“AbouttheLibrary”https://www.loc.gov/about/