Metadata Demystified

19
ISBN 1-880124-59-9

Transcript of Metadata Demystified

Page 1: Metadata Demystified

ISBN 1-880124-59-9

Page 2: Metadata Demystified

Metadata Demystified: A Guide for Publishers

Table of Contents

What Metadata Is 1 What Metadata Isn’t 3 XML 3

Identifiers 4

Why Metadata Is Important 6 What Metadata Means to the Publisher 6

What Metadata Means to the Reader 6

Book-Oriented Metadata Practices 8 ONIX 9

Journal-Oriented Metadata Practices 10 ONIX for Serials 10

JWP On the Exchange of Serials Subscription Information 10

CrossRef 11

The Open Archives Initiative 13

Conclusion 13 Where To Go From Here 13 Compendium of Cited Resources 14 About the Authors and Publishers 15

Published by: The Sheridan Press & NISO Press Contributing Editors: Pat Harris, Susan Parente, Kevin Pirkey,

Greg Suprock, Mark Witkowski Authors: Amy Brand, Frank Daly, Barbara Meyers

Copyright 2003, The Sheridan Press and NISO Press

Printed July 2003

Page 3: Metadata Demystified

classified according to a variety of specificfunctions, such as technical metadata fortechnical processes, rights metadata for rightsresolution, and preservation metadata fordigital archiving, this guide focuses ondescriptive metadata, or metadata thatcharacterizes the content itself.

Occurrences of metadata vary tremendouslyin richness; that is, how much or how littleof the entity being described is actuallycaptured in the metadata record. Thestrategic decisions publishers make aboutmetadata often concern how much to expose.The answer to this question depends on theapplication at hand. In order to enablereference linking across publisher platforms,for instance, the number of metadataelements required is minimal, often less thanwhat occurs in a typical citation. TheCrossRef metadata set, which we will look atin section 5, contains only a handful ofrequired elements. For electronicbookselling, where one role of metadata is toapproximate the experience of perusing aphysical book in a bookstore, the richer themetadata record, the better. Hence, theOnline Information Exchange (ONIX)standard for books specifies over 200elements.

To illustrate what metadata is, let’s look at asimple metadata standard called DublinCore. The Dublin Core Metadata Initiative(DCMI) got underway in 1995 as a jointeffort among professionals from thepublishing, library, and academiccommunities. One outcome of this effort wasthe Dublin Core Metadata Element Set,which became a NISO standard in 2001(ANSI/NISO Z39.85-2001) and aninternational standard (ISO 15836) in 2003.

The Sheridan Press / NISO Press 1

This guide presents an overview of evolvingmetadata conventions in publishing, as well asrelated initiatives designed to standardize howmetadata is structured and disseminatedonline. Focusing on strategic rather thantechnical considerations in the business ofpublishing, this guide offers insight into howbook and journal publishers can streamlinethe various metadata-based operations at workin their companies and leverage that metadatafor added exposure through digital media suchas the Web. This exposure is an additionalway of sharing information about content. Itbenefits not only publishers, but also potentialreaders who seek access to published productsand the resource discovery environment moregenerally.

Publishers work with metadata on a dailybasis. It is in the manuscript tracking process,in internal reports and content managementsystems, in marketing copy, and in theinformation transmitted to the supply chain.Whenever publishers complete copyrightregistration forms or supply promotional andlibrary cataloging information during theeditorial/production process, they createmetadata. Similarly, whenever authors citeother publications, or libraries record theirholdings, they create metadata.

What Metadata IsThe term metadata refers to informationabout information or, equivalently, data aboutdata. In current practice, the term has come tomean structured information that feeds intoautomated processes, and this is currently themost useful way to think about metadata. Thisdefinition holds whether the publication thatthe metadata describes is in print or electronicform. While metadata in publishing can be

Metadata Demystified:A Guide for Publishers

Page 4: Metadata Demystified

2 Metadata Demystified

metadata elements and the record layout fortransmitting those elements.

Standards-building is an ongoing, collaborativeprocess in which book and journal publishersshould participate. Despite the fact that a muchgreater proportion of journal content than bookcontent is digitized, publisher-drivenstandardization initiatives in book publishingare more advanced than in journal publishing.Book publishers have been driven towardstandardization in order to capitalize onaggregated bookselling—traditionally viawholesalers and now through the Internet—which has required them to conform tostandards for supplying promotional metadata.Even existing standards have a routine reviewprocess to incorporate new features, andpublishers can take part via organizations suchas the National Information StandardsOrganization (NISO, http://www.niso.org), inorder to have input on how both current andnew standards take shape.

The remainder of this document is structuredas follows: In the next section, we will refineour operational definition of metadata byexplaining its relationship to ExtensibleMarkup Language (XML) and to identifiers.Then we will look at the internal andexternal roles of metadata in today’spublishing companies, and why metadata hasbecome a strategic issue. Next, we will turnto metadata practices and trends in bookpublishing. In the final section, we willdiscuss evolving standards in journalpublishing.

Along the way, we will provide pointers totools and resources that publishers should be

The DCMI standard includes fifteen optionalmetadata elements for describing cross-genre, cross-disciplinary informationresources. These elements are: title, creator,subject, description, publisher, contributor,date, type, format, identifier, source,language, relation, coverage, and rights.Some of these elements relate to the contentof the item, some to the item as intellectualproperty, and others to the particularinstantiation, or version of the item.

The Dublin Core website (http://dublincore.org)uses its own metadata scheme to displaydocument information. Table 1 shows a three-element Dublin Core record.

The left-hand column lists element types,and the right-hand column assigns elementvalues for this particular document. DublinCore has been mapped to several othermetadata formats, including the MachineReadable Cataloging (MARC) 21bibliographic format for representation andexchange of bibliographic information thatmost library catalogs use today. Seehttp://www.loc.gov/marc for moreinformation.

Metadata in the publishing andcommunication cycle is not new. What isrelatively new to the broader publishingcommunity, and crucial for interoperabilityin the digital age, is standardization. This isthe process of building consensus aroundbest practices in the formatting and use ofmetadata for specific applications, so thatmachines can interpret and exchange thisinformation efficiently. In recent years,clear standards have emerged to define

Title Overview of Documentation for DCMI Metadata Terms

Identifier http://dublincore.org/usage/documents/overviewDescription of Document This page provides an overview of official documentation of all

DCMI metadata terms.

Table 1. Dublin Core Record

Page 5: Metadata Demystified

The Sheridan Press / NISO Press 3

XML syntax. XML uses a simple syntax thatboth people and machines can easily process.The syntax consists of matching start and endtags, such as <journal> and </journal>, tomark up information elements. These tagscan also be associated with attributes, alsoknown as name-value pairs (e.g., type =“print”).

Document Type Definition (DTD). An XMLDTD provides a description (actuallyexpressed in Standard Generalized MarkupLanguage, or SGML) of the building blocksof any type of XML document, whether thatdocument is a list, a metadata record, ajournal article, or a whole book. It includeswhat to call different types of elements, howthey should be ordered, and how theyinterrelate. Some DTDs are proprietary—created by a company for their internaluse—while others are standardized andfreely available. The latter include themetadata formats we will discuss in sections4 and 5.

XML schema. An XML schema (also calledan XSD file) is itself an XML document andis an alternative to the DTD that providesdevelopers with enhanced validationcapabilities and more refined tools forstructuring their own XML-based formats.Whereas DTDs only allow for relativelysimple data types, a schema has a set ofpowerful, flexible semantics for defining whatan XML file can contain.

XML workflow. This is not a technicalterm, but a way of describing theinfrastructure that publishers put in placein order to capture data in XML format asneeded and streamline processes forcreating, re-purposing, and disseminatingthat data.

For additional information on XML, go to http://www.w3.org/XML

familiar with as they embark on integratingautomated metadata processes into theircontent management, production, andmarketing/supply systems. A handful ofsample metadata records will be displayed,but these are not intended to replaceimplementation guidelines for the variousstandards they illustrate, nor do they reflectthe full range of metadata schemes, standards,and initiatives presently in use across theinformation industry.

What Metadata Isn’tThe term metadata has come to refer tostandardized, structured information thatmachines can interpret and use. Theboundaries of this definition often overlap, yetare not to be confused with, two related setsof conventions: XML, a widely adoptedstandard for structuring and exchanging data,and identifiers, which are standards foruniquely naming a piece of content orintellectual property. In this section we take abrief look at XML and identifiers to explaintheir relation to metadata.

XMLAlthough not a programming language per se,XML is a language for expressing rules thatgive structure to any kind of textual data,including but not limited to metadata. Oneway to think about XML in this context is asthe information “wrapper” or container ofchoice for your metadata. XML has beenwidely adopted because it was designed forprecisely the kind of data transfer thatcomprehensive electronic publishing requires.It also provides an application-independentmethod for sharing data, and because it is freeto license, XML can save publishers moneythrough the use of inexpensive, off-the-shelftools. A large part of its power comes fromthe nearly universal support it receives fromproduct vendors, standards bodies, academia,and the open source community.

Page 6: Metadata Demystified

4 Metadata Demystified

ISBN. The International Standard BookNumber (ISBN/ISO 2108) is a ten-digitnumeric string (e.g., ISBN 0-500-27664-1)that uniquely identifies each manifestationof a book or non-serial publication.Although sub-parts of the ISBN identifycountry (or language area) and publisher,ISBNs strings as a whole are opaque(“dumb numbers”), non-actionable atpresent, publisher-driven, and not currentlyassociated with a metadata registry. TheISBN standard is now undergoing arevision process that will increase the ISBNstring to thirteen digits, in addition,publishers will be encouraged to deposit acore set of metadata as part of theregistration process.

For more information on the ISBN, go tohttp://www.isbn.org/standards/home/index.asp

ISTC. The International Standard Text Code(ISTC/ISO 21047) is a new numberingscheme, mainly but not exclusively for books,under development for unique identificationof textual works, as opposed tomanifestations. It is intended to be opaque,actionable, and persistent, and potentially tobe assigned as soon as a work is conceived bya creator or author. It may potentially be usedas an overarching identifier to tie together thevarious related identifiers registered at themanifestation level. As part of the ISTCregistration process, descriptive metadata willbe captured by the ISTC registration agencyand will include, at minimum, a title, thename of the author or contributor, a uniqueidentifier for the ISTC registrant, registrationdate, and whether the work identified is aderived or original.

For more information on the ISTC, go tohttp://www.nlc-bnc.ca/iso/tc46sc9/istc.htm

ISSN. The International Standard SerialNumber (ISO 3297) is an opaque eight-digitnumber (e.g., 1234-1231) for unique

IdentifiersIdentifiers are names or strings adhering tocertain conventions that, if properly employed,ensure uniqueness. While standard identifiersfor publications have been in use for decades,unambiguous identification of content entitieshas become especially important forelectronic publishing and e-commerceplatforms. Identifiers and metadata are notone and the same, yet identifiers are mostuseful in association with metadata.

Identifiers for book and journal publishingcan be characterized according to sixparameters or features:

• Whether the identifier itself istransparent (derivable) or opaque (notinherently meaningful or interpretable).

• Whether the entity that the identifierpoints to is a work (an abstraction, nottied to any particular physical medium)or a manifestation (an exemplar in aparticular physical medium of a work,such as the online version of a journalthat exists in multiple media).

• Whether or not the identifier isactionable in the electronicenvironment, so that clicking on ittakes you directly to the thing beingnamed; for example, the URLidentifier.

• If actionable, whether or not theidentifier is truly persistent: that is,designed to withstand changes in theonline location of content identified.URLs are actionable but not persistent.

• Who drives or regulates the identifierregistration process (e.g., the author,publisher, or library community).

• Whether descriptive metadata isregistered in association with theidentifier.

The following identifiers are the most familiarones for books and journals in terms of theseproperties.

Page 7: Metadata Demystified

The Sheridan Press / NISO Press 5

DOI. The Digital Object Identifier (DOI)syntax is a more recent open standard(ANSI/NISO Z39.84-2000). The DOI systemis a complete system for implementingpersistent identifiers, and DOIs themselvesare variable-length alphanumeric strings (e.g.,doi:10.1101/gr.10.12.1841) assigned bypublishers at any level of granularity.Although the DOI is designed to be opaque,the DOI suffix can incorporate other existingidentifiers as an option. The DOI is alsoactionable; one click on a properlyimplemented DOI gets the user to thelocation of the content being identified. TheDOI is persistent because it is paired with thecontent object’s electronic address, or URL,in an updateable central directory andpublished in place of the URL; this avoidsbroken links while allowing the content tomove as needed. Although the content that itis linked to may take the form of amanifestation (the electronic version of anarticle, for instance), the DOI can function asa work-level identifier when it is associatedwith a rich set of metadata elementsdescribing a work.

Declaration of kernel metadata is in theprocess of becoming mandatory for all DOIsin the global DOI directory; the requisitemetadata follows a carefully designed schemebased on indecs (http://www.indecs.org) tomaximize interoperability. This way, the DOIcan support a range of applications forelectronic content, such as e-commerce,management of rights and permissions, andthe creation of learning objects. For example, among the official DOI registrationagencies of the International DOI Foundation, Learning Objects Network(http://www.learningobjectsnetwork.com) isapplying DOI functionality to SCORM-compliant learning objects. Sharable ContentObject Reference Model (SCORM) consistsof metadata specifications for a range of e-learning content applications. Seehttp://www.adlnet.org for more information.

identification of journals and other serialresources; the same serial in a differentphysical medium is assigned a differentISSN, and title changes to serials frequentlycall for new ISSNs as well. ISSN assignmentis a regulated process. ISSNs are assigned byISSN national centers; publishers shouldcontact their national ISSN center to requestan ISSN assignment. The National SerialsData Program (NSDP) at the Library ofCongress coordinates the U.S. ISSNprogram.

Each ISSN assigned to a serial publication isregistered in an international database (theISSN Register), along with a relatively richmetadata record. Among the bibliographicelements in these records are ISSN, key title,abbreviated key title, frequency ofpublication, language, other forms of the title,place of publication, publisher, former title(s),pointers to other language editions and othermedia editions, and URLs. ISSN records areavailable in MARC-compatible format.

For more information on the ISSN, go to http://www.issn.org

SICI. The Serial Item and ContributionIdentifier standard (SICI) (e.g. SICI: 0002-8231(199412)45:10<>1.1.TX;2-M) is a NISOstandard (ANSI/NISO Z39.56-1996) for theunique identification of a serials issue orarticle, regardless of the distribution medium.SICI is designed to be dynamicallyconstructed and in that sense is a transparentidentifier. It is neither actionable norassociated with a metadata record in itscurrent implementation. Due to their strict,derivable format, SICIs can be created andused by anyone involved in serialsmanagement, and automated SICI generatorshave been created for this purpose.

For more information on the SICI, go to http://sunsite.berkeley.edu/SICI

Page 8: Metadata Demystified

6 Metadata Demystified

database can feed multiple metadata templatescorresponding to the formats required fordifferent purposes, both internal and external.Given such a system, responsibility forvalidating the data can be easily shared acrossdepartments. At the same time, any update toan information element, such as the title orthe price, is automatically propagated to alloutputs.

While supplying structured metadataaccording to several formats may seem like ahuge task, the web of mappings amongcommon metadata standards continues togrow, and there are many shared elementsacross the different standards. For example,the data elements currently proposed for thenew ISBN kernel were developed as a subsetof ONIX and are a subset of Dublin Core. Allof the standards a publisher now encountersare likely to be tagged in XML and tofunction across several formats.

The benefits of structuring and tagging textapply not only to metadata but to full-textcontent. Full-text mark-up of books andjournals allows them to be readily re-purposedfor course-packs, in derivative worksrequiring a subset or re-ordering of theoriginal content, and as input to emergingarchival standards. The key to successfulmetadata usage is to develop the systems andprocedures necessary to maintain anddisseminate metadata as an integral part of thepublication process. Creating structuredmetadata as a normal part of the productionworkflow allows a publisher to provideconsistent information about products to allthe communities using that information.

What metadata means to the readerMany of the advantages that publishers reapfrom effective use of metadata turn out tobenefit the reader and research communitiesas well. For example, the online aggregation

For more information on the DOI, go to http://www.doi.org

This selection of identifier standards currentlyin use in book and journal publishingindicates a clear trend toward identifiers withthe following properties: actionability,persistence, opacity, and association withmetadata. Identifiers with these characteristicsbest meet the demands of the digital medium.While the ISBN, ISSN, and SICI are notcurrently actionable, they could well be in thefuture. Actionable, persistent identifiers addvalue to publications because they enable newfunctionality and work reliably in the Webenvironment. Identifiers do not need to betransparent or inherently meaningful if theyare associated with descriptive metadata andprimarily interpreted by machines. Finally,registration of an identifier along withmetadata lays the groundwork forconstructing other automated services aroundthe content being identified.

Why Metadata Is ImportantMetadata can take many forms, and metadatarecords can vary tremendously in richness,creating an array of content management andeconomic models.

What metadata means to the publisherPublishers benefit in many ways fromautomating and streamlining their internalmetadata practices. In book publishing, it isstill common to see employees in differentdepartments re-keying the same descriptiveinformation for different purposes; forinstance, when a new contract is logged, whenthat same manuscript is launched for editing,when its marketing and catalog copy iscreated, etc. With appropriate back-officetools and procedures in place, a publisher canset up a database of metadata elementscompiled from the various departments. This

Page 9: Metadata Demystified

The Sheridan Press / NISO Press 7

transformed the research process. Oneimportant metadata-driven trend is towardvirtual, or distributed, aggregation ofinformation resources. Researchers who havelong relied on specialist databases to mineauthoritative information resources in theirfields now turn to powerful search enginesthat index, but do not aggregate, thoseresources. The more robust the metadata thatpublishers expose for this purpose, the morethey will benefit from this trend. Interlinkingof resources is another example of distributedintegration in e-journal publishing. Bothpublisher and researcher benefit frominitiatives that use metadata and identifierregistration to enable cross-publisher linkingwithout aggregation of any proprietarycontent. (The term distributed integration isattributed to Brian Schottlaender; seeSchottlaender, B., “Portals for Integration andCollaboration” presented at the AAP/PSPAnnual Conference, Washington, DC,February 2003.)

Publishers are now cooperating directly withone another, with some exposing not onlytheir metadata but also their full text forsearch and navigation purposes. In addition,automated tools for the intelligentclassification of content have become moreavailable. As a result of these trends, therewill be less of a need for manual aggregationof subject-based resources in the future. Aspublisher-supplied metadata grows to includemore semantic information about apublication, conceptually based research toolswill also evolve. As standards emerge forcapturing metadata associated with theindividual user (e.g., access rights profilesand personal preferences), frameworks will berequired for structuring how that kind ofmetadata interacts with the metadata forinformation resources. A number of initiativesare currently underway to specify, at a highlevel, how metadata standards for differentdomains (publications, individuals, e-commerce applications) should interoperate.

of book metadata brought about bycentralized Internet bookselling was a boonfor publishers, who saw an unprecedentedsurge in sales of the backlist titles that they nolonger promoted through establishedchannels; it was equally an advantage forscholars seeking out those obscure backlisttitles. Readers, for the first time, had at theirdisposal an easy way to search across acomprehensive, cross-publisher database ofavailable books and complete a purchase. Thedigital medium has made publishedinformation easier to disseminate, search, andsell, and metadata plays a critical role in theseadvantages.

In the publication of journals, cross-publishermetadata has traditionally been aggregated byintermediaries, or secondary publishers, whocreate sophisticated tools and services (e.g.,citation indexing and resource discovery)around subject-based databases ofbibliographic information and journal articleabstracts. The process of compiling thismetadata has been substantially automated,although there are still some manualcomponents, such as selecting content forinclusion, classification of content, andwriting abstracts where they are not alreadyavailable.

Abstracting and indexing (A&I) services havebeen a source of income for publishers whosell metadata or have their own secondarydivisions. Publishers also earn income fromaggregators that license full-text content orlink back to publishers and thereby drivearticle sales and journal subscriptions. Thesebusiness models are currently in flux. Manypublishers now have their own journalwebsites where they freely providebibliographic information, abstracts, tables ofcontents, and other resources that they maypreviously have considered proprietary.

From the end-user perspective, metadata andits innovative use by publishers have already

Page 10: Metadata Demystified

8 Metadata Demystified

Advisory Committee (BISAC) Title Statusformat. This format was eventuallysuperseded by the BISAC X12 832transaction. Both of these formats are nowobsolete, although the push to adopt ONIX asa standard method of communicatingmetadata has not entirely replaced them.

Ultimately, wholesalers created web-basedapplications for their customers that requirethe detailed range of data included in anONIX record. At present, wholesalers areusing a combination of publisher-providedelectronic files and manual keying of datato maintain these applications. ONIX is fastbecoming the method wholesalers use toupdate their web products, and the samewholesalers have been licensing theirdatabases to Internet booksellers for use onbookseller sites. The data that publishersprovide to wholesalers, therefore, not onlyupdates their internal inventory file but alsofeeds the wholesalers’ websites and oftenthose of several Internet booksellers. Likewholesalers, booksellers require detailedinformation about titles to decide on theinitial buy. They also require an easymethod of placing basic information in theirinventory management system, and thosewith websites require rich metadata for theirpromotional web pages.

Many library suppliers have developed web-based search and order applications thatresemble an Internet retailer’s site, withjacket image, table of contents, firstchapter, and so on. These sites allowlibrarians to access considerably moreinformation about a title than could beprovided in a catalog. Librarians are alsolicensing wholesaler and bibliographicdatabases such as Bowker’s Books in Printfor use on their internal acquisitionssystems and Online Public Access Catalogs(OPACs), and beginning to use portions ofpublishers’ ONIX records to enhance theirMARC record data.

(See, for example, http://www.indecs.org,http://www.cores-eu.net, andhttp://www.w3.org/RDF).

Metadata is thus both a marketing tool and away to add functionality to electronicpublications. It allows publishers to “open up”their proprietary content for e-commerce andresource discovery applications such asindexing, search, and linking, whilemaintaining control over their own tradingpractices.

Book-Oriented Metadata PracticesAn explanation of how book wholesalers,retailers, and libraries use metadata willclarify why metadata is becoming critical tothe overall success of every publisher.Historically, wholesalers obtained theirinformation about forthcoming titles fromvisits by publisher sales representativesand/or catalogs. The wholesaler used theinformation in a publisher’s catalog to updatetheir in-house inventory database manually,re-keying the data elements their systemneeded to track customer demand and ordera title.

This inventory database was often the sourceof the catalogs and selection lists thewholesaler created and mailed to theircustomers (mainly libraries for research andscholarly titles). As wholesalers expandedthe number of book titles they stocked orwere willing to obtain for their customers,the cost of this re-keying of data increased.At the same time, the shift in technologyfrom microfiche to CD-ROM and then to theWeb increased the amount of informationthat publishers provided on each title.

These factors led wholesalers to seek anelectronic means of obtaining titleinformation from publishers. The earlieststandard was the Book Industry Systems

Page 11: Metadata Demystified

The Sheridan Press / NISO Press 9

availability in different markets, andpromotional information, as well ascomprehensive bibliographic information.

The following examples show part of thesame ONIX sample record, in the first boxusing plain text “reference names” in XML,and in the second using short tags:

<ProductIdentifier><ProductIDType>02</ProductIDType><IDValue>0816016356</IDValue>

</ProductIdentifier><ProductForm>BB</ProductForm><Title>

<TitleType>01</TitleType><TitleText textcase = “02”>British English, A toZed</TitleText>

</Title><Contributor>

<SequenceNumber>1</SequenceNumber><ContributorRole>A01</ContributorRole><PersonNameInverted>Schur, NormanW</PersonNameInverted><BiographicalNote>A Harvard graduate in Latin andItalian literature, Norman Schur attended the University of Rome and the Sorbonne before returningto the United States to study law at Harvard andColumbia Law Schools. Now retired from legalpractise, Mr. Schur is a fluent speaker and writer of both British and American English</BiographicalNote>

</Contributor>

<productidentifier>

<b221>02</b221><b244>0816016356</b244>

</productidentifier><b012>BB</b012><title>

<b202>01</b202><b203 textcase = “02”>British English, A toZed</b203>

</title><contributor>

<b035>A01</b035><b037>Schur, Norman W</b037><b044>A Harvard graduate in Latin and Italianliterature, Norman Schur attended the University ofRome and the Sorbonne before returning to the UnitedStates to study law at Harvard and Columbia LawSchools. Now retired from legal practise, Mr. Schur is a fluent speaker and writer of both British and American English </b044>

</contributor>

ONIX The ONIX initiative got underway in 1999,with the American Association of Publishers(AAP) bringing together the major publishers,wholesalers, online retailers, and bookinformation services personnel to create auniversal, international format in which alltrading partners, regardless of their size, couldexchange information about books. Theworking group released ONIX 1.0 in January2000. Release 2.1 of ONIX is currently indevelopment.

ONIX is now published and maintained byEDItEUR in association with the BookIndustry Study Group (BISG,http://www.bisg.org) in the U.S. and the BookIndustry Communication (BIC) in the U.K,and has become the international standard forbook-trade metadata. In addition to the UnitedStates and United Kingdom, France,Germany, and Korea have set up nationalimplementation groups; the ONIX DTD hasbeen extended to accommodate the tradingpractices in these countries.

ONIX comprises both a content specificationand an XML DTD. The content specificationincludes a comprehensive set of carefullydefined data elements, code lists and XMLtags, that can be either short codes (e.g.<b012>) or text labels (e.g. <ProductForm>).XML schemas have also been defined for trialpurposes.

Originally designed for books and other non-serial materials such as audio and point ofsale materials produced by book publishers,the scope of ONIX has now grown to coverserials (see below) and a version of ONIX hasbeen developed for the video/DVD sector.

ONIX data elements include structured tablesof contents, text items (e.g. descriptions,reviews, extracts, author biographies), images(e.g. jackets, author pictures, double pagespreads), links to video, audio or websites,territorial rights information, price and

Page 12: Metadata Demystified

10 Metadata Demystified

focused their energies on their own proprietaryjournal platforms and formats. This approachis changing as libraries, publishers, and thirdparties exchange an increasing amount ofcatalog information, serials subscription data,and other structured data at multiplebibliographic levels (journal, volume, issue,article). It is in this environment that thedevelopers of ONIX have undertaken efforts toextend ONIX to serials.

ONIX for serialsThere are three new ONIX records specific toserials that are currently under review: the SerialTitle Record, the Serial Item Record, and theSubscription Package Record. The Serial TitleRecord is the proposed ONIX format forexchanges of rich catalog information. Itprovides a readily extensible framework for thedescription of a journal as a bibliographic item,including such details as the cost of an individualsubscription item. The Serial Item Record is theONIX format for alerting, shipping, librarycheck-in functions, and structured multilevelbibliographic description of serial parts. TheSubscription Package Record is the ONIXformat for communicating a publisher’s oragent’s product catalog information aboutsubscription packages, along with the Serial TitleRecord, which carries product cataloginformation about individual serials.

A Serial Title Record file is linkable to anaccompanying Serial Item Record file whenmore complex price information is required,such as the ability to specify “off-the-shelf ” ortailored subscription packages of the kindincreasingly being offered by academic journalpublishers. This linkage could prove invaluablefor sales of journals to consortia.

JWP on the exchange of serialssubscription informationTaking ONIX for serials as a starting point,NISO and EDItEUR have recently launched a

Creating an ONIX message involves twobasic steps: organizing the data into ONIX-specified fields and storing it in a database;and using an XML software application andthe ONIX DTD to organize and tag that data.A single ONIX message may contain dataabout multiple titles. An ONIX message istransmitted across networks and the Internetthe same way that other data is transferred;for instance, as an email attachment or viaFTP. Once an online retailer receives anONIX message, the same tools (an XMLsoftware application and the ONIX DTD) areused to validate the data. From that point, theretailer translates the delivered data into whatis seen on a web page.

ONIX differs from other metadata standards inthat it is a very rich record with over 200 dataelements, some optional and some required. Forexample, ISBN, author name, and title arerequired elements; book reviews and coverimage remain optional. In contrast, DCMI usesonly fifteen repeatable, optional elements. A fullONIX record loaded onto a website provides asearching experience similar to that of browsingthe physical book. Just as book retailers andwholesalers came to require an ISBN and a barcode, they will soon require an ONIX record for every new title. Several publishers arealready delivering ONIX data feeds to theirtrading partners.

For more information on ONIX, go to http://www.editeur.org/onix.html

Journal-Oriented Metadata PracticesJournal publishers have been slower toconverge on their own metadata standards thanbook publishers, in part due to a businessenvironment in which metadata was largely thepurview of other parties, such as subscriptionagents, aggregators, and libraries. Althoughelectronic publishing has taken a firm hold injournals publishing, most publishers have

Page 13: Metadata Demystified

The Sheridan Press / NISO Press 11

Publisher members of CrossRef initiallydeposit a record for a content item thatconsists of minimal bibliographic metadata:journal title, ISSN, first author, year, volume,issue, page number, DOI and URL.Depositing metadata with CrossRef involvescreating a file formatted according to anXML schema. The following exampleillustrates an abbreviated metadata recordcontaining both journal-level and article-levelelements:

<journal><full_title>Applied Physics Letters</full_title><abbrev_title>Appl. Phys. Lett.</abbrev_title><issn media_type=“print”>00036951</issn><issn media_type=“electronic”>10773118</issn><doi_data><doi>10.1063/aplo</doi><resource> http://ojps.aip.org/aplo/ </resource></doi_data></journal_metadata>…<contributors><person_name sequence=“first” contributor_role=“author”><given_name>Ann P.</given_name><surname>Shirakawa</surname></person_name></contributors><publication_date media_type=“print”><year>1999</year></publication_date><pages><first_page>2268</first_page></pages><doi_data><doi>10.1063/1.123820</doi><timestamp>19990628123304</timestamp><resource>http://ojps.aip.org/link/?apl/74/2268/ab</resource></doi_data>

</journal_article>

After a publisher deposits a record, CrossRefregisters the DOI-URL pair in the central DOIdirectory and maintains the full metadata set inits metadata database (MDDB). In a separateprocess, the publisher submits the citationscontained in each deposited article to theReference Resolver, the front-end componentof the MDDB that allows for the retrieval ofDOIs. By using this method, the publisher can,as part of its electronic production process, addoutbound hyperlinks to any of an article’scitations that point to content already registeredin the CrossRef system.

Joint Working Party (JWP) to explore thecreation of standard formats for the exchangeof serials subscription information. At thepresent time, most such exchanges make useof variable, proprietary formats, except whereformats appropriate to a given exchangealready exist, such as use of the MARC 21bibliographic format for library holdings data.In the future, there will probably be morepressure on publishers and others to exchangethis information in an accurate, efficient, andsecure manner. Development of theseguidelines also requires standard identifiersfor the key elements in the exchange,including parties to the exchange,aggregations, subscription packages, and thejournals themselves.

The JWP is currently functioning as threesubgroups: one on identifiers, another onpublisher-to-library exchanges, and a third onPAMS (Publication Access ManagementService)-to-Library exchanges. Theimmediate goals of the JWP are to implementpilot programs in these three areas during2003 and ultimately recommend specificenhancements to the ONIX for serialsschema.

For more information on the JWP, go tohttp://www.niso.org/news/SerialsExchange.html

CrossRefCrossRef is a DOI-based system for thepersistent identification of scholarly contentand cross-publisher reference linking to thefull text of a journal. CrossRef DOIs link topublisher response pages, which include thefull bibliographic citation and abstract, as wellas providing full-text access as determined bythe publisher. The publisher response pageoften includes other linking options, such aspay-per-view access, journal table of contentsand homepage, and associated resources.CrossRef has recently begun adding books andconference proceedings to its linking network.

Page 14: Metadata Demystified

12 Metadata Demystified

interlibrary loan (ILL) services, databases,search engines, etc. For the user working in aninstitutional context, it is often useful to bedirected to resources outside the publisher’ssite. For example, the institution may notsubscribe to the e-journal itself but may stillbe able to offer the user access to the desiredarticle through an aggregated database orprint holdings. In addition, the library maywish to provide a range of linking optionsbeyond what is available at the publisher’swebsite.

Information providers are beginning toimplement the OpenURL to enable optimalintegration with library linking systems. Thishas caused some confusion among primaryand secondary publishers who use theCrossRef/DOI system for cross-publisherlinks to full text, because of the mistakenperception that the OpenURL and the DOI arecompeting standards; they are not. CrossRefand the DOI provide persistent identificationof scholarly content and centralized linking tothe full text and other resources designated bythe publisher. The OpenURL is designed forlocalized linking and enables library-controlled links to a multiplicity of resourcesrelated to a citation.

The OpenURL and DOI work together inseveral ways. First, the DOI directory itself—where link resolution occurs in the CrossRefsystem—is OpenURL-enabled. This meansthat it can recognize a user with access to alocal resolver. When such a user clicks on aDOI, the CrossRef system redirects that DOIback to the user’s local resolver and at thesame time allows the DOI to be used as a keyto pull metadata out of the CrossRef database— metadata that is needed to create theOpenURL that targets the local link resolver.As a result, the institutional user clicking on aDOI is directed to appropriate resources.

By using the CrossRef/DOI system to identifytheir content, publishers can make their

If the identified content migrates from oneproduction system to another (e.g., pre-printto post-print), or moves from one publisher toanother if a journal—or the publisher itself—changes ownership, the publisher need onlyupdate the URL in one place in order for theDOI to persist. In all these cases the DOInever changes, which means that all the linksto that content that have already been madewill still function.

The CrossRef Reference Resolver acceptsbibliographic metadata and returns thecorresponding DOI. Queries are formatted ina pipe-delimited format containing ten fieldsfor queries against journal holdings andtwelve fields for queries against books andconference proceeding holdings. Thesequeries are submitted interactively through aWeb browser interface or programmaticallyvia the system’s HTTP interface. The resolverwill also accept a DOI as input and return theassociated metadata. When a query result isreturned, the metadata can be presented ineither the same pipe-delimited format or asXML.

For more information on CrossRef, go to http://www.crossref.org

OpenURL and CrossRef. The OpenURL is amechanism for transporting metadata andidentifiers describing a publication for thepurpose of context-sensitive linking. TheOpenURL is currently on the path towardNISO approval.

A link resolver is a system for linking withinan institutional context that can interpretincoming OpenURLs, take the local holdingsand access privileges of that institution(usually a library) into account, and displaylinks to appropriate resources. A link resolverallows the library to provide a range oflibrary-configured links and services,including links to the full text, a local catalogto check print holdings, document delivery or

Page 15: Metadata Demystified

The Sheridan Press / NISO Press 13

For more information, go to http://www.openarchives.org

ConclusionMetadata has become an essential part of thepublication process. Whether an informationresource is published in book or journal form,in print or electronic format, metadata is howthe content creator or producer advertises itsexistence. The richer the metadata record, thegreater the possibilities.

As the sea of information grows, being ableto locate, discover, link to, search on, re-purpose, integrate, track, exchange, or sell agiven information resource all tend tobecome more complex processes. Goodmetadata practices reduce some of thiscomplexity and help publishers harness thenew opportunities that new technologieswill bring.

Where To Go From HereWithout recommending specific products orvendors, the following list provides someinformation resources on electronicpublishing that serve as good starting points:

Since 1997, Sheridan Press has published aseries of white papers on informationtechnology and publishing, available athttp://www.sheridanpress.com/whitepapers.htm.

NISO standards and guides are available tothe public without charge from the NISOwebsite: http://www.niso.org. NISO offersworkshops and programs throughout the yearfocusing on standards and good publishingpractices.

Both the Society for Scholarly Publishing(http://www.sspnet.org) and the Council of Science Editors(http://www.councilofscienceeditors.org) offertutorials on electronic publishing topics.

products OpenURL-aware. Since DOIs canstreamline linking and data managementprocesses for publishers, many publishers arebeginning to require that the DOI be used asthe primary mechanism for linking to fulltext; link resolvers can then use the CrossRefsystem to retrieve the DOI if the DOI is notalready available from the source, or citingdocument.

For more information on the OpenURL, go to http://library.caltech.edu/openurl

The Open Archives InitiativeAlthough the Open Archives Initiative (OAI)got underway as a means of supportingdistributed e-print archives with tools forinteroperability, a growing number ofpublishers now recognize its value as a toolfor disseminating publisher metadata. TheOAI framework for exposing metadatathrough the OAI Protocol for MetadataHarvesting (OAI-PMH) is entirelyindependent of the type of underlying contentand the economic models surrounding thatcontent.

OAI-PMH defines an easy-to-implement toolfor harvesting XML-formatted metadata fromcontent repositories, or servers. Participationcan take one of two forms: data providers useOAI-PMH to expose metadata, while serviceproviders use metadata harvested via theOAI-PMH to build new services. To quoteClifford Lynch, Executive Director of theCoalition for Networked Information (CNI),OAI-PMH is “simply an interface that anetworked server (not necessarily an e-printserver) can employ to make metadatadescribing objects housed at that serveravailable to external applications that wish tocollect this metadata.” (See Lynch, C., ARLBimonthly Report 217 titled “MetadataHarvesting and the Open Archives Initiative” available athttp://www.arl.org/newsltr/217/mhp.html.)

Page 16: Metadata Demystified

Compendium of Cited Web ResourcesBook Industry Study Group (BISG) http://www.bisg.orgCoalition for Networked Information (CNI) http://www.cni.orgColumbia Guide to Digital Publishing http://www.digitalpublishingguide.comCORES Forum on Shared Metadata Vocabularies http://www.cores-eu.netCouncil of Science Editors (CSE) http://www.councilofscienceeditors.orgCrossRef http://www.crossref.orgDCLNews http://www.dclab.com/DCLNews.aspDigital Object Identifier (DOI) http://www.doi.orgDublin Core Metadata Initiative (DCMI) http://dublincore.orgExtensible Markup Language (XML) http://www.w3/org/XMLInternational Standard Book Number (ISBN) http://www.isbn.org/standards/home/index.aspInternational Standard Serial Number (ISSN) http://www.issn.orgInternational Standard Text Code (ISTC) http://www.nlc-bnc.ca/iso/tc46sc9/istc.htmInteroperability of Data in E-Commerce Systems (INDECS) http://www.indecs.orgLearning Objects Network (LON) http://www.learningobjectsnetwork.comMachine Readable Catalog (MARC) http://www.loc.gov/marcNational Information Standards Organization (NISO) http://www.niso.orgNISO-EDItEUR Joint Working Party on the Exchange of Serials Subscription Information http://www.niso.org/news/SerialsExchange.htmlNYU Center for Publishing http://www.scps.nyu.edu/departments/index.jspOnline Information Exchange (ONIX) http://www.editeur.org/onix.htmlOpen Archives Initiative (OAI) http://www.openarchives.orgOpenURL http://library.caltech.edu/openurlSerial Item and Contribution Identifier Standard (SICI) http://sunsite.berkeley.edu/SICISeybold Reports http://www.seyboldreports.comSharable Content Object Reference Model (SCORM) http://www.adlnet.orgSheridan Press White Papers http://www.sheridanpress.com/whitepapers.htmSociety for Scholarly Publishing (SSP) http://www.sspnet.org

14 Metadata Demystified

subscription. It offers in-depth reports, newsbriefs, and other information about currenteducational opportunities and resources inelectronic publishing.

The NYU Center for Publishing, part of theSchool of Continuing and ProfessionalStudies (http://www.scps.nyu.edu/departments/index.jsp) offers classes onONIX and technology and publishing.

The Columbia Guide to Digital Publishing,edited by William Kasdorf, is available for online browsing athttp://www.digitalpublishingguide.com, and is an excellent, up-to-date resource on XML,content management, and related workflowissues.

Data Conversion Labs publishes a newslettercalled DCLNews at http://www.dclab.com/DCLNews.asp that is available via free

Page 17: Metadata Demystified

The Sheridan Press / NISO Press 15

Barbara currently serves on the SSP Board ofDirectors and is a past president of theCouncil of Science Editors (CSE). She holdstwo degrees from George WashingtonUniversity: her bachelor’s in sciencejournalism and her master’s in science,technology, and public policy with aspecialization in technology assessment. Foradditional information on Barbara’sbackground, services, and client list, pleasevisit the MCS web site athttp://www.MCSone.com. ContactInformation: Meyers Consulting Services,1836 Metzerott Road, Suite 1003, Adelphi,MD 20783-3448. V: 301-434-6249; F: 301-434-0126; E: [email protected].

NISO Press is the publishing program of theNational Information Standards Organization(NISO). NISO, a nonprofit associationaccredited by the American NationalStandards Institute (ANSI), identifies,develops, maintains, and publishes technicalstandards to manage information in ourchanging and ever-more digital environment.NISO standards apply both traditional andnew technologies to the full range ofinformation-related needs, including retrieval,re-purposing, storage, metadata, andpreservation. Contact Information: NISO,4733 Bethesda Avenue, Suite 300, Bethesda,MD 20814. V: 301-654-2512; F: 301-654-1721; E: [email protected]. Website: http://www.niso.org.

The Sheridan Press provides a full range ofprinting and publishing services andtechnology innovations to associations,publishers, and university presses within thescientific, technical, and medical journalmarkets. Contact Information: The SheridanPress, 450 Fame Avenue, Hanover, PA 17331.V: 717-632-3535; F: 717-633-8900. Website: http://www.sheridanpress.com.

About the Authors and PublishersAmy Brand joined CrossRef as Director ofBusiness Development in April 2001. Hercareer spans electronic publishing, bookpublishing, and academia. She has previouslyheld positions at Ingenta, LEA Inc., theUniversity of Pennsylvania, and The MITPress where she was an executive editor from1994-2000. She received her doctorate incognitive science from MIT in 1989. ContactInformation: CrossRef, 40 Salem Street,Lynnfield, MA 01940. V: 781-295-0072; F: 781-295-0077; E: [email protected].

Frank Daly, until recently, was ExecutiveDirector of the Book Industry Study Group.For more than twenty years, Frank was withBaker & Taylor, Inc. During that time heserved in a variety of roles, includingDirector of Marketing, Public & SchoolLibraries, and Vice President, BusinessDevelopment. Frank is on the advisoryboards of Clarion University, NYU’s Centerfor Publishing, and KnowledgeMax, acorporate intranet provider. He is pastPresident of The American WholesaleBooksellers Association. Frank received his MBA from Fordham University and his BBA from the University ofMassachusetts. Contact Information:30 Tiberon Drive, Holmdel, NJ 07733. V: 732-817-1774; F: 732-817-1774; E: [email protected].

Barbara Meyers, president of MeyersConsulting Services (est. 1983), providesexpert advice and experienced operationalsupport to professional societies, scholarlypublishers, and their supplier communities inthe areas of management, marketing,planning, and research. One of the foundersof the Society for Scholarly Publishers (SSP),

Page 18: Metadata Demystified

Printing and Publishing Services450 Fame AvenueHanover, Pennsylvania 17331

For more information about The Sheridan Press or to request additional copies of theMetadata Demystified White Paper, call Prudi Showers at 717-632-3535 or contact her by e-mail at [email protected] or fax this form to 717-633-8900.

Name Title

Company

Address

Phone Number Fax Number

E-Mail Address

The Sheridan Press Publications and Literature

I am interested in additional copies of the following White Papers:

_____ Metadata Demystified (in collaboration with NISO Press) (7/03)

_____ Member Recruitment (4/03)

_____ Digital Art (5/02)

_____ Implementing Information Technology Systems (1/02)

_____ Marketing Reprints (10/01)

_____ Marketing Scholarly Journals (5/01)

_____ Digital Archiving in the New Millennium: Developing an

Infrastructure (11/00)

_____ Improving Journal Quality with Process Improvement Methods (5/00)

_____ Digital Workflow: Managing the Process Electronically (3/00)

_____ How to Make the Most of Reprints (5/99)

_____ The Future of the Print Journal (2/99)

_____ Outsourcing (6/98)

_____ Archiving (9/97)

I am interested in more information about:

_____ The Sheridan Press

_____ Sheridan Reprints

_____ The Sheridan Group

Page 19: Metadata Demystified

ISBN 1-880124-59-9