Copyright 2011 Inera Incorporated. All Rights Reserved Variations in XML Reference Tagging in...

44
Copyright 2011 Inera Incorporated. All Rights Reserved Variations in XML Reference Tagging in Scholarly Publication Presented by Bruce D. Rosenblum CEO Inera Incorporated Journal Article Tag Suite Conference, 26 September 2011

Transcript of Copyright 2011 Inera Incorporated. All Rights Reserved Variations in XML Reference Tagging in...

Copyright 2011 Inera Incorporated. All Rights Reserved

Variations in XML Reference Tagging in Scholarly Publication

Presented by

Bruce D. Rosenblum

CEO

Inera Incorporated

Journal Article Tag Suite Conference, 26 September 2011

Copyright 2011 Inera Incorporated. All Rights Reserved

Why Are References Hard? Beck, J. (2011). NISO Z39.96 The Journal Article Tag Suite (JATS):

What Happened to the NLM DTDs? Journal of Electronic Publishing, 14(1). http://dx.doi.org/10.3998/3336451.0014.106

<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Beck</surname><given-names>J.</given-names></name></person-group> (<year>2011</year>). <article-title>NISO Z39.96 The Journal Article Tag Suite (JATS): What Happened to the NLM DTDs?</article-title> <source>Journal of Electronic Publishing</source>, <volume>14</volume>(<issue>1</issue>). <pub-id pub-id-type="doi">10.3998/3336451.0014.106</pub-id></mixed-citation>

Copyright 2011 Inera Incorporated. All Rights Reserved

First Online Publication

Beck J. (2011.). NISO Z39.96 The Journal Article Tag Suite (JATS): What Happened to the NLM DTDs? Journal of Electronic Publishing, 14(1). 10.3998/3336451.0014.106.

Copyright 2011 Inera Incorporated. All Rights Reserved

The Conversation, Part 1 Bruce: “Your template is adding an extra period

after the year. Perhaps I need to grab a screen shot and add a new section to my paper about errant reference templates?”

Jeff: “That's what happens when you send non-NLM style citations to the NLM. :) It’s fixed now - no problemo”

Copyright 2011 Inera Incorporated. All Rights Reserved

Second Publication

Beck J. (2011). NISO Z39.96 The Journal Article Tag Suite (JATS): What Happened to the NLM DTDs? Journal of Electronic Publishing, 14(1). 10.3998/3336451.0014.106.

Copyright 2011 Inera Incorporated. All Rights Reserved

The Conversation, Part 2 Bruce: “The DOIs aren't live links in my

references. Can you map pub-id appropriately or do I need to violate best practices and make them ext-link?”

Jeff: “We have flip-flopping policies on how we handle pub-id links outside of NCBI resources. I'll build the links here so you and your XML can stay pure. ;)”

Copyright 2011 Inera Incorporated. All Rights Reserved

Third Publication Beck J. (2011). NISO Z39.96 The Journal Article

Tag Suite (JATS): What Happened to the NLM DTDs? Journal of Electronic Publishing, 14(1). 10.3998/3336451.0014.106.

Copyright 2011 Inera Incorporated. All Rights Reserved

The Conversation, Continued Bruce: “I understand that there may be some reticence

to link to non-NCBI resources. However DOI is so widely accepted by libraries (including by NLM) that I think it would be helpful for NCBI to build links for DOIs any time they show up correctly tagged in XML.”

Jeff: “It is not the outside that is the problem. The problem we've seen is that so many of them are not correct that we get complaints when they don't resolve.”

Copyright 2011 Inera Incorporated. All Rights Reserved

The Failings of Links Bruce: “What percent of DOIs that you get do you think fail? Do

you have any metrics on that?” Jeff: “No. But that would be interesting. My guess is that they are

hopefully getting better. But when they are created for print and not checked by a person, there will be problems. Add to that hard hyphens when they break on the page (certainly we can't just remove hyphens from dois).”

Bruce: “We've also found an average of 10% to 15% failure rate for URLs (and higher when extracted from a PDF file), so by the logic you've suggested (users complaining about dead links), you shouldn't make those live hyperlinks either...”

Copyright 2011 Inera Incorporated. All Rights Reserved

What is the Purpose of References?

… to give an indication of how authoritative the source is — in order for the reader to decide whether he should bother to pursue the source in the first place.

… the minimal set of metadata that can unambiguously identify the source… so that the reader can go to a shelf… write an ILL request, or… send sufficient metadata set to CrossRef, so that its matching algorithm could return the source's DOI to the requester (Knox and Schwarzman, 2004).

Copyright 2011 Inera Incorporated. All Rights Reserved

References Gone Wrong Corrupt or missing references can be a source of minor

irritation or major inconvenience — misquoted references increase the probability that a citation index such as Web of Science will not be able to link the citations to the source article. In today’s metric-driven world, not receiving credit, in terms of citations, for the work that one has published can actually make a difference in terms of promotions, tenure, grant funding, etc. (Wates and Campbell, 2007)

Copyright 2011 Inera Incorporated. All Rights Reserved

Variations in Reference Style References have many, many editorial styles

AMA, APA, CSE, MLA, etc. Journal-specific styles Author “invented” styles

And XML has almost as many variations to tag them

Copyright 2011 Inera Incorporated. All Rights Reserved

Online Resources Online references exacerbate the problem

Dickens, Charles. A tale of two cities [Internet]. Charlottesville (VA): University of Virginia Library, Electronic Text Center; 1994; c1999 [updated 1996 May; cited 2002 Apr 29]. 820K bytes. Available from: http://etext.lib.virginia.edu/toc/modeng/public/DicTale.html

(for extra credit, trying tagging the above with nlm-citation and re-rendering as shown)

Copyright 2011 Inera Incorporated. All Rights Reserved

It’s a Whole New World “All the rules we've spent years developing are

out the window” Karen Patrias, National Library of Medicine

“The Internet made a lot of things very simple. Bibliographies aren't among them”

Wall Street Journal, 2 May 2002, p. A1

Copyright 2011 Inera Incorporated. All Rights Reserved

In The Beginning… 12083 <citation> model

<!ELEMENT citation - O %m.bib; > <!ENTITY % m.bib "(no?, title, (%bib;)*)" > <!ENTITY % bib "author|corpauth|msn|sertitle|

location|date|pages|subject| othinfo" >

Copyright 2011 Inera Incorporated. All Rights Reserved

PMC 1.0 bibl<!ELEMENT refgrp (st?, bibl+)>

<!ELEMENT bibl (title?, edg?, (aug | insg)*, firstauaff?, ang?, source?, issn?, publisher?, pubdate?, ((volume?, edition?, issue?, fpage?, lpage?, exlnk?) | inpress)?, p?, xrefbib?)>

<!ELEMENT fm (doctopic*, dochead?, docsubj*, supptitle?, sertitle*, sertext?, addart?, bibl, suppmat?, history?, com?, con?, cor*, cpyrt?, relart*, shortabs?, abs?, kwdg?)>

Copyright 2011 Inera Incorporated. All Rights Reserved

bibl Used in Front Matter<fm><dochead>Research article</dochead><bibl><title><p>Analysis of stress- and host ...</p></title><aug><au ca=“yes"><snm>Triccas</snm><mi>A</mi><fnm>James</fnm><insr iid="I1"/><insr iid="I2"/></au><au ca="no"><snm>Gicquel</snm><fnm>Brigitte</fnm><insr iid="I1"/></au></aug><insg><ins id="I1"><p>Unite de Genetique Mycobacterienne</p></ins><ins id="I2"><p>Centenary Institute of Cancer Medicine and Cell Biology</p></ins></insg><source>BMC Microbiology</source><issn>1471-2180</issn><pubdate>2001</pubdate><volume>1</volume><issue>1</issue><fpage>3</fpage></bibl><abs><sec><st><p>Abstract</p></st><p>The gene encoding the inorganic...</p></abs></fm>

Copyright 2011 Inera Incorporated. All Rights Reserved

Bibl used in refgrp<bibl id="B1">

<title><p>Dissecting the biology of a pathogen during infection.</p></title>

<aug>

<au ca="no"><snm>Heithoff</snm><fnm>DM</fnm></au>

<au ca="no"><snm>Conner</snm><fnm>CP</fnm></au>

<au ca="no"><snm>Mahan</snm><fnm>MJ</fnm></au></aug>

<source>Trends Microbiol</source><pubdate>1997</pubdate>

<volume>5</volume><fpage>509</fpage><lpage>513</lpage>

</bibl>

Element reuse is good; element overloading is not

Copyright 2011 Inera Incorporated. All Rights Reserved

Green DTD 1.0 Return of <citation> (at least the name)

Designed for archive requirements Markup of journal and non-journal references Inclusion of boilerplate text (allows PCDATA between elements) Permits element placement in presentation order

(Well… almost. Author name elements in <citation> were in a proscribed order in version 1.0 and did not permit boilerplate text)

Copyright 2011 Inera Incorporated. All Rights Reserved

Blue DTD 1.0 Adds <nlm-citation>

Structured citation model to assist users creating “new” content; the model loosely reflects the NLM’s style in that it allows the tagging of all “legal” NLM citations and enforces the sequence in which content must appear if it is present

nlm-citation vs. citation Proscribed order Most elements not repeatable No boilerplate text allowed

Copyright 2011 Inera Incorporated. All Rights Reserved

The Problem with nlm-citation Fukumoto Y (1972b) Study on the behaviour of stabilization piles

for landslides. Soil and Foundation 12(2), 61–73 [in Japanese]. <nlm-citation citation-type="journal">

<person-group><name><surname>Fukumoto</surname><given-names>Y</given-names></name></person-group><article-title>Study on the behaviour of stabilization piles for landslides.</article-title><source>Soil and Foundation</source><year>1972</year><volume>12</volume><issue>2</issue><fpage>61</fpage><lpage>73</lpage><comment>b</comment><comment>[in Japanese]</comment></nlm-citation>

Copyright 2011 Inera Incorporated. All Rights Reserved

NLM DTD 3.0 <mixed-citation> replaces <citation> <element-citation> replaces <nlm-citation>

Element order is no longer proscribed All elements can be repeated

Copyright 2011 Inera Incorporated. All Rights Reserved

Fixed with element-citation Fukumoto Y (1972b) Study on the behaviour of stabilization piles

for landslides. Soil and Foundation 12(2), 61–73 [in Japanese]. <element-citation publication-type="journal">

<person-group><name><surname>Fukumoto</surname><given-names>Y</given-names></name></person-group><year>1972</year><comment>b</comment><article-title>Study on the behaviour of stabilization piles for landslides.</article-title><source>Soil and Foundation</source><volume>12</volume><issue>2</issue><fpage>61</fpage><lpage>73</lpage><comment>[in Japanese]</comment></element-citation>

Copyright 2011 Inera Incorporated. All Rights Reserved

NLM 1.x and 2.x Citation Attributes

Single citation-type attribute How to describe an online book published by a

government entity? citation-type="book|gov|eref"

Copyright 2011 Inera Incorporated. All Rights Reserved

NLM 3.0 Citation Attributes New 3.0 attribute

publication-type publisher-type publication-format

How to describe an online book published by a government entity?

<mixed-citation publication-type="book" publisher-type="gov" publication-format="online">

Copyright 2011 Inera Incorporated. All Rights Reserved

person-group PCDATA person-group and boilerplate text

1.x: No PCDATA; proscribed name element order 2.x Green: string-name and x added 3.0 Blue: string-name added JATS 0.4 Blue: PCDATA added

JATS allows full retention of boilerplate text in references

Copyright 2011 Inera Incorporated. All Rights Reserved

Non-journal References Books, book chapters, conference proceedings,

reports, working papers, standards, web pages, legal citations, etc.

Harder to tag More variation than journal references

Copyright 2011 Inera Incorporated. All Rights Reserved

What Publishers Tag Many publishers tag only journal references

Non-journal references are minority in scientific literature Journal references follow relatively “normal” patterns Automated tagging of non-journal references is more

challenging Missing information (e.g., publishers and locations in book references)

Meissonnier, Juste-Aurèle, Livre de légumes (c. 1732). Extra information (e.g., reprint information in humanities book references)

Dezallier d’Argenville, Abrégé de la vie des plus fameux peintres, 4 vols (Paris, 1762; repr. Geneva: Minkoff, 1972).

Most non-journal content cannot easily be linked PubMed indexes journals almost exclusively CrossRef has mostly journal references

Copyright 2011 Inera Incorporated. All Rights Reserved

Abbreviated Reference Styles Short form references

No article title No last page No repeated authors

Vestiges of print publication Attempt to save pages But does it help the reader?

Copyright 2011 Inera Incorporated. All Rights Reserved

Missing Article Title M. G. Banwell, M. D. McLeod, Chem. Commun.

1998, 1991 Advantages

Style saves paper in print

Disadvantages How can the reader determine if the article is worth

reading when they don’t know the title? CrossRef linking is less accurate

Copyright 2011 Inera Incorporated. All Rights Reserved

CrossRef Fuzzy Limits Consider these references:

X. Wang, et al., Appl. Phys. Lett. 72, 3255 (1998) doi:10.1063/1.121615 X. Wang, et al., Appl. Phys. Lett. 72, 3264 (1998) doi:10.1063/1.121618

What if page mis-typed as “3265”: X. Wang, et al., Appl. Phys. Lett. 72, 3265 (1998)

CrossRef gives us 15 DOIs:10.1063/1.121618 10.1063/1.121225

10.1063/1.121331 10.1063/1.121396

10.1063/1.120857 10.1063/1.120753

10.1063/1.121615 10.1063/1.121493

10.1063/1.121500 10.1063/1.120942

10.1063/1.121016 10.1063/1.120657

10.1063/1.120619 10.1063/1.120660

10.1063/1.120604

Copyright 2011 Inera Incorporated. All Rights Reserved

Missing last pages M. G. Banwell, M. D. McLeod, Chem. Commun.

1998, 1991 Which number is the year?

In some reference styles, year is last In OCR input, bold may not be available

Can lead to incorrect tagging And how does the reader know the article

length?

Copyright 2011 Inera Incorporated. All Rights Reserved

Excluded Repeated Authors Bornstein, Eli. “The Crystal in the Rock,” The

Structurist no. 2 (1961-62): 5-18. ---, “Creation/Destruction/Creation in Art and

Nature,” The Structurist no. 7 (1967): 55-63.

Copyright 2011 Inera Incorporated. All Rights Reserved

Tagging “Excluded” Authors How to tag with <element-citation>?

<surname>---</surname> <name content-type="repeated-author">

<surname>Bornstein</surname><given-name>Eli</given-name></name>

Do not use option 1 Semantically incorrect Lowers chance of CrossRef link

Apply attribute to name, not person-group Or better yet, don’t exclude authors…

Copyright 2011 Inera Incorporated. All Rights Reserved

To Boilerplate or Not mixed-citation with boilerplate text permits

Retention of “extra” text not easily tagged Reduced work in template development

PDF, HTML

mixed-citation to element-citation conversion easier than reverse

Copyright 2011 Inera Incorporated. All Rights Reserved

Survey: Reference Boilerplate TextOrganization Ref Boilerplate Text

Publisher 1 DROP

Publisher 2 DROP

Publisher 3 KEEP

Publisher 4 KEEP

Publisher 5 KEEP

Publisher 6 KEEP

Publisher 7 DROP

Publisher 8 KEEP

Publisher 9 DROP

Publisher 10 DROP

Publisher 11 DROP

Publisher 12 KEEP

Publisher 13 KEEP

Publisher 14 KEEP

Publisher 15 DROP

Publisher 16 KEEP

Publisher 17 KEEP

Publisher 18 KEEP

Publisher 19 NA

Publisher 20 KEEP

Publisher 21 KEEP

JATS-con KEEP

Supplier 1 KEEP

Supplier 2 KEEP

Supplier 3 KEEP

Most publishers keep boilerplate text

All suppliers keep boilerplate text

Copyright 2011 Inera Incorporated. All Rights Reserved

Boilerplate Advantages Boilerplate text allows

Appropriate semantic markup of challenging references For computer-to-computer automation such as linking With full retention of extra information to aid the reader's use of that item

Less work in template setup, especially for unusual cases <mixed-citation>

<person-group><name><surname>Dickens</surname>, <given-name>Charles</given-name></name></person-group>. <source>A tale of two cities</source> <comment>[Internet]</comment>. <publisher-loc>Charlottesville (VA)</publisher-loc>: <publisher-name>University of Virginia Library, Electronic Text Center</publisher-name>; <year>1994</year>; c1999 [updated 1996 May; cited 2002 Apr 29]. <size units="kb">820K bytes</size>. Available from: <ext-link>http://etext.lib.virginia.edu/toc/modeng/public/DicTale.html</ext-link></mixed-citation>

XML is about structure, not formatting But not at unnecessary extra expense for the sake of purity

Copyright 2011 Inera Incorporated. All Rights Reserved

Tag Abuse <volume>Vol.6</volume>, <issue>No.3,

(September)</issue> Problems

Semantically incorrect Causes link failures

Corrected: Vol.<volume>6</volume>, No.<issue>3</issue>, (September)

Copyright 2011 Inera Incorporated. All Rights Reserved

Conclusions Journal references using consistent editorial style are

relatively easy to JATS-tag Exception cases can be more challenging Tagging of non-journal references presents additional

difficulties Consider mixed-citation if many non-journal references Accurate tagging efforts result in

Higher quality publications Greater online linking success Best reader experience

Copyright 2011 Inera Incorporated. All Rights Reserved

Questions?

Bruce RosenblumInera Incorporated+1 (617) 932 - 1932

[email protected]

Copyright 2011 Inera Incorporated. All Rights Reserved

A Few More Details Getting it right

Learn the markup models Work carefully with unusual cases

Study Groups Name/Date Year Letters Unusual Page Ranges

Avoid “Tag Abuse”

Copyright 2011 Inera Incorporated. All Rights Reserved

Study Groups <collab> and <on-behalf-of> elements "Coudray C, Roussel AM, Arnaud J, Favier A, and

the EVA Study Group“ <collab>EVA Study Group</collab>

Coudray C, Roussel AM, Arnaud J, Favier A, for the EVA Study Group <on-behalf-of>EVA Study Group</on-behalf-of>

<on-behalf-of> is a specialized form for <role>

Copyright 2011 Inera Incorporated. All Rights Reserved

Name/Date Year Letters Citation “Smith, 2000a, b” Semantically incorrect

<year>2000b</year>

Element-citation <year>2000</year><comment>b</comment>

Mixed-citation <year>2000</year>b

Copyright 2011 Inera Incorporated. All Rights Reserved

Unusual Page Ranges Two or more page ranges or discrete pages

“8–11, 14–19, 40” Incorrect:

<fpage>8</fpage>-<lpage>11, 14-19, 40</lpage>

Correct: <fpage>8</fpage><lpage>40</lpage><page-range>8-11, 14-19,

40</page-range>.