Copyright 2011 Inera Incorporated. All Rights Reserved Variations in XML Reference Tagging in...
-
Upload
jesse-atkins -
Category
Documents
-
view
214 -
download
0
Transcript of Copyright 2011 Inera Incorporated. All Rights Reserved Variations in XML Reference Tagging in...
Copyright 2011 Inera Incorporated. All Rights Reserved
Variations in XML Reference Tagging in Scholarly Publication
Presented by
Bruce D. Rosenblum
CEO
Inera Incorporated
Journal Article Tag Suite Conference, 26 September 2011
Copyright 2011 Inera Incorporated. All Rights Reserved
Why Are References Hard? Beck, J. (2011). NISO Z39.96 The Journal Article Tag Suite (JATS):
What Happened to the NLM DTDs? Journal of Electronic Publishing, 14(1). http://dx.doi.org/10.3998/3336451.0014.106
<mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Beck</surname><given-names>J.</given-names></name></person-group> (<year>2011</year>). <article-title>NISO Z39.96 The Journal Article Tag Suite (JATS): What Happened to the NLM DTDs?</article-title> <source>Journal of Electronic Publishing</source>, <volume>14</volume>(<issue>1</issue>). <pub-id pub-id-type="doi">10.3998/3336451.0014.106</pub-id></mixed-citation>
Copyright 2011 Inera Incorporated. All Rights Reserved
First Online Publication
Beck J. (2011.). NISO Z39.96 The Journal Article Tag Suite (JATS): What Happened to the NLM DTDs? Journal of Electronic Publishing, 14(1). 10.3998/3336451.0014.106.
Copyright 2011 Inera Incorporated. All Rights Reserved
The Conversation, Part 1 Bruce: “Your template is adding an extra period
after the year. Perhaps I need to grab a screen shot and add a new section to my paper about errant reference templates?”
Jeff: “That's what happens when you send non-NLM style citations to the NLM. :) It’s fixed now - no problemo”
Copyright 2011 Inera Incorporated. All Rights Reserved
Second Publication
Beck J. (2011). NISO Z39.96 The Journal Article Tag Suite (JATS): What Happened to the NLM DTDs? Journal of Electronic Publishing, 14(1). 10.3998/3336451.0014.106.
Copyright 2011 Inera Incorporated. All Rights Reserved
The Conversation, Part 2 Bruce: “The DOIs aren't live links in my
references. Can you map pub-id appropriately or do I need to violate best practices and make them ext-link?”
Jeff: “We have flip-flopping policies on how we handle pub-id links outside of NCBI resources. I'll build the links here so you and your XML can stay pure. ;)”
Copyright 2011 Inera Incorporated. All Rights Reserved
Third Publication Beck J. (2011). NISO Z39.96 The Journal Article
Tag Suite (JATS): What Happened to the NLM DTDs? Journal of Electronic Publishing, 14(1). 10.3998/3336451.0014.106.
Copyright 2011 Inera Incorporated. All Rights Reserved
The Conversation, Continued Bruce: “I understand that there may be some reticence
to link to non-NCBI resources. However DOI is so widely accepted by libraries (including by NLM) that I think it would be helpful for NCBI to build links for DOIs any time they show up correctly tagged in XML.”
Jeff: “It is not the outside that is the problem. The problem we've seen is that so many of them are not correct that we get complaints when they don't resolve.”
Copyright 2011 Inera Incorporated. All Rights Reserved
The Failings of Links Bruce: “What percent of DOIs that you get do you think fail? Do
you have any metrics on that?” Jeff: “No. But that would be interesting. My guess is that they are
hopefully getting better. But when they are created for print and not checked by a person, there will be problems. Add to that hard hyphens when they break on the page (certainly we can't just remove hyphens from dois).”
Bruce: “We've also found an average of 10% to 15% failure rate for URLs (and higher when extracted from a PDF file), so by the logic you've suggested (users complaining about dead links), you shouldn't make those live hyperlinks either...”
Copyright 2011 Inera Incorporated. All Rights Reserved
What is the Purpose of References?
… to give an indication of how authoritative the source is — in order for the reader to decide whether he should bother to pursue the source in the first place.
… the minimal set of metadata that can unambiguously identify the source… so that the reader can go to a shelf… write an ILL request, or… send sufficient metadata set to CrossRef, so that its matching algorithm could return the source's DOI to the requester (Knox and Schwarzman, 2004).
Copyright 2011 Inera Incorporated. All Rights Reserved
References Gone Wrong Corrupt or missing references can be a source of minor
irritation or major inconvenience — misquoted references increase the probability that a citation index such as Web of Science will not be able to link the citations to the source article. In today’s metric-driven world, not receiving credit, in terms of citations, for the work that one has published can actually make a difference in terms of promotions, tenure, grant funding, etc. (Wates and Campbell, 2007)
Copyright 2011 Inera Incorporated. All Rights Reserved
Variations in Reference Style References have many, many editorial styles
AMA, APA, CSE, MLA, etc. Journal-specific styles Author “invented” styles
And XML has almost as many variations to tag them
Copyright 2011 Inera Incorporated. All Rights Reserved
Online Resources Online references exacerbate the problem
Dickens, Charles. A tale of two cities [Internet]. Charlottesville (VA): University of Virginia Library, Electronic Text Center; 1994; c1999 [updated 1996 May; cited 2002 Apr 29]. 820K bytes. Available from: http://etext.lib.virginia.edu/toc/modeng/public/DicTale.html
(for extra credit, trying tagging the above with nlm-citation and re-rendering as shown)
Copyright 2011 Inera Incorporated. All Rights Reserved
It’s a Whole New World “All the rules we've spent years developing are
out the window” Karen Patrias, National Library of Medicine
“The Internet made a lot of things very simple. Bibliographies aren't among them”
Wall Street Journal, 2 May 2002, p. A1
Copyright 2011 Inera Incorporated. All Rights Reserved
In The Beginning… 12083 <citation> model
<!ELEMENT citation - O %m.bib; > <!ENTITY % m.bib "(no?, title, (%bib;)*)" > <!ENTITY % bib "author|corpauth|msn|sertitle|
location|date|pages|subject| othinfo" >
Copyright 2011 Inera Incorporated. All Rights Reserved
PMC 1.0 bibl<!ELEMENT refgrp (st?, bibl+)>
<!ELEMENT bibl (title?, edg?, (aug | insg)*, firstauaff?, ang?, source?, issn?, publisher?, pubdate?, ((volume?, edition?, issue?, fpage?, lpage?, exlnk?) | inpress)?, p?, xrefbib?)>
<!ELEMENT fm (doctopic*, dochead?, docsubj*, supptitle?, sertitle*, sertext?, addart?, bibl, suppmat?, history?, com?, con?, cor*, cpyrt?, relart*, shortabs?, abs?, kwdg?)>
Copyright 2011 Inera Incorporated. All Rights Reserved
bibl Used in Front Matter<fm><dochead>Research article</dochead><bibl><title><p>Analysis of stress- and host ...</p></title><aug><au ca=“yes"><snm>Triccas</snm><mi>A</mi><fnm>James</fnm><insr iid="I1"/><insr iid="I2"/></au><au ca="no"><snm>Gicquel</snm><fnm>Brigitte</fnm><insr iid="I1"/></au></aug><insg><ins id="I1"><p>Unite de Genetique Mycobacterienne</p></ins><ins id="I2"><p>Centenary Institute of Cancer Medicine and Cell Biology</p></ins></insg><source>BMC Microbiology</source><issn>1471-2180</issn><pubdate>2001</pubdate><volume>1</volume><issue>1</issue><fpage>3</fpage></bibl><abs><sec><st><p>Abstract</p></st><p>The gene encoding the inorganic...</p></abs></fm>
Copyright 2011 Inera Incorporated. All Rights Reserved
Bibl used in refgrp<bibl id="B1">
<title><p>Dissecting the biology of a pathogen during infection.</p></title>
<aug>
<au ca="no"><snm>Heithoff</snm><fnm>DM</fnm></au>
<au ca="no"><snm>Conner</snm><fnm>CP</fnm></au>
<au ca="no"><snm>Mahan</snm><fnm>MJ</fnm></au></aug>
<source>Trends Microbiol</source><pubdate>1997</pubdate>
<volume>5</volume><fpage>509</fpage><lpage>513</lpage>
</bibl>
Element reuse is good; element overloading is not
Copyright 2011 Inera Incorporated. All Rights Reserved
Green DTD 1.0 Return of <citation> (at least the name)
Designed for archive requirements Markup of journal and non-journal references Inclusion of boilerplate text (allows PCDATA between elements) Permits element placement in presentation order
(Well… almost. Author name elements in <citation> were in a proscribed order in version 1.0 and did not permit boilerplate text)
Copyright 2011 Inera Incorporated. All Rights Reserved
Blue DTD 1.0 Adds <nlm-citation>
Structured citation model to assist users creating “new” content; the model loosely reflects the NLM’s style in that it allows the tagging of all “legal” NLM citations and enforces the sequence in which content must appear if it is present
nlm-citation vs. citation Proscribed order Most elements not repeatable No boilerplate text allowed
Copyright 2011 Inera Incorporated. All Rights Reserved
The Problem with nlm-citation Fukumoto Y (1972b) Study on the behaviour of stabilization piles
for landslides. Soil and Foundation 12(2), 61–73 [in Japanese]. <nlm-citation citation-type="journal">
<person-group><name><surname>Fukumoto</surname><given-names>Y</given-names></name></person-group><article-title>Study on the behaviour of stabilization piles for landslides.</article-title><source>Soil and Foundation</source><year>1972</year><volume>12</volume><issue>2</issue><fpage>61</fpage><lpage>73</lpage><comment>b</comment><comment>[in Japanese]</comment></nlm-citation>
Copyright 2011 Inera Incorporated. All Rights Reserved
NLM DTD 3.0 <mixed-citation> replaces <citation> <element-citation> replaces <nlm-citation>
Element order is no longer proscribed All elements can be repeated
Copyright 2011 Inera Incorporated. All Rights Reserved
Fixed with element-citation Fukumoto Y (1972b) Study on the behaviour of stabilization piles
for landslides. Soil and Foundation 12(2), 61–73 [in Japanese]. <element-citation publication-type="journal">
<person-group><name><surname>Fukumoto</surname><given-names>Y</given-names></name></person-group><year>1972</year><comment>b</comment><article-title>Study on the behaviour of stabilization piles for landslides.</article-title><source>Soil and Foundation</source><volume>12</volume><issue>2</issue><fpage>61</fpage><lpage>73</lpage><comment>[in Japanese]</comment></element-citation>
Copyright 2011 Inera Incorporated. All Rights Reserved
NLM 1.x and 2.x Citation Attributes
Single citation-type attribute How to describe an online book published by a
government entity? citation-type="book|gov|eref"
Copyright 2011 Inera Incorporated. All Rights Reserved
NLM 3.0 Citation Attributes New 3.0 attribute
publication-type publisher-type publication-format
How to describe an online book published by a government entity?
<mixed-citation publication-type="book" publisher-type="gov" publication-format="online">
Copyright 2011 Inera Incorporated. All Rights Reserved
person-group PCDATA person-group and boilerplate text
1.x: No PCDATA; proscribed name element order 2.x Green: string-name and x added 3.0 Blue: string-name added JATS 0.4 Blue: PCDATA added
JATS allows full retention of boilerplate text in references
Copyright 2011 Inera Incorporated. All Rights Reserved
Non-journal References Books, book chapters, conference proceedings,
reports, working papers, standards, web pages, legal citations, etc.
Harder to tag More variation than journal references
Copyright 2011 Inera Incorporated. All Rights Reserved
What Publishers Tag Many publishers tag only journal references
Non-journal references are minority in scientific literature Journal references follow relatively “normal” patterns Automated tagging of non-journal references is more
challenging Missing information (e.g., publishers and locations in book references)
Meissonnier, Juste-Aurèle, Livre de légumes (c. 1732). Extra information (e.g., reprint information in humanities book references)
Dezallier d’Argenville, Abrégé de la vie des plus fameux peintres, 4 vols (Paris, 1762; repr. Geneva: Minkoff, 1972).
Most non-journal content cannot easily be linked PubMed indexes journals almost exclusively CrossRef has mostly journal references
Copyright 2011 Inera Incorporated. All Rights Reserved
Abbreviated Reference Styles Short form references
No article title No last page No repeated authors
Vestiges of print publication Attempt to save pages But does it help the reader?
Copyright 2011 Inera Incorporated. All Rights Reserved
Missing Article Title M. G. Banwell, M. D. McLeod, Chem. Commun.
1998, 1991 Advantages
Style saves paper in print
Disadvantages How can the reader determine if the article is worth
reading when they don’t know the title? CrossRef linking is less accurate
Copyright 2011 Inera Incorporated. All Rights Reserved
CrossRef Fuzzy Limits Consider these references:
X. Wang, et al., Appl. Phys. Lett. 72, 3255 (1998) doi:10.1063/1.121615 X. Wang, et al., Appl. Phys. Lett. 72, 3264 (1998) doi:10.1063/1.121618
What if page mis-typed as “3265”: X. Wang, et al., Appl. Phys. Lett. 72, 3265 (1998)
CrossRef gives us 15 DOIs:10.1063/1.121618 10.1063/1.121225
10.1063/1.121331 10.1063/1.121396
10.1063/1.120857 10.1063/1.120753
10.1063/1.121615 10.1063/1.121493
10.1063/1.121500 10.1063/1.120942
10.1063/1.121016 10.1063/1.120657
10.1063/1.120619 10.1063/1.120660
10.1063/1.120604
Copyright 2011 Inera Incorporated. All Rights Reserved
Missing last pages M. G. Banwell, M. D. McLeod, Chem. Commun.
1998, 1991 Which number is the year?
In some reference styles, year is last In OCR input, bold may not be available
Can lead to incorrect tagging And how does the reader know the article
length?
Copyright 2011 Inera Incorporated. All Rights Reserved
Excluded Repeated Authors Bornstein, Eli. “The Crystal in the Rock,” The
Structurist no. 2 (1961-62): 5-18. ---, “Creation/Destruction/Creation in Art and
Nature,” The Structurist no. 7 (1967): 55-63.
Copyright 2011 Inera Incorporated. All Rights Reserved
Tagging “Excluded” Authors How to tag with <element-citation>?
<surname>---</surname> <name content-type="repeated-author">
<surname>Bornstein</surname><given-name>Eli</given-name></name>
Do not use option 1 Semantically incorrect Lowers chance of CrossRef link
Apply attribute to name, not person-group Or better yet, don’t exclude authors…
Copyright 2011 Inera Incorporated. All Rights Reserved
To Boilerplate or Not mixed-citation with boilerplate text permits
Retention of “extra” text not easily tagged Reduced work in template development
PDF, HTML
mixed-citation to element-citation conversion easier than reverse
Copyright 2011 Inera Incorporated. All Rights Reserved
Survey: Reference Boilerplate TextOrganization Ref Boilerplate Text
Publisher 1 DROP
Publisher 2 DROP
Publisher 3 KEEP
Publisher 4 KEEP
Publisher 5 KEEP
Publisher 6 KEEP
Publisher 7 DROP
Publisher 8 KEEP
Publisher 9 DROP
Publisher 10 DROP
Publisher 11 DROP
Publisher 12 KEEP
Publisher 13 KEEP
Publisher 14 KEEP
Publisher 15 DROP
Publisher 16 KEEP
Publisher 17 KEEP
Publisher 18 KEEP
Publisher 19 NA
Publisher 20 KEEP
Publisher 21 KEEP
JATS-con KEEP
Supplier 1 KEEP
Supplier 2 KEEP
Supplier 3 KEEP
Most publishers keep boilerplate text
All suppliers keep boilerplate text
Copyright 2011 Inera Incorporated. All Rights Reserved
Boilerplate Advantages Boilerplate text allows
Appropriate semantic markup of challenging references For computer-to-computer automation such as linking With full retention of extra information to aid the reader's use of that item
Less work in template setup, especially for unusual cases <mixed-citation>
<person-group><name><surname>Dickens</surname>, <given-name>Charles</given-name></name></person-group>. <source>A tale of two cities</source> <comment>[Internet]</comment>. <publisher-loc>Charlottesville (VA)</publisher-loc>: <publisher-name>University of Virginia Library, Electronic Text Center</publisher-name>; <year>1994</year>; c1999 [updated 1996 May; cited 2002 Apr 29]. <size units="kb">820K bytes</size>. Available from: <ext-link>http://etext.lib.virginia.edu/toc/modeng/public/DicTale.html</ext-link></mixed-citation>
XML is about structure, not formatting But not at unnecessary extra expense for the sake of purity
Copyright 2011 Inera Incorporated. All Rights Reserved
Tag Abuse <volume>Vol.6</volume>, <issue>No.3,
(September)</issue> Problems
Semantically incorrect Causes link failures
Corrected: Vol.<volume>6</volume>, No.<issue>3</issue>, (September)
Copyright 2011 Inera Incorporated. All Rights Reserved
Conclusions Journal references using consistent editorial style are
relatively easy to JATS-tag Exception cases can be more challenging Tagging of non-journal references presents additional
difficulties Consider mixed-citation if many non-journal references Accurate tagging efforts result in
Higher quality publications Greater online linking success Best reader experience
Copyright 2011 Inera Incorporated. All Rights Reserved
Questions?
Bruce RosenblumInera Incorporated+1 (617) 932 - 1932
Copyright 2011 Inera Incorporated. All Rights Reserved
A Few More Details Getting it right
Learn the markup models Work carefully with unusual cases
Study Groups Name/Date Year Letters Unusual Page Ranges
Avoid “Tag Abuse”
Copyright 2011 Inera Incorporated. All Rights Reserved
Study Groups <collab> and <on-behalf-of> elements "Coudray C, Roussel AM, Arnaud J, Favier A, and
the EVA Study Group“ <collab>EVA Study Group</collab>
Coudray C, Roussel AM, Arnaud J, Favier A, for the EVA Study Group <on-behalf-of>EVA Study Group</on-behalf-of>
<on-behalf-of> is a specialized form for <role>
Copyright 2011 Inera Incorporated. All Rights Reserved
Name/Date Year Letters Citation “Smith, 2000a, b” Semantically incorrect
<year>2000b</year>
Element-citation <year>2000</year><comment>b</comment>
Mixed-citation <year>2000</year>b