Code as a Research Product !
Open Source for
Open Science
Dan Gezelter @gezelter
OpenScience.org (also: The University of Notre Dame)
Suppose your colleague sends you an email that says, I’ve found something amazing. I don’t have time to tell you exactly what it is, or how I found it, but here’s proof that I discovered it:
smaismrmilmepoetaleumibunenugttauiras
On the 25th of July in 1610, Galileo discovered that Saturn was apparently situated between two smaller companions that always moved together. Wanting to establish his priority of discovery, he sent to Kepler (and others) the following anagram, which he informed them was a coded description of his latest discovery:
smaismrmilmepoetaleumibunenugttauiras
On the 25th of July in 1610, Galileo discovered that Saturn was apparently situated between two smaller companions that always moved together. Wanting to establish his priority of discovery, he sent to Kepler (and others) the following anagram, which he informed them was a coded description of his latest discovery:
smaismrmilmepoetaleumibunenugttauiras
Altissimum planetam tergeminum observavi !
I have observed the highest of the planets [Saturn] three-formed
!
Sadly, this kind of scientific communication was common at the time. Newton, Huygens, Hooke, and Leonardo all used similar devices to hide their discoveries and methods from each other. !!!
!
In 1665, the Philosophical Transactions (one of the earliest scientific journals) was founded by Henry Oldenburg. !!
The Royal Society is collaborating with JSTOR to digitize, preserve, and extend access toPhilosophical Transactions (1665-1678).
www.jstor.org®
The Royal Society is collaborating with JSTOR to digitize, preserve, and extend access toPhilosophical Transactions (1665-1678).
www.jstor.org®
Although the importance of reproducibility is as old as scientific inquiry, the importance of sharing scientific methodology was adopted slowly.
!
Over the next 200 years, publishing and sharing methodology became commonplace. !In August of 1867, the chemist, William Crookes wrote an obituary for his mentor, Michael Faraday. He recounted Faraday’s advice to his students: !
“The secret,” said he, “is comprised in three words — Work, Finish, Publish.”
!
Over the next 200 years, publishing and sharing methodology became commonplace. !In August of 1867, the chemist, William Crookes wrote an obituary for his mentor, Michael Faraday. He recounted Faraday’s advice to his students: !
“The secret,” said he, “is comprised in three words — Work, Finish, Publish.”
!
It must be confessed that young chemists of the present day follow this advice, carefully omitting the second word.
!
Surely science has continued to evolve since 1867… !!
Today, thousands of scientific papers report on computations that cannot be reproduced without access to secret software. !The inner workings of this secret software are hidden from skeptics and other researchers. !If you try to reproduce the capabilities of the secret software in another code, the entity that owns this software bans the university where you work.
Today, thousands of scientific papers report on computations that cannot be reproduced without access to secret software. !The inner workings of this secret software are hidden from skeptics and other researchers. !If you try to reproduce the capabilities of the secret software in another code, the entity that owns this software bans the university where you work.
!
Think I’m exaggerating? BannedByGaussian.org
!
!
Science has been Open since 1665. We just need to remind ourselves of this fact every few years…
!!
• Open Source
• Open Notebook
• Open Data
• Open Metadata
• Open Access
!
What is Open Science?
Transparency in experimental methodology, observation, and collection of data.
• Open Source
• Open Notebook
• Open Data
• Open Metadata
• Open Access
!
What is Open Science?
Transparency in experimental methodology, observation, and collection of data.
Public availability and re-use of scientific data.
• Open Source
• Open Notebook
• Open Data
• Open Metadata
• Open Access
!
What is Open Science?
Public accessibility and transparency of scientific communication.
Transparency in experimental methodology, observation, and collection of data.
Public availability and re-use of scientific data.
• Open Source
• Open Notebook
• Open Data
• Open Metadata
• Open Access
!
What is Open Science?
What is Open Science?
!
Open Science is the idea that scientific knowledge of all kinds should be shared publicly as early as is practical in the discovery process. !
Reproducibility
Reproducibility of experiments is one of the foundations of science.
We expect universality from the results from empirical tests. Independent scientists should be able to subject theories to similar tests in different locations, on different equipment, and at different times and get similar answers.
Reproducible Computational Science
• For simple models and small data sets, calculations are reproducible in principle and in practice.
• As simulations become more complex and data sets become larger, calculations that are reproducible in principle are no longer reproducible in practice without access to the code, data, and meta-data.
• Reproducibility now requires public access to code, data, and meta-data.
Reproducible Computational Science
Reports of numerical experiments should include: !
1. All source code needed to reproduce the calculation 2. All input data used to perform the calculation 3. All meta-data required to allow other codes to use
the input data
These are equivalent to the methodology section of an experimental paper. This standard requires Open Source, Open Data, and Open Meta-data for reproducible computational science.
Reproducible Research Standard
1. Release media components (text, figures) under CC BY. 2. Release code components under MIT license or similar. 3. Attribution license on selection and arrangement of data. 4. Release data under CC0.
V. Stodden, “Enabling reproducible research: Licensing for scientific innovation,” International Journal of Communications Law & Policy 13, 1 (2009).
!
Why aren’t all scientific programs open source?
!!
Two Open Source Science Codes
Started: 1998 2004
Purpose: Molecular Visualization Molecular Dynamics
Languages: Java C++, Python
Developers: 38 17 (graduate students)
Lead Developers: 7 1
Code base: 472,956 lines 92,308 lines
Person-Years: 125 23
Estimated Development Costs: $6,848,949 $629,389
Explicitly-funded Costs: $0 $0
Downloads:Over 831,656 at
SourceForge alone, (possibly millions more)
5,472
External Citations: 221 21
Citations to lead developers: 157 21
Comparative History
Post- doc
Pharma researcher
Graphics guru
Informatics Post-doc
Graduate Student Academic
Grad Student 1
Grad Student 2
AdvisorGroup Code
Grad Student 3
Other Groups
Code Re-use
Jmol is a useful tool• Filled a void created by the death of a closed-source tool.
• Developed by a series of project leads and their geographically-distributed teams. The lead developers hand off the code when they become too busy.
• Application focus changed dramatically over 16 years.
• External users of the code tend to run the application rather than re-use algorithms.
• Jmol became the standard tool for embedding chemical structures in web pages: • RCSB Protein Data Bank (PDB) • Inorganic Crystal Structure Database • Viewer for Folding @ Home projects • Can be directly included in Sakai, Moodle, and WebAssign sites
Jmol is part of other scientific tools• Bioclipse (integrated environment for biomolecule investigation ) • CaGe • ChemPad (3D models calculated on- the-fly from a formula
sketched by hand in a tablet PC • iBabel (a GUI for Openbabel ) • Janocchio (calculates NMR coupling constants and NOEs ) • Molecular Workbench • PFAAT (Protein Family Alignment Annotation Tool) • ProteinGlimpse • Spice • STING Millennium • STRAP • Taverna
Jmol helps disseminate data
Jmol provides structure visualization for: • ACS Chemical Biology • Biochemical Journal • Chemical Reviews (ACS) • Crystallography Journals Online (IUCr) • Molecular BioSystems (Royal Soc. Chem.) • Nature Chemical Biology • Nature Structural & Molecular Biology • Inorganic Chemistry (ACS) • JACS • Journal of Chemical Education • Journal of Molecular Biology • Journal of Natural Products • Organic Letters
Comparative History
Post- doc
Pharma researcher
Graphics guru
Informatics Post-doc
Graduate Student Academic
Grad Student 1
Grad Student 2
AdvisorGroup Code
Grad Student 3
Other Groups
Code Re-use
OpenMD• Merged student codes that carried out similar tasks. • Development was done within one research group and
was piggy-backed on other funded projects. • A journal article outlines the code’s capabilities, and
attribution is requested in the license. • Application development preserved group memory. • External users of the code tend to re-use algorithms
rather than run the application.
OpenMD
• Tight coupling of data & meta-data
• Code versioning information stored in generated data
• Reproduce simulations easily, but reproducibility is not the same as replicability.
• In parallel architectures, replicability may not be possible.
<OpenMD version=2> <MetaData> molecule{ name = "D"; atom[0]{ type = "D"; position(0.0, 0.0, 0.0); orientation(0.0, 0.0, 0.0); } } component{ type = "D"; nMol = 3456; } ensemble = NVE; forceField = "Multipole"; cutoffMethod = "shifted_force"; electrostaticScreeningMethod = "damped"; cutoffRadius = 12.0; dampingAlpha = 0.14; dt = 1.0; runTime = 1e3; sampleTime = 10.0; statusTime = 1.0; seed = 8675309; ## Last run using OpenMD Version: 2.1 Revision: 1972 </MetaData> <Snapshot> …
!
Why aren’t all scientific programs open source?
!!
!
Every discussion in science ends up in a discussion on tenure and grants. !
6�–�College of Science 2013 Accomplishments –�Department of Chemistry and Biochemistry
My graduate students have done excellent research that is well respected in my community, and after their degrees were awarded, they have all gone on to positions in which they can contribute to science and to society in meaningful ways. We are deeply engaged in the university’s goal to become a pre-eminent research university. This past year, our work in the OpenScience movement was recognized nationally and internationally at the White House Champions of Change event. That recognition contributed directly to communication with the external constituents of the university (Goal 5).
Generate Personal “Citation Report”
1. Go to ISI Web of Knowledge: http://www.webofknowledge.com 2. Select the Web of Science tab at the top 3. In 2nd box type in:
• Last name and first initial (no commas). Use all variations, i.e. Doe J OR Doe JR • Make sure that Author appears in the small box on the right
4. Under Current Limits at the bottom of the page, select: • All Years (normally the default) • Click on Search above
5. Under Refine Results in the column on the left side of the page, you may be able to perform refinements that will help to limit the list to only your citations.
6. Click on Refine button 7. Click on Create Citation Report (graph icon/upper right) 8. After generating the citation report, remove any citations that are not yours by checking the left-hand box on
that citation 9. Rerun Citation Report 10. Copy and paste both green plots into the space below (highlight both or right-click and copy each one). 11. Please also copy and paste your citation metrics (h-index, citations etc.), which appear to the right of the plots.
You are done! (Don’t forget to Log Out) * If you need help in generating these citation plots, please contact Thurston Miller [email protected] - He will be happy to assist you! *
Copy and paste Citation Report bar graphs (2) into the space below (please submit as PC viewable graphs, which QuickTime often
are not). Please also copy and paste your citation metrics in this space.
Once a year, academic scientists do this:
• Scientists stay alive professionally by publishing • Paper count • Citation count • h-index
• Time spent on open science projects reduces publication rates • Scientific software tools are often not cited • Even if they were cited, how would that citation get tied to a researcher? • How can a scientist show her institution the value of her project? !!Attribution metrics should (but don’t) take into account:
1. Effort to maintain a useful resource 2. Importance to the scientific community 3. Externalities beyond the scientific community
Recognition & Attribution
Until recently, there was no way to measure open products of research (outside of traditional publications) with a simple metric that can be used by institutions. !This is starting to change. ImpactStory, fidgit DOI lookups..
Recognition & Attribution
What we need is institutional recognition of alternative metrics.
Recognition & Attribution
What we need is institutional recognition of alternative metrics
Recognition & Attribution
What we need is institutional recognition of alternative metrics
And a pony
Recognition & Attribution
What we need is institutional recognition of alternative metrics
And a pony
There is no drive to make these changes in the academic world. !
There’s no drive for this in the for-profit or society journals. The new PLOS One sharing policy is a refreshing change. !
The funding agencies may be our best hope for recognizing code & data as primary research products.
SustainabilityAcademic scientists know almost nothing about good coding practices:
!• Source version control systems (cvs, svn, git, Hg) • Agile (or any other) development models • Design patterns • Object-oriented languages • Strong typing • Public source repositories ( SourceForge, github ) • Differences among open source licenses • Modern build & testing systems • Unit testing • Bug & issue tracking • Designing for usability and usability testing • UI design • Error handling • Introductory user documentation !
Because they aren’t forced into good practices, scientists often create code that is impossible to maintain effectively. This does not lead to sustainable open science.
Sustainability
Why not employ professional programmers to do scientific coding?
Sustainability
Computer scientists often know little about the domain sciences
• It is significantly faster for me to train a computational chemist in good coding practices than it is to train even an accomplished programmer in the various disciplines we use.
Why not employ professional programmers to do scientific coding?
Resources are also necessary for sustainable Open Science NIH, DOE & DARPA fund specific kinds of science. There is little room for projects which enrich the overall scientific enterprise, but don’t constitute novel research themselves. Tools are rarely funded.
!
!
Funding agencies should require delivery of primary research products:
• code in a public repository
• data in a public repository
• make depositions a part of the reporting structure for funded grants
Sustainability
Open source is essential for reproducible, open science.
!
There are no easy solutions to problems of Recognition, Attribution, and Sustainability.
!
That doesn’t mean we get to step away from open source. To do so would be to go back to 1611:
Haec immatura a me jam frustra leguntur oy
Outlook
Open source is essential for reproducible, open science.
!
There are no easy solutions to problems of Recognition, Attribution, and Sustainability.
!
That doesn’t mean we get to step away from open source. To do so would be to go back to 1611:
Haec immatura a me jam frustra leguntur oy
Outlook
Cynthiae figuras aemulatur mater amorum “The mother of love imitates the shape of Cynthia”
(Venus imitates the phases of the moon)
The Alfred P. Sloan Foundation Startup funding for the Open Science Project
Brian Glanz & the Open Science Federation Supporters and friends of Open Science
Michael Nielsen, author of Reinventing Discovery For making us aware of the Galileo & Faraday stories
Acknowledgments
Victoria Stodden, Columbia University developer of the Reproducible Research Standard
The National Science Foundation OpenMD was indirectly supported under grant CHE-0848243
Sources• E. A. Partridge and H. C. Whitaker, “Galileo’s Work on Saturn’s Rings - A Historical Correction,” Popular
Astronomy 3, 408-414 (1896).
• Henry Oldenburg, “The Introduction,” Philosophical Transactions 1, 1-3 (1665) doi:10.1098/rstl.1665.0002
• William Crookes, “Faraday,” The Chemical News XVI(404), 110-111 (1867)
• C. Hempel, Philosophy of Natural Science 49 (1966).
• The distinction between verifiable in principle and verifiable in practice was originally made in: A. J. Ayer, Language, Truth and Logic, (New York: Dover, 1946) p. 32.
• E. Sober Philosophy of Biology (Boulder: Westview Press, 2000), pp. 50-51.
• J. Lett, Science, Reason and Anthropology, The Principles of Rational Inquiry (Oxford: Rowman & Littlefield, 1997), p. 47
• The Yale Law School Roundtable on Data and Code Sharing, “Reproducible Research: Addressing the Need for Data and Code Sharing in Computational Science,” Computing in Science & Engineering 12(5), 8-12 (2010) doi: 10.1109/MCSE.2010.113
• V. Stodden, “The Legal Framework for Reproducible Scientific Research: Licensing and Copyright,” Computing in Science & Engineering 11(1), 35-40 (2009) doi: 10.1109/MCSE.2009.19
• V. Stodden, “Enabling reproducible research: Licensing for scientific innovation,” International Journal of Communications Law & Policy 13, 1 (2009).
• Jmol is available at jmol.sf.net
• OpenMD is available at openmd.org
• Source code analysis and cost estimates were done at ohloh.net , reference counts are from webofknowledge.com
Top Related