Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
ACS 248th Paper 108 NIST-IUPAC Solubility Data
description
Transcript of ACS 248th Paper 108 NIST-IUPAC Solubility Data
A REST API forThe IUPAC Solubility Data Series:
A ‘Skunkworks’ Project
Stuart J. ChalkDepartment of ChemistryUniversity of North Florida
2014 Fall ACS Meeting
Motivation What is Website ‘Scraping’? What are REST and API? Project Process NIST Website Analysis Database Definition Data Ingestion Project Website Design Using the Website Future Plans Conclusion
Outline
Linked Open Data (LOD) is important for science Defining a process for grabbing high quality science
data and making it semantically available is useful Providing a REST API makes information easy to find Providing unique REST URLs for data allows linking A semantic description of data makes it more useful Increase value added -> link data to other available
data
SDS data is fundamentally important to chemistry
Motivation
(1) http://en.wikipedia.org/wiki/Linked_data
Data in web pages is available for users to copy/paste
When the available data is large, automation of the scripts is necessary
‘Scraping’ is the processing of web page data using a scripting language
Data can be captured and stored in any format Most useful to capture data in a relational database
so that it can be repurposed at another website This is usually done without the permission of the
authors of the ‘scraped’ web page(s)
What is Website Scraping?
Representational State Transfer (REST) is…“is a software architectural style consisting of a coordinated set of architectural constraints applied to components, connectors, and data elements, within a distributed hypermedia system”2
REST is applied to websites as a style for providing URL access to information in a structured human readable way
Application Programming Interface (API) is…A standardized way for one computer/software system to talk to another. For REST this a set of remote (http) based calls to pre-defined URL’s
What are REST and API?
(2) http://en.wikipedia.org/wiki/Representational_state_transfer(3) http://en.wikipedia.org/wiki/API
Analysis of current NIST Solubility Database website
Definition of database tables needed Code generation to automate data scraping Data cleanup REST API definition and description REST API development Output file format generation Addition of bells and whistles (if there’s time
)
Project Process
http://srdata.nist.gov/solubility/dataSeries.aspxcontains links to all the volumes that are available => volID
http://srdata.nist.gov/solubility/sys_category.aspxcontains all the system types as part of a select list => typeID
http://srdata.nist.gov/solubility/sol_sys_lst.aspx?sysID=<typeID>&FROM=SSN contains the different datasets for a specific system type => sysID
http://srdata.nist.gov/solubility/sol_detail.aspx?sysID=<sysID>contains details of system: citation, data tables, refs, preparer etc.
http://srdata.nist.gov/solubility/sol_2casno.aspx? STR1=<CASRN1>&STR2=<CASRN2>&OPTION=CASNO allows searching by chemical CASRN (also name (OPTION=CHEM) or formula (OPTION=MOL)
http://srdata.nist.gov/solubility/citation_detail.aspx?REF_NO=<?REFNO?> allows searching system date by paper
NIST Website Analysis
What types of data are available and how should it be organized? By Volume => volID By System Type => typeID By System => sysID By Chemical => CASRN, name, formula By Citation => refNO By Author (new) Also added Tables and Variables during
development Note: the actual site uses sysID for the system
and type and particular set of data about a system type
Database Definition
Data was imported into MySQL either from a tab delimited text file or insertion via PHP scripts
Scraped the volume id’s fromhttp://srdata.nist.gov/solubility/dataSeries.aspx htmlcleaned up to generate a tab delimited text file18 rows
Similarly the system types were scraped fromhttp://srdata.nist.gov/solubility/sys_category.aspx into a tab delimited text file => 2564 rows
Data Ingestion
Individual systems with data were scraped using a PHP script which involved Lookup of system type and retrieval of typeID Construction of system type page URL
http://srdata.nist.gov/solubility/sol_sys_lst.aspx?sysID=<typeID>&FROM=SSN
Retrieval of the page content (HTML) into a PHP variable
PCRE Regex expression match for the sysID of each system
Creation of a new entry in the system database table 4817 rows
Data Ingestion
System details were scraped using a PHP script by Lookup of system and retrieval of sysID Construction of system detail page URL
http://srdata.nist.gov/solubility/sol_detail.aspx?sysID=<sysID>
Retrieval of the page content (HTML) into a PHP variable
Processing of HTML to retrievecitation, variables, data analysis and tables, method, source, errors, references
Saving of details to systems table and related tables
Data Ingestion
In addition to data extraction Chemical InChI strings were retrieved from NIH CIR1
Citation DOI’s were retrieved from CrossRef2 and saved(article titles and full author names were also added)
Data tables were converted to JSON format for storageand reproduction
Table notes, sources, and additional refs were converted to JSON for storage
Data Ingestion
(1)http://cactus.nci.nih.gov/chemical/structure(2)http://www.crossref.org
Database
Database
Constructed using the CakePHP framework (PHP) Index (listing) and view pages for each of
Authors Chemicals Citations Systems System Types Volumes
Search functionality provided via the homepage Example URL
http://chalk.coas.unf.edu/solubility/systems/view/20_135
Project Website Design
Project Website Design
Project Website Design
Get this project funded Clean up references and link to DOI’s Clean up authors and link to ORCIDs Add procedural references Convert table data into searchable/linked format
Add measurement type, unit, error, and variables Provide searching and plotting of data Automated calculation of additional parameters
e.g. solubility in different units, mole ratio Create solubility ontology => add RDF +
searching Add microdata1 to each web page Next phase ? => Add the other volumes
Future Plans
(1) http://www.w3.org/TR/microdata/
A RESTful version of the IUPAC-NIST Solubility Series Database was successfully created and made available
Metrics 20 Volumes 2564 System Types 4817 Systems 1484 Chemicals 1247 References 1968 Authors 11 MB size of database
One week worth of work
Conclusion
[email protected] Phone: 904-620-5311 Skype: stuartchalk LinkedIn/Slidehare: https://www.linkedin.com/in/
stuchalk ORCID: http://orcid.org/0000-0002-0703-7776 ResearcherID:
http://www.researcherid.com/rid/D-8577-2013
Questions?