Improving the chemistry content of Wikipedia using workflow tools

Post on 08-Apr-2017

173 views 0 download

Transcript of Improving the chemistry content of Wikipedia using workflow tools

Using Automated Workflow Tools to Improve WikipediaMITCH MILLER

SCIENTIFIC THINKING

VERMONT CODE CAMP 2016

SEPTEMBER 17, 2016

Disclaimer

This talk represents my opinion and personal experience using software systems developed by third parties

The software systems shown are very complex and have hundreds of components. I have only worked with a small number.

Every task shown today can be accomplished in multiple ways. I’m only showing some of those ways.

Overview

Introduction: how are we improving Wikipedia? Why are we doing this?

The list of information we need to compile First method of generating the list The second method of generating the list The third method of generating the list

What chemistry does Wikipedia contain?

9,736 articles with the Chembox; 5,656 with the Drug box (15, 392 total) [source: https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes]

Chembox? Drug box? Templates of selected content within Wikipedia articles

Contents of Chembox: Molecular structure image Name (systematically assigned name + synonyms) Identifiers: CASNo, ChEBI, ChEMBL, ChemSpiderID, DrugBank, InChI,

KEGG, PubChem, SMILES, UNII… Key properties

Chemical identifiers

Different specific databases Individual IDs have strengths and weakness

The UNII is a non- proprietary, free, unique, unambiguous, non semantic, alphanumeric identifier based on a substance’s composition and/or descriptive information. http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistration

System-UniqueIngredientIdentifierUNII/

UNIIs contain 9 randomly generated alphanumeric characters with a tenth check alphanumeric character

When two samples have the same UNII, “they represent the same molecular entity or elements upon which the definition is based.”

SRS group goal

Manages Substance Registration System (SRS) Assure uniformity of UNII assignments across internet resources

that reference UNIIs

The assignment

Generate a report of all chemicals and drugs in Wikipedia Name, UNII (when present), CAS (when present),Wikipedia URL

Idea: subject matter experts will review list and correct assignments, add new UNIIs to Wikipedia as needed

Result: more accurate Wikipedia that links to the FDA’s Substance Registration System unambiguously https://fdasis.nlm.nih.gov/srs/srs.jsp

Development tool: KNIME

Graphic, component based programming environment Drag functional components from palette onto canvas to create program Configure most components by setting parameters Connect components to route data from one to another Run and observe data traveling down the lines

KNIME stands for KoNstanz Information MinEr Pronounced “Nighm”

Originally a production of the University of Konstanz, Germany 2004 Currently produced by KNIME.com AG, a company in Zurich, Switzerland Free version available for download

Windows, Linux, Mac

First method of report generation

Read list of pages with each infobox E.g., https://en.wikipedia.org/w/index.php?

title=Special:WhatLinksHere/Template:Chembox&limit=50000&from=16225610&back=0

Retrieve each individual page mentioned in the list Parse HTML Use Xpath to get Name, CAS, UNII

The Infobox templates lead to pages with defined structure – straightforward to parse

Format data for output Write to a file

First method: pluses/minuses

Plus: it works Minus: had to run in batches to get all records Minus: XPath parsing was more cumbersome than expected Minus: misses some data

The Semantic Web

A connected set of data resources that can be understood by machines

Data encoded in a standard way that allows unattended processors to traverse links from one entity to another across organizational and geographic boundaries

[Standard WWW is a web of documents meant to be understood by humans]

Tim Berners-Lee has a great Ted talk on the semantic web https://www.youtube.com/watch?v=OM6XIICm_qo

Understand Semantic Web in comparison to WWW

Compare pages on same subject: Wikipedia article on ethanol: https://en.wikipedia.org/wiki/Ethanol Wikidata page on ethanol: https://www.wikidata.org/wiki/Q153

Technological foundations of Semantic Web

RDF – Resource Definition Framework – organizing facts as Subject – Predicate – Object Conceptual example:

[Ethanol] [has a boiling point] [173 degrees Fahrenheit] Coded example:

Wd:Q153 wdt:P2102 “173±1 degree Fahrenheit” . Represented in Turtle - Terse RDF Triple Language

SPARQL

Query language for RDF data SPARQL Protocol and RDF Query Language Similar to SQL Syntax based on the RDF triple

Wikidata

Conceptually: semantic web version of Wikipedia Add grain of salt

“Free and open knowledge base that can be read and edited by both humans and machines. “

Designed as ‘central storage’ for Wikipedia and other Wikimedia projects

Approximately: programmatic interface to Wikipedia See https://query.wikidata.org/

Run the example queries

Second method

Search Wikidata programmatically for chemical information Wikidata SPARQL interface Format list Write file

SPARQL for chemical and pharmaceutical compounds

PREFIX wdt: <http://www.wikidata.org/prop/direct/>PREFIX wd: <http://www.wikidata.org/entity/>PREFIX wikibase: <http://wikiba.se/ontology#>PREFIX bd: <http://www.bigdata.com/rdf#>

#All Chemicals with, optionally, CAS registry numbers and UNIIs in WikidataSELECT DISTINCT ?compound ?compoundLabel ?formula ?unii ?pubchem ?cas WHERE { ?compound wdt:P31 wd:Q11173 . OPTIONAL { ?compound wdt:P231 ?cas . } OPTIONAL { ?compound wdt:P274 ?formula . } OPTIONAL { ?compound wdt:P652 ?unii . } OPTIONAL { ?compound wdt:P662 ?pubchem . } SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }

Second method: pluses/minus

Fast and easy! Data arrives in a format we can use – no parsing! Minus:

*some* Wikidata data does not match up with Wikipedia!

Third method

Hybrid approach Use Wikidata SPARQL query to get list of chemicals Query Wikipedia for individual items to compare values

Conclusion

Using Wikidata, Wikipedia and KNIME we compiled a list of chemicals with the required data

Subject matter experts are in the process of updating Wikipedia Semantic web technology made the job easier! Thank you!

References

Scholarly article on KNIME and Pipeline Pilot https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414708/

KNIME www.knime.org

Wikipedia https://en.wikipedia.org/wiki/Template:Chembox https://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox

Wikidata: https://query.wikidata.org

Who is your speaker?

Mitch Miller, Ph.D. in Chemistry and 20+ years of IT experience Independent consultant: Scientific Thinking, LLC mitch.miller@thinkscience.us Some recent projects

Ongoing custodian of one chemical database implementation for ChemIDplus project within the National Library of Medicine

Reporting systems Web service to link collaborative object management system to

reporting system Import wizard for chemical array designer Merged a set of chemical databases and harmonized data