Improving the chemistry content of Wikipedia using workflow tools

22
Using Automated Workflow Tools to Improve Wikipedia MITCH MILLER SCIENTIFIC THINKING VERMONT CODE CAMP 2016 SEPTEMBER 17, 2016

Transcript of Improving the chemistry content of Wikipedia using workflow tools

Page 1: Improving the chemistry content of Wikipedia using workflow tools

Using Automated Workflow Tools to Improve WikipediaMITCH MILLER

SCIENTIFIC THINKING

VERMONT CODE CAMP 2016

SEPTEMBER 17, 2016

Page 2: Improving the chemistry content of Wikipedia using workflow tools

Disclaimer

This talk represents my opinion and personal experience using software systems developed by third parties

The software systems shown are very complex and have hundreds of components. I have only worked with a small number.

Every task shown today can be accomplished in multiple ways. I’m only showing some of those ways.

Page 3: Improving the chemistry content of Wikipedia using workflow tools

Overview

Introduction: how are we improving Wikipedia? Why are we doing this?

The list of information we need to compile First method of generating the list The second method of generating the list The third method of generating the list

Page 4: Improving the chemistry content of Wikipedia using workflow tools

What chemistry does Wikipedia contain?

9,736 articles with the Chembox; 5,656 with the Drug box (15, 392 total) [source: https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes]

Chembox? Drug box? Templates of selected content within Wikipedia articles

Contents of Chembox: Molecular structure image Name (systematically assigned name + synonyms) Identifiers: CASNo, ChEBI, ChEMBL, ChemSpiderID, DrugBank, InChI,

KEGG, PubChem, SMILES, UNII… Key properties

Page 5: Improving the chemistry content of Wikipedia using workflow tools

Chemical identifiers

Different specific databases Individual IDs have strengths and weakness

The UNII is a non- proprietary, free, unique, unambiguous, non semantic, alphanumeric identifier based on a substance’s composition and/or descriptive information. http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistration

System-UniqueIngredientIdentifierUNII/

UNIIs contain 9 randomly generated alphanumeric characters with a tenth check alphanumeric character

When two samples have the same UNII, “they represent the same molecular entity or elements upon which the definition is based.”

Page 6: Improving the chemistry content of Wikipedia using workflow tools

SRS group goal

Manages Substance Registration System (SRS) Assure uniformity of UNII assignments across internet resources

that reference UNIIs

Page 7: Improving the chemistry content of Wikipedia using workflow tools

The assignment

Generate a report of all chemicals and drugs in Wikipedia Name, UNII (when present), CAS (when present),Wikipedia URL

Idea: subject matter experts will review list and correct assignments, add new UNIIs to Wikipedia as needed

Result: more accurate Wikipedia that links to the FDA’s Substance Registration System unambiguously https://fdasis.nlm.nih.gov/srs/srs.jsp

Page 8: Improving the chemistry content of Wikipedia using workflow tools

Development tool: KNIME

Graphic, component based programming environment Drag functional components from palette onto canvas to create program Configure most components by setting parameters Connect components to route data from one to another Run and observe data traveling down the lines

KNIME stands for KoNstanz Information MinEr Pronounced “Nighm”

Originally a production of the University of Konstanz, Germany 2004 Currently produced by KNIME.com AG, a company in Zurich, Switzerland Free version available for download

Windows, Linux, Mac

Page 9: Improving the chemistry content of Wikipedia using workflow tools

First method of report generation

Read list of pages with each infobox E.g., https://en.wikipedia.org/w/index.php?

title=Special:WhatLinksHere/Template:Chembox&limit=50000&from=16225610&back=0

Retrieve each individual page mentioned in the list Parse HTML Use Xpath to get Name, CAS, UNII

The Infobox templates lead to pages with defined structure – straightforward to parse

Format data for output Write to a file

Page 10: Improving the chemistry content of Wikipedia using workflow tools

First method: pluses/minuses

Plus: it works Minus: had to run in batches to get all records Minus: XPath parsing was more cumbersome than expected Minus: misses some data

Page 11: Improving the chemistry content of Wikipedia using workflow tools

The Semantic Web

A connected set of data resources that can be understood by machines

Data encoded in a standard way that allows unattended processors to traverse links from one entity to another across organizational and geographic boundaries

[Standard WWW is a web of documents meant to be understood by humans]

Tim Berners-Lee has a great Ted talk on the semantic web https://www.youtube.com/watch?v=OM6XIICm_qo

Page 12: Improving the chemistry content of Wikipedia using workflow tools

Understand Semantic Web in comparison to WWW

Compare pages on same subject: Wikipedia article on ethanol: https://en.wikipedia.org/wiki/Ethanol Wikidata page on ethanol: https://www.wikidata.org/wiki/Q153

Page 13: Improving the chemistry content of Wikipedia using workflow tools

Technological foundations of Semantic Web

RDF – Resource Definition Framework – organizing facts as Subject – Predicate – Object Conceptual example:

[Ethanol] [has a boiling point] [173 degrees Fahrenheit] Coded example:

Wd:Q153 wdt:P2102 “173±1 degree Fahrenheit” . Represented in Turtle - Terse RDF Triple Language

Page 14: Improving the chemistry content of Wikipedia using workflow tools

SPARQL

Query language for RDF data SPARQL Protocol and RDF Query Language Similar to SQL Syntax based on the RDF triple

Page 15: Improving the chemistry content of Wikipedia using workflow tools

Wikidata

Conceptually: semantic web version of Wikipedia Add grain of salt

“Free and open knowledge base that can be read and edited by both humans and machines. “

Designed as ‘central storage’ for Wikipedia and other Wikimedia projects

Approximately: programmatic interface to Wikipedia See https://query.wikidata.org/

Run the example queries

Page 16: Improving the chemistry content of Wikipedia using workflow tools

Second method

Search Wikidata programmatically for chemical information Wikidata SPARQL interface Format list Write file

Page 17: Improving the chemistry content of Wikipedia using workflow tools

SPARQL for chemical and pharmaceutical compounds

PREFIX wdt: <http://www.wikidata.org/prop/direct/>PREFIX wd: <http://www.wikidata.org/entity/>PREFIX wikibase: <http://wikiba.se/ontology#>PREFIX bd: <http://www.bigdata.com/rdf#>

#All Chemicals with, optionally, CAS registry numbers and UNIIs in WikidataSELECT DISTINCT ?compound ?compoundLabel ?formula ?unii ?pubchem ?cas WHERE { ?compound wdt:P31 wd:Q11173 . OPTIONAL { ?compound wdt:P231 ?cas . } OPTIONAL { ?compound wdt:P274 ?formula . } OPTIONAL { ?compound wdt:P652 ?unii . } OPTIONAL { ?compound wdt:P662 ?pubchem . } SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }

Page 18: Improving the chemistry content of Wikipedia using workflow tools

Second method: pluses/minus

Fast and easy! Data arrives in a format we can use – no parsing! Minus:

*some* Wikidata data does not match up with Wikipedia!

Page 19: Improving the chemistry content of Wikipedia using workflow tools

Third method

Hybrid approach Use Wikidata SPARQL query to get list of chemicals Query Wikipedia for individual items to compare values

Page 20: Improving the chemistry content of Wikipedia using workflow tools

Conclusion

Using Wikidata, Wikipedia and KNIME we compiled a list of chemicals with the required data

Subject matter experts are in the process of updating Wikipedia Semantic web technology made the job easier! Thank you!

Page 21: Improving the chemistry content of Wikipedia using workflow tools

References

Scholarly article on KNIME and Pipeline Pilot https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414708/

KNIME www.knime.org

Wikipedia https://en.wikipedia.org/wiki/Template:Chembox https://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox

Wikidata: https://query.wikidata.org

Page 22: Improving the chemistry content of Wikipedia using workflow tools

Who is your speaker?

Mitch Miller, Ph.D. in Chemistry and 20+ years of IT experience Independent consultant: Scientific Thinking, LLC [email protected] Some recent projects

Ongoing custodian of one chemical database implementation for ChemIDplus project within the National Library of Medicine

Reporting systems Web service to link collaborative object management system to

reporting system Import wizard for chemical array designer Merged a set of chemical databases and harmonized data