Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation

Post on 10-May-2015

1.393 views 1 download

Tags:

description

Science has evolved from the isolated individual tinkering in the lab, through the era of the “gentleman scientist” with his or her assistant(s), to group-based then expansive collaboration and now to an opportunity to collaborate with the world. With the advent of the internet the opportunity for crowd-sourced contribution and large-scale collaboration has exploded and, as a result, scientific discovery has been further enabled. The contributions of enormous open data sets, liberal licensing policies and innovative technologies for mining and linking these data has given rise to platforms that are beginning to deliver on the promise of semantic technologies and nanopublications, facilitated by the unprecedented computational resources available today, especially the increasing capabilities of handheld devices. The speaker will provide an overview of his experiences in developing a crowdsourced platform for chemists allowing for data deposition, annotation and validation. The challenges of mapping chemical and pharmacological data, especially in regards to data quality, will be discussed. The promise of distributed participation in data analysis is already in place.

Transcript of Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation

Facilitating Scientific Discovery through Crowdsourcing and Distributed Participation

Antony Williams

NETTAB

October 17th 2013

If it was not just about me…

• Together we might:• build an encyclopedia• …and rate restaurants• …share book reviews • …and movie reviews• …and reviews of service providers• …organize sit-ins and social action• …and more data might just be Open

If it was not just about me…

• Together we might:• build an encyclopedia• …and rate restaurants• …share book reviews • …and movie reviews• …and reviews of service providers• …organize sit-ins and social action• …and more data might just be Open

Crowdsource the galaxy

Various ways to contribute

Where Am I From?

What can be done with Big Data

Patients Like Me

Patients Like Me

Let’s Change the World

• Let’s map together all historical chemistry data and build systems to integrate new data

• Heck, let’s integrate chemistry and biology data and add in disease data too

• Lets model the data and see if we can extract new relationships – quantitative and qualitative

• Let’s make it all available on the web

That’s a BIG Request

What About Something Smaller?

• We’re going to map the world

• We’re going to take photos of as many places as we can and link them together

• We’ll let people annotate and curate the map

• Then let’s make it available free on the web

• We’ll make it available for decision making

• Put it on Mobile Devices, Give it Away

I’m from here…

Wikipedia

Wikipedia

Moel-Y-Parc

I care…I want to contribute…

The Power of Contribution

How do you spell Afonwen?

And the Welsh know!

Whoa…

• So the world can be mapped…

• We can enter a 3D environment within the map

• We can add annotations

• We can use the data, we can reference it, we can extract it, we can make decisions with it

• And we can do it on our lap, in our hands

• Let’s crowdsource chemistry

Chemistry Data is Everywhere

In a Galaxy far, far away…

• Build a structure-centric hub

• Aggregate structure-based data and integrate

• Link to additional data sources• Patents• Publications• Vendors• Models

A LITTLE Chemistry First

Structural Diagrams

Structural Diagrams

Analytical Data

Does Stereochemistry Matter?

Does one stereocenter matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

Structural Representations

The InChI Standard

InChIKeysSearch the Web by Structure

The Quality of Chemical Data OnlineWhat is the Structure of Vitamin K?

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K1 (phytomenadione) derived from plants, VITAMIN K2 (menaquinone) from bacteria & synthetic naphthoquinone provitamins, VITAMIN K3 (menadione).

What is the Structure of Vitamin K1?

What is the Structure of Vitamin K1?

Chemical Abstracts Service

Wikipedia

Wolfram Alpha

DailyMed

People Use Trusted Resources…

We can all help clean it up…

How will it improve?

Participation

and

contribution

Building a chemistry hub ..issuesALL Different, ALL “Domoic Acids”

ALL Different, ALL “Domoic Acids”

ONE is “correct”

The EXPERTS must get it right?!

Web 2.0 Contribution

• We have been contributing to the web for a along time already – but how much in chemistry?

• A few blogs, an increasing amount of tweeting but what about data sharing in chemistry?

Challenging a PublicationThis story is only 11 years ago!

Oops…

>2 Years to Resolution

The New Way of Challenging

Challenging Science…

Collaboration towards completion

Detailed constructive dialog

Oxidation by Sodium Hydride?

The Blogosphere Analyzes…

How much is in the archives?

Data Mine the Archives

• Imagine the power of data-mining the archives of the publishers

• Generate nanopublications from the articles and make available to the reasoning engines

• Now imagine extracting all the “data” and freeing it up to the semantic web.

• MORE LATER ON THAT!

Open Notebook Science Analysis

Oxidation by Sodium Hydride?

THIS IS A COP OUT!!!

The Blogosphere “Discusses”…

What is real, what is fake?

http://www.youtube.com/watch?v=hMpAoC-h5SA

Question Everything Online: www.dhmo.org

Chemistry is Dangerous!

http://tinyurl.com/cl2awnj

Chemistry is Dangerous

Florida DJs May Face Felony for April Fools' Water Joke

“… told their listeners that "dihydrogen monoxide" was coming out of the taps

throughout the Fort Myers area.”

How do you recognize good vs bad?

Is this real?

Junk vs Real

• “We then established a collaboration with professor Sum Ting Wong, a fugitive from the North Korean University Hu Yu Hai Ding”

• “..identified as the new protein Wai So Dim”

What is real, what is fake?

Helping to change science

• Participation and contribution

• Immediacy of action

• Platforms for contribution

• Openness…whatever that is

Getting Called Out in Public…Rules for Licensing Data

Challenged in the Twittersphere

Annotating Articles Today…

Attribution to me…

Back to Chemistry….

• ANYBODY can annotate a record on ChemSpider• Registered users can deposit new data

• Compounds• Reactions• Links• Analytical Data• Images• Movies,• Data sets

• Registered users can validate existing data

CURATION Search “Vitamin H”

“Curate” Identifiers

“Curate” Identifiers

Data Validation is Exacting Work

It is so difficult to navigate…

What’s the structure?What’s the structure?

Are they in our file?

Are they in our file?

What’s similar?What’s

similar?

What’s the target?

What’s the target?Pharmacology

data?Pharmacology

data?

Known Pathways?

Known Pathways?

Working On Now?

Working On Now?Connections

to disease?Connections to disease?

Expressed in right cell type?Expressed in

right cell type?

Competitors?Competitors?

IP?IP?

• Open PHACTS Project• Develop a set of robust standards• Integrate Chemistry and Biology data by implementing the

standards in a semantic integration hub• Deliver services to support drug discovery programs in

pharma and public domain• INITIALLY 22 partners, 8 pharmaceutical companies, 3

biotechs

Guiding principle is open access, open usage, open source- Key to standards adoption -

Guiding principle is open access, open usage, open source- Key to standards adoption -

ChemSpider serving RDF and the semantic web

• Using RDF permalinks• http://www.chemspider.com/Chemical-Structure.7787

.rdf

• Using a Search Term• http://www.chemspider.com/rdf.ashx?q=cyclohexane• http://rdf.chemspider.com/cyclohexane

RDF and the semantic web

The issue of identifiers

Aspirin on ChemSpider

PubChemDrugbankChemSpider

Imatinib

Mesylate

What is Gleevec

Strict Relaxed

Analysing Browsing

LinkSet#1 { chemspider:gleevec hasParent imatinib ... drugbank:gleevec exactMatch imatinib ...}

chemspider:gleevecchemspider:gleevec drugbank:gleevecdrugbank:gleevec

Dynamic Equality

CVSP : chemical validation

Free chemistry validation platform that performs:•Structure validation

• Atoms• Bonds• Valence• Stereo• If aromatic - check that uniquely dearomatized• Strongest acid not ionized first in partially-ionized system

•Cross-matching of SDF fields• synonyms• InChIs• Smiles

But what about Data Quality???

Validation

• Custom processing let’s user to put together workflow from pre-defined standardization modules list

Standardization

Data Review

DrugBank dataset (6516 records)

~60 records that can’t be dearomatized unambiguously

DB04283 DB04462

DrugBank

~30 records with bonds that do not make sense

DB04283

DDB04009

Nonsensical bonds

~40 records where InChIs did not match the structure

DrugBank ID: DB00755InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+

DruGBank ID: DB00614

Mismatches – which is correct?

DB08128

J. Brechner, IUPACGraphical Representation of stereochem. configurationsSection: ST-1.1.10

DB06287

7 records with 2 stereo bonds at chiral atoms

Stereobond issues

Chemistry Validation and Standardization Platform (CVSP)

at cvsp.chemspider.com•Validation•Standardization•Parent generation

RDF Export

Data

Chemistry Data out to OPS

Data is being imported from ChemSpider to

Open PHACTS in RDF/turtle

– VoID is an RDF Schema vocabulary for expressing metadata about RDF datasets.

• skos:exactMatch (Simple Knowledge Organisation System)e.g. To link compounds in OPS with compounds in ChEBI.• skos:closeMatch e.g. To link Stereo Insensitive Parents to their Children within OPS.• skos:relatedMatch e.g. To link Parent compounds that contain others as Fragments.

– Recommendations on VoID specified by Manchester Uni. here: http://www.cs.man.ac.uk/~graya/ops/2012/ED-datadesc/

RDF/VOID

RDF Export

Ongoing Updates to Linksets

• Ongoing data checking activities on ChemSpider

• Does Crowdsourcing work?

• But WHY would people validate ChemSpider?

• What can be done to encourage participation?

• How will we deliver on Barend’s Million Minds approach – he has ideas…see later

Crowdsourcing – does it work?

• >200 people EVER have deposited or curated data

• Database hosts make the largest contributions

• ChemSpider staff do the most curation

Contributions

Curation Activities

• 2009 – 8255 curations by 43 people

• 2010 – 10014 curations by 66 people

• 2011 – 16025 curations by 116 people

• 2012 – 13127 curations by 74 people

Depositors

• 2009       91 unique depositors

• 2010       120 unique depositors

• 2011       99 unique depositors

• 2012       120 unique depositors

• “The crowd is small – very small”

Rewards and Recognition

• The badgesonomy culture of recognition is growing.

• Badges are commonplace• FourSquare • Klout

Rewards and Recognition

• Rewards and Recognition integrated across all ChemSpider related projects

• Including paths to expose such recognition on AltMetrics platforms

• MAY work for young scientists

The Alt-Metrics Manifesto

http://altmetrics.org/manifesto/

ImpactStory

Future Recognition on

Scientists AltMetrics

Usage, Citations, Social Media, etc

Detailed Usage Statistics

What makes a Scientist Notable?What will it be in the future???

We Are All Being Quantified…

How I am Quantified…

Likely Enabled by

• Persistent unique digital identifier

• Integrates to workflows such as manuscript and grant submission

• Supports automated linkages with your professional activities

Summary

• A grand vision for semantic data linking is coming to fruition. We have a lot do.

• Data quality enhancements can come through crowdsourcing (and intelligent robots)

• Participation may be driven by new approaches to rewards and recognition

• Publishers are understanding the value of data, but still guard it, and generally don’t have systems to handle it. It’s changing.

• There is so much value in the historical data

Inside our Publication Archive

• How much data is in RSC archive, in the publications and in the supplementary info?

• How many compounds for ChemSpider?• How many syntheses for ChemSpider

reactions?• How many characterization

measurements?• Property Data• Spectral Data

• Graphs and charts to be used for modeling?

What if we could capture it all?Digitally Enhancing the RSC Archive

Start with data in publications

Turn “Figures” Into Data

Spectral FIGURE

Extracted Spectrum

Text Mining

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

ChemSpider Reactions

Thank YouEmail: williamsa@rsc.org

Twitter: ChemConnector

Personal Blog: www.chemconnector.com

SLIDES: www.slideshare.net/AntonyWilliams