ChemSpider – disseminating data and enabling an abundance of chemistry platforms

50
ChemSpider – disseminating data and enabling an abundance of chemistry platforms Antony Williams, Valery Tkachenko, Ken Karapetyan, Alexey Pshenichnov, Dmitry Ivanov, Colin Batchelor, Jon Steele and David Sharpe ACS New Orleans April 2013

Transcript of ChemSpider – disseminating data and enabling an abundance of chemistry platforms

ChemSpider – disseminating data and enabling an abundance of

chemistry platforms

Antony Williams, Valery Tkachenko, Ken Karapetyan, Alexey Pshenichnov, Dmitry Ivanov, Colin Batchelor, Jon Steele

and David Sharpe

ACS New Orleans April 2013

ChemSpider

• >28.5 million unique chemicals from >400 data sources

• Focus on improving data quality, enhancing functionality, integrating and enabling

Some usage statistics• ca. 200 visitors at any one time, ~30,000 visits per day• Mar 4-Apr 3, 2013

– Visits = 731,656– Unique Visitors = 527,008

• Independent servers to support other projects

Access ChemSpider

• APIs– Programmatic access used by Mobile Apps, Funded

Consortia projects, many Academic groups

• Widgets– UI components for embedding in other websites

• Data– Data access, downloads, reuse, licensing

Supporting the Semantic Webrdf.chemspider.com/CSID

ChemSpider Resources for Chemistry

From this…..…..to this

Simplified interface

ChemSpider Audiences

Substance Pages

It is so difficult to navigate…

What’s the structure?

What’s the structure?

Are they in our file?

Are they in our file?

What’s similar?

What’s similar?

What’s the target?

What’s the target?Pharmacology

data?Pharmacology

data?

Known Pathways?Known

Pathways?

Working On Now?

Working On Now?Connections to

disease?Connections to

disease?

Expressed in right cell type?Expressed in

right cell type?

Competitors?Competitors?

IP?IP?

• 3-year knowledge management IMI project

• Integrating chemistry and biology data and delivering using semantic web technologies

• Open source code, open data and open standards

• Academics, Pharma companies, Publishers….

ChemSpider Contributions

• The host of the chemistry services– Supplier of “standardized” chemical data files– Chemistry searching (structure, substructure etc)– Provider of data in RDF format – Curator and data quality checking

• Now building the Open PHACTS chemical registration system

ChemSpider Contributions

• Supplier of chemistry UI components• “Quality Police” for data checking • Chemical Validation and Standardization Platform• Nanopublications from RSC publications

• FP7 Initiative. PharmaSea: increasing value and flow in the marine biodiscovery pipeline

PharmaSea

• Dereplication via ChemSpider• Segregation of natural products datasets• Analytical data algorithms & integration– Mass spec searching – predicted fragmentation– NMR feature searching – NMR prediction– Computer-assisted structure elucidation

Integrate to instruments and software

• Integration to analytical instrumentation vendors already in place – Agilent, Bruker, Thermo, Waters

• Also, Cheminformatics vendors link to ChemSpider– Accelrys, ACD/Labs, ChemAxon, iChemLabs, and…

Natural Products Updates

• Names hard, Structures “Obvious”

• New content based on monthly updates of the database

• Click through to the Natural Products Updates entry

National Chemical Database Service

Chemical Database Service• National Chemical Database Service

for UK Academics

• Integrating Commercial Databases and Services

• Chemicals, analytical data, prediction algorithms

• Development of data repository

Retrosynthetic Analysis

Publications - a summary of work

• Scientific publications are a summary of work– Is all work reported?– How much science is lost to pruning?– What of value sits in notebooks and is lost?

• How much data is lost?– How many compounds never reported?– How many syntheses fail or succeed?– How many characterization measurements?

Community Repository for Data• Funding agencies encourage sharing of data• Increasing availability of “Open Data”• Institutional repositories no specific domain

support • Develop a community repository for chemistry

data – private, public, embargoed• Provides data to develop models/algorithms

Community Repository for Data• Automated depositions of data• DOI’ed data objects for citation purposes• A database of reference data, but validated by

the community • National services feeding the repository –

crystallography, mass spectrometry• Integrate to blogging tools for chemistry• Integrate to Electronic Lab Notebooks as feeds

Model Building with Community Data

• Community data as a basis of model building– Consume data from available databases, community

data, new publications and build predictive algorithms for the community

– How many algorithms are reported and lost? How much repeat work is done in the domain of algorithmic development?

Recognition onData

IC50 Measurements for 62 substituted benzoxazolesChemSpider Data Repository: DOI: 10.1356/CSID784.4

Integrate to electronic lab notebooks

E-Lab Notebooks

• Previous work with IDBS and University of Cambridge

• Working on LabTrove integration win U. Southampton

• Integration between ELNs and:• ChemSpider• ChemSpider Reactions• CDS Repository

• Publish data from ELNs issue DOIs• Data aggregated into fully indexed

ESI format for publication

Support for Chemical Reactions

• Integrating mined reaction data from patents (Daniel Lowe)

• Will also incorporate and integrate: Methods of Organic Synthesis, Catalysts and Catalyzed Reactions and…

Micro-publishing Chemical Reactions

ChemSpider SyntheticPages

Retrosynthetic Analysis

Inside our Publication Archive

• How much data is in the archive, in the publications and in the supplementary info?– How many compounds for ChemSpider?– How many syntheses for ChemSpider reactions?– How many characterization measurements?• Property Data• Spectral Data• Graphs and charts to be used for modeling?

What if we could capture it all?Digitally Enhancing the RSC Archive

Start with data in publications

Recent Work

Comparison of Spectra

Data Validation and Curation Required

CVSP: Validation and Standardization

Data Validation and Curation Required

Encouraging Participation with

Rewards and RECOGNITION

Manual Curation

• Integrated commenting, curating and validation platform across ALL eScience and publishing platforms

• All integrated to a central RSC profile and feeding the AltMetrics tools

Structure Review

Where we are now…

Rewards and Recognition

Congratulations! Your 1st CSSP article has been published. Philosopher Lao Tzu said “A journey of a thousand miles begins with a single step”. In the same way we hope that this will be the first of many submissions that you make to CSSP.

The First Step badge is awarded when a user submits (& has published) their 1st CSSP article.

Future Recognition in AltMetrics?

ChemSpider

Why is ChemSpider “different”

• Interfaces for integration• Sharing of data – and increasingly open• Open for community participation

– Deposition– Annotation– Curation

• We are clear…the world is changing

Internet Data

The Future

Commercial SoftwarePre-competitive Data

Open ScienceOpen DataPublishersEducators

Open DatabasesChemical Vendors

Small organic moleculesUndefined materialsOrganometallicsNanomaterialsPolymersMineralsParticle boundLinks to Biologicals

Acknowledgments • The RSC eScience and infrastructure teams• Our data providers, depositors, collaborators

and curators• Daniel Lowe for Reaction Data• William Brouwer, Penn State• Software providers – OpenEye, ChemDoodle,

ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)

Thank you

Email: [email protected] Twitter: ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams