Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for...

Strategies towards improving the utility of scientific big data

Evan Bolton, PhDNational Center for Biotechnology Information (NCBI)National Library of Medicine (NLM)National Institutes of Health (NIH)

Sep. 4, 2014

http://www.nlm.nih.gov/

http://www.nlm.nih.gov/

U.S. National Center for Biotechnology Information

https://www.ncbi.nlm.nih.gov/

https://www.ncbi.nlm.nih.gov/

https://pubchem.ncbi.nlm.nih.gov/

PubChem website


PubChem primary goal

… to be an on-line resource providing

comprehensive information on the

biological activities of substanceswhere “substance” means any biologically testable entity

Small molecules, RNAs, carbohydrates, peptides, plant extracts, etc.

PubChem data growth over ten years

Contributors Chemicals Biological Assays

Bioactivity ResultsTested ChemicalsProtein Targets

+280 substance contributors, +60 assay contributors, +150M substances, +50M compounds, +1.0M bioassays, +6.1T protein targets, +2.9M tested substances, +2.0M tested compounds, +225M bioactivity result sets

[M=millions, T=thousands, MLP = Molecular Libraries Program]

CAVEAT! All data has “errors”

Big data has “big errors”

Hypothetical

If your average data error rate is 1 in 1,000,000, you have 99.999% data accuracy

If you have one trillion facts (10^12), can you accept one million errors (10^9)?

Strategies to mitigate errors?

Manual curation has its limits (accuracy, cost, time)

So .. what do you do?

Error suppression strategies for scientific big data

1. Identify quality {un}known known/unknownsuse to formulate an error suppression

strategy

2. Perform data normalizationimproves utility by helping to refine

identification

3. “Trust but verify”cross compare authoritative and curated

data

4. Consistency filteringimproves precision by removal of outliers

5. Address error feedback loopsuse “is”, “can be”, and, if all else fails, “is

not” lists



strategythere are known knowns; there are things that we know that we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns, the ones we don't know we don't know

Feb. 2002 news briefing

Image credit: http://en.wikipedia.org/wiki/Donald_Rumsfeld

Tautomers and resonance forms of same chemical structure are prolific

(+)-IridodialDefense chemicals from abdominal glands of 13

rove beetle species of subtribe Staphylinina

Ring ClosedRing Open

Salt-form drawing variations are commonChemical meaning of a substance may change upon context

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=85086529&loc=es_rss

























identification• Verify chemical content– Atoms defined/real– Implicit hydrogen– Functional group– Atom valence sanity

• Normalize representation– Tautomer invariance– Aromaticity detection– Stereochemistry– Explicit hydrogen

• Calculate– Coordinates– Properties– Descriptors

• Detect components– Isolate covalent units– Neutralize (+/- proton)– Reprocess– Detect unique



data

orJohn Kerry’s more recent adaption of the phrase

when discussing Syria’s chemical weapons disposal:

“Verify and verify”

Image credit: http://en.wikipedia.org/wiki/John_Kerry

Доверяй, но проверяй (doveryai, no proveryai)Russian proverb used extensively by Ronald Regan

when discussing relations with the Soviet Union

Image credit: http://en.wikipedia.org/wiki/Ronald_Reagan

Cross concept count % CTD HDO KEG MED NDF ORD CTD 100.0 14.3 79.1 40.7 49.7 35.8 HDO 26.0 100.0 38.7 52.4 48.3 26.2 KEG 24.8 6.7 100.0 10.7 6.4 25.2 MED 97.2 68.9 81.6 100.0 93.8 79.6 NDF 30.4 16.3 12.5 24.0 100.0 10.8 ORD 31.9 12.8 71.6 29.7 15.7 100.0

Cross-reference overlaps between various disease resources: Human Disease Ontology (HDO), NCBI MedGen (MED), CTD MEDIC (CTD), KEGG Disease (KEG), NDF-RT (NDF), and OrphaNet (ORD) using NLM Medical Subject Headings (MeSH) as the basis of comparison.



Keep consensus, remove the restImage credit: http://withfriendship.com/images/c/11229/Accuracy-and-precision-picture.png

Original Total Added Removed Same -

20,000

40,000

60,000

80,000

100,000

120,000

Histogram of Fate of CID-MNID Pairs

Many votes, 70%Many votes, 60%One Vote, 70%One Vote, 60%1 2 3 4 5 6

1

10

100

1,000

10,000

100,000

1,000,000

Histogram of MNIDs per CID

OriginalMany votes, 70%Many votes, 60%One Vote, 70%One Vote, 60%



not” lists

Prevent error proliferation at the data source, when possible



strategy


identification


data



not” lists

Okay … now what?

… you have cleaned up your data … but it is huge, unwieldy, unstructured

How can it be made more useful?

Data organization strategies for scientific big data

1. Crosslink and annotate dataprovides context and identifies associated

concepts

2. Establish similarity schemesenables identification of related records

3. Associate to concept hierarchiesimproves navigation between related

records

4. Perform data reductionsuppresses “redundant” information

5. Be succinctsimplifies presentation by hiding details



concepts

Compound

SubstanceProtein

Gene

DrugPublication

Patent

Disease

Pathway

citesinhibit

encode

ingredienttreat

cites

associates

parti

cipa

tes

cites



Vioxx

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=12560











records

Match toconcept

Independenthierarchy

= chemicalprotein

genepatent

publicationpathway

… …

Organized records




“subject-predicate-object” “atorvastatin may treat hypercholesterolemia”

subject objectpredicate

Evidence citation (PMID)

From whom?(Data Source)

Provenance information



concepts



records



Concluding remarks

Scientific “big data” …… contains an amazing amount of information

… provides opportunities to make discoveries

… benefits from strategies to massage it

PubChem is doing its part …… making chemical substance data broadly accessible

… cross-integrating it to key scientific resources

… suppressing errors and their propagation

… organizing the data and making it available

https://pubchem.ncbi.nlm.nih.gov


PubChem Crew …

Steve Bryant

Tiejun Chen

Gang Fu

Lewis Geer

Renata Geer

Asta Gindulyte

Volker Hahnke

Lianyi Han

Jane He

Siqian He

Sunghwan Kim

Ben Shoemaker

Paul Thiessen

Jiyao Wang

Yanli Wang

Bo Yu

Jian Zhang

Special thanks to the NCBI Help Desk, especially Rana Morris

Any questions?

If you think of one later, email me:

[email protected]

Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for...

Documents

Transcript of Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for...