Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for...
-
Upload
ada-elisabeth-ward -
Category
Documents
-
view
220 -
download
0
Transcript of Strategies towards improving the utility of scientific big data Evan Bolton, PhD National Center for...
Strategies towards improving the utility of scientific big data
Evan Bolton, PhDNational Center for Biotechnology Information (NCBI)National Library of Medicine (NLM)National Institutes of Health (NIH)
Sep. 4, 2014
http://www.nlm.nih.gov/
U.S. National Center for Biotechnology Information
https://www.ncbi.nlm.nih.gov/
PubChem primary goal
… to be an on-line resource providing
comprehensive information on the
biological activities of substanceswhere “substance” means any biologically testable entity
Small molecules, RNAs, carbohydrates, peptides, plant extracts, etc.
PubChem data growth over ten years
Contributors Chemicals Biological Assays
Bioactivity ResultsTested ChemicalsProtein Targets
+280 substance contributors, +60 assay contributors, +150M substances, +50M compounds, +1.0M bioassays, +6.1T protein targets, +2.9M tested substances, +2.0M tested compounds, +225M bioactivity result sets
[M=millions, T=thousands, MLP = Molecular Libraries Program]
CAVEAT! All data has “errors”
Big data has “big errors”
Hypothetical
If your average data error rate is 1 in 1,000,000, you have 99.999% data accuracy
If you have one trillion facts (10^12), can you accept one million errors (10^9)?
Strategies to mitigate errors?
Manual curation has its limits (accuracy, cost, time)
So .. what do you do?
Error suppression strategies for scientific big data
1. Identify quality {un}known known/unknownsuse to formulate an error suppression
strategy
2. Perform data normalizationimproves utility by helping to refine
identification
3. “Trust but verify”cross compare authoritative and curated
data
4. Consistency filteringimproves precision by removal of outliers
5. Address error feedback loopsuse “is”, “can be”, and, if all else fails, “is
not” lists
Error suppression strategies for scientific big data
1. Identify quality {un}known known/unknownsuse to formulate an error suppression
strategythere are known knowns; there are things that we know that we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns, the ones we don't know we don't know
Feb. 2002 news briefing
Image credit: http://en.wikipedia.org/wiki/Donald_Rumsfeld
Tautomers and resonance forms of same chemical structure are prolific
(+)-IridodialDefense chemicals from abdominal glands of 13
rove beetle species of subtribe Staphylinina
Ring ClosedRing Open
Salt-form drawing variations are commonChemical meaning of a substance may change upon context
Error suppression strategies for scientific big data
2. Perform data normalizationimproves utility by helping to refine
identification• Verify chemical content– Atoms defined/real– Implicit hydrogen– Functional group– Atom valence sanity
• Normalize representation– Tautomer invariance– Aromaticity detection– Stereochemistry– Explicit hydrogen
• Calculate– Coordinates– Properties– Descriptors
• Detect components– Isolate covalent units– Neutralize (+/- proton)– Reprocess– Detect unique
Error suppression strategies for scientific big data
3. “Trust but verify”cross compare authoritative and curated
data
orJohn Kerry’s more recent adaption of the phrase
when discussing Syria’s chemical weapons disposal:
“Verify and verify”
Image credit: http://en.wikipedia.org/wiki/John_Kerry
Доверяй, но проверяй (doveryai, no proveryai)Russian proverb used extensively by Ronald Regan
when discussing relations with the Soviet Union
Image credit: http://en.wikipedia.org/wiki/Ronald_Reagan
Cross concept count % CTD HDO KEG MED NDF ORD CTD 100.0 14.3 79.1 40.7 49.7 35.8 HDO 26.0 100.0 38.7 52.4 48.3 26.2 KEG 24.8 6.7 100.0 10.7 6.4 25.2 MED 97.2 68.9 81.6 100.0 93.8 79.6 NDF 30.4 16.3 12.5 24.0 100.0 10.8 ORD 31.9 12.8 71.6 29.7 15.7 100.0
Cross-reference overlaps between various disease resources: Human Disease Ontology (HDO), NCBI MedGen (MED), CTD MEDIC (CTD), KEGG Disease (KEG), NDF-RT (NDF), and OrphaNet (ORD) using NLM Medical Subject Headings (MeSH) as the basis of comparison.
Error suppression strategies for scientific big data
4. Consistency filteringimproves precision by removal of outliers
Keep consensus, remove the restImage credit: http://withfriendship.com/images/c/11229/Accuracy-and-precision-picture.png
Original Total Added Removed Same -
20,000
40,000
60,000
80,000
100,000
120,000
Histogram of Fate of CID-MNID Pairs
Many votes, 70%Many votes, 60%One Vote, 70%One Vote, 60%1 2 3 4 5 6
1
10
100
1,000
10,000
100,000
1,000,000
Histogram of MNIDs per CID
OriginalMany votes, 70%Many votes, 60%One Vote, 70%One Vote, 60%
Error suppression strategies for scientific big data
5. Address error feedback loopsuse “is”, “can be”, and, if all else fails, “is
not” lists
Prevent error proliferation at the data source, when possible
Error suppression strategies for scientific big data
1. Identify quality {un}known known/unknownsuse to formulate an error suppression
strategy
2. Perform data normalizationimproves utility by helping to refine
identification
3. “Trust but verify”cross compare authoritative and curated
data
4. Consistency filteringimproves precision by removal of outliers
5. Address error feedback loopsuse “is”, “can be”, and, if all else fails, “is
not” lists
Okay … now what?
… you have cleaned up your data … but it is huge, unwieldy, unstructured
How can it be made more useful?
Data organization strategies for scientific big data
1. Crosslink and annotate dataprovides context and identifies associated
concepts
2. Establish similarity schemesenables identification of related records
3. Associate to concept hierarchiesimproves navigation between related
records
4. Perform data reductionsuppresses “redundant” information
5. Be succinctsimplifies presentation by hiding details
Data organization strategies for scientific big data
1. Crosslink and annotate dataprovides context and identifies associated
concepts
Compound
SubstanceProtein
Gene
DrugPublication
Patent
Disease
Pathway
citesinhibit
encode
ingredienttreat
cites
associates
parti
cipa
tes
cites
Data organization strategies for scientific big data
2. Establish similarity schemesenables identification of related records
Vioxx
Data organization strategies for scientific big data
3. Associate to concept hierarchiesimproves navigation between related
records
Match toconcept
Independenthierarchy
= chemicalprotein
genepatent
publicationpathway
… …
Organized records
Data organization strategies for scientific big data
4. Perform data reductionsuppresses “redundant” information
5. Be succinctsimplifies presentation by hiding details
“subject-predicate-object” “atorvastatin may treat hypercholesterolemia”
subject objectpredicate
Evidence citation (PMID)
From whom?(Data Source)
Provenance information
Data organization strategies for scientific big data
1. Crosslink and annotate dataprovides context and identifies associated
concepts
2. Establish similarity schemesenables identification of related records
3. Associate to concept hierarchiesimproves navigation between related
records
4. Perform data reductionsuppresses “redundant” information
5. Be succinctsimplifies presentation by hiding details
Concluding remarks
Scientific “big data” …… contains an amazing amount of information
… provides opportunities to make discoveries
… benefits from strategies to massage it
PubChem is doing its part …… making chemical substance data broadly accessible
… cross-integrating it to key scientific resources
… suppressing errors and their propagation
… organizing the data and making it available
https://pubchem.ncbi.nlm.nih.gov
PubChem Crew …
Steve Bryant
Tiejun Chen
Gang Fu
Lewis Geer
Renata Geer
Asta Gindulyte
Volker Hahnke
Lianyi Han
Jane He
Siqian He
Sunghwan Kim
Ben Shoemaker
Paul Thiessen
Jiyao Wang
Yanli Wang
Bo Yu
Jian Zhang
Special thanks to the NCBI Help Desk, especially Rana Morris