A Deep Survey of the Digital Resource Landscape
-
Upload
neuroscience-information-framework -
Category
Technology
-
view
1.270 -
download
0
description
Transcript of A Deep Survey of the Digital Resource Landscape
A Deep Survey of the Digital Resource Landscape:
Perspectives from the Neuroscience Information Framework
Maryann E. Martone, Ph. D.University of California, San Diego
• NIF is an initiative of the NIH Blueprint consortium of institutes– What types of resources (data, tools, materials, services) are available to the
neuroscience community?
– How many are there?
– What domains do they cover? What domains do they not cover?
– Where are they?• Web sites
• Databases
• Literature
• Supplementary material
– Who uses them?
– Who creates them?
– How can we find them?
– How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
The Neuroscience Information Framework
• NIF has developed a production technology platform for researchers to:– Discover– Share– Analyze– Integrate neuroscience-relevant
information
• Since 2008, NIF has assembled the largest searchable catalog of neuroscience data and resources on the web
• Cost-effective and innovative strategy for managing data assets
“This unique data depository serves as a model for other Web sites to provide research data. “ - Choice Reviews Online
NIF is poised to capitalize on the new tools and emphasis on big data and open science
http://neuinfo.orgJune10, 2013 dkCOIN Investigator's Retreat 4
The Neuroscience Information Framework: Discovery and utilization of web-based resources for neuroscience
• A portal for finding and using neuroscience resources
A consistent framework for describing resources
Provides simultaneous search of multiple types of information, organized by category
Supported by an expansive ontology for neuroscience
Utilizes advanced technologies to search the “hidden web”
UCSD, Yale, Cal Tech, George Mason, Washington Univ
Literature
Database Federation
Registry
Part 1: Surveying the resource landscape
•NIF Registry: A catalog of neuroscience-relevant resources
•> 6000 currently listed
•>2200 databases
•And we are finding more every day
How do resources get added to the NIF Registry?
June10, 2013 dkCOIN Investigator's Retreat 6
•NIF curators•Nomination by the community•Semi-automated text mining pipelines
NIF RegistryRequires no special skillsSite map available for local hosting
•NIF Data Federation•DISCO interop•Requires some programming skill
Bandrowski et al., 2012
NIF Registry
• Extended over time
– Parent resource
– Supporting agency
– Grant numbers
– Accessibility
– Related to
– Organism
– Disease or condition
– Last updated
First catalog: SFN Neuroscience Database Gateway NIF 0.5 NIF 1.0+
Simple metadata model
Name, description, type, URL, other names, keywords, unique identifier
~2003 2006 2008
Resource Curation
June10, 2013 dkCOIN Investigator's Retreat 8
• NIF Registry is hosted on Semantic Media Wiki platform Neurolex– Community can add,
review, edit without special privileges
– Searchable by Google
– Integrated with NIF ontologies
– Graph structure
http://neurolex.org
The resource graph
NIF is creating the linked data graph of resources
Keeping the Registry Current
– NIF employs an automated link checker
– Last analysis: 478/6100 invalid URL’s (~8%)
– 199 can’t locate at another university or location out of service (~3%)
– Bigger issue: Many resources are no longer updated or maintained
0
20
40
60
80
100
120
140
160
180
200
1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
0
500
1000
1500
2000
2500
3000
3500
Reso
urces ad
dedLa
st u
pd
ated
• Automated text mining is used to look for “web page last updated” or copyright dates
– Identified for 570 resources; manual review suggested that the results
were accurate although we can’t guarantee that the date itself is
accurate
– 373 were not updated within the last 2 years (65%)
• Manual review of ~200 resources identified by 3DVC for their catalog
– 38 not updated within the past 2 years (~20%)
– 8 migrated to new addresses or institutions
– 7 are no longer in service (~3%)
– 3 were deemed no longer appropriate
Tracking the fate of digital resources
Yuling Li, Paul Sternberg, Cal Tech
Keeping content up to date
Connectome
Tractography
Epigenetics
•New tags come into existence•New resource types come into existence, e.g., Mobile apps•Resources add new types of content
•Change name•Change scope
•> 7000 updates to the registry last year
It’s a challenge to keep the registry up to date; sitemaps, curation, ontologies, community review
Ontology provides a human-centric model for search and data integration
June10, 2013 dkCOIN Investigator's Retreat 13
Last updated...
• Some neglected resources are still valuable– Complete data sets
– Rare data
• Software may still be usable
• Some databases, however, may only be of historical interest– “all metalloproteins
found in PDB” Are all databases and data sets equally valuable?
• The NIF Registry has created a linked data graph of web-accessible resources
• Maintained on a community wiki platform
• Provides data on the fluidity of the resource landscape
– New resources continue to be created and
found
– Relatively few disappear altogether
– Many more grow stale, although their value
may still be significant
– Maintaining up to date curation requires
frequent updating
Summary
NIF Registry provides insight into the state of digital resources on the web
Part 2: Surveying the data landscape
•The NIF data federation performs deep search over the content of over 200 databases•New databases are added at a rate of 25-40 per year
•Latest update: Open Source Brain; ingest completed in 2 hours
•Databases chosen on a variety of criteria:•Early: testing different types of resources•Thematic areas•Volunteers
0
50
100
150
200
250
0.01
0.1
1
10
100
1000
Jun-08 Dec-08 Jul-09 Jan-10 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13
Nu
mb
er
of
Fed
era
ted
Dat
abas
es
Nu
mb
er
of
Fed
era
ted
Re
cord
s (M
illio
ns)
Data Federation GrowthNIF searches the largest collation of neuroscience-relevant data on the web
DISCO
June10, 2013 dkCOIN Investigator's Retreat 17
Data Ingestion Architecture
CurrentPlanned
DISCO Dashboard Functions• Ingest Script Manager• Public Script Repository• Data & Event Tracker• Versioning System• Curator Tool • Data Transformer Manager
June10, 2013 dkCOIN Investigator's Retreat 18Luis Marenco, Rixin Wang, Perrry Miller, Gordon ShepherdYale University
DISCO Dashboard
June10, 2013 dkCOIN Investigator's Retreat 19
• Management of registry resources through a single administrative dashboard
• Associated discovery pipeline
• Tools to manage data updates
• Change tracking
• Globally unique identifier creation
Luis Marenco, Rixin Wang, Perrry Miller, Gordon ShepherdYale University
NIF data federation
NIF was designed to be populated rapidly with progressive refinement
What are the connections of the hippocampus?
Hippocampus OR “CornuAmmonis” OR “Ammon’s horn” Query expansion: Synonyms
and related conceptsBoolean queries
Data sources categorized by
“data type” and level of nervous
system
Common views across multiple
sources
Tutorials for using full resource when getting there from
NIF
Link back to record in
original source
Results are organized within a common framework
Connects to
Synapsed with
Synapsed by
Input region
innervates
Axon innervatesProjects toCellular contact
Subcellular contact
Source site
Target site
Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases
NIF Semantic Framework: NIFSTD ontology
• NIF covers multiple structural scales and domains of relevance to neuroscience
• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene Ontology, Chebi, Protein Ontology
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellularstructure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
Use of Ontologies• Controlled vocabulary for describing type of resource and
content– Database, Image, Diabetes
• Entity-mapping of database and data content
• Data integration across sources
• Search: Mixture of mapped content and string-based search– Different parts of the infrastructure use the vocabularies in
different ways
– Utilize synonyms, parents, children to refine search
– Increasing use of other relationships and logical inferencing
• Generation of semantic content (i.e. RDF, Linked Data)
June10, 2013 dkCOIN Investigator's Retreat 24
NIF Concept Mapper
June10, 2013 25
Aligns sources to the NIF semantic framework
Column level mapping: Reducing false positives
The scourge of neuroanatomical nomenclature: Importance of NIF semantic framework
•NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)•Temporal lobe.com (rodent)•Connectome Wiki (human)•Brain Maps (various)•CoCoMac (primate cortex)•UCLA Multimodal database (Human fMRI)•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385
Content Annotation – Google Refine
June10, 2013 dkCOIN Investigator's Retreat 28
Resource Provider Services - Linkout
June10, 2013 dkCOIN Investigator's Retreat 29
What have we learned: Grabbing the long tail of small data
• NIF can be used to survey the data landscape
• Analysis of NIF shows multiple databases with similar scope and content
• Many contain partially overlapping data
• Data “flows” from one resource to the next– Data is reinterpreted, reanalyzed or
added to
• Is duplication good or bad?
What do you mean by data?Databases come in many shapes and sizes
• Primary data:– Data available for
reanalysis, e.g., microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)
• Secondary data– Data features extracted through
data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS)
• Tertiary data– Claims and assertions about the
meaning of data
• E.g., gene upregulation/downregulation, brain activation as a function of task
• Registries:– Metadata– Pointers to data sets or
materials stored elsewhere
• Data aggregators– Aggregate data of the same
type from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede
• Single source– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
NIF Analytics: The Neuroscience Landscape
NIF is in a unique position to answer questions about the neuroscience landscape
Where are the data?
StriatumHypothalamusOlfactory bulb
Cerebral cortex
Brain
Bra
in r
egio
n
Data source
VadimAstakhov, Kepler Workflow Engine
Whither neuroscience information?
∞
What is easily machine processable and accessible
What is potentially knowable
What is known:Literature, images, human
knowledge
Unstructured; Natural language processing, entity recognition, image
processing and analysis;
communication
Open world meets closed world
We know a lot about some things and less about others; some of NIF’s sources are comprehensive; others are highly biased
But...NIF has > 900,000 antibodies, 250,000 model organisms, and 3 million microarray records
Diseases of nervous system
What drives discovery?
The combination of ontologies, diverse data and analytics lets us look at the current landscape in interesting ways
Ne
uro
de
generative
Seizu
re diso
rders
Neo
plastic
disease
of n
ervou
s system NIH ReporterN
IF d
ata
fed
erat
ed s
ou
rces
Embracing duplication: Data Mash ups
•NIF queries across 3 of approximately 10 fMRIdatabases•Two resources, Brede and SUMSdbcurated activation foci from the literature•~300 PMID’swere common between Brede and SUMSdb
•PMID serves as a unique identifier for an article•Same information; value added
Data is additive
Same data: different analysis
• Gemma: Gene ID + Gene Symbol• DRG: Gene name + Probe ID
• Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
• Analysis:•1370 statements from Gemma regarding gene expression as a function of chronicmorphine
•617 were consistent with DRG; over half of the claims of the paper were not confirmed in this analysis
•Results for 1 gene were opposite in DRG and Gemma
•45 did not have enough information provided in the paper to make a judgment
Relatively simple standards would make life easier
Phases of NIF
• 2006-2008: A survey of what was out there
• 2008-2009: Strategy for resource discovery– NIF Registry vs NIF data federation
– Ingestion of data contained within different technology platforms, e.g., XML vs relational vs RDF
– Effective search across semantically diverse sources
• NIFSTD ontologies
• 2009-2011: Strategy for data integration– Unified views across common sources
– Mapping of content to NIF vocabularies
• 2011-present: Data analytics– Uniform external data references
• 2012-present: SciCrunch: unified biomedical resource services
NIF provides a strategy and set of tools applicable to all biomedical science
Where is the Neuroscience in NIF?
• Search semantics
• Ranking
• Resources supported by NIH Blueprint Institutes are more thoroughly covered
• Data types, e.g., Brain activation foci
June10, 2013 dkCOIN Investigator's Retreat 39
Building a Uniform Resource Layer
Discoverability
Accessibility
Web of Data
Data specified via simple semanticsData in a usable formSemantically-enabled search
Enhanced semanticsStandardized representationLinked Open Data - RDF
Data resources simply describedAutomated data harvesting technologies Common resource registry
A production data (resource) catalog and underlying technology platform for researchers to discover, share, access, analyze, and integrate biomedical information
June10, 2013 40
Community Built Uniform Resource Layer
June10, 2013 41
SciCrunch
NIF
Neuroscience
MONARCH
Animal Models
CommunityServices
dkCOIN
SharedResources
Undiagnosed Disease Program
Phenotype RCN
3D Virtual Cell
National Institute on Aging
One Mind for Research
BIRN
International Neuroinformatics
Coordinating Facility
Model Organism Databases
Community Outreach
DELSA
Varied
(not just a data catalog)
Each project shares resources and adds unique value to the resource layer
42
•3dVC: Focus on models and simulation
•Gene Ontology: Focus on bioinformatics tools
•National Institute on aging: Aging-related data sets
•Monarch: Phenotype-Genotype; deep semantic data integration
•One Mind for Research: Biospecimenrepositories
•NeuroGateway: Computational resources
•FORCE11: Tools for next-gen publishing and e-scholarship
SciCrunch
SciCrunch is actively supporting multiple communities; multiple communities are enriching and improving SciCrunch
Customized portals and rankings
June10, 2013 dkCOIN Investigator's Retreat 43
SciCrunch
NIF
Neuroscience
MONARCH
Animal Models
CommunityServices
dkCOIN
SharedResources
Undiagnosed Disease Program
Phenotype RCN
3D Virtual Cell
National Institute on Aging
One Mind for Research
BIRN
International Neuroinformatics
Coordinating Facility
Model Organism Databases
Community Outreach
DELSA
Varied
dkCOINOntology
SciCrunch
SharedResources
Community database: beginning
Community database:
End
Register your resource to NIF!
“How do I share my data/tool?”
“There is no database for my data”
1
2
3
4
Institutional repositories
Cloud
INCF: Global infrastructure
Government
Education
Industry
NIF is designed to leverage existing investments in resources and infrastructure
Tool repositories
Collaboration, competition, coordination, cooperation
• The diversity and dynamism of biomedical data will make data integration challenging always
• The overall data space is vast: No one group or individual can do everything– Cooperation and coordination is essential
• Creating a core resource registry and data catalog allows the entire community to track resources, work together to keep it updated, promote cross-fertilization, and build better resources
June10, 2013 dkCOIN Investigator's Retreat 45
NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PI
AmarnathGupta, UCSD, Co Investigator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
ArunRangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
SrideviPolavarum
FahimImam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
VadimAstakhov
XufeiQian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer (retired)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11