The Neuroscience Information Framework: A Scalable … Summer Institute: Discover Big Data, August 5...
Transcript of The Neuroscience Information Framework: A Scalable … Summer Institute: Discover Big Data, August 5...
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
NIF is an initiative of the NIH Blueprint consortium of institutes What types of resources (data, tools, materials, services) are available to the neuroscience community? How many are there? What domains do they cover? What domains do they not cover? Where are they?
Web sites Databases Literature Supplementary material
Who uses them? Who creates them? How can we find them? How can we make them better in the future?
http://neuinfo.org
• PDF files • Desk drawers
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California http://neuinfo.org June10, 2013 dkCOIN Investigator's Retreat 3
A portal for finding and using neuroscience resources
A consistent framework for describing resources
Provides simultaneous search of multiple types of information, organized by category
Supported by an expansive ontology for neuroscience
Utilizes advanced technologies to search the “hidden web”
UCSD, Yale, Cal Tech, George Mason, Washington Univ
Literature
Database Federation
Registry
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
How do we create an information infrastructure that is able to connect a person or a community with the resources they need to accomplish their task at hand? Resource
Anything that is tangible and accessible a product, a person, an institution, a piece of data, a connection …
Information Infrastructure Enables the entire life cycle of information from acquisition to (potential)archival Allows people to find, access, understand and work with information
A domain-specific example:
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Credit: Bhaskar Ghosh, LinkedIn
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Credit: Bhaskar Ghosh, LinkedIn
Oracle, MySQL, Voldemort, Espresso
Zoei, Bobo, Sensei, GraphDB
Kafka, Databus
Hadoop, Teradata, Azkaban
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Oracle or Espresso
Search Index
Graph Index
Read Replica
Updates
Standardization
Data Change Events
Credit: Hien Luu, Sid Anand, LinkedIn
Oracle or Espresso
Updates
Databus
Web Servers
Teradata
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Where is data about X? How does Y relate to Z? Accumulate and Analyze Compare X and Y Subscribe to topic T Recommend Resource Funding reports Search and Explore News
Resource Promotion
Utilization Search
Cross-Utilization
Experiential Services
Researcher Activity Resource Activity
Data
Data Infrastructure
Analysis & Science
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Where is data about X? How does Y relate to Z? Accumulate and Analyze Compare X and Y Subscribe to topic T Recommend Resource Funding reports Search and Explore News
Resource Promotion
Utilization Search
Cross-Utilization
Experiential Services
Researcher Activity Resource Activity
Data
Data Infrastructure
Analysis & Science
The problem starts here
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Given an infinite number of web accessible resources, which are relevant for Neuroscience? Easy Case
A resource assigned by a trusted source Reasonably Easy Case
A resource recommended by a potentially trustable source
Not-so-easy Case “Mine” for resources from literature/crawler Auto-filter by semantic classification Fully validate by curatorial staff/community
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Is this a potential neuroscience resource? Two-pronged classification problem
If it belongs to the class, a reasonable portion of the document term vector will align with a neuroscience vocabulary
Necessary but not sufficient The “spread” of the document term vector with respect to a reasonable domain ontology will be limited
Pragmatic problems What additional (recognizable) descriptors does the resource have? Is the resource “current”? Are there “other ways” of getting to the content of the site?
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
DISCO – NIF’s Ingestion Manager and Data Tracker
“Relationalizes” incoming data when needed Feeds the curators’ dashboard for ingestion, update and index management Executes automatic updates per schedule Keeps track of chains of derived views defined by curators Maintains annotations on data and its views Propagates data updates to all derived views through curator notification
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
The data come From too many disparate sources
6000+ neuroscience resources In too many different formats and models
Relational, XML, RDF, Text, domain-specific, … Having all too diverse semantics
“GRM1”: a string, a gene, a chromosomal region, a list of interesting SNPs in mice?
There is a massive data integration problem because only integration of data will lead to insight
What possible drugs might be repurposed for human inclusion body myopathy (HIBM)? Data about/from the following to be integrated
Organisms, diseases, cross-organism anatomy, phenotypes, genes, proteins, interactions, pathways, genomic variations, pharmaceutical compounds, assays and publications
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Data integration using schema mappings for similar resources Semantics-based integration
Using ontologies as the unifying structure Using vocabularies as the connecting substrate
Using linked-data graph where applicable Link inference Link prediction
Using machine learning Term association using active learning with conditional random fields
When possible
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Commercial solutions exist They are not very scalable as number of schemas increase
Many groups working on it Data Tamer – an MIT-Intel partnership project on Big Data
Matches schema where possible Crowd-sources ambiguous cases Performs entity-consolidation by data clustering
Our approach Curators use small scale schema mappings using ontology and Google Refine
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Gather the language of the domain Terms, their variations and term properties “Verbs”, relationships, their linguistic variations and their properties Constraints that hold in the domain
Sources Ontologies Folksonomies
image tags, text annotations, … Literature
figure and table captions Data (structured or semi-structured)
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
What is an ontology? • represents a domain terminology • a node and edge labeled graph • edge labels are associated with rules specified in Description Logic •Additional constraints can be specified with some subset of FOL • embeds a spanning DAG under the is-a relationship over non-literal nodes •represented using OWL (Web Ontology Language) or RDF
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
A graph query engine for ontological graphs Accepts OWL, RDFS, RDF Ingestors for special cases can be developed
MESH XML DTD Has a service API
1. Get all human phenotypes http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/concepts/search/HP_ 2. Find superclasses of "exocrine pancreas" First: http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/concepts/term/exocrine+pancreas Then, from the result of the above: http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/rel/superclasses/UBERON_0000017 3. Find all direct properties of UBERON_0000017 http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/rel/children/UBERON_0000017 COMBINED WITH http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/rel/parents/UBERON_0000017
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Which skeletal structures in the zebrafish develop from the mesenchyme?
Return $x where ($x (subclassOf)* ‘skeletal element’) and ($x develops_from* ‘ZFA:mesenchyme’) Return $y where
($y subclassOf* $x) Query Rewriting Return $x where
($x develops_from* `mesenchyme’) and ($x has_ontology $o) ($x equivalent_to $z) ($z has_ontology ‘ZFA’) and ($x (subclassOf)* ‘skeletal element’)
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
“develops_from” http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/rel/edge-relation/id/RO_0002202
Subclasses of “skeletal element” (incl. its equivalenceClasses) in ZFA http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/rel/subclasses/term/skeletal%20element?level=3
<relationship> <subject InternalId="778440-1" id="ZFA_0001635">intramembranous bone</subject><property InternalId="5389-15" id="subClassOf">subClassOf</property><object InternalId="777874-1" id="ZFA_0001514">bone element</object> </relationship>
“equivalenceClass” http://nif-services-stage.neuinfo.org/ontoquest-lamhdi/rel/children/term/skeletal%20element?level=1
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Tagging of Antibody Records using a machine learning technique with Conditional Random Fields
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
A high-level statement The ontology O is a graph of concepts and inter-concept relationships The data is a semistructured object S with groupings A mapping structure is a graph of mapping edges from S to O + intra-source map edges + intra-ontology map edges Table: protein(ID, symbol, name, pathway)
Mapping: pathway mapsTo ontoID(Pathway). (exhibitedIn(ontoID(Pathway), ontoID(mammal)) AND occursIn(value(ID), value(pathway)))
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
What are the top 10 genes you have most information about?
Which sources have information about “genes”? Find the ontology term for gene CHEBI_23367 Call http://cm.neuinfo.org:8080/cm_services/column/mapping/ontoterm?ontologyTermId=CHEBI_23367
http://cm.neuinfo.org:8080/cm_services/column/valuefreqs?srcNifId=nlx_146253&tableNifId=nlx_146253-1&columnName=transgenic_line
Now group and rank merge in parallel if needed
Get top 10
For each table and column thus found, call
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Lucene Query Language FIND:(image video) hippocampus anatomy:hippocampus component:"plasma membrane“ anatomy::organism:human anatomy:hippocampus[::organism:human] RELATED:(Tenascin rabbit) RELATED::measuredBy:(“cell signaling” cytometry)
+
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
FIND:(image video) hippocampus Find the ontological class of the query term “hippocampus”
Ans: anatomical_entity Does “hippocampus” have a non-empty has_part tree underneath?
Every node in the ontology keeps an approximate statistics of descendant counts across various edge labels
Rewrite query to: FIND: (image video) has_part*(synonyms(hippocampus))
Issue: A cell is a part of any brain region.
Should the expansion include the cells of hippocampus? No, because “cell” is a different module of the ontology whose top-level is “cell”. Partonomic expansion stays within module of the ontology
Final rewrite: FIND: (image video) anatomy::(has_part*(synonyms(hippocampus)))
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
The PDSP Ki database is a unique resource in the public domain which provides information on the abilities of drugs to interact with an expanding number of molecular targets. The Ki database serves as a data warehouse for published and internally-derived Ki, or affinity, values for a large number of drugs and drug candidates at an expanding number of G-protein coupled receptors, ion channels, transporters and enzymes.
Which marijuana related genes are of interest to NCI? Let’s just use PDSP as an example.
Which marijuana related genes are of interest to NCI? Let’s just use PDSP as an example.
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
public static final String THC_QUERY = "cannabis thc marijuana"; public void demonstrateFederation() throws Exception { final String kiDatabaseId = "nif-0000-01866-1"; FederationQuery query = FederationQuery.builder(kiDatabaseId, THC_QUERY).get(); for (Facets facets: searcher.getFacets(query, 10, 0, 1)) { // Find all receptor (gene) facets if (!facets.getCategory().equals("Receptor")) { continue; } for (Facet facet: facets.getFacets()) { // Get grants related to these genes from NCI query = FederationQuery.builder("nif-0000-10319-1", "\"" + facet.getFacet() +"\"") .facet("Funding Institute", "national cancer institute") .exportType(ExportType.data) .rows(1000).get(); TableData data = searcher.getTableData(query); for (FederationModelData model: data.getResult()) { System.out.println(facet.getFacet() + "," + model.get("project_number") + "," + model.get("project_title")); } } } }
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Which marijuana related genes are of interest to NCI?
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
Running on Gordon (with no tweaking)
Configuration • 1 server
• 98 GB RAM • 24 core Intel Xeon CPU @2.8GHz
• RAID 5 SSD • 2 Solr instances serving the • Federation and literature cores.
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California
The Neuroscience Information Framework is not really dependent on Neuroscience Applying it to
Diabetes and kidney diseases Model organisms Earthcube for geo-science data (Hopefully) social science data for economists
We need More scalability Improved complex query handling A distribution framework