Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information...
-
Upload
buddy-lang -
Category
Documents
-
view
220 -
download
1
Transcript of Integrated Querying Across Disparate Data Sources José Luis Ambite & Gully APC Burns Information...
Integrated Querying Across Disparate Data Sources
José Luis Ambite & Gully APC Burns
Information Sciences InstituteUniversity of Southern California
2
Team
Information Integration Infrastructure: Jose Luis Ambite, Craig Knoblock, Maria Muslea, Gowri Kumaraguruparan, Kristina Lerman (USC/ISI)
Domain Collaborators:FBIRN: Naveen Ashish (UCI), Jessica Turner (UCI), Karl Helmer (MGH),
Tim Olsen (WUSTL), Dingying Wei (UCI)
NHPRC: John Nylander, Dave Brink, Liz Moran (NHPRC)
CVRG: Naveen Ashish (UCI), Steve Granite (JHU)
Security: Rachana Ananthakrishnan (UC), Laura Pearlman (USC/ISI)
Data Management: Robert Schuler, Ann Chervenak (USC/ISI)
Knowledge Engineering: Gully Burns, Tom Russ (USC/ISI), Naveen Ashish, Jessica Turner (UCI)
User Interfaces: Naveen Ashish (UCI), Jose Luis Ambite, Pedro Szekely, Craig Rogers, Gowri Kumaraguruparan, Maria Muslea (USC/ISI)
3
Information Integration
Mediator: uniform structured query access to heterogeneous sources
Challenges:Syntactic (Access/Format) heterogeneity: Wrappers
Structured Sources: DBMS, XML/XQuery DBsSemi-structured Sources: HTML, text, pdfWeb services XML, SOAP, WSDL
Semantic heterogeneity MediatorSchema Source modelingData Record Linkage
Scalability:MediationSecuritySource AdditionRecord LinkageEfficient Query Execution
Decision Support
Application Programs
Mediator
KnowledgeBases
Databases Computer Programs
The Web
4
Information Mediator• Virtual Integration Architecture:
– Virtual organization: community of data providers and consumers that want to share data for specific purpose
– Autonomous sources: data, control remains at sources; no change to access methods, schemas; data accessed real-time in response to user queries
– Mediator: integrator defines domain schema and describes source contents• Domain schema: agreed upon view of the domain preferred by the virtual
organization• Source descriptions: logical formulas relating source and domain schemas
• Easy to add new sources• Query Answering
– User writes query in domain schema– Mediator:
• Determines sources relevant to user query• Rewrites query in sources schemas• Breaks query into sub-queries for sources• Optimizes query evaluation plan• Combines answers from sources
– Efficient query evaluation• Streaming dataflow
Mediator
DomainSchema
User queries
Reformulation
Optimizer
Execution Engine
DataSource
Data Source Data
Source
Wrapper WrapperSources schemas
Logical SourceDescriptions
5
BIRN Grid-based Virtual Data Integration Architecture
Relational DBEx: HID
XML DBEx: eXist-db
Web Portal
BIRN Gateway
BIRN Gateway BIRN Gateway BIRN Gateway
OGSA-DQP/DAI
Internet
Internal to Organization
Internet
Internal to Organization
Grid Security Infrastructure(TLS + PKI)
OGSA-DAI OGSA-DAI OGSA-DAI
ISI-Mediator
Web ServiceEx: XNAT
Logical SourceDescriptions
Reconcile Semantics/
Query RewritingClient Program
QueryOptimization/
Execution
SourceWrappers
Security: Encryption,Authentication, Authorization,
HID@MRN
6
FBIRN Data Integration Use Case:
HID and XNAT
HID@UCI
Human Imaging Database(s)Oracle DB
XNAT
EXtensible Neuroimaging Archive Toolkit Web service API
BIRN MediatorSQL query XML
query
User query: find all male
patients over 50 with t1 scans
Results integratedfrom XNAT and HID
HIDresults
XNAT results(XML)
…
Domain query Integrated resultsLogical Source
descriptions
HID@MRN
7
FBIRN Data Integration Use Case:
HID and XNAT
HID@UCI
Human Imaging Database(s)Oracle DB
XNAT
EXtensible Neuroimaging Archive Toolkit Web service API
BIRN MediatorSQL query XML
query
User query: find all male
patients over 50 with t1 scans
Results integratedfrom XNAT and HID
HIDresults
XNAT results(XML)
…
Domain query Integrated resultsLogical Source
descriptions
ECG_Mesa(MySQL DB)
CardioVascular Research Grid
BIRN Mediator
Integrated results
Logical Sourcedescriptions
ChesnokovAnalysis(eXistDB
XML DBMS)
Image Metadatadcm4che PACS
(MySQL DB)
WaveformDB(eXistDB
XML DBMS)
DICOM Image Files(file system)
Waveform Files(file system)
Domain query
Same BIRN mediatorJust plug in CVRG sourcedescriptions
and additional wrapper for eXistDB (XML/XQuery database)
ECG_Mesa(MySQL DB)
CardioVascular Research Grid
BIRN Mediator
Integrated results
Logical Sourcedescriptions
ChesnokovAnalysis(eXistDB
XML DBMS)
Image Metadatadcm4che PACS
(MySQL DB)
WaveformDB(eXistDB
XML DBMS)
DICOM Image Files(file system)
Waveform Files(file system)
Domain query
Use mediator to identify subjectsand files of interest
10
BIRN NHPRC Data Integration Use Case
• Provide data integration infrastructure for NHPRC:– Colony management, genetics, pathology, …
• BIRN NHPRC Activities: – BIRN/ISI demonstrated Colony Management integration
prototype– BIRN/ISI released data integration system to NHPRC team – NHPRC team developing DNA Banking application using
mediator– Deployment to NPRC Centers in progress
BIRN and Biomedical Ontologies
Ontology development challenges: • Modeling complex domains is challenging and requires
specialized expertise • The community of ontology development efforts is
large and somewhat daunting to navigate
Our goal: to provide an ontology development process that
• leverages existing ontology development• creates effective dialog between domain users and
biomedical ontologists • informs and documents the design of domain models
for integration • publishes curated domain ontologies to the community
Domain/Ontology Engineering Strategy
Variables and Data Values are the basis of scientific assertions
e.g., ‘CDNF protects nigral dopaminergic neurons in-vivo’
This statistically-significant effect is the experimental basis for the findings of this study.
Our ontology engineering approach is based on experimental variables
from Lindholm, P. et al. (2007), Nature, 448(7149): p. 73-7
Ontology of Experimental Variables and Values (‘OoEVV’)• Capture semantics of experimental variables
– Based on Measurement Scales (nominal, ordinal, interval, ratio, etc.1)
• Publish to standardized ontology repositories (NCBO, OBO Foundry etc.) – Expressed in RDF / OWL.
• Serve as a focal point of interaction with other ontology curation activities– i.e., Ontology of Biomedical Investigations
(OBI, http://obi-ontology.org/)
1 Stevens, S. S. (1946). "On the theory of scales of measurement." Science 103(2684): 677-680.
OoEVV Basic Design
Characteristic
MeasurementScale
Measurement Value
measuresvalue
usesMeasurementScale
onMeasurementScale
Variable
External Ontologies
Example: Simple (‘nominal’) Handedness
Handedness
Characteristic
BirnLexOntology: birnlex_2178
Variable
Handedness Variable
Nominal Handedness
Variable
MeasurementValue
Nominal Measurement
Value
Nominal Handedness
Value
lefthanded
ambidextrous
righthanded
Measurement Scale
Nominal Measurement
Scale
Nominal Handedness
Scale
Example: Edinburgh Handedness
Inventory
Handedness
Characteristic
BirnLexOntology: birnlex_2178
Variable
Handedness Variable
Edinburgh Handedness
Variable
MeasurementValue
Ordinal Measurement
Value
Edinburgh Handedness
Value
float values[-100.0, +100.0]
Measurement Scale
Ordinal Measurement
Scale
Edinburgh Handedness
Inventory
OoEVV elements for FBIRN
• FBIRN (HID/XNAT) domain model – 193 attributes 83 OoEVV variables – Mainly clinical assessments – e.g., Structured Clinical Interview for
DSM Disorders ‘SCID’, Mini-Mental State Examination, Clinical Dementia Rating, etc.)
Integrated Querying Across Disparate Data Sources
• General information integration infrastructure– Mediators
• bridge semantics across data sources• provide integrated data for analysis and visualization
– Domain model development and curation process• Balance bottom-up/top-down domain/ontology
development and reuse– Security and user data access control built-in
• Approach– Engage research communities – User-lead integration: NHPRC, FBIRN– Build applications incrementally– Enhance capabilities while providing useful tools