Arcot Rajasekar , Reagan Moore, Bertram Ludäscher, Ilya Zaslavsky [email protected]

71
The GRID Adventures: SDSC's Storage Resource Broker and Web Services in Digital Library Applications Arcot Rajasekar, Reagan Moore, Bertram Ludäscher, Ilya Zaslavsky [email protected] San Diego Supercomputer Center University of California, San

description

The GRID Adventures: SDSC's Storage Resource Broker and Web Services in Digital Library Applications. Arcot Rajasekar , Reagan Moore, Bertram Ludäscher, Ilya Zaslavsky [email protected] San Diego Supercomputer Center University of California, San Diego. Staff Reagan Moore Chaitan Baru. - PowerPoint PPT Presentation

Transcript of Arcot Rajasekar , Reagan Moore, Bertram Ludäscher, Ilya Zaslavsky [email protected]

Page 1: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

The GRID Adventures: SDSC's Storage Resource Broker

and Web Services in Digital Library Applications

Arcot Rajasekar, Reagan Moore, Bertram Ludäscher, Ilya Zaslavsky

[email protected]

San Diego Supercomputer CenterUniversity of California, San Diego

Page 2: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

2 RCDL’02, Dubna, October 15-17 2002

Data and Knowledge SystemsStaff• Reagan Moore• Chaitan Baru

• Data Mining Lab (Tony Fountain)• Advanced Query Processing Lab (Amarnath Gupta)• Knowledge-Based Integration Lab (Bertram Ludäscher)• Data Grid Lab (Arcot Rajasekar)• Spatial Information Systems Lab (Ilya Zaslavsky)

+ 2-3 programmers in each lab, + graduate and undergraduate students

Now: connecting research with production databases and data grid solutions

Page 3: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

3 RCDL’02, Dubna, October 15-17 2002

Overview• Intro

– SDSC and NPACI

• Part I: technologies– What is Data Grid– Data, Information, and Knowledge Infrastructures at SDSC/DICE– SDSC Storage Resource Broker, with examples– MIX (Mediation of Information Using XML), and Knowledge-Based

Mediation

• Part II: case studies– BIRN: the First Operational Data Grid– Web Services Demos– Persistent Archives at SDSC

• Summary

Page 4: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

4 RCDL’02, Dubna, October 15-17 2002

A Distributed National Laboratory for Computational Science and Engineering

Page 5: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

5 RCDL’02, Dubna, October 15-17 2002

1st Teraflops System for US Academia

• 1 TFLOPs IBM SP– 144 8-processor compute nodes– 12 2-processor service nodes– 1,176 Power3 processors at 222

MHz– Initially > 640 GB memory (4

GB/node), upgrade to > 1 TB later

– 6.8 TB switch-attached disk storage

• Largest SP with 8-way nodes• High-performance access to HPSS

Page 6: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

6 RCDL’02, Dubna, October 15-17 2002

Bioinformatics Infrastructure for Large-Scale Analyses

• Next-generation tools for accessing, manipulating, and analyzing biological data– Biology, Stanford University– DICE, SDSC

• Analysis of Protein Data Bank, GenBank and other databases

• Accelerate key discoveries for health and medicine

• Supporting and leveraging new data grid projects, such as BIRN in biology

Page 7: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

Part I: technologies

What is Data GridData, Information, and Knowledge

Infrastructures at SDSC/DICESDSC Storage Resource Broker

MIX (Mediation of Information Using XML), and Knowledge-Based Mediation

SRB

Page 8: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

8 RCDL’02, Dubna, October 15-17 2002

What are Data Grids?• Power Grid Analogy

– Multiple power generators– Complex transmission networks

with switching– Simple Usage Interface – plug and play– Guaranteed Supply - Meeting of

demands (peak and lull)– Complex cost function

• More than one data provider• Best movement of data across computer networks• Seamless Access to Data with good ‘Finding Aids’ • Guarantee of Data Access• Access Control, Quotas & Complex Usage Costing

Page 9: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

9 RCDL’02, Dubna, October 15-17 2002

Data Grids

Data Grid - linking multiple data collectionsSeparate name spacesSeparate schema Separate administration domainsHeterogeneous database instances

Database A Database BData grid

The data grid is itself a collection that provides mechanisms to hide latency and manage semantics

Page 10: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

10 RCDL’02, Dubna, October 15-17 2002

Federated Digital Libraries

Virtual Data Grid - linking multiple data collectionsAbility to execute processes to recreate derived data

Database AServices

Database BServicesVirtual Data Grid

The virtual data grid integrates data grid and digital librarytechnology to manage processes

Page 11: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

11 RCDL’02, Dubna, October 15-17 2002

Why Data Grids: Data Handling Problems • Large Datasets; Large Number of Datasets; Scaling• Distributed, Heterogeneous Storage• Virtualization & Transparency• Collaboration, Access Control, Authentication, Security• Replication, Coherency, Synchronization• Fault Tolerance and Load Distribution• Scheduling, Caching & Data Placements• Data Migration over Time & Space• Data/Collection Curation• Uniform Name Space • Handling Legacy Data and

Data/Resource Evolution• User-friendly Interfaces – foster

collaborations

Page 12: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

12 RCDL’02, Dubna, October 15-17 2002

Why Data Grids: Metadata Problems

• Types of Metadata – Relational to XML to unstructured• Standardized to User-defined Metadata • Large Number of Attributes; • Large Size; Scaling• Federation - integration over space• Evolution - integration over time • Evolution - integration over contexts• Discovery and Search• Presentation – user friendly• Extraction and Maintenance

Page 13: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

13 RCDL’02, Dubna, October 15-17 2002

DAKS Data Management Hierarchy

• Model-Based Information Management– Rule-based ontology mapping, conceptual-level mediation - CMIX

• Information Mediation– Data federation across multiple libraries - MIX

• Digital Library – Interoperable services for information discovery and presentation -

SDLIP• Data Collection

– Tools for managing data set collections on databases - MCAT• Data Handling

– Systems for data retrieval from remote storage - SRB• Persistent Archives

– Storage of data collections for 30+ years

Page 14: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

14 RCDL’02, Dubna, October 15-17 2002

SRB as a Solution

Application

SRB Server

Distributed Storage Resources(database systems, archival storage systems, file systems, ftp, http, …)

MCAT

HRM DB2, Oracle, Illustra, ObjectStore HPSS, ADSM, UniTree UNIX, NTFS, HTTP, FTP

• The Storage Resource Broker is a middleware• It virtualizes resource access• It mediates access to distributed heterogeneous resources• It uses a MetaCATalog to facilitate the brokering• It integrates data and metadata

Page 15: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

16 RCDL’02, Dubna, October 15-17 2002

Solution SRB SDSC Storage Resource Broker & Meta-data Catalog

SRBArchives

HPSS, ADSM,UniTree, DMF

DatabasesDB2, Oracle,

Sybase

File SystemsUnix, NT,Mac OSX

Application

C, C++, Linux I/O

Unix Shell

Dublin Core

Resource,Mthd, User

User Defined

ApplicationMeta-data

RemoteProxies

DataCutter

MetadataExtraction

Java, NTBrowsers

WebPrologPredicate

MCAT

HRM

Page 16: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

17 RCDL’02, Dubna, October 15-17 2002

SRB Space

DR

DR DR

DR

DR

DR

DL DL

DL

DL

DR - Data RepositoryDL - Dig LibraryMC - Meta Catalog

MC

Client

SRB

SRB

SRB

SRB

SRB

SRB

SRB

SRBSRB

SRB

Client

ClientClient

Client

Client

Page 17: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

18 RCDL’02, Dubna, October 15-17 2002

MySRB: Web-bases Access to the SRB

• Browse in Hierarchical Collections• Registration of (remote) Legacy Files & Directories• Registration of SQL Objects• Registration of URLs• Data Movement Operations

– Ingest & Re-Ingest, Delete, Unlink– Replicate, Copy, Move, S-Link

• Access Control Operations– Read, Write, Own, Curate, Annotate, …– Ticket-based Access

• Version Control Operations – Read Lock, Write Lock, Unlock– Check In Check Out

Page 18: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

19 RCDL’02, Dubna, October 15-17 2002

Meta data Management in MySRB• Types of Meta Data

– System-level Metadata• Size, resource, owner, date, access

control, …– User-defined Meta data

• for data & collections• <name,value,unit> triples• No limits in number of metadata• Support for Collection-level schemas

– Comments, default values, drop-down lists

• Support for Standardized Schemas – (eg. Dublin Core)

– Annotations• Supports textual annotations• Annotator, date, context also registered

Page 19: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

20 RCDL’02, Dubna, October 15-17 2002

SRB Projects• Digital Libraries

– UCB, Umich, UCSB, Stanford,CDL– NSF NSDL - UCAR / DLESE

• NASA Information Power Grid• DOE ASCI Data Visualization Corridor • Astronomy

– National Virtual Observatory – 2MASS Project (2 Micron All Sky Survey)

• Particle Physics – Particle Physics Data Grid (DOE)– GriPhyN – SLAC Synchrotron Data Repository

• Medicine– Visible Embryo (NLM)

• Earth Systems Sciences– ESIPS– LTER

• Persistent Archives– NARA– LOC

• Neuro Science & Molecular Science– TeleScience, Brain Images, BIRN– JCSG (SSRL/SLAC), AfCS, …

Page 20: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

21 RCDL’02, Dubna, October 15-17 2002

Large Data Project Examples• Astronomy:

– National Virtual Observatory • Integrate 18 sky surveys- (ITR prop)

– 2MASS Project (2 Micron All Sky Survey) • 10TB; 5million files• Co-locate Images for Spatial Access• Data Mining across entire collection• Replicate to CalTech HPSS

• Particle Physics: – Particle Physics Data Grid (DOE)– GrPhyN (NSF ITR proj)

• CERN LHC 1PB/yr (1billion obj)• Multi-Lab integration

– SLAC Synchrotron Data Repository

Page 21: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

22 RCDL’02, Dubna, October 15-17 2002

Compute Resources Catalogs Data Archives

InformationDiscovery

Metadatadelivery

Data Discovery

Data Delivery

Catalog Mediator Data mediator

1. Portals and Workbenches

Bulk DataAnalysis

CatalogAnalysis

MetadataView

DataView

4.GridSecurityCachingReplicationBackupScheduling

2.Knowledge & ResourceManagement

Standard Metadata format, Data model, Wire format

Catalog/Image Specific Access

Standard APIs and Protocols Concept space

3.

5.

6.

7. Derived Collections

National Virtual ObservatoryData Grid

Page 22: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

23 RCDL’02, Dubna, October 15-17 2002

Page 23: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

24 RCDL’02, Dubna, October 15-17 2002

Page 24: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

25 RCDL’02, Dubna, October 15-17 2002

Digital Sky Data Ingestion

Informix

SUN

SRBSUN E10K

HPSS

….

800 GB

10 TB

SDSCIPAC CALTECH

input tapes from telescopes

star catalogData

Cache

Page 25: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

26 RCDL’02, Dubna, October 15-17 2002

Digital Sky Data Ingestion• The input data was on tapes in a random (temporal…) order.• Ingestion nearly 1.5 year - almost continuous, 4 parallel streams (4

MB/sec per stream), 24*7*365• Total 10+TB, 5 million, 2 MB images in 147,000 containers. • SRB performed a spatial sort on data insertion (Scientists view/analyze data by

neighborhood). The disc cache (800 GB) for the HPSS containers was utilized.

• Ingestion speed limited by input tape reads– Only two tapes per day can be read

• Work flow incorporated persistent features to deal with network outages and other failures.

• C API was utilized for fine grain control and to be able to manipulate and insert metadata into Informix catalog at IPAC Caltech. – http://www.ipac.caltech.edu/2mass

Page 26: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

27 RCDL’02, Dubna, October 15-17 2002

DigSky Conclusion

• SRB can handle large number of files• Metadata access is still less than ½ sec delay• Replication of large collections• Single command for geographical replication• On-the-fly sorting (out-of-tape sorting)• Availability of data otherwise not possible• Near-line access to 5 million files (10 TB)• Successfully used in web-access & large scale

analysis (daily)

Page 27: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

28 RCDL’02, Dubna, October 15-17 2002

Demonstration

• goto mySRB• For Additional Information:

http://www.npaci.edu/dice/[email protected]

Page 28: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

MIX:Mediation of Information

using XML

Page 29: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

30 RCDL’02, Dubna, October 15-17 2002

Data Source(eg. home ads)

Native XMLDatabase

XML ViewDocument(s)

XML ViewDocument(s)

XML ViewDocument(s)

Export: • Schema & Metadata (DTD, RDF,…)• Capabilities

Wrapper

LegacySource

XML Query

Wrapper

XML

Mediation of Information using XML (MIX)

Page 30: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

31 RCDL’02, Dubna, October 15-17 2002

Query

Query “fragment”

A Typical Mediation Scenario

Mediator(integrated views over heterogeneous sources)

Wrapper

UserInterface

Convert incoming queryand outgoing data

SQL Database

Wrapper Wrapper

GIS HTML

Results

Query “fragment”

Page 31: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

32 RCDL’02, Dubna, October 15-17 2002

XMAS Query

The Home Buyer Scenario

MIXmMediator

N’hood info(demographics)

“Neighborhood” mediator

WebClient

“Homes” mediator

Results (XML)

www.realtor.com www.homeadvisor.msn.comwww.sandag.cog.ca.us www.sannet.gov

Community info(name, ZIP)

Crime info(ZIP, stats)

Home info(real estate) Schools info

(address, size)

School district info

(scores,spending,ZIP)

“Schools” mediator

National test scores

Data Data

Data

www.asd.com

Page 32: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

33 RCDL’02, Dubna, October 15-17 2002

Home Buyer GUI

Page 33: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

34 RCDL’02, Dubna, October 15-17 2002

<folder> $C $S for $S</folder> for $C

$C:<*.condo> <address zip=$Z/> </condo> AT www.condo.com AND$S:<*.school type=elementary> <address zip=$Z/> </school> AT schools.org

... <RealEstateAgent> <name>J. Smith</name> <condos> <condo> <address ... zip=92037> <price>$170k OBO</price> <bedrooms>2</bedrooms> </condo> <condos> </RealEstateAgent>

<condosAndSchools> <folder> <condo> <address ... zip=92037> <price>$170k OBO</price> <bedrooms>2</bedrooms> </condo> <school> <name>La Jolla High</name> <address … zip=92037> </school> <school>…</school> </folder>

An XML Query (XMAS)

Page 34: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

35 RCDL’02, Dubna, October 15-17 2002

Home Buyer GUI (Answers)Generated

XMAS QueryXML Answer

Document

Page 35: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

36 RCDL’02, Dubna, October 15-17 2002

Our Research

• In what query language does the user pose a query?

• How does the query engine of the mediator rewrite the query?

• How does the mediator combine/restructure/post-process partial results?

• What data model and query transformation scheme should the wrappers use for different source types?

For details: http://www.npaci.edu/DICE/MIX

Mediator

S1S1

W1

S2

W2

S3

W3

User QueryXMAS

XML

Page 36: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

37 RCDL’02, Dubna, October 15-17 2002

New MIX Challenges from Scientific Applications• Complex Data

– SDSC’s Scientific Data Applications (current/planned, e.g. Neurosciences: NCMIR, NIH BIRN, Earth sciences: GEON, GeoGrid, ...) show that syntactic/structural integration is insufficient for ... Complex Multiple-World Mediation Problems:

– complex, disjoint, seemingly unrelated data– “hidden semantics” in complex, indirect relationships

=> Semantic (aka Model/Knowledge-Based) Mediation – lift mediation to the level of conceptual models (CMs)– use domain experts’ knowledge formalized as rules over CMs

=> Specialized Extensions • temporal, geospatial, statistical, DQ/accuracy... operations

=> Extend Mediation Scope and Power via Deductive Rules

Page 37: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

INFORMATION MEDIATION WITH

DOMAIN MAPS

Page 38: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

39 RCDL’02, Dubna, October 15-17 2002

An Unresolved ChallengeHow do nerve cells change as we learn and remember?

A multi-resolution study of the rat hippocampus at Boston University

Page 39: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

40 RCDL’02, Dubna, October 15-17 2002

Dendritic spine morphology and its variationsDendritic spine morphology and its variations

Reconstructions from the Synapse Lab, Boston University

density = #spines/length

Page 40: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

41 RCDL’02, Dubna, October 15-17 2002

• Distribution of spines changes with learning• Each spine type performs a different task in information transmission

HypothesisHypothesis

ObservationsObservations

• Spine density, size, shape and PSD vary with maturity• Spine neck geometry controls peak Calcium amount• Calcium flow parameters depend on the different subclasses of spines

Next QuestionsNext Questions

• Does anyone else have corroborative evidence for these observations?• Are these observations true in other comparable parts of the brain?• Is this consistent with the distribution of Calcium-binding proteins?

Page 41: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

42 RCDL’02, Dubna, October 15-17 2002

Example for Formalizing Domain Knowledge:Domain Map for SYNAPSE and NCMIR

A domain map comprises• Description Logic facts ...

- concepts ("classes") - roles ("associations")

• derived properties ...• ... expressed as logic rules

- (e.g. F-logic)

domain map

Purkinje cells and Pyramidal cells have dendritesthat have higher-order branches that contain spines.Dendritic spines are ion (calcium) regulating components.Spines have ion binding proteins. Neurotransmissioninvolves ionic activity (release). Ion-binding proteinscontrol ion activity (propagation) in a cell. Ion-regulatingcomponents of cells affect ionic activity (release).

domain expert knowledge

equivalent Description Logic facts

Page 42: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

Extended Mediator Architecture for Semantic Mediation USER/ClientUSER/Client

S1 S2

S3

XML-Wrapper

CM-WrapperXML-Wrapper

CM-WrapperXML-Wrapper

CM-Wrapper

GCMCM S1

GCMCM S2

GCMCM S3

CM (Integrated View)

MediatorEngine

FL rule proc.

LP rule proc.

Graph proc.XSB Engine

Domain MapDM

Integrated View Definition IVD

Logic API(capabilities)

CM Queries & Results (exchanged in XML)

CM Plug-In

Page 43: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

44 RCDL’02, Dubna, October 15-17 2002

Comparison & Summary: Semantic Mediation

(Complex) Single World/ Simple Multiple World

Complex Multiple World

Integration target global schema(common / shared)

1..n shared domain maps

Example scenario suppliers’ catalogs/ home buyer

complex scientific data (neuroscience, geoscience,…)

Schema level overlapInstance level overlap

large / smalllarge / none

none … smallnone

Source correlation direct, instance / schema level indirect, conceptual (knowledge)level

Techniques schema transformations, schemaintegration

“structural” integration

domain maps, formalized domainknowledge (“semantic bridges”)=> model-based (“semantic”)

mediationIntegration languagesExpressiveness

relational, semistructured,queries & transformations

(e.g., SQL, XQuery, XSLT)

conceptual (description logics),object-oriented, deductive features

(e.g., GCM, F-logic)Integrators DB expert domain expert + KRDB expert

Page 44: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

Part II: case studies

BIRNWeb Services

Persistent Archives

Page 45: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

46 RCDL’02, Dubna, October 15-17 2002

NIH is Funding a Brain Imaging Federated Repository

National Partnership for Advanced Computational Infrastructure

Part of the UCSD CRBS Center for Research on Biological Structure

Biomedical Informatics Research Network

(BIRN)NIH Plans to Expand

to Other Organs and Many Laboratories

Page 46: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

Infrastructure for Sharing Neuroscience Data

CCB, Montana SUSurface atlas, Van Essen Lab NCMIR, UCSDstereotaxic atlas LONI MCell, CNL, Salk

SOURCES:• NCMIR, U.C. San Diego• Caltech Neuroimaging• Center for Imaging Science, John Hopkins• Center for Computational Biology, Montana State• Laboratory of Neuro Imaging (LONI), UCLA• Computatuonal Neurobiology Laboratory, Salk Inst.• Van Essen Laboratory, Washington University• …

Data Management Infrastructure (DAKS/NPACI)• MIX Mediation in XML • MCAT information discovery• SRB data handling • HPSS storage• ...

Knowledge-based GRID

infrastructure

? ? ? ?

Data Management Infrastructure (“Data Grid”)GTOMO, Telemicroscopy, Globus, SRB/MCAT, HPSS

Page 47: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

The Need for Semantic Integration

protein localization

What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity?

How about other rodents?

morphometry neurotransmission

???Mediator ???

Web

CaBP, Expasy

Wrapper WrapperWrapper Wrapper

??? Integrated View ???

??? Integrated View Definition ???

Data, relationships,

constraints are modeled (CMs)

Cross-source relationships are

modeled

Semantic (knowledge-

based) mediation services

Cross-source queries

Page 48: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

Hidden Semantics: Protein Localization

<protein_localization><neuron type=“purkinje cell” /><protein channel=“red”><name>RyR</>….</protein><region h_grid_pos=“1” v_grid_pos=“A”><density> <structure fraction=“0.8”>

<name>spine</><amount name=“RyR”>0</>

</> <structure fraction=“0.2”>

<name>branchlet</><amount name=“RyR”>30</>

</>

Molecular layer ofCerebellar Cortex

Purkinje Cell layer ofCerebellar Cortex

Fragment of dendrite

Page 49: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

Mediation Services: Source Registration (System Issues)

SourceData Type

Access Protocol

Query Capability

table tree file

SRB HTTP

JDBC

SQL XMLQL

DOODARC

Result Delivery

Tuple-at-a-time Set-at-a-

timeStream

Binary for Viewer Selections SPJ

Page 50: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

Mediation Services: Source Registration (Semantics Issues)

• Domain Map Registration– provide concept space/ontology

• … as a private object (“myANATOM”)• … merge with others (give “semantic bridges”)• … and check for conflicts

• Conceptual Model Registration– schema: classes, associations, attributes– domain constraints – “put data into context” (linking data to the domain map)

Next

Page 51: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

Mediation Services: Integrated View Definition

DERIVEprotein_distribution(Protein, Organism, Brain_region, Feature_name,

Anatom, Value) FROM I:protein_label_image[ proteins ->> {Protein}; organism -> Organism;

anatomical_structures ->>{AS:anatomical_structure[name->Anatom]}] , % from PROLAB

NAE:neuro_anatomic_entity[name->Anatom; % from ANATOM located_in->>{Brain_region}], AS..segments..features[name->Feature_name; value->Value].

• provided by the domain expert and mediation engineer• declarative language (here: Frame-logic)

Page 52: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

Mediation Services: Semantic Annotation Tools

line drawing annotation (spatial) database for mediation

Page 53: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

Part II: case studies

Web Services

Page 54: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

56 RCDL’02, Dubna, October 15-17 2002

Web Services Demo 1

OracleDBMS

JavaServlets

Web ServerSOAP

XML Mediator (Enosys)

Clients: AxioMap, Polexis

XMLXML query (XCQL)

SociologyWorkbench

WSDL

OracleDBMS

JavaServlets

Web ServerSOAP

WSDL

Java Servlet

Spatial Mediator

Find school districts in San Diego where computer ownership rates among residents are over 80%

San Diego Digital Divide Survey

Boundaries of municipalities

and school districts

Page 55: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

57 RCDL’02, Dubna, October 15-17 2002

Web Services Demo 2

ESRI ArcObjects

Web ServerSOAP

XML

CoordinateConversionService

WSDL

EPA Envirofacts Website

XML Wrapper

Java Servlet

Spatial Mediator

Local Pollution Data

XML Wrapper

Web spatial source, EPA dataArcObjects spatial service

Page 56: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

58 RCDL’02, Dubna, October 15-17 2002

Web Services Demo 3GIS source,WSDL: for spatial analysis, survey data analysis, DBMS queryUCR/FBI dataProcess flow across Web services

Counties crossed by an

interstate

Counties with decrease in victims of firearms over … %,

1993-99

Counties with decrease in homicide

rates over … %,

1993-99

UCR summaries

, Oracle

Victim data,SWB

Spatial Query,ArcIMS/

ArcObjects

WSDL WSDL

Page 57: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

Part II: case studies

Persistent Archives

Page 58: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

60 RCDL’02, Dubna, October 15-17 2002

Persistent Archives• NARA project• Store & Recover Data after 400 years• 5 million emails• 33 million web

pages• 90 million

personnel records

Page 59: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

61 RCDL’02, Dubna, October 15-17 2002

Persistent Archives • Challenges: each of the software and hardware systems may

become obsolete– the storage media may degrade– the storage system may become obsolete– the database backups may become obsolete, with no way to recover the

collection (structure)– the digital object formats may become obsolete, with no helper application

that can read them• Persistent archive is a migration mechanism

– support for automatic migration to new technology; automatic ingestion, management, access, catalog discovery

• Infrastructure independence– Non-proprietary formatting -- Collection management -- Data set access –

Authentication -- Presentation• Persistent archive is an interoperability system

– XML as a (meta-) information markup language

Page 60: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

62 RCDL’02, Dubna, October 15-17 2002

Persistent Archive

Persistent archiveDescribe archived data as collectionsDescribe processes used to create collectionsManage evolution of technology

Database A(today)

Database A(tomorrow)

Virtual Data Grid

The persistent archive is itself a virtual data grid that provides mechanisms to manage migration to new technology

Page 61: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

63 RCDL’02, Dubna, October 15-17 2002

Information Hierarchy (Simplest Definitions)• Data

– digital object, i.e., the object representation as a bit stream• Information

– any tagged data, where tags are treated as information attributes– attributes may be tagged data within the digital object, or tagged data that

is associated with the digital object• Knowledge

– higher-order concepts and relationships between attributes– relationships can be procedural, temporal, structural, spatial, functional, ...

and described in a Logic formalism (semantic networks, description logics, conceptual graphs, ...) which is often rule-based (e.g. Datalog, Frame-Logic)

Page 62: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

64 RCDL’02, Dubna, October 15-17 2002

What Types of Interoperability are Needed?

• Data management (digital objects)– ability to work with multiple types of storage systems, across

separate administration domains • Information management (attributes)

– ability to define a collection independent of database choice– ability to migrate collection onto new databases

• Knowledge management (relationships)– ability to manage relationships and high-level domain concepts– ability to map concepts to collection attributes

Page 63: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

65 RCDL’02, Dubna, October 15-17 2002

From XML-Based to Knowledge-Based Archives

• Collection-based archival with XML: save data "as is" plus...– ... separate content from presentation– ... tag your data (take a lift in the info hierarchy)– ... use a self-describing, semistructured data format (XML)

• Knowledge-based archival: now add ...– ... conceptual level information– ... integrity constraints– ... explanations/derivation rules:

• archiving only results y=f(x) vs. archiving the rules/function "f" (e.g. f = “the Florida procedure”...)

=> employ knowledge representation languages

Page 64: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

66 RCDL’02, Dubna, October 15-17 2002

Knowledge-Based Persistent Archive

AttributesSemantics

Knowledge

Information

Data

Ingest Services

Management AccessServices

(Topic Maps / Model-based Access)

(Data Handling System - SRB / FTP / HTTP)

MC

AT

/HD

F

Grid

s

XM

L D

TD

SDL

IP

XT

M D

TD

Rul

es -

KQ

L

InformationRepository

Attribute- based Query

Feature-basedQuery

Knowledge orTopic-Based Query / Browse

KnowledgeRepository for Rules

RelationshipsBetweenConcepts

FieldsContainersFolders

Storage(Replicas,Persistent IDs)

Page 65: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

67 RCDL’02, Dubna, October 15-17 2002

Knowledge-Based Archival: Senate Example Data provider says:

“Please archive all records of legislative activities of the 106th senate!”Integrity constraints, eg:

(1) {senators_with_file} = UNION (sponsor, cosponsors, submitted_by) (2) {senators} = {sponsors} = {co-sponsors}

Violation: – the rhs is a SUPERSET of the lhs !

Exceptions:– (Chafee, John), (Gramm, Phil), (Miller, Zell)

(Possible) Explanations: – senators who joined (Zell), passed away (Chafee), were forgotten (Gramm)!?

Checking ICs:IF sponsor(X), not senator(X) THEN ADD(exception_log, missing_senator_info(X))

IF condition THEN action Action = LOG, WARN, ABORT, ...

Page 66: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

68 RCDL’02, Dubna, October 15-17 2002

NARA Herbicides Collection:Introduction

Page 67: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

69 RCDL’02, Dubna, October 15-17 2002

The Herbicides Collection - input

6507213207565 260404040 040000{0000D0000000{048{ {0000000{0000000{0000000{0000000{6507243207565 260606060 060000{0000D0000000{072{ {0000000{0000000{0000000{0000000{6507253207565 260606060 060000{0000D0000000{072{ {0000000{0000000{0000000{0000000{6507263207565 260606060 060000{0000D0000000{072{ {0000000{0000000{0000000{0000000{6507273207565 260606060 060000{0000D0000000{072{ {0000000{0000000{0000000{0000000{6507283207565 260505050 050000{0000D0000000{060{ {0000000{0000000{0000000{0000000{6507293207565 260404040 040000{0000D0000000{048{ {0000000{0000000{0000000{0000000{6508022022365 060202020 010000{0000C0000000{012{ {0000000{0000000{0000000{0000000{1A

AS890255 000{000{6508022022365 1B

AS940140 000{000{6508042022365 060202020 006000{0000C0000000{007B {0000000{0000000{0000000{0000000{1A

AS925205 000{000{6508042022365 1B

AS970065 000{000{6508062022365 060202020 004000{0000C0000000{004H {0000000{0000000{0000000{0000000{1A

BS290320 000{000{6508062022365 1B

BS275298 000{000{6508073207565 260202020 020000{0000D0000000{024{ {0000000{0000000{0000000{0000000{1A

YT080110 000{000{6508073207565 1B

YT110060 000{000{6508113207565 260202020 020000{0000D0000000{024{ {0000000{0000000{0000000{0000000{6508123207565 260202020 020000{0000D0000000{024{ {0000000{0000000{0000000{0000000{6508151022465 020202020 008000{0000C0000000{009F {0000000{0000000{0000000{0000000{1A

YD350155 000{000{6508151022465 1B

YD450150

From EBCDIC tapes:

Page 68: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

70 RCDL’02, Dubna, October 15-17 2002

The Herbicides Collection - preservationConverted to XML:

<YEAR><yearnum>66</yearnum><MONTH><monthnum>01</monthnum><DATE><datenum>01</datenum><MISSION><num>206866</num>

<RUN><code>A</code><ctz>3</ctz><multi></multi><prov>27</prov>

<aircrafts><scheduled>02</scheduled><airborne>02</airborne><productive>02</productive>

</aircrafts><agent>O</agent><gal>02000</gal><hits>0</hits><aborts><maintenance>0</maintenance><weather>0</weather><battle_damage>0</battle_damage><other>0</other></aborts><type>D</type><area>024</area><rsult></rsult><UTM>

<utmid>1A</utmid><utm_coor>YS240780</utm_coor>

</UTM><UTM>

<utmid>1B</utmid><utm_coor>YS290630</utm_coor>

</UTM></RUN><RUN><code>B</code><ctz>3</ctz><multi></multi><prov>27</prov>

<aircrafts><scheduled>02</scheduled><airborne>02</airborne><productive>02</productive>

</aircrafts><agent>O</agent><gal>02000</gal><hits>0A</hits><aborts><maintenance>0</maintenance><weather>0</weather><battle_damage>0</battle_damage><other>0</other></aborts><type>D</type><area>024</area><rsult></rsult>

MAPPING

Page 69: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

71 RCDL’02, Dubna, October 15-17 2002

From Geography Markup to Rendering<?xml version="1.0" encoding="iso-8859-1"?><rs><r><name>Horton Plaza</name><URL></URL><labelpos>41.46,77.51</labelpos><c>5076,1540 4986,1540 4895,1539 4803,1539 4715,1539 4622,1539 4534,1538 4534,1641 4534,1745 4534,1856 4622,1856 4711,1856 4800,1856 4893,1855 4984,1855 5075,1854 5075,1749 5076,1646 </c></r><r><name>Gaslamp</name><URL></URL><labelpos>44.60,83.00</labelpos><c>5162,1013 5084,1057 5083,1116 5081,1222 5079,1326 5079,1433 5076,1540 5076,1646 5075,1749 5075,1854 5167,1854 5257,1855 5257,1750 5259,1647 5260,1541 5262,1434 5262,1328 5263,1222 5263,1013 </c></r>. . .XML encoding of geographic features (such as GML)

<?xml version="1.0"?><!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20000303 Stylable//EN" "http://www.w3c.org/2000/svg10-20000303-stylable" [<!ENTITY base "fill:#ff0000;stroke:#000000;stroke-width:1;">]><svg width="100%" height="100%" viewBox="0 0 11590 7547" style="shape-rendering:geometricPrecision; text-rendering:optimizeLegibility"><g id="karta" transform="scale(1, -1) translate(0, -7547)"><g id="base" style="&base;"><path id="a1" title="Horton Plaza" style="fill:#00ff00;" d="M5076,1540L 4986,1540 4895,1539 4803,1539 4715,1539 4622,1539 4534,1538 4534,1641 4534,1745 4534,1856 4622,1856 4711,1856 4800,1856 4893,1855 4984,1855 5075,1854 5075,1749 5076,1646 5076,1540z"/><path id="a2" title="Gaslamp" style="fill:#ffff00;" d="M5162,1013L 5084,1057 5083,1116 5081,1222 5079,1326 5079,1433 5076,1540 5076,1646 5075,1749 5075,1854 5167,1854 5257,1855 5257,1750 5259,1647 5260,1541 5262,1434 5262,1328 5263,1222 5263,1013 5162,1013z"/></g></g></svg>

VML or SVG or…

SVG

Page 70: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

72 RCDL’02, Dubna, October 15-17 2002

XML Map Viewer for

the Herbicides Collection

Page 71: Arcot Rajasekar , Reagan Moore,  Bertram Ludäscher, Ilya Zaslavsky ZASLAVSK@SDSC.EDU

73 RCDL’02, Dubna, October 15-17 2002

Conclusion• Necessity & Requirements of a Virtual Data Grid• SRB – a proven solution

– It is an existing middle-ware– Field-tested in multiple projects– Proven Scalability: users, data & resources

• New element of data grid: knowledge management• Working solutions

– BIRN: the first real data grid complete with knowledge management and cross-ontology bridges

– Web services, to expose grid functionality in a uniform way

– Archiving data, information and knowledge as a gridactivity

• www.npaci.edu/DICE/