BODHI, A Bio-diversity Database Pla(n)tform

18
BODHI 1 BODHI, A Bio-diversity Database Pla(n)tform Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of Science

description

BODHI, A Bio-diversity Database Pla(n)tform. Jayant Haritsa Database Systems Lab Supercomputer Education and Research Centre Indian Institute of Science. Team. B. J. Srikanta (next talk) Prof. Madhav Gadgil Prof. V. Nanjundiah (Centre for Ecological Sciences, IISc) - PowerPoint PPT Presentation

Transcript of BODHI, A Bio-diversity Database Pla(n)tform

Page 1: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 1

BODHI,A Bio-diversity Database Pla(n)tform

Jayant HaritsaDatabase Systems Lab

Supercomputer Education and Research Centre

Indian Institute of Science

Page 2: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 2

Team

B. J. Srikanta (next talk)

Prof. Madhav GadgilProf. V. Nanjundiah(Centre for Ecological Sciences, IISc)

Several Masters Students

Funded by DBT

Page 3: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 3

Motivation

GATT – Patent Laws To be in place by 2005

Loss Neem Basmati (estimated export value: Rs. 1,198 crore) Turmeric

Global and local efforts GBIF (Global Biodiversity Information Facility) Karnataka Bio-diversity Board [Deccan Herald - Aug 26 2000]

Page 4: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 4

Bio-diversity Data

Taxonomy of species Phenetic (physical) characteristics Phylogenetic (evolutionary) characteristics

Habitat / Spatial distribution Political Layout Geographic Layout Biospheres

Genetic information Bio-molecular sequences Structural information

Page 5: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 5

MULTI-DOMAIN QUERY

Retrieve all plant species that share a common habitat, have identical Inflorescence characteristics, and have a DNA sequence within BLAST score of 80, with respect to “Michelia-champa”.

Page 6: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 6

Difficulties:

Complex range of data types sets, hierarchies, aggregations, sequences,

geometries, maps, audio, images …

Multidimensional data spatial (latitude, longitude, elevation) to

proteins (hundreds of coordinates)

Computationally-intensive operators species relationships, spatial distributions,

sequence alignments, ...

Page 7: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 7

Current Solutions

Small-Scale MS-Access / FoxPro / Excel / ... Pentium PCs

Large-Scale RDBMS: Oracle / DB2 / Informix / Sybase / … Unix servers: Sun / SGI / IBM / HP / ...

Page 8: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 8

Limitations:

RDBMS approach of “the world is a flat collection of tables with simple attributes”

suits financial applications,

NOT scientific (biological) applications In particular, taxonomic / spatial / sequence /

multimedia data modeling and processingare very cumbersome and coarse

Page 9: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 9

Limitations (contd)

Spatial and other applications are not within the database kernel but are connected externally. E.g. Many GIS systems have ArcInfo and MS-Access hooked up in a “black-box” manner. Or, Blast/FASTA utilizing sequence files generated from Oracle.

Problem: Slow and ugly!

Page 10: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 10

Is there Hope?

Object-Oriented DBMS “Natural” for biological applications

High-performance data access methods Path Dictionary Index, Multi-key Type Index,

Pyramid Tree, ...

High-performance specialized operators spatial join, data mining, sequence processing, …

XML = HTML + Semantics

Page 11: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 11

Goals of BODHI

Seamless integration of taxonomic, spatial and genomic data using OO technology

Latest access methods and operatorsfor all three types of data

Utilize XML for data exchangeLow-cost (ideally, free!)

Page 12: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 12

Architecture of BODHI

The Internet

Object Operations Genome Operations

Genome ModelSpatial Model

Spatial Operations

OBJECT STORAGE MANAGERSpatial Services Object Services Sequence Services

Taxonomy Model

Spatial Indexes Object Indexes Genome Indexes

Client Interface FrameworkQuery Processor

Page 13: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 13

Implementation of BODHI

The Internet

Inheritance Aggregation

AlignmentBLAST, FASTA

DNA, ProteinCountry, State,

City, River, Road

Overlaps, Contains,Closest, Within

SHORE MICRO-KERNEL

Spatial Services Object Services Sequence Services

Species, Genera, Family, Order

R*-tree, Hilbert-Rtree Multi-Key Type,Path-Dictionary

??? Indexes(next talk)

Client Interface Framework–DB

Basic Types (Point, Line, Polygon, Sets, Sequences, ...)

Page 14: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 15

Query Flow

Page 15: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 16

Project Status

Prototype (minus Client Interface Framework) is operational since last month !

Platform: PIII-700MHz running Redhat Linux.

For Code, contact “[email protected]

Page 16: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 17

Performance Evaluation

SEQUOIA 2000 spatial benchmark: Competitive with Paradise GIS from Wisconsin

Taxonomy + Spatial Queries: Reasonably fast

But Genomics slows things down a lot due to absence of indexes (next talk)

Page 17: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 18

More details

“Design and Implementation of a Biodiversity Information System”,Proc. of Intl. Conf. On Management of Data (COMAD), Pune, December 2000

“The Building of BODHI, A Bio-diversity Database System”,TechRep-2001-02, DSL/SERC, IISc

Available at http://dsl.serc.iisc.ernet.in

Page 18: BODHI, A Bio-diversity Database Pla(n)tform

BODHI 19

End of Talk