Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services

18
Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration Middleware for Distributed and Grid Computing, September 18-19, 2003

description

Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration Middleware for Distributed and Grid Computing, September 18-19, 2003. Why Discovery Net?. Data Challenge: - PowerPoint PPT Presentation

Transcript of Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services

Page 1: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Discovery Net : A UK e-Science Pilot Project

for Grid-based Knowledge Discovery Services

Patrick WendelImperial College, London

Data Mining and Exploration Middleware for Distributed and Grid

Computing,

September 18-19, 2003

Page 2: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Why Discovery Net?

Data Challenge: Distributed, heterogeneous & large scale data sets

Novel and real-time data sourcesResource Challenge

Novel specialised data analysis components/services continually being published/made available

Computational resources provided

Information Challenge: Data cleaning, normalisation & calibration

New data needs to be related to existing data

Knowledge Challenge:Collaborative, interactive & people-intensive

Result interpretation & validation in relation to existing knowledge

Knowledge sharing is key

Page 3: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

What is Discovery Net

Goal : Construct an Infrastructure for Global wide Knowledge Discovery Services

•Key Technologies:• Grid and Distributed Computing• Workflow and service composition• Data Mining & Visualisation.• Data Access & Information Structuring.• High Throughput Screening Devices: real-time.

Page 4: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Discovery Net: Unifying the World’s Knowledge

Data Integration: Dynamic Real Time Construction of “Data Grids”

Application Integration: Component and Service-based Integration

People Integration:Global-wide Discovery Groupware

Knowledge Integration: Multi-subjects and Multi-modality Integrative Analysis to Cross Validate and Annotate Related Discovery Work

Page 5: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Using Distributed

Resources

What is Discovery Net

Real Time Integration

Dynamic Application

Integration

Workflow Construction

Interactive Visual

Analysis

Page 6: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Discovery Net Layer Model(Life Science Application)

High Performance

and Grid-enabled

Transfer Protocol

(GSI-FTP, DSTP..)

Grid-enabled

Infrastructure

(GSI)

Deployment

Web/Grid Services

OGSA

D-Net Clients:End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities

D-Net Middleware:

Provides execution logic for distributed knowledge discovery and access to distributed resources

Computation & Data Resources:

Distributed databases, compute servers and scientific devices.

Page 7: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

A Knowledge Grid based on D-Net Servers

DNet Server

Da

ta a

ccess &

Sto

rag

e

Info

Grid

Co

mp

on

en

ts

Co

mp

uta

tion

De

plo

yme

nt

DN

et A

PI

DNet Server

DNet Server

DNet Client

DNet Client

Computational services

Data sources

WWW

RDBMS

DNet server

DNet server

DNet participating client

DNet clientInternet

Web client

DPML

Knowledgediscoveryservices

XML

DNet Server

Da

ta a

ccess &

Sto

rag

e

Info

Grid

Co

mp

on

en

ts

Co

mp

uta

tion

De

plo

yme

nt

DN

et A

PI

DNet Server

Da

ta a

ccess &

Sto

rag

e

Info

Grid

Co

mp

on

en

ts

Co

mp

uta

tion

De

plo

yme

nt

DN

et A

PI

DNet ServerDNet Server

DNet Server

DNet Client

DNet Server

DNet Client

DNet Client

Computational services

Data sources

WWW

RDBMS

DNet server

DNet server

DNet participating client

DNet clientInternet

Web client

DPMLDPML

Knowledgediscoveryservices

XML

Several types of clients for different usage (from thin web client to

participating client)

Current implmentation based on Java distributed objects (EJB), moving

towards Web/Grid service

But deployment and API access through standard Web/Grid service

Goal: Plug & Play Data Sources, Analysis Components and Knowledge Discovery Processes

Page 8: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Discovery Process Management

Workflow based service compositionData-flow approach fits Knowledge

Discovery processAllows scientists to develop processes.Towards a Standard Workflow

Representation for Discovery Informatics: Discovery Process Markup Language (DPML):

Contains component data-flow graphs, but alsoRecords collaboration information (user, changes)Records execution constraints (location, parameterisation)Becomes a key intellectual property: Discovery Processes can be stored, reused, audited, refined and deployed in various forms

D-Net Workflow for Genome Annotation :

16 services executing across Internet

Page 9: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

InfoGrid: Dynamic Data Integration

Integrative Analysis

Chemistry

Gene

Protein /

Targets

Biological

Screening

Clinical

Journals

Sequence

Structure

Location

Function…

Activity

Protocols

Toxicology

Metabolic

Pathways…

Sequence

Expression

Function…

Structures

Libraries

Catalogues

Synthetic

pathways

Journals

Project

Reports

Patents…

Trails

Patients…

Dynamic Data Integration = On-demand access to heterogeneous data sources + information structuring

Towards a Dynamic Information Integration Methodology:

Specialised Information Source

Access: InfoGrid allows users to

register, locate and connect to

various specialised information

sources.

On the-fly Integration: InfoGrid allows

users to build their own integration

structure on the fly (Worst case:

proprietary protocol/format, best case

JDBC/HTTP-XML-XPath/Web Service).

Easy Maintenance: Wrappers/Drivers

to new data sources can be added

through a clean API

Page 10: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Dynamic Application Integration Services

Dynamic Application Integration = On-demand access and composition of remote analysis components

Towards a Dynamic Component Integration:

Component service: allow users to

register, locate and remotely execute

components (Java component

interface or Web Service port type).

Execution service: allow users to

control the execution of components

distributed environments

Easy Maintenance: New components

can be added through a clean API

Regression

Clustering

Classification

Gene function

perdition

Homology

Search

Promoter

Prediction

D-NET API

Page 11: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Discovery Deployment

Discovery ServiceBatch processing

ReportDiscovery Component

Discovery Process

in DPML

Discovery Deployment = On-demand rapid application construction and publishing

Towards a Dynamic Deployment of Knowledge Discovery Procedures:

Deployment Engine : allows users to build and publish applications based on DPML code coordinating remotely execute components, as Web Page, Web/Grid Service, command line tool.

Easy Maintenance: New discovery procedures described in DPML, a Standardised representation of “composed” discovery procedures

Storage & Reporting Servers: allow users to share DPML procedures and to generate workflow audit reports.

Page 12: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Knowledge Integration & Interpretation

Dynamic Knowledge Interpretation = cross-reference and verify analysis results against background knowledge

Towards a Knowledge Integration Framework: Multi-subject data analysis

Specialised Client Interfaces:

Interactive Analysis and dynamic

component interaction

Result Annotation, Structuring and

Storage: Information source query,

result browsing, sharing and markup

Sequence

Analysis

Text MiningGenetic

Analysis

Pathway

Analysis

Life science example application

Page 13: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Workflow execution

Component execution location resolutionUser list of known resourcesA component can require explicitly to be executed on a particular resourceA component can choose from a set of resources proposed (and could use Grid resource information systems and network weather information to determine where to go)

For unconstrained components, simple “near the data” execution policy:

If single input data location then execute thereOtherwise fallback to original execution location

Allows usual DPKD workflows to be designedHandles data management and transfer (serialisation, Java based, FTP based)

Page 14: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Discovery Net and Grid technologies

Cluster/Campus Grid level:Partial or complete workflow execution on Condor / SGETask farming on subset of the workflow

Global Grid:GSI integration (Java Cog Kit)GSI-FTP transfer functionality (Java Cog Kit) OGSA Grid Service access to functionalities (GT3)Potential use of GRIS or NWS in component implementation

Globus scheduler ? Unicore ? SRB ?

Page 15: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Discovery Net Application Testbeds

Life Science Testbed:Gene sequencing, Protein Chips

High Throughput real-time genome annotation testbed: analyse and interpret new sequences using existing distributed bioinformatics tools and databases

Environmental Modelling Pollution Sensors (GUSTO): SO2, Benzene, ..

High Throughput real-time pollution monitoring testbed: analyse, interpret time-resolved correlations among remote stations, and with other environmental data sets

Geo-hazard PredictionMulti-spectral, multi-temporal, Satellite imagery

Real-time geo-hazard prediction testbed: analyse, interpret satellite images with other data sets to generate thematic knowledge

GUSTO UNITS with wireless connectivity

Page 16: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Case Study:SC2002 HPC Challenge

blastgenscan

Repeat

Masker

grail

genscanE-PCR

Identify

Genes

Gene markers

tRNAs, rRNAsNon-translated

RNAsRegulatory

Regions

Repetitive

ElementsSegmental

Duplication

SNP

VariationsLiterature

References

…..

3D-PSSMblast

Motif

Search

PFAM

DSCpredator

Inter

Pro

Inter

Pro

SMART

SWISS

PROT

Identify

Functional

Characteisation

Homologues

Domain 3-D Structure

Fold Prediction

Secondary

structureLiterature

References

…..

Proteins

Classify into

Protein Families

IdentifyOrganism

Chromosomes

Organism’s

DNA

RelateCell

Cycle

Metabolism

Drugs

Biological

Process…..Cell death

EmbryogenesisLiterature

References

…..

Ontologies

Pathway

Maps GeneMapsAmiGO

GenNav

virtual

chip

High Throughput

Sequencers

Nucleotide-level

Annotation

Protein-level

Annotation

Process-level

Annotation

NCBIEMBL

TIGR SNP

GO CSNDB

GKKEGG

15 DBs 21 Applications

D-Net based Global Collaborative

Real- Time Genome Annotation

Genome

Annotation

Page 17: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Nucleotide Annotation Workflows

How It Works

Download sequence

from Reference

Server

Save to Distributed Annotation

Server

InteractiveEditor &

Visualisation

Execute distributed annotation workflow

NCBIEMBL

TIGR SNP

Inter

ProSMART

SWISS

PROT

GO

KEGG

1800 clicks 500 Web access200 copy/paste 3 weeks work in 1 workflow and few second execution

Page 18: Discovery Net :  A UK e-Science Pilot Project  for Grid-based Knowledge Discovery Services

Conclusion and Future works

Towards an open integration platform that enables scientists to conduct their KD activitiesSeveral levels of integration requiredEnable use of available resources

Evolution towards cost model integration (performance, value, QoS)Semantic based service retrieval and compositionOther useful standards ? (OGSA-DAI ?)