Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science,...

72
Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 [email protected] http:// www.infomall.org

Transcript of Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science,...

Page 1: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Grids for ChemicalInformatics

Chemistry, IU Bloomington

Oct. 21 2005

Geoffrey Fox

Computer Science, Informatics, Physics

Pervasive Technology Laboratories

Indiana University Bloomington IN 47401

[email protected]

http://www.infomall.org

Page 2: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Why are Grids Important Grids are important for Chemistry because they support key

functionalities that grow in importance as we are deluged with data from instruments and simulations

Grids provide information access, storage and management Grids manage multiple simulations with different defining

parameters Grids allow complex workflows with data flowing between

filters Grids define models for portals Grids are built on top of commodity web service technology

with broad industry support – the next generation information technology

Grids are used in multiple NIH and other life science/chemistry projects across the world (BIRN, caBIG, myGrid, Comb-e-Chem )

Page 3: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Internet Scale Distributed Services Grids use Internet technology and are distinguished by managing

or organizing sets of network connected resources• Classic Web allows independent one-to-one access to

individual resources • Grids integrate together and manage multiple Internet-

connected resources: People, Sensors, computers, data systems

Organization can be explicit as in• TeraGrid which federates many supercomputers; • Deep Web Technologies IR Grid which federates multiple data

resources; • CrisisGrid which federates first responders, commanders,

sensors, GIS, (Tsunami) simulations, science/public data Organization can be implicit as in Internet resources such as

curated databases and simulation resources that “harmonize a community”

Page 4: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Different Visions of the Grid Grid just refers to the technologies

• Or Grids represent the full system/Applications DoD’s vision of Network Centric Computing can be considered a

Grid (linking sensors, warfighters, commanders, backend resources) and they are building the GiG (Global Information Grid)

Utility Computing or X-on-demand (X=data, computer ..) is major computer Industry interest in Grids and this is key part of enterprise or campus Grids

e-Science or Cyberinfrastructure are virtual organization Grids supporting global distributed science (note sensors, instruments are people are all distributed

Skype (Kazaa) VOIP system is a Peer-to-peer Grid (and VRVS/GlobalMMCS like Internet A/V conferencing are Collaboration Grids)

Commercial 3G Cell-phones and DoD ad-hoc network initiative are forming mobile Grids

Page 5: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Types of Computing Grids Running “Pleasing Parallel Jobs” as in United Devices, Entropia

(Desktop Grid) “cycle stealing systems” Can be managed (“inside” the enterprise as in Condor) or more

informal (as in SETI@Home) Computing-on-demand in Industry where jobs spawned are

perhaps very large (SAP, Oracle …) Support distributed file systems as in Legion (Avaki), Globus with

(web-enhanced) UNIX programming paradigm• Particle Physics will run some 30,000 simultaneous jobs

Distributed Simulation HLA/RTI style Grids Linking Supercomputers as in TeraGrid Pipelined applications linking data/instruments, compute,

visualization Seamless Access where Grid portals allow one to choose one of

multiple resources with a common interfaces Parallel Computing typically NOT suited for a Grid (latency)

Page 6: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Large Scale Parallel Computers

Old Style Metacomputing GridQuickTime™ and a

decompressorare needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

IMAGING INSTRUMENTS

COMPUTATIONALRESOURCES

LARGE-SCALE DATABASES

DATA ACQUISITION ,ANALYSIS

ADVANCEDVISUALIZATION

Analysis and Visualization

Original: Spread a single large Problem over multiple supercomputersNow-1: Control multiple smallish jobs each on independent ComputersNow-2: Choose which of a few supercomputers to use

Large Disks

Page 7: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Computation

Starlight (Chicago) Netherlight

(Amsterdam)

Leeds

PSC

SDSC

UCL

Network PoP Service Registry

NCSA

Manchester

UKLight

Oxford

RAL

US TeraGrid

UK NGS

Steering clients

SC05

Local laptops in Seattle and UK

All sites connected by production

network (not all shown)

Towards an International Compute Grid Infrastructure

Page 8: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Information/Knowledge Grids Distributed (10’s to 1000’s) of data sources (instruments,

file systems, curated databases …) Data Deluge: 1 (now) to 100’s petabytes/year (2012)

• Moore’s law for Sensors Possible filters assigned dynamically (on-demand)

• Run image processing algorithm on telescope image• Run Gene sequencing algorithm on compiled data

Needs decision support front end with “what-if” simulations

Metadata (provenance) critical to annotate data

Integrate across experiments as in multi-wavelength astronomy

Data Deluge comes from pixels/year available

Page 9: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Data Deluged Science Now particle physics will get 100 petabytes from CERN using

around 30,000 CPU’s simultaneously 24X7 Exponential growth in data and compare to:

• The Bible = 5 Megabytes• Annual refereed papers = 1 Terabyte• Library of Congress = 20 Terabytes• Internet Archive (1996 – 2002) = 100 Terabytes

Weather, climate, solid earth (EarthScope) Bioinformatics curated databases (Biocomplexity only 1000’s of

data points at present) Virtual Observatory and SkyServer in Astronomy Environmental Sensor nets In the past, HPCC community worried about data in the form of

parallel I/O or MPI-IO, but we didn’t consider it as an enabler of new science and new ways of computing

Data assimilation was not central to HPCC DoE ASCI set up because didn’t want test data!

Page 10: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Virtual Observatory Astronomy GridIntegrate Experiments

Radio Far-Infrared Visible

Visible + X-ray

Dust Map

Galaxy Density Map

Page 11: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

International Virtual Observatory Alliance

• Reached international agreements on Astronomical Data Query Language, VOTable 1.1, UCD 1+, Resource Metadata Schema

• Image Access Protocol, Spectral Access Protocol and Spectral Data Model, Space-Time Coordinates definitions and schema

• Interoperable registries by Jan 2005 (NVO, AstroGrid, AVO, JVO) using OAI publishing and harvesting

• So each Community of Interest builds data AND service standards that build on GS-* and WS-*

Page 12: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

myGrid Project• Imminent

‘deluge’ of data• Highly

heterogeneous• Highly complex

and inter-related• Convergence of

data and literature archives

Page 13: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

A B C

The Williams Workflows

A: Identification of overlapping sequenceB: Characterisation of nucleotide sequenceC: Characterisation of protein sequence

Page 14: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Web services Web Services build

loosely-coupled, distributed applications, (wrapping existing codes and databases) based on the SOA (service oriented architecture) principles.

Web Services interact by exchanging messages in SOAP format

The contracts for the message exchanges that implement those interactions are described via WSDL interfaces.

Databases

Humans

ProgramsComputational resources

Devices

reso

urce

s

BP

EL,

Jav

a, .N

ET

serv

ice

logi

c

<env:Envelope> <env:Header> ... </env:header> <env:Body> ... </env:Body></env:Envelope> m

essa

ge p

roce

ssin

g

SO

AP

and

WS

DL

SOAP messages

Page 15: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

A typical Web Service In principle, services can be in any language (Fortran .. Java ..

Perl .. Python) and the interfaces can be method calls, Java RMI Messages, CGI Web invocations, totally compiled away (inlining)

The simplest implementations involve XML messages (SOAP) and programs written in net friendly languages like Java and Python

PaymentCredit Card

WarehouseShippingcontrol

WSDL interfaces

WSDL interfaces

Security CatalogPortalService

Web Services

Web Services

Page 16: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Two-level Programming I• The Web Service (Grid) paradigm implicitly assumes a

two-level Programming Model• We make a Service (same as a “distributed object” or

“computer program” running on a remote computer) using conventional technologies– C++ Java or Fortran Monte Carlo module

– Data streaming from a sensor or Satellite

– Specialized (JDBC) database access

• Such services accept and produce data from users files and databases

• The Grid is built by coordinating such services assuming we have solved problem of programming the service

Service Data

Page 17: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Two-level Programming II The Grid is discussing the composition of distributed

services with the runtime interfaces to Grid as opposed to UNIX pipes/data streams

Familiar from use of UNIX Shell, PERL or Python scripts to produce real applications from core programs

Such interpretative environments are the single processor analog of Grid Programming

Some projects like GrADS from Rice University are looking at integration between service and composition levels but dominant effort looks at each level separately

Service1 Service2

Service3 Service4

Page 18: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Database Database

Analysis and VisualizationPortal

RepositoriesFederated Databases

Data Filter

Services

Field Trip DataStreaming Data

Sensors

?DiscoveryServices

SERVOGrid

ResearchSimulations

Research Education

CustomizationServices

From Research

to Education

EducationGrid ComputerFarmGrid of Grids: Research Grid and Education Grid

GISGrid

Sensor GridDatabase Grid

Compute Grid

Page 19: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

SERVOGrid Requirements Seamless Access to Data repositories and large scale

computers Integration of multiple data sources including sensors,

databases, file systems with analysis system• Including filtered OGSA-DAI (Grid database access)

Rich meta-data generation and access with SERVOGrid specific Schema extending openGIS (Geography as a Web service) standards and using Semantic Grid

Portals with component model for user interfaces and web control of all capabilities

Collaboration to support world-wide work Basic Grid tools: workflow and notification NOT metacomputing

Page 20: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

SERVOGrid SERVOGrid Portal Screen Portal Screen

ShotsShots

Page 21: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

n: Service refers to core services identified by DoDCoI Community of Interest GIS Geographical Information System

Physical Network

4: Discovery 11: Metadata

Earthquake GridDoD NCOW Grid

…CoI SpecificGrids/Services

8: Data Access/Storage

2: Security 5: Mediation3: Messaging 1: Management

7: PortalsInformation Grid

Sensor Grid

Compute Grid

GIS Grid

Core Low Level Grid Services

6: Collaboration Grid

9: Application Services 10: Policy (ECS)

C2 (JBI CEE etc.)NCOW-IS Services

Earthquake Data& Simulation ServiceServoIS

Page 22: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Physical Network

4: Discovery 11: Metadata

BioInformatics GridChemical Informatics Grid

…Domain SpecificGrids/Services

8: Data Access/Storage

2: Security 5: Workflow3: Messaging 1: Management

7: PortalsInformation Grid

Instrument Grid

Compute Grid

MIS Grid

Core Low Level Grid Services

6: Collaboration Grid

9: Application Services 10: Policy

M(B,C)IS Molecular (Bio, Chem) Information System

HTS ToolsQuantumCalculationsCIS

Sequencing ToolsBiocomplexity SimulationsBIS

Page 23: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

GIS Grid with WMS, WFS, data sources and GML

Railroads

RiversBridges

Interstate Highways

90

WFS Server

SQL Query

Railroads

[a-b]

SQ

L Q

uery

Riv

er [a

-d]

Bri

dge

[1-5

]

SQL QueryHigway [12-18]

`

ClientWMS

GetFeature

FeatureCollection

Get

Feat

ure

Feat

ureC

olle

ctio

n

<gml:featureMember> <fault> <name> Northridge2 </name> <segment> Northridge2

</segment> <author> Wald D. J.</author> <gml:lineStringProperty> <gml:LineString

srsName="null"> <gml:coordinates>

-118.72,34.243 -118.591,34.176 </gml:coordinates>

</gml:LineString> </gml:lineStringProperty> </fault> </gml:featureMember>

GML becomes CML, CellML, SBML

Page 24: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Electric Power and Natural Gas data from LANL Interdependent Critical Infrastructure Simulations

Zoom-in

Zoom-out

FeatureInfo mode

Measure distance mode

Clear Distance

Drag and Drop mode

Refresh to initial map

Page 25: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Google maps can be integrated with Web Feature Service Archives to filter and browse seismic records.

Integrating Archived Web

Feature Services and Google Maps

Page 26: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

What is Happening? Grid ideas are being developed in (at least) four communities

• Web Service – W3C, OASIS, (DMTF)• Grid Forum (High Performance Computing, e-Science)• Enterprise Grid Alliance (Commercial “Grid Forum” with a

near term focus) Service Standards are being debated Grid Operational Infrastructure is being deployed Grid Architecture and core software being developed

• Apache has several important projects as do academia; large and small companies

Particular System Services are being developed “centrally” – OGSA or GS-* framework for this in GGF; WS-* for OASIS/W3C/Microsoft-IBM

Lots of fields are setting domain specific standards and building domain specific services

USA started but now Europe is probably in the lead and Asia will soon catch USA if momentum (roughly zero for USA) continues

Page 27: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

The Grid and Web Service Institutional Hierarchy

1: Container and Run Time (Hosting) Environment

2: System Services and FeaturesHandlers like WS-RM, Security, Programming Models like BPEL

or Registries like UDDI

3: Generally Useful Services and FeaturesSuch as “Access a Database” or “Submit a Job” or “ManageCluster” or “Support a Portal” or “Collaborative Visualization”

4: Application or Community of InterestSpecific Services

such as “Run BLAST” or “Look at Houses for sale”

OGSA GS-*and some WS-*GGF/W3C/….

WS-* fromOASIS/W3C/Industry

Apache Axis.NET etc.

Must set standards to get interoperability

Page 28: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Location of software for Grid Projects in Community Grids Laboratory

htpp://www.naradabrokering.org provides Web service (and JMS) compliant distributed publish-subscribe messaging (software overlay network)

htpp://www.globlmmcs.org is a service oriented (Grid) collaboration environment (audio-video conferencing)

http://www.crisisgrid.org is an OGC (open geospatial consortium) Geographical Information System (GIS) compliant GIS and Sensor Grid (with POLIS center)

http://www.opengrids.org has WS-Context, Extended UDDI etc.

The work is still in progress but NaradaBrokering is quite mature

All software is open source and freely available

Page 29: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Project Goals Establish Requirements from stakeholders

• Research

• Pharmaceutical Industry

• Government

Consider educational implications• e-Science v Bio/Chem/Molecular Informatics

Consider other national and international projects to ensure we either lead or use best practice

Design a Grid architecture and staged implementation Start pilot projects led by Chemistry/Chemical Informatics Evaluate and iterate Design and implement ?(Chem, Life Science, Science, Molecular)

Informatics educational program that will attract students Write winning center grant in 2006-7

Page 30: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Web Services Introduction

• What are “Web Services”?– A distributed invocation system built on Grid

computing• Independent of platform and programming

language• Built on existing Web standards

– A service oriented architecture with• Interfaces based on Internet protocols• Messages in XML (except for binary data

attachments)

Page 31: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Web Services Introduction

• A web-based architecture providing for interoperability among resources– Centralized service registry– Solves problems associated with finding, using, and

combining online resources

• Employ standard Internet protocols for:– Communication with resources– Automated discovery using centralized registries

• Communicate with devices, people, and each other with the protocols and computer languages

Page 32: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Service Oriented Architecture (SOA)

• Goal is to achieve loose coupling among interacting software agents

• Define service: a unit of work done by a service provider to achieve desired end results for a service consumer

• Both provider and consumer are roles played by software agents on behalf of their owners.

Page 33: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

How does SOA work?

• Two architectural constraints are employed– Small set of simple and ubiquitous interfaces

to all participating software agents– Descriptive messages constrained by an

extensible schema delivered through the interfaces

Page 34: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Web Services Architectures

• Individual services are registered globally– Broken down into individual services with

inputs and outputs specified

• Services are published

• Services are requested

• Open registry, publishing, and requesting

Page 35: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Service-Oriented Architecture

• From Curcin et al. DDT, 2005, 10(12),867

Page 36: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Web Services for Science

• Invisible Services, Semantic Web, and Grid

• Easy-to-use tools for any scientist• High throughput, resource intensive

computing done for low cost/resources• Shared community

– Collaborations between labs and fields– Shared data– Shared tools

Page 37: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

e-Science and the Grid 1

• e-Science: Major UK Program – global collaboration in key areas of science and the

next generation of infrastructure that will enable it• reflects growing importance of international

laboratories, satellites and sensors and their integrated analysis by distributed teams

• total investment of some £200M over the five-year period from 2001 to 2006

• CyberInfrastructure: the analogous US initiative• Grid Technology: supports e-Science &

Cyberinfrastructure

Page 38: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Basic Architectures:Servlets/CGI and Web Services

Browser

WebServer

HTTP GET/POST

DB or MPIAppl.

JDBC

WebServer

DBor MPIAppl.

JDBC

Browser

WebServer

SOAP

GUIClient

SOAPWSDL

WSDL

WSD

LWSD

L

Page 39: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Importance of Web Services

• Building a true science community

• Enabling interoperability between tools and the integration of data

• Less time coding, more time for science

• Change the way scientists work by achieving new levels of integration

Page 40: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

When To Use Web Services?

• Applications do not have severe restrictions on reliability and speed.

• Two or more organizations need to cooperate.– One needs to write an application that uses another’s service.

• Services can be upgraded independently of clients.

• Services can be easily expressed with simple request/response semantics and simple state.

Page 41: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Web Services Benefits

• Web services provide a clean separation between a capability and its user interface.

• Increase in productivity

• Increase in flexibility

• Rapid return on investment

• Integration across multiple applications

Page 42: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Web Services Advantages

• Output in human- and computer-readable formats

• I/O formats based on standard Internet protocols

• Resources accessible server to server allow automated I/O

• Integration based on specific services: you select services or data needed without downloading the entire data set

Page 43: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Web Services Advantages

• Description protocols provide details of service provided and interface components

• Semantic Web standards increase efficiency

• Use a central registry and standardized description of services

• Quality and status of the information is dynamically available

Page 44: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Web Services Drawbacks

• Based on new technologies

• Time and commitment required to learn

• Standards still in a state of rapid flux

• Issues with quality of data, (and for chemistry, quantity of open data), security, and privacy

Page 45: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Components of Web Services

• Protocols– SOAP– WSDL– UDDI

• XML as a basis for the protocols

• Ontologies– OWL: Ontology Web Language

• Semantic Web

Page 46: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Components of the Semantic Web for Chemistry

• XML – eXtensible Markup Language• RDF – Resource Description Framework• RSS – Rich Site Summary• Dublin Core – allows metadata-based

newsfeeds• OWL – for ontologies• BPEL4WS – for workflow and web services

– Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 3192-3203.

Page 47: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

SOAP: Simple Object Access Protocol

• Flexible protocol to communicate information between server and server or client and server using XML

• Supports Remote Procedure Calls

• Allows layers (security, authentication, transactions) over the basic SOAP elements

Page 48: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

WSDL: Web Service Definition Language

• Describes a service’s interface to clients

• Services register themselves with Web Services

• WSDL describes how to contact and interact with services– I/O, operations and messages to aid

interaction with client

Page 49: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

WSDL Overview

• An XML-based Interface Definition Language.– You can define the APIs for all of your services in WSDL.

• WSDL docs are broken into five major parts:– Data definitions (in XML) for custom types – Abstract message definitions (request, response)– Organization of messages into “ports” and “operations”

(classes and methods).– Protocol bindings (to SOAP, for example)– Service point locations (URLs)

• Some interesting features– A single WSDL document can describe several versions of an

interface.– A single WSDL doc can describe several related services.

Page 50: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

UDDI: Universal Description, Discovery, and Integration

• Provides ways for clients and services to interact with other services

• Uses XML• Defines the means of access, e.g.,

– URL– E-Mail

• Defines services hosted by an entity• Business-oriented tags• Uses SOAP for communicating

Page 51: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

XML: eXtensible Markup Language

• Allows definitions of types of documents

• Tags are used to specify components of documents

• Allows specification of namespaces to differentiate between identical tag names

• Tag names do not provide semantics other than simple hierarchical relations

Page 52: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

XML Overview

• A language for building languages• Basic rules: be well formed and be valid• Particular XML “dialects” are defined by XML

schemas.– XML itself is defined by its own schema.

• Extensible via namespaces• Many non-Web services dialects

– RDF, SVG, GML, CML, XForms, XHTML

• Many basic tools available: parsers, XPath and XQuery for searching/querying, etc.

Page 53: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

XML and Web services• XML lends itself to distributed computing:

– It’s just a data description.– Platform, programming language independent

• Web Services Description Language (WSDL)– Describes how to invoke a service– Can bind to SOAP, other protocols for actual

invocation

• Simple Object Access Protocol (SOAP)– Wire protocol extension for conveying RPC calls– Can be carried over HTTP, SMTP

Page 54: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

OWL: Web Ontology Language

• Builds on RDF and RDFS and adds a means for richer descriptions of properties and classes– Disjoint classes– Cardinality of classes– Characteristics of relations, like symmetry

Page 55: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Standards for Web Services

• Business Process Execution Language for Web Services (BPEL4WS)

• Ontology Web Language Semantics (OWL-S)

• Web Service Modeling Ontology (WSMO)

Page 56: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Standards Setting Boards

• OASIS: Organization for Advancement of Structured Information Standards– ebXML: e-business XML– UDDI: Universal Description, Discovery and

Integration

• Global Grid Forum– community of users, developers, and vendors

leading the global standardization effort for grid computing

Page 57: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Standards Setting Boards

• W3C: World Wide Web Consortium– OWL: Ontology Web Language– RDF/RDFS: Resource Description

Framework/Schema– SOAP: Simple Object Access Protocol– URI/URL/URN: Universal Resource

Identifier/Locator/Name– WSDL: Web Service Definition Language– XML: eXtensible Markup Language

Page 58: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

SWWS: Semantic Web-Enabled Web Services

• Main objectives:– Provide a comprehensive Web Service

description framework– Define a Web Service discovery framework– Provide a scalable Web Service mediation

middleware

• A program of the European Commission to run 2002-2005 – http://swws.semanticweb.org

Page 59: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Web Services Integration Projects: Biosciences

• myGrid– http://www.mygrid.org.uk/

• BIOPIPE– http://biopipe.org/

• BioMOBY– http://biomoby.org/

Page 60: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Web Services for Chemistry: Problems

• Performance and scalability• Proprietary data• Competition from high-performance desktop

applications-- Geoff Hutchison, it’s a puzzle blog, 2005-01-05

• ALSO: – Lack of a substantial body of trustworthy Open

Access databases– Non-standard chemical data formats (over 40 in

regular use and requiring normalization to one another)

Page 61: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Missing Ingredients in Chemistry

• Chemical communities to assemble Open Access databases– Well-defined quality assurance procedures

performed by distributed peer-review systems– Software underlying the databases needs to

be open source.

Page 62: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Chemistry Databases on the Web

• Marc Nicklaus lists 37 databases as of October 2001– Must have structure searching and at least

100 molecules– http://cactus.nci.nih.gov/ncidb2/chem_www.html

• SoaringBear’s List has 15 databases– http://geocities.com/soaringbear/biomed/chem.html

Page 63: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Institutional Repositories

• NARSTO Quality Systems Science Center– http://cdiac.esd.ornl.gov/programs/NARSTO/– Pollutant species in the troposphere over

North America– Part of the Carbon Dioxide Information

Analysis Center at ORNL– NARSTO Data and Information Sharing Tool

• http://mercury.ornl.gov/narsto/

Page 64: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Public Data Repositories

• Developmental Therapeutics Program/NCI– Some assay data for download– Structures for over 200,000 compounds

• http://dtp.nci.nih.gov/docs/dtp_search.html

• Zinc and other screening databases

• NIST computational chemistry database

• Environmental fate and exposure databases

Page 65: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Other Public Repositories 1

• ChemExper Chemical Directory– > 200,000 substances; > 10,000 IR spectra– http://chemexper.com/

• HIC-Up; Hetero-Compound Identification Centre – Uppsala– 5384 substances as of 1/15/05– http://xray.bmc.uu.se/hicup/

• Chemicals with Pharmaceutical Activity; a 3D Structural Database– 400 3D structures– http://www.chem.ox.ac.uk/mom/chemical-database/

Page 66: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Other Public Repositories 2

• Cheminformatics.org– 41 data sets in 9 categories as of 8/18/05– http://www.cheminformatics.org/

• WebReactions– http://webreactions.net/

Page 67: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Other Public Repositories 3

• MolTable– http://www.moltable.org/

• MatWeb Materials Property Data– http://www.matweb.com/index.asp?ckck=1

• Spectral Database for Organic Compounds (SDBS)– Over 32,000 compounds– Has EI-MS, FT-IR, 1H NMR, 13C NMR, Raman, ESR– http://www.aist.go.jp/RIODB/SDBS/cgi-bin/cre_index.cgi

• NMRShiftDB (Christoph Steinbeck)– 14,753 structures as of 8/19/05– Features peer-reviewed submission of data sets

– http://www.nmrshiftdb.org/

Page 68: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Other Public Repositories:Commercial Teasers

• FTIRsearch.com (Thermo Electron)– Demo file of 575 spectra from 87,000 in the full database– https://ftirsearch.com/default3.htm

• ChemACX– 30 of >350 suppliers catalog data– http://chemacx.cambridgesoft.com/chemacx/index.asp

• Sunset Molecular Discovery, LLC– Wombat (World of Molecular BioAcTivity)

• 117,007 entries with over 230,000 biological activities– Wombat PK

• Database for Clinical Pharmacokinetics: 643 substances with 4668 measurements

– Three sample files from Wombat containing 341 Histamine-1 receptor antagonists

– http://www.sunsetmolecular.com/

Page 69: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

BlueObelisk.org

• A group of chemists, programmers, and informaticians working collaboratively on projects such as:– Chemistry Development Kit (CDK)– JChemPaint– Jmol– JUMBO– NMRShiftDB– Octet– Open Babel– QSAR– World Wide Molecular Matrix (WWMM)

Page 70: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Indiana University Existing Projects

• System for the Integration of Bioinformatics Services (SIBIOS)– http://sibios.engr.iupui.edu

• PlatCom: A Platform for Computational Comparative Genomics– http://bio.informatics.indiana.edu/sunkim/Platcom/

• Reciprocal Net– http://www.reciprocalnet.org/index.html

Page 71: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Indiana University Planned Projects

• Design of a Grid-based distributed data architecture

• Development of tools for HTS data analysis and virtual screening

• Database for quantum mechanical simulation data

• Chemical prototype projects– Novel routes to enzymatic reaction mechanisms– Mechanism-based drug design– Data-inquiry-based development of new methods in

natural product synthesis

Page 72: Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.

Web Services Future

• Depends on– Adoption of standards– Incorporation of WS in current and newly

developed applications– Security, privacy, quality of data issues– Development of WS tools and resources for e-

Science