EMBL-EBI the European Macromolecular Structure Database (EMSD). .

32
EMBL-EBI the European Macromolecular Structure Database (EMSD). http://www.ebi.ac.uk/msd/education/Tutorial.html http://www. ebi .ac. uk / msd / roadshow .html MSD Roadshow Co-ordinator . Janet Copeland 2 nd November 2005 Oxford University

Transcript of EMBL-EBI the European Macromolecular Structure Database (EMSD). .

EMBL-EBI

USERNAME: cal

PASSWORD: warthog

EMBL-EBI

Introduction to MSD and to Quaternary Structures/Assemblies as Basis of MSD database

SSM Fold recognition

PISA Surface and assembly toolkit

MSDchem Chemistry reference data

MSDlite/MSDpro generalised search systems

MSDsite Active sites

MSDmotif small structural motifs

EMBL-EBI

Visualisation and Patterns

Intergration Projects with Sequence and Domain data

Validation/Deposition

Clustering methods used at MSD

MSDmine – generalised data access to the MSD

PIMS – Protein Information System

Targets – Workflow for Target selection tools

NMR – NMR tools and data at MSD

Data Mining and an example MSDtemplates

DataBases at MSD including data warehouse technologies

DataBase Replication

http://www.ebi.ac.uk/msd

EMBL-EBI

Genomes

Hypotheses andin silico models

Bioinformatics

Expression-profiling

Comparativegenomics

Mutant/RNAidata

Metabolic data

Literature

Proteome data

Biochemistry

Bioinformatics

EMBL-EBI

Role of Bioinformatics

To Support Experimental BiologyTo Collect and Archive DataTo provide Framework and IntegrationTo give Easy Access to Data

To make New Discoveries through Data Analysis

EMBL-EBI

http://www.wwpdb.org/

EMBL-EBI

WHAT IS THE PDB?

EMBL-EBI

Databanks and Databases

The PDB Archive is a “databank” A series of flat files that have a format originally

designed for Fortran card readers

The MSD provides “databases” Collections of data (1000’s attributes)

organized into relational tables and held with a RDMS.

PQS biological assemblies

MSDchem ligand data

Electron Density VisualisationAstexViewer MSDPro, MSDlite

SSM fold matching Surface MatchingMSDsite Active sites

Linking to Domain data, eFamily

Sequence Mapping, SIFTS

EMBL-EBI

Data & information

ATOM 2567 N PHE B 175 7.821 -25.530 -22.848 1.00 8.71 ATOM 2568 CA PHE B 175 8.845 -25.172 -21.877 1.00 9.41ATOM 2569 C PHE B 175 9.449 -23.798 -22.169 1.00 10.02 ATOM 2570 O PHE B 175 10.664 -23.613 -22.103 1.00 10.37 ATOM 2571 CB PHE B 175 9.928 -26.251 -21.848 1.00 9.53 ATOM 2572 CG PHE B 175 10.969 -26.137 -22.982 1.00 10.03 ATOM 2573 CD1 PHE B 175 12.356 -25.819 -22.988 1.00 10.51 ATOM 2574 CD2 PHE B 175 11.725 -27.211 -23.402 1.00 10.25 ATOM 2575 CE1 PHE B 175 11.821 -27.095 -22.869 1.00 11.17 ATOM 2576 CE2 PHE B 175 12.282 -26.086 -24.008 1.00 10.95 ATOM 2577 CZ PHE B 175 10.953 -26.335 -23.622 1.00 11.38

EMBL-EBI

MSD service provider

We provide a service to the scientific community 24/7 (almost) :

parallel DB with fail-over, etc.

Service “ping” baseline check several times/day Data is incremented with new data weekly Systems are extensible

EMBL-EBI

Query capabilities

Browsing (click and read) Simple search

select records with some constraints More elaborate search

select specific fields of some records with constraints on some fields

Complex queryingability to return an answer that results from a

"live" computation, and was not part of any record of the database

EMBL-EBI

What we cannot do well

“Give us sequence, we do rest”

EMBL-EBI

EMBL-EBI

EMBL-EBI

What is the function of this structure?

What is the function of this sequence?

What is the function of this motif? the fold provides a scaffold, which

can be decorated in different ways by different sequences to confer different functions - knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level

EMBL-EBI

Complication – Multiprotein Complexes

EMBL-EBI

1H8E (ADP.ALF4)2(ADP.SO4) BOVINE F1-ATPASE (ALL THREE CATALYTIC SITES OCCUPIED)MENZ, R.I., WALKER, J.E., LESLIE, A.G.W.

ATPase

EMBL-EBI

Ground rules for bioinformatics

Don't always believe what programs tell youthey're often misleading & sometimes wrong!

Don't always believe what databases tell youthey're often misleading & sometimes wrong!

Don't always believe what lecturers tell youthey're often misleading & sometimes wrong!

In short, don't be a naive user when computers are applied to biology, it is vital to

understand the difference between mathematical & biological significance

computers don’t do biology - they do sums quickly!

EMBL-EBI

General Evaluation Criteria Be sceptical and cynical!

When you are searching for information you need to judge its quality and suitability.

Think critically about each piece of information you find and how you found it.

Relevance: Does the information you have found adequately support your research? Does it answer the question, or support one of your arguments? How general or specific is the information about the topic?

EMBL-EBI

http://harvester.embl.de/

“Harvester” collects information from selected public databases

EMBL-EBI

Appreciate how difficult it is to draw a complex 3-D object and appreciate the complexity of the requirements for storing sequence and structural information of molecules in a database.

There are a lot of interrelated pieces of information about a biomolecule, such as

sequence similaritiesgenome locationprotein structureExpressionchemistry

EMBL-EBI

Data formats are not standard. The nomenclature is not standard. There is more than one database offering the same information (data redundancy). Links between databases may not be easy to follow. The number of databases available makes it confusing to choose from

Some of the obstacles of searching databases are:

EMBL-EBI

You need to determine whether the information is reliable or not

Accuracy or Validity

EMBL-EBI

Quality Control Issues

The quality of archived data is no better than the data determined in the contributing laboratories.

Curation of the data can help to identify errors. Disagreement between duplicate determinations is a

clear warning of an error in one or the other. Similarly, results that disagree with established

principles may contain errors. It is useful, for instance, to flag deviations from

expected stereochemistry in protein structures, but such ``outliers'' are not necessarily wrong.

EMBL-EBI

Data quality

Data Consistency Data Models Reliability

Evidences ? Level of confidence ?Assignation of function by similarity

recursive process propagation of errors

EMBL-EBI

Data quality

It’s hard to judge whether something “makes sense”.

The lack of labeling on many web pages makes it hard to know the source.

Calculations based on databases are even harder to deal with

Logical deductions may be worse.

“tacR gene regulates the human nervous system”

“tacQ gene is similar to tacR but is found in E. coli”

“so tacQ gene regulates the E. coli nervous system”

EMBL-EBI

E. coli nervous system

Who spotted ?

EMBL-EBI Significance

Appreciating that mathematical & biological significance are different is crucial

Important in understanding the limitations of database search algorithms multiple sequence alignment algorithms pattern recognition techniques functional site & structure prediction tools

Contrary to popular opinion, there is currently still no biologically-reliable automatic multiple alignment

algorithm no infallible pattern-recognition technique no reliable gene, function or structure prediction algorithm

EMBL-EBI

As a result, we will have to give up the ``safe'' idea of a stable databank composed of entries that are correct when they are first distributed in mature form and stay fixed thereafter.

Databanks are dynamic in information content and growing in size, and maturing in quality.

Maintaining local copies – largely “top up” this is not sufficient.

Proliferation of various copies in various states with out-of-date linkages

New Problems