Conceptual basis for critical thinking, data analysis and problem solving
description
Transcript of Conceptual basis for critical thinking, data analysis and problem solving
EMBL-EBI
Conceptual basis for critical thinking, data analysis and
problem solving
(and I don’t know what this is either !)
STRATEGY
EMBL-EBI
Challenges for bioinformatics
With the sequence/structure deficit, the challenges are to rationalise the mass of sequence data derive more efficient means of data storage design more reliable analysis tools
Imperative - to convert sequence information into biochemical & biophysical knowledge
EMBL-EBI
What we cannot do well
“Give us sequence, we do rest”
EMBL-EBI
EMBL-EBI
EMBL-EBI
What is the function of this structure?
What is the function of this sequence?
What is the function of this motif? the fold provides a scaffold, which
can be decorated in different ways by different sequences to confer different functions - knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level
EMBL-EBI
Complication – Multiprotein Complexes
EMBL-EBI
1H8E (ADP.ALF4)2(ADP.SO4) BOVINE F1-ATPASE (ALL THREE CATALYTIC SITES OCCUPIED)MENZ, R.I., WALKER, J.E., LESLIE, A.G.W.
ATPase
EMBL-EBI
1NT9 COMPLETE 12-SUBUNIT RNA POLYMERASE IIARMACHE, K.-J., KETTENBERGER, H., CRAMER, P
Multiprotein transcription complexes- RNA Polymerase
Science 288, 640 (2000) P. Cramer et.al.
EMBL-EBI
STRING: a database of predicted functional associations between proteins. von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B
http://string.embl.de/
Prolinks: a database of protein functional linkages derived from coevolution P.M. Bowers, M.Pellegrini, M.J. Thompson, J.Fierro, T.O. Yeates, D.Eisenberghttp://dip.doe-mbi.ucla.edu/pronav (? )
EMBL-EBI
Ground rules for bioinformatics
Don't always believe what programs tell youthey're often misleading & sometimes wrong!
Don't always believe what databases tell youthey're often misleading & sometimes wrong!
Don't always believe what lecturers tell youthey're often misleading & sometimes wrong!
In short, don't be a naive user when computers are applied to biology, it is vital
to understand the difference between mathematical & biological significance
computers don’t do biology - they do sums quickly!
EMBL-EBI
General Evaluation Criteria Be sceptical and cynical!
When you are searching for information you need to judge its quality and suitability.
Think critically about each piece of information you find and how you found it.
Relevance: Does the information you have found adequately support your research? Does it answer the question, or support one of your arguments? How general or specific is the information about the topic?
EMBL-EBI
Building a search protocol
The usual starting point searching the primary data sources
NRDB, SPTR, etc.Pattern recognition methods
searching the secondary sourcespatterns, profiles, blocks, fingerprints
& HMMsEstimating significance
when do we believe a result?
EMBL-EBI
A central goal is to predict protein function from sequence
Given a sequence, we want to know what is my protein? to what family does it belong? what is its function? how can we explain its function in structural terms?
By searching pattern dbs & fold libraries, we may recognise patterns that allow us to infer relationships with previously-characterised families & folds
Given the variety of dbs to search, how do we use them to build a sensible search protocol?
EMBL-EBI
Planning a database Search
To find various aspects of your query sequence, you may have to search a number of databases
1. Identify the sequenceSearch for a matching or similar sequence using a 'BLAST' program.
2. Find related sequences(a) For a protein sequence, find the mRNA sequence that produces the protein, and the DNA sequence that codes for the mRNA.(b) For mRNA sequence, find the protein it produces, and the DNA sequence that codes for the mRNA.(c) For DNA sequence, find the mRNA it translates to, and the protein that the mRNA produces.
EMBL-EBI
3. If a the sequence is from a protein, find a structural image.
4. Research the functionality of the sequence: (a) What is its function in different tissues (homology)?(b) What is its function in different organisms (phylogeny).(c) Are there any mutations, and what are their consequences?(d) What is the role of the protein in cell function?
Protein sequence database identity searche.g., for short fragments, pinpoints identical matches to probe - may
identify correct reading frame
Protein sequence database similarity searche.g., nrdb, OWL, SP+SPTrEMBL - identifies homologues to
probe
Protein pattern database search e.g., PROSITE, profiles, PRINTS, BLOCKS, Pfam - identifies
family relationships or pinpoints key structural or functional sites
Known structure Structure classification database query library search e.g., scop, CATH, FSSPprovides details ligand-binding, etc.
Unknown StructureProtein fold patterne.g. threading identifies compatible of structural class
EMBL-EBI
iGAP
http://eol.sdsc.edu
Protein sequences
Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG)
Structural assignment of domains by PSI-BLAST profiles on FOLDLIB
Structural assignment of domains by 123D on FOLDLIB
Structural assignment of domains by WU-BLAST
Data Warehouse
Functional assignment by PFAM, NR assignments
FOLDLIB
Building FOLDLIB:
PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP
90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)
Domain location prediction by sequence
structure info sequence info
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
NR, PFAMSCOP, PDB
EMBL-EBI
http://harvester.embl.de/
“Harvester” collects information from selected public databases
EMBL-EBI
Similarity searching
Whether or not an identity search finds a match, the next step is to look for similar sequencese.g., you may wish to know if a wider family exists
The most rapid option is to use BLAST & variants and look for high scores with low P-values (unlikely to be
random) clusters of high scores at the top of the hitlist (a
family?) trends in the type of sequences matched
Use a composite databasese.g., UNIPROT
EMBL-EBI
Structural & functional interpretation
db searches often does little more than identify a protein familythis only scratches the surface - we still want
to know what our protein does & what it might look like
The first step is to examine the detailed family in InterPromay help to elucidate function
The next step is to examine the fold classification & structure summary resourcese.g., SCOP, CATH
EMBL-EBI
Gene prediction, structure & function prediction are non-trivialstructure & function prediction tools are, at best,
70% accurate What are the lessons for sequence analysis?
when searching for distant homologues, several dbs should be searched
different methods provide different perspectives dbs aren’t complete & their contents don’t fully
overlap
The more dbs searched, the more difficult it can be to interpret results
EMBL-EBI
Thinking about your Topic
Can you identify what you already know about the topic, and identify what you do not know.
Can you create questions based on these knowledge 'gaps', that is, can you identify your information needs.
What do you require about your protein sequence.
Develop a concept map to organise your ideas and structure your approach to the topic.
Discuss your topic with others.
EMBL-EBI
Identifying the Type of Information you need
As well as thinking about your topic, you need to consider the type of information you will need.
Which information tools are best suited to your inquiry?How much information do you need - to what degree of detail?
EMBL-EBI
Appreciate how difficult it is to draw a complex 3-D object and appreciate the complexity of the requirements for storing sequence and structural information of molecules in a database.
There are a lot of interrelated pieces of information about a biomolecule, such as
sequence similaritiesgenome locationprotein structureExpressionchemistry
EMBL-EBI
All information on a molecule or sequence will not be found in one record, nor even in the one database.
Be prepared to search in several databases for information on your query sequence
As different organisations create databases to suit their own purposes, there will not be a great deal of similarity between these databases.
EMBL-EBI
Data formats are not standard. The nomenclature is not standard. There is more than one database offering the same information (data redundancy). Links between databases may not be easy to follow. The number of databases available makes it confusing to choose from
Some of the obstacles of searching databases are:
EMBL-EBI
Once you have found some information on your query sequence, you will find a new focus for your research from this information.
Through exploring any linked text in the databases:-
EMBL-EBI
What function does the protein/mRNA/DNA have?
Do mutations occur and what are their effects?
Does it play a role in disease?
Homologies: Does it have the same function in different tissues?
Phylogenies: Does it have the same function in different organisms?
What role does structure play in the protein's function?
Does it have a similar function to other molecules with similar structure?
EMBL-EBI
Pitfalls of searching databases
Remember that you are looking for information about a molecule, not database records.
Duplication of information (even within the same database) Links that are not always intuitive (or self-explanatory) Nomenclature that is not always standard
EMBL-EBI
You need to determine whether the information is reliable or not
Accuracy or Validity
EMBL-EBI
Quality Control Issues
The quality of archived data is no better than the data determined in the contributing laboratories.
Curation of the data can help to identify errors. Disagreement between duplicate determinations is
a clear warning of an error in one or the other. Similarly, results that disagree with established
principles may contain errors. It is useful, for instance, to flag deviations from
expected stereochemistry in protein structures, but such ``outliers'' are not necessarily wrong.
EMBL-EBI
Data quality
Data Consistency Data Models Reliability
Evidences ? Level of confidence ? Assignation of function by similarity
recursive process propagation of errors
EMBL-EBI
Data quality
It’s hard to judge whether something “makes sense”.
The lack of labeling on many web pages makes it hard to know the source.
Calculations based on databases are even harder to deal with
Logical deductions may be worse.
“tacR gene regulates the human nervous system”
“tacQ gene is similar to tacR but is found in E. coli”
“so tacQ gene regulates the E. coli nervous system”
EMBL-EBI
E. coli nervous system
Who spotted ?
EMBL-EBI
Evaluating database records
In order for your research to reliable you must use reliable sources of information
It is important to evaluate the information you find in databases as you would any other type of information
In the case of sequencing research however, peer review does not necessarily happen prior to publication.
EMBL-EBI
Significance
Appreciating that mathematical & biological significance are different is crucial
Important in understanding the limitations of database search algorithms multiple sequence alignment algorithms pattern recognition techniques functional site & structure prediction tools
Contrary to popular opinion, there is currently still no biologically-reliable automatic multiple alignment
algorithm no infallible pattern-recognition technique no reliable gene, function or structure prediction
algorithm
EMBL-EBI
Summary
Difficult questions on big data Data and Information Database and Databanks Organise the data to provide a service Visualization and Rendering Keep it up-to-date Provide a means to ask questions Provide a useful service to a large and
diverse scientific field
EMBL-EBI
Data & Information
Data : a collection of factsi.e. X-ordinate, B-value, sequence
Information : acquired knowledge Data within a scientific “context” Meaning of the data
Sequence/structure alignment
EMBL-EBI
Databases & Databanks
Databank A (usually large) collection of data
DatabaseA (usually large) set of data organized to allow
rapid retrieval of information. Organized for a reason Rapid retrieval : human short term memory is ~5
seconds information
EMBL-EBI
WHAT IS THE PDB?
EMBL-EBI
Databanks and Databases
The PDB Archive is a “databank” A series of flat files that have a format originally
designed for Fortran card readers
The MSD, RCSB, and PDBj provide “databases”
Collections of data (1000’s attributes) organized into relational tables and held with a RDMS.
EMBL-EBI
EMBL-EBI
Data & information
ATOM 2567 N PHE B 175 7.821 -25.530 -22.848 1.00 8.71 ATOM 2568 CA PHE B 175 8.845 -25.172 -21.877 1.00 9.41ATOM 2569 C PHE B 175 9.449 -23.798 -22.169 1.00 10.02 ATOM 2570 O PHE B 175 10.664 -23.613 -22.103 1.00 10.37 ATOM 2571 CB PHE B 175 9.928 -26.251 -21.848 1.00 9.53 ATOM 2572 CG PHE B 175 10.969 -26.137 -22.982 1.00 10.03 ATOM 2573 CD1 PHE B 175 12.356 -25.819 -22.988 1.00 10.51 ATOM 2574 CD2 PHE B 175 11.725 -27.211 -23.402 1.00 10.25 ATOM 2575 CE1 PHE B 175 11.821 -27.095 -22.869 1.00 11.17 ATOM 2576 CE2 PHE B 175 12.282 -26.086 -24.008 1.00 10.95 ATOM 2577 CZ PHE B 175 10.953 -26.335 -23.622 1.00 11.38
EMBL-EBI
http://oca.ebi.ac.uk/oca-docs/oca-home.htmlhttp://srs.ebi.ac.uk/
http://www.rcsb.org/pdb/http://www.ebi.ac.uk/msd/http://www.pdbj.org/
EMBL-EBI
wwPDB are service providers
We provide a service to the scientific community
24/7 (almost) : parallel DB with fail-over, etc. Service “ping” baseline check several times/day Data is incremented with new data weekly Systems are extensible
EMBL-EBI
Query capabilities
Browsing (click and read) Simple search
select records with some constraints More elaborate search
select specific fields of some records with constraints on some fields
Complex queryingability to return an answer that results from a
"live" computation, and was not part of any record of the database
EMBL-EBI
Interfaces
User interfaces user-friendly convenient browsing intuitive query forms visualization (graphical output)
Programmatic interfaces - communication with external programs: other databases (concept of distributed database) analysis tools
EMBL-EBI
Annotation Issues
EMBL-EBI
Annotation
Problem The flow of available data is increasing
exponentiallyStrategies
internal curators selected external experts public submission computer-based extraction of information
from biological texts
EMBL-EBI
Annotation is a weak component of the enterprise.
Automation of annotation is possible only to a limited extent and getting annotation right remains labor-intensive.
But the importance of proper annotation, however, cannot be underestimated.
P. Bork has commented that for people interested in analysing the protein sequences implicit in genome sequence information, errors in gene assignment corrupt the high quality of the sequence data.
Annotation of the data
EMBL-EBI
A possible solution is a distributed and dynamic error-correction and annotation process.
The workload must be distributed because databank staff have neither the time nor the expertise for the job; specialists will have to act as curators.
Progress in automation of annotation and error identification /correction will permit re-annotation of databanks.
Distributed Annotation
EMBL-EBI
As a result, we will have to give up the ``safe'' idea of a stable databank composed of entries that are correct when they are first distributed in mature form and stay fixed thereafter.
Databanks are dynamic in information content and growing in size, and maturing in quality.
Maintaining local copies – largely “top up” this is not sufficient.
Proliferation of various copies in various states with out-of-date linkages
New Problems
EMBL-EBI
The more computers are involved in automating genome annotation, the greater the need for collaboration with biologists
The more data we have to handle, the more rigorous we must be in our thinking (& writing) if we are to make sense of the complexities
We are still a long way from having reliable tools for deducing protein function from sequence
but with the right approach, there is hope
EMBL-EBI
not much without intervention
What can you do with bioinformatics?
Conclusion
however, a lot if you know how to apply it right!
EMBL-EBI
http://www.library.cqu.edu.au/chemcompass/index.htm
Terri AttwoodSchool of Biological SciencesUniversity of Manchester, Oxford RoadManchester M13 9PT, UKhttp://www.bioinf.man.ac.uk/dbbrowser/
Referencing - and Plagiarism
http://www.vts.rdn.ac.uk/tutorial/biores