1 Helsinki University of Technology DNA, RNA, Protein Structure Prediction Laura Pombo Laboratory of...

Post on 25-Dec-2015

216 views 0 download

Tags:

Transcript of 1 Helsinki University of Technology DNA, RNA, Protein Structure Prediction Laura Pombo Laboratory of...

1

Helsinki University of Technology

DNA, RNA, Protein Structure PredictionDNA, RNA, Protein Structure Prediction

Laura Pombo Laboratory of Computational Engineering

Helsinki University of Technology

Helsinki University of Technology

2DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

INTRODUCTION:

Bioinformatics

DNA

RNA

Proteins

Helsinki University of Technology

3DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

BIOINFORMATICSBIOINFORMATICS

Bioinformatics involves the integration of computers, software tools, and databases in an effort to address biological questions. Bioinformatics approaches are often used for major initiatives that generate large data sets.

Two important large-scale activities that use bioinformatics are genomics and proteomics.

Genomics refers to the analysis of genomes. – A genome can be thought of as the complete set of DNA sequences

that codes for the hereditary material that is passed on from generation to generation.

– Thus, genomics refers to the sequencing and analysis of all of these genomic entities, including genes and transcripts, in an organism.

Helsinki University of Technology

4DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Bioinformatics, continue …Bioinformatics, continue …

Proteomics, on the other hand, refers to the analysis of the complete set of proteins or proteome.

In addition to genomics and proteomics, there are many more areas of biology where bioinformatics is being applied (i.e., metabolomics, transcriptomics). Each of these important areas in bioinformatics aims to understand complex biological systems. Many scientists today refer to the next wave in bioinformatics as systems biology, an approach to tackle new and complex biological questions. Systems biology involves the integration of genomics, proteomics, and bioinformatics information to create a whole system view of a biological entity.

Helsinki University of Technology

5DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Bioinformatics Bioinformatics http://www.bioinformatics.ubc.ca/http://www.bioinformatics.ubc.ca/

Para ver esta película, debedisponer de QuickTime™ y de

un descompresor TIFF (sin comprimir).

Helsinki University of Technology

6DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Central DogmaCentral Dogma

DNA

RNA

Protein

Helsinki University of Technology

7DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

DNA to RNA DNA to RNA

Portions of DNA Sequence Are Transcribed into RNA The first step of a cell is to copy a particular portion of its DNA

nucleotide sequence ( =gene) Similarities:

– DNA and RNA is a linear polymer made of four different types of nucleotide subunits linked together by phosphodiester bonds

– DNA and RNA contains the bases adenine (A), guanine (G) and cytosine (C)

Differences:– In RNA the nucleotides are ribonucleotides (=contain the sugar ribose)– RNA contains uracil (U) instead of the thymine (T)

My summary from the book: Molecular Biology of THE CELL (Bruce Alberts, et al.)

Helsinki University of Technology

8DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Different RNAsDifferent RNAs mRNAs

– (messenger RNAs), code for proteins rRNAs

– (ribosomal RNAs), form the basic structure of the ribosome and catalyze protein synthesis

tRNAs – (transfer RNA), central to protein synthesis as adaptors between mRNA and

amino acids snRNAs

– (small nuclear RNAs), function in a variety of nuclear processes, including the splicing of pre-Mrna

snoRNAs– (small nucleolar RNAs), used to process and chemically modify rRNAs

Other noncoding RNAs– function in diverse cellular processes, including telomere synthesis, X-

chromosome inactivation and the transport of proteins into te ER

Helsinki University of Technology

9DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Helsinki University of Technology

10DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

RNA structure predictionRNA structure prediction

Para ver esta película, debedisponer de QuickTime™ y de

un descompresor TIFF (sin comprimir).Para ver esta película, debedisponer de QuickTime™ y de

un descompresor TIFF (sin comprimir).

Helsinki University of Technology

11DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

http://gibk26.bse.kyutech.ac.jp/jouhou/image/dna-protein/all/N3utr.gifhttp://gibk26.bse.kyutech.ac.jp/jouhou/image/dna-protein/all/N3utr.gif

Helsinki University of Technology

12DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

RNA is transcribed (or synthesized) in cells as single strands of (ribose) nucleic acids. However, these sequences are not simply long strands of nucleotides. Rather, intra-strand base pairing will produce structures.

In RNA, guanine and cytosine pair (GC) by forming a triple hydrogen bond, and adenine and uracil pair (AU) by a double hydrogen bond; additionally, guanine and uracil can form a single hydrogen bond base pair.

Helsinki University of Technology

13DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

RNA structure predictionRNA structure prediction Vienna RNA (PackageRNA Secondary Structure Prediction and

Comparison)http://www.tbi.univie.ac.at/~ivo/RNA/

including a few precompiled binaries for downloadhttp://www.tbi.univie.ac.at/~ivo/RNA/windoze/ [under Windows]

The Vienna RNA Package consists of a C code library and several stand-alone programs for the prediction and comparison of RNA secondary structures.– RNA secondary structure prediction through energy minimization is the most

used function in the package. – The program provides three kinds of dynamic programming algorithms for

structure prediction: » the minimum free energy algorithm of (Zuker & Stiegler 1981) which yields a single

optimal structure, » the partition function algorithm of (McCaskill 1990) which calculates base pair

probabilities in the thermodynamic ensemble, and » the suboptimal folding algorithm of (Wuchty et.al 1999) which generates all

suboptimal structures within a given energy range of the optimal energy.

Helsinki University of Technology

14DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

RNAFOLD toolRNAFOLD tool

RNAfold reads RNA sequences from stdin and calculates their minimum free energy (mfe) structure, partition function (pf) and base pairing probability matrix. It returns the mfe structure in bracket notation, its energy, the free energy of the thermodynamic ensemble and the frequency of the mfe structure in the ensemble to stdout. It also produces PostScript files with plots of the resulting secondary structure graph and a "dot plot" of the base pairing matrix. The dot plot shows a matrix of squares with area proportional to the pairing probability in the upper half, and one square for each pair in the minimum free energy structure in the lower half

Helsinki University of Technology

15DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

ALIDOT programALIDOT program

Detecting Conserved RNA Structures The program alidot is designed to detect conserved RNA secondary structures in small data sets of related RNA sequences. The method is a combination of structure prediction and comparative sequence alignment.

http://www.tbi.univie.ac.at/~ivo/RNA/ALIDOT/

Helsinki University of Technology

16DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Helsinki University of Technology

17DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

http://images.google.com/imgres?imgurl=http://images.clinicaltools.com/images/gene/dna_versus_rna_reversed.jpg&imgrefurl=http://www.geneticsolutions.com/PageReq%3Fid%3D1530:1873&h=461&w=405&sz=135&tbnid=R7LVIZO4g6cJ:&tbnh=125&tbnw=109&hl=en&start=2&prev=/images%3Fq%3DDNA%2Bto%2BRNA%26svnum%3D10%26hl%3Den%26lr%3D%26sa%3DG

Helsinki University of Technology

18DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

DNA structure predictionDNA structure prediction

Para ver esta película, debedisponer de QuickTime™ y de

un descompresor TIFF (sin comprimir).

Para ver esta película, debedisponer de QuickTime™ y deun descompresor TIFF (sin comprimir).

Helsinki University of Technology

19DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

MEMEMEME

MEME is a tool for discovering motifs in a group of related DNA or protein sequences. A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. – MEME represents motifs as position-dependent letter-probability matrices

which describe the probability of each possible letter at each position in the pattern.

– Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs. MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested.

– MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif.

Helsinki University of Technology

20DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

DNA structure predictionDNA structure prediction

Other similar programs:

Cassandrahttp://www-hto.usc.edu/software/procrustes/cassandra/cass_frm.html

DNA Sequence Translation

GENEID which predicts Gene Structure in Query Sequences (US)

GRAIL, GenHunt, Censor, Pythia, Entrez, Beauty, etc. You should have a look in: http://restools.sdsc.edu/biotools/biotools16.html

Helsinki University of Technology

21DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

PROTEINPROTEIN Protein: A large molecule composed of one or more chains of amino acids in a

specific order determined by the base sequence of nucleotides in the DNA coding for the protein.

Proteins are required for the structure, function, and regulation of the body's cells, tissues, and organs. Each protein has unique functions. Proteins are essential components of muscles, skin, bones and the body as a whole.

Protein is one of the three types of nutrients used as energy sources by the body, the other two being carbohydrate and fat. Proteins and carbohydrates each provide 4 calories of energy per gram, while fats produce 9 calories per gram.

The word "protein" was introduced into science by the great Swedish physician and chemist Jöns Jacob Berzelius (1779-1848) who also determined the atomic and molecular weights of thousands of substances, discovered several elements including selenium, first isolated silicon and titanium, and created the present system of writing chemical symbols and reactions.

Helsinki University of Technology

22DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Tools for PROTEIN Structure Prediction: ExPASyTools for PROTEIN Structure Prediction: ExPASy The ExPASy (Expert Protein Analysis System) proteomics server from the

Swiss Institute of Bioinformatics (SIB) is dedicated to molecular biology with an emphasis on data relevant to proteins.It allows you to browse through a number of databases produced in Geneva, such as Swiss-Prot, PROSITE, SWISS-2DPAGE, SWISS-3DIMAGE, ENZYME, as well as other cross-referenced databases (such as EMBL/GenBank/DDBJ, OMIM, Medline, FlyBase, ProDom, SGD, SubtiList, etc).

It also allows access to many analytical tools for the identification of proteins, the analysis of their sequence and the prediction of their tertiary structure. ExPASy also offers you many documents relevant to these field of research and you will find from the servers, links to most relevant sources of information across the Web.Swiss-2DService is a non-profit 2-D PAGE service to the scientific community

ExPASy was created in August 1993, it was one of the first WWW servers for biological sciences. Since that date it has undergone constant modifications and improvements.

Helsinki University of Technology

23DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Helsinki University of Technology

24DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Para ver esta película, debedisponer de QuickTime™ y de

un descompresor TIFF (sin comprimir).

Para ver esta película, debedisponer de QuickTime™ y de

un descompresor TIFF (sin comprimir).

Helsinki University of Technology

25DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

PROSITE Database of protein families and domainsPROSITE Database of protein families and domains PROSITE is a database of protein families and domains.

It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.

It is based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families.

Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor.It is apparent, when studying protein sequence families, that some regions have been better conserved than others during evolution.

http://au.expasy.org/prosite/prosite_details.html

Helsinki University of Technology

26DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

These regions are generally important for the function of a protein and/or for the maintenance of its three- dimensional structure. By analyzing the constant and variable properties of such groups of similar sequences, it is possible to derive a signature for a protein family or domain, which distinguishes its members from all other unrelated proteins.

A pertinent analogy is the use of fingerprints by the police for identification purposes. A fingerprint is generally sufficient to identify a given individual. Similarly, a protein signature can be used to assign a newly sequenced protein to a specific family of proteins and thus to formulate hypotheses about its function.

PROSITE currently contains patterns and profiles specific for more than a thousand protein families or domains. Each of these signatures comes with documentation providing background information on the structure and function of these proteins.

Helsinki University of Technology

27DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Helsinki University of Technology

28DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Helsinki University of Technology

29DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Helsinki University of Technology

30DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Helsinki University of Technology

31DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

(Protein) Structure Prediction(Protein) Structure Prediction

Helsinki University of Technology

32DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Experimental Data Experimental Data

Disulphide bonds Spectroscopic data Site directed mutagenesis studies Knowledge of proteolytic cleavage sites Etc.

Helsinki University of Technology

33DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Protein sequence dataProtein sequence data

Is your protein a transmembrane protein, or does it contain transmembrane segments? Methods for predicting these segments: – TMAP (EMBL) – PredictProtein (EMBL/Columbia) – TMHMM (CBS, Denmark) – TMpred (Baylor College) – DAS (Stockholm)

Does your protein contain coiled-coils? Methods:– At COILS server, the COILS program

Does your protein contain regions of low complexity? Methods:– the program SEG

Helsinki University of Technology

34DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Sequence database searching 1/2Sequence database searching 1/2 Comparisons with sequence databases to find homologues, methods:

– the BLAST suite of programs. – National Center for Biotechnology Information (USA) Searches – European Bioinformatics Institute (UK) Searches – BLAST search through SBASE (domain database; ICGEB, Trieste) – Other methods for comparing a single sequence to a database include:

» The FASTA suite (William Pearson, University of Virginia, USA) » SCANPS (Geoff Barton, European Bioinformatics Institute, UK) » BLITZ (Compugen's fast Smith Waterman search)

Multiple sequence information – building a profile from some kind of multiple sequence alignment.

Methods:» PSI-BLAST (NCBI, Washington) » ProfileScan Server (ISREC, Geneva) » HMMER Hidden Markov Model searching (Sean Eddy, Washington

University) » Wise package (Ewan Birney, Sanger Centre; this is for protein versus DNA

comparisons)

Helsinki University of Technology

35DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Sequence database searching 2/2Sequence database searching 2/2 Incorporating multiple sequence information – a MOTIF.

– describes the key residues that are conserved and define the family. – Sometimes this is called a "signature". For example, "H-[FW]-x-[LIVM]-

x-G-x(5)-[LV]-H-x(3)-[DE]" describes a family of DNA binding proteins. Methods:

» PROSITE, ExPASy » EBI

Pre-prepared protein alignments, databases:– SMART (Oxford/EMBL) – PFAM (Sanger Centre/Wash-U/Karolinska Intitutet) – COGS (NCBI) – PRINTS (UCL/Manchester) – BLOCKS (Fred Hutchinson Cancer Research Centre, Seatle) – SBASE (ICGEB, Trieste)

Helsinki University of Technology

36DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Multiple Sequence AlignmentMultiple Sequence Alignment Some methods and tools:

– EBI (UK) Clustalw Server – IBCP (France) Multalin Server – IBCP (France) Clustalw Server – IBCP (France) Combined Multalin/Clustalw – MSA (USA) Server – BCM Multiple Sequence Alignment ClustalW Sever (USA)

Alignments can provide:– Information as to protein domain structure– The location of residues likely to be involved in protein function– Information of residues likely to be buried in the protein core or exposed to solvent– More information than a single sequence for applications like homology modelling and

secondary structure prediction.

Helsinki University of Technology

37DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Secondary Structure Prediction methods and linksSecondary Structure Prediction methods and links Methods and tools:

– PSI-pred (PSI-BLAST profiles used for prediction; David Jones, Warwick) – JPRED Consensus prediction (includes many of the methods given below; Cuff & Barton,

EBI) – DSC King & Sternberg (this server) – PREDATORFrischman & Argos (EMBL) – PHD home page Rost & Sander, EMBL, Germany – ZPRED server Zvelebil et al., Ludwig, U.K. – nnPredict Cohen et al., UCSF, USA. – BMERC PSA Server Boston University, USA – SSP (Nearest-neighbor) Solovyev and Salamov, Baylor College, USA.

If no homologue of known structure from which to make a 3D model– to predict secondary structure to provide the location of alpha helices, and beta strands

within a protein or protein family.

Helsinki University of Technology

38DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Fold recognition methods and links 1/2Fold recognition methods and links 1/2 Methods:

– 3D-pssm (this server) – TOPITS (EMBL) – UCLA-DOE Structre Prediction Server (UCLA) – 123D – UCSC HMM (UCSC) – FAS (Burnham Institute) – THREADER(Warwick) – ProFIT CAME (Salzburg)

Even with no homologue of known 3D structure, it may be possible to find a suitable fold for your protein among known 3D structures by way of fold recognition methods. Prediction of protein 3D structures is not possible at present,

– and a general solution to the protein folding problem is not likely to be found in the near future. – However, it has long been recognised that proteins often adopt similar folds despite no significant sequence or functional similarity– There are numerous protein structure classifications now available via the WWW:

» SCOP (MRC Cambridge) » CATH (University College, London) » FSSP (EBI, Cambridge) » 3 Dee (EBI, Cambridge) » HOMSTRAD (Biochemistry, Cambridge) » VAST (NCBI, USA)

The goal of fold recognition

Methods of protein fold recognition attempt to detect similarities between protein 3D structure that are not accompanied by any significant sequence similarity. There are many approaches, but the unifying theme is to try and find folds that are compatable with a particular sequence. Unlike sequence-only comparison, these methods take advantage of the extra information made available by 3D structure information. In effect, the turn the protein folding problem on it's head: rather than predicting how a sequence will fold, they predict how well a fold will fit a sequence.

The alignments that are output by the programs. They can be used as a starting point, but the best alignment of sequence on to tertiary structure is still likely to come from careful human intervention.

Helsinki University of Technology

39DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Fold recognition methods and links 2/2Fold recognition methods and links 2/2

The goal of fold recognition– detect similarities between protein 3D structure that are not

accompanied by any significant sequence similarity. – to find folds that are compatible with a particular sequence. – 3D structure information to predict how well a fold will fit a

sequence.

the best alignment of sequence on to tertiary structure is still likely to come from human intervention.

Helsinki University of Technology

40DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Analysis of protein folds and alignment of secondary structure Analysis of protein folds and alignment of secondary structure elements elements

to which fold your protein belongs, methods:

– SCOP (MRC Cambridge) – CATH (University College, London) – FSSP (EBI, Cambridge) – 3 Dee (EBI, Cambridge) – HOMSTRAD (Biochemistry, Cambridge) – VAST (NCBI, USA)

If there is any functional similarity between your protein and any members of the fold, then you may be able to back up your prediction of fold

Helsinki University of Technology

41DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Alignment of sequence to tertiary structureAlignment of sequence to tertiary structure

Starting with the alignment from the fold recognition method, and considering the alignment of secondary structures.

Proteins having similar three-dimensional structures with little or no sequence similarity can differ substantial with respect to the finer details of their structures (i.e. loops, precise orientation of side chains, orientation of secondary structures, etc.).

Helsinki University of Technology

42DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

Comparative or Homology ModellingComparative or Homology Modelling If significant homology to another protein of known three-dimensional structure,

– model of your protein 3D structure can be obtained via homology modelling.

To build models, if you have found a suitable fold via fold recognition – to generate models automatically using the very useful SWISSMODEL server;

» WHAT IF (G. Vriend, EMBL, Heidelberg) » MODELLER (A. Sali, Rockefeller University) » MODELLER Mirror FTP site

Once you have a three-dimensional model, it is useful to look at protein 3D structures: methods:– GRASP Anthony Nicholls, Columbia, USA. – MolMol Reto Koradi, ETH, Zurrich, C.H. – Prepi Suhail Islam, ICRF, U.K. – RasMol Roger Sayle, Glaxo, U.K.

Helsinki University of Technology

43DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi