1 Helsinki University of Technology DNA, RNA, Protein Structure Prediction Laura Pombo Laboratory of...

Helsinki University of Technology

DNA, RNA, Protein Structure PredictionDNA, RNA, Protein Structure Prediction

Laura Pombo Laboratory of Computational Engineering

2DNA, RNA, Protein Structure Prediction 23.11.2005 laura.pombo@tkk.fi

INTRODUCTION:

Bioinformatics

Proteins

BIOINFORMATICSBIOINFORMATICS

Bioinformatics involves the integration of computers, software tools, and databases in an effort to address biological questions. Bioinformatics approaches are often used for major initiatives that generate large data sets.

Two important large-scale activities that use bioinformatics are genomics and proteomics.

Genomics refers to the analysis of genomes. – A genome can be thought of as the complete set of DNA sequences

that codes for the hereditary material that is passed on from generation to generation.

– Thus, genomics refers to the sequencing and analysis of all of these genomic entities, including genes and transcripts, in an organism.

Bioinformatics, continue …Bioinformatics, continue …

Proteomics, on the other hand, refers to the analysis of the complete set of proteins or proteome.

In addition to genomics and proteomics, there are many more areas of biology where bioinformatics is being applied (i.e., metabolomics, transcriptomics). Each of these important areas in bioinformatics aims to understand complex biological systems. Many scientists today refer to the next wave in bioinformatics as systems biology, an approach to tackle new and complex biological questions. Systems biology involves the integration of genomics, proteomics, and bioinformatics information to create a whole system view of a biological entity.

Bioinformatics Bioinformatics http://www.bioinformatics.ubc.ca/http://www.bioinformatics.ubc.ca/

Para ver esta película, debedisponer de QuickTime™ y de

un descompresor TIFF (sin comprimir).

Central DogmaCentral Dogma

Protein

DNA to RNA DNA to RNA

Portions of DNA Sequence Are Transcribed into RNA The first step of a cell is to copy a particular portion of its DNA

nucleotide sequence ( =gene) Similarities:

– DNA and RNA is a linear polymer made of four different types of nucleotide subunits linked together by phosphodiester bonds

– DNA and RNA contains the bases adenine (A), guanine (G) and cytosine (C)

Differences:– In RNA the nucleotides are ribonucleotides (=contain the sugar ribose)– RNA contains uracil (U) instead of the thymine (T)

My summary from the book: Molecular Biology of THE CELL (Bruce Alberts, et al.)

Different RNAsDifferent RNAs mRNAs

– (messenger RNAs), code for proteins rRNAs

– (ribosomal RNAs), form the basic structure of the ribosome and catalyze protein synthesis

tRNAs – (transfer RNA), central to protein synthesis as adaptors between mRNA and

amino acids snRNAs

– (small nuclear RNAs), function in a variety of nuclear processes, including the splicing of pre-Mrna

snoRNAs– (small nucleolar RNAs), used to process and chemically modify rRNAs

Other noncoding RNAs– function in diverse cellular processes, including telomere synthesis, X-

chromosome inactivation and the transport of proteins into te ER

RNA structure predictionRNA structure prediction

un descompresor TIFF (sin comprimir).Para ver esta película, debedisponer de QuickTime™ y de

http://gibk26.bse.kyutech.ac.jp/jouhou/image/dna-protein/all/N3utr.gifhttp://gibk26.bse.kyutech.ac.jp/jouhou/image/dna-protein/all/N3utr.gif

RNA is transcribed (or synthesized) in cells as single strands of (ribose) nucleic acids. However, these sequences are not simply long strands of nucleotides. Rather, intra-strand base pairing will produce structures.

In RNA, guanine and cytosine pair (GC) by forming a triple hydrogen bond, and adenine and uracil pair (AU) by a double hydrogen bond; additionally, guanine and uracil can form a single hydrogen bond base pair.

RNA structure predictionRNA structure prediction Vienna RNA (PackageRNA Secondary Structure Prediction and

Comparison)http://www.tbi.univie.ac.at/~ivo/RNA/

including a few precompiled binaries for downloadhttp://www.tbi.univie.ac.at/~ivo/RNA/windoze/ [under Windows]

The Vienna RNA Package consists of a C code library and several stand-alone programs for the prediction and comparison of RNA secondary structures.– RNA secondary structure prediction through energy minimization is the most

used function in the package. – The program provides three kinds of dynamic programming algorithms for

structure prediction: » the minimum free energy algorithm of (Zuker & Stiegler 1981) which yields a single

optimal structure, » the partition function algorithm of (McCaskill 1990) which calculates base pair

probabilities in the thermodynamic ensemble, and » the suboptimal folding algorithm of (Wuchty et.al 1999) which generates all

suboptimal structures within a given energy range of the optimal energy.

RNAFOLD toolRNAFOLD tool

RNAfold reads RNA sequences from stdin and calculates their minimum free energy (mfe) structure, partition function (pf) and base pairing probability matrix. It returns the mfe structure in bracket notation, its energy, the free energy of the thermodynamic ensemble and the frequency of the mfe structure in the ensemble to stdout. It also produces PostScript files with plots of the resulting secondary structure graph and a "dot plot" of the base pairing matrix. The dot plot shows a matrix of squares with area proportional to the pairing probability in the upper half, and one square for each pair in the minimum free energy structure in the lower half

ALIDOT programALIDOT program

Detecting Conserved RNA Structures The program alidot is designed to detect conserved RNA secondary structures in small data sets of related RNA sequences. The method is a combination of structure prediction and comparative sequence alignment.

http://www.tbi.univie.ac.at/~ivo/RNA/ALIDOT/

http://images.google.com/imgres?imgurl=http://images.clinicaltools.com/images/gene/dna_versus_rna_reversed.jpg&imgrefurl=http://www.geneticsolutions.com/PageReq%3Fid%3D1530:1873&h=461&w=405&sz=135&tbnid=R7LVIZO4g6cJ:&tbnh=125&tbnw=109&hl=en&start=2&prev=/images%3Fq%3DDNA%2Bto%2BRNA%26svnum%3D10%26hl%3Den%26lr%3D%26sa%3DG

DNA structure predictionDNA structure prediction

Para ver esta película, debedisponer de QuickTime™ y deun descompresor TIFF (sin comprimir).

MEMEMEME

MEME is a tool for discovering motifs in a group of related DNA or protein sequences. A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. – MEME represents motifs as position-dependent letter-probability matrices

which describe the probability of each possible letter at each position in the pattern.

– Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs. MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested.

– MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif.

DNA structure predictionDNA structure prediction

Other similar programs:

Cassandrahttp://www-hto.usc.edu/software/procrustes/cassandra/cass_frm.html

DNA Sequence Translation

GENEID which predicts Gene Structure in Query Sequences (US)

GRAIL, GenHunt, Censor, Pythia, Entrez, Beauty, etc. You should have a look in: http://restools.sdsc.edu/biotools/biotools16.html

PROTEINPROTEIN Protein: A large molecule composed of one or more chains of amino acids in a

specific order determined by the base sequence of nucleotides in the DNA coding for the protein.

Proteins are required for the structure, function, and regulation of the body's cells, tissues, and organs. Each protein has unique functions. Proteins are essential components of muscles, skin, bones and the body as a whole.

Protein is one of the three types of nutrients used as energy sources by the body, the other two being carbohydrate and fat. Proteins and carbohydrates each provide 4 calories of energy per gram, while fats produce 9 calories per gram.

The word "protein" was introduced into science by the great Swedish physician and chemist Jöns Jacob Berzelius (1779-1848) who also determined the atomic and molecular weights of thousands of substances, discovered several elements including selenium, first isolated silicon and titanium, and created the present system of writing chemical symbols and reactions.

Tools for PROTEIN Structure Prediction: ExPASyTools for PROTEIN Structure Prediction: ExPASy The ExPASy (Expert Protein Analysis System) proteomics server from the

Swiss Institute of Bioinformatics (SIB) is dedicated to molecular biology with an emphasis on data relevant to proteins.It allows you to browse through a number of databases produced in Geneva, such as Swiss-Prot, PROSITE, SWISS-2DPAGE, SWISS-3DIMAGE, ENZYME, as well as other cross-referenced databases (such as EMBL/GenBank/DDBJ, OMIM, Medline, FlyBase, ProDom, SGD, SubtiList, etc).

It also allows access to many analytical tools for the identification of proteins, the analysis of their sequence and the prediction of their tertiary structure. ExPASy also offers you many documents relevant to these field of research and you will find from the servers, links to most relevant sources of information across the Web.Swiss-2DService is a non-profit 2-D PAGE service to the scientific community

ExPASy was created in August 1993, it was one of the first WWW servers for biological sciences. Since that date it has undergone constant modifications and improvements.

PROSITE Database of protein families and domainsPROSITE Database of protein families and domains PROSITE is a database of protein families and domains.

It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.

It is based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families.

Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor.It is apparent, when studying protein sequence families, that some regions have been better conserved than others during evolution.

http://au.expasy.org/prosite/prosite_details.html

These regions are generally important for the function of a protein and/or for the maintenance of its three- dimensional structure. By analyzing the constant and variable properties of such groups of similar sequences, it is possible to derive a signature for a protein family or domain, which distinguishes its members from all other unrelated proteins.

A pertinent analogy is the use of fingerprints by the police for identification purposes. A fingerprint is generally sufficient to identify a given individual. Similarly, a protein signature can be used to assign a newly sequenced protein to a specific family of proteins and thus to formulate hypotheses about its function.

PROSITE currently contains patterns and profiles specific for more than a thousand protein families or domains. Each of these signatures comes with documentation providing background information on the structure and function of these proteins.

(Protein) Structure Prediction(Protein) Structure Prediction

Experimental Data Experimental Data

Disulphide bonds Spectroscopic data Site directed mutagenesis studies Knowledge of proteolytic cleavage sites Etc.

Protein sequence dataProtein sequence data

Is your protein a transmembrane protein, or does it contain transmembrane segments? Methods for predicting these segments: – TMAP (EMBL) – PredictProtein (EMBL/Columbia) – TMHMM (CBS, Denmark) – TMpred (Baylor College) – DAS (Stockholm)

Does your protein contain coiled-coils? Methods:– At COILS server, the COILS program

Does your protein contain regions of low complexity? Methods:– the program SEG

Sequence database searching 1/2Sequence database searching 1/2 Comparisons with sequence databases to find homologues, methods:

– the BLAST suite of programs. – National Center for Biotechnology Information (USA) Searches – European Bioinformatics Institute (UK) Searches – BLAST search through SBASE (domain database; ICGEB, Trieste) – Other methods for comparing a single sequence to a database include:

» The FASTA suite (William Pearson, University of Virginia, USA) » SCANPS (Geoff Barton, European Bioinformatics Institute, UK) » BLITZ (Compugen's fast Smith Waterman search)

Multiple sequence information – building a profile from some kind of multiple sequence alignment.

Methods:» PSI-BLAST (NCBI, Washington) » ProfileScan Server (ISREC, Geneva) » HMMER Hidden Markov Model searching (Sean Eddy, Washington

University) » Wise package (Ewan Birney, Sanger Centre; this is for protein versus DNA

comparisons)

Sequence database searching 2/2Sequence database searching 2/2 Incorporating multiple sequence information – a MOTIF.

– describes the key residues that are conserved and define the family. – Sometimes this is called a "signature". For example, "H-[FW]-x-[LIVM]-

x-G-x(5)-[LV]-H-x(3)-[DE]" describes a family of DNA binding proteins. Methods:

» PROSITE, ExPASy » EBI

Pre-prepared protein alignments, databases:– SMART (Oxford/EMBL) – PFAM (Sanger Centre/Wash-U/Karolinska Intitutet) – COGS (NCBI) – PRINTS (UCL/Manchester) – BLOCKS (Fred Hutchinson Cancer Research Centre, Seatle) – SBASE (ICGEB, Trieste)

Multiple Sequence AlignmentMultiple Sequence Alignment Some methods and tools:

– EBI (UK) Clustalw Server – IBCP (France) Multalin Server – IBCP (France) Clustalw Server – IBCP (France) Combined Multalin/Clustalw – MSA (USA) Server – BCM Multiple Sequence Alignment ClustalW Sever (USA)

Alignments can provide:– Information as to protein domain structure– The location of residues likely to be involved in protein function– Information of residues likely to be buried in the protein core or exposed to solvent– More information than a single sequence for applications like homology modelling and

secondary structure prediction.

Secondary Structure Prediction methods and linksSecondary Structure Prediction methods and links Methods and tools:

– PSI-pred (PSI-BLAST profiles used for prediction; David Jones, Warwick) – JPRED Consensus prediction (includes many of the methods given below; Cuff & Barton,

EBI) – DSC King & Sternberg (this server) – PREDATORFrischman & Argos (EMBL) – PHD home page Rost & Sander, EMBL, Germany – ZPRED server Zvelebil et al., Ludwig, U.K. – nnPredict Cohen et al., UCSF, USA. – BMERC PSA Server Boston University, USA – SSP (Nearest-neighbor) Solovyev and Salamov, Baylor College, USA.

If no homologue of known structure from which to make a 3D model– to predict secondary structure to provide the location of alpha helices, and beta strands

within a protein or protein family.

Fold recognition methods and links 1/2Fold recognition methods and links 1/2 Methods:

– 3D-pssm (this server) – TOPITS (EMBL) – UCLA-DOE Structre Prediction Server (UCLA) – 123D – UCSC HMM (UCSC) – FAS (Burnham Institute) – THREADER(Warwick) – ProFIT CAME (Salzburg)

Even with no homologue of known 3D structure, it may be possible to find a suitable fold for your protein among known 3D structures by way of fold recognition methods. Prediction of protein 3D structures is not possible at present,

– and a general solution to the protein folding problem is not likely to be found in the near future. – However, it has long been recognised that proteins often adopt similar folds despite no significant sequence or functional similarity– There are numerous protein structure classifications now available via the WWW:

» SCOP (MRC Cambridge) » CATH (University College, London) » FSSP (EBI, Cambridge) » 3 Dee (EBI, Cambridge) » HOMSTRAD (Biochemistry, Cambridge) » VAST (NCBI, USA)

The goal of fold recognition

Methods of protein fold recognition attempt to detect similarities between protein 3D structure that are not accompanied by any significant sequence similarity. There are many approaches, but the unifying theme is to try and find folds that are compatable with a particular sequence. Unlike sequence-only comparison, these methods take advantage of the extra information made available by 3D structure information. In effect, the turn the protein folding problem on it's head: rather than predicting how a sequence will fold, they predict how well a fold will fit a sequence.

The alignments that are output by the programs. They can be used as a starting point, but the best alignment of sequence on to tertiary structure is still likely to come from careful human intervention.

Fold recognition methods and links 2/2Fold recognition methods and links 2/2

The goal of fold recognition– detect similarities between protein 3D structure that are not

accompanied by any significant sequence similarity. – to find folds that are compatible with a particular sequence. – 3D structure information to predict how well a fold will fit a

sequence.

the best alignment of sequence on to tertiary structure is still likely to come from human intervention.

Analysis of protein folds and alignment of secondary structure Analysis of protein folds and alignment of secondary structure elements elements

to which fold your protein belongs, methods:

– SCOP (MRC Cambridge) – CATH (University College, London) – FSSP (EBI, Cambridge) – 3 Dee (EBI, Cambridge) – HOMSTRAD (Biochemistry, Cambridge) – VAST (NCBI, USA)

If there is any functional similarity between your protein and any members of the fold, then you may be able to back up your prediction of fold

Alignment of sequence to tertiary structureAlignment of sequence to tertiary structure

Starting with the alignment from the fold recognition method, and considering the alignment of secondary structures.

Proteins having similar three-dimensional structures with little or no sequence similarity can differ substantial with respect to the finer details of their structures (i.e. loops, precise orientation of side chains, orientation of secondary structures, etc.).

Comparative or Homology ModellingComparative or Homology Modelling If significant homology to another protein of known three-dimensional structure,

– model of your protein 3D structure can be obtained via homology modelling.

To build models, if you have found a suitable fold via fold recognition – to generate models automatically using the very useful SWISSMODEL server;

» WHAT IF (G. Vriend, EMBL, Heidelberg) » MODELLER (A. Sali, Rockefeller University) » MODELLER Mirror FTP site

Once you have a three-dimensional model, it is useful to look at protein 3D structures: methods:– GRASP Anthony Nicholls, Columbia, USA. – MolMol Reto Koradi, ETH, Zurrich, C.H. – Prepi Suhail Islam, ICRF, U.K. – RasMol Roger Sayle, Glaxo, U.K.

1 Helsinki University of Technology DNA, RNA, Protein Structure Prediction Laura Pombo Laboratory of...

Documents

Transcript of 1 Helsinki University of Technology DNA, RNA, Protein Structure Prediction Laura Pombo Laboratory of...

Game analysis: Messaging Huibin Lin, Helsinki University of Technology Huibin@cc.hut.fi Yi Zhou, Helsinki University of Technology zhouyi@cc.hut.fi.

HELSINKI UNIVERSITY OF TECHNOLOGY Timo Laitinenmhproject.org/media/blogs/mhpenlaces/Externo/Print_Support_MHP.… · HELSINKI UNIVERSITY OF TECHNOLOGY ABSTRACT OF MASTER’S THESIS

Helsinki University of Technology - Aaltolib.tkk.fi/Dipl/2009/urn100053.pdf · Helsinki University of Technology ... sion and coordination even while working remotely between Finland

S ystems Analysis Laboratory Helsinki University of Technology 1 Raimo P. Hämäläinen Systems Analysis Laboratory Helsinki University of Technology .

Helsinki University of Technology Laboratory of Biomedical ...

Helsinki University of Technology Radio Laboratory …lib.tkk.fi/Diss/2005/isbn951227485X/isbn951227485X.pdf · Helsinki University of Technology Radio Laboratory Publications Teknillisen

Eva Björkner Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing HUT, Helsinki, Finland KTH – Royal Institute of Technology.

Helsinki University of Technology Radio Laboratory ...lib.tkk.fi/Diss/2005/isbn9512277581/isbn9512277581.pdf · Helsinki University of Technology Radio Laboratory Publications Teknillisen

Helsinki University of Technology - Aalto · Helsinki University of Technology Publications in Computer and Information Science Report E7 December 2006 FUNCTIONAL ELEMENTS AND NETWORKS

Helsinki University of Technology Department of Electrical ...

Helsinki University of Technology Institute of … · Helsinki University of Technology Institute of Mathematics Research Reports Teknillisen korkeakoulun matematiikan laitoksen tutkimusraporttisarja

HELSINKI UNIVERSITY OF TECHNOLOGY Department …lib.tkk.fi/Dipl/2007/urn010154.pdf · HELSINKI UNIVERSITY OF TECHNOLOGY Department of Electrical and Communications Engineering ...

Helsinki University of Technology Systems Analysis Laboratory Antti Punkka and Ahti Salo Systems Analysis Laboratory Helsinki University of Technology.

Juha Itkonen Helsinki University of Technology ygy …tie21201/s2009/luennot/itkonen23112009.pdf · Juha Itkonen Helsinki University of Technology ... Introduciton to experience based

HELSINKI UNIVERSITY OF TECHNOLOGY ENE-47.153 Trace ...

Helsinki University of Technology ... - Aalto-yliopistorewardresearch.aalto.fi/fi/research/tutkimusraportit/palkitsemisen...Helsinki University of Technology Department of Industrial

HELSINKI UNIVERSITY OF TECHNOLOGY ENE-47.153 SULPHUR #1

Helsinki University of Technology Laboratory of …lib.tkk.fi/Diss/2005/isbn9512277654/isbn9512277654.pdf · Helsinki University of Technology Laboratory of Industrial Management

RosettaNet Nan Ning Helsinki University of Technology.

HELSINKI€UNIVERSITY€OF€TECHNOLOGY …