Computational Methods for Protein Structure...

39
Computational Methods for Computational Methods for Protein Structure Prediction Protein Structure Prediction Ying Xu Ying Xu 2010/1/19 2010/1/19 1

Transcript of Computational Methods for Protein Structure...

Page 1: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Computational Methods for Computational Methods for ppProtein Structure PredictionProtein Structure Prediction

Ying XuYing Xugg

2010/1/192010/1/19 11

Page 2: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

OutlineOutlineOutlineOutlinei t d ti t t i t ti t d ti t t i t t introduction to protein structuresintroduction to protein structures

the problem of protein structure predictionthe problem of protein structure prediction the problem of protein structure predictionthe problem of protein structure prediction

protein secondary structure predictionprotein secondary structure prediction protein secondary structure predictionprotein secondary structure prediction

protein tertiary structure predictionprotein tertiary structure prediction protein tertiary structure predictionprotein tertiary structure prediction Ab initioAb initio foldingfolding homology modelinghomology modeling

protein threadingprotein threading

2010/1/192010/1/19 22

Page 3: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein Protein Sequence, Structure and FunctionSequence, Structure and Function

>1MBN:_ MYOGLOBIN (154 AA)

Protein sequence_

MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKELGYQG

Protein structure

Protein function

Oxygen storage

function

2010/1/192010/1/19 33

Page 4: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein StructureProtein StructureProtein StructureProtein Structure protein sequence folds into a “unique” shape (“structure”) thatprotein sequence folds into a “unique” shape (“structure”) that protein sequence folds into a unique shape ( structure ) that protein sequence folds into a unique shape ( structure ) that

minimizes its free potential energyminimizes its free potential energy

2010/1/192010/1/19 44

Page 5: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein StructuresProtein StructuresProtein StructuresProtein Structures Primary sequence Primary sequence

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

Secondary structureSecondary structure

-helix

-sheet

anti-parallel

ll l

2010/1/192010/1/19 55

parallel

Page 6: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein StructuresProtein StructuresProtein StructuresProtein Structures Tertiary structureTertiary structure Tertiary structureTertiary structure

Quaternary structureQuaternary structure Quaternary structureQuaternary structure

2010/1/192010/1/19 66

Page 7: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein StructuresProtein StructuresProtein StructuresProtein Structures

Backbone Backbone versusversus allall--atom structuresatom structures

Backbone + sidechain = all-atom structure

2010/1/192010/1/19 77Backbone structure == structural fold

Page 8: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein StructuresProtein StructuresProtein StructuresProtein Structures

Protein structureProtein structure generally compactgenerally compact

Soluble protein structureSoluble protein structure individual domains are generally globularindividual domains are generally globularg y gg y g they share various common characteristics, e.g. hydrophobic they share various common characteristics, e.g. hydrophobic

moment profilemoment profile

Membrane protein structureMembrane protein structure

most of the amino acid sidechains of transmembrane segments are non-polar

polar groups of the polypeptide backbone of transmembrane segments generally participate in hydrogen bonds

2010/1/192010/1/19 88

segments generally participate in hydrogen bonds

Page 9: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein Structure PredictionProtein Structure PredictionProtein Structure PredictionProtein Structure PredictionProblem: Given the amino acid sequence of a protein, computationally predict its 3-dimensional shape?

MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKELGYQG

? ……..2010/1/192010/1/19 99

Page 10: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Secondary Structure PredictionSecondary Structure PredictionSecondary Structure PredictionSecondary Structure Prediction

Eight Secondary Structure Types

H: -helix (i, i+4 form hydrogen bond)G: 310 helix (i, i+3 form hydrogen bond)I: -helix (i i+5 form hydrogen bond)

HI: helix (i, i+5 form hydrogen bond)E: -strandB: bridgeT: turn

E

T: -turnS: bendC: coil

C

Rough categories: H, E, C

2010/1/192010/1/19 1010

Page 11: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Secondary Structure PredictionSecondary Structure PredictionSecondary Structure PredictionSecondary Structure Prediction

Given a protein sequence (primary structure), predict its Given a protein sequence (primary structure), predict its secondary structure categoriessecondary structure categoriesy gy g

GGHWIATHWIATRGQRGQLIREALIREAYEDYEDYRHFSSYRHFSSECPFIPECPFIP E: -strandH: -helix

CEEEEEEEEEECCCCCEEEEEEEEEECCCCCCHHHHHHHHHHHHCCCCCC

C: coil

Assumption: short stretches of residues have propensity to adopt certainstructural conformation

2010/1/192010/1/19 1111

Page 12: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Secondary Structure PredictionSecondary Structure PredictionSecondary Structure PredictionSecondary Structure PredictionSecondary structure propensitiesSecondary structure propensities::Calculate the propensity for a given amino acid to adopt a certain ssCalculate the propensity for a given amino acid to adopt a certain ss--typetype

( | ) ( , )( ) ( ) ( )

i i iP aa p aaPp p p aa

i, amino acid, secondary structure

type( ) ( ) ( )ip p p aa

Example: a data set with 300 proteins containing #residues=20 000

type

Example: a data set with 300 proteins containing #residues 20,000, #Ala=2,000, #helix=4,000, #Ala in helix=500

p(,aa) = 500/20,000, p() = 4,000/20,000, p(aa) = 2,000/20,000P = 500 / (4,000/10) = 1.25

2010/1/192010/1/19 1212

Page 13: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Secondary Structure PredictionSecondary Structure PredictionSecondary Structure PredictionSecondary Structure Prediction

2010/1/192010/1/19 1313

Page 14: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Secondary Structure PredictionSecondary Structure PredictionSecondary Structure PredictionSecondary Structure PredictionP di t h li ( d di t t d)P di t h li ( d di t t d)Predict helix (and predict strand)Predict helix (and predict strand)

1. Scan each window of 6 residues; if score > 4 predict 1. Scan each window of 6 residues; if score > 4 predict helixhelix ( if score > 3 ( if score > 3 predict predict strand)strand)

2. Propagate in both directions until 4 residue window with mean 2. Propagate in both directions until 4 residue window with mean propensity < 1propensity < 1

3. Move forward and repeat3. Move forward and repeat

Resolving ConflictsResolving ConflictsFor overlapping regions, decide according to propensity parametersFor overlapping regions, decide according to propensity parameters

2010/1/192010/1/19 1414Chou-Fasman designed the first prediction program

Page 15: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Secondary Structure PredictionSecondary Structure PredictionSecondary Structure PredictionSecondary Structure PredictionThi i l l ith h d 60% di ti dThi i l l ith h d 60% di ti d This simple algorithm had ~60% prediction accuracy and This simple algorithm had ~60% prediction accuracy and was the best prediction for a long whilewas the best prediction for a long while

The next breakthrough came in 1999 when PSIThe next breakthrough came in 1999 when PSI--PRED PRED program was developedprogram was developedp g pp g p

Key ideasKey ideas using sequence profiles, generated by psiusing sequence profiles, generated by psi--blast, rather than blast, rather than

individual sequence for secondary structureindividual sequence for secondary structure combining multiple predictors using a neural networkcombining multiple predictors using a neural networkg p p gg p p g accuracy reaches ~76%accuracy reaches ~76%

2010/1/192010/1/19 1515

Page 16: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Secondary Structure PredictionSecondary Structure PredictionSecondary Structure PredictionSecondary Structure PredictionU i l i i h di i hU i l i i h di i h Using larger training sets, the prediction accuracy can reach Using larger training sets, the prediction accuracy can reach ~80%; So how far can we further push this~80%; So how far can we further push this

NonNon--locality. Secondary structure is influenced by longlocality. Secondary structure is influenced by long--range range interactionsinteractions

S t h lti l t t tS t h lti l t t t Some segments can have multiple structure typesSome segments can have multiple structure types((chameleon sequenceschameleon sequences))

There is some room for improvement but not much

2010/1/192010/1/19 1616Long Piccolo C2A Short Piccolo C2A

Page 17: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Methods for Tertiary Structure Methods for Tertiary Structure PredictionPrediction

b i iti ab initio use first principles to fold proteins does not require templates does not require templates high computational complexity

homology modeling similar sequence similar structures practically very useful need homologues practically very useful, need homologues

protein threading many proteins share the same structural fold a folding problem becomes a fold recognition problem

2010/1/192010/1/19 1717

Page 18: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Ab initioAb initio Structure PredictionStructure PredictionAb initioAb initio Structure PredictionStructure PredictionA f ti t d ib th t iA f ti t d ib th t iAn energy function to describe the proteinAn energy function to describe the protein

oo bond energybond energyoo bond angle energybond angle energyoo dihedral angel energydihedral angel energyoo van der Waalsvan der Waals energyenergyoo electrostatic energyelectrostatic energy

Need an algorithm to search the conformational space to find structural Need an algorithm to search the conformational space to find structural conformation that minimizes the function.conformation that minimizes the function.

Not practical in generalNot practical in generalcomputationally too expensivecomputationally too expensiveoo computationally too expensivecomputationally too expensive

oo accuracy is pooraccuracy is poor

2010/1/192010/1/19 1818

Page 19: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Homology ModelingHomology ModelingHomology ModelingHomology ModelingObservation: proteins with similar sequences tend to fold into

similar structures.

1. Target sequence is aligned with the sequence of a known structure (typically requiring they share sequence identity 30% or higher)

2. Superimpose target sequence onto the template, replacing equivalent side-chain atoms where necessary

3. Refine the model by minimizing an energy function.

PPrograms: Modeller http://salilab.org/modeller/Swiss-Model http://swissmodel expasy org//SWISS MODEL html

2010/1/192010/1/19 1919

Swiss-Model http://swissmodel.expasy.org//SWISS-MODEL.html

Page 20: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein ThreadingProtein ThreadingProtein ThreadingProtein Threading Basic premiseBasic premise Basic premiseBasic premise

The number of unique structural (domain) folds in nature

Statistics from Protein Data Bank (~61 000 structures)Statistics from Protein Data Bank (~61 000 structures)

is fairly small (possibly a few thousand)

Statistics from Protein Data Bank ( 61,000 structures)Statistics from Protein Data Bank ( 61,000 structures)

90% of new structures submitted to PDB in the past

Chances for a protein to have a nativeChances for a protein to have a native like structural fold inlike structural fold in

90% of new structures submitted to PDB in the past three years have similar structural folds in PDB

Chances for a protein to have a nativeChances for a protein to have a native--like structural fold in like structural fold in PDB are quite goodPDB are quite good (estimated to be 60(estimated to be 60--70%)70%)

Proteins with similar structural folds could be Proteins with similar structural folds could be homologueshomologues or or analoguesanalogues

2010/1/192010/1/19 2020

Page 21: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein ThreadingProtein ThreadingProtein ThreadingProtein Threading

The goalThe goal: find the “correct” sequence: find the “correct” sequence--structure alignment structure alignment between a target sequence and its nativebetween a target sequence and its native--like fold in PDBlike fold in PDB

Energy functionEnergy function –– knowledge (or statistics) based rather than knowledge (or statistics) based rather than

MTYKLILN …. NGVDGEWTYTE

physics based physics based Should be able to distinguish correct structural folds from incorrect Should be able to distinguish correct structural folds from incorrect

structural foldsstructural folds Should be able to distinguish correct sequenceShould be able to distinguish correct sequence--fold alignment from fold alignment from

incorrect sequenceincorrect sequence--fold alignmentsfold alignments

2010/1/192010/1/19 2121

Page 22: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein ThreadingProtein Threading four basic componentsfour basic componentsProtein Threading Protein Threading –– four basic componentsfour basic components

Structure databaseStructure database

Energy functionEnergy function

SequenceSequence--structure alignment algorithmstructure alignment algorithm

Prediction reliability assessmentPrediction reliability assessment

2010/1/192010/1/19 2222

Page 23: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein ThreadingProtein Threading t t d t bt t d t bProtein Threading Protein Threading –– structure databasestructure database

Build a template databaseBuild a template database

2010/1/192010/1/19 2323

Page 24: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein ThreadingProtein Threading t t d t bt t d t bProtein Threading Protein Threading –– structure databasestructure database

• Non-redundant representatives through structure-structure and/or sequence sequence comparisonand/or sequence-sequence comparison

FSSP (http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html)( p )(Families of Structurally Similar Proteins)

SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)

PDB-Select (http://www.sander.embl-heidelberg.de/pdbsel/)

Pisces (http://www.fccc.edu/research/labs/dunbrack/pisces/)

2010/1/192010/1/19 2424

Page 25: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein ThreadingProtein Threading f tif tiProtein Threading Protein Threading –– energy functionenergy function

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

f

how well a residue fitst t l

how preferable to put two particular residues nearby: E_p

a structural environment: E_s

alignment gap penalty: E_g

total energy: E p + E s + E gtotal energy: E_p + E_s + E_g

find a sequence-structure alignment t i i i th f ti

2010/1/192010/1/19 2525

to minimize the energy function

Page 26: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein ThreadingProtein Threading f tif tiProtein Threading Protein Threading –– energy functionenergy function

A singleton energy measures each residue’s preference in a specific A singleton energy measures each residue’s preference in a specific structural environmentsstructural environments secondary structuresecondary structure secondary structuresecondary structure solvent accessibilitysolvent accessibility

Compare actual occurrence against its “expected value” by chanceCompare actual occurrence against its “expected value” by chance

where

2010/1/192010/1/19 2626

Page 27: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein ThreadingProtein Threading f tif tiProtein Threading Protein Threading –– energy functionenergy function

A simple definition of structural environmentA simple definition of structural environment secondary structure: alphasecondary structure: alpha--helix, betahelix, beta--strand, loopstrand, loop

solvent accessibility: 0 10 20 100% of accessibilitysolvent accessibility: 0 10 20 100% of accessibility solvent accessibility: 0, 10, 20, …, 100% of accessibilitysolvent accessibility: 0, 10, 20, …, 100% of accessibility each combination of secondary structure and solvent each combination of secondary structure and solvent

accessibility level defines a structural environmentaccessibility level defines a structural environment•• E.g., (alphaE.g., (alpha--helix, 30%), (loop, 80%), …helix, 30%), (loop, 80%), …

E_s: a scoring matrix of 30 structural environments by 20 amino E_s: a scoring matrix of 30 structural environments by 20 amino acidsacidsacidsacids E.g., E_s ((loop, 30%), A)E.g., E_s ((loop, 30%), A)

i l t t

2010/1/192010/1/19 2727

singleton energy term

Page 28: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein ThreadingProtein Threading f tif tiProtein Threading Protein Threading –– energy functionenergy function

Helix Sheet LoopBuried Inter Exposed Buried Inter Exposed Buried Inter Exposed

ALA -0.578 -0.119 -0.160 0.010 0.583 0.921 0.023 0.218 0.368ARG 0 997 0 507 0 488 1 267 0 345 0 580 0 930 0 005 0 032ARG 0.997 -0.507 -0.488 1.267 -0.345 -0.580 0.930 -0.005 -0.032ASN 0.819 0.090 -0.007 0.844 0.221 0.046 0.030 -0.322 -0.487ASP 1.050 0.172 -0.426 1.145 0.322 0.061 0.308 -0.224 -0.541CYS -0.360 0.333 1.831 -0.671 0.003 1.216 -0.690 -0.225 1.216GLN 1.047 -0.294 -0.939 1.452 0.139 -0.555 1.326 0.486 -0.244GLU 0.670 -0.313 -0.721 0.999 0.031 -0.494 0.845 0.248 -0.144GLY 0.414 0.932 0.969 0.177 0.565 0.989 -0.562 -0.299 -0.601HIS 0.479 -0.223 0.136 0.306 -0.343 -0.014 0.019 -0.285 0.051ILE -0.551 0.087 1.248 -0.875 -0.182 0.500 -0.166 0.384 1.336LEU -0.744 -0.218 0.940 -0.411 0.179 0.900 -0.205 0.169 1.217LEU 0.744 0.218 0.940 0.411 0.179 0.900 0.205 0.169 1.217LYS 1.863 -0.045 -0.865 2.109 -0.017 -0.901 1.925 0.474 -0.498MET -0.641 -0.183 0.779 -0.269 0.197 0.658 -0.228 0.113 0.714PHE -0.491 0.057 1.364 -0.649 -0.200 0.776 -0.375 -0.001 1.251PRO 1.090 0.705 0.236 1.249 0.695 0.145 -0.412 -0.491 -0.641SER 0 350 0 260 0 020 0 303 0 058 0 075 0 173 0 210 0 228SER 0.350 0.260 -0.020 0.303 0.058 -0.075 -0.173 -0.210 -0.228THR 0.291 0.215 0.304 0.156 -0.382 -0.584 -0.012 -0.103 -0.125TRP -0.379 -0.363 1.178 -0.270 -0.477 0.682 -0.220 -0.099 1.267TYR -0.111 -0.292 0.942 -0.267 -0.691 0.292 -0.015 -0.176 0.946VAL -0.374 0.236 1.144 -0.912 -0.334 0.089 -0.030 0.309 0.998

2010/1/192010/1/19 2828

Page 29: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein ThreadingProtein Threading f tif tiProtein Threading Protein Threading –– energy functionenergy function

It measures the preference of a pair of amino acids to be It measures the preference of a pair of amino acids to be close in 3D space. close in 3D space.

Observed occurrence of a pair compared with its “expected” Observed occurrence of a pair compared with its “expected” occurrenceoccurrenceoccurrenceoccurrence

uniform state model

pair-wise interaction energy term

2010/1/192010/1/19 2929

Page 30: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein ThreadingProtein Threading f tif tiProtein Threading Protein Threading –– energy functionenergy function

ALA -140ARG 268 -18ASN 105 -85 -435ASP 217 -616 -417 17CYS 330 67 106 278 -1923GLN 27 -60 -200 67 191 -115GLU 122 -564 -136 140 122 10 68GLY 11 -80 -103 -267 88 -72 -31 -288HIS 58 -263 61 -454 190 272 -368 74 -448ILE -114 110 351 318 154 243 294 179 294 -326LEU -182 263 358 370 238 25 255 237 200 -160 -278LYS 123 310 -201 -564 246 -184 -667 95 54 194 178 122MET -74 304 314 211 50 32 141 13 -7 -12 -106 301 -494PHE -65 62 201 284 34 72 235 114 158 -96 -195 -17 -272 -206PRO 174 -33 -212 -28 105 -81 -102 -73 -65 369 218 -46 35 -21 -210SER 169 -80 -223 -299 7 -163 -212 -186 -133 206 272 -58 193 114 -162 -177THR 58 60 -231 -203 372 -151 -211 -73 -239 109 225 -16 158 283 -98 -215 -210TRP 51 -150 -18 104 52 -12 157 -69 -212 -18 81 29 -5 31 -432 129 95 -20TYR 53 -132 53 268 62 -90 269 58 34 -163 -93 -312 -173 -5 -81 104 163 -95 -6VAL -105 171 298 431 196 180 235 202 204 -232 -218 269 -50 -42 46 267 73 101 107 -324

ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL

2010/1/192010/1/19 3030

Page 31: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein ThreadingProtein Threading f tif tiProtein Threading Protein Threading –– energy functionenergy function

w(k) = h + gk, k ≥ 1, w(0) = 0;

h: opening gap penaltyg: extension gap penaltyg: extension gap penalty

FDSK---THRGHR:.: :: :::FESYWTCTH GHR

FDSK-T--HRGHR:.: : : :::FESYWTCTH GHRFESYWTCTH-GHR FESYWTCTH-GHR

gap penalty term

2010/1/192010/1/19 3131

gap penalty term

Page 32: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Threading Parameter OptimizationThreading Parameter OptimizationThreading Parameter OptimizationThreading Parameter Optimization

How to determine the weight of different energy term?How to determine the weight of different energy term?EEtotal = total = ssEEsingletonsingleton + + ppEEpairwisepairwise + + ggEEgapgap

Select the weights to give the “best” threading Select the weights to give the “best” threading g g gg g gperformance on a training setperformance on a training set (fold recognition and (fold recognition and alignment accuracy)alignment accuracy)

Different weights for different classes? (superfamily, fold)Different weights for different classes? (superfamily, fold)pairpair--wise may contribute more for fold level threadingwise may contribute more for fold level threadingy gy gmutation/profile terms dominate in superfamily level threading mutation/profile terms dominate in superfamily level threading

2010/1/192010/1/19 3232

Page 33: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Protein Threading Protein Threading ---- algorithmalgorithm Dynamic programmingDynamic programming Heuristic algorithms for pairHeuristic algorithms for pair--wise interactionswise interactions

Frozen approximation algorithm (A. Godzik Frozen approximation algorithm (A. Godzik et alet al.).) Double dynamic programming (D. Jones Double dynamic programming (D. Jones et alet al.).) Monte carlo sampling (S.H. Bryant Monte carlo sampling (S.H. Bryant et alet al.).)p g ( yp g ( y ))

Rigorous algorithms for pairRigorous algorithms for pair--wise interactionswise interactionsB hB h dd b d (R H L th d T F S ith)b d (R H L th d T F S ith) BranchBranch--andand--bound (R.H. Lathrop and T.F. Smith)bound (R.H. Lathrop and T.F. Smith)

DivideDivide--andand--conquer (Y. Xu conquer (Y. Xu et alet al.) .) ----PROSPECTPROSPECT Linear programming (J. Xu Linear programming (J. Xu et alet al.) .) ––RAPTORRAPTOR Tree decomposition (L. Cai Tree decomposition (L. Cai et al.)et al.)

Rigorous algorithm for treating backbone and sideRigorous algorithm for treating backbone and side--chain chain simultaneously (Li et al )simultaneously (Li et al )

2010/1/192010/1/19 3333

simultaneously (Li et al.)simultaneously (Li et al.)

Page 34: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Fold RecognitionFold RecognitionFold RecognitionFold Recognition

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

S 1500 S 900S 1120S 720Score = -1500 Score = -900Score = -1120Score = -720

Which one is the correct structural fold for the target sequence if any?fold for the target sequence if any?

The one with the lowest score ?

2010/1/192010/1/19 3434

The one with the lowest score ?

Page 35: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Fold RecognitionFold RecognitionFold RecognitionFold Recognition

Query sequence: AAAA

Template #1: AATTAATACATTAATATAATAAAATTACTGA

B l ?

Template #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA

Better template?

Template #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA

Which of these two sequences will have betterWhich of these two sequences will have better chance to have a good match with the query sequence after randomly reshuffling them?

2010/1/192010/1/19 3535

Page 36: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Fold RecognitionFold RecognitionFold RecognitionFold Recognition

Different template structures may have different background score Different template structures may have different background score distribution making direct comparison of threading scores againstdistribution making direct comparison of threading scores againstdistribution, making direct comparison of threading scores against distribution, making direct comparison of threading scores against different templates invaliddifferent templates invalid

Comparison of threading results should be made based on howComparison of threading results should be made based on how Comparison of threading results should be made based on how Comparison of threading results should be made based on how standout the score is in its background score distribution rather than standout the score is in its background score distribution rather than the threading scores directlythe threading scores directly

2010/1/192010/1/19 3636

Page 37: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Fold RecognitionFold RecognitionFold RecognitionFold RecognitionThreading 100 000Threading 100,000 sequences against a template structure provides th b li i f tithe baseline information about the background scores of the template

By locating where the y gthreading score with a particular query sequence, one can decide howone can decide how significant the score, and hence the threading result, is!

2010/1/192010/1/19 3737

is!Not significant significant

Page 38: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

Fold RecognitionFold RecognitionFold RecognitionFold Recognition

Z-score = standard deviation

score - average

--randomly shuffle the query sequence and calculate the alignment score

The goal is to pick the predicted structure with high statistical significance

2010/1/192010/1/19 3838

Page 39: Computational Methods for Protein Structure Predictionglycam.org/.../pdfs/bcmb8330_protein_structure.pdf · Protein StructuresProtein Structures Protein structure generally compact

State of the ArtState of the ArtState of the ArtState of the Art

~60% of the proteins in encoded in any genome can ~60% of the proteins in encoded in any genome can b bl h th i t t l f ld di t db bl h th i t t l f ld di t dprobably have their structural folds predictedprobably have their structural folds predicted

~60% of these proteins can have their structures ~60% of these proteins can have their structures predicted accurate enough to be useful to guidepredicted accurate enough to be useful to guidepredicted accurate enough to be useful to guide predicted accurate enough to be useful to guide experimental designsexperimental designs

2010/1/192010/1/19 3939