Local Statistical Dependencies in Protein Structure: Discovery, Evaluation, Prediction and...

58
Local Statistical Local Statistical Dependencies in Protein Dependencies in Protein Structure: Discovery, Structure: Discovery, Evaluation, Prediction Evaluation, Prediction and Applications and Applications Advancement to Candidacy Computer Science Department by Rachel Karchin Advisor: Kevin Karplus
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    2

Transcript of Local Statistical Dependencies in Protein Structure: Discovery, Evaluation, Prediction and...

Local Statistical Dependencies Local Statistical Dependencies in Protein Structure: Discovery, in Protein Structure: Discovery,

Evaluation, Prediction and Evaluation, Prediction and ApplicationsApplications

Advancement to Candidacy

Computer Science Department

by Rachel Karchin

Advisor: Kevin Karplus

2

OutlineOutline

Protein structure - primary, secondary, tertiary

Fold recognition, local and secondary structure

Alphabets of local structureDesigning and evaluating local

structure alphabetsImproving fold recognition

3

Molecular structure of proteinsMolecular structure of proteins

Proteins are large, organic molecules composed of smaller molecules called amino acids.

Ball-and-stick atomic model of Crambinplant seed protein with 44 amino acids

threonine cysteine arginine

4

The amino acidsThe amino acids

There are 20 kinds ofamino acids found in natural proteins.

All share a common structure.

Biochemistry Mathews, 3ed. AddisonWesley

R side chain

carboxyl groupamine group

alpha carbon(with attached hydrogen)

5

Primary structurePrimary structure

Proteins consist of one or more polypeptide chains of amino acids connected by peptide bonds.

The sequence of linked amino acids along the chain is called the protein’s primary structure.Phe-Leu-Ser-Cys . . .FLSC . . .

Access Excellence NHGRI Graphics Gallery

6

Secondary structureSecondary structure

Symmetric patterns of hydrogen bonds between amino acids.

Anthony Day/Pace et. al. 1996

Helix. H-bonds between residues close in primary sequence.

7

Secondary structureSecondary structure

Strand. H-bonds between residues not close in primary sequence.

Anthony Day/Pace et. al. 1996

8

Protein FoldingProtein Folding

In an aqueous environment (such as cell cytoplasm), polypeptide chains fold into 3D shapes (tertiary structure).

9

From primary to tertiary structureFrom primary to tertiary structure

A protein’s 3D shape is determined by its primary amino acid sequence. Anfinsen et. al. 1963.

Predicting tertiary structure from amino acid sequence is an unsolved problem.

– Difficult to model the energies that stabilize a protein molecule.

– Conformational search space is enormous.

Laboratory of MolecularBiophysics, University of Oxford

10

Fold recognitionFold recognition

In nature, proteins are observed to assume on the order of a thousand shapes or “folds”.

Biochemistry Mathews, 3ed. AddisonWesley

11

Fold recognitionFold recognition Given an amino acid sequence target:

– search a set of known folds by aligning target and a template fold representative

– predict the fold that gets the best scoring alignment

Target amino acid sequence

Template

Fold library

YLAADTYK

Template amino acid sequence FISSETCN MEPSSYV TGLIRKN

Target/template Score: 7 21 2

12

Twilight zone sequence Twilight zone sequence relationshipsrelationships This method is very effective when target

and template have > 30% sequence identity. Approximately 1/3 of protein sequences can

be assigned folds and modeled this way. We would like to extend the method to

sequences in the twilight zone (< 30% identity to any sequence of known structure).

13

SAM-T98SAM-T98

Build a target HMM of amino acid frequencies from a multiple alignment of target plus homologs (SAM-T98).

YLAADTYK Target amino acid sequence

Protein DatabaseSearch for

homologs

YLAADTYK FISTE-HR HVATD-H- -ITA--HR YLASDS-R

Multiple alignment

Target amino acid HMM

Courtesy of K. Karplus

14

SAM-T98SAM-T98

Target amino acid HMM Template Fold library

Template amino acid sequence FISSETCN MEPSSYV TGLIRKN

Amino acid HMM for target. Amino acid strings for templates Three -fold increase in recognizing twilight zone similarities (Park et. al.

1998)

Target/template Score: 7 21 2

Courtesy of K. Karplus

15

SAM-T98 enhancementsSAM-T98 enhancements

Two-way scoring Augment the method with secondary structure information.

16

Two-way SAM-T98Two-way SAM-T98 Also build amino acid HMMs for templates. Do 2-way scoring

to strengthen recognition of twilight zone relationships.

Template amino acid HMMs

Target amino acid sequence

YLAADTYK

Target/template Score: 19 82 31

Template Fold library

17

Secondary structureSecondary structure

DSSP alphabet (Kabsch and Sander 1983). Classifies the secondary structure of a residue using known tertiary structure.

alpha helixH

beta strandE

pi helixI

3-10 helixG

turnT

bendS

bridgeB

random coilC

Basic patterns:Repeating

turns:Repeating

bridges:

Other:

Biochemistry Mathews, 3ed. AddisonWesley

18

Secondary structureSecondary structure

Alternatives to DSSP definitions.– Collapse 8 classes to 3: H,E,C– Other programs to automate

assignment:• Richards and Kundrot (1988) Define• Sklenar (1989) P-Curve• Adzhubei and Sternberg (1993)• Frishman and Argos (1995) STRIDE• King and Johnson (1999) xlsstr

19

Predicting secondary structurePredicting secondary structure

Extensive research on predicting secondary structure from primary sequence.

Neural nets are most successful approach.– PHD (Rost and Sander 1996)– Predict_2nd (Karplus and Barrett 1998)

Best methods around 75-80% accurate

20

Secondary structure and fold Secondary structure and fold recognitionrecognition Predicted secondary structure shown useful for

fold recognition (Russell et. al. 1998). Fold recognition accuracy correlated with

secondary structure prediction accuracy(Di Francesco 1995, 1997, 1999).

Why?– Structure more conserved than sequence.

– Proteins in the same fold family have similar topologies (secondary structure elements have similar lengths, spatial organization and connectivities).

21

Two-track SAM-T2KTwo-track SAM-T2K Predicted probability vectors of secondary

structure added to target HMM

YLAADTYK Target amino acid sequence

H E CY 0.65 0.2 0.15L 0.15 0.7 0.25A 0.01 0.04 0.9A 0.47 0.45 0.08D 0.85 0.1 0.05T 0.32 0.18 0.5Y 0.81 0.09 0.1K 0.5 0.25 0.15

Target two-track HMM

YLAADTYKFISTE-HRHVATD-H--ITA--HR

Multiple alignment

Courtesy of C. Barrett

Courtesy of K. Karplus

P(H) P(E) P(C)

22

Two-track SAM-T2KTwo-track SAM-T2K

Search template library of sequence pairs with two-track target HMM

Template with 2 sequence pairsFISSETCN CCEECHHH

MEPSSYV HHHHCCE

TGLIRKN EEECEEE

Target two-track HMM

Target/template Score: 22 68 15

Courtesy of K. Karplus

Template Fold library

23

Motivation for alternatives to Motivation for alternatives to secondary structure classessecondary structure classesWhat’s wrong with secondary structure

classes?– The most widely used secondary structure

alphabet (3-state DSSP) is crude (Helix, Strand, Coil).

– Secondary structure classes are ambiguous.• Automated assignment methods disagree.• 63% agreement between DSSP, Define and

P-Curve (Collc’h et. al. 1993).

24

What is Local structure? – describes environment of a residue– a residue’s relationship to neighbors

Can use this information to predict fold from primary structure.

Requires comparing local structure of target and template.

Local structure and fold Local structure and fold recognitionrecognition

KnownMust predict (easier than 3d)

25

Low level descriptions of local Low level descriptions of local structurestructure Lowest level representation of protein

structure - atomic position vectors.

ATOM 1 CA THR 1 7.047 14.099 3.625ATOM 2 C THR 1 16.967 12.784 4.338ATOM 3 O THR 1 15.685 12.755 5.133ATOM 4 N SER 2 15.115 11.555 5.265ATOM 5 CA SER 2 13.856 11.469 6.066ATOM 6 C SER 2 14.164 10.785 7.379ATOM 7 O SER 2 14.993 9.862 7.443ATOM 8 CB SER 2 12.732 10.711 5.261ATOM 9 N CYS 3 13.488 11.241 8.417ATOM 10 CA CYS 3 13.660 10.707 9.787

AtomNo. Type

ResidueType No.

Position vectorX Y Z

Conformations of BiopolymersIUPAC-IUB

26

“One level up”. From atomic position vectors can derive a list of properties that describe a residue’s local environment.

Low level descriptions of local Low level descriptions of local structurestructure

Conformations of BiopolymersIUPAC-IUB

27

Dihedral and bond anglesDihedral and bond angles

Dihedral angles are defined by 4 atoms.

Bond angles are defined by 3 atoms.

Conformations of BiopolymersIUPAC-IUB

Conformations of BiopolymersIUPAC-IUB

28

Dihedral angles: Phi, Psi, OmegaDihedral angles: Phi, Psi, Omega

The 6 atoms in each peptide unit lie in the same plane.

ω

ω

= 180 (trans)or 0 (cis)

and free to rotate

Biochemistry Mathews, 3ed. AddisonWesley

29

Dihedral angles: Phi, Psi, OmegaDihedral angles: Phi, Psi, Omega

Result: good approximation of polypeptide backbone is list of (,) pairs ( cis is rare).

(,) pairs often represented on a plane called the Ramachandran plot.

http://www.biochem.artizona.eduBiochemistry 462A Lecture Notes

30

A small gallery of properties: A small gallery of properties: the geometry of local structurethe geometry of local structure

Kappa. Virtual bond angle between

C of residues i-2, i, i+2

Alpha. Virtual dihedral angle between C of residues i-1, i, i+1, i+2

Tau. Virtual bond angle between C of residues i-1, i, i+1

Zeta. Dihedral angle between carbonyl bonds of residues i and i-1

31

Relationship of a residue to its Relationship of a residue to its neighborsneighbors Density measures. How many residues

are within a given distance?

Count of H-bond partners.

12 neighboring residueswithin 6 A radius

2 H-bond partners

32

Existing local structure alphabetsExisting local structure alphabets

Approximately 30 alphabets of local structure in the literature.

Can they be used to improve fold recognition?

33

Phi/psi alphabetsPhi/psi alphabets

Classes based on partition of phi/psi space

Bystroff et. al. 2000. 10 classes: B E b d e G H L I x

Kang et. al. 1993. 1296 classes: uniform partitioning by 10

Sun et. al. 1996DSSP H,E plus 5 phi/psi classes: a b e l t

Bystroff et. al. 2000

34

Backbone fragment alphabetsBackbone fragment alphabets

Classes based on clustering low-level properties of contiguous series of residues.

Unger et. al. 1987~100 6-residue fragments

k-nearest neighbor clustering by RMSD of C atoms Centroid of each cluster selected as building block

Unger et. al. 1987

35

Backbone fragment alphabetsBackbone fragment alphabets

De Brevern et. al. 2000Protein Building Blocks (PBBs).

16 classes of 5-residue fragments. SOM clustering of vectors of

8 dihedral angles ( and ).

De Brevern et. al. 2000

36

Desired properties of local Desired properties of local structural alphabetsstructural alphabetsFor purposes of improving fold

recognition:– Predictable from primary sequence– Conserved within a fold family

37

Comparison of existing local Comparison of existing local structure alphabetsstructure alphabets

Only a few of the alphabets have been tested for predictability. None of the alphabets have been tested for conservation within fold families.

38

Designing a Local Structure Designing a Local Structure AlphabetAlphabet Extract properties with respect to each residue in the

dataset.

Selected property:

TCO

Selected PDB structures

Property extraction

PDBNo AA TCO1 M -0.32 L -0.343 S 0.914 P 0.9355 E -0.16 V 0.2..

i-1 i

39

Designing a Local Structure Designing a Local Structure AlphabetAlphabet Partition the data into k populations.

PDBNo AA TCO1 M -0.32 L -0.343 S 0.914 P 0.9355 E -0.16 V 0.2..

UnsupervisedLearning

Algorithm

PDBNo AA TCO1 M -0.32 L -0.345 E -0.1

PDBNo AA TCO 3 S 0.914 P 0.9356 V 0.2

Class A

Class B

-1 -0.5 0 0.5 1

X OX O

Class A Class B

X O

40

Designing a Local Structure Designing a Local Structure AlphabetAlphabet

Selected property:KJ descriptor vector*:

[,, d1, d2, d3]

ZETA TAU

D1 dison3:H-bond lengthfrom Oi to Ni+3

D2 dison4:H-bond lengthfrom Oi to Ni+4

D3 discn3:length from Ci to Ni+3

* Descriptor vector of key geometric properties identified by King and Johnson 1999

i

i

i

i+3

i+3

i+4

i

i-1

i

i-1 i+1

41

Designing a Local Structure Designing a Local Structure AlphabetAlphabet Extract properties with respect to each residue in the

dataset.

Selected property:KJ descriptor vector:

[, , d1, d2, d3]

Selected PDB structures

Property extraction

PDBNo AA KJDV1 M [13.6, 9 2.9, 3.7, 3.1, 4.1]2 L [14.4, 9, 5.7,4 .9, 7.1, 4.9]3 S [19.8, 100.3, 7.2, 10.1, 6.9]4 P [18.1, 116.2, 6.7, 9.2,6 .9]...

42

Designing a Local Structure Designing a Local Structure AlphabetAlphabet Clustering multi-dimensional data points.

PDBNo AA KJDV1 M [13.6, 9 2.9, 3.7, 3.1, 4.1]2 L [14.4, 9, 5.7,4 .9, 7.1, 4.9]3 S [19.8, 100.3, 7.2, 10.1, 6.9]4 P [18.1, 116.2, 6.7, 9.2,6 .9]...

Components in different units. Scale to same range? For very high dimensional vectors require feature reduction.

43

Evaluation protocolEvaluation protocol

Protocol is based on:– testing candidate alphabets for their conservation within fold families.– testing predictability of candidate alphabets– testing improvements in fold recognition when candidate alphabets are used.

44

Evaluation Protocol: string Evaluation Protocol: string translationtranslation

Selected PDB structures

Selected alphabet Stringbuilder

Position-equivalent strings in

new alphabet

>2abdCAAABCAB>4ecaACBBABCA. . .

>2abdMDAAVKTG>4ecaMELVIRSG. . .

45

Evaluation Protocol: alignment Evaluation Protocol: alignment translationtranslation

Fold family alignments

Alignmentbuilder

Position-equivalent alignments

in new alphabet

Position-equivalent strings in

new alphabet

CA-AABCABAC-BBABCAC-AACCBBCCCA-BB-A-

MD-AAVKTGME-LVIRSGM-SAGCRDKMEA-SC-E-

46

Position-equivalent alignments

in new alphabet

Conserved?

CA-AABCABAC-BBABCAC-AACCBBCCCA-BB-A-

Evaluation Protocol: alphabet Evaluation Protocol: alphabet conservationconservation

Average entropy in columns of alignments. Relative entropy of substitution matrix

constructed from alignments (Altschul 91).

47

Evaluation Protocol: alphabet Evaluation Protocol: alphabet predictabilitypredictability

Test predictability with Predict_2nd neural net.

Improve on neural net performance with alternate methods. Position-

equivalent strings in

new alphabet

Predictable?

Courtesy of C. Barrett

P(A) P(B) P(C)

48

Evaluation Protocol: fold Evaluation Protocol: fold recognitionrecognition

Build a fold library that incorporates the local structure alphabet and do fold recognition testing using this library.

49

Incorporating local structure Incorporating local structure alphabets into a fold libraryalphabets into a fold library Simplest. Use predicted local structure string for

target and known local structure string for templates.

Target local structure string

ABBCACAB

Target/template Score: 7 21 2

Template local structure string CCABBBAC AACBCAA CAACBBB

PROBLEM!Wrong letter predicted.

Template Fold library

50

Incorporating local structure Incorporating local structure information into a fold libraryinformation into a fold library Use several strings (amino acid and local

structure) for target and templates.Target with string tuple

YLAADTYKABBCACABWYTZTTVU

Template with string tuples FISSETCNCCABBBACYVUUTZVV

MEPSSYVAACBCAATTYUVWZ

TGLIRKNCAACBBBYUUUVZW

Target/template Score: 6 23 5

PROBLEM!Wrong letters predicted.

Template Fold library

51

Add tracks to the target HMM. Search template library of sequence tuples with multi-track target HMM.

Template with sequence tuplesFISSETCNCCABBBACYVUUTZVV

MEPSSYVAACBCAATTYUVWZ

TGLIRKNCAACBBBYUUUVZW

Target multi-track HMM

Extending the SAM-T2K method Extending the SAM-T2K method with local structure informationwith local structure information

Target/template Score: 75 3 22

Template Fold library

52

Adding local structure strings to the template HMM. Enable 2-way HMM scoring.

Template amino acid HMMs plus local structure strings

Extending the SAM-T2K method Extending the SAM-T2K method with local structure informationwith local structure information

Target/template Score: 8 24 49

CCABBBACYVUUTZVV

AACBCAATTYUVWZ

CAACBBBYUUUVZW

Target

YLAADTYKABBCACABWYTZTTVU

A B CY 0.65 0.2 0.15L 0.15 0.7 0.25A 0.01 0.04 0.9A 0.47 0.45 0.08D 0.85 0.1 0.05T 0.32 0.18 0.5Y 0.81 0.09 0.1K 0.5 0.25 0.15

Template Fold library

53

Build multi-track HMMs for target and template.

Target multi-track HMM

Extending the SAM-T2K method Extending the SAM-T2K method with local structure informationwith local structure information

Template multi-track HMMs

Target/template Score: 6 23 5

Template Fold library

54

Evaluation Protocol: fold Evaluation Protocol: fold recognitionrecognition

Foldclassification

database

Fold testset

Non-redundant

119l T4 Lysozyme12asA Asparagine Synthetase153l Goose Lysozyme16pk Phosphoglycerate Kinase16vpA VP16 regulatory protein. . .

Target

Template Fold library

119l

Target/template Score: 12 2 71

Templates: 12asA 153l 16pk

119l12asA153l16pk16vpA. . .

55

Evaluation Protocol: fold Evaluation Protocol: fold recognitionrecognition

courtesy of K. Karplus

1

2

5

10

20

50

100

200

500

1000

2000

500 1000 2000 5000 10000

Fals

e Po

sitiv

es

True Positives

+=Same foldold PSI-blast

PSI-blastSAM-T2K

SAM-T2K EHL 50-50SAM-T2K EBGHTL 50-50

DALI

56

Research ScheduleResearch Schedule

Year 1:Find a local structure alphabet that improves fold recognition. Build a fold library that uses the alphabet. Put up a webserver for public use of the library.

Summer 2002CASP5

57

Research ScheduleResearch Schedule

Year 2:Design more alphabets. Compare and combine new and existing alphabets. Expand the methods to continuous-value predictions. Incorporate best combination into my fold library.

June 2003Produce completed dissertation.

58

ConclusionConclusion

Focus of the work:– Evaluate existing local structure alphabets– Design and evaluate novel local structure

alphabets

Evaluation protocol:– conservation– predictability – fold recognition