Download - C enter For C omputational B iology and B ioinformatics Bioinformatics and Intrinsically Disordered Proteins (IDPs) A. Keith Dunker Biochemistry and Molecular.

CCenter For enter For CComputationalomputationalBBiology and iology and BBioinformaticsioinformatics

Bioinformatics and Intrinsically Disordered Proteins (IDPs)

A. Keith Dunker

Biochemistry and Molecular Biology &Center for Computational Biology / Bioinformatics

Indiana University School of Medicine

Presented at:

October 22, 2010

Outline

• What are “Intrinsically Disordered Proteins” ?

• Bioinformatics Applications to IDPs– Why don’t IDPs form structure?– Predicting IDPs from amino acid sequence– Some important results from IDP prediction– An improved order / disorder amino acid scale– Predicting phosphorylation sites– Disorder and function: two examples

• Importance of bioinformatics to IDP research

Definitions: Intrinsically Disordered Proteins (IDPs) and ID Regions (IDRs)

• Whole proteins and regions of proteins are intrinsically disordered if they lack stable 3D structure under physiological conditions,

• But exist instead as highly dynamic, rapidly interconverting ensembles without particular equilibrium values for their coordinates or bond angles and with non-cooperative conformational changes.

Outline

• What are “Intrinsically Disordered Proteins” ?



Why are IDPs / IDRs unstructured?

• From the 1950s to now, >> 1,000 IDPs / IDRs studied and characterized

• Visit: http://www.disprot.org

• Why do IDPs & IDRs lack structure?– Lack a ligand or partner?– Denatured during isolation?– Folding requires conditions found inside cells?– Lack of folding encoded by amino acid sequence?

Amino Acid Compositions

Residue

W C F I Y V L H M A T R G Q S N P D E K( D

isor

der -O

rder

) / O

rder

-1.0

-0.5

0.0

0.5

1.04aa L 14aa (14579)15aa L 29aa (10381)30aa L (58147)

Surface

Buried

Why are IDPs / IDRs unstructured?

• To a first approximation, amino acid composition determines whether a protein folds or remains intrinsically disordered.

• Given a composition that favors folding, the sequence details determine which fold.

• Given a composition that favors not folding, the sequence details provide motifs for biological function.

Outline

• What are “Intrinsically Disordered Proteins” ?• Bioinformatics Applications to IDPs

– Why don’t IDPs form structure?– Predicting IDPs from amino acid sequence– Some important results from IDP prediction– An improved order / disorder amino acid scale– Predicting phosphorylation sites– Disorder and function: two examples


Prediction of Intrinsic Disorder

Predictor Validation on Out-of-Sample Data

Prediction

Attribute Selection or Extraction

Separate Training and Testing Sets

Predictor Training

Ordered / Disordered Sequence DataAromaticity,Hydropathy, Charge, Complexity

Neural Networks,SVMs, etc.

First Machine-learning PredictorSDR/MDR/LDR Predictors

1. Short Disordered Regions (SDR): 7 – 21 missing AAMedium Disordered Regions (MDR): 22 – 44 Long Disordered Regions (LDR): 45 or more

2. SDR / MDR / LDR predictors: Neural networks

3. Training dataset: proteins with missing AA SDR: 34 proteins, 11,050 AA, 38 IDR, 411 IDAA

MDR: 20 proteins, 4,764 AA, 22 IDR, 464 IDAALDR: 7 proteins, 2,069 AA, 7 IDR, 465 IDAA

4. Feature selection: standard sequential forward selection

5. Accuracy: 59 – 67% estimated by 5-cross validation

6. Better than chance; Better on self than on not self

Romero P, et.al. Proc. IEEE International Conference on Neural Networks. 1:90-95 (1997)

Next: PONDR®VL-XT

XN(1)

XC(1)

VL1(2) VL-XT(2)

1114

N-11N-14

XN, VL1, and : neural networks

(1) Li X et al., Genome Informat. 9:201-213 (1999)(2) Romero P et al., Proteins 42:38-48 (2001)

Input features: XN: 8 VL1: 10XC: 8

Inputs for PONDR®VL-XT

XN Coordination No.

V VIYFW M N H D PEVK - -

VL1 Coordination No.

Net charge WFY W Y F D E K R

XC Coordination No.

Hydropathy VIYFW M T H - PEVK - R

Accuracy (ACC) = (% Corr-O)/2 + (%Corr-D)/2 ACC ( estimated by cross-validation ) ~ 72 ± 4%

Li X. et.al. Genome Informat. 9:201-213(1999)Romero P. et.al. Proteins 42:38-48(2001)

Disorder Prediction in CASP

• Critical Assessment of Structure Prediction

• http://predictioncenter.org

• CASP1(1994) to CASP9 (2010)

• Experimentalists provide amino acid sequences as they are determining the structures of proteins

• Groups register and make structure predictons

• After structures determined, predictions evaluated

• Disorder predictions introduced in CASP5 (2002)

CASP PREDICTIONS ARE TRULY BLIND!!!

http://predictioncenter.org/

Disorder Prediction in CASP

Year

2002 2004 2006 2008 2010

Num

ber

of C

AS

P p

redi

cto

rs

0

10

20

30

40

Year

2002 2004 2006 2008 2010A

rea

unde

r R

OC

cur

ve0.6

0.7

0.8

0.9

1.0

CASP5 (2002), sensitivity replaced AUC

VSL2

VSL2

PreDisorder

Our Performance in CASP

• Used VL-XT, poor on short disordered regions in CASP5, but very well on long disordered regions.

• VL trained mainly on long disordered regions.

• Changed predictor in CASP6 and CASP7, new predictor ranked #1. Big improvement !!

• Did not participate in CASP 8, but would not have ranked #1 with current predictors.

• What was change that led to large improvement in CASP6??

Predictors of Natural Disordered RegionsPONDR®VL-XT and PONDR®VSL2

(1) Li X et al., Genome Informat. 9:201-213 (1999)(2) Romero P et al., Proteins 42:38-48 (2001)(3) Peng K et al., BMC Bioinfo. 7:208 (2006)

N(1)

C(1)

VL1(2) VL-XT(2)

1114

N-11N-14

VL2(3)

VS2(3)

VSL2(3)

OM 1-OM

OL

OS

VSL2 Score = OL×OM + OS×(1-OM)

M1(3)

N, VL1, and C are neural networksN-term: 8 inputsVL1: 10 inputsC-term: 8 inputs

M1, VSL2-L, and VSL2-S are support vector machinesM1: 54 inputsVL2: 20 inputsVS2: 20 inputs

Comparison on CASP 8 Dataset

Zhang P, et.al. (unpublished results; not quite same as CASP evaluation)

ACC = 80%

ACC = (%Corr-O)/2 + (%Corr-D)/2

AUC = Area Under CurveAUC = 0.89

(+) Disordered

XPA

(–) Structured

PONDR®VL-XT, PONDR®VSL2Band PreDisorder

Iakoucheva L et al., Protein Sci 3: 561-571 (2001) Dunker AK et al., FEBS J 272: 5129-5148 (2005)Deng X., et al., BMC Bioinformatics 10:436 (2009)

Residue Index

0 50 100 150 200 250

Dis

ord

er S

core

0.0

0.2

0.4

0.6

0.8

1.0 VL-XT VSL2 PreDisorder

Published Predictors of Disordered Proteins

1979

1997

20002001

2003200420052006200720082009

Num

ber

of p

redi

ctor

s of

ID

Ps

0

10

20

30

40

50

60

He B, et al., Cell Res 19: 929-949 (2009)

Year

PONDRs:

- VSL1: Ranked #1 in CASP 6 (2004);

- VSL2: Ranked #1in CASP 7 (2006);

PO

ND

RS# +

,- / # phobics

CASP

56

7

8

Outline

• What are “Intrinsically Disordered Proteins” (IDPs)



How Abundant are IDRs/IDPs?

• To Estimate Abundance of IDPs/IDRs: predict on whole proteomes from many organisms.

ALERT!!• Lack of membrane-protein-specific disorder

predictors means that

• Estimates of disorder will be too low by a small percentage.

Organisms

#Orgs.

#Proteins

Avg. # Proteins

% Disordered

AA

%Proteins IDR >30

%ProteinsNatively

Unfolded

Archaea 73 536 – 4234

2199 12.5 – 37.2%

0 – 60.0%

3.2 – 31.5%

Bacteria 951 182 – 9320

3331 12.0 – 36.1%

11.5 – 53.7%

2.7 – 29.2%

Single-cellEukarya

58 1909 – 16365

9098 22.3 – 49.9%

17.0 – 76.8%

16.8 – 47.6%

Multi-cellEukarya

51 1775 – 35942

11295 10.4 – 49.0%

4.4 – 66.5%

6.9 – 48.7%

VSL2 Prediction of Abundance** of Intrinsically Disordered Proteins

**Are organism-specific predictors sometimes needed?

Archaea Phylogenetic Tree

>30% >21%>14%

Todd Lowe (http://archaea.ucsc.edu/)

>17% <14%

Proteome size

102 103 104 105

Av

erag

e fr

acti

on

of

dis

ord

ered

res

idu

es

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.0

0.1

0.2

0.3

0.4

0.5

0.6BacteriaArchaea SC eukaryotesMC eukaryoyes

Predicted Disorder vs. Proteome Size

Why So Much Disorder?Hypothesis: Disorder Used for Signaling

• Sequence Structure Function

– Catalysis,

– Membrane transport,

– Binding small molecules.

• Sequence Disordered Ensemble Function – Signaling, – Regulation, Dunker AK, et al., Biochemistry 41: 6573-6582 (2002)

– Recognition, Dunker AK, et al., Adv. Prot. Chem. 62: 25-49 (2002)

– Control. Xie H, et al., Proteome Res. 6: 1882-1932 (2007)

Outline


• Bioinformatics Applications to IDPs– Why don’t IDPs form structure?– Predicting IDPs from amino acid sequence– Some important results from IDP prediction– An improved order / disorder amino acid scale– Predicting phosphorylation sites


A New Order / Disorder AA Scale, Part 1

• Collect equal numbers of O and D windows of length 21.• Calculate the value of attribute, x, for each window.• For each interval of x, count how many windows are O

and D; from this, determine P (O I x) and P (D I x)• Plot P (O I x) and P (D I x) versus x.• Determine the areas between the two curves.• Area Ratio Value = (area between curves / total area)• Apply to 517 aa scales: http://www.genome.jp/aaindex . • Rank scales from smallest to largest

Campen A, et al Protein Pept Lett 15: 956-963 (2008)

http://www.genome.jp/aaindex

A New Order / Disorder AA Scale, Part 2

• Overall idea: make random changes to a scale, test for higher ARV, repeat until no larger value is found.

• Genetic Algorithm Pseudocode:– Choose initial population– Repeat– Evaluate the fitness of each individual– Select a certain portion of best-ranking individuals– Breed new population through crossover + mutation– Until terminating condition

• ARV value improved from 0.69 for best of 517 scales to 0.76 for new scale, called TOP-ID

Campen A, et al Protein Pept Lett 15: 956-963 (2008)

P (D l x) and P (O I x) Versus x Plots:Area Between Curves Used to Rank Attributes, X

Flexibility

Positive Charge

Extracellular ProteinAA Composition

TOP-IDP

Campen A et al., Protein & Peptide Lett 15: 956-963 (2008)

ARV = 0.69, Rank = #1/517ARV = 0.07, Rank #517/517

ARV = 0.36, Rank = #238/517 ARV = 0.76

Analysis of the disorder propensity in p53 by Top-IDP (A), PONDR® VLXT (B) and PONDR® VSL1 (C).

Chronology of Amino Acid Evolution DISORDER TO ORDER, NON-LIFE TO LIFE

Di Mauro E, et al., in Genesis: Origin of Life on Earth and Other Planets (In press)

Outline




New Phosphorylation Predictor

...

... Classifier m

Training

Phosphorylation sites Non-phosphorylation sites

Features frompositive set

Features fromnegative set

Control data

Bootstrapsample 1

Bootstrap

Bootstrapsample m

Aggregating Specificityestimation

Feature extraction

KNN scores Disorder scores

Amino acid frequencies

Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,

PhosphoPep,and PhosPhAt

Non-redundant datasets built by BLASTclust

Classifier 1

Training data

Making predictions on new data

Phosphorylation prediction model

KNN – similarity to known sites (+ / -) of phosphorylation

Disorder Scores – used VSL2

AA frequencies – at sequence positions before and after phophorylation sites

Gao J et al Mol and Cell Proteomics (In press)

Disorder Score vs. Phosphorylation

Gao J et al., Mol & Cell Proteomics 9 (Epub) (2010)

0 0.2 0.4 0.6 0.8 10

5000

10000

(A) Phospho-S/T in H. sapiens

0 0.2 0.4 0.6 0.8 10

1

2

x 105 (B) Non-phospho-S/T in H. sapiens

0 0.2 0.4 0.6 0.8 10

500

1000

1500

(C) Phospho-S/T in A. thaliana

Occ

uren

ce

0 0.2 0.4 0.6 0.8 10

5

10

15x 10

4 (D) Non-phospho-S/T in A. thaliana

0 0.2 0.4 0.6 0.8 10

200

400(E) Phospho-Y in H. sapiens

0 0.2 0.4 0.6 0.8 10

2

4

6x 10

4 (F) Non-phospho-Y in H. sapiens

Disorder score

-6

-5

-4

-3

-2

-1

0

+1

+2

+3

+4

+5

+6

Re

sidu

e P

ositio

ns

91.3% > 0.5

87.6% > 0.5

54.9% > 0.5

50.5% > 0.5

Outline




Signaling Example 1: Calcineurin and Calmodulin

A-SubunitA-SubunitB-Subunit

AutoinhibitoryPeptide

Active Site

Kissinger C et al., Nature 378:641-644 (1995)

Meador W et al., Science 257: 1251-1255 (1992)

Example 2: p27kip1: A Disordered Domain

Cyclin A CDKCDK

p27kip1

3D Structure: Russo AA et al., Nature 382: 325-331 (1996)DD: Tompa P et al., Bioessays 4: 328-340 (2008)

25 93

(69 residues)

The p27kip1 Disordered Domain: Used for Signal Integration

Y88

T187

pY88

T187

ATP pY88

pT187

Ub’n

pY88

pT187

♦ ♦♦♦

?? ?

1 2 3

41. NRTK phosphorylation @ Y88, signal #1.2. Intra-molecular phosphorylation @ T187, #2.3. Ubiquitination @ several possible loci, #3. 4. Proteasome digestion of p27, then cell cycle

progression.Galea CA et al., J Mol Biol 376: 827-838 (2008)Dunker AK & Uversky VN, Nat Chem Biol 4: 229-230 (2008)

Outline


• Bioinformatics Applications to IDPs– Predicting IDPs from amino acid sequence– Some important results from IDP prediction– An improved order / disorder amino acid scale– Predicting phosphorylation sites– Disorder and function: two examples


Importance of Bioinformatics to IDP and Protein Research

• Thousands of IDPs and IDRs have been found.

• Not one IDP or IDR is discussed in any current biochemistry textbook!

• Why? - IDPs and IDRs don’t fit

Sequence Structure Function

• New paradigm developed from bioinformatics

Sequence Disordered Ensemble Function

IDP prediction is changing fundamental views of structure-function relationships!

Thank You ! ! !

CollaboratorsIndiana UniversityBin Xue

Jake ChenBill Sullivan

Predrag RadivojacJennifer ChenPedro RomeroMarc Cortese

Derrick JohnsonChris Oldfield Amrita Mohan

Yunlong LiuAnn RomanTom Hurley

Anna DePaoli-RoachYuro TagakiSiama Zaidi

Jingwei MengWei-Lun Hsu

Hua LuFei Huang

Vladimir Uversky

UCSDLilia Iakoucheva Sebat

Temple UniversityZoran ObradovicSlobodan Vucetic

Vladimir VacicKang Peng

Hiongbo XieSiyuan RenUros Midic

Enzyme InstitutePeter Tompa

Zsuzsanna DosztanyiIstvan Simon

Monika Fuxreiter

USURobert Williams

Harbin Engineering University Bo He

Kejun Wang

University of IdahoCeleste J. Brown

Chris Williams

Molecular KineticsYugong ChengTanguy LeGall

Aaron Santner Plant and Food Research

Xaiolin Sun

USFGary Daughdrill

Wright State UniversityOleg Paliy