CCenter For enter For CComputationalomputationalBBiology and iology and BBioinformaticsioinformatics
Bioinformatics and Intrinsically Disordered Proteins (IDPs)
A. Keith Dunker
Biochemistry and Molecular Biology &Center for Computational Biology / Bioinformatics
Indiana University School of Medicine
Presented at:
October 22, 2010
Outline
• What are “Intrinsically Disordered Proteins” ?
• Bioinformatics Applications to IDPs– Why don’t IDPs form structure?– Predicting IDPs from amino acid sequence– Some important results from IDP prediction– An improved order / disorder amino acid scale– Predicting phosphorylation sites– Disorder and function: two examples
• Importance of bioinformatics to IDP research
Definitions: Intrinsically Disordered Proteins (IDPs) and ID Regions (IDRs)
• Whole proteins and regions of proteins are intrinsically disordered if they lack stable 3D structure under physiological conditions,
• But exist instead as highly dynamic, rapidly interconverting ensembles without particular equilibrium values for their coordinates or bond angles and with non-cooperative conformational changes.
Outline
• What are “Intrinsically Disordered Proteins” ?
• Bioinformatics Applications to IDPs– Why don’t IDPs form structure?– Predicting IDPs from amino acid sequence– Some important results from IDP prediction– An improved order / disorder amino acid scale– Predicting phosphorylation sites– Disorder and function: two examples
• Importance of bioinformatics to IDP research
Why are IDPs / IDRs unstructured?
• From the 1950s to now, >> 1,000 IDPs / IDRs studied and characterized
• Visit: http://www.disprot.org
• Why do IDPs & IDRs lack structure?– Lack a ligand or partner?– Denatured during isolation?– Folding requires conditions found inside cells?– Lack of folding encoded by amino acid sequence?
Amino Acid Compositions
Residue
W C F I Y V L H M A T R G Q S N P D E K( D
isor
der -O
rder
) / O
rder
-1.0
-0.5
0.0
0.5
1.04aa L 14aa (14579)15aa L 29aa (10381)30aa L (58147)
Surface
Buried
Why are IDPs / IDRs unstructured?
• To a first approximation, amino acid composition determines whether a protein folds or remains intrinsically disordered.
• Given a composition that favors folding, the sequence details determine which fold.
• Given a composition that favors not folding, the sequence details provide motifs for biological function.
Outline
• What are “Intrinsically Disordered Proteins” ?• Bioinformatics Applications to IDPs
– Why don’t IDPs form structure?– Predicting IDPs from amino acid sequence– Some important results from IDP prediction– An improved order / disorder amino acid scale– Predicting phosphorylation sites– Disorder and function: two examples
• Importance of bioinformatics to IDP research
Prediction of Intrinsic Disorder
Predictor Validation on Out-of-Sample Data
Prediction
Attribute Selection or Extraction
Separate Training and Testing Sets
Predictor Training
Ordered / Disordered Sequence DataAromaticity,Hydropathy, Charge, Complexity
Neural Networks,SVMs, etc.
First Machine-learning PredictorSDR/MDR/LDR Predictors
1. Short Disordered Regions (SDR): 7 – 21 missing AAMedium Disordered Regions (MDR): 22 – 44 Long Disordered Regions (LDR): 45 or more
2. SDR / MDR / LDR predictors: Neural networks
3. Training dataset: proteins with missing AA SDR: 34 proteins, 11,050 AA, 38 IDR, 411 IDAA
MDR: 20 proteins, 4,764 AA, 22 IDR, 464 IDAALDR: 7 proteins, 2,069 AA, 7 IDR, 465 IDAA
4. Feature selection: standard sequential forward selection
5. Accuracy: 59 – 67% estimated by 5-cross validation
6. Better than chance; Better on self than on not self
Romero P, et.al. Proc. IEEE International Conference on Neural Networks. 1:90-95 (1997)
Next: PONDR®VL-XT
XN(1)
XC(1)
VL1(2) VL-XT(2)
1114
N-11N-14
XN, VL1, and : neural networks
(1) Li X et al., Genome Informat. 9:201-213 (1999)(2) Romero P et al., Proteins 42:38-48 (2001)
Input features: XN: 8 VL1: 10XC: 8
Inputs for PONDR®VL-XT
XN Coordination No.
V VIYFW M N H D PEVK - -
VL1 Coordination No.
Net charge WFY W Y F D E K R
XC Coordination No.
Hydropathy VIYFW M T H - PEVK - R
Accuracy (ACC) = (% Corr-O)/2 + (%Corr-D)/2 ACC ( estimated by cross-validation ) ~ 72 ± 4%
Li X. et.al. Genome Informat. 9:201-213(1999)Romero P. et.al. Proteins 42:38-48(2001)
Disorder Prediction in CASP
• Critical Assessment of Structure Prediction
• http://predictioncenter.org
• CASP1(1994) to CASP9 (2010)
• Experimentalists provide amino acid sequences as they are determining the structures of proteins
• Groups register and make structure predictons
• After structures determined, predictions evaluated
• Disorder predictions introduced in CASP5 (2002)
CASP PREDICTIONS ARE TRULY BLIND!!!
Disorder Prediction in CASP
Year
2002 2004 2006 2008 2010
Num
ber
of C
AS
P p
redi
cto
rs
0
10
20
30
40
Year
2002 2004 2006 2008 2010A
rea
unde
r R
OC
cur
ve0.6
0.7
0.8
0.9
1.0
CASP5 (2002), sensitivity replaced AUC
VSL2
VSL2
PreDisorder
Our Performance in CASP
• Used VL-XT, poor on short disordered regions in CASP5, but very well on long disordered regions.
• VL trained mainly on long disordered regions.
• Changed predictor in CASP6 and CASP7, new predictor ranked #1. Big improvement !!
• Did not participate in CASP 8, but would not have ranked #1 with current predictors.
• What was change that led to large improvement in CASP6??
Predictors of Natural Disordered RegionsPONDR®VL-XT and PONDR®VSL2
(1) Li X et al., Genome Informat. 9:201-213 (1999)(2) Romero P et al., Proteins 42:38-48 (2001)(3) Peng K et al., BMC Bioinfo. 7:208 (2006)
N(1)
C(1)
VL1(2) VL-XT(2)
1114
N-11N-14
VL2(3)
VS2(3)
VSL2(3)
OM 1-OM
OL
OS
VSL2 Score = OL×OM + OS×(1-OM)
M1(3)
N, VL1, and C are neural networksN-term: 8 inputsVL1: 10 inputsC-term: 8 inputs
M1, VSL2-L, and VSL2-S are support vector machinesM1: 54 inputsVL2: 20 inputsVS2: 20 inputs
Comparison on CASP 8 Dataset
Zhang P, et.al. (unpublished results; not quite same as CASP evaluation)
ACC = 80%
ACC = (%Corr-O)/2 + (%Corr-D)/2
AUC = Area Under CurveAUC = 0.89
(+) Disordered
XPA
(–) Structured
PONDR®VL-XT, PONDR®VSL2Band PreDisorder
Iakoucheva L et al., Protein Sci 3: 561-571 (2001) Dunker AK et al., FEBS J 272: 5129-5148 (2005)Deng X., et al., BMC Bioinformatics 10:436 (2009)
Residue Index
0 50 100 150 200 250
Dis
ord
er S
core
0.0
0.2
0.4
0.6
0.8
1.0 VL-XT VSL2 PreDisorder
Published Predictors of Disordered Proteins
1979
1997
20002001
2003200420052006200720082009
Num
ber
of p
redi
ctor
s of
ID
Ps
0
10
20
30
40
50
60
He B, et al., Cell Res 19: 929-949 (2009)
Year
PONDRs:
- VSL1: Ranked #1 in CASP 6 (2004);
- VSL2: Ranked #1in CASP 7 (2006);
PO
ND
RS# +
,- / # phobics
CASP
56
7
8
Outline
• What are “Intrinsically Disordered Proteins” (IDPs)
• Bioinformatics Applications to IDPs– Why don’t IDPs form structure?– Predicting IDPs from amino acid sequence– Some important results from IDP prediction– An improved order / disorder amino acid scale– Predicting phosphorylation sites– Disorder and function: two examples
• Importance of bioinformatics to IDP research
How Abundant are IDRs/IDPs?
• To Estimate Abundance of IDPs/IDRs: predict on whole proteomes from many organisms.
ALERT!!• Lack of membrane-protein-specific disorder
predictors means that
• Estimates of disorder will be too low by a small percentage.
Organisms
#Orgs.
#Proteins
Avg. # Proteins
% Disordered
AA
%Proteins IDR >30
%ProteinsNatively
Unfolded
Archaea 73 536 – 4234
2199 12.5 – 37.2%
0 – 60.0%
3.2 – 31.5%
Bacteria 951 182 – 9320
3331 12.0 – 36.1%
11.5 – 53.7%
2.7 – 29.2%
Single-cellEukarya
58 1909 – 16365
9098 22.3 – 49.9%
17.0 – 76.8%
16.8 – 47.6%
Multi-cellEukarya
51 1775 – 35942
11295 10.4 – 49.0%
4.4 – 66.5%
6.9 – 48.7%
VSL2 Prediction of Abundance** of Intrinsically Disordered Proteins
**Are organism-specific predictors sometimes needed?
Archaea Phylogenetic Tree
>30% >21%>14%
Todd Lowe (http://archaea.ucsc.edu/)
>17% <14%
Proteome size
102 103 104 105
Av
erag
e fr
acti
on
of
dis
ord
ered
res
idu
es
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.0
0.1
0.2
0.3
0.4
0.5
0.6BacteriaArchaea SC eukaryotesMC eukaryoyes
Predicted Disorder vs. Proteome Size
Why So Much Disorder?Hypothesis: Disorder Used for Signaling
• Sequence Structure Function
– Catalysis,
– Membrane transport,
– Binding small molecules.
• Sequence Disordered Ensemble Function – Signaling, – Regulation, Dunker AK, et al., Biochemistry 41: 6573-6582 (2002)
– Recognition, Dunker AK, et al., Adv. Prot. Chem. 62: 25-49 (2002)
– Control. Xie H, et al., Proteome Res. 6: 1882-1932 (2007)
Outline
• What are “Intrinsically Disordered Proteins” (IDPs)
• Bioinformatics Applications to IDPs– Why don’t IDPs form structure?– Predicting IDPs from amino acid sequence– Some important results from IDP prediction– An improved order / disorder amino acid scale– Predicting phosphorylation sites
• Importance of bioinformatics to IDP research
A New Order / Disorder AA Scale, Part 1
• Collect equal numbers of O and D windows of length 21.• Calculate the value of attribute, x, for each window.• For each interval of x, count how many windows are O
and D; from this, determine P (O I x) and P (D I x)• Plot P (O I x) and P (D I x) versus x.• Determine the areas between the two curves.• Area Ratio Value = (area between curves / total area)• Apply to 517 aa scales: http://www.genome.jp/aaindex . • Rank scales from smallest to largest
Campen A, et al Protein Pept Lett 15: 956-963 (2008)
A New Order / Disorder AA Scale, Part 2
• Overall idea: make random changes to a scale, test for higher ARV, repeat until no larger value is found.
• Genetic Algorithm Pseudocode:– Choose initial population– Repeat– Evaluate the fitness of each individual– Select a certain portion of best-ranking individuals– Breed new population through crossover + mutation– Until terminating condition
• ARV value improved from 0.69 for best of 517 scales to 0.76 for new scale, called TOP-ID
Campen A, et al Protein Pept Lett 15: 956-963 (2008)
P (D l x) and P (O I x) Versus x Plots:Area Between Curves Used to Rank Attributes, X
Flexibility
Positive Charge
Extracellular ProteinAA Composition
TOP-IDP
Campen A et al., Protein & Peptide Lett 15: 956-963 (2008)
ARV = 0.69, Rank = #1/517ARV = 0.07, Rank #517/517
ARV = 0.36, Rank = #238/517 ARV = 0.76
Analysis of the disorder propensity in p53 by Top-IDP (A), PONDR® VLXT (B) and PONDR® VSL1 (C).
Chronology of Amino Acid Evolution DISORDER TO ORDER, NON-LIFE TO LIFE
Di Mauro E, et al., in Genesis: Origin of Life on Earth and Other Planets (In press)
Outline
• What are “Intrinsically Disordered Proteins” (IDPs)
• Bioinformatics Applications to IDPs– Why don’t IDPs form structure?– Predicting IDPs from amino acid sequence– Some important results from IDP prediction– An improved order / disorder amino acid scale– Predicting phosphorylation sites– Disorder and function: two examples
• Importance of bioinformatics to IDP research
New Phosphorylation Predictor
...
... Classifier m
Training
Phosphorylation sites Non-phosphorylation sites
Features frompositive set
Features fromnegative set
Control data
Bootstrapsample 1
Bootstrap
Bootstrapsample m
Aggregating Specificityestimation
Feature extraction
KNN scores Disorder scores
Amino acid frequencies
Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt
Non-redundant datasets built by BLASTclust
Classifier 1
Training data
Making predictions on new data
Phosphorylation prediction model
KNN – similarity to known sites (+ / -) of phosphorylation
Disorder Scores – used VSL2
AA frequencies – at sequence positions before and after phophorylation sites
Gao J et al Mol and Cell Proteomics (In press)
Disorder Score vs. Phosphorylation
Gao J et al., Mol & Cell Proteomics 9 (Epub) (2010)
0 0.2 0.4 0.6 0.8 10
5000
10000
(A) Phospho-S/T in H. sapiens
0 0.2 0.4 0.6 0.8 10
1
2
x 105 (B) Non-phospho-S/T in H. sapiens
0 0.2 0.4 0.6 0.8 10
500
1000
1500
(C) Phospho-S/T in A. thaliana
Occ
uren
ce
0 0.2 0.4 0.6 0.8 10
5
10
15x 10
4 (D) Non-phospho-S/T in A. thaliana
0 0.2 0.4 0.6 0.8 10
200
400(E) Phospho-Y in H. sapiens
0 0.2 0.4 0.6 0.8 10
2
4
6x 10
4 (F) Non-phospho-Y in H. sapiens
Disorder score
-6
-5
-4
-3
-2
-1
0
+1
+2
+3
+4
+5
+6
Re
sidu
e P
ositio
ns
91.3% > 0.5
87.6% > 0.5
54.9% > 0.5
50.5% > 0.5
Outline
• What are “Intrinsically Disordered Proteins” (IDPs)
• Bioinformatics Applications to IDPs– Why don’t IDPs form structure?– Predicting IDPs from amino acid sequence– Some important results from IDP prediction– An improved order / disorder amino acid scale– Predicting phosphorylation sites– Disorder and function: two examples
• Importance of bioinformatics to IDP research
Signaling Example 1: Calcineurin and Calmodulin
A-SubunitA-SubunitB-Subunit
AutoinhibitoryPeptide
Active Site
Kissinger C et al., Nature 378:641-644 (1995)
Meador W et al., Science 257: 1251-1255 (1992)
Example 2: p27kip1: A Disordered Domain
Cyclin A CDKCDK
p27kip1
3D Structure: Russo AA et al., Nature 382: 325-331 (1996)DD: Tompa P et al., Bioessays 4: 328-340 (2008)
25 93
(69 residues)
The p27kip1 Disordered Domain: Used for Signal Integration
Y88
T187
pY88
T187
ATP pY88
pT187
Ub’n
pY88
pT187
♦ ♦♦♦
?? ?
1 2 3
41. NRTK phosphorylation @ Y88, signal #1.2. Intra-molecular phosphorylation @ T187, #2.3. Ubiquitination @ several possible loci, #3. 4. Proteasome digestion of p27, then cell cycle
progression.Galea CA et al., J Mol Biol 376: 827-838 (2008)Dunker AK & Uversky VN, Nat Chem Biol 4: 229-230 (2008)
Outline
• What are “Intrinsically Disordered Proteins” (IDPs)
• Bioinformatics Applications to IDPs– Predicting IDPs from amino acid sequence– Some important results from IDP prediction– An improved order / disorder amino acid scale– Predicting phosphorylation sites– Disorder and function: two examples
• Importance of bioinformatics to IDP research
Importance of Bioinformatics to IDP and Protein Research
• Thousands of IDPs and IDRs have been found.
• Not one IDP or IDR is discussed in any current biochemistry textbook!
• Why? - IDPs and IDRs don’t fit
Sequence Structure Function
• New paradigm developed from bioinformatics
Sequence Disordered Ensemble Function
IDP prediction is changing fundamental views of structure-function relationships!
Thank You ! ! !
CollaboratorsIndiana UniversityBin Xue
Jake ChenBill Sullivan
Predrag RadivojacJennifer ChenPedro RomeroMarc Cortese
Derrick JohnsonChris Oldfield Amrita Mohan
Yunlong LiuAnn RomanTom Hurley
Anna DePaoli-RoachYuro TagakiSiama Zaidi
Jingwei MengWei-Lun Hsu
Hua LuFei Huang
Vladimir Uversky
UCSDLilia Iakoucheva Sebat
Temple UniversityZoran ObradovicSlobodan Vucetic
Vladimir VacicKang Peng
Hiongbo XieSiyuan RenUros Midic
Enzyme InstitutePeter Tompa
Zsuzsanna DosztanyiIstvan Simon
Monika Fuxreiter
USURobert Williams
Harbin Engineering University Bo He
Kejun Wang
University of IdahoCeleste J. Brown
Chris Williams
Molecular KineticsYugong ChengTanguy LeGall
Aaron Santner Plant and Food Research
Xaiolin Sun
USFGary Daughdrill
Wright State UniversityOleg Paliy
Top Related