Quantifying Conformational Ensemble Changes in Proteins...
Transcript of Quantifying Conformational Ensemble Changes in Proteins...
Quantifying Conformational Ensemble Changes in Proteins Using Inverse Machine LearningMohsen Botlani, Ahnaf Siddiqui and Sameer Varma
Department of Cell Biology, Microbiolgy and Molecular BiologyUniversity of South Florida, FL-33620
Background: Protein activities are regulated tightly in biological environments. An understanding of their regulatory mechanisms entails assessment of their various states, including active and inactive states. For many proteins, their states can be distinguished based on their minimum-energy conformations since, the magnitudes of thermal fluctuations, or dynamics, are negligible compared to the differences in minimum-energy structures. This approximation, however, breaks down for several other proteins. The states of these proteins can only be distinguished categorically from each other when their finite-temperature conformational ensembles are considered alongside their minimum-energy structures. The list of such proteins has grown rapidly in the last decade, which now includes GPCRs, PDZ domains, nuclear transcription factors, heat shock proteins, T-cell receptors and viral attachment proteins. Applicability of molecular simulations toward understanding mechanisms in this latter category of proteins requires development of new methods that can deal with high-dimensional conformational ensemble data. Description: The traditional approach to compare protein conformational ensembles is to compare their respective summary statistics. However, if a subset of the summary statistics from the two ensembles is found to be identical, it does not imply that the remaining summary statistics will also be identical. The general problem of finding and choosing a feature that appropriately distinguishes ensembles can be overcome by comparing ensembles directly against each other and prior to any dimensionality reduction. We have developed a method to accomplish just that – it performs excellently for both Gaussian and non-Gaussian distributions. The difference between ensembles is computed by solving the inverse machine learning problem and in terms of a metric that satisfies the conditions set forth by the zeroth law of thermodynamics. Conclusions: Such a quantification permits statistical analyses and quantitative data mining necessary for establishing causality in protein functional regulation. We have applied this method to (a) quantitatively understand the effect of ligand binding on the structure and dynamics of a viral protein whose function is controlled by dynamic allostery; (b) understand the role of water in the inception of allosteric signals; (c) determine intersecting signaling pathways. This method is available under standard GNU license on SimTk.(https://simtk.org/projects/conf_ensembles).
1. Leighty RE and Varma S. J Chem. Theory and Comput, 9: 868-875, 2013.2. Varma S, Botlani M and Leighty RE. Proteins, 82: 3241-3254, 2014. 3. Dutta P, Botlani M and Varma S. J Phys. Chem. B, 118: 14795-14807, 2014.4. Dutta P, Siddiqui A, Botlani M and Varma S. Biophys. J, 2016, Under revision.
Acknowledgments: All simulations were carried out at the Research Computing center of the University of South Florida.
TextText
1)
2)
Intersecting signaling pathways:
Conformational sampling over 3 collective variables
Traditionally, a support vector machine (SVM) is used for binary classification. It is first trained on a set of instances for which their group identities are known, and then used for predicting the group identities of unclassified instances. In our approach, we train the SVM to recognize the difference of two n-particle conformational ensembles, but instead of using the trained SVM for predictive purposes, we utilize the mathematical properties of the underlying classification function to obtain a physically meaningful quantitative estimate for the difference between the ensembles. The method is trained on Gaussian distributions, and works excellently without need for any data fitting. From a theoretical standpoint, the method should also work for multi-Gaussian distributions, and by extension, for any distribution, because the overlap between two multi-Gaussian distributions is essentially a sum of overlaps between Gaussian distributions,
Residues that are close to the diagonal undergo shifts primarily in backbone positions. Residues that lie below the diagonal undergo changes in side chain orientations and/or conformational entropy. Residues that lie above the diagonal represent cases where backbone deviations are swamped by smaller changes in whole residue deviations.
Example: Effect of force field on ligand-induced conformational ensemble shifts. and are computed, respectively, from stochastic dynamics simulations in explicit and implicit solvent.
Biophysical Journal: Dutta et al. 5
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Etas
1−Overlap
−2 0 2 4 6 8 10 120
0.2
0.4
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Etas
1−Overlap
−2 0 2 4 6 80
0.2
0.4
0.55
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Etas
1−Overlap
−2 0 2 4 6 8 10 120
0.2
0.4
0.5
(a) Bimodal (b) Trimodal (c) Quadrimodal
MAE = 4.2% ρ = 0.99
MAE = 5.7% ρ = 0.98
MAE = 5.8% ρ = 0.97
Figure 2: Increase font sizes for the main plots. Increase font sizes for the insets. Add MAE and Corr values
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 10 0.2 0.4 0.6 0.8 1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
Figure 3: Performance of ⌘ estimated from F (r) against its exact value (1� ||R\R0||). For each of the three types of multimodaldistributions, (a) bimodal distributions (R =
P2i=1 ci
fi
), (b) trimodal distributions (R =P3
i=1 ci
fi
), and (c) quadrimodaldistributions (R =
P4i=1 ci
fi
), we generate 400 random pairs (R, R0) by modulating the weighting coefficients c as well asthe attributes of Gaussian functions f . Representative distribution pairs are shown as insets, where the shaded portions indicatethe overlap (||R \ R0||) between the distributions. Performance is quantified using mean absolute errors (MAE) and Pearsoncorrelation coefficients (⇢).
We also note from Fig. 4a that while the two simulations of the ephrin bound state yield identical RBD-RBD orientations,the two simulations of the ephrin free state yield slightly different RBD-RBD orientations. To understand the latter, we visualizein Fig. 4b the RBD-RBD interfaces obtained from these simulations in the context of the position of the FAD. We note that theFAD will interact more extensively with the RBDs in the ephrin free state, as compared to the ephrin bound state. Therefore, thereason the two simulations of the ephrin free state produce slightly different RBD-RBD interfaces could be due to the absence ofthe RBD-FAD interface in our simulations. Nevertheless, the primary outcome of these simulation is that ephrin binding inducesa significant change in the RBD-RBD orientation.
0 50 100 150 2000
0.785
1.570.79
1.58
2.364.5
5.0
5.5
(nm
)
π/2
π/4
3π/4
π/2
π/4
Time (ns)
(rad)
(rad)
(a) (b)
Ephrin binding
Ephrin free stateEphrin bound state
0
FAD
RBD-I
RBD-II
RBD-IIRBD-I
Ephrin
θtilt
â
θroll
â
â′
â′
dCoM
Figure 4: (a) Time evolutions of collective variables that describe the interface between the two RBDs of a dimer. The two linesfor each of the ephrin free and ephrin bound states indicate two separate MD simulations. d
CoM
is the distance between thecenters of masses (CoM) of the backbone atoms of the two RBDs. ✓
tilt
is the angle between the central axes, a and a0, of thetwo RBDs. ✓
roll
is the angle of rotation of the RBD about its central axis. The geometrical definitions of ✓tilt
and ✓roll
areprovided in Fig. S5 in the Supporting Material. (b) Final snapshots of the RBD-RBD interface in MD simulations. Note thattwo superimposed structures are shown for the ephrin free state, as the two simulations in the ephrin free state produced slightlydifferent RBD-RBD geometries. The location of the FAD relative to the RBD-RBD dimer is depicted according to structure ofthe full length ectodomain proposed by Broder and coworkers (5), which was homology modeled on the X-ray structures of theG analogs in the Newcastle Disease Virus and the parainfluenza virus (4, 11, 12).
Biophysical Journal 00(00) 1–0
Inactive Active
InactiveActive
Ligand
0
0.2
0.4
0.6
0.8
1.0
0 0.2 0.4 0.6 0.8 1.0ηExplicit
η Implicit
ρ = 0.28
10 30
G-B2G-B3
Bio
physica
lJo
urn
al
Vo
lu
me
:0
0M
on
th
Ye
ar
0–
01
Wat
erdy
nam
ics
atpr
otei
n-pr
otei
nin
terf
aces
:Am
olec
ular
dyna
mic
sst
udy
ofvi
rus-
host
rece
ptor
com
plex
es
Priy
anka
Dut
ta,M
ohse
nB
otla
nian
dSa
mee
rVar
ma
Dep
artm
ento
fCel
lBio
logy
,Mic
robi
olog
yan
dM
olec
ular
Bio
logy
,Uni
vers
ityof
Sout
hFl
orid
a,42
02E.
Fow
lerA
ve.,
Tam
pa,F
L-33
620,
Uni
ted
Stat
esof
Am
eric
a
Ab
stract
The
dyna
mic
alpr
oper
ties
ofw
ater
atbi
olog
ical
inte
rfac
esar
edi
ffer
ent
from
thos
ein
bulk
wat
er.E
xper
imen
tsas
wel
las
sim
ulat
ions
indi
cate
that
wat
erdi
ffus
esan
dor
ient
sat
rate
sth
atde
pend
onbo
thth
ech
emis
tryas
wel
las
the
topo
logy
ofth
ein
terf
ace.
Her
ew
eut
ilize
mol
ecul
ardy
nam
ics
sim
ulat
ions
tode
term
ine
the
natu
rean
dex
tent
tow
hich
the
dyna
mic
alpr
oper
-tie
sof
wat
erar
esh
ifted
from
thei
rbul
kva
lues
whe
nth
eyoc
cupy
inte
rstit
ialr
egio
nsbe
twee
ntw
opr
otei
ns.W
eco
nsid
ertw
ona
tura
lpro
tein
-pro
tein
com
plex
es,o
nein
whi
chth
eN
ipah
viru
sG
prot
ein
bind
sto
cellu
lare
phrin
B2,
and
the
othe
rin
whi
chth
esa
me
Gpr
otei
nbi
nds
toep
hrin
B3.
Thes
epr
otei
n-pr
otei
nin
tera
ctio
nsco
nstit
ute
the
first
step
inN
ipah
infe
ctio
n.W
efin
dth
atde
spite
the
low
sequ
ence
iden
tity
of50
%be
twee
nep
hrin
sB2
and
B3,
the
dyna
mic
alpr
oper
tieso
fint
erst
itial
wat
ersi
nth
etw
oco
mpl
exes
are
sim
ilar.
Inbo
thca
ses,
we
find
that
the
inte
rstit
ialw
ater
sdi
ffus
ete
ntim
essl
ower
com
pare
dto
bulk
wat
er.
Inad
ditio
n,de
spite
thei
rre
solu
tion
incr
ysta
lstru
ctur
es,m
ore
than
95%
ofth
ew
ater
sin
the
inte
rstit
ialr
egio
nsex
chan
gew
ithth
ebu
lkw
ithin
150
ns.T
hein
ters
titia
lwat
ers
also
exhi
bitd
ipol
ere
laxa
tion
times
and
hydr
ogen
bond
lifet
imes
anor
der
inm
agni
tude
long
erth
anbu
lkw
ater
.The
sede
viat
ions
from
bulk
valu
esar
ege
nera
llym
uch
larg
erth
anth
ose
obse
rved
atpr
otei
n-w
ater
inte
rfac
es.T
oga
uge
the
func
tiona
lrel
evan
ceof
the
inte
rstit
ialw
ater
,we
exam
ine
quan
titat
ivel
yho
wim
plic
itso
lven
tmod
els
com
pare
agai
nste
xplic
itso
lven
tmod
els
inpr
oduc
ing
ephr
in-in
duce
dsh
ifts
inth
eG
confi
gura
tiona
lden
sity
.Ep
hrin
-indu
ced
shift
sin
the
Gco
nfigu
ratio
nald
ensi
tyar
ecr
itica
lto
the
allo
ster
icre
gula
tion
ofvi
ralf
usio
n.W
efin
dth
atth
etw
om
etho
dsyi
eld
strik
ingl
ydi
ffer
enti
nduc
edch
ange
sin
the
Gco
nfigu
ratio
nald
ensi
ty,w
hich
sugg
ests
that
the
inte
rstit
ial
wat
ers
may
also
cont
ribut
eto
the
allo
ster
icsi
gnal
ing,
and
ther
efor
e,ar
efu
nctio
nally
impo
rtant
.
Inse
rtR
ecei
ved
forp
ublic
atio
nD
ate
and
infin
alfo
rmD
ate.
Cor
resp
onda
nce:
svar
ma@
usf.e
du
In
tro
du
ctio
n
The
dyna
mic
alpr
oper
ties
ofw
ater
atbi
olog
ical
inte
rfac
esar
edi
ffer
entf
rom
thos
ein
bulk
wat
er(?
??
??
??
??
??
).H
owar
eth
eydi
ffer
ent?
Inge
nera
l,th
efu
ndam
enta
ltre
ndob
serv
edfr
omex
perim
ents
and
sim
ulat
ions
isth
atw
ater
diff
uses
,rel
axes
and
orie
nts
slow
erat
prot
ein-
wat
eran
dlip
id-w
ater
inte
rfac
es,a
sco
mpa
red
toin
the
bulk
.1.
Firs
thyd
ratio
nsh
ello
fpro
tein
sin
dens
erIM
PORT
AN
TFO
RM
ETH
OD
S:cr
ysta
lWAT
ERs
inB
2an
dno
tB3.
Sore
tain
ing
the
crys
talw
ater
sha
sno
teff
ecto
nth
eov
eral
lpro
perti
es.
——
——
Equa
tions
:⇢/
⇢ 0r
(A)
——
——
Prob
ing
the
fold
ing
and
unfo
ldin
gpr
oces
ses
ofpr
otei
nsas
afu
nctio
nof
tem
pera
ture
isa
maj
orch
alle
nge
inbi
ophy
sics
.H
ere
we
exam
ine
the
effe
cts
ofte
mpe
ratu
resp
ikes
that
heat
and
cool
prot
eins
with
inte
nsof
nano
seco
nds.
Our
resu
ltssh
owth
ese
spik
esar
eca
pabl
eof
caus
ing
irrev
ersi
ble
chan
ges
suffi
cien
tto
elim
inat
epr
otei
nac
tivity
.
©2
01
3T
he
Au
th
ors
0006
-349
5/08
/09/
2624
/12
$2.0
0do
i:10
.152
9/bi
ophy
sj.1
06.0
9094
4
Biophysical Journal Volume: 00 Month Year 0–0 1
Water dynamics at protein-protein interfaces: A molecular dynamicsstudy of virus-host receptor complexes
Priyanka Dutta, Mohsen Botlani and Sameer Varma
Department of Cell Biology, Microbiology and Molecular Biology, University of South Florida, 4202 E.Fowler Ave., Tampa, FL-33620, United States of America
Abstract
The dynamical properties of water at biological interfaces are different from those in bulk water. Experiments as well assimulations indicate that water diffuses and orients at rates that depend on both the chemistry as well as the topology of theinterface. Here we utilize molecular dynamics simulations to determine the nature and extent to which the dynamical proper-ties of water are shifted from their bulk values when they occupy interstitial regions between two proteins. We consider twonatural protein-protein complexes, one in which the Nipah virus G protein binds to cellular ephrin B2, and the other in whichthe same G protein binds to ephrin B3. These protein-protein interactions constitute the first step in Nipah infection. We findthat despite the low sequence identity of 50% between ephrins B2 and B3, the dynamical properties of interstitial waters in thetwo complexes are similar. In both cases, we find that the interstitial waters diffuse ten times slower compared to bulk water.In addition, despite their resolution in crystal structures, more than 95% of the waters in the interstitial regions exchangewith the bulk within 150 ns. The interstitial waters also exhibit dipole relaxation times and hydrogen bond lifetimes an orderin magnitude longer than bulk water. These deviations from bulk values are generally much larger than those observed atprotein-water interfaces. To gauge the functional relevance of the interstitial water, we examine quantitatively how implicitsolvent models compare against explicit solvent models in producing ephrin-induced shifts in the G configurational density.Ephrin-induced shifts in the G configurational density are critical to the allosteric regulation of viral fusion. We find that thetwo methods yield strikingly different induced changes in the G configurational density, which suggests that the interstitialwaters may also contribute to the allosteric signaling, and therefore, are functionally important.
Insert Received for publication Date and in final form Date.Correspondance: [email protected]
Introduction
The dynamical properties of water at biological interfaces are different from those in bulk water (? ? ? ? ? ? ? ? ? ? ? ). Howare they different?
In general, the fundamental trend observed from experiments and simulations is that water diffuses, relaxes and orientsslower at protein-water and lipid-water interfaces, as compared to in the bulk.
1. First hydration shell of proteins in denserIMPORTANT FOR METHODS: crystal WATERs in B2 and not B3. So retaining the crystal waters has not effect on the
overall properties.———— Equations:⇢/⇢0r (A)————Probing the folding and unfolding processes of proteins as a function of temperature is a major challenge in biophysics.
Here we examine the effects of temperature spikes that heat and cool proteins within tens of nanoseconds. Our results showthese spikes are capable of causing irreversible changes sufficient to eliminate protein activity.
© 2013 The Authors
0006-3495/08/09/2624/12 $2.00 doi: 10.1529/biophysj.106.090944
0.5
Interstitial water
(a) (b)
Biophysical Journal Volume: 00 Month Year 1–0 1
Water dynamics at protein-protein interfaces: A molecular dynamicsstudy of virus-host receptor complexes
Priyanka Dutta, Mohsen Botlani and Sameer Varma
Department of Cell Biology, Microbiology and Molecular Biology, University of South Florida, 4202 E.Fowler Ave., Tampa, FL-33620, United States of America
Abstract
The dynamical properties of water at biological interfaces are different from those in bulk water. Experiments as well assimulations indicate that water diffuses and orients at rates that depend on both the chemistry as well as the topology of theinterface. Here we utilize molecular dynamics simulations to determine the nature and extent to which the dynamical proper-ties of water are shifted from their bulk values when they occupy interstitial regions between two proteins. We consider twonatural protein-protein complexes, one in which the Nipah virus G protein binds to cellular ephrin B2, and the other in whichthe same G protein binds to ephrin B3. These protein-protein interactions constitute the first step in Nipah infection. We findthat despite the low sequence identity of 50% between ephrins B2 and B3, the dynamical properties of interstitial waters in thetwo complexes are similar. In both cases, we find that the interstitial waters diffuse ten times slower compared to bulk water.In addition, despite their resolution in crystal structures, more than 95% of the waters in the interstitial regions exchangewith the bulk within 150 ns. The interstitial waters also exhibit dipole relaxation times and hydrogen bond lifetimes an orderin magnitude longer than bulk water. These deviations from bulk values are generally much larger than those observed atprotein-water interfaces. To gauge the functional relevance of the interstitial water, we examine quantitatively how implicitsolvent models compare against explicit solvent models in producing ephrin-induced shifts in the G configurational density.Ephrin-induced shifts in the G configurational density are critical to the allosteric regulation of viral fusion. We find that thetwo methods yield strikingly different induced changes in the G configurational density, which suggests that the interstitialwaters may also contribute to the allosteric signaling, and therefore, are functionally important.
Insert Received for publication Date and in final form Date.Correspondance: [email protected]
Introduction
The dynamical properties of water at biological interfaces are different from those in bulk water (1–3, 6–13). How are theydifferent?
In general, the fundamental trend observed from experiments and simulations is that water diffuses, relaxes and orientsslower at protein-water and lipid-water interfaces, as compared to in the bulk.
1. First hydration shell of proteins in denserIMPORTANT FOR METHODS: crystal WATERs in B2 and not B3. So retaining the crystal waters has not effect on the
overall properties.———— Equations:⇢/⇢0r (A)r = 10 A————Probing the folding and unfolding processes of proteins as a function of temperature is a major challenge in biophysics.
Here we examine the effects of temperature spikes that heat and cool proteins within tens of nanoseconds. Our results showthese spikes are capable of causing irreversible changes sufficient to eliminate protein activity.
© 2013 The Authors
0006-3495/08/09/2624/12 $2.00 doi: 10.1529/biophysj.106.090944
G
B2
G
B2
0
0.2
0.4
0.6
0.8
1.0
0 0.2 0.4 0.6 0.8 1.0ηExplicit
η Implicit
ρ = 0.28
0
0.2
0.4
0.6
0.8
1.0
00.2
0.40.6
0.81.0
ηExplicit
ηImplicit
ρ=0.28
Inverse machine learning
Abstract Functional Regulation via small structural changes
(a) Comparison between two conformational ensembles
(b) Comparison between two conformational ensemble shifts
(c) Comparison between multiple conformational ensembles
References
an extended ensemble approach (38,39) and with a coupling constant of 1 ps.An extended ensemble approach is also used for maintaining pressure (40).Pressure is maintained at 1 bar using a coupling constant of 1 ps and acompressibility of 4:5! 10"5 bar"1. NaCl concentration is set at 150 mM,and there are extraNaþ ions compared toCl" ions tocompensate for the chargeon the protein. Electrostatic interactions are computed using the particle meshEwald scheme (41) with a Fourier grid spacing of 0.1 nm, a fourth-order inter-polation, and a direct space cutoff of 10 A. The van derWaals interactions arecomputed explicitly for interatomic distances%10 A. The bonds in proteinsand the geometries of water molecules are constrained (42,43), and conse-quently an integration time step of 2 fs is employed. The protein and ionsare described using OPLS-AA parameters (44), and the water molecules aredescribed using TIP4P parameters (45).We note thatwe do notmodel inducedeffects explicitly; however, such effects are generally more important fordescribing ionic interactions (46,47). Convergence is administered by trackingtime evolutions of conformational RMSDs, pressure, potential energies, and aset of collective variables that describe RBD-RBD interfaces.
Construction of RBD-RBD dimer models
While there are no experimental structures of the RBD-RBD dimer of Ni-pah G, there is sufficient experimental data to construct the initial dimermodel for carrying out MD simulations. Firstly, x-ray structures are avail-able for the isolated Nipah RBD as well as its complex with ephrin (25,26).Secondly, both the ephrin-free and ephrin-bound structures of Nipah RBDhave been subjected to MD at physiological temperature, and have beenfound to be stable (28,29). Thirdly, Bowden et al. (12) have proposed aRBD-RBD interface for the G protein of the Hendra virus (PDB: 2X9M).This interface serves as a suitable template to construct the initial modelof the RBD-RBD interface of Nipah G because 1) the G protein of Hendrais a closely related homolog of the Nipah G protein (89% sequence similar-ity; see Fig. S1 in the Supporting Material), and 2) x-ray structures of theephrin-free and ephrin-bound states of Hendra’s RBD closely match therespective x-ray structures of Nipah’s RBD (Fig. S2).
The RBD-RBD interface of Hendra’s G protein was proposed (12) byconsolidating data concerning the 1) packing interactions within crystals,2) conservation patterns within RBD-RBD interfaces of analogous receptorbinding proteins of other paramyxoviruses, and 3) distribution of N-linkedglycosylation sites on the RBD. In particular, the distribution of glycosyl-ation sites on the RBDs of Nipah and Hendra are such that they permitonly one specific face of the RBD to dimerize with an adjacent RBD—the remaining faces of the RBDs contain protruding glycosyl chains thatwill produce steric clashes. Therefore, there is absolutely no ambiguity con-cerning the dimerization face of the RBD. However, as Bowden et al. (12)also point out, there is ambiguity concerning the relative orientation be-tween the two RBDs. Nevertheless, the Nipah RBD-RBD model con-structed using Hendra template will serve as an excellent starting pointfor MD simulations, which we, as such, utilize to determine the relativeorientations between RBDs.
To construct the initial model of the RBD-RBD interface in the ephrin-free state, we take the final snapshot (640 ns) from our earlier simulationof the monomeric form of Nipah’s RBD (28), and geometrically fit twoof its copies individually onto the two RBDs of Hendra’s RBD-RBD dimer.Geometric fitting is conducted using the backbone Ca atoms. The two geo-metric fits produced identical least squared fit values, which are expectedbecause the RBD-RBD interface is symmetric. We also consider the fitsexcellent (RMSD < 2 A). The templated model is shown in Fig. S3. Weuse the same protocol to construct the initial model of the RBD-RBD dimerin the ephrin-bound state, but in this case we take the final snapshot (460 ns)of our simulation of Nipah’s ephrin-bound RBD monomer (28) (Fig. S3).Even in this case, we find that the geometric fits are excellent (RMSD <2 A). The reason that the structures of both the ephrin-free and ephrin-bound RBDs fit excellently on to the RBD of Hendra is because, as wenote in Fig. 2, the difference between the ephrin-free and ephrin-boundstructures of the RBD is small (25,26,28,29). Note that after fitting the
RBD of the ephrin-RBD complex to the RBD of the Hendra RBD-RBDtemplate, we apply the resulting rotational matrix to ephrin. Note alsothat we retain the water molecules sandwiched between ephrin and theRBD and apply the rotational matrix to these water molecules. We havefound these interstitial waters are critical to not only the structural integrityof the RBD-ephrin interface, but also to the inception of the ephrin bindingsignal at the RBD-ephrin interface (48). The two constructed RBD-RBD di-mers are energy minimized, solvated separately in salt solutions, and thensubjected to MD. The ephrin-free state is comprised of 356,770 particles,and the ephrin-bound state is comprised of 435,254 particles.
Comparison of conformational ensembles
The traditional approach to compare two conformational ensembles ofproteins, ℝ ¼ fr1; r2;.; rmg and ℝ0 ¼ fr01; r02;.; r0mg, where r denotes a3n-dimensional coordinate and m denotes the number of conformationsin the ensemble, is to compare their respective summary statistics, like cen-ters-of-mass (COMs) and root mean square fluctuations. However, if a sub-set of the summary statistics of the two ensembles is found to be identical, itdoes not imply that all of the 3n" 6 summary statistics of two ensembleswill also be identical. The general problem of finding and choosing afeature that appropriately distinguishes two ensembles can be overcomeby comparing ensembles directly against each other, and before any dimen-sionality reduction (28,29,48). A further advantage of comparing ensemblesdirectly against each other is that the resulting quantification naturally em-bodies differences in conformational fluctuations.
We compare ensembles directly against each other using amethodwe devel-oped recently (29). It quantifies the difference between two ensembles in termsof ametric, h, which satisfies two conditions: (1) hðℝ/ℝ0Þ ¼ hðℝ0/ℝÞ, and(2) if hðℝ/ℝ0Þ ¼ hðℝ0/ℝ00Þ, then it does not necessarily imply thathðℝ/ℝ0Þ ¼ hðℝ/ℝ00Þ. This metric is also universal in that it is not boundedby systemtype/size, and canbe used to examinedifferences in ensembles at anystructural hierarchy (functional groups, amino acids, or secondary structures).
Mathematically, h is a function of the geometrical overlap betweenconformational ensembles, ℝ and ℝ0,
h ¼ 1" kℝ X ℝ0 k : (1)
It is normalized, that is, h ˛ ½0; 1Þ, and it takes up a value closer to unity asthe difference between the ensembles increases. kℝ X ℝ0 k is estimatedby solving an inverse machine learning problem. In the traditional sense,machine learning is used for data classification (49–51)—the classificationfunction, or machine ðFðrÞÞ, is first trained on a set of instances withknown group identities, and then used for predicting the group identityof an unclassified instance. In principle, the conformational ensembles ℝand ℝ0 can also serve as training data to train a classification function,FðrÞ, which can, in turn, be used to predict whether an unseen conforma-tion belongs to ℝ or ℝ0. We have shown that if FðrÞ is constructed andtrained appropriately, then the overlap between ℝ and ℝ0 can be extractedfrom FðrÞ (28).
We have also demonstrated that this method works excellently andwithout need for any prior data fitting, provided we assume that the under-lying distributions are Gaussian—the mean absolute error (MAE) betweencomputed and analytical overlaps is 3.2% (29). The Gaussianity in a distri-bution, which is a corollary to the central limit theorem, is, however, a validassumption only in systems where particles do not interact with each other.Therefore, deviations can be expected for protein systems that evolve underthe influence of many-body interactions. Nevertheless, the overlap betweentwo multi-Gaussian distributions, ℝ ¼
Pcifi and ℝ0 ¼
Pc0if
0i , where fi are
Gaussians and ci are weighting coefficients, is essentially a sum of overlapsbetween Gaussian distributions, that is,
h ¼ 1" kX
i¼ 1
n
ci fi XX
j¼ 1
n
c0j f0j k¼ 1" k
X
i;j¼ 1
n
ci fi X c0j f0j k : (2)
Allosteric Stimulation of Nipah Host Binding Protein
Biophysical Journal 111, 1621–1630, October 18, 2016 1623