Statistical Force-Field for Structural Modeling Using ...

44
doi.org/10.26434/chemrxiv.6030563.v1 Statistical Force-Field for Structural Modeling Using Chemical Cross-Linking/mass Spectrometry Distance Constraints Allan J. R. Ferrari, Fabio C. Gozzo, Leandro Martinez Submitted date: 26/03/2018 Posted date: 27/03/2018 Licence: CC BY-NC-ND 4.0 Citation information: Ferrari, Allan J. R.; Gozzo, Fabio C.; Martinez, Leandro (2018): Statistical Force-Field for Structural Modeling Using Chemical Cross-Linking/mass Spectrometry Distance Constraints. ChemRxiv. Preprint. Chemical cross-linking/Mass Spectrometry (XLMS) is an experimental method to obtain distance constraints between amino acid residues, which can be applied to structural modeling of tertiary and quaternary biomolecular structures. These constraints provide, in principle, only upper limits to the distance between amino acid residues along the surface of the biomolecule. In practice, attempts to use of XLMS constraints for tertiary protein structure determination have not been widely successful. This indicates the need of specifically designed strategies for the representation of these constraints within modeling algorithms. Here, a force-field designed to represent XLMS-derived constraints is proposed. The potential energy functions are obtained by computing, in the database of known protein structures, the probability of satisfaction of a topological cross-linking distance as a function of the Euclidean distance between amino acid residues. The force-field can be easily incorporated into current modeling methods and software. In this work, the force-field was implemented within the Rosetta ab initio relax protocol. We show a significant improvement in the quality of the models obtained relative to current strategies for constraint representation. This force-field contributes to the long-desired goal of obtaining the tertiary structures of proteins using XLMS data. Force-field parameters and usage instructions are freely available at http://m3g.iqm.unicamp.br/topolink/xlff File list (2) download file view on ChemRxiv ferrari_xlff_manuscript.pdf (0.98 MiB) download file view on ChemRxiv ferrari_xlff_supporting_information.pdf (1.51 MiB)

Transcript of Statistical Force-Field for Structural Modeling Using ...

Page 1: Statistical Force-Field for Structural Modeling Using ...

doi.org/10.26434/chemrxiv.6030563.v1

Statistical Force-Field for Structural Modeling Using ChemicalCross-Linking/mass Spectrometry Distance ConstraintsAllan J. R. Ferrari, Fabio C. Gozzo, Leandro Martinez

Submitted date: 26/03/2018 • Posted date: 27/03/2018Licence: CC BY-NC-ND 4.0Citation information: Ferrari, Allan J. R.; Gozzo, Fabio C.; Martinez, Leandro (2018): Statistical Force-Field forStructural Modeling Using Chemical Cross-Linking/mass Spectrometry Distance Constraints. ChemRxiv.Preprint.

Chemical cross-linking/Mass Spectrometry (XLMS) is an experimental method to obtain distance constraintsbetween amino acid residues, which can be applied to structural modeling of tertiary and quaternarybiomolecular structures. These constraints provide, in principle, only upper limits to the distance betweenamino acid residues along the surface of the biomolecule. In practice, attempts to use of XLMS constraints fortertiary protein structure determination have not been widely successful. This indicates the need of specificallydesigned strategies for the representation of these constraints within modeling algorithms. Here, a force-fielddesigned to represent XLMS-derived constraints is proposed. The potential energy functions are obtained bycomputing, in the database of known protein structures, the probability of satisfaction of a topologicalcross-linking distance as a function of the Euclidean distance between amino acid residues. The force-fieldcan be easily incorporated into current modeling methods and software. In this work, the force-field wasimplemented within the Rosetta ab initio relax protocol. We show a significant improvement in the quality ofthe models obtained relative to current strategies for constraint representation. This force-field contributes tothe long-desired goal of obtaining the tertiary structures of proteins using XLMS data. Force-field parametersand usage instructions are freely available at http://m3g.iqm.unicamp.br/topolink/xlff

File list (2)

download fileview on ChemRxivferrari_xlff_manuscript.pdf (0.98 MiB)

download fileview on ChemRxivferrari_xlff_supporting_information.pdf (1.51 MiB)

Page 2: Statistical Force-Field for Structural Modeling Using ...

Statistical force-field for structural modeling using chemical cross-1

linking/mass spectrometry distance constraints 2

Allan J R Ferrari1, Fabio C Gozzo1, Leandro Martínez1,2* 3

1Institute of Chemistry, University of Campinas, Campinas, SP, Brazil and 2Center for 4

Computational Engineering & Sciences, University of Campinas, Campinas, SP, Brazil. 5

*[email protected] 6

7

Abstract 8

Motivation 9

Chemical cross-linking/Mass Spectrometry (XLMS) is an experimental method to obtain 10

distance constraints between amino acid residues, which can be applied to structural modeling of 11

tertiary and quaternary biomolecular structures. These constraints provide, in principle, only 12

upper limits to the distance between amino acid residues along the surface of the biomolecule. In 13

practice, attempts to use of XLMS constraints for tertiary protein structure determination have 14

not been widely successful. This indicates the need of specifically designed strategies for the 15

representation of these constraints within modeling algorithms. 16

Results 17

Here, a force-field designed to represent XLMS-derived constraints is proposed. The potential 18

energy functions are obtained by computing, in the database of known protein structures, the 19

probability of satisfaction of a topological cross-linking distance as a function of the Euclidean 20

distance between amino acid residues. The force-field can be easily incorporated into current 21

modeling methods and software. In this work, the force-field was implemented within the 22

Rosetta ab initio relax protocol. We show a significant improvement in the quality of the models 23

obtained relative to current strategies for constraint representation. This force-field contributes to 24

the long-desired goal of obtaining the tertiary structures of proteins using XLMS data. 25

Availability 26

Force-field parameters and usage instructions are freely available at 27

http://m3g.iqm.unicamp.br/topolink/xlff 28

29

Page 3: Statistical Force-Field for Structural Modeling Using ...

1. Introduction 30

The number of protein structures determined is much smaller than that of proteins known 31

at the sequence level (Bateman et al., 2017; Pundir et al., 2017; Berman et al., 2000; Bairoch and 32

Apweiler, 2000). This discrepancy is the result of experimental limitations to obtain high-33

resolution structures and, in parallel, the experimental advances in genome sequencing. In silico 34

approaches for protein structure prediction have been applied to fill that gap. If the structure of 35

homologous proteins has already been solved, modeling the target protein is relatively simple 36

(Fiser, 2010; Eswar et al., 2006; Song et al., 2013). Without structural homologous, however, the 37

determination of a protein fold is a major challenge in computational biology. 38

Many types of data can be used to improve structural modeling of protein structures. For 39

example, sparse NMR data (Bowers et al., 2000; Tang et al., 2015; Thompson et al., 2012), 40

Small Angle X-Ray Scattering (SAXS) (Schneidman-Duhovny et al., 2012), Cryo-electron 41

Microscopy (Cryo-EM) (DiMaio et al., 2015), distance constraints derived from residue 42

coevolution statistics (Ovchinnikov et al., 2017; Ovchinnikov Sergey et al., 2016) and, more 43

recently, from chemical cross-linking/Mass Spectrometry (XLMS) (Brodie et al., 2017; Belsom 44

et al., 2016; Santos et al., 2018). XLMS is an attractive experimentally because it requires cheap 45

and more accessible instrumentation, simple sample handling and small amounts of sample. 46

Furthermore, the results are tolerant to contaminants and, in principle, XLMS data can be 47

obtained for about any protein, as MS is a universally applicable technique. 48

In some sense, the information XLMS provides is similar to that obtained from NMR, 49

that is, a list of distance constraints between atoms. Nevertheless, important differences exist: 1) 50

the XLMS constraint is a distance along the surface of the protein; 2) the constraint is in 51

principle associated only to the maximum linker reach, that is, it is only an upper bound to the 52

distance between the residues; 3) the length of the linker can be of the order of several 53

Angstroms and, thus, geometrically associates residues which are relatively far on the protein 54

structure. Experimentally, XLMS presents its own limitations, which are a field of intense state-55

of-the-art research: the number XLMS constraints is limited by the diversity of the reactivity of 56

the linkers, and by the exposure of the residues to the protein surface. Also, the interpretation of 57

XLMS spectra is still a complex task (Iacobucci and Sinz, 2017), requiring specific algorithms 58

and software (Lima et al., 2015; Götze et al., 2012; Hoopmann et al., 2015; Kosinski et al., 59

2015; Sarpe et al., 2016), and possibly manual curation. 60

Page 4: Statistical Force-Field for Structural Modeling Using ...

Because of the current limitations in experimental and modeling techniques, the use of 61

chemical cross-links has been indisputably successful only for the determination of quaternary 62

arrangements. Their use for tertiary structure modeling remains a challenge. For instance, in the 63

CASP11 and CASP12 assisted competitions, no clear improvements in the quality of the models 64

were observed from the use of experimental cross-linking constraints (Schneider et al., 2016; 65

Tamò et al., 2017). Recently, we were able to model the tertiary structures of a variety of models 66

with the support of XLMS distance constraints, but only in combination with distance constraints 67

derived from amino acid coevolution analysis, which played a determinant role in obtaining 68

models with fold-level accuracy (Santos et al., 2018). 69

The incorporation of XLMS constraints in structural modeling strategies is indeed a 70

challenge. The distance constraints are along the protein surface, thus their precise evaluation 71

depends on the model structure which, in principle, is not known. Furthermore, the evaluation of 72

the surface-accessible distance between two residues requires specialized strategies and, of 73

course, is much more computationally demanding than the evaluation of straight Euclidean 74

distances. Therefore, these constraints have been implemented in the modeling process through 75

Euclidean-distance-dependent energy functions that aim to constrain the maximum distance 76

between residues observed to be cross-linked. The maximum distance is usually derived from the 77

maximum cross-linker and side chain extensions, through simple geometrical arguments. 78

Here, we formulate a Euclidean-distance-dependent structure-based statistical force-field 79

for cross-linking/mass spectrometry constraints, named XLFF. In summary, we compute from a 80

database of non-redundant protein structures the probability of observing two residues at a 81

surface-accessible cross-linking distance as a function of the Euclidean distance between their 82

Cβ atoms. This probability curve is converted into a potential energy function assuming it obeys 83

a Boltzmann distribution. The potential is dependent on the cross-linker length and on the nature 84

of the residues involved, thus defining a residue and linker-dependent force-field for structural 85

modeling with XLMS distance constraints. We implemented the force-field in the Rosetta ab 86

initio protocol (Simons et al., 1999; Bonneau et al., 2001, 4, 2002; Bradley et al., 2005; Raman 87

et al., 2009, 8) and demonstrate that this statistical force-field increases significantly the 88

probability of obtaining native-like tertiary structures compared to current approaches to 89

represent the constraints. Although here we focus on the more challenging problem of tertiary 90

Page 5: Statistical Force-Field for Structural Modeling Using ...

protein structure determination, the principles here described find application in other structural 91

modeling goals, including the determination of general protein assemblies. 92

93

2. Approach 94

Chemical cross-linking is an experimental method to obtain structural information from a 95

chemical modification of the protein with a reagent called cross-linker, or simply linker. If a 96

residue is found to attach to the linker, it means that the residue is accessible to the solvent in 97

some significantly populated protein conformation in solution. If, additionally, the linker is 98

found to be attached to a pair of residues, A and B, it follows that the reactive atoms Ax and By 99

are closer to each other than the length of the cross-linker spacer arm, LXL. This linker works as 100

molecular ruler over the protein surface. Thus, when measuring the distance between Ax and By, 101

d(Ax,By), one should consider the physical path between them, dtop(Ax,By), where the subscript 102

top stands for “topological” distance, which we define here as the shortest path physically 103

accessible to the linker connecting the reactive atoms (see Figure 1). 104

Usually, and for every case which will be discussed here, the linker reactivity is 105

associated with a side-chain chemical group. For example, the amine group of Lysine residues, 106

or carboxylate group of acidic side chains. Therefore, in principle, the observation of a cross-link 107

is associated with these side-chain atoms being within the linker length, dtop(Ax,By) ≤ LXL. Since 108

structural models are static and side chains exposed to solvent are frequently mobile, the distance 109

between side-chain atoms in a model reflects poorly the possibility of a cross-link. 110

Backbone and Cβ atoms provide more stable reference positions for the introduction of 111

constraining potentials. Here, we define a statistical force-field based on the probability of the 112

reactive atoms of the side chain being at a cross-linkable topological distance given the 113

Euclidean distance between corresponding Cβ atoms. This statistical force-field considers 114

implicitly the flexibility of the side chains and, by being based on Euclidean distances, is 115

practical to use. 116

The maximum topological distance between Cβ atoms consistent with the formation of a 117

cross-link is Lmax=LXL+LA+LB, where LXL is the maximum linker length and LA and LB are the 118

lengths of the side chains of the residues involved. If the topological distance between Cβ atoms, 119

dtop(ACβ/BCβ), is smaller than Lmax, residues A and B may form a cross-link, since the side-chains 120

can potentially fluctuate to assume conformations compatible with the topological path along the 121

Page 6: Statistical Force-Field for Structural Modeling Using ...

surface associated with dtop(ACβ,BCβ). To a first approximation, using dtop(ACβ,BCβ) and Lmax to 122

constrain residue distances can be considered a strategy to incorporate the side-chain flexibility 123

into the modeling procedure. 124

However, Lmax, when implemented as a Euclidean distance constraint, represents an 125

unlikely scenario, in which the linker and both side chains are in their fully extended 126

conformations. Intuitively, constraining Cβ atoms distances to something smaller than Lmax 127

should be a good strategy in most cases. In this work, we first propose that an effective Lmax can 128

be assessed by statistical analysis of known protein structures. We compute the frequency 129

distribution of dtop(ACβ,Bcβ) in a protein database, under the condition that dtop(Ax,By) ≤ LXL. By 130

eliminating unlikely (1%) scenarios, we define a more restrained statistical distance-cutoff, 131

Lmax(0.99). 132

The statistical analysis above can be further refined for the establishment of a distance-133

dependent force field for XLMS constraints. Imagine that a pair of Cβ atoms from residues A 134

and B are found at a Euclidean distance deuc(ACβ,BCβ). This distance is associated to a topological 135

distance, dtop(ACβ,BCβ), as defined above. Given a database of known protein structures, we ask 136

what is the probability that the topological distance dtop(ACβ,BCβ) is smaller than Lmax(0.99) given 137

the Euclidean distance deuc(ACβ,BCβ), that is, p[(dtop(ACβ,BCβ)<Lmax(0.99))|deuc(ACβ,Bcβ)]. The 138

potential energy that would imply this probability distribution, assuming Boltzmann sampling is 139

140

V(deuc) = -RT ln p[(dtop(ACβ,BCβ)<Lmax(0.99))|deuc(ACβ,Bcβ)], (1)

141

where, at room temperature, RT=0.569 kcal mol-1. This potential can be directly incorporated 142

into most modeling procedures, as it is dependent on the Euclidean distances between Cβ atoms 143

and on the Lmax(0.99) from the structural database. Section 3.1 and Supporting Information S1 144

describes the details of the parameterization and implementation of this potential energy 145

function. 146

The statistical force-field was implemented in Rosetta ab initio protocol and proved to be 147

superior in terms of modeling quality to current state-of-art approaches for XLMS constraint 148

representation, as we will show. Modeling details are available in Supporting Information S2. All 149

modeling results, including input and output raw files, are available at 150

http://m3g.iqm.unicamp.br/topolink/xlff. Each modeling round consisted of generating 5,000 151

Page 7: Statistical Force-Field for Structural Modeling Using ...

models with Rosetta. We evaluated the quality of the models by the distribution of the structural 152

similarity of the models to the crystallographic structure, as given by the TM-score metric 153

computed with LovoAlign (Andreani et al., 2009). Structures with TM-scores greater than 0.5 154

relative to the crystallographic structure are considered to have roughly the correct fold (Xu and 155

Zhang, 2010). Structures with TM-scores greater than 0.6 are likely winner candidates at the 156

CASP modeling competitions. 157

158

3. Results 159

3.1. Parametrization of the statistical force field 160

Figures 2 and 3 exemplify the construction of the statistical force-field for a pair of 161

reactive residues. In Figure 2A we display the frequency of observation of topological distances 162

between Lys Nζ atoms in the CATH database of non-redundant domains. The subset of pairs of 163

Lys residues for which the Nζ are within the length of the linker molecule (11.5Å) was 164

shortlisted, and the distribution of Cβ distances for these pairs is obtained, as shown in Figure 165

2B. Within the subset of pairs of Lys residues for which the N atoms are within 11.5 Å, 99% of 166

the Cβ atoms are closer than 17.8Å. Therefore, we consider 17.8Å the maximum effective 167

distance between Cβ atoms for this linker. This maximum effective distance will be named the 168

Statistical Limit of the linker. 169

Then, we compute the probability of finding the Cβ atoms of the K residues closer than 170

17.8Å as a function of their Euclidean distance, as shown in Figure 3A. This probability shows, 171

for example, that if the Euclidean distance between Cβ atoms of K residues is greater than ~14Å, 172

there is only 50% probability that a topological path connecting these residues exists within the 173

reactive distance. 174

This probability distribution is translated, according to Equation 1, into a statistical 175

potential, which is represented in Figure 3B. For instance, this potential introduces an increasing 176

energy for the Euclidean distance between Cβ atoms at all distances, but which is particularly 177

noticeable above ~12Å. Therefore, effectively, the force field penalizes distances which are 178

greater than ~12Å. 179

In Figure 3C, the potential energy profiles for the pairs KK, KS, and SS, obtained with 180

the same protocol, are shown. As expected, for shorter side-chains, the potential energy increases 181

at shorter Cβ-Cβ distances. This reflects the fact that, for example, linkers of the same length 182

Page 8: Statistical Force-Field for Structural Modeling Using ...

may bind Lys residues at larger Cβ-Cβ distances than Ser residues, as a result of the difference 183

between side chain lengths. The exact profile of the potential is dependent not only on the 184

lengths of the side chains, but also on the nature of their interactions with the surface of the 185

proteins, and these are implicitly taken into account in the present approach because the profiles 186

are obtained from the structural database. The profiles of the potential energies of all other 187

reactive pairs of residues are shown in Figure S3 and available for download. 188

189

3.2 Modeling performance 190

3.2.1. Overview of previously attempted cross-linking representation strategies 191

Different interaction potentials have been proposed for the use of XLMS derived 192

constraints in protein modeling protocols. Kahraman and collaborators (Kahraman et al., 2013) 193

have proposed using a flat harmonic potential to integrate cross-linking constraints data to de 194

novo and comparative protein prediction. The flat harmonic potential penalizes models having 195

Euclidean distance between two atoms farther than an upper distance limit, UL (see Figure S4). 196

In that work, the UL was chosen as 30Å for all constraints associated with the same linker 197

(DSS/BS3). Belson and collaborators (Belsom et al., 2016) have proposed a Lorentz-like 198

potential. As shown in Figure S4, instead of penalizing models having Euclidean distance above 199

a certain UL, the Lorentz potential rewards models for which Euclidean distances are below a 200

threshold, that is, in which it is believed that the cross-link should be satisfied. Above this limit, 201

there is a progressive decrease of the energy bonus to zero. This potential is argued to be more 202

tolerant to the presence of incorrect constraints. The Serum Albumin domains were modeled 203

using this function, but in combination with contact prediction constraints from evolutionary 204

information. Therefore, the specific role of the XLMS constraints in model quality was not 205

addressed. Merkley and collaborators (Merkley et al., 2014) have proposed a justification for this 206

UL=30 Å for cross-links between Lysine pairs: the correlation between Cα Euclidean distances 207

in a set of crystal structures and molecular dynamics simulation, and a set of experimental cross-208

linking data for cytochrome C, showed that the experimental information of cross-linking data is 209

often not represented by a single conformational state. It is then proposed that using larger 210

constraint distances would account for the conformational flexibility of the structure. This 211

concept of adding some threshold to the extended conformation of the linker to account for 212

structural variability has guided much of the XL modelling and validation strategies (Kahraman 213

Page 9: Statistical Force-Field for Structural Modeling Using ...

et al., 2013; Fritzsche Romy et al., 2012; Kalisman et al., 2012; Herzog et al., 2012; Chavez et 214

al., 2016, 2018). Alternatively, the structural variability can be addressed by the proposal of 215

multiple models, without sacrificing the precision of the XLMS ruler (Degiacomi et al., 2017). 216

All these proposals were evaluated here in the context of modeling with Rosetta’s ab 217

initio relax protocol (Simons et al., 1997, 1999, Bonneau et al., 2001, 2002; Bradley et al., 2005; 218

Raman et al., 2009, 8). Four different targets where chosen: SalBIII (Luhavaya et al., 2015), a 219

15.6kDa protein with a low sequence similarity to other proteins in the Protein Data Bank; and 220

the three domains of Albumin (Sugio et al., 1999) (ALB-D1, ALB-D2 and ALB-D3), that have 221

been standard examples in cross-linking experiments (Belsom et al., 2016; Huang et al., 2004; 222

Fischer et al., 2013). In this section, we will describe the modeling performed with ideal cross-223

linking data sets, computed from the crystallographic models with Topolink (Martinez et al., 224

2017). The SalBIII set contains 62 constraints compatible with the crystallographic structure. For 225

ALB-D1, ALB-D2, and ALB-D3, the sets contained 125, 153 and 92 constraints, respectively. 226

Section 3.3 describes the modeling results obtained with more limited experimental sets of 227

constraints. 228

Modeling was performed without constraints and with three different penalizations 229

choices previously used to represent cross-linking constraints: the flat harmonic potential, the 230

linear penalization, and the Lorentz-like potential. The upper limit distances considered were: (i) 231

UL=25Å between Cβ atoms, assuming that the UL incorporates the conformational diversity of 232

the structure; (ii) A slightly more restrained UL=20Å between Cβ atoms, which roughly 233

represents the extended length of a link formed by the DSS linker bound to two Lys residues; 234

(iii) UL=Extended length, in which each constraint has an UL that is the sum of the lengths of 235

the side chains and of the linker, and which varies according to the linker and side chains 236

involved; and, finally, (iv) UL = Lmax(0.99), the Statistical limit, which is computed for each pair 237

of residue-types independently. 238

239

3.2.2. The statistical upper limit improves significantly the quality of the models 240

The distributions of model quality obtained using the different representation of the 241

constraints and upper limits are shown in Figure 4 for the SalBIII protein. Distributions obtained 242

without constraints are repeated in each graph as a reference. The left-side graphs display the 243

quality distributions of all 5,000 models, while the right panels show the fraction of models with 244

Page 10: Statistical Force-Field for Structural Modeling Using ...

fold-level accuracy that are obtained for subsets of the models as classified by percentiles of their 245

Rosetta energy scores. The results are summarized in Table 2. 246

None of the models obtained without constraints have TM-score greater than 0.5, thus 247

confirming that experimental constraints are essential for the modeling of this protein. The flat 248

harmonic or linear energy functions with UL=25Å perform as bad as the modeling without 249

constraints. Decreasing the UL distance, however, improves dramatically the overall quality of 250

models. Using the statistical limits derived here, it is possible to obtain 10 and 14% of native-like 251

structures using the linear and flat harmonic energy functions, respectively. Notably, by selecting 252

the 10% best-scored models, the native-like populations increase to 63 and 72%. Therefore, 253

using a more constrained potential increases the quality of the models obtained significantly. 254

Here, this choice is justified by the statistical analysis of cross-linkable pairs in known protein 255

structures. For example, we know that Lmax=17.8Å is too restrictive for only 1% of the cross-256

links between Lysine residues. 257

The use of the Lorentz potential did not result in any improvement of the models relative 258

to modeling without constraints, independently of the ULs used. This is likely a consequence of 259

the Lorentz potential having a null gradient at almost every distance (see Figure S4). Therefore, 260

in gradient-dependent modeling strategies, like the ones used by Rosetta, this potential could 261

only affect the selection of the models by their final energy. In other words, this potential might 262

be useful for modeling using Monte-Carlo sampling methods alone, but any advantage that might 263

be gained by using gradient information is lost. 264

Similar results were obtained for all three Albumin domains, and are reported in 265

Supplementary Information Figure S5 and Table S2. 266

267

3.2.3. The statistical force-field optimally weights constraint penalties 268

Finally, we used the complete statistical force-field (XLFF) to model protein domains. 269

The statistical weights introduced improved expressively the modeling results: Figure 5A 270

compares the quality of the models obtained with each of the energy functions described in the 271

previous section with those obtained with the XLFF force-field. We compare the different 272

representations of the penalty function using the best upper distance limit, defined by the 273

statistical analysis described in the previous section. Using the statistical potential, 27% of all 274

models are native-like structures, which is almost the double and the triple of the fraction of 275

Page 11: Statistical Force-Field for Structural Modeling Using ...

native-like structures obtained with the flat harmonic or linear energy functions. Furthermore, 276

94% of the 10% best Rosetta-scored models are native-like structures (Figure 5B). For Albumin 277

domains, the fraction of native-fold models increases from the best results obtained previously 278

(using the flat harmonic potential with statistical limits): from 15% to 29% for ALB-1, from 44% 279

to 74% for ALB-2, and from 40% to 70% for ALB-3 (Supplementary Figure S6 and Table S2). 280

Finally, beyond being crucial to sampling, the potential energy introduced by the force-281

field contributes to qualify the models as compared with the Rosetta score alone, as shown in 282

Figure 5C. In fact, for SalBIII, the energy of constraints is almost as effective as the composed 283

score function (Rosetta + XL score) to differentiate models. 284

In summary: (i) the extended linker lengths are unnecessarily large, leading to 285

information loss, independently of the potential energy function used; (ii) restricting UL to 286

statistically relevant distances, Lmax(0.99), improves significantly the probability of generating 287

native-like structures; and (iii) the statistical force-field, V(Lmax, deuc), is the best strategy to 288

implement XLMS derived constraints in modeling. 289

290

3.3. Modeling with experimental XLMS constraints 291

In the previous sections, we evaluated protein tertiary structures modeling performance 292

from an ideal perspective of having identified all potential cross-links consistent with the 293

crystallographic structure. 294

We discuss now the modeling results obtained using experimental data. In this work, we 295

consider the use of XLMS constraints only. Evidently, the present approach can be, and should 296

be, combined with other source of data if available. 297

We performed XLMS experiments using DSS and the Xplex chemistry (Fioramonte et 298

al., 2018) on SalBIII and Serum Albumin domains (Supplementary Information S2) and 299

analyzed the resulting MS raw files with SIM-XL (Borges et al., 2015; Lima et al., 2015) to 300

obtain a list of cross-link candidates. The set of cross-links obtained experimentally which is 301

consistent with the crystallographic structure contains 25 to 35% of the ideal cross-link set 302

(Supplementary Information S2 and Table S1). This reduced set of cross-links was used for 303

structural modeling. 304

Figure 6A shows, as expected, that reducing the number of constraints worsens the TM-305

score distribution of the models obtained by Rosetta (red) relative to the ideal set. Nevertheless, 306

Page 12: Statistical Force-Field for Structural Modeling Using ...

there is a significant improvement relative to modeling without constraints: 3% of all models 307

have TM-scores greater than 0.5, and within the 10% best-scored 20% display native-like folds. 308

These results are similar to those obtained for SalBIII using the ideal constraint-set and the flat 309

harmonic or linear penalization functions, except if the statistical limit we propose is used (Table 310

2). In other words, the improvement in modeling obtained by the XLFF force-field is comparable 311

to what would be obtained from an ideal experiment using the previous constraint representation 312

strategies. Similar results for Albumin domains are shown in Supplementary Information Figure 313

S7 and Table S2. 314

315

4. Conclusions 316

A force-field designed for modeling XLMS constraints was developed. The potential 317

energy functions representing the constraints are obtained from the statistics of physically-318

accessible distances between residues in the database of known protein structures. The potential 319

energy function is dependent on the Euclidean distance between residues and on the structural 320

properties of the linker and associates more favorable interactions to pairs of residues which are 321

more likely associated with a valid cross-linking path. The force-field was implemented in the 322

Rosetta modeling suite, and expressively improve the quality of the models obtained. These 323

results bring to reality the possibility of modeling from XLMS constraints the tertiary structures 324

of proteins for which other structural data is not available or is insufficient for characterizing the 325

protein fold. 326

327 Acknowledgements 328

We thank FAPESP (Grants 2010/16947-9, 2013/05475-7, 2013/08293-7, 2014/17264-3 and 329

2016/13195-2), and CNPq (Grant 470374/2013-6) for financial support. 330

331

References 332

Andreani,R. et al. (2009) Low Order-Value Optimization and applications. J. Glob. Optim., 43, 333 1–22. 334

Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its 335 supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48. 336

Bateman,A. et al. (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res., 45, 337

D158–D169. 338

Belsom,A. et al. (2016) Serum Albumin Domain Structures in Human Blood Serum by Mass 339 Spectrometry and Computational Biology. Mol. Cell. Proteomics MCP, 15, 1105–1116. 340

Page 13: Statistical Force-Field for Structural Modeling Using ...

Berman,H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242. 341 Bonneau,R. et al. (2002) De Novo Prediction of Three-dimensional Structures for Major Protein 342

Families. J. Mol. Biol., 322, 65–78. 343

Bonneau,R. et al. (2001) Rosetta in CASP4: Progress in ab initio protein structure prediction. 344 Proteins Struct. Funct. Bioinforma., 45, 119–126. 345

Borges,D. et al. (2015) Using SIM-XL to identify and annotate cross-linked peptides analyzed 346 by mass spectrometry. Protoc. Exch. 347

Bowers,P.M. et al. (2000) De novo protein structure determination using sparse NMR data. J. 348

Biomol. NMR, 18, 311–318. 349

Bradley,P. et al. (2005) Toward High-Resolution de Novo Structure Prediction for Small 350 Proteins. Science, 309, 1868–1871. 351

Brodie,N.I. et al. (2017) Solving protein structures using short-distance cross-linking constraints 352 as a guide for discrete molecular dynamics simulations. Sci. Adv., 3, e1700479. 353

Chavez,J.D. et al. (2018) Chemical Crosslinking Mass Spectrometry Analysis of Protein 354 Conformations and Supercomplexes in Heart Tissue. Cell Syst., 6, 136–141.e5. 355

Chavez,J.D. et al. (2016) In Vivo Conformational Dynamics of Hsp90 and Its Interactors. Cell 356 Chem. Biol., 23, 716–726. 357

Degiacomi,M.T. et al. (2017) Accommodating Protein Dynamics in the Modeling of Chemical 358

Crosslinks. Structure, 25, 1751–1757.e5. 359 DiMaio,F. et al. (2015) Atomic-accuracy models from 4.5-Å cryo-electron microscopy data with 360

density-guided iterative local refinement. Nat. Methods, 12, 361–365. 361 Eswar,N. et al. (2006) Comparative protein structure modeling using Modeller. Curr. Protoc. 362

Bioinforma., Chapter 5, Unit-5.6. 363 Fioramonte,M. et al. (2018) XPlex: an effective, multiplex cross-linking chemistry for acidic 364

residues. Anal. Chem. 365 Fischer,L. et al. (2013) Quantitative cross-linking/mass spectrometry using isotope-labelled 366

cross-linkers. J. Proteomics, 88, 120–128. 367 Fiser,A. (2010) Template-based protein structure modeling. Methods Mol. Biol. Clifton NJ, 673, 368

73–94. 369

Fritzsche Romy et al. (2012) Optimizing the enrichment of cross‐linked products for mass 370

spectrometric protein analysis. Rapid Commun. Mass Spectrom., 26, 653–658. 371 Götze,M. et al. (2012) StavroX--a software for analyzing crosslinked products in protein 372

interaction studies. J. Am. Soc. Mass Spectrom., 23, 76–87. 373

Herzog,F. et al. (2012) Structural Probing of a Protein Phosphatase 2A Network by Chemical 374 Cross-Linking and Mass Spectrometry. Science, 337, 1348–1352. 375

Hoopmann,M.R. et al. (2015) Kojak: Efficient Analysis of Chemically Cross-Linked Protein 376

Complexes. J. Proteome Res., 14, 2190–2198. 377 Huang,B.X. et al. (2004) Probing three-dimensional structure of bovine serum albumin by 378

chemical cross-linking and mass spectrometry. J. Am. Soc. Mass Spectrom., 15, 1237–379 1247. 380

Iacobucci,C. and Sinz,A. (2017) To Be or Not to Be? Five Guidelines to Avoid Misassignments 381 in Cross-Linking/Mass Spectrometry. Anal. Chem., 89, 7832–7835. 382

Kahraman,A. et al. (2013) Cross-Link Guided Molecular Modeling with ROSETTA. PLOS 383 ONE, 8, e73411. 384

Page 14: Statistical Force-Field for Structural Modeling Using ...

Kalisman,N. et al. (2012) Subunit order of eukaryotic TRiC/CCT chaperonin by cross-linking, 385 mass spectrometry, and combinatorial homology modeling. Proc. Natl. Acad. Sci., 109, 386 2884–2889. 387

Kosinski,J. et al. (2015) Xlink Analyzer: Software for analysis and visualization of cross-linking 388 data in the context of three-dimensional structures. J. Struct. Biol., 189, 177–183. 389

Lima,D.B. et al. (2015) SIM-XL: A powerful and user-friendly tool for peptide cross-linking 390 analysis. J. Proteomics, 129, 51–55. 391

Luhavaya,H. et al. (2015) Enzymology of Pyran Ring A Formation in Salinomycin Biosynthesis. 392

Angew. Chem. Int. Ed., 54, 13622–13625. 393

Martinez,L. et al. (2017) TopoLink: A software to validate structural models using chemical 394 crosslinking constraints. Protoc. Exch. 395

Merkley,E.D. et al. (2014) Distance restraints from crosslinking mass spectrometry: Mining a 396 molecular dynamics simulation database to evaluate lysine–lysine distances. Protein Sci. 397

Publ. Protein Soc., 23, 747–759. 398 Ovchinnikov,S. et al. (2017) Protein structure determination using metagenome sequence data. 399

Science, 355, 294–298. 400 Ovchinnikov Sergey et al. (2016) Improved de novo structure prediction in CASP11 by 401

incorporating coevolution information into Rosetta. Proteins Struct. Funct. Bioinforma., 402

84, 67–75. 403 Pundir,S. et al. (2017) UniProt Protein Knowledgebase. In, Protein Bioinformatics, Methods in 404

Molecular Biology. Humana Press, New York, NY, pp. 41–55. 405 Raman,S. et al. (2009) Structure prediction for CASP8 with all-atom refinement using Rosetta. 406

Proteins, 77, 89–99. 407 Santos,D. et al. (2018) Enhancing protein fold determination by exploring the complementary 408

information of chemical cross-linking and coevolutionary signals. Bioinformatics, 409 bty074. 410

Sarpe,V. et al. (2016) High Sensitivity Crosslink Detection Coupled With Integrative Structure 411 Modeling in the Mass Spec Studio. Mol. Cell. Proteomics MCP, 15, 3071–3080. 412

Schneider,M. et al. (2016) Blind testing of cross‐linking/mass spectrometry hybrid methods in 413

CASP11. Proteins, 84, 152–163. 414 Schneidman-Duhovny,D. et al. (2012) Integrative structural modeling with small angle X-ray 415

scattering profiles. BMC Struct. Biol., 12, 17–17. 416

Simons,K.T. et al. (1997) Assembly of protein tertiary structures from fragments with similar 417

local sequences using simulated annealing and bayesian scoring functions1. J. Mol. Biol., 418 268, 209–225. 419

Simons,K.T. et al. (1999) Improved recognition of native-like protein structures using a 420

combination of sequence-dependent and sequence-independent features of proteins. 421 Proteins Struct. Funct. Bioinforma., 34, 82–95. 422

Song,Y. et al. (2013) High-Resolution Comparative Modeling with RosettaCM. Structure, 21, 423 1735–1742. 424

Sugio,S. et al. (1999) Crystal structure of human serum albumin at 2.5 Å resolution. Protein 425 Eng. Des. Sel., 12, 439–446. 426

Tamò,G.E. et al. (2017) Assessment of data-assisted prediction by inclusion of 427 crosslinking/mass-spectrometry and small angle X-ray scattering data in the 12th Critical 428

Assessment of protein Structure Prediction experiment. Proteins. 429

Page 15: Statistical Force-Field for Structural Modeling Using ...

Tang,Y. et al. (2015) Protein structure determination by combining sparse NMR data with 430 evolutionary couplings. Nat. Methods, 12, 751–754. 431

Thompson,J.M. et al. (2012) Accurate protein structure modeling using sparse NMR data and 432

homologous structure information. Proc. Natl. Acad. Sci., 109, 9875–9880. 433 Xu,J. and Zhang,Y. (2010) How significant is a protein structure similarity with TM-score = 0.5? 434

Bioinformatics, 26, 889–895. 435 436

437

Page 16: Statistical Force-Field for Structural Modeling Using ...

438

439

Figure 1: Cross-linking information. The identification of a cross-link between two residues (yellow 440

line) implies that the distance between reactive atoms, d(Ax,By) is smaller than the extended linker length, 441

LXL. However, a fixed backbone structure cannot represent all cross-linkable side chains configurations. 442

As indicated by the red lines connecting C atoms, alternative configurations of side chains for a single 443

fixed backbone can potentially validate other four cross-links. Therefore, at least the variability of side 444

chain orientations should be taken into account to define the effective maximum distance, Lmax, between 445

residues that might be cross-linked. 446

Page 17: Statistical Force-Field for Structural Modeling Using ...

447

Figure 2: Statistical-based definition of Lmax. (A) After computing the topological distance between N 448

atom pairs, we selected the subset of pairs with distances shorter than the linker length, 11.5 Å. (B) Next, 449

we selected the subset of topological distances between C atoms pairs which had the corresponding 450

reactive atoms in the previous subset. The topological distance distribution reveals that distances 451

corresponding to side chains and linker in extended conformations (~22 Å) are never observed. We define 452

a cross-linkable distance for Lysine pairs and DSS/BS3 cross-link after removing unlikely scenarios (1%) 453

as Lmax(0.99) = 17.8 Å (vertical dashed line), increasing the restrictive role of the constraint by more than 454

4Å. Similar profiles for other residue pairs are shown in Supplementary Figure S1. 455

Page 18: Statistical Force-Field for Structural Modeling Using ...

Table 1: Extended and statistical (Lmax) distances for cross-linked residue pairs and linkers. The 456

effective maximum distances that account for 99% of the possible cross-links are significantly more 457

restrictive than the maximum linker lengths. Extended conformations are not frequently observed. 458

cross-link ID extended distance / Å Lmax

(0.99)* / Å BS3/DSS

KK 21.8 17.8 KS 18.0 15.8 SS 14.1 13.4

1,6-hexanediamine DD 14.1 13.5 DE 15.4 14.3 EE 16.7 15.1

zero-length KE - 10.5 KD - 9.7 SE - 7.7 SD - 7.0

*statistically-derived topological Cβ-Cβ distance / Å 459

Page 19: Statistical Force-Field for Structural Modeling Using ...

460

Figure 3: Statistical force-field determination. (A) Probability that a topological distance is below 461

Lmax(0.99) = 17.8 Å as a function of the Euclidean distance between Cβ Lys/Lys pairs, as a representation 462

of the DSS/BS3 cross-linker. As the Euclidean distance reaches Lmax, the probability of satisfying the 463

topological length decreases because fewer possible physical paths connecting the residues are possible. 464

(B) A potential energy curve can be derived from (A) assuming that this probability distribution is a 465

Boltzmann distribution. (C) For each pair of residue types a different energy function is derived. Here, we 466

show the energy functions for KK, KS and SS pairs assuming a DSS/BS3 cross-linker. Potential energy 467

profiles for other residue pairs and linkers are shown in Supplementary Figure S3. 468

469

Page 20: Statistical Force-Field for Structural Modeling Using ...

470

Figure 4: Performance of cross-linking energy functions in the modeling of SalBIII structure with 471

Rosetta ab initio relax protocol. An improvement in generating native-like structures is correlated to 472

applying more restrictive distance constraints limits for linear and flat harmonic energy functions. Lorentz 473

energy function produces the same distributions than the modeling without constraints, our negative 474

control. There is a significant improvement in the quality of the models obtained when using the 475

statistical upper distance limit proposed here, which is justified by being too restrictive to an average of 476

only 1% of the expected cross-links. Similar results for Albumin domains are shown in Figure S5. 477

Page 21: Statistical Force-Field for Structural Modeling Using ...

Table 2: Evaluation of fold-level accuracy population of SalBIII models generated using different 478

available energy functions to represent XLMS constraints. The results obtained with the statistical 479

upper limit and the full statistical potential (XLFF) developed in this work are highlighted. Using 480

statistical upper limits improves significantly the quality of the models obtained, even when using 481

the flat harmonic or linear penalization functions, but the proper statistical representation of the 482

functional form of the constraint energy improves even further the quality of the models obtained. 483

The modeling obtained with XLFF the experimental constraint set is comparable to the best 484

previous energy functions using the ideal constraint set. Similar results for Albumin domains are 485

shown in Supplementary Table S2. 486

fraction of models with TM-score > 0.5

Energy function UL all models 50% models

with best

Rosetta score

10% models

with best

Rosetta score no constraints - 0.00 0.00 0.00

Ideal

con

strain

t se

t

flat harmonic

25 Å 0.00 0.00 0.00 20 Å 0.02 0.04 0.12

Extended length 0.04 0.07 0.25 Statistical Limit 0.14 0.27 0.72

linear

25 Å 0.00 0.00 0.03 20 Å 0.02 0.04 0.16

Extended length 0.03 0.07 0.24 Statistical Limit 0.10 0.20 0.63

Lorentz

25 Å 0.00 0.00 0.00 20 Å 0.00 0.00 0.00

Extended length 0.00 0.00 0.00 Statistical Limit 0.00 0.00 0.01

XLFF Statistical Limit 0.27 0.56 0.94

Experimental

constraint set XLFF Statistical limit 0.03 0.06 0.20

487

488

Page 22: Statistical Force-Field for Structural Modeling Using ...

489

Figure 5: The XL statistical force-field in comparison to other energy functions to model SalBIII 490

structure. The initial distribution of model’s quality (A) and the selection of best-scored models (B) 491

show that the XLMS force field outperforms other functional forms, even if those use the statistical upper 492

limit is used. (C) The XL constraint energy contributes to the classification of the models. Similar results 493

for Albumin domains are shown in Supplementary Figure S6. 494

495

Page 23: Statistical Force-Field for Structural Modeling Using ...

496

497

Figure 6: Modeling SalBIII protein with XLMS force field and experimental constraints. (A) Using 498

experimental constraints, 3% of the models obtained using the current modeling protocol achieve fold 499

level accuracy (B) If the 10% best-scored models are selected in terms of their combined XL and Rosetta 500

energies, a subset of models in which ~20% have fold level accuracy is obtained. This result is 501

comparable with those obtained using the ideal constraint set and flat harmonic or linear penalization 502

functions (Table 2). Similar results for Albumin domains are shown in Figure S7. 503

504

Page 25: Statistical Force-Field for Structural Modeling Using ...

Supporting information

Statistical force-field for structural modeling using chemical cross-

linking/mass spectrometry distance constraints

Allan J R Ferrari1, Fabio C Gozzo1, Leandro Martínez1,2

1Institute of Chemistry, University of Campinas, Campinas, SP, Brazil and 2Center for

Computational Engineering & Sciences, University of Campinas, Campinas, SP, Brazil.

Page 26: Statistical Force-Field for Structural Modeling Using ...

S1. Computational Methods

We have recently developed a software, called Topolink

(http://m3g.iqm.unicamp.br/topolink) (Martinez et al., 2017), which evaluates the consistency of

the experimental constraints derived from XLMS in structural models and computes, from the

potential reactivity of the residues on the protein structure, the experimental constraints one

should expect. Topolink (version 17.332) was used to derive the statistics of possible cross-

linking formation of non-redundant proteins domains of the CATH database (Sillitoe et al.,

2015; Lam et al., 2016) (S40 v4.1). For each structure, the topological and Euclidean distances

between the Cβ and reactive atoms were computed for every pair of residues of interest. Here,

we considered two cross-linkers with different residue specificity, DSS/BS3 and 1,6-

hexanediamine, both of which have spacer arms of approximately 11.5 Å. That is, the

topological distance between the reactive atoms must be smaller than 11.5 Å for a residue pair to

be considered a cross-linkable pair. Given the residue specificity for each cross-linker, the

following six pairs of residues were considered: Lys/Lys, Lys/Ser, and Ser/Ser (for DSS/BS3),

and Asp/Asp, Asp/Glu, and Glu/Glu (for 1,6-hexanediamine). Additionally, we examine the

species formed from zero-length reactions, for which the reactive side chains become directly

bonded: Lys/Asp, Lys/Glu, Ser/Asp and Ser/Ser. Here, we exemplify the results for the Lys/Lys

pair. The statistics for DSS/BS3, 1,6-haxanediamine, the chemically analogous shorter linkers

(BSG and 1,3-propanediamine), zero-length and its derived statistical force-field are available at

http://m3g.iqm.unicamp.br/topolink/xlff/.

S1.1 Determination of the maximum effective linker length Lmax

Figure 2 of the main manuscript shows the distribution of topological distances smaller

than 11.5Å (the extended linker length) between Nζ atoms of Lys/Lys pairs and their associated

Cβ-Cβ topological distance distribution. For instance, if the Cβ-Cβ topological distance is greater

than 22Å, there is no side chain conformation allowing the approximation of the Nζ atoms to less

than 11.5Å, as the sum of the lengths of the side chains is 10.5Å. The most frequent Cβ-Cβ

topological distance associated with Nζ-Nζ topological distances smaller than 11.5Å is about

7.5Å, corresponding to Cβ-Cβ topological distances of vicinal residues in α-helices.

The integrated distribution of Figure 2B illustrates the fraction of Nζ-Nζ pairs that can be

found at a topological distance smaller than 11.5Å as a function of the Cβ-Cβ topological

Page 27: Statistical Force-Field for Structural Modeling Using ...

distance. There is more than 99% probability that the topological distance between Cβ atoms is

smaller than 17.8Å if the Nζ-Nζ pair satisfies the 11.5Å cutoff. This means that the effective

reach of the linker and side-chains is about 17.8Å instead of the fully-extended 22Å limit for this

pair of residues. This distance is 4.2Å (~20%) shorter than the extended limit, thus increasing

significantly the precision of the structural information provided by the XLMS constraint. This

effective limit is defined as Lmax for every pair of reactive residues and linker arm lengths, in this

case that of the DSS/BS3 reagent. This additional restriction plays a key role in improving the

modeling results. Figure S1 presents the distribution for all cross-links which have either

DSS/BS3 or 1,6-haxanediamine as cross-linkers. A discussion on zero-length Lmax definition is

presented in Figure S2. Table 1 of the mains manuscript summarizes the maximum extended

linker lengths and the statistically derived limits for each cross-linkable pair considered.

S1.2. Development of a statistical force-field

In the previous section, we have shown that an effective maximum linker length can be

obtained by analysis of topological distances in a database of protein structures. The restricted

maximum-lengths improve the precision of the distance constraint significantly by excluding

unlikely arrangements of side-chains and linkers molecules.

Now we will use this effective maximum linker length to create a potential energy

function that depends on the Euclidean distances between Cβ atoms, amenable for computational

modeling. The underlying problem is how to infer the existence of a physically possible

topological distance from the Euclidean distance measured.

If the topological distance between the Cβ atoms is smaller than the effective linker

reach, determined by the procedure above, the XLMS constraint can be satisfied by some side

chain conformation. Intuitively, the shorter the Euclidean distance between Cβ atoms, the greater

the probability that a topological distance will exist corresponding to a valid constraint.

Conversely, as the Cβ-Cβ Euclidean distance increases, the probability of a valid topological

distance between these atoms decreases since physical barriers for the linker are likely to be

present.

We compute, then, from the database of protein structures, the probability that the

topological distance between two Cβ atoms is smaller than the effective linker reach, Lmax, as a

function of their Euclidean distance: p[(dtop(ACβ/BCβ)<Lmax)|deuc(ACβ/Bcβ)]. This computation

Page 28: Statistical Force-Field for Structural Modeling Using ...

consists in evaluating the topological (physically-accessible) distances between every pair of Cβ

atoms in all protein structures of the CATH S40 non-homologous protein database and

classifying these distances as a function of their Euclidean distances, which are trivial to

compute.

Numerically, the probability distribution p[(dtop(ACβ/BCβ)<Lmax)|deuc(ACβ/Bcβ)] was

computed starting from from 3Å distance, with 1Å intervals. That means, for example, that we

compute the probability that a topological distance is smaller than Lmax(0.99) given that the

Euclidean distance between Cβ atoms is between 3 and 4Å, and that up to the maximum

effective linker reach Lmax. Specifically, we compute

p(dtop(ACβ/BCβ)<Lmax)|deuc(ACβ/Bcβ)]=N[dtop(ACβ/BCβ)<Lmax)|deuc(ACβ/Bcβ)]/N(A/B), where

N[dtop(ACβ/BCβ)<Lmax)|deuc(ACβ/Bcβ)] is the number of topological distances found to be below

Lmax(0.99) for each Euclidean distance in the database, and N(A/B) is the number of solvent

accessible pairs of residues of type A and B.

In Figure 3A (main manuscript) we show the probability that a topological distance

between Cβ atoms is smaller than Lmax as a function of the Euclidean distance, for pairs of

Lysine residues and DSS/BS3 cross-linker (defining Lmax=17.8Å). The shorter the Euclidean

distance measured between Cβ atoms, the higher the probability of having a cross-linkable

physical path between them, as expected. The probability of finding valid topological distances

decreases monotonically from roughly 6Å on, and at about 16Å is of about 20%. This means that

two CB atoms which are found at an Euclidean distance of 16Å have a statistical probability of

being connected by a valid topological path of 20%. We propose, here, that the potential energy

associated to this constraint should be dependent on this probability. That is, if a pair of CB

atoms is found at a given Euclidean distance, the potential energy associated with the XLMS

constraint should be smaller (more favorable) the greater the statistical probability that the atoms

have a valid topological distance between them.

Two independent parameters must be set to complete the potential energy profile: 1) The

minimum energy, associated with small distances, must be chosen, corresponding to the energy

of the reference state in a thermodynamic potential. 2) A choice must be made concerning the

behavior of the potential energy at distances close and higher to Lmax, for which statistical data is

not available.

Page 29: Statistical Force-Field for Structural Modeling Using ...

Concerning the minimum energy of the constraint, we argue that an equal energy bonus

should be associated to any experimental information obtained from a cross-linking experiment,

independently of the residues involved. We chose that a minimum energy of -2.5 kcal mol-1 is to

be associated with probability 1.0 (small Euclidean distances). With this choice, the potential

energy profiles for cross-link constraints involving Lysine and Serine for LXL = 11.5Å are shown

in Figure 3C.

For distances greater than Lmax, one should expect the potential energy to increase

quickly to infinite. This choice is dependent on the existence, or not, of possible false-positives.

A faster growth will strongly penalize constraints that are not satisfied. This might be in general

desired, except if one wants to develop a force-field which is insensitive to false positives. We

have described the potential above Lmax as a linear penalization function of the deviation of Lmax,

as shown in Figure 3A. We examined the possibility of using faster increasing functions, but no

significant differences in the results were obtained, such that we opted for the simplest functional

form. We also examined the possibility of flattening the energy function after Lmax, such that the

potential is insensitive to unsatisfied constraints, an alternative which was not effective in the

present examples, but which might be useful if the constraint dataset is dominated by false

positives. This choice might deserve further investigation depending on the nature of the

constraint data available.

S2. Experimental Cross-links

S2.1. Materials

SalBIII protein was obtained as described previously (Luhavaya et al., 2015), Serum Albumin

(Sigma Aldrich), 1,6-diaminohexane (Sigma Aldrich), Suberic acid bis(N-hydroxysuccinimide

ester) (DSS, Sigma Aldrich), 2-(N-Morpholino)ethanesulfonic (Sigma Aldrich), N-

Hydroxybenzotriazole (Sigma Aldrich), 1-Ethyl-3-(3-dimethylaminopropyl)carbodiimide (EDC,

Thermo Fischer Scientific), DL-dithiothreitol (DTT, Sigma Aldrich), iodoacetamide (IAA, Sigma

Aldrich), trypsin solution from porcine pancreas (Sigma Aldrich).

S2.2. Cross-linking reaction

S2.2.1 DSS cross-linking reaction

Page 30: Statistical Force-Field for Structural Modeling Using ...

35 uL of a solution of DSS in DMF (1 mg mL-1) was added to 500 µL of protein solution (10

µM) in PBS buffer 200 mM, pH 7 (~200-fold molar excess cross-linker to protein). The reaction

was carried out for two hours at 25°C and 300rpm.

S2.2.2 Xplex cross-linking reaction

For Xplex reaction all reagents were prepared in 200 mM MES buffer, pH 6. The reaction was

conducted by adding to 500 uL of protein solution (10 µM): (i) 10 uL of a solution (500 mM) of

1,6-diaminohexane (~1000-fold molar excess cross-linker to protein), (ii) 5 uL of a HOBt

solution (500 mM) (~500-fold molar excess to protein) and (iii) 20 uL of EDC solution (~2000-

fold molar excess to protein). The reaction was carried out for two hours at 25°C and 300rpm.

Reaction details reported recently (Fioramonte et al., 2018).

S.2.2.3 Reaction follow up

The samples were incubated at 80oC for 0.5h to promote thermal denaturation. After cooling, 20

uL of a IAA solution (1 mg mL-1) was added and samples were kept in the dark for 0.5h. Next, it

was performed a buffer exchange to sodium bicarbonate buffer, 100 mM, pH 8. Samples were

submitted to overnight digestion by adding trypsin solution in the ratio 1:50 protease/protein

(m/m).

S2.3. Protein cross-linking analysis

All cross-linking samples were analyzed in duplicate by nanoLC-nESI-MS/MS using a Dionex

UltiMate™ 3000 RSLCnano system coupled online to a Q-Exactive Plus orbitrap mass

spectrometer (Thermo Scientific) operating in positive ion mode. Tryptic peptides were

desalted/concentrated by self-packed Poros 20 R2 (Applied Biosystems) tip columns, dried

using a SpeedVac concentrator (Savant), solubilized in 1% formic acid and sonicated for 10

min in an ultrasonic bath. Peptide concentration was estimated by 280 nm absorbance measured

on a NanoDrop 2000 spectrophotometer (1 Abs = 1 mg/mL, considering 1 cm path length)

(Thermo Scientific). The peptides (1-1.5 µg) were injected at 2 µL/min for 10 min onto a home-

made trapping column (2 cm length, 100 µm internal diameter, 1-2 mm Kasil frit) packed with 5

µm 200 Å Magic C18AQ beads (Michrom Bioresources Inc). Sample fractionation was

performed at room temperature using a flow rate of 200 nL/min and a laser-pulled fused-silica

Page 31: Statistical Force-Field for Structural Modeling Using ...

column (30 cm length, 75 µm internal diameter) packed with 1.9 µm ReproSil-Pur 120 C18-AQ

(Dr. Maisch GmbH) and equilibrated with 0.1% formic acid in water containing 2% acetonitrile.

After 10 min elution under the initial equilibration condition, acetonitrile concentration ramped

to 50% in 162 min, followed by a further increase until 80% in 4 min and a final washing step

with that acetonitrile concentration for 2 min. Using data-dependent acquisition, up to 6 most

abundant precursor ions per MS survey scan (300 – 1800 m/z) were selected for MS/MS scans

using the XCalibur software (Version 3.0.63, Thermo Scientific). MS1 data were acquired in the

profile mode and the parameters were: AGC target of 1E6, IT of 100 ms, 70,000 resolution

(FWHM at 200 m/z) and 1 microscan. Precursor ions with z ≥ 3 were isolated using 2 m/z

window and an offset of 0.5 m/z, followed by fragmentation with Higher-energy Collisional

Dissociation (HCD) applying the following Stepped Normalized Collision Energy (SNCE): 30-

40 for samples cross-linked with DSS and 28-33 for those submitted to the XPlex chemistry. In

attempt to optimize the fragmentation of zero-length cross-linked peptides, these last samples

were also analyzed using 30-40 SNCE. Ions selected for fragmentation were dynamically

excluded for 60s, with the “exclude isotopes” option. MS/MS scans were acquired in the centroid

mode with a resolution of 35,000, a fixed first mass of 200 m/z, AGC of 5E5, IT of 100 ms and

the underfill ratio set to 10%. The spray voltage was set to 1.9 kV with no sheath or auxiliary gas

flow and with a capillary temperature of 250C. The mass spectrometer was externally calibrated

using a calibration mixture that was composed of caffeine, peptide MRFA and Ultramark 1621,

as recommended by the instrument manufacturer.

S2.3. Cross-linking identification

Raw data was processed using SIM-XL software v. 1.3.2.0 and search parameters were: cross-

linker XPlex C6Ac2 (1,6-haxane diamine and zero-length) or DSS/BS3, 10 ppm error tolerance

for precursors and fragments, trypsin fully specific digestion with up to 3 missed cleavages,

carbamidomethylation of cysteine as fixed modification. Our list of candidates was obtained

using parameters for non-restrictive identification of XLs (post analysis filters): 10 ppm error

tolerance for precursor and fragments, score = 2.5, spectral count=1 and 1 peak matched per

chain.

Page 32: Statistical Force-Field for Structural Modeling Using ...

Figure S1: Distribution of topological distances for C-C atoms for DSS/BS3 and 1,6-

haxanediamine cross-linkable pairs. The strategy consists in computing all topological distances

between C atoms and between the corresponding reactive atoms. Next, shortlist the subset of C-C

topological distances that have the associated reactive atoms’ distances below the linker length, LXL. The

exclusion of the more unlikely distances (here, 1%) returns the statistical upper limit cutoff, Lmax(0.99),

which should be considered to validate cross-link data (dashed line in each graph). Refer to Table 1 for

each distance value.

Page 33: Statistical Force-Field for Structural Modeling Using ...

Figure S2: Zero-length Lmax definition. Zero-length species are amide or esters that result from covalent

binding of the nitrogen of a Lys or an oxygen from a Ser to the carbonyl of an Asp or a Glu acidic residue

and elimination of a water molecule. That is, the presence of a zero-length implies a more constrained

system than in the native protein. To evaluate the state just before the zero-length formation, one needs to

approximate the contact between the residues involved before the reaction. Here, we computed the

topological distance between the atoms which would be connected after reaction thought the covalent

bond. A maximum at ~4 Å is observed in the distributions (left, black curve) which is due to those

residues interacting through polar interactions before the bond formation. A normal distribution (left, blue

Page 34: Statistical Force-Field for Structural Modeling Using ...

dashed curve) was used to fit the probability distribution for this condition and the distance containing

99% of the area under the Gaussian curve (left, red curves) was chosen to represent the maximum

distance allowed between the reactive atoms. This distance is treated as LXL (although there is no linker),

and the determination of Lmax is analogous to that of other linkers (Figure S1). The dashed vertical lines

on the right panels define the Lmax(0.99) which are reported in Table 1.

Page 35: Statistical Force-Field for Structural Modeling Using ...

Figure S3: Potential energy curves for different cross-linking types. (A) DSS/BS3, (B) 1,6-

hexanediamine and (C) zero-length.

Page 36: Statistical Force-Field for Structural Modeling Using ...

Figure S4: Representation of cross-linking constraints. In each of the right panels the red line indicates

the UL from which the potential energy increases.

Page 37: Statistical Force-Field for Structural Modeling Using ...

Figure S5: Albumin (ALB) domains (D1, D2 and D3) modelled without constraints and with ideal

set of constraints and the linear penalization energy function for four UL (25 Å, 20 Å, Extended

length limit, and Statistical Limit). As observed for SalBIII, the more restrained upper limit distances

improves the models obtained. Results comparing all heuristic energy functions is presented in Figure S6.

Page 38: Statistical Force-Field for Structural Modeling Using ...

Figure S6: Use of XLMS statistical force field and its comparison to heuristic energy functions to

model Albumin domains structures with ideal set of constraints. The statistical force-field performs

better than linear and flat harmonic energy functions (with statistical limits) or the Lorentz energy

function. Also, the energy function including the constraint energy from the statistical force-field (Rosetta

+ XL energy) provides a better discrimination of models with fold level accuracy (TM-score > 0.5)

compared to the Rosetta score alone.

Page 39: Statistical Force-Field for Structural Modeling Using ...

Figure S7: Albumin domains modelled with experimental constraints. The number of experimental

constraints for each domain correspond to roughly 1/4 of the number of theoretical constraints applied

previously (Table S1). 11% ALB-D1, 51% ALB-D2 and 2% ALB-D3 models of 10% best scored models

have TM-score > 0.5 (Table S2).

Page 40: Statistical Force-Field for Structural Modeling Using ...

Table S1: XLMS constraints in modelling SalBIII and Serum Albumin Domains

protein ID sequence length number of theoretical

constraints*

number of

experimental

constraints‡

number of experimental

validated constraints*

experimental

validated /

experimental

experimental

validated /

theoretical SalBIII 141 62 156 22 0.14 0.35 ALB-D1 201 125 163 33 0.20 0.26 ALB-D2 189 153 195 40 0.21 0.26 ALB-D3 193 92 98 23 0.23 0.25

*as evaluated by Topolink with Lmax(0.99) defined in Table 1.

‡list of XLMS constraints candidates as defined in S2.3.

Page 41: Statistical Force-Field for Structural Modeling Using ...

Table S2: Evaluation of fold-level accuracy of Serum Albumin domain’s models generated using

different energy functions to represent XLMS constraints. Distributions are shown in Figures S5,

S6 and S7.

fraction of models with TM-score > 0.5

Energy function UL fraction of best

scored models = 1.0 fraction of best

scored models = 0.5 fraction of best

scored models = 0.1

ALB-D1

no constraints - 0.00 0.00 0.00

Idea

l co

nst

rain

t se

t

Linear

25 Å 0.02 0.03 0.05 20 Å 0.06 0.10 0.24

Extended length 0.12 0.20 0.38 Statistical Limit 0.15 0.26 0.50

flat harmonic Statistical Limit 0.12 0.21 0.39 Lorentz Statistical Limit 0.00 0.01 0.02 XLFF Statistical Limit 0.29 0.43 0.67

Experimental

constraint set XLFF Statistical Limit 0.05 0.07 0.11

ALB-D2

no constraints - 0.00 0.00 0.00

Idea

l co

nst

rain

t se

t

Linear

25 Å 0.02 0.03 0.06 20 Å 0.10 0.17 0.32

Extended length 0.29 0.45 0.66 Statistical Limit 0.39 0.60 0.80

flat harmonic Statistical Limit 0.44 0.71 0.92 Lorentz Statistical Limit 0.00 0.00 0.00 XLFF Statistical Limit 0.74 0.95 0.99

Experimental

constraint set XLFF Statistical Limit 0.21 0.33 0.51

ALB-D3

no constraints - 0.00 0.00 0.00

Idea

l co

nst

rain

t se

t

Linear

25 Å 0.01 0.03 0.07 20 Å 0.07 0.13 0.29

Extended length 0.26 0.47 0.80 Statistical Limit 0.36 0.66 0.92

flat harmonic Statistical Limit 0.40 0.70 0.94 Lorentz Statistical Limit 0.00 0.00 0.00 XLFF Statistical Limit 0.70 0.96 0.99

Experimental

constraint set XLFF Statistical Limit 0.01 0.02 0.02

Page 42: Statistical Force-Field for Structural Modeling Using ...

Protocol S1. Rosetta abinitio relax protocol

Structural modeling with Rosetta was performed using the ab initio relax protocol and the

following flags. Each bracket requires a file. Fragment file of 3 and 9-mer were generated in

Robetta server (Kim et al., 2004) (http://robetta.bakerlab.org/fragmentsubmit.jsp). The fragments

were generated excluding homologous. For generation of models without constraints, constraints

file and weight must be omitted.

-abinitio

-fastrelax

-increase_cycles 1

-rg_reweight 0.25

-in

-file

-fasta {fasta file}

-frag3 {fragments3 file}

-frag9 {fragments9 file}

-path

-database $rosetta_path/main/database/

-out

-nstruct 5000

-file

-fullatom

-silent {output silent file}

-constraints

-cst_file {constraint file}

-cst_weight 1

-cst_fa_file {constraint file}

-cst_fa_weight 1

Page 43: Statistical Force-Field for Structural Modeling Using ...

References

Fioramonte,M. et al. (2018) XPlex: an effective, multiplex cross-linking chemistry for acidic

residues. Anal. Chem.

Kim,D.E. et al. (2004) Protein structure prediction and analysis using the Robetta server. Nucleic

Acids Res., 32, W526–W531.

Lam,S.D. et al. (2016) Gene3D: expanding the utility of domain assignments. Nucleic Acids

Res., 44, D404–D409.

Luhavaya,H. et al. (2015) Enzymology of Pyran Ring A Formation in Salinomycin Biosynthesis.

Angew. Chem. Int. Ed., 54, 13622–13625.

Martinez,L. et al. (2017) TopoLink: A software to validate structural models using chemical

crosslinking constraints. Protoc. Exch. doi:10.1038/protex.2017.035

Sillitoe,I. et al. (2015) CATH: comprehensive structural and functional annotations for genome

sequences. Nucleic Acids Res., 43, D376–D381.