Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI...

14
Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CD Spectra and Reference Dataset. For spectral reference dataset, the SP175 CD reference spectrum set (1) was downloaded from the Protein Circular Dichroism Data Bank (PCDDB) (2). The spec- trum of the subtilisin Carlsberg (PCDDB ID CD0000067000) was left out of the 71 protein CD spectra of SP175. According to our experience, subtilisin Carlsberg shows rapid autodegradation at high concentrations used in SRCD measurements, strongly af- fecting the CD spectrum. CD spectra of three additional proteins were added to the reference set, that of native β 2 -microglobulin, amyloid fibrils of K3 peptide fragment of β 2 -microglobulin, and fibrils of Alzheimers amyloid-β (142) peptide, having PDB structures 2yxf (3), 2e8d (4), and 2beg (5), respectively. These spectra were recorded at the DISCO SRCD beamline of SOLEIL French Synchrotron Radiation Facility (Gif-sur-Yvette, France) (6, 7) and deposited in PCDDB. We refer to this extended ref- erence set as SP175+. Performance of our BeStSel algorithm was verified on an extra CD spectrum test set. The list of the proteins included is shown in Table S3. Some of these proteins originate from the MP180 membrane protein CD dataset of PCDDB (8). Other spectra were recorded by the authors at the SOLEIL French Synchrotron Radiation Facility. The spectra were analyzed and treated with the CDTool package (9) (averaging, baseline subtraction, cor- rection with camphorsulphonic acid (CSA), normalization) and, after verification by the ValiDichro software (10), were uploaded to PCDDB. For experimental conditions and details of data collection, please refer to the corresponding PCDDB record. CD spectrum of polyQ (11) was kindly provided by Ronald Wetzel (University of Pittsburgh, Pittsburgh, PA). Note on CD Spectroscopy of Protein Aggregates and Amyloid Fibrils. Recording of the accurate CD spectra of protein aggregates and amyloid fibrils might be complicated and sometimes unrealizable. It is important to have a transparent, homogenous solution without large insoluble precipitates. Aggregates can scatter the light, and sometimes amyloid fibrils become oriented in the cu- vette, especially in the case of short pathlengths, exhibiting linear dichroism. To improve the quality of the sample, a slight ultra- sonication can be applied, which homogenizes the sample and break long fibrils into short pieces, decreasing light scattering and the chance of orientation. It is recommended to place the sample cell close to the detector, further decreasing the effect of light scattering. SRCD has an advantage over the conventional CD spectroscopy for such purposes. Definition of the Secondary Structure Elements and the β-Sheet Twist. Secondary structure elements were determined from the PDB structures using the DSSP algorithm (12, 13) for identifi- cation. Secondary structures of DSSP (H, α-helix; G, 3 10 helix; I, π-helix; E, β-strand; B, β-bridge; S, bend; T, turn; and “”, ir- regular or loop) were reorganized as follows. Helix components were strictly defined similarly to Sreerama et al. (14), however, only including residues within α-helices. Accordingly, the regular, middle parts of α-helices were assigned as Helix1, and the two residues at each end of α-helices were taken as distorted helix, namely Helix2. The overall β-sheet content was determined by the fraction of the residues assigned as β-strand by DSSP. Turn was assigned as that of DSSP. The remaining components of DSSP were handled as others.In DSSP, β-sheets are determined by hydrogen-bonded pairs of residues, which are oriented either antiparallel or parallel according to the relative direction of the host β-strands. β-Sheet content is divided into antiparallel and parallel by the ratio of the corresponding hydrogen bonded pairs. We adopted the definition of Ho and Curmi (15) for the twisting angle of the β-sheet (Fig. 2C ). A twisting angle (ω) is determined for two adjacent residues of two neighboring β-strands in a β-sheet, briefly as an angle between the two neighboring peptide backbones at the location of the residues. For parallel β-sheets, the negative of the calculated angle had to be taken for the correct handedness. For antiparallel β-sheet, the twisting angle is defined as follows: ω = 180° + signððb 1 × b 2 Þ · d 21 Þ · arccosððb 1 · b 2 Þ=jb 1 · b 2 . [S1] For parallel β-sheet: ω = signððb 1 × b 2 Þ · d 21 Þ · arccosððb 1 · b 2 Þ=jb 1 · b 2 , [S2] where b vectors are pointing from the midpoint of the NC bond of an amide group to the midpoint of the amide NC next in sequence (Fig. 2C). The d 21 vector is pointing from one C α atom of one β-strand to the neighboring C α atom of the other β-strand, in between which the twist is calculated. The left and right di- rection of twist corresponds to the twisting direction of the two neighboring β-strands around each other (at the Top of Fig. 2 A and B). Positive angle values represent right-hand twisted β-sheets, both for parallel and antiparallel β-sheets. This definition provides a number of angle values equal to (n 1) × r in an ideal β-sheet of n strands with strand lengths of r residues in each. To study the spectral contribution of β-sheet twist, we divided the twisting angle range of antiparallel β-sheets into regions and distinguished them as different secondary structure components. Basis spectra for the individual structural elements were calculated by the least-squares method applied at every wavelength point for the 71 CD spectra of SP175 reference set (Optimization Procedure to Determine the Basis Spectra Sets). First, we made five regions of antiparallel β, with three 20° wide ones in the middle. At the start, the centers of the three middle regions were 10°; 10°; 30°. By shifting these three regions together gradually with 1° steps to reach the 20°; 40°; 60° positions, we calculated the corresponding basis spectra, which reflects the spectral effect of the twist on the antiparallel β-sheets. To our knowledge, this is the first pre- sentation of β-sheet basis spectra as a function of the twisting angle (the angle that belongs to center of the window) (Fig. 2E). The BeStSel algorithm divides the antiparallel β-sheet into three groups (anti1, anti2, and anti3) by using boundaries +and +23° of twisting angles for separation, corresponding to left- hand twisted, relaxed (slightly right-hand twisted), and right-hand twisted β-sheets. In the case of parallel β-sheets, the distinction of different twists did not improve the accuracy of secondary structure estimation. Possible explanations are the low number of parallel β-sheetcontaining reference proteins and the rela- tively more uniform spectral shape of parallel β-sheets. The definition of turn is identical to that in DSSP. The remaining residues, including residues invisible in the 3D struc- tures are assigned as others.Altogether, BeStSel algorithm uses eight basis components, which are presented in Fig. 3A in relation to DSSP and SELCON algorithms. For comparison with BeStSel, we evaluated the performance of several other algorithms including SELCON3 (14), CONTIN (16), CDSSTR (17), and K2D2 (18), among others. SELCON3, Micsonai et al. www.pnas.org/cgi/content/short/1500851112 1 of 14

Transcript of Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI...

Page 1: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

Supporting InformationMicsonai et al. 10.1073/pnas.1500851112SI MethodsCD Spectra and Reference Dataset.For spectral reference dataset, theSP175 CD reference spectrum set (1) was downloaded from theProtein Circular Dichroism Data Bank (PCDDB) (2). The spec-trum of the subtilisin Carlsberg (PCDDB ID CD0000067000) wasleft out of the 71 protein CD spectra of SP175. According to ourexperience, subtilisin Carlsberg shows rapid autodegradation athigh concentrations used in SRCD measurements, strongly af-fecting the CD spectrum. CD spectra of three additional proteinswere added to the reference set, that of native β2-microglobulin,amyloid fibrils of K3 peptide fragment of β2-microglobulin, andfibrils of Alzheimer’s amyloid-β (1–42) peptide, having PDBstructures 2yxf (3), 2e8d (4), and 2beg (5), respectively. Thesespectra were recorded at the DISCO SRCD beamline of SOLEILFrench Synchrotron Radiation Facility (Gif-sur-Yvette, France)(6, 7) and deposited in PCDDB. We refer to this extended ref-erence set as SP175+.Performance of our BeStSel algorithm was verified on an extra

CD spectrum test set. The list of the proteins included is shown inTable S3. Some of these proteins originate from the MP180membrane protein CD dataset of PCDDB (8). Other spectra wererecorded by the authors at the SOLEIL French SynchrotronRadiation Facility. The spectra were analyzed and treated withthe CDTool package (9) (averaging, baseline subtraction, cor-rection with camphorsulphonic acid (CSA), normalization) and,after verification by the ValiDichro software (10), were uploadedto PCDDB. For experimental conditions and details of datacollection, please refer to the corresponding PCDDB record.CD spectrum of polyQ (11) was kindly provided by Ronald

Wetzel (University of Pittsburgh, Pittsburgh, PA).

Note on CD Spectroscopy of Protein Aggregates and Amyloid Fibrils.Recording of the accurate CD spectra of protein aggregates andamyloid fibrils might be complicated and sometimes unrealizable.It is important to have a transparent, homogenous solutionwithout large insoluble precipitates. Aggregates can scatter thelight, and sometimes amyloid fibrils become oriented in the cu-vette, especially in the case of short pathlengths, exhibiting lineardichroism. To improve the quality of the sample, a slight ultra-sonication can be applied, which homogenizes the sample andbreak long fibrils into short pieces, decreasing light scattering andthe chance of orientation. It is recommended to place the samplecell close to the detector, further decreasing the effect of lightscattering. SRCD has an advantage over the conventional CDspectroscopy for such purposes.

Definition of the Secondary Structure Elements and the β-SheetTwist. Secondary structure elements were determined from thePDB structures using the DSSP algorithm (12, 13) for identifi-cation. Secondary structures of DSSP (H, α-helix; G, 310 helix; I,π-helix; E, β-strand; B, β-bridge; S, bend; T, turn; and “ ”, ir-regular or loop) were reorganized as follows.Helix components were strictly defined similarly to Sreerama

et al. (14), however, only including residues within α-helices.Accordingly, the regular, middle parts of α-helices were assignedas Helix1, and the two residues at each end of α-helices weretaken as distorted helix, namely Helix2. The overall β-sheetcontent was determined by the fraction of the residues assignedas β-strand by DSSP. Turn was assigned as that of DSSP. Theremaining components of DSSP were handled as “others.”In DSSP, β-sheets are determined by hydrogen-bonded pairs

of residues, which are oriented either antiparallel or parallel

according to the relative direction of the host β-strands. β-Sheetcontent is divided into antiparallel and parallel by the ratio of thecorresponding hydrogen bonded pairs.We adopted the definition of Ho and Curmi (15) for the

twisting angle of the β-sheet (Fig. 2C). A twisting angle (ω)is determined for two adjacent residues of two neighboringβ-strands in a β-sheet, briefly as an angle between the twoneighboring peptide backbones at the location of the residues.For parallel β-sheets, the negative of the calculated angle had tobe taken for the correct handedness.For antiparallel β-sheet, the twisting angle is defined as

follows:

ω= 180°+ signððb1 × b2Þ ·d21Þ · arccosððb1 · b2Þ=jb1 · b2jÞ. [S1]

For parallel β-sheet:

ω= signððb1 × b2Þ ·d21Þ · arccosððb1 · b2Þ=jb1 · b2jÞ, [S2]

where b vectors are pointing from the midpoint of the N‒C bondof an amide group to the midpoint of the amide N‒C next insequence (Fig. 2C). The d21 vector is pointing from one Cα atomof one β-strand to the neighboring Cα atom of the other β-strand,in between which the twist is calculated. The left and right di-rection of twist corresponds to the twisting direction of the twoneighboring β-strands around each other (at the Top of Fig. 2 Aand B). Positive angle values represent right-hand twisted β-sheets,both for parallel and antiparallel β-sheets. This definition providesa number of angle values equal to (n − 1) × r in an ideal β-sheet ofn strands with strand lengths of r residues in each.To study the spectral contribution of β-sheet twist, we divided

the twisting angle range of antiparallel β-sheets into regions anddistinguished them as different secondary structure components.Basis spectra for the individual structural elements were calculatedby the least-squares method applied at every wavelength point forthe 71 CD spectra of SP175 reference set (Optimization Procedureto Determine the Basis Spectra Sets). First, we made five regions ofantiparallel β, with three 20° wide ones in the middle. At the start,the centers of the three middle regions were −10°; 10°; 30°. Byshifting these three regions together gradually with 1° steps toreach the 20°; 40°; 60° positions, we calculated the correspondingbasis spectra, which reflects the spectral effect of the twist on theantiparallel β-sheets. To our knowledge, this is the first pre-sentation of β-sheet basis spectra as a function of the twisting angle(the angle that belongs to center of the window) (Fig. 2E).The BeStSel algorithm divides the antiparallel β-sheet into

three groups (anti1, anti2, and anti3) by using boundaries +3°and +23° of twisting angles for separation, corresponding to left-hand twisted, relaxed (slightly right-hand twisted), and right-handtwisted β-sheets. In the case of parallel β-sheets, the distinctionof different twists did not improve the accuracy of secondarystructure estimation. Possible explanations are the low numberof parallel β-sheet–containing reference proteins and the rela-tively more uniform spectral shape of parallel β-sheets.The definition of turn is identical to that in DSSP. The

remaining residues, including residues invisible in the 3D struc-tures are assigned as “others.” Altogether, BeStSel algorithmuses eight basis components, which are presented in Fig. 3A inrelation to DSSP and SELCON algorithms.For comparison with BeStSel, we evaluated the performance

of several other algorithms including SELCON3 (14), CONTIN(16), CDSSTR (17), and K2D2 (18), among others. SELCON3,

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 1 of 14

Page 2: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

CONTIN, and CDSSTR used the same six basis components asdefined by Sreerama et al. (14) (Fig. 3A). K2D2 (18), K2D3 (19),and CAPITO (20) resolve three basis components, i.e., α-helix,β-sheet, and the rest. For the definition of secondary structurebasis components and secondary structure contents of the ref-erence proteins, VARSLC (21), LINCOMB (22), and CDNN(23) methods use data from AMSOM (24) instead of DSSP.Unfortunately, the content of AMSOM database is limited andnot refreshed any more. For these latter algorithms, helix wasdefined as the sum of α-helix and 310 helix, antiparallel andparallel were defined as in BeStSel, turn as turn in DSSP, andthe rest gave the “others.”As a consequence of the different secondary structure definitions

used by the various algorithms, the secondary structure compositionof a particular protein with known X-ray structure will depend onthe algorithm. For correct comparison between methods, the re-liability of the secondary structure content estimation by an algo-rithm has to be evaluated on the appropriate basis.

Optimization Procedure to Determine the Basis Spectra Sets. BeStSeluses eight secondary structure components. The basis spectracorresponding to the secondary structure basis components werecalculated on the SP175+ reference spectrum set optimizing withthe linear least-square method based on the following equation:

min�12 ��SSref · bλ −CDref ,λ

��2�, [S3]

for every λ (wavelength). SSref is an n-by-8 matrix containing thesecondary structure contents of the n reference proteins, andCDref, λ is a column vector of n values containing the CD signalsof the reference proteins at the corresponding λ wavelength. bλ isa vector of eight elements corresponding to the eight basis spec-tra at λ wavelength value the linear least-squares problem issolved for. jj x jj is the Euclidean norm for vector x.The elements of bλ as a function of λ provide the eight basis

spectra corresponding to the eight secondary structure compo-nents. The eight basis spectra will give a matrix B.In the next step, using the basis spectra, we estimated the

secondary structure contents of the reference proteins from theirCD spectra. The fitting to a CD spectrum with the basis spectrawas realized by solving the following constrained linear least-squares problem:

min�12kB ·SScalc −CDtestk2

�, [S4]

X8i=1

SScalc,i = 1.00, [S5]

−0.01≤ SScalc,i ≤ 1.00, [S6]

using the reflective Newton method (25). B is the m-by-8 matrixcontaining the secondary structure basis spectra, and m is thenumber of wavelength values. CDtest is an m element vector con-taining the CD spectrum of the protein at m wavelength values.SScalc is the vector of the secondary structure contents the least-square problem solved for, satisfying the following criteria: thesum of the secondary structures is equal to 1 and none of thesecondary structures can be less than −0.01 or higher than 1.The calculation was carried out on all reference proteins, and

the secondary structure results were compared with the realsecondary structure contents derived from the X-ray structures.The reliability of the estimation for the ith secondary structurewas characterized by the RMSD on all of the reference proteins:

RMSDi = sqrt

1n

Xnj=1

�SSref,j,i − SScalc,j,i

�2!, [S7]

where SSref,j,i is the ith secondary structure of the jth referenceprotein derived from its known X-ray structure and SScalc,j,i is theestimated ith secondary structure of the jth reference protein.We carried out an optimization to improve the basis spectrum

set by leaving out reference proteins and wavelength regions fromthe basis spectra preparation resulting in the lowest RMSD byfitting back all of the reference proteins in SP175+. Separateoptimization procedures were carried out for each of the sec-ondary structures resulting in eight sets of the eight basis spectra.First, the wavelength range to leave out was investigated byvarying the starting point and the width of the gap and carryingout the above calculations. Then, using the best wavelengthranges for calculation, one reference protein was left out in everyvariation and the one leading to the best estimation on the entirereference dataset for the given secondary structure was found.By decreasing the number of reference proteins by that one, thewavelength range search and a further iteration with the nextprotein to be left out was carried out. Finally, to resolve the bestsecondary structure estimation on the whole SP175+, an opti-mized subset of reference protein spectra and a correspondingoptimal wavelength range were found for each of the secondarystructure elements. A steady improvement of the RMSD duringeach iteration of the selection procedure for the differentsecondary structure components was observed (Fig. S2A). Never-theless, for the anti1 component, the optimization for thewavelength regions did not improve the RMSD; therefore, onlythe best reference protein subset optimization was carried out.It is important to note that the performance of the overall

algorithm was determined by estimating the secondary structurecontent of all of the reference proteins in a cross-validated way,i.e., the reference protein of which we estimated the secondarystructure from its CD spectrum was left out from the wholeoptimization and basis spectrum preparation procedure. Becausewe had separate optimized basis sets for each of the secondarystructure elements, the fitting had to be carried out eight times,and from each fitting, the corresponding secondary structurecontent was taken. In case the secondary structure content wasless than zero, it was set to zero. The sum of the estimatedsecondary structure contents is not necessarily 1.00; thus, a finalnormalization was done. The cross-validated performance of theBeStSel algorithm is presented in Tables 1 and 2 and Table S2 incomparison with other algorithms.In the optimization procedure to calculate the basis spectra sets

for the general use of BeStSel, the following spectra were usedfinally (given with their PCDDB codes as CD000XXXX000): forHelix1: 0004–0006, 0010, 0014–0015, 0025, 0029, 0031, 0034,0036–0037, 0039–0040, 0043–0044, 0049–0051, 0054–0056, 0058,0062–0066, 0069, 3894, 3901; for Helix2: 0007, 0010, 0019, 0025,0040–0041, 0043, 0046–0047, 0053, 0063–0064, 0066, 0071, 3900,3901; for Anti1: 0004, 0006, 0008, 0014–0015, 0019, 0021, 0023,0026, 0029, 0031, 0036–0037, 0041–0042, 0046, 0048–0054,0057–0058, 0062–0064, 0067, 0069, 0070, 3900, 3901; for Anti2:0004–0005, 0007–0009, 0011–0012, 0014–0017, 0019, 0022, 0023,0026, 0029–0034, 0040–0041, 0045, 0049–0050, 0052–0053, 0056,0058, 0060, 0062–0065, 0071, 3894, 3900, 3901; for Anti3: 0004,0007, 0008, 0014, 0040–0041, 0064, 0070; for Parallel: 0003–0004, 0007–0008, 0014–0016, 0018, 0020–0023, 0026, 0028–0030,0037, 0039–0045, 0051–0055, 0057–0058, 0061–0067, 0069, 0071,3894, 3900, 3901; for Turn: 0003, 0005, 0012, 0014, 0021, 0023,0030, 0033, 0039, 0041, 0043, 0048, 0052, 0066, 0069, 0070, 3894,3901; for Others: 0002, 0004–0005, 0007–0008, 0010–0011, 0016,0021, 0026–0027, 0033–0034, 0037, 0041, 0043, 0049, 0053,0064–0065, 0068, 0070, 3894, 3900, 3901.

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 2 of 14

Page 3: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

Secondary Structure Estimation of Proteins from Their CD Spectrum.To estimate the secondary structure composition of an unknownprotein, the same final, fixed, eight sets of eight basis spectraoptimized on SP175+ are always used. Fig. 3 B and C shows eightfinal basis spectra for the eight secondary structure componentsas taken out from their optimized sets. For a new experimentalCD spectrum, eight independent fittings have to be carried out,resulting in 8 × 8 secondary structure contents. From each fit-ting, the secondary structure content is picked out for which abasis set was optimized. To get the helix1 content, the basisspectrum set optimized for helix1 is fitted, and from the eightvalues, the helix1 content was taken out. After the eight fittings,the negative secondary structure contents were set to zero andthe eight values were finally normalized to give a sum of 1.00.

RMSD and NRMSD Between the Experimental and Fitted CD Spectra.The whole fitting procedure was verified, using the average of theeight fittings, taking into account the corresponding wavelengthregions used in the individual fittings. The spectral RMSD isdefined as follows:

RMSDCD = sqrt

1w

Xwi=1

�CDexp,i −CDfit,i

�2!. [S8]

This value does not characterize well the reliability of the spectralfitting because it does not take into account the relative value ofthe fitting error, as it was shown by Mao et al. (26). For an im-proved measure, we introduced a normalized RMSD as follows:

NRMSDCD = sqrt

1w·

1max

�CDexp

�−min

�CDexp

�·Xwi=1

�CDexp,i −CDfit,i

�2!. [S9]

In general, the relation between the NRMSDCD of spectral fittingand the RMSDSS of the secondary structure estimation highlydepends on the algorithm, the secondary structure compositionof the given protein, and the wavelength range of the CD spec-trum. As an example, K2D2 algorithm often provides large fit-ting errors with relatively good secondary structure estimations,whereas CDSSTR algorithm always fits perfectly, independentlyof the structure estimation error (27). Fig. S2B presents thestructural RMSD vs. spectral NRMSD on the proteins ofSP175+ using BeStSel in comparison with SELCON3 algorithmin cross-validated way.

Comparative Statistics on the Performance of BeStSel. The reliabilityof the secondary structure estimation was characterized by twoparameters, the RMSDSS and the Pearson’s correlation co-efficient. Often, it is difficult to compare different algorithmsbecause of the different definitions of the secondary structurecomponents. A common ground for comparison is the helix (H),β-sheet (B), and others (O) contents, to which all of the outputs ofthe different algorithms can be summed up. To compare the an-tiparallel–parallel distinction of BeStSel to other algorithms, thehelix, antiparallel, parallel, turn, and others (HAPTO) basis wasappropriate. Tables 1 and 2 and Table S2 show the performanceof BeStSel in comparison with the most popular and reliablemethods. Unfortunately, the source codes of those algorithms aremostly unavailable, making the cross-validated evaluation impos-sible. Wallace and coworkers adapted SELCON to Matlab[SELMAT (28)] and provided its source code. SELCON is one ofthe most popular and improved algorithm and thus was a perfectchoice for comparisons in our work. Where a cross-validatedstatistics was impossible on SP175, we provided the statistics fromthe original publications. The performance of CONTIN and

CDSSTR on the SP175 reference set was evaluated by usingthe original Fortran codes downloaded from the authors’ homepages (s-provencher.com/pages/contin-cd.shtml; biochem.science.oregonstate.edu/dr-johnsons-software-download-instructions). Inthe case of K2D3 and CAPITO algorithms, the secondary struc-tures of these two proteins were estimated without cross-validation.VARSLC and CONTIN failed to run through on some proteins,

which we indicated in Tables 1 and 2. The statistics for these al-gorithms excluded those proteins possibly resulting in better values.The comparison of BeStSel to algorithms that distinguish

antiparallel and parallel β-sheets are shown in Fig. 4 in the formof scatter plots for SP175+.The performance of various algorithms and BeStSel was evalu-

ated on an extra set of protein rich in β-sheet or have rare structuralcomposition. The list of these proteins is presented in Table S3.

Database Construction for Fold Recognition. Using the secondarystructure basis components as descriptors of the protein structure,the different folds might be separated. We examined and com-pared in this respect the basis components of our BeStSel algo-rithm to that of DSSP and SELCON3 algorithm (the latter is alsocommon for the CONTIN and CDSSTR algorithms) and K2D.Any protein with known 3D structure can be characterized by itssecondary structure composition. The proteins of the PDB can beprojected as individual points into a multidimensional “fold space”defined by the secondary structure elements as dimensions. Thisway, the secondary structure basis components of BeStSel,SELCON, and K2D define an 8D, 6D, and 3D space, respectively.Similarly, DSSP also defines an 8D fold space with axes repre-senting the α-helix, 310 helix, π-helix, β-strand, β-bridge, etc.,contents of the proteins. The secondary structure compositions of87,346 proteins of the PDB were calculated by means of all of theabove secondary structure definitions constructing a database.This database was extended by information on the protein folds.For each polypeptide chain in the database, we linked the

available CATH data using the 3.5 version of the CATH foldclassification (29). We use the first four category of the hierarchyof CATH: class, architecture, topology, and homology (referredas superfamily in CATH 3.5).A single-domain subset of the database was constructed as

follows: polypeptide chains consisting of single, classified CATHdomains having a resolution better than 3.0 Å were searched, anda filtering was applied for less than 90% sequential homology toavoid redundancy. Resolution and sequential filtering werecarried out by PISCES server (30). Out of the 225,799 chains ofthe 87,346 PDB structures, 85,280 single domains were found,representing four classes, 38 architectures, 804 topologies, 1,532homologies. After filtering, the final single-domain subset con-tains 10,433 chains, representing four classes, 38 architectures,783 topologies, and 1,490 homologies.

Search for the Closest Structures in PDB. The most general search toobtain fold information from the secondary structure of proteinswas carried out on the entire PDB. We searched for proteins withknown X-ray structures having secondary structures similar to theinvestigated protein. The similarity is determined by the distancein the secondary structure space. This is especially useful in thecase of multidomain proteins. For single-domain proteins, thefiltered subset of PDB is of better use (see in the next paragraph).The distance between two points (x and x’) in the secondary

structure space is defined by their Euclidean distance:

d=

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiX8i=1

ðxi − x′iÞ2vuut , [S10]

where xi is the content of the ith secondary structure ofthe protein. The search for the closest structures is carried

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 3 of 14

Page 4: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

out on the entire database representing 87,346 structuresof PDB.

Search for the Closest Single CATH Domains. In case of single-domain proteins, a prediction for CATH domains is done. Oneoption is the search for the closest structures on the single-domainnonredundant structure set. The definition of the distance ispresented in Eq. S10. We can assume that the closest structuresrepresenting proteins with similar secondary structure contentshave the same fold and belong to the same CATH family. Thereliability of this method for fold prediction depends on thedefinitions of the secondary structure components, i.e., howunambiguously they characterize the protein folds. This wastested on the whole single-domain filtered subset by withdrawingall of the proteins one by one from the subset and investigatingfor the closest structures in the remaining subsets if they representthe same protein folds. The results give a theoretical maximumreliability and were compared for different secondary structuredecompositions (BeStSel, DSSP, SELCON, K2D; Table 3). Forthis calculation, the real secondary structure contents derivedfrom the X-ray structures were used. The method does not takeinto account the possible error of the secondary structure esti-mation from CD. An experimental reliability was calculated onthe single-domain proteins of SP175 by estimating the secondarystructure from the CD spectra and searching for closest structuresin the single-domain filtered database (Table 3).

Search for CATH Domains Within the Expected Error of BeStSel. Totake into account the possible error of the secondary structureestimation from CD, we search on the single-domain database forall of the chains that lie within 1.5 × RMSDi distance in eachstructural element from the estimated secondary structure. Inother words, we searched for the structures in a box centered tothe BeStSel result (Fig. 5). The size of the box is determined bythe RMSD of BeStSel on SP175 reference set. The hits in thebox are sorted out for class, architecture, topology, and homol-ogy. In the order of the frequency of occurrence, lists are gen-erated for the different CATH categories providing suggestionsfor the fold of the studied protein. The theoretical performanceof this method was evaluated on the single-domain dataset. Allof the proteins were taken out one by one, and the surveys ofstructures and the corresponding CATH data in the box aroundthem were carried out. We examined the percentages for thefold of the given protein matching with the fold occurring athighest frequency in the box among the first 5 or 10 on the list(Table 3). Different algorithms were compared with BeStSel. Acomparison with DSSP was also carried out. For DSSP, the boxsize was determined by the SDs of the DSSP secondary structurecontents on the single-domain dataset. We investigated the ex-perimental performance of fold prediction by this “box” methodon the cross-validated results of secondary structure estimationfrom CD of BeStSel on the single-domain proteins of SP175 andcompared with other algorithms. The results revealed (Table 3)that the fold recognition using the BeStSel algorithm to estimatethe secondary structure from CD and the new method for findingthe fold in the secondary structure space defined by the eightcomponents of BeStSel, outperforms any other methods andtreatments of the problem.

Short Description of the BeStSel Web Server.We constructed a webserver (bestsel.elte.hu) to make the BeStSel method freely avail-able for academic users. The home page provides three programblocks. The first one is the analysis of a single CD spectrum. Thesecond block provides with the possibility to analyze a serious ofCD spectra simultaneously, which can be useful when recordingCD spectra as the function of temperature, ligand concentration,etc. The third block calculates the secondary structure contentsfrom any known pdb structure on the basis of various definitions

for the secondary structure components, such as that of BeStSel,SELCON (CONTIN, CDSSTR), and DSSP. In the test period,the use of the analysis program is password protected. Please, typethe password “bestseltest” upon data uploading.In the single spectrum analysis, one can upload the CD data

either in normalized form in Δe or mean residue ellipticity ([Θ])or as the measured ellipticity. In the latter case, data for theprotein concentration, pathlength, and number of residues arealso required. Uploading a data file or simply the copy of datacolumns into the BeStSel window—either is possible. The datafile should be in text or .gen format containing the wavelengthand CD values in columns. If there is a header, it will be auto-matically recognized. The data step can be different from 1 nm/data; the program will automatically convert it to data in 1-nmsteps. After data loading, a verification page will appear wherewe can check the correct reading of the data. After clicking onthe button “calculate secondary structure,” the BeStSel analysiswill be carried out automatically on the widest wavelength rangeand the result will be provided in a graphical page showing thesecondary structure contents and the reliability of fitting.Thereafter, we can change the wavelength range, use a scalingfactor, and reanalyze the data. At the bottom, the preferableformat of the results can be set including data in text format.After secondary structure calculation, we can start the fold rec-ognition with a simple click. The program will carry out threetypes of searches. (i) Search on the entire PDB for structuresthat have similar overall secondary structure composition. Theclosest 20 structures are shown. (ii) Search on the single-domainPDB subset for the closest structures. The 10 closest single do-mains are displayed with their CATH classifications. (iii) Searchon the single-domain PDB subset for the structures that liewithin the expected error of the structure estimation. A list isgiven for the CATH classes with the data of occurrence and the10 most frequent architectures and topologies. Links for thePDB entries of the closest structures and representative imagesfor the most frequent CATH topologies are presented.Throughout the use of the web server, information and help are

provided.

SI ResultsSecondary Structure Estimation of K3 Amyloid Fibrils. The exampleof the amyloid fibrils of K3 fragment of β2-microglobulin revealsthe weakness of the presently available methods for the char-acterization of β-sheet–rich structures especially for parallelβ-sheets. The presently available algorithms, with the exceptionof ours (BeStSel), are unable to provide reliable estimationsusually falsely predicting high helix content (Table S1 and Fig.S1). The structure of K3 amyloid fibrils has been solved by solid-state NMR spectroscopy (4) and shown as a reference in TableS1 and Fig. S1B.

Improvement of the Secondary Structure Prediction During theOptimization Cycles and Final Detailed Performance of BeStSel Method.For any secondary structure component, a separate subset wasgenerated by subsequently leaving out proteins one by one, in theabsence of which the secondary structure prediction on the entirereference set improved most significantly for that particular sec-ondary structure component. Simultaneously, an optimization of thewavelength range was carried out. A wavelength region was left out,in the absence of which the prediction on the entire reference set wasimproved most for the particular secondary structure. Fig. S2Apresents the improvement of the secondary structure predictionduring the optimization cycles. Fig. S2B shows the structural RMSDvs. spectral NRMSD after optimization for proteins in SP175+.Compared with the former algorithms, BeStSel provides extra

information on the secondary structure by distinguishing paralleland antiparallel β-sheets and characterizing the twist of the

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 4 of 14

Page 5: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

antiparallel β-sheets. Table S2 shows the detailed performanceof BeStSel for its eight secondary structure components on SP175+.

BeStSel Analysis of Selected Examples. We applied our method onselected protein samples that have unique secondary structurecomposition including amyloid fibrils of various proteins asso-ciated with degenerative diseases. Most of these spectra wecollected by SRCD spectroscopy at DISCO beamline in SOLEILFrench Synchrotron Facility. Besides BeStSel, the spectra werealso analyzed by the most popular methods for comparison. Here,in SI Results, we focus on the performance of BeStSel and do notdiscuss in details the usually lower performance of the othermethods. Some of the spectra were also used as a referencespectrum for building up BeStSel for general use. In such a case,the basis spectra were optimized by leaving out of the spectrumto provide unbiased results.First, we tested our method on proteins with highly twisted

β-sheet structures. BeStSel accurately estimates the secondarystructure of such proteins. Moreover, information is also pro-vided on the type of β-structure, i.e., that they are highly right-twisted (high Anti3 content). Ecotin (PDB 1ECZ) is a bacterialserine protease inhibitor. Its Ig-like fold is predicted successfullydespite the fact that this fold is usually planar β-sandwich, suchas native β2m (Table S4 and Fig. S3 A and B vs. G and H). ItsCATH classification is 2.60.40, with mainly β class, the mostfrequent class in the secondary structure space around theBeStSel result within the expected error. Its architecture, sand-wich, is the fifth most frequent architecture, and its Ig-like to-pology is the fourth most frequent topology. dUTPase (PDB1Q5U) is a nucleotidohydrolase. Its secondary structure is re-liably estimated (Table S4 and Fig. S3 C and D). Its 2.70.40CATH classification is perfectly predicted as mainly β, and itsdistorted sandwich architecture and deoxyuridine 5′-triphosphatenucleotidohydrolase topology are both third on the predictedfold list.Second, we determined the structure and predicted the fold of a

mixed α/β protein, isopropylmalate dehydrogenase (IPMDH)-containing parallel β-sheet. Its fold is rarely populated andburied by the overlapping and more populated Rossmann-foldand TIM-barrel. However, the third most populated fold in thesecondary structure space around the BeStSel result is IPMDHfold [CATH: 3.40.718; class: α-β, the most frequent class; ar-chitecture: three-layer(αβα) sandwich, the most frequent archi-tecture; topology: isopropylmalate dehydrogenase, third mostfrequent topology].Third, the secondary structure of various forms of β2-micro-

globulin was analyzed by SRCD. β2m is the light chain of theMHC-I. In long-term kidney dialysis, its clearance is defective,and the protein is deposited in the form of amyloid fibrils in thevasculo-skeletal system. In vitro, β2m forms amyloid fibrils at lowpH, or in the presence of additives at neutral pH. The structureof the protein consists of β-sheets both in the native and in theaggregated states, but their CD spectra are clearly different (Fig.S3 G–J). The 3D structure of the native protein is known [2YXF(3)], consisting of two antiparallel β-sheets (Fig. S3H). The

analysis of its CD spectrum with BeStSel accurately gives theoverall β-sheet content, and, moreover, it indicates they arepredominantly relaxed β-sheets (Table S5). Its Ig-like fold ispredicted successfully (CATH: 2.60.40; class: mainly β, mostfrequent class; architecture: sandwich, most frequent architec-ture, topology: Ig-like, second most frequent topology). On thecontrary, BeStSel indicates that mature amyloid fibrils consist ofparallel β-sheets, which are consistent with previous FTIR (31,32), NMR, and cryo-EM results (32) (Table S5 and Fig. S3I).Note that other CD analysis methods predict high α-helicalcontent, which is obviously false. Worm-like (WL) protofibrils ofβ2m that are formed in the presence of high salt concentrationshow distinct CD spectrum. By FTIR and VUV-CD, Hiramatsuet al. (33) estimated similar β-sheet contents for this form andthe mature β2m fibrils. They explained the spectral and mor-phological differences by the different protonation levels of thecarboxyl groups. Our CD analysis clearly shows that the structureof WL fibrils is dominantly antiparallel β-sheet, whereas maturefibrils contain parallel β-sheets. SRCD spectra of various formsof β2m were collected at DISCO beamline in SOLEIL.Fourth, Aβ (1–42) amyloid fibrils exhibited a characteristic CD

spectrum with a large minimum around 218 nm and a maximumat 194 nm. There is a structural model of Aβ (1–42) amyloidfibrils based on solid-state NMR (2BEG) (5), which is inagreement with our results (Table S6 and Fig. S3 K and L).BeStSel accurately estimates that these fibrils contain parallelβ-sheet. Other algorithms falsely predict high α-helical content.VARSLC gives no solution.Fifth, we analyzed the CD spectrum of polyglutamine amyloid

fibrils. The aggregation of polyQ sequences is associated withneurodegenerative diseases such as Huntington’s disease. Thetype of the β-structure and the molecular architecture of theaggregated state of the polyQ sequences is still questionable. CDspectrum of the amyloid fibrils of the K2Q42K2 peptide (11) wasprovided by Ronald Wetzel (University of Pittsburgh, Pittsburgh,PA). The BeStSel analysis showed that this peptide forms anti-parallel β-sheets with a main component of relaxed β-sheets(Anti2) (Table S6 and Fig. S3M). These results are consistentwith and strengthen NMR (34) and FTIR data (35).Sixth, thefibrils of GNNQQNY peptide, the amyloidogenic

fragment of yeast prion Supp-35, were analyzed by SRCD. Theaggregated peptide sample showed a spectrum with an un-expected minimum around 200 nm (Fig. S3N). The morphologyof the GNNQQNY aggregates was verified by EM showing thepresence of long fibrils. BeStSel revealed that the structure ofGNNQQNY fibrils is dominated by a twisted antiparallel β-struc-ture. In contrast to our analysis, X-ray structure (2OMM) of thepeptide determined by Sawaya et al. (36) indicates parallel cross–β-sheet architecture. Our observation reveals that the structure ofamyloid fibrils in the solution can be largely different from thestructure in the crystal form of the peptide. In agreement with theconclusion from the solid-state NMR study of Lewandowski et al.(37), it indicates that caution is necessary when using crystalstructures in the study of amyloid fibrils.

1. Lees JG, Miles AJ, Wien F, Wallace BA (2006) A reference database for circular di-chroism spectroscopy covering fold and secondary structure space. Bioinformatics22(16):1955–1962.

2. Whitmore L, et al. (2011) PCDDB: The Protein Circular Dichroism Data Bank, a re-pository for circular dichroism spectral and metadata. Nucleic Acids Res 39(Databaseissue):D480–D486.

3. Iwata K, Matsuura T, Sakurai K, Nakagawa A, Goto Y (2007) High-resolution crystalstructure of beta2-microglobulin formed at pH 7.0. J Biochem 142(3):413–419.

4. Iwata K, et al. (2006) 3D structure of amyloid protofilaments of β2-microglobulinfragment probed by solid-state NMR. Proc Natl Acad Sci USA 103(48):18119–18124.

5. Lührs T, et al. (2005) 3D structure of Alzheimer’s amyloid-β(1-42) fibrils. Proc Natl AcadSci USA 102(48):17342–17347.

6. Giuliani A, et al. (2009) DISCO: A low-energy multipurpose beamline at synchrotronSOLEIL. J Synchrotron Radiat 16(Pt 6):835–841.

7. Réfrégiers M, et al. (2012) DISCO synchrotron-radiation circular-dichroism endstationat SOLEIL. J Synchrotron Radiat 19(Pt 5):831–835.

8. Abdul-Gader A, Miles AJ, Wallace BA (2011) A reference dataset for the analyses ofmembrane protein secondary structures and transmembrane residues using circulardichroism spectroscopy. Bioinformatics 27(12):1630–1636.

9. Lees JG, Smith BR, Wien F, Miles AJ, Wallace BA (2004) CDtool—an integrated soft-ware package for circular dichroism spectroscopic data processing, analysis, and ar-chiving. Anal Biochem 332(2):285–289.

10. Woollett B, Whitmore L, Janes RW, Wallace BA (2013) ValiDichro: A website forvalidating and quality control of protein circular dichroism spectra. Nucleic Acids Res41(Web Server issue):W417–W421.

11. Chen S, Ferrone FA, Wetzel R (2002) Huntington’s disease age-of-onset linked topolyglutamine aggregation nucleation. Proc Natl Acad Sci USA 99(18):11884–11889.

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 5 of 14

Page 6: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

12. Joosten RP, et al. (2011) A series of PDB related databases for everyday needs. NucleicAcids Res 39(Database issue):D411–D419.

13. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: Patternrecognition of hydrogen-bonded and geometrical features. Biopolymers 22(12):2577–2637.

14. Sreerama N, Venyaminov SY, Woody RW (1999) Estimation of the number of alpha-helical and beta-strand segments in proteins using circular dichroism spectroscopy.Protein Sci 8(2):370–380.

15. Ho BK, Curmi PM (2002) Twist and shear in beta-sheets and beta-ribbons. J Mol Biol317(2):291–308.

16. Provencher SW, Glöckner J (1981) Estimation of globular protein secondary structurefrom circular dichroism. Biochemistry 20(1):33–37.

17. Sreerama N, Woody RW (2000) Estimation of protein secondary structure from cir-cular dichroism spectra: Comparison of CONTIN, SELCON, and CDSSTR methods withan expanded reference set. Anal Biochem 287(2):252–260.

18. Perez-Iratxeta C, Andrade-Navarro MA (2008) K2D2: Estimation of protein secondarystructure from circular dichroism spectra. BMC Struct Biol 8:25.

19. Louis-Jeune C, Andrade-Navarro MA, Perez-Iratxeta C (2012) Prediction of proteinsecondary structure from circular dichroism using theoretically derived spectra. Pro-teins 80(2):374–381.

20. Wiedemann C, Bellstedt P, Görlach M (2013) CAPITO—a Web server-based analysisand plotting tool for circular dichroism data. Bioinformatics 29(14):1750–1757.

21. Manavalan P, Johnson WC, Jr (1987) Variable selection method improves the pre-diction of protein secondary structure from circular dichroism spectra. Anal Biochem167(1):76–85.

22. Toumadje A, Alcorn SW, Johnson WC, Jr (1992) Extending CD spectra of proteins to168 nm improves the analysis for secondary structures. Anal Biochem 200(2):321–331.

23. Böhm G, Muhr R, Jaenicke R (1992) Quantitative analysis of protein far UV circulardichroism spectra by neural networks. Protein Eng 5(3):191–195.

24. Feldmann RJ (1976) AMSOM: Atlas of Macromolecular Structure on Microfiche (Tra-cor Jitco, Rockville, MD).

25. Coleman TF, Li YY (1996) A reflective Newton method for minimizing a quadraticfunction subject to bounds on some of the variables. SIAM J Optim 6(4):1040–1058.

26. Mao D, Wachter E, Wallace BA (1982) Folding of the mitochondrial proton ad-enosinetriphosphatase proteolipid channel in phospholipid vesicles. Biochemistry21(20):4960–4968.

27. Wallace BA (2000) Synchrotron radiation circular-dichroism spectroscopy as a tool forinvestigating protein structures. J Synchrotron Radiat 7(Pt 5):289–295.

28. Lees JG, Miles AJ, Janes RW, Wallace BA (2006) Novel methods for secondary structuredetermination using low wavelength (VUV) circular dichroism spectroscopic data.BMC Bioinformatics 7:507.

29. Sillitoe I, et al. (2013) New functional families (FunFams) in CATH to improve themapping of conserved functional sites to 3D structures. Nucleic Acids Res 41(Databaseissue):D490–D498.

30. Wang G, Dunbrack RL, Jr (2003) PISCES: A protein sequence culling server. Bio-informatics 19(12):1589–1591.

31. Kardos J, et al. (2005) Structural studies reveal that the diverse morphology of beta(2)-microglobulin aggregates is a reflection of different molecular architectures. Bi-ochim Biophys Acta 1753(1):108–120.

32. Fabian H, et al. (2013) IR spectroscopic analyses of amyloid fibril formation ofβ2-microglobulin using a simplified procedure for its in vitro generation at neutral pH.Biophys Chem 179:35–46.

33. Hiramatsu H, et al. (2010) Differences in the molecular structure of beta(2)-microglobulinbetween two morphologically different amyloid fibrils. Biochemistry 49(4):742–751.

34. Schneider R, et al. (2011) Structural characterization of polyglutamine fibrils by solid-state NMR spectroscopy. J Mol Biol 412(1):121–136.

35. Buchanan LE, et al. (2014) Structural motif of polyglutamine amyloid fibrils discernedwith mixed-isotope infrared spectroscopy. Proc Natl Acad Sci USA 111(16):5796–5801.

36. Sawaya MR, et al. (2007) Atomic structures of amyloid cross-beta spines reveal variedsteric zippers. Nature 447(7143):453–457.

37. Lewandowski JR, van der Wel PC, Rigney M, Grigorieff N, Griffin RG (2011) Structuralcomplexity of a composite amyloid fibril. J Am Chem Soc 133(37):14686–14698.

Fig. S1. CD spectrum and fitting with various algorithms of the amyloid form of K3 peptide fragment of β2-microglobulin (A) and 2E8D structural model fromsolid-state NMR (4) (B).

Fig. S2. Optimization of the algorithm and structural RMSD vs. spectral NRMSD. (A) Improvement of the secondary structure prediction during the opti-mization cycles to get the basis spectra sets for the eight secondary structure components (175- to 250-nm range). RMSD between the calculated and the X-raystructure on the entire SP175+ with cross validation is shown. (B) Structural RMSD vs. spectral NRMSD for proteins in SP175+ as treated by BeStSel (red circles)and SELCON3 (blue circles).

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 6 of 14

Page 7: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

Fig. S3. CD spectra and fitted curves using various CD analysis algorithms of selected examples. (A and B) SRCD spectrum and 3D structure of the highlytwisted β-sheet containing ecotin and (C and D) human dUTPase. (E and F) SRCD spectrum and structure of α-β protein Th. thermophilus isopropylmalatedehydrogenase (IPMDH)-containing parallel β-sheet. (G and H) SRCD spectrum and 3D structure of native β2-microglobulin; (I and J) SRCD spectra of LS (longstraight or mature) and WL (worm-like or immature) fibrils of β2m, respectively. (K and L) Spectrum and structure of Aβ (1–42) amyloid fibrils; (M) CD spectrumof K2Q42K2 polyQ fibrils provided by R. Wetzel. (N–P) Spectrum, EM image, and X-ray structure of GNNQQNY fragment of yeast prion protein Supp-35, re-spectively. GNNQQNY fibrils were formed in water at 10 mg/mL concentration by incubation at 37 °C for 1 d. EM samples were applied directly to 300-meshformvar/carbon-coated copper grids, allowed to adhere for 1 min, and stained for 40 s with 1% uranyl acetate. Grids were examined by a JEOL 100 CX IItransmission electron microscope (JEOL) at an accelerating voltage of 60 kV. Analysis of the CD spectrum results in a secondary structure that is markedlydifferent from X-ray structure 2OMM.

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 7 of 14

Page 8: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

Table S1. Secondary structure analysis of the amyloid form of K3 peptide fragment ofβ2-microglobulin

Algorithm Helix β-sheet Antiparallel β Parallel β Turn+Others Turn Others

SELCON 0.351 0.319 0.303 0.087 0.216CONTIN 0.306 0.444 0.249 0.004 0.245CDSSTR 0.600 0.140 0.250 0.100 0.150K2D 0.000 1.000 0.000K2D2 0.875 0.005 0.120K2D3 0.262 0.220 0.518CAPITO 0.660 0.980 0.000NMR(2E8D) 0.000 0.545 0.455 0.011 0.444LINCOMB 0.212 0.788 0.000 0.788 0.000 0.000 0.000CDNN 0.158 0.486 0.011 0.475 0.566 0.028 0.538VARSLC Number of ref. protein combination matching the test protein: 0BeStSel 0.041 0.557 0.000 0.557 0.402 0.039 0.363NMR(2E8D) 0.000 0.545 0.000 0.545 0.455 0.011 0.444

The algorithms are grouped according to the varying definitions of the secondary structure elements. Theappropriate secondary structure contents calculated from the solid-state NMR structure [2E8D (4)] are presentedfor each group. Although some algorithms distinguish more than one helical and β-sheet elements (SI Methods),these are summed up for the easier comparison. BeStSel method described in this manuscript gives accurateestimation of the secondary structure content of K3 amyloid fibrils (cross-validated result).

Table S2. Performance indices for the eight structural components of BeStSel on SP175+

BeStSel* SELCON3_mod†

175–250 nm 180–250 nm 190–250 nm 200–250 nm 175–250 nm

Secondary structure RMSD Corr. RMSD Corr. RMSD Corr. RMSD Corr. RMSD Corr.

Helix1 0.028 0.98 0.037 0.97 0.037 0.97 0.029 0.98 0.055 0.93Helix2 0.026 0.92 0.025 0.93 0.027 0.91 0.028 0.91 0.034 0.86Anti1 0.017 0.85 0.017 0.85 0.019 0.80 0.023 0.69 0.025 0.63Anti2 0.038 0.90 0.035 0.92 0.038 0.91 0.036 0.92 0.051 0.82Anti3 0.043 0.87 0.045 0.85 0.038 0.89 0.048 0.84 0.051 0.80Parallel 0.039 0.92 0.044 0.90 0.044 0.89 0.045 0.91 0.073 0.72Turn 0.036 0.63 0.038 0.60 0.037 0.59 0.034 0.65 0.044 0.32Others 0.057 0.81 0.059 0.80 0.058 0.80 0.065 0.75 0.074 0.64Helix 0.042 0.98 0.051 0.97 0.052 0.97 0.044 0.98 0.082 0.92Antiparallel 0.067 0.93 0.068 0.93 0.068 0.93 0.075 0.91 0.103 0.82β-sheet 0.060 0.93 0.060 0.93 0.056 0.94 0.071 0.91 0.088 0.85Turn+Others 0.060 0.83 0.063 0.81 0.058 0.84 0.067 0.79 0.076 0.69

*RMSD and Pearson correlation for different wavelength ranges are provided.†For comparison, SELMAT algorithm, which contains the SELCON3 method and is flexible regarding the reference database and struc-tural components, was adapted to use the same structural components (SELCON3_mod).

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 8 of 14

Page 9: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

Table S3. List of the proteins of the extra test set

Protein PDB* Origin of CD† PCDDB ID‡

Aβ 2BEG Own experiment 3900α1-Antitrypsin 1QLP Own experiment 3890α-2-Macroglobulin 4ACQ Own experiment 3893Antithrombin 1SR5.A Own experiment 3892Avidin 1RAV SP175 0008β2-Microglobulin 2YXF Own experiment 3894ClC-ec1 1KPK MP180 0104Ecotin 1ECZ Own experiment 3896Ferrienterobactin receptor 1FEP MP180 0107FhuA 1FCP MP180 0108Human dUTPase 1Q5U Own experiment 3897Jacalin 1KU8 SP175 0041K3 2E8D Own experiment 3901Lactose permease 2CFQ MP180 0112Large-conductance mechanosensitive channel 2OAR MP180 0115Light harvesting protein 1NKZ MP180 0114Na(+):neurotransmitter symporter [Snf (nss) family] 2A65 MP180 0113Outer membrane lipoprotein Wza 2J58 MP180 0128Preprotein translocase subunit secY 1RH5 MP180 0124Reaction center protein 1PCR MP180 0121Rhodopsin (dark) 1HZX MP180 0123Rhomboid protease glpG 2NR9 MP180 0109Sarcoplasmic/endoplasmic reticulum calcium ATPase 1 1T5S MP180 0125Sucrose porin 1A0S MP180 0127TraF protein 3JQO MP180 0120

*PDB code of the atomic resolution structure.†Origin of the CD spectrum. Some spectra were measured by SRCD at the DISCO beamline in SOLEIL by theauthors of the present study and deposited to PCDDB; others were downloaded from the MP180 and SP175reference dataset of PCDDB (2).‡The full PCDDB codes are CD000xxxx000, where xxxx is the number in the PCDDB ID column.

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 9 of 14

Page 10: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

Table S4. Secondary structure estimation of ecotin, human dUTPase, and isopropylmalate dehydrogenase byBeStSel and comparison with different methods

Ecotin

Algorithm H B TO H1 H2 A1 A2 A3 A P T O

PDB: 1ECZ 0.000 0.373 0.627 0.000 0.000 0.000 0.054 0.287 0.338 0.036 0.077 0.549BeStSel 0.016 0.348 0.636 0.000 0.016 0.000 0.064 0.230 0.294 0.054 0.062 0.574

H B TO H1 H2 S1 S2 T OPDB: 1ECZ 0.035 0.373 0.592 0.000 0.035 0.211 0.162 0.134 0.458SELCON 0.169 0.268 0.572 0.066 0.103 0.164 0.104 0.147 0.425CONTIN 0.067 0.374 0.558 0.015 0.052 0.246 0.128 0.116 0.442CDSSTR 0.080 0.380 0.550 0.030 0.050 0.250 0.130 0.130 0.420

H B TO H A P T OPDB: 1ECZ 0.035 0.373 0.592 0.035 0.337 0.036 0.127 0.465VARSLC Number of ref. protein combination matching the test protein: 0LINCOMB 0.113 0.057 0.830 0.113 0.000 0.057 0.668 0.162CDNN 0.029 0.230 0.785 0.029 0.158 0.072 0.311 0.474

H B TOPDB: 1ECZ 0.000 0.373 0.626K2D 0.280 0.330 0.400K2D2 0.105 0.291 0.605K2D3 0.062 0.084 0.854

H B TOPDB: 1ECZ 0.035 0.412 0.553CAPITO 0.050 0.060 0.740

Human dUTPaseH B TO H1 H2 A1 A2 A3 A P T O

PDB: 1Q5U 0.047 0.386 0.567 0.020 0.027 0.012 0.081 0.254 0.347 0.039 0.086 0.481BeStSel 0.079 0.338 0.583 0.055 0.024 0.003 0.118 0.192 0.313 0.025 0.102 0.481

H B TO H1 H2 S1 S2 T OPDB: 1Q5U 0.047 0.385 0.566 0.020 0.027 0.222 0.163 0.156 0.410SELCON 0.118 0.380 0.503 0.052 0.066 0.249 0.131 0.125 0.378CONTIN 0.060 0.390 0.550 0.004 0.056 0.251 0.139 0.135 0.415CDSSTR 0.050 0.410 0.540 0.000 0.050 0.270 0.140 0.130 0.410

H B TO H A P T OPDB: 1Q5U 0.048 0.386 0.566 0.048 0.347 0.039 0.156 0.410VARSLC 0.060 0.550 0.460 0.060 0.520 0.030 0.290 0.170LINCOMB 0.049 0.388 0.563 0.049 0.388 0.000 0.563 0.000CDNN 0.063 0.281 0.676 0.063 0.232 0.049 0.342 0.334

H B TOPDB: 1Q5U 0.048 0.385 0.567K2D 0.080 0.450 0.470K2D2 0.076 0.303 0.621K2D3 0.083 0.111 0.806

H B TOPDB: 1Q5U 0.048 0.399 0.553CAPITO 0.080 0.230 0.670

Isopropylmalate dehydrogenaseH B TO H1 H2 A1 A2 A3 A P T O

PDB: 2Y3Z 0.413 0.169 0.418 0.279 0.134 0.008 0.033 0.029 0.070 0.099 0.095 0.323BeStSel 0.402 0.129 0.468 0.264 0.138 0.014 0.012 0.000 0.026 0.103 0.129 0.339

H B TO H1 H2 S1 S2 T OPDB: 2Y3Z 0.471 0.170 0.359 0.279 0.192 0.103 0.067 0.153 0.206SELCON 0.439 0.164 0.399 0.265 0.174 0.101 0.063 0.095 0.304CONTIN 0.439 0.165 0.396 0.268 0.171 0.104 0.061 0.102 0.294CDSSTR 0.450 0.160 0.390 0.270 0.180 0.100 0.060 0.110 0.280

H B TO H A P T OPDB: 2Y3Z 0.470 0.169 0.359 0.470 0.070 0.099 0.153 0.206VARSLC 0.420 0.190 0.420 0.420 0.140 0.050 0.160 0.260LINCOMB 0.444 0.221 0.335 0.444 0.009 0.211 0.093 0.242CDNN 0.503 0.124 0.425 0.503 0.022 0.102 0.152 0.273

H B TOPDB: 2Y3Z 0.412 0.170 0.418K2D 0.610 0.060 0.330K2D2 0.418 0.131 0.452

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 10 of 14

Page 11: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

Table S4. Cont.

Isopropylmalate dehydrogenase

Algorithm H B TO H1 H2 A1 A2 A3 A P T O

K2D3 0.354 0.209 0.437H B TO

PDB: 2Y3Z 0.470 0.170 0.360CAPITO 0.290 0.240 0.370

Secondary structure components are abbreviated as follows: H, helix; B, β-sheet; TO, turn+others; H1, helix1; H2, helix2; A1, anti1; A2,anti2; A3, anti3; P, parallel β-sheet; T, turn; O, others; S1, strand1; S2, strand2; A, antiparallel β-sheet.

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 11 of 14

Page 12: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

Table S5. Secondary structure estimation of various forms of β2-microglobulin by BeStSel and comparison withdifferent methods

Native β2-microglobulin

Algorithm H B TO H1 H2 A1 A2 A3 A P T O

PDB: 2YXF 0.000 0.483 0.520 0.000 0.000 0.029 0.279 0.175 0.483 0.000 0.080 0.440BeStSel 0.000 0.499 0.501 0.000 0.000 0.065 0.262 0.172 0.499 0.000 0.090 0.411

H B TO H1 H2 S1 S2 T OPDB: 2YXF 0.030 0.480 0.490 0.000 0.030 0.320 0.160 0.160 0.330SELCON 0.076 0.508 0.438 0.020 0.056 0.368 0.140 0.066 0.372CONTIN 0.045 0.539 0.416 0.001 0.044 0.389 0.150 0.066 0.350CDSSTR 0.000 0.530 0.440 −0.010 0.010 0.380 0.150 0.080 0.360

H B TO H A P T OPDB: 2YXF 0.030 0.483 0.487 0.030 0.483 0.000 0.080 0.407VARSLC 0.050 0.430 0.620 0.050 0.300 0.130 0.130 0.490LINCOMB 0.000 0.526 0.474 0.000 0.232 0.294 0.000 0.474CDNN 0.032 0.623 0.437 0.032 0.521 0.102 0.102 0.335

H B TOPDB: 2YXF 0.000 0.483 0.520K2D 0.020 0.510 0.470K2D2 0.031 0.493 0.476K2D3 0.045 0.338 0.617

H B TOPDB: 2YXF 0.030 0.500 0.470CAPITO 0.000 0.830 0.370

β2-microglobulin LS (long straight or mature) fibrilsH B TO H1 H2 A1 A2 A3 A P T O

X-ray — — — — — — — — — — — —

BeStSel 0.025 0.414 0.561 0.000 0.025 0.000 0.000 0.000 0.000 0.414 0.072 0.489H B TO H1 H2 S1 S2 T O

X-ray — — — — — — — — —

SELCON 0.335 0.234 0.473 0.180 0.155 0.148 0.086 0.115 0.358CONTIN 0.365 0.143 0.493 0.257 0.108 0.118 0.025 0.167 0.326CDSSTR 0.510 0.210 0.280 0.360 0.150 0.150 0.060 0.100 0.180

H B TO H A P T OX-ray — — — — — — — —

VARSLC Number of ref. protein combination matching the test protein: 0LINCOMB 0.194 0.473 0.333 0.194 0.000 0.473 0.000 0.333CDNN 0.065 0.512 0.616 0.065 0.154 0.358 0.022 0.594

H B TOX-ray — — —

K2D 0.000 1.000 0.000K2D2 0.500 0.090 0.410K2D3 0.132 0.334 0.535

H B TOX-ray — — —

CAPITO 0.590 0.580 0.160β2-microglobulin WL (worm-like or immature) fibrils

H B TO H1 H2 A1 A2 A3 A P T OX-ray — — — — — — — — — — — —

BeStSel 0.127 0.468 0.405 0.127 0.000 0.000 0.179 0.289 0.468 0.000 0.115 0.290H B TO H1 H2 S1 S2 T O

X-ray — — — — — — — — —

SELCON 0.183 0.341 0.479 0.079 0.104 0.229 0.112 0.116 0.363CONTIN 0.168 0.353 0.478 0.072 0.096 0.238 0.115 0.119 0.359CDSSTR 0.170 0.360 0.460 0.080 0.090 0.250 0.110 0.110 0.350

H B TO H A P T OX-ray — — — — — — — —

VARSLC Number of ref. protein combination matching the test protein: 0LINCOMB 0.318 0.137 0.5452 0.318 0.137 0.000 0.409 0.136CDNN 0.282 0.188 0.558 0.282 0.109 0.079 0.221 0.337

H B TOX-ray — — —

K2D 0.270 0.160 0.560K2D2 0.243 0.220 0.537

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 12 of 14

Page 13: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

Table S5. Cont.

β2-Microglobulin WL (worm-like or immature) fibrils

Algorithm H B TO H1 H2 A1 A2 A3 A P T O

K2D3 0.289 0.097 0.614H B TO

X-ray — — —

CAPITO 0.130 0.120 0.610

Secondary structure components are abbreviated as follows: H, helix; B, β-sheet; TO, turn+others; H1, helix1; H2, helix2; A1, anti1; A2,anti2; A3, anti3; P, parallel β-sheet; T, turn; O, others; S1, strand1; S2, strand2; A, antiparallel β-sheet.

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 13 of 14

Page 14: Supporting Information - PNAS...Supporting Information Micsonai et al. 10.1073/pnas.1500851112 SI Methods CDSpectraandReferenceDataset.Forspectralreferencedataset,the SP175 CD reference

Table S6. Secondary structure estimation of different amyloid fibrils by BeStSel and comparison with differentmethods

Polyglutamine amyloid fibrils

Algorithm H B TO H1 H2 A1 A2 A3 A P T O

X-ray — — — — — — — — — — — —

BeStSel 0.000 0.552 0.448 0.000 0.000 0.113 0.262 0.130 0.505 0.046 0.122 0.326H B TO H1 H2 S1 S2 T O

X-ray — — — — — — — — —

SELCON 0.010 0.512 0.464 −0.014 0.024 0.377 0.135 0.102 0.362CONTIN 0.031 0.543 0.426 0.000 0.031 0.398 0.145 0.073 0.353CDSSTR 0.000 0.520 0.450 −0.020 0.020 0.370 0.150 0.090 0.360

H B TO H A P T OX-ray — — — — — — — —

VARSLC Number of ref. protein combination matching the test protein: 0LINCOMB 0.000 0.706 0.294 0.000 0.108 0.598 0.000 0.294CDNN 0.086 0.702 0.330 0.086 0.643 0.059 0.091 0.239

H B TOX-ray — — —

K2D 0.020 0.510 0.470K2D2 0.084 0.455 0.461K2D3 0.016 0.402 0.583

H B TOX-ray — — —

CAPITO 0.020 0.880 0.320Aβ (1–42) amyloid fibrils

H B TO H1 H2 A1 A2 A3 A P T OPDB: 2BEG 0.000 0.476 0.524 0.000 0.000 0.000 0.000 0.000 0.000 0.476 0.000 0.524BeStSel 0.014 0.491 0.495 0.002 0.012 0.013 0.000 0.000 0.013 0.478 0.000 0.495

H B TO H1 H2 S1 S2 T OPDB: 2BEG 0.000 0.476 0.524 0.000 0.000 0.095 0.381 0.048 0.476SELCON 0.564 0.098 0.356 0.378 0.186 0.059 0.039 0.102 0.254CONTIN 0.236 0.356 0.408 0.147 0.089 0.259 0.097 0.063 0.345CDSSTR 0.690 0.100 0.200 0.500 0.190 0.070 0.030 0.070 0.130

H B TO H A P T OPDB: 2BEG 0.000 0.476 0.524 0.000 0.000 0.476 0.000 0.524VARSLC Number of ref. protein combination matching the test protein: 0LINCOMB 0.466 0.358 0.175 0.466 0.000 0.358 0.000 0.175CDNN 0.087 0.428 0.554 0.087 0.020 0.408 0.029 0.525

H B TOPDB: 2BEG 0.000 0.476 0.524K2D 0.820 0.010 0.170K2D2 0.563 0.057 0.380K2D3 0.566 0.047 0.387

H B TOPDB: 2BEG 0.000 0.476 0.524CAPITO 0.990 0.380 0.270

GNNQQNY amyloid fibrilsH B TO H1 H2 A1 A2 A3 P A T O

BeStSel 0.000 0.324 0.676 0.000 0.000 0.014 0.044 0.266 0.000 0.324 0.184 0.492H B TO H1 H2 S1 S2 T O

SELCON 0.153 0.313 0.487 0.070 0.083 0.201 0.112 0.107 0.380CONTIN 0.046 0.449 0.505 0.001 0.045 0.318 0.131 0.117 0.388CDSSTR −0.050 0.370 0.650 −0.020 −0.030 0.240 0.130 0.150 0.500

H B TO H A P T OVARSLC 0.110 0.180 0.820 0.110 0.180 0.000 0.520 0.300LINCOMB 0.105 0.000 0.896 0.105 0.000 0.000 0.613 0.283CDNN 0.043 0.130 0.919 0.043 0.098 0.032 0.365 0.554

H B TOK2D 0.010 0.050 0.940K2D2 0.082 0.287 0.631K2D3 0.038 0.094 0.868

H B TOCAPITO 0.050 0.350 0.860

Secondary structure components are abbreviated as follows: H, helix; B, β-sheet; TO, turn+others; H1, helix1; H2, helix2; A1, anti1; A2,anti2; A3, anti3; P, parallel β-sheet; T, turn; O, others; S1, strand1; S2, strand2; A, antiparallel β-sheet.

Micsonai et al. www.pnas.org/cgi/content/short/1500851112 14 of 14