arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation...

13
Universal and accessible entropy estimation using a compression algorithm Ram Avinery, * Micha Kornreich, and Roy Beck The Raymond and Beverly Sackler School of Physics and Astronomy, Tel Aviv University, Tel Aviv 69978, Israel Entropy and free-energy estimation are key in thermodynamic characterization of simulated sys- tems ranging from spin models through polymers, colloids, protein structure, and drug-design. Current techniques suffer from being model specific, requiring abundant computation resources and simulation at conditions far from the studied realization. Here, we present a universal scheme to calculate entropy using lossless compression algorithms and validate it on simulated systems of in- creasing complexity. Our results show accurate entropy values compared to benchmark calculations while being computationally effective. In molecular-dynamics simulations of protein folding, we exhibit unmatched detection capability of the folded states by measuring previously undetectable entropy fluctuations along the simulation timeline. Such entropy evaluation opens a new window onto the dynamics of complex systems and allows efficient free-energy calculations. Utilizing the exponentially growing power of comput- ers enables in-silico experiments of complex and dynamic systems [1]. In these systems, entropy (S) and enthalpy (H) should be evaluated to appraise the system ther- modynamic properties. While enthalpy can be directly calculated from the interaction strength between the sys- tem’s components, computing the entropy of an equili- brated canonical system essentially requires inferring the probabilities of all relevant microstates (i.e., specific con- figurations). Consequently, for large systems, contem- porary computational capabilities struggle to simulate sufficient microstates for adequate mapping of the free- energy landscape. This fact limits current ability to es- timate thermodynamic properties of interesting systems and phenomena including, e.g., protein folding [1–3]. Present strategies to estimate the entropy from sim- ulations include density- or work- based methods [4]. These methods have been proven useful, though they rely on plentiful computational power, for simulations away from the designated realization [5–7]. Notably, no sin- gle method for entropy and free-energy evaluation can be viewed as superior to others, and in many cases, the choice is system dependent [8]. As an alternative path, a reduced phase space assignment can be used, as previously demonstrated in protein folding simulations [1, 2, 9]. There, using a priori knowledge, such as the experimental native protein structure, can be used to at- tribute each frame a specific state (e.g., folded or un- folded protein states). Following, a rough estimate of the system’s free-energy and thermodynamic properties is then attained using the respective state populations, although entropy values are not directly assigned. Seminal papers in information theory by Shannon [10] and Kolmogorov [11] introduced measures of uncertainty which are mathematically identical to the statistical- mechanics definition of entropy at the large dataset limit. Lossless compession algorithms are essentially practical implementations attempting to realize Kolmogorov com- plexity [12, 13]. Recognizing these relations has produced novel analytical methods for studying mutual informa- tion in sequences of symbols, internet traffic analysis, and redundancy anomaly detections for medicinal sig- nal analysis in electroencephalography, electrocardiogra- phy, and more [14–17]. Despite these important links, studies of physical systems using lossless-compression are rather sparse. Exceptions include recent studies on ther- modynamic phase transitions [18–21]. Additional details on previous studies involving compression algorithms for physical systems are given in the Supplementary Material [22]. Here, we present a framework for accessible and accu- rate asymptotic entropy (S A ) calculation using a lossless compression algorithm. Conceptually, the redundancy of information stored in a recorded simulation is tightly re- lated to the entropy of the physical system being simu- lated. At the foundation of our method, we use a loss- less compression algorithm which is optimized to remove information redundancy by locating repeated patterns within a stream of data. Thus, the ability to compress a digital representation of a physical system is directly re- lated to the entropy of that system [22, 23]. We note that other methods exist for the estimation of information- entropy in a data stream [23]. Here we chose to utilize lossless compression, due to its availability and ease of use. As a proof-of-concept, we verify our entropy estimation on various model systems where the entropy is analyti- cally calculated and compared. Later, the direct appli- cation of asymptotic entropy calculation is demonstrated on protein folding simulations, where entropy estimation is challenging. Most adopted lossless compression implementations derive from schemes introduced by Lempel and Ziv (LZ) [24–26]. LZ algorithms process an input sequence of sym- bols in a finite alphabet and produce a compressed out- put sequence by replacing short segments with a refer- ence to a previous instance of the same segment. Funda- mentally, the ratio of LZ compressed to input sequence lengths has been proven to converge to Shannon’s entropy definition [16, 27]. This convergence is guaranteed for an infinite sequence of symbols, produced by an ergodic ran- dom source. A sequence of independent microstates sam- arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019

Transcript of arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation...

Page 1: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

Universal and accessible entropy estimation using a compression algorithm

Ram Avinery,∗ Micha Kornreich,† and Roy Beck‡

The Raymond and Beverly Sackler School of Physics and Astronomy, Tel Aviv University, Tel Aviv 69978, Israel

Entropy and free-energy estimation are key in thermodynamic characterization of simulated sys-tems ranging from spin models through polymers, colloids, protein structure, and drug-design.Current techniques suffer from being model specific, requiring abundant computation resources andsimulation at conditions far from the studied realization. Here, we present a universal scheme tocalculate entropy using lossless compression algorithms and validate it on simulated systems of in-creasing complexity. Our results show accurate entropy values compared to benchmark calculationswhile being computationally effective. In molecular-dynamics simulations of protein folding, weexhibit unmatched detection capability of the folded states by measuring previously undetectableentropy fluctuations along the simulation timeline. Such entropy evaluation opens a new windowonto the dynamics of complex systems and allows efficient free-energy calculations.

Utilizing the exponentially growing power of comput-ers enables in-silico experiments of complex and dynamicsystems [1]. In these systems, entropy (S) and enthalpy(H) should be evaluated to appraise the system ther-modynamic properties. While enthalpy can be directlycalculated from the interaction strength between the sys-tem’s components, computing the entropy of an equili-brated canonical system essentially requires inferring theprobabilities of all relevant microstates (i.e., specific con-figurations). Consequently, for large systems, contem-porary computational capabilities struggle to simulatesufficient microstates for adequate mapping of the free-energy landscape. This fact limits current ability to es-timate thermodynamic properties of interesting systemsand phenomena including, e.g., protein folding [1–3].

Present strategies to estimate the entropy from sim-ulations include density- or work- based methods [4].These methods have been proven useful, though they relyon plentiful computational power, for simulations awayfrom the designated realization [5–7]. Notably, no sin-gle method for entropy and free-energy evaluation canbe viewed as superior to others, and in many cases,the choice is system dependent [8]. As an alternativepath, a reduced phase space assignment can be used, aspreviously demonstrated in protein folding simulations[1, 2, 9]. There, using a priori knowledge, such as theexperimental native protein structure, can be used to at-tribute each frame a specific state (e.g., folded or un-folded protein states). Following, a rough estimate ofthe system’s free-energy and thermodynamic propertiesis then attained using the respective state populations,although entropy values are not directly assigned.

Seminal papers in information theory by Shannon [10]and Kolmogorov [11] introduced measures of uncertaintywhich are mathematically identical to the statistical-mechanics definition of entropy at the large dataset limit.Lossless compession algorithms are essentially practicalimplementations attempting to realize Kolmogorov com-plexity [12, 13]. Recognizing these relations has producednovel analytical methods for studying mutual informa-tion in sequences of symbols, internet traffic analysis,

and redundancy anomaly detections for medicinal sig-nal analysis in electroencephalography, electrocardiogra-phy, and more [14–17]. Despite these important links,studies of physical systems using lossless-compression arerather sparse. Exceptions include recent studies on ther-modynamic phase transitions [18–21]. Additional detailson previous studies involving compression algorithms forphysical systems are given in the Supplementary Material[22].

Here, we present a framework for accessible and accu-rate asymptotic entropy (SA) calculation using a losslesscompression algorithm. Conceptually, the redundancy ofinformation stored in a recorded simulation is tightly re-lated to the entropy of the physical system being simu-lated. At the foundation of our method, we use a loss-less compression algorithm which is optimized to removeinformation redundancy by locating repeated patternswithin a stream of data. Thus, the ability to compress adigital representation of a physical system is directly re-lated to the entropy of that system [22, 23]. We note thatother methods exist for the estimation of information-entropy in a data stream [23]. Here we chose to utilizelossless compression, due to its availability and ease ofuse.

As a proof-of-concept, we verify our entropy estimationon various model systems where the entropy is analyti-cally calculated and compared. Later, the direct appli-cation of asymptotic entropy calculation is demonstratedon protein folding simulations, where entropy estimationis challenging.

Most adopted lossless compression implementationsderive from schemes introduced by Lempel and Ziv (LZ)[24–26]. LZ algorithms process an input sequence of sym-bols in a finite alphabet and produce a compressed out-put sequence by replacing short segments with a refer-ence to a previous instance of the same segment. Funda-mentally, the ratio of LZ compressed to input sequencelengths has been proven to converge to Shannon’s entropydefinition [16, 27]. This convergence is guaranteed for aninfinite sequence of symbols, produced by an ergodic ran-dom source. A sequence of independent microstates sam-

arX

iv:1

709.

1016

4v4

[co

nd-m

at.s

tat-

mec

h] 1

5 Se

p 20

19

Page 2: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

2

pled from a physical system in equilibrium is in accordwith the required random source [22].

One expects LZ schemes to produce an upper boundon physical entropy and approach it asymptotically forlarge datasets [27]. In practice, our entropy estimationconverges to within a few percents from expected val-ues, even for relatively small datasets. This result, incombination with readily available enthalpy values fromthe simulations, allows us to construct enthalpy-entropypopulation diagrams for the complex and dynamic simu-lation of protein folding.

To calculate entropy using a compression-based algo-rithm, we must quantitatively map the information con-tent (compressed length) to entropy in the proper scale.However, preliminary steps are required to eliminate spu-rious effects that result from the combination of translat-ing physical systems into 1d datasets, the physical natureof the specific problem, and the algorithm limitations.

Several physical systems are represented using continu-ous variables. There, each variable requires an enormousalphabet to represent each degree of freedom. This posesa difficulty for compression since the least-significant dig-its are noisy, hence incompressible. Therefore, a prepro-cess is required to reduce the alphabet variability to acoarse-grained representation with ns values. For addi-tional preprocessing details see [22].

Following, we now take the discretized configurationsand store them contiguously in a 1d file [22]. We de-fine the original and compressed file sizes, measured inbytes, by Cd and Cd respectively. To properly evaluatethe asymptotic entropy SA, we generate two additionaldatasets having the original dataset length. In the first,data over the entire phase-space is replaced with a sin-gle repeating symbol (e.g., zero). In the second, all thedataset is replaced with random symbols from the alpha-bet. The resulting two compressed dataset file sizes aredenoted by C0 and C1, respectively. The ratios C0/C0

and C1/C1 converge at the large dataset limit to a valuethat depends on the size of the alphabet [22].

Since the degenerate and random datasets representthe extreme cases of minimal and maximal entropy, thecompressed file size for the simulated state (Cd) layswithin these two extremes. Therefore, we define the in-compressibility by η = (Cd−C0)/(C1−C0). For physicalsystems 0 ≤ η ≤ 1, and converges to a constant in equi-librium with sufficient sampling.

Finally, mapping η to SA can be conducted in variousways, for example from prerequisite knowledge on spe-cific entropy values. Alternatively, we recognize that foreach of the D degrees of freedom in the system, repre-sented with ns discrete values, the maximal entropy isgiven by kB log ns, where kB is the Boltzmann constant.Therefore, as a first order approximation, we linearlymap η to entropy, up to an additive constant, by takingSA/kB = ηD log ns (Figure 1) [22]. Below we demon-strate that this linear mapping asymptotically quantifies

FIG. 1. Schematic asymptotic entropy calculation. Simula-tions of physical systems are preprocessed and encoded intodata files [22]. Entropy is directly calculated from the size ofthe compressed (Cd) and calibration (C0, C1) data, as well asthe entropy range (Smin, Smax).

the entropy even with finite sampling and far from thelarge dataset limit (e.g., number of microstates).

We are now ready to evaluate our scheme for severalbenchmark systems. Herein, we use the LZMA compres-sion algorithm although other algorithms produce qual-itatively similar results [19]. We compare SA to analyt-ical entropy calculation of five different systems [Figs.2(a-f)]: finite energy levels (ε, 2ε, 70ε, 80ε) with an arbi-trary energy scale (ε) simulated at different temperatures(T ), a 2d Ising model on a square lattice, 2d ferromag-netic and antiferromagnetic (frustrated) Ising models ontriangular lattices, and an ideal chain fluctuating in 2dwith fixed end-to-end distance (R). The Ising modelshave exchange energy J , the ideal chain is simulatedwith monomer length b, and all systems are simulatedusing Monte-Carlo algorithms [22]. The results agreewell with the theoretical calculation [Figs. 2(a-d)]. Infact, for the Ising model on a square lattice, maximalresidues from analytical values are smaller than 0.04kB

[Fig. 3(a)]. For the ideal chain simulation, our entropyestimation matches the known entropy dependence ofS(R) − S(0) = −R2/b2(N − 1), where N is the num-ber of monomers, without any fitting parameters [Fig.3(e)] [22]. Also, our results present a smooth trend andenable to differentiate SA for specific heat and criticalexponent derivations [Figs. 2(e-f)].

Since compression algorithms result in an upper boundfor the entropy, we can evaluate and optimize differentpreprocessing protocols [22]. For example, a compari-son between different 2d to 1d transformations for theIsing model on a square lattice shows that the Hilbertscan [28] is slightly better than other naive transforma-tions [Fig. 3(a)]. Notably, we can use data compres-sion to evaluate ergodicity and proper sampling intervals[Fig. 3(b)] [22, 27]. While the convergence of SA with in-creasing sampling interval is exponential, its convergencewith additional sampling is logarithmic [Fig. 3(c)], as ex-pected [29], and will level off as it approaches the actualvalue, similarly to trends in random data Fig. S1 [22].Moreover, for as low as 1000 frames SA estimate is a few

Page 3: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

3

FIG. 2. Validation to asymptotic entropy calculation toMonte-Carlo simulation data on benchmark model systems,by comparing analytical entropy calculation (lines) and com-pression algorithm method (symbols). (a) Discrete energylevels; Ising models on (b) square lattice, and (c) triangularlattice, either with antiferromagnetic (J < 0, triangles) or fer-romagnetic (J > 0, circles); (d) Ideal chains held at varyingend-to-end distance (R). (e) and (f) Entropy derivatives ofthe Ising model simulations [(b) and (c) respectively].

percents off the analytically calculated values.

Next, we consider the case of continuous variableswhich must be coarse-grained for further processing andapply it on simulated lattice-free ideal chains [22]. Theoptimal discretization (ns) should depend on the correla-tions in the system and the number of sampled configu-rations. Furthermore, the choice of a coordinate systemrepresenting the degrees of freedom in the system can in-troduce or eliminate correlations. In our case, the idealchain simulation is recorded with 64-bit floating numbersfor each Cartesian coordinate, but the analysis is appliedto the 1d bond-angle representation [22]. We note thatstandard compression algorithms work best with shortrange correlations. For physical systems exhibiting long-range correlations, more attention will be required, pos-sibly by transformation to an alternative representation(e.g. Fourier transform).

The ideal chain example validates that optimal coarse-graining can be identified using our procedure [Fig. 3(d)].At the low ns limit significant information is lost, and theentropy estimate cannot be resolved well. On the otherhand, pattern-matching by the compression algorithm ishindered by finite sampling and the estimate increases to-wards the maximal entropy at the high ns limit. Indeed,SA evaluation shows a shallow minimum that deepens asthe chain is stretched [22].

Encouraged by our results, we test our entropy esti-mation scheme where free-energy evaluation is a seriousconcern, namely in protein folding simulation. There, en-tropy evaluation is currently limited when using the sim-ulation data alone [30]. Specifically, we quantify entropyfor the reversible protein folding of a Villin headpieceC-terminal fragment simulated by molecular dynamics(MD) [2]. The system is sampled at equilibrium anddemonstrates short transition times between folded andunfolded states and a long lifetime at each given state [2].Piana et al. [2] calculated the fragment’s thermodynamicproperties from the population ratio of folded to unfoldedstates via the transition-based assignment [9] aided bythe experimental folded structure [31]. In particular, thedifference in entropy between folded and unfolded states(∆Sf) was estimated from the states’ lifetimes (Table S5).

FIG. 3. (a) Residuals from analytical entropy for various1d reductions of the Ising model on a square lattice: rowby row (squares), spiral (triangles), Hilbert (circles), Hilbertwith 1 site/byte (pentagons), Hilbert with 3 sites/byte (dia-monds). (b) and (c) Asymptotic entropy convergence for theIsing model on a square lattice. (b) Varying sample size (N)and fixed sampling interval (140L2). (c) Varying samplinginterval with fixed sample size (104), exhibiting exponentialconvergence (solid line). (d) Effect of coarse-graining level onSA for the ideal chain. Lines are a guide for the eye.

Page 4: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

4

Using our compression framework, along with the above-mentioned frame assignments into two ensembles, we at-tained the backbone’s entropy values of the folded andunfolded states [22].

Moreover, the assignment to either of the two statescan be done using a sliding entropy estimate from loss-less compression, without any a-priori experimental in-put. This scheme can potentially detect yet unidentified,competing, low free-energy structures which eluded ex-perimental observation. In Fig. 4(a) we show SA evalu-ated for sequences of configurations within a sliding win-dow of length τw through the timeline of a simulation.At each time point t, configurations sampled by the sim-ulation between t and t+ τw are preprocessed and com-pressed, to arrive at SA(t) [22]. We chose the windowlength τw = 0.4µs as a reasonable compromise betweenconvergence of SA and time-resolution [22]. This choicelimits our observations to dynamic processes slower thanthe chosen time-window. Fig. 4(a) clearly demonstratesthe correspondence between low SA and Piana’s preas-signed folded states (shaded areas).

Using these sliding-window SA values, we can nowconstruct an enthalpy-entropy population diagram [Fig.4(b)] [22]; a valley between clustered events in the SA−Hplot allows to assign the folded and unfolded states, with-out a priori experimental knowledge, with 95.3 – 96.3%agreement [22] allowing the low free-energy folded struc-ture to be extracted from the simulation and comparedto the experimental crystal structure [Fig. 4(c)]. Inagreement with Piana’s assignments, changes in the ra-tio between folded and unfolded populations are clearlyrevealed by SA distributions, as simulation temperatureis varied [Fig. 4(d), Table S5].

Following an assignment to the two folded/unfoldedensembles, we can now use our method to directly esti-mate the entropic difference between the ensembles. Inorder to reduce spurious effects resulting from time corre-lations between neighboring frames, we resampled eachensemble (folded/unfolded) separately [22]. Following,we optimize the number of coarse-grained dihedral an-gles, as described above, and estimate each ensemble’sentropy value using lossless compression. The differencebetween the estimated values (∆SA) is given in TableS5. We note that ∆SA represents only the protein back-bone’s entropic contribution, as the information is takensolely from the dihedral angles. Further contributions,originating from the solvent, side-chains or other solutesmay contribute as well to the overall entropy difference.These contributions will be addressed in future work.

By construction, the successful operation of compres-sion algorithms is derived from identifying domains thatrepeat within 1d datasets. This is of great conveniencefor effectively 1d objects such as polymers and proteins.We show that lossless compression algorithms allow ef-ficient estimation of entropy in a wide variety of physi-cal systems, including protein folding simulations, and

FIG. 4. (a) Representative scan through the Villin headpiecesimulation timeline, with mean-subtracted SA (blue bars),overlaying regions identified as folded using transition-basedassignment (gray area). (b) Enthalpy – entropy populationdiagram at T = 360 K. (c) Protein structure collected fromrandomly sampled low SA states simulated at 360 K (red)overlaid with the protein fragment (2F4K) crystal structure(blue) [7, 31], generated with PyMOL [32]. (d) Distributionsof sliding-window SA, at three simulation temperatures.

without any a priori knowledge about specific states.Additionally, our framework can easily assess sufficientsampling, ergodicity and coarse-graining optimality formany-body simulations. We expect that our method-ology will be useful for experimental systems [21] andadditional athermal models, where entropy estimation ishard or inaccessible.

Entropy is defined for equilibrated or almost-stationary systems. However, SA estimates can beuseful also away from equilibrium, to detect divergenttrends in information-content and disorder [21]. Our re-sults demonstrate modern MD simulations have sufficientstatistics to allow entropy estimation even for small frag-ments of simulated trajectories, and that lossless com-pression algorithms can be conveniently used for this esti-mation. The resulting observation of continuous entropydynamics, including detection of transient ordered states(i.e., a protein’s fold), opens a new avenue in character-izing dynamics of complex systems.

Page 5: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

5

We greatly appreciate fruitful discussions and com-ments from A. Aharony, D. Andelman, P. Chaikin, H.Diamant, E. Eisenberg, O. Farago, D. Frenkel, M. Gold-stein, G. Jacoby, D. Levin, R. Lifshitz, Y. Messica, H.Orland, P. Pincus, Y. Roichman, Y. Shokef, and H. Su-chowski. Special thanks for David Shaw laboratory forsharing the MD data. The work is supported by theIsrael Science Foundation (453/17, 550/15) and UnitedStates - Israel Binational Science Foundation (201696).

[email protected]† Currently at Department of Physics, New York Univer-

sity, New York, New York 10003, USA.‡ [email protected]

[1] R. O. Dror, R. M. Dirks, J. Grossman, H. Xu, and D. E.Shaw, Annual Review of Biophysics 41, 429 (2012).

[2] S. Piana, K. Lindorff-Larsen, and D. E. Shaw, Proceed-ings of the National Academy of Sciences 109, 17845(2012).

[3] S. Piana, J. L. Klepeis, and D. E. Shaw, Current Opinionin Structural Biology 24, 98 (2014).

[4] D. A. Kofke, Fluid Phase Equilibria 228-229, 41 (2005).[5] D. P. Landau and K. Binder, A guide to Monte Carlo

simulations in statistical physics, 4th ed. (Cambridge uni-versity press, 2014).

[6] D. Frenkel and B. Smit, Understanding molecular simu-lation: from algorithms to applications, Vol. 1 (Academicpress, 2001).

[7] J. Kubelka, T. K. Chiu, D. R. Davies, W. A. Eaton,and J. Hofrichter, Journal of Molecular Biology 359, 546(2006).

[8] N. Hansen and W. F. van Gunsteren, Journal of ChemicalTheory and Computation 10, 2632 (2014).

[9] N.-V. Buchete and G. Hummer, The Journal of PhysicalChemistry B 112, 6057 (2008).

[10] C. E. Shannon, Bell System Technical Journal 27, 623(1948).

[11] A. N. Kolmogorov, International Journal of ComputerMathematics 2, 157 (1968).

[12] T. Downarowicz, Entropy in dynamical systems, Vol. 18(Cambridge University Press, 2011).

[13] W. Krieger, Transactions of the American MathematicalSociety 149, 453 (1970).

[14] T. Henriques, H. Goncalves, L. Antunes, M. Matias,J. Bernardes, and C. Costa-Santos, Journal of Evalu-ation in Clinical Practice 19, 1101 (2013).

[15] M. Aboy, R. Hornero, D. Abasolo, and D. Alvarez, IEEETransactions on Biomedical Engineering 53, 2282 (2006).

[16] D. Benedetto, E. Caglioti, V. Loreto, and V. Loreto,Physical Review Letters 88, 4 (2002).

[17] J. M. Amigo, J. Szczepanski, E. Wajnryb, and M. V.Sanchez-Vives, Neural Computation 16, 717 (2004).

[18] O. Melchert and A. K. Hartmann, Physical Review E 91,023306 (2015).

[19] E. Vogel, G. Saravia, and L. Cortez, Physica A: Statis-tical Mechanics and its Applications 391, 1591 (2012).

[20] E. E. Vogel, G. Saravia, and A. J. Ramirez-Pastor, Phys-ical Review E 96, 062133 (2017).

[21] S. Martiniani, P. M. Chaikin, and D. Levine, Phys. Rev.

X 9, 011031 (2019).[22] See Supplemental Material, which includes Refs. [33-36] .[23] T. M. Cover, Elements of information theory 2nd ed.

(Wiley-Interscience, Hoboken, N.J., 2006).[24] J. Ziv and A. Lempel, IEEE Transactions on Information

Theory 23, 337 (1977).[25] J. Ziv and A. Lempel, IEEE Transactions on Information

Theory 24, 530 (1978).[26] I. M. Pu, Fundamental data compression (Butterworth-

Heinemann, 2005) Chap. 7.[27] A. Lesne, J.-L. Blanc, and L. Pezard, Physical Review

E 79, 046208 (2009).[28] B. Moon, H. V. Jagadish, C. Faloutsos, and J. H. Saltz,

in IEEE Transactions on Knowledge and Data Engineer-ing , Vol. 13 (2001) pp. 124–141.

[29] E. Plotnik, M. J. Weinberger, and J. Ziv, IEEE Trans-actions on Information Theory 38, 66 (1992).

[30] N. Singh and A. Warshel, Proteins: Structure, Functionand Bioinformatics 78, 1724 (2010).

[31] T. K. Chiu, J. Kubelka, R. Herbst-Irmer, W. a. Eaton,J. Hofrichter, and D. R. Davies, Proc. Natl. Acad. Sci.USA 102, 7517 (2005).

[32] Schrodinger, LLC, “The PyMOL molecular graphics sys-tem, version 2.1,” (2015).

[33] U. Wolff, Physical Review Letters 62, 361 (1989).[34] A. G. Schlijper and B. Smit, Journal of Statistical Physics

56, 247 (1989).[35] L. Onsager, Physical Review 65, 117 (1944).[36] G. H. Wannier, Physical Review 79, 357 (1950).

Page 6: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

6

SUPPLEMENTAL MATERIAL FOR UNIVERSAL AND ACCESSIBLE ENTROPY ESTIMATIONUSING A COMPRESSION ALGORITHM

Physical systems presented in this work are severelyunder-sampled compared to their available configura-tional space (e.g., 24096 states for the Ising model on asquare lattice with 642 sites). In the text, we discuss aproof for the convergence of the size of a sequence en-coded by the Lempel-Ziv algorithm [24], to Shannon-entropy per symbol [25]. It is worth noting that theanalogy made there is between physical microstates andsymbols of the processed sequence. As a result, conver-gence is guaranteed only for an over-sampled sequence ofthe system’s microstates. Using our method, the entropyestimate converges for a much smaller sample size [Figs.2(a-d)]. This is in accordance with conventional simu-lation methods (Metropolis Monte Carlo or molecular-dynamics) that produce reliable thermodynamical prop-erties, even when under-sampled[5, 6].

We offer the following concise description of inner-workings of the current method, which may shed lighton the reasons for which it works, and potential limits.

The probability pi of observing a system’s i’th mi-crostate, described as a set of variables {vk}, can bebroken down to the observation probability of sub-microstates (i.e., values of a sub-set of system variables),via the probability chain-rule: pi = P (v1, ..., vD) =P (vk+1, ..., vD|v1, ..., vk) · P (v1, ..., vk), for an arbitraryindex k. This decomposition can be continued to anyarbitrary partitioning into sub-sets of variables. Notehowever that, in all systems, a correlation length (lc)can be defined such that pairs of non-overlapping sub-microstates of size lc, have a negligible correlation.Finally, microstates may be decomposed into uncor-related sub-microstates of size l ≥ lc; in this casewe end up removing the conditions in the chain-rule:pi = P (v1, ..., vD) = P (v1, ..., vl) · ... · P (vD−l+1, ..., vD),otherwise stated as: log pi = logP (v1, ..., vl) + ... +logP (vD−l+1, ..., vD).

Switching to the compression algorithm perspective,practical implementations find patterns of sequentialbytes; these patterns can be viewed as sub-microstates.At each point during the compression process, patternsare matched to the history of data up to that point. Amatched pattern is then replaced by a reference of sizelog d to data found a distance d back. If one assumes anunderlying Poisson process, and therefore an exponentialdistribution of distance to next observation, then the av-erage distance is 〈d〉 = p−1; where p is the probability ofobserving the current sub-microstate. The average size ofcompressed data amounts to 〈Σp log dp〉 per sampled con-figuration, where p is the index for the current patternwithin a conformation (i.e., sub-microstate), and dp isthe distance to previous observation of the pattern. Em-pirically, due to the exponential distribution of dp onecan replace 〈log dp〉 ≈ log〈dp〉. Additionally, assuming

for the size of the patterns (lp) : lp ≥ lc, and using thestatements above, we recover an average compressed sizeof −〈log pi〉 per sampled configuration.

A few previous studies used lossless compression onphysical system simulations, taking different approachesthan the one presented in this work [18–21]. For example,Melchert & Hartmann [18] used compression to charac-terize the divergent correlation-time in the dynamics ofan Ising model and commented on potential generaliza-tion for the detection of order-disorder phase transitions.Their approach is useful for the characterization of diver-gent trends, as in the case of correlation-time near a 2nd

order phase transition. However, as stated in their work,that approach does not lead to a universal estimate ofentropy.

All aforementioned studies reduced multi-dimensionaldatasets to streams of data that can be compressed byrepresenting every time point with one value. Melchert& Hartmann [18] compressed the time-series of an arbi-trary spin in the 2d Ising lattice, and Vogel et al. [20]compressed a time-series of a nematic order parameter.

More recently, Martiniani et al. used lossless compres-sion to qualitatively evaluate the time-evolution of orderin out-of-equilibrium systems [21]. There, the exact na-ture of order is not easily defined and lossless compressionwas used to detect a dynamic phase transition. Their de-tection is achieved by monitoring the compressed-lengthof a full description of each configuration, as the sys-tem evolves in time. These studies did not attempt toquantify entropy; rather, they use divergent trends in in-formation content, quantified by lossless compression, todetect a change in the system state.

Application of lossless compression to a single config-uration for estimation of entropy, such as in the case ofMartiniani et al. [21], implicitly assumes an equivalencebetween a large system and an ensemble of smaller sim-ilar systems. Such equivalence is valid in systems likethe Ising model or indeed any system with homogenousdegrees of freedom and periodic boundaries. However,simulation of a large ensemble of complex systems suchas protein dynamics, although in-principle possible, is im-practial. Instead, we treat a sequence of independentlysampled configurations as an ensemble and apply com-pression on its entirety.

In this work, we are ultimately interested in systemswhich are heterogeneous, without any obvious internalsimilarity (e.g. symmetry). Examples of such systemsinclude a stretched ideal chain and other linear chainssuch as proteins, DNA or RNA. If one was to apply alossless compression algorithm to a single configurationof these systems, the resulting entropy value would typ-ically be over-estimated since recurring patterns wouldbe rare. Instead, in these systems, in order to achieve a

Page 7: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

7

valid estimate for entropy using lossless compression, weapply the compression to a set of many independent con-figurations of the system at once. This approach leadsto pattern matches by the compression algorithm acrossthe set of configurations. In our work, we demonstratethat with the appropriate care, lossless compression al-gorithms can do more than detection of order and yieldthe entropy values of a digitized physical realization.

General scheme

Practically, to quantify SA, each recorded simulationdata was first encoded into a file as bytes. Encodinginto bytes encompasses various choices of projection tolower dimensionality and coarse-graining, which will bediscussed below. The encoded file is compressed usingthe Lempel-Ziv-Markov chain-Algorithm (LZMA), im-plemented in the open-source 7-Zip software (LZMA);the resulting compressed size in bytes is plugged into thecalculation of SA, as described in the main text.

Concisely, to extract an entropy value from a digitizeddataset one should perform the following steps:

a. Choose appropriate degrees-of-freedom

b. Reduce dimensionality (re-index coordinates)

c. Coarse-grain

d. Compress and rescale resulting file size to extractSA

e. Evaluate preprocessing

Choose Appropriate Degrees-of-Freedom

We consider a physical system with D degrees of free-doms recorded at N independently sampled configura-tions. The original recorded dataset is defined as the setof variables {xti}, where t = 1...N and i = 1...D. Of-ten the native coordinate systems in which the systemis recorded are not optimal for convergence of SA underfinite-sampling. For example, trivial degrees of freedom,such as whole-body translation and/or rotation typicallydo not affect configurational entropy (or free-energy). Insuch cases, a transformation to an orientation-alignedcenter-of-mass frame should be performed.

Reduce Dimensionality

The LZMA algorithm, as well as any alternative, worksby finding contiguous patterns of bytes which appearpreviously within the file. As a result, the algorithmonly detects one-dimensional correlations, while localcorrelations within the data might be better represented

in higher dimensionality. This problem has been con-veniently analyzed before, where a reduction to one-dimension using a Hilbert space-filling curve (Hilbertscan) was found optimal to retain clustered correlations[28]. In general, the choice of reduction from multi- toone- dimension affects convergence of SA. We have testeda multitude of schemes for this reduction and demon-strate several for the Ising model on a square lattice [Fig.3(a)].

Coarse-Grain

Compression algorithms are designed to minimally rep-resent a dataset’s alphabet (finite set of symbols, of whicha sequence is composed). However, often data is recordedas continuous variables which contain insignificant digitsthat are effectively random and independent, due to noiseor numerical inaccuracy. Such digits render the dataset’salphabet enormous. In principle, SA asymptotically con-verges for any sized alphabet; however, for practical pur-poses, the required sample size increases dramatically. Totreat the issue, we approximate a system’s entropy by theentropy of a projected system with discretized degrees offreedom (i.e., coarse-graining), for which SA convergesat much smaller sample size.

Here again, many schemes for coarse-graining may beintroduced which, after computation of SA, will producean upper bound on the actual entropy. In this work,we coarse-grain the continuous degrees of freedom in theideal-chain model, and in molecular dynamics trajecto-ries, using the following scheme. For both systems, wehave used a representation composed of angles. In theideal chain these are bond angles relative to commonaxes, and in the protein, these are dihedral angles – rel-ative to surrounding amino acids. In either system, themaximal range of values taken up by the coordinates isconveniently known in advance to be [−π,+π).

For each degree of freedom xti, we generate a new inte-ger variable xti = bns(x

ti + ∆xi)/Ric, which is effectively

a rounding down to units of Ri/ns. Here, ∆xi and Riare a shift and range, respectively, for the i’th degree offreedom, such that the mapped values would lay in therange [0, ns). In the current case, the shift is +π andthe range is preset to 2π, but in general, these can bederived from the samples using, for example, the meanand some multiple of the standard deviation. The num-ber of coarse-grained values ns is optimized for minimalcalculated SA [Fig. 3(d)]. Whether we started off withdiscrete degrees of freedom or continuous ones, the as-sessed discrete system invariably has maximal entropyrange of D log ns, where D is the number of representeddegrees of freedom.

Page 8: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

8

Compress and Rescale

For compression, LZMA version 16.04 was used(www.7-zip.org/download.html), with the following pa-rameters: 3 literal context bits, 0 literal position bits,2 position bits, 64 fast bytes, BT4 search tree, 4 hashbytes, and a dictionary size of 64 MB. These parametervalues are the default compression settings of the 7Zipsoftware, set to ”maximal compression”. The exact codeused in this work can be provided upon request. All datastreams compressed in this work were smaller than theset dictionary size.

As described in the main text, the final entropy esti-mate is derived by relation to the known entropies of themaximal and minimal entropy states (even if these arenot reachable in the given system). These converge to atypical compression ratio for each choice of ns, as demon-strated in Fig. S1. In principle, representations coarse-grained to ns values require log2 ns bits to be described,hence the expected asymptotic compression ratio wouldbe log2 ns/ log2 256 for representations occupying wholebytes (8 bits). However, due to practical limitations ofthe LZMA algorithm used here, the compression ratioconverges to a value slightly larger than expected. Thisis normalized out in our definition of the incompressibil-ity content η = (Cd − C0)/(C1 − C0), as schematicallyshown in Figure 1.

Practically, the maximal and minimal entropy statesare derived by generating two additional datasets hav-ing the original dataset length. In the first, data overthe entire phase-space is replaced with a single repeat-ing symbol (e.g., zero). The result of compressing this”zeros” dataset is defined as C0 and is given in Fig. S1by ns = 1. In the second dataset, all the values are re-placed with random symbols from the alphabet (ns) usedto extract Cd. The latter results with the compressed filesize denoted by C1. As described above, the entropy iscalculated by SA/kB = ηD log ns.

Evaluate Preprocessing

Due to finite sampling, SA is guaranteed to be an upperbound to the physical entropy of the system. Nonethe-less, alternative choices in the preprocessing steps such asthe level of coarse-graining and the choice of degrees-of-freedom may eliminate important correlations or intro-duce spurious effects challenging the compression algo-rithm’s efficiency. It is therefore recommended to evalu-ate the entropy over various representations and to chosethe one with the minimal valued SA.

FIG. S1. Convergence of compression ratio for varying al-phabet size (ns). Data points represent the ratio betweenthe compressed length and the original length for sequencesof randomly generated integers from a uniform distribution.For compression, these numbers are stored each in a singlebyte, which can store an alphabet of at most 256 unique val-ues. These compression ratios are used for the calibrationof the entropy estimate in the maximal entropy state for agiven choice of the number of coarse-grained represented val-ues (the case of a single valued alphabet is always used as theminimal entropy state). Compression ratios for 107 bytes areindicated to the right of each curve and due to practical lim-itations of the LZMA algorithm are slightly higher than theexpected ratio log2 ns/ log2 256, for each ns. For reference, thelength of 5000 concatenated configurations of an ideal chainwith 100 monomers is 5 · 105 bytes (1 byte per monomer perconfiguration).

Implicit assumptions and spurious effects

One should take note of the implicit physical assump-tions made by a compression algorithm which processesbytes. Compression algorithms match patterns betweenany arbitrary pairs of locations, which implies trans-lational symmetry (at least in the 1d representation).Many systems, like a polymer, protein, or indeed anyfinite system without a periodic boundary, do not havethis symmetry. The implicit assumption of symmetriesmay lead to an under-estimation of entropy by SA dueto spurious matching of patterns. We do not currentlyobserve such an effect and can only assume that matchesbetween independent regions in our tested systems aremuch less likely than matches between the same regionin different instances.

Calculation of SA for the finite states system

States sampled from the finite state system were laidout consecutively in a file, with each byte holding the

Page 9: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

9

index of the currently selected state. For the entropyestimate, D = 1 and ns is assigned the number of energylevels.We construct and compress the zero dataset (C0)as a file of identical size to the recorded files above, withall samples set to the value 0. A random dataset (C1)is created similarly, with every sample chosen randomlyand uniformly from the available alphabet.

Calculation of SA for the Ising model

The Ising model consists of a ±1 value per site, whichwe represent by a single binary digit (a bit). Spin siteson the 2d square lattice are laid out in a 1d sequenceeither row-by-row, in a spiral (first row, then last columnwithout overlapping site, spiraling inwards in clockwiseorder), or ordered by a Hilbert space-filling curve [28].Site values (bits) are laid out in a file either consecutively,filling every byte with the values of 8 sites (8 bits), or 1site value per byte. Also, we used an intermediate fillingof bytes with 3 sites per byte, where the two additionalvalues were taken from the sites above and to the right,regardless of the order in which the sites are laid out.For the triangular lattice, we use a Hilbert scan with 3sites per byte, as described above. In practice, we imple-ment the triangular lattice as a square lattice with twoadditional diagonal bonds, so scans regard the site posi-tions as for the square lattice. For the final estimate, Dis assigned the number of sites in the lattice and ns = 2.The zero dataset (C0) is generated with configurationsof equal size and number to the sampled systems above,with all spin values set to -1. The random dataset (C1)is produced by setting spins uniformly and randomly to±1 and representing as described above.

Calculation of SA for the ideal chain

Our simulations of ideal-chains consisting of D = 2 ·Ncoordinates were conducted and recorded with 64-bit pre-cision floating point numbers. We tested several repre-sentations for the chain to be compressed, including anaive representation with cartesian coordinates. For theideal chain, a representation as a list of angles for thesteps between monomers (i.e., bond angles) resulted inthe lowest entropy estimates and was therefore used foranalysis. This representation effectively decorrelates themonomers and therefore produces the lowest entropy es-timates, which is also robust to large deviations fromthe optimal ns, as observed by the shallow minima [Fig.3(d)]. Minimal SA is consistently around ns = 170coarse-grained angles. The results presented here [Figs.2(e), 3(d) and S2] were derived using this representa-tion and also demonstrate that the compression algo-rithm effectively captures entropic contributions of in-dividual variables, even when uncorrelated. For calibra-

FIG. S2. Entropy estimates for the ideal chain, normal-ized by number of bonds. Data points from Fig. 2(e) ofentropy estimates for an ideal chain of varying length (N)and constant bond length (b), with increasing fixed end-to-end distance (R), normalized by number of bonds (N − 1).Data points collapse on the theoretically expected entropyS = −R2/b2(N − 1) depicted by the dashed line.

FIG. S3. Evaluation of entropy for varying number of coarse-grained angles (ns) for Villin headpiece. SA(ns) is calculatedfor 4,000 frames from either unfolded or folded configurationsas assigned in this work, resampled every 100’th frame (seemain text). Lowest SA is calculated in both ensembles withns = 11, indicated by the arrows. We notice that the dif-ference in entropies between the folded and unfolded statesremains almost constant over a large range of ns.

tion, we generate the zero (C0) and random (C1) filesas before, with configurations equal in size and numberto recorded data. In Fig. S2 we demonstrate the col-lapse of all our data points to a single curve describedby S = −R2/b2(N − 1) when the x axis is normalized by1/(N − 1).

Page 10: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

10

FIG. S4. Convergence of SA with sampled window sizeτw. SA(τw) is calculated for varying windows size, and aver-aged over 100 windows preassigned by the transition-based-assignment [2] as folded (circles) and unfolded (squares). Thecalculated SA(τw) points are fit with a double exponentialfunctiona+ b1 · exp(−τw/τ1) + b2 · exp(−τw/τ2), which resultswith τ1 ≈ 0.02µs for both and τ2 = 0.160µs and 0.100µs forthe folded and unfolded frames respectively (solid lines).

Calculation of SA for the Villin headpiece proteinfragment

We treat the coordinates for each Cα carbon essen-tially as for the Ideal chain. Here, however, we havetwo dihedral angles per site (φ, ψ, D = 66). We chooseto lay out a sequence of concatenated pairs of angles(φ1, ψ1, φ2, ψ2, ...) rather than concatenating first one an-gle and then the other (φ1, φ2, ..., ψ1, ψ2, ...) as this resultsin slightly lower SA values. For ∆SA values (see TableS5 in the main text), frames assigned to either foldedor unfolded state are collected for processing. We testedfor correlation time between frames, using the conver-gence of compression ratio [see for example Fig. 3(c)]and found correlation times in the range of 20-30 framesfor collected frames of either folded or unfolded assign-ments. Therefore, every 100’th frame was kept for fur-ther processing. We observe lowest SA values for ns = 11coarse-grained values per coordinate, in either the foldedor unfolded configurational ensembles (Fig. S3).

We generate the zero (C0) and random (C1) files asbefore, with configurations equal in size and number toprocessed recorded data. We generate the random filefrom uniformly random ns dihedral angles per coordi-nate. For the sliding-window SA values (Figs. 4 and S4),we obtain a reasonable compromise between convergenceof SA and time-resolution with a window of 2,000 con-secutive frames (τw = 0.4µs, see Fig. S4), independent ofchoice of ns. Our SA estimate is evaluated for discretizedwindows starting every 200 frames (90% overlap). Usingns = 24 results in maximal discrepancy between SA val-ues for either folded or unfolded states and was thereforeused for the SA−H diagrams [Figs. 4(b) and S4] instead

of ns = 11 which was used for the ensemble entropyestimates in Table S5. For these enthalpy-entropy dia-grams, we averaged per-frame enthalpy values providedwith the recorded simulation, over the same windows de-fined above.

For each simulated temperature, a line crossing thevalley between clustered events was optimized as follows.We interpolate values of a two-dimensional SA −H his-togram along an optimized line and minimize the lineintegral over sampled values. Final optimized lines aregiven by SA = aH + b with the parameters: (a =−4.84, b = 0), (a = −5.88, b = −0.01), (a = −5.61, b =−14.9) for temperatures 360K, 370K, 380K respectively.These optimized classification line borders are shown inFig. S5. After assigning windows above (below) the lineas unfolded (folded), we compared to per-frame assign-ments given to us with the recorded simulation achievedwith the transition-based-assignment [2], which are aver-aged over windows. The agreement reached 96.3%, 95.6%and 95.3% for temperatures 360K, 370K, 380K respec-tively.

Finite states simulations

A discrete state system was defined consisting of thefollowing four energy-levels: ε, 2ε, 70ε, 80ε, with ε beingan arbitrary energy scale. States were sampled from their

Boltzmann distribution pi = exp−εi/εT

Σj exp−εj/εTwith pi the

probability for the i’th energy-level, εi its energy, andT the dimensionless simulation temperature. For eachtemperature, 7 · 106 states were sampled and recorded.

Ising model simulations

Ising spin-half simulations were conducted using theMetropolis Monte-Carlo (MC) algorithm at various tem-

TABLE S5. Thermodynamics estimation derived from MDsimulations of Villin headpiece at three temperatures (T ).Folding free energy estimation was calculated from the ra-tio of folded and unfolded states either pre-assigned by Pi-ana et al. [2] (∆Gf) or assigned by our method (∆GA).Folding enthalpy (∆Hf) was calculated from the differencein average force-field energy in the pre-assigned data [2].Entropy difference between folded to unfolded ensembles iscompared between the transition-based assignment method(T∆Sf = ∆Hf − ∆Gf) and the compression-based entropyestimations (∆SA), as detailed in the text. All energy termsare given in kcal

molunits.

T (K) ∆Hf ∆Gf T∆Sf ∆GA T∆SA

360 -17.6 -0.6 17.0 -0.64 25.9370 -18.0 0.0 18.0 -0.02 25.4380 -21.2 0.7 21.9 0.81 25.9

Page 11: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

11

FIG. S5. Compression based Entropy (SA) – Enthalpy (H) diagrams. (a-c) Enthalpy – entropy population diagramscalculated from enthalpy assignment during the simulations and entropy assigned from compression method. Two well-definedstates (folded and unfolded) of high and low entropy are clearly demonstrated. (d-f) Scatter plots of the same analysis coloredwith preassigned folded (red) and unfolded (black), according to the transition-based-assignment [2]. A dashed green linedivides the reassigned folded and unfolded states, as explained in the methods. (a,d) Simulation at T=360K. (b,e) Simulationat T=370K. (c,f) Simulation at T=380K.

peratures. An interaction potential was defined betweeneach nearest neighboring pair of sites (i, j) : −1/2 ·Jσiσj ,with σi a ±1 valued spin site and periodic boundary con-ditions. Sites were put either on a square or a triangularlattice, with the number of sites L2 = 642 for the fer-romagnetic (J = +1) and 162 for the antiferromagnetic(J = −1) models. The MC step involves either a singlesite flip or a cluster flip, depending on the temperatureand coupling parameter J . A single site flip involvesa randomly picked site which is flipped with probabil-ity exp−∆E/kBT for ∆E > 0 energy difference causedby the flip, and with probability 1 for ∆E ≤ 0; T be-ing the simulation temperature. The cluster flip involvesrandomly picking a site, and repeatedly growing a clus-ter onto neighboring sites. Cluster growth is consideredthrough each outward bond from sites on the cluster’s in-terface, onto sites with an identical sign, and with prob-ability 1−exp−2J/kBT , as described by U. Wolff [33]. Forthe antiferromagnetic triangular lattice, only single siteflips were used.

Correlation times were calculated by fitting the au-tocorrelated fluctuations of mean spin value, with anexponential function; this was done for each simulatedtemperature. The final sampling plan defined sampling

intervals exceeding the correlation times 5-fold, at eachtemperature (Fig. S6). Based on correlation times foreither single or cluster flips, cluster flips were used withprobability 0.5, below T = 2.6kB/J for the ferromag-netic model on a square lattice and T = 4.1kB/J on atriangular lattice. Simulations were started in a randomconfiguration and run consecutively at decreasing tem-perature, with a recording of full system microstates ev-ery sampling interval, until 5,000 sampled configurations.Simulations were verified by comparing entropy quanti-fied using the standard cluster variation method [34], tothe analytical entropy derived by Onsager [35] for thesquare and Wannier [36] for the triangular lattices.

Two-dimensional ideal chain simulations

Ideal chain simulations were conducted using MC atvarying end-to-end distance R, in a lattice-free setup.The number of monomers (Nm = D/2) along the chainwas set to either 100, 200, 300 or 1000, and the bondlength (b) was constantly set to 1. At each MC step, twosites along the chain are randomly picked, and the chainin between is flipped about the line connecting the two

Page 12: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

12

FIG. S6. Correlation times and sampling plan for the Isingmodel on a 64 by 64 square lattice. Correlations times werecalculated for the mean spin value fluctuations, with eithersingle-flip steps (squares) or cluster-flip steps (circles). Sam-pling intervals are stated in L2 steps for single-flips and singlesteps for cluster-flips. The temperature of transition to thecluster-flip steps (T = 2.6kB/J) is plotted (dashed line). Foreach temperature, the simulation is sampled at intervals in-dicated by the sampling plan (green line, scaled by 1/5).

sites. Minimal and maximal distance between the pickedsites is set to 3 and Nm/2 respectively. The step occursat probability 1, since no potential is defined, and thestep preserves bond lengths. The simulation algorithm isverified by running with an additional step of randomlyreorienting the edge bonds with probability 0.1, to get afree ideal chain. The observed end-to-end vector popula-tions fit the expected Gaussian distribution.

We calculate correlation times by fitting an exponen-tial function to the autocorrelated fluctuations of theradius-of-gyration R2

g = Σi(ri − 〈r〉i)2, calculated overthe chain coordinates ri; this is done for each simulatedR. The final sampling plan defines sampling intervals ex-ceeding correlation times 5-fold, at each R. Simulationsare started from a zig-zag configuration, equilibrated for10,000 steps, and we record the chain’s coordinates everysampling interval and store 5,000 sampled configurationsper choice of R.

Villin headpiece protein fragment

Recorded simulation data for the Villin headpiece pro-tein fragment [2] was made available to us by the group

of Dr. Shaw. The recorded simulation data processedin this work consists of a full-atom description of the 35amino acids of the protein. The full-atom descriptioncontains hundreds of atoms, which are then reduced to adescription of dihedral angles (such as in Ramachandranplots). Since dihedral angles refer to orientation changesbetween amino acids, they are undefined at the first andlast ones. Each protein configuration is reduced to 2 di-hedral angles per each of the 33 non-edge amino acids –66 coordinates in total. Data used for analysis is from theNle/Nle mutant, simulated at 360K, 370K, and 380K.

We estimated the error range and uncertainties in ourentropy calculation by repeating the compression-basedassessment over 11 newly generated and uncorrelateddatasets for several different systems. We define the un-certainty δSA(T ) = SA(T )−S(T ). Here, S(T ) is the an-alytical entropy value for a given temperature (T ), andSA(T ) is the mean of entropy estimation over the real-izations. In addition, error range is calculated by thestandard deviation (SD) out of the realizations. In Ta-ble S6, we show the average and maximum uncertain-ties, and the maximum SDs over the entire temperaturerange under study. For the ideal chain case, T is replacedwith end-to-end distance (R) and the analytical modelS(R) = S0 − R2/(b2(N − 1)) was fitted for its constantS0.

When trying to evaluate how accurate our estimationis, in the most general way, we notice the problem thatrelative error (δSA/S) can be misleading when S → 0.There, even a very good estimation of the entropy willresult in divergence of the relative error. Moreover, forcontinuous variables as in cases of the ideal chain andprotein folding, only entropy difference from a referenceshould be used. However, entropy error ranges and un-certainties are orders of magnitude smaller than the en-tropy scale of the problem, giving us confidence using thecompression-based scheme to evaluate entropy.

Page 13: arXiv:1709.10164v4 [cond-mat.stat-mech] 15 Sep 2019 · Universal and accessible entropy estimation using a compression algorithm Ram Avinery, Micha Kornreich,yand Roy Beckz The Raymond

13

TABLE S6. Errors and uncertainties calculated from 11 independent entropy evaluations

Model # configurations Entropy scale 〈δSA(T )〉T max {δS(T )}T max{SD(SA)}T(kB) (kB) (kB) (kB)

4 state model 107 log(4) = 1.6 0.013 0.06 0.009Ising on a square lattice (64 x 64) 104 log(2) = 0.69 0.017 0.04 0.0003Ideal chain N = 100 5000 R2/(b2N) = 9 0.03 0.09 0.08Ideal chain N = 1000 5000 R2/(b2N) = 9 0.07 0.14 0.14