Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better...

37
Linking data and model quality in macromolecular crystallography Kay Diederichs

Transcript of Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better...

Page 1: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Linking data and model quality in macromolecular crystallography

Kay Diederichs

Page 2: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Crystallography is highly successful

Can we do better?

Page 3: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Error in experimental data

Error = random + systematic

Random = counting + detector

Systematic = Radiation damage + absorption + non-linearities + vibrations + instabilities + …

“If you don’t have good data, then you have no data at all.” -Sung-Hou Kim

“If you don’t have good data, then you must learn statistics.” - James Holton

Multiplicity of n reduces the random error by √n

Multiplicity does not necessarily reduce the systematic error!

Page 4: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Crystallographic statistics - which indicators are being used?

• Data R-values:  Rpim < Rmerge=Rsym < Rmeas  

• Model R-values: Rwork/Rfree

• I/σ (for unmerged or merged data !)

Rmerge=∑hkl

∑i=1

n

∣I i (hkl )− I (hkl )∣

∑hkl

∑i=1

n

I i (hkl )

Rwork / free=∑hkl

∣F obs (hkl )−F calc (hkl )∣

∑hkl

Fobs (hkl )

Rmeas=∑hkl √

nn−1

∑i=1

n

∣I i (hkl )− I (hkl )∣

∑hkl

∑i=1

n

I i (hkl )

R pim=∑hkl

√1 /n−1∑i=1

n

∣I i (hkl )− I (hkl )∣

∑hkl

∑i=1

n

I i (hkl )

Page 5: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

In search for a proxy

Page 6: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Decisions and compromises

Which high-resolution cutoff for refinement?Higher resolution means better accuracy and mapsBut: high resolution yields high Rwork/Rfree!Basic questions: what to optimize? Is the data/model R-value a good predictor/indicator of model quality? How good need the data be; to what extent do the data influence the refinement result? What to refine, and when to stop? Overfitting?

Which datasets/frames to include into scaling? Optimize consistency (Rmeas) or quality of merged data (Rpim, <I/σ>)? Systematic differences/errors? Completeness versus radiation damage? Anomalous signal (CCano)? …

Reject negative observations or unique reflections?

The reason why this is difficult to answer “R-value questions” is that no proper mathematical theory exists that uses absolute differences; concerning the use of R-values, Crystallography is disconnected from mainstream Statistics

Page 7: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Conflicting views

„An appropriate choice of resolution cutoff is difficult andsometimes seems to be performed mainly to satisfy referees ... Ideally, we would determine the point at which adding the next shell of data is not adding any statistically significant information ... Rmerge is not a very useful criterion.“P. Evans (2011) An introduction to data reduction: space-group determination, scaling and intensity statistics. Acta Cryst. D67, 282 "At the highest resolution shell, the Rmerge can be allowed to reach 30–40% for low-symmetry crystals and up to 60% for high-symmetry crystals, since in the latter case the redundancy is usually higher."A. Wlodawer, W. Minor, Z. Dauter and M. Jaskolski (2008) Protein crystallography for non-crystallographers, or how to get the best (but not more) from published macromolecular structures. FEBS J. 275, 1

Page 8: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

2010 PDB depositions

Page 9: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

The asymptotic behaviour of model and data R-values is different at high resolution 

Data R-values (Rpim , Rmerge , Rmeas): with resolution, go to infinity since the numerator is constant (determined by background), and the denominator approaches zero

Model R-values (Rwork/free): the ratio of numerator and denominator approaches a constant  for a random or wrong model (Wilson 1950)

This means that at high resolution, a quantitative relation cannot exist between model and data R-values !

Page 10: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Crystallographic reasoning

1. Better data allow to obtain a better model

2. A better model has a lower Rfree

, and a lower

Rfree

-Rwork

gap

3. Comparison of model R-values is only meaningful when using the same data

4. Taken together, this leads to the „paired refinement technique“: compare models in terms of their R-values against the same data.

Page 11: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Example: Cysteine DiOxygenase (CDO; PDB 3ELN) re-refined against 15-fold weaker data

Rmerge ■

Rpim ●

Rfree

Rwork

I/sigma

Page 12: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Is there information beyond the conservative hi-res cutoff?

“Paired refinement technique“:

• refine at (e.g.) 2.0Å and at 1.9Å using the same starting model and refinement parameters• since it is meaningless to compare R-values at different resolutions, calculate the overall R-values of the 1.9Å model at 2.0Å (main.number_of_macro_cycles=1 strategy=None fix_rotamers=False ordered_solvent=False)

• ΔR=R1.9(2.0)-R2.0(2.0)

Page 13: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Do the maps really get better?

• cooperation with JCSG: Henry van den Bedem, Ashley M. Deacon, Abhinav Kumar

• identified 5 structures where re-processing of existing raw data permits to extend the resolution by another 0.2-0.3Å, with full completeness

• reprocessing with XDS/MOSFLM; refinement with refmac/phenix.refine

• calculate real-space CC of map and model, using EDSTATS (CCP4) or phenix.resolve

Page 14: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Example 1/5

Page 15: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Summary I

• Rmerge

should no longer be considered as useful for

deciding on a high-resolution cutoff

• The paired refinement technique can prove that data should be used to higher resolution than a (R

merge-based) conservative cutoff suggests

Page 16: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

In search for a proxy

Page 17: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

measuring data quality with a correlation coefficient

• Correlation coefficient has clear meaning and well-known statistical properties

• Significance of its value can be assessed by Student's t-test (e.g. CC>0.3 is significant at p=0.01 for n>100; CC>0.08 is significant at p=0.01 for n>1000)

• Apply this idea to crystallographic intensity data: use “random half-datasets” → CC1/2 (called CC_Imean by SCALA/aimless, now CC

1/2 )

• From CC1/2 , we can analytically estimate CC of the full dataset against the true (usually unmeasurable) intensities using

(Karplus and Diederichs (2012) Science 336, 1030)

CC*=√ 2CC1/ 21+CC 1/2

Page 18: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Data CCs

CC1/2 ◊

CC* Δ

I/sigma

Page 19: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

What is a suitable resolution cutoff when using CC

1/2?

Results from Student's t-test tell statistical significance (p=0.001 marked w/ * in CORRECT.LP/XSCALE.LP)

Data are usually insignificant if CC1/2

< 0.15

It depends on refinement program if it can make good use of weak data; in practice a bit higher cutoff is best (remember, for CC

anom a value of 0.3 proved useful)

Operational definition: resolution should be cut such that the best model results – use paired refinement!

Page 20: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Model CCs

• We can define CCwork, CCfree as CCs calculated on Fcalc2 of the

working and free set, against the experimental data• CCwork and CCfree can be directly compared with CC*

― CC*

Dashes: CCwork , CCfree against weak exp‘tl data

Dots: CC‘work , CC‘free against strong 3ELN data

Page 21: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Quantitative relation between data and model CCs

• Refinement should make CCwork converge towards CC*

(from lower values)• Inadequate model, or systematic errors in data

processing: CCwork remains < CC*

• If CCwork > CC*: the model is closer to the data, than the truth is to the data : “overfitting”

True data

Exp‘l data

Distance to truth

First model

Error

Page 22: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Summary II • CC1/2 assesses the statistical significance of data

• predicts the agreement of experimental data

and true data; CC* is the upper limit for the CCwork /CCfree

model quality indicators

• CC1/2

, CC*, CCwork

/CCfree

table can be obtained from 1.8.2

Phenix distribution; the routine is called phenix.cc_star

CC*=√ 2CC1/ 21+CC 1/2

Page 23: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Four new concepts for improving crystallographic procedures

Image courtesy of

P.A. Karplus

Page 24: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Further uses of the „paired refinement technique“

Page 25: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Does merging with weak data improve the resulting data?

dataset name CDO3 CDO4 CDO5 CDO3+4 CDO3+5 CDO4+5 CDO3+4+5

# observations 201160 (10837)

155771 (2389)

200117 (10838)

358117 (13657)

401270 (21717)

357787 (13655)

558273 (24518)

# unique reflections

29424 (2008) 26807 (1316)

27939 (1982) 29431 (2013)

29433 (2013) 28195 (1995)

29433 (2013)

completeness [%]

99.9 (98.7) 93.8 (79.3) 95.7 (98.0) 99.9 (98.8) 99.9 (98.8) 95.7 (98.0) 99.9 (98.8)

Rmeas

[%] 10.2 (294.0) 23.1 (431.1) 26.0 (395.6) 15.0 (314.7) 15.9 (332.6) 26.0 (401.4)

15.9 (339.8)

<I/σ> 16.24 (0.64) 10.68 (0.21) 9.88 (0.19) 13.63 (0.69) 14.12 (0.87) 13.76 (0.49)

14.60 (0.91)

CC1/2

in

highest shell; # of pairs

0.2079; 1986 0.0581; 842 0.1273; 1961 0.1752; 2006

0.2226; 2008 0.1540; 1992

0.2216; 2008

CC* in highest shell

0.5867 0.3314 0.4752 0.5460 0.6034 0.5166 0.6023

Page 26: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Refinement:which data give the best model?

dataset name CDO3 CDO4 CDO5 CDO3+4 CDO3+5 CDO4+5 CDO3+4+5

model

CDO3 0,1855, 0.2189

 

           

CDO4 0.2106, 0.2520

CDO5 0.1978, 0.2364

CDO3+4 0.1876, 0.2201

0.2268, 0.2624

0.1846, 0.2213

CDO3+5 0.1890, 0.2178

0.1852, 0.2161

CDO4+5 0.2154, 0.2533

0.2003, 0.2337

0.1989, 0.2374

CDO3+4+5 0.1922, 0.2210

0.2242, 0.2571

0.2112, 0.2403

0.1865, 0.2190

0.1862, 0.2189

0.2109, 0.2437

0.1863, 0.2205

Page 27: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Summary III

● Data R-values of merged data do not in any useful way predict the refinement outcome

● CC1/2

and CC* do

Page 28: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Does removal of negative reflections „improve“ the data?

All data

Only positive observations

Only positive unique reflections

Page 29: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Why reject negative observations or unique reflections?

● Negative observations/unique reflections increase the data R-value

● Assuming that data R-values are a predictor of data quality, rejecting them should improve model quality

● Make the data look better to reviewers

Page 30: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Does removal of negative reflections „improve“ the model?Results from paired refinement

Refinement against

R-value against

All reflections Only positive unique reflections

Only positive observations

All reflections 0.1856, 0.2189 0.1783, 0.2115 0.2036, 0.2326

Only positive unique

0.1870, 0.2232 0.1782, 0.2161 0.2036, 0.2346

Only positive observations

0.1873, 0.2268 0.1799, 0.2197 0.1992, 0.2392

Page 31: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Summary IV

Throwing away negative observations / unique reflections beautifies the data R-values and <I/σ>, but lowers CC

1/2

… and produces worse models as shown by the paired refinement technique!

Page 32: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

References

Andy Karplus, Oregon State

University (Corvallis, OR)

P.A. Karplus and K. Diederichs (2012) Linking Crystallographic Data with Model Quality. Science 336, 1030-1033.

K. Diederichs and P.A. Karplus (2013) Better models by discarding data? Acta Cryst D, in the press

see also: P.R. Evans (2012) Resolving Some Old Problems in Protein Crystallography. Science 336, 986-987.

Acknowledgement

Page 33: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

If you would like to obtain a PDF of the slides, send email to [email protected]

Thank you!

Page 34: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Checking your imagination!

Which data are better? Rmerge

=80% , multiplicity=4 or Rmerge

=100% ,

multiplicity=8 ?

Model A is refined at 2.0 Å, Rwork

/Rfree

=18.0/22%, model

B at 2.2 Å, Rwork

/Rfree

=16.0/20.5%. a) Which model is better? b) If you

don't know the answer: which procedure to decide?

Can one compare R-values that do not refer to the same sets of reflections?

Can one estimate model error by formulas that disregard experimental errors (sigmas)?

Can a model give Fcalc

2 that are closer to the true Fcalc

2 than the Iobs

are

against which it was refined?

Why is there a gap between CC* and CCwork

at low res (only)?

Page 35: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

General references

Data-collection strategiesZ. Dauter (1999) Acta Cryst. D55, 1703-1717

Integration of macromolecular diffraction dataA.G.W. Leslie (1999) Acta Cryst. D55, 1696-1702

An introduction to data reduction: space-group determination, scaling and intensity statisticsP. R. Evans (2011) Acta Cryst. D67, 282-292

Page 36: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

Outlook

● anisotropy: elliptic cutoff or use all data?● a refinement program that does not use

experimental sigmas suffers from noise amplification, and benefits from throwing away data in the „weak“ directions

● phenix.refine does not use the experimental sigmas; Refmac and CNS do.

Page 37: Linking data and model quality in macromolecular ... · 1. Better data allow to obtain a better model 2. A better model has a lower R free, and a lower R free-R work gap 3. Comparison

How to obtain good data, given a total dose

Old/classic/conservative

better

Statistics Rmerge

CC1/2

Exposure Reflections visible; tolerate some overloads

Low-dose, high multiplicity

Oscillation range 1° Pilatus: 0.1-0.2°CCD: 0.25-0.5°

Rotation range „strategy“, „xplan“ 180°/360° native/anomalous