Post on 05-Jan-2016
description
Accuracy, Reliability, and Validity of Freesurfer Measurements
David H. Salatsalat@nmr.mgh.harvard.edu
Why Talk About This?
• This is not meant to imply that everything is perfect in FreeSurfer processing; it is a sample of the types of procedures that we and others have used to provide information about what works and what doesn’t, and to enhance confidence in our results.
• The information here should be used as a guide for how to assess the data in your own projects.
What is Accuracy?
• Accuracy: the degree of closeness of a measured or calculated quantity to its actual (true) value (e.g. a physical property such as length or thickness)
• MRI measures are indirect. We may be able to measure morphometry accurately given the contrast of the MR image, however, this contrast may differ from measurements from the actual tissue.
What is Reliability?• Measures obtained for the same individual on two
different days, close together in time to avoid a biological influence on the reliability measure– Reliability of a labeling procedure in the same scan– Reliability of the labeling procedure on two different scans– Reliability of the labeling procedure on two different scans
collected on two different scanners• The reliability of an overall effect can be assessed by
replication of the experiment in an independent sample.
• This is a general theory, that applies to all types of data, structural, functional, cognitive, etc.
What is Validity?
• Validity: the extent to which an indirect measurement is representative of what it is supposed to measure.
• For example, in fMRI we use blood flow as an indirect measure of neural activity. Is this a valid measure of neural activity?
Validity Examples• Internal validity: What is the strength of the overall experimental
design, study sample size, analysis procedures, etc.?• External validity: Would the effect measured generalize to another
sample? (replication)• Ecological validity: Can the results be applied in the real world
outside of the experimental setting? (clinical application)• Construct validity: Does the totality of evidence support the validity
of a single measure? (do the data fit with what is known?)• Face validity: Does the measure seem to be a good measure?• Convergent validity: How well does the measure correlate with
other types of measures that it should theoretically be correlated with? (do the data correlate with ‘gold standards’)
• Discriminant validity: Is the measure not correlated with measures it should not be correlated with? (ICV/age)
One does not necessarily ensure the other
• A measure that is perfectly reliable (e.g. you get the same exact measure every time), but not accurate, or valid.
• We can measure morphometry very precisely, but the validity of this measure depends on the quality of the input data.
• If an experiment is not reliable, then it is likely inaccurate and invalid.
Types of Error
• Random Error: Unknown and unpredictable changes in the measurement– Should be unbiased– Accuracy, reliability, and validity all limited by error
• Systematic error: Predictable offset or scaling of data– Typically comes from some aspect of the data
acquisition/analysis– Can be identified and corrected by analyzing
standards that closely match the real sample (e.g. do you get the same values at 1.5T as at 3T?)
How does poor reliability and validity affect your studies?
• Poor reliability increases variance across individuals and across timepoints.
• Validity is directly tied to interpretation. You may have a valid measure of ‘cortical thickness’, but ‘cortical thickness’ might not be a valid measure of degeneration– E.g. normal variation, hydration
• Many studies would benefit from the ability to measure minute changes across time.
Accuracy and Validity of Spherical Averaging for Labeling Structural and Functional Anatomy
Fischl et al., 1999
Anatomical Labeling
Fischl et al., 1999
Functional Labeling
Fischl et al., 1999
Enhanced Statistical Power
Fischl et al., 1999
Face Validity: Results fall within Expected Range
• Consistent with published findings:– crowns of gyri are thicker
than the fundi of sulci– sensory areas are among the
thinnest in the cortex.
Fischl et al., 1999
Validate against manual measurements of imaging data from
another study
Fischl et al., 1999
Automated measures are similar in size and region to manual measures, and
predict who will develop AD
Fischl et al., 2002
Comparison with Postmortem Measures
Rosas et al., 2002
Manual Measurements• Can only be done in regions where folds are appropriate• Calcarine also consistent across studies
Orbitofrontal Calcarine
Kuperberg et al., 2003
Salat et al., 2004
Compared to ManuallyLabeled Data
• 1 volume and 2 surface based labeling schemes• Percent of subjects labeled correctly at each location
across the surface.
Fischl et al., 2004 Desikan et al., 2006
Volume Atlas Surface Atlas Surface Atlas 2
Replication of Result:Split Sample
• Concordant results are likely not due to statistical error
• Current study with 5 samples used in prior literature
Salat et al., 2004
Cross Sequence Parameters
Fischl et al., 2004
Comparison across time, scanner, field strength, number of scans, sequence type, scanner upgrade, and
scanner manufacturer
Han et al., 2006
Effects of Pulse Sequence, Voxel Geometry, and Parallel Imaging
Wonderlick et al., 2008
Replication of Effects in Same Participants Across Scanning Conditions
Dickerson et al., 2008
WMPARC: same subjects scanned at different times (test-retest)
Salat et al., 2008
Replicable results across sex and hemisphere
Men
Women
Salat et al., 2008
Consistent Findings Across 5 samples Used To Identify Regions with Predictive Validity
• Regional measures predict who wll progress to AD.
Dickerson et al., 2008
Conclusions• Any tool used for MR analysis should be
rigorously tested for accuracy, reliability, and validity
• Most of the measures from Freesurfer have good accuracy, reliability, and validity across a range of conditions
• These results are dependent on optimal input data and correct implementation
• These data provide confidence, but do not substitute for using similar procedures to check data from each new study