Post on 30-Mar-2018
Training a model for estimating leukocyte composition using whole blood DNA methylation and cell counts as reference
Jonathan A. Heiss,1 Lutz P. Breitling,1,2 Benjamin C. Lehne,3 Jaspal S. Kooner,4,5,6 John C. Chambers,3,4,5 Hermann Brenner1,7,8
1Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), Heidelberg, Germany
2Pneumology and Respiratory Critical Care Medicine, Thorax Clinic, University of Heidelberg, Heidelberg, Germany
3Department of Epidemiology and Biostatistics, Imperial College London, London, UK
4Ealing Hospital NHS Trust, Middlesex, UK
5Imperial College Healthcare NHS Trust, London, UK
6National Heart and Lung Institute, Imperial College London, Hammersmith Hospital, London, UK
7Division of Preventive Oncology, National Center for Tumor Diseases (NCT) and German Cancer Research Center (DKFZ), Heidelberg, Germany
8German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany
Address correspondence to Jonathan Alexander Heiss, Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), Im Neuenheimer Feld 581, 69120 Heidelberg, Germany. Telephone 49-6221-421304. Email: jonathan.heiss@dkfz-heidelberg.de
1
Abstract
Aims: Whole blood DNA methylation depends on the underlying leukocyte composition and
confounding hereby is a major concern in epigenome-wide association studies. Cell counts are
often missing or may not be feasible in large-scale studies. Computational approaches can
estimate leukocyte composition from DNA methylation based on reference datasets of purified
leukocytes. We explored the possibility to train such a model on whole blood DNA methylation
and cell counts only without the need for purification.
Materials & methods: Using whole blood DNA methylation measurements and corresponding
5-part cell counts from 2,445 participants from the LOLIPOP study, a model was trained on a
subset of 175 subjects and evaluated on the remaining ones.
Results: Correlations between cell counts and estimated cell proportions in LOLIPOP were high
(neutrophils 0.85, eosinophils 0.88, basophils 0.02, lymphocytes 0.84, monocytes 0.55) and
estimated cell proportions explained more variance in whole blood DNA methylation levels than
cell counts.
Keywords: Leukocyte composition, White blood cell distribution, Estimation of cell proportions, DNA methylation, Infinium 450K, LOLIPOP, KAROLA
Introduction
Blood samples are the most commonly available source of DNA in epidemiological studies, and
DNA extracted from whole blood samples is often used in epigenome-wide association studies
(EWAS). DNA methylation (DNAm) has been linked to lifestyle factors such as smoking [1],
chronic diseases such as diabetes [2], and was shown to predict all-cause and cardiovascular
2
mortality [3]. Leukocytes show subtype specific DNAm profiles and whole blood DNAm depends
on the underlying leukocyte composition (LC) [4]. A major concern in EWAS based on whole
blood DNAm is that discovered associations may not arise from genuine DNAm changes but may
rather reflect shifts in LC. E.g., cases of ovarian and head-and-neck cancer differed in their LC
compared to cancer-free controls and adjusting for LC in regression models changed the
association between case/control status and whole blood DNAm [5]. Automated cell counting
requires fresh blood samples and is therefore not an option for large-scale studies with banked
samples collected over many years, especially cohort studies with thousands of participants and
blood samples possibly decades old.
Houseman et al. developed an algorithm to infer cell proportions of whole blood DNAm profiles
measured on the Illumina Infinium 27K platform based on a reference set of purified leukocyte
types [6]. The algorithm was adapted by Jaffe and Irizarry [7] to the newer 450K platform based
again on a dataset of purified cell types provided by Reinius et al. [4] and is implemented in the
minfi R package [8]. Recently, accuracy of the minfi model was improved by Koestler et al. by
using a new algorithm for the selection of markers included in the model [9]. For both 27K and
450K platforms it is common to use estimated cell proportions from these models to adjust for
confounding by LC in statistical analysis [2, 5, 10, 11].
However, reports about the accuracy of these estimates vary widely [9, 12-15]. Whole blood
DNAm levels can be thought of as a weighted average of leukocyte-specific methylation levels,
which is the foundation for using purified cells as reference. While this linear relation will hold
3
approximately for measurements of methylation levels from the mentioned platforms, which
are commonly referred to as β-values, it might be distorted by background noise, an issue
common to microarray technology: if we were to compare the β-values of a whole blood
sample with the weighted average of β-values of its cell fractions, they would differ due to
background noise. Training a model on whole blood β-values might better account for
background noise. Our aim was to explore the possibility to train a model for LC estimation on a
reference dataset of whole blood samples.
Methods
Study populations
DNA methylation profiles of whole blood samples and corresponding 5-part blood cell counts
from 2,445 participants from a nested case-control study within the London Life Sciences
Prospective Population Study (LOLIPOP) cohort were available. Details about this nested case-
control study have been reported elsewhere [2]. In brief, all participants in this study from
London, UK, were of Indian Asian descent, most were male (n=1,646, 67%) and mean age and
standard deviation at enrollment were 50.9±10.1 years. Blood samples were taken at baseline.
Cases with incident type 2 diabetes were identified at the 8-year follow-up and controls were
matched for sex and age. Written informed consent was obtained from all participants. A
second dataset of 37 subjects, recruited in context of the KAROLA study, was used as external
validation. Details on the KAROLA study have been reported elsewhere [16]. In brief,
participants with a stable coronary heart disease were recruited several weeks up to three
months after myocardial infarction or coronary artery revascularization at two cooperating 4
rehabilitation clinics in Southern Germany in 2009-2011. Participants were Caucasian, most
were male (n=31) and mean age and standard deviation at enrollment were 63.2±6.7 years.
Blood samples were taken at baseline. Written informed consent was obtained from all
participants.
Laboratory measurements
DNA methylation of whole blood samples was measured on the Illumina Infinium 450K platform
that queries the methylation levels of 485,512 CpG sites. Data were normalized using the R
package normalize450K [17] and methylation levels were expressed as 𝛽-values. Cell counts
included proportions of neutrophils (NE), eosinophils (EO), basophils (BA), lymphocytes (LY) and
monocytes (MO). Cell counts in LOLIPOP were performed using a Sysmex XE-2100 hematology
analyzer, cell counts in KAROLA were performed using a Beckman Coulter LH 750 or an Abbott
CELL-DYN Sapphire hematology analyzer. Average cell proportions and standard deviations
stratified by cell counting device are given in Table 1 and were rather similar in both study
populations, with neutrophils (close to 55%) and lymphocytes (30-35%) being the by far most
common cell types.
Statistical analyses
First, we wanted to estimate how many of the 485,512 probes on the 450K chip were associated
with LC. As cell proportions are compositional data (they represent relative quantities) [18], cell
types were not tested individually for their association with methylation levels, instead two
linear models were trained for each probe, one including only an intercept, sex and batch
5
(indicating the 96-well plate on which the sample was run) as independent variables (model A),
the other including also cell counts of EO, BA, LY and MO (model B). NE were not included as
this proportion can be calculated from the other cell types (proportions sum up to 1) and
including all cell types would falsely increase the degree of freedom and hence underestimate
the number of probes associated with LC. As we were not interested in the regression
coefficients in this step it does not matter which cell type is left out, but there might be slight
differences due to rounding errors. NE were left out, as this resulted in a model specification
with the lowest variance inflation factors among the remaining variables, a useful property for
later analysis steps. ANOVA was used to test the significance of the gain in R ² by including cell
counts. Based on the distribution of p-values from all 485,512 probes the fraction of probes that
were associated with LC was estimated using the function pi0.est from the siggenes package
[19, 20]. Of course, this estimate will reflect only associations with main cell types. Although the
included cell types can be divided further into subtypes that also show distinctive epigenetic
profiles [4], our estimate will still give an impression of the possible magnitude of confounding.
All 2,445 LOLIPOP samples were used in this step.
Next we tested if we could estimate the LC using whole blood DNAmand cell counts as
reference. Two 96-well plates with a total of 175 samples from LOLIPOP were used for model
training and the remaining samples as test set. Samples from the KAROLA study served as
external validation without refitting of the model. We chose a small size for the training set in
order to see if building a new reference set from scratch is a viable option. However, we also
trained a model on half the LOLIPOP samples to see if accuracy would differ. For all probes on
6
the 450K chip partial correlation coefficients of methylation levels with cell proportions were
computed for EO, BA, LY and MO, each coefficient adjusted for the other cell types and sex and
batch. NE proportions were not included as covariate. For each of the four cell types the 10
probes with the highest absolute partial correlation coefficients were selected (a list of these
markers is provided in the supplement). Methylation levels of these 40 markers were regressed
on the same set of variables as before (EO, BA, LY, MO, sex and batch; deviation coding was
used for the categorical variables) and intercepts α i and regression coefficients
β iEO , β i
BA , βiLY, β i
MO for i=1 , ..., 40 were recorded to construct a matrix as follows.
M=[ α 1 β1EO−α1 β1
BA−α 1 β1LY−α1 β1
MO−α1⋮ ⋮ ⋮ ⋮ ⋮α 40 β40
EO−α 40 β40BA−α 40 β40
LY−α 40 β40MO−α 40]
Using M and the methylation levels of the 40 markers, quadratic programming was applied as
described by Houseman et al. [6], with the additional constraint that estimated proportions
must sum to 1, to estimate the LC in the test and validation set. We report Pearson correlation
coefficients of measured and estimated cell proportions as this is the most relevant metric
supposed that these estimates are included as covariates in linear models. 95% confidence
intervals for correlations were obtained by bootstrapping.
Furthermore we estimated the proportion of variance of whole blood DNAm levels that could
be explained by cell counts or cell proportion estimates. For each of the 485,512 probes we
7
fitted three linear models using the data from the LOLIPOP test set. The first two models had
the same specification as models A and B and in a third model cell counts in model B were
replaced by their corresponding estimates (model C). We computed the increase of R2 from
model A to B and A to C, representing the proportion of variance that could be explained by cell
counts and cell proportion estimates, respectively.
Recently, Koestler et al. reported an improved accuracy of cell proportion estimates by using a
new algorithm for the selection of markers for their model [9]. We built a model according to
the description provided by Koestler et al. (using the list of 300 markers provided by Koestler et
al. and the data provided by Reinius et al. [4]) for comparison with our custom model. The
output from the Koestler model provides proportions for granulocytes (GR), CD8+ T-cells, CD4+
T-cells, natural killer cells, CD19+ B lymphocytes and monocytes. To compare measured and
estimated proportions, the estimated proportions for CD8+ T-cells, CD4+ T-cells, natural killer
cells and CD19+ B lymphocytes were collapsed into a lymphocyte type, and the measured
proportions for neutrophils, eosinophils and basophils were collapsed into a granulocyte type.
We also created a granulocyte type from estimated cell proportions from our custom model.
Again we computed Pearson correlation coefficients between measured and estimated cell
proportions.
An implementation of our model can be found in the R package normalize450K. Exemplary R
code to perform parts of this analysis on a publicly available dataset (GEO Accession GSE53840)
is provided in the supplement and can be easily adapted to other datasets.
8
Results
Using the distribution of p-values from ANOVA comparisons of the two linear models with and
without cell counts as independent variables (see Figure 1), we estimated that ∼69% of the
probes on the 450K chip were associated with leukocyte composition.
Pearson correlation coefficients between measured and estimated cell proportions from our
custom model using 40 markers are listed in Table 2, stratified by the device used for cell
counting (additional scatterplots are provided in Figure S1). Correlations for neutrophils and
lymphocytes were high for all devices, but somewhat higher in KAROLA than in LOLIPOP.
Correlations for basophils were close to or even less than zero for all devices. Correlations for
eosinophils were high for the XE-2100 and CELL-DYN hematology analyzers, but only moderate
for LH 750, whereas in case of monocytes a high correlation was found only for LH 750. Using
half the LOLIPOP samples to train the model did not improve results. Likewise, in another
sensitivity analysis, results for a model utilizing 100 markers for each cell type were overall very
similar, only correlations for monocytes were lower (Table S1). Table 3 lists corresponding
numbers for the Koestler model. Overall results were very similar to our model with the
exception of monocytes for CELL-DYN.
Figure 2 shows the proportion of variance of whole blood DNAm levels explained by cell counts
and cell proportions estimates, respectively, for the top 10,000 probes when ranked by the
9
former metric. Despite this unfair selection of probes, cell proportion estimates explained more
variance than cell counts in all instances (on average 16 percentage points more).
Discussion
We explored the possibility to build a model for estimating LC based on a reference dataset of
DNAm profiles of whole blood samples and cell counts. We found that such a model can provide
accurate estimates with performance on a par with a model trained on a dataset of purified cell
lines. Our model also provides proportions for eosinophils, a cell type relevant for inflammation
that is not included in the minfi/Koestler model.
Estimating the LC based on whole blood DNAm has several advantages. It has virtually no costs
(assuming that DNA methylation data is already available) and provides estimates even for
blood samples stored under conditions that no longer allow cell counts by other means [21].
However, reports about the accuracy of these estimates vary widely. The Koestler model
provided accurate predictions for granulocytes, lymphocytes and monocytes in our datasets
(Table 3). Koestler et al. validated the original Houseman model [6] in a set of 94 peripheral
blood mononuclear cell samples and found a correlation of 0.61 and 0.60 between measured
and estimated proportions of lymphocytes and monocytes, respectively [12]. Yousefi et al.
tested the minfi model and found low and high Spearman correlations for 45 whole blood
samples from 12-year old children (GR 0.77, LY 0.75, MO 0.26), but there was no correlation at
all for 111 samples from newborns (GR -0.05, LY -0.03, MO -0.01) [14]. This is likely due to the
age distribution of the study populations: the minfi model is trained on a sample of six males
10
with mean age 38±13.6 years [4], which is more similar to the sample of adults from LOLIPOP
and KAROLA than to the sample of children and newborns. This is of great importance, as such
LC estimates are already used in study populations of newborns [10, 11] (the most recent
release of the minfi package now supports a dedicated reference dataset for cord blood [22]). In
another dataset of whole blood samples from 22 pairs of monozygotic twins including counts of
minor cell types, predictions of the minfi model were again excellent (GR 0.75, CD4+ T cells 0.75,
CD8+ T cells 0.66, B lymphocytes 0.93, natural killer cells 0.82, MO 0.59), but little information
was provided about the study population [15].
We hypothesized that a model trained on whole blood samples might perform better. While
whole blood DNAm can be thought of as a weighted average of subtype-specific methylation
levels, this is not true for the background noise that is always present for microarray technology,
and a model trained on whole blood DNAm might account better for this issue. Another possible
issue with purification of cell types is that the process could distort DNAm. Our model provided
exact predictions in LOLIPOP even for eosinophils, only in the case of basophils, which account
for only approximately 1% of leukocytes, it did not provide reasonable predictions. Correlation
between measured and estimated cell proportions in LOLIPOP for the two most frequent cell
types neutrophils and lymphocytes (standard deviations of ±8.5% and ±7.9%) were lower than
for eosinophils (standard deviation of ±2.6%). This could point to imprecise cell counts, which
was even more evident in KAROLA, where correlations for eosinophils were close to 1 for one
cell counting device, but only moderate for the other. An explanation for this pattern might be
that correlations of measured with estimated cell proportions were limited by the correlations
11
of measured with true cell proportions, meaning that for some device/leukocyte combinations
the cell counts might be less precise than the estimates. This interpretation is further supported
by the observations that correlations for neutrophils and lymphocytes were lower in the
population from which the training set was sampled, than in the external validation (where
measurements for these two cell types would be more precise) and that in LOLIPOP more
variance in the 450K data could be explained by the estimated than by the measured cell
proportions. Therefore, both our custom and the Koestler model might provide for some cell
types more precise proportions than the 5-part blood cell counts from the hematology analyzers
used in this study. In contrast, much better precision has been reported for the Sysmex XE-2100
hematology analyzer [23] and other devices or methods also outperform the estimates based
on DNA methylation (see Fig. S8 and S9 in Accomando et al. [21]). In accordance with two
evaluations of the XE-2100 hematology analyzer, which found that eosinophil proportions were
more reproducible than monocyte proportions [24, 25], the correlation between estimates and
cell counts in LOLIPOP was higher for eosinophils than for monocytes.
LOLIPOP and KAROLA study populations included participants that developed type 2 diabetes or
had a coronary heart disease. It is unlikely that these conditions had direct impact on prediction
accuracy. For example, there were only 5 markers associated with incident type 2 diabetes in
the LOLIPOP dataset [2], and neither one was included in our or the Koestler model. Likewise,
normalization can be ruled out as nuisance factor: all analyses used raw 450K data as a starting
point and test and reference datasets were normalized together. However, we cannot exclude
12
the possibility that there might be other clinical characteristics, or factors such as sample
preparation and storage [26], that could have influenced results.
To train models for the estimation of leukocyte composition from DNAm, a dedicated reference
dataset is required for each microarray platform. The 450K platform is now replaced by the
Illumina Infinium MethylationEPIC chip and eventually by whole-genome bisulfite sequencing.
Building a reference dataset of purified leukocyte types for every new platform or technology is
elaborate and expensive. Our approach achieves similar results as the Koestler model, but does
not require purification of cell types. Instead we trained a model on DNAm data and cell counts
from whole blood samples. While our model likely suffers from the same drawback as the
minfi/Koestler model, namely that it has not been trained on a diverse study population, and
that performance in other groups might be far worse, it can easily be refitted to other study
populations, for example in the case that cell counts are available only for a subset of samples.
And with cell counts including various lymphocyte subtypes such as CD8+ T cells or CD4+ T cells
the model could be extended accordingly. A reference dataset heterogeneous in regard to sex,
age, race and prevalent diseases of subjects might improve generalizability. Finally, estimates
from both models might be combined, for example to additionally adjust for the proportion of
eosinophils in EWAS. Compared to reference-free methods, that do not require any kind of
reference data, our approach has the advantage to provide cell proportion estimates that have
a direct biological interpretation [27].
13
Despite their limitations it seems reasonable to assume that adjustment for LC estimates from
current models will reduce confounding in many situations. However, in case-control studies
investigating outcomes that are known to be associated with shifts in LC or studies including
very young subjects, accuracy of current models might still be insufficient. Highly accurate
models would be of great importance for the growing number of epigenome-wide association
studies and would have a large impact on the validity of findings.
Summary points
Leukocyte composition is an important confounder in epigenome-wide association
studies investigating DNA methylation in whole blood samples, but cell counts are often
missing.
Computational methods can estimate cell proportions based on whole blood DNA
methylation profiles, but existing models require reference datasets of purified
leukocyte subtypes.
We trained a model for estimating cell proportions based on a reference dataset of
whole blood DNA methylation profiles and cell counts only, without the need for
purification of leukocyte subtypes.
We show that the estimated cell proportions from our model explain more variance in
whole blood DNA methylation levels than cell counts from common hematology
analyzers, indicating that estimated cell proportions are more precise than those cell
counts.
14
Our approach is flexible and our model can easily be trained on other datasets, where
existing models do not provide reasonable estimates, e.g. study populations of different
age or ethnic composition.
Acknowledgements
The LOLIPOP study is supported by the National Institute for Health Research (NIHR)
Comprehensive Biomedical Research Centre Imperial College Healthcare NHS Trust, the British
Heart Foundation (SP/04/002), the Medical Research Council (G0601966,G0700931), the
Wellcome Trust (084723/Z/08/Z) the NIHR (RP-PG-0407-10371), European Union FP7
(EpiMigrant, 279143) and Action on Hearing Loss (G51). We thank the participants and research
staff who made the study possible.
15
References
1. Zhang Y, Yang R, Burwinkel B, Breitling LP, Brenner H. F2RL3 methylation as a biomarker of current and lifetime smoking exposures. Environ. Health Perspect. 122(2), 131-137 (2014).
2. Chambers JC, Loh M, Lehne B et al. Epigenome-wide association of DNA methylation markers in peripheral blood from Indian Asians and Europeans with incident type 2 diabetes: a nested case-control study. Lancet Diabetes Endocrinol. 3(7), 526-534 (2015).
3. Zhang Y, Schöttker B, Florath I et al. Smoking-Associated DNA Methylation Biomarkers and Their Predictive Value for All-Cause and Cardiovascular Mortality. Environ. Health Perspect. 124(1), (2015).
4. Reinius LE, Acevedo N, Joerink M et al. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS One 7(7), e41361 (2012).
5. Langevin SM, Houseman EA, Accomando WP et al. Leukocyte-adjusted epigenome-wide association studies of blood from solid tumor patients. Epigenetics 9(6), 884-895 (2014).
6. Houseman EA, Accomando WP, Koestler DC et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics 13 86 (2012).
7. Jaffe AE, Irizarry RA. Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol. 15(2), R31 (2014).
8. Aryee MJ, Jaffe AE, Corrada-Bravo H et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30(10), 1363-1369 (2014).
9. Koestler DC, Jones MJ, Usset J et al. Improving cell mixture deconvolution by identifying optimal DNA methylation libraries (IDOL). BMC Bioinformatics 17 120 (2016).
10. Küpers LK, Xu X, Jankipersadsing SA et al. DNA methylation mediates the effect of maternal smoking during pregnancy on birthweight of the offspring. Int. J. Epidemiol. 44(4), 1224-1237 (2015).
11. Sharp GC, Lawlor DA, Richmond RC et al. Maternal pre-pregnancy BMI and gestational weight gain, offspring DNA methylation and later offspring adiposity: findings from the Avon Longitudinal Study of Parents and Children. Int. J. Epidemiol. 44(4), 1288-1304 (2015).
12. Koestler DC, Christensen B, Karagas MR et al. Blood-based profiles of DNA methylation predict the underlying distribution of cell types: a validation analysis. Epigenetics 8(8), 816-826 (2013).
13. Lehne B, Drong AW, Loh M et al. A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies. Genome Biol. 16 37 (2015).
14. Yousefi P, Huen K, Quach H et al. Estimation of blood cellular heterogeneity in newborns and children for epigenome-wide association studies. Environ. Mol. Mutagen. 56(9), 751-758 (2015).
15. Waite LL, Weaver B, Day K et al. Estimation of Cell-Type Composition Including T and B Cell Subtypes for Whole Blood Methylation Microarray Data. Front. Genet. 7 (2016).
16. Zhang Q-L, Brenner H, Koenig W, Rothenbacher D. Prognostic value of chronic kidney disease in patients with coronary heart disease: Role of estimating equations. Atherosclerosis 211(1), 342-347 (2010).
17. Heiss JA, Brenner H. Between-array normalization for 450K data. Front. Genet. 6 (2015).18. Aitchison J. The statistical analysis of compositional data. Blackburn Press, Caldwell, N.J. (2003).19. Schwender H. siggenes: Multiple testing using SAM and Efron's empirical Bayes approaches. R
package version 1.46.40 (2012).20. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U. S.
A. 100(16), 9440-9445 (2003).
16
21. Accomando WP, Wiencke JK, Houseman E, Nelson HH, Kelsey KT. Quantitative reconstruction of leukocyte subsets using DNA methylation. Genome Biol. 15(3), R50 (2014).
22. Bakulski KM, Feinberg JI, Andrews SV et al. DNA methylation of cord blood cell types: Applications for mixed cell birth studies. Epigenetics 11(5), 354-362 (2016).
23. Park BG, Park C-J, Kim S et al. Comparison of the Cytodiff flow cytometric leucocyte differential count system with the Sysmex XE-2100 and Beckman Coulter UniCel DxH 800. Int. J. Lab. Hematol. 34(6), 584-593 (2012).
24. Tsuda I, Hino M, Takubo T et al. First basic performance evaluation of the XE-2100 haematology analyser. J. Auto. Meth. Manage. Chem. 21(4), 127-133 (1999).
25. Maciel TES, Comar SR, Beltrame MP. Performance evaluation of the Sysmex® XE-2100D automated hematology analyzer. J. Bras. Patol. Med. Lab. 50(1), 26-35 (2014).
26. Imeri F, Herklotz R, Risch L et al. Stability of hematological analytes depends on the hematology analyser used: A stability study with Bayer Advia 120, Beckman Coulter LH 750 and Sysmex XE 2100. Clin. Chim. Acta 397(1-2), 68-71 (2008).
27. Houseman EA, Kelsey KT, Wiencke JK, Marsit CJ. Cell-composition effects in the analysis of DNA methylation array data: a mathematical perspective. BMC Bioinformatics 16 95 (2015).
17
Figure legendsFigure 1 Histogram of p-values from all 485,512 probes on the 450K chip testing the association of whole blood DNA methylation levels with leukocyte composition.
Figure 2 Proportion of variance of whole blood DNA methylation levels explained by cell counts (black) and cell proportion estimates (red), respectively, for the top 10,000 probes when ranked by the former metric. Cell proportions estimates explain more variance than cell counts in all 10,000 instances.
18
Table 1 Average cell proportions and standard deviations stratified by cell counting device.
Cell proportions (%)Study population Device n NE EO BA LY MO
LOLIPOP XE-2100 2,270 54.3±8.5 3.8±2.6 1.2±0.7
34.4±7.9 6.2±2.0
KAROLA LH 750 22 53.6±8.6
4.1±2.7 0.8±0.9
32.1±7.2 9.6±2.6
KAROLA CELL-DYN
15 55.5±7.4
4.9±2.5 0.6±0.3
30.1±5.7 8.9±1.2
Abbreviated cell types: NE neutrophils, EO eosinophils, BA basophils, LY lymphocytes, MO monocytes.
19
Table 2 Pearson correlation coefficients between measured and estimated cell proportions.
Study population Device NE EO BA GR LY MOLOLIPOP XE-2100 0.85 (0.83,0.86) 0.88 (0.87,0.89) 0.02 (-0.02,0.06) 0.83 (0.81,0.85) 0.84 (0.82,0.86) 0.55 (0.51,0.58)KAROLA LH 750 0.90 (0.74,0.98) 0.47 (-0.03,0.91) -0.14 (-0.47,0.35) 0.95 (0.90,0.98) 0.96 (0.91,0.99) 0.91 (0.86,0.97)KAROLA CELL-
DYN0.94 (0.87,0.98) 0.95 (0.82,0.99) -0.08 (-0.50,0.42) 0.91 (0.80,0.97) 0.96 (0.89,0.98) 0.42 (0.06,0.74)
Pearson correlation coefficients (with 95% confidence intervals) between measured and estimated cell proportions from our custom model stratified by cell counting device. Correlations for GR are obtained by collapsing proportions of NE, EO and BA into one composite cell type. Abbreviated cell types: NE neutrophils, EO eosinophils, BA basophils, GR granulocytes, LY lymphocytes, MO monocytes.
20
Table 3 Pearson correlation coefficients between measured and estimated cell proportions from the Koestler model.
Study population Device GR LY MOLOLIPOP XE-2100 0.85 (0.83,0.87) 0.85 (0.83,0.87) 0.50 (0.46,0.54)KAROLA LH 750 0.96 (0.88,0.99) 0.96 (0.89,0.99) 0.91 (0.64,0.97)KAROLA CELL-DYN 0.97 (0.93,0.99) 0.97 (0.94,0.99) 0.69 (0.26,0.87)
Pearson correlation coefficients (with 95% confidence intervals) between measured and estimated cell proportions from the Koestler model stratified by cell counting device. Abbreviated cell types: GR granulocytes, LY lymphocytes, MO monocytes.
21
Table S1 Pearson correlation coefficients between measured and estimated cell proportions based on a model using 100 markers for each cell type.
Study population Device NE EO BA GR LY MOLOLIPOP XE-2100 0.85 (0.83,0.86) 0.88 (0.87,0.90) 0.01 (-0.03,0.05) 0.84 (0.82,0.86) 0.85 (0.83,0.87) 0.39 (0.35,0.43)KAROLA LH 750 0.90 (0.76,0.97) 0.52 (-0.02,0.94) -0.25 (-0.66,0.02) 0.96 (0.89,0.99) 0.96 (0.88,0.99) 0.83 (0.65,0.94)KAROLA CELL-
DYN0.98 (0.94,0.99) 0.96 (0.87,0.99) -0.05 (-0.56,0.54) 0.96 (0.88,0.99) 0.97 (0.92,0.99) 0.81 (0.38,0.95)
Pearson correlation coefficients (with 95% confidence intervals) between measured and estimated cell proportions from our custom modelwith 100 markers for each cell type (instead of 10 as in Table 2) stratified by cell counting device. Correlations for GR are obtainedby collapsing proportions of NE, EO and BA into one composite cell type. Abbreviated cell types: NE neutrophils, EO eosinophils, BAbasophils, GR granulocytes, LY lymphocytes, MO monocytes.
22