Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms
description
Transcript of Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms
1
Searching forPeriodic Gene Expression Patterns Using Lomb-Scargle Periodograms
http://research.stowers-institute.org/efg/2004/CAMDA
Critical Assessment of Microarray Data Analysis ConferenceNovember 11, 2004
Earl F. GlynnStowers Institute
Arcady R. MushegianStowers Institute &
Univ. of Kansas Medical Center
Jie ChenStowers Institute &Univ. of Missouri
Kansas City
2
Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms
• Periodic Patterns in Biology
• Introduction to Lomb-Scargle Periodogram
• Data Pipeline
• Application to Bozdech’s Plasmodium dataset
• Conclusions
3
Periodic Patterns in Biology
Photograph taken at Reptile Gardens, Rapid City, SDwww.reptile-gardens.com
A vertebrate’s body plan: a segmented pattern.Segmentation is established during somitogenesis.
4
Periodic Patterns in Biology
From Bozdech, et al, Fig. 1A, PLoS Biology, Vol 1, No 1, Oct 2003, p 3.
Intraerythrocytic Developmental Cycle of Plasmodium falciparum
Expression Ratio = RNA from parasitized red blood cells
RNA from all development cycles=
Cy5
Cy3
Values for Log2(Expression Ratio) are approximately normally distributed.Assume gene expression reflects observed biological periodicity.
5
Simple Periodic Gene Expression Model
Time
Exp
ress
ion
period(T)
period(T)
“On” “On” “On”
“Off” “Off”
frequency = 1
period
f = 1 T
Gene Expression = Constant Cosine(2f t)“Periodic” if only observed over a single cycle?
= angular frequency = 2f
6
Introduction to Lomb-Scargle Periodogram
• What is a Periodogram?• Why Lomb-Scargle Instead of Fourier?• Example Using Cosine Expression Model• Mathematical Details• Mathematical Experiments
- Single Dominant Frequency
- Multiple Frequencies- Mixtures: Signal and Noise
7
What is a Periodogram?
• A graph showing frequency “power” for a spectrum of frequencies
• “Peak” in periodogram indicates a frequency with significant periodicity
Time
Log2(Expression)
Periodic Signal Periodogram
Frequency
Spectral“Power”
Computation
8
Why Lomb-Scargle Instead of Fourier?
• Missing data handled naturally
• No data imputation needed
• Any number of points can be used
• No need for 2N data points like with FFT
• Lomb-Scargle periodogram has known statistical properties
Note: The Lomb-Scargle algorithm is NOT equivalentto the conventional periodogram analysis basedFourier analysis.
9
Lomb-Scargle PeriodogramExample Using Cosine Expression Model
T = 1 f
0 10 20 30 40
-1.0
0.0
0.5
1.0
Cosine Curve (N=48)
Time [hours]
Ex
pre
ss
ion
N = 48
0.00 0.05 0.10 0.15 0.20
05
10
20
Lomb-Scargle Periodogram
Frequency [1/hour]
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05p = 0.01
p = 0.001p = 1e-04p = 1e-05p = 1e-06
Period at Peak = 48 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.4
0.8
Peak Significance
Frequency [1/hour]
Pro
ba
bili
ty
p = 3.3e-009 at Peak
A small valuefor the false-alarm probability indicates
a highly significant periodic signal.
Evenly-spaced time points
10
Lomb-Scargle PeriodogramExample Using Noisy Cosine Expression Model
0 10 20 30 40
-1.0
0.0
1.0
Cosine Curve + Noise (N=48)
Time [hours]
Ex
pre
ss
ion
N = 48
Time Interval Variability
log10(delta T)
Fre
qu
en
cy
-1.0 -0.5 0.0 0.5 1.0
02
46
8
0.00 0.05 0.10 0.15 0.20
05
10
20
Lomb-Scargle Periodogram
Frequency [1/hour]
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05p = 0.01
p = 0.001p = 1e-04p = 1e-05p = 1e-06
Period at Peak = 45.7 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.4
0.8
Peak Significance
Frequency [1/hour]
Pro
ba
bili
ty
p = 2.54e-007 at Peak
Unevenly-spaced time points
11
Lomb-Scargle PeriodogramExample Using Noise
0 10 20 30 40
-1.0
0.0
0.5
1.0
Noise (N=48)
Time [hours]
Ex
pre
ss
ion
N = 48
Time Interval Variability
log10(delta T)
Fre
qu
en
cy
-1.0 -0.5 0.0 0.5 1.0
02
46
8
0.00 0.05 0.10 0.15 0.20
05
10
20
Lomb-Scargle Periodogram
Frequency [1/hour]
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05p = 0.01
p = 0.001p = 1e-04p = 1e-05p = 1e-06
Period at Peak = 7.4 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.4
0.8
Peak Significance
Frequency [1/hour]
Pro
ba
bili
ty
p = 0.973 at Peak
12
Lomb-Scargle Periodogram Mathematical Details
Source: Numerical Recipes in C (2nd Ed), p. 577
PN() has an exponential probability distribution with unit mean.
13
Mathematical Experiment: Single Dominant Frequency
0 10 20 30 40
-1.0
0.0
0.5
1.0
Cosine Curve (N=48)
Time [hours]
Ex
pre
ss
ion
N = 48
0.00 0.05 0.10 0.15 0.20
05
10
20
Lomb-Scargle Periodogram
Frequency [1/hour]
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05p = 0.01
p = 0.001p = 1e-04p = 1e-05p = 1e-06
Period at Peak = 24 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.4
0.8
Peak Significance
Frequency [1/hour]
Pro
ba
bili
ty
p = 3.3e-009 at Peak
Expression = Cosine(2t/24)
Single “peak” in periodogram. Single “valley” in significance curve.
14
Mathematical Experiment: Multiple Frequencies
0 10 20 30 40
-2-1
01
23
Sum of 3 Cosines (N=48)
Time [hours]
Ex
pre
ss
ion
N = 48
0.00 0.05 0.10 0.15 0.20
05
10
20
Lomb-Scargle Periodogram
Frequency [1/hour]
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05p = 0.01
p = 0.001p = 1e-04p = 1e-05p = 1e-06
Period at Peak = 21.8 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.4
0.8
Peak Significance
Frequency [1/hour]
Pro
ba
bili
ty
p = 0.00246 at Peak
Expression = Cosine(2t/48) + Cosine(2t/24) + Cosine(2t/ 8)
Multiple peaks in periodogram. Corresponding valleys in significance curve.
15
Mathematical Experiment: Multiple Frequencies
0 10 20 30 40
-20
24
Sum of 3 Cosines (N=48)
Time [hours]
Ex
pre
ss
ion
N = 48
0.00 0.05 0.10 0.15 0.20
05
10
20
Lomb-Scargle Periodogram
Frequency [1/hour]
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05p = 0.01
p = 0.001p = 1e-04p = 1e-05p = 1e-06
Period at Peak = 48 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.4
0.8
Peak Significance
Frequency [1/hour]
Pro
ba
bili
ty
p = 2.37e-007 at Peak
Expression = 3*Cosine(2t/48) + Cosine(2t/24) + Cosine(2t/ 8)
“Weaker” periodicities cannot always be resolved statistically.
16
Mathematical Experiment: Multiple Frequencies: “Duty Cycle”
0 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
duty cycle: 1/2
Time [hours]
Exp
ress
ion
N = 48
0.0 0.1 0.2 0.3 0.4 0.5
05
1015
2025
Lomb-Scargle Periodogram
Frequency [1/hour]
Nor
mal
ized
Pow
er S
pect
ral D
ensi
ty
p = 0.05p = 0.01
p = 0.001
p = 1e-04
p = 1e-05
p = 1e-06
Period at Peak = 24 hours
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
0.8
1.0
Peak Significance
Frequency [1/hour]
Pro
babi
lity
p = 2.54e-007 at Peak
0 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
duty cycle: 2/3
Time [hours]
Exp
ress
ion
N = 48
0.0 0.1 0.2 0.3 0.4 0.5
05
1015
2025
Lomb-Scargle Periodogram
Frequency [1/hour]
Nor
mal
ized
Pow
er S
pect
ral D
ensi
ty
p = 0.05p = 0.01
p = 0.001
p = 1e-04
p = 1e-05
p = 1e-06
Period at Peak = 24 hours
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
0.8
1.0
Peak Significance
Frequency [1/hour]
Pro
babi
lity
p = 5.06e-006 at Peak
50% 66.6% (e.g., human sleep cycle)
One peak with symmetric “duty cycle”. Multiple peaks with asymmetric cycle.
17
Mathematical Experiment: Mixtures: Periodic Signal Vs. Noise
'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 )
log10(p)
Fre
quen
cy
-8 -6 -4 -2 0
050
100
150
p corresponding to max Periodogram Power Spectral Density100 % simulated periodic genes
'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 )
log10(p)
Fre
quen
cy
-8 -6 -4 -2 0
050
010
0015
00 p corresponding to max Periodogram Power Spectral Density50 % simulated periodic genes
'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 )
log10(p)
Fre
quen
cy
-8 -6 -4 -2 0
050
010
0015
0020
00
p corresponding to max Periodogram Power Spectral Density0 % simulated periodic genes
“p” histogram
100% periodic genes 50% periodic50% noise
100% noise
18
0 1000 2000 3000 4000 5000
-8-6
-4-2
0
Multiple Testing Correction Methods
Rank Order of Sorted p Values
Log1
0(p)
bonferroniholmhochbergfdrnone
50 % simulated periodic genes
Mathematical Experiment: Mixtures: Periodic Signal Vs. Noise
50% periodic, 50% noise
Multiple-Hypothesis Testing
Bonferroni
Holm
Hochberg
Benjamini &Hochberg FDR
None
More False Negatives
More False Positives
19
Data Pipeline to Apply to Bozdech’s Data
1. Apply quality control checks to data
2. Apply Lomb-Scargle algorithm to all expression profiles
3. Apply multiple hypothesis testing to define “significant” genes
4. Analyze biological significance of significant genes
20
Bozdech’s Plasmodium dataset:
1. Apply Quality Control Checks
Global views of experiment.Remove certain outliers.
21
Bozdech’s Plasmodium dataset:
1. Apply Quality Control Checks
Many missing data points require imputation for Fourier analysis.
22
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
0 10 20 30 40
-0.5
-0.4
-0.3
-0.2
-0.1
0.0
Mean Expression Profile
Time [hours]
Ex
pre
ss
ion
N = 46
Time Interval Variability
log10(delta T)
Fre
qu
en
cy
-1.0 -0.5 0.0 0.5 1.0
01
02
03
04
0
0.00 0.05 0.10 0.15 0.20
05
10
15
20
25
Lomb-Scargle Periodogram
Frequency [1/hour]
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05p = 0.01
p = 0.001
p = 1e-04
p = 1e-05
p = 1e-06
Period at Peak = 27.4 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
Peak Significance
Frequency [1/hour]
Pro
ba
bili
ty
p = 0.0581 at Peak
Complete/06-MeanExpressionProf ile.pdf 2004-10-27 11:39
A weak diurnal period is visible in “mean” data profile.
23
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
0 10 20 30 40
-2-1
01
i3518_1
Time [hours]
Ex
pre
ss
ion
N = 46
Time Interval Variability
log10(delta T)
Fre
qu
en
cy
-1.0 -0.5 0.0 0.5 1.0
01
02
03
04
0
0.00 0.05 0.10 0.15 0.20
05
10
15
20
25
Lomb-Scargle Periodogram
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05
p = 0.01
p = 0.001
p = 1e-04
p = 1e-05
p = 1e-06
Period at Peak = 45.7 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
Peak Significance
Pro
ba
bili
ty
p = 1.48e-008 at Peak
Periodic Expression Patterns
0 10 20 30 40
-4-2
02
opfi17638
Time [hours]
Ex
pre
ss
ion
N = 46
Time Interval Variability
log10(delta T)
Fre
qu
en
cy
-1.0 -0.5 0.0 0.5 1.0
01
02
03
04
0
0.00 0.05 0.10 0.15 0.20
05
10
15
20
25
Lomb-Scargle Periodogram
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05
p = 0.01
p = 0.001
p = 1e-04
p = 1e-05
p = 1e-06
Period at Peak = 45.7 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
Peak Significance
Pro
ba
bili
ty
p = 1.19e-008 at Peak
Examples of highly-significant periodic expression profiles.
24
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
0 10 20 30 40
-0.5
0.0
0.5
1.0
j167_5
Time [hours]
Ex
pre
ss
ion
N = 35
Time Interval Variability
log10(delta T)
Fre
qu
en
cy
-1.0 -0.5 0.0 0.5 1.0
05
10
15
20
25
0.00 0.05 0.10 0.15 0.20
05
10
15
20
25
Lomb-Scargle Periodogram
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05
p = 0.01
p = 0.001
p = 1e-04
p = 1e-05
p = 1e-06
Period at Peak = 17.8 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
Peak Significance
Pro
ba
bili
ty
p = 0.998 at Peak
Aperiodic/Noise Expression Patterns
0 10 20 30 40
-1.0
-0.5
0.0
0.5
1.0
1.5
f35105_2
Time [hours]
Ex
pre
ss
ion
N = 45
Time Interval Variability
log10(delta T)
Fre
qu
en
cy
-1.0 -0.5 0.0 0.5 1.0
01
02
03
04
0
0.00 0.05 0.10 0.15 0.20
05
10
15
20
25
Lomb-Scargle Periodogram
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05
p = 0.01
p = 0.001
p = 1e-04
p = 1e-05
p = 1e-06
Period at Peak = 32 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
Peak Significance
Pro
ba
bili
ty
p = 0.516 at Peak
25
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle Algorithm
0 10 20 30 40
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
f58149_1
Time [hours]
Ex
pre
ss
ion
N = 39
Time Interval Variability
log10(delta T)
Fre
qu
en
cy
-1.0 -0.5 0.0 0.5 1.0
05
10
15
20
25
30
0.00 0.05 0.10 0.15 0.20
05
10
15
20
25
Lomb-Scargle Periodogram
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05
p = 0.01
p = 0.001
p = 1e-04
p = 1e-05
p = 1e-06
Period at Peak = 48 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
Peak Significance
Pro
ba
bili
ty
p = 8.54e-006 at Peak
Small “N”
N=39
0 10 20 30 40
-3-2
-10
12
n170_1
Time [hours]
Ex
pre
ss
ion
N = 32
Time Interval Variability
log10(delta T)
Fre
qu
en
cy
-1.0 -0.5 0.0 0.5 1.0
05
10
15
20
25
30
0.00 0.05 0.10 0.15 0.20
05
10
15
20
25
Lomb-Scargle Periodogram
No
rma
lize
d P
ow
er
Sp
ec
tra
l D
en
sit
y
p = 0.05
p = 0.01
p = 0.001
p = 1e-04
p = 1e-05
p = 1e-06
Period at Peak = 64 hours
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
Peak Significance
Pro
ba
bili
ty
p = 2.74e-005 at Peak
N=32
26
Bozdech’s Plasmodium dataset:
2. Apply Lomb-Scargle AlgorithmSignal and Noise Mixture
'p' histogram
log10(p)
Num
ber
of P
robe
s
-8 -6 -4 -2 0
050
100
150
200
Complete Bozdech set of 6875 probes
Periodic Probes Aperiodic Probes or Noise
histogram-log10p.pdf 2004-11-06 10:26
27
Bozdech’s Plasmodium dataset:
3. Apply Multiple-Hypothesis Testing
= 1E-4
Bonferroni
Holm
Hochberg
Benjamini &Hochberg FDR
None
More False Negatives
More False Positives
0 1000 2000 3000 4000 5000 6000 7000
-8-6
-4-2
0
Multiple Testing Correction Methods
Rank Order of Sorted p Values
Log1
0(p)
bonferroniholmhochbergfdrnone
(Using R's p.adjust methods)
p-adjust.pdf 2004-11-06 10:12
Significance
28
Bozdech’s Plasmodium dataset:
3. Apply Multiple-Hypothesis Testing
p Adjustment
Method
Significance Level
0.05 0.01 0.001 0.0001 0.00001
Bonferroni 3707 3050 1461 13 0
Holm 3995 3351 1705 13 0
Hochberg 4009 3359 1723 15 0
Benjamini & Hochberg FDR
5618 5315 4906 4358 3584
None 5648 5351 4961 4456 3823A priori plan: Use Benjamini & Hochberg FDR level of 0.0001.
Observed number of periodic probes consistent with biological observation of ~60% of Plasmodium genome being transcriptionally active during the intraerythrocytic developmental cycle.
29
Bozdech’s Plasmodium dataset:
4. Analyze Biological SignificanceLomb-Scargle: 4358 Probes, = 1E-4 significance
Comparison with Bozdech’s Results
Dataset Ntime series points
Probes Lomb-ScarglePeriodic
%
Bozdech Complete
43 .. 46(Bozdech
Quality Control Dataset)
5080 4115 81.0%
32 .. 42 1795 243 13.5
Total 6875 4358 63.4While Lomb-Scargle identified 243 new low “N” periodic probes, thelow percentage in that group may indicate some other problem.
30
Bozdech’s Plasmodium dataset:
4. Analyze Biological SignificanceLomb-Scargle: 4358 Probes, = 1E-4 significance
Comparison with Bozdech’s Results
Unclear how to apply Bozdech’s ad hoc “Overview” criteria for use with Lomb-Scargle method: “70% power in max frequency with top 75% of max frequency magnitude.”
The best 3711 Lomb-Scargle “p” values contained 3449 (92.9%) of the Overview probes.
Dataset Probes Lomb-ScarglePeriodic
Bozdech Overview
3711 3611
31
Bozdech’s Plasmodium dataset:
4. Analyze Biological Significance
Lomb-Scargle Results 4358 Probes
“Phaseograms”
Time
Pro
bes
Ord
ered
by
Pha
se
TimeP
robe
s O
rder
ed b
y P
hase
Bozdech: “Overview” Dataset2714 genes, 3395 probes
32
Bozdech’s Plasmodium dataset:
4. Analyze Biological SignificanceLomb-Scargle: 4358 Probes, = 1E-4 significance
Periodogram Map
Frequency
Pro
bes
Ord
ered
by
Pea
k F
requ
ency • Shows periodograms,
not expression profiles
• Shows frequency space, not time
• Dominant frequency band corresponds to 48-hr period
•Are “weak” bands indicative of complex expression, perhaps a diurnal component, or an asymmetric “duty cycle”?Period
33
SummaryLomb-Scargle Method Fourier Method
Weights data points Weights frequency intervals
No special requirement Requires uniform spacing
No special processing Missing data imputed
No special requirement 2N points for FFT; 0 padding
Known statistical properties Permutation tests needed to assess statistical properties
Use “p” values Ad hoc scoring rules
Need estimate of number of “independent frequencies” but explore using continuum
Usually only look at “independent” Fourier frequencies
34
Conclusions• Lomb-Scargle periodogram is effective tool
to identify periodic gene expression profiles
• Results comparable with Fourier analysis
• Lomb-Scargle can help when data are missing or not evenly spaced
We wanted to validate the Lomb-Scargle method before applying to our somitogenesis problem, since the Fourier technique would be difficult to use. Scargle (1982): “surprising result is that the … spectrum of a process can be estimated … [with] only the order of the samples ...”
35
Conclusions
• Conclusions should not be drawn using the individual p-value calculated for each profile. A multiple comparison procedure False Discovery Rate (FDR) must be used to control the error rate.
• Expression profiles may be more complex than simple cosine curves
• Power spectra of non-sinusoid rhythms are more difficult to interpret
36
Supplementary Information
http://research.stowers-institute.org/efg/2004/CAMDA
37
Acknowledgements
Stowers Institute for Medical Research
Pourquie LabOlivier Pourquie
Mary-Lee Dequeant