Buried treasures Old statistics in new contexts “If I have seen further it is by standing on the...

Post on 12-Jan-2016

218 views 1 download

Tags:

Transcript of Buried treasures Old statistics in new contexts “If I have seen further it is by standing on the...

Buried treasures Old statistics in new contexts

“If I have seen further it is by standing on the shoulders of giants”

- Isaac Newton

One form of the past effect

You are dealing with a statistical problem in a special context.

You solve it by realizing a new interpretation of an old, interesting, but uncelebrated result, which was developed in a completely different context.

-

-

Three vignettes

V2: Bootstrapping and rank statistics (theory)

V1: Genomics meets sample surveys (methodology)

V3: Cancer genetics and stochastic geometry (application)

V2: Bootstrapping and rank statistics (theory)

V1: Genomics meets sample surveys (methodology)

V3: Cancer genetics and stochastic geometry (application)

John Tukey

V1: Genomics meets sample surveys

Context

Second-order gene-set enrichment analysis

Buried treasure

J.W. Tukey, 1950, Some sampling simplified. J. Amer. Statist. Assoc., 45, 501-519.

Context

D Pyeon, MA Newton, PF Lambert, JA den Boon, S Sengupta, CJ Marsit, CD Woodworth, JP Connor , TH Haugen, EM Smith, KT Kelsey, LP Turek and P Ahlquist (2007).

Fundamental Differences in Cell Cycle Deregulation in Human Papillomavirus Positive and Human Papillomavirus Negative Head/Neck and Cervical Cancers. Cancer Research, 67, 4605-4619.

MA Newton, X Ma, D Sarkar, D Pyeon, and P Ahlquist (2007).

Second order enrichment analysis of microarray expression datareveals gene sets with heterogeneous activation states. Submitted.

Context

D Pyeon, MA Newton, PF Lambert, JA den Boon, S Sengupta, CJ Marsit, CD Woodworth, JP Connor , TH Haugen, EM Smith, KT Kelsey, LP Turek and P Ahlquist (2007).

Fundamental Differences in Cell Cycle Deregulation in Human Papillomavirus Positive and Human Papillomavirus Negative Head/Neck and Cervical Cancers. Cancer Research, 67, 4605-4619.

MA Newton, X Ma, D Sarkar, D Pyeon, and P Ahlquist (2007).

Second order enrichment analysis of microarray expression datareveals gene sets with heterogeneous activation states. Submitted.

Slice of expression data from Pyeon et al. 2007

genes(a few)

tissue samplesHPV + HPV -

Fold changes between HPV+ and HPV- (all genes)

-2 -1 0 1 2

den

sity

log2 [ HPV+ / HPV- ]

The post-processing problem

expression exogenous

results biology

+

Exogenous biology

B = { c: c = {genes with specific property } }

- gene ontology (GO)

- Kyoto Encylopedia (KEGG)

e.g.

In HPV example, cell cycle may be an interesting gene set

Large sample variance(largest in KEGG, GO)

Excess differential expressionin both directions

u s,c( ) =1

m −1sg − s c( )

2

g∈c

Expression results:

s = s1,s2,L ,sG( )

Gene set:

c ⊂ 1,2,L ,G{ }

c ∈ B

Gene set variance:

Standardized statistic:

z(s,c) =u(s,c) − E u(s,C){ }

var u(s,C){ }

Centering:

E u(s,C){ } =1

G −1sg − s ( )

2

g=1

G

Connection: C indexes a simple random sample of genes I.e. finite population sampling

Scaling:

var u(s,C){ } = ??

var u(s,C){ } =1

m−

1

G

⎝ ⎜

⎠ ⎟b1

Tδ(s) +2

m −1−

2

G −1

⎝ ⎜

⎠ ⎟b2

Tδ(s)

We get:

following Tukey’s 1950 calculation involving “K” functions: set-level statistics whose expected value equals the same statistic computed on the whole population

1

Gγ 4

1

G(G −1)γ 2

2 − γ 4( )

1

G(G −1)γ1 γ 3 − γ 4( )

1

G(G −1)(G − 2)γ

1

2 γ 2 − 2γ1γ 3 − γ 22 + 2γ 4( )

1

G(G −1)(G − 2)(G − 3)γ

1

4 + 8γ1γ 3 + 3γ 22 − 6γ1

2γ 2 − 6γ 4( )

1 0

-3 1

-4 0

12 -2

-6 1

b1 b2

δ s( )

where

γk = sgk∑

V2: Bootstrapping and rank statistics (theory)

V1: Genomics meets sample surveys (methodology)

V3: Cancer genetics and stochastic geometry (application)

V2: Bootstrapping and rank statistics

Context

Mason and Newton, 1992, A rank statistics approach to theConsistency of a general bootstrap. Ann. Statist., 20,1611-24

Buried treasure

J. Hajak, 1961, Some extensions of the Wald-Wolfowitz-Noether theorem. Ann. Math. Statist., 32, 506-523.

Jaroslav Hajek

CLT:

n X n − μ( )

σ⇒ N 0,1[ ]

Bootstrap mean:

X n* =

1

nMn,i

i=1

n

∑ x i

Data:

X = (X1, X2,L ) iid

μ,σ 2( )

Bootstrap CLT:

n X n* − x n( )

sn

⇒ N 0,1[ ] a.s. x

multinomials

Generalized bootstrap: exchangeableweights

X nW =

1

nWn,i

i=1

n

∑ x i

Mason, Newton asked: What is CLT for this case?

an,i : i =1,2,L ,n{ }€

n

bn,i : i =1,2,L ,n{ }€

n

Consider two triangular arrays of numbers

Tn = an,π n,i

i=1

n

∑ bn,iAnd the sum

For a random permutation

π n,1, π n,2, L , π n,n( )

Tn = an,π n,i

i=1

n

∑ bn,iNotes about:

- Linear rank statistic; studied in nonparametrics.

- Hajak 1961 gives weak conditions for AN

Back to the general bootstrap problem:

This is precisely a linear rank statistic, and Hajek (1961)gives general conditions for its asymptotic normality.

Key fact:

X nW =D X n

Wπ =1

nWn,π n,i

i=1

n

∑ x i random permutation

Now condition on both data

X = x and weights

W = w

Tn =1

nwn,π n,i

i=1

n

∑ x i

V2: Bootstrapping and rank statistics (theory)

V1: Genomics meets sample surveys (methodology)

V3: Cancer genetics and stochastic geometry (application)

V3: Cancer genetics and stochastic geometry

Context

Cellular events during tumor initiation, intestinal cancer

Buried treasure

P. Armitage, 1949, An overlap problem arising in particle counting. Biometrika, 45, 501-519.

Peter Armitage

Context

AT Thiliveris, RB Halberg, L Clipson, WF Dove, R Sullivan, MK Washington, S Stanhope, and MA Newton (2005).

Polyclonality of familial murine adenomas: Analyses of mouse chimeras with low tumor multiplicity suggest short-range interactions. PNAS, 102, 6960-6965.

MA Newton, L Clipson, AT Thliveris and RB Halberg (2006).

A statistical test of the hypothesis that polyclonal intestinal tumors ariseby random collision of initiated clones. Biometrics, 62, 721-7.

MA Newton (2006).

On estimating the polyclonal fraction in lineage marker studies of tumororigin. Biostatistics, 7, 503-14.

Context

AT Thiliveris, RB Halberg, L Clipson, WF Dove, R Sullivan, MK Washington, S Stanhope, and MA Newton (2005).

Polyclonality of familial murine adenomas: Analyses of mouse chimeras with low tumor multiplicity suggest short-range interactions. PNAS, 102, 6960-6965.

MA Newton, L Clipson, AT Thliveris and RB Halberg (2006).

A statistical test of the hypothesis that polyclonal intestinal tumors ariseby random collision of initiated clones. Biometrics, 62, 721-7.

MA Newton (2006).

On estimating the polyclonal fraction in lineage marker studies of tumororigin. Biostatistics, 7, 503-14.

Context

AT Thiliveris, RB Halberg, L Clipson, WF Dove, R Sullivan, MK Washington, S Stanhope, and MA Newton (2005).

Polyclonality of familial murine adenomas: Analyses of mouse chimeras with low tumor multiplicity suggest short-range interactions. PNAS, 102, 6960-6965.

MA Newton, L Clipson, AT Thliveris and RB Halberg (2006).

A statistical test of the hypothesis that polyclonal intestinal tumors ariseby random collision of initiated clones. Biometrics, 62, 721-7.

MA Newton (2006).

On estimating the polyclonal fraction in lineage marker studies of tumororigin. Biostatistics, 7, 503-14.

Monoclonal theory of tumor origin

genetic defectapears in a cell

Monoclonal theory of tumor origin

aberrant cell divides and persists

Aggregation chimerasprovide data on clonality.

B6 Apc Min/+ Mom1 R/R <--> B6 Apc Min/+ Mom1 R/R Rosa26/+

B6 Apc Min/+ Mom1 R/R <--> B6 Apc Min/+ Mom1 R/R Rosa26/+

Heterotypic tumor!

mouse id % blue tissue

total # tumors

heterotypic pure blue

pure white

ambiguous

1 20 19 5 5 6 3

2 85 24 3 13 6 2

3 20 9 2 2 5 0

4 60 19 3 2 10 4

5 30 24 2 0 21 1

6 50 9 2 2 3 2

7 40 8 5 0 3 0

totals 112 22 24 54 12

Summary count data

∃ many heterotypic tumors … but why?

HA : clonal cooperation - recruitment; selection

∃ many heterotypic tumors … but why?

Ho : random collision

HA : clonal cooperation - recruitment; selection

# initiated clones

N =

collision distance

δ =

Key parameters:

X1 = # isolated clones

X2 = # doublets

X3 = # triplets

Induced R.V.’s

# tumors (one mouse)

X1 + X2 + X3 +L

# initiated clones

N =

collision distance

δ =

Key parameters:

X1 = # isolated clones

X2 = # doublets

X3 = # triplets

Induced R.V.’s

Intractable distribution!!

# tumors (one mouse)

X1 + X2 + X3 +L

But thanks to Armitage, 1949,

E(X1) ≈ m1 = N exp −4ψ( )

E(X2) ≈ m2 = 2N ψ −4π + 3 3

πψ 2

⎝ ⎜

⎠ ⎟

E(X3) ≈ m3 = N4 2π + 3 3( )

⎜ ⎜

⎟ ⎟ψ 2

where

ψ =πNδ 2

4A

Armitage was studying dust particles … not cancer

• Lineage marking

• Unknown N’s

• Extra Poisson variation

Closing the inference loop

Conditional predictive p-values

One form of the past effect

You are dealing with a statistical problem in a special context.

You solve it by realizing a new interpretation of an old, interesting, but uncelebrated result, which was developed in a completely different context.

-

-

John Tukey Jaraslav Hajek Peter Armitage

1915-2000 1924-present1926-1974

John Tukey Jaraslav Hajek Peter Armitage

1915-2000 1924-present1926-1974

8 943

# citations of key paper

John Tukey Jaraslav Hajek Peter Armitage

1915-2000 1924-present1926-1974

2800 5300415

# citations of a book

“I seem to have been only like a child playing on the seashore, and diverting myself in now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me.”

- Isaac Newton

Peter Armitage 1924 - present worked with George Barnard. worked for the

Medical Research Council from 1947-61.

From 1961-76 he was Professor of Medical Statistics at the London School of Hygiene and Tropical Medicine.

moved to Oxford as Professor of Biomathematics and became Professor of Applied Statistics and head of the new Department of Statistics, retiring in 1990.

president of the Royal Statistical Society in 1982-4.