Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta [email protected]...

29
Summary of “A Spatial Scan Statistic” by M. Kulldorff Presented by Gauri S. Datta [email protected] SAMSI September 29, 2005

Transcript of Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta [email protected]...

  • Slide 1

Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta [email protected] SAMSI September 29, 2005 Slide 2 Background Scan Statistic A tool to detect cluster in a Point Process Naus (1965 JASA) studied in one dimension tests if a 1-dim point process is purely random Point Process Consider a time interval [a,b] and a window A=[t,t+w] of fixed width w (A)= # of e-mails arrived in the time window A n(A) n A = # of junk e-mails = number of points Arrival times of junk e-mails define a Point Process Slide 3 Main Idea in Scan Statistic Move a window [t,t+w] of size w < b-a over a time interval [a,b] Over all possible values of t, record the maximum number of points in the window Compare this number with cut off points under the the hypothesis of a purely Poisson Process Slide 4 Building block of Scan Test Repeated use of tests to test equality of two Binomial or Poisson populations Two populations are defined by the scanning window A and its complement A c As in multiple comparison, these tests are dependent as one moves the scanning window Slide 5 Spatial Scan Statistic (SSS) Kulldorff (1997) used SSS to detect clusters in spatial process SSS can be used In multi-dim point process With variable window size With baseline process an inhomogeneous Poisson process or Bernoulli Process Slide 6 SSS (continued) Scanning window can be any predefined shape SSS is on a geographical space G with a measure In traditional point process, G is a line, is a uniform measure In 2-dim, G is a plane, a Lebesgue measure Slide 7 More Examples Forestry: Spatial clustering of trees. Want to see for clusters of a specific kind of trees after adjusting fro uneven spatial distribution of all trees (A)=Total # of trees in region A n A =# of trees in A of specific kind Slide 8 More Examples (continued) Epidemiology Interest in detecting geographical clusters of disease Need to adjust for uneven population density Rural vs. urban population For data aggregated into census districts, measure is concentrated at the central coordinates of districts Slide 9 More Examples (continued) If interest is in space-time clusters of a disease, the measure will still be concentrated in the geographical region as in the prior example Adjusting for uneven population distribution is not always enough. Should take confounding factors into account. E.g., in epidemiology measure can reflect standardized expected incidence rate Slide 10 SS = LR statistic For a fixed size window, scan statistic is the maximum # of points in the window at any given time/geographical region Test Stat is equivalent to LR test statistic for testing H 0 : 1 = 2 vs. H a : 1 > 2 Generalization to LR test is important for variable window Slide 11 Generalized SS: Notation G= Geographical area / study space A= Window G N(A)= Random # of points in A A spatial point process Goal to find a zone Z, the prominent cluster ={Z: Z G} = collection of zones over G Slide 12 Standard Models for SS Two useful models for point process (a) Bernoulli model (b) Poisson model For Bernoulli model, measure is such that (A) is an integer for all subsets A of G Two states (disease point or no disease) for each unit Location of the points define a point process Slide 13 More on Bernoulli model There is exactly one zone Z G such that p=probability of point in Z q=probability of point in Z c Test H 0 : p=q vs. H a : p>q, Z 2 Under H 0, n(A) bin( (A), p) for all A G, Under H a, n(A) bin( (A), p) if A Z; n(A) bin ( (A), q) if A Z C Slide 14 Poisson Model In Poisson model, points are generated by inhomogeneous Poisson process n(A)= # of cases in zone A Exactly one zone Z G such that n(A) Po(p (A Z) + q (A Z c )) for all A G H 0 : p=q vs. H a : p>q, Z 2 Under H 0, n(A) Po( (A)) for all A, free from Z Slide 15 Choice of Zones How is selected? Possibilities: (1)All circular subsets (2)All circles centered at anyof several foci on a fixed grid, with a possible upper limit on size (3)Same as (2) but with a fixed size (4)All rectangles of fixed size and shape (5)If looking for space-time clusters, use cylinders scanning circular geographical areas over variable time intervals Slide 16 Bernoulli vs. Posson Model Choice between a Bernoulli or Poisson model does not matter much if n(G) Likelihood Ratio Test Bernoulli model Likelihood L(Z,p,q) = p n(z) (1-p) (z)-n(z) q n(G)-n(Z) (1-q) (G)- (Z)-n(G)+n(Z) L(Z)= {sup L(Z,p,q): p>q} \hat Z is such that: L(Z^\prime) \le L(\hat Z) for all Z^\prime in \xi Slide 18 LR Test (continued) L 0 ={sup L(Z,p,q): p=q} = sup p n(G) (1-p) (G)-n(G) free from Z Likelihood ratio: = L(\hat Z)/L_0 Slide 19 A Useful Result For LR test for Poisson model, see the paper An important result on most likely cluster based on these models is given in the paper. It states that as long as the points within the zone constituting the most likely cluster are located where they are, H_0 will be rejected irrespective of the other points in G. If a cluster is located in Seattle, locations of the points in the east coast of U.S. do not matter (Theorem 1) Slide 20 Application of SSS to SIDS Bernoulli and Poisson models are illustrated using the SIDS data from NC For 100 counties in NC, total # of live births and # of SIDS cases for 1974-84. Live births range from 567 to 52345 Location of county seats are the coordinates. Measure is the # of live births in a county Slide 21 Application to SIDS (continued) Zones for scanning window are circles centered at a county coordinate point including at most half of the total population Zones are circular only wrt the aggregated data. As circles around a county seat are drawn, other counties are will either be completely part of a zone or else not at all, depending on whether its county seat is within the circle or not Slide 22 Bernoulli model for SIDS Bernoulli model is very natural. Each birth can correspond to at most one SID. Table 1 summarizes the results of the analysis. From Figure 1, the most likely cluster A, consists of Bladen, Columbus, Hoke, Robeson, and Scotland. Using a conservative test, a secondary cluster is B, consists of Halifax, Hartford and Northampton counties. Slide 23 Poisson model for SIDS For a rare disease SIDS, Poisson model gives a close approximation to Bernoulli. Results are reported in Table 1 Both models detect the same cluster P-values for the primary cluster are same for both the models; p-values for the secondary cluster are very close Slide 24 Application to SIDS (continued) Slide 25 Two significant clusters based on SSS Slide 26 SSS adjusted for Race For SIDS one useful covariate is race Race is related to SIDS through unobserved covariates such as quality of housing, access to health care Overall incidence of SIDS for white children is 1.512 per 1000 and for black children is 2.970 per 1000. Slide 27 SSS: race-adjusted (continued) Racial distribution differs widely among the counties in NC This analysis leads to the same primary cluster (see Figure 2) Previous secondary cluster disappeared but a third secondary cluster C emerges. Cluster C consists of a bunch of counties in the western part of the state Slide 28 Application to SIDS (continued) Slide 29 SSS to SIDS adjusted for race