Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and...

11
Copyright 0 1994 by the Genetics Society of America Detecting Selective Sweeps in Naturally Occurring Escherichia coli David S. Guttman and Daniel E. Dykhuizen Department of Ecology and Evolution, Division of Biological Sciences, State University of New York at Stony Brook, Stony Brook, New York 1 1794-5245 Manuscript received April 12, 1994 Accepted for publication August 25, 1994 ABSTRACT The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were determined in 12 natural isolates of Escherichia coli. Three analyses were performed on the data. First, the levels of polymorphism at the loci werecompared within and between E. coli and Salmonella strains relative to their degrees of constraint. Second, the gapA and pabB loci were analyzed by the Hudson- Kreitman-Aguadc? (HKA) test for selective neutrality. Four additional dispersed genes (crr, putP, trp and gnd) were added to the analysis to providethe necessary frame of reference. Finally, the gene genealogies of gupA and pabB were examined for topological consistency within and between the loci. These lines of evidence indicate that some evolutionary event has recently purged the variability in the region sur- rounding the gapA and pabB loci in E. coli. This can best be explained by the spread of a selected allele through the global E. coli population by directional selection and the resulting loss in variability in the surrounding Yegions due to genetic hitchhiking. " A LL evolutionary events affect the pattern and dis- tribution of genetic variability found within and between species. Each specific type of event, whether it be mutation, recombination,or selection, leaves a char- acteristic pattern which can be revealed by molecular population genetic techniques and analyses. Statistical methods designed to recognize patterns of nucleotide variation that deviate from neutral expectations (HUDSON et al. 1987; TAJIMA 1989; MCDONALD and KREITMAN 1991; Fu and LI 1993) have been especially successful in expanding our ability to decipher both cur- rent and historical selection events. These analyses have been most successful in identifylng directional selection in regions of reduced recombination. The genetic hitch- hiking associated with the tight linkage in these regions results in a reduction of the within species polymor- phism, relative to the between species divergence, as the selected allele increases in frequency in the population. This process is commonly referred to as a selective sweep. Although, to date, these analyses have been exclusively a p plied to regions of restricted recombination in Drosophila species, such as the fourth (BERRY et ai. 1991) and X (AGUADE et ai. 1989; BEGUN and AQUADRO 1991; MARTIN- CAMPOS et ai. 1992; LANGLEY et ai. 1993) chromosomes,se- lective sweeps should occur in all genomes and genomic regions that have restricted recombination. The common enteric bacterium Escherichiacoli is perfectly suited to an analysis for selective sweeps. Its genome is a single, circular molecule of roughly 45 megabases. It experiences significantly less recombina- tion than sexually reproducing organisms. It is well es- tablished as a model organism for both molecular and population genetic studies, and, most importantly, there is extensive documentation for the presence of selective Genetics 138 993-1003 (December, 1994) sweeps. These events are commonly referred to as pe- riodic selection events (ATWOOD etal. 1951a,b; LEVIN 1981; MAYNARD SMITH 1991; DYKHUIZEN 1992) and are a frequently observed in continuous culture studies. The gapA gene, encoding glyceraldehyde 3-phos- phate dehydrogenase (EC 1.2.1.12) and located at minute 39.3 on the E. coli chromosome, has been shown to have one of the lowest recorded level of electro- phoretic variation in E. coli (WHITTAM and AKE 1993).In 1991, NELSON et al. published gapA data from E. coli and Salmonella which revealed the surprising finding of a 10-fold difference in the amount of polymorphism within the two species coupled with a consistent degree of constraint on the gapA locus (inferred from the de- gree of codon bias). These observations made gapA a logical candidate in the search for the molecular traces of a selective sweep. pabB, encoding paminobenzoate synthetase and located only 32.5 kb downstream from gapA at minute 39.9 (RUDD 1992), was included in the study basedon its proximity to gapA, its known electro- phoretic diversity, and itshaving been previouslyse- quenced in E. coli K-12. The nucleotide sequences of the E. coli gapA and pabB loci were examined for nucle- otide variability patterns consistent with a selective sweep using three different approaches. The first approach was based on a comparison of lev- els of polymorphism and constraint (as revealed by the relative level of codon bias at the locus) within E. coli and Salmonella. Using the relationship between codon bias and selective constraint, a baseline could be estab- lished with which to compare levels of polymorphism between species. The second approach used statistics developed to de- tect the patterns characteristic of selection events. These

Transcript of Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and...

Page 1: Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were determined in 12 natural isolates

Copyright 0 1994 by the Genetics Society of America

Detecting Selective Sweeps in Naturally Occurring Escherichia coli

David S. Guttman and Daniel E. Dykhuizen

Department of Ecology and Evolution, Division of Biological Sciences, State University of New York at Stony Brook, Stony Brook, New York 1 1794-5245 Manuscript received April 12, 1994

Accepted for publication August 25, 1994

ABSTRACT The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were

determined in 12 natural isolates of Escherichia coli. Three analyses were performed on the data. First, the levels of polymorphism at the loci were compared within and between E. coli and Salmonella strains relative to their degrees of constraint. Second, the gapA and pabB loci were analyzed by the Hudson- Kreitman-Aguadc? (HKA) test for selective neutrality. Four additional dispersed genes (crr, putP, trp and g n d ) were added to the analysis to provide the necessary frame of reference. Finally, the gene genealogies of gupA and pabB were examined for topological consistency within and between the loci. These lines of evidence indicate that some evolutionary event has recently purged the variability in the region sur- rounding the gapA and pabB loci in E. coli. This can best be explained by the spread of a selected allele through the global E. coli population by directional selection and the resulting loss in variability in the surrounding Yegions due to genetic hitchhiking.

"

A LL evolutionary events affect the pattern and dis- tribution of genetic variability found within and

between species. Each specific type of event, whether it be mutation, recombination, or selection, leaves a char- acteristic pattern which can be revealed by molecular population genetic techniques and analyses. Statistical methods designed to recognize patterns of nucleotide variation that deviate from neutral expectations (HUDSON et al . 1987; TAJIMA 1989; MCDONALD and KREITMAN 1991; Fu and LI 1993) have been especially successful in expanding our ability to decipher both cur- rent and historical selection events. These analyses have been most successful in identifylng directional selection in regions of reduced recombination. The genetic hitch- hiking associated with the tight linkage in these regions results in a reduction of the within species polymor- phism, relative to the between species divergence, as the selected allele increases in frequency in the population. This process is commonly referred to as a selective sweep. Although, to date, these analyses have been exclusively a p plied to regions of restricted recombination in Drosophila species, such as the fourth (BERRY et ai. 1991) and X (AGUADE et ai. 1989; BEGUN and AQUADRO 1991; MARTIN-

CAMPOS et ai. 1992; LANGLEY et ai. 1993) chromosomes, se- lective sweeps should occur in all genomes and genomic regions that have restricted recombination.

The common enteric bacterium Escherichia coli is perfectly suited to an analysis for selective sweeps. Its genome is a single, circular molecule of roughly 45 megabases. It experiences significantly less recombina- tion than sexually reproducing organisms. It is well es- tablished as a model organism for both molecular and population genetic studies, and, most importantly, there is extensive documentation for the presence of selective

Genetics 138 993-1003 (December, 1994)

sweeps. These events are commonly referred to as pe- riodic selection events (ATWOOD et al . 1951a,b; LEVIN 1981; MAYNARD SMITH 1991; DYKHUIZEN 1992) and are a frequently observed in continuous culture studies.

The gapA gene, encoding glyceraldehyde 3-phos- phate dehydrogenase (EC 1.2.1.12) and located at minute 39.3 on the E. coli chromosome, has been shown to have one of the lowest recorded level of electro- phoretic variation in E. coli (WHITTAM and AKE 1993). In 1991, NELSON et al . published gapA data from E. coli and Salmonella which revealed the surprising finding of a 10-fold difference in the amount of polymorphism within the two species coupled with a consistent degree of constraint on the gapA locus (inferred from the de- gree of codon bias). These observations made gapA a logical candidate in the search for the molecular traces of a selective sweep. pabB, encoding paminobenzoate synthetase and located only 32.5 kb downstream from gapA at minute 39.9 (RUDD 1992), was included in the study based on its proximity to gapA, its known electro- phoretic diversity, and its having been previously se- quenced in E. coli K-12. The nucleotide sequences of the E. coli gapA and pabB loci were examined for nucle- otide variability patterns consistent with a selective sweep using three different approaches.

The first approach was based on a comparison of lev- els of polymorphism and constraint (as revealed by the relative level of codon bias at the locus) within E . coli and Salmonella. Using the relationship between codon bias and selective constraint, a baseline could be estab- lished with which to compare levels of polymorphism between species.

The second approach used statistics developed to de- tect the patterns characteristic of selection events. These

Page 2: Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were determined in 12 natural isolates

994 D. S. Guttman and D. E. Dykhuizen

methods contrast intraspecific nucleotide polymor- phism and interspecific nucleotide divergence at two or more loci. The associated test statistic represents the probability that regions being compared could not con- form to the same evolutionary expectations, such as time to common ancestry. These results are usually inter- preted to represent the probability that the loci have experienced different selective histories. Perhaps the most successful of these techniques, and certainly the most widely known, is what has come to be called the Hudson-Kreitman-Aguadk, or HKA, test (HUDSON et al. 1987).

The HKA test uses the neutral theory prediction that the relative rate of evolution of a locus observed between two closely related species (such as E. coli and Salmo- nella) should be proportionally reflected in the amount of polymorphism found within a species; in other words, “regions of the genome that evolve at high rates, as re- vealed by interspecific DNA sequence comparisons, will also exhibit high levels of polymorphism within species” (HUDSON et al. 1987’). This contrast is expressed as a chi-squared statistic.

The predicted outcome of the HKA test under dif- ferent evolutionary scenarios can most easily be illus- trated by comparing intraspecies nucleotide polymor- phism to interspecies nucleotide divergence. A selective sweep has the effect of purging polymorphism out of a population or species while not changing the degree of interspecies divergence; therefore, the polymorphism to divergence ratio will be low (BERRY et al. 1991; LANGLEY

et al. 1993). Balancing selection, on the other hand, maintains variability in a population or species relative to the amount of interspecies divergence. Under this evolutionary scenario the polymorphism to divergence ratio will be high (HUDSON et al. 1987). A neutral locus will fall somewhere in between these two extremes. The description of these ratios as high and low is purpose- fully vague since the absolute values are heavily depend- ent upon coalescent times and effective population sizes. Therefore, comparisons must be placed in the con- text of the biology of the gene before any evolutionary scenario can be reliably supported.

The final method was based on the comparison of gene genealogical patterns and structure. Topological consistency, the maintenance of clusters of taxa across loci, and specific genealogical patterns provide infor- mation about the evolutionary forces that influenced the divergence of taxa from their common ancestor. This approach was used successfully by DYKHUIZEN and GREEN (1991) to study intergenic recombination, and is similar to the method proposed by SLATKIN (1989) and by SLATKIN and MADDISON (1989) for the study of small amounts of gene flow.

Our analyses indicated that a selective sweep occurred in the chromosomal region which includes the gapA and pabB loci. This finding represents the first identi-

TABLE 1

E. coli isolates

ECOR Other strain ECOR no. a name Source group ’

(like 2) K12 Human diarrhea A 4 RM39A Human stool A 8 RM77c Human stool A

10 AN1 Human stool A 16 RM191F Leopard stool A 38 RM75A Human stool D 39 FN 104 Human stool D 40 P60 Human urine D 49 FN90 Human stool D 50 P97 Human urine D 65 RM2021 Celebese ape stool B2 68 RM224H Giraffe stool B1

OCHMAN and SELANDER (1984). SELWDER et al. (1987); HEWER et al. (1990).

fication of such an event in natural populations of mi- croorganisms. We believe that this approach can be of general use for inferring historical events from the ge- nome of natural isolates of microorganisms.

MATERIALS AND METHODS

Strains and genes: Strains were selected from the ECOR standard reference collection of natural E . coli isolates (OCHMAN and SELANDER 1984) by a mixed sampling strategy (Table 1). A compromise was needed between a purely ran- dom sampling strategy, as dictated by the coalescent analyses, and a non-random strategy which would ensured balanced ECOR group representation. ECOR4, ECOR8, ECOR16 (all from ECOR group A), ECOR38 (ECOR group D) , ECOR65 (ECOR group B2) and ECOR68 (ECOR group B1) (SELANDER et al. 1987; HERZER et al. 1990), were selected at random (with- out regard to their phylogenetic relationships) while ECORlO (ECOR group A), ECOR39, ECOR40, ECOR49 and ECOR5O (all from ECOR group D) were selected based on their phy- logenetic relatedness as determined by multiple locus enzyme electrophoresis of 38 protein coding genes. The standard E. coli laboratory strain, K-12, which is electrotypically iden- tical to strain ECOR2 (ECOR group A), was also included in the analysis for a total of 12 strains.

Salmonella (serovar typhimurium) gapA data was obtained from NELSON et al. (1991). E . coli gapA data from NELSON et al. (1991) has also been included in some of the analyses. When their data were used, all non-ECOR strains were excluded to avoid biasing the sample with strains selected solely on the basis of their rare electrophoretic mobility. Salmonella pabB data was obtained from GONCHAROFF and NICHOLS (1988). E. coli and Salmonella data for putP (located 785.6 kb u p stream from gapA ) came from NELSON and SELANDER (1992), and gnd (located 238 kb downstream from gapA) came from DYKHUIZEN and GREEN (1991) and BISERCIC et al. (1991). E. coli data for trpAB (located 549.6 kb upstream from gapA) and crr (located approximately 677 kb downstream from gapA ) came from MILKMAN and STOLTZFUS (1988) and HALL and SHARP (1992) respectively, while the Salmonella data came from CRAWFORD et al. (1980) and NICHOLS and YANOFSKY (1979) for the trpAB locus, and NELSON et al. (1984) for the crT locus.

Polymerase chain reaction (PCR) amplification and nucle- otide sequencing: Genomic DNA was extracted from each of the 12 strains (WILSON 1990) and PCR-amplified (SAIKI 1989) with two terminal primers designed from the published K-12

Page 3: Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were determined in 12 natural isolates

Selective Sweeps in E. coli 995

gupA sequence (BRANLANT and BRANLANT 1985) and pabB se- quence ( GONCHAROFF and NICHOU 1984). The gupA PCR prim- ers were: 5’ primer, 5’-TGACTATCAAAGTAGGTATCAAG GG3’; 3’ primer, 5‘-AGATGTGAGCGATCAGGTCCAGMG 3’. The pabB PCR primers were: 5’ primer, 5”TTTTACACT- CCGGCTATGCCGATCA-3’; 3’ primer, 5”GCTGCGGTTC- CAGTTCGTCGATAAT-3’. The amplified regions encompass 937 bases of the 990-bp coding sequence for gapA (beginning at base position 27), and 1009 bases of the 1359-bp coding sequence of p u b s (beginning at base position 123).

DNA for sequencing was obtained from secondary ampli- fications from the genomic PC&. Single stranded DNA was generated by A exonuclease treatment (HIGUCHI and OCHMAN 1989) and removal of PCR primers was carried out with Mii- lipore Ultrafree MC 30K spin columns. Dideoxynucleotide chain termination sequencing was performed following the Sequenase I1 protocol (U.S. Biochemical Gorp.). Six internal primers for each gene, in addition to the terminal PCR prim- ers, were used to sequence the entire amplified region in both orientations. Analysis: Each of the sequences were individually and

manually read and recorded into ESEE (CABOT and BECKIN- BACH 1989). Neighbor-joining analysis was performed with the aid of NJBoot2 (TAMURA 1993), and parsimony analysis was carried out using PAUP (SWOFFORD 1993). All HKA analyses were performed on a Mathcad (MathSoft) template developed by the authors. All gene sequences have been submitted to GenBank. The gupA gene sequences are available through GenBank accession nos. UO7750-UO7754 and U07765- U07773. The p u b s sequences are available through accession nos. U07748, U07749 and UO7755-UO7764.

The HKA tests were performed on both the complete se- quence and on only synonymous site variation. The use of total sequence data is valid under the assumptions of a strict neu- trality model, while synonymous site variation should be used if a mildly deleterious mutation model is assumed (R. R. HUDSON, personal communication). The synonymous site variation analysis used a sequence length of 0.75 times the total number of codons to approximate the number of sites capable of synonymous variation (silent sites). The divergence was de- termined based on the comparison of the Salmonella se- quence to a single, randomly selected E. coli strain, A correc- tion was used for unequal sample sizes (BERRY et ul. 1991).

RESULTS

Nucleotide polymorphism: Table 2 summarizes the polymorphism data for gapA, showing that it is virtuaIly monomorphic across the twelve sequenced E. coli strains. Seven polymorphisms were found in the 937-bp region sequenced. All of these are third position syn- onymous substitutions, resulting in a synonymous sub- stitution rate of 3.0%. All but one of the polymorphisms are transitions. The average pairwise number of nucle- otide differences between strains for total data is 2.652 (0.28%) which agrees very well with the value of 0.2% calculated by NELSON et al. (1991) for 13 strains of E. coli. Since only synonymous substitutions were found, the average pairwise difference for amino acids is O.OO%, and which also agrees with the NELSON et al. finding of 0.1%. The codon bias for E. coli K-12 at gapA is 0.862, as measured by a codon adaptation index (CAI) (SHARP and LI 1987). This is the highest CAI observed in any E. coli gene. The divergence between the species (as

TABLE 2

gapA polymorphic sites

Nucleotide position

Strain 306 309 339 345 597 621 717

K12 4 8

10 16 38 39 40 49 50 65 68

pos a

s/n

T C T G C C T C

A T T C A T T C A T T C

C A T C C A T C

T C

3 3 3 3 3 3 3

s S S s S S S ~ ______ _______

a Nucleotide position in the codon. Synonymous us. nonsynonymous change.

calculated by the comparison of the Salmonella se- quence to a single randomly selected E. coli sequence) is 58 (6.19%) bases for the total sequence comparison, and 53 (22.62%) for only synonymous sites (using the approximation for the number of silent sites discussed above).

The pattern of nucleotide polymorphism is quite dif- ferent in pabB. As is shown in Table 3, pabB is highly polymorphic, with 50 out of the 1009 sites being variable. Thirty-eight percent (19 sites) of the polymorphic sites are nonsynonymous substitutions, giving a synonymous polymorphism rate of 12.29%. The transition to trans- version ratio is 1.8:l. The average pairwise number of nucleotide differences between strains is 19.758 (1.96%) and the average pairwise number of amino acid differ- ences between strains is 6.348 (1.89%). The codon bias, expressed as CAI, is 0.33 for the K-12 pabB, which is lower than seen in other E. coli genes. The average CAI reported for 68 E. coli genes is 0.41 (SHARP 1991). When compared to the Salmonella pabB locus, 295 fixed sites are found with 172 being silent. The divergence for Sal- monella and a single randomly selected strain of E. coli is 311 bp (30.83%) with 186 bp (73.74%) for synony- mous variation. Thirty-eight percent nonsynonymous variation is a remarkably high amount of replacement substitutions. pabB is part of an inducible system coding for vitamin biosynthesis. These systems typically are only infrequently expressed and, therefore, are under less constraint. It is unclear if the nature of this system is sufficient to account for these findings. Table 4 sum- marizes the E. coli polymorphism and E. coli and Sal- monella divergence data for gapA and pabB.

Selective history of the gapA and pabB genes: The HKA test (HUDSON et al. 1987) tests for departures from neutral expectations in DNA sequence data by compar- ing the ratio of intraspecific to interspecific nucleotide

Page 4: Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were determined in 12 natural isolates

996 D. S. Guttman and D. E. Dykhuizen

TABLE 3

pabB polymorphic sites

Nucleotide ~

Strain 156 160 206 214 283 316 376 374 396 420 429 430 435 450 490 495 519 522 535 618 645 657 661 711 720

K 1 2 G A G C A A G A G T C A C T C C G G A C A T A C A 4 T 8 10 16 38 C G G C G G T A A A C A G 39 C G G C G G T A A A C A G

G

40 C G G C G G T A A A C A G G G

49 A G G C G T C G G T A C G A G A G 50

G A G G C G T C G G T A G A G A G G

65 T G C G C G G T A G T G 68 T G

~~~~~ ~~~

G

p o s a 3 1 2 1 1 3 1 3 3 3 3 1 3 3 1 3 3 3 1 3 3 3 1 3 3

s / n b n n n n n s n C s C s s s n s n s s s s n n s n n s s

a Nucleotide position in the codon. Synonymous versus nonsynonymous change. Phylogenitically inconsistant site.

TABLE 4

E. coli polymorphism and divergence from Salmonella

@PA pabB gapApabB trpAB crr pu tp gnd

n 12 12 12 4 12 12 18 Total sequence length (bp) 937 1009 1946 1853 510 1467 771 E. coli synonymous polymorphism (%) a 3.00 12.29 7.81 26.77 14.12 27.54 99.10 Synonymous divergence (%) 22.62 73.74 49.13 62.46 25.10 53.99 51.88 Polymorphism/ divergence 0.13 0.17 0.16 0.43 0.56 0.51 1.91

~~ ~~

a Number of synonymous sites estimated as 0.75 times the number of codons. ’ Divergence calculated from one randomly chosen sequence from each species.

~~~

variation for two species. It calculates a statistic based upon the relative relationship between these two levels of variation found at two loci. This statistic expresses the probability that the two loci have experienced similar genealogical histories. The analysis can be performed with either total sequence data or only synonymous site variation (R. R. HUDSON, personal communication). Both approaches were used, and the results from each were qualitatively the same.

The HKA analysis of the gapA and pabB data (Tables 4 and 5) indicated that the relative polymorphism to divergence ratios are approximately the same for these two genes, despite the almost 10-fold difference in their absolute levels of polymorphism. Since these genes are located only approximately 32.5 kb apart on the E. coli chromosome, the effect of linkage on the analysis should be taken into account. Linkage results in a posi- tive correlation between loci with respect to their num- ber of segregating sites and fixed differences. This posi- tive correlation, in turn, makes the model slightly more conservative by shifting the distribution of chi- squared statistics toward smaller values (HUDSON et al. 1987). To take this into account, all additional HKA analyses with the gupA and pabB data sets were carried

out by using the genes individually and with a pooled gapA-pabB data set.

Painvise comparisons with crr, PutP, trpAB and gnd were performed to determine how the region encom- passing the gapA and pabB loci compared to other re- gions of the chromosome. Polymorphism and diver- gence data is presented in Table 4, and the results of the HKA analyses are presented in Table 5 . Analyses of the codon bias, G + C content, and degree of divergence between E. coli and Salmonella for crr, putP and trp indicate that they approximate the “average” E. coli gene, and, therefore, should provide a good baseline for comparison with gapA and pabB. The HKA test com- paring crr, putP and trp showed no departure from ex- pectations for common genealogical history, but all comparisons between them and gapA, pubs, or the pooled gapA-pabB data set indicated that they had dif- ferent genealogical histories (all comparisons were sig- nificant at the 5% level, with one degree of freedom, except the pabB-putP, gapA-crr, and gapA-PutP com- parisons which had P values slightly above 0.05).

The HKA analyses between gnd and gapA, PabB, or the pooled gapA-pabB data sets clearly indicated that these loci experienced different genealogical histories.

Page 5: Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were determined in 12 natural isolates

Selective Sweeps in E. coli 997

position

734 747 750 751 755 774 787 804 848 872 873 884 890 891 900 903 921 943 960 994 1029 1040 1041 1060 1080

T G T G A G T C C T A A G G T C C T A G C A A G A A C T G

C C G A C A C T T C C G G A C C G A C A C T T C C G G A C C G A C A C T T C C G G A C C G C A G C C C C T C C C G C A G C C C C T C C A C A G T T A G

A C T G

3 3 3 1 3 3 3 3 2 2 3 3 1 3 3 3 3 3 3 3 3 2 3 1 3

~ s s n s s s s n n s s n s s s s s s ~ s s n ~ n n s

TABLE 5

Hudson-Kreihan-Aguade test P values

trpAB c f f grid p u t p P P A pabB gapApabB

trpAB 0.789 0.376 C f f 0.791 0.241 grid 0.265 0.121 pu tp 0.744 0.908 0.0 70 gaPA 0.031 0.056 0.004 pabB 0.040 0.020 <0.001 gapApabB 0.032 0.012 <0.001

Synonymous site data are above diagonal and total data are below diagc numbers are significant at the 0.05 < P < 0.10 level. NA, not applicable.

The gnd comparison with putP, cw and trp were gen- erally non-significant at the 5% level, although all of the analyses showed a strong trend toward excess polymor- phism in the gnd locus. This failure to find consistent significant differences in the genealogical histories be- tween gnd and crr, putP and trpAB may be due to the near complete saturation of synonymous polymorphism at gnd. gnd was expected to show excess polymorphism given that rfb gene cluster is located within 200 bp of gnd. The rfb gene cluster codes for the 0 antigen, a surface polysaccharide subject to intense diversifjmg se- lection (BISERCIC et al. 1991; JIANG et al. 1991; WANC et al. 1992). The tight linkage of gnd to a gene complex un- dergoing extreme diversifjmg selection gives an a priori expectation of an atypical selective history. The addition of the NELSON et al. (1991) gapA data (using only the ECOR strains) to our data has no significant effect on the outcome of these analyses (results not shown).

Gene genealogical analysis of gupA and p u b B A gene genealogical analysis of the gapA and pabB genes was carried out using both maximum parsimony methods and neighbor-joining (SAITOU and NEI 1987). Midpoint

0.709 0.030 0.034 0.026 0.888 0.051 0.019 0.010 0.104 0.005 0.002 <0.001

0.068 0.058 0.041 0.066 0.84 NA 0.051 0.742 NA 0.024 NA NA

mal. Bolded numbers are significant at the P < 0.05 level and italicized

rooted neighbor-joining trees of gapA and pabB are pre- sented in Figure 1 along with bootstrap values (above a critical value of 50) at each node, which reflect the sig- nificance levels for the clusters of taxa.

The parsimony analysis of gapA found only one tree (not shown). This tree had a consistency index of 1 .OO and was identical to the neighbor-joining tree. Despite the perfect consistency, relatively low bootstrap scores are anticipated due to the extremely low level of poly- morphism. The high consistency and low polymorphism of the gapA gene result in all taxa being within three mutational steps of each other.

The parsimony analysis of pabB also found only one tree (not shown) and calculated a consistency index of 0.927. This tree was also identical to the one produced by the neighbor-joining algorithm. Only 4 of the 50 polymorphic sites were determined to be homoplastic, and two of these four are completely linked and located within the same codon, suggesting that they entered the strain via recom- bination at the same time. The bootstrap scores presented in Figure 1B indicate that the taxa clusters are very well supported.

Page 6: Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were determined in 12 natural isolates

D. S. Guttman and D. E. Dykhuizen 998

A.

79 65

4

a7 I 49 '7 50

""AI 38 39

40

I 0.002 ,

1 K-12

B. r 96 ,", "' 1 16

68

L""" " 65

I 1001- 49

1- 50

.J( :: 40

0.002

FIGURE 1.-Midpoint rooted neighbor-joining trees. (A) The gapA locus; (B) the pabB locus. Bootstrap values from 1000 bootstrap replicates are presented at the nodes. Scale bar below represents a genetic distance of 0.002.

A comparison of the gupA and pubB gene genealogies indicate that they are almost identical in their clustering of taxa, despite a huge difference in their relative levels of polymorphism. With the exception of ECOR4, all group A strains (ECOR8, ECOR10, ECORl6 and K-12) are isosequential (have identical sequences) in both genes. The same holds true for strains ECOR38, ECOR39 and ECOR40 from group D, and nearly so for ECOR49 and ECOR.50, also from group D (these strains are separated by one mutational step in pabB). The group B strains (ECOR65 and ECOR68) are more closely related to the group A strains, but their place- ment is not entirely consistent.

DISCUSSION

A selective sweep in the region encompassing the gapA and pabB loci of E. coli: The DNA sequence analysis of the closely linked E. coli gapA and pa bB genes has revealed the traces of an important evolutionary event. Three lines of evidence indicate that an adaptive allele has swept through this region in the relatively re- cent evolutionary past, reshaping the genetic structure of the surrounding region in a characteristic manner as a result of the genetic hitchhiking brought about by lim-

ited recombination. These lines of evidence are the re- lationship between the levels of constraint and the poly- morphism at the loci in E. coli and Salmonella, the selective history of the region as revealed by the HKA analysis, and the topological pattern and consistency of the gene genealogical analysis. Although none of these lines of evidence are completely independent, nor would any of them make the case for a selective sweep on its own, the combined picture overwhelmingly s u p ports the assertion.

A central prediction of neutral theory is that the rate of molecular evolution will be negatively correlated with the degree of selective constraint. The gupA locus ex- hibits all of the characteristics of a highly constrained gene. It has one of the lowest levels of polymorphism of any E. coli gene at 0.28%. Of the eighteen ECOR strains of E. coli sequenced at the gupA locus either by us or by the NELSON et ul. group (1991), only two nonsynony- mous substitution were found, and those were both lo- cated in the same codon. Additionally, gapA has the highest degree of codon bias found in any E. coli locus. The extent of codon bias is strongly correlated with lev- els of gene expression (SHARP and LI 1987) and nega- tively correlated with the rate of evolution. With its ex- tremely slow rate of evolution and inferred high level of expression, the gapA gene must be under considerable selective constraint.

A striking comparison can be made between the gupA gene in E. coli and in Salmonella. Presumably, gapA is of equal importance in these two species and under the same degree of constraint. This is reflected in their show- ing approximately the same degree of codon bias. De- spite this, when the average painvise nucleotide poly- morphism within the E. coli gupA gene is compared to that of the Salmonella gupA gene, a difference of over an order of magnitude is found. We calculated the av- erage painvise nucleotide divergence for our 12 strains of E . coli to be 0.28% while NELSON et ul. (1991) cal- culated a value of 3.8% for 16 strains of Salmonella. If E. coli and Salmonella gapA genes are under approxi- mately the same selective pressures, then this discrep- ancy could be explained by a difference in the effective population sizes of the two species. Salmonella is thought to have an effective population size of roughly twice that of E. coli. When NELSON et ul. (1992) exam- ined 16 strains of Salmonella and 12 strains of E. coli at the putP locus they found that the average pairwise nucleotide difference was only twice as great in Salmo- nella as in E. coli. Therefore, the difference in effective population size between the two species cannot account for the magnitude of the discrepancy. The most plau- sible explanation for our finding is that some relatively recent evolutionary event purged most of the variability from the E. coli gupA gene.

The second line of evidence for a selective sweep comes from the results of the HKA analyses. Pairwise

Page 7: Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were determined in 12 natural isolates

Selective Sweeps in E. coli 999

comparisons of gapA and pabB were made with four other genes, err, putP, trpAB and gnd to look for de- partures from neutral expectations. crr, PutP and trpAB were chosen as representative of the “average” E. coli gene. This was based upon their degree of codon bias and polymorphism within the species and divergence from Salmonella. gnd was selected for comparison with the a priori understanding that it does not reflect the average E. coli gene due to its proximity to the r - gene cluster which codes for the 0 antigen, a surface polysac- charide. This region is believed to be undergoing strong diversifymg selection (BISERCIC et al. 1991; JIANG et al. 1991; WANG etal. 1992). g d s extremely tight linkage to this region should result in the maintenance of excess poly- morphism at that locus. It can therefore be used to estab lish the polarity of any departure from neutrality.

The first key to substantiating the occurrence of a re- cent selective sweep at the gapA locus requires evidence that the linked nucleotide sequences surrounding gupA share the same pattern. Given that the rate of recom- bination in E. coli is low, relative to what is found in sexually reproducing organisms, nonselected parts of the genome will be pulled through the population by the selected locus at the time of the selection event. WHIT“ et al. (1983) used electrophoretic markers to show that the amount of linkage disequilibrium was not correlated with map distance for genetic distances greater than about 1 minute or -45 kb. Since this is the size of the average P1 transducing fragment, we propose that the average recombination fragment will be ap- proximately this size. Therefore, since the pabB locus is roughly 0.5 minute away from gapA, it is likely to show the effects of the selective sweep as well. MLKMAN and STOLTZFUS (1988) and MLKMAN and BRIDGLS (1990,1993) have estimated the size of recombinant fi-agments in E. coli to be considerably smaller than this, but since their analysis focused on a region of less than 4.5 kb it would have missed fragments on the order of our estimate.

Upon initial inspection of the gapA and pabB loci it seems unlikely that they should share a common selec- tive history and have undergone the same variation- purging selective sweep. pabB appears to be evolving faster than the average E. coli locus and is almost ten times more polymorphic than gapA, which is evolving considerably slower than the average locus. Despite this, the crucial comparison for understanding the selective history is the within-species polymorphism vs. the de- gree of divergence between species. From this perspec- tive, gapA and pabB are nearly identical. gupA has a nucleotide polymorphism to divergence ratio (for syn- onymous sabstitutions only) of 0.13, compared to 0.17 for pabB. The HKA analysis for these two genes shows them to be not significantly different at the a = 0.84 level with 1 d.f. (Table 5 ) . On the basis of these results and the genes’ close proximity on the E. coli genome, we conclude that these two genes have shared a common

selective history. This allows us to pool the gapA and pabB data sets for all further comparisons, alleviating the problem of nonindependence of the two genes, brought about by linkage, and relaxing the inflated conservative- ness of the HKA test.

The trpAB sequences were found to share a common selective history with the putP gene and the crr gene by the HKA analysis. This result cannot be accounted for by linkage since they are spread over 1.3 megabases. When trpAB, crr and putP were individually compared to gapA, pabB and the pooled gapA-pabB data set, the analyses indicate that these two sets of genes do not share a common genealogical history. Although this fur- ther solidifies the case for a selective sweep around the gapA locus, it is not definitive. The HKA analysis is only capable of determining if the evolutionary processes that operated at each gene were inconsistent. In other words, it gives a statistical basis for determining if two genes experienced the same selective history. The test, by itself, is not capable of determining the polarity of the departure. In this case we can independently verify the polarity with the assistance of the gnd analyses. As has been discussed above, there is u priori reason to assume that an HKA analysis of the gnd locus will show a de- parture from neutral expectations in the direction of excess polymorphism since it is known to be linked to a gene complex which is believed to be under diversifjmg selection. Given this, we expect to find highly significant results when this locus is compared to the gupA and pabB loci, which we allege have experienced a selective sweep, and less extreme results when compared to the crr, putP and trpAB loci which are presumed to be neu- tral. As Figure 2 and Table 5 show, these are exactly the results obtained. The gnd comparisons with gapA and pubB show highly significant differences in the genea- logical histories. When gnd is compared with crr, putP and trpAB, there is a clear trend but only borderline significance to non-significance at the 5% level. The nonsignificant results may be partially explained by the unfortunately low sample number available for trpAB and the near complete saturation of the synonymous sites in gnd.

The nucleotide polymorphism to divergence ratios further illustrate the clustering and relative relatedness of the five genes examined by the HKA analysis. As dis- cussed above, the polymorphism to divergence ratios for gupA and pabB, respectively, are 0.13 and 0.17. The same ratio for the crr, PutP and trpAB loci are 0.56,0.51 and 0.43, respectively. Finally, the ratio for the gnd gene is 1.91. Clearly, the gapA and pabB loci share the com- mon feature of having too little polymorphism relative to their respective between species divergence from Sal- monella. gnd, on the other hand, has an excess of poly- morphism relative to the amount of divergence. putP, crr and trp fall squarely in the middle with roughly twice as much interspecific divergence as polymorphism.

Page 8: Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were determined in 12 natural isolates

1000 D. S. Guttman and D. E. Dykhuizen

putP(-742.5kb) trpAB(-526.5kb) sapA pabB(+32Skb) gnd(+211.5kb) crr(+580.5kb)

Gene (distance from gapA) E2pofP @AS WgapA DpabS O g n d mcrr ______

FIGURE 2 . P h a r t o f results from HKA analyses. The HKA Pvalues are presented for each set of gene comparisons (using total sequence data). The abscissa presents the genes in the order in which they are found along the E. coli chromosome and their disknce, in kilobases, from the p p A locus,

These results can best be accounted for by a purging of variability at gupA and pabB, a maintenance ofvariability at gnd, and the approximate neutrality of pulp, crr and trp.

Differences in genealogical histories can also be evalu- ated on the basis of differences in the 8 calculated for each locus. 8, or the per base heterozygosity, is equal to 2N,p for a haploid organism, where N, is the effective population size and p is the neutral mutation rate. It can be calculated based on the number of segregating sites found at a given locus. The set of six loci used in the HKA analysis were examined for heterogeneity in their 8s by the method proposed by KREITMAN and HUDSON (1991). Table 6 shows that the 8s for these loci are heteroge- neous, indicating that the genes have experienced dif- ferences in their genealogical histories. In support of the earlier finding, most of the heterogeneity in 8s is caused by a deficiency of polymorphism at gupA and an excess of polymorphism at gnd.

The final line of evidence for the occurrence of a se- lective sweep in the region encompassing the gupA and pubB loci is based upon the consistency and topological structure of the gene genealogies of the two loci. A com- parison of the neighbor-joining trees from gupA and pubB (Figure 1) shows that the two loci have almost iden- tical genealogical relationships. The pattern of strain clustering is consistent and significant (as defined by the

TABLE 6

Heterogeneity of @s

Silent polymorphism

Synonymous n sites Obs. Exp. e X'

gapA 12 234.25 7 64 0.0099 4.053 pabR 12 252.25 31 69 0.0394 1.545

pufP 12 366.75 101 100 0.0911 0.0009 C T 12 127.5 18 35 0.0467 1.108 frpAR 4 463.25 124 59 0.1460 1.458

$' = 39.09 P < 0.001 5 d.f.

p d 18 192.75 191 60 0.2881 30.93

bootstrap scores) across both genes. The only inconsis- tent strain clustering is that of ECOR68 which is found with the group A strains in gupA. Although this place- ment differs from that found in pabB, it is the result of a single polymorphism.

An examination of the consistency index and tree to- pology of gupA also supports the claim of a selective sweep. The addition of the NELSON et ul. (1991) gupA data facilitates the analysis by providing a larger set of ECOR strains from which to work. Table 7 is a compi- lation of our gupA data and the ECOR strains from NELSON et ul. (1991). The parsimony analysis of the gupA data revealed a consistency index of 1.00, Le., all infor-

Page 9: Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were determined in 12 natural isolates

Selective Sweeps in E . coli 1001

TABLE 7

gapA polymorphic sites from combined data set

Nucleotide position

Strain Group 127 128 306 309 339 345 444 597 621 717 891 930

4 A T A T C T G T C C C T C 32 B1 K12 A T 8 A T

10 A T 14 A T 16 A T 68 B1 T 65 B2 T 64 B2 T G 52 B2 A T G 38 D A T T 39 D A T T 40 D A T T 49 D C A T 50 D C A T 70 B1 A T 58 B1 T

posa 1 2 3 3 3 3 3 3 3 3 3 3

s/n n n S S S S S S S S S S

a Nucleotide position in the codon. Synonymous us. nonsynonymous change. These strains appear in ECOR group C in SELANDER et al. (1987). Bolded strains are those sequences by us; non-bold strains are those

sequenced by NELSON et al. (1991).

mative sites support the same gene genealogy. Another view of the relationships between the strains is presented as an unrooted dendrogram in Figure 3. This perspec- tive not only shows the consistency of the data, but also reveals that, even in this relatively small set of data, the dendrogram is almost complete with only one transi- tional strain missing (that connecting strains ECOR4 and ECOR32 to the ECOR49 and ECOR.50 group and the ECOR38, ECOR39 and ECOR40 group). This of course assumes that the double mutations occurred si- multaneously, a subject which will be discussed shortly. This figure also shows how similar the gapA genealogy is to a star phylogeny, i.e., all the taxa radiate out from a central hub. This is the genealogical pattern that would be expected if all of the taxa diverged from a relatively recent common ancestor.

A phylogenetic analysis of the pa6B data revealed simi- lar results. The consistency index for pa6B was calcu- lated to be 0.927. Identical clustering of taxa is seen when the gene genealogies of gapA and pabB are com- pared, despite the almost ten fold difference in the de- gree of divergence.

These remarkably high consistency indices are atypi- cal for most gene genealogies. They are indicative of a pattern of variability found in a region that was recently purged of the many possible confounding factors that naturally occur during evolutionary divergence (e.g. , homoplasies and recombination events). Simply put, some evolutionary event wiped the slate clean in this region so that all subsequent evolution has been clearly

t

6% K12,8,10,

FIGURE 3.-Schematic representation of relationships be- tween ECOR strains derived from combined data of Table 6. Hash marks crossing lines connecting strains represent single mutations.

and unequivocally recorded. A selective sweep which globally replaced the region surrounding the gapA and pabB loci would have such an effect. This event would have presumably occurred relatively recently since there has not been sufficient time for the widespread accu- mulation of confounding factors, yet must have oc- curred before the formation of some of the major extant E. coli strain clusters that show significant amounts of clonal divergence and diversification (e.g. , ECOR groups A and D). This analysis is unable to determine the exact focal point of the selective sweep. It clearly occurred in the region near gapA and pabB, but did not necessarily occur in either one of these loci. In fact, it may have been centered at any gene in the surrounding region with sufficient linkage to gapA and pa6B. Like- wise, the event should have also left its mark on the other

Page 10: Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were determined in 12 natural isolates

1002 D. S. Guttman and D. E. Dykhuizen

linked loci in the region. dnal; important in DNA biosynthesis, and m d , coding for RNase D, are both lo- cated in the intervening region between gapA and pabB (BACHMAN 1990; RILEY 1993). A molecular population genetic analysis of these loci would be expected to reveal the same patterns observed in gapA and pabB.

Clustered polymorphisms: Upon visual inspection, the polymorphisms found in the E. coli gapA locus ap- pear to be dispersed in a highly nonrandom fashion. Of the twelve polymorphic sites observed in the eighteen E. coli strains which make up the combined set of our and the NELSON et al. data, one pair of polymorphisms are in adjacent nucleotides, another set are in adjacent codons, and a final pair are separated by only a single codon. In other words, 6 of the 12 polymorphic sites are found in pairs of nucleotides separated by no more than 6 bp (Table 7). The paired substitutions are also exclu- sively linked-they are never found without their partner. Of the three pairs of clustered polymorphisms, two pairs are phylogenetically informative, and both of these are consistent with the reconstructed gene genealogy.

Phylogenetic consistency is the key to determining whether clustered polymorphisms are the result of re- combination or are simply artifacts of the mutational process (DYKHUIZEN et al. 1993). A single recombination event can introduce multiple polymorphic sites into a sequence. Although these sites will be exclusively linked, it is extremely unlikely that they will be phylogenetically consistent or informative if the recombination event oc- curred after the divergence of the taxa being analyzed. Given that the clustered gapA polymorphisms are phy- logenetically consistent, it is highly unlikely that they were generated by recombination.

Clustered mutations at the gapA locus are then a phe- nomenon of the mutational process. If the probability of an error in replication is increased by the presence of a prior error ( i . e . , the polymerase has a tendency to stut- ter), clustering of polymorphisms would be common. This explanation is plausible and tantalizing, but the answer will have to be left for further study.

There is also one case of clustered polymorphisms in the E . coli pabB gene. Nucleotides 376 and 378 are lo- cated in the same codon and are also exclusively linked. They differ from the linked gapA sites in that they are phylogenetically inconsistent. Out of the 50 polymor- phisms in the E. coli pabB gene, they are two out of only four inconsistent sites. Their complete linkage and phy- logenetic inconsistency lend weight to their being in- troduced by recombination. This finding is not incom- patible with the claim of a selective sweep in the region surrounding the gapA and pabB loci. A recombination event could have occurred after the selective sweep, be- tween strains that had diverged only slightly since their recent common ancestor. Alternatively, recombination may have saved some variation at the pabB locus during the selective sweep. This is especially likely if the selec- tion event was centered at the gupA locus, and PabB was

strongly influenced strictly due to genetic hitchhiking brought about by its proximity to gapA.

This workwas supported in part by U.S. Public Health Service grants GM30201 and R01-AI32454 from the National Institutes for Health. We would like to thank WALTER EANES, DOUGLAS FUTUYMA, KIMBERLYN

NELSON, ALEXANDRA BELY, MICHAEL MCCARTNEY, GUILLERMO ORTI and two anonymous reviewers for their careful reading and suggestions to im- prove the manuscript, and JUNHYONG K I M and TIMOTHY MORTON for valuable discussions. This is contribution number 898 from the De- partment of Ecology and Evolution at the State University of New York at Stony Brook.

LITERATURE CITED

Acumt, M., N. MIYASHITA and C. H. LANGLEY, 1989 Reduced variation in the yellow-achaete-scute region in natural populations of Dros- philia melanogaster. Genetics 122: 607-615.

ATWOOD, K. C., L. K. SCHNEIDER and F. J. RYAN, 1951a Selective mecha- nisms in bacteria. Cold Spring Harbor Symp. Quant. Biol. 16: 345-355.

ATWOOD, K. C., L. K SCHNEIDER and T. J. RYAN, 1951b Periodic selection in Escherichia coli. Proc. Natl. Acad. Sci. USA 37: 146-155.

BACHMAN, B. J., 1990 Linkage map of Escherichia coli K-12, Edition 8. Microbiol. Rev. 5 4 130-197.

BEGUN, D. J., and C. F. AQUADRO, 1991 Molecular population genetics of the distal portion of the X chromosome in Drosophila: evi- dence for genetic hitchhiking of the yellow-achaete region. Genetics 129: 1147-1158.

BERRY, A. J., J. W. AJIOKA and M. KREITMAN, 1991 Lack of polymor- phism on the Drosophila fourth chromosome resulting from selection. Genetics 129 1111-1117.

BISERCIC, M., J. Y. FEUTRIER and P. R. REEVES, 1991 Nucleotide se- quences of the gnd genes from nine natural isolates ofEscherichia coli: Evidence of intragenic recombination as a contributing fac- tor in the evolution of the polymorphic gnd locus. J. Bacteriol.

BRANLANT, G., and C. BRANLANT, 1985 Nucleotide sequence of the Escherichia coli gap gene. Eur. J. Biochem. 150: 61-66.

CABOT, E., and A. T. BECKINBACH, 1989 Simultaneous editing of mul- tiple nucleic acid and protein sequences with ESEE. Comput. Appl. Biosci. 5: 233-234.

CRAWFORD, I. P., B. P. NICHOLS and C. YANOFSKY, 1980 Nucleotide se- quence of the trpB gene in Escherichia coli and Salmonella ty- phimurium. J. Mol. Biol. 142: 489-502.

DWUIZEN, D. 1992 Periodic Selection, pp. 351-355 in Encyclopedia of Microbiology. Academic Press, New York.

DWUIZEN, D. E., and L. GREEN, 1991 Recombination in Escherichia coli and the definition of biological species. J. Bacteriol. 173:

DWUIZEN, D. E., D. S. POLIN, J. J. DUNN, B. WILSKE, V. PREAC-MURSIC et al., 1993 Borrelia burgdorferi is clonal: Implications for tax- onomy and vaccine development. Proc. Natl. Acad. Sci. USA 9 0

Fu, Y.-X., and W.H. LI, 1993 Statistical tests of neutrality of muta- tions. Genetics 133: 693-709.

GONCHAROFF, P., and B. P. NICHOLS, 1984 Nucleotide sequence of Escherichia coli pabB indicates a common evolutionary origin of paminobenzoate synthetase and anteranilate synthetase. J. Bacteriol. 159: 57-62.

GONCHAROFF, P., and B. P. NICHOLS, 1988 Evolution of amino- benzoate synthases: nucleotide sequences of Salmonella typhimurium and Klebsiella aerogenes pabB. Mol. Biol. Evol. 5: 531-548.

HALL, B. G., and P. M. SHARP, 1992 Molecular population genetics of Escherichia coli: DNA sequence diversity at the celC, cm, and gutB loci of natural isolates. Mol. Biol. Evol. 9: 654-665.

HERZER, P. J., S. INOUYE, M. INOLJYE and T. S. WHITTAM, 1990 Phyle genetic distribution of branched RNA-linked multicopy single-stranded DNA among natural isolates of Escherichia coli. J. Bacteriol. 172 6175-6181.

HIGUCHI, R. G., and H. OCHMAN, 1989 Production of single-stranded DNA templates by exonuclease digestion following polymerase chain reaction. Nucleic Acids Res. 17: 5865-5866.

173 3894-3900.

7257-7268.

10163-10167.

Page 11: Detecting Selective Sweeps in Naturally Occurring ... · The nucleotide sequences of the gapA and pabB genes (separated by approximately 32.5 kb) were determined in 12 natural isolates

Selective Sweeps in E . coli 1003

HUDSON, R. R., M. WITMAN and M. AGUADE, 1987 A test of neutral molecular evolution based on nucleotide data. Genetics 116 153-159.

JIANG, X."., B. NEAL, R. SANTIAGO, S. J. LEE, L. K. ROMANA et al., 1991 Stucture and sequence of the rJb (0 antigen) gene cluster of Salmonella serovar typhimurium (strain LT2). Mol. Microbiol.

KREITMAN, M., and R. R. HUDSON, 1991 Inferring the evolutionary histones of Adh and Adh-dup loci in Drosophila melanogaster from patterns of polymorphism and divergence. Genetics 127: 565-582.

LANGLEY, C. H., J. MACDONALD, N. MWMHITA and M. AGUADE, 1993 Lack of corrleation between interspecific divergence and intraspecific polymorphism at the suppressor of forked region in Drosophila melanogaster and Drosophila simulans. Proc. Natl. Acad. Sci. USA 90: 1800-1803.

LEVIN, B. R., 1981 Periodic selection, infectious gene exchange and the genetic structure of E. coli populations. Genetics 99:

MARTIN-CAMPOS, J. M., J. M. COMERON, N. MNASHITA and M. AGUADE, 1992 Intraspecific and interspecific variation at the y-ac-sc re- gion of Drosophila simulans and Drosophila melanogaster. Genetics 130 805-816.

mmmn SMITH, J., 1991 The population genetics of bacteria. Proc. R. SOC. Lond. Ser. B 245: 37-41.

MCDONALD, J. H., and M. WITMAN, 1991 Adaptive protein evolution at the adh locus in Drosophila. Nature 351: 642-654.

MILKMAN, R., and M. M. BRIDGES, 1990 Molecular evolution of the Escherichia coli chromosome. 111. Clonal frames. Genetics 126:

MILKMAN, R., and M. M. BRIDGES, 1993 Molecular evolution of the Escherichia coli chromosome. IV. Sequence comparisons. Genet- ics 133: 455-468.

MILKMAN, R., and A. STOLTZFUS, 1988 Molecular evolution of the Escherichia coli chromosome. 11. Clonal segments. Genetics 120:

NELSON, K., and R. K. SELANDER, 1992 Evolutionary genetics of the proline permease gene ( putP) and the control region of the pro- line utilization operon in populations of Salmonella and Escherichia coli. J. Bacteriol. 174: 6886-6895.

NELSON, S. O., A. R. SCHUITEMA, R. BENNE, L. H. D. PLOEG, J. S. PLIJTER et al., 1984 Molecular cloning, sequencing, and expression of the crr gene: the structural gene for IIIGlc of the bacterial PEP:glucose phosphotransferase system. EMBO J. 3: 1587-1593.

NELSON, K., T. S. WHITTAM and R. K. SELANDER, 1991 Nucleotide poly- morphism and evolution in the glyceraldehyde-%phosphate de- hydrogenase gene (gapA) in natural populations of Salmonella and Escherichia coli. Proc. Natl. Acad. Sci. USA 88: 6667-6671.

NICHOLS, B. P., and C. YANOFSKY, 1979 Nucleotide sequences of the trpA of Salmonella typhimurium and Escherichia coli: an evolu- tionary comparison. Proc. Natl. Acad. Sci. USA 76: 524445248,

OCHMAN, H., and R. K. SELANDER, 1984 Standard reference strains

5: 695-713.

1-23.

505-517.

359-366.

of Escherichia coli from natural populations. J. Bacteriol. 157:

RILEY, M., 1993 Functions of the gene products of Escherichia coli. Microbiol. Rev. 57: 862-952.

RUDD, K. E. 1992 Alignment of E. coli DNA sequences to a revised, integrated genomic restriction map, pp. 2.3-2.43 in A Short Course in Bacterial Genetics: A Laboratory Manual and Hand- book forEscherichia coli and Related Bacteria.Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y.

SAIKI, R. K., 1989 The design and optimization of the PCR, pp. 7-16 in PCR Technology. M. Stockton Press, New York

SAITOU, N., and M. NEI, 1987 The neighbor-joining method: a new method for reconstruction phylogenetic trees. Mol. Biol. Evol. 4

SELANDER, R. K., D. A. CAucm and T. S. WHITTAM 1987 Genetic struc- ture and variation in natural populations of Escherichia coli, pp. 1625-1648 in Escherichia coli and Salmonella typhimuraum: Cel- lular and Molecular Biology. American Society for Microbiology, Washington, D.C.

SHARP, P. M., 1991 Determinants of DNA sequence divergence between Escherichia coli and Salmonella typhimurium: codon usage, map position, and concerted evolution. J. Mol. Evol. 33: 23-33.

SHARP, P. M., and W. H. LI, 1987 The codon adaptation index-a measure of directional synonymous condon usage bias. Nucleic Acids Res. 15: 1281-1295.

SLATKIN, M., 1989 Detecting small amounts of gene flow from phy- logenies of alleles. Genetics 121: 609-612.

SLATKIN, M., and W. P. W D I S O N , 1989 A cladistic measure of gene flow inferred from the phylogenies of alleles. Genetics 123 603-613.

SWOFFORD, D. L., 1993 PAUP: Phylogenetic Analysis Using Parsi- mony, Ver. 3.1. Illinois Natural History Survey, Champaign, Ill.

TAJIMA, F., 1989 Statistical methods for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585-595.

TAMURA, K., 1993 NJBoot2, Ver. 2. Pennsylvania State University, University Park, Pa.

WANG, L., L. K. ROMANA and P. R. REEVES, 1992 Molecular analysis of a Salmonella enterica group E l rfa gene cluster: 0 antigen and the genetic basis of the major polymorphism. Genetics 130: 429-443.

WHInm, T. S., and S. E. AKE 1993 Genetic polymorphisms and recombination in natural populations of Escherichia coli, pp. 223-246 in Mechanisms of Molecular Evolution. Sinauer Associ- ates, Sunderland, Mass.

WHITTAM, T. S., H. OCHMAN and R. K. SELANDER, 1983 Multilocus ge- netic structure in natural populations of Escherichia coli. Proc. Natl. Acad. Sci. USA 80: 1751-1755.

WILSON, K. 1990 Isolation of bacterial genomic DNA, pp. 2.4.1-2.4.2 in Current Protocols in Molecular Biology. Wiley, New York.

690-693.

406 - 425.

Communicating editor: G. B. GOLDINC