Supplementary data and methods - Cell  · Web viewSelf-targeting by CRISPR: gene regulation or...

12
Supplemental Material Self-targeting by CRISPR: gene regulation or autoimmunity? Adi Stern 1† , Leeat Keren 2† , Omri Wurtzel 1 , Gil Amitai 1 , Rotem Sorek 1 * 1 Department of Molecular Genetics 2 Department of Computer Science and Applied Mathematics Weizmann Institute of Science, Rehovot, Israel * Corresponding author: [email protected] These authors contributed equally Supplemental Methods Dataset construction CRISPR arrays, spacers, and repeats were obtained from CRISPRdb [1]. Our study referred only to non-questionable arrays, as defined by CRISPRdb [1]. For each CRISPR- bearing organism, all Genbank files for all replicons in an organism were downloaded from the NCBI database (http://www.ncbi.nlm.nih.gov ). Information regarding cas genes, genes overlapping self-targets and genomic neighborhoods was obtained from both the Genbank file and from IMG (Integrated Microbial Genomes) at JGI (http://img.jgi.doe.gov/cgi-bin/pub/main.cgi ) [2]. When testing for COG enrichment, the background distribution 1

Transcript of Supplementary data and methods - Cell  · Web viewSelf-targeting by CRISPR: gene regulation or...

Page 1: Supplementary data and methods - Cell  · Web viewSelf-targeting by CRISPR: gene regulation or autoimmunity? Adi Stern1†, Leeat Keren2†, Omri Wurtzel1, Gil Amitai1, Rotem Sorek1*

Supplemental Material

Self-targeting by CRISPR: gene regulation or autoimmunity?

Adi Stern1†, Leeat Keren2†, Omri Wurtzel1, Gil Amitai1, Rotem Sorek1*

1Department of Molecular Genetics

2Department of Computer Science and Applied Mathematics

Weizmann Institute of Science, Rehovot, Israel

* Corresponding author: [email protected]

† These authors contributed equally

Supplemental Methods

Dataset constructionCRISPR arrays, spacers, and repeats were obtained from CRISPRdb [1]. Our study referred only to non-questionable arrays, as defined by CRISPRdb [1]. For each CRISPR-bearing organism, all Genbank files for all replicons in an organism were downloaded from the NCBI database (http://www.ncbi.nlm.nih.gov). Information regarding cas genes, genes overlapping self-targets and genomic neighborhoods was obtained from both the Genbank file and from IMG (Integrated Microbial Genomes) at JGI (http://img.jgi.doe.gov/cgi-bin/pub/main.cgi) [2]. When testing for COG enrichment, the background distribution of each COG family was obtained from ftp://ftp.ncbi.nih.gov/pub/COG/. We define here some general characteristics of CRISPR that are used later on in the study:

Repeat consensus : the consensus sequence of all repeats in a specific CRISPR array.

Associated cas operon of a CRISPR array : the closest cas operon that is up to 10,000 bases away from a given CRISPR array, or the sole cas operon in the replicon.

cas subtype : one out of 9 possible cas operon subtypes, as defined by Haft et al. 2005 [3] and by TIGR [4]. A cas operon is considered a certain subtype if it contains one of the known genes of the subtype. Thus, an array may belong to several subtypes. This is denoted in Tables 1 & S1 by a concatenation of the subtypes' names, separated by a period.

Searching for self-targeting spacers

1

Page 2: Supplementary data and methods - Cell  · Web viewSelf-targeting by CRISPR: gene regulation or autoimmunity? Adi Stern1†, Leeat Keren2†, Omri Wurtzel1, Gil Amitai1, Rotem Sorek1*

For each spacer, BLAST [5] was performed against all replicons in the cognate organism, using an E-value cutoff of 10-4. All BLAST hits that resided within a CRISPR array were discarded. Hits were divided into two categories: full matches (spanning 100% identity over the entire spacer) and partial matches (all the rest). Five spacers were later manually discarded since they were found to reside in CRISPR arrays annotated by Genbank but not present in CRISPRdb. The results of the search are summarized in Table S2 below.

Table S2. Summary of self-targeting spacers

No. of spacers No. of CRISPR arrays (%)

No. of organisms (%)

Total analyzed 23,550 973 330

Self-targeting spacers, full match

100 (0.43%) 73 (7.5%) 59 (18%)

Self-targeting spacers, full match, non-mobile DNA target

53 (0.2%) 43 (4%) 39 (12%)

Self-targeting spacers, full and partial match

350 (1.5%) 228 (23%) 155 (47%)

Self-targeting spacer conservationBLAST [5] was performed between each self-targeting spacer (both full and partial matches) against all other self-targeting spacers, with an E-value cutoff of 10-2. This yielded 7 cases of similarity between self-targeting spacers, 5 from highly related organisms (the sum of branch length separating the two strains on the tree of life [6] was smaller than 0.01). However, all of these cases turned out to be spurious: in the first 5 cases, the entire CRISPR array turned out to be identical or almost identical between the 2 strains, suggesting that some of these strains might even be the same strain. In the remaining two cases the spacers partially matched transposons and prophages in related organisms, suggesting that similar mobile elements exert pressure on the organisms bearing these spacers.

Characterization of self-targetsFirst, we determined whether each of the matches found for a self-targeting spacer (termed target) overlaps an open reading frame, or resides in an intergenic location, by parsing the Genbank genomic file. Next, each target was assigned one of four categories: v (provirus), t (transposon), p (plasmid), or n (non-mobile DNA) (Table S1). The p category was assigned if the target resided in an endogenous plasmid of the organism. A v or t category was assigned if the target complied with one of the conditions listed below; otherwise the n category was assigned:

(a) Gene annotation : targets overlapping a gene containing the phrases "phage" or "transpos" within their gene description.

(b) Part of a known prophage : targets residing in the Prophinder database of prophage sequences [7].

(c) Gene neighborhood : targets overlapping a gene (termed the gene hit), where one of the neighbors of the target was annotated as a phage/transposon (as in

2

Page 3: Supplementary data and methods - Cell  · Web viewSelf-targeting by CRISPR: gene regulation or autoimmunity? Adi Stern1†, Leeat Keren2†, Omri Wurtzel1, Gil Amitai1, Rotem Sorek1*

(a)). Gene neighbors were defined as no more than 5 genes upstream and downstream of the gene hit, on the same strand, with no more than 300 basepairs separating between two consecutive neighbor genes.

(d) Homology : targets overlapping a gene homologous to a phage or transposon gene, or with a gene neighbor (as defined in (c)) homologous to a phage or transposon gene. Homology was defined by performing BLASTP of the protein sequence against the 'nr' database, and gene annotation of the homologs was tested as in (a).

All targets of the fully matching spacer were further manually curated using the IMG database to ensure the validity of the above results.

Determining orientation of CRISPR arraysThe orientation of each CRISPR array was inferred according to the following criteria, ranked by their priority (i.e., only if the array was not successfully oriented according to one criterion, was it tested using the next). Note that per the general literature convention, spacers and repeats proximal to the leader sequence are denoted as the beginning of the array.

(1) Repeat clustering : Each repeat was compared to repeat clusters based on the study of Kunin et al. [8]. Thus, each repeat consensus of an array was BLASTed [5] against members of each repeat cluster, using a word size of 7 and an E-value cutoff of 10-2. To correctly orientate each repeat cluster group, an "anchor" member from each cluster was orientated based on studies of RNA transcription of CRISPR, on bioinformatic analysis of related strains, [9-16] , (Wurtzel and Sorek, unpublished data), and/or on leader conservation [16, 17]. The original repeat cluster orientations [8] were found to be correct for all clusters except numbers 2 and 10, which were accordingly reversed. Notably, due to the palindromic nature of the repeats, although some repeats showed similarity to a repeat cluster from Kunin et al. 2007, the orientation of the array could nevertheless not be determined.

(2) Repeat degeneracy towards the array end : Previous studies have shown that repeats tend to degenerate towards the leader distal region [15, 18]. Thus, the number of repeats deviating from the repeat consensus was counted for each half of an array. If one side had more than two deviant repeats than the other side, this was considered the leader distal side.

(3) Leader conservation : While leaders vary widely among different organisms, it has been shown that leader sequences may be highly conserved within a given organism with multiple CRISPR arrays [17]. This is most likely since the sequence may act as a promoter, and may also serve as a binding sequence that directs the addition of new spacers [17]. For each organism with multiple CRISPR arrays, two putative leader sequences were built from the 100 bases upstream of each orientation of each array. All pairs of such putative leader sequences from different arrays were compared using blast2seq pairwise alignment [19], which also outputs an E-value for the alignment. Alignments with an E-value < 10-4 were considered significant. If only one orientation produced significant alignments for a given array, this orientation was considered the correct orientation.

Self-protection A processed crRNA was shown to include eight bases of the upstream flanking repeat sequence [9, 13, 20]. For the mtube subtype, it has been shown that extended base-

3

Page 4: Supplementary data and methods - Cell  · Web viewSelf-targeting by CRISPR: gene regulation or autoimmunity? Adi Stern1†, Leeat Keren2†, Omri Wurtzel1, Gil Amitai1, Rotem Sorek1*

pairing between three base-pairs at locations -4, -3, and -2 relative to the spacer (which correspond to bases of the repeat sequence) results in protection from CRISPR degradation [13]. To test whether self-targets are inherently protected from CRISPR degradation, we tested whether such base-pairing occurs. Since it is possible that different subtypes use different variants of this protection method, for each target we tested if there are 3 (or more) consecutive base pairs that match the repeat sequence anywhere in the 8 repeat nucleotides upstream/downstream of the target (hereby termed putative protection). For 51 out of 53 non-mobile, and 58 of 60 mobile targets we could extract information regarding the orientation of the CRISPR array. For each such group (non-mobile and mobile) we tested how many targets displayed a 3 base-pair match between the repeat and the target somewhere along the 8 nucleotides upstream or downstream the target. For the non-mobile group, 6 and 7 targets displayed such upstream or downstream matches, respectively. For the mobile category, 3 and 5 targets displayed such upstream or downstream matching, respectively. Assuming a random distribution of nucleotides (i.e., the probability of observing each nucleotide is 0.25), the probability of a match of 3 or more consecutive of 8 base pairs is given by:

Under a binomial distribution and a type I error (α) of 0.05, the cutoff for rejecting the null hypothesis that the base-pairing is random is 8 out of 51 matches for the non-mobile category and 8 out of 58 matches in the mobile category. Thus, we cannot rule out that the potential base-pairing between the sequences flanking the targets and the repeats is a random artifact. Notably, in three cases the target adhered to self-protection as defined originally [13]. In fact, in these three cases the base-pairing extended beyond the 3 base-pairs defining the minimal protection, suggesting that in these 3 cases, base-pairing is a built-in mechanism for self-protection. We refer to these cases as putative self-protection in Table 1 of the main text.

PAM sequencesWhen defining whether a target displays an adjacent PAM, we tested the existence of a consensus sequence as defined by Mojica et al. [21] in association with CRISPR arrays of different repeat clusters (numbers 1,2,3,4,7, and 10). Thus, for spacers from CRISPR arrays that were not assigned to one of these groups, the presence or absence of a PAM could not be determined.

4

Page 5: Supplementary data and methods - Cell  · Web viewSelf-targeting by CRISPR: gene regulation or autoimmunity? Adi Stern1†, Leeat Keren2†, Omri Wurtzel1, Gil Amitai1, Rotem Sorek1*

Supplemental text and figures

Enrichment of subtypes and target types of self-targeting CRISPRsWe explored whether self-targeting spacers display enrichment for different features. First, we found that CRISPR arrays containing self-targeting spacers against non-mobile elements are enriched for repeats from CRISPR cluster 10 [8] (P < 0.005; Fisher exact test, significant after Bonferroni correction for multiple testing). This cluster was originally found to be associated with CRISPR systems of subtype nmeni [3]. We also found that targeting occurs equally against the sense and antisense strands of the target gene across all CRISPR/Cas subtypes in our dataset, matching similar findings in viral targets [11, 12, 15, 22, 23]. Finally, we found that the non-mobile endogenous targets are enriched for two COG categories: nucleotide transport and metabolism (F) (P < 0.05; Fisher exact test), and amino acid transport and metabolism (E) (P < 0.01; Fisher exact test).The latter results do not hold when applying stringent Bonferroni correction for multiple testing, possibly due to the small sample size. Thus, certain subtypes of the CRISPR system may be more prone to acquisition of an endogenous spacer, and genes belonging to specific cellular pathways may also be more prone, or tolerant, to CRISPR targeting. With the future accumulation of more sequenced genomes of bacteria and archaea, additional data may allow determining whether this enrichment is statistically significant or not.

Previous evidence for CRISPR involvement in gene regulationAlthough our results support the hypothesis that self-targeting spacers do not play a role in gene regulation, there have been several intriguing reports linking CRISPR with regulation of endogenous processes. In Pseudomonas aeruginosa PA14, lysogenic infection with phage DMS3, and the subsequent induction of the CRISPR region, led to changes in biofilm formation and swarming motility in the bacteria [24]. In Myxococcus Xanthus DK 1622, inactivation of a cas operon (also known as the dev operon) was shown to inhibit fruiting body formation. In both these cases, it is yet unclear by what mechanism CRISPR affects the cellular process, although in both cases a putative link was suggested by the authors with lysogenic phage infection. On a different vein, in Listeria monocytogenes EGDe, it was experimentally shown that a CRISPR (termed rliB) construct made of a repeat flanked by two spacers weakly base-pairs with an mRNA target [25]. It remains to be shown whether this base-pairing represents a distinct function, or whether it exemplifies a previous event of autoimmunity. Indeed, this strain of Listeria monocytogenes completely lacks cas genes.

5

Page 6: Supplementary data and methods - Cell  · Web viewSelf-targeting by CRISPR: gene regulation or autoimmunity? Adi Stern1†, Leeat Keren2†, Omri Wurtzel1, Gil Amitai1, Rotem Sorek1*

Archaea

Proteobacteria

Cyano-bacteria

Firmicutes

CRISPR encodingNo CRISPREndogenous spacer, non-mobile targetEndogenous spacer, mobile target

Actinobacteria

Bacteroides/Chlorobi

Archaea

Proteobacteria

Cyano-bacteria

Firmicutes

CRISPR encodingNo CRISPREndogenous spacer, non-mobile targetEndogenous spacer, mobile target

CRISPR encodingNo CRISPREndogenous spacer, non-mobile targetEndogenous spacer, mobile target

Actinobacteria

Bacteroides/Chlorobi

Figure S1. Phylogenetic distribution of self-targeting spacers. Self-targeting spacers were projected on the species tree of fully sequenced prokaryotes, which was reconstructed based on multiple shared loci [6]. Spacers were divided into two categories: those that target mobile DNA (proviruses, transposons, or plasmids; yellow), and those that target non-mobile DNA (red). Overall 100 self-targeting spacers were found, distributed in 59 different organisms. The distribution of self-targeting spacers is clearly widespread, and no phylum appears to be over- or under- represented. Some of the major phyla are labeled for convenience. The tree is presented using the FigTree software (http://tree.bio.ed.ac.uk/software/figtree/).

6

Page 7: Supplementary data and methods - Cell  · Web viewSelf-targeting by CRISPR: gene regulation or autoimmunity? Adi Stern1†, Leeat Keren2†, Omri Wurtzel1, Gil Amitai1, Rotem Sorek1*

Figure S2. Illustration of mutations in flanking repeats near self-targeting spacers. The first 20 spacers in the CRISPR array of Pelobacter propionicus DSM 2379 (crisprdb accession NC_008609_2) are presented, along with their flanking repeats. Dots represent nucleotides identical to the repeat consensus (shown at the bottom). The three self-targeting spacers are underlined in color: green, spacer with 100% match to a hydrophobe/amphiphile efflux-1 transporter; blue, spacer with 100% match to DNA topoisomerase I; red, spacer with 100% match to RND efflux system outer membrane lipoprotein. Mutations in adjacent repeats are encircled. Unmarked spacers do not target self genes.

References

1 Grissa, I., et al. (2007) The CRISPRdb database and tools to display CRISPRs and to generate dictionaries of spacers and repeats. BMC Bioinformatics 8, 172

2 Markowitz, V.M., et al. (2008) The integrated microbial genomes (IMG) system in 2007: data content and analysis tool extensions. Nucleic Acids Res 36, D528-533

3 Haft, D.H., et al. (2005) A guild of 45 CRISPR-associated (Cas) protein families and multiple CRISPR/Cas subtypes exist in prokaryotic genomes. PLoS Comput Biol 1, e60

4 Lee, Y., et al. (2005) The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Res 33, D71-74

5 Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 25, 3389

6 Dehal, P.S., et al. (2009) MicrobesOnline: an integrated portal for comparative and functional genomics. Nucleic Acids Res 38, D396-400

7

Page 8: Supplementary data and methods - Cell  · Web viewSelf-targeting by CRISPR: gene regulation or autoimmunity? Adi Stern1†, Leeat Keren2†, Omri Wurtzel1, Gil Amitai1, Rotem Sorek1*

7 Lima-Mendez, G., et al. (2008) Prophinder: a computational tool for prophage prediction in prokaryotic genomes. Bioinformatics 24, 863

8 Kunin, V., et al. (2007) Evolutionary conservation of sequence and secondary structures in CRISPR repeats. Genome Biol 8, R61

9 Hale, C., et al. (2008) Prokaryotic silencing (psi)RNAs in Pyrococcus furiosus. Rna 14, 2572-2579

10 Agari, Y., et al. (2009) Transcription Profile of Thermus thermophilus CRISPR Systems after Phage Infection. Journal of Molecular Biology

11 Brouns, S.J., et al. (2008) Small CRISPR RNAs guide antiviral defense in prokaryotes. Science 321, 960-964

12 Barrangou, R., et al. (2007) CRISPR provides acquired resistance against viruses in prokaryotes. Science 315, 1709-1712

13 Marraffini, L.A. and Sontheimer, E.J. (2010) Self versus non-self discrimination during CRISPR RNA-directed immunity. Nature 463, 568-571

14 Tang, T.H., et al. (2002) Identification of 86 candidates for small non-messenger RNAs from the archaeon Archaeoglobusfulgidus. Proceedings of the National Academy of Sciences 99, 7536

15 Pourcel, C., et al. (2005) CRISPR elements in Yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary studies. Microbiology 151, 653-663

16 Horvath, P., et al. (2009) Comparative analysis of CRISPR loci in lactic acid bacteria genomes. Int J Food Microbiol 131, 62-70

17 Sorek, R., et al. (2008) CRISPR--a widespread system that provides acquired resistance against phages in bacteria and archaea. Nat Rev Microbiol 6, 181-186

18 Horvath, P., et al. (2008) Diversity, activity, and evolution of CRISPR loci in Streptococcus thermophilus. J Bacteriol 190, 1401-1412

19 Tatusova, T.A. and Madden, T.L. (1999) BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS microbiology letters 174, 247-250

20 Hale, C.R., et al. (2009) RNA-guided RNA cleavage by a CRISPR RNA-Cas protein complex. Cell 139, 945-956

21 Mojica, F.J., et al. (2009) Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology 155, 733-740

22 Shah, S.A., et al. (2009) Distribution of CRISPR spacer matches in viruses and plasmids of crenarchaeal acidothermophiles and implications for their inhibitory mechanism. Biochem Soc Trans 37, 23-28

23 Mojica, F.J., et al. (2005) Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. J Mol Evol 60, 174-182

24 Zegans, M.E., et al. (2009) Interaction between bacteriophage DMS3 and host CRISPR region inhibits group behaviors of Pseudomonas aeruginosa. J Bacteriol 191, 210-219

25 Mandin, P., et al. (2007) Identification of new noncoding RNAs in Listeria monocytogenes and prediction of mRNA targets. Nucleic Acids Res 35, 962-974

8