LCS (Longest Common Substrings): A novel method for the prediction of regulatory elements in...

3
LCS (Longest Common Substrings): A novel method for the prediction of regulatory elements in poxviruses Melissa Da Silva 1 , Daniel Horspool 2 and Chris Upton 1 1 Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC 2 Department of Computer Science, University of Victoria, Victoria, BC Abstract Poxviruses are large dsDNA viruses that replicate in the cytoplasm and are capable of infecting a wide range of hosts including insects, birds and mammals; to date, the genomes of 45 poxviruses have been sequenced. Recently, during analysis of the Yaba monkey tumor virus (YMTV) genome, Brunetti et al. showed that a region consisting of 40 nucleotides in the YMTV genome was absolutely conserved in several distantly related poxvirus genomes. They suggested that a conserved sequence of this length was highly unusual and demonstrated that this region has the potential to be a late promoter although other potential functions could not be ruled out. However, no data was presented to confirm the significance of this sequence conservation. Our lab, challenged by the assertions made in this paper, developed a software tool to identify short identical regions of sequence that are common to a series of user-selected genomes. Termed LCS (Longest Common Substrings), this algorithm has been implemented in the latest version of our viral genome organizer (VGO; www.virology.ca). Using Brunetti’s data set LCS finds the same 40 nucleotide sequence; it also finds, however, a series of other conserved regions that are only slightly shorter. The locations of these sequences are also suggestive of promoter elements; most are between divergently transcribed genes. Questions to be answered: - Do these sequences contain promoters? - If so, are these promoters more highly conserved than others? - Are these sequences significantly conserved than surrounding gene sequences? - Are there any other relationships between these short sequences? Seven regions ranging in size from 21 to 40 nucleotides, were found to be conserved in all 6 viruses that were used in the dataset. Interestingly, each hit maps to the exact location in each of the 6 virus genomes investigated. Hit 1 maps to the upstream region of a cytoplasmic protein with unknown function and was found by Brunetti et al. to be a late promoter. Hit 4 is located 700 nucleotides downstream of the 5’ translation start site of the large subunit of RNA polymerase (RPO147). Hits 2, 3, 5, and 7 are all potential bidirectional promoters as each hit can be mapped to the upstream regions of genes that are transcribed in opposite directions (see Figure 2). We will focus on characterizing these potential promoters in particular hit 7 which is located upstream of the DNA helicase and intracellular mature virion membrane protein genes. Upon alignment of this hit region with all chordopoxvirus genomes, it was found that the hit was actually 1 nucleotide longer than was seen using LCS with Brunetti’s dataset. This is because there is a 1 nucleotide difference between the sequence seen for swinepoxvirus compared to the other 5 viruses used in the LCS search. Since LCS is only capable of finding 100% identical hits, the hit that was reported was 32 nucleotides in length. The LCS hit is 100% conserved in all orthopoxviruses and is 93.94% identical in the molluscipoxvirus molluscum contagiosum virus (MOCV). The parapoxviruses show the least amount of sequence conservation with orf virus having 32.26% identity to the LCS hit. The avipoxviruses, fowlpox virus and canarypox virus, have 84.85% and 90.91% sequence identity to the LCS hit respectively. This region is not found in any entomopoxviruses since they do not encode the intracellular mature virion membrane protein gene. Alignments for hits 2,3, and 5 show Table 1. List of hits obtained from an LCS search using a cutoff of 20 nucleotides searching 6 distantly related genomes. Start and stop positions and sequence are listed as they appear for the Yaba monkey tumor virus (YMTV) genome. Hits colored in blue may be potential bidirectional promoters. Hit Start position (bp) Stop position (bp) Length (bp) Sequence 1 14746 14785 40 TTATTTATGTTATTAGCTATGATTTATGTTTCATTTTTAA 2 42489 42516 28 ACTATCATTTACTAAGGAGTAAAATAGG 3 46984 47004 21 AAAAAATAAAATGAGTCTTCG 4 56349 56371 23 GATCAAACTGCTAGATCTGTTAT 5 92638 92660 23 TCATTTATTTAGTATTAAATGAC 6 95122 95143 22 TTATCGTCTACGAACATTTATA 7 96835 96866 32 TTAAATAACTCATTTATATATTAAAAAATGTC Figure 3. Multiple sequence alignment of LCS hit 7 with all chordopoxvirus genomes. The hit region is highlighted in grey.

description

References 1.Brodie, R., A. J. Smith, et al. (2004). "Base-By-Base: single nucleotide-level analysis of whole viral genome alignments." BMC Bioinformatics 5(1): Brunetti, C. R., H. Amano, et al. (2003). "Complete genomic sequence and comparative analysis of the tumorigenic poxvirus Yaba monkey tumor virus." J Virol 77(24): Edgar, R. C. (2004). "MUSCLE: multiple sequence alignment with high accuracy and high throughput." Nucleic Acids Res 32(5): Ehlers, A., J. Osborne, et al. (2002). "Poxvirus Orthologous Clusters (POCs)." Bioinformatics 18(11): Hedengren-Olcott, M., C. M. Byrd, et al. (2004). "The vaccinia virus G1L putative metalloproteinase is essential for viral replication in vivo." J Virol 78(18): Heljasvaara, R., D. Rodriguez, et al. (2001). "The major core protein P4a (A10L gene) of vaccinia virus is essential for correct assembly of viral DNA into the nucleoprotein complex to form immature viral particles." J Virol 75(13): Ramsey-Ewing, A. and B. Moss (1995). "Restriction of vaccinia virus replication in CHO cells occurs at the stage of viral intermediate protein synthesis." Virology 206(2): Resch, W., A. S. Weisberg, et al. (2005). "Vaccinia virus nonstructural protein encoded by the A11R gene is required for formation of the virion membrane." J Virol 79(11): Simpson, D. A. and R. C. Condit (1994). "The vaccinia virus A18R protein plays a role in viral transcription during both the early and the late phases of infection." J Virol 68(6): Szajner, P., H. Jaffe, et al. (2003). "Vaccinia virus G7L protein Interacts with the A30L protein and is required for association of viral membranes with dense viroplasm to form immature virions." J Virol 77(6): Ukkonen, E. (1995). "On-line construction of suffix-trees." Algorithmica 14: Upton, C., D. Hogg, et al. (2000). "Viral genome organizer: a system for analyzing complete viral genomes." Virus Res 70(1-2): Acknowledgements This work was performed under the supervision of Dr. Chris Upton and with the help of Angelika Ehlers and Marina Barsky. Contacts Dr. Chris Upton Phone: (250) Melissa Da Silva Phone: (250) Daniel Horspool Phone: (250) Table 2 lists the experimentally determined stage of expression for each of the 8 genes under the control of possible bidirectional promoters. The stage of expression is as yet unknown for both YMTV-50L and YMTV- 109L although late promoters have been predicted for each of these genes. Interestingly, with the exception of YMTV-50L, each of the genes associated with the bidirectional promoters have been found to be essential to virus survival in the host cell. These promoter regions may be highly conserved in all poxviruses because of the essential nature of each of the genes controlled by the promoters. Table 2. Stage of expression during the poxvirus lifecycle for the genes comprising each potential bidirectional promoter. Figure 1. An example LCS suffix tree. Red arrows correspond to sequence 1 and green arrows correspond to sequence 2. The LCS algorithm uses a standard suffix tree created by iteratively adding suffixes of the sequence one at a time. New branches are added to the tree if the current substring has not been encountered before. Once the tree for the first sequence has been built the second sequence is then analyzed in the same way by adding branches for unseen substrings or marking a particular path if the substring has been seen before. Nodes that are common between all sequences imply that the path to them is shared by all sequences. In this example, the longest common substring between sequences 1 and 2 is TAAC. LCS can also find all common substrings of a user-defined minimum length. For instance, if a cut-off of 3 nucleotides is used, LCS will report two hits (TAAC and AAC).

Transcript of LCS (Longest Common Substrings): A novel method for the prediction of regulatory elements in...

Page 1: LCS (Longest Common Substrings): A novel method for the prediction of regulatory elements in poxviruses Melissa Da Silva 1, Daniel Horspool 2 and Chris.

LCS (Longest Common Substrings): A novel method for the prediction of regulatory elements in poxvirusesMelissa Da Silva1, Daniel Horspool2 and Chris Upton1

1Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC2Department of Computer Science, University of Victoria, Victoria, BC

AbstractPoxviruses are large dsDNA viruses that replicate in the cytoplasm and are capable of infecting a

wide range of hosts including insects, birds and mammals; to date, the genomes of 45 poxviruses have been sequenced. Recently, during analysis of the Yaba monkey tumor virus (YMTV) genome, Brunetti et al. showed that a region consisting of 40 nucleotides in the YMTV genome was absolutely conserved in several distantly related poxvirus genomes. They suggested that a conserved sequence of this length was highly unusual and demonstrated that this region has the potential to be a late promoter although other potential functions could not be ruled out. However, no data was presented to confirm the significance of this sequence conservation.

Our lab, challenged by the assertions made in this paper, developed a software tool to identify short identical regions of sequence that are common to a series of user-selected genomes. Termed LCS (Longest Common Substrings), this algorithm has been implemented in the latest version of our viral genome organizer (VGO; www.virology.ca).

Using Brunetti’s data set LCS finds the same 40 nucleotide sequence; it also finds, however, a series of other conserved regions that are only slightly shorter. The locations of these sequences are also suggestive of promoter elements; most are between divergently transcribed genes. Questions to be answered:

- Do these sequences contain promoters?- If so, are these promoters more highly conserved than others?- Are these sequences significantly conserved than surrounding gene sequences?- Are there any other relationships between these short sequences? - Are single orthologous early or late poxvirus promoters more conserved than the sets of early and late promoters within a single virus?

Seven regions ranging in size from 21 to 40 nucleotides, were found to be conserved in all 6 viruses that were used in the dataset. Interestingly, each hit maps to the exact location in each of the 6 virus genomes investigated. Hit 1 maps to the upstream region of a cytoplasmic protein with unknown function and was found by Brunetti et al. to be a late promoter. Hit 4 is located 700 nucleotides downstream of the 5’ translation start site of the large subunit of RNA polymerase (RPO147). Hits 2, 3, 5, and 7 are all potential bidirectional promoters as each hit can be mapped to the upstream regions of genes that are transcribed in opposite directions (see Figure 2). We will focus on characterizing these potential promoters in particular hit 7 which is located upstream of the DNA helicase and intracellular mature virion membrane protein genes.

Upon alignment of this hit region with all chordopoxvirus genomes, it was found that the hit was actually 1 nucleotide longer than was seen using LCS with Brunetti’s dataset. This is because there is a 1 nucleotide difference between the sequence seen for swinepoxvirus compared to the other 5 viruses used in the LCS search. Since LCS is only capable of finding 100% identical hits, the hit that was reported was 32 nucleotides in length.

The LCS hit is 100% conserved in all orthopoxviruses and is 93.94% identical in the molluscipoxvirus molluscum contagiosum virus (MOCV). The parapoxviruses show the least amount of sequence conservation with orf virus having 32.26% identity to the LCS hit. The avipoxviruses, fowlpox virus and canarypox virus, have 84.85% and 90.91% sequence identity to the LCS hit respectively. This region is not found in any entomopoxviruses since they do not encode the intracellular mature virion membrane protein gene.

Alignments for hits 2,3, and 5 show that these regions are also highly conserved amongst all orthopoxviruses (data not show).

Table 1. List of hits obtained from an LCS search using a cutoff of 20 nucleotides searching 6 distantly related genomes. Start and stop positions and sequence are listed as they appear for the Yaba monkey tumor virus (YMTV) genome. Hits colored in blue may be potential bidirectional promoters.

HitStart position

(bp)Stop position

(bp) Length (bp) Sequence1 14746 14785 40 TTATTTATGTTATTAGCTATGATTTATGTTTCATTTTTAA2 42489 42516 28 ACTATCATTTACTAAGGAGTAAAATAGG3 46984 47004 21 AAAAAATAAAATGAGTCTTCG4 56349 56371 23 GATCAAACTGCTAGATCTGTTAT5 92638 92660 23 TCATTTATTTAGTATTAAATGAC6 95122 95143 22 TTATCGTCTACGAACATTTATA7 96835 96866 32 TTAAATAACTCATTTATATATTAAAAAATGTC

Figure 3. Multiple sequence alignment of LCS hit 7 with all chordopoxvirus genomes. The hit region is highlighted in grey.

Page 2: LCS (Longest Common Substrings): A novel method for the prediction of regulatory elements in poxviruses Melissa Da Silva 1, Daniel Horspool 2 and Chris.

Introduction

• Brunetti et. al. located a 40 nucleotide region conserved in 6 distantly related poxviruses that also contained a late promoter

• We developed an algorithm (Longest Common Substrings; LCS) to automatically locate regions which share 100% identity between distantly related poxvirus genomes

• LCS uses a suffix tree to systematically identify regions on one genome that are identical to regions on each of the other genomes included in the comparison (Figure 1)

• When a 100% identical match if found, the length, position and nucleotide sequence are recorded

• Using LCS with a minimum cut-off of 20 nucleotides and Brunnetti’s dataset, we identified the same 40 nucleotide region as well as 6 other identical regions ranging in size from 21-32 nucleotides

• We will show that:• 3 of the 7 hits may be possible bidirectional promoter regions• These regions are conserved in almost all poxviruses

Each of the four above hits is located upstream of two genes, each being transcribed in opposite directions. For hit 2 (Figure 2A), the LCS hit is located 300 nucleotides upstream of the YMTV-52R (late transcription elongation factor) gene and 20 nucleotides upstream of the YMTV-50L (unknown function) gene. Hit 3 (Figure 2B) is located 10 nucleotides upstream of the YMTV-58R (late transcription factor) gene and 44 nucleotides upstream of the YMTV-57L (virion assembly) gene. Hit 4 (Figure 2C) is located 18 nucleotides upstream of the YMTV-102R (viral membrane formation) gene and 19 nucleotides upstream of the YMTV-101L (P4a precursor) gene. Hit 5 (Figure 2D) is located 27 nucleotides upstream of the YMTV-110R (DNA helicase) gene and 19 nucleotides upstream of the YMTV-109L (intracellular mature virion membrane protein) gene.

Conclusions and Future Studies

• LCS is a novel method in the identification of highly conserved regions in distantly related poxvirus genomes

• Our analysis has shown that several of these regions have the potential to be bidirectional promoters

• Sequence alignments of all chordopoxvirus genomes indicate that in each case, the bidirectional promoter regions are highly conserved amongst almost all orthopoxvirus genomes with parapoxviruses exhibiting the least identity in these regions

• Further experimental analysis must be performed to confirm the stage of expression for two genes (YMTV-50L and YMTV-109L) listed in table 2

• Comparisons must be made between the bidirectional promoters identified here, and other known early and late promoters to further show that these bidirectional promoter regions are significantly conserved

• The LCS algorithm will be modified in order to find common substrings that are not 100% identical matches

A B

C D

Figure 2. Graphical representation of each of the four possible bidirectional promoter regions identified through LCS. Black boxes represent the hit for each region and blue bars correspond to a particular gene in the YMTV genome. (A) Hit 2 (B) Hit 3 (C ) Hit 5 (D) Hit 7

Page 3: LCS (Longest Common Substrings): A novel method for the prediction of regulatory elements in poxviruses Melissa Da Silva 1, Daniel Horspool 2 and Chris.

References

1. Brodie, R., A. J. Smith, et al. (2004). "Base-By-Base: single nucleotide-level analysis of whole viral genome alignments." BMC Bioinformatics 5(1): 96.

2. Brunetti, C. R., H. Amano, et al. (2003). "Complete genomic sequence and comparative analysis of the tumorigenic poxvirus Yaba monkey tumor virus." J Virol 77(24): 13335-47.

3. Edgar, R. C. (2004). "MUSCLE: multiple sequence alignment with high accuracy and high throughput." Nucleic Acids Res 32(5): 1792-7.

4. Ehlers, A., J. Osborne, et al. (2002). "Poxvirus Orthologous Clusters (POCs)." Bioinformatics 18(11): 1544-5.5. Hedengren-Olcott, M., C. M. Byrd, et al. (2004). "The vaccinia virus G1L putative metalloproteinase is essential for

viral replication in vivo." J Virol 78(18): 9947-53.6. Heljasvaara, R., D. Rodriguez, et al. (2001). "The major core protein P4a (A10L gene) of vaccinia virus is essential

for correct assembly of viral DNA into the nucleoprotein complex to form immature viral particles." J Virol 75(13): 5778-95.

7. Ramsey-Ewing, A. and B. Moss (1995). "Restriction of vaccinia virus replication in CHO cells occurs at the stage of viral intermediate protein synthesis." Virology 206(2): 984-93.

8. Resch, W., A. S. Weisberg, et al. (2005). "Vaccinia virus nonstructural protein encoded by the A11R gene is required for formation of the virion membrane." J Virol 79(11): 6598-609.

9. Simpson, D. A. and R. C. Condit (1994). "The vaccinia virus A18R protein plays a role in viral transcription during both the early and the late phases of infection." J Virol 68(6): 3642-9.

10. Szajner, P., H. Jaffe, et al. (2003). "Vaccinia virus G7L protein Interacts with the A30L protein and is required for association of viral membranes with dense viroplasm to form immature virions." J Virol 77(6): 3418-29.

11. Ukkonen, E. (1995). "On-line construction of suffix-trees." Algorithmica 14: 249-260.12. Upton, C., D. Hogg, et al. (2000). "Viral genome organizer: a system for analyzing complete viral genomes." Virus

Res 70(1-2): 55-64.

AcknowledgementsThis work was performed under the supervision of Dr. Chris Upton and with the help of Angelika Ehlers and Marina Barsky.

ContactsDr. Chris Upton Phone: (250) 721-6507 E-mail: [email protected] Da Silva Phone: (250) 721-6506 E-mail: [email protected] Horspool Phone: (250) 721-6506 E-mail: [email protected]

Table 2 lists the experimentally determined stage of expression for each of the 8 genes under the control of possible bidirectional promoters. The stage of expression is as yet unknown for both YMTV-50L and YMTV-109L although late promoters have been predicted for each of these genes. Interestingly, with the exception of YMTV-50L, each of the genes associated with the bidirectional promoters have been found to be essential to virus survival in the host cell. These promoter regions may be highly conserved in all poxviruses because of the essential nature of each of the genes controlled by the promoters.

Table 2. Stage of expression during the poxvirus lifecycle for the genes comprising each potential bidirectional promoter.

Hit Gene Putative functionStage of

expression

YMTV-52RViral late

transcription elongation factor

Late

YMTV-50L Unknown Late (predicted)

YMTV-58R Viral late transcription factor Intermediate

YMTV-57L Virion assembly protein Late

YMTV-102R Viral membrane formation Late

YMTV-101L P4a precursor Late

YMTV-110R DNA helicase Early and late

YMTV-109LIntracellular mature

virion membrane protein

Late (predicted)

2

3

5

7

Figure 1. An example LCS suffix tree. Red arrows correspond to sequence 1 and green arrows correspond to sequence 2.

The LCS algorithm uses a standard suffix tree created by iteratively adding suffixes of the sequence one at a time. New branches are added to the tree if the current substring has not been encountered before. Once the tree for the first sequence has been built the second sequence is then analyzed in the same way by adding branches for unseen substrings or marking a particular path if the substring has been seen before. Nodes that are common between all sequences imply that the path to them is shared by all sequences. In this example, the longest common substring between sequences 1 and 2 is TAAC. LCS can also find all common substrings of a user-defined minimum length. For instance, if a cut-off of 3 nucleotides is used, LCS will report two hits (TAAC and AAC).