ELUCIDATING the MECHANISMS of TRANSPOSABLE ELEMENTS …

ELUCIDATING the MECHANISMS of TRANSPOSABLE ELEMENTS using EXPERIMENTAL and

BIOINFORMATIC APPROACHES: the hAT SUPERFAMILY of TRANSPOSABLE ELEMENTS in the

GENOME of AEDES AEGYPTI and TE DISPLAYER

Rebecca Rooke – complete as registered on ROSI

A thesis submitted in conformity with the requirements for the degree of Masters of Science

Graduate Department of Cell and Systems Biology University of Toronto

Elucidating the Mechanisms of Transposable Elements using

Experimental and Bioinformatic Approaches: The hAT

Superfamily of Transposable Elements in the Genome of Aedes

aegypti and TE Displayer

Rebecca Rooke

Masters of Science

Cell and Systems Biology University of Toronto

Abstract

Transposable elements (TEs) are found in nearly all eukaryotic genomes and are a

major driving force of genome evolution. The hAT superfamily of TEs are found in a

variety of organisms, including plants, fungi, insects and animals. To date, only 14 hAT

TEs in the Aedes aegypti genome have been annotated as having a hAT transposase

coding sequence. In this study, extensive bioinformatic approaches have been

employed to find hAT TEs that encode transposases in the A. aegypti genome. A total

of six newly-identified TEs belonging to the hAT superfamily were discovered in the A.

aegypti genome. Furthermore, a computer program called TE Displayer was developed

to analyze TEs in genome sequences. TE Displayer detects TE-derived polymorphisms

in genome datasets and presents the results on a virtual gel image. TE Displayer

enables researchers to compare TE profiles in silico and provides a reference profile for

experimental analyses.

Acknowledgments

First and foremost, I would like to thank my supervisor, Dr. Guojun Yang, for introducing

me to and guiding me through the exciting world of transposable elements. Your

constant enthusiasm about your research was nothing short of contagious. I appreciate

all the time and effort you gave me throughout these past two years to help me become

a better biologist.

I would also like to thank the members of my committee, Dr. George Espie and

Dr. Marla Sokolowski, for their valuable guidance and suggestions.

I could not have successfully completed my MSc without the academic, mental,

and emotional support of Amy Wong and Matt Janicki. You are both phenomenal people

who were always there to encourage and motivate me, laugh and joke with me, and you

provided me with a necessary fun and whacky world outside of the lab.

Lastly, I would like to thank my family for their support, motivation, and

encouragement. Thank you, Angela, for editing my thesis. You are my role model and

inspiration, not only in the world of academia, but in life as well. Thank you Mom and

Dad, for allowing me to choose my own path and for supporting me with every step I

Funding: National Sciences and Engineering Research Council (RGPIN371565 to G.Y.); Canadian Foundation for Innovation (24456 to G.Y.); Ontario Research Fund; University of Toronto.

Table of Contents

Acknowledgments ........................................................................................................... iii

Table of Contents ............................................................................................................ iv

List of Tables.................................................................................................................. vii

List of Figures ............................................................................................................... viii

List of Appendices ........................................................................................................... xi

Publications.................................................................................................................... xii

Glossary ........................................................................................................................ xiii

Chapter 1 Introduction to Transposable Elements ........................................................... 1

1 Transposable Elements (TEs) ..................................................................................... 1

1.1 TE Classification ................................................................................................... 1

1.2 Miniature Inverted Repeat Transposable Elements (MITEs) ................................. 5

1.3 Recently and Currently Active MITEs .................................................................... 6

1.4 Elucidating how MITEs Achieve High Copy Numbers ........................................... 9

1.5 Significance of TEs ............................................................................................. 11

Chapter 2 Elucidating the Transposase Sources for the Transposition of hAT MITEs ... 13

2 Introduction to hAT TEs ............................................................................................. 13

3 Methods ..................................................................................................................... 15

3.1 Determining and Cloning hAT MITEs .................................................................. 15

3.2 Finding TEs using a Top-Down Approach ........................................................... 16

3.3 Determining Candidate Transposases for the Transposition of hAT MITEs ........ 18

3.3.1 Retrieving All Putative hAT Transposases ............................................... 18

3.3.2 Identifying Recently Active Putative Transposases .................................. 19

3.3.3 Linking hAT MITEs with Putative Transposases ...................................... 19

3.3.4 Identifying Coding Sequences of Putative hAT Transposases ................. 19

3.3.5 Phylogenetic and Conserved Domain Analysis of Known and Putative hAT Transposases ........................................................................................... 20

3.4 Synthesizing and Cloning of Transposases ........................................................ 21

3.5 Yeast Excision Assays ........................................................................................ 23

4 Results ....................................................................................................................... 25

4.1 Computational Analyses...................................................................................... 25

4.1.1 Finding MITE Members Belonging to the hAT Superfamily of TEs ........... 25

4.1.2 Finding TEs Encoding Putative hAT Tranposases ................................... 26

4.1.3 Analysis of hATTPases and their copies in the A. aegypti genome .......... 30

4.1.4 The Buster and Ac families of the hAT Superfamily ................................. 33

4.1.5 Conserved Domains in Known and Putative Transposase Sequences in the A. aegypti genome .............................................................................. 36

4.1.6 Linking MITEs to Putative hAT Transposases .......................................... 39

4.1.7 Finding TEs using a Top-Down Approach ................................................ 43

4.2 Experimental Analyses ........................................................................................ 44

4.2.1 Cloning MITEs .......................................................................................... 44

4.2.2 Candidate hAT Transposase Analysis and Cloning.................................. 45

4.2.3 Yeast Excision Assays with the Putative hAT Transposase hATTPase1646

5 Discussion ................................................................................................................. 51

Chapter 3 TE Displayer for Post Genomic Analysis of TEs ........................................... 56

6 Introduction to Transposon Display ........................................................................... 56

7 Methods ..................................................................................................................... 60

7.1 Algorithm ............................................................................................................. 60

7.2 Implementation .................................................................................................... 61

7.3 Output ................................................................................................................. 63

7.4 Parameters Used for Testing TE Displayer ......................................................... 63

7.5 Genomic Database Sources ............................................................................... 64

8 Results ....................................................................................................................... 64

9 Discussion ................................................................................................................. 66

Chapter 4 Concluding Remarks ..................................................................................... 68

References..................................................................................................................... 70

Appendix I: Supplementary Materials ............................................................................ 79

List of Tables

Table 1: Summary of output retrieved from MAK’s Member function. .............................. 26

Table 2: A summary of the 23 hATTPases. Their accession and position in the A.

aegypti genome is shown, along with their size in bps and TSD sequence. .................... 29

Table 3: The number of individual hAT MITE sequences that were cloned into the donor

plasmid for each hAT MITE family. ......................................................................................... 45

Table 4 hAT primer sequences and genomes used to generate output for hAT elements

....................................................................................................................................................... 63

Supplementary Table 1: Consensus sequences of hAT MITE families from TEfam

(http://tefam.biochem.vt.edu) .................................................................................................... 79

Supplementary Table 2: Primer sequences used to amplify hAT MITEs from A. aegypti

genomic DNA. ............................................................................................................................. 81

Supplementary Table 3: Primer sequences of candidate transposase exons. Grey, six

additional nucleotides; pink, restriction enzyme site. ........................................................... 82

Supplementary Table 4: Novel hAT TE families. Consensus sequences and size (bp)

are shown, as well as the copy number for each family/subfamily. ................................... 97

List of Figures

Figure 1: Graphical representation of the transposition of Class I TEs. The TE is

transcribed into RNA and then reverse-transcribed into cDNA. The cDNA is inserted into

the genome at a different location than the original element. ............................................... 2

Figure 2: Graphical representation of the transposition of Class II TEs. The TE is

excised from its location and re-inserted elsewhere in the genome. ................................... 4

Figure 3: Illustration of donor plasmid. Amp, ampicillin resistance gene; ARS1,

autonomous replications sequence 1; OriEC, E. coli replication origin; CEN4,

centromere of yeast chromosome 4. Illustration adapted from Yang et al. (2009). ........ 16

Figure 4: An illustration of the primers designed for a hypothetical hAT transposase with

two exons and one intron. Green arrows, primers corresponding to exon #1; orange

arrows, primers corresponding to exon #2; TGATCA, SpeI site; GTCGAC, SalI site. .... 22

Figure 5: Illustration of transposase source plasmid. Amp, ampicillin resistance gene;

ARS H4, autonomous replication sequence of H4 gene; CEN6, centromere of yeast

chromosome 6; cyc1 ter, termination of yeast cyclin gene cyc1; OriEC, E. coli replication

origin; Pgal1, yeast gal1 promoter. Illustration adapted from Yang et al. (2009). ............ 23

Figure 6: A schematic representation of how the best candidate hAT transposase

sequences were selected. ........................................................................................................ 28

Figure 7: A neighbor-joining tree of the DNA sequences of hATTPases and their copies.

....................................................................................................................................................... 31

Figure 8: A maximum likelihood phylogenetic tree of the 23 hATTPase transposase

amino acid sequences (50% majority rule consensus). Numbers next to the nodes show

quartet puzzling reliability based on 10,000 puzzling steps, a measure of nodal support

similar to bootstrapping that is produced by TREE-PUZZLE .............................................. 32

Figure 9: A maximum likelihood phylogenetic tree of amino acid transposase

sequences from Arensburger et al. (2011) and amino acid sequences of annotated

hATTPases (50% majority rule consensus). Numbers next to most nodes show quartet

puzzling reliability based on 10,000 puzzling steps, a measure of nodal support similar

to bootstrapping produced by TREE-PUZZLE. ..................................................................... 35

Figure 10: Sequence frequency logos of the TSD sequences for hATTPases and their

copies belonging to the Buster and Ac families. ................................................................... 36

Figure 11: A schematic representation of known intact hAT transposase sequences in

A. aegypti (from TEfam) and annotated hATTPases that have conserved sequence

domains. Grey lines, transposase sequence; blue, hAT family dimerization domain; red,

zinc finger domain; green, DUF659 domain of unknown function. .................................... 38

Figure 12: Figure illustrating which hATTPases DNA sequences have ends that are

similar in sequence to the ends of each MITE family. Red lines, match MITE family

TF000722; Blue line, match MITE family TF000576; green lines, match MITE family

TF000718; yellow lines, match MITE family TF000706; purple lines, match MITE family

TF001275; grey lines, match MITE family TF000715. ......................................................... 40

Figure 13: Alignment of the end sequences of hAT MITE families that match best with

the end sequences of the hATTPases DNA sequences ...................................................... 42

Figure 14: Alignment of the 5’ and 3’ ends of the three TE families found from TopDown

and the hATTPases-coding elements used to find them. .................................................... 44

Figure 15: Example of yeast colonies growing on media lacking histidine and uracil. All

transformation reactions that resulted in colony formation for all three conditions, as

shown above, were plated on media lacking adenine. ........................................................ 47

Figure 16: Yeast on media lacking adenine. Plates were streaked with colonies

incubated at 30ºC on media lacking histidine and uracil. Sections on plates are

representative of a single streaked colony. Red arrow, colony. ......................................... 48

Figure 17: Yeast on media lacking adenine. Plates were spread with yeast cells from

colonies incubated at 25ºC in liquid media lacking histidine and uracil. Red arrow,

colony ........................................................................................................................................... 49

Figure 18: Yeast on media lacking adenine. Plates were spread with yeast cells from

colonies incubated at 30ºC in liquid media lacking histidine and uracil. Red arrow,

colony ........................................................................................................................................... 50

Figure 19: A schematic representation of Transposon Display. (A) Genomic DNA is

extracted; (B) DNA is digested with MseI and adapters are ligated to the ends; (C) Pre-

amplification PCR is performed; (D) Selective PCR is performed; (E) Products are run

on a polyacrylamide gel. Blue boxes-adaptors; grey arrows-pre-amplification primers;

black arrows-selective amplification primers. ........................................................................ 59

Figure 20: Screen-shot of the bioinformatics program, TE Displayer ............................... 60

Figure 21: Diagram of TE Displayer algorithm (see Methods: Implementation). Red

arrowhead, pre-amplification primer; Black arrowhead, selective-amplification primer.

Adapted from Rooke & Yang (2010). ...................................................................................... 62

Figure 22: TE Displayer virtual gels. (A) hAT families in different species. Lane 1:

A.thaliana; lane 2: C.elegans; lane 3: rice; lane 4: A.aegypti; lane 5: D.melanogaster. (B)

mPing elementsin rice. Lane 1: O.sativa var. indica; lane 2: O.sativa var. japonica.

(C)TF000720 family in A.aegypti with different allowed primer mismatches. Lane 1: no

mismatches; lane 2: 1 mismatch; lane 3: 2 mismatches. (D) TF000700 family in

A.aegypti with different selective bases. Lane 1: no selective base; lane 2: A; lane 3: C;

lane 4: T; lane 5: G. Adapted from Rooke & Yang (2010). .................................................. 65

Supplementary Figure 1: The amino acid alignment of annotated hATTPases and

transposase protein sequences from Arsenburger et al. (2011). Alignments were

generated from M-COFFEE. .................................................................................................... 95

Supplementary Figure 2: Sequence of the putative hAT transposase, hATTP16.

Underlined sequence is the coding region. Insertion locations that were repaired are

denoted by asterisks (*). Substitutions that were repaired are denoted by red residues.

Grey background-intron; yellow background-TIRs ............................................................... 96

List of Appendices

Appendix I: Supplementary Materials ............................................................................ 78

Publications

Rooke, R. & G. Yang (2010) TE displayer for post genomic analysis of transposable elements. Bioinformatics, 27(2): 286-287

My contributions to this publication include: troubleshooting glitches in the

software; making the computer program more aesthetically-pleasing and easy to use;

inserting user-controlled options into the software, such as changing background and

font color; testing the program with numerous different databases; inspecting all output

to insure the software is generating expected results. Furthermore, I wrote and

submitted the publication (with editing from Dr. Guojun Yang) and generated all figures

for the manuscript. Compared to the publication, the thesis contains an expanded

introduction.

Janicki, M., Rooke, R. & G. Yang. In press. Bioinformatics and genomic analysis of transposable elements in eukaryotic genomes. Chromosome Research. DOI 10.1007/s10577-011-9230-7

My contributions to this publication include thoroughly editing the manuscript prior

to submission. Following submission, the first author and I were responsible for

addressing the reviewer’s comments and suggestions and editing the manuscript

accordingly.

Glossary

hAT: named after the hobo, Activator, and Tam3 transposable elements

MAK: MITE Analysis Kit

MITE: Miniature inverted-repeat transposable element

TD: Transposon display

TE: Transposable element

TIR: Terminal inverted repeat

TSD: Target site duplication

Chapter 1 Introduction to Transposable Elements

1 Transposable Elements (TEs)

Barbara McClintock first described transposable elements (TEs) in the Zea maize

genome in the 1940s (1, 2) . Since their discovery, TEs have been found in nearly every

eukaryotic and prokaryotic organism studied to date, with only a few exceptions

(Plasmodium falciparum and Bacillus subtilis) (3, 4). TEs are so abundant in some

genomes that they can comprise over 85% of the DNA (5). Furthermore, TEs are

estimated to have increased the maize genome two- to five- fold (5, 6), where a single

class of TEs comprises approximately 50% of the total genome (7). Although the effect

of TEs on genome structure and function is continually being investigated, it is well-

accepted that TEs shape the size and structure of genomes and are significant players

in genome evolution (8). Therefore, understanding TEs—their transposition activity,

structure, and replication—is essential to elucidating how genomes have evolved both

structurally and functionally.

1.1 TE Classification

TEs can be divided into two major classes: class I (or retrotransposable elements) and

class II (or DNA transposable elements). The two classes of TEs differ with respect to

their mode of transposition. Class I TEs transpose via an RNA intermediate using a

mechanism commonly referred to as ―copy-and-paste‖. In comparison, class II TEs

transpose using a ―cut-and-paste‖ mechanism with only DNA as intermediates (9, 10).

Due to their different modes of transposition, class I elements are commonly found in

Figure 1: Graphical representation of the transposition of Class I TEs. The TE is transcribed into RNA and then reverse-transcribed into cDNA. The cDNA is inserted into the genome at a different location than the original element.

high copy numbers in their host genomes, whereas class II elements are often found in

low copy numbers (11).

Class I TEs contribute to the major repetitive portions of large genomes (12-14).

For example, a single family of class I TEs comprises nearly 35% of the human genome

(15). The transposition mechanism of class I elements begins with the synthesis of RNA

transcripts using the genomic TE copy as a template. The RNA transcripts are

subsequently reverse transcribed into DNA by a TE-encoded reverse transcriptase and

inserted into the genome at a different location (Figure 1). As a result of this ―copy-and-

paste‖ transposition mechanism, each transposition event produces one additional copy

Donor DNA with Class I element

Transcription

Reverse Transcription

Recipient DNA with Class I element

Donor DNA with Class I element

Figure 1: Graphical representation of the transposition of Class I TEs. The TE is transcribed into RNA and then reverse-transcribed into cDNA. The cDNA is inserted into the genome at a different location than the original element.

of the TE (10).

Class I elements are divided into five orders, based on their insertion mechanism

and overall organization and enzymology: LTR (long terminal repeats), DIRS

(Dictyostelium intermediate repeat sequence), PLE (Penelope-like elements), LINE

(long interspersed nuclear element), and SINE (short interspersed nuclear element).

These orders are further divided into superfamilies based on the sizes of their target site

duplications (TSDs)—a short direct repeat sequence generated upon TE insertion—and

their protein coding domains (10).

Class II TEs are found in most eukaryotes and are the major class of TEs in

prokaryotes. Most TEs belonging to this class have terminal inverted repeats (TIRs) that

range in size from 11 base pairs to several hundred base pairs (11). Many class II TEs

encode a transposase enzyme that recognizes and binds to TIRs and excises the

original TE from its existing location and insert it elsewhere in the genome (Figure 2). It

is estimated that sequences derived from class II TEs constitute at least 1% of the

human genome (16).

Due to the nonreplicative transposition mechanism of class II TEs, an increase in

copy number is achieved by utilizing the host machinery. In one instance, a class II TE

can be duplicated if a transposition event occurs during DNA replication. In this case, if

the class II TE transposes from a replicated chromatid to an unreplicated site, the

element will have duplicated itself in the genome. In another instance, a class II TE can

be duplicated by gap repair through homologous recombination if the TE is present on

the homologous chromosome or a sister chromatid. This results in the restoration of the

TE at its original site (17).

Class II elements can be divided into two subclasses based on the number of

DNA strands that are cut during transposition. Subclass I elements cut both DNA

strands, while elements belonging to subclass II only cut one of the DNA strands.

Subclass I elements are further divided into two orders: TIR and Crypton. Elements

belonging to the TIR order are characterized by their TIRs which vary in length. This

order is separated into nine superfamilies based on the size of their TSDs and the

sequence of their TIRs: Tc1-Mariner, hAT, Mutator, Merlin, Transib, P, PiggyBac, PIF-

Harbinger, and CACTA. The Crypton order only contains one superfamily of the same

name which contains elements that lack TIRs but generate TSDs upon insertion (10).

Subclass II elements are also divided into two orders: Helitron and

Maverick/Polintrons (10, 18). Both orders contain a single superfamily of the same

name. Elements in the superfamily Helitron are proposed to replicate via a rolling-circle

Donor Site with Class II element

Excision of TE

Recipient DNA with Class I element

Donor Site

Insertion into Different Site

Figure 2: Graphical representation of the transposition of Class II TEs. The TE is excised from its location and re-inserted elsewhere in the genome.

mechanism and do not generate TSDs (10). Alternatively, elements in the superfamily

Maverick/Polintron bear long TIRs and generate TSDs that are 6 bps in length (17-19).

1.2 Miniature Inverted Repeat Transposable Elements (MITEs)

Both class I and class II TEs contain autonomous and nonautonomous elements.

Autonomous elements are elements that encode the enzyme(s) necessary for their

transposition, while nonautonomous elements do not. Despite their differences,

autonomous and nonautonomous elements within the same superfamily may have

strong sequence similarity and often contain the same crucial characteristics required

for transposition (i.e. TIRs) (10). Some nonautonomous elements, such as the Dc

element, are generated by point mutations or deletions from the autonomous element,

rendering their transposase gene inactive, but maintaining enough sequence similarity

to be recognized by transposase produced by the autonomous element(s) (20).

Therefore, nonautonomous elements rely on transposases from autonomous TEs for

their transposition.

Miniature inverted repeat transposable elements (MITEs) are a type of

nonautonomous element that have TIRs and generate TSDs upon insertion. The first

MITE was discovered in maize while analyzing insertions in the waxy gene (21). The

MITE did not share sequence similarity with any known TE at the time and was present

in over 10 000 copies in the maize genome (22). MITEs are typically short (usually <500

bps in length), often located in or near genes (23-25) and are often found in high copy

numbers in the genomes in which they reside, despite lacking a transposase coding

sequence. Unlike other nonautonomous elements, the majority of MITEs are not

deletion derivatives of autonomous elements (26, 27). Two hypotheses exist to explain

the origin of MITEs: (a) a MITE arises from the fortuitous placement of TIR-like

sequences or solo TIRs that are recognized by an autonomous TE (28, 29) or (b) MITEs

are relics of past TEs whose autonomous elements have been degraded in the genome

or have not reached fixation within the population (27).

To date, MITEs have been found in organisms spanning all five kingdoms. They

are found in a diverse range of species, including Arabidopsis thaliana (30), Xenopus

laevis (31), Caenorhabditis elegans (32), Aedes aegypti (33), teleost fish (34), archaea

species (35) and humans (16, 36). In some species, MITEs make up a significant

portion of the genome. For example, rice (Oryza sativa) has a genome composed of

approximately 4% MITEs and MITE-derived sequences (37) and MITEs constitute 1-2%

of the C. elegans genome (38). Furthermore, approximately 16% of the yellow fever

mosquito’s (Aedes aegypti) total genome is composed of MITEs, the highest genome

percentage known so far (39).

1.3 Recently and Currently Active MITEs

In 2003, the first active MITE, named mPing, was identified in natural rice plants (40),

tissue culture (24), and plants derived from anther calli (23). It was later discovered that

mPing is active in plants derived from seeds treated with hydrostatic pressure (41) and

in recombinant inbred lines (42). In transgenic Arabidopsis plants and introgressed rice

plants, the transposase from Ping and Pong were demonstrated to mobilize mPing (42,

43). Although mPing is a deletion derivative of Ping, Pong encodes similar proteins to

Ping and is able to transpose mPing elements via cross-mobilization (24).

Since the discovery of mPing’s transposition activity, other active MITEs have

been identified. The MITEs dTstu1 and dTstu1-2 were shown to be active in potato

when a somaclonal variant, called Java kids purple (JKP), was generated from leaf

protoplast of the potato cultivar 72218. It was shown that dTsu1 excised from the

flavonoid 3’,5’-hydroxylase gene, thereby restoring the gene’s function and producing a

differently coloured tuber. Further investigation revealed that a dTstu1-like MITE,

dTstu1-2, was present in an allele in JKP, but was absent in every allele of the locus in

72218, indicating a new insertion event (44).

Similarly, the Arachis hypogaea MITE (AhMITE1) in the VL1 peanut mutant also

showed activity following stressful conditions to its host. When VL1 peanut mutants

were subjected to mutagenesis, the resulting plants differed phenotypically from VL1

mutants, in that they became resistant to late leaf spot (LLS) and susceptible to rust.

Molecular analysis showed that the phenotypic changes were due to the excision of

AhMITE1 from a pre-determined site. MITEs can be activated by mutagenesis (45) and

tissue culture stresses (23) and AhMITE1 follows this pattern in VL1 peanut mutant

plants.

Another known active MITE family, called mimp, was characterized in the

genome of the ascomycete fungus Fusarium oxysporum (46). The two subclasses of

mimp, referred to as mimp1 and mimp2, have 27 bp TIRs that share sequence similarity

to the autonomous element impala. Furthermore, both mimp and impala generate a

―TA‖ TSD upon insertion (47). Phenotypic assays that were performed to test the

functional link between impala and mimp showed that impala is responsible for mimp1

excision in different strains of F. oxysporum. Although the origin of mimp1 is still

unknown, it is speculated to either be a deletion derivative of impala or to have been

formed de novo (47).

Tc7 is a 921 bp MITE found in the genome of C. elegans (32). The terminal 38

bps of Tc7 have high sequence similarity to the terminal 38 bps of the autonomous

element Tc1. Like mimp and impala, Tc1 and Tc7 have the same TSDs (―TA‖). Using

Southern blotting, it was determined that Tc7 actively transposes in the germline of

mutator strains. Further analyses revealed that Tc1 is responsible for the transposition

of Tc7 and that Tc7 is not a deletion derivative of any known Tc1 element in C. elegans.

It was determined that Tc1 and Tc7 have similar transposition efficiencies and it is still

unclear why Tc7 copy numbers have not increased in mutator lines when Tc1 copy

numbers have (48).

In addition to MITEs that have been shown to be active, there are also MITEs

that are presumed to be currently or recently active. Most of these MITEs were

discovered using computational means and are predicted to be recently or currently

active based mostly on length and sequence conservation amongst members in the

genome. Recently active MITEs are highly homogenous in length and sequence,

especially in the TIRs and TSDs, as they have not yet accumulated mutations (49, 50).

For example, Nehza is thought to have recently transposed in the genomes of

Anabaena variabilis and Nostoc sp. Nehza is a MITE that is 132-171 bps in length, has

18bp TIRs, and generates 10 bp TSDs upon insertion. A total of eight copies of Nehza

in A. variabilis and two copies in Nostoc sp. are thought to have been recently active,

due to the highly conserved lengths and TIR sequences. Nehza is speculated to have

been cross-mobilized by the transposase ISNpu3 due to the fact that they share almost

identical TIR sequences (51).

Another family of MITEs, T2-MITEs, is speculated to be currently or recently

active in Xenopus tropicalis. TS clustering is a novel strategy that involves analyzing the

differences in short terminal sequences and can identify MITEs with weak TIR base-

matching. Using TS clustering, a total of 19 242 T2-MITEs were classified into 16 major

subfamilies. Analyses of subfamilies A1, B3 and C showed that they contained

members with highly conserved TSD sequences and contained completely identical

copies. Therefore, it was postulated that these subfamilies may be currently active or

recently active. However, no transposase source has been identified as being

potentially responsible for the transposition of T2-MITEs (52).

1.4 Elucidating how MITEs Achieve High Copy Numbers

Although MITEs do not encode a transposase enzyme, they are often found in high

copy numbers in the genomes which they reside. For example, in some rice strains

mPing can be present up to 1000-fold more than its autonomous partner Ping (25). It is

well-known that the DNA structure of MITEs plays a key role in their transposition.

Studies have shown that TIRs are extremely important in transposition, as they are

recognized and bound by transposase enzymes (53–59). However, the mechanism

through which MITEs achieve such high copy numbers, despite lacking a transposase

coding region, was unknown until recently.

In 2009, a breakthrough study by Yang and colleagues suggested mechanisms

that may explain why MITEs are so successful in achieving high copy numbers in

genomes. Rice Mariner-like transposons, called Osmars, were predicted to be the

transposase source of Stowaway MITEs (called Ost5, Ost8, etc.) in rice due to similar

TIR sequences and the same TSD sequence. To test this, a yeast assay was performed

in which two plasmids were co-transformed into yeast cells. One plasmid contained the

transposase source, while the other plasmid contained an ade2 gene interrupted by a

MITE. Transposition of the MITE was detected based on the recovery of the ADE2 gene

when yeast cells were plated on media lacking adenine (60).

In this study, six of the seven Osmar transposases showed activity, with the

highest excision frequency occurring between the Osmar14 transposase (Osm14) and

the Stowaway MITE Ost35. Site-directed mutagenesis of the elements revealed that the

Ost35 MITE contains multiple motifs throughout its internal region that promotes

excision by transposase. Surprisingly, the Osm14 3’ subterminal region contains a

repressive motif that dramatically decreases transposition efficiency (60).

It has been postulated that class II elements persist in genomes across

generations via the relaxation of transposase-DNA binding specificity, thereby softening

the effect of detrimental mutations (27, 61, 62). This theory is supported by the fact that

Osmar transposases are able to cross-mobilize distantly related elements and have

weak DNA-transposase binding specificity (60, 63). Therefore, MITEs may parasitize

these transposases and increase their copy numbers through internal enhancement

motifs, thereby ensuring their persistence in the genome.

1.5 Significance of TEs

In the past, TEs were considered to be ―parasitic‖ DNA that invaded genomes through

transposition (64). However, continual analyses of genomes began to shed light on the

prevalence of TEs across multiple organisms and their influence in these genomes.

Despite the improved understanding of TEs since their discovery, it still remains unclear

to what extent they contribute to genome diversity, evolution, and complexity.

The fact that TEs were once considered parasitic is not surprising, considering

that TE proliferation and transposition have the potential to cause harmful effects on

genomes. TEs are capable of causing mutations either by inserting themselves into

genes, or by their imprecise excision from genic regions, leaving what is known as a TE

―footprint‖. For example, the insertion of a P element and copia element into the white

locus in D. melanogaster resulted in a white eye phenotype, reflecting a lack of

pigmentation (65). Furthermore, TE transposition can affect the host at a genome-wide

level. For example, in D. melanogaster larvae, the excision of P elements can cause

massive chromosome breakage, thought to result in temperature-dependent lethality

and sterility (66). However, although mutations induced by TEs can be harmful, it has

also been suggested that these TE-induced mutations can benefit populations through

increased mutation rates, thereby enhancing adaptation to different environments (67).

Despite the harmful effects that TEs may have on their host, there are also

examples of TEs providing direct benefits to their hosts. In D. melanogaster, for

example, certain class I elements have adopted a role similar to that of telomerases.

The transposition of these class I elements, such as HET-A and TART, replaces

damaged chromosome ends thereby maintaining constant chromosome size (68–70). It

has also been suggested that endogenous class I elements may play a role in repairing

double-strand chromosome breaks through reverse transcriptase-mediated events (67,

71, 72).

In shaping the biological properties of the organisms that carry them, TEs can be

useful tools for biotechnological applications such as insertional mutagenesis,

transgenesis, and phylogenetic markers (6, 73–75). Even though TEs were discovered

over 60 years ago in the maize genome, active TEs are continually being discovered

and characterized. Active TEs are at the core of TE-derived genome evolution and can

result in an increase in genome size (76), chromosomal rearrangements (66, 77, 78),

and disrupting or altering gene expression (65, 79–86). Therefore, in-depth

investigations of TEs that are potentially and currently active at genome-wide scales

and the consequences of their activity are critical to understanding genome evolution.

Chapter 2 Elucidating the Transposase Sources for the Transposition of hAT

2 Introduction to hAT TEs

The first TE ever discovered was the Ac element in maize, which belongs to the hAT

superfamily of TEs (87). The class II hAT superfamily is so named after the hobo

elements in Drosophila melanogaster, Activator (Ac) elements in maize, and Tam3

elements in Antirrhinum majus (88–90). hAT TEs are present in the genomes of a

variety of organisms including plants, mammals, fungi, amphibians, nematodes and fish

[see (91) for review]. Furthermore, hAT TEs are also found in humans, where they are

the most abundant class II TE, comprising approximately 195 Mb of the human genome

hAT TEs have also undergone molecular domestication, a process defined as a

TE-derived coding sequence resulting in a functional host protein (93). For example, a

gene in A. thaliana is derived from the transposase sequence of the hAT TE,

Daysleeper, and is speculated to act as a transcriptional regulator that is necessary for

plant development (94). Similarly, the DREF gene in D. melanogaster is a chimeric

gene that recruited a transposase DNA-binding domain from a hAT TE. The DREF gene

is involved in multiple cellular activities in D. melanogaster including DNA replication,

cell growth and differentiation (95, 96).

The elements in the hAT superfamily are characterized by generating 8 bp TSDs

upon insertion and having 5-27 bp TIRs, with limited interfamily sequence similarity (97).

Furthermore, both autonomous and nonautonomous elements are found in the hAT

superfamily. For autonomous hAT elements, the transposases have four amino acid

motifs: a zinc finger domain near the N-terminus; a DNA-binding domain; a catalytic

domain; and an insertion domain (88, 98–100). The end region of the catalytic domain is

often referred to as the hAT dimerization domain, as it is commonly conserved in hAT

transposases and plays a role in oligomerization. However, crystal structure analyses of

a hAT transposase suggest that multiple regions may be involved in oligomerization

(100).

Recent evidence suggests that the hAT superfamily can be divided into two

families of TEs based on transposase sequences and target-site selection: the Ac family

and the Buster family. The majority of members in the Ac family have a consensus TSD

sequence of 5’-nTnnnnAn-3’. In contrast, members of the Buster family have a TSD

consensus sequence of 5’-nnnTAnnn-3’. The most significant amino acid variation

between the two families lies in the DNA-binding and insertion domains (101).

TEs are a major contributing factor to the variability and biodiversity of insect

populations [see (102) for review]. The yellow fever mosquito, A. aegypti, is commonly

found in close proximity to human populations and is a major vector of yellow fever,

dengue fever, and chikungunya fever (103–105). Approximately 30,000 people die

every year in Africa and South America as a result of yellow fever (104). In 2007, the A.

aegypti genome was sequenced, revealing that approximately 47% of the genome is

composed of TEs (39). A total of 21 MITE families are present in the A. aegypti genome

related to the hAT superfamily (39). Uncovering which transposases are involved in the

activity of hAT MITE members could elucidate the mechanisms involved in the evolution

and biodiversification of the A. aegypti genome.

3 Methods

3.1 Determining and Cloning hAT MITEs

In order to identify hAT MITEs present in the A. aegypti genome, the bioinformatics tool

MAK was used (106). The Member function of MAK was run using the consensus

sequences of all 21 hAT MITE families from TEfam (http://tefam.biochem.vt.edu) as a

query database (39) (Supplementary Table 1). The output of the Member function

consisted of the nucleotide sequences of every member of each MITE family present in

the A. aegypti genome. A ClustalW alignment was performed for all the members of

each MITE family and a 90% consensus sequence was generated (107). Primers were

designed using the consensus sequence to amplify MITEs for each family

(Supplementary Table 2). Due to mutations in TIR sequences amongst members of

certain MITE families, more than one set of primers was often needed.

The A. aegypti genomic DNA was extracted using the protocol described in

Rivero et al. (2004) with the following modifications: a fresh pupa was used instead of

an adult mosquito; samples were incubated at 55ºC for 4 hours after protease addition;

the suspensions were extracted with a single phenol-chloroform step; and no RNAse

was added (108). PCR was carried out using Pfu DNA polymerase (Fermentas Life

Sciences, Burlington, ON), each primer set, and A. aegypti genomic DNA as a template

[95˚C, 5 min.; 35X (95˚C, 30 sec.; 57˚C, 30 sec.; 72˚C, 1 min.); 72˚C, 5 min.]. MITEs

were phosphorylated using T4 Polynucleotide Kinase (NEB, Pickering, ON) and were

column-purified (Sigma-Aldrich, Oakville, ON). The donor plasmid used for cloning the

MITEs has an ade2 gene that contains an HpaI site (Figure 3). The plasmid was digested

with HpaI restriction endonuclease (NEB) at 37˚C for 3 hours, followed by

dephosphorylation using Antarctic Phosphatase (NEB) and column-purification (Sigma).

Ligation was performed using T4 DNA Ligase (NEB) and left overnight in an ice-water

bath. Following ligation, the plasmids containing MITEs were transformed into

Escherichia coli strain DH5α. The presence of the insert was verified through enzyme

digestion and sequencing.

3.2 Finding TEs using a Top-Down Approach

The Topdown function of MAK finds deletion derivative elements and MITE-sized

elements with similar TIR sequences and the same-sized TSDs to a query sequence.

Topdown was run using the DNA sequences of each hAT transposase-coding

candidate as a query sequence (E-value: 0.1)(106). To check for novel sequences, a

Figure 3: Illustration of donor plasmid. Amp, ampicillin resistance gene; ARS1, autonomous replications sequence 1; OriEC, E. coli replication origin; CEN4, centromere of yeast chromosome 4. Illustration adapted from Yang et al. (2009).

BLASTn search (E-value: 10) was performed using the output of Topdown as query

sequences and all known hAT MITE members in A. aegypti (see Section 3.1:

Determining and Cloning hAT MITEs) as the database sequence. The results were

manually inspected for sequence similarities.

The Topdown output from each hAT transposase-coding candidate (referred to

as a hATTE family) was aligned and output sequence files were separated into

subfamilies based on sequence similarity in the internal regions. Redundant sequences

were removed. A 51% consensus sequence was generated for each hATTE family and

subfamily. To determine whether the hATTE families have been previously described, a

BLASTn search (E-value: 10) was performed using the consensus sequences and the

DNA sequences of all known hAT TEs in A. aegypti (109). The results were manually

inspected for matching sequences.

To retrieve all the members of each hATTE family, MAK’s Member function was

run using the consensus sequences as query sequences and the A. aegypti genome as

the database (e-value: 0.1)(106). Furthermore, to check whether TE families contained

members that are deletion derivatives of other TEs, a TBLASTX was performed using

the consensus sequences as query sequences against the NCBI nr database (E-value:

10)(109). Sequences with results containing the words ―transposon‖, ―transposable

element‖, and/or ―transposase‖ were manually inspected for sequence similarities to the

TE family/subfamily consensus sequence.

3.3 Determining Candidate Transposases for the Transposition of hAT MITEs

3.3.1 Retrieving All Putative hAT Transposases

The first step in elucidating whether transposases in the A. aegypti genome are

responsible for the transposition of hAT MITEs is to identify all putative transposase

sequences. To do this, the Anchor function in MAK (106) was run using a protein

database compiled from known hAT transposase amino acid sequences from Repbase

(110) and a DNA database compiled from known hAT MITE family consensus

sequences in the A. aegypti genome (http://tefam.biochem.vt.edu) as the query

sequences (39). The Anchor function searches for elements longer than the queried

MITE sequence that share TIR and subterminal sequence similarity to the MITE. The

resulting output contains putative autonomous elements of the queried MITE that have,

or had, coding capacity. The TP_TE function was also run using the same protein

database as described above. The TP_TE function searches for DNA sequences that

encode proteins that share sequence similarity to the queried transposase protein

database.

For each output sequence, both Anchor and TP_TE search for the flanking 8 bp

nucleotides and generate a difference value, reflecting the percentage of nucleotides

that are different between the 3’ and 5’ 8 bp flanking sequences for each output

sequence. To narrow down the output from Anchor and TP_TE, all output transposase

sequences that had a difference lower than 50% and 12.5% were removed from the

output, respectively. Redundant copies of transposases in the output were also

removed.

3.3.2 Identifying Recently Active Putative Transposases

The best candidate transposases responsible for cross-mobilizing recently active MITEs

are TEs that were also recently active. TEs that were recently active frequently have

highly conserved copies in a genome. Therefore, all output sequences were aligned

against each other using ClustalW (107). Afterwards, an unrooted neighbor-joining

phylogenetic tree was generated. Clades with highly conserved sequences were

identified from the tree and a representative from every clade that contained highly

conserved copies was selected with manual inspection.

3.3.3 Linking hAT MITEs with Putative Transposases

Since transposase proteins recognize and bind to the end sequences of TEs during

transposition (111), the DNA sequences of the transposases-coding elements whose

ends had the highest similarity to the MITE ends were also retrieved. To do this, 29 bps

from the 3’ and 5’ ends of each transposase-coding sequence in the TP_TE and Anchor

output, as well as from each MITE family (Supplementary Table 1) were isolated. These

end sequences were aligned using ClustalW and were manually inspected for high

sequence similarity between putative TEs encoding transposases and MITE families

(107).

3.3.4 Identifying Coding Sequences of Putative hAT Transposases

BLASTx was performed using the DNA sequences of putative TEs encoding hAT

transposases with highly conserved copies and those with similar ends to hAT MITE

families as the query sequences and the amino acid sequences of all known hAT

transposases as the search database (109, 110). The BLASTX results were manually

inspected for any conserved regions between the translated nucleotides and the hAT

proteins. The translated nucleotide query sequences with long matching stretches (>150

amino acids) to the hAT proteins were manually annotated as follows: (a) the matching

regions between the putative transposase and hAT protein were considered putative

exons; (b) DNA sequences between putative exons were inspected for a GT-AG

boundary, to define introns (112, 113); (c) the first and last exons were inspected for the

presence of start and stop codons, respectively; (d) finally, the putative transposase

sequences were inspected for mutations causing frame shifts and stop codons.

Sequences with mutations causing truncated protein-coding regions were repaired to

make a full-length coding sequence. The putative hAT transposase sequences with

coding sequences that had the least number of mutations and 100% identical 5’ and 3’

TSD sequences are referred to as hATTPases.

3.3.5 Phylogenetic and Conserved Domain Analysis of Known and Putative hAT Transposases

The software used to search for conserved domains was Batch Web CD-Search Tool,

with default parameters (Search against database: CDD; E-value: 0.01; Maximum

number of hits: 500; Low complexity filter) (114).

The copies of putative transposases were found by performing a BLASTn search

against the whole-genome shotgun reads database, using the nucleotide sequence of

each transposase as the query sequence (109). The BLASTn results were manually

inspected and sequence hits with an E-value of 0.0 and with query coverage >85%

were chosen as copies. A neighbor-joining tree was generated using MEGA version 5

using default parameters (115).

Two databases were used to make the maximum-likelihood trees: (i) all amino

acid sequences of the annotated hATTPases and (ii) all amino acid sequences of the

annotated haTTPases and the amino acid sequences of the transposases used in

Arsenburger et al. (2011) (see Supplementary Figure 1). Both databases were aligned

using the program M-Coffee (116). Each database was used as input for the program

ProtTest 3 (117) to identify the best amino acid substitution matrix for phylogenetic

analysis. ProtTest 3 produced the following results: Blosum62 with a gamma shape and

invariant sites; and WAG with a gamma shape, and among-site rate variation, for

databases (i) and (ii), respectively. The aligned databases were then used to make

phylogenetic trees using TREE-PUZZLE 5.2 based on the maximum-likelihood

optimality criterion (50% majority rule consensus) (118).

The 8 bp TSD sequences from each hATTPase and their copies were manually

isolated. Sequence frequency logos were generated using WebLogo (119).

3.4 Synthesizing and Cloning of Transposases

PCR-based gene synthesis methods were used to synthesize and clone the putative

transposases (120). Primers were designed to flank the coding regions of the candidate

transposases (Supplementary Table 3). An additional six nucleotides and a restriction

enzyme site were incorporated into the flanking regions of the outermost primers. For

transposases with multiple exons, primers were designed with an overlapping region

corresponding to neighboring exons (Figure 4).

PCR was carried out using iProof High-Fidelity DNA Polymerase (Bio-Rad,

Mississauga, ON) from A. aegypti genomic DNA [98˚C, 2 min; 35X (98˚C,10 sec.; 57˚C,

30 sec.; 72˚C, 45 sec.); 72˚C, 5 min.]. For transposases with multiple exons, each exon

was run on a 1% agarose gel via agarose gel-electrophoresis and gel-extracted

(Qiagen, Valencia, CA). Exons were joined using PCR [98˚C, 2 min.; 10X(98˚C ,10 sec.;

45˚C, 30 sec.; 72˚C, 30 sec.); 30X(98˚C, 10 sec.; 57˚C, 30 sec.; 72˚C for 45 sec-2 min.);

72˚C for 5 min.]

Once the full coding sequence of the transposase was synthesized, the

fragment was gel-extracted (Qiagen). The ends of the transposase, as well as the

transpososase source plasmid, were digested using the appropriate restriction enzymes

for 3-4 hours at 37ºC (Figure 5). After digestion, the products were column-purified

(Sigma). Once the transposase coding sequence was cloned into the plasmid,

transformation, ligation and verification were performed as described above (Section

3.1: Determining and Cloning hAT MITEs ). If the transposase sequence had single

Figure 4: An illustration of the primers designed for a hypothetical hAT transposase with two exons and one intron. Green arrows, primers corresponding to exon #1; orange arrows, primers corresponding to exon #2; TGATCA, SpeI site; GTCGAC, SalI site.

nucleotide point mutations, it was repaired with PCR-based gene synthesis using

primers bearing the correct sequence.

3.5 Yeast Excision Assays

To make yeast competent cells (Strain DG2523) for transformation, a yeast colony was

inoculated in 5 mL of YPD broth at 30ºC with shaking overnight until a 1:10 dilution of

the culture reached an OD600 of 0.2-0.4. The culture was transferred to 50 mL of YPD

with a starting OD600 of 0.2 and incubated with shaking at 30ºC until an OD600 of 0.5-0.8

was reached. Cells were pelleted by centrifugation for 5 min. at 4000rpm. The

supernatant was discarded and the pellet was resuspended in 25 mL of sterile water.

Figure 5: Illustration of transposase source plasmid. Amp, ampicillin resistance gene; ARS H4, autonomous replication sequence of H4 gene; CEN6, centromere of yeast chromosome 6; cyc1 ter, termination of yeast cyclin gene cyc1; OriEC, E. coli replication origin; Pgal1, yeast gal1 promoter. Illustration adapted from Yang et al. (2009).

Cells were pelleted by centrifugation for 5 min. at 4000rpm. The supernatant was

discarded and the pellet was resuspended in 1 ml of 100 mM lithium acetate. Cells were

pelleted by centrifugation for 2 min. at 7000rpm. The supernatant was discarded and

the cells were resuspended in 450 μL of 100 mM lithium acetate [adapted from (121)].

The co-transformation of yeast cells using a transposase and MITE plasmid were

executed as follows: 25 μL of yeast competent cells were mixed with 2.9 μL of carrier

DNA (salmon sperm, 5 mg/mL), 60 ng of transposase vector, 60 ng of pooled MITE

vectors, and 200 μL of PEG buffer (40% PEG, 100 mM LiAc, 10 mM Tris-pH 8.0, 1 mM

EDTA). Tubes were incubated at 42ºC for 45 min. The cells were pelleted by

centrifugation at 7000 rpm for 20 sec. and the supernatant was discarded. Cells were

re-suspended in 50 μL of sterile water and were plated on media lacking histidine and

uracil [adapted from (121)]. Plates were incubated at 30˚C until colonies formed and

then placed at room temperature. After approximately 2 weeks, the colonies were

either: (i) streaked on media lacking adenine or (ii) inoculated in 2 mL of media lacking

histidine and uracil at 24ºC or 30ºC for approximately a week and plated on media

lacking adenine. The plates with media lacking adenine were incubated at 30ºC and

inspected regularly for colony formation.

For a positive control, yeast colonies were co-transformed with two plasmids

(abbreviated pOst35 and pOsm14Tp) which contain the MITE Ost35 and the

transposase Osm14 from rice. These elements were previously shown to undergo

transposition in the same yeast assay (60). For the negative control, yeast cells were

co-transformed with the pOst35 and an empty transposase source plasmid.

4 Results

4.1 Computational Analyses

4.1.1 Finding MITE Members Belonging to the hAT Superfamily of TEs

A total of 5,026 members were retrieved from the MAK Member function. A summary of

each hAT MITE family is described in Table 1. The hAT MITE family TF000576 has the

most MITE members, with a total of 526 retrieved; the hAT MITE family TF000720 has

the least, with only five complete members retrieved. TF000715 has 39 clades with

identical members, the most of any hAT MITE family. The highest number of identical

sequences in a single clade varies between two to six for most hAT MITE families.

However, the hAT MITE family TF000708 has a single clade with 25 identical members.

Table 1: Summary of output retrieved from MAK’s Member function.

4.1.2 Finding TEs Encoding Putative hAT Tranposases

Collectively, the output from both the Anchor and TP_TE functions resulted in an output

of approximately 5,000 DNA sequences of putative TEs encoding transposases (Error!

Reference source not found.A). After the output from Anchor and TP_TE was

processed by removing redundant sequences and sequences that had high difference

values (as described in Methods: Section 3.2.1), approximately 400 putative

MITE Family # Members # Clades with Identical

Members

Highest # of Identical Sequences in a Single

TF000722 5 1 2

TF000576 526 23 3

TF000700 253 19 6

TF000703 197 12 6

TF000706 243 3 2

TF000708 287 23 25

TF000714 256 2 2

TF000715 275 39 4

TF000717 230 6 3

TF000718 280 16 3

TF000719 234 8 3

TF000720 295 7 2

TF000724 250 3 3

TF000725 272 7 2

TF001258 240 2 2

TF001274 175 1 2

TF001275 249 4 4

TF001302 248 16 5

TF001310 187 1 2

TF001312 68 3 2

TF001332 256 3 6

transposase sequences remained (Error! Reference source not found.B). These

results were further narrowed down by isolating the sequences with ends matching best

to known hAT MITE ends (Error! Reference source not found., I & ii) and by isolating

representative sequences with identical copies (Error! Reference source not found.,1

& 2). To select the best candidate TEs for encoding a hAT transposase, a BLASTX

search was performed against known hAT transposases (Error! Reference source not

found.C). The sequences that encoded amino acid sequences similar to known hAT

transposases were isolated, resulting in a total of 56 putative transposase sequences

(Error! Reference source not found.D). The TSDs on the 5’ and 3’ ends of a

sequence are typically identical. To refine the search, sequences with non-identical

predicted TSD similarity were removed, resulting in a total of 23 sequences (Error!

Reference source not found.E), referred to hereafter as hATTPases. A summary of

the 23 hATTPases is shown in Error! Reference source not found.. Based on the

manual annotation of the hATTPases and their similarity to hAT MITE ends, 14

hATTPase sequences were selected as candidates for experimental analyses (Error!

Reference source not found.F).

~5000 Sequences

56 Sequences

23 Sequences (hATTPases)

14 Sequences

Choose sequences with best annotations and

highest simialrity to MITE ends for experimental

analyses

~400 Sequences

Manually inspect TSDs and remove all sequences

whose TSDs are not 100% identical

Annotate

Anchor and TP_TE output

Remove redundant sequences and sequences with

high flanking sequence difference values

BLASTX all Sequences against

known hAT Transposases

Isolate candidates from clades

with highly conserved

sequences

Make tree

Align sequencesIsolate 3' and 5' ends from

sequences and hAT MITE

Align sequences

Isolate putative transposases

whose ends had highest

similarity to MITES

Figure 6: A schematic representation of how the best candidate hAT transposase sequences were selected.

Table 2: A summary of the 23 hATTPases. Their accession and position in the A. aegypti genome is shown, along with their size in bps and TSD sequence.

Name Accession Position Size (bps) TSD Sequence TIR Sequence

hATTPase1 AAGE02006152 156325-159177 2853 GTGCCAAA TAGAGATGGGCAA

hATTPase2 AAGE02007824 63758-61110 2649 AGCTATTC CATAGGTTCCCAAACT

hATTPase3 AAGE02008188 229621-235662 8640 TGTAGATC CAGCGGTTCTCAACC

hATTPase4 AAGE02024413 240837-243635 2799 ATCTATGG CATAGATTCCCAAACT

hATTPase5 AAGE02020453 22912-17960 4953 CTGACACC CAGGGTTGCCACAT

hATTPase6 AAGE02025054 115298-110670 4629 TCTTGCAT CAGTGTTGCCACA

hATTPase7 AAGE02025133 41412-46897 5486 CCAGCGAC CAGGCATGGGAAAAATCA

hATTPase8 AAGE02027653 3714-1092 2623 TTTCCTTT TAGAGTTTTCAAT

hATTPase9 AAGE02008109 14028-20984 6957 GTGTCCAG TAGGGTGCCAATG

hATTPase10 AAGE02001552 30277-27718 3141 AAAACTGA TAGAGTGTCCA

hATTPase11 AAGE02015151 3178-5661 2484 AAAGATGA CAGAGGCGTCGCGT

hATTPase12 AAGE02025302 26720-29860 3141 GCCAGAGG CAGTGTTGCCACA

hATTPase13 AAGE02004541 52790-55376 2487 CTTTAGGG CAGTGTTTCCCAAA

hATTPase14 AAGE02005240 26023-29480 3558 AACAAGAA CAGGGTTGTTAACGTTAATCAACG

hATTPase15 AAGE02012529 17071-13588 3484 GTTTGTTC CAGAGGCGTCGCGT

hATTPase16 AAGE02009227 30394-27825 2570 GATTAGAC CAGTGTTTCCC

hATTPase17 AAGE02019027 5371-8562 3192 TGTAGAAC GCAGTGGTGCT

hATTPase18 AAGE02015586 48794-46395 2400 GTTTACAT CAGTGGTGCTCA

hATTPase19 AAGE02005137 18241-20625 2385 CTCTACCA CAGTGGTGCTCA

hATTPase20 AAGE02020255B 82631-85714 3084 TACCATTC CAGAGGCGTCGCGT

hATTPase21 AAGE02013260 122701-128987 6277 TTGTTTAT TAGGGTGCCAATGAAA

hATTPase22 AAGE02022847 30976-33727 2752 GTGTGCGC TAGGCCGTCCCTTATTTTTCCAAATTT

hATTPase23 AAGE02003553 22957-20517 2441 AAATAAAT CAGTGGTGCTCA

4.1.3 Analysis of hATTPases and their copies in the A. aegypti genome

To better understand the relationships between the hATTPases, their DNA sequence

copies were aligned and a neighbor-joining tree was generated (Figure 7). hATTPase10

has the most copies in the A. aegypti genome of an element that encodes a full or

partial hAT transposase, with a total of five. Some hATTPases are present in a single

copy in the A. aegypti genome, such as hATTPase5, hATTPase3, and hATTPase22.

Furthermore, a maximum-likelihood tree was generated using the amino acid

sequences of every annotated hATTPase (Figure 8). Interestingly, four distinct clades

are evident in the tree. Clades I and IV are highly supported, with node values of 95 and

96, respectively; while clades II and III have weaker support with node values of 73 and

69, respectively.

AAGE02005313

AAGE02005388

hATTPase10

AAGE02005396

AAGE02004097

hATTPase8

AAGE02017306

hATTPase5

hATTPase1

AAGE02014391

AAGE02009620

hATTPase11

hATTPase15

hATTPase20

AAGE02003016

AAGE02000252

hATTPase7

AAGE02021183

hATTPase3

hATTPase4

AAGE02001305

hATTPase2

AAGE02024385

AAGE02001290

hATTPase22

hATTPase21

hATTPase9

AAGE02022887

hATTPase13

hATTPase16

hATTPase6

hATTPase12

AAGE02001220

hATTPase14

AAGE02014073

hATTPase17

hATTPase19

hATTPase18

AAGE02002382

hATTPase23

AAGE02003553

Figure 7: A neighbor-joining tree of the DNA sequences of hATTPases and their copies.

hATTPase6

hATTPase12

hATTPase5

hATTPase19

hATTPase23

hATTPase18

hATTPase17

hATTPase13

hATTPase16

hATTPase2

hATTPase4

hATTPase3

hATTPase15

hATTPase20

hATTPase11

hATTPase8

hATTPase10

hATTPase7

hATTPase1

hATTPase14

hATTPase9

hATTPase21

hATTPase22

Figure 8: A maximum likelihood phylogenetic tree of the 23 hATTPase transposase amino acid sequences (50% majority rule consensus). Numbers next to the nodes show quartet puzzling reliability based on 10,000 puzzling steps, a measure of nodal support similar to bootstrapping that is produced by TREE-PUZZLE

4.1.4 The Buster and Ac families of the hAT Superfamily

To examine which hATTPases belong to the Buster and Ac families of hAT TEs, a

maximum-likelihood tree was generated using all annotated hATTPase amino acid

sequences and all available TE protein sequences described in Arensburger et al.

(2011) ( Figure 9). The amino acid alignment is shown in Supplementary Figure 1.

According to Arensburger et al. (2011), the hAT superfamily is divided into two families:

Buster and Ac; Tip TEs could not be placed in either family, nor into a third family, due

to the small sample size used in the study (101).

Based on Figure 9, hATTPase1, hATTPase7, hATTPase8, hATTPase10, and

hATTPase14 belong to the Ac family, while hATTPase2-4, hATTPase13, hATTPase16-

19, and hATTPase23 belong to the Buster family. The Tip proteins are clustered into a

separate clade with hATTPase5, hATTPase6, hATTPase11, hATTPase12,

hATTPase15, and hATTPase20, potentially indicating that these transposase

sequences represent a third, separate family in the hAT superfamily of TEs. Lastly,

hATTPase9, hATTPase21, and hATTPase22 cluster into a fourth, highly-supported

clade with no other known hAT transposase sequences.

Furthermore, it is clear that some hATTPases are distinctly different from, albeit

related to, any known hAT TE in the A. aegypti genome. For example, although

hATTPase5, hATTPase6 and hATTPase12 is clustering with the Tip TEs, none show

sequence similarity to the Tip transposase in A. aegypti (with only 12%, 10%, and 10%

sequence identities to AeTip2) Furthermore, the hATTPase9, hATTPase21, and

hATTPase22 are not clustered within the Ac or Buster family, nor are they clustered

with the Tip TEs Figure 9).

The Buster and Ac families of hAT TEs have TSD consensus sequences (101).

To determine whether the TSDs of the hATTPases and their copies share the same

consensus sequences as their respective families, sequence frequency logos were

generated (119). The hATTPases were separated into Buster and Ac families based on

the clustering shown in Figure 9. As seen in the sequence frequency logos, the majority

of the TSD sequences belonging to the Buster family have a ―T‖ at position 4 and ―A‖ at

position 5, as expected. Furthermore, the majority of the TSD sequences belonging to

the Ac family have a ―T‖ at position 2. However, although the majority of TSDs also

have an ―A‖ at position 7, ―G‖ also occurs frequently at that position (Figure 10).

Figure 9: A maximum likelihood phylogenetic tree of amino acid transposase sequences from Arensburger et al. (2011) and amino acid sequences of annotated hATTPases (50% majority rule consensus). Numbers next to most

nodes show quartet puzzling reliability based on 10,000 puzzling steps, a measure of nodal support similar to bootstrapping produced by TREE-PUZZLE.

hATTPase8

AeHermes2

hATTPase10

Activator

CxKink3

CxKink4

hATTPase7

CxKink2

CxKink5

CxKink7

CxKink8

Hermes

hermit

AeHermes1

hATTPase14

VihAT2

hopper

Restless

hATTPase1

Herves

hATTPase15

hATTPase11

AeTip2

hATTPase20

hAT12HM

IpTip100

hATTPase5

hATTPase12

hATTPase6

AeBuster2

hAT5XT

DrBuster2

CsBuster1

hATTPase4

AeBuster1

hATTPase2

hAT2XT

hAT2DR

SPIN Md

SPIN Xt

SPIN MI

SPIN Og

SPIN Et

AeBuster3

hATTPase3

TcBuster1

AeBuster5

hATTPase16

AeBuster7

hATTPase13

hAT5DR

MyotishAT

SpBuster2

SpBuster1

MIBuster1

hATTPase23

hATTPase18

hATTPase17

AeBuster4

hATTPase19

hATTPase22

hATTPase21

hATTPase9

Buster

Figure 10: Sequence frequency logos of the TSD sequences for hATTPases and their copies belonging to the Buster and Ac families.

4.1.5 Conserved Domains in Known and Putative Transposase Sequences in the A. aegypti genome

There are currently 14 known hAT TEs in the A. aegypti genome that encode a

hAT transposase sequence, 10 of which encode intact transposase proteins

(http://tefam.biochem.vt.edu). The 10 intact sequences were analyzed for conserved

domains, to see which domains are common across known hAT transposase

sequences in the A. aegypti genome (Figure 11). Only 6 of the intact transposases have

hAT family dimerization domains and 4 have zinc finger domains. Two of the intact

transposases, AeHerves2 and AeHerves3, have a transposase domain of unknown

function called the DUF659 domain. This domain is also found in the harrow TE in

Drosophila (122).

Buster

The same conserved domain search was performed for all 23 annotated

hATTPase sequences. hATTPase1 has three domains: zinc finger, hAT family

dimerization and DUF659. hATTPase4, hATTPase7, hATTPase11, hATTPase12,

hATTPase18, and hATTPase23 have the hAT family dimerization domain while

hATTPase8, hATTPase10, hATTPase15 and hATTPase20 have the zinc finger domain.

The rest of the hATTPases--hATTPase2, hATTPase3, hATTPase5, hATTPase6,

hATTPase9, hATTPase13, hATTPase14, hATTPase16, hATTPase17, hATTPase19,

and hATTPase21—do not have any apparent conserved domains.

Figure 11: A schematic representation of known intact hAT transposase sequences in A. aegypti (from TEfam) and annotated hATTPases that have conserved sequence domains. Grey lines, transposase sequence; blue, hAT family dimerization domain; red, zinc finger domain; green, DUF659 domain of unknown function.

4.1.6 Linking MITEs to Putative hAT Transposases

The 23 hATTPases were manually annotated and the DNA sequences with the least

number of mutations in the coding regions and/or those that had similar terminal

sequences to hAT MITEs were chosen for experimental analyses. These include:

hATTPase1, hATTPase2, hATTPase4, hATTPase5, hATTPase7, hATTPase8,

hATTPase10, hATTPase13hATTPase16, hATTPase18, hATTPase19, and

hATTPase23. Figure 12 illustrates which hATTPases chosen for experimental analyses

have DNA terminal sequences which match best with the hAT MITE families; Figure 13

shows the alignments of the end sequences.

There are some MITE families that have multiple identical copies in the A.

aegypti genome. For example, the MITE family TF000708 has 25 identical copies

(Table 1); however, no element encoding a hATTPase bears similar end sequences as

the TF000708 MITE family. Compared to studies performed on rice, where almost every

Stowaway MITE family has TIR sequence similarity to the autonomous Osmar TEs,

there are 14 MITE families in A. aegypti that do not have similar ends to any TE

encoding a hATTPase or autonomous hAT TEs in the genome (123).

hAT TPase1 TF000722

hATTPase8 TF000576

hATTPase10 TF000708

hATTPase13 TF000714

hATTPase15 TF000715

hATTPase16 TF000717

hATTPase18 TF000718

hATTPase23 TF000706

hATTPase19 TF000700

hATTPase2 TF000703

hATTPase7 TF000719

hATTPase4 TF000720

hATTPase20 TF000724

hATTPase5 TF000725

TF001258

TF001274

TF001275

TF001302

TF001310

TF001312

TF001332

Figure 12: Figure illustrating which hATTPases DNA sequences have ends that are similar in sequence to the ends of each MITE family. Red lines, match MITE family TF000722; Blue line, match MITE family TF000576; green lines, match MITE family TF000718; yellow lines, match MITE family TF000706; purple lines, match MITE family TF001275; grey lines, match MITE family TF000715.

Figure 13: Alignment of the end sequences of hAT MITE families that match best with the end sequences of the hATTPases DNA sequences

4.1.7 Finding TEs using a Top-Down Approach

To find nonautonomous TEs that have not yet been identified and that are potentially

cross-mobilized by the hATTPases, the Topdown function of MAK was run using all

hATTPases DNA sequences as query sequences. Extensive sequence similarity

analysis revealed the existence of TEs that have not been recognized or identified in the

A. aegypti genome. Supplementary Table 4 shows the consensus sequence for each

new TE family. A total of three new TE families were found, all of which generate 8 bp

The new TE families were named according to the hATTPase-coding elements

that were used as the query sequence to find them. An alignment of the end sequences

of each new TE family with their respective hATTPase-coding elements shows that the

hATTPases-coding elements have highly similar sequences to the MITE families

(Figure 14). The hATTE1 family has the fewest members, with only 11. The hATTE2

family has 121 total members and is separated into 10 subfamilies. Translated

sequence searches revealed that all 10 subfamilies have amino acid sequence

similarities to the autonomous AeBuster1 TE. Specifically, all subfamilies of hATTE2

show amino acid sequence similarity to the end regions of the AeBuster1 transposase,

with E-values lower than 2e-11. The hATTE2G subfamily showed the most sequence

similarity with an overall query coverage of 34%. Lastly, the hATTE8 family is separated

into three subfamilies and has a total of 62 members.

4.2 Experimental Analyses

4.2.1 Cloning MITEs

A total of 41 MITE primers sets were designed for 21 hAT MITE families. Cloning was

attempted for members belonging to each MITE family; however, due to ligation

reaction difficulties, only certain MITEs were successfully cloned. Table 3 summarizes

how many individual MITE sequences were successfully cloned into the MITE plasmid.

Figure 14: Alignment of the 5’ and 3’ ends of the three TE families found from TopDown and the hATTPases-coding elements used to find them.

Table 3: The number of individual hAT MITE sequences that were cloned into the donor plasmid for each hAT MITE family.

4.2.2 Candidate hAT Transposase Analysis and Cloning

Only one transposase coding sequence was successfully cloned and repaired:

hATTPase16. The hATTPase2 transposase was successfully cloned but was unable to

be repaired and joined. To repair the hATTPase16 transposase, two substitutions and

two insertions were fixed using PCR. Furthermore, the native transposase-coding

sequence had a single intron, flanked by the splice sites ―GT‖ and ―AG‖. The 2572 bp

long transposon has perfect 12 bp TIRs, composed of ―CCAGTGTTTCCC‖ (Error!

MITE Family # MITEs Cloned

TF00072 2

TF000576 6

TF000700 1

TF000703 8

TF000706 2

TF000708 2

TF000714 1

TF000715 1

TF000717 2

TF000718 1

TF000719 0

TF000720 1

TF000724 0

TF000725 0

TF001258 0

TF001274 0

TF001275 0

TF001302 1

TF001310 0

TF001312 0

TF001332 1

MITE Family # MITEs Cloned

TF00072 2

TF000576 6

TF000700 1

TF000703 8

TF000706 2

TF000708 2

TF000714 1

TF000715 1

TF000717 2

TF000718 1

TF000719 0

TF000720 1

TF000724 0

TF000725 0

TF001258 0

TF001274 0

TF001275 0

TF001302 1

TF001310 0

TF001312 0

TF001332 1

Reference source not found.). The TE encodes a transposase that is 595 amino acids

long and belongs to the Buster family of TEs.

4.2.3 Yeast Excision Assays with the Putative hAT Transposase hATTPase16

All cloned hAT MITEs were pooled at equimolar concentrations for yeast excision

assays with the candidate transposase, hATTPase16. The plasmid used for cloning

MITEs contains a Ura3 gene, while the plasmid used for cloning transposases contains

a His3 gene. This enables yeast cells that contain both plasmids to grow on media

lacking histidine and uracil. An example of how plates containing media lacking histidine

and uracil appear after yeast cells are plates is seen in Figure 15. As expected, the

positive, negative and experimental conditions resulted in colony formation on media

lacking histidine and uracil. All transformation reactions that resulted in colony formation

on media lacking histidine and uracil for the positive, negative, and experimental

conditions (as seen in Figure 15) were streaked on media lacking adenine. When

colonies are streaked on media lacking adenine, only cells that have an intact ade2

gene (referred to as ade2 revertants) can grow.

The colonies that were incubated at 30ºC on media lacking histidine and uracil

were streaked on media lacking adenine. As expected, each yeast colony streaked from

the positive control produced ade2 revertants and no colonies grew on the negative

control. Furthermore, no colonies carrying the hATTPase16 and a MITE yielded any

ade2 revertants (Figure 16).

When colonies that were incubated in liquid media lacking histidine at 25ºC and

at 30ºC were subsequently plated on media lacking adenine (Figure 17 & Figure 18,

respectively), each colony streaked from the positive control produced ade2 revertants

and no colonies grew on the negative control. No colonies carrying the hATTPase16

and a MITE yielded any ade2 revertants.

Figure 15: Example of yeast colonies growing on media lacking histidine and uracil. All transformation reactions that resulted in colony formation for all three conditions, as shown above, were plated on media lacking adenine.

Figure 16: Yeast on media lacking adenine. Plates were streaked with colonies incubated at 30ºC on media lacking histidine and uracil. Sections on plates are representative of a single streaked colony. Red arrow, colony.

Figure 17: Yeast on media lacking adenine. Plates were spread with yeast cells from colonies incubated at 25ºC in liquid media lacking histidine and uracil. Red arrow, colony

Figure 18: Yeast on media lacking adenine. Plates were spread with yeast cells from colonies incubated at 30ºC in liquid media lacking histidine and uracil. Red arrow, colony

5 Discussion

Using bioinformatic approachs, TEs encoding hAT transposases can be predicted.

Furthermore, TEs that were recently active can be predicted based on the presence of

multiple conserved copies of that TE in the genome. Since transposases recognize and

bind to the terminal regions of TEs during transposition (111), a prediction of which

transposase protein is responsible for the transposition of which TE(s) can be made

based on the sequence similarity of the terminal regions between two elements.

In this study, a total of 23 TEs were selected as candidates to encode full or

partial transposases belonging to the hAT superfamily. Extensive computational

analysis of these transposases was performed and the copies of each candidate hAT

transposase were retrieved. It was discovered that the candidate hATTPase10 has the

highest copy number of any known hAT transposase in the A. aegypti genome (39).

Amino acid sequence analysis grouped the candidate transposases into four

distinct clades. To determine if any candidates belonged to the Ac and Buster families,

phylogenetic analysis was performed using the amino acid sequences of the annotated

hATTPases as well as hAT transposase sequences from multiple organisms [described

in (101)]. Similarly, four distinct clades were formed. As expected, the Buster and Ac

families formed two separate clades. However, the Tip TEs formed a separate third

clade with high support. This was also seen in the analyses done by Arsenburger et al.

(2011) where the three Tip TEs used in their study remained separate from the Ac and

Buster families. In their study, it was concluded that there was an insufficient sample

size for the placement of the Tip TEs into either of the two families or into a separate

third family (101).

In this study, six hATTPases were also placed in the same clade as the Tip TEs.

Although the sample size is still small, these results indicate that the Tip TEs may form

a third family in the hAT superfamily of TEs. Another clade, formed by three

hATTPases, is separate from the Buster and Ac families, as well as the Tip TEs. These

three hATTPases cannot be placed into either the Ac or Buster family, into the Tip

clade, or into a separate clade due to the small sample size. Furthermore, a total of six

hATTPases don’t have any strong sequence similarity to any known hAT transposase in

A. aegypti, making these sequences newly described as encoding full or partial

transposase proteins (hATTPase5, hATTPase6, hATTPase12, hATTPase9,

hATTPase21, and hATTPase22).

Conserved domain analysis was performed to detect the presence of any of the

four conserved domains in hAT TEs (88, 98–100). The analysis revealed that not all

known intact hAT transposases in the A. aegypti genome have strong conservation of

transposase domains. This was also true for the candidate transposases selected in this

study. Although the hAT dimerization domain is often found in hAT transposases, it is

thought that multiple regions of a transposase can play a role in oligomerization (100).

Therefore, a highly conserved hAT dimerization domain may not be necessary for

transposition. In fact, the only active class II TE in A. aegypti to date, AeBuster1, doesn’t

show any strong sequence conservation for any transposase domain (101).

Using a top-down approach, three TE families were identified and described. The

members of these families all generate 8 bp TSDs upon insertion, likely making them

members of the hAT superfamily of TEs. Furthermore, all the hATTE families are

nonautonomous, as they do not encode any full length transposase. As expected, all

three families have strong end-sequence similarity with the hATTPase-coding elements

used to find them. Due to this strong end-sequence similarity, these hATTPases are

good candidates for the cross-mobilization of these TE families. Due to time constraints,

the activity of these MITE families has not yet been tested experimentally. Translated

sequence analysis revealed that the hATTE2 family shares amino acid sequence

similarity to the AeBuster1 transposase. Therefore, the hATTE2 family members are

likely deletion derivatives of the autonomous AeBuster1 TE. Due to the fact that the

hATTE2 and hATTE8 families have no amino acid sequence similarity to any known

transposase in A. aegypti, it is likely that these two families are MITEs that have never

before been described.

In this study, a total of 14 hAT TEs were selected as candidates to encode a hAT

transposase that can cross-mobilize hAT MITEs. These predictions were based on

sequence similarity to known hAT transposases and end-sequence similarity to hAT

MITEs. Due to the fact that 14 MITE families, some of which have multiple identical

copies, had no end-sequence similarity to any hATTPase; it is likely that there are hAT

TEs in the A. aegypti genome that are responsible for the cross-mobilization of these

MITEs and have not yet been identified. Only one candidate transposase, hATTPase16,

was successfully cloned and repaired. The candidate hATTPase2 was successfully

cloned, but after repairing the sequence, was unsuccessfully ligated back into the

vector. Although the coding regions of every other candidate transposase were

successfully amplified, the ligation reaction proved most difficult. Despite multiple

attempts and altering ligation conditions (e.g. temperature, reaction time, insert-vector

concentrations), no other transposase was cloned.

To test the activity of hATTPase16, it was co-transformed into yeast cells

containing hAT MITEs, and monitored for ade2 revertants at different conditions.

However, no ade2 revertants were observed in the experimental conditions. The most

likely reason for not observing a transposition event is that the hATTPase16

transposase enzyme does not recognize and bind to any of the hAT MITE ends. In this

study, a lenient approach was used to compare the similarity of MITE family and

hATTPase end sequences, in order to increase the probability of finding a hATTPase

that cross-mobilizes a MITE. Since transposase enzymes bind to terminal regions of

TEs as the first critical step in the transposition process, if the hATTPase16 transposase

cannot bind to the hAT MITEs, transposition will not occur. Furthermore, although the

terminal regions of the hATTPase16 TE shows some sequence similarity to four hAT

MITE families, it has been shown that single nucleotide substitutions in TIRs can

severely decrease transformation efficiency (47). Therefore, it is likely that the terminal

regions of the MITE families were too dissimilar for the transposase to recognize and

Further studies are required to fully comprehend the activity and consequences

of hAT MITE transposition. For example, binding studies, such as electrophoretic

mobility shift assays (EMSAs), using purified transposase enzymes and terminal region

DNA fragments can be performed to test the ability of each transposase to bind to MITE

sequences in vitro. DNA deletion analyses can be performed on hAT MITEs shown to

be cross-mobilized by transposases in order to test which TIR regions are necessary for

transposase binding and transposition. This will allow researchers to make better

predictions of which hAT MITEs a transposase is capable of mobilizing.

Understanding TE activity and amplification is essential to understanding how A.

aegypti has evolved and diversified. Being a major vector of yellow fever, dengue fever,

and chikungunya fever, there is a great need to understand the driving-forces of A.

aegypti genome evolution (103–105). In doing so, there will be a better understanding of

the biological properties of A. aegypti and the potential to develop the tools necessary to

produce pathogen-resistant strains(124).

Chapter 3 TE Displayer for Post Genomic Analysis of TEs

6 Introduction to Transposon Display

Transposon Display (TD) is a commonly used experimental technique to study TEs. TD

is frequently used to study TE insertion polymorphisms within a genome or between

genomes and is a technique derived from Amplified Fragment Length Polymorphism

(AFLP) (125, 126). Using polymerase chain reaction (PCR)-based methods, TD allows

for the separation and visualization of specific TEs in a genome. The process starts with

the extraction and digestion of genomic DNA with a restriction enzyme to generate DNA

fragments of different sizes. Adapters are then ligated to the ends of the digested

genomic DNA fragments (Error! Reference source not found.A-B). A pre-amplification

PCR is performed using a primer that is complementary to the adapter sequence and

another primer that is complementary to the TE sequence (Error! Reference source

not found.C). A second, selective PCR reaction is performed using the pre-

amplification products as templates with a nested primer set (Error! Reference source

not found.D). Selective amplification PCR products are analyzed by polyacrylamide gel

or capillary electrophoresis. DNA fragments can be extracted and sequenced if desired.

When the TE family being analyzed has a high copy number in a genome, one or more

selective nucleotide(s) can be added to the 3’ end of the adapter primer used in the

selective amplification reaction to reduce the number of bands per lane (125). The

resulting products consist of DNA fragments containing part of the TE and a flanking

genomic region outside of the TE. These fragments are then resolved on a

polyacrylamide gel, where each band indicates a transposable element at a specific

location in the genome (Error! Reference source not found.E). The copy number of

the TE family in a genome can be determined and an active TE can be revealed

through the detection of an insertion event within a genome.

Although TD is often used to study TEs experimentally, discovering and

analyzing TEs computationally has become common in TE research, and is made

possible by the abundance of genome sequencing efforts being performed on a variety

of different organisms. Furthermore, most TEs have recognizable structural signatures,

making their identification and annotation possible. In lieu of this, multiple computer

programs have been developed to find and analyze TEs in different genomes.

A novel bioinformatics tool has been developed that transforms TD into a

computational program. The program, called TE Displayer, was generated using

Practical Extraction and Report Language (PERL) with a Graphical User Interface and

runs in the Windows and Linux operating systems Using TE Displayer, a user can

choose genome databases and define parameters including an adapter oligo length

(bp), a restriction enzyme recognition site sequence for genomic DNA digestion,

selective base(s), the sequences of the pre-amplification TE primer and the sequence of

the selective amplification TE primer. In addition, a user can specify the allowed number

of mismatches for the pre-amplification PCR primer to anneal to its targets and choose

a DNA size ladder and color(s) for the virtual gel image. The output of TE Displayer

includes a detailed description of each fragment in text format and a graphical

representation of the fragments on a virtual gel image (Figure 20). TE Displayer was

tested using TEs in the Aedes aegypti, Drosophila melanogaster, Caenorhabditis

elegans, Arabidopsis thaliana, and Oryza sativa genomes and all of the output from

these analyses is consistent with the analysis through manual inspection.

Figure 19: A schematic representation of Transposon Display. (A) Genomic DNA is extracted; (B) DNA is digested with MseI and adapters are ligated to the ends; (C) Pre-amplification PCR is performed; (D) Selective PCR is performed; (E) Products are run on a polyacrylamide gel. Blue boxes-adaptors; grey arrows-pre-amplification primers; black arrows-selective amplification primers.

7 Methods

7.1 Algorithm

The algorithm was implemented with PERL. The BLAST search is performed with the

standalone program package 2.2.22 with an E-value of 10,000. The graphical interface

is implemented with Perl/Tk modules. The GD-2.43 module was used for the generation

Figure 20: Screen-shot of the bioinformatics program, TE Displayer

of the virtual gel images. BioPerl modules such as Bio::Tools::Run::StandAloneBlast

and Bio::SearchIO are used to perform BLAST searches and parse the output. TE

Displayer has been tested on Linux and Windows (XP, Vista, Windows 7) standalone

systems and the SciNet high performance system (University of Toronto).

7.2 Implementation

The parameters required to perform TE Displayer include: a restriction enzyme site, a

pre-amplification primer sequence, a selective amplification primer sequence, and an

adaptor size. Parameters that are not required, but can be used, if desired, include a

selective base (A, T, C, or G) and different nucleotide mismatch values (up to a total of

five mismatches) between the pre-amplification primer and the genomic sequence.

When TE Displayer is implemented, a BLAST search is performed using the pre-

amplification primer as the query sequence and the genomic sequence as the subject

database (Figure 21, i). A 5 kb flanking sequence is retrieved from the pre-amplification

primer sequence, which is subsequently searched for the nearest enzyme restriction

site to the pre-amplification primer (as specified by the user) (Figure 21, ii & iii).

Following this, the region between the pre-amplification primer and the closest

restriction enzyme site is scanned for the selective-amplification primer sequence

(Figure 21, iv). If the selective-amplification primer sequence is found (in the correct

orientation) in this region, the size of the conceptual amplicon is calculated as the size

of the selective-amplification primer, the size of the adaptor, and the region between

them (Figure 21, v). Every location that contains the target TE sequence is processed in

this manner and a conceptual amplicon is produced.

Figure 21: Diagram of TE Displayer algorithm (see Methods: Implementation). Red arrowhead, pre-amplification primer; Black arrowhead, selective-amplification primer. Adapted from Rooke & Yang (2010).

7.3 Output

The output of TE Displayer includes a text output that contains the genomic location,

amplicon size, selective base, and number of mismatches for each amplicon.

Furthermore, the amplicons are displayed as ―bands‖ in a lane on a virtual gel-image.

The migration of the virtual bands are calculated using the formula D1/D2=S2/S1, where

D is distance and S is size. All output from TE Displayer is consistent with the output

from manual inspection.

7.4 Parameters Used for Testing TE Displayer

The primer sequences used to find hAT TEs in different genomes are outlined in Table

4. Primers were generated from consensus sequences corresponding to each element

ID found on TEfam (http://tefam.biochem.vt.edu/tefam/) or Repbase

(http://www.girinst.org/repbase/index.html). The adopter size was 10 bps, primer

mismatch value was 5, and the restriction enzyme used was MseI with a recognition site

of TTAA.

Table 4 hAT primer sequences and genomes used to generate output for hAT elements

Element ID

Pre-Amplification Primer

Sequence

Selective Amplification Primer

Sequence Genome

SIMPLEHAT2 CCCTAAACTCATTTGATTAT GTTGAGTTGGGTTACCCATT

Arabidopsis

thaliana

HATN1_CE ATTTGGATCGCGGCGTGAG GAGCGGCGTTTGAGCGACGC

Caenorhabditis

elegans

CRATA TGGTGGAGTAACCTCCGACG TCCCCGTTGCCATCTCTA

Oryza sativa

var. japonica

TF000700 TTGTATGGTTGTTTACATTTT GCAATAAAGAGCCGCCAGTT Aedes aegypti

The primer sequences for mPing in different varieties of O. sativa were the same

as those described in Jiang et al. (2003). The length of the adaptor was 19 bps, the

primer mismatch value was 0, and the restriction enzyme used was MseI.

For the TF000720 element in A. aegypti, the pre-amplification primer sequence

was 5’ GGCAAGCTGAAGTTATCTTG 3’ and the selective amplification primer

sequence was 5’ TTTCGTGTGTAGTATCT 3’. The adapter length was 21 bps and the

restriction enzyme used was MseI.

7.5 Genomic Database Sources

The genomic sequence of Aedes aegypti was downloaded from Vectorbase

(www.vectorbase.org), Arabidopsis thaliana from TAIR (www.arabidopsis.org), and

Caenorhabditis elegans from Wormbase (www.wormbase.org). The genome sequences

for O. sativa var. indica and var. japonica were obtained from Genomics

(ftp://ftp.genomics.org.cn/; release: 23/04/2008).

8 Results

TE Displayer was implemented using various genomes, including A. thaliana, C.

elegans, Oryza sativa, A. aegypti, and D. melanogaster. To compare different TE

profiles in these five genomes, primer sequences were developed for different hAT TEs

found in each organism. As expected, the banding pattern for each hAT family is

different across each organism, reflecting the different sizes and copy number of the TE

elements in each genome (Figure 22A). The copy numbers and sizes of each band

were consistent with manual inspection of the genomic sequences.

Since TE Displayer can be used to look at the same TE family in different

genomes, primer sequences were designed for the MITE element, mPing. TE Displayer

was implemented using mPing primer sequences and two different rice genome

sequences: O. sativa var. japonica and O. sativa var indica. As shown on the virtual gel

image, O. sativa var japonica has 33 virtual gel bands and O. sativa var indica has 9

(Figure 22B). This is consistent with previous experimental data (25) that shows

significantly more mPing elements in japonica variety compared to indica variety.

Figure 22: TE Displayer virtual gels. (A) hAT families in different species. Lane 1: A.thaliana; lane 2: C.elegans; lane 3: rice; lane 4: A.aegypti; lane 5: D.melanogaster. (B) mPing elementsin rice. Lane 1: O.sativa var. indica; lane 2: O.sativa var. japonica. (C)TF000720 family in A.aegypti with different allowed primer mismatches. Lane 1: no mismatches; lane 2: 1 mismatch; lane 3: 2 mismatches. (D) TF000700 family in A.aegypti with different selective bases. Lane 1: no selective base; lane 2: A; lane 3: C; lane 4: T; lane 5: G. Adapted from Rooke & Yang (2010).

To illustrate TE Displayers ability of to reduce the specificity of the pre-

amplification primer using mismatch nucleotide(s), a virtual gel image was generated

displaying the TF000720 TE family and using the A. aegypti genome as the database.

When no mismatches were permitted, only four bands appeared on the virtual gel

image. With one and two mismatches permitted, 33 and 37 bands appeared,

respectively (Figure 22C). As expected, the higher the number of mismatches allowed,

the less specific the pre-amplification primer search is, and the more bands are found

and appear on the virtual gel.

Selective base(s) are a valuable tool to reduce the number of bands per lane on

the virtual gel image. This is often necessary to resolve bands from genomes that have

a high copy number of the TE family of interest. When no selective bases are used to

search the A. aegypti genome for the TE family TF000700, a total of 78 bands appear

on the virtual gel image. When A, C, T, or G is used as the selective base, a total of 23,

14, 22, and 19 bands appear, respectively (Figure 22D).

9 Discussion

With the ever-increasing number of whole-genome sequences becoming available in

public databases, there is an increased need for bioinformatics tools that are capable of

processing and analyzing the large amount of data. In TE research, bioinformatics tools

capable of identifying, annotating, and analyzing TEs in genomic databases are

advantageous, if not necessary.

Discrepancies between TE Displayer output and that seen on a TD gel may be

found. For example, the experimental and in silico TE profiles of mPing are similar, but

not exactly the same [see (24)]. Discrepancies may be a result of: (i) transposition

activity of the TE of interest; (ii) incomplete genome sequences, resulting in fewer bands

seen on the virtual gel image; (iii) sequencing and genome-assembly errors, resulting in

incorrect band sizes; (iv) non-specific amplification during experimental analysis,

resulting in the appearance of non-target sequences.

TE Displayer enables a researcher to create a virtual gel image that mimics the

experimental outcome of TD, as well as providing detailed text output about band sizes

and genomic coordinates. Currently, TE transposition can be inferred from an individual

by the appearance of novel bands on a TD gel (24, 127). That being said, TE Displayer

can similarly be used to detect transposition events by comparing TE profiles across

different individuals, tissues, or generations. In addition, TE Displayer allows

researchers to compare computational TE profiles with that of experimental TE profiles,

enabling them to detect genome assembly and sequencing errors, and provides

researchers with an initial idea of what to expect on a TD gel.

Chapter 4 Concluding Remarks

Often considered ―parasitic‖, TEs are now known to have a beneficial role in

some instances for the genomes in which they reside. Multiple examples of molecular

domestication illustrate how TEs and TE-derived sequences can become essential

components of genomes, regulating gene expression and becoming crucial for proper

host development (93). Furthermore, TEs have been attributed as major drivers in

vertebrate diversity, and may play an important role in speciation (128). Although TE

movement throughout a genome can cause mutations either through their direct

insertion or from TE footprints at the location of excision, these mutations have been

attributed to enlarging genetic variation in populations (129).

TEs are a major driving force of genome evolution, despite the fact that the

majority of TEs are not active. In studying genome sequences and sizes, researchers

have revealed some interesting findings about eukaryotic genomes, including the fact

that an organism’s morphological complexity and genome size are not correlated and

that most eukaryotic DNA is comprised mostly of non-coding regions [see (130)for

review]. Moreover, it is becoming increasingly evident that TEs are the major contributor

to eukaryotic genome size, with total TE content and genome size having shown a

strong positive correlation (76, 131) Even though TEs were discovered over 60 years

ago in the maize genome, TEs continue to be discovered in a diversity of genomes.

Therefore, the importance of revealing which TEs are potentially and currently active on

a genome-wide scale and what consequences arise from their transposition is important

for understanding genome evolution. Identifying and grasping the entirety of active TEs

will provide a better understanding of genome structure and evolution.

References

1. McClintock B (1948) Mutable loci in maize. Carnegie Institute Washington Year Book

47:155-169.

2. McClintock B (1947) Cytogenetic studies of maize and Neurospora. Carnegie Institute

Washington Year Book 46:146-152.

3. Gardner MJ et al. (2002) Genome sequence of the human malaria parasite Plasmodium

falciparum. Nature 419:498-511.

4. Kunst F et al. (1997) The complete genome sequence of the gram-positive bacterium

Bacillus subtilis. Nature 390:249-256.

5. Schnable PS et al. (2009) The B73 maize genome: complexity, diversity, and dynamics.

Science 326:1112-1115.

6. SanMiguel P, Bennetzen JL (1998) Evidence that a recent increase in maize genome size

was caused by the massive amplification of intergene retrotransposons. Annals of Botany

82:37-44.

7. SanMiguel P et al. (1996) Nested retrotransposons in the intergenic regions of the maize

genome. Science 274:765-768.

8. Biemont C (2010) A brief history of the status of transposable elements: from junk DNA

to major players in evolution. Genetics 186:1085-1093.

9. Finnegan DJ (1989) Eukaryotic transposable elements and genome evolution. Trends in

Genetics 5:103-107.

10. Wicker T et al. (2007) A unified classification system for eukaryotic transposable

elements. Nature Reviews Genetics 8:973-982.

11. Bennetzen JL (2000) Transposable element contributions to plant gene and genome

evolution. Plant Molecular Biology 42:251-269.

12. Kumar A, Bennetzen JL (1999) Plant retrotransposons. Annual Review of Genetics

33:479-532.

13. Han JS, Boeke JD (2005) LINE-1 retrotransposons: modulators of quantity and quality of

mammalian gene expression? Bioessays 27:775-784.

14. Sabot F, Schulman AH (2006) Parasitism and the retrotransposon life cycle in plants: a

hitchhiker’s guide to the genome. Heredity 97:381-388.

15. Cordaux R, Batzer MA (2009) The impact of retrotransposons on human genome

evolution. Nature Reviews Genetics 10:691-703.

16. Smit AF, Riggs AD (1996) Tiggers and DNA transposon fossils in the human genome.

Proceedings of the National Academy of Science 93:1443-1448.

17. Feschotte C, Pritham EJ (2005) Non-mammalian c-integrases are encoded by giant

transposable elements. Trends in Genetics 21:551-552.

18. Kapitonov VV, Jurka J (2006) Self-synthesizing DNA transposons in eukaryotes.

Proceedings of the National Academy of Sciences 103:4540-4540.

19. Pritham EJ, Putliwala T, Feschotte C (2007) Mavericks, a novel class of giant transposable

elements widespread in eukaryotes and related to DNA viruses. Gene 390:3-17.

20. display.cgi?uids=1334917 Available at:

http://www.hubmed.org/display.cgi?uids=1334917 [Accessed August 16, 2011].

21. Bureau TE, Wessler SR (1992) Tourist: a large family of small inverted repeat elements

frequently associated with maize genes. Plant Cell 4:1283-1294.

22. Bureau TE, Wessler SR (1994) Mobile inverted-repeat elements of the Tourist family are

associated with the genes of many cereal grasses. Proceedings of the National Academy of

Sciences 91:1411-1415.

23. Kikuchi K, Terauchi K, Wada M, Hirano HY (2003) The plant MITE mPing is mobilized

in anther culture. Nature 421:167-170.

24. Jiang N et al. (2003) An active DNA transposon family in rice. Nature 421:163-167.

25. Naito K et al. (2006) Dramatic amplification of a rice transposable element during recent

domestication. Proceedings of the National Academy of Sciences 103:17620-17625.

26. Jiang N, Feschotte C, Zhang X, Wessler SR (2004) Using rice to understand the origin and

amplification of miniature inverted repeat transposable elements (MITEs). Current

Opinion in Plant Biology 7:115-119.

27. Feschotte C, Swamy L, Wessler SR (2003) Genome-wide analysis of mariner-like

transposable elements in rice reveals complex relationships with stowaway miniature

inverted repeat transposable elements (MITEs). Genetics 163:747-758.

28. MacRae AF, Clegg MT (1992) Evolution of Ac and Dsl elements in select grasses

(Poaceae). Genetica 86:55-66.

29. Tsubota SI, Huong DV (1991) Capture of flanking DNA by a P element in Drosophila

melanogaster: creation of a transposable element. Proceedings of the National Academy of

Sciences 88:693 -697.

30. Surzycki SA, Belknap WR (1999) Characterization of Repetitive DNA Elements in

Arabidopsis. J Mol Evol 48:684-691.

31. Unsal K, Morgan GT (1995) A novel group of families of short interspersed repetitive

elements (SINEs) in Xenopus: evidence of a specific target site for DNA-mediated

transposition of inverted-repeat SINEs. Journal of Molecular Biology 248:812-823.

32. Oosumi T, Garlick B, Belknap WR (1996) Identification of putative nonautonomous

transposable elements associated with several transposon families in Caenorhabditis

elegans. Journal of Molecular Evolution 43:11-18.

33. Tu Z (1997) Three novel families of miniature inverted-repeat transposable elements are

associated with genes of the yellow fever mosquito, Aedes aegypti. Proceedings of the

National Academy of Sciences 94:7475 -7480.

34. Izsvak Z et al. (1999) Short inverted-repeat transposable elements in teleost fish and

implications for a mechanism of their amplification. Journal of Molecular Evolution

48:13-21.

35. Brügger K et al. (2002) Mobile elements in archaeal genomes. FEMS Microbiology

Letters 206:131-141.

36. Morgan GT (1995) Identification in the human genome of mobile elements spread by

DNA-mediated transposition. Journal of Molecular Biology 254:1-5.

37. Oki N et al. (2008) A genome-wide view of miniature inverted-repeat transposable

elements (MITEs) in rice, Oryza sativa ssp. japonica. Genes & Genetic Systems 83:321-

38. Surzycki SA, Belknap WR (2000) Repetitive-DNA elements are similarly distributed on

Caenorhabditis elegans autosomes. Proceedings of the National Academy of Science

97:245-249.

39. Nene V et al. (2007) Genome sequence of Aedes aegypti, a major arbovirus vector.

Science 316:1718-1723.

40. Nakazaki T et al. (2003) Mobilization of a transposon in the rice genome. Nature 421:170-

41. Lin X et al. (2006) In planta mobilization of mPing and its putative autonomous element

Pong in rice by hydrostatic pressurization. Journal of Experimental Botony 57:2313-2323.

42. Shan X et al. (2005) Mobilization of the active MITE transposons mPing and Pong in rice

by introgression from wild rice (Zizania latifolia Griseb.). Molecular Biology and

Evolution 22:976-990.

43. Yang G, Zhang F, Hancock CN, Wessler SR (2007) Transposition of the Rice Miniature

Inverted Repeat Transposable Element mPing in Arabidopsis thaliana. Proceedings of the

National Academy of Sciences of the United States of America 104:10962-10967.

44. Momose M, Abe Y, Ozeki Y (2010) Miniature Inverted-Repeat Transposable Elements of

Stowaway Are Active in Potato. Genetics 186:59-66.

45. Patel M et al. (2004) High-oleate peanut mutants result from a MITE insertion into the

FAD2 gene. Theoretical and Applied Genetics 108:1492-1502.

46. Hua-Van A, Davière JM, Kaper F, Langin T, Daboussi MJ (2000) Genome organization in

Fusarium oxysporum: clusters of class II transposons. Current Genetics 37:339-347.

47. Dufresne M et al. (2007) Transposition of a fungal miniature inverted-repeat transposable

element through the action of a Tc1-like transposase. Genetics 175:441-452.

48. Rezsohazy R, van Luenen HGA, Durbin RM, Plasterk RHA (1997) Tc7, a Tc1-hitch

hiking transposon in Caenorhabditis elegans. Nucleic Acids Research 25:4048-4054.

49. Shirasu K, Schulman AH, Lahaye T, Schulze-Lefert P (2000) A contiguous 66-kb barley

DNA sequence provides evidence for reversible genome expansion. Genome Research

10:908-915.

50. Feschotte C (2004) Merlin, a new superfamily of DNA transposons identified in diverse

animal genomes and related to bacterial IS1016 insertion sequences. Molecular Biology

and Evolution 21:1769-1780.

51. Zhou F, Tran T, Xu Y (2008) Nezha, a novel active miniature inverted-repeat transposable

element in cyanobacteria. Biochemical and Biophysical Research Communications

365:790-794.

52. Hikosaka A, Kawahara A (2010) A systematic search and classification of T2 family

miniature inverted-repeat transposable elements (MITEs) in Xenopus tropicalis suggests

the existence of recently active MITE subfamilies. Molecular Genetics and Genomics

283:49-62.

53. Derbyshire KM, Kramer M, Grindley N (1990) Role of instability in the cis action of the

insertion sequence IS903 transposase. Proceedings of the National Academy of Sciences

87:4048-4052.

54. Huisman O, Errada PR, Signon L, Kleckner N (1989) Mutational analysis of IS10’s

outside end. EMBO Journal 8:2101-2109.

55. Makris JC, Nordmann PL, Reznikoff WS (1988) Mutational analysis of insertion sequence

50 (IS50) and transposon 5 (Tn5) ends. Proceedings of the National Academy of Sciences

85:2224-2228.

56. Mahillon J, Chandler M (1998) Insertion sequences. Microbiology and Molecular Biology

Reviews 62:725-774.

57. Derbyshire KM, Hwang L, Grindley ND (1987) Genetic analysis of the interaction of the

insertion sequence IS903 transposase with its terminal inverted repeats. Proceedings of the

National Academy of Sciences 84:8049-53.

58. Zerbib D et al. (1990) Functional organization of the ends of IS1: specific binding site for

an IS 1-encoded protein. Molecular Microbiology 4:1477-1486.

59. Johnson RC, Reznikoff WS (1983) DNA sequences at the ends of transposon Tn 5

required for transposition. Nature 304:280-282.

60. Yang G, Nagel DH, Feschotte C, Hancock CN, Wessler SR (2009) Tuned for

transposition: molecular determinants underlying the hyperactivity of a Stowaway MITE.

Science 325:1391-1394.

61. Hartl DL, Lozovskaya ER, Lawrence JG (1992) Nonautonomous transposable elements in

prokaryotes and eukaryotes. Genetica 86:47-53.

62. Lampe DJ, Walden KK, Robertson HM (2001) Loss of transposase-DNA interaction may

underlie the divergence of mariner family transposable elements and the ability of more

than one mariner to occupy the same genome. Molecular Biology and Evolution 18:954-

63. Feschotte C, Osterlund MT, Peeler R, Wessler SR (2005) DNA-binding specificity of rice

mariner-like transposases and interactions with Stowaway MITEs. Nucleic Acids Research

33:2153-2165.

64. Orgel LE, Crick FH (1980) Selfish DNA: the ultimate parasite. Nature 284:604-607.

65. Bingham PM, Kidwell MG, Rubin GM (1982) The molecular basis of P-M hybrid

dysgenesis: the role of the P element, a P-strain-specific transposon family. Cell 29:995-

991004.

66. Engels WR et al. (1987) Somatic effects of P element activity in Drosophila melanogaster:

pupal lethality. Genetics 117:745-757.

67. Kidwell MG, Lisch D (1997) Transposable elements as sources of variation in animals and

plants. Proceedings of the National Academy of Sciences 94:7704-7711.

68. Biessmann H et al. (1992) HeT-A, a transposable element specifically involved in

“healing” broken chromosome ends in Drosophila melanogaster. Molecular and Cellular

Biology 12:3910-8.

69. Pardue ML, Danilevskaya ON, Lowenhaupt K, Slot F, Traverse KL (1996) Drosophila

telomeres: new views on chromosome evolution. Trends in Genetics 12:48-52.

70. Pardue M-L, DeBaryshe PG (2008) Drosophila telomeres: A variation on the telomerase

theme. Fly 2:101-110.

71. Moore JK, Haber JE (1996) Capture of retrotransposon DNA at the sites of chromosomal

double-strand breaks. Nature 383:644-646.

72. Teng SC, Kim B, Gabriel A (1996) Retrotransposon reverse-transcriptase-mediated repair

of chromosomal breaks. Nature 383:641-644.

73. Cooley L, Kelley R, Spradling A (1988) Insertional mutagenesis of the Drosophila

genome with single P elements. Science 239:1121-1128.

74. Ivics Z, Izsvák Z (2004) in Mobile Genetic Elements (Humana Press, New Jersey), pp

255-276.

75. Ostertag EM, Madison BB, Kano H (2007) Mutagenesis in rodents using the L1

retrotransposon. Genome Biology 8:S16.

76. Kidwell MG (2002) Transposable elements and the evolution of genome size in

eukaryotes. Genetica 115:49-63.

77. Evgen’ev MB et al. (2000) Mobile elements and chromosomal evolution in the virilis

group of Drosophila. Proceedings of the National Academy of Sciences 97:11337 -11342.

78. Oliver KR, Greene WK (2009) Transposable elements: powerful facilitators of evolution.

BioEssays 31:703-714.

79. Belancio VP, Hedges DJ, Deininger P (2008) Mammalian non-LTR retrotransposons: For

better or worse, in sickness and in health. Genome Research 18:343 -358.

80. Bartolomé C, Maside X, Charlesworth B (2002) On the Abundance and Distribution of

Transposable Elements in the Genome of Drosophila melanogaster. Molecular Biology

and Evolution 19:926 -937.

81. Callinan PA, Batzer MA (2006) in Genome and Disease (Karger Publishers, New York),

pp 104-115.

82. Wallace NA, Belancio VP, Deininger PL (2008) L1 mobile element expression causes

multiple types of toxicity. Gene 419:75-81.

83. Girard L, Freeling M (1999) Regulatory changes as a consequence of transposon insertion.

Developmental Genetics 25:291-296.

84. Medstrand P et al. (2005) Impact of transposable elements on the evolution of mammalian

gene regulation. Cytogenetic and Genome Research 110:342-352.

85. Kashkush K, Khasdan V (2007) Large-Scale Survey of Cytosine Methylation of

Retrotransposons and the Impact of Readout Transcription From Long Terminal Repeats

on Expression of Adjacent Rice Genes. Genetics 177:1975 -1985.

86. Romanish MT, Lock WM, de Lagemaat LN van, Dunn CA, Mager DL (2007) Repeated

Recruitment of LTR Retrotransposons as Promoters by the Anti-Apoptotic Locus NAIP

during Mammalian Evolution. PLoS Genetics 3:e10.

87. McClintock B (1950) The origin and behavior of mutable loci in maize. Proceedings of

the National Academy of Sciences 36:344-355.

88. Calvi BR, Hong TJ, Findley SD, Gelbart WM (1991) Evidence for a common

evolutionary origin of inverted repeat transposons in Drosophila and plants: hobo,

Activator, and Tam3. Cell 66:465-71.

89. Feldmar S, Kunze R (1991) The ORFa protein, the putative transposase of maize

transposable element Ac, has a basic DNA binding domain. EMBO Journal 10:4003-4010.

90. Warren WD, Atkinson PW, O’Brochta DA (1995) The Australian bushfly Musca

vetustissima contains a sequence related to transposons of the hobo, Ac and Tam3 family.

Gene 154:133-134.

91. Kempken F, Windhofer F (2001) The hAT family: a versatile transposon group common

to plants, fungi, animals, and man. Chromosoma 110:1-9.

92. Lander ES et al. (2001) Initial sequencing and analysis of the human genome. Nature

409:860-921.

93. Sinzelle L, Izsvak Z, Ivics Z (2009) Molecular domestication of transposable elements:

from detrimental parasites to useful host genes. Cell and Molecular Life Sciences 66:1073-

94. Bundock P, Hooykaas P (2005) An Arabidopsis hAT-like transposase is essential for plant

development. Nature 436:282-284.

95. Hirose F et al. (2001) Ectopic expression of DREF induces DNA synthesis, apoptosis, and

unusual morphogenesis in the Drosophila eye imaginal disc: possible interaction with

Polycomb and trithorax group proteins. Molecular and Cellular Biology 21:7231-7242.

96. Hirose F et al. (1996) Isolation and Characterization of cDNA for DREF, a Promoter-

activating Factor for Drosophila DNA Replication-related Genes. Journal of Biological

Chemistry 271:3930 -3937.

97. Kempken F, Kuck U (1996) restless, an active Ac-like transposon from the fungus

Tolypocladium inflatum: structure, expression, and alternative RNA splicing. Molecular

and Cellular Biology 16:6563-6572.

98. Essers L, Adolphs RH, Kunze R (2000) A highly conserved domain of the maize activator

transposase is involved in dimerization. Plant Cell 12:211-224.

99. Zhou L et al. (2004) Transposition of hAT elements links transposable elements and

V(D)J recombination. Nature 432:995-1001.

100. Hickman AB et al. (2005) Molecular architecture of a eukaryotic DNA transposase.

Nature Structural & Molecular Biology 12:715-721.

101. Arensburger P et al. (2011) Phylogenetic and Functional Characterization of the hAT

Transposon Superfamily. Genetics 188:45 -57.

102. Salvado J, Bensaadi-Merchermek N, Mouches C (1994) Transposable elements in

mosquitoes and other insect species. Comparative Biochemistry and Physiology Part B:

Comparative Biochemistry 109:531-544.

103. Gubler DJ (1998) Dengue and Dengue Hemorrhagic Fever. Clinical Microbiology

Reviews 11:480-496.

104. Tomori O (2004) Yellow Fever: The Recurring Plague. Critical Reviews in Clinical

Laboratory Sciences 41:391-427.

105. Ligon BL (2006) Infectious Diseases that Pose Specific Challenges After Natural

Disasters: A Review. Seminars in Pediatric Infectious Diseases 17:36-45.

106. Yang G, Hall TC (2003) MAK, a computational tool kit for automated MITE analysis.

Nucleic Acids Research 31:3659-3665.

107. Larkin MA et al. (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23:2947 -

108. Rivero J et al. (2004) Optimization of Extraction Procedure for Mosquito DNA Suitable

for PCR-Based Techniques. International Journal of Tropical Insect Science 24:266-269.

109. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment

search tool. Journal of Molecular Biology 215:403-410.

110. Jurka J (2000) Repbase Update: a database and an electronic journal of repetitive

elements. Trends in Genetics 16:418-420.

111. Craig NL (2002) in Mobile DNA II (American Society for Microbiology Press,

Washington, D.C.), pp 3-11.

112. Levine A, Durbin R (2001) A computational scan for U12-dependent introns in the human

genome sequence. Nucleic Acids Research 29:4006-4013.

113. Sheth N et al. (2006) Comprehensive splice-site analysis using comparative genomics.

Nucleic Acids Research 34:3955-3967.

114. Marchler-Bauer A et al. (2011) CDD: a Conserved Domain Database for the functional

annotation of proteins. Nucleic Acids Research 39:D225-D229.

115. Tamura K et al. (2011) MEGA5: Molecular Evolutionary Genetics Analysis using

Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods.

Molecular Biology and Evolution In Press.

116. Wallace IM, O’Sullivan O, Higgins DG, Notredame C (2006) M-Coffee: combining

multiple sequence alignment methods with T-Coffee. Nucleic Acids Research 34:1692 -

117. Darriba D, Taboada GL, Doallo R, Posada D (2011) ProtTest 3: fast selection of best-fit

models of protein evolution. Bioinformatics 27:1164 -1165.

118. Schmidt HA, Strimmer K, Vingron M, von Haeseler A (2002) TREE-PUZZLE: maximum

likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics

18:502 -504.

119. Crooks GE, Hon G, Chandonia J-M, Brenner SE (2004) WebLogo: A Sequence Logo

Generator. Genome Research 14:1188 -1190.

120. Dillon PJ, Rosen CA (1990) A rapid method for the construction of synthetic genes using

the polymerase chain reaction. Biotechniques 9:298-300.

121. Gietz RD, Schiestl RH (2007) Frozen competent yeast cells that can be transformed with

high efficiency using the LiAc/SS carrier DNA/PEG method. Nature Protocols 2:1-4.

122. Mota NR, Ludwig A, Da Silva Valente VL, Loreto ELS (2010) harrow: new Drosophila

hAT transposons involved in horizontal transfer. Insect Molecular Biology 19:217-228.

123. Feschotte C, Swamy L, Wessler SR (2003) Genome-Wide Analysis of mariner-Like

Transposable Elements in Rice Reveals Complex Relationships With Stowaway Miniature

Inverted Repeat Transposable Elements (MITEs). Genetics 163:747-758.

124. Adelman ZN, Jasinskiene N, James AA (2002) Development and applications of

transgenesis in the yellow fever mosquito, Aedes aegypti. Molecular and Biochemical

Parasitology 121:1-10.

125. Van den Broeck D et al. (1998) Transposon Display identifies individual transposable

elements in high copy number lines. Plant Journal 13:121-129.

126. Vos P et al. (1995) AFLP: a new technique for DNA fingerprinting. Nucleic Acids

Research 23:4407-4414.

127. Slotkin RK et al. (2009) Epigenetic reprogramming and small RNA silencing of

transposable elements in pollen. Cell 136:461-472.

128. Bohne A, Brunet F, Galiana-Arnoux D, Schultheis C, Volff J-N (2008) Transposable

elements as drivers of genomic and biological diversity in vertebrates. Chromosome

Research 16:203-215.

129. Kidwell MG, Lisch DR (2000) Transposable elements and host genome evolution. Trends

in Ecology & Evolution 15:95-99.

130. Gregory TR (2005) Synergy between sequence and size in large-scale genomics. Nature

Reviews Genetics 6:699-708.

131. Lynch M, Conery JS (2003) The Origins of Genome Complexity. Science 302:1401 -

Appendix I: Supplementary Materials

Supplementary Table 1: Consensus sequences of hAT MITE families from TEfam (http://tefam.biochem.vt.edu)

Family Name Length (bp)

TF000576 371

TF000719 397

TF000720 470

TF00072 481

TF000724 441

TF000725 431

TF001258 1288

TF001274 562

TF001275 420

TF001302 562

Consensus Sequence

CAGTGTTGTTCAGACTCATTTCCCCGAAAATAACATTTTCCGAACTACGAAGCCACGAAGTACTACAACCATGCTCTATGCTCCTAAGATAGAAAAGCAGATGGTTGAGCTTCAGTACTGCACACAAAATCGGA

TCAATTTGACCCAGAGAGCCACAATTCAAGTTTCGGTCAAATAAATGAGTGAGTGAGTGAGTTCGAATCATTTCACCGAAACTTGAATTACGGCCCACCGAGATCGAAACTGTCGGGTCCCATGTGCAACACTC

AAGCCCAACCACCTCCCCCTCCATCTCAGGAATACAACGCACAGTTATAACAGTTCATGGCTTCACAATTCGAATTTCGGGGAAATGAGTCTGGTCAACACTG

CTGAGAGAGAGGAGACAGCTGGCTTAGATTGCGAGCTCCCAAAAGTCATGGAAACCTGTCATCAACTGTCACTGAAAAACCGCACATTCGAGGGGCAGAGCGTGAAAAACCGTCAAAATTTGGCCAAACTAAA

ACTGGATGTGATTGAAATAGCAGGTTGCTTTCCTGTGCTCTCGCATATAAAATCGTTGAATATTTAAATATTGTCGATTGGACTAAAATCACTATTAAAGTTTTTCAGTTTTATGATCTTTTTGATAACAAATGGGA

TGTGAAAATGCTTTTGAACCGGTTTGTGCTCTCCGAACCATTGTGCACAACCCAAACATGAGAAGAAGACATCAAAGAAAGCAAAAAAACAAAGCCAGTTGTCAGTGAGCTGTCTCCTCTCTCTCAG

CTGGGAGGGAGAAACCAATACAAAATTTATGTACGTCATAGTGAGCCGGGATTCCGCTGTCACGAACTGTCACTGTGAGCCCAGATCTCAAACATTGGACTAACATGGAAAAAGCGCGTAACTGATTAAAACA

GATTTTCGCATTTCATCAAAAGTGACAAAAATCGAATGAAAATTAATTCATGCACACGATGCAAGTAATGTCACGTGAGCACCTTTCAATCTGATTGAATATAAAGCATTGAAAAACGCATCGTTTAACTTTTTC

CAGTGAAAATGAATATTTAGTGCATAGAAAACTGTGGCCCTTTTTCCATGTTTGTCAGGCAAGATCGAAAGTACACAACAAAAGAAAACTAGCGATTTTCCCCGAAGCCGGCCTTGATTGTTGTTGGCAAGCTG

AAGTTATCTTGCAGTTCAAACAAACGTTGTTCTATATTTCGTGTGTAGTATCTGTGCCTCCCTCCTAG

TAGAGTTACATATTTCTCAGGAGCACATAAATGTACTGACGAGCCTCCAACCAAACCGTCTCTCAGCTCATCGGAATCCTAGTGTGCTGCGCTAGACGGAAGCTCACGAAAATTTTGAAAGCGCGCGCTAAGTG

ATGTGCTCCCGGGGAGACTCGTCTCATGCGCTGTTCGGCGCTACGGCTCGCCTTCGGTATCCAAGTTGTGAAACAACTGCTTAGTAACGAACAGCCTCTACAACTATTTTCGCTTCATTATACAGGTGTCTAGAT

GGTCTACATGATATGGTCGAAGACTCGCGATCCAAGAGTTACCATTTTAATTTTAAACAGCACATGTGACGGAGCACATGAGACCGCAGTGATACACATCTCAAGAAAAATGAGGCGCACCGAAAAAGGTCTC

ACAATGTACAGCGCTCCCGTGCGCTTGCAAGCAATGAGGCTCCTGCACCTAAGGCTACAGCACGTGCACAGTAACTCTA

TAGCGTTGGGCAAATTTGTCTAAAACATCGATGTTAGTGAATCGATTCACATTTGAACTGCCGTATCGATTCACCGATTTAATCGAATCCTTTGAATCGATGTTCGTGAATCGATTCGAATCGATTAAATTTTTTTT

TTTTTTTTGATTTTGAAATGAGACCAAATGCATTTACATTGTATTTTAACATCTCTGAATTTAAATTAATTATTAAATTCAACAAGACAAATTTACTATTTCAAGGCTAGTTCAAAATTTGGTTTCTTTTCATCAGTT

CATTAAGAAATATTCATAAATTTTAACAAAATGAATCGAATCAAAAAATCGAATGAAGAAAATCGATTCACCTGATTTCAGATGCCCCAAACATCGATTCAAAAAATCGATTAATCGGAATGAAACATCGATTT

TCGGAACATCGATTCAAAATCGCCCAACGCTA

TAGTGATCCTTTATAGAGAGATAAGCGGAAGTAAAATGGAATGATTTGGCAACAGACCAGCAGTGCTGATTTTGCTCGGCATCGAGAATAATATTGTAACACGGACATGACTTCCTCATTGCTAAAAAGTGTATT

TTGTTTCGTTATAGATCTTTGAGCTACATATCCAAACTAATAAACAACCGAAAAAATCAACAACTAAAAGTCGCATTGACAAATATTTAAAAAAGAAAAATATTTCATTGTTTTGCTTTATGCAAAATTACTTTTA

CCGAAATAATTTCAAGCGGATTGCATTACTCCTCAAGAAAAGTTCAACACAAACATAACGCATGTTCGTTTGGCTCACACTTTGACAGCTCTCGCTGCTCGATTGGCTCACACTTTGACAGAAATGTCAGCTTCC

GCAAATCTCTTATAAAGGATTACTA

TAGGGTGCGGCTTATTTTTGAAAAGTTCTCAAAACCAAAAATTCGTGTGCTCTACTGAATTCAAATCACATTAAAAGAGAAACGTCAAAATTTGAGCCAAAAATATTAACATTTAGAGGTGGCGCAAGCGTCTTG

AAGTTGAATTTTCAGGTTATAAAAAATGACCTTCAGTAAAGTACACATAACTTTGTTATTTTTCAACCGATTTTAAAACTTTTAGCATCAATACCTTCAAAATTAAATTTGTTAAAACTTTGTTGAACAGCAAATTT

GTCTAAAATCAAAACAAGTTTTAGTTAAAAGTAATTTATTCCAAATTTTTCCTCATTTTACTAAAAATTTCAACTTTGTTTGACCATAACTTCATGATTACTCAACCGATTCCAAATCTTTTTACATGATTTTGATG

CTAATTTAATTGTTTTCAAACCATTCATATATCATATTTCACTAAAAAAATGTCTTGACCAAGTTATTCAGCAAAAACTGCACAAAAACATTATTTTATAACGAAAAATGCCAGTTTTGTCAAATCTTTGGCAGTT

AAATATTGTCTCTTAAGCTGAAACTGATTTTTATCATTTGTTAAACCTTTGGAAAGGACTTTGCAACAACTTTTAAATAATTTTGAAACAAAATTATGATTGCATATGTTAAAAATATATGCACTATTCAACTCGTA

ACTAAGGCAAGATGCTTTACATATTTAAGTTTAAAAAGAAAAAATGCTGTTTTCGGCAAAAAAACTTAAATTATCCAAAGAAATTTGATTTTTTCATTGAAAAATCATGTTTTTGGGTAGTTTTTGTTGAATACCTA

AGTCAAAAATATTTTTTAGAAAAAGATGTTGTATGTACAGTTTGAAAACAAATAAATTTTCTTCAAAAGTATGTGAAAAGTTTTGGAATCGATTGAGTTTTAATGAAGTTATGGTCAAACAAAAATGATTTTTTTAG

CAAAATGAGGAAAAATTTGGATATAAGTATTTTTAACTAAGACAAGTTTTGATTTTAGACAAATTTGATGTTCTACAAAATTTTTATAAATTTAATTTGAAAGAAAATGGTGCTAAAAGTTTTGAAATCGGTCTAA

AAATAACAAAGTTATGTTTACTTAACTGGAGGTAATTTTTTATAACTTGAAAATTCACCTTCAACACGCTTGCGCCACCTCTAAATGTTAATATTTTTGGCTCAGATTTTGAGGTTTCTCTTTTTTTGTGATTTGAAT

TCAGTAGAGCACACGAATTTTCGGTTTTGAGAACTTTTGAAAAATAAGCCGCACCCTA

CAGGCTTTGGAAAATTTAACGATCATTGACATACGAGCCATAATTTCATCTCATTTATCCGCCTGCATTCATACAACACGCATGACATTTATCGTTCGTGGAAGATAACGAGCCATGAACGAAAACAAGACAGA

CACACACCACCCACTGTTTGGCGCCGATCACTGTGTGTCAACACTTGTAACATTCGCTGACCCACTCTCATTCTGATCAAGAAACAGAGGCGAATATTTTCTATTTTCTGTGCTATGTTTGCAACCAAAGGGGTG

TTCAATGATCCAATAGATTATAATTAGGTTGTTAAAGGCTGCATTACCAGAAAATATGAAAGTTATTACCAATTTTATGCAAAACATAAATCACCCGAATGGCTCGATACTGTTGCAGTCTGTGAGAGTTGCTGC

TCTTGAACGCTCTCGTGCCACGTACCAAACGAGAGAGTGAATGTTTACATTCATAGGGCTCTCCGAATGCGATGTGCCACGAACTGTTATGGTTCGCGAACTGTAACACTCTCCGAATATGGTTCGATCGGCTTA

TTCGGATCAAAATCCAAACACTG

CAGGCTTGATAGAAACTCCTCTGCGATGCAAAGAGGAGGCGATTTTGCCGTTTTTCTGAGGAGAAAAAAATAAACTCAAAATTTTCAACACAGATCCTCAAGCGATTCGAGTGAGAGTGCAGAAAAATGCGAGG

ATTTTTTCAGACACTCTCTCTCTCTCGTTGCGATGCTCACCATCACTCACACTTCAAACGAACTCTCGAGTCTGCCGCCCGAACAGACAGTCCTCGGGGAGTTGTGAGAGCACCCTTTTTTGATGGAATTTTACT

CGGTTTTTTTTTGTGCTGCATCTTCCTCGCTCATTTTTCGTTGCCTCTCTGCTTAGCATTTTTTCGCTCCCTCGCAAAGAGGCAAAGGCAGCACGCTGCTCTAGAGTATGTCACCGAACTCTGATGATGTTGAGGA

GAATTATCAAGCCTG

CAGGCTTCCGAATTCTCACTCACTCTCACGATAATGGATTTATCCCTCACAGCAATTTTCAGCGATCAGCTGCCTTGGGATTTTATCCGTCCACAAAACAAAGCGAAAATAGATAATTGCACATTTCCGCAGCTG

CGAGAAGTTTGTTTTCTCTTTCTCCGATCCATGGAGAATTTTTCCTGTATCCATCGCCACGAGCAGCAGTTGCTTTTATGCACCAAATTAACCACTCCTTTCGTCAAGTTGGTGTGGTAGCATGGTTAGCGTGCAC

GCATCGCGGTTGTTTTTAGCAATGTTCTAGGTTCGAATCCCATCGCCGGCACTAGTTTTTGTTTATCCACATAGTGATAATTCTCCGGAGCAGTTGGTGAGCGAAAATATGATGGAGTGAAAATCAAATTATCCA

CAGAGTGGAGTGATTTTCGGTTATCCACCATCGTAGCTCCAGGGAAAATTAGTGTGAGCGATAAAAGATGTTCACGCGAGGAAGCGAAAAAAGCATTCTCCTTACGTGCGAAAAATCGTAGATTTCGCATCATT

GATACGAAAATTCAGAACCCTG