Post on 28-Aug-2018
1
UNIVERSIDADE FEDERAL DE PELOTAS Programa de Pós-Graduação em Agronomia
Tese
Caracterização in silico de microssatélites no genoma do arroz e análise comparativa com outras espécies vegetais
Luciano Carlos da Maia
Pelotas, 2009
id5877546 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
2
Luciano Carlos da Maia
Engenheiro Agrônomo
Caracterização in silico de microssatélites no genoma do arroz e análise comparativa com outras espécies vegetais
Orientador: Antônio Costa de Oliveira, PhD. � FAEM/UFPel
Co-orientador: Fernando Irajá Félix de Carvalho, PhD. � FAEM/UFPel
Pelotas, 2009
Tese apresentada ao Programa de Pós-Graduação em Agronomia da Universidade Federal de Pelotas, como requisito parcial à obtenção do título de Doutor em Ciências (área do conhecimento: Fitomelhoramento).
3
Banca examinadora:
Professor, PhD., Antonio Costa de Oliveira (UFPel) � Presidente
Professor, PhD., Fernando Irajá Félix de Carvalho (UFPel)
Professor, PhD., Antonio Vargas de Oliveira Figueira (ESALQ-USP)
Professor, PhD., Odir Antonio Dellagostin (UFPel)
Professor, PhD., Cesar Valmor Rombaldi (UFPel)
4
Aos meus pais, Milton e Ivone.
As minhas irmãs Lú, Binha, Tamara e para o irmão Zico.
A todos que amo.
Dedico
5
Agradecimentos -A Deus pela minha vida e pela clareza de minhas convicções.
-Ao Professor Antônio pela orientação, pelo apoio na pesquisa de
bioinformática e pela amizade dispensada.
-Ao Professor Fernando Carvalho, no seu incansável labor pelo ensino do
melhoramento vegetal e pelos, sempre bem-vindos, puxões de orelha.
-A todo o pessoal com o qual compartilhei estes anos no Centro de Genomica
e Fitomelhoramento: amigos e colegas...
-Ao vô Alcides, vó Maurisa e ao Osmil.
-Aos fiéis amigos (Batista, Walter, Galvão, Zé Siqueira, Denival, Cibalena,
Anderson e Mané) e familiares, lá de Rechan, que mesmo após todos estes anos
longe, fazem parte de minha vida.
Ao Éder Moreira (Jacupiranga), amigo de todas as dificuldades...
-Aos amigos com quem dividi moradia nestes anos: Mano Lima (e Darliane...),
Julio e Diego.
-Ao Dario Palmieri (UNESP), Mauricio Kopp (EMBRAPA), Velci (UFSM) e
Valmor (KSP), pela amizade e pelas constantes discussões profissionais...
-Ao casal Fernando Henning e Lili Mertz, grandes amigos, para horas de pão-
de-queijo, café, BLASTs e muitas discussões...
-Pra Cris Sakashita...
-A todos que contribuíram e me ajudaram...
-A agências CAPES e CNPq, sem as quais esse sonho não poderia ser
realizado.
-A todos um forte abraço!
6
Eis que o semeador saiu a semear. E, quando semeava, uma parte da semente caiu ao pé do
caminho, e vieram as aves, e comeram-na: E outra parte caiu em pedregais, onde não havia terra
bastante, e logo nasceu, porque não tinha terra funda; Mas, vindo o sol, queimou-se e secou-se,
porque não tinha raiz.E outra caiu entre espinhos, e os espinhos cresceram, e sufocaram-na.
E outra caiu em boa terra, e deu fruto: um a cem, outro a sessenta e outro a trinta.
(Matheus 13;3-8)
7
Resumo MAIA, Luciano Carlos da. Caracterização in silico de microssatélites no genoma do arroz e análise comparativa com outras espécies vegetais. 2009. 83f. Tese (Doutorado) - Programa de Pós- Graduação em Agronomia. Universidade Federal de Pelotas, Pelotas. Marcadores moleculares têm sido utilizados com sucesso em mapeamento genético e seleção assistida como uma ferramenta auxiliar para o melhoramento de plantas e transferência de informações entre espécies relacionadas. Neste sentido, o entendimento da ocorrência de microssatélites no genoma nas diferentes espécies melhoradas como trigo, arroz e milho pode ser utilizado no sentido da melhoria do conhecimento básico das espécies gramíneas descritas como �orfãs�. O arroz, após a conclusão do seqüênciamento do seu genoma, tem sido proposto como modelo genético entre as gramíneas. Dentre os diferentes tipos de marcadores moleculares, os microssatélites são indicados como a classe preferida para estes estudos. De maneira geral, as estratégias de transposição de marcadores moleculares entre espécies ainda apresentam algumas dificuldades e questionamentos referentes os padrões mais conservados de microssatélites entre espécies, genêro e famílias vegetais. Este estudo teve como objetivo, o uso de ferramentas de bioinformática para caracterizar microssatélites oriundos do genoma do arroz e outras espécies, possibilitando predizer padrões de microssatélites mais promissores na transferência. Três estudos foram realizados. O primeiro consistiu no desenvolvimento e validação de uma ferramenta para localização de microssatélites, desenho de iniciadores e simulação da PCR. Foi utilizado um banco de dados contendo 28.469 seqüências fl-cDNA de arroz japonica. Do total de 3.907 loci encontrados, foram desenhados 3.329 conjuntos de iniciadores e testados pela simulação da PCR, mostrando que somente 2.397 (72%) iniciadores amplificaram regiões específicas. No segundo estudo foi analisada a ocorrência de microssatélites em regiões expressas de dez espécies de três diferentes famílias de plantas. Os resultados indicaram a freqüência e padrões de microssatélites dentro e entre as diferentes famílias. No terceiro estudo foi feita a caracterização de microssatélites no genoma completo do arroz. Os resultados mostraram um conservado padrão de ocorrência dos diferentes microssatélites nos diferentes cromossomos e quais os arranjos foram os mais abundantes. Inferências sobre quais elementos permitem a melhor cobertura do genoma foram discutidas. Palavras-chaves: Oryza sativa subsp. japonica. Simple Sequence Repeat. Microsatélites. Bioinformática. Genomica. Melhoramento vegetal, gramíneas.
8
Abstract MAIA, Luciano Carlos da. In silico characterization of microsatellites in the rice genome and comparative analysis to other plant species. 2009. 83f. Tese (Doutorado) - Programa de Pós- Graduação em Agronomia. Universidade Federal de Pelotas, Pelotas. Molecular markers have been successfuly applied in genetic mapping and marker assisted selection as an auxiliary tool for plant breeding and transfer of genetic information among related species. In this sense, the understanding of genome elements occurring in important crop species such as wheat, rice and maize can be used towards the improvement of basic knowledge in orphan grass species. Rice, after the completion of its genome sequence, has been proposed as a genetic model in the grasses. Among the different types of molecular markers, microsatellites have been indicated as the preferred class for such studies. In general, the strategy of transposing molecular markers between species still poses some questions/difficulties regarding the most conserved microsatellite patterns among plant species, genera and families. This study had as objective to use bioinformatic tools to characterize microsatellites from rice and other Grass species of economical importance, enabling the prediction of microsatellite patterns that are most promising in transfer strategies. Three studies were performed. The first was concerned about developping and validating a microsatellite searching tool plus primer design and PCR simulation. A database containing 28,469 fl-cDNA sequences originating from japonica rice genome was used. From a total of 3,907 microsatellite loci, 3,329 primer pairs were designed and tested using the simulated PCR feature showing that only 2,397 (72%) of pairs amplified in specific regions. The second study had as objective to describe the occurrence of microsatellites in expressed regions originating from ten species from three different plant families. The results indicated the frequency and patterns of occurrence of microsatellites within and between the different families. The third study had as objective to characterize the complete occurrence of microsatellites in the rice genome. The results showed a different pattern of occurrence of microsatellites for the different chromosomes and which arrangements are most abundant. Inferences on which elements allow better genome coverages are discussed. Keywords: Oryza sativa subsp. japonica. Simple Sequence Repeat. Microsatellites. Bioinformatics. Genomics. Plant Breeding. Grasses.
9
Lista de figuras Pág. 2. SSR Locator: Tool for Simple Sequence Repeat Discovery Integrated with Primer Design and PCR Simulation.
Figure 1. Flow-chart showing the functional structure of SSR Locator. (A) Perl script to search SSRs; (B) text file where information from detected SSRs is stored; (C) module for the statistical calculations for SSR motif occurrence; (D) module that formats text files into standard Primer3 input files; (E) running of Primer3; (F) module for running Virtual-PCR (using a second sequence file as a template); (G) module performing global alignment between homologous amplicons; (H) identity and alignment score calculations between homologous amplicons; and (I) file containing SSR, primer, homologous amplicons, identity, and score information���������������.�
31 3. Tandem Repeat distribuition in gene transcripts of three plant families.
Figure 1. Percentage of expressed sequences containing tandem repeat loci��������������������.�
58
4. Distribuition and patterns of microsatellites occurency in whole rice genome.
Figure 1. Percentage occurrence of different microsatellite types (≥ 12 bp) in the chromossomes����..�������..
77
Figure 2. Percentage occurrence of different microsatellite types (≥ 20 bp) in twelve chromossome���.����...���
78
10
Lista de tabela Pág. 2. SSR Locator: Tool for Simple Sequence Repeat Discovery Integrated with Primer Design and PCR Simulation
Table 1. Distribution of SSR/minisatellite motifs according to the number of repeats������������������������
32
Table 2. Distribution of SSR/minisatellite repeats in the rice cDNA collection�������������������������.
33
Table 3. Distribution of amplicon alignments for specific and redundant amplicons with varying identity levels�����������.
35
3. Tandem Repeat distribuition in gene transcripts of three plant families
Table 1. Overall distribution (amounts and percentage) of expressed sequences in translated and non-translated regions.�����..
57
Table 2. Overall distribution of tandem repeat occurrences in translated and non-translated transcripts���������������.
59
Table 3. Overall occurrence, in percentage, of microsatellite and minisatellite motifs on different regions of ten plant species��
60
Table 4. Distribution of di-, tri- and tetramer motifs, percentage occurrence per species and average occurrence per family��......................................................................................
61 Table 5. Distribution of penta- to decamers motifs, percentage
occurrence per species and average occurrence per family��������................................................................
62 4. Distribuition and patterns of microsatellites occurency in whole rice genome
Table 1. Total amounts of microsatellite types (≥ 12 bp)* in the twelve chromossomes�������������������..........
73
Table 2. Distributions, percentage and frequency of different microsatellite types within Classes I and II in the twelve chromosomes����.................................................................
74 Table 3. Average locus size (bp) of different microsatellite types within
Classes I and II for the twelve chromosomes�������......
75 Table 4. Average distances (Kb) between different microsatellite loci
within Classe I and Class II chromossomes���������.
76
11
Sumário
Resumo ....................................................................................................................... 7
Abstract ....................................................................................................................... 8
Lista de figuras ............................................................................................................ 9
Lista de tabela ........................................................................................................... 10
Sumário ..................................................................................................................... 11
1. Introdução geral .................................................................................................. 13
2. SSR Locator: Tool for Simple Sequence Repeat Discovery Integrated with Primer Design and PCR Simulation ...................................................................... 16
Abstract .................................................................................................................. 16
1. Introduction ........................................................................................................ 16
2. Material and Methods ........................................................................................ 18
3. Results ............................................................................................................... 21
4. Conclusions ....................................................................................................... 25
References ............................................................................................................ 27
3. Tandem Repeat distribuition in gene transcripts of three plant families ...... 36
ABSTRACT ............................................................................................................ 36
INTRODUCTION ................................................................................................... 37
MATERIAL AND METHODS ................................................................................. 38
RESULTS AND DISCUSSION .............................................................................. 39
CONCLUSIONS .................................................................................................... 49
REFERENCES: ..................................................................................................... 52
4. Distribuition and patterns of microsatellites occurency in the whole rice genome .................................................................................................................... 63
ABSTRACT ............................................................................................................ 63
INTRODUCTION ................................................................................................... 63
MATERIAL AND METHODS ................................................................................. 65
RESULTS AND DISCUSSION .............................................................................. 65
12
CONCLUSION ....................................................................................................... 68
REFERENCES: ..................................................................................................... 70
5. Considerações Finais ......................................................................................... 79
6. Referencias bibliográficas do Item 1 ................................................................. 81
VITAE ....................................................................................................................... 83
13
1. Introdução geral
O desenvolvimento de novas variedades que satisfaçam as exigências de
maior potencial genético para produtividade é a principal meta de todo programa de
melhoramento, e, o sucesso de tal programa, depende de um método dinâmico e
eficiente para atender seus objetivos. Portanto, o melhoramento genético de plantas
requer três etapas fundamentais para a obtenção de genótipos superiores: presença
da variabilidade genética, eficiência na seleção dos genótipos mais promissores e
ajuste das melhores constituições genéticas ao ambiente de cultivo (CARVALHO et
al., 2003).
A identificação dessa variabilidade tem sido objeto de muitos estudos, visto
que, ao avaliar a variabilidade de um determinado caráter, muitas vezes a
manifestação deste vem mascarada pelo efeito do ambiente, ou ainda, por
interações alélicas ou gênicas. Estes fatos tornam o trabalho de seleção do
melhorista mais complicado, exigindo em muitos casos, investigações que são
repetidas por vários anos e locais distintos, no intuito de lograr a ação do ambiente
(CARVALHO et al., 2003).
A seleção de um indivíduo que revele um potencial genético de grande
produtividade, passa a ser uma das tarefas mais árduas do melhorista, pois, esta
dificuldade tem como base a necessidade de substituir um grande número de alelos
nos diferentes locos para determinar um progresso expressivo no caráter. Esse fato
reside na dificuldade em acompanhar a segregação de vários alelos e em vários
locos ao mesmo tempo (CARVALHO, 1982).
Modernamente o uso de técnicas da biotecnologia, como os marcadores
moleculares, tem sido descritas como estratégias auxiliares para superar estas
dificuldades, pois, no momento em que são identificados marcadores moleculares
14
associados a genes de interesse, hà a possibilidade de identificação dos genótipos
portadores dos melhores alelos sem a ação do ambiente (MAIA, 2007).
Entre os diferentes marcadores moleculares conhecidos, uma classe bastante
promissora são os microssatélites ou SSRs. Esta classe de marcadores é poderosa
em variadas aplicações na genética e melhoramento de plantas, devido a sua
reprodutibilidade, natureza multi-alélica, característica co-dominante e abundância
em diferentes genomas (TEMNYKH et al., 2001; VARSHNEY et al. 2005).
Descrita inicialmente como Microssatélites por Litt e Luty (1989) e SSR
(Single sequence repeats) por Tautz et al., (1989), segundo Morgante e Olivieri
(1993), estas sequências de DNA são constituídas por 1,2,3,4,5 ou 6 nucleotideos
que repetem em série.
Regiões de DNA repetitivo (microssatélites) estão mais propensas à
ocorrência de laço (loops) ou estruturas conhecidas como grampos (hairpins), pois,
nestes trechos, durante a replicação, a DNA Polimerase sofre um �escorregão�
(slippage) provocando inserção ou deleção de nucleotídeos, promovendo dessa
forma o aumento ou a redução no tamanho da seqüência de repetição (WELL et al.
1998; IYER et al., 2000).
Atualmente com o acúmulo de dados referentes a regiões expressas de
diferentes genomas (ESTs e cDNAs), a caracterização e obtenção de marcadores
microssatélites derivados dessas regiões, descritos como marcadores funcionais,
representam uma promissora estratégia a ser utilizada no melhoramento de plantas,
pois, apresentam vantagens quando comparadas com aquelas classes de
marcadores baseadas no acesso de regiões genomicas anônimas (VARSHNEY ET
al., 2005).
Em vegetais, embora vários estudos tenham descrito os níveis de ocorrência
de microssatélites associados a regiões transcritas (TEMNYKH et al. 2001;
MCCOUCH et al. 2002; MORGANTE et al. 2002; THIEL et al. 2003; NICOT et al.
2004; LAWON e ZHANG, 2006; VARSHNEY et al. 2006; KASHI e KING, 2006;
ZHANG et al. 2006), algumas abordagens comparativas e ou descritivas, podem
ainda, oferecer novas perspectivas sobre as características desses marcadores,
pois, frequentemente distintos grupos de espécies vegetais vem sendo
seqüenciados, possibilitando a re-avaliação dos bancos de dados acrescidos de
15
novas seqüências, representando divergentes grupos evolutivos e ou com diferentes
modelos genéticos.
O objetivo geral deste trabalho foi utilizar a bioinformática para a
caracterização completa da ocorrência de microssatélites no genoma do arroz,
verificar a ocorrência de microssatélites em regiões expressas deste genoma,
identificar a existência de padrões de ocorrência destes marcadores em diferentes
espécies gramíneas e dícotiledôneas e prever quais padrões de ocorrência são os
melhores marcadores para o arroz, para as gramíneas e quais os padrões de
ocorrência possibilitam incremento na taxa de sucesso em estratégias de
transferência desta classe de marcadores entre diferentes espécies.
16
2. SSR Locator: Tool for Simple Sequence Repeat Discovery Integrated with Primer Design and PCR Simulation
International Journal of Plant Genomics (ISSN 1687-5389)
Abstract
Microsatellites or SSRs (simple sequence repeats) are ubiquitous short
tandem duplications occurring in eukaryotic organisms. These sequences are among
the best marker technologies applied in plant genetics and breeding. The abundant
genomic, BAC, and EST sequences available in databases allow the survey
regarding presence and location of SSR loci. Additional information concerning
primer sequences is also the target of plant geneticists and breeders. In this paper,
we describe a utility that integrates SSR searches, frequency of occurrence of motifs
and arrangements, primer design, and PCR simulation against other databases. This
simulation allows the performance of global alignments and identity and homology
searches between different amplified sequences, that is, amplicons. In order to
validate the tool functions, SSR discovery searches were performed in a database
containing 28 469 nonredundant rice cDNA sequences.
1. Introduction Microsatellites or SSRs (simple sequence repeats) are sequences in which
one or few bases are tandemly repeated for varying numbers of times [1]. Variations
in SSR regions originate mostly from errors during the replication process, frequently
DNA polymerase slippage, generating insertion or deletion of base pairs, resulting,
respectively, in larger or smaller regions [2, 3]. SSR assessments in the human
genome have shown that many diseases are caused by mutation in these sequences
[4].
17
SSRs can be found in different regions of genes, that is, coding sequences,
untranslated sequences (5′-UTR and 3′-UTR), and introns, where the expansions
and/or contractions can lead to gene gain or loss of function [5]. Also, there are
evidences that genomic distribution of SSRs is related to chromatin organization,
recombination, and DNA repair. SSRs are found throughout the genome, in both
protein-coding and noncoding regions. Genome fractions as low as 0.85%
(Arabidopsis thaliana), 0.37% (Zea mays), 0.21% (Caenorhabtis elegans), 0.30%
(Sacharomyces cerevisae) and as high as 3.0% (Homo sapiens) and 3.21% (Fugu
rubripes) have been found. Some bias for defined genomic locations has also been
reported [6, 7]. This class of markers is broadly applied in genetics and plant
breeding, due to its reproducibility, multiallelic, codominant nature, and genomic
abundance. It�s use for integrating genetic maps, physical mapping, and anchoring
gives geneticists and plant breeders a pathway to link genotype and phenotype
variations [8].
The protocols for isolating SSR loci for a new species were always very labor-
intensive. Currently, with the accumulation of biological data originating from whole
genome sequence initiatives, the use of bioinformatics tools helps to maximize the
identification of these sequences and consequently, the efficiency in the number of
generated markers [9]. The first in silico studies of SSRs were developed using
FASTA [10] and BLAST [11] packages. Later, more specific algorithms, such as
SPUTINICK [12], REPEATMASKER [13], TRF-Tandem Repeat Find [14], TROLL
[15], MISA [16] and SSRIT (Simple Sequence Repeat Tool) [17], were obtained [9].
SSR detection is generally followed by the use of another program for primer
design, to be anchored on flanking sequences. Also, in some applications, a third
step using e-PCR [18] is added, with the goal of verifying primer redundancy. The
sequential use of a number of software is often called a pipeline. Building such a
pipeline can be a very difficult task for research groups not familiar with programming
tools. In the present work, a computing tool with an interface for Windowsusers was
developed, called SSR Locator. The application integrates the following functions: (i)
detection and characterization of SSRs and minisatellite motifs between 1 and 10
base pairs; (ii) primer design for each locus found; (iii) simulation of PCR
(polymerase chain reaction), amplifying fragments with different primer pairs from a
given set of fasta files; (iv) global alignment between amplicons generated by the
18
same primer pair; and (v) estimation of global alignment scores and identities
between amplicons, generating information on primer specificity and redundancy.
The described tool is publicly available at the site
http://www.ufpel.edu.br/~lmaia.faem.
2. Material and Methods
2.1. Algorithms
The algorithms used for the searches, alignment, and homology estimates are
described separately.
2.2. SSR Search
The algorithm used for perfect and imperfect micro-/minisatellite searches was
written in Perl and consists of the generation of a matrix that mixes A(adenine),
T(thymine), C(cytosine), and G(guanine) in all possible composite arrangements
between 1 and 10 nucleotides. The script instructions perform readings on fasta files,
searching all possible arrangements in each database sequence.
Several instructions in the algorithm used in SSRLocator resemble those from
MISA [16] and SSRIT [17]. However, additional instructions have been inserted in
SSRLocator's code. Instead of allowing the overlap of a few nucleotides when two
SSRs are adjacent to each other and one of them is shorter than the minimum size
for a given class as found in MISA and SSRIT, a module written in Delphi language
records the data and eliminates such overlaps.
The SSR Locator software contains windows focused on the selection and
configuration of SSR and minisatellite types (mono- to 10-mers) and a minimum
number of repeats for each one of the selected types. The algorithm calls a perfect
repeat when one locus is present with adjacent loci at an up or downstream distance
higher than 100 bp.
The algorithm calls an imperfect repeat when the same motif is present on
both sides of a fragment containing up to 5 base pairs. The algorithm identifies a
composite locus when two or more adjacent loci were found at distances between 6
and 100 bp [16].
19
In this study, only �Class I� (≥20 bp) repeats are shown. These repeats have
been described as the most efficient loci for use as molecular markers [17]. The
software SSRLocator was configured to locate a minimum of 20 bp SSRs:
monomers(x20), 2-mers(x10), 3-mers(x7), 4-mers(x5), 5-mers(x4), 6-mers(x4), and
minissatellites: 7-mers(x3), 8-mers(x3), 9-mers(x3), and 10-mers(x3).
In order to validate the efficiency of SSRLocator in finding SSRs and
minisatellites, the same database was analyzed with MISA and SSRIT, using the
same parameters for minimum number of repeats.
2.3. Primer Design
An algorithm written in Delphi language performs calls to Primer3 [19], which
execute primer designs. These results are fed to a module that performs Virtual-
PCRs and allocates individual identification, forward and reverse primer sequences,
and a sequence fragment corresponding to the region flanked by the primers (original
amplicon) to each SSR locus. A window allows the selection of Primer3 parameters,
such as range of primer and amplicon sizes, as well as optimum primer size, ranges
of melting temperature (TM) (minimum, maximum, and optimum) and GC content
(minimum and optimum). For primer searches, the software automatically looks for
five base pair distances from both SSR (5′ and 3′) flanking sites. In this study, the
following parameters were used: amplicon size between 100 and 280 bp; minimum,
optimum, and maximum annealing temperature (TM) of 45, 50, and 55, respectively,
minimum, optimum, and maximum primer size of 15, 20, and 25 bp, respectively.
2.4. Virtual-PCR
The module used to simulate a PCR reaction was written in Delphi. The
algorithm consists in reading the file generated by the previous module (SSR locus,
forward and reverse primers, and original amplicon), followed by a search of
sequences containing primer annealing sites. When annealing sites are found for the
two primers, the flanked region and the primer sequences are copied to a new
variable called �paralog amplicon.�
20
2.5. Global Alignment
For the global alignment between paralog and original amplicon sequences
and score calculations (match, mismatch, gaps), a routine was written in Delphi
language using the algorithms of Needleman and Wunsch (1970) [20] and Smith and
Waterman (1981) [21]. Also, in the same module, amplicon identities were calculated
according to Waterman (1994) [22] and Vingron and Waterman (1994) [23].
2.6. Implementation
The strategy of creating a two-language hybrid program was established as a
function of: (i) the higher speed achieved by handling large text files with Perl as
compared to Delphi, and (ii) the better fitness of Perl for generating combinatory
strings to be located. The Perl module was transformed into an executable file,
making unnecessary to install Perl libraries during program installing. The graphic
interface built, integrating input and output windows to the Windows operational
system, was obtained using the Suite Turbo Delphi, where a menu system executes
calls for each of the previously described modules.
2.7. Sequences for Analysis
A total of 28 469 rice (Oryza sativa ssp. japonica- cv. Nipponbare)
nonredundant full length nonredundant cDNA sequences, sequenced by The Rice
Full-Length cDNA Consortium, mapped on the databases derived from the
sequencing of japonica (japonica draft genome, BAC/PAC clones-IRGSP) and indica
(indica draft genome) subspecies [24] were used for the analyses. These sequences
are deposited in NCBI as two groups, the first comprising accesses from AK058203
to AK074028, and the second comprising accesses from AK98843 to AK111488. All
these sequences can be also found in KOME (Knowledge-based Oryza Molecular
Biological Encyclopedia).
A flow chart representing the different steps performed by the software is
shown in Figure 1.
21
3. Results
3.1. Program Validation A total of 3907 micro- and minisatellites were detected by SSRLocator in the
28,469 analyzed cDNA sequences. The same database searched with MISA and
SSRIT presented 3,913 and 3,917 loci, respectively. The mono-, 4-mer, 6-mer, 7-
mer, 8-mer, 9-mer, and 10-mer repeats were identical for the three programs. In the
case of 2-mer repeats, 594 elements were detected by SSRLocator and 596
elements were detected by MISA and SSRIT. 3-mer repeats were differently scored
by SSRLocator (1990) and the other two (1994) algorithms. For 5-mer repeats,
SSRLocator and MISA found the same number of repeats (426), while SSRIT (430)
found a different value.
3.2. Overall Distribution of SSR Types The results obtained with SSRLocator indicate that out of 28,469 cDNA
sequences, 3765 (13.22%) presented one or more micro-/minisatellite loci. In other
studies, microsatellites were found in the following proportions in ESTs: 3% in
arabidopsis [25], 4% in rosaceae [26], 8.11% in barley [16], 2.9% in sugarcane [27],
and values ranging between 6�11% [28] and 1.5�4.7% [29] for cereals in general
(maize, barley, rye, sorghum, rice, and wheat).
Considering the 3765 fl-cDNA sequences, in 3632 (92.96%) only a single
micro-/minisatellitelocus was detected. In 125 sequences, two loci were detected, in
seven sequences three lociandonly one sequence had four loci, adding up to 3907
occurrences. Among the types analyzed, SSRs (mono to 6-mer repeats) and
minisatellites (7- to 10-mer repeats) comprised 96.98% and 4.12% of detected loci,
respectively.
The distribution of occurrences detected by SSRLocator was consisted of 138
monomers, 594 2-mers, 1990 3-mers, 251 4-mers, 426 5-mers, 390 6-mers, 82 7-
mers, 6 8-mers, 25 9-mers, and 5 10-mers, corresponding to rates of 3.53%,
15.20%, 50.93%, 6.42%, 10.90%, 9.98%, 2.10%, 0.15%, 0.64%, and 0.13%,
respectively (see Table 1).
For the remaining SSRs, average percentage values have been reported as
between 17 and 40% for 2-mer, 54�78% for 3-mer, 2.6�6.6% for 4-mer, 0.4�1.3% for
22
5-mer, and less than 1% for 6-mer repeats [28] and 26.5% for 2-mer, 65.4% 3-mer,
6.8% 4-mer, 0.77% 5-mer, and 0.45% for 6-mer repeats [30] for barley, maize,
wheat, sorghum, rye, and rice, respectively. In nonredundant transcripts from the
TIGR database, 15.6% 2-mer, 61.6% 3-mer, 8.5% 4-mer, and 14.4% 5-mer repeats
were found in rice [31]. The frequency of micro/minisatellite locus occurrence for
each million nucleotides (loci/Mb) [6] in this study was 2.94, 12.64, 42.34, 5.34, 9.06,
8.30, 1.74, 0.13, 0.53, and 0.11 for mono to 10-mer repeats/Mb, respectively. Overall
occurrences of 83.13 loci/Mb were found (see Table 1). In other studies, different taxa
were described in analyses of EST databases, such as 133 loci/Mb (barley), 161
loci/Mb (wheat, sorghum and rye), and 256 loci/Mb for rice [28]. Also, for
nonredundant ESTs in rice, sorghum, barley, wheat, and Arabidopsis, frequencies of
277, 169, 112, 94 and 133 loci/Mb were found, respectively [30]. Frequencies closer
to those found in this study were described for CDS regions of Rosaceaespecies,
with an average of 40.9-78 loci/Mb for Rose, Almond and Peach, while 39 loci/Mb
were found for Arabidopsis [26].
3.3. Occurrence Patterns for Different SSR and Minisatellite Types and Motifs Monomers, 2-Mers, 3-Mers, and 4-Mers
On Table 2, the contents and percentage values for different micro-
/minisatellite motifs are shown. For monomer, 2-mer and 3-mer repeats, all possible
arrangements are shown, while for 4-mer to 10-mer repeats, only the ten most
frequent motifs are shown.
The A/T monomer repeats were found in 125 loci, with 111 (88.80%) and 14
(11.20%) loci formed by A and T nucleotides, respectively. The C/G motifs were
found in 13 loci, with ten (76.92%) and three (23.08%) loci formed by C and G,
respectively. A/T containing SSRs were predominant and comprised 90.58% of
monomer loci. In the overall distribution, the monomers represent 3.53% of 3907
detected loci. Motifs AG/CT and GA/TC were the most frequent and added up to
8.52% of 2-mer SSRs, and 6.89% and 5.96% of all 3907 detected occurrences. The
motifs CT, GA, and TC were the most abundant adding up to 172, 143, and 90 loci,
respectively. In maize, barley, rice, sorghum, and wheat ESTs, the motif AG was
described as the most frequent [6, 16, 28, 29, 31, 32]. However, in some studies, the
most frequent motif was GA [30, 33]. Repeats composed by guanine and cytosine
were the most abundant among trimers, with occurrences of 18.44%, 17.89%, and
23
10.60%, respectively, for the motifs CCG/CGG, CGC/GCG, and GCC/GGC, adding
up to 23.9% of the overall frequencies of micro-/minisatellites in the analysis.
The motifs CGC, CCG, and CGG were the most frequent comprising 218, 197,
and 170 loci, respectively. Many reports indicate the 3-mer CCG as the most frequent
in maize, barley, wheat, sorghum and rye [6, 16, 28, 32], sugarcane [27] and rice [29,
31]. Among 4-mers, 100 different arrangements were found, where the motifs GATC
(7.17%), ATTA/AAT (6.77%), and ATCG/CGAT (5.98%) were the most frequent.
These motifs add up to 19.92% of 4-mer repeats found and represent 1.28% of the
overall content of micro-/minisatellites.
In barley ESTs, ACGT was reported as the most abundant motif [16, 28]. For
other species, AAAG/CTTT and AAGG/CCTT in Lolium perene [34], AAAG/CTTT
and AAAC/GTTT in Arabidopsis UTRs [6, 35], and AAAT and AAAG in citrus [36, 37]
were described as most abundant.
3.4. Remaining Repeats Among 5-mers, 188 different arrangements were detected and the most
frequent were CTCCT, CTCTC, and CCTCC with 17, 17, and 12 occurrences,
respectively. In the analysis of CDS regions, the ACCCG motif was the most frequent
in Arabidopsis, AAAAG in S. cerevisae, C. elegans, and AAAAC in different primates
[38]. Also, the motifs AAAAT, AAAAC, and AAAAG were described as the most
frequent in eukaryotes [39].
In rice, the motifs AGAGG and AGGGG were the most abundant [31]. Repeats
of type 6-mer were detected in 230 different arrangements, where CGCCTC and
TCGCCG were the most frequent, occurring in 12 and 10 loci, respectively. Other
studies have shown higher frequencies for the motifs AAGATG, AAAAAT in
arabidopsis [35], AAAAAG in citrus [36], AACACG in S. cerevisae, ACCAGG in C.
elegans and CCCCGG in primates [38]. For all remaining repeats (minisatellites), the
occurrences are widely distributed with low-percentage values for each arrangement.
For 7-mer, 8-mer, 9-mer, and 10-mer repeats, the totals of occurrences were 57, 5,
23, and 5, respectively.
3.5. Primer Design and PCR Simulation The design of primers for the 3907 detected micro-/minisatellites resulted in
24
3329 primer pairs, covering 85.20% of loci. The running of �Virtual PCR� generated a
total of 4610 amplicons. A module in SSRLocator checks for primer redundancy. A
total of 2397 primer pairs amplified only the fragment from its original locus (specific
amplicons) and 932 pairs amplified one or more regions besides the original locus.
From these, 692 pairs amplified two fragments, one from the original site and a
second from another region (paralogous). In this case, 692 specific amplicons plus
692 redundant amplicons, were detected. A total of 143, 90, 2, and 5 primer pairs
generated three (two redundancies), four (three redundancies), five (four
redundancies), and six (five redundancies) fragments, respectively. The final product
of 932 primers with more than one anchoring region resulted in 932 specific
amplicons and 1281 redundant amplicons, adding up to 2213 fragments.
To investigate the ability of these primers in amplifying genomic sequences,
an extra experiment was performed against the whole rice genomic sequence
available at NCBI. The different groups of redundant and nonredundant primer sets,
that is, amplifying one, two, three, or more times in the cDNA database, were tested
against the genomic sequence.
From the 2397 nonredundant primers, only 924 amplified a locus in the
genomic sequence. This difference was already expected because of difficulties in
amplifying genomic regions, that is, if some primers anneal to a boundary region
between two exons in the cDNA, the presence of introns would make this annealing
site no more available. It is interesting to note that from the 924 amplicons detected,
914 (99%) did amplify only one locus in the genomic region, agreeing with the cDNA
results.
When the primer sets that amplified two different cDNAs were run against the
genomic sequence, only 294/692 (42.5%) did amplify, having 14.5% been able to
amplify two different loci. Only one primer set did amplify more than two loci. These
results indicate that SSR locator performance was consistent between the two
databases regarding the nonredundant loci, that is, from those loci that were able to
be amplified in both databases, their status of nonredundant was maintained. The
changes observed for the redundant loci can be attributable to many causes,
including redundancy in the cDNA database, but also to biological reasons due to
primer positioning.
25
3.6. Identity between Specific and Redundant Amplicons Results of global alignment between amplicons from original and redundant
sites are shown in Table 3. Among the 1281 redundant amplifications, 787 (61.44%)
resulted in a perfect alignment between both loci (identity equal to 100). For
redundant amplicons with identity levels of 96�99%, and 90�95%, 452 (35.28%) and
8 (0.62%) loci were found, respectively. Alignments with identity levels bellow 90%
were found in only 2.65% of cases. The fact that such a high percentage of
redundant loci show high identity is probably a consequence of the genome fraction
chosen, that is, expressed sequences. This fraction is under tight selection pressure
and should not accumulate variations such as substitutions or indels at a high rate.
As expected, comparisons to whole genome, generated a great deal of
polymorphism, due to the inclusion of intronic regions in the alignments (data not
shown).
4. Conclusions The software SSRLocator was successfully implemented, adding steps for (1)
SSR discovery, (2) primer design, and (3) PCR simulation between the primers
obtained from original sequences and other fasta files. Also, the software produces
reports for frequency of occurrence, nucleotide arrangement, primer lists with all
standard information needed for PCR and global alignments. From the PCR
simulation, it was possible to point out which primer pairs were nonredundant,
suggesting that these primers are more appropriate for mapping purposes. In this
case, however, wet lab experiments should be performed to confirm the advantage of
nonredundant over redundant primers for mapping.
It is possible that the results for micro-/minisatellite frequencies (loci/Mb)
obtained in this study diverge from the results found in the literature. This can be
explained by the different databases used (redundant ESTs, nonredundant ESTs
and/or fl-cDNA), different algorithm configurations and minimum requirements set for
counting motifs. Another explanation for some contrasting results is the fact that only
�Class I� repeats were analyzed in our study.
The results showed that 932 (27.99%) primers presented amplifications in
more than one gene sequence. This could be mostly due to the fact that primer pairs
derived from a specific gene (cDNA) anchored in similar sites in other duplicated
26
genes, since 5,607/28,469 (19.70%) genes were described as paralogs in the
annotation of the database used [24]. Gene duplication along with polyploidy and
transposon amplification are the major driving forces in genome evolution [40].
It is therefore not surprising that so many loci have redundancy. Also, a
second possibility is that some primers were generated from protein domain regions
within the analyzed cDNAs. These domains could be found in protein families with
many genome copies, resulting in the observed redundancies. A validation of the
redundancies of cDNA results was obtained through a virtual-PCR against the whole
rice genome sequence. From the nonredundant primers that generated an amplicon,
ca. 99% were nonredundant.
Finally, this tool can be used successfully for data mining strategies to find
SSR primers in genomic or expressed sequences (ESTs/cDNAs). Also, this software
can be a tool for microsatellite discovery in databanks of related species, anchoring
primers in ortholog or paralog regions contained between databases from two
different species.
27
References 1. M. Morgante, M. Hanafey, and W. Powell, �Microsatellites are preferentially
associated with nonrepetitive DNA in plant genomes,� Nature Genetics, vol. 30, no. 2,
pp. 194�200, 2002.
2. R. R. Iyer, A. Pluciennik, W. A. Rosche, R. R. Sinden, and R. D. Wells, �DNA
polymerase III proofreading mutants enhance the expansion and deletion of triplet
repeat sequences in Escherichia coli,� Journal of Biological Chemistry, vol. 275, no.
3, pp. 2174�2184, 2000.
3. H. Ellegren, �Microsatellites: simple sequences with complex evolution,� Nature
Reviews Genetics, vol. 5, no. 6, pp. 435�445, 2004.
4. S. M. Mirkin, �DNA structures, repeat expansions and human hereditary disorders,�
Current Opinion in Structural Biology, vol. 16, no. 3, pp. 351�358, 2006.
5. B. Li, Q. Xia, C. Lu, Z. Zhou, and Z. Xiang, �Analysis on frequency and density of
microsatellites in coding sequences of several eukaryotic genomes,� Genomics
Proteomics & Bioinformatics, vol. 2, no. 1, pp. 24�31, 2004.
6. M. Morgante, M. Hanafey, and W. Powell, �Microsatellites are preferentially
associated with nonrepetitive DNA in plant genomes,� Nature Genetics, vol. 30, no. 2,
pp. 194�200, 2002.
7. S. Subramanian, R. K. Mishra, and L. Singh, �Genome-wide analysis of
microsatellite repeats in humans: their abundance and density in specific genomic
regions,� Genome Biology, vol. 4, no. 2, p. R13, 2003.
8. R. K. Varshney, A. Graner, and M. E. Sorrells, �Genic microsatellite markers in
plants: features and applications,� Trends in Biotechnology, vol. 23, no. 1, pp. 48�55,
2005.
9. M. Bilgen, M. Karaca, A. N. Onus, and A. G. Ince, �A software program combining
sequence motif searches with keywords for finding repeats containing DNA
sequences,� Bioinformatics, vol. 20, no. 18, pp. 3379�3386, 2004.
10. W. R. Pearson and D. J. Lipman, �Improved tools for biological sequence
28
comparison,� Proceedings of the National Academy of Sciences of the United States
of America, vol. 85, no. 8, pp. 2444�2448, 1988.
11. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, �Basic local
alignment search tool,� Journal of Molecular Biology, vol. 215, no. 3, pp. 403�410,
1990.
12. C. Abajian, SPUTNIK, 1994, http://www.abajian.com/sputnik.
13. A. F. A. Smit, R. Hubley, and P. Green, RepeatMasker Open-3.0, 1996,
http://www.repeatmasker.org.
14. G. Benson, �Tandem repeats finder: a program to analyze DNA sequences,�
Nucleic Acids Research, vol. 27, no. 2, pp. 573�580, 1999.
15. A. T. Castelo, W. Martins, and G. R. Gao, �TROLL�tandem repeat occurence
locator,� Bioinformatics, vol. 18, no. 4, pp. 634�636, 2002.
16. T. Thiel, W. Michalek, R. K. Varshney, and A. Graner, �Exploiting EST databases
for the development and characterization of gene-derived SSR-markers in barley
(Hordeum vulgare L.),� Theoretical and Applied Genetics, vol. 106, no. 3, pp. 411�
422, 2003.
17. S. Temnykh, G. DeClerck, A. Lukashova, L. Lipovich, S. Cartinhour, and S.
McCouch, �Computational and experimental analysis of microsatellites in rice (Oryza
sativa L.): frequency, length variation, transposon associations, and genetic marker
potential,� Genome Research, vol. 11, no. 8, pp. 1441�1452, 2001.
18. G. D. Schuler, �Sequence mapping by electronic PCR,� Genome Research, vol.
7, no. 5, pp. 541�550, 1997.
19. S. Rozen and H. Skaletsky, �Primer3 on the WWW for general users and for
biologist programmers,� Methods in Molecular Biology, vol. 132, part 3, pp. 365�386,
2000.
20. S. B. Needleman and C. D. Wunsch, �A general method applicable to the search
for similarities in the amino acid sequence of two proteins,� Journal of Molecular
Biology, vol. 48, no. 3, pp. 443�453, 1970.
29
21. T. F. Smith and M. S. Waterman, �Identification of common molecular
subsequences,� Journal of Molecular Biology, vol. 147, no. 1, pp. 195�197, 1981.
22. M. Waterman, �Estimating statistical significance of sequence alignments,�
Philosophical transactions of the Royal Society of London. Series B, vol. 344, no.
1310, pp. 383�390, 1994.
23. M. Vingron and M. S. Waterman, �Sequence alignment and penalty choice.
Review of concepts, case studies and implications,� Journal of Molecular Biology, vol.
235, no. 1, pp. 1�12, 1994.
24. S. Kikuchi, K. Satoh, T. Nagata, et al., �Collection, mapping, and annotation of
over 28,000 cDNA clones from japonica rice: the rice full-length cDNA consortium,�
Science, vol. 301, no. 5631, pp. 376�379, 2003.
25. L. Cardle, L. Ramsay, D. Milbourne, M. Macaulay, D. Marshall, and R. Waugh,
�Computational and experimental characterization of physically clustered simple
sequence repeats in plants,� Genetics, vol. 156, no. 2, pp. 847�854, 2000.
26. S. Jung, A. Abbott, C. Jesudurai, J. Tomkins, and D. Main, �Frequency, type,
distribution and annotation of simple sequence repeats in Rosaceae ESTs,�
Functional & Integrative Genomics, vol. 5, no. 3, pp. 136�143, 2005.
27. G. M. Cordeiro, R. Casu, C. L. McIntyre, J. M. Manners, and R. J. Henry,
�Microsatellite markers from sugarcane (Saccharum spp.) ESTs cross transferable to
erianthus and sorghum,� Plant Science, vol. 160, no. 6, pp. 1115�1123, 2001.
28. R. K. Varshney, T. Thiel, N. Stein, P. Langridge, and A. Graner, �In silico analysis
on frequency and distribution of microsatellites in ESTs of some cereal species,�
Cellular & Molecular Biology Letters, vol. 7, no. 2A, pp. 537�546, 2002.
29. R. V. Kantety, M. La Rota, D. E. Matthews, and M. E. Sorrells, �Data mining for
simple sequence repeats in expressed sequence tags from barley, maize, rice,
sorghum and wheat,� Plant Molecular Biology, vol. 48, no. 5-6, pp. 501�510, 2002.
30. S. K. Parida, K. Anand Raj Kumar, V. Dalal, N. K. Singh, and T. Mohapatra,
�Unigene derived microsatellite markers for the cereal genomes,� Theoretical and
Applied Genetics, vol. 112, no. 5, pp. 808�817, 2006.
30
31. M. La Rota, R. V. Kantety, J.-K. Yu, and M. E. Sorrells, �Nonrandom distribution
and frequencies of genomic and EST-derived microsatellite markers in rice, wheat,
and barley,� BMC Genomics, vol. 6, article 23, 2005.
32. J.-K. Yu, T. M. Dake, S. Singh, et al., �Development and mapping of EST-derived
simple sequence repeat markers for hexaploid wheat,� Genome, vol. 47, no. 5, pp.
805�818, 2004.
33. N. Nicot, V. Chiquet, B. Gandon, et al., �Study of simple sequence repeat (SSR)
markers from wheat expressed sequence tags (ESTs),� Theoretical and Applied
Genetics, vol. 109, no. 4, pp. 800�805, 2004.
34. T. Asp, U. K. Frei, T. Didion, K. K. Nielsen, and T. Lübberstedt, �Frequency, type,
and distribution of EST-SSRs from three genotypes of Lolium perenne, and their
conservation across orthologous sequences of Festuca arundinacea, Brachypodium
distachyon, and Oryza sativa,� BMC Plant Biology, vol. 7, article 36, 2007.
35. L. Zhang, D. Yuan, S. Yu, et al., �Preference of simple sequence repeats in
coding and non-coding regions of Arabidopsis thaliana,� Bioinformatics, vol. 20, no.
7, pp. 1081�1086, 2004.
36. D. Jiang, G.-Y. Zhong, and Q.-B. Hong, �Analysis of microsatellites in citrus
unigenes,� Acta Genetica Sinica, vol. 33, no. 4, pp. 345�353, 2006.
37. D. A. Palmieri, V. M. Novelli, M. Bastianel, et al., �Frequency and distribution of
microsatellites from ESTs of citrus,� Genetics and Molecular Biology, vol. 30, no. 3,
supplement, pp. 1009�1018, 2007.
38. G. Tóth, Z. Gáspári, and J. Jurka, �Microsatellites in different eukaryotic
genomes: surveys and analysis,� Genome Research, vol. 10, no. 7, pp. 967�981,
2000.
39. Y.-C. Li, A. B. Korol, T. Fahima, and E. Nevo, �Microsatellites within genes:
structure, function, and evolution,� Molecular Biology and Evolution, vol. 21, no. 6, pp.
991�1007, 2004.
40. E. A. Kellogg and J. L. Bennetzen, �The evolution of nuclear genome structure in
seed plants,� American Journal of Botany, vol. 91, no. 10, pp. 1709�1725, 2004.
31
Figure 1. Flow-chart showing the functional structure of SSR Locator. (A) Perl script to search SSRs; (B) text file where information
from detected SSRs is stored; (C) module for the statistical calculations for SSR motif occurrence; (D) module that formats text files
into standard Primer3 input files; (E) running of Primer3; (F) module for running Virtual-PCR (using a second sequence file as a
template); (G) module performing global alignment between homologous amplicons; (H) identity and alignment score calculations
between homologous amplicons; and (I) file containing SSR, primer, homologous amplicons, identity, and score information.
32
Table 1: Distribution of SSR/minisatellite motifs according to the number of repeats.
Repeats Mono- (%) 2-mer (%) 3-mer (%) 4-mer (%) 5-mer (%) 6-mer (%) 7-mer (%) 8-mer (%) 9-mer (%) 10-mer (%) Total (%)3 0 - 0 - 0 - 0 - 0 - 0 - 78 95.12 6 100 24 96 5 100 113 2.894 0 - 0 - 0 - 0 - 348 81.69 323 82.82 4 4.88 0 0 1 4 0 0 676 17.305 0 - 0 - 0 - 181 72.11 69 16.20 45 11.54 0 0 0 0 0 0 0 0 295 7.556 0 - 0 - 0 - 41 16.33 7 1.64 13 3.33 0 0 0 0 0 0 0 0 61 1.567 0 - 0 - 1220 61.31 9 3.59 0 0 5 1.28 0 0 0 0 0 0 0 0 1234 31.588 0 - 0 - 441 22.16 9 3.59 1 0.23 1 0.26 0 0 0 0 0 0 0 0 452 11.579 0 - 0 - 173 8.69 4 1.59 0 0 1 0.26 0 0 0 0 0 0 0 0 178 4.5610 0 - 125 21.04 68 3.42 1 0.40 0 0 2 0.51 0 0 0 0 0 0 0 0 196 5.0211 0 - 82 13.80 32 1.61 3 1.20 0 0 0 0 0 0 0 0 0 0 0 0 117 2.9912 0 - 76 12.79 18 0.90 1 0.40 0 0 0 0 0 0 0 0 0 0 0 0 95 2.4313 0 - 71 11.95 5 0.25 1 0.40 0 0 0 0 0 0 0 0 0 0 0 0 77 1.9714 0 - 39 6.57 2 0.10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41 1.0515 0 - 44 7.41 5 0.25 0 0 1 0.23 0 0 0 0 0 0 0 0 0 0 50 1.2816 0 - 30 5.05 2 0.10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 32 0.8217 0 - 33 5.56 1 0.05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 34 0.8718 0 - 15 2.53 3 0.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0.4619 0 - 17 2.86 1 0.05 1 0 0 0 0 0 0 0 0 0 0 0 0 0 19 0.4920 21 15.22 14 2.36 2 0.10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 37 0.9521 19 13.77 8 1.35 2 0.10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29 0.7422 15 10.87 6 1.01 3 0.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 0.6123 8 5.80 7 1.18 3 0.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0.4624 3 2.17 5 0.84 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0.2025 9 6.52 5 0.84 1 0.05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0.3826 5 3.62 4 0.67 2 0.10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 0.2827 3 2.17 1 0.17 1 0.05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0.1328 1 0.72 3 0.51 3 0.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0.1829 4 2.90 0 0 1 0.05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0.1330 2 1.45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.0531 9 6.52 2 0.34 1 0.05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0.3132 3 2.17 3 0.51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0.1533 3 2.17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0.0834 1 0.72 1 0.17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.0535 6 4.35 1 0.17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0.1836 1 0.72 1 0.17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.0537 1 0.72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0338 4 2.90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0.1039 0 0 1 0.17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0340 1 0.72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0341 1 0.72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0342 2 1.45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.0543 2 1.45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.0544 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00≥45 14 10.14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0.36
Total 138 594 1.990 251 426 390 82 6 25 5 3.907 (%) 3.53 15.20 50.93 6.42 10.90 9.98 2.10 0.15 0.64 0.13 100.00
33
Table 2: Distribution of SSR/minisatellite repeats in the rice cDNA collection.
Motif Ocur(1) (%)(1) Ocur(2) (%)(2) Total (%) Group (%) OverallMono- A/T 111 88.80 14 11.20 125 90.58 3.20
C/G 10 76.92 3 23.08 13 9.42 0.332-mer AG/CT 97 36.06 172 63.94 269 45.29 6.89
GA/TC 143 61.37 90 38.63 233 39.23 5.96CA/TG 10 35.71 18 64.29 28 4.71 0.72AT 24 100.00 - - 24 4.04 0.61AC/GT 6 31.58 13 68.42 19 3.20 0.49TA 19 100.00 - - 19 3.20 0.49CG 2 100.00 - - 2 0.34 0.05
3-mer CCG/CGG 197 53.68 170 46.32 367 18.44 9.39CGC/GCG 218 61.24 138 38.76 356 17.89 9.11GCC/GGC 112 53.08 99 46.92 211 10.60 5.40CTC/GAG 73 42.69 98 57.31 171 8.59 4.38AGG/CCT 34 30.91 76 69.09 110 5.53 2.82GGA/TCC 60 62.50 36 37.50 96 4.82 2.46CAG/CTG 58 76.32 18 23.68 76 3.82 1.95AAG/CTT 34 50.75 33 49.25 67 3.37 1.71CGA/TCG 33 54.10 28 45.90 61 3.07 1.56AGC/GCT 36 62.07 22 37.93 58 2.91 1.48GCA/TGC 47 83.93 9 16.07 56 2.81 1.43AGA/TCT 33 62.26 20 37.74 53 2.66 1.36CCA/TGG 39 75.00 13 25.00 52 2.61 1.33ACC/GGT 22 48.89 23 51.11 45 2.26 1.15GAA/TTC 28 63.64 16 36.36 44 2.21 1.13CAC/GTG 28 65.12 15 34.88 43 2.16 1.10GAC/GTC 18 54.55 15 45.45 33 1.66 0.84ACG/CGT 11 42.31 15 57.69 26 1.31 0.67ATC/GAT 5 45.45 6 54.55 11 0.55 0.28TCA/TGA 5 50.00 5 50.00 10 0.50 0.26CAA/TTG 4 50.00 4 50.00 8 0.40 0.20ACT/AGT 3 42.86 4 57.14 7 0.35 0.18TAA/TTA 1 14.29 6 85.71 7 0.35 0.18CTA/TAG 4 66.67 2 33.33 6 0.30 0.15AAT/ATT 1 20.00 4 80.00 5 0.25 0.13CAT/ATG 4 100.00 - - 4 0.20 0.10AAC/GTT 3 75.00 1 25.00 4 0.20 0.10ATA/TAT 1 50.00 1 50.00 2 0.10 0.05GTA/TAC 1 100.00 - - 1 0.05 0.03
34
continued...
4-mer GATC 18 100.00 0 0 18 7.17 0.46ATTA/TAAT 9 52.94 8 47.06 17 6.77 0.44ATCG/CGAT 3 20.00 12 80.00 15 5.98 0.38CATC/GATG 4 40.00 6 60.00 10 3.98 0.26AGAA/TTCT 2 25.00 6 75.00 8 3.19 0.20GCTA/TAGC 6 75.00 2 25.00 8 3.19 0.20GATA/TATC 1 14.29 6 85.71 7 2.79 0.18GCGA/TCGC 3 42.86 4 57.14 7 2.79 0.18GCAC/GTGC 2 33.33 4 66.67 6 2.39 0.15AGGG/CCCT 2 33.33 4 66.67 6 2.39 0.15
5-mer AGGAG/CTCCT 3 15.00 17 85.00 20 4.69 0.51CTCTC/GAGAG 17 89.47 2 10.53 19 4.46 0.49GAGGA/TCCTC 9 56.25 7 43.75 16 3.76 0.41CCTCC/GGAGG 12 80.00 3 20.00 15 3.52 0.38AGAGG/CCTCT 4 26.67 11 73.33 15 3.52 0.38GGAGA/TCTCC 2 18.18 9 81.82 11 2.58 0.28CTCGC/GCGAG 7 77.78 2 22.22 9 2.11 0.23AGCTA/TAGCT 4 44.44 5 55.56 9 2.11 0.23GAAAA/TTTTC 2 25.00 6 75.00 8 1.88 0.20AGGCG/CGCCT 2 25.00 6 75.00 8 1.88 0.20
6-mer CGCCTC/GAGGCG 12 85.71 2 14.29 14 3.59 0.36CGGCGA/TCGCCG 4 28.57 10 71.43 14 3.59 0.36CCTCCG/CGGAGG 9 81.82 2 18.18 11 2.82 0.28AGGCGG/CCGCCT 1 10.00 9 90.00 10 2.56 0.26CCGTCG/CGACGG 4 44.44 5 55.56 9 2.31 0.23CGTCGC/GCGACG 7 77.78 2 22.22 9 2.31 0.23ACCGCC/GGCGGT 1 12.50 7 87.50 8 2.05 0.20CCACCG/CGGTGG 6 85.71 1 14.29 7 1.79 0.18GGCGGA/TCCGCC 5 71.43 2 28.57 7 1.79 0.18CTCCAT/ATGGAG 6 100.00 0 0 6 1.54 0.15
7-mer CCGCCGC/GCGGCGG 4 66.67 2 33.33 6 7.32 0.15CTCTCTC/GAGAGAG 4 80.00 1 20.00 5 6.10 0.13CCTCTCT/AGAGAGG 4 100.00 0 0 4 4.88 0.10CTCTCTT/AAGAGAG 4 100.00 0 0 4 4.88 0.10CCCAAAT/ATTTGGG 3 100.00 0 0 3 3.66 0.08GCCGCCG/CGGCGGC 3 100.00 0 0 3 3.66 0.08GCGGCGC/GCGCCGC 2 100.00 0 0 2 2.44 0.05AATAAAA/TTTTATT 2 100.00 0 0 2 2.44 0.05GTGTGCG/CGCACAC 2 100.00 0 0 2 2.44 0.05CGCCGTC/GACGGCG 2 100.00 0 0 2 2.44 0.05
8-mer TTGGTTTC/GAAACCAA 2 100.00 0 0 2 33.33 0.05TGGGCTTG/CAAGCCCA 1 100.00 0 0 1 16.67 0.03GCTTCTTG/CAAGAAGC 1 100.00 0 0 1 16.67 0.03ACGGGCGA/TCGCCCGT 1 100.00 0 0 1 16.67 0.03ATGATGTA/TACATCAT 1 100.00 0 0 1 16.67 0.03
9-mer TCGGCGGCG/CGCCGCCGA 2 100.00 0 0 2 8.00 0.05AGGTGGTGG/CCACCACCT 2 100.00 0 0 2 8.00 0.05CCGGTGCGA/TCGCACCGG 1 100.00 0 0 1 4.00 0.03ACGAGGAGG/CCTCCTCGT 1 100.00 0 0 1 4.00 0.03TCCCTTTTC/GAAAAGGGA 1 100.00 0 0 1 4.00 0.03CGGCATGAA/TTCATGCCG 1 100.00 0 0 1 4.00 0.03CGGCAGCGA/TCGCTGCCG 1 100.00 0 0 1 4.00 0.03ACCATCCCG/CGGGATGGT 1 100.00 0 0 1 4.00 0.03ATGGGCGGC/GCCGCCCAT 1 100.00 0 0 1 4.00 0.03ATGCAGGGT/ACCCTGCAT 1 100.00 0 0 1 4.00 0.03
10-mer AGCCCCAACG/CGTTGGGGCT 1 50.00 1 50.00 2 40.00 0.05TTTTTTTCTT/AAGAAAAAAA 1 100.00 0 0 1 20.00 0.03CCTGCTTTGC/GCAAAGCAGG 1 100 0 0 1 20 0.03ATCTCCGCCG/CGGCGGAGAT 1 100 0 0 1 20 0.03
35
Table 3: Distribution of amplicon alignments for specific and redundant amplicons
with varying identity levels.
Identity 100 99 98 97 96 95�90 89�80 79�70 69�60 ≤59 TotalAmplicons 787 261 151 29 11 8 8 6 5 15 1281% 61.44 20.37 11.79 2.26 0.86 0.62 0.62 0.47 0.39 1.17 -
36
3. Tandem Repeat distribuition in gene transcripts of three plant families
Genetics and Molecular Biology (ISSN 1415-4757)
ABSTRACT
Tandem Repeats (Microsatellites or SSRs) are molecular markers with great
potential for plant genetic studies. Modern strategies include the transfer of these
markers between widely studied and orphan species. In silico analyses allow to study
the distribution patterns of microsatellites and to predict which motifs would be more
amenable to interspecies transfer. Transcribed sequences (Unigene) from ten
species of three plant families were surveyed for the occurrence of micro and
minisatellites. Transcripts from different species displayed different rates of tandem
repeat occurrences, ranging from 1.47% to 11.28%. Similar as well as different
patterns were found within and between plant families. The results also indicate a
lack of association between genome size and tandem repeat fractions in expressed
regions. The conservation of motifs among species and its implication on the
evolution and genome dynamics are discussed.
37
INTRODUCTION
Microsatellites or SSRs (Simple sequence repeats) are DNA sequences
formed by the tandem arrangement of nucleotides through the combination of one to
six base pairs, widely distributed in prokaryote and eukaryote genomes (Morgante
and Olivieri, 1993; Tóth et al., 2000). Microsatellite regions tend to form loops or
hairpins structures, leading to a slippage of DNA polymerase during replication,
provoking the insertion or deletion of nucleotides (Iyer et al., 2000). Expansions
and/or contractions of microsatellites may lead to a gain or loss of gene function (Li et
al., 2002, 2004a). Initially, it was suggested that the occurrence and distribution of
microsatellites was the result of random processes. However, new evidences indicate
that the genomic distribution of these repeats is originated from non-random
processes (Bell, 1996; Li et al., 2004b). Microsatellites have been reported to
correspond to 0.85% of Arabidopsis (Arabidopsis thaliana), 0.37% of maize (Zea
mays subsp. mays), 3.21% of fugu fish (Fugu rubripes), 0.21% of the nematode
Caenorhabditis elegans and 0.30% of yeast (Saccharomyces cerevisae) genomes
(Morgante et al., 2002). Also, they make up for 3.00 % of the human genome
(Subramanian et al., 2003).
For microsatellites located in genic regions, 5�UTR are the hotspot for the
presence of this type of repeat. It is known that contractions and/or expansions of
repeats found in 5�UTR regions alter the transcription and/or the translation of these
genes (Li et al., 2004b; Zhang et al., 2006a). Mutations in microsatellite loci found in
3�UTR regions are associated with gene silencing, transcript-cytosol exporting and
splicing mechanism changes as well as the expression levels of flanking genes
(Davis et al., 1997; Thornton et al., 1997; Philips et al., 1998; Conne et al., 2000). For
coding sequences (CDS), the impact of mutations has been described as functional
38
changes, loss of function and protein truncation (Li et al., 2004b). In plants, although
many studies have reported microsatellites frequencies in transcribed regions
(Temnykh et al., 2001; McCouch et al., 2002; Morgante et al., 2002; Thiel et al.,
2003, Nicot et al., 2004; Kashi and King, 2006; Lawon and Zhang, 2006; Varshney et
al., 2006; Zhang et al., 2006b), additional comparative or descriptive analysis can
offer novel perspectives on their use as molecular markers. The genomic abundance
of microsatellites and the ability to associate to many phenotypes make this class of
molecular markers a powerful tool to many aplications in plant genetics. The
identification of microsatellite markers derived from EST and/or cDNAs, described as
functional markers, represent an even more useful possibility for these markers when
compared to those markers based on assessing anonymous regions (Varshney et
al., 2005, 2006).
In order to provide information regarding the patterns of microsatellite
occurrence and distribution on transcribed genome regions, non-redundant full-length
cDNAs (fl-cDNAs) and/or ESTs belonging to ten plant species from three different
families (Brassicaceae, Solanaceae and Poaceae) were used.
MATERIAL AND METHODS
Obtaining the expressed sequence
Files containing expressed sequences were obtained for the following
families/species: Brassicaceae (Arabidopsis thaliana and Brassica napus),
Solanaceae (Solanum lycopersicum and Solanum tuberosum) and Poaceae (Oryza
sativa, Sorghum bicolor, Triticum aestivum, Zea mays, Saccharum officinarum and
Hordeum vulgare) deposited in NCBI-Unigene database. The non-redundant yet
representative sequences for all known genes in each species were selected. The
39
sequences used in the present study were downloaded from the Unigene database
on June 2008.
Distribution of sequences in different transcribed regions
Using computer scripts developed in Perl language and based on the existing
annotation for each of the cDNAs and/or ESTs sequences, the sequences were
categorized as CDS, upstream and downstream regions and partitioned into fasta
files and named CDS, 5� UTR and 3� UTR for each species. Since the annotation of
introns was not part of the database, the repeats present in intronic regions were not
considered in this study.
Location of tandem repeats
For the location of micro and minisatellites, the SSRLocator software was used
(Maia et al. 2008). The software options were adjusted to locate monomers, dimers,
trimers, pentamers and hexamers, containing a minimum of 10, 7, 5, 4 and 4 repeats,
respectively. For minisatellites, heptamer, octamer, nonamer and decamers
containing a minimum of 3, 3, 3 and 2 repeats were selected, respectively.
RESULTS AND DISCUSSION
Distribution of sequences in UTRs and CDSs
The sequences separated in coding regions (CDS) and in untranslated
transcribed regions (5´UTR and 3´UTR) distributed in number of sequences, amount
(Mb) and average size (bp) for the ten species are shown in Table 1. On average, all
species have sequence fragments between 560 and 893 bp, excluding the A.
40
thaliana and O. sativa databases, where longer sequences were found, reaching
averages of 1,447 and 1,490 bp, respectively. The Poaceae species Z. mays and O.
sativa had the largest numbers of sequences deposited in Unigene, with 57,447 and
40,259 sequences, respectively, for each species. It must be taken into account that
not all sequences deposited in this database contain 5´UTR and 3´UTR regions, and
in some sequences, both sequence types are found and in others only one (i.e., 5´ or
3�UTR) is found. The overall average sizes were found to be 130 bp for 5�UTR, 873
bp for CDS and 270 bp for 3´UTR regions. The total nucleotides allocated in each
region were on average 0.9% for 5´UTR, 97.5% for CDS and 1.6% for 3´UTR
regions. The only species with contrasting values was Arabidopsis, where 6.8%,
82.6% and 10.7% of total nucleotides were allocated for 5´UTR, CDS and 3´UTR
regions, respectively.
Percentage of expressed sequences with tandem repeats
On average, 3.55% of analyzed sequences contain one or more loci with
tandem repeats. In Figure 1, the percentage of tandem repeat containing sequences
for each species is displayed. The highest amounts were found for rice (11.28%).
The smallest values were found for the Solanaceae species, i.e. 1.47% and 1.76%
for S. lycopersicum and S. tuberosum, respectively. The percentage values found for
Arabidopsis (3.88%) is in agreement with other reports which have reported between
3% and 5% tandem repeat containing sequences (Cardle et al., 2000; Kumpatla and
Mukhopadhyay, 2005). For B. napus, S. lycopersicon and S. tuberosum
2.42%, 1.47% and 1.76% of sequences containing tandem repeats were found,
respectively. Different values (6.9%, 4.7% and 2.65%) have been reported for the
same species, respectively (Kumpatla and Mukhopadhyay, 2005). For Poaceae, the
41
comparison of present results with former reports for H. vulgare (4.25% vs. 8.11%),
Z. mays (2.14% vs. 1.5%), O. sativa (11.28% vs. 4.7%), S. officinarum (2.13% vs.
2.9%) and T. aestivum (2.38% vs. 7.5%) show a different range of values (Cordeiro
et al., 2001; Kantety et al., 2002; Thiel et al., 2003; Nicot et al., 2004; Asp et al.,
2007). However, all differences are within 2-3 fold range.
The variations on the percentage values found in different reports are related
to the strategy used (software, repeat number and type defined for the search) by the
authors. However, an overall agreement is that microsatellite stretches with minimum
sizes of 20 bp are present in approximately 2-5% of cereal EST sequences
(Varshney et al., 2005).
Frequency of tandem repeats in UTR and CDS regions
Results for total occurrences (total loci), percentage per region (loci amounts
per region divided by total number of loci) and frequencies (amount of loci per
megabase) are shown separately for each species and genic region (5´UTR, CDS
and 3´UTR) in Table 2. In 5´UTR and 3�UTR regions, 4.92% (529 loci) and 2.21%
(237 loci) of all repeats were found in all surveyed species (10,731 loci), with an
average frequency of 1.3 and 0.7 loci/Mb, respectively. In coding regions (CDS) a
higher occurrence of micro and minisatellites were detected, reaching 92.86% of total
loci found (9,965 occurrences) with an average frequency of 35.1 loci/Mb. The higher
percentage of repeats occurred in CDS regions, as a consequence of trimers present
in this region. However, for Arabidopsis, large percentages of dimers (17.9%), trimer
(19.3%) and total (44.5%) microsatellites were found in UTR regions, contrasting with
the other species (Table 3). For Rosaceae, between 44.3% and 53.2% of
microsatellites were found in UTR regions (Jung et al., 2005). For Arabidopsis, 81%
42
and 26% of dimers and trimers were found in UTR regions (Yu et al., 2004).
In the present study, a major percentage of microsatellites in 5�UTR was
detected in Arabidopsis, with a frequency of 9.1 loci/Mb. These repeats represented
34% of all 1,162 repeats found in the 29,918 sequences analyzed in this species.
The species O. sativa and H. vulgare had the second and the third higher
frequencies of repeats in these regions, containing on average 1.3 and 1.0 loci/Mb,
respectively (Table 2).
Many studies indicate the UTR as more abundant in microsatellites than CDS
regions (Morgante et al., 2002). In the present work, 92.86% of microsatellite loci in
CDS regions is due to an annotation deficiency, separating translated from non-
translated fractions in the Unigene transcript database.
As observed for 5�UTRs, contrasting values were also found in 3�UTR regions.
Much higher values were found for Arabidopsis (average of 3.6 loci/Mb) when
compared to values below 0.6 loci/Mb found for the remaining species (Table 2).
Considering all 5�UTR, 3�UTR and CDS occurrences for all species, the
average frequency observed is 37 loci/Mb. The values range from 18 loci/Mb in
tomato to 76 loci/Mb in rice. Average frequency values per family are: 29.0 loci/Mb in
Brassicaceae, 19.9 loci/Mb in Solanaceae and 45.4 loci/Mb in Poaceae (Table 2).
Many reports have shown higher values than those found in this study, i.e.,
112-133 loci/Mb in barley, 133 loci/Mb in maize, 94-161 loci/Mb in wheat, 158-169
loci/Mb in sorghum, 161 loci/Mb in rye, 256-277 loci/Mb in rice and 133 loci/Mb in
Arabidopsis (Varshney et al., 2002; Thiel et al., 2003; Parida et al., 2006). In Citrus
species, values as high as 507 loci/Mb have been described in EST sequences
(Palmieri et al., 2007). Values as high as 125 loci/Mb were also found in Brassica
rapa (Hong et al., 2007). Frequency values closer to our study have been reported
43
for CDS regions of Rosa chinensis (Rose), Prunus dulcis (Almond), Prunus persica
(Peach) and Arabidopsis, showing values ranging from 39-78 loci/Mb (Jung et al.,
2005).
Percentage occurrences of different microsatellite types in UTR and CDS regions
In Table 3, the detailed percentage values for each repeat type in the different
sections of a genic region are listed for each species. The average occurrence of
dimer microsatellites in all species was 21.9%, with majority of these loci present in
CDS regions. For each family, the average percentage of dimer occurrence was
31.5% for Brassicaceae, 21.7% for Solanaceae and 18.8% for Poaceae species.
The percentage values for dimer microsatellites in CDS regions ranged from 4.0% for
Arabidopsis to 40.8% in B. napus. An interesting feature that seems to be particular
of the Arabidopsis genome is the high occurrence of dimer microsatellites in 5�
(13.6%) and 3� (4.3%) UTR regions. Within the Poaceae, dimer microsatellites
ranged from 15.4% in barley to 27.3% in wheat (Table 3). Other studies indicate that
generally the highest rates of occurrence of dimers is associated with 5'UTR regions
(Morgante et al., 2002; Lowson and Zhang, 2006; Hong et al., 2007), but one should
keep in mind that this prevalence in CDS regions may be a reflection of a deficient
annotation of the database. Trimer microsatellites were found in 40.2% of sequences,
with a high predominance in CDS regions. The species with higher trimer values
were Arabidopsis, rice and tomato, with 58.0%, 54.7% and 41.4% of occurrences,
respectively. The average percentage of trimers within each family was 47.0% in
Brassicaceae, 37.8% in Solanaceae and 38.7% in Poaceae. Among the Poaceae
species, the highest percentage of occurrence was found in rice (54.7%) and the
lowest percentage of trimer occurrence was for maize (34.6%). In Brassicaceae,
44
trimers were found more frequently in Arabidopsis (58.0%) and less frequent in B.
napus (36.1%) (Table 3).
Tetramers represented, on average, 8.2% of microsatellites, with average
frequencies of 3.4%, 4.4% and 11.0% for Brassicaceae, Solanaceae and Poaceae,
respectively. Among the Brassicaceae, less than one-fold differences in frequencies
were observed for Arabidopsis (2.9%) and B. napus (4.4%). In Poaceae, a 2.7-fold
difference was found between rice (6.1%) and barley (16.5%).
Pentamers represented, on average, 10.36% of microsatellites, with average
frequencies of 4.5%, 6.6% and 13.6% for Brassicaceae, Solanaceae and Poaceae,
respectively (Table 3). Less than one-fold differences were found for Brassicaceae
and Solanaceae species. In Poaeceae, however, a 1.7-fold difference was found
between rice (9.7%) and maize (16.5%).
Hexamers represented, on average, 13.8% of microsatellites, with average
frequencies of 8.1%, 19.1% and 13% for Brassicaceae, Solanaceae and Poaceae. In
Poaceae, a 2.4-fold difference was found between wheat (7.7%) and sorghum
(18.3%), respectively.
Minisatellites frequencies were also assessed in the data (Table 3).
Heptamers represented, on average, 4.5% of total (minisatellite plus microsatellite)
occurrences. These types of repeats were more common in the Solanaceae family
(9.6%). In Brassicaceae and Poaceae, the average frequencies of heptamers were
3.3% and 3.2%, respectively. Octamers were more frequent in the Brassicaceae
(0.8%), when compared to the Solanaceae (0.3%) and Poaceae (0.1%). Nonamers
were also more frequent in Brassicaceae (0.9%), when compared to Solanaceae
(0.6%) and Poaceae (0.5%). Decamers were comparatively less frequent than other
minisatellites, reaching frequencies of 0.2%, 0.1% and zero in Brassicaceae,
45
Poaceae and Solanaceae, respectively (Table 3).
Many studies have reported EST sequences containing microsatellites. For
the Poaceae (rice, maize, sorghum, barley and wheat) frequencies ranging from 16.6
to 40% for dimers, 41 to 78% for trimers, 2.6 to 14% for tetramers, 0.4-18.9% for
pentamers and below 1% for hexamers (Varshney et al., 2002; Thiel et al., 2003; La
Rota et al., 2005; Parida et al., 2006) have been reported. In the case of Arabidopsis,
frequencies of dimers (36.5%), trimers (62.1%), tetramers (1.1%), pentamers (0.15%)
and hexamers (0.13%) have been reported (Parida et al., 2006).
Most frequent motifs
Dimers and trimers
In Tables 4 and 5 the motif frequencies per species and average frequency
per family are listed. For dimers, differences were observed within and between
families. For Brassicaceae, the dimer motifs AG/CT and GA/TC were most the
frequent, reaching 9.69% and 8.89% of observations in the family. A 6.9-fold
difference was found for AG/CT between Arabidopsis (2.46%) and B. napus
(16.93%). Also for the motif GA/TC, a near 10-fold difference was found between
Arabidopsis (1.64%) and B. napus (16.14%). Other reports have shown that the
motifs AG and GA were the most frequent in Arabidopsis (Cardle et al., 2000;
Morgante et al., 2002; Lawson and Zhang, 2006; Parida et al., 2006) and AT/TA in B.
rapa (Hong et al., 2007). Among the Solanaceae, the motifs AT/AT and TA/TA were
the most frequent, with frequencies of 8.29% and 5.69%, respectively. In Solanaceae
ESTs, frequencies between 20-25% and 15-20% were found for the dimers AG and
AT, respectively (Kumptla and Mukhopadhyay, 2005). In Poaceae, the most frequent
motifs were AG/CT and GA/TC, with average percentage values of 6.72% and
46
5.61%, respectively. In other studies, frequencies ranging from 38-50% were found
for the motif AG in maize, barley, rice, sorghum and wheat (Morgante et al., 2002;
Varshney et al., 2002; Kantety et al., 2002; Thiel et al., 2003; Yu et al., 2004; La Rota
et al., 2005) and frequencies of 50% for AC in barley (Varshney et al., 2002).
However, other reports have shown GA as the most abundant motif in grasses
(Temnykh et al., 2001; Kantety et al., 2002; Nicote et al., 2004; Parida et al., 2006).
In all species that were analysed in the present study, the smaller frequencies were
found for those motifs formed by guanine and cytosine (CG/GC) and were even
missing in the Brassicaceae and Solanaceae species.
The data regarding trimer frequencies show, as already observed for dimers,
that motif patterns are different within as well as between families (Table 4). Among
the Brassicaceae, the motifs GAA/TTC and AAG/CTT were the most abundant,
reaching frequencies of 8.36% and 6.73%, respectively. Contrasting values were
verified for GAA/TTC between Arabidopsis (12.13%) and B. napus (4.59%). The
motif AAG/CTT also showed contrasting values for Arabidopsis (9.51%) and B.
napus (3.96%). Other reports have claimed that AAG is the most frequent motif for
Arabidopsis and B. rapa (Morgante et al., 2002; Hong et al., 2007). In the
Solanaceae, the motifs GAA/TCC and AGA/TCT were the most frequent, showing
values of 4.75% and 4.60%, respectively. For both motifs, the frequency values were
higher in S. tuberosum. Similar results were obtained in Arabidopsis, B. napus, B
.rapa, S. Lycopersicum and S. tuberosum (Kumptla and Mukhopadhyay, 2005) and in
Citrus (Jiang et al., 2006) where the motifs AAG/AGA/GAA were the most frequent.
In the Poaceae, the trimers CCG/CGG, CGC/GCG and GCC/GGC were the most
frequent, corresponding to 5.89%, 5.85% and 5.06%, respectively, adding up to
16.80% of all microsatellites found. Within the family, different motifs were the most
47
common, i.e., for O. sativa, S. bicolor and H. vulgare, the motifs CCG/CGG were
predominant. For T. aestivum and S. officinarum it was GCC/GGC and for Z. mays it
was CGC/GCG. Other reports have shown predominance of the motif CCG in grass
species Z. mays, H. vulgare, O. sativa, S. bicolor, T. aestivum, S. cereale and S.
officinarum (Cordeiro et al., 2001; Kantety et al., 2002; Morgante et al., 2002;
Varshney et al., 2002; Thiel et al., 2003; Nicote et al., 2004; Yu et al., 2004; La Rota
et al., 2005; Peng et al., 2005). These motifs (CCG/CGG, CGC/GCG and GCC/GGC)
seem to be less common in other families, where instead of values around 16.8%
(found for grasses), these motifs reached frequency values of 0.56% in Brassicaceae
and 0.36% in the Solanaceae.
Tetramers, pentamers and hexamers
For the loci formed by motifs longer than three nucleotides, only the ten higher
average percentage values for each family are shown (Tables 4 and 5).
In Brassicaceae, tetramer motifs occurring at higher frequencies were
AAGA/TCTT, AAAC/GTTT or GAAA/TTTC adding to 1.04 % of all motifs found. Other
reports indicate that motifs AAAG/AAAT were predominant in Arabidopsis and AAAT
in B. rapa (Cardle et al., 2000; Hong et al., 2007). For 5�UTR/CDS and 3�UTR
Arabidopsis regions, the predominant motifs reported were AAAG/CTTT and
AAAC/GTTT, respectively (Morgante et al., 2002; Zhang et al., 2004). For the
Solanaceae species, 1.96% of all motifs found were either TAAA/TTTA, TTAA/TTAA
or AAGA/TCTT. These results agree with EST data from 20 dicot species (Kumptla
and Mukhoadhlyay, 2005). Among the grasses, 0.85% of all motifs were either
CCTC/GAGG, AGGA/TCCT or CATC/GATG. Differences on the predominant
tetramer rates were found among the species (Table 4). Other reports have shown
48
ACGT as the most abundant for barley (Varshney et al., 2002; Thiel et al., 2003),
AAAG/CTTT and AAGG/CCTT for perennial ryegrass (Asp et al. 2007) and AAAG as
the most frequent motif in rice BACs (McCouch et al., 2002).
Pentamers present at rates of 0.80% (GAAAA/TTTTC, AAAAT/ATTTT and
AAAAC/GTTTT), 1.37% (AAAAT/ATTTT, AAAAG/CTTTT and AGAAG/CTTCT) and
0.83% (CTCTC/GAGAG, GAGGA/TCCTC and CTTCC/GGAAG) were predominant
in Brassicaceae, Solanaceae and Poaceae, respectively. The major difference
among plant families is the predominance of A/T in Brassicaceae and Solanaceae.
Also, reports in CDS regions of Arabidopsis, S. cerevisae and C.elegans, indicated
the predominance of ACCCG and AAAAG (Toth et al. 2000). For eukaryotes in
general, AAAAT, AAAAC and AAAAG have been shown as predominant (Li et al.,
2004a). On the other hand, 5�UTR and 3�UTR regions of Arabidopsis, were shown to
be rich in AAGAG and AAAAC, respectively (Zhang et al., 2004). AAAAT (Hong et
al., 2007) and AAAAT /AAAAG (Jiang et al., 2006) were described as frequently
found in Rosaceae and Citrus, respectively.
In transcripts from TIGR database, the motif AGAGG was predominant in rice,
AGGGG in barley and ACGAT in wheat (La Rota et al., 2005). Very little information
was found describing the preferential occurrences of pentamers in grasses and
information found about eukaryotes (Toth et al., 2000; Li et al., 2004a), Citrus
(Palmieri et al., 20007; Jiang et al., 2006), Arabidopsis (Zhang et al., 2004) and
Rosaceae (Hong et al., 2007) showed variable results.
A pattern of occurrence of hexamers among and within the three analyzed
plant families was found (Table 5). Only one study has reports in agreement with the
present results, regarding the predominance of AAGGAG hexamers found in
Arabidopsis (Toth et al., 2000). Other reports indicate that the major occurrences of
49
hexamers are AAGATG, AAAGAG and AAAAAT in Arabidopsis (Zhang et al., 2004),
AAAAAG in Citrus (Jiang et al., 2006), AACACG in S. cerevisae, ACCAGG in C.
elegans, AAGGC in mammals and CCCCGG in primates (Toth et al., 2000). The ten
major occurrences in heptamers, octamers, nonamers and decamers are presented
on Table 5. Occurrences are widely variable within and among families, making it
difficult to establish a pattern or discussion based on similarities.
Genome dynamics is very complex regarding microsatellite motifs in plants. A
higher conservation of dimer motifs (AG/TC and GA/TC) seems to overcome
evolutionary barriers distances such as those found between monocot and dicot
plants. However, within the dicots, this conservation may not hold. Unexpectedly,
Poaceae and Brassicaceae were closer when these motifs were analyzed. On the
other hand, trimer microsatellites that are known to be predominant in coding regions
follow the expected pattern of conservation, showing similar rates and predominant
motifs (GAA/TTC) between the two dicot families. Trimers present at higher
frequencies in the grasses tend to be formed by GC arrangements, in contrast to
dicot plants where GATC combinations are more frequently found. The higher
frequency of AT- rich repeats is also found in pentamer motifs in the dicot families.
Repeats of higher complexity did not show detectable conserved patterns in this
study.
CONCLUSIONS
The occurrence of micro and minisatellites in rice sequences (11.28%) is
higher than in other species, ranging from 2.5 to 5 times more sequences containing
these repetitive DNA loci. The fact that species having larger genomes (T. aestivum,
H. vulgare and S. officinarum) do not present a corresponding higher frequency of
50
repetitive loci suggests that there is no relationship between genome size and rates
of tandem repeat occurrence in functional regions. However, the lower coverage of
sequences present in databases for these species could also be a reason for the low
rates found in some species. For Arabidopsis and rice, the results obtained are closer
to reality because both are considered model species and have been studied at
deeper coverage.
The distribution of micro- and minisatellites was higher in CDS regions for all
studied species. Also, microsatellites (97%) were more common than minisatellites
(3%). Per family, the predominant dimer motifs were the same for brassicaceae and
poaceae (AG/CT) and different for the solanaceae (AT/AT). Trimers were the
predominant repeats, ranging between 34.3% and 58.0% with different rates
depending on the family or species. For the Solanaceae, the predominant trimer
motifs were not the same for S. lycopersicum (ATA/TAT and AAT/TTA) and S.
tuberosum (GAA/TTC and AGA/TCT), which could be due to selection. Among the
grasses, trimers formed by C/G were the most abundant, however the specific motifs
are variable between species.
Disagreements between earlier reports and the results obtained in the present
work where dimers were also frequent in CDS regions, could be due to the fact that
the Unigene database contains predominantly EST clusters. Therefore, there is a
tendency of under representing the UTR regions in the annotated sequences present
in this database. This is true for all species, except for Arabidopsis. This could be
solved if the genes were manually curated defining the different regions, however, it
would take a community effort to accomplish such task.
The obtained results shed light on the patterns of tandem repeat occurrence
51
within and between different plant families, facilitating the use of plant breeding
strategies based on the transfer of markers from model to orphan species.
ACKNOWLEDGMENTS:
The authors thank CNPq fot fellowships and grants. The Authors also thank
Dr. Dario Abel Palmieri (UNESP/Assis-SP) and Dr. Olivier Panaud (University of
Perpignan) for fruitful discussions.
52
REFERENCES:
Asp T, Frei UK, Didion T, Nielsen KK and Lübberstedt T (2007) Frequency, type, and
distribution of EST-SSRs from three genotypes of Lolium perenne, and their
conservation across orthologous sequences of Festuca arundinacea, Brachypodium
distachyon, and Oryza sativa. BMC Plant Biol, 7:36.
Bell GI (1996) Evolution of simple sequence repeats. Comput Chem, 20:41-48.
Cardle L, Ramsay L, Milbourne D, Macaulay M, Marshall D and Waugh R. (2000)
Computational and experimental characterization of physically clustered simple
sequence repeats in plants. Genetics, 156:847-854.
Conne B, Stutz A and Vassalli JD (2000) The 3' untranslated region of messenger
RNA: A molecular 'hotspot' for pathology? Nat Med, 6:637-641.
Cordeiro GM, Casu R, McIntyre CL, Manners JM and Henry RJ (2001) Microsatellite
markers from sugarcane (Saccharum spp.) ESTs cross transferable to erianthus and
sorghum. Plant Sci, 160:1115-1123.
Davis BM, McCurrach ME, Taneja KL, Singer RH and Housman DE (1997)
Expan.sion of a CUG trinucleotide repeat in the 3�untranslated region of myotonic
dystrophy protein kinase transcripts results in nuclear retention of transcripts. Proc
Natl Acad Sci USA, 94:7388�7393.
Hong CP, Piao ZY, Kang TW, Batley J, Yang TJ, Hur YK, Bhak J, Park BS, Edwards
D and Lim YP (2007) Genomic distribution of simple sequence repeats in Brassica
rapa. Mol Cells, 23:349-356.
Iyer RR, Pluciennik A, Rosche WA, Sinden RR and Wells RD (2000) DNA
polymerase III proofreading mutants enhance the expansion and deletion of triplet
repeat sequences in Escherichia coli. J Biol Chem, 275: 2174-2184.
Jiang D, Zhong GY and Hong QB (2006) Analysis of microsatellites in citrus
53
unigenes. Yi chuan xue bao (Acta genetica Sinica) 33:345-353.
Jung S, Abbott A, Jesudurai C, Tomkins J and Main D (2005) Frequency, type,
distribution and annotation of simple sequence repeats in Rosaceae ESTs. Funct
Integr Genomics, 5:136-143.
Kantety RV, La Rota M, Matthews DE and Sorrells ME (2002) Data mining for simple
sequence repeats in expressed sequence tags from barley, maize, rice, sorghum and
wheat. Plant Mol Biol, 48:501-510.
Kashi Y and King DG (2006) Simple sequence repeats as advantageous mutators in
evolution. Trends Genet, 22:253-259.
Kumpatla SP and Mukhopadhyay S (2005) Mining and survey of simple sequence
repeats in expressed sequence tags of dicotyledonous species. Genome, 48:985-
998.
La Rota M, Kantety RV, Yu JK and Sorrells ME (2005) Nonrandom distribution and
frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and
barley. BMC Genomics, 6:23.
Lawson MJ and Zhang L (2006) Distinct patterns of SSR distribution in the
Arabidopsis thaliana and rice genomes. Genome Biol, 7: R14.
Li YC, Korol AB, Fahima T, Beiles A and Nevo E (2002) Microsatellites: genomic
distribution, putative functions and mutational mechanisms: a review. Mol Ecol,
11:2453-2465.
Li YC, Korol AB, Fahima T and Nevo E (2004a) Microsatellites within genes:
structure, function, and evolution. Mol Biol Evol, 21: 991-1007.
Li B, Xia Q, Lu C, Zhou Z and Xiang Z (2004b) Analysis on frequency and density of
microsatellites in coding sequences of several eukaryotic genomes. Genomics
Proteomics Bioinformatics, 2:24-31.
54
Maia LC da, Palmieri DA, de Souza VQ, Kopp MM, de Carvalho FI, Costa de Oliveira
A. (2008) SSR Locator: Tool for Simple Sequence Repeat Discovery Integrated with
Primer Design and PCR Simulation. Int J Plant Genomics. 412696.
McCouch SR, Teytelman L, Xu Y, Lobos KB, Clare K, Walton M, Fu B, Maghirang R,
Li Z, Xing Y, Zhang Q, Kono I, Yano M, Fjellstrom R, DeClerck G, Schneider D,
Cartinhour S, Ware D and Stein L (2002) Development and mapping of 2240 new
SSR markers for rice (Oryza sativa L.) DNA Res, 9:199-207.
Morgante M, Hanafey M and Powell, W (2002) Microsatellites are preferentially
associated with nonrepetitive DNA in plant genomes. Nat Genet, 30: 194-200.
Morgante M and Olivieri AM (1993) PCR-amplified microsatellites as markers in plant
genetics. Plant J, 3: 175-182.
Nicot N, Chiquet V, Gandon B, Amilhat L, Legeai F, Leroy P, Bernard M and Sourdille
P (2004) Study of simple sequence repeat (SSR) markers from wheat expressed
sequence tags (ESTs). Theor Appl Genet, 109: 800-805.
Palmieri DA, Novelli VM, Bastianel M, Cristofani M, Monge GA, Carlos EF, Oliveira
AC and Machado MA (2007) Frequency and distribution of microsatellites from ESTs
of citrus. Genet Mol Biol, 30: 1009-1018.
Parida SK, Anand Raj Kumar K, Dalal V, Singh NK and Mohapatra T (2006) Unigene
derived microsatellite markers for the cereal genomes. Theor Appl Genet, 112:808-
817.
Peng JH and Lapitan NL (2005) Characterization of EST-derived microsatellites in
the wheat genome and development of eSSR markers. Funct Integr Genomics, 5:
80-96.
Philips AV, Timchenko LT and Cooper TA (1998) Disruption of splicing regulated by a
CUG-binding protein in yotonic dystrophy. Science, 280: 737-741.
55
Subramanian S, Mishra RK and Singh L (2003) Genome-wide analysis of
microsatellite repeats in humans: their abundance and density in specific genomic
regions. Genome Biol, 4: R13.
Temnykh S, DeClerck G, Lukashova A, Lipovich L, Cartinhour S and McCouch S
(2001) Computational and experimental analysis of microsatellites in rice (Oryza
sativa L.): frequency, length variation, transposon associations, and genetic marker
potential. Genome Research. 11:1441-1452.
Thiel T, Michalek W, Varshney W and Graner A (2003) Exploiting EST databases for
the development and characterization of gene-derived SSR-markers in barley
(Hordeum vulgare L.). Theor Appl Genet, 106:411-422.
Thornton CA, Wymer JP, Simmons Z, McClain C and Moxley RT (1997) Expansion
of the myotonic dystrophy CTG repeat reduces expression of the flanking DMAHP
gene. Nat Genet, 16: 407-409.
Tóth G, Gáspári Z and Jurka J (2000) Microsatellites in different eukaryotic genomes:
survey and analysis. Genome Res, 10:967-981.
Varshney RK, Graner A and Sorrells ME (2005) Genic microsatellite markers in
plants: features and applications. Trends Biotechnol, 23:48-55.
Varshney RK, Thiel T, Stein N, Langridge P and Graner A (2002) In silico analysis on
frequency and distribution of microsatellites in ESTs of some cereal species. Cell
Mol Biol Lett, 7:537-546.
Varshney RK, Hoisington DA, Tyagi AK (2006) Advances in cereal genomics and
applications in crop breeding. Trends Biotechnol, 24:490-499.
Yu JK, Dake TM, Singh S, Benscher D, Li W, Gill B and Sorrells ME (2004)
Development and mapping of EST-derived simple sequence repeat markers for
hexaploid wheat. Genome, 47:805-818.
56
Zhang L, Yuan D, Yu S, Li Z, Cao Y, Miao Z, Qian H and Tang K (2004) Preference
of simple sequence repeats in coding and non coding regions of Arabidopsis
thaliana. Bioinformatics. 20:1081-1086.
Zhang L, Zuo K, Zhang F, Cao Y, Wang J, Zhang Y, Sun X and Tang K (2006a)
Conservation of noncoding microsatellites in plants: implication for gene regulation.
BMC Genomics, 7:323.
Zhang L, Yu S, Cao Y, Wang J, Zuo K, Qin J and Tang K (2006b) Distributional
gradient of amino acid repeats in plant proteins. Genome. 49:900-905.
57
Table 1.Overall distribution (amounts and percentage) of expressed sequences in translated and non-translated regions. Expressed Sequences 5' UTR CDS 3' UTR
Total Total Mean Total Mean Total Mean Total Mean
Seq.1 mb1 pb1 Seq.2 % mb2 pb2 Seq.3 % mb3 pb3 Seq.4 % mb4 pb4
A. thaliana 29,918 43.3 1,447 16,625 6.8 176 29,918 82.6 1,195 17,591 10.7 262
B. napus 26,285 20.3 773 216 0.1 74 26,285 99.7 770 242 0.2 204
S. lycopersicum 16,945 14.0 823 614 0.5 103 16,945 98.3 809 710 1.2 245
S. tuberosum 19,539 15.6 796 554 0.3 93 19,539 98.6 785 635 1.0 252
O. sativa 40,259 60.0 1,490 1,088 0.5 270 40,259 98.7 1,470 1,158 0.8 438
S. bicolor 13,547 9.5 699 68 0.1 115 13,547 99.7 697 82 0.2 244
T. aestivum 34,505 26.2 758 498 0.2 92 34,505 99.2 753 611 0.6 246
Z. mays 57,447 32.2 560 704 0.3 120 57,447 99.1 555 803 0.7 275
S. officinarum 15,586 12.7 815 48 0.1 160 15,586 99.8 813 54 0.1 273
H. vulgare 21,418 19.1 893 359 0.2 102 21,418 99.2 886 458 0.6 259
Average 27,545 25.3 905 2,077 0.9 130 27,545 97.5 873 2,234 1.6 269.8
Expressed Sequences: Total Seq.1 (Total no. of cDNA sequences), Total mb1 (sum of base pairs of fl-cDNA sequences), Average pb1 (average size of sequences � sum of base pairs divided by number of sequences (Total mb1 / Total Seq.1)). 5�UTR: Total Seq.2 (Total sequences containing 5�UTR regions), % mb2 (percentage of Total mb1 contained in 5�UTR regions), Average pb2 (average size of 5�UTR sequences- sum of base pairs divided by the number of sequences (Total pb(% mb2) / Total Seq.2)). CDS: Total Seq.3 (Total sequences containing CDS regions), % mb3 (percentage of Total mb1 contained in CDS regions), Average pb3 [average size of CDS sequences � sum of base pairs divided by number of sequences (Total pb(% mb3) / Total Seq.3)]. 3�UTR: Total Seq.4 (Total of sequences containing 3�UTR regions), % mb4 (percentage of Total mb1 contained in 3�UTR regions), Average pb4 (average size of 3�UTR sequences � sum of base pairs divided by the number of sequences (Total pb(% mb4) / Total Seq.4)).
59
Table 2. Overall distribution of tandem repeat occurrences in translated and non-translated transcripts.
5' UTR CDS 3' UTR
Total
Occurrence % ssr/mb Occurrence % 2 ssr/mb Occurrence % ssr/mb Occurrence ssr/mb
A. thaliana 395 34.0 9.1 610 52.5 14.1 157 13.5 3.6 1,162 27
B. napus 1 0.2 0.0 632 99.5 31.1 2 0.3 0.1 635 31
S. lycopersicum 6 2.4 0.4 234 94.0 16.8 9 3.6 0.6 249 18
S. tuberosum 4 1.2 0.3 336 97.7 21.6 4 1.2 0.3 344 22
O. sativa 78 1.7 1.3 4,433 97.6 73.9 29 0.6 0.5 4,540 76
S. bicolor 3 0.6 0.3 505 99.4 53.3 0 0.0 0.0 508 54
T. aestivum 11 1.3 0.4 795 97.0 30.4 14 1.7 0.5 820 31
Z. mays 12 1.0 0.4 1,205 98.0 37.4 13 1.1 0.4 1,230 38
S. officinarum 0 0.0 0.0 332 100.0 26.1 0 0.0 0.0 332 26
H. vulgare 19 2.1 1.0 883 96.9 46.2 9 1.0 0.5 911 48
Average 529 4.9 1.3 9,965 92.9 35.1 237 2.2 0.7 10,731 37
60
Table 3. Overall occurrence, in percentage, of microsatellite and minisatellite motifs on different regions of ten plant species.
Dimer Trimer Tetramer Pentamer HexamerMicrossatélites 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR TotalA. thaliana 13.6 4.0 4.3 21.9 14.6 38.6 4.7 58.0 1.0 0.9 1.0 2.9 2.1 0.8 1.9 4.7 0.9 5.8 0.4 7.1 B. napus 0.2 40.8 0.2 41.1 - 35.9 0.2 36.1 - 4.4 - 4.4 - 4.3 - 4.3 - 9.1 - 9.1 S. lycopersicum 0.4 17.7 2.0 20.1 0.4 40.2 0.8 41.4 - 4.4 - 4.4 - 6.0 0.8 6.8 0.8 17.3 - 18.1 S. tuberosum 0.3 22.4 0.6 23.3 0.3 34.0 - 34.3 - 4.4 - 4.4 - 6.1 0.3 6.4 - 20.1 - 20.1 O. sativa 0.5 14.9 0.3 15.7 0.7 53.9 0.1 54.7 0.0 6.0 0.1 6.1 0.3 9.3 0.1 9.7 0.1 10.3 0.0 10.4 S. bicolor 0.2 18.5 - 18.7 0.2 35.2 - 35.4 - 10.2 - 10.2 - 14.6 - 14.6 0.2 18.1 - 18.3 T. aestivum 0.5 26.5 0.4 27.3 0.5 34.0 0.5 35.0 0.2 13.3 0.1 13.7 0.1 11.3 0.6 12.1 - 7.6 0.1 7.7 Z. mays 0.5 16.0 0.5 17.0 0.2 34.5 - 34.6 0.1 10.7 0.4 11.2 0.1 16.2 0.2 16.4 0.1 17.4 - 17.5 S. officinarum - 18.7 - 18.7 - 36.4 - 36.4 - 8.4 - 8.4 - 14.5 - 14.5 - 16.9 - 16.9 H. vulgare 0.5 14.6 0.2 15.4 0.7 35.1 0.3 36.1 0.4 15.7 0.3 16.5 0.2 13.8 0.1 14.2 0.1 12.8 - 13.0 Average 1.7 19.4 0.8 21.9 1.7 37.8 0.7 40.2 0.2 7.8 0.2 8.2 0.3 9.7 0.4 10.4 0.2 13.5 0.1 13.8
Heptamer Octamer Nonamer Decamer GeralMinissatélites 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR TotalA. thaliana 1.0 0.9 0.8 2.8 0.6 0.3 0.3 1.2 0.1 1.0 - 1.1 0.1 0.2 0.1 0.3 34.0 52.5 13.5 100.0 B. napus - 3.8 - 3.8 - 0.5 - 0.5 - 0.6 - 0.6 - 0.2 - 0.2 0.2 99.5 0.3 100.0 S. lycopersicum 0.8 8.4 - 9.2 - - - - - - - - - - - - 2.4 94.0 3.6 100.0 S. tuberosum 0.6 9.0 0.3 9.9 - 0.6 - 0.6 - 1.2 - 1.2 - - - - 1.2 97.7 1.2 100.0 O. sativa 0.1 2.0 0.0 2.1 0.0 0.2 - 0.2 - 0.7 - 0.7 - 0.3 - 0.3 1.7 97.6 0.6 100.0 S. bicolor - 2.4 - 2.4 - - - - - 0.4 - 0.4 - - - - 0.6 99.4 - 100.0 T. aestivum - 3.5 - 3.5 - 0.2 - 0.2 - 0.4 - 0.4 - 0.1 - 0.1 1.3 97.0 1.7 100.0 Z. mays 0.1 3.0 - 3.1 - - - - - 0.2 - 0.2 - - - - 1.0 98.0 1.1 100.0 S. officinarum - 4.2 - 4.2 - - - - - 0.6 - 0.6 - 0.3 - 0.3 - 100.0 - 100.0 H. vulgare 0.1 3.6 - 3.7 - 0.2 - 0.2 - 1.0 - 1.0 - - - - 2.1 96.9 1.0 100.0 Average 0.3 4.1 0.1 4.5 0.1 0.2 0.0 0.3 0.0 0.6 - 0.6 0.0 0.1 0.0 0.1 4.4 93.3 2.3 100.0
61
Table 4. Distribution of di-, tri- and tetramer motifs, percentage occurrence per species and average occurrence per family. Brassicaceae Solanaceae Poaceae
Ara Bra Average Lyc Sol Average Ory Sor Tri Zea Sac HorDimersAG/CT 2.46 16.93 9.69 AT/AT 8.55 8.04 8.29 AG/CT 6.38 5.15 9.06 6.56 6.63 6.57GA/TC 1.64 16.14 8.89 TA/TA 5.13 6.25 5.69 GA/TC 5.46 5.35 10.19 5.15 3.92 3.62AT/AT 1.80 4.11 2.96 GA/TC 1.71 4.76 3.24 AT/AT 1.31 1.39 1.01 1.83 2.71 1.25TA/TA 0.98 2.22 1.60 AG/CT 3.42 2.98 3.20 CA/TG 0.56 2.38 2.89 0.75 1.51 1.36GT/AC 0.49 0.79 0.64 GT/AC 0.00 0.60 0.30 GT/AC 0.59 2.38 2.77 0.50 1.20 1.36CA/TG 0.16 0.79 0.48 CA/TG 0.00 0.30 0.15 TA/TA 0.92 1.98 1.26 1.58 2.41 0.57GC/GC 0.00 0.00 0.00 GC/GC 0.00 0.00 0.00 GC/GC 0.00 0.00 0.00 0.00 0.30 0.23CG/CG 0.00 0.00 0.00 CG/CG 0.00 0.00 0.00 CG/CG 0.07 0.00 0.13 0.00 0.00 0.11TrimersGAA/TTC 12.13 4.59 8.36 GAA/TTC 3.85 5.65 4.75 CCG/CGG 11.41 5.15 2.52 5.81 4.22 6.23AAG/CTT 9.51 3.96 6.73 AGA/TCT 3.85 5.36 4.60 CGC/GCG 10.47 4.75 3.02 5.98 6.02 4.87AGA/TCT 8.85 4.59 6.72 ATA/TAT 5.13 3.57 4.35 GCC/GGC 6.11 4.95 3.27 5.81 6.93 3.28ATC/GAT 7.54 2.22 4.88 AAT/ATT 4.27 2.98 3.62 CAG/CTG 1.87 2.77 2.64 2.41 3.31 2.60TCA/TGA 4.59 2.37 3.48 AAG/CTT 3.42 3.57 3.50 GCA/TGC 1.47 2.77 2.01 2.16 1.81 2.83CAA/TTG 4.75 1.90 3.33 TAA/TTA 2.99 1.19 2.09 CTC/GAG 3.77 1.19 1.89 1.49 2.41 2.15ATG/CAT 4.43 1.74 3.08 CAA/TTG 2.14 1.19 1.66 AGC/GCT 1.47 2.18 1.26 1.16 2.41 2.27AAC/GTT 4.10 1.27 2.68 CTC/GAG 2.14 0.60 1.37 AGG/CCT 2.50 1.19 1.89 1.24 0.30 1.59ACA/TGT 3.93 1.11 2.52 CAG/CTG 2.14 0.60 1.37 GGA/TCC 2.57 0.99 1.13 1.74 1.20 0.79GGA/TCC 3.44 0.79 2.12 TCA/TGA 0.85 1.79 1.32 AAG/CTT 1.51 0.59 1.64 0.41 0.30 1.59AGG/CCT 1.31 2.06 1.68 ACA/TGT 1.71 0.89 1.30 CAA/TTG 0.29 0.40 3.02 0.41 1.20 0.68CTC/GAG 1.15 2.22 1.68 CAC/GTG 2.14 0.30 1.22 CCA/TGG 1.38 1.39 0.38 0.75 0.60 1.13ACC/GGT 2.13 0.63 1.38 ATC/GAT 1.71 0.60 1.15 CGA/TCG 1.58 0.99 0.38 1.58 0.30 0.34CCA/TGG 1.48 1.11 1.29 CCA/TGG 0.85 1.19 1.02 CAC/GTG 0.99 0.79 0.75 0.58 0.90 1.13CAC/GTG 1.31 0.32 0.81 CCG/CGG 1.71 0.30 1.00 GAC/GTC 0.99 0.40 0.50 1.00 1.20 0.68GCA/TGC 0.16 0.95 0.56 GGA/TCC 0.85 0.89 0.87 AGA/TCT 1.35 0.20 1.01 0.33 0.60 0.79TAA/TTA 0.00 0.95 0.47 ACC/GGT 0.43 1.19 0.81 GAA/TTC 1.40 0.40 1.64 0.17 0.00 0.68ACT/AGT 0.66 0.16 0.41 GCA/TGC 0.43 0.89 0.66 ACC/GGT 1.29 0.40 0.88 0.33 0.60 0.23AAT/ATT 0.16 0.63 0.40 ATG/CAT 0.85 0.30 0.58 ACG/CGT 0.79 1.39 0.13 0.50 0.60 0.11CAG/CTG 0.33 0.32 0.32 AGC/GCT 0.43 0.30 0.36 ACA/TGT 0.14 0.20 1.89 0.08 0.60 0.45AGC/GCT 0.33 0.32 0.32 GTA/TAC 0.00 0.60 0.30 ATC/GAT 0.32 0.59 0.38 0.25 0.00 0.57GAC/GTC 0.33 0.32 0.32 GAC/GTC 0.43 0.00 0.21 TCA/TGA 0.32 0.20 0.38 0.00 0.30 0.57CCG/CGG 0.16 0.47 0.32 ACT/AGT 0.43 0.00 0.21 AAC/GTT 0.14 0.00 0.88 0.17 0.30 0.00GCC/GGC 0.00 0.47 0.24 CGC/GCG 0.00 0.30 0.15 ATG/CAT 0.25 0.20 0.25 0.08 0.00 0.57ATA/TAT 0.00 0.47 0.24 GCC/GGC 0.00 0.30 0.15 ATA/TAT 0.14 0.40 0.50 0.17 0.00 0.00GTA/TAC 0.33 0.00 0.16 AAC/GTT 0.00 0.30 0.15 AAT/ATT 0.25 0.00 0.13 0.17 0.30 0.11CTA/TAG 0.33 0.00 0.16 AGG/CCT 0.00 0.00 0.00 ACT/AGT 0.11 0.59 0.13 0.00 0.00 0.00CGA/TCG 0.16 0.16 0.16 CGA/TCG 0.00 0.00 0.00 TAA/TTA 0.18 0.00 0.13 0.41 0.00 0.00CGC/GCG 0.00 0.00 0.00 ACG/CGT 0.00 0.00 0.00 GTA/TAC 0.07 0.20 0.38 0.00 0.00 0.00ACG/CGT 0.00 0.00 0.00 CTA/TAG 0.00 0.00 0.00 CTA/TAG 0.09 0.20 0.13 0.00 0.00 0.00TetramersAAGA/TCTT 0.33 0.47 0.40 TAAA/TTTA 0.85 0.89 0.87 CCTC/GAGG 0.09 0.40 0.50 0.17 0.00 0.79AAAC/GTTT 0.33 0.32 0.32 TTAA/TTAA 0.85 0.30 0.58 AGGA/TCCT 0.14 0.00 0.13 0.17 0.60 0.57GAAA/TTTC 0.33 0.32 0.32 AAGA/TCTT 0.43 0.60 0.51 CATC/GATG 0.27 0.00 0.50 0.25 0.00 0.57AGGA/TCCT 0.16 0.16 0.16 AAAG/CTTT 0.00 0.60 0.30 CACG/CGTG 0.09 0.20 0.13 0.08 0.60 0.45CAAA/TTTG 0.16 0.16 0.16 AGAT/ATCT 0.00 0.60 0.30 AAAG/CTTT 0.14 0.20 0.00 0.08 0.90 0.23CATA/TATG 0.16 0.16 0.16 AAAT/ATTT 0.43 0.00 0.21 ATGC/GCAT 0.00 0.00 0.38 0.33 0.00 0.79AAAG/CTTT 0.00 0.32 0.16 AATT/AATT 0.43 0.00 0.21 CATA/TATG 0.14 0.00 0.50 0.41 0.30 0.11AACA/TGTT 0.00 0.32 0.16 ATTA/TAAT 0.43 0.00 0.21 TCCA/TGGA 0.11 0.00 0.50 0.50 0.00 0.34ACAA/TTGT 0.00 0.32 0.16 CCTC/GAGG 0.43 0.00 0.21 CTGC/GCAG 0.02 0.59 0.38 0.33 0.00 0.11
Ara (Arabidopsis thaliana), Bra (Brassica napus), Lyc (Solanum lycopersicum), Sol (Solanum tuberosum), Ory (Oryza sativa), Sor(Sorghum bicolor), Tri (Triticum aestivum), Zea (Zea mays), Sac (Saccharum officinarum) and Hor (Hordeum vulgare)
62
Table 5. Distribution of penta- to decamers motifs, percentage occurrence per species and average occurrence per family.
Brassicaceae Solanaceae PoaceaeAra Bra Average Lyc Sol Average Ory Sor Tri Zea Sac Hor Average
PentamersGAAAA/TTTTC 0.16 0.47 0.32 AAAAT/ATTTT 0.85 0.30 0.58 CTCTC/GAGAG 0.34 0.59 0.00 0.25 0.30 0.68 0.36AAAAT/ATTTT 0.16 0.32 0.24 AAAAG/CTTTT 0.85 0.00 0.43 GAGGA/TCCTC 0.32 0.00 0.38 0.17 0.00 0.57 0.24AAAAC/GTTTT 0.00 0.47 0.24 AGAAG/CTTCT 0.43 0.30 0.36 CTTCC/GGAAG 0.07 0.20 0.25 0.17 0.60 0.11 0.23CAAAA/TTTTG 0.33 0.00 0.16 ATAAA/TTTAT 0.43 0.30 0.36 GGAGA/TCTCC 0.25 0.20 0.13 0.33 0.00 0.34 0.21GAATC/GATTC 0.00 0.32 0.16 GAAAA/TTTTC 0.43 0.30 0.36 AGGAG/CTCCT 0.29 0.20 0.13 0.33 0.00 0.23 0.20AAATA/TATTT 0.16 0.00 0.08 CAAAC/GTTTG 0.00 0.60 0.30 AGAGG/CCTCT 0.32 0.00 0.25 0.17 0.00 0.34 0.18ACAAA/TTTGT 0.16 0.00 0.08 AAATA/TATTT 0.43 0.00 0.21 CTCCC/GGGAG 0.16 0.00 0.13 0.17 0.60 0.00 0.18ACAAC/GTTGT 0.16 0.00 0.08 AAATC/GATTT 0.43 0.00 0.21 CACCA/TGGTG 0.00 0.00 0.38 0.33 0.30 0.00 0.17ACTAG/CTAGT 0.16 0.00 0.08 AACTG/CAGTT 0.43 0.00 0.21 AGAAG/CTTCT 0.09 0.20 0.25 0.00 0.00 0.45 0.17TGTTC/GAACA 0.16 0.00 0.08 AATAA/TTATT 0.43 0.00 0.21 AGGGG/CCCCT 0.18 0.00 0.25 0.08 0.00 0.45 0.16HexamersGATGAA/TTCATC 0.33 0.16 0.24 GGTGGA/TCCACC 0.00 2.38 1.19 CGGCGA/TCGCCG 0.38 0.20 0.13 0.25 0.30 0.11 0.23AAAACA/TGTTTT 0.00 0.47 0.24 GAAGTA/TACTTC 0.85 0.60 0.72 GCACCA/TGGTGC 0.09 0.00 0.25 0.17 0.60 0.00 0.19AAGGAG/CTCCTT 0.33 0.00 0.16 AGCAGG/CCTGCT 0.85 0.30 0.58 AGGCGG/CCGCCT 0.25 0.20 0.13 0.25 0.00 0.23 0.17AGCCTC/GAGGCT 0.33 0.00 0.16 CAGCAA/TTGCTG 0.43 0.60 0.51 CCGACG/CGTCGG 0.09 0.00 0.00 0.17 0.60 0.11 0.16ATCACC/GGTGAT 0.33 0.00 0.16 CCAACA/TGTTGG 0.85 0.00 0.43 CCGTCG/CGACGG 0.18 0.00 0.13 0.17 0.30 0.11 0.15ATGAAG/CTTCAT 0.33 0.00 0.16 CCTATC/GATAGG 0.85 0.00 0.43 GCCTCC/GGAGGC 0.18 0.40 0.13 0.17 0.00 0.00 0.14CATCAC/GTGATG 0.33 0.00 0.16 GGATGA/TCATCC 0.85 0.00 0.43 GCCACC/GGTGGC 0.02 0.40 0.00 0.00 0.30 0.11 0.14CCTCCA/TGGAGG 0.33 0.00 0.16 AGGAAG/CTTCCT 0.43 0.30 0.36 CGGCGC/GCGCCG 0.05 0.59 0.00 0.17 0.00 0.00 0.13CCTGAG/CTCAGG 0.33 0.00 0.16 ATGAAG/CTTCAT 0.43 0.30 0.36 CGACGC/GCGTCG 0.07 0.40 0.00 0.33 0.00 0.00 0.13GAATCC/GGATTC 0.33 0.00 0.16 CAACCT/AGGTTG 0.43 0.30 0.36 GGAGCC/GGCTCC 0.00 0.20 0.13 0.17 0.30 0.00 0.13HeptamersACACAAA/TTTGTGT 0.33 0.00 0.16 CTTCTCT/AGAGAAG 0.85 0.00 0.43 CCGCCGC/GCGGCGG 0.18 0.20 0.00 0.00 0.00 0.11 0.08GAGAGAA/TTCTCTC 0.16 0.16 0.16 GATCTCC/GGAGATC 0.85 0.00 0.43 CGCCGCC/GGCGGCG 0.02 0.20 0.25 0.00 0.00 0.00 0.08AGAGAGA/TCTCTCT 0.00 0.32 0.16 AAAAAAT/ATTTTTT 0.43 0.30 0.36 CCGGCGA/TCGCCGG 0.00 0.40 0.00 0.00 0.00 0.00 0.07AATTACA/TGTAATT 0.16 0.00 0.08 AAATTTA/TAAATTT 0.43 0.30 0.36 CCGCCGA/TCGGCGG 0.00 0.00 0.00 0.08 0.30 0.00 0.06ATGAGTG/CACTCAT 0.16 0.00 0.08 TCAACTA/TAGTTGA 0.00 0.60 0.30 CGGCAGG/CCTGCCG 0.02 0.00 0.00 0.00 0.30 0.00 0.05CAGCGAC/GTCGCTG 0.16 0.00 0.08 TTTTTTG/CAAAAAA 0.00 0.60 0.30 AAAATGA/TCATTTT 0.00 0.00 0.00 0.00 0.30 0.00 0.05CATTCAA/TTGAATG 0.16 0.00 0.08 AATTGAG/CTCAATT 0.43 0.00 0.21 ACGCAAG/CTTGCGT 0.00 0.00 0.00 0.00 0.30 0.00 0.05CCTCTCT/AGAGAGG 0.16 0.00 0.08 AGAAACA/TGTTTCT 0.43 0.00 0.21 AGCAGAG/CTCTGCT 0.00 0.00 0.00 0.00 0.30 0.00 0.05CTCAACT/AGTTGAG 0.16 0.00 0.08 ATCGCCG/CGGCGAT 0.43 0.00 0.21 CACGCCG/CGGCGTG 0.00 0.00 0.00 0.00 0.30 0.00 0.05TCTCAAA/TTTGAGA 0.16 0.00 0.08 ATGATTC/GAATCAT 0.43 0.00 0.21 CACTGCG/CGCAGTG 0.00 0.00 0.00 0.00 0.30 0.00 0.05OctamersATGTATGA/TCATACAT 0.16 0.00 0.08 AAGAAAAA/TTTTTCTT 0.00 0.30 0.15 GAAGTCAA/TTGACTTC 0.00 0.00 0.13 0.00 0.00 0.00 0.02CCCCTTCT/AGAAGGGG 0.16 0.00 0.08 TTTCTCTC/GAGAGAAA 0.00 0.30 0.15 GCGACCGA/TCGGTCGC 0.00 0.00 0.13 0.00 0.00 0.00 0.02CTTGTTCC/GGAACAAG 0.16 0.00 0.08 AAAAAAAC/GTTTTTTT 0.00 0.00 0.00 CCGCACGC/GCGTGCGG 0.00 0.00 0.00 0.00 0.00 0.11 0.02GAAGCAAG/CTTGCTTC 0.16 0.00 0.08 ACGGGCGA/TCGCCCGT 0.00 0.00 0.00 CCTATCTA/TAGATAGG 0.00 0.00 0.00 0.00 0.00 0.11 0.02AAAAAAAC/GTTTTTTT 0.00 0.16 0.08 AGAAAAAA/TTTTTTCT 0.00 0.00 0.00 CAAGAAGC/GCTTCTTG 0.05 0.00 0.00 0.00 0.00 0.00 0.01AGAAAAAA/TTTTTTCT 0.00 0.16 0.08 ATCAGGGA/TCCCTGAT 0.00 0.00 0.00 ACGGGCGA/TCGCCCGT 0.02 0.00 0.00 0.00 0.00 0.00 0.00TCTTTGTG/CACAAAGA 0.00 0.16 0.08 ATGATGTA/TACATCAT 0.00 0.00 0.00 ATCAGGGA/TCCCTGAT 0.02 0.00 0.00 0.00 0.00 0.00 0.00AAGAAAAA/TTTTTCTT 0.00 0.00 0.00 ATGTATGA/TCATACAT 0.00 0.00 0.00 ATGATGTA/TACATCAT 0.02 0.00 0.00 0.00 0.00 0.00 0.00ACGGGCGA/TCGCCCGT 0.00 0.00 0.00 CAAGAAGC/GCTTCTTG 0.00 0.00 0.00 TCAAATTT/AAATTTGA 0.02 0.00 0.00 0.00 0.00 0.00 0.00ATCAGGGA/TCCCTGAT 0.00 0.00 0.00 CCCCTTCT/AGAAGGGG 0.00 0.00 0.00 TGGGCTTG/CAAGCCCA 0.02 0.00 0.00 0.00 0.00 0.00 0.00NonamersAAGATGAAG/CTTCATCTT 0.16 0.00 0.08 ACTCCTTCA/TGAAGGAGT 0.00 0.30 0.15 ACGACTACG/CGTAGTCGT 0.00 0.00 0.00 0.00 0.30 0.00 0.05AATGGGTGG/CCACCCATT 0.16 0.00 0.08 CAAATTACC/GGTAATTTG 0.00 0.30 0.15 AGCGAAGAA/TTCTTCGCT 0.00 0.00 0.00 0.00 0.30 0.00 0.05AGAAGGAAG/CTTCCTTCT 0.16 0.00 0.08 CAGACTATT/AATAGTCTG 0.00 0.30 0.15 AGCACCAGC/GCTGGTGCT 0.00 0.20 0.00 0.00 0.00 0.00 0.03ATGGGTGAC/GTCACCCAT 0.16 0.00 0.08 CTTCTTATC/GATAAGAAG 0.00 0.30 0.15 GGTGGTATG/CATACCACC 0.00 0.20 0.00 0.00 0.00 0.00 0.03GAAGGAGAA/TTCTCCTTC 0.16 0.00 0.08 AAAAAAAAC/GTTTTTTTT 0.00 0.00 0.00 ACCCTCTCC/GGAGAGGGT 0.00 0.00 0.13 0.00 0.00 0.00 0.02GAGAAGAAG/CTTCTTCTC 0.16 0.00 0.08 AACAGGAGA/TCTCCTGTT 0.00 0.00 0.00 CCGCTGGAT/ATCCAGCGG 0.00 0.00 0.13 0.00 0.00 0.00 0.02GAGGAAGAA/TTCTTCCTC 0.16 0.00 0.08 AAGATGAAG/CTTCATCTT 0.00 0.00 0.00 GCTGTGACC/GGTCACAGC 0.00 0.00 0.13 0.00 0.00 0.00 0.02GAGGAAGAG/CTCTTCCTC 0.16 0.00 0.08 AATGGGTGG/CCACCCATT 0.00 0.00 0.00 ACCACCAGC/GCTGGTGGT 0.00 0.00 0.00 0.00 0.00 0.11 0.02TATAATTCG/CGAATTATA 0.16 0.00 0.08 ACAGCAACA/TGTTGCTGT 0.00 0.00 0.00 ACCACGGAC/GTCCGTGGT 0.00 0.00 0.00 0.00 0.00 0.11 0.02TCTTCGTCT/AGACGAAGA 0.16 0.00 0.08 ACCACCAGC/GCTGGTGGT 0.00 0.00 0.00 CCATCCTTA/TAAGGATGG 0.00 0.00 0.00 0.00 0.00 0.11 0.02DecamersACTTTGAGTG/CACTCAAAGT 0.16 0.00 0.08 AAAAAGAAAA/TTTTCTTTTT 0.00 0.00 0.00 AAAAAGAAAA/TTTTCTTTTT 0.00 0.00 0.00 0.00 0.30 0.00 0.05CAAAGTCACT/AGTGACTTTG 0.16 0.00 0.08 ACTTTGAGTG/CACTCAAAGT 0.00 0.00 0.00 CCACGCGTCG/CGACGCGTGG 0.23 0.00 0.00 0.00 0.00 0.00 0.04TTTTTTTTCT/AGAAAAAAAA 0.00 0.16 0.08 AGCCCCAACG/CGTTGGGGCT 0.00 0.00 0.00 TTTTTTTTCT/AGAAAAAAAA 0.00 0.00 0.13 0.00 0.00 0.00 0.02AAAAAGAAAA/TTTTCTTTTT 0.00 0.00 0.00 ATCTCCGCCG/CGGCGGAGAT 0.00 0.00 0.00 AGCCCCAACG/CGTTGGGGCT 0.05 0.00 0.00 0.00 0.00 0.00 0.01 Ara (Arabidopsis thaliana), Bra (Brassica napus), Lyc (Solanum lycopersicum), Sol (Solanum tuberosum), Ory (Oryza sativa), Sor(Sorghum bicolor), Tri (Triticum aestivum), Zea (Zea mays), Sac (Saccharum officinarum) and Hor (Hordeum vulgare)
63
4. Distribuition and patterns of microsatellites occurency in the whole rice
genome
Genetics and Molecular Biology (ISSN 1415-4757)
ABSTRACT
The objective of this work was to describe the abundance of microsatellites in
the complete sequence of the rice genome. Total occurrence and type distribution of
microsatellites per chromosome were evaluated. Our results indicate that the
occurrence of different loci on distinct chromosomes holds an aparent distribution
pattern. The results also indicate that if one selects only two-mers and three-mers, it
is possible to position markers on average at every 24. 9 Kb.
INTRODUCTION
Microsatellites or SSRs (Simple sequence repeat) are DNA sequences formed
by a tandem repetition of nucleotides between one and six base pairs (Morgante e
Olivieri, 1993). Microsatellite regions are formed as a consequence of loops or
hairpins structures formed during replication and that can be increased or decreased
by errors in the functioning of the DNA polymerase (Iyer et al., 2000). Initially, some
authors attributed the origin of microsatellites to random processes. However,
currently many studies describe the genomic distribution of microsatellites as a non-
64
random process (Li et al., 2002). This is based on the evidence of chromatin, gene
activity regulation, recombination and DNA replication effects of these mutations (Li
et al., 2004). During the 80�s, microsatellites were primarily studied in humans
(Weber et al., 1989; Litt e Lut, 1989; Taltz, 1989) and later were identified in other
eukaryotic genomes, including plants (Morgante et al., 1993; Wang et al., 1994;
Taramino e Tingey, 1996; MacCouch et al., 2001). Microsatellites are an important
class of molecular markers that are used to understand space relationships between
chromosome segments and can be useful to evaluate the temporal and evolutive
relationships between species and genera (Kashi et al., 1997).
Among the different types of molecular markers, the use of microsatellites is
very promising, thanks to their multi-allelic nature, reproducibility, co-dominant
inheritance and abundant genomic distribution. These markers are useful to integrate
genetic and physical maps to sequenced genomic regions and provide to breeders
and plant geneticists an efficient tool to integrate phenotypic and genotypic variations
(Varshney et al., 2005). Different studies on microsatellites using information on from
ESTS, cDNAs and BACs from the rice genome were previously published (Temnykh
et al., 2001; MacCouch et al., 1997, 2002; Morgante et al., 2002; Varshney et al.,
2002; La Rota et al., 2005; IRGSP, 2005; Parida et al., 2006). However, in most
studies, the complete sequence was not yet available. With the current availability of
the complete and ordered sequence, it is possible to obtain more precise statistics on
rice microsatellites.
In the present work, the abundance of microsatellites in the rice genome were
analyzed with the goal of clearing the rates, frequencies and distribution of different
microsatellites on different chromosomes as well as to describe which loci can be
better applied as molecular markers.
65
MATERIAL AND METHODS
Fasta files containing the pseudomolecules corresponding to the twelve rice
chromosomes (Oryza sativa spp japonica � cv. NipponBare) (IRGSP, 2005) were
obtained from the NCBI � National Center for Biotechnology Information
(http://www.ncbi.nlm.nih.gov/) database. For microsatellite searches, a software
called SSRLocator, which was developed in our lab was used (Maia et al., 2008)
(www.ufpel.edu.br/~lmaia.faem). The software was configured to locate Class I (≥ 20
bp) and Class II (≥ 12 and < 20 bp) microsatellites (Temnykh et al., 2001). Also, it
was configured to select those repeats with a minimum of 12 bp, i. e., 12x monomers
repeats, 6x two-mers, 4x three-mers, 3x for four-mers and five-mers and 2x for six-
mers repeats. Class I and Class II repeats were later stored on different files and
analyzed separately.
RESULTS AND DISCUSSION
Overall occurrence
On Table 1 the size (Mb) and genome percentage covered by each rice
chromosome, followed by total occurrence of each microsatellite type, class and
number of loci per million pairs (loci/Mb) per chromosome, are shown.
Chromosome one is the largest (43.3 Mb) and represents 11.7% of total
genome size. The smaller genome fractions are found in chromosomes nine and ten,
both measuring 22.7 Mb and representing 6.1% of the total genome. Total
occurrence of microsatellites was 484,613 loci, including both classes (I and II) and
all types of microsatellites. The most and least common types were six-mers
(290,360) and five-mers (15,924), respectively. The highest and lowest percentage
66
values were 59.9% (six-mers) and 3.3% (five-mers and monomers), respectively
(Table 1). The highest average was 780.2 loci/Mb (six-mers) and the lowest was 42.5
loci/Mb (five-mers). The overall average occurrence was 1,301 loci/Mb.
Looking at some rice BACs, Morgante et al. (2002) found a minimum of 118.4
five-mers and a maximum of 321.3 three-mers per Mb, respectively. For gene-rich
maize BACs, the maximum frequency found was 267.5 loci/Mb (tetramers) and the
minimum frequency 108.4 was 48.0 loci/Mb (monomers). Gene-poor BAC results
were 15.4 monomer loci/Mb and 171.4 three-mer loci/Mb for minimum and maximum
frequency, respectively. A survey of rice genomic sequences (474 Mb) revealed a
range of 7.1 monomer loci to 807.8 six-mer loci per Mb. An Arabidopsis survey
indicated that microsatellite loci range from 32.0 + 733.0, for five and six-mers,
respectively. Another survey involving Medicago Truncatula (77.1 Mb) genomic
clones also showed the highest (733.9 loci/Mb) and the lowest (32.0 loci/Mb)
frequency, for six and five-mers, respectively (Mun et al., 2006). The comparison of
results found in the above mentioned reports agree with the results obtained in the
present work. The major differences found between rice, alfalfa and Arabidopsis were
for monomers, with occurrence rates 2.5-4 times less frequent in rice and for three-
mers that were two times more frequent in rice.
The overall results obtained in this study indicate a common pattern of
occurrence except for chromosome 4 and 11, where lower frequencies were found
(Table 2).
Separate Class I and Class II occurrences
On Table 2, distribution and occurrence of Class I and Class II microsatellites
are shown per chromosome, followed by their percentage values and locus
67
frequency/Mb. A total of 22,581 Class I and 462.032 Class II microsatellites were
detected corresponding, respectively, to 5% and 95% of total occurrences.
Class I microsatellites were divided in 826 monomer, 10.542 two-mer, 4.949
three-mer, 2.345 four-mer, 2.654 five-mer and 1.265 six-mer loci. The most frequent
type was the two-mers, corresponding to 46.88% of overall Class I occurrence. The
least frequent type was the six-mers with 5.60% of overall Class I occurrence. The
density of Class I microsatellites ranged from 2.2 loci/Mb (monomers) to 28.4 loci/Mb
(two-mers), with an average of 60.6 loci/Mb. For Class II microsatellites, the density
range was between 41.0 loci/Mb and 776.8 loci/Mb for monomers and six-mers,
respectively, with an average density of 1,240 loci/Mb.
Class I microsatellites showed a frequency range of 2.3 loci/Mb (monomers) to
29.8 loci/Mb (two-mers) for rice and a range of 1.4 loci/Mb (four-mers) to 21.4 (two-
mers) loci/Mb for Arabidopsis (Mun et al., 2006).
In Figure 2 the average percentage of Class I microsatellites is shown. The
least frequent types were monomers (3.53%) and the most frequent were two-mers
(47.00%). When both classes are considered, however, six-mers become the most
frequent repeat type (59. 99%) (Figure 1).
Microsatellite sizes
On Table 3, Class I and Class II microsatellite average sizes are shown
individually for each chromosome. For those belonging to Class I, sizes ranged from
21.4 to 39.5 bp for five-mers and two-mers, respectively. For Class II microsatellites,
a range from 12.5 to 15 bp for four-mers and five-mers, respectively. The overall
average was 27.5 bp and 13.3 bp for Class I and Class II, respectively. Still on Table
3, average microsatellite sizes do not vary much.
68
Considering that Class I microsatellites are reported as the most useful as
molecular markers, two-mer and three-mer loci are the best candidates, since they
presented longer repeat sequences on average.
Distances between microsatellites
On Table 4, average distances between microsatellites (Kb) were presented
per class, type and chromosome location. For both classes, distances between
microsatellites were shorter for three-mers (4.6 Kb) longer between monomers (445.8
Kb).
The overall average distance between microsatellite loci was 11.5 Kb, white
for Class I this average was 189.9 Kb. Regarding per chromosome distribution, the
shortest average distance between microsatellites considering both classes was
found in chromosome 1 (10.5 Kb). The longest distance was 12.6 Kb and was found
in chromosome 4. When only Class I microsatellites are considered, the shortest and
longest distances were 162.8 and 215.9 Kb and were found in chromosomes 1 and
10, respectively.
CONCLUSION
The results showed a general view about abundance and distribution patterns
of microsatellites in the rice genome. Previous reports have been based in genomic
samples or in unordered sequences from pseudomolecules representing rice
chromosomes.
In the initial analysis, where both Class I and II (≥12 bp) were considered, the
overall frequency of each microsatellite type in each chromosome was assessed. In
this analysis, six-mers were the most abundant types (59.9%). This could be due to
69
the fact that any locus with two or more repeats was detected. Still for this analysis,
the comparison with A. thaliana and M. truncatula indicated similar frequencies for
two, four, five and six-mers and contrasting frequencies for mono and three-mers
between the three species.
For the second part of the analysis, where only Class I repeats (≥20 bp) were
considered, the most abundant types were two-mers (28.4 loci/Mb) and three-mers
(13.2 loci/Mb) with average distances of 34.4 Kb between two-mers and 74.8 Kb
between three-mers. Besides being more frequent, these repeats were also the
longest, with average length of 39.5 bp and 28.5 bp for two-mers and three-mers,
respectively. Considering the use of both these types, one can reach an average
coverage of 3 loci every 74.8 Kb or one locus every 24.9 Kb. This coverage
represents an excellent supply of markers to saturate any targeted genomic region
during mapping studies.
Finally, the data regarding the found loci, average distance between loci,
repeat size and ratio of different types on the rice genome suggest a similar
distribution for the 12 chromosomes.
70
REFERENCES:
Iyer RR, Pluciennik A, Rosche WA, Sinden RR, Wells RD (2000) DNA polymerase III
proofreading mutants enhance the expansion and deletion of triplet repeat
sequences in Escherichia coli. Journal of Biological Chemistry, v.275, n.3,
p.2174-2184.
IRGSP. (2005) The map-based sequence of the rice genome. Nature.
11;436(7052):793-800.
Kashi Y, King D, Soller M. (1997) Simple sequence repeats as a source of
quantitative genetic variation. Trends Genet. 13(2):74-8.
La Rota M, Kantety RV, Yu JK, Sorrells ME (2005) Nonrandom distribution and
frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and
barley. BMC Genomics. 18;6(1):23.
Li YC, Korol AB, Fahima T, Beiles A, Nevo E. (2002) Microsatellites: genomic
distribution, putative functions and mutational mechanisms: a review. Mol Ecol.
11(12):2453-65.
Li YC, Korol AB, Fahima T, Nevo E (2004) Microsatellites within genes: structure,
function, and evolution. Mol Biol Evol. 21(6):991-1007.
McCouch SR, Chen X, Panaud O, Temnykh S, Xu Y, Cho YG, Huang N, Ishii T, Blair
M (1997). Microsatellite marker development, mapping and applications in rice
genetics and breeding. Plant Mol Biol. 35(1-2):89-99.
McCouch SR, Chen X, Panaud O, Temnykh S, Xu Y, Cho YG, Huang N, Ishii T, Blair
M. (1997) Microsatellite marker development, mapping and applications in rice
genetics and breeding. Plant Mol Biol. 35(1-2):89-99.
McCouch SR, Teytelman L, Xu Y, Lobos KB, Clare K, Walton M, Fu B, Maghirang R,
Li Z, Xing Y, Zhang Q, Kono I, Yano M, Fjellstrom R, DeClerck G, Schneider D,
71
Cartinhour S, Ware D, Stein L. (2002) Development and mapping of 2240 new SSR
markers for rice (Oryza sativa L.). DNA Res. 9(6):199-207.
Morgante M, Hanafey M, Powell W. (2002) Microsatellites are preferentially
associated with nonrepetitive DNA in plant genomes. Nat Genet. 30(2):194-200.
Morgante M, Olivieri AM (1993) PCR-amplified microsatellites as markers in plant
genetics. Plant J 1: 175�182.
Mun JH, Kim DJ, Choi HK, Gish J, Debellé F, Mudge J, Denny R, Endré G, Saurat O,
Dudez AM, Kiss GB, Roe B, Young ND, Cook DR. (2006) Distribution of
microsatellites in the genome of Medicago truncatula: a resource of genetic markers
that integrate genetic and physical maps. Genetics. 172(4):2541-55.
Parida SK, Anand Raj Kumar K, Dalal V, Singh NK, Mohapatra T. (2006) Unigene
derived microsatellite markers for the cereal genomes. Theor Appl Genet.
112(5):808-17.
Taramino G, Tingey S. (1996) Simple sequence repeats for germplasm analysis and
mapping in maize. Genome. 39(2):277-87.
Tautz D. (1989) Hypervariability of simple sequences as a general source for
polymorphic DNA markers. Nucleic Acids Res. 17(16):6463-71.
Temnykh S, DeClerck G, Lukashova A, Lipovich L, Cartinhour S, McCouch S. (2001)
Computational and experimental analysis of microsatellites in rice (Oryza sativa L.):
frequency, length variation, transposon associations, and genetic marker potential.
Genome Res. 11(8):1441-52.
Varshney RK, Graner A, Sorrells ME. (2005) Genic microsatellite markers in plants:
features and applications. Trends Biotechnol. 23(1):48-55.
Varshney RK, Thiel T, Stein N, Langridge P, Graner A. (2002) In silico analysis on
frequency and distribution of microsatellites in ESTs of some cereal species. Cell
72
Mol Biol Lett. 7(2A):537-46.
Wang Z, Weber JL, Zhong G, Tanksley SD (1994) Survey of plant short tandem DNA
repeats. Theor Appl Genet 88: 1�6.
Weber JL, May PE. (1989) Abundant class of humanDNApolymorphisms which can
be typed using the polymerase chain reaction. Am J Hum Genet 44: 388�396.
73
Table 1. Total amounts of microsatellite types (≥ 12 bp)* in the twelve chromossomes.
Mono- Di- Tri- Tetra- Penta- Hexa- TotalChr. Mb % Amount Frequency Amount Frequency Amount Frequency Amount Frequency Amount Frequency Amount Frequency Amount Frequency
1 43.3 11.7 2,183 50.5 4,326 100.0 9,472 218.9 5,943 137.4 2,118 49.0 35,537 821.4 59,579 1,377.2 2 36.0 9.7 1,774 49.3 3,442 95.7 8,178 227.5 4,966 138.1 1,716 47.7 29,327 815.7 49,403 1,374.0 3 36.2 9.8 1,785 49.3 3,504 96.8 8,337 230.4 5,073 140.2 1,698 46.9 29,83 824.2 50,227 1,387.8 4 35.5 9.6 1,514 42.6 3,137 88.4 7,077 199.4 4,196 118.2 1,336 37.6 27,226 767.0 44,486 1,253.2 5 29.7 8.0 1,331 44.8 2,847 95.7 7,012 235.8 3,842 129.2 1,346 45.3 23,868 802.6 40,246 1,353.4 6 30.7 8.3 1,368 44.5 3,154 102.6 7,085 230.5 4,004 130.3 1,399 45.5 24,8 807.0 41,81 1,360.5 7 29.6 8.0 1,284 43.3 2,681 90.4 6,231 210.2 3,851 129.9 1,372 46.3 23,334 787.1 38,753 1,307.3 8 28.4 7.7 1,251 44.0 2,795 98.3 6,255 220.0 3,743 131.6 1,169 41.1 22,688 797.9 37,901 1,332.9 9 22.7 6.1 958 42.2 2,277 100.3 4,958 218.4 2,967 130.7 888 39.1 17,969 791.7 30,017 1,322.5 10 22.7 6.1 881 38.8 2,268 100.0 4,942 217.8 2,778 122.5 962 42.4 18,074 796.7 29,905 1,318.2 11 28.4 7.7 768 27.1 2,083 73.4 4,041 142.4 2,684 94.6 802 28.3 16,491 580.9 26,869 946.5 12 27.6 7.4 1,133 41.1 2,797 101.5 5,682 206.1 3,471 125.9 1,118 40.6 21,216 769.6 35,417 1,284.8 Total 370.8 - 16,23 - 35,311 - 79,27 - 47,518 - 15,924 - 290,36 - 484,613 -Average 8.3 1,353 43.1 2,943 95.3 6,606 213.1 3,96 127.4 1,327 42.5 24,197 780.2 40,384 1,301.5 % 3.3 7,3 16,4 9,8 3,3 59,9 - - *Microsatellites Class I: upper 20 bp (>= 20) * Microsatellites Class II: between 12 bp and 20 bp (>= 12 bp e <20)
74
Table 2. Distributions, percentage and frequency of different microsatellite types within Classes I and II in the twelve chromosomes.
Chromossome/Type Mono- Di- Tri- Tetra- Penta- Hexa- TotalClasses I II I II I II I II I II I II I II
Occur. 136 2.047 1.312 3.014 630 8.842 284 5.659 337 1.781 155 35.382 2.854 56.725 1 % 0,06 0,94 0,30 0,70 0,07 0,93 0,05 0,95 0,16 0,84 0,00 1,00 0,05 0,95
Freq. 3,1 47,3 30,3 69,7 14,6 204,4 6,6 130,8 7,8 41,2 3,6 817,9 66,0 1.311,2 Occur. 99 1.675 1.066 2.376 567 7.611 268 4.698 287 1.429 125 29.202 2.412 46.991
2 % 0,06 0,94 0,31 0,69 0,07 0,93 0,05 0,95 0,17 0,83 0,00 1,00 0,05 0,95 Freq. 2,8 46,6 29,6 66,1 15,8 211,7 7,5 130,7 8,0 39,7 3,5 812,2 67,1 1.306,9
Occur. 94 1.691 1.145 2.359 599 7.738 231 4.842 270 1.428 127 29.703 2.466 47.761 3 % 0,05 0,95 0,33 0,67 0,07 0,93 0,05 0,95 0,16 0,84 0,00 1,00 0,05 0,95
Freq. 2,6 46,7 31,6 65,2 16,6 213,8 6,4 133,8 7,5 39,5 3,5 820,7 68,1 1.319,6 Occur. 68 1.446 840 2.297 408 6.669 166 4.030 220 1.116 118 27.108 1.820 42.666
4 % 0,04 0,96 0,27 0,73 0,06 0,94 0,04 0,96 0,16 0,84 0,00 1,00 0,04 0,96 Freq. 1,9 40,7 23,7 64,7 11,5 187,9 4,7 113,5 6,2 31,4 3,3 763,6 51,3 1.201,9
Occur. 67 1.264 846 2.001 415 6.597 190 3.652 249 1.097 103 23.765 1.870 38.376 5 % 0,05 0,95 0,30 0,70 0,06 0,94 0,05 0,95 0,18 0,82 0,00 1,00 0,05 0,95
Freq. 2,3 42,5 28,4 67,3 14,0 221,8 6,4 122,8 8,4 36,9 3,5 799,2 62,9 1.290,5 Occur. 62 1.306 882 2.272 424 6.661 185 3.819 243 1.156 116 24.684 1.912 39.898
6 % 0,05 0,95 0,28 0,72 0,06 0,94 0,05 0,95 0,17 0,83 0,00 1,00 0,05 0,95 Freq. 2,0 42,5 28,7 73,9 13,8 216,7 6,0 124,3 7,9 37,6 3,8 803,2 62,2 1.298,3
Occur. 65 1.219 801 1.880 380 5.851 178 3.673 233 1.139 103 23.231 1.760 36.993 7 % 0,05 0,95 0,30 0,70 0,06 0,94 0,05 0,95 0,17 0,83 0,00 1,00 0,05 0,95
Freq. 2,2 41,1 27,0 63,4 12,8 197,4 6,0 123,9 7,9 38,4 3,5 783,7 59,4 1.247,9 Occur. 65 1.186 820 1.975 406 5.849 183 3.560 191 978 116 22.572 1.781 36.120
8 % 0,05 0,95 0,29 0,71 0,06 0,94 0,05 0,95 0,16 0,84 0,01 0,99 0,05 0,95 Freq. 2,3 41,7 28,8 69,5 14,3 205,7 6,4 125,2 6,7 34,4 4,1 793,8 62,6 1.270,3
Occur. 58 900 712 1.565 268 4.690 175 2.792 150 738 78 17.891 1.441 28.576 9 % 0,06 0,94 0,31 0,69 0,05 0,95 0,06 0,94 0,17 0,83 0,00 1,00 0,05 0,95
Freq. 2,6 39,7 31,4 69,0 11,8 206,6 7,7 123,0 6,6 32,5 3,4 788,3 63,5 1.259,0 Occur. 37 844 649 1.619 292 4.650 125 2.653 156 806 85 17.989 1.344 28.561
10 % 0,04 0,96 0,29 0,71 0,06 0,94 0,04 0,96 0,16 0,84 0,00 1,00 0,04 0,96 Freq. 1,6 37,2 28,6 71,4 12,9 205,0 5,5 116,9 6,9 35,5 3,7 793,0 59,2 1.259,0
Occur. 26 742 628 1.455 226 3.815 164 2.520 139 663 57 16.434 1.240 25.629 11 % 0,03 0,97 0,30 0,70 0,06 0,94 0,06 0,94 0,17 0,83 0,00 1,00 0,05 0,95
Freq. 0,9 26,1 22,1 51,3 8,0 134,4 5,8 88,8 4,9 23,4 2,0 578,9 43,7 902,8 Occur. 49 1.084 841 1.956 334 5.348 196 3.275 179 939 82 21.134 1.681 33.736
12 % 0,04 0,96 0,30 0,70 0,06 0,94 0,06 0,94 0,16 0,84 0,00 1,00 0,05 0,95 Freq. 1,8 39,3 30,5 71,0 12,1 194,0 7,1 118,8 6,5 34,1 3,0 766,6 61,0 1.223,8
Total 826 15.404 10.542 24.769 4.949 74.321 2.345 45.173 2.654 13.270 1.265 289.095 22.581 462.032 Occur. 68,8 1.283,7 878,5 2.064,1 412,4 6.193,4 195,4 3.764,4 221,2 1.105,8 105,4 24.091,3 1.881,8 38.502,7
Average % 0,05 0,95 0,30 0,70 0,06 0,94 0,05 0,95 0,17 0,83 0,00 1,00 0,05 0,95 Freq. 2,2 41,0 28,4 66,9 13,2 200,0 6,3 121,0 7,1 35,4 3,4 776,8 60,6 1.240,9
75
Table 3. Average locus size (bp) of different microsatellite types within Classes I and II for the twelve chromosomes.
Chr. I II I II I II I II I II I II I II 1 22.7 13.6 37.7 13.7 28.6 13.3 25.3 12.5 21.3 15.0 26.9 12.2 27.1 13.4 2 22.8 13.6 39.2 13.8 28.5 13.3 25.6 12.5 21.3 15.0 26.3 12.2 27.3 13.4 3 23.2 13.6 39.7 13.7 28.2 13.3 24.5 12.5 21.5 15.0 25.6 12.2 27.1 13.4 4 24.0 13.5 36.6 13.6 28.0 13.2 26.1 12.5 21.2 15.0 29.6 12.2 27.6 13.3 5 23.5 13.4 38.9 13.7 26.8 13.3 25.7 12.5 21.6 15.0 25.4 12.2 27.0 13.3 6 23.0 13.5 41.6 13.7 28.6 13.2 27.8 12.5 21.3 15.0 25.8 12.2 28.0 13.3 7 23.4 13.5 38.0 13.7 29.2 13.2 29.7 12.5 21.4 15.0 25.0 12.2 27.8 13.3 8 23.0 13.4 40.6 13.7 27.6 13.2 26.3 12.5 21.6 15.0 25.3 12.2 27.4 13.3 9 22.9 13.6 40.7 13.6 28.1 13.2 27.5 12.5 21.2 15.0 25.5 12.2 27.6 13.3 10 24.3 13.5 41.0 13.6 27.8 13.2 23.2 12.4 20.9 15.0 25.0 12.2 27.0 13.3 11 24.7 13.4 40.4 13.8 29.1 13.1 25.6 12.5 21.2 15.0 25.1 12.2 27.7 13.3 12 23.8 13.5 39.6 13.7 30.8 13.2 26.2 12.5 22.2 15.0 25.4 12.2 28.0 13.3
Average 23.4 13.5 39.5 13.7 28.5 13.2 26.1 12.5 21.4 15.0 25.9 12.2 27.5 13.3
AverageMono- Di- Tri- Tetra- Penta- Hexa-
76
Table 4. Average distances (Kb) between different microsatellite loci within Classe I and Class II chromossomes.
Chr. I - II I I - II I I - II I I - II I I - II I I - II I I - II I
1 19,8 316,2 10,0 32,9 4,6 68,6 7,3 152,1 20,4 128,3 1,2 278,9 10,5 162,82 20,2 353,0 10,4 33,6 4,4 63,4 7,2 134,6 20,9 125,4 1,2 285,8 10,7 166,03 20,2 381,1 10,3 31,5 4,3 60,4 7,1 156,5 21,3 133,7 1,2 284,1 10,7 174,54 23,4 521,5 11,3 42,2 5,0 87,0 8,4 213,3 26,6 160,9 1,3 300,8 12,7 220,95 22,1 427,5 10,4 34,9 4,2 71,1 7,7 155,1 22,0 118,5 1,2 286,8 11,3 182,36 22,4 462,6 9,7 34,7 4,3 72,4 7,7 165,9 21,9 125,5 1,2 263,6 11,2 187,57 23,0 450,4 11,0 36,9 4,7 77,9 7,7 166,1 21,6 126,3 1,3 286,1 11,5 190,68 22,6 423,0 10,1 34,5 4,5 69,9 7,6 154,9 24,3 148,4 1,2 244,3 11,7 179,29 23,5 370,2 9,9 31,6 4,6 84,3 7,6 129,2 25,5 150,6 1,3 288,8 12,1 175,810 25,7 591,1 10,0 34,9 4,6 77,6 8,2 181,1 23,6 145,3 1,2 265,5 12,2 215,911 24,3 526,7 9,8 32,7 4,8 82,5 7,9 140,4 24,6 153,7 1,3 335,5 12,1 211,912 24,3 526,7 9,8 32,7 4,8 82,5 7,9 140,4 24,6 153,7 1,3 335,5 12,1 211,9Average 22,6 445,8 10,2 34,4 4,6 74,8 7,7 157,5 23,1 139,2 1,2 288,0 11,6 189,9
AverageMono- Di- Tri- Tetra- Penta- Hexa-
77
-
5
10
15
20
25
30
35
40
45
50
55
60
65
70
Mono-mer 2-mer 3-mer 4-mer 5-mer 6-mer
% O
ccur
renc
e
Microsatellite types
Chr.1 Chr.2 Chr.3 Chr.4 Chr.5 Chr.6 Chr.7 Chr.8 Chr.9 Chr.10 Chr.11 Chr.12
Figure 1. Percentage occurrence of different microsatellite types (≥ 12 bp)* in the chromossomes.
78
-
5
10
15
20
25
30
35
40
45
50
55
60
Mono-mer 2-mer 3-mer 4-mer 5-mer 6-mer
% O
ccur
renc
e
Microsatellite types
Chr.1 Chr.2 Chr.3 Chr.4 Chr.5 Chr.6 Chr.7 Chr.8 Chr.9 Chr.10 Chr.11 Chr.12
Figure 2. Percentage occurrence of different microsatellite types (≥ 20 bp) in twelve chromossome.
* Including as Class I and Class II
79
5. Considerações Finais
A utilização de marcadores moleculares é atualmente uma ferramenta de
grande importância no auxilio do melhoramento vegetal, entretanto, é necessário
investigar estratégias para incrementar as taxas de sucesso destas aplicações em
estudos de mapeamento genético e seleção assistida. Para isso estudos com base
em bioinformática foram realizados para verificar padrões e abundância de
diferentes tipos e arranjos de loci microssatélites na seqüência completa do genoma
do arroz e outras espécies.
A análise da seqüência completa do genoma do arroz mostrou que os loci
microssatélites formados pelos tipos dímeros e trímeros são os mais abundantes e
que a utilização destes dois tipos de arranjos possibilitam o posicionamento médio
de um marcador a cada 24.900 nucleotídeos (24,9 kb), resultando numa excelente
cobertura do genoma.
No estudo onde foi analizada a acorrência de microssatélites em 28.469
seqüências gênicas (fl-cDNA) foi encontrado um total de 3.907 loci mini e
microssatélites em 3.765 seqüências (13,22%), sendo que, foram desenhados 3.329
conjuntos de iniciadores, correspondendo a 85,20% das seqüências. A simulação da
PCR apartir dos 3.329 iniciadores mostrou que 2.397 conjuntos amplificaram apenas
o fragmento original, e que, 932 conjuntos amplificaram regiões redundantes além
do locus original.
As comparações entre espécies de diferentes famílias mostraram que os
dímeros AG/CT foram predominantes na família Brassicaceae e Poaceae e AT/AT
na Solanaceae. Entre os microssatélites trímeros os motivos ATA/TAT/AAT/TTA e
GAA/TTC/AGA/TCT foram predominantes entre brássicas e solanáceas, enquanto
que, nas gramíneas os trímeros mais freqüêntes foram aqueles compostos por C/G.
80
Finalmente, o resultado geral dos três estudos indicou que loci microssatélites
possibilitam uma boa cobertura tanto para regiões gênicas e intergênicas do arroz, e
que, para transposição de marcadores entre regiões gênicas das gramíneas os loci
formados por trímeros C/G são os mais indicados. A transferência entre espécies
dentro das diferentes famílias é apoaida por padrões encontrados dentro de cada
família estudada, entretanto, entre diferentes famílias um pequeno padrão de loci
foram evidentes, indicando baixo potencial de transferência de marcadores entre
espécies mais distantes evolutivamente.
Estudos futuros em laboratório serão necessários para a validação dos
resultados obtidos in silico. Conjuntos de iniciadores obtidos apartir daqueles loci
com padrões mais abundantes entre as gramíneas, deverão ser testados quanto a
sua real capacidade de amplificar regiões de DNA nas diferentes gramíneas
estudadas e confirmar, desta forma, quais os melhores padrões de marcadores para
transferência para gramíneas pouco estudadas. A segunda necessidade de
validação dos resultados obtidos in silico, é para aqueles conjuntos de iniciadores
que aplificam loci específicos e/ou redundantes no arroz. Iniciadores com real
capacidade de amplificar loci específicos poderão ser utilizados em estratégias de
mapeamento e seleção assistida e aqueles com capacidade de amplificar regiões
redundantes (vários loci ao mesmo tempo) poderão ser utilizados em estudos de
variabilidade e diversidade genética, sendo que, ambos os estudos após validação
resultarão em ferramentas auxiliares para programas de melhoramento vegetal.
81
6. Referencias bibliográficas do Item 1 CARVALHO, F.I.F. in: Trigo no Brasil. Fundação Cargill. Campinas-SP, Editora Ilus,
v.1, p.620, 1982.
CARVALHO, F.I.F.; LORENCETTI, C.; MARCHIORO, V.S.; SILVA, S.A. Condução de populações no melhoramento genético de plantas. Editora e Gráfica
Universitária - UFPel, 2003.
IYER, R.R.; PLUCIENNIK, A.; ROSCHE, W.A.; SINDEN, R.R.; WELLS, R.D. DNA
polymerase III proofreading mutants enhance the expansion and deletion of triplet
repeat sequences in Escherichia coli. Journal of Biological Chemistry, v.275, n.3,
p.2174-2184, 2000.
LAWSON, M.J.; ZHANG, L. Distinct patterns of SSR distribution in the Arabidopsis
thaliana and rice genomes. Genome Biology, v.7, n.2, 2006. LITT, M.; LUTY, J. A. A hypervariable microsatellite revealed by in vitro amplification
of a dinucleotide repeat within the cardiac muscle actin gene. American journal of human genetics. v.44, n.3, p.397-401. 1989.
MORGANTE, M.; OLIVIERI, A.M. PCR-amplified microsatellites as markers in plant
genetics. The Plant Journal. v.3, n.1, p.175-182, 1993.
MORGANTE, M.; HANAFEY, M.; POWELL, W. Microsatellites are preferentially
associated with nonrepetitive DNA in plant genomes. Nature Genetics, v.30, n.2,
p.194-200, 2002.
MCCOUCH, S.R; TEYTELMAN, L.; XU, Y.; LOBOS, K.B.; CLARE, K.; WALTON, M.;
FU, B.; MAGHIRANG, R.; LI, Z.; XING, Y.; ZHANG, Q.; KONO, I.; YANO, M.;
FJELLSTROM, R.; DECLERCK, G.; SCHNEIDER, D.; CARTINHOUR, S.; WARE, D.;
STEIN, L., Development and Mapping of 2240 New SSR Markers for Rice (Oryza
sativa L.). DNA Research, v.9, n.6, p.199-207, 2002.
NICOT, N.; CHIQUET, V.; GANDON, B.; AMILHAT, L.; LEGEAI, F.; LEROY, P.;
BERNARD, M.; SOURDILLE, P. Study of simple sequence repeat (SSR) markers
from wheat expressed sequence tags (ESTs). Theoretical and Applied Genetics,
82
v.109, n.4, p.800-805, 2004.
TAUTZ, D. Hypervariability of simple sequences as a general source for polymorphic
DNA markers. Nucleic Acids Research. v.17, n.16, p.6463�6471, 1989.
TEMNYKH S, DECLERCK G, LUKASHOVA A, LIPOVICH L, CARTINHOUR S,
MCCOUCH S. (2001) Computational and experimental analysis of microsatellites in
rice (Oryza sativa L.): frequency, length variation, transposon associations, and
genetic marker potential. Genome Research. 11(8):1441-52.
THIEL, T.; MICHALEK, W.; VARSHNEY, W.; GRANER, A. Exploiting EST databases
for the development and characterization of gene-derived SSR-markers in barley
(Hordeum vulgare L.). Theoretical and Applied Genetics, v.106, n.3, p.411-422,
2003.
VARSHNEY, R.K.; GRANER, A.; SORRELLS, M.E. Genic microsatellite markers in
plants: features and applications. Trends in Biotechnolgy. v.23, n.1, 48-55. 2005b.
VARSHNEY, R.K.; HOISINGTON, D.A.; TYAGI, A.K. Advances in cereal genomics
and applications in crop breeding. Trends in Biotechnology, v.24, n.11, p.490-499,
2006.
Kashi Y, King D, Soller M. (1997) Simple sequence repeats as a source of
quantitative genetic variation. Trends Genet. 13(2):74-8.
WELLS, R.D.; PARNIEWSKI, P.; PLUCIENNIK, A.; BACOLLA, A.; GELLIBOLIAN,
R.; JAWORSKI, A. Small slipped register genetic instabilities in Escherichia coli in
triplet repeat sequences associated with hereditary neurological diseases. J Biol Chem. v.273, n.31, p.19532-19541, 1998.
ZHANG L, ZUO K, ZHANG F, CAO Y, WANG J, ZHANG Y, SUN X, TANG K. (2006)
Conservation of noncoding microsatellites in plants: implication for gene regulation.
BMC Genomics. 25;7:323.
83
VITAE
Luciano Carlos da Maia, nascido em 13/07/1976 em Itapetininga-SP. Formado
em Tecnologia de Processamento de Dados em 1995, pela Associação de Ensino
de Itapetininga. No período de 1992 e 1999 trabalhou no desenvolvimento de
sistemas de informação para gerenciamento agrícola e de custos da divisão de
pecuária e citricultura do Grupo Votorantim (Itapetininga-SP). Ingressou na
Faculdade de Agronomia Eliseu Maciel(FAEM), da Universidade Federal de Pelotas
(UFPel) em março de 2000. Foi estagiário e bolsista de iniciação científica (CNPq)
entre 2000-2002 no Laboratório de Bacteriologia do Departamento de Fitossanidade-
FAEM/UFPel. A partir de 2003 iniciou estágio no Centro de Genômica e
Fitomelhoramento, sendo bolsista da Fundação Delfin Mendes até conclusão do
curso em 2004. Em março de 2005 iniciou mestrado em Agronomia, área de
Fitomelhoramento da FAEM/UFPel, sob orientação dos Professores Antonio Costa
de Oliveira e Fernando Irajá Félix de Carvalho. Em Maio de 2006, por cumprir os
requisitos necessários, progrediu ao nível de doutorado. Ao longo deste período,
vem desenvolvendo trabalhos de bioinformática para auxílio do melhoramento
vegetal de arroz, trigo e aveia alem de demais estudos de genomica e biologia
molecular das destas espécies estudadas no grupo de Fitomelhoramento da
FAEM/UFPel.