Assignment Research Paper
Embed Size (px)
Transcript of Assignment Research Paper
8/13/2019 Assignment Research Paper
C BIOS Vol. 13 no. 3 1997Pages 263-270 Prediction of probable genes by Fourieranalysis of genomic sequences Shrish Tiwari1, S.Ramachandran2, Alok Bhattacharya3,Sudha Bhattacharya2 and Ramakrishna Ramaswamy 1 4
Abstract Motivation The major signal in coding regions of genomicsequences is a three-base periodicity. Our aim is to useFourier techniques to analyse this periodicity, and thereby todevelop a tool to recognize coding regions in genomic DNA.Result The three-base periodicity in the nucleotide arrange-ment is evidenced as a sha rp peak at frequency f — 1/3 in theFourier (or power) spectrum. From extensive spectral analysis of DNA sequences of total length over 5.5 millionbase pairs from a wide variety or organisms (including thehuman genome), and by separately examining coding andnon-coding sequences, we find that the relative height of thepeak at f = 1/3 in the Fourier spectrum is a good dis-criminator of coding potential. This feature is utilized by us todetect probable coding regions in DNA sequences, byexamining the local signal-to-noise ratio of the peak withina sliding window. While the overall accuracy is comparableto that of other techniques currently in use, the measure thatis presently proposed is independent of training sets orexisting databa se information, and can thus find gen eralapplication. Availability A computer program GeneScan which locatescoding open reading frames and exonic regions in genomicsequences has been developed, and is available on request.Contact E-mail: firstname.lastname@example.org. Introduction The gene identification problem (Fickett, 1996), namely theidentification of protein-coding genes in DNA sequencesthrough computational means, is of great current importance.The worldwide initiative on genome sequencing has necessi-tated the developm ent of new approach es to assess rapidly thepotential of a given nucleotide sequence in a functional con-text. Genome projects have given rise to an exponentiallygrowing amount of genetic information, much of which isnovel: even in a simple eukaryote like Saccharomycescerevisiae, less than half the number of potential genesequences were known prior to the recently completed YeastGenome Project.
School of Physical Sciences,School of Environmental Sciences andSchool of Life Sciences, Jawaharlal Nehru University, Nev-Delhi 110067, India To whom correspondence should be addressed
A number of methods have been proposed for genedetection, based on distinctive features of protein-codingsequences. These have recently been comprehensivelyreviewed (Fickett and Tung, 1992; Fickett, 1996). Thedifferent methods are based on a variety of contrastingcharacteristics of protein-coding DNA sequences and DNAsequences that do not encode proteins. These methodsemploy, for example, differences in the patterns of codonusage (Staden and McLachlan, 1982) or oligonucleotidefrequencies (Shulman etal., 1981, Fickett, 1982; Borodovskyet al., 1986, 1994), or train neural nets (Lapedes et al., 1990;Uberbacher and M ural, 1991; Farber et al., 1992; Xu et al.,1994; Snyder and Stormo, 1995) to recognize the distinctivefeatures of the two sets. Other techniques use linguisticmethods (Dong and Searls, 1994; Mantegna et al., 1994) andcorrelation functions (Li, 1992; Peng et al., 1993; Ossadniket al., 1994; Buldyrev et al., 1995). At the same time,comprehensive evaluations of the various methods suggestthat they cannot be expected to work equally well for allgenes (Burset and Guigo, 1996), and constant refinement isneeded to evolve better methodologies. There is also a need (Fickett, 1996) for new methods of gene prediction whichutilize features of gene structure that have not so far beenincorporated in programs already available.
In this paper, we investigate a Fourier technique based on adistinctive feature of protein-coding regions of DNAsequences, namely the existence of short-range correlationsin the nucleotide arrangem ent. The m ost prominent of these isa 1/3 periodicity, which has been shown to be present incoding sequences (Fickett, 1982). The signature of this (andindeed any other) periodicity can be seen most directlythrough the Fourier analysis (Tsonis et al., 1991; Voss, 1992)as a spectral peak. In the present work, we analyse genomicsequences from different organisms, and verify that such periodicity is universal for protein-coding sequences and isabsent in genomic sequences which do not code for proteins.The quantitative measure that we focus on is the relativestrength of this periodicity, which we then use in order todevelop a simple technique to predict genes (with and withoutintrons) in unknown genomic sequences of any organism.The present method, like other Fourier-based methods, offerssome advantages, namely that it is quite easy to apply andrequires no prior knowledge of the sequence to be analysed.
The origin of the period-3 signal in protein-codingsequences derives from the triplet nature of the codon. This
© Oxford University Press 26 3
8/13/2019 Assignment Research Paper
S.Tiwari et al.
Table 1. Summary Group
Fungus Fungus Insect virus Protozoa Bacteria Bacteria Nematode Mammal Various Various
of genomic sequences studied Species
S.cerevisiae III S.cerevtsiae VIII A. californ ica E.histolvtica1 A.vinelandtt H.influenzae C.elegans Humane Globins1 Actins8
G + C
38.6 38.2 40.7 34 2 62.4 38.2 35 6 51.2 49.6 36.8
216 267 154 26
6 1727- -
54 140 51 26
6 933 146 24 15 15
198 255 137 26
6 1667- _
51 139 49 26
6 927 146 24 15 15
"Total number of ORFs as reported in the literature.ORFs positively identified as coding for proteins through homology, detection of the corresponding c-DNAcThe G + C content quoted is the average over genes only. D ata from Sehgal et al. (1994) as well as GenBank. The G + C content is the average over thesequences analysed.H.K.Das, private communicationeThe human sequences used in this study are those used by Uberbacher and Mural (1991) for GRAIL.fThe GenBank locus names of the sequences studied are: ACAACTI, CELACI, DROACT2A, BOVACTI, BOVACT2, BOVACT2, RABACMAI,RABACMA2, CHKACACA . QULACASK, SLMACT15. SLMACT21, SOYACT1G, MUSACACM, MUSA CASA, HUMA CBPA1. The G + C content quotedis the average over all the actins from different species.8The GenBank accession numbers of sequences studied are: JOO 153. JOO 182, J00413, J03082. KO 1714, K03256. M17601, M17602, M17909. M61740, V00491,VOO5I3, XOO371. X00372, X00 373, X03248 and X04862 . The G + C content quoted is an average over all the globin sequences from different species. fact alone is, however, insufficient to explain why codingregions exclusively have the signal, while non-coding regionsoverwhelmingly do not, and the reasons for this distinction liein the unequal usage of codons (codon bias) in coding regions(Tsonis et al., 1991), as well as in the biased usage ofnucleotide triples in genomic DNA (the triplet bias). Thelatter bias comes in part from the unequal usage of the aminoacids in naturally occurring proteins, and is universally present. The former bias arises from the unequal usage of thecodons corresponding to a given amino acid, and is specific toa given organism. In order to explore the role of the two types
of compositional bias in generating the period-3 signal, wehave performed a variety of numerical experiments. Ourresults indicate that while codon bias does play a role, it is,however, not the primary reason for the periodicity. Algorithm The Fourier analysis described below has been performed on nucleotide sequences, of total length over 5.5 Mbases. Thesesequences (listed in Table I) include the complete sequencesof yeast (S. cerevisiae) chromosomes III (Oliver et al., 1992)
UUUI J I L . L
0.00 «.V.tn>.l- . I I L I 1 J L . I J I . J V .oY oh 0.3 0.4
S°°2 0.01 -
f f Fig. 1. Typical Founer spectra for (a) a coding stretch of DNA and (b) a non-coding stretch from S.cerevtsiae chromosome III.
http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/ http://bioinformatics.oxfordjournals.org/
8/13/2019 Assignment Research Paper
Gene prediction by Fourier analysis
and VIII (Johnson et al., 1994); the genome of Autographa californica nuclear polyhedrosis virus (Ac-NPV ) (Ayres et al.,1994); 2.2 Mb of