Post on 13-Dec-2015
PREETI MISRAPREETI MISRA
Advisor: Dr. HAIXU TANGAdvisor: Dr. HAIXU TANG
SCHOOL OF INFORMATICS - INDIANA UNIVERSITYSCHOOL OF INFORMATICS - INDIANA UNIVERSITY
Computational method to analyze Computational method to analyze tandem repeats in eukaryote tandem repeats in eukaryote
genomesgenomes
OverviewOverview
BackgroundBackground Tandem repeatsTandem repeats Methodology Methodology ResultsResults ConclusionsConclusions ReferencesReferences
Capstone Presentation 05/18/2007 2
BackgroundBackground
April 20, 2006
Capstone Presentation 05/18/2007 3
An array of consecutive repeatsAn array of consecutive repeats Repeating pattern or consensus = 5Repeating pattern or consensus = 5 Total repeat length = 25Total repeat length = 25
3 main types of tandem repeats3 main types of tandem repeats Microsatellites -- 1-5 bp repeating Microsatellites -- 1-5 bp repeating
patternpattern Minisatellites -- 6-50 bp repeating Minisatellites -- 6-50 bp repeating
patternpattern Large tandem -- greater than 50 bp Large tandem -- greater than 50 bp
repeating patternrepeating pattern
GATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCC Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Significance Significance
Use tandem repeats to determine Use tandem repeats to determine whether 2 DNA samples belong to whether 2 DNA samples belong to same person or notsame person or not
Uses – Uses – Forensic useForensic use Paternity testingPaternity testing
Capstone Presentation 05/18/2007 4
Image downloaded from www.egensburg.de/Fakultaeten/Medizin/Klinische_Chemie/lehre/vorlesung/Manuskript_Profiling%20.pdf
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Mechanism of tandem Mechanism of tandem duplicationduplication
Unequal recombinationUnequal recombination is the major known mechanism for the formation of large tandem repeats is the major known mechanism for the formation of large tandem repeats
Image has been downloaded from Image has been downloaded from http://hc.ims.u okyo.ac.jp/JSBi/journal/GIW02/GIW02F010/GIW02F010.html okyo.ac.jp/JSBi/journal/GIW02/GIW02F010/GIW02F010.html
Capstone Presentation 05/18/2007 5
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Tandem gene Tandem gene duplicationduplication
Benefits – New functions arise. Benefits – New functions arise. Responsible for the evolution of Responsible for the evolution of gene clusters gene clusters
Example – Zinc finger genes in Example – Zinc finger genes in mammalian genesmammalian genes
Capstone Presentation 05/18/2007 6
http://www.steve.gb.com/images/molecules/nucleotides/zinc_finger_(side).jpg
2 homologous Genes
second gene = Duplicated
PurposePurpose Large tandem repeats are Large tandem repeats are
commonly found in eukaryotes – commonly found in eukaryotes – humans have 1.684 % and humans have 1.684 % and chimpanzees have 1.525%chimpanzees have 1.525%
To date the large tandem To date the large tandem duplication and find the duplication and find the relationship between various relationship between various characteristics of long tandem characteristics of long tandem repeats and corresponding repeats and corresponding evolutionary time evolutionary time
8 genomes8 genomes – 3 primates, 2 – 3 primates, 2 rodents , dog, chicken and puffer rodents , dog, chicken and puffer fish were analyzedfish were analyzed
Capstone Presentation 05/18/2007 7
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
MethodologyMethodology
IdentificationIdentification Tandem repeat finder (TRF) for Tandem repeat finder (TRF) for
identification of large tandem repeatsidentification of large tandem repeats
Distance computation Distance computation Jukes – Cantor distance model to find Jukes – Cantor distance model to find
distance between two repeatsdistance between two repeats
TransformationTransformation Transform the above computed distance Transform the above computed distance
into evolutionary timeinto evolutionary time
Capstone Presentation 05/18/2007 8
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Tandem Repeat Tandem Repeat FinderFinder
STRING, Mreps and TRFSTRING, Mreps and TRF TRAP: T.Jose, P. Sobreira, A.Durham and A.Gruber TRF can be downloaded at TRF can be downloaded at http://tandem.bu.edu/trf/trf.html
Starting and ending positions of Starting and ending positions of tandem repeat was presenttandem repeat was present
Number of repetitionsNumber of repetitions
A%, C%, G%, T% percentage of A%, C%, G%, T% percentage of bases in the tandem repeatbases in the tandem repeat
Length of the consensus word (only Length of the consensus word (only the first 10 bases)the first 10 bases)
Capstone Presentation 05/18/2007 9
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Tandem Repeat Tandem Repeat FinderFinder
Tandem repeat finder outline :Tandem repeat finder outline :
Tandem repeat finder program has 2 main Tandem repeat finder program has 2 main components – components – detectiondetection and and analysisanalysis
DetectionDetection - Finds - Finds candidatecandidate tandem tandem repeatsrepeats
Analysis Analysis - Produces an - Produces an alignment alignment for for each candidate and statistics about the each candidate and statistics about the
alignmentalignment
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Capstone Presentation 05/18/2007 10
Tandem Repeat Tandem Repeat FinderFinder
Large tandem repeats were extractedLarge tandem repeats were extracted Results of TRF –Results of TRF – 1 5 100 0 50 20 40 20 20 1.92 1 5 100 0 50 20 40 20 20 1.92 GATCCGATCC
GATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCC
GATCCGATCC - period or consensus - period or consensusGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCC - repeat - repeat
1 1 - indices- indices5 5 - consensus or period size- consensus or period size100 100 - percent matches- percent matches0 0 - percent indels- percent indels50 50 - score- score20 20 - % of A - % of A 40 - % of C40 - % of C1.92 1.92 - entropy- entropy
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Capstone Presentation 05/18/2007 11
DNA Sequence DNA Sequence Evolution Model For Evolution Model For
DatingDating
AAGACTT
TGGACTTAAGGCCT
3 mil yrs
2 mil yrs
1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
Capstone Presentation 05/18/2007 12
D
Computing divergence of tandem Computing divergence of tandem repeating units – repeating units –
Repeat identityRepeat identity - each repeat is each repeat is compared with other repeats and compared with other repeats and maximummaximum similarity/identity is similarity/identity is consideredconsidered
GATCCGATCC GATCC|GATCC|GATCC|GATCC|GATCCGATCC|GATCC|GATCC|GATCC|GATCC
Dating tandem Dating tandem duplicationsduplications
Capstone Presentation 05/18/2007 13
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Jukes-CantorJukes-Cantor model model
Computes the Computes the distancedistance between 2 between 2 repeatsrepeats
All bases occur with equal All bases occur with equal probability, probability,
i.e. i.e. pp = 0.25 for A, T, G and C = 0.25 for A, T, G and C
All possible base substitutions are All possible base substitutions are equally likely as follows -equally likely as follows -
A ↔ G, A ↔ C, A ↔ T, G ↔ TA ↔ G, A ↔ C, A ↔ T, G ↔ T
Capstone Presentation 05/18/2007 14
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Jukes-Cantor modelJukes-Cantor model
m = no. of mutations m = no. of mutations
n = length of sequence n = length of sequence
D = -3/4 ln(1- 4/3 m/n)D = -3/4 ln(1- 4/3 m/n)D = Distance between two repeatsD = Distance between two repeats
Ex- Observed mismatches at 25% of Ex- Observed mismatches at 25% of the sites, then Jukes Cantor model the sites, then Jukes Cantor model predicts the distance between two predicts the distance between two repeat is 0.304repeat is 0.304
Capstone Presentation 05/18/2007 15
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Estimating the Estimating the evolutionary time evolutionary time
Transforming the computed distance Transforming the computed distance (D) between two repeats into (D) between two repeats into evolutionary timeevolutionary time
Neutral mutation rateNeutral mutation rate in mammals is in mammals is nearly 1.25 * 10nearly 1.25 * 10-9-9 per year per site per year per site
Time (T) = D / 1.25 * 10Time (T) = D / 1.25 * 10-9-9 years years
agoago Ex- D = 0.1 Ex- D = 0.1 T = 0.1 / 1.25 * 10T = 0.1 / 1.25 * 10-9-9 = 80 million = 80 million
years agoyears ago
Capstone Presentation 05/18/2007 16
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Material and MethodMaterial and Method MaterialMaterial
The genome files were downloaded from UCSC The genome files were downloaded from UCSC site site http://hgdownload.cse.ucsc.edu/downloads.html
The tandem repeat finder and The tandem repeat finder and stretcher software were downloadedstretcher software were downloaded
ProcedureProcedure Extraction of large tandem repeats Extraction of large tandem repeats
with the help of tandem repeat finderwith the help of tandem repeat finder Calculation of similarities between Calculation of similarities between
tandem repeats using stretchertandem repeats using stretcher Computation of the distance using Computation of the distance using
Jukes-Cantor modelJukes-Cantor model Transformation of distance to the Transformation of distance to the
evolutionary time evolutionary time
Capstone Presentation 05/18/2007 17
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Tree of lifeTree of life500 Million years ago
225
92
75
25
612-2412-24
Recap – period & Recap – period & repeatrepeat
Capstone Presentation 05/18/2007 19
ATTCGATTCGATTCGGGATTCGACATTATTCGATTCGATTCGGGATTCGACATTCGCG
ATTCATTCGG
REPEAT REPEAT PERIOD or PERIOD or CONSENSUSCONSENSUS
ResultsResults
GenomeGenomeChr# , Chr# , Longest Longest repeat repeat lengthlength
Chr#, Chr#, highest highest total# total# of of repeatrepeat
Chr#, Chr#,
longest longest period period length length
Total Total repeatrepeat
Total Total genome genome sizesize
Chr#, Chr#, highest % highest % of repeatof repeat
Total Total coveragcoverage(% of e(% of repeat repeat in in genome)genome)
HUMANHUMAN
CHIMPANZECHIMPANZEEE
8, 62142 8, 62142 1, 59431, 5943 8, 19858, 1985 48 MB48 MB 2.97 GB2.97 GB 19, 4.946319, 4.9463 1.5251.525
MACAQUEMACAQUE
13, 13, 119145119145
19, 19, 53995399
19, 196919, 1969 44 MB44 MB 2.87 GB2.87 GB 1.5381.538
RATRAT
18 , 63266 18 , 63266 1, 51561, 5156 1, 19891, 1989 20 MB20 MB 2.75 GB2.75 GB 12, 2.03412, 2.034 0.4240.424
MOUSEMOUSE
7, 203136 7, 203136 5, 4214 5, 4214 10, 198310, 1983 19 MB19 MB 2.61 GB2.61 GB X, 1.242X, 1.242 0.2630.263
DOGDOG
X, 37449X, 37449 1, 16131, 1613 18, 185218, 1852 6 MB 6 MB 2.40 GB2.40 GB 0.7480.748
CHICKENCHICKEN
1, 166801, 16680 1, 9581, 958 13, 198813, 1988 4 MB4 MB 1.1 GB1.1 GB 16, 7.77916, 7.779 0.7350.735
PUFFER PUFFER FISHFISH
1, 65861, 6586Y, 217961Y, 217961 X, 2000X, 2000 51 MB51 MB 1.681.6844
2, 2, 1304513045
2, 1392, 139 2, 15692, 1569 0.53 MB0.53 MB 0.2450.245
19, 8.40819, 8.408
1, 0.49021, 0.4902
385 MB385 MB 10, 10, 0.510.51
3.1 GB3.1 GB 19, 19, 6.066.06
ResultsResults
Capstone Presentation 05/18/2007 21
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusion
Total number of Total number of repeatsrepeats
Capstone Presentation 05/18/2007 22
Total number of period Total number of period or consensusor consensus
Capstone Presentation 05/18/2007 23
Results of repeatResults of repeat lengthlength
Capstone Presentation 05/18/2007 24
% Repeat results% Repeat results
FishFish
HumanHumanHuman
Dating tandem Dating tandem repeatsrepeats
Capstone Presentation 05/18/2007 26
Tree of lifeTree of life500 Million years ago
225
92
75
25
6 12-2412-24
Capstone Presentation 05/18/2007 27
ConclusionsConclusions PrimatesPrimates (human, chimpanzee and (human, chimpanzee and
macaque) have highest number of macaque) have highest number of long tandem repeat duplicationslong tandem repeat duplications
Dating peakDating peak is prominent in human, is prominent in human, chimpanzee and macaque, especially chimpanzee and macaque, especially between between 80-12080-120 million years ago million years ago
Tandem repeat results follow a Tandem repeat results follow a pattern which is similar to the pattern which is similar to the divergencedivergence as shown in the tree of life as shown in the tree of life
Dog, rat and mouse show Dog, rat and mouse show steady steady increaseincrease in number of tandem in number of tandem duplications but burst is negligible duplications but burst is negligible between 80-120 million years agobetween 80-120 million years ago
Human has Human has highest numberhighest number of of duplications among all studied duplications among all studied genomes genomes
Capstone Presentation 05/18/2007 28
Background
Tandem repeats
Tandem Gene duplication
Methodology
Tandem Repeat Finder
Dating tandem repeats
Jukes-Cantor model
Results
Analysis
Conclusions
AcknowledgementsAcknowledgements
Advisor – Dr. Haixu TangAdvisor – Dr. Haixu Tang
School of InformaticsSchool of Informatics
Members of Computational Omics Lab Members of Computational Omics Lab
Parents, Rajen & RajeevParents, Rajen & Rajeev
PrasantaPrasanta
ReferencesReferences
Methods for reconstructing the history of tandem repeats Methods for reconstructing the history of tandem repeats and their application to the human genomeand their application to the human genome
Authors: Jaitly D, Kearney P , Lin G, Ma BAuthors: Jaitly D, Kearney P , Lin G, Ma B A Survey on Algorithmic Aspects of Tandem Repeats A Survey on Algorithmic Aspects of Tandem Repeats
Evolution.Evolution.
Authors: E. RivalsAuthors: E. Rivals Topological Rearrangements and Local Search Method for Topological Rearrangements and Local Search Method for
Tandem Duplication TreesTandem Duplication Trees
Authors: Denis Bertrand and Olivier GascuelAuthors: Denis Bertrand and Olivier Gascuel Greedy method for inferring tandem duplication historyGreedy method for inferring tandem duplication history
Authors: Louxin Zhang Bin Ma Lusheng Wang and Ying XuAuthors: Louxin Zhang Bin Ma Lusheng Wang and Ying Xu A fast and accurate distance algorithm to reconstruct A fast and accurate distance algorithm to reconstruct
tandem duplication treestandem duplication trees
Authors: Elemento O. and Gascuel OAuthors: Elemento O. and Gascuel O Tandem repeats finder: a program to analyze DNA Tandem repeats finder: a program to analyze DNA
sequencessequences
AuthorAuthor: : Gary BensonGary Benson