PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method...

31
PREETI MISRA PREETI MISRA Advisor: Dr. HAIXU TANG Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to Computational method to analyze tandem repeats in analyze tandem repeats in eukaryote genomes eukaryote genomes

Transcript of PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method...

Page 1: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

PREETI MISRAPREETI MISRA

Advisor: Dr. HAIXU TANGAdvisor: Dr. HAIXU TANG

SCHOOL OF INFORMATICS - INDIANA UNIVERSITYSCHOOL OF INFORMATICS - INDIANA UNIVERSITY

Computational method to analyze Computational method to analyze tandem repeats in eukaryote tandem repeats in eukaryote

genomesgenomes

Page 2: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

OverviewOverview

BackgroundBackground Tandem repeatsTandem repeats Methodology Methodology ResultsResults ConclusionsConclusions ReferencesReferences

Capstone Presentation 05/18/2007 2

Page 3: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

BackgroundBackground

April 20, 2006

Capstone Presentation 05/18/2007 3

An array of consecutive repeatsAn array of consecutive repeats Repeating pattern or consensus = 5Repeating pattern or consensus = 5 Total repeat length = 25Total repeat length = 25

3 main types of tandem repeats3 main types of tandem repeats Microsatellites -- 1-5 bp repeating Microsatellites -- 1-5 bp repeating

patternpattern Minisatellites -- 6-50 bp repeating Minisatellites -- 6-50 bp repeating

patternpattern Large tandem -- greater than 50 bp Large tandem -- greater than 50 bp

repeating patternrepeating pattern

GATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCC Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Page 4: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Significance Significance

Use tandem repeats to determine Use tandem repeats to determine whether 2 DNA samples belong to whether 2 DNA samples belong to same person or notsame person or not

Uses – Uses – Forensic useForensic use Paternity testingPaternity testing

Capstone Presentation 05/18/2007 4

Image downloaded from www.egensburg.de/Fakultaeten/Medizin/Klinische_Chemie/lehre/vorlesung/Manuskript_Profiling%20.pdf

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Page 5: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Mechanism of tandem Mechanism of tandem duplicationduplication

Unequal recombinationUnequal recombination is the major known mechanism for the formation of large tandem repeats is the major known mechanism for the formation of large tandem repeats

Image has been downloaded from Image has been downloaded from http://hc.ims.u okyo.ac.jp/JSBi/journal/GIW02/GIW02F010/GIW02F010.html okyo.ac.jp/JSBi/journal/GIW02/GIW02F010/GIW02F010.html

Capstone Presentation 05/18/2007 5

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Page 6: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Tandem gene Tandem gene duplicationduplication

Benefits – New functions arise. Benefits – New functions arise. Responsible for the evolution of Responsible for the evolution of gene clusters gene clusters

Example – Zinc finger genes in Example – Zinc finger genes in mammalian genesmammalian genes

Capstone Presentation 05/18/2007 6

http://www.steve.gb.com/images/molecules/nucleotides/zinc_finger_(side).jpg

2 homologous Genes

second gene = Duplicated

Page 7: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

PurposePurpose Large tandem repeats are Large tandem repeats are

commonly found in eukaryotes – commonly found in eukaryotes – humans have 1.684 % and humans have 1.684 % and chimpanzees have 1.525%chimpanzees have 1.525%

To date the large tandem To date the large tandem duplication and find the duplication and find the relationship between various relationship between various characteristics of long tandem characteristics of long tandem repeats and corresponding repeats and corresponding evolutionary time evolutionary time

8 genomes8 genomes – 3 primates, 2 – 3 primates, 2 rodents , dog, chicken and puffer rodents , dog, chicken and puffer fish were analyzedfish were analyzed

Capstone Presentation 05/18/2007 7

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Page 8: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

MethodologyMethodology

IdentificationIdentification Tandem repeat finder (TRF) for Tandem repeat finder (TRF) for

identification of large tandem repeatsidentification of large tandem repeats

Distance computation Distance computation Jukes – Cantor distance model to find Jukes – Cantor distance model to find

distance between two repeatsdistance between two repeats

TransformationTransformation Transform the above computed distance Transform the above computed distance

into evolutionary timeinto evolutionary time

Capstone Presentation 05/18/2007 8

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Page 9: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Tandem Repeat Tandem Repeat FinderFinder

STRING, Mreps and TRFSTRING, Mreps and TRF TRAP: T.Jose, P. Sobreira, A.Durham and A.Gruber TRF can be downloaded at TRF can be downloaded at http://tandem.bu.edu/trf/trf.html

Starting and ending positions of Starting and ending positions of tandem repeat was presenttandem repeat was present

Number of repetitionsNumber of repetitions

A%, C%, G%, T% percentage of A%, C%, G%, T% percentage of bases in the tandem repeatbases in the tandem repeat

Length of the consensus word (only Length of the consensus word (only the first 10 bases)the first 10 bases)

Capstone Presentation 05/18/2007 9

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Page 10: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Tandem Repeat Tandem Repeat FinderFinder

Tandem repeat finder outline :Tandem repeat finder outline :

Tandem repeat finder program has 2 main Tandem repeat finder program has 2 main components – components – detectiondetection and and analysisanalysis

DetectionDetection - Finds - Finds candidatecandidate tandem tandem repeatsrepeats

Analysis Analysis - Produces an - Produces an alignment alignment for for each candidate and statistics about the each candidate and statistics about the

alignmentalignment

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Capstone Presentation 05/18/2007 10

Page 11: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Tandem Repeat Tandem Repeat FinderFinder

Large tandem repeats were extractedLarge tandem repeats were extracted Results of TRF –Results of TRF – 1 5 100 0 50 20 40 20 20 1.92 1 5 100 0 50 20 40 20 20 1.92 GATCCGATCC

GATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCC

GATCCGATCC - period or consensus - period or consensusGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCCGATCC - repeat - repeat

1 1 - indices- indices5 5 - consensus or period size- consensus or period size100 100 - percent matches- percent matches0 0 - percent indels- percent indels50 50 - score- score20 20 - % of A - % of A 40 - % of C40 - % of C1.92 1.92 - entropy- entropy

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Capstone Presentation 05/18/2007 11

Page 12: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

DNA Sequence DNA Sequence Evolution Model For Evolution Model For

DatingDating

AAGACTT

TGGACTTAAGGCCT

3 mil yrs

2 mil yrs

1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Capstone Presentation 05/18/2007 12

D

Page 13: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Computing divergence of tandem Computing divergence of tandem repeating units – repeating units –

Repeat identityRepeat identity - each repeat is each repeat is compared with other repeats and compared with other repeats and maximummaximum similarity/identity is similarity/identity is consideredconsidered

GATCCGATCC GATCC|GATCC|GATCC|GATCC|GATCCGATCC|GATCC|GATCC|GATCC|GATCC

Dating tandem Dating tandem duplicationsduplications

Capstone Presentation 05/18/2007 13

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Page 14: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Jukes-CantorJukes-Cantor model model

Computes the Computes the distancedistance between 2 between 2 repeatsrepeats

All bases occur with equal All bases occur with equal probability, probability,

i.e. i.e. pp = 0.25 for A, T, G and C = 0.25 for A, T, G and C

All possible base substitutions are All possible base substitutions are equally likely as follows -equally likely as follows -

A ↔ G, A ↔ C, A ↔ T, G ↔ TA ↔ G, A ↔ C, A ↔ T, G ↔ T

Capstone Presentation 05/18/2007 14

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Page 15: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Jukes-Cantor modelJukes-Cantor model

m = no. of mutations m = no. of mutations

n = length of sequence n = length of sequence

D = -3/4 ln(1- 4/3 m/n)D = -3/4 ln(1- 4/3 m/n)D = Distance between two repeatsD = Distance between two repeats

Ex- Observed mismatches at 25% of Ex- Observed mismatches at 25% of the sites, then Jukes Cantor model the sites, then Jukes Cantor model predicts the distance between two predicts the distance between two repeat is 0.304repeat is 0.304

Capstone Presentation 05/18/2007 15

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Page 16: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Estimating the Estimating the evolutionary time evolutionary time

Transforming the computed distance Transforming the computed distance (D) between two repeats into (D) between two repeats into evolutionary timeevolutionary time

Neutral mutation rateNeutral mutation rate in mammals is in mammals is nearly 1.25 * 10nearly 1.25 * 10-9-9 per year per site per year per site

Time (T) = D / 1.25 * 10Time (T) = D / 1.25 * 10-9-9 years years

agoago Ex- D = 0.1 Ex- D = 0.1 T = 0.1 / 1.25 * 10T = 0.1 / 1.25 * 10-9-9 = 80 million = 80 million

years agoyears ago

Capstone Presentation 05/18/2007 16

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Page 17: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Material and MethodMaterial and Method MaterialMaterial

The genome files were downloaded from UCSC The genome files were downloaded from UCSC site site http://hgdownload.cse.ucsc.edu/downloads.html

The tandem repeat finder and The tandem repeat finder and stretcher software were downloadedstretcher software were downloaded

ProcedureProcedure Extraction of large tandem repeats Extraction of large tandem repeats

with the help of tandem repeat finderwith the help of tandem repeat finder Calculation of similarities between Calculation of similarities between

tandem repeats using stretchertandem repeats using stretcher Computation of the distance using Computation of the distance using

Jukes-Cantor modelJukes-Cantor model Transformation of distance to the Transformation of distance to the

evolutionary time evolutionary time

Capstone Presentation 05/18/2007 17

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Page 18: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Tree of lifeTree of life500 Million years ago

225

92

75

25

612-2412-24

Page 19: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Recap – period & Recap – period & repeatrepeat

Capstone Presentation 05/18/2007 19

ATTCGATTCGATTCGGGATTCGACATTATTCGATTCGATTCGGGATTCGACATTCGCG

ATTCATTCGG

REPEAT REPEAT PERIOD or PERIOD or CONSENSUSCONSENSUS

Page 20: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

ResultsResults

GenomeGenomeChr# , Chr# , Longest Longest repeat repeat lengthlength

Chr#, Chr#, highest highest total# total# of of repeatrepeat

Chr#, Chr#,

longest longest period period length length

Total Total repeatrepeat

Total Total genome genome sizesize

Chr#, Chr#, highest % highest % of repeatof repeat

Total Total coveragcoverage(% of e(% of repeat repeat in in genome)genome)

HUMANHUMAN

CHIMPANZECHIMPANZEEE

8, 62142 8, 62142 1, 59431, 5943 8, 19858, 1985 48 MB48 MB 2.97 GB2.97 GB 19, 4.946319, 4.9463 1.5251.525

MACAQUEMACAQUE

13, 13, 119145119145

19, 19, 53995399

19, 196919, 1969 44 MB44 MB 2.87 GB2.87 GB 1.5381.538

RATRAT

18 , 63266 18 , 63266 1, 51561, 5156 1, 19891, 1989 20 MB20 MB 2.75 GB2.75 GB 12, 2.03412, 2.034 0.4240.424

MOUSEMOUSE

7, 203136 7, 203136 5, 4214 5, 4214 10, 198310, 1983 19 MB19 MB 2.61 GB2.61 GB X, 1.242X, 1.242 0.2630.263

DOGDOG

X, 37449X, 37449 1, 16131, 1613 18, 185218, 1852 6 MB 6 MB 2.40 GB2.40 GB 0.7480.748

CHICKENCHICKEN

1, 166801, 16680 1, 9581, 958 13, 198813, 1988 4 MB4 MB 1.1 GB1.1 GB 16, 7.77916, 7.779 0.7350.735

PUFFER PUFFER FISHFISH

1, 65861, 6586Y, 217961Y, 217961 X, 2000X, 2000 51 MB51 MB 1.681.6844

2, 2, 1304513045

2, 1392, 139 2, 15692, 1569 0.53 MB0.53 MB 0.2450.245

19, 8.40819, 8.408

1, 0.49021, 0.4902

385 MB385 MB 10, 10, 0.510.51

3.1 GB3.1 GB 19, 19, 6.066.06

Page 21: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

ResultsResults

Capstone Presentation 05/18/2007 21

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusion

Page 22: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Total number of Total number of repeatsrepeats

Capstone Presentation 05/18/2007 22

Page 23: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Total number of period Total number of period or consensusor consensus

Capstone Presentation 05/18/2007 23

Page 24: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Results of repeatResults of repeat lengthlength

Capstone Presentation 05/18/2007 24

Page 25: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

% Repeat results% Repeat results

FishFish

HumanHumanHuman

Page 26: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Dating tandem Dating tandem repeatsrepeats

Capstone Presentation 05/18/2007 26

Page 27: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

Tree of lifeTree of life500 Million years ago

225

92

75

25

6 12-2412-24

Capstone Presentation 05/18/2007 27

Page 28: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

ConclusionsConclusions PrimatesPrimates (human, chimpanzee and (human, chimpanzee and

macaque) have highest number of macaque) have highest number of long tandem repeat duplicationslong tandem repeat duplications

Dating peakDating peak is prominent in human, is prominent in human, chimpanzee and macaque, especially chimpanzee and macaque, especially between between 80-12080-120 million years ago million years ago

Tandem repeat results follow a Tandem repeat results follow a pattern which is similar to the pattern which is similar to the divergencedivergence as shown in the tree of life as shown in the tree of life

Dog, rat and mouse show Dog, rat and mouse show steady steady increaseincrease in number of tandem in number of tandem duplications but burst is negligible duplications but burst is negligible between 80-120 million years agobetween 80-120 million years ago

Human has Human has highest numberhighest number of of duplications among all studied duplications among all studied genomes genomes

Capstone Presentation 05/18/2007 28

Background

Tandem repeats

Tandem Gene duplication

Methodology

Tandem Repeat Finder

Dating tandem repeats

Jukes-Cantor model

Results

Analysis

Conclusions

Page 29: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

AcknowledgementsAcknowledgements

Advisor – Dr. Haixu TangAdvisor – Dr. Haixu Tang

School of InformaticsSchool of Informatics

Members of Computational Omics Lab Members of Computational Omics Lab

Parents, Rajen & RajeevParents, Rajen & Rajeev

PrasantaPrasanta

Page 30: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.

ReferencesReferences

Methods for reconstructing the history of tandem repeats Methods for reconstructing the history of tandem repeats and their application to the human genomeand their application to the human genome

Authors: Jaitly D, Kearney P , Lin G, Ma BAuthors: Jaitly D, Kearney P , Lin G, Ma B A Survey on Algorithmic Aspects of Tandem Repeats A Survey on Algorithmic Aspects of Tandem Repeats

Evolution.Evolution.

Authors: E. RivalsAuthors: E. Rivals Topological Rearrangements and Local Search Method for Topological Rearrangements and Local Search Method for

Tandem Duplication TreesTandem Duplication Trees

Authors: Denis Bertrand and Olivier GascuelAuthors: Denis Bertrand and Olivier Gascuel Greedy method for inferring tandem duplication historyGreedy method for inferring tandem duplication history

Authors: Louxin Zhang Bin Ma Lusheng Wang and Ying XuAuthors: Louxin Zhang Bin Ma Lusheng Wang and Ying Xu A fast and accurate distance algorithm to reconstruct A fast and accurate distance algorithm to reconstruct

tandem duplication treestandem duplication trees

Authors: Elemento O. and Gascuel OAuthors: Elemento O. and Gascuel O Tandem repeats finder: a program to analyze DNA Tandem repeats finder: a program to analyze DNA

sequencessequences

AuthorAuthor: : Gary BensonGary Benson

Page 31: PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.