Genome Exploration in A-T G-C space (mk1)

21
Genome exploration in A-T G-C space introducing Icarus a DNA walking program Jonathan Blakes MSc Biotechnology and Computation Department of Biosciences Faculty of Science, Technology and Medical Studies

description

My masters presentation demonstrating Icarus and DNA walking

Transcript of Genome Exploration in A-T G-C space (mk1)

Page 1: Genome Exploration in A-T G-C space (mk1)

Genome exploration in A-T G-C space

introducing Icarusa DNA walking program

Jonathan Blakes

MSc Biotechnology and Computation

Department of BiosciencesFaculty of Science, Technology and Medical Studies

Page 2: Genome Exploration in A-T G-C space (mk1)

Problem

too much information!

Page 3: Genome Exploration in A-T G-C space (mk1)

EnsEMBL

UCSC

Genome Browsers

Page 4: Genome Exploration in A-T G-C space (mk1)

Hypothesis

Can DNA sequences be plotted in such a way that long sequences can be easily interpreted by humans without a priori knowledge?

“It seems that the simplest method of visualizing some properties of genomes is

to send a virtual walker for a genomic walk, ask "it" to talk about what it has

seen and note its observations. If our walker doesn't move with a Brownian-like

motion, it is possible to extract from its walk a lot of information.”

Stanislaw Cebrat, the principal Polish proponent of DNA walks

Assigning a cardinal coordinate (north, south, east or west) to each of the four nucleotide bases (A, T, G, C) and taking steps in

those directions as a sequence is read sequentially will produce a ‘walk’ of the sequence in which repetitive DNA elements will be

seen as repetitive 2-dimensional ‘structures’.

Page 5: Genome Exploration in A-T G-C space (mk1)

DNA walks are plots of DNA or RNA sequences where each of the four nucleotide bases

is assigned a direction and distance, the sequence is read off one nucleotide at a time

and for each nucleotide the virtual walker takes a step in the designated direction

creating a 'walk' of the sequence that reveals elements of structure in the nucleotide

composition.

DNA walking

From Comparative Genometrics website, L'Université de Lausanne

Page 6: Genome Exploration in A-T G-C space (mk1)

IcarusLive Demonstration

Could someone please suggesta mammalian gene

to walk?

Page 7: Genome Exploration in A-T G-C space (mk1)

Mapping

24 possible combinations of cardinal vectors:4 rotations for each of the 3 above mappings, and 4 rotations of each of their reflections about the x or y plane.

Choosing which 3 ‘unique’ mappings of those 24 is a matter of parsimony.

Page 8: Genome Exploration in A-T G-C space (mk1)

A-T G-C

Page 9: Genome Exploration in A-T G-C space (mk1)

A-G C-T

Page 10: Genome Exploration in A-T G-C space (mk1)

A-C G-T

Page 11: Genome Exploration in A-T G-C space (mk1)

A-T G-C

Page 12: Genome Exploration in A-T G-C space (mk1)

A-T G-C is consistently smallest

Smaller pictures can contain more information in less space and are therefore more amenable to publication, hence Genome Exploration in A-T G-C space

Page 13: Genome Exploration in A-T G-C space (mk1)

Duplicationsexons introns

a 7 fold contiguous duplication in the male

Y chromosome. Members of the TSPY

(Testis-specific Y-encoded proteins) family

identified by Skaletsky et al1 using a

combination of a whole chromosome

dotplot with a 2-kb window and a custom

Perl script running BLAST alignments of all

5-kb sequence segments, in 2-kb steps, of

the entire MSY (Male Specific Y).

In contrast I stumbled upon this purely by

accident.

1. Skaletsky et al. Nature 2003 423.

Page 14: Genome Exploration in A-T G-C space (mk1)

DNA walks for phylogenetics

But for a DNA walk the spatial distance between the first and last bases is a function of:• the nucleotide composition of a sequence and a 2D mapping• the order of the bases since the A might oppose T and C oppose G.

Imagine a 1-dimensional textual DNA sequence. The distance from the first base to the last is simply the number of bases in the sequence.

A comparison of aligned sequences on the basis of spatial distance(a much simpler measure than the Jukes-Cantor definition of evolutionary distance) will be

unable to discriminate between them.

7 previously aligned 1798-nucleotide long small ribosomal subunit sequences of Candida and Saccharomyces species as detailed in Gilfillan1 were walked and their total euclidean distances used to produce a phylogeny, which was compared to Gilfillan’s.

1. Gilfillan GD, et. al. Microbiology. 1998. 144: 829-838.

Page 15: Genome Exploration in A-T G-C space (mk1)

Phylogeny algorithms

neighbour joining Icarus’ UPGMA

Distance Matrix

Page 16: Genome Exploration in A-T G-C space (mk1)

Phylogeny Demonstration

Page 17: Genome Exploration in A-T G-C space (mk1)

Newick format

Distance Matrix

Output

Newick format string representation of a tree:

(Bovine:0.69395, (Gibbon:0.36079, (Orang:0.33636, (Gorilla:0.17147, (Chimp:0.19268, Human:0.11927) :0.08386):0.06124):0.15057):0.54939, Mouse:1.21460);

Page 18: Genome Exploration in A-T G-C space (mk1)

Phylogenies with DNA walks

Page 19: Genome Exploration in A-T G-C space (mk1)

Does summing distances from 3 mappings eliminate bias and produce a better

phylogeny?

NO.

A better distance measure is needed.

Page 20: Genome Exploration in A-T G-C space (mk1)

Conclusion

• Icarus is a DNA walk based genome browser that can retrieve sequences and annotate walks using Ensembl.

• DNA walks can demonstrate the existence of duplications in DNA to the untrained eye.

• Spatial distances measures can produce phylogenies, but a better measure is needed than Manhattan or Euclidean distance.

Page 21: Genome Exploration in A-T G-C space (mk1)

Acknowledgements

I would like to thank:

Dr. Gary RobinsonDr. Colin JohnsonDr. Anthony Baines

And everyone I have met during the Biotechnology and Computation MSc.