Species Identification through DNA String Analysis

17
Mark Vorster Supervisor: Prof Philip Machanick

description

Mark Vorster Supervisor: Prof Philip Machanick. Species Identification through DNA String Analysis. -. -. -. -. Research Overview. Bioinformatics. String Matching. Discussion. Questions. Research Overview. Goal - PowerPoint PPT Presentation

Transcript of Species Identification through DNA String Analysis

Page 1: Species Identification through  DNA String Analysis

Mark VorsterSupervisor: Prof Philip Machanick

Page 2: Species Identification through  DNA String Analysis

Research Overview

GoalAid bioinformaticians in research by providing a tool which can identify similar DNA sequences in order to infer homogeneity, in a timely manner.Reason for problemsLarge data setsDays of processingNo existing specific tools

2

Bioinformatics String Matching DiscussionResearch Overview- - - - Questions

Page 3: Species Identification through  DNA String Analysis

Bioinformatics

"Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualise such data.“

Biomedical Information Science and Technology Initiatives Definition Committee - Dr Huerta

"The branch of science concerned with information and information flow in biological systems, esp. the use of computational methods in genetics and genomics.“

Oxford English Dictionary

3

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Page 4: Species Identification through  DNA String Analysis

History of Bioinformatics and Genetics 1953 - Watson,

Crick , Wilkins and Franklin.

Discrete abstraction

Adenine – ThymineGuanine – Cytosine

44

One helical turn = 3.4 nm

http://www.accessexcellence.org/RC/VL/GG/images/structure.gif

Sugar-phosphate backbonebase

Hydrogen bonds

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Page 5: Species Identification through  DNA String Analysis

Sequence Analysis and Sequence Alignment

Sequence Alignment Global Alignment is expensive

Assumption: Sequences are already Globally Aligned

Alignment Differences TGAGCACCT Insertion TGACGCACCT Deletion TGA_CACCT Replacement TGATCACCT

Phylogenetic inference55

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Page 6: Species Identification through  DNA String Analysis

FASTA File Format

Leading ‘>’ Sequence Identifier Description or

comment A number of lines of

genetic code

Other Symbols6

>SequenceName description or commentCCGGAATACCTAGGACGCCTTCATCCCCCGCCGGTCTGTGATGTCCCAATGGACCGGA>NextSequence description of commentACGCCTGATTACCTGCTAGTCGGGATGATAACCAAGAATTTGTGTCTG

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Page 7: Species Identification through  DNA String Analysis

Approximate String Matching Algorithm Nesting loops inefficient Dynamic Programing

Take into account all previous information Improved to O(n2) | where n is number of bases in

shorter sequence

Goal: Find the closet match between two strings

Or the minimum number of differences

7

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Page 8: Species Identification through  DNA String Analysis

Approximate String Matching AlgorithmMinimum of:MatchCost = D[i-1][j-1], if pi = tj

ReviseCost = D[i-1][j-1]+1 , if pi ≠ tj

InsertCost = D[i-1][j]+1DeleteCost = D[i][j-1]+1

D[0][j] = 0 and D[i][0] = i8

D[i-1][j-1] D[i-1][j]

D[i][j-1] D[i][j]

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Page 9: Species Identification through  DNA String Analysis

Approximate String Matching Algorithm

9

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

H a v e a h s p p y d a yNULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

h 1

a 2

p 3

p 4

y 5

Page 10: Species Identification through  DNA String Analysis

Approximate String Matching Algorithm

10

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

H a v e a h s p p y d a yNULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

h 1 1 1 1 1 1 1

a 2 2 1 2 2 2 1

p 3 3 2 2 3 3

p 4 4 3 3 3 4

y 5 5 4 4 4 4

D[i-1][j-1]

MatchCost = D[i-1][j-1], if pi = tj

ReviseCost = D[i-1][j-1]+1 , if pi ≠ tj

InsertCost = D[i-1][j]+1DeleteCost = D[i][j-1]+1

D[i][j-1]i

j

D[i-1][j]

D[i-1][j-1]

tj

pi

MatchCost = N/AReviseCost = 3InsertCost = 2DeleteCost = 4-> Min = 2

Page 11: Species Identification through  DNA String Analysis

Approximate String Matching Algorithm

11

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

H a v e a h s p p y d a yNULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

h 1 1 1 1 1 1 1 1 0 1 1 1 1

a 2 2 1 2 2 2 1 2 1 1 2 2 2

p 3 3 2 2 3 3 2 2 2 2 1 2 3

p 4 4 3 3 3 4 3 3 3 3 2 1 2

y 5 5 4 4 4 4 4 4 4 4 3 2 1

Page 12: Species Identification through  DNA String Analysis

Approximate String Matching Algorithm

12

Changes D[i][0] = i , if pi = t0

D[i][0] = i + 1 , if pi ≠ t0

D[0][j] = j , if p0 = tj

D[0][j] = j + 1 , if p0 ≠ tj

Additional stop case for mismatch

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Page 13: Species Identification through  DNA String Analysis

Approximate String Matching Algorithm

13

T A C G G A C G G T

T 0 2 3 4 5 6 7 8 9 9

A 2 0 1 2 3 4 5

C 3 1 0 1 2 3 4

G 4 2 1 0 1 2 3

A 5 3 2 1 1 1 2

A 6 4 3 2 2 1 2

G 7 5 4 3 2 2 2

G 8 6 5 4 3 3 3

G 9 7 6 5 4 4 4

A 10 8 7 6 5 4 5

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Page 14: Species Identification through  DNA String Analysis

Discussion

14

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Grouping Algorithm Scale of the problem

400 – 800 bases per sequence Ten thousands of sequences

Assumptions: Sequences Globally Aligned Sequences Begin at the Same Place

Page 15: Species Identification through  DNA String Analysis

Example Grouping

15

Seq[336] HK2QS7R01AXRJ6 Seq[218] Seq[38] Seq[235] Seq[89] …

Seq[382] HK2QS7R01BR4Q9 Seq[173]

Seq[180] HK2QS7R01ABFDP Seq[339] Seq[289] Seq[491] Seq[319] …

Seq[269] HK2QS7R01AZHD7 Seq[402] Seq[112] Seq[203] Seq[137] …

Seq[210] HK2QS7R01BMNQ4 Seq[364]

Seq[270] HK2QS7R01AZFOG Seq[388] Seq[441]

Seq[442] HK2QS7R01ADASO Seq[426] Seq[233] Seq[374] Seq[416] …

… …

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Page 16: Species Identification through  DNA String Analysis

Results

16

O(n2), where n is number of sequences.

~1600 comparisons per second.

10000 sequence ~8.6 hours.(from 10 days)

Comparisons for n sequence = (n-1)n/2

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Page 17: Species Identification through  DNA String Analysis

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions