Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths &...

33
Graph Theory And Bioinformatics Jason Wengert

Transcript of Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths &...

Page 1: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Graph Theory And Bioinformatics

Jason Wengert

Page 2: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Outline

Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing DNA Shortest Superstring Problem Sequencing by Hybridization (SBH) SBH as Hamiltonian Cycle Problem SBH as Eulerian Path Problem

Page 3: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Graph Theory

A graph is a set of vertices connected by edges

Edges connect two vertices and can be weighted, directed or undirected

Page 4: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

The Bridge Obsession Problem

Bridges of Königsberg

Find a tour crossing every bridge just onceLeonhard Euler, 1735

Page 5: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Eulerian Cycle Problem

Find a cycle that visits every edge exactly once

Linear time Solutions exist if each

vertex has even degree

For a path, the start and end vertices can have odd degree More complicated Königsberg

Page 6: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Hamiltonian Cycle Problem

Find a cycle that visits every vertex exactly once

NP – complete Equivalent to

Travelling Salesperson Problem

Game invented by Sir William Hamilton in 1857

Page 7: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Interval Graph

Vertices are segments on a line Edges exist between two vertices if any

part of their segments overlap

Page 8: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Seymour Benzer and T4 phages

Benzer studied mutated bacteriophages Non-mutated T4 phages kill bacteria The mutants had continuous segments

missing from their DNA These mutants lost the ability to kill the

bacteria Two mutant strains with non-overlapping

segments missing can kill bacteria together

Page 9: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Complementation between pairs of mutant T4 bacteriophages

Page 10: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Interval Graph: Linear Genes

Page 11: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Interval Graph: Branched Genes

Page 12: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Sequencing DNA

Sanger Method: Make copies of fragments of different

length by nucleotide starvation during copying

Separate the fragments via electrophoresis

Each starvation experiment produces a ladder of fragments of various lengths

Generate several-hundred-nucleotide fragment sequences (reads) from these ladders

Page 13: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Sanger Method: Generating Read

1. Start at primer

(restriction site)

2. Grow DNA chain

3. Include ddNTPs

4. Stops reaction at all

possible points

5. Separate products by

length, using gel

electrophoresis

Page 14: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Combining Reads

Need a method to take the reads and combine them into a genome

You don't know where the reads came from in the DNA

Page 15: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Shortest Superstring Problem

Problem: Given a set of strings, find a shortest string that contains all of them

Input: Strings s1, s2,…., sn Output: A string s that contains all strings s1, s2,…., sn as substrings, such that the length

of s is minimized

Complexity: NP – complete Note: this formulation does not take into account

sequencing errors

Page 16: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Shortest Superstring Problem: Example

Page 17: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Reducing SSP to TSP Define overlap ( si, sj ) as the length of the longest prefix of sj that

matches a suffix of si. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa

Construct a graph with n vertices representing the n strings s1, s2,…., sn.

Insert edges of length overlap ( si, sj ) between vertices si and sj. Find the shortest path which visits every vertex exactly once.

This is the Traveling Salesman Problem (TSP), which is also NP – complete.

Page 18: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Sequencing By Hybridization

The presence of a target DNA sequence can be determined by creating a probe, a single-strand fragment with the complementary sequence

If the target sequence exists in the sample, it will hybridize with the probe

Page 19: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

How SBH Works

Attach all possible DNA probes of length l to a flat surface, each probe at a distinct and known location. This set of probes is called the DNA array.

Apply a solution containing fluorescently labeled DNA fragment to the array.

The DNA fragment hybridizes with those probes that are complementary to substrings of length l of the fragment.

Page 20: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

How SBH Works (cont’d)

Using a spectroscopic detector, determine which probes hybridize to the DNA fragment to obtain the l–mer composition of the target DNA fragment.

Apply the combinatorial algorithm (below) to reconstruct the sequence of the target DNA fragment from the l – mer composition.

Page 21: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Hybridization on DNA Array

Page 22: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Spectrum

Spectrum(s,l) is the multiset of all l-mers that appear in s

Example: Spectrum(TATGGTGC, 3) =

{TAT, ATG, TGG, GGT, GTG, TGC}

Page 23: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

The SBH Problem

Goal: Reconstruct a string from its l-mer composition

Input: A set S, representing all l-mers from an (unknown) string s

Output: String s such that Spectrum ( s,l ) = S

Page 24: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

SBH As Hamiltonian Path

Vertices of the graph are every l-mer in the spectrum

A directed edge leads from vertex u to vertex v if the last l-1 characters of u equal the first l-1 characters of v.

HH

S = { ATG TGG TGC GTG GGC GCA GCG CGT }

Page 25: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

SBH: Hamiltonian Path Approach S = { ATG TGG TGC GTG GGC GCA GCG CGT }

Path 1:

HH

ATGCGTGGCA

HH

ATGGCGTGCA

Path 2:

Page 26: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

SBH As Eulerian Path

Vertices are (l-1)-mers Edges represent the l-mers from the

spectrum

AT

GT CG

CAGCTG

GG

S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT }

Page 27: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

SBH: Eulerian Path Approach

S = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths:

ATGGCGTGCA ATGCGTGGCA

AT TG GCCA

GG

GT CG

AT

GT CG

CAGCTG

GG

Page 28: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Euler Theorem

A graph is balanced if for every vertex the

number of incoming edges equals to the

number of outgoing edges:

in(v)=out(v) Theorem: A connected graph is Eulerian if

and only if each of its vertices is balanced.

Page 29: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Constructing an Eulerian Cycle

Start with an arbitrary vertex and form an arbitrary cycle until you reach a dead end

Because the graph is Eulerian, this dead end will be the starting point

If the resulting cycle is not Eulerian, select a vertex from the cycle with untraversed edges, and repeat step 1 starting there.

Combine the two cycles and repeat step 2.

Page 30: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Constructing an Eulerian Cycle

Page 31: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Some Difficulties with SBH Fidelity of Hybridization: difficult to detect differences

between probes hybridized with perfect matches and 1 or 2 mismatches

Array Size: Effect of low fidelity can be decreased with longer l-mers, but array size increases exponentially in l. Array size is limited with current technology.

Practicality: SBH is still impractical. As DNA microarray technology improves, SBH may become practical in the future

Practicality again: Although SBH is still impractical, it spearheaded expression analysis and SNP analysis techniques

Page 32: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

Summary

Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing DNA Shortest Superstring Problem Sequencing by Hybridization (SBH) SBH as Hamiltonian Cycle Problem SBH as Eulerian Path Problem

Page 33: Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.

References

Jones, N. and Pevzner, P. An introduction to Bioinformatics Algorithms.

Some slides courtesy of the book's website at bix.ucsd.edu/bioalgorithms/slides.php