1 Bioinformatics Algorithms Lecture 2 Jeff Parker, 2009 A bacteriologist is a man whose...

37
1 Bioinformatics Algorithms Lecture 2 © Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

description

3 Questions What are your goals? Will this course be offered next year? What will the exams be like? Can you go over the probability example again? I didn't understand the Dynamic Programming example Will this get me a job?

Transcript of 1 Bioinformatics Algorithms Lecture 2 Jeff Parker, 2009 A bacteriologist is a man whose...

Page 1: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

1

Bioinformatics AlgorithmsLecture 2

© Jeff Parker, 2009

A bacteriologist is a man whose conversation always starts with the germ of an idea.

Page 2: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

2

OutlineReview Website changes

Any problems? BBoard?Review your questionsReview the homework questionsReview Dynamic Programming and string matchNew material

Reading file formats in PythonOpen Reading FramesTurnpike Problem

Page 3: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

3

QuestionsWhat are your goals?Will this course be offered next year?What will the exams be like?Can you go over the probability example again?I didn't understand the Dynamic Programming

exampleWill this get me a job?

Page 4: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

4

(short) AnswersWhat are your goals?

See next slideWill this course be offered next year?

YesWhat will the exams be like?

Solve this problem: suggest an algorithmCan you go over the probability example again?

YesI didn't understand the Dynamic Programming example

I didn't explain it well, and combined two formsLet's try again tonight

Will this get me a job?Not right away

Page 5: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

5

Course GoalsIntroduce an interesting problem from BiologyApply some Computer Science techniquesOur guide will be Jones and Pevzner

I hope to cover all the chaptersIntroduce some topics that are not coveredAdd some problems in probability, which this book does not attempt to cover

My goal is to make the course accessible to students in Biotechnology ProgramLess focus on programming, more on algorithmsProgramming projects will be smaller, more exploratory

Students will pick and hand in a final project on a topic of their choiceApply ideas discussed in class to a problem of their choice

Page 6: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

6

Fibonacci numbers1, 1, 2, 3, 5, 8, 13, ….

def fib(n):if (1 == n):

return 1elif (2 == n):

return 1else:

return fib(n-1) + fib(n-2)

fib(5)

fib(4) fib(3)

fib(3) fib(2)fib(2) fib(1)

fib(2) fib(1)

Page 7: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

7

Ruminationsdef fib(n):

if (1 == n):return 1

elif (2 == n):return 1

else:return fib(n-1) + fib(n-2)

Not well defined for negative numbersHow many calls to compute fib(10)?

fib(5)

fib(4) fib(3)

fib(3) fib(2)fib(2) fib(1)

fib(2) fib(1)

Page 8: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

8

Running timedef fib(n):

if (1 == n):return 1

elif (2 == n):return 1

else:return fib(n-1) + fib(n-2)

Function only returns the value 1 or a sum, so must call for each symbol1 + 1 + 1 + 1 + 1

To compute fib(n) need 2*fib(n) - 1 calls

fib(5)

fib(4) fib(3)

fib(3) fib(2)fib(2) fib(1)

fib(2) fib(1)

Page 9: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

9

Dynamic Programming

def fib3(n):dict = {}for i in xrange(n+1):

if (i < 3):d[i] = 1

else:d[i] = d[i-1] + d[i-2]

Running time?Trade off? 1 1 2 3 5 8 13

Page 10: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

10

Top Down DPFor some problems, we can assemble the information on the flyWe start with an empty dictionary, d

def fib2(n, d):if (n in d):

return d[n]elif ((1 == n) or (2 == n)):

result = 1else:

result = fib2(n-1, d) + fib2(n-2,d)

d[n] = resultreturn result 1 1 2 3 5 8 13

Page 11: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

11

Top Down DPdef fib2(n, d):

print "fib2", n, dif (n in d):

return d[n]elif ((1 == n) or (2 == n)):

result = 1else:

result = fib2(n-1, d) + fib2(n-2,d)

d[n] = resultreturn result

fib2 10 {}fib2 9 {}fib2 8 {}fib2 7 {}fib2 6 {}fib2 5 {}fib2 4 {}fib2 3 {}fib2 2 {}fib2 1 {2: 1}fib2 2 {1: 1, 2: 1, 3: 2}fib2 3 {1: 1, 2: 1, 3: 2, 4: 3}fib2 4 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5}fib2 5 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5,

6: 8}fib2 6 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5,

6: 8, 7: 13}fib2 7 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5,

6: 8, 7: 13, 8: 21}fib2 8 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5,

6: 8, 7: 13, 8: 21, 9: 34}

Page 12: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

12

Approximate Pattern MatchHow can we define how far apart two sequences are?

We start with the Levenshtein distance or edit distanceATCGG-AT-GGA

Smallest number of insertions, deletions, and substitutions requiredBill 1 for each change Satisfies properties for a metric, including the triangle inequality

D(a,c) <= D(a, b) + D(b, c)

Page 13: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

13

Recursive SolutionTo find a match between ATGGA and ATCGGATry all possible actions on first characters, then compare the rest

Match or replace first characters of each stringDrop first char of textDrop first char of pattern

Try to match the remainder using recursionAt each step, at least one string is shorter.

ATGGAATCGGA

TGGATCGGA

TGGAATCGGA

ATGGATCGGA

Page 14: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

14

DotsList the pattern and text as row and column headings.

Place a dot in each cell where row heading and column heading match.We will use this idea in other ways…

Page 15: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

15

Needleman & WunschA T G G

A

T

C

G

A

G

A

--

0

-A

-1

-AT

-2

-ATG

-3

-ATGG

-4

-ATGGA

-5-A

-1-AT

-2-ATC

-3-ATCG

-4-ATCGG

-5-ATCGGA

-6

Global MatchMatch all of two stringsNotice the labelingRules:

Match + 1Missmatch -1Insert/Delete -1

Page 16: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

16

Needleman & WunschA T G G

A

T

C

G

A

G

A

--

0

-A

-1

-AT

-2

-ATG

-3

-ATGG

-4

-ATGGA

-5-A

-1

AA

1

ATA

0

ATGA

-1

ATGGA

-2

ATGGAA

-3-AT

-2

AAT

0-ATC

-3

AATC

-1-ATCG

-4

AATCG

-2-ATCGG

-5

AATCGG

-3-ATCGGA

-6

AATCGGA

-4

Page 17: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

17

Needleman & WunschA T G G

A

T

C

G

A

G

A

--

0

A-

-1

AT-

-2

ATG-

-3

ATGG-

-4

ATGGA-

-5-A

-1

AA

1

ATA

0

ATGA

-1

ATGGA

-2

ATGGAA

-3-AT

-2

AAT

0

ATAT

2

ATGAT

1

ATGGAT

0

ATGGAAT

-1-ATC

-3

AATC

-1

ATATC

1-ATCG

-4

AATCG

-2

ATATCG

0-ATCGG

-5

AATCGG

-3

ATATCGG

-1-ATCGGA

-6

AATCGGA

-4

ATATCGGA

-2

Page 18: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

18

How we build the tableConsider filling in the blank spot in pinkWe have three choicesBuild on pair above, deleting char C

ATG_AT_CCost: 1- 1 = 0

Build on pair on left, inserting char GAT_GATC_Cost: 1 - 1 = 0

Match or replace, using pair from upper leftATGATCCost: 2 – 1 = 0

We only display the winner

ATGATC

1

13

2

T GT AT

AT2

ATGAT_

C AT_ATC

1

1

T GT AT

AT2

ATGAT_

C AT_ATC

1

1

Page 19: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

19

Key IdeaTo compute the best match ending at location [i,j] we compute the three

values below, pick minimal value, and store it in d[i][j]The costs for match, non-match may be varied to match the problem

insertCost = d[i-1][j] - 1;deleteCost = d[i][j-1] - 1;

if (pattern[i] == text[j])matchCost = d[i-1][j-1] + 1;

elsereviseCost = d[i-1][j-1] - 1; ATG

ATC1

13

2

T GT AT

AT2

ATGAT_

C AT_ATC

1

1

Page 20: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

20

DP Approximate Pattern MatchA T G G

A

T

C

G

A

G

A

0 -1 -2 -3 -4 -5

-1

-2

-3

-4

-5

-6

1 0 -1 -2 -3

0 2 1 0 -1

-1 1 0 0 0

-2 0 0 0 0

-3 -1 0 0 0

-4 -2 0 0 0

Page 21: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

21

Needleman & WunschA T G G

A

T

C

G

A

G

A

0 -1 -2 -3 -4 -5

-1

-2

-3

-4

-5

-6

1 0 -1 -2 -3

0 2 1 0 -1

-1 1 1 0 -1

-2 0 2 0 0

-3 -1 1 0 0

-4 -2 0 0 0

Page 22: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

22

Needleman & WunschA T G G

A

T

C

G

A

G

A

0 -1 -2 -3 -4 -5

-1

-2

-3

-4

-5

-6

1 0 -1 -2 -3

0 2 1 0 -1

-1 1 1 0 -1

-2 0 2 2 1

-3 -1 1 3 2

-4 -2 0 2 4

Page 23: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

23

Trace back from lower rightA T G G

A

T

C

G

A

G

A

0 -1 -2 -3 -4 -5

-1

-2

-3

-4

-5

-6

1 0 -1 -2 -3

0 2 1 0 -1

-1 1 1 0 -1

-2 0 2 2 1

-3 -1 1 3 2

-4 -2 0 2 4

Page 24: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

24

Other Pricing SchemesWe may decide that alternative pricing models are betterOne common assumption is that the first deletion is rare

(expensive) but it is much cheaper to continue to deleteAnother model suggests that a Frame Shift (delete by non-

multiple of 3) is more expensive

ATC AT- --T GGT GTT

Can we deal with such functions?

Page 25: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

25

Python File HandlingReading a fileimport string"""Frequency – count frequency of letters in sequence"""fileName = input("Enter file name: ")f = open(fileName, 'r')text = f.read()text = string.replace(text, "\n", "")

# print "Saw ", text

symbolCounts = {}# Go over each letter in the sequencefor x in range(len(text)):

Page 26: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

26

Python File Handling, part deuxReading a large file in one operation may be a bad idea

import string"""Frequency – count frequency of letters in sequence"""fileName = input("Enter file name: ")f = open(fileName, 'r')line = f.readline()while (len(line) > 0):

line = string.replace(line, "\n", "")process(line)line = f.readline()

Page 27: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

27

ProcessingRemove \n and convert to Upper Case

def countLetters(text, counts): """Count the letters in the sequence""" text = string.replace(text, "\n", "") text = text.upper() for x in range(len(text)): ch = text[x] # Increment count if (ch in counts): counts[ch] = counts[ch] + 1 else: counts[ch] = 1

Page 28: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

28

ProblemsChameleons (2.20)

How long will this thing go on? (Probabilities and the stop codon)

Finding an Open Reading Frame (ORF)http://www.genome.gov/25020001

Page 29: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

29

Open Reading Frame

Page 30: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

30

Frame Shifts

Page 31: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

31

Open Reading Frame

Page 32: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

32

Introns and Exons

Page 33: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

33

Alternative Splicing

Page 34: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

34

ProblemHands around the worldEveryone in the world joins hands to form a circle round the

globeEveryone remembers who was on their right hand and left

handGiven a flat file with triples (RightId, Id, LeftId) recover the

sequenceThere are 6 billion triples, so you will want to be efficient

Page 35: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

35

Next WeekPartial Digest Problem

Page 36: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

36

Turnpike distancesExit Description From NYS From Exit251 West Stockbridge Route 41 2.9 47.42 Lee US 20/Route 102 10.6 55.13 Westfield Route 10/US 202 40.4 84.94 West Springfield I-91/US 5 45.7 90.25 Chicopee Route 33 49.0 93.56 Springfield I-291 51.3 95.87 Ludlow Route 21 54.9 99.48 Palmer Route 32 62.8 107.39 Sturbridge I-84 78.5 123.010 Auburn I-290/I-395 90.2 134.710A Millbury Route 146/US 20 94.1 138.611 Millbury Route 122 96.5 141.011A Hopkinton I-495 106.2 150.712 Framingham Route 9 111.4 155.913 Framingham/Natick 116.8 161.314/15 Weston I-95/Route 128 123.3 167.816 West Newton Route 16 125.2 169.717 Newton Washington/ Galen 127.7 172.218/20 Allston/Brighton 130.9 175.421 Back Bay Mass Ave 132.9 177.422 Copley Square MA 9 133.4 177.923 Theater District 133.9 178.424A-B-C South Station 134.6 179.125 South Boston Local streets. 135.3 179.826 Airport Logan Airport 137.3 181.8

Page 37: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

37

SummaryThere is a world of interesting problems in BiologyThere is great interest in finding solutions

Computer Science can helpCrucial to keep in touch with Biologists about solutions

Not all simplifications are equally validNot all matches are meaningful

Many Biologists use the new tools in their researchThere is a need for those who understand the algorithms the tools use