Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer...

405
Introduction to Bioinformatics Esa Pitkänen [email protected] Autumn 2008, I period www.cs.helsinki.fi/mbi/courses/08-09/itb 582606 Introduction to Bioinformatics, Autumn 2008

Transcript of Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer...

Page 1: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

Introduction toBioinformatics

Esa Pitkä[email protected]

Autumn 2008, I periodwww.cs.helsinki.fi/mbi/courses/08-09/itb

582606 Introduction to Bioinformatics, Autumn 2008

Page 2: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

Introduction toBioinformatics

Lecture 1:Administrative issues

MBI Programme, Bioinformatics coursesWhat is bioinformatics?Molecular biology primer

Page 3: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

3

How to enrol for the course?p Use the registration system of the Computer

Science department: https://ilmo.cs.helsinki.fin You need your user account at the IT department (“cc

account”)

p If you cannot register yet, don’t worry: attend thelectures and exercises; just register when you areable to do so

Page 4: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

4

Teachersp Esa Pitkänen, Department of Computer Science,

University of Helsinkip Elja Arjas, Department of Mathematics and

Statistics, University of Helsinkip Sami Kaski, Department of Information and

Computer Science, Helsinki University ofTechnology

p Lauri Eronen, Department of Computer Science,University of Helsinki (exercises)

Page 5: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

5

Lectures and exercisesp Lectures: Tuesday and Friday 14.15-16.00

Exactum C221

p Exercises: Tuesday 16.15-18.00 ExactumC221n First exercise session on Tue 9 September

Page 6: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

6

Status & Prerequisitesp Advanced level course at the Department

of Computer Science, U. Helsinkip 4 creditsp Prerequisites:n Basic mathematics skills (probability calculus,

basic statistics)n Familiarity with computersn Basic programming skills recommendedn No biology background required

Page 7: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

7

Course contentsp What is bioinformatics?p Molecular biology primerp Biological wordsp Sequence assemblyp Sequence alignmentp Fast sequence alignment using FASTA and BLASTp Genome rearrangementsp Motif finding (tentative)p Phylogenetic treesp Gene expression analysis

Page 8: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

8

How to pass the course?p Recommended method:n Attend the lectures (not obligatory though)n Do the exercisesn Take the course exam

p Or:n Take a separate exam

Page 9: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

9

How to pass the course?p Exercises give you max. 12 points

n 0% completed assignments gives you 0 points,80% gives 12 points, the rest by linearinterpolation

n “A completed assignment” means thatp You are willing to present your solution in the

exercise session andp You return notes by e-mail to Lauri Eronen (see

course web page for contact info) describing the mainphases you took to solve the assignment

n Return notes at latest on Tuesdays 16.15

p Course exam gives you max. 48 points

Page 10: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

10

How to pass the course?p Grading: on the scale 0-5

n To get the lowest passing grade 1, you need to get atleast 30 points out of 60 maximum

p Course exam: Wed 15 October 16.00-19.00Exactum A111

p See course web page for separate examsp Note: if you take the first separate exam, the

best of the following options will be considered:n Exam gives you 48 points, exercises 12 pointsn Exam gives you 60 points

p In second and subsequent separate exams, onlythe 60 point option is in use

Page 11: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

11

Literaturep Deonier, Tavaré,

Waterman: ComputationalGenome Analysis, anIntroduction. Springer,2005

p Jones, Pevzner: AnIntroduction toBioinformatics Algorithms.MIT Press, 2004

p Slides for some lectureswill be available on thecourse web page

Page 12: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

12

Additional literaturep Gusfield: Algorithms on

strings, trees andsequences

p Griffiths et al: Introductionto genetic analysis

p Alberts et al.: Molecularbiology of the cell

p Lodish et al.: Molecular cellbiology

p Check the course web site

Page 13: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

13

Questions about administrative &practical stuff?

Page 14: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

14

Master's Degree Programme inBioinformatics (MBI)p Two-year MSc programmep Admission for 2009-2010 in January 2009

n You need to have your Bachelor’s degree ready byAugust 2009

www.cs.helsinki.fi/mbi

Page 15: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

15

MBI programme organizers

Department of Computer Science,Department of Mathematics and StatisticsFaculty of Science, Kumpula Campus, HY

Laboratory of Computer andInformation Science, Laboratory of

CS and Engineering,TKK

Faculty of Medicine, Meilahti Campus, HY

Faculty of BiosciencesFaculty of Agriculture and Forestry

Viikki Campus, HY

Page 16: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

16

TKK, Otaniemi

HY, MeilahtiHY, Kumpula

HY, Viikki

Four MBI campuses

Page 17: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

17

MBI highlightsp You can take courses from both HY and

TKKp Two biology courses tailored specifically

for MBIp Bioinformatics is a new exciting field, with

a high demand for experts in job market

p Go to www.cs.helsinki.fi/mbi/careers tofind out what a bioinformatician could dofor living

Page 18: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

18

Admissionp Admission requirements

n Bachelor’s degree in a suitable field (e.g., computerscience, mathematics, statistics, biology or medicine)

n At least 60 ECTS credits in total in computer science,mathematics and statistics

n Proficiency in English (standardized language test:TOEFL, IELTS)

p Admission period opens in late Autumn 2009 andcloses in 2 February 2009

p Details on admission will be posted inwww.cs.helsinki.fi/mbi during this autumn

Page 19: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

19

Bioinformatics courses in Helsinkiregion: 1st periodp Computational genomics (4-7 credits, TKK)p Seminar: Neuroinformatics (3 credits, Kumpula)p Seminar: Machine Learning in Bioinformatics (3

credits, Kumpula)p Signal processing in neuroinformatics (5 credits,

TKK)

Page 20: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

20

A good biology course for computerscientists and mathematicians?p Biology for methodological scientists (8 credits, Meilahti)

n Course organized by the Faculties of Bioscience and Medicinefor the MBI programme

n Introduction to basic concepts of microarrays, medical geneticsand developmental biology

n Study group + book exam in I period (2 cr)n Three lectured modules, 2 cr eachn Each module has an individual registration so you can

participate even if you missed the first modulen www.cs.helsinki.fi/mbi/courses/08-09/bfms/

Page 21: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

21

Bioinformatics courses in Helsinkiregion: 2nd periodp Bayesian paradigm in genetic bioinformatics (6

credits, Kumpula)p Biological Sequence Analysis (6 credits, Kumpula)p Modeling of biological networks (5-7 credits, TKK)p Statistical methods in genetics (6-8 credits,

Kumpula)

Page 22: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

22

Bioinformatics courses in Helsinkiregion: 3rd periodp Evolution and the theory of games (5 credits, Kumpula)p Genome-wide association mapping (6-8 credits, Kumpula)p High-Throughput Bioinformatics (5-7 credits, TKK)p Image Analysis in Neuroinformatics (5 credits, TKK)p Practical Course in Biodatabases (4-5 credits, Kumpula)p Seminar: Computational systems biology (3 credits,

Kumpula)p Spatial models in ecology and evolution (8 credits,

Kumpula)p Special course in bioinformatics I (3-7 credits, TKK)

Page 23: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

23

Bioinformatics courses in Helsinki region:4th periodp Metabolic Modeling (4 credits, Kumpula)p Phylogenetic data analyses (6-8 credits,

Kumpula)

Page 24: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

24

1. What is bioinformatics?

Page 25: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

25

What is bioinformatics?p Bioinformatics, n. The science of information

and information flow in biological systems,esp. of the use of computational methods ingenetics and genomics. (Oxford EnglishDictionary)

p "The mathematical, statistical and computingmethods that aim to solve biological problemsusing DNA and amino acid sequences andrelated information." -- Fredj Tekaia

Page 26: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

26

What is bioinformatics?p "I do not think all biological computing is

bioinformatics, e.g. mathematical modelling isnot bioinformatics, even when connected withbiology-related problems. In my opinion,bioinformatics has to do with management andthe subsequent use of biological information,particular genetic information."-- Richard Durbin

Page 27: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

27

What is not bioinformatics?p Biologically-inspired computation, e.g., genetic algorithms

and neural networksp However, application of neural networks to solve some

biological problem, could be called bioinformaticsp What about DNA computing?

http://www.wisdom.weizmann.ac.il/~lbn/new_pages/Visual_Presentation.html

Page 28: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

28

Computational biologyp Application of computing to biology (broad

definition)p Often used interchangeably with bioinformaticsp Or: Biology that is done with computational

means

Page 29: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

29

Biometry & biophysicsp Biometry: the statistical analysis of biological

datan Sometimes also the field of identification of individuals

using biological traits (a more recent definition)

p Biophysics: "an interdisciplinary field whichapplies techniques from the physical sciencesto understanding biological structure andfunction" -- British Biophysical Society

Page 30: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

30

Mathematical biologyp Mathematical biology

“tackles biologicalproblems, but the methodsit uses to tackle them neednot be numerical and neednot be implemented insoftware or hardware.”-- Damian Counsell

Alan Turing

Page 31: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

31

Turing on biological complexityp “It must be admitted that the biological examples which

it has been possible to give in the present paper are verylimited.

This can be ascribed quite simply to the fact thatbiological phenomena are usually very complicated.Taking this in combination with the relatively elementarymathematics used in this paper one could hardly expect tofind that many observed biological phenomena would becovered.

It is thought, however, that the imaginary biologicalsystems which have been treated, and the principles whichhave been discussed, should be of some help ininterpreting real biological forms.”

– Alan Turing, The Chemical Basis of Morphogenesis, 1952

Page 32: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

32

Related conceptsp Systems biology

n “Biology of networks”n Integrating different levels

of information tounderstand how biologicalsystems work

p Computational systems biology

Overview of metabolic pathways inKEGG database, www.genome.jp/kegg/

Page 33: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

33

Why is bioinformatics important?p New measurement techniques produce

huge quantities of biological datan Advanced data analysis methods are needed to

make sense of the datan Typical data sources produce noisy data with a

lot of missing values

p Paradigm shift in biology to utilisebioinformatics in research

Page 34: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

34

Bioinformatician’s skill setp Statistics, data analysis methodsn Lots of datan High noise levels, missing valuesn #attributes >> #data points

p Programming languagesn Scripting languages: Python, Perl, Ruby, …n Extensive use of text file formats: need

parsersn Integration of both data and tools

p Data structures, databases

Page 35: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

35

Bioinformatician’s skill setp Modellingn Discrete vs continuous domainsn -> Systems biology

p Scientific computation packagesn R, Matlab/Octave, …

p Communication skills!

Page 36: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

36

Communication skills: case 1Biologist presents a problemto computer scientists /mathematicians

?

”I am interested in finding what affects theregulation gene x during condition y and how

that relates to the organism’s phenotype.”

”Define input and output of the problem.”

Page 37: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

37

Communication skills: case 2Bioinformatician is a partof a group that consistsmostly of biologists.

Page 38: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

38

Communication skills: case 2

...biologist/bioinformatician ratio is important!

Page 39: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

39

Communication skills: case 3A group ofbioinformaticiansoffers their services tomore than one group

Page 40: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

40

Bioinformatician’s skill setp How much biology you should know?

Page 41: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

41

Computer Science• Programming• Databases• Algorithmics

Mathematics andstatistics• Calculus• Probability calculus• Linear algebra

Biology & Medicine• Basics in molecular andcell biology• Measurement techniques

Bioinformatics• Biological sequence analysis• Biological databases• Analysis of gene expression• Modeling protein structure andfunction• Gene, protein and metabolicnetworks• …

Bioinformatician’s skill set

Prof. Juho Rousu, 2006

Where would you be in this triangle?

Page 42: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

42

A problem involving bioinformatics?

- ”I found a fruit fly that is immune to all diseases!”

- ”It was one of these”

Pertti Jarla, http://www.hs.fi/fingerpori/

Page 43: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

43

Molecular biology primer

Molecular Biology Primer by Angela Brooks, Raymond Brown,Calvin Chen, Mike Daly, Hoa Dinh, Erinn Hama, Robert Hinman,Julio Ng, Michael Sneddon, Hoa Troung, Jerry Wang, Che FungYungEdited for Introduction to Bioinformatics (Autumn 2007, Summer2008, Autumn 2008) by Esa Pitkänen

Page 44: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

44

Molecular biology primerp Part 1: What is life made of?p Part 2: Where does the variation in

genomes come from?

Page 45: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

45

Life begins with Cell

p A cell is a smallest structural unit of anorganism that is capable of independentfunctioning

p All cells have some common features

Page 46: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

46

Cellsp Fundamental working units of every living system.p Every organism is composed of one of two radically different types of

cells:n prokaryotic cells orn eukaryotic cells.

p Prokaryotes and Eukaryotes are descended from the sameprimitive cell.n All prokaryotic and eukaryotic cells are the result of a total of 3.5

billion years of evolution.

Page 47: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

47

Two types of cells: Prokaryotes andEukaryotes

Page 48: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

48

Prokaryotes and Eukaryotesp According to the most

recent evidence, thereare three mainbranches to the tree oflife

p Prokaryotes includeArchaea (“ancientones”) and bacteria

p Eukaryotes arekingdom Eukarya andincludes plants,animals, fungi andcertain algae Lecture: Phylogenetic trees

Page 49: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

49

All Cells have common Cycles

p Born, eat, replicate, and die

Page 50: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

50

Common features of organismsp Chemical energy is stored in ATPp Genetic information is encoded by DNAp Information is transcribed into RNAp There is a common triplet genetic codep Translation into proteins involves ribosomesp Shared metabolic pathwaysp Similar proteins among diverse groups of

organisms

Page 51: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

51

All Life depends on 3 critical moleculesp DNAs (Deoxyribonucleic acid)

n Hold information on how cell works

p RNAs (Ribonucleic acid)n Act to transfer short pieces of information to different

parts of celln Provide templates to synthesize into protein

p Proteinsn Form enzymes that send signals to other cells and

regulate gene activityn Form body’s major components (e.g. hair, skin, etc.)n “Workhorses” of the cell

Page 52: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

52

DNA: The Code of Life

p The structure and the four genomic letters code for all livingorganisms

p Adenine, Guanine, Thymine, and Cytosine which pair A-T and C-Gon complimentary strands.

Lecture: Genome sequencingand assembly

Page 53: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

53

Discovery of the structure of DNAp 1952-1953 James D. Watson and Francis H. C. Crick

deduced the double helical structure of DNA from X-raydiffraction images by Rosalind Franklin and data on amountsof nucleotides in DNA

James Watson andFrancis Crick

RosalindFranklin

”Photo 51”

Page 54: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

54

DNA, continuedp DNA has a double helix

structure which iscomposed ofn sugar moleculen phosphate groupn and a base (A,C,G,T)

p By convention, we readDNA strings in direction oftranscription: from 5’ endto 3’ end5’ ATTTAGGCC 3’3’ TAAATCCGG 5’

Page 55: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

55

DNA is contained in chromosomes

http://en.wikipedia.org/wiki/Image:Chromatin_Structures.png

p In eukaryotes, DNA is packed into chromatidsn In metaphase, the “X” structure consists of two identical

chromatids

p In prokaryotes, DNA is usually contained in a single,circular chromosome

Page 56: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

56

Human chromosomesp Somatic cells in humans

have 2 pairs of 22chromosomes + XX(female) or XY (male) =total of 46 chromosomes

p Germline cells have 22chromosomes + either X orY = total of 23chromosomes

Karyogram of human male using Giemsa staining(http://en.wikipedia.org/wiki/Karyotype)

Page 57: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

57

Length of DNA and number of chromosomesOrganism #base pairs #chromosomes (germline)

ProkayoticEscherichia coli (bacterium) 4x106 1

EukaryoticSaccharomyces cerevisia (yeast) 1.35x107 17Drosophila melanogaster (insect) 1.65x108 4Homo sapiens (human) 2.9x109 23Zea mays (corn / maize) 5.0x109 10

Page 58: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

58

1 atgagccaag ttccgaacaa ggattcgcgg ggaggataga tcagcgcccg agaggggtga61 gtcggtaaag agcattggaa cgtcggagat acaactccca agaaggaaaa aagagaaagc

121 aagaagcgga tgaatttccc cataacgcca gtgaaactct aggaagggga aagagggaag181 gtggaagaga aggaggcggg cctcccgatc cgaggggccc ggcggccaag tttggaggac241 actccggccc gaagggttga gagtacccca gagggaggaa gccacacgga gtagaacaga301 gaaatcacct ccagaggacc ccttcagcga acagagagcg catcgcgaga gggagtagac361 catagcgata ggaggggatg ctaggagttg ggggagaccg aagcgaggag gaaagcaaag421 agagcagcgg ggctagcagg tgggtgttcc gccccccgag aggggacgag tgaggcttat481 cccggggaac tcgacttatc gtccccacat agcagactcc cggaccccct ttcaaagtga541 ccgagggggg tgactttgaa cattggggac cagtggagcc atgggatgct cctcccgatt601 ccgcccaagc tccttccccc caagggtcgc ccaggaatgg cgggacccca ctctgcaggg661 tccgcgttcc atcctttctt acctgatggc cggcatggtc ccagcctcct cgctggcgcc721 ggctgggcaa cattccgagg ggaccgtccc ctcggtaatg gcgaatggga cccacaaatc781 tctctagctt cccagagaga agcgagagaa aagtggctct cccttagcca tccgagtgga841 cgtgcgtcct ccttcggatg cccaggtcgg accgcgagga ggtggagatg ccatgccgac901 ccgaagagga aagaaggacg cgagacgcaa acctgcgagt ggaaacccgc tttattcact961 ggggtcgaca actctgggga gaggagggag ggtcggctgg gaagagtata tcctatggga

1021 atccctggct tccccttatg tccagtccct ccccggtccg agtaaagggg gactccggga1081 ctccttgcat gctggggacg aagccgcccc cgggcgctcc cctcgttcca ccttcgaggg1141 ggttcacacc cccaacctgc gggccggcta ttcttctttc ccttctctcg tcttcctcgg1201 tcaacctcct aagttcctct tcctcctcct tgctgaggtt ctttcccccc gccgatagct1261 gctttctctt gttctcgagg gccttccttc gtcggtgatc ctgcctctcc ttgtcggtga1321 atcctcccct ggaaggcctc ttcctaggtc cggagtctac ttccatctgg tccgttcggg1381 ccctcttcgc cgggggagcc ccctctccat ccttatcttt ctttccgaga attcctttga1441 tgtttcccag ccagggatgt tcatcctcaa gtttcttgat tttcttctta accttccgga1501 ggtctctctc gagttcctct aacttctttc ttccgctcac ccactgctcg agaacctctt1561 ctctcccccc gcggtttttc cttccttcgg gccggctcat cttcgactag aggcgacggt1621 cctcagtact cttactcttt tctgtaaaga ggagactgct ggccctgtcg cccaagttcg1681 ag

Hepatitis delta virus, complete genome

Page 59: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

59

RNAp RNA is similar to DNA chemically. It is usually only a

single strand. T(hyamine) is replaced by U(racil)p Several types of RNA exist for different functions in

the cell.

http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.giftRNA linear and 3D view:

Page 60: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

60

DNA, RNA, and the Flow ofInformation

TranslationTranscription

Replication ”The central dogma”

Is this true?

Denis Noble: The principles of Systems Biology illustrated using the virtual hearthttp://velblod.videolectures.net/2007/pascal/eccs07_dresden/noble_denis/eccs07_noble_psb_01.ppt

Page 61: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

61

Proteinsp Proteins are polypeptides

(strings of amino acidresidues)

p Represented using stringsof letters from an alphabetof 20: AEGLV…WKKLAG

p Typical length 50…1000residues

Urease enzyme from Helicobacter pylori

Page 62: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

62

Amino acids

http://upload.wikimedia.org/wikipedia/commons/c/c5/Amino_acids_2.png

Page 63: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

63

How DNA/RNA codes for protein?p DNA alphabet contains four

letters but must specifyprotein, or polypeptidesequence of 20 letters.

p Dinucleotides are notenough: 42 = 16 possibledinucleotides

p Trinucleotides (triplets)allow 43 = 64 possibletrinucleotides

p Triplets are also calledcodons

Page 64: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

64

How DNA/RNA codes for protein?p Three of the possible

triplets specify ”stoptranslation”

p Translation usually startsat triplet AUG (this codesfor methionine)

p Most amino acids may bespecified by more thantriplet

p How to find a gene? Lookfor start and stop codons(not that easy though)

Page 65: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

65

Proteins: Workhorses of the Cell

p 20 different amino acidsn different chemical properties cause the protein chains to fold

up into specific three-dimensional structures that define theirparticular functions in the cell.

p Proteins do all essential work for the celln build cellular structuresn digest nutrientsn execute metabolic functionsn mediate information flow within a cell and among cellular

communities.p Proteins work together with other proteins or nucleic acids

as "molecular machines"n structures that fit together and function in highly specific, lock-

and-key ways.

Lecture 8: Proteomics

Page 66: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

66

Genesp “A gene is a union of genomic sequences encoding a

coherent set of potentially overlapping functional products”--Gerstein et al.

p A DNA segment whose information is expressed either asan RNA molecule or protein

5’ 3’

3’ 5’

… a t g a g t g g a …

… t a c t c a c c t …

augagugga ...

(transcription)

(translation)MSG …

(folding)

http://fold.it

Page 67: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

67

FoldIt: Protein folding game

http://fold.it

Page 68: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

68

Genes & allelesp A gene can have different variantsp The variants of the same gene are called

alleles

5’

3’

… a t g a g t g g a …

… t a c t c a c c t …

augagugga ...

MSG …

5’

3’

… a t g a g t c g a …

… t a c t c a g c t …

augagucga ...

MSR …

Page 69: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

69

Genes can be found on both strands

3’

5’

5’

3’

Page 70: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

70

Exons and introns & splicing

3’

5’

5’

3’

Introns are removed from RNA after transcription

Exons

Exons are joined:

This process is called splicing

Page 71: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

71

Alternative splicing

A 3’

5’

5’

3’

B C

Different splice variants may be generated

A B C

B C

A C

Page 72: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

72

Where does the variation in genomes comefrom?p Prokaryotes are typically

haploid: they have a single(circular) chromosome

p DNA is usually inheritedvertically (parent todaughter)

p Inheritance is clonaln Descendants are faithful

copies of an ancestral DNAn Variation is introduced via

mutations, transposableelements, and horizontaltransfer of DNA

Chromosome map of S. dysenteriae, the nine ringsdescribe different properties of the genomehttp://www.mgc.ac.cn/ShiBASE/circular_Sd197.htm

Page 73: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

73

Causes of variationp Mistakes in DNA replicationp Environmental agents (radiation, chemical

agents)p Transposable elements (transposons)

n A part of DNA is moved or copied to another location ingenome

p Horizontal transfer of DNAn Organism obtains genetic material from another

organism that is not its parentn Utilized in genetic engineering

Page 74: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

74

Biological string manipulationp Point mutation: substitution of a base

n …ACGGCT… => …ACGCCT…

p Deletion: removal of one or more contiguousbases (substring)n …TTGATCA… => …TTTCA…

p Insertion: insertion of a substringn …GGCTAG… => …GGTCAACTAG…

Lecture: Sequence alignmentLecture: Genome rearrangements

Page 75: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

75

Meiosisp Sexual organisms are usually

diploidn Germline cells (gametes)

contain N chromosomesn Somatic (body) cells have 2N

chromosomesp Meiosis: reduction of

chromosome number from2N to N during reproductivecyclen One chromosome doubling is

followed by two cell divisions

Major events in meiosis

http://en.wikipedia.org/wiki/Meiosis

http://www.ncbi.nlm.nih.gov/About/Primer

Page 76: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

76

Recombination and variationp Recap: Allele is a viable DNA

coding occupying a given locus(position in the genome)

p In recombination, alleles fromparents become suffled inoffspring individuals viachromosomal crossover over

p Allele combinations inoffspring are usually differentfrom combinations found inparents

p Recombination errors lead intoadditional variations

Chromosomal crossover as described byT. H. Morgan in 1916

Page 77: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

77

Mitosis

http://en.wikipedia.org/wiki/Image:Major_events_in_mitosis.svg

p Mitosis: growth and development of the organismn One chromosome doubling is followed by one cell

division

Page 78: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

78

Recombination frequency and linked genesp Genetic marker: some DNA sequence of interest

(e.g., gene or a part of a gene)

p Recombination is more likely to separate twodistant markers than two close ones

p Linked markers: ”tend” to be inherited together

p Marker distances measured in centimorgans: 1centimorgan corresponds to 1% chance that twomarkers are separated in recombination

Page 79: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

79

Biological databasesp Exponential growth of

biological datan New measurement

techniquesn Before we are able to use

the data, we need to storeit efficiently -> biologicaldatabases

n Published data issubmitted to databases

p General vs specialiseddatabases

p This topic is discussedextensively in Practicalcourse in biodatabases (IIIperiod)

Page 80: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

80

10 most important biodatabases… accordingto ”Bioinformatics for dummies”

p GenBank/DDJB/EMBL www.ncbi.nlm.nih.gov Nucleotide sequencesp Ensembl www.ensembl.org Human/mouse genomep PubMed www.ncbi.nlm.nih.gov Literature referencesp NR www.ncbi.nlm.nih.gov Protein sequencesp UniProt www.expasy.org Protein sequencesp InterPro www.ebi.ac.uk Protein domainsp OMIM www.ncbi.nlm.nih.gov Genetic diseasesp Enzymes www.expasy.org Enzymesp PDB www.rcsb.org/pdb/ Protein structuresp KEGG www.genome.ad.jp Metabolic pathways

Sophia Kossida, Introduction to Bioinformatics, Summer 2008

Page 81: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

81

FASTA formatp A simple format for DNA and protein sequence

data is FASTA

>Hepatitis delta virus, complete genomeatgagccaagttccgaacaaggattcgcggggaggatagatcagcgcccgagaggggtgagtcggtaaagagcattggaacgtcggagatacaactcccaagaaggaaaaaagagaaagcaagaagcggatgaatttccccataacgccagtgaaactctaggaaggggaaagagggaaggtggaagagaaggaggcgggcctcccgatccgaggggcccggcggccaagtttggaggacactccggcccgaagggttgagagtaccccagagggaggaagccacacggagtagaacagagaaatcacctccagaggaccccttcagcgaacagagagcgcatcgcgagagggagtagaccatagcgataggaggggatgctaggagttgggggagaccgaagcgaggaggaaagcaaagagagcagcggggctagcaggtgggtgttccgccccccgagaggggacgagtgaggcttatcccggggaactcgacttatcgtccccacatagcagactcccggaccccctttcaaagtga…

Header line,begins with >

Page 82: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

Introduction toBioinformatics

Biological words

Page 83: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

83

Recapp DNA codes information with alphabet of 4

letters: A, C, G, Tp In proteins, the alphabet size is 20p DNA -> RNA -> Protein (genetic code)n Three DNA bases (triplet, codon) code for one

amino acidn Redundancy, start and stop codons

Page 84: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

84

1 atgagccaag ttccgaacaa ggattcgcgg ggaggataga tcagcgcccg agaggggtga61 gtcggtaaag agcattggaa cgtcggagat acaactccca agaaggaaaa aagagaaagc

121 aagaagcgga tgaatttccc cataacgcca gtgaaactct aggaagggga aagagggaag181 gtggaagaga aggaggcggg cctcccgatc cgaggggccc ggcggccaag tttggaggac241 actccggccc gaagggttga gagtacccca gagggaggaa gccacacgga gtagaacaga301 gaaatcacct ccagaggacc ccttcagcga acagagagcg catcgcgaga gggagtagac361 catagcgata ggaggggatg ctaggagttg ggggagaccg aagcgaggag gaaagcaaag421 agagcagcgg ggctagcagg tgggtgttcc gccccccgag aggggacgag tgaggcttat481 cccggggaac tcgacttatc gtccccacat agcagactcc cggaccccct ttcaaagtga541 ccgagggggg tgactttgaa cattggggac cagtggagcc atgggatgct cctcccgatt601 ccgcccaagc tccttccccc caagggtcgc ccaggaatgg cgggacccca ctctgcaggg661 tccgcgttcc atcctttctt acctgatggc cggcatggtc ccagcctcct cgctggcgcc721 ggctgggcaa cattccgagg ggaccgtccc ctcggtaatg gcgaatggga cccacaaatc781 tctctagctt cccagagaga agcgagagaa aagtggctct cccttagcca tccgagtgga841 cgtgcgtcct ccttcggatg cccaggtcgg accgcgagga ggtggagatg ccatgccgac901 ccgaagagga aagaaggacg cgagacgcaa acctgcgagt ggaaacccgc tttattcact961 ggggtcgaca actctgggga gaggagggag ggtcggctgg gaagagtata tcctatggga

1021 atccctggct tccccttatg tccagtccct ccccggtccg agtaaagggg gactccggga1081 ctccttgcat gctggggacg aagccgcccc cgggcgctcc cctcgttcca ccttcgaggg1141 ggttcacacc cccaacctgc gggccggcta ttcttctttc ccttctctcg tcttcctcgg1201 tcaacctcct aagttcctct tcctcctcct tgctgaggtt ctttcccccc gccgatagct1261 gctttctctt gttctcgagg gccttccttc gtcggtgatc ctgcctctcc ttgtcggtga1321 atcctcccct ggaaggcctc ttcctaggtc cggagtctac ttccatctgg tccgttcggg1381 ccctcttcgc cgggggagcc ccctctccat ccttatcttt ctttccgaga attcctttga1441 tgtttcccag ccagggatgt tcatcctcaa gtttcttgat tttcttctta accttccgga1501 ggtctctctc gagttcctct aacttctttc ttccgctcac ccactgctcg agaacctctt1561 ctctcccccc gcggtttttc cttccttcgg gccggctcat cttcgactag aggcgacggt1621 cctcagtact cttactcttt tctgtaaaga ggagactgct ggccctgtcg cccaagttcg1681 ag

Given a DNA sequence, we might ask a number of questions

What sort of statistics should be used to describe the sequence?

What sort of organism did this sequence come from?

Does the description of this sequence differ fromthe description of other DNA in the organism?

What sort of sequence is this? What does it do?

Page 85: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

85

Biological wordsp We can try to answer questions like these

by considering the words in a sequencep A k-word (or a k-tuple) is a string of length

k drawn from some alphabetp A DNA k-word is a string of length k that

consists of letters A, C, G, Tn 1-words: individual nucleotides (bases)n 2-words: dinucleotides (AA, AC, AG, AT, CA, ...)n 3-words: codons (AAA, AAC, …)n 4-words and beyond

Page 86: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

86

1-words: base compositionp Typically DNA exists as duplex molecule

(two complementary strands)

5’-GGATCGAAGCTAAGGGCT-3’3’-CCTAGCTTCGATTCCCGA-5’

Top strand: 7 G, 3 C, 5 A, 3 TBottom strand: 3 G, 7 C, 3 A, 5 TDuplex molecule: 10 G, 10 C, 8 A, 8 TBase frequencies: 10/36 10/36 8/36 8/36

fr(G + C) = 20/36, fr(A + T) = 1 – fr(G + C) = 16/36

These are somethingwe can determineexperimentally.

Page 87: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

87

G+C contentp fr(G + C), or G+C content is a simple

statistics for describing genomesp Notice that one value is enough

characterise fr(A), fr(C), fr(G) and fr(T)for duplex DNA

p Is G+C content (= base composition) ableto tell the difference between genomes ofdifferent organisms?n Simple computational experiment, if we have

the genome sequences under study(-> exercises)

Page 88: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

88

G+C content and genome sizes (inmegabasepairs, Mb) for various organismsp Mycoplasma genitalium 31.6% 0.585p Escherichia coli K-12 50.7% 4.693p Pseudomonas aeruginosa PAO1 66.4% 6.264p Pyrococcus abyssi 44.6% 1.765p Thermoplasma volcanium 39.9% 1.585p Caenorhabditis elegans 36% 97p Arabidopsis thaliana 35% 125p Homo sapiens 41% 3080

Page 89: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

89

Base frequencies in duplex moleculesp Consider a DNA sequence generated randomly,

with probability of each letter being independentof position in sequence

p You could expect to find a uniform distribution ofbases in genomes…

p This is not, however, the case in genomes,especially in prokaryotesn This phenomena is called GC skew

5’-...GGATCGAAGCTAAGGGCT...-3’3’-...CCTAGCTTCGATTCCCGA...-5’

Page 90: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

90

DNA replication forkp When DNA is replicated,

the molecule takes thereplication fork form

p New complementaryDNA is synthesised atboth strands of the”fork”

p New strand in 5’-3’direction correspondingto replication forkmovement is calledleading strand and theother lagging strand

Leading strand

Lagging strand

Replication fork

Replication fork movement

Page 91: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

91

DNA replication forkp This process has

specific startingpoints in genome(origins ofreplication)

p Observation:Leading strandshave an excess of Gover C

p This can bedescribed by GCskew statistics Lagging strand

Replication fork

Leading strand

Replication fork movement

Page 92: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

92

GC skewp GC skew is defined as (#G - #C) / (#G +

#C)p It is calculated at successive positions in

intervals (windows) of specific width

5’-...GGATCGAAGCTAAGGGCT...-3’3’-...CCTAGCTTCGATTCCCGA...-5’

(3 – 2) / (3 + 2) = 1/5

(4 – 2) / (4 + 2) = 1/3

Page 93: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

93

p G-C content &GC skewstatistics can bedisplayed with acircular genomemap

Chromosome map of S. dysenteriae, the nine ringsdescribe different properties of the genomehttp://www.mgc.ac.cn/ShiBASE/circular_Sd197.htm

G-C content & GC skew

G+C content

GC skew

(10kb window size)

Page 94: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

94

GC skewp GC skew

oftenchanges signat originsand terminiof replication

G+C content

GC skew

(10kb window size)

Nie et al., BMC Genomics, 2006

Page 95: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

95

2-words: dinucleotidesp Let’s consider a sequence L1,L2,...,Ln

where each letter Li is drawn from theDNA alphabet {A, C, G, T}

p We have 16 possible dinucleotides lili+1:AA, AC, AG, ..., TG, TT.

Page 96: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

96

i.i.d. model for nucleotidesp Assume that basesn occur independently of each othern bases at each position are identically

distributed

p Probability of the base A, C, G, T occuringis pA, pC, pG, pT, respectivelyn For example, we could use pA=pC=pG=pT=0.25

or estimate the values from known genomedata

p Probability of lili+1 is then PliPli+1n For example, P(TG) = pT pG

Page 97: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

97

2-words: is what we see surprising?p We can test whether a sequence is ”unexpected”,

for example, with a 2 test

p Test statistic for a particular dinucleotide r1r2 is2 = (O – E)2 / E wheren O is the observed number of dinucleotide r1r2

n E is the expected number of dinucleotide r1r2

n E = (n – 1)pr1pr2 under i.i.d. model

p Basic idea: high values of 2 indicate deviationfrom the modeln Actual procedure is more detailed -> basic statistics

courses

Page 98: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

98

Refining the i.i.d. modelp i.i.d. model describes some organisms well

(see Deonier’s book) but fails tocharacterise many others

p We can refine the model by having theDNA letter at some position depend onletters at preceding positions

…TCGTGACGCCG ?Sequence context toconsider

Page 99: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

99

First-order Markov chains

p Lets assume that in sequence X the letter atposition t, Xt, depends only on the previous letterXt-1 (first-order markov chain)

p Probability of letter j occuring at position t givenXt-1 = i: pij = P(Xt = j | Xt-1 = i)

p We consider homogeneous markov chains:probability pij is independent of position t

…TCGTGACGCCG ?

Xt

Xt-1

Page 100: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

100

Estimating pij

p We can estimate probabilities pij (”the probabilitythat j follows i”) from observed dinucleotidefrequencies

A C G TA pAA pAC pAG pATC pCA pCC pCG pCTG pGA pGC pGG pGTT pTA pTC pTG pTT

Frequencyof dinucleotide ATin sequence

…the values pAA, pAC, ..., pTG, pTT sum to 1

+ + + Base frequencyfr(C)

Page 101: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

101

Estimating pij

p pij = P(Xt = j | Xt-1 = i) = P(Xt = j, Xt-1 = i)P(Xt-1 = i)

Probability of transition i -> j

Dinucleotide frequency

Base frequency of nucleotide i,fr(i)

A C G TA 0.146 0.052 0.058 0.089

C 0.063 0.029 0.010 0.056

G 0.050 0.030 0.028 0.051

T 0.086 0.047 0.063 0.140

P(Xt = j, Xt-1 = i)

A C G TA 0.423 0.151 0.168 0.258

C 0.399 0.184 0.063 0.354

G 0.314 0.189 0.176 0.321

T 0.258 0.138 0.187 0.415

P(Xt = j | Xt-1 = i)

0.052 / 0.345 0.151

Page 102: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

102

Simulating a DNA sequencep From a transition matrix, it is easy to generate a

DNA sequence of length n:n First, choose the starting base randomly according to

the base frequency distributionn Then, choose next base according to the distribution

P(xt | xt-1) until n bases have been chosen

A C G TA 0.423 0.151 0.168 0.258

C 0.399 0.184 0.063 0.354

G 0.314 0.189 0.176 0.321

T 0.258 0.138 0.187 0.415

P(Xt = j | Xt-1 = i)

T T C T T C AA

Look for R code in Deonier’sbook

Page 103: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

103

Simulating a DNA sequence

ttcttcaaaataaggatagtgattcttattggcttaagggataacaatttagatcttttttcatgaatcatgtatgtcaacgttaaaagttgaactgcaataagttcttacacacgattgtttatctgcgtgcgaagcatttcactacatttgccgatgcagccaaaagtatttaacatttggtaaacaaattgacttaaatcgcgcacttagagtttgacgtttcatagttgatgcgtgtctaacaattacttttagttttttaaatgcgtttgtctacaatcattaatcagctctggaaaaacattaatgcatttaaaccacaatggataattagttacttattttaaaattcacaaagtaattattcgaatagtgccctaagagagtactggggttaatggcaaagaaaattactgtagtgaagattaagcctgttattatcacctgggtactctggtgaatgcacataagcaaatgctacttcagtgtcaaagcaaaaaaatttactgataggactaaaaaccctttatttttagaatttgtaaaaatgtgacctcttgcttataacatcatatttattgggtcgttctaggacactgtgattgccttctaactcttatttagcaaaaaattgtcatagctttgaggtcagacaaacaagtgaatggaagacagaaaaagctcagcctagaattagcatgttttgagtggggaattacttggttaactaaagtgttcatgactgttcagcatatgattgttggtgagcactacaaagatagaagagttaaactaggtagtggtgatttcgctaacacagttttcatacaagttctattttctcaatggttttggataagaaaacagcaaacaaatttagtattattttcctagtaaaaagcaaacatcaaggagaaattggaagctgcttgttcagtttgcattaaattaaaaatttatttgaagtattcgagcaatgttgacagtctgcgttcttcaaataagcagcaaatcccctcaaaattgggcaaaaacctaccctggcttctttttaaaaaaccaagaaaagtcctatataagcaacaaatttcaaaccttttgttaaaaattctgctgctgaataaataggcattacagcaatgcaattaggtgcaaaaaaggccatcctctttctttttttgtacaattgttcaagcaactttgaatttgcagattttaacccactgtctatatgggacttcgaattaaattgactggtctgcatcacaaatttcaactgcccaatgtaatcatattctagagtattaaaaatacaaaaagtacaattagttatgcccattggcctggcaatttatttactccactttccacgttttggggatattttaacttgaatagttcacaatcaaaacataggaaggatctactgctaaaagcaaaagcgtattggaatgataaaaaactttgatgtttaaaaaactacaaccttaatgaattaaagttgaaaaaatattcaaaaaaagaaattcagttcttggcgagtaatatttttgatgtttgagatcagggttacaaaataagtgcatgagattaactcttcaaatataaactgatttaagtgtatttgctaataacattttcgaaaaggaatattatggtaagaattcataaaaatgtttaatactgatacaactttcttttatatcctccatttggccagaatactgttgcacacaactaattggaaaaaaaatagaacgggtcaatctcagtgggaggagaagaaaaaagttggtgcaggaaatagtttctactaacctggtataaaaacatcaagtaacattcaaattgcaaatgaaaactaaccgatctaagcattgattgatttttctcatgcctttcgcctagttttaataaacgcgccccaactctcatcttcggttcaaatgatctattgtatttatgcactaacgtgcttttatgttagcatttttcaccctgaagttccgagtcattggcgtcactcacaaatgacattacaatttttctatgttttgttctgttgagtcaaagtgcatgcctacaattctttcttatatagaactagacaaaatagaaaaaggcacttttggagtctgaatgtcccttagtttcaaaaaggaaattgttgaattttttgtggttagttaaattttgaacaaactagtatagtggtgacaaacgatcaccttgagtcggtgactataaaagaaaaaggagattaaaaatacctgcggtgccacattttttgttacgggcatttaaggtttgcatgtgttgagcaattgaaacctacaactcaataagtcatgttaagtcacttctttgaaaaaaaaaaagaccctttaagcaagctc

p Now we can quickly generate sequences ofarbitrary length...

Page 104: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

104

Simulating a DNA sequence

aa 0.145 0.146ac 0.050 0.052ag 0.055 0.058at 0.092 0.089ca 0.065 0.063cc 0.028 0.029cg 0.011 0.010ct 0.058 0.056ga 0.048 0.050gc 0.032 0.030gg 0.029 0.028gt 0.050 0.051ta 0.084 0.086tc 0.052 0.047tg 0.064 0.063tt 0.138 0.0140

Dinucleotide frequenciesSimulated Observed

n = 10000

Page 105: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

105

Simulating a DNA sequence

ttcttcaaaataaggatagtgattcttattggcttaagggataacaatttagatcttttttcatgaatcatgtatgtcaacgttaaaagttgaactgcaataagttcttacacacgattgtttatctgcgtgcgaagcatttcactacatttgccgatgcagccaaaagtatttaacatttggtaaacaaattgacttaaatcgcgcacttagagtttgacgtttcatagttgatgcgtgtctaacaattacttttagttttttaaatgcgtttgtctacaatcattaatcagctctggaaaaacattaatgcatttaaaccacaatggataattagttacttattttaaaattcacaaagtaattattcgaatagtgccctaagagagtactggggttaatggcaaagaaaattactgtagtgaagattaagcctgttattatcacctgggtactctggtgaatgcacataagcaaatgctacttcagtgtcaaagcaaaaaaatttactgataggactaaaaaccctttatttttagaatttgtaaaaatgtgacctcttgcttataacatcatatttattgggtcgttctaggacactgtgattgccttctaactcttatttagcaaaaaattgtcatagctttgaggtcagacaaacaagtgaatggaagacagaaaaagctcagcctagaattagcatgttttgagtggggaattacttggttaactaaagtgttcatgactgttcagcatatgattgttggtgagcactacaaagatagaagagttaaactaggtagtggtgatttcgctaacacagttttcatacaagttctattttctcaatggttttggataagaaaacagcaaacaaatttagtattattttcctagtaaaaagcaaacatcaaggagaaattggaagctgcttgttcagtttgcattaaattaaaaatttatttgaagtattcgagcaatgttgacagtctgcgttcttcaaataagcagcaaatcccctcaaaattgggcaaaaacctaccctggcttctttttaaaaaaccaagaaaagtcctatataagcaacaaatttcaaaccttttgttaaaaattctgctgctgaataaataggcattacagcaatgcaattaggtgcaaaaaaggccatcctctttctttttttgtacaattgttcaagcaactttgaatttgcagattttaacccactgtctatatgggacttcgaattaaattgactggtctgcatcacaaatttcaactgcccaatgtaatcatattctagagtattaaaaatacaaaaagtacaattagttatgcccattggcctggcaatttatttactccactttccacgttttggggatattttaacttgaatagttcacaatcaaaacataggaaggatctactgctaaaagcaaaagcgtattggaatgataaaaaactttgatgtttaaaaaactacaaccttaatgaattaaagttgaaaaaatattcaaaaaaagaaattcagttcttggcgagtaatatttttgatgtttgagatcagggttacaaaataagtgcatgagattaactcttcaaatataaactgatttaagtgtatttgctaataacattttcgaaaaggaatattatggtaagaattcataaaaatgtttaatactgatacaactttcttttatatcctccatttggccagaatactgttgcacacaactaattggaaaaaaaatagaacgggtcaatctcagtgggaggagaagaaaaaagttggtgcaggaaatagtttctactaacctggtataaaaacatcaagtaacattcaaattgcaaatgaaaactaaccgatctaagcattgattgatttttctcatgcctttcgcctagttttaataaacgcgccccaactctcatcttcggttcaaatgatctattgtatttatgcactaacgtgcttttatgttagcatttttcaccctgaagttccgagtcattggcgtcactcacaaatgacattacaatttttctatgttttgttctgttgagtcaaagtgcatgcctacaattctttcttatatagaactagacaaaatagaaaaaggcacttttggagtctgaatgtcccttagtttcaaaaaggaaattgttgaattttttgtggttagttaaattttgaacaaactagtatagtggtgacaaacgatcaccttgagtcggtgactataaaagaaaaaggagattaaaaatacctgcggtgccacattttttgttacgggcatttaaggtttgcatgtgttgagcaattgaaacctacaactcaataagtcatgttaagtcacttctttgaaaaaaaaaaagaccctttaagcaagctc

p The model is able to generate correct proportionsof 1- and 2-words in genomes...

p ...but fails with k=3 and beyond.

Page 106: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

106

3-words: codonsp We can extend the previous method to 3-

wordsp k=3 is an important case in study of DNA

sequences because of genetic code

5’ 3’

3’ 5’

… a t g a g t g g a …

… t a c t c a c c t …

a u g a g u g g a ...

M S G …

Page 107: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

107

3-word probabilitiesp Let’s again assume a sequence L of

independent basesp Probability of 3-word r1r2r3 at position

i,i+1,i+2 in sequence L isP(Li = r1, Li+1 = r2, Li+2 = r3) =

P(Li = r1)P(Li+1 = r2)P(Li+2 = r3)

Page 108: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

108

3-words in Escherichia coli genome

AAA 108924 0.02348 0.01492AAC 82582 0.01780 0.01541AAG 63369 0.01366 0.01537AAT 82995 0.01789 0.01490ACA 58637 0.01264 0.01541ACC 74897 0.01614 0.01591ACG 73263 0.01579 0.01588ACT 49865 0.01075 0.01539AGA 56621 0.01220 0.01537AGC 80860 0.01743 0.01588AGG 50624 0.01091 0.01584AGT 49772 0.01073 0.01536ATA 63697 0.01373 0.01490ATC 86486 0.01864 0.01539ATG 76238 0.01643 0.01536ATT 83398 0.01797 0.01489

CAA 76614 0.01651 0.01541CAC 66751 0.01439 0.01591CAG 104799 0.02259 0.01588CAT 76985 0.01659 0.01539CCA 86436 0.01863 0.01591CCC 47775 0.01030 0.01643CCG 87036 0.01876 0.01640CCT 50426 0.01087 0.01589CGA 70938 0.01529 0.01588CGC 115695 0.02494 0.01640CGG 86877 0.01872 0.01636CGT 73160 0.01577 0.01586CTA 26764 0.00577 0.01539CTC 42733 0.00921 0.01589CTG 102909 0.02218 0.01586CTT 63655 0.01372 0.01537

Word Count Observed Expected Word Count Observed Expected

Page 109: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

109

3-words in Escherichia coli genome

GAA 83494 0.01800 0.01537GAC 54737 0.01180 0.01588GAG 42465 0.00915 0.01584GAT 86551 0.01865 0.01536GCA 96028 0.02070 0.01588GCC 92973 0.02004 0.01640GCG 114632 0.02471 0.01636GCT 80298 0.01731 0.01586GGA 56197 0.01211 0.01584GGC 92144 0.01986 0.01636GGG 47495 0.01024 0.01632GGT 74301 0.01601 0.01582GTA 52672 0.01135 0.01536GTC 54221 0.01169 0.01586GTG 66117 0.01425 0.01582GTT 82598 0.01780 0.01534

TAA 68838 0.01484 0.01490TAC 52592 0.01134 0.01539TAG 27243 0.00587 0.01536TAT 63288 0.01364 0.01489TCA 84048 0.01812 0.01539TCC 56028 0.01208 0.01589TCG 71739 0.01546 0.01586TCT 55472 0.01196 0.01537TGA 83491 0.01800 0.01536TGC 95232 0.02053 0.01586TGG 85141 0.01835 0.01582TGT 58375 0.01258 0.01534TTA 68828 0.01483 0.01489TTC 83848 0.01807 0.01537TTG 76975 0.01659 0.01534TTT 109831 0.02367 0.01487

Word Count Observed Expected Word Count Observed Expected

Page 110: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

110

2nd order Markov Chainsp Markov chains readily generalise to higher ordersp In 2nd order markov chain, position t depends on

positions t-1 and t-2p Transition matrix:

p Probabilistic models for DNA and amino acidsequences will be discussed in Biologicalsequence analysis course (II period)

A C G TAAACAGATCA...

Page 111: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

111

Codon Adaptation Index (CAI)p Observation: cells prefer certain codons

over others in highly expressed genesn Gene expression: DNA is transcribed into RNA

(and possibly translated into protein)

Phe TTT 0.493 0.551 0.291TTC 0.507 0.449 0.709

Ala GCT 0.246 0.145 0.275GCC 0.254 0.276 0.164GCA 0.246 0.196 0.240GCG 0.254 0.382 0.323

Asn AAT 0.493 0.409 0.172AAC 0.507 0.591 0.828

Aminoacid Codon Predicted Gene class I Gene class II

Highlyexpressed

Moderatelyexpressed

Codon frequencies for some genes in E. coli

Page 112: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

112

Codon Adaptation Index (CAI)p CAI is a statistic used to compare the

distribution of codons observed with thepreferred codons for highly expressedgenes

Page 113: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

113

Codon Adaptation Index (CAI)p Consider an amino acid sequence X = x1x2...xn

p Let pk be the probability that codon k is used inhighly expressed genes

p Let qk be the highest probability that a codoncoding for the same amino acid as codon k hasn For example, if codon k is ”GCC”, the

corresponding amino acid is Alanine (see geneticcode table; also GCT, GCA, GCG code for Alanine)

n Assume that pGCC = 0.164, pGCT = 0.275, pGCA =0.240, pGCG = 0.323

n Now qGCC = qGCT = qGCA = qGCG = 0.323

Page 114: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

114

Codon Adaptation Index (CAI)p CAI is defined as

p CAI can be given also in log-odds form:

log(CAI) = (1/n) log(pk / qk)

CAI = ( pk / qk )k=1

n 1/n

k=1

n

Page 115: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

115

CAI: example with an E. coli geneM A L T K A E M S E Y L …ATG GCG CTT ACA AAA GCT GAA ATG TCA GAA TAT CTG1.00 0.47 0.02 0.45 0.80 0.47 0.79 1.00 0.43 0.79 0.19 0.02

0.06 0.02 0.47 0.20 0.06 0.21 0.32 0.21 0.81 0.020.28 0.04 0.04 0.28 0.03 0.040.20 0.03 0.05 0.20 0.01 0.03

0.01 0.04 0.010.89 0.18 0.89

ATG GCT TTA ACT AAA GCT GAA ATG TCT GAA TAT TTAGCC TTG ACC AAG GCC GAG TCC GAG TAC TTGGCA CTT ACA GCA TCA CTTGCG CTC ACG GCG TCG CTC

CTA AGT CTACTG AGC CTG

1.00 0.20 0.04 0.04 0.80 0.47 0.79 1.00 0.03 0.79 0.19 0.89…1.00 0.47 0.89 0.47 0.80 0.47 0.79 1.00 0.43 0.79 0.81 0.89

1/n

qkpk

Page 116: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

116

CAI: propertiesp CAI = 1.0 : each codon was the most frequently

used codon in highly expressed genesp Log-odds used to avoid numerical problems

n What happens if you multiply many values <1.0together?

p In a sample of E.coli genes, CAI ranged from 0.2to 0.85

p CAI correlates with mRNA levels: can be used topredict high expression levels

Page 117: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

117

Biological words: summaryp Simple 1-, 2- and 3-word models can

describe interesting properties of DNAsequencesn GC skew can identify DNA replication originsn It can also reveal genome rearrangement

events and lateral transfer of DNAn GC content can be used to locate genes:

human genes are comparably GC-richn CAI predicts high gene expression levels

Page 118: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

118

Biological words: summaryp k=3 models can help to identify correct

reading framesn Reading frame starts from a start codon and

stops in a stop codonn Consider what happens to translation when a

single extra base is introduced in a readingframe

p Also word models for k > 3 have theiruses

Page 119: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

119

Next lecturep Genome sequencing & assembly – where

do we get sequence data?

Page 120: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

120

Note on programming languagesp Working with probability distributions is

straightforward with R, for examplen Deonier’s book contains many computational

examplesn You can use R in CS Linux systems

p Python works too!

Page 121: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

121

#!/usr/bin/env python

import sys, random

n = int(sys.argv[1])

tm = {'a' : {'a' : 0.423, 'c' : 0.151, 'g' : 0.168, 't' : 0.258},'c' : {'a' : 0.399, 'c' : 0.184, 'g' : 0.063, 't' : 0.354},'g' : {'a' : 0.314, 'c' : 0.189, 'g' : 0.176, 't' : 0.321},'t' : {'a' : 0.258, 'c' : 0.138, 'g' : 0.187, 't' : 0.415}}

pi = {'a' : 0.345, 'c' : 0.158, 'g' : 0.159, 't' : 0.337}

def choose(dist):r = random.random()sum = 0.0keys = dist.keys()for k in keys:

sum += dist[k]if sum > r:

return kreturn keys[-1]

c = choose(pi)for i in range(n - 1):

sys.stdout.write(c)c = choose(tm[c])

sys.stdout.write(c)sys.stdout.write("\n")

Example Python code for generatingDNA sequences with first-orderMarkov chains.

Function choose(), returns a key (here ’a’, ’c’, ’g’ or’t’) of the dictionary ’dist’ chosen randomlyaccording to probabilities in dictionary values.

Choose the first letter, then choosenext letter according to P(xt | xt-1).

Transition matrixtm and initialdistribution pi.

Initialisation: use packages ’sys’ and ’random’,read sequence length from input.

Page 122: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

Introduction toBioinformatics

Genome sequencing & assembly

Page 123: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

123

Genome sequencing & assemblyp DNA sequencing

n How do we obtain DNA sequence information fromorganisms?

p Genome assemblyn What is needed to put together DNA sequence

information from sequencing?

p First statement of sequence assembly problem(according to G. Myers):n Peltola, Söderlund, Tarhio, Ukkonen: Algorithms for

some string matching problems arising in moleculargenetics. Proc. 9th IFIP World Computer Congress, 1983

Page 124: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

124

?

Recovery of shredded newspaper

Page 125: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

125

DNA sequencingp DNA sequencing: resolving a nucleotide

sequence (whole-genome or less)p Many different methods developedn Maxam-Gilbert method (1977)n Sanger method (1977)n High-throughput methods

Page 126: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

126

Sanger sequencing: sequencing bysynthesisp A sequencing technique developed by Fred

Sangerp Also called dideoxy sequencing

Page 127: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

127http://en.wikipedia.org/wiki/DNA_polymerase

DNA polymerasep A DNA polymerase is an

enzyme that catalyzesDNA synthesis

p DNA polymerase needsa primern Synthesis proceeds

always in 5’->3’ direction

Page 128: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

128

Dideoxy sequencingp In Sanger sequencing, chain-terminating

dideoxynucleoside triphosphates (ddXTPs)are employedn ddATP, ddCTP, ddGTP, ddTTP lack the 3’-OH

tail of dXTPs

p A mixture of dXTPs with small amount ofddXTPs is given to DNA polymerase withDNA template and primer

p ddXTPs are given fluorescent labels

Page 129: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

129

Dideoxy sequencingp When DNA polymerase encounters a

ddXTP, the synthesis cannot proceedp The process yields copied sequences of

different lengthsp Each sequence is terminated by a labeled

ddXTP

Page 130: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

130

Determining the sequencep Sequences are sorted

according to length bycapillaryelectrophoresis

p Fluorescent signalscorresponding tolabels are registered

p Base calling:identifying which basecorresponds to eachposition in a readn Non-trivial problem!

Output sequences frombase calling are called reads

Page 131: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

131

Reads are short!p Modern Sanger sequencers can produce

quality reads up to ~750 bases1

n Instruments provide you with a quality file forbases in reads, in addition to actual sequencedata

p Compare the read length against the sizeof the human genome (2.9x109 bases)

p Reads have to be assembled!

1 Nature Methods - 5, 16 - 18 (2008)

Page 132: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

132

Problems with sequencingp Sanger sequencing error rate per base

varies from 1% to 3%1

p Repeats in DNAn For example, ~300 base Alu sequence

repeated is over million times in humangenome

n Repeats occur in different scales

p What happens if repeat length is longerthan read length?n We will get back to this problem later

1 Jones, Pevzner (2004)

Page 133: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

133

Shortest superstring problemp Find the shortest string that ”explains” the

readsp Given a set of strings (reads), find a

shortest string that contains all of them

Page 134: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

134

Example: Shortest superstring

Set of strings: {000, 001, 010, 011, 100, 101, 110, 111}

Concetenation of strings: 000001010011100101110111

010110

011000

Shortest superstring: 0001110100001

111101

100

Page 135: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

135

Shortest superstrings: issuesp NP-complete problem: unlike to have an

efficient (exact) algorithmp Reads may be from either strand of DNAp Is the shortest string necessarily the

correct assembly?p What about errors in reads?p Low coverage -> gaps in assemblyn Coverage: average number of times each base

occurs in the set of reads (e.g., 5x coverage)

Page 136: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

136

Sequence assembly and combinationlocksp What is common with sequence assembly

and opening keypad locks?

Page 137: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

137

Whole-genome shotgun sequencep Whole-genome shotgun sequence

assembly starts with a large sample ofgenomic DNA1. Sample is randomly partitioned into inserts of

length > 500 bases2. Inserts are multiplied by cloning them into a

vector which is used to infect bacteria3. DNA is collected from bacteria and sequenced4. Reads are assembled

Page 138: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

138

Assembly of reads with Overlap-Layout-Consensus algorithmp Overlapn Finding potentially overlapping reads

p Layoutn Finding the order of reads along DNA

p Consensus (Multiple alignment)n Deriving the DNA sequence from the layout

p Next, the method is described at a veryabstract level, skipping a lot of details

Kececioglu, J.D. and E.W. Myers. 1995. Combinatorial algorithms forDNA sequence assembly. Algorithmica 13: 7-51.

Page 139: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

139

Finding overlapsp First, pairwise overlap

alignment of reads isresolved

p Reads can be fromeither DNA strand:The reversecomplement r* ofeach read r has to beconsidered

acggagtccagtccgcgctt

5’ 3’

3’ 5’

… a t g a g t g g a …

… t a c t c a c c t …

r1

r2

r1: tgagt, r1*: actca

r2: tccac, r2*: gtgga

Page 140: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

140

Example sequence to assemble

p 20 reads:

5’ – CAGCGCGCTGCGTGACGAGTCTGACAAAGACGGTATGCGCATCGTGATTGAAGTGAAACGCGATGCGGTCGGTCGGTGAAGTTGTGCT - 3’

# Read Read*

1 CATCGTCA TCACGATG

2 CGGTGAAG CTTCACCG

3 TATGCGCA TGCGCATA

4 GACGAGTC GACTCGTC

5 CTGACAAA TTTGTCAG

6 ATGCGCAT ATGCGCAT

7 ATGCGGTC GACCGCAT

8 CTGCGTGA TCACGCAG

9 GCGTGACG CGTCACGC

10 GTCGGTGA TCACCGAC

# Read Read*

11 GGTCGGTG CACCGACC

12 ATCGTGAT ATCACGAT

13 GCGCTGCG CGCAGCGC

14 GCATCGTG CACGATGC

15 AGCGCGCT AGCGCGCT16 GAAGTTGT ACAACTTC

17 AGTGAAAC GTTTCACT

18 ACGCGATG CATCGCGT

19 GCGCATCG CGATGCGC

20 AAGTGAAA TTTCACTT

Page 141: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

141

Finding overlapsp Overlap between two reads

can be found with adynamic programmingalgorithmn Errors can be taken into

account

p Dynamic programming willbe discussed more on nextlecture

p Overlap scores stored intothe overlap matrixn Entries (i, j) below the

diagonal denote overlap ofread ri and rj

*

1 CATCGTCA

6 ATGCGCAT

12 ATCGTGAT

Overlap(1, 6) = 3

Overlap(1, 12) = 7

1

6 12

3 7

Page 142: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

142

Finding layout & consensusp Method extends the

assembly greedily bychoosing the bestoverlaps

p Both orientations areconsidered

p Sequence is extendedas far as possible

7* GACCGCAT6=6* ATGCGCAT14 GCATCGTG1 CATCGTGA12 ATCGTGAT19 GCGCATCG13* CGCAGCGC---------------------

CGCATCGTGAT

Ambiguous bases

Consensus sequence

Page 143: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

143

Finding layout & consensusp We move on to next best

overlaps and extend thesequence from there

p The method stops whenthere are no more overlapsto consider

p A number of contigs isproduced

p Contig stands forcontiguous sequence,resulting from mergingreads

2 CGGTGAAG10 GTCGGTGA11 GGTCGGTG7 ATGCGGTC---------------------

ATGCGGTCGGTGAAG

Page 144: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

144

Whole-genome shotgun sequencing:summary

p Ordering of the reads is initially unknownp Overlaps resolved by aligning the readsp In a 3x109 bp genome with 500 bp reads and 5x

coverage, there are ~107 reads and ~107(107-1)/2= ~5x1013 pairwise sequence comparisons

… …Original genome sequence

ReadsNon-overlappingread

Overlapping reads=> Contig

Page 145: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

145

Repeats in DNA and genome assembly

Pop, Salzberg, Shumway (2002)

Two instances of the same repeat

Page 146: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

146

Repeats in DNA cause problems insequence assemblyp Recap: if repeat length exceeds read

length, we might not get the correctassembly

p This is a problem especially in eukaryotesn ~3.1% of genome consists of repeats in

Drosophila, ~45% in human

p Possible solutions1. Increase read length – feasible?2. Divide genome into smaller parts, with known

order, and sequence parts individually

Page 147: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

147

”Divide and conquer” sequencingapproaches: BAC-by-BAC

Whole-genome shotgun sequencing

Divide-and-conquer

Genome

Genome

BAC library

Page 148: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

148

BAC-by-BAC sequencingp Each BAC (Bacterial Artificial

Chromosome) is about 150 kbpp Covering the human genome requires

~30000 BACsp BACs shotgun-sequenced separatelyn Number of repeats in each BAC is

significantly smaller than in the wholegenome...

n ...needs much more manual work comparedto whole-genome shotgun sequencing

Page 149: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

149

Hybrid methodp Divide-and-conquer and whole-genome

shotgun approaches can be combinedn Obtain high coverage from whole-genome

shotgun sequencing for short contigsn Generate of a set of BAC contigs with low

coveragen Use BAC contigs to ”bin” short contigs to

correct places

p This approach was used to sequence thebrown Norway rat genome in 2004

Page 150: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

150

Paired end sequencingp Paired end (or mate-pair) sequencing is

technique wheren both ends of an insert are sequencedn For each insert, we get two readsn We know the distance between reads, and that

they are in opposite orientation

n Typically read length < insert length

kRead 1 Read 2

Page 151: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

151

Paired end sequencingp The key idea of paired end sequencing:

n Both reads from an insert are unlikely to be in repeatregions

n If we know where the first read is, we know alsosecond’s location

p This technique helps to WGSS higher organisms

kRead 1 Read 2

Repeat region

Page 152: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

152

First whole-genome shotgun sequencingproject: Drosophila melanogaster

p Fruit fly is a commonmodel organism inbiological studies

p Whole-genomeassembly reported inEugene Myers, et al.,A Whole-GenomeAssembly ofDrosophila, Science24, 2000

p Genome size 120 Mbp

http://en.wikipedia.org/wiki/Drosophila_melanogaster

Page 153: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

153

Sequencing of the Human Genomep The (draft) human

genome was publishedin 2001

p Two efforts:n Human Genome Project

(public consortium)n Celera (private

company)

p HGP: BAC-by-BACapproach

p Celera: whole-genomeshotgun sequencing

HGP: Nature 15 February 2001Vol 409 Number 6822

Celera: Science 16 February 2001Vol 291, Issue 5507

Page 154: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

154

Genome assembly softwarep phrap (Phil’s revised assembly program)p AMOS (A Modular, Open-Source whole-

genome assembler)p CAP3 / PCAPp TIGR assembler

Page 155: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

155

Next generation sequencing techniquesp Sanger sequencing is the prominent first-

generation sequencing methodp Many new sequencing methods are

emerging

p See Lars Paulin’s slides (course web page)for details

Page 156: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

156

Next-gen sequencing: 454p Genome Sequencer FLX (454 Life Science

/ Roche)n >100 Mb / 7.5 h runn Read length 250-300 bpn >99.5% accuracy / base in a single runn >99.99% accuracy / base in consensus

Page 157: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

157

Next-gen sequencing: Illumina Solexap Illumina / Solexa Genome Analyzern Read length 35 - 50 bpn 1-2 Gb / 3-6 day runn > 98.5% accuracy / base in a single runn 99.99% accuracy / consensus with 3x

coverage

Page 158: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

158

Next-gen sequencing: SOLiDp SOLiDn Read length 25-30 bpn 1-2 Gb / 5-10 day runn >99.94% accuracy / basen >99.999% accuracy / consensus with 15x

coverage

Page 159: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

159

Next-gen sequencing: Helicosp Helicos: Single Molecule Sequencern No amplification of sequences neededn Read length up to 55 bp

p Accuracy does not decrease when read length isincreased

p Instead, throughput goes down

n 25-90 Mb / hn >2 Gb / day

Page 160: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

160

Next-gen sequencing: PacificBiosciencesp Pacific Biosciencesn Single-Molecule Real-Time (SMRT) DNA

sequencing technologyn Read length “thousands of nucleotides”

p Should overcome most problems with repeats

n Throughput estimate: 100 Gb / hourn First instruments in 2010?

Page 161: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

161

Exercise groupsp 1st group: Tuesdays 16.15-18.00 C221p 2nd group: Wednesday 14.15-16.00 B120

p You can choose freely which group youwant to attend to

p Send exercise notes before the 1st groupstarts (Tue 16.15), even if you go to the2nd group

Page 162: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

Introduction toBioinformatics

Lecture 3:Sequence alignment

Page 163: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

163

Sequence alignmentp The biological problemp Global alignmentp Local alignmentp Multiple alignment

Page 164: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

164

Background: comparative genomicsp Basic question in biology: what properties

are shared among organisms?p Genome sequencing allows comparison of

organisms at DNA and protein levelsp Comparisons can be used ton Find evolutionary relationships between

organismsn Identify functionally conserved sequencesn Identify corresponding genes in human and

model organisms: develop models for humandiseases

Page 165: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

165

Homologsp Two genes (sequences in

general) gB and gCevolved from the sameancestor gene gA arecalled homologs

p Homologs usually exhibitconserved functions

p Close evolutionaryrelationship => expect ahigh number of homologs

gB = agtgccgttaaagttgtacgtc

gC = ctgactgtttgtggttc

gA = agtgtccgttaagtgcgttc

Page 166: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

166

p We expect homologs to be ”similar” to each otherp Intuitively, similarity of two sequences refers to

the degree of match between correspondingpositions in sequence

p What about sequences that differ in length?

Sequence similarity

agtgccgttaaagttgtacgtc

ctgactgtttgtggttc

Page 167: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

167

Similarity vs homologyp Sequence similarity is not sequence

homologyn If the two sequences gB and gC have accumulated

enough mutations, the similarity between them is likelyto be low

Homology is more difficult to detect over greaterevolutionary distances.

0 agtgtccgttaagtgcgttc1 agtgtccgttatagtgcgttc2 agtgtccgcttatagtgcgttc4 agtgtccgcttaagggcgttc8 agtgtccgcttcaaggggcgt16 gggccgttcatgggggt32 gcagggcgtcactgagggct

64 acagtccgttcgggctattg128 cagagcactaccgc256 cacgagtaagatatagct512 taatcgtgata1024 acccttatctacttcctggagtt2048 agcgacctgcccaa4096 caaac

#mutations #mutations

Page 168: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

168

Similarity vs homology (2)p Sequence similarity can occur by chancen Similarity does not imply homology

p Consider comparing two short sequencesagainst each other

Page 169: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

169

Orthologs and paralogs

p We distinguish between two types of homologyn Orthologs: homologs from two different species,

separated by a speciation eventn Paralogs: homologs within a species, separated by a

gene duplication event

gA

gB gC

Organism B Organism C

gA

gA gA’

gB gC

Organism A

Gene duplication event

OrthologsParalogs

Page 170: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

170

Orthologs and paralogs (2)p Orthologs typically retain the original functionp In paralogs, one copy is free to mutate and

acquire new function (no selective pressure)

gA

gB gC

Organism B Organism C

gA

gA gA’

gB gC

Organism A

Page 171: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

171

Paralogy example: hemoglobinp Hemoglobin is a protein

complex which transportsoxygen

p In humans, hemoglobinconsists of four proteinsubunits and four non-protein heme groups

Hemoglobin A,www.rcsb.org/pdb/explore.do?structureId=1GZX

Sickle cell diseasesare caused by mutations

in hemoglobin genes

http://en.wikipedia.org/wiki/Image:Sicklecells.jpg

Page 172: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

172

Paralogy example: hemoglobinp In adults, three types are

normally presentn Hemoglobin A: 2 alpha and

2 beta subunitsn Hemoglobin A2: 2 alpha

and 2 delta subunitsn Hemoglobin F: 2 alpha and

2 gamma subunitsp Each type of subunit

(alpha, beta, gamma,delta) is encoded by aseparate gene

Hemoglobin A,www.rcsb.org/pdb/explore.do?structureId=1GZX

Page 173: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

173

Paralogy example: hemoglobinp The subunit genes are

paralogs of each other, i.e.,they have a common ancestorgene

p Exercise: hemoglobin humanparalogs in NCBI sequencedatabaseshttp://www.ncbi.nlm.nih.gov/sites/entrez?db=Nucleotiden Find human hemoglobin alpha, beta,

gamma and deltan Compare sequences

Hemoglobin A,www.rcsb.org/pdb/explore.do?structureId=1GZX

Page 174: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

174

Orthology example: insulinp The genes coding for insulin in human

(Homo sapiens) and mouse (Musmusculus) are orthologs:n They have a common ancestor gene in the

ancestor species of human and mousen Exercise: find insulin orthologs from human

and mouse in NCBI sequence databases

Page 175: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

175

Sequence alignmentp Alignment specifies which positions in two

sequences match

acgtctag|||||

-actctag

5 matches2 mismatches1 not aligned

acgtctag||actctag-

2 matches5 mismatches1 not aligned

acgtctag|| |||||ac-tctag

7 matches0 mismatches1 not aligned

Page 176: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

176

Sequence alignmentp Maximum alignment length is the total length of

the two sequences

acgtctag-------

--------actctag

0 matches0 mismatches15 not aligned

-------acgtctag

actctag--------

0 matches0 mismatches15 not aligned

Page 177: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

177

Mutations: Insertions, deletions andsubstitutions

p Insertions and/or deletions are calledindelsn We can’t tell whether the ancestor sequence

had a base or not at indel position!

acgtctag|||||

-actctag

Indel: insertion ordeletion of a basewith respect to theancestor sequence

Mismatch: substitution(point mutation) ofa single base

Page 178: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

178

Problems

p What sorts of alignments should be considered?p How to score alignments?p How to find optimal or good scoring alignments?p How to evaluate the statistical significance of

scores?

In this course, we discuss each of these problemsbriefly.

Page 179: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

179

Sequence Alignment (chapter 6)p The biological problemp Global alignmentp Local alignmentp Multiple alignment

Page 180: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

180

Global alignmentp Problem: find optimal scoring alignment between

two sequences (Needleman & Wunsch 1970)p Every position in both sequences is included in

the alignmentp We give score for each position in alignment

n Identity (match) +1n Substitution (mismatch) -µn Indel -

p Total score: sum of position scores

Page 181: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

181

Scoring: Toy examplep Consider two sequences

with characters drawnfrom the Englishlanguage alphabet:WHAT, WHY

WHAT

||

WH-YS(WHAT/WH-Y) = 1 + 1 – – µ

WHAT

-WHYS(WHAT/-WHY) = – – µ – µ – µ

Page 182: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

182

Dynamic programmingp How to find the optimal alignment?p We use previous solutions for optimal

alignments of smaller subsequencesp This general approach is known as

dynamic programming

Page 183: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

183

Introduction to dynamic programming:the money change problemp Suppose you buy a pen for 4.23€ and pay for

with a 5€ notep You get 77 cents in change – what coins is the

cashier going to give you if he or she tries tominimise the number of coins?

p The usual algorithm: start with largest coin(denominator), proceed to smaller coins until nochange is left:n 50, 20, 5 and 2 cents

p This greedy algorithm is incorrect, in the sensethat it does not always give you the correctanswer

Page 184: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

184

The money change problemp How else to compute the

change?p We could consider all possible

ways to reduce the amount ofchange

p Suppose we have 77 centschange, and the followingcoins: 50, 20, 5 cents

p We can compute the changewith recursionn C(n) = min { C(n – 50) + 1,

C(n – 20) + 1, C(n – 5) + 1 }p Figure shows the recursion

tree for the example

77

7227 57

7 22 7 37 52 22 52 67…

50 20 5

p Many values are computed morethan once!

p This leads to a correct but veryinefficient algorithm

Page 185: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

185

The money change problemp We can speed the computation up by

solving the change problem for all i nn Example: solve the problem for 9 cents with

available coins being 1, 2 and 5 cents

p Solve the problem in steps, first for 1cent, then 2 cents, and so on

p In each step, utilise the solutions from theprevious steps

Page 186: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

186

The money change problem

0 1 2 3 4 5 6 7 8 9Amount ofchange left

p Algorithm runs in time proportional to Md, where M is theamount of change and d is the number of coin types

p The same technique of storing solutions of subproblems canbe utilised in aligning sequences

Number ofcoins used

0 1 1 2 2 1 2 2 3 3

Page 187: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

187

Representing alignments and scores

Y

H

W

-

TAHW-

WHAT

||

WH-Y

Alignments can berepresented in thefollowing tabular form.

Each alignmentcorresponds to a paththrough the table.

Page 188: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

188

Representing alignments and scores

Y

H

W

-

TAHW-

WHAT---

----WHY

WH-AT

||

WHY--

Page 189: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

189

Y

H

W

-

TAHW-WHAT

||

WH-Y

Global alignmentscore S3,4 = 2- -µ

2- -µ

2-2

1

0

Representing alignments and scores

Page 190: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

190

Filling the alignment matrix

Y

H

W

-

TAHW-

Case 1Case 2

Case 3

Consider the alignment processat shaded square.

Case 1. Align H against H(match)

Case 2. Align H in WHY against– (indel) in WHAT

Case 3. Align H in WHATagainst – (indel) in WHY

Page 191: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

191

Filling the alignment matrix (2)

Y

H

W

-

TAHW-

Case 1Case 2

Case 3

Scoring the alternatives.

Case 1. S2,2 = S1,1 + s(2, 2)

Case 2. S2,2 = S1,2

Case 3. S2,2 = S2,1

s(i, j) = 1 for matching positions,

s(i, j) = - µ for substitutions.

Choose the case (path) thatyields the maximum score.

Keep track of path choices.

Page 192: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

192

Global alignment: formaldevelopmentA = a1a2a3…an,B = b1b2b3…bm

a3

a2

a1

-

b4b3b2b1-

3

2

1

0

43210

b1 b2 b3 b4 -- -a1 a2 a3

l Any alignment can be writtenas a unique path through thematrix

l Score for aligning A and B upto positions i and j:

Si,j = S(a1a2a3…ai, b1b2b3…bj)

Page 193: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

193

Scoring partial alignmentsp Alignment of A = a1a2a3…ai with B = b1b2b3…bj

can be end in three possible waysn Case 1: (a1a2…ai-1) ai

(b1b2…bj-1) bj

n Case 2: (a1a2…ai-1) ai

(b1b2…bj) -n Case 3: (a1a2…ai) –

(b1b2…bj-1) bj

Page 194: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

194

Scoring alignmentsp Scores for each case:

n Case 1: (a1a2…ai-1) ai

(b1b2…bj-1) bj

n Case 2: (a1a2…ai-1) ai

(b1b2…bj) –

n Case 3: (a1a2…ai) –(b1b2…bj-1) bj

s(ai, bj) = { -µ otherwise

+1 if ai = bj

s(ai, -) = s(-, bj) = -

Page 195: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

195

Scoring alignments (2)p First row and first column

correspond to initial alignmentagainst indels:

S(i, 0) = -i S(0, j) = -j

p Optimal global alignment scoreS(A, B) = Sn,m

a3

a2

a1

-

b4b3b2b1-

-33

-22

-1

-4-3-2-00

43210

Page 196: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

196

Algorithm for global alignmentInput sequences A, B, n = |A|, m = |B|Set Si,0 := - i for all iSet S0,j := - j for all jfor i := 1 to n

for j := 1 to mSi,j := max{Si-1,j – , Si-1,j-1 + s(ai,bj), Si,j-1 – }

endend

Algorithm takes O(nm) time

Page 197: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

197

Global alignment: example

?-10T-8G-6C-4T-2A

-10-8-6-4-20-GTGGT-

µ = 1

= 2

Page 198: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

198

Global alignment: example

?-10T-8G-6C-4T-2A

-10-8-6-4-20-GTGGT-

µ = 1

= 2 -1 -3

Page 199: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

199

Global alignment: example (2)

-20-3-4-7-10T-4-3-1-2-5-8G-5-5-3-2-3-6C-6-4-4-2-1-4T-9-7-5-3-1-2A-10-8-6-4-20-GTGGT-

µ = 1

= 2

ATCGT-

| ||

-TGGTG

Page 200: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

200

Sequence Alignment (chapter 6)p The biological problemp Global alignmentp Local alignmentp Multiple alignment

Page 201: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

201

Local alignment: rationalep Otherwise dissimilar proteins may have local regions of

similarity-> Proteins may share a function

Human bonemorphogenic proteinreceptor type IIprecursor (left) has a300 aa region thatresembles 291 aaregion in TGF-receptor (right).

The shared functionhere is protein kinase.

Page 202: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

202

Local alignment: rationale

p Global alignment would be inadequatep Problem: find the highest scoring local alignment

between two sequencesp Previous algorithm with minor modifications solves this

problem (Smith & Waterman 1981)

A

BRegions ofsimilarity

Page 203: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

203

From global to local alignmentp Modifications to the global alignment

algorithmn Look for the highest-scoring path in the

alignment matrix (not necessarily through thematrix), or in other words:

n Allow preceding and trailing indels withoutpenalty

Page 204: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

204

Scoring local alignmentsA = a1a2a3…an, B = b1b2b3…bm

Let I and J be intervals (substrings) of A and B, respectively:

Best local alignment score:

where S(I, J) is the alignment score for substrings I and J.

Page 205: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

205

Allowing preceding and trailingindelsp First row and column

initialised to zero:Mi,0 = M0,j = 0

a3

a2

a1

-

b4b3b2b1-

03

02

01

000000

43210

b1 b2 b3- - a1

Page 206: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

206

Recursion for local alignmentp Mi,j = max {

Mi-1,j-1 + s(ai, bi),Mi-1,j – ,Mi,j-1 – ,0

}

020010T

101100G

000000C

010010T

000000A

000000-

GTGGT-

Allow alignment tostart anywhere insequences

Page 207: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

207

Finding best local alignmentp Optimal score is the highest

value in the matrix

= maxi,j Mi,j

p Best local alignment can befound by backtracking from thehighest value in M

p What is the best localalignment in this example? 020010T

101100G

000000C

010010T

000000A

000000-

GTGGT-

Page 208: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

208

Local alignment: example

0G80G70A60A50T40C30C20A1

00000000000-0ACTAACTCGG-

109876543210Mi,j = max {

Mi-1,j-1 + s(ai,bi),Mi-1,j ,Mi,j-1 ,0

}

0

Scoring (for example)Match: +2Mismatch: -1Indel: -2

Page 209: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

209

Local alignment: example

0G80G70A60A50T40C30C20A1

00000000000-0ACTAACTCGG-

109876543210Mi,j = max {

Mi-1,j-1 + s(ai,bi),Mi-1,j ,Mi,j-1 ,0

}

0 0 0 0 0 2

Scoring (for example)Match: +2Mismatch: -1Indel: -2

Page 210: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

210

C T – A AC T C A A

Local alignment: example

Scoring (for example)Match: +2Mismatch: -1Indel: -2

Optimal localalignment:

24321002420G813543000220G7

32465100000A631134320000A521201240000T413001212000C302110202000C220022000000A100000000000-0ACTAACTCGG-

109876543210

Page 211: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

211

Multiple optimal alignmentsNon-optimal, good-scoring alignments

24321002420G813543000220G7

32465100000A631134320000A521201240000T413001212000C302110202000C220022000000A100000000000-0ACTAACTCGG-

109876543210How can you find

1. Optimalalignments ifmore than oneexist?

2. Non-optimal,good-scoringalignments?

Page 212: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

212

Overlap alignmentp Overlap matrix used by Overlap-Layout-

Consensus algorithm can be computed withdynamic programming

p Initialization: Oi,0 = O0,j = 0 for all i, jp Recursion:Oi,j = max {

Oi-1,j-1 + s(ai, bi),Oi-1,j – ,Oi,j-1 – ,

}Best overlap: maximum value from rightmost

column and bottom row

Page 213: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

213

Non-uniform mismatch penaltiesp We used uniform penalty for mismatches:

s(’A’, ’C’) = s(’A’, ’G’) = … = s(’G’, ’T’) = µp Transition mutations (A->G, G->A, C->T, T->C) are

approximately twice as frequent than transversions (A->T,T->A, A->C, G->T)n use non-uniform mismatch

penalties collected into asubstitution matrix

1-1-0.5-1T-11-1-0.5G

-0.5-11-1C-1-0.5-11A

TGCA

Page 214: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

214

Gaps in alignmentp Gap is a succession of indels in alignment

p Previous model scored a length k gap asw(k) = -k

p Replication processes may produce longerstretches of insertions or deletionsn In coding regions, insertions or deletions of

codons may preserve functionality

C T – - - A AC T C G C A A

Page 215: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

215

Gap open and extension penalties (2)p We can design a score that allows the

penalty opening gap to be larger thanextending the gap:

w(k) = - – (k – 1)p Gap open cost , Gap extension costp Alignment algorithms can be extended to

use w(k) (not discussed on this course)

Page 216: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

216

Amino acid sequencesp We have discussed mainly DNA sequencesp Amino acid sequences can be aligned as

wellp However, the design of the substitution

matrix is more involved because of thelarger alphabet

p More on the topic in the course Biologicalsequence analysis

Page 217: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

217

Demonstration of the EBI web sitep European Bioinformatics Institute (EBI)

offers many biological databases andbioinformatics tools athttp://www.ebi.ac.uk/n Sequence alignment: Tools -> Sequence

Analysis -> Align

Page 218: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

218

Sequence Alignment (chapter 6)p The biological problemp Global alignmentp Local alignmentp Multiple alignment

Page 219: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

219

Multiple alignmentp Consider a set of n sequences

on the rightn Orthologous sequences from

different organismsn Paralogs from multiple

duplicationsp How can we study

relationships between thesesequences?

aggcgagctgcgagtgctacgttagattgacgctgacttccggctgcgacgacacggcgaacggaagtgtgcccgacgagcgaggacgcgggctgtgagcgctaaagcggcctgtgtgccctaatgctgctgccagtgtaagtcgagccccgagtgcagtccgagtccactcggtgc

Page 220: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

220

Optimal alignment of threesequencesp Alignment of A = a1a2…ai and B = b1b2…bj can

end either in (-, bj), (ai, bj) or (ai, -)p 22 – 1 = 3 alternativesp Alignment of A, B and C = c1c2…ck can end in 23 –

1 ways: (ai, -, -), (-, bj, -), (-, -, ck), (-, bj, ck),(ai, -, ck), (ai, bj, -) or (ai, bj, ck)

p Solve the recursion using three-dimensionaldynamic programming matrix: O(n3) time andspace

p Generalizes to n sequences but impractical witheven a moderate number of sequences

Page 221: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

221

Multiple alignment in practicep In practice, real-world multiple alignment

problems are usually solved with heuristicsp Progressive multiple alignment

n Choose two sequences and align themn Choose third sequence w.r.t. two previous sequences

and align the third against themn Repeat until all sequences have been alignedn Different options how to choose sequences and score

alignmentsn Note the similarity to Overlap-Layout-Consensus

Page 222: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

222

Multiple alignment in practicep Profile-based progressive multiple

alignment: CLUSTALWn Construct a distance matrix of all pairs of

sequences using dynamic programmingn Progressively align pairs in order of decreasing

similarityn CLUSTALW uses various heuristics to

contribute to accuracy

Page 223: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

223

Additional materialp R. Durbin, S. Eddy, A. Krogh, G.

Mitchison: Biological sequence analysisp N. C. Jones, P. A. Pevzner: An introduction

to bioinformatics algorithmsp Course Biological sequence analysis in

period II, 2008

Page 224: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

224

Rapid alignment methods: FASTA andBLASTp The biological problemp Search strategiesp FASTAp BLAST

Page 225: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

225

The biological problemp Global and local

alignment algoritms areslow in practice

p Consider the scenario ofaligning a querysequence against a largedatabase of sequencesn New sequence with

unknown function n NCBI GenBank size in January2007 was 65 369 091 950bases (61 132 599 sequences)

n Feb 2008: 85 759 586 764bases (82 853 685 sequences)

Page 226: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

226

Problem with large amount of sequencesp Exponential growth in both number and

total length of sequencesp Possible solution: Compare against model

organisms onlyp With large amount of sequences, chances

are that matches occur by randomn Need for statistical analysis

Page 227: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

227

Rapid alignment methods: FASTA andBLASTp The biological problemp Search strategiesp FASTAp BLAST

Page 228: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

228

FASTAp FASTA is a multistep algorithm for sequence

alignment (Wilbur and Lipman, 1983)p The sequence file format used by the FASTA

software is widely used by other sequenceanalysis software

p Main idea:n Choose regions of the two sequences (query and

database) that look promising (have some degree ofsimilarity)

n Compute local alignment using dynamic programming inthese regions

Page 229: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

229

FASTA outlinep FASTA algorithm has five steps:n 1. Identify common k-words between I and Jn 2. Score diagonals with k-word matches,

identify 10 best diagonalsn 3. Rescore initial regions with a substitution

score matrixn 4. Join initial regions using gaps, penalise for

gapsn 5. Perform dynamic programming to find final

alignments

Page 230: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

231

Analyzing the word contentp Example query string I: TGATGATGAAGACATCAGp For k = 8, the set of k-words (substring of length

k) of I is

TGATGATGGATGATGAATGATGAATGATGAAG

…GACATCAG

Page 231: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

232

Analyzing the word contentp There are n-k+1 k-words in a string of length n

p If at least one word of I is not found fromanother string J, we know that I differs from J

p Need to consider statistical significance: I and Jmight share words by chance only

p Let n=|I| and m=|J|

Page 232: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

233

Word lists and comparison by contentp The k-words of I can be arranged into a table of

word occurences Lw(I)p Consider the k-words when k=2 and

I=GCATCGGC:GC, CA, AT, TC, CG, GG, GCAT: 3CA: 2CG: 5GC: 1, 7GG: 6TC: 4

Start indecies of k-word GC in I

Building Lw(I) takes O(n) time

Page 233: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

234

Common k-wordsp Number of common k-words in I and J can

be computed using Lw(I) and Lw(J)

p For each word w in I, there are |Lw(J)|occurences in J

p Therefore I and J havecommon words

p This can be computed in O(n + m + 4k)timen O(n + m) time to build the listsn O(4k) time to calculate the sum (in DNA

strings)

Page 234: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

235

Common k-wordsp I = GCATCGGCp J = CCATCGCCATCG

Lw(J)AT: 3, 9CA: 2, 8CC: 1, 7CG: 5, 11GC: 6

TC: 4, 10

Lw(I)AT: 3CA: 2

CG: 5GC: 1, 7GG: 6TC: 4

Common words220220210 in total

Page 235: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

236

Properties of the common word listp Exact matches can be found using binary search

(e.g., where TCGT occurs in I?)n O(log 4k) time

p For large k, the table size is too large to computethe common word count in the previous fashion

p Instead, an approach based on merge sort can beutilised (details skipped)

p The common k-word technique can be combinedwith the local alignment algorithm to yield a rapidalignment approach

Page 236: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

237

FASTA outlinep FASTA algorithm has five steps:n 1. Identify common k-words between I and Jn 2. Score diagonals with k-word matches,

identify 10 best diagonalsn 3. Rescore initial regions with a substitution

score matrixn 4. Join initial regions using gaps, penalise for

gapsn 5. Perform dynamic programming to find final

alignments

Page 237: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

238

Dot matrix comparisonsp Word matches in two sequences I and J can be

represented as a dot matrixp Dot matrix element (i, j) has ”a dot”, if the word

starting at position i in I is identical to the wordstarting at position j in J

p The dot matrix can be plotted for various k

i

j

I = … ATCGGATCA …J = … TGGTGTCGC …

i

j

Page 238: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

239

k=1 k=4

k=8 k=16

Dot matrix (k=1,4,8,16)for two DNA sequencesX85973.1 (1875 bp)Y11931.1 (2013 bp)

Page 239: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

240

k=1 k=4

k=8 k=16

Dot matrix(k=1,4,8,16) for twoprotein sequencesCAB51201.1 (531 aa)CAA72681.1 (588 aa)

Shading indicatesnow the match scoreaccording to ascore matrix(Blosum62 here)

Page 240: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

241

Computing diagonal sumsp We would like to find high scoring diagonals of the dot

matrixp Lets index diagonals by the offset, l = i - j

C C A T C G C C A T C GG *C * *A * *T * *C * *GG *C

k=2

I

J

Diagonal l = i – j = -6

Page 241: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

242

Computing diagonal sumsp As an example, lets compute diagonal sums for

I = GCATCGGC, J = CCATCGCCATCG, k = 2p 1. Construct k-word list Lw(J)p 2. Diagonal sums Sl are computed into a table,

indexed with the offset and initialised to zero

l -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

Sl 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 242: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

243

Computing diagonal sumsp 3. Go through k-words of I, look for matches in

Lw(J) and update diagonal sums

C C A T C G C C A T C GG *C * *A * *T * *C * *GG *C

I

J For the first 2-word in I,GC, LGC(J) = {6}.

We can then updatethe sum of diagonall = i – j = 1 – 6 = -5 toS-5 := S-5 + 1 = 0 + 1 = 1

Page 243: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

244

Computing diagonal sumsp 3. Go through k-words of I, look for matches in

Lw(J) and update diagonal sums

C C A T C G C C A T C GG *C * *A * *T * *C * *GG *C

I

J Next 2-word in I is CA,for which LCA(J) = {2, 8}.

Two diagonal sums areupdated:l = i – j = 2 – 2 = 0S0 := S0 + 1 = 0 + 1 = 1

I = i – j = 2 – 8 = -6S-6 := S-6 + 1 = 0 + 1 = 1

Page 244: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

245

Computing diagonal sumsp 3. Go through k-words of I, look for matches in

Lw(J) and update diagonal sums

C C A T C G C C A T C GG *C * *A * *T * *C * *GG *C

I

J Next 2-word in I is AT,for which LAT(J) = {3, 9}.

Two diagonal sums areupdated:l = i – j = 3 – 3 = 0S0 := S0 + 1 = 1 + 1 = 2

I = i – j = 3 – 9 = -6S-6 := S-6 + 1 = 1 + 1 = 2

Page 245: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

246

Computing diagonal sumsAfter going through the k-words of I, the result is:l -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

Sl 0 0 0 0 4 1 0 0 0 0 4 1 0 0 0 0 0

C C A T C G C C A T C GG *C * *A * *T * *C * *GG *C

I

J

Page 246: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

247

Algorithm for computing diagonal sum of scores

Sl := 0 for all 1 – m l n – 1Compute Lw(J) for all words wfor i := 1 to n – k – 1 do

w := IiIi+1…Ii+k-1

for j Lw(J) dol := i – jSl := Sl + 1

endend

Match score is here 1

Page 247: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

248

FASTA outlinep FASTA algorithm has five steps:n 1. Identify common k-words between I and Jn 2. Score diagonals with k-word matches,

identify 10 best diagonalsn 3. Rescore initial regions with a substitution

score matrixn 4. Join initial regions using gaps, penalise for

gapsn 5. Perform dynamic programming to find final

alignments

Page 248: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

249

Rescoring initial regionsp Each high-scoring diagonal chosen in the

previous step is rescored according to a scorematrix

p This is done to find subregions with identitiesshorter than k

p Non-matching ends of the diagonal are trimmed

I: C C A T C G C C A T C GJ: C C A A C G C A A T C A

I’: C C A T C G C C A T C GJ’: A C A T C A A A T A A A

75% identity, no 4-word identities

33% identity, one 4-word identity

Page 249: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

250

Joining diagonalsp Two offset diagonals can be joined with a gap, if

the resulting alignment has a higher scorep Separate gap open and extension are usedp Find the best-scoring combination of diagonals

High-scoringdiagonals

Two diagonalsjoined by a gap

Page 250: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

251

FASTA outlinep FASTA algorithm has five steps:n 1. Identify common k-words between I and Jn 2. Score diagonals with k-word matches,

identify 10 best diagonalsn 3. Rescore initial regions with a substitution

score matrixn 4. Join initial regions using gaps, penalise for

gapsn 5. Perform dynamic programming to find final

alignments

Page 251: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

252

Local alignment in the highest-scoringregionp Last step of FASTA: perform local

alignment using dynamicprogramming around the highest-scoring

p Region to be aligned covers –w and+w offset diagonal to the highest-scoring diagonals

p With long sequences, this region istypically very small compared to thewhole n x m matrix w

w

Dynamic programming matrixM filled only for the green region

Page 252: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

253

Properties of FASTAp Fast compared to local alignment using dynamic

programming onlyn Only a narrow region of the full matrix is aligned

p Increasing parameter k decreases the number ofhits:n Increases specificityn Decreases sensitivityn Decreases running time

p FASTA can be very specific when identifying longregions of low similarityn Specific method does not find many incorrect resultsn Sensitive method finds many of the correct results

Page 253: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

254

Properties of FASTAp FASTA looks for initial exact matches to

query sequencen Two proteins can have very different amino

acid sequences and still be biologically similarn This may lead into a lack of sensitivity with

diverged sequences

Page 254: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

255

Demonstration of FASTA at EBIp http://www.ebi.ac.uk/fasta/p Note that parameter ktup in the software

corresponds to parameter k in lectures

Page 255: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

256

Note on sequences and alignmentmatrices in exercisesp Example solutions to alignment problems

will have sequences arranged like this:n Perform global alignment of the sequences

p s = AGCTGCGTACTp t = ATGAGCGTTA

AG

CTG

CG

TACT

ATGAGCGTTA

So if you want to be able tocompare your solution easilyagainst the example, use thisconvention.

Page 256: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

257

Rapid alignment methods: FASTA andBLASTp The biological problemp Search strategiesp FASTAp BLAST

Page 257: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

258

BLAST: Basic Local Alignment SearchToolp BLAST (Altschul et al., 1990) and its variants are

some of the most common sequence search toolsin use

p Roughly, the basic BLAST has three parts:n 1. Find segment pairs between the query sequence and

a database sequence above score threshold (”seed hits”)n 2. Extend seed hits into locally maximal segment pairsn 3. Calculate p-values and a rank ordering of the local

alignments

p Gapped BLAST introduced in 1997 allows for gapsin alignments

Page 258: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

259

Finding seed hitsp First, we generate a set of neighborhood

sequences for given k, match score matrix andthreshold T

p Neighborhood sequences of a k-word w includeall strings of length k that, when aligned againstw, have the alignment score at least T

p For instance, let I = GCATCGGC, J =CCATCGCCATCG and k = 5, match score be 1,mismatch score be 0 and T = 4

Page 259: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

260

Finding seed hitsp I = GCATCGGC, J = CCATCGCCATCG, k = 5,

match score 1, mismatch score 0, T = 4p This allows for one mismatch in each k-wordp The neighborhood of the first k-word of I, GCATC,

is GCATC and the 15 sequences

A A C A A

CCATC,G GATC,GC GTC,GCA CC,GCAT G

T T T G T

Page 260: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

261

Finding seed hitsp I = GCATCGGC has 4 k-words and thus 4x16 =

64 5-word patterns to locate in Jn Occurences of patterns in J are called seed hits

p Patterns can be found using exact search in timeproportional to the sum of pattern lengths +length of J + number of matches (Aho-Corasickalgorithm)n Methods for pattern matching are developed on course

58093 String processing algorithms

p Compare this approach to FASTA

Page 261: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

262

Extending seed hits: original BLASTp Initial seed hits are extended into

locally maximal segment pairsor High-scoring Segment Pairs(HSP)

p Extensions do not add gaps to thealignment

p Sequence is extended until thealignment score drops below themaximum attained score minus athreshold parameter value

p All statistically significant HSPsreported

AACCGTTCATTA| || || ||TAGCGATCTTTT

Initial seed hit

Extension

Altschul, S.F., Gish, W., Miller, W., Myers, E. W. andLipman, D. J., J. Mol. Biol., 215, 403-410, 1990

Page 262: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

263

Extending seed hits: gapped BLASTp In a later version of BLAST, two

seed hits have to be found on thesame diagonaln Hits have to be non-overlappingn If the hits are closer than A

(additional parameter), then theyare joined into a HSP

p Threshold value T is lowered toachieve comparable sensitivity

p If the resulting HSP achieves ascore at least Sg, a gappedextension is triggered

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, andLipman DJ, Nucleic Acids Res. 1;25(17), 3389-402, 1997

Page 263: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

264

Gapped extensions of HSPsp Local alignment is performed

starting from the HSPp Dynamic programming matrix

filled in ”forward” and”backward” directions (seefigure)

p Skip cells where value wouldbe Xg below the bestalignment score found so far

Region potentially searchedby the alignment algorithm

HSP

Region searched with scoreabove cutoff parameter

Page 264: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

265

Estimating the significance of resultsp In general, we have a score S(D, X) = s for a

sequence X found in database Dp BLAST rank-orders the sequences found by p-

valuesp The p-value for this hit is P(S(D, Y) s) where Y

is a random sequencen Measures the amount of ”surprise” of finding sequence X

p A smaller p-value indicates more significant hitn A p-value of 0.1 means that one-tenth of random

sequences would have as large score as our result

Page 265: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

266

Estimating the significance of resultsp In BLAST, p-values are computed roughly as

followsp There are nm places to begin an optimal

alignment in the n x m alignment matrixp Optimal alignment is preceded by a mismatch

and has t matching (identical) lettersn (Assume match score 1 and mismatch/indel score - )

p Let p = P(two random letters are equal)p The probability of having a mismatch and then t

matches is (1-p)pt

Page 266: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

267

Estimating the significance of resultsp We model this event by a Poisson distribution

(why?) with mean = nm(1-p)pt

p P(there is local alignment t or longer)1 – P(no such event)

= 1 – e- = 1 – exp(-nm(1-p)pt)p An equation of the same form is used in Blast:p E-value = P(S(D, Y) s) 1 – exp(-nm t) where

> 0 and 0 < < 1p Parameters and are estimated from data

Page 267: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

268

Scoring amino acidalignmentsp We need a way to compute the

score S(D, X) for aligning thesequence X against database D

p Scoring DNA alignments wasdiscussed previously

p Constructing a scoring model foramino acids is more challengingn 20 different amino acids vs. 4

basesp Figure shows the molecular

structures of the 20 amino acids

http://en.wikipedia.org/wiki/List_of_standard_amino_acids

Page 268: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

269

Scoring amino acidalignmentsp Substitutions between chemically

similar amino acids are morefrequent than between dissimilaramino acids

p We can check our scoring modelagainst this

http://en.wikipedia.org/wiki/List_of_standard_amino_acids

Page 269: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

270

Score matricesp Scores s = S(D, X) are obtained from score

matricesp Let A = A1a2…an and B = b1b2…bn be sequences

of equal length (no gaps allowed to simplifythings)

p To obtain a score for alignment of A and B, whereai is aligned against bi, we take the ratio of twoprobabilitiesn The probability of having A and B where the characters

match (match model M)n The probability that A and B were chosen randomly

(random model R)

Page 270: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

271

Score matrices: random modelp Under the random model, the probability

of having X and Y is

where qxi is the probability of occurence ofamino acid type xi

p Position where an amino acid occurs doesnot affect its type

Page 271: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

272

Score matrices: match modelp Let pab be the probability of having amino acids

of type a and b aligned against each other giventhey have evolved from the same ancestor c

p The probability is

Page 272: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

273

Score matrices: log-odds ratio scorep We obtain the score S by taking the ratio

of these two probabilities

and taking a logarithm of the ratio

Page 273: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

274

Score matrices: log-odds ratio score

p The score S is obtained by summing overcharacter pair-specific scores:

p The probabilities qa and pab are extractedfrom data

Page 274: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

275

Calculating score matrices for aminoacidsp Probabilities qa are in

principle easy to obtain:n Count relative frequencies of

every amino acid in a sequencedatabase

Page 275: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

276

p To calculate pab we can use aknown pool of aligned sequences

p BLOCKS is a database of highlyconserved regions for proteins

p It lists multiply aligned, ungappedand conserved protein segments

p Example from BLOCKS showsgenes related to human geneassociated with DNA-repairdefect xeroderma pigmentosum

Calculating score matrices for aminoacids

Block PR00851AID XRODRMPGMNTB; BLOCKAC PR00851A; distance from previous block=(52,131)DE Xeroderma pigmentosum group B protein signatureBL adapted; width=21; seqs=8; 99.5%=985; strength=1287XPB_HUMAN|P19447 ( 74) RPLWVAPDGHIFLEAFSPVYK 54XPB_MOUSE|P49135 ( 74) RPLWVAPDGHIFLEAFSPVYK 54P91579 ( 80) RPLYLAPDGHIFLESFSPVYK 67XPB_DROME|Q02870 ( 84) RPLWVAPNGHVFLESFSPVYK 79RA25_YEAST|Q00578 ( 131) PLWISPSDGRIILESFSPLAE 100Q38861 ( 52) RPLWACADGRIFLETFSPLYK 71O13768 ( 90) PLWINPIDGRIILEAFSPLAE 100O00835 ( 79) RPIWVCPDGHIFLETFSAIYK 86

http://blocks.fhcrc.org

Page 276: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

277

BLOSUM matrixp BLOSUM is a score matrix

for amino acid sequencesderived from BLOCKS data

p First, count pairwisematches fx,y for every aminoacid type pair (x, y)

p For example, for column 3and amino acids L and W,we find 8 pairwise matches:fL,W = fW,L = 8

RPLWVAPDRPLWVAPRRPLWVAPNPLWISPSDRPLWACADPLWINPIDRPIWVCPD

Page 277: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

278

p Probability pab is obtained bydividing fab with the totalnumber of pairs (notedifference with course book):

p We get probabilities qa by

RPLWVAPDRPLWVAPRRPLWVAPNPLWISPSDRPLWACADPLWINPIDRPIWVCPD

Creating a BLOSUM matrix

Page 278: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

279

Creating a BLOSUM matrixp The probabilities pab and qa can now be plugged

into

to get a 20 x 20 matrix of scores s(a, b).p Next slide presents the BLOSUM62 matrix

n Values scaled by factor of 2 and rounded to integersn Additional step required to take into account expected

evolutionary distancen Described in Deonier’s book in more detail

Page 279: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

280

BLOSUM62A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

Page 280: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

281

Using BLOSUM62 matrixMQLEANADTSV

| | |

LQEQAEAQGEM

= 2 + 5 – 3 – 4 + 4 + 0 + 4 + 0 – 2 + 0 +1

= 7

Page 281: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

282

Demonstration of BLAST at NCBIp http://www.ncbi.nlm.nih.gov/BLAST/

Page 282: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

Introduction toBioinformatics

Lecture 4:Genome rearrangements

Page 283: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

284

Why study genome rearrangements?p Provide insight into evolution of speciesp Fun algorithmic problem!

p Structure of this lecture:n The biological phenomenonn How to computationally model it?n How to compute interesting things?n Studying the phenomenon using existing tools

(continued in exercises)

Page 284: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

285

Genome rearrangements as analgorithmic problem

Page 285: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

286

Backgroundp Genome sequencing enables us to

compare genomes of two or moredifferent speciesn -> Comparative genomics

p Basic observation:n Closely related species (such as human and

mouse) can be almost identical in terms ofgenome contents...

n ...but the order of genomic segments can bevery different between species

Page 286: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

287

Synteny blocks and segmentsp Synteny – derived from Greek ’on the

same ribbon’ – means genomic segmentslocated on the same chromosomen Genes, markers (any sequence)

p Synteny block (or syntenic block)n A set of genes or markers that co-occur

together in two species

p Synteny segment (or syntenic segment)n Syntenic block where the order of genes or

markers is preserved

Page 287: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

288

Synteny blocks and segmentsChromosome i, species B

Chromosome j, species C

Synteny segment

Synteny block

Homologsof the samegene

Page 288: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

289

Observations from sequencing1. Large chromosome inversions and

translocations (we’ll get to these shortly)are commonn ...Even between closely related species

2. Chromosome inversions are usuallysymmetric around the origin of DNAreplication

3. Inversions are less common withinspecies...

Page 289: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

290

What causes rearrangements?p RecA, Recombinase A,

is a protein used torepair chromosomaldamage

p It uses a duplicatecopy of the damagedsequence as template

p Template is usually ahomologous sequenceon a sisterchromosome

Diarmaid Hughes: Evaluating genome dynamics: the constraints onrearrangements within bacterial genomes, Genome Biology 2000, 1

Page 290: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

291

Chromosomes: recapp Linear chromosomesn Eukaryotes (mostly)

p Circular chromosomesn Prokaryotes (mostly)n Mitochondria

chromatid

centromere

gene 1

gene 3

gene 2

Also double-stranded: genes can befound on both strands (orientations)

Page 291: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

292

What effects does RecA have ongenome?p Repeated sequences cause RecA to fail to

choose correct recombination startposition

p This leads ton Tandem duplicationsn Translocationsn Inversions

Repeat 1 Repeat 2

RecA

?

Damaged sequence

Page 292: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

293Diarmaid Hughes: Evaluating genome dynamics: the constraints onrearrangements within bacterial genomes, Genome Biology 2000, 1

X, Y, Z and W are repeats ofthe same sequence.

a, b, c and d are sequences on genomebounded by repeats.

In a tandem duplication example,RecA recombines a sequence thatstarts from Y instead of Z after Z.

This leads to duplication of segment Y-Z.

Page 293: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

294Diarmaid Hughes: Evaluating genome dynamics: the constraints onrearrangements within bacterial genomes, Genome Biology 2000, 1

Recombination of tworepeat sequences in thesame chromosome canlead to a fragment translocation

Here sequence d is translocated

Page 294: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

295Diarmaid Hughes: Evaluating genome dynamics: the constraints onrearrangements within bacterial genomes, Genome Biology 2000, 1

Inversion happens when twosequences of opposite orientationsare recombined.

Page 295: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

296

Example: human vs mouse genomep Human and mouse genomes share

thousands of homologous genes, but theyaren Arranged in different ordern Located in different chromosomes

p Examplesn Human chromosome 6 contains elements from

six different mouse chromosomesn Analysis of X chromosome indicates that

rearrangements have happened primarilywithin chromosome

Page 296: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

297 Jones & Pevzner, 2004

Page 297: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

298

Page 298: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

299

Representing genome rearrangmentsp When comparing two genomes, we can

find homologous sequences in both usingBLAST, for example

p This gives us a map between sequences inboth genomes

Page 299: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

300

Representing genome rearrangmentsp We assign numbers 1,...,n to

the found homologoussequences

p By convention, we number thesequences in the first genomeby their order of appearancein chromosomes

p If the homolog of i is inreverse orientation, it receivesnumber –i (signed data)

p For example, consider humanvs mouse gene numbering onthe right

(il10)9

(at3)-6(pdc)8(lamc1)-7(lamc1)7(pdc)-8(at3)6(il10)-9(pklr)5(pax3)15(gba)4(fn1)14(ngfb)3(cd28)13(nras)2(inpp1)12(gnat2)1

MouseHuman

List order corresponds tophysical order on chromosomes!

Page 300: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

301

Permutationsp The basic data structure in the study of

genome rearrangements is permutationp A permutation of a sequence of n numbers

is a reordering of the sequencep For example, 4 1 3 2 5 is a permutation of

1 2 3 4 5

Page 301: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

302

Genome rearrangement problemp Given two genomes (set of markers), how

manyn duplications,n inversions andn translocations

do we need to do to transform the firstgenome to the second?

Minimum number of operations?What operations? Which order?

Page 302: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

303

Genome rearrangement problem

6 1 2 3 4 5 1 2 3 4 5 6

#duplications?#inversions?#translocations?

Page 303: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

304

Genome rearrangement problem

6 1 2 3 4 5 1 2 3 4 5 6

1 2 3 4 5 6

Keep in mind, that the two genomeshave been evolved from a commonancestor genome!

Permutationof 1,...,6

Page 304: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

305

Genome rearrangements using reversals(=inversions) onlyp Lets consider a simpler problem where we just

study reversals with unsigned datap A reversal p(i, j) reverses the order of the

segment i i+1 ... j-1 j (indexing starts from 1)p For example, given permutation

6 1 2 3 4 5 and reversal p(3, 5) we getpermutation 6 1 4 3 2 5

...note that we do not care about exact positions on the genome

p(3, 5)

Page 305: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

306

Reversal distance problemp Find the shortest series of reversals that, given

a permutation , transforms it to the identitypermutation (1, 2, ..., n)

p This quantity is denoted by d( )

p Reversal distance for a pair of chromosomes:n Find synteny blocks in bothn Number blocks in the first chromosome to identityn Set to correspond matching of second chromosome’s

blocks against the firstn Find reversal distance

Page 306: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

307

Reversal distance problem: discussionp If we can find the minimal series of

reversals for some pair of genomesn Is that what happened during evolution?n If not, is it the correct number of reversals?

p In any case, reversal distance gives us ameasure of evolutionary distance betweenthe two genomes and species

Page 307: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

308

Solving the problem by sortingp Our first approach to solve the reversal

distance problem:n Examine each position i of the permutationn At each position, if i i, do a reversal such

that i = i

p This is a greedy approach: we try tochoose the best option at each step

Page 308: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

309

Simple reversal sort: example

6 1 2 3 4 5 -> 1 6 2 3 4 5 -> 1 2 6 3 4 5 -> 1 2 3 4 6 5

-> 1 2 3 4 5 6

Reversal series: p(1,2), p(2,3), p(3,4), p(5,6)

Is d(6 1 2 3 4 5) then 4?

6 1 2 3 4 5 -> 5 4 3 2 1 6 -> 1 2 3 4 5 6

D(6 1 2 3 4 5) = 2

Page 309: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

310

Pancake flipping problemp No pancake made by

the chef is of thesame size

p Pancakes need to berearranged beforedelivery

p Flipping operation:take some from thetop and flip them over

p This corresponds toalways reversing thesequence prefix

1 2 3 6 4 5 -> 6 3 2 1 4 5 ->

5 4 1 2 3 6 -> 3 2 1 4 5 6 ->

1 2 3 4 5 6

Page 310: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

311

How good is simple reversal sort?p Not so good actuallyp It has to do at most n-1 reversals with

permutation of length np The algorithm can return a distance that is

as large as (n – 1)/2 times the correctresult d( )n For example, if n = 1001, result can be as bad

as 500 x d( )

Page 311: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

312

Estimating reversal distance by cycledecompositionp We can estimate d( ) by cycle

decompositionp Lets represent permutation = 1 2 4 5 3

with the following graph

where edges correspond to adjacencies(identity, permutation F)

1 2 4 5 30 6

Page 312: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

313

Estimating reversal distance by cycledecompositionp Cycle decomposition: a set of cycles thatn have edges with alternating colorsn do not share edges with other cycles (=cycles

are edge disjoint)

1 2 4 5 30 6

1 2 4 5

Page 313: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

314

Cycle decompositionsp Let c( ) the maximum number of alternating,

edge-disjoint cycles in the graph representationof

p The following formula allows estimation of d( )n d( ) n + 1 – c( ), where n is the permutation length

1 2 4 5 30 6

1 2 4 5d( ) 5 + 1 – 4 = 2

Claim in Deonier: equality holds for ”most of the usual andinteresting biological systems.

Page 314: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

315

Cycle decompositionsp Cycle decomposition is NP-completen We cannot solve the general problem exactly

for large instances

p However, with signed data the problembecomes easyn Before going into signed data, lets discuss

another algorithm for the general case

Page 315: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

316

Computing reversals with breakpointsp Lets investigate a better way to compute

reversal distancep First, some concepts related to

permutation 1 2,,, n-1 nn Breakpoint: two elements i and i+1 are a

breakpoint, if they are not consecutivenumbers

n Adjacency: if i and i+1 are consecutive, theyare called adjacency

Page 316: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

317

Breakpoints and adjacencies

2 1 3 4 5 8 7 6

This permutation containsfour breakpoints begin-2, 13, 58, 6-end andfive adjacencies 21, 34, 45, 87, 76

Breakpoints

Page 317: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

318

Breakpointsp Each breakpoint in permutation needs to be

removed to get to the identity permutation (=ourtarget)n Identity permutation does not contain any breakpoints

p First and last positions special casesp Note that each reversal can remove at most two

breakpoints

p Denote the number of breakpoints by b( )

2 1 3 4 5 8 7 6 b( ) = 4

Page 318: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

319

Breakpoint reversal sortp Idea: try to remove as many breakpoints

as possible (max 2) in every step

1. While b( ) > 02. Choose reversal p that removes most breakpoints3. Perform reversal p to 4. Output 5. return

Page 319: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

320

Breakpoint removal: example8 2 7 6 5 1 4 3 b( ) = 6

2 8 7 6 5 1 4 3 b( ) = 5

2 3 4 1 5 6 7 8 b( ) = 3

4 3 2 1 5 6 7 8 b( ) = 2

1 2 3 4 5 6 7 8 b( ) = 0

Page 320: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

321

Breakpoint removalp The previous algorithm needs refinement

to be correctp Consider the following permutation:

1 5 6 7 2 3 4 8

p There is no reversal that decreases thenumber of breakpoints!

p See Jones & Pevzner for detaileddescription on this

Page 321: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

322

Breakpoint removalp Reversal can only decrease breakpoint

count if permutation contains decreasingstrips

1 5 6 7 2 3 4 8

1 5 6 7 4 3 2 8

1 2 3 4 7 6 5 8

Increasing strip

Decreasing strip

Strip: maximal segment without breakpoints

Page 322: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

323

Improved breakpoint reversal sort1. While b( ) > 02. If has a decreasing strip3. Do reversal p that removes most BPs4. Else5. Reverse an increasing strip6. Output 7. return

Page 323: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

324

Is Improved BP removal enough?p The algorithm works pretty well:n It produces a result that is at most four times

worse than the optimal resultn ...is this good?

p We considered only reversalsp What about translocations & duplications?

Page 324: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

325

Translocations via reversals

1 2 3 4 5 6 7 8

1 5 6 7 8 2 3 4

1 4 3 2 8 7 6 5

1 2 3 4 8 7 6 5

1 2 3 4 5 6 7 8

Translocation of 2,3,4

p(2,8)

p(2,4)

p(5,8)

Page 325: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

326

Genome rearrangements with reversalsp With unsigned data, the problem of finding

minimum reversal distances is NP-completen Why is this so if sorting is easy?

p An algorithm has been developed thatachieves 1.375-approximation

p However, reversal distance in signed datacan be computed quickly!n It takes linear time w.r.t. the length of

permutation (Bader, Moret, Yan, 2001)

Page 326: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

327

Cycle decomposition with signed datap Consider the following two permutations

that include orientation of markersn J: +1 +5 -2 +3 +4n K: +1 -3 +2 +4 -5

p We modify this representation a bit toinclude both endpoints of each marker:n J’: 0 1a 1b 5a 5b 2b 2a 3a 3b 4a 4b 6n K’: 0 1a 1b 3b 3a 2a 2b 4a 4b 5b 5a 6

Page 327: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

328

Graph representation of J’ and K’p Drawn online in lecture!

Page 328: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

329

Multiple chromosomesp In unichromosomal genomes, inversion

(reversal) is the most common operationp In multichromosomal genomes,

inversions, translocations, fissions andfusions are most common

Page 329: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

330

Multiple chromosomesp Lets represent multichromosomal genome

as a set of permutations, with $ denotingthe boundary of a chromosome:

5 9 $1 3 2 8 $7 6 4 $

This notation is frequently used in softwareused to analyse genome rearrangements.

Chr 1

Chr 2

Chr 3

Page 330: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

331

Multiple chromosomesp Note that when dealing with multiple

chromosomes, you need to specifynumbering for elements on both genomes

Page 331: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

332

Reversals & translocationsp Reversal p( , i, j)p Translocation p( , , i, j)

i

j

Translocation

Page 332: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

333

Fusions & fissionsp Fusion: merging of two chromosomesp Fission: chromosome is split into two

chromosomesp Both events can be represented with a

translocation

Page 333: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

334

Fusionp Fusion by translocation p( , , n+1, 1)

i = n + 1

j = 1

Fusion

Page 334: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

335

Fissionp Fission by translocation p( , , i, 1)

i

Empty chromosome

Fission

Page 335: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

336

Algorithms for general genomic distanceproblemp Hannenhalli, Pevzner: Transforming Men into

Mice (polynomial algorithm for genomic distanceproblem), 36th Annual IEEE Symposium onFoundations of Computer Science, 1995

Page 336: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

337

Human & mouse revisitedp Human and mouse are separated by about

75-83 million years of evolutionary historyp Only a few hundred rearrangements have

happened after speciation from thecommon ancestory

p Pevzner & Tesler identified in 2003 for 281synteny blocks a rearrangement frommouse to human withn 149 inversionsn 93 translocationsn 9 fissions

Page 337: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

338

Discussionp Genome rearrangement events are very

rare compared to, e.g., point mutationsn We can study rearrangement events further

back in the evolutionary history

p Rearrangements are easier to detect incomparison to many other genomic events

p We cannot detect homologs 100%correctly so the input permutation cancontain errors

Page 338: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

339

Discussionp Genome rearrangement is to some degree

constrained by the number and size ofrepeats in a genomen Notice how the importance of genomic repeats

pops up once again

p Sequencing gives us (usually) signed dataso we can utilize faster algorithms

p What if there are more than one optimalsolution?

Page 339: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

340

Two different genome rearrangement scenariosgiving the same result.

Page 340: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

341

GRIMM demonstration

Glenn Tesler, GRIMM: genome rearrangements web server.Bioinformatics, 2002,

Page 341: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

342

GRIMM file format

# useful comment about first genome# another useful comment about it>name of first genome1 -4 2 $ # chromosome 1-3 5 6 # chromosome 2>name of second genome5 -3 $6 $2 -4 1 $

http://grimm.ucsd.edu/GRIMM/grimm_instr.html

GRIMM supports analysis ofone, two or more genomes

Page 342: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

Introduction toBioinformatics

Phylogenetic trees

Page 343: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

344

Inferring the Past: PhylogeneticTreesp The biological problemp Parsimony and distance methodsp Models for mutations and estimation of

distancesp Maximum likelihood methods

Page 344: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

345

Phylogenyp We want to study ancestor-

descendant relationships, orphylogeny, among groups oforganisms

p Groups are called taxa (singular:taxon)

p Organisms are usually calledoperational taxonomic units orOTUs in the context ofphylogeny

Page 345: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

346

Phylogenetic treesp Leaves (external nodes)

~ species, observed(OTUs)

p Internal nodes ~ancestralspecies/divergenceevents, not observed

p Unrooted tree does notspecify ancestor-descendant relationshipsbeyond the observation”leaves are notancestors”

1

2

3

4

5

6

78

Unrooted tree with 5leaves and 3 internalnodes.

Is node 7 ancestor of node6?

Page 346: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

347

Phylogenetic treesp Rooting a tree specifies all

ancestor-descendantrelationships in the tree

p Root is the ancestor to theother species

p There are n-1 ways to roota tree with n nodes

1

2

3

4

5

6

78

R1 R2

2 3 4 51

6

7

8

R1

2 3 451

6

78

R2

root(R

1)

root(R2 )

Page 347: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

348

Questionsp Can we enumerate all possible

phylogenetic trees for n species (orsequences?)

p How to score a phylogenetic tree withrespect to data?

p How to find the best phylogenetic treegiven data?

Page 348: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

349

Finding the best phylogenetic tree:naive methodp How can we find the phylogenetic tree

that best represents the data?p Naive method: enumerate all possible

treesp How many different trees are there of n

species?p Denote this number by bn

Page 349: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

350

Enumerating unordered treesp Start with the only

unordered tree with 3leaves (b3 = 1)

p Consider all ways to adda leaf node to this tree

p Fourth node can be addedto 3 different branches(edges), creating 1 newinternal branch

p Total number of branchesis n external and n – 3internal branches

p Unrooted tree with nleaves has 2n – 3 branches

1 2

3

1 2

3

4

1 2

3

4

1 2

34

Page 350: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

351

Enumerating unordered treesp Thus, we get the number of unrooted trees

bn = (2(n – 1) – 3)bn-1 = (2n – 5)bn-1

= (2n – 5) * (2n – 7) * …* 3 * 1= (2n – 5)! / ((n-3)!2n-3), n > 2

p Number of rooted trees b’n isb’n = (2n – 3)bn = (2n – 3)! / ((n-2)!2n-2),

n > 2

that is, the number of unrooted trees times thenumber of branches in the trees

Page 351: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

352

Number of possible rooted andunrooted trees

8.20E+0212.22E+020204.95E+0388.69E+03630

344594252027025102027025135135913513510395810395954794510561051551534313b’nBnn

Page 352: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

353

Too many trees?p We can’t construct and evaluate every

phylogenetic tree even for a smallishnumber of species

p Better alternative is ton Devise a way to evaluate an individual tree

against the datan Guide the search using the evaluation criteria

to reduce the search space

Page 353: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

354

Inferring the Past: PhylogeneticTrees (chapter 12)p The biological problemp Parsimony and distance methodsp Models for mutations and estimation of

distancesp Maximum likelihood methods

Page 354: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

355

Parsimony methodp The parsimony method finds the tree that

explains the observed sequences with aminimal number of substitutions

p Method has two stepsn Compute smallest number of substitutions for

a given tree with a parsimony algorithmn Search for the tree with the minimal number of

substitutions

Page 355: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

356

Parsimony: an examplep Consider the following short sequences

1 ACTTT2 ACATT3 AACGT4 AATGT5 AATTT

p There are 105 possible rooted trees for 5sequences

p Example: which of the following treesexplains the sequences with least numberof substitutions?

Page 356: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

357

3

AACGT

4

AATGT

5

AATTT

2

ACATT

1

ACTTT

6 AATGT

7 AATTT

8 ACTTT

9 AATTT

T->C

T->G

T->A

A->C

This tree explains the sequenceswith 4 substitutions

Page 357: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

358

3

AACGT

4

AATGT

5

AATTT

2

ACATT

1

ACTTT

6 AATGT

7 AATTT

8 ACTTT

9 AATTT

T->C

T->G

T->A

A->C

3

AACGT

4

AATGT

5

AATTT

2

ACATT

1

ACTTT

6 ACCTTC->T

7 AACGT8 AATGT

9 AATTT

G->TT->C

T->G

A->C

C->A

6substitutions…

First tree ismoreparsimonious!

4substitutions…

Page 358: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

359

Computing parsimonyp Parsimony treats each site (position in a

sequence) independentlyp Total parsimony cost is the sum of parsimony

costs (=required substitutions) of each sitep We can compute the minimal parsimony cost for

a given tree byn First finding out possible assignments at each node,

starting from leaves and proceeding towards the rootn Then, starting from the root, assign a letter at each

node, proceeding towards leaves

Page 359: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

360

Labelling tree nodesp An unrooted tree with n leaves contains 2n-1

nodes altogetherp Assign the following labels to nodes in a rooted

treen leaf nodes: 1, 2, …, nn internal nodes: n+1, n+2, …, 2n-1n root node: 2n-1

p The label of a child node is alwayssmaller than the label of theparent node

2 3 4 51

6

8

7

9

Page 360: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

361

Parsimony algorithm: first phasep Find out possible assignments at every node for each

site u independently. Denote site u in sequence i bysi,u.

For i := 1, …, n doFi := {si,u} % possible assignments at node iLi := 0 % number of substitutions up to node i

For i := n+1, …, 2n-1 doLet j and k be the children of node iIf Fj Fk =

then Li := Lj + Lk + 1, Fi := Fj Fkelse Li := Lj + Lk, Fi := Fj Fk

Page 361: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

362

Parsimony algorithm: first phase

3

AACGT

4

AATGT

5

AATTT

2

ACATT

1

ACTTT

Choose u = 3 (for example, in general we do this for all sites)F1 := {T}L1 := 0F2 := {A}L2 := 0

F3 := {C}, L3 := 0

F4 := {T}, L4 := 0

F5 := {T}, L5 := 0

6

7

8

9

Page 362: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

363

Parsimony algorithm: first phase

3

AACGT

4

AATGT

5

AATTT

2

ACATT

1

ACTTT

6 {C,T}

7 T

8 {A, T}

9 T

F8 := F1 F2 = {A, T}L8 := L1 + L2 + 1 = 1

F6 := F3 F4 = {C, T}

L6 := L3 + L4 + 1 = 1

F7 := F5 F6 = {T}L7 := L5 + L6 = 1

F9 := F7 F8 = {T}L9 := L7 + L8 = 2 Parsimony cost for site 3 is 2

Page 363: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

364

Parsimony algorithm: second phasep Backtrack from the root and assign x Fi

at each nodep If we assigned y at parent of node i and y

Fi, then assign yp Else assign x Fi by random

Page 364: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

365

Parsimony algorithm: second phase

3

AACGT

4

AATGT

5

AATTT

2

ACATT

1

ACTTT

6 {C,T}

7 T

8 {A,T}

9 T

At node 6, thealgorithm assigns Tbecause T wasassigned to parentnode 7 and T F6.

T is assigned to node 8for the same reason.

The other nodes haveonly one possible letterto assign

Page 365: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

366

Parsimony algorithm

3

AACGT

4

AATGT

5

AATTT

2

ACATT

1

ACTTT

6 T

7 T

8 T

9 T

First and second phase arerepeated for each site inthe sequences,summing the parsimonycosts at each site

Page 366: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

367

Properties of parsimony algorithmp Parsimony algorithm requires that the sequences

are of same lengthn First align the sequences against each other and,

optionally, remove indelsn Then compute parsimony for the resulting sequencesn Indels (if present) considered as characters

p Is the most parsimonious tree the correct tree?n Not necessarily but it explains the sequences with least

number of substitutionsn We can assume that the probability of having fewer

mutations is higher than having many mutations

Page 367: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

368

Finding the most parsimonious treep Parsimony algorithm calculates the

parsimony cost for a given tree…p …but we still have the problem of finding

the tree with the lowest costp Exhaustive search (enumerating all trees)

is in general impossiblep More efficient methods exist, for examplen Probabilistic searchn Branch and bound

Page 368: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

369

Branch and bound in parsimonyp We can exploit the fact that adding edges

to a tree can only increase the parsimonycost

1

AATGT

2

AATTT

3

AACGT

1

AATGT

2

AATTT

{T}{T}

{C, T}

cost 0 cost 1

Page 369: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

370

Branch and bound in parsimonyBranch and bound isa general searchstrategy where

p Each solution ispotentially generated

p Track is kept of thebest solution found

p If a partial solutioncannot achieve betterscore, we abandonthe current searchpath

In parsimony…p Start from a tree with 1

sequencep Add a sequence to the tree

and calculate parsimonycost

p If the tree is complete,check if found the best treeso far

p If tree is not complete andcost exceeds best treecost, do not continueadding edges to this tree

Page 370: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

371

Branch and bound example

1 1 2 31 2 31 2 4

Complete tree:compute parsimonycost

Example with 4 sequences

Partial tree:Compute parsimony costand compare against bestso far;Do not continue expansionif above cost of the best tree

31 2 421 3 4

21 3 4

1 321 3

Page 371: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

372

Distance methodsp The parsimony method works on sequence

(character string) datap We can also build phylogenetic trees in a

more general settingp Distance methods work on a set of

pairwise distances dij for the datap Distances can be obtained from

phenotypes as well as from genotypes(sequences)

Page 372: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

373

Distances in a phylogenetic treep Distance matrix D = (dij)

gives pairwise distancesfor leaves of thephylogenetic tree

p In addition, thephylogenetic tree willnow specify distancesbetween leaves andinternal nodesn Denote these with dij as

well

2 3 4 51

6

7

8

Distance dij states howfar apart species i and jare evolutionary (e.g.,number of mismatches inaligned sequences)

Page 373: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

374

Distances in evolutionary contextp Distances dij in evolutionary context

satisfy the following conditionsn Symmetry: dij = dji for each i, jn Distinguishability: dij 0 if and only if i jn Triangle inequality: dij dik + dkj for each i, j, k

p Distances satisfying these conditions are calledmetric

p In addition, evolutionary mechanisms mayimpose additional constraints on the distances

additive and ultrametric distances

Page 374: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

375

Additive treesp A tree is called additive, if the distance

between any pair of leaves (i, j) is thesum of the distances between the leavesand a node k on the shortest path from ito j in the tree

dij = dik + djk

p ”Follow the path from the leaf i to the leafj to find the exact distance dij between theleaves.”

Page 375: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

376

Additive trees: example

0244D

2044C

4402B

4420A

DCBAA

B

C

D

1

1

2 1

1

Page 376: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

377

Ultrametric treesp A rooted additive tree is called an ultrametric

tree, if the distances between any two leaves iand j, and their common ancestor k are equal

dik = djk

p Edge length dij corresponds to the time elapsedsince divergence of i and j from the commonparent

p In other words, edge lengths are measured by amolecular clock with a constant rate

Page 377: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

378

Identifying ultrametric datap We can identify distances to be ultrametric

by the three-point condition:D corresponds to an ultrametric tree ifand only if for any three species i, j andk, the distances satisfy dij max(dik, dkj)

p If we find out that the data is ultrametric, we canutilise a simple algorithm to find thecorresponding tree

Page 378: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

379

Ultrametric trees9

8

7

5 4 3 2 1

6

Observation time

Tim

e

Page 379: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

380

Ultrametric trees9

8

7

5 4 3 2 1

6

Observation time

Tim

e

Only vertical segments of thetree have correspondence tosome distance dij:

Horizontal segments act asconnectors.

d8,9

Page 380: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

381

Ultrametric trees9

8

7

5 4 3 2 1

6

Observation time

Tim

e

dik = djk for any two leavesi, j and any ancestor k ofi and j

Page 381: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

382

Ultrametric trees9

8

7

5 4 3 2 1

6

Observation time

Tim

e

Three-point condition: there areno leafs i, j for which dij > max(dik, djk)for some leaf k.

Page 382: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

383

UPGMA algorithmp UPGMA (unweighted pair group method

using arithmetic averages) constructs aphylogenetic tree via clustering

p The algorithm works by at the same timen Merging two clustersn Creating a new node on the tree

p The tree is built from leaves towards theroot

p UPGMA produces a ultrametric tree

Page 383: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

384

Cluster distancesp Let distance dij between clusters Ci and Cj

be

that is, the average distance betweenpoints (species) in the cluster.

Page 384: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

385

UPGMA algorithmp Initialisation

n Assign each point i to its own cluster Ci

n Define one leaf for each sequence, and place it at heightzero

p Iterationn Find clusters i and j for which dij is minimaln Define new cluster k by Ck = Ci Cj, and define dkl for all ln Define a node k with children i and j. Place k at height dij/2n Remove clusters i and j

p Termination:n When only two clusters i and j remain, place root at height dij/2

Page 385: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

386

1 2

3

4

5

Page 386: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

387

1 2

3

4

51 2

6

Page 387: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

388

1 2

3

4

51 2 4 5

6 7

Page 388: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

389

1 2

3

4

51 2 4 5

6 7

8

3

Page 389: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

390

1 2

3

4

51 2 4 5

6 7

8

3

9

Page 390: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

391

UPGMA implementationp In naive implementation, each iteration

takes O(n2) time with n sequences =>algorithm takes O(n3) time

p The algorithm can be implemented to takeonly O(n2) time (see Gronau & Moran,2006, for a survey)

Page 391: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

392

Problem solved?p We now have a simple algorithm which finds a

ultrametric treen If the data is ultrametric, then there is exactly one

ultrametric tree corresponding to the data (we skip theproof)

n The tree found is then the ”correct” solution to thephylogeny problem, if the assumptions hold

p Unfortunately, the data is not ultrametric inpracticen Measurement errors distort distancesn Basic assumption of a molecular clock does not hold

usually very well

Page 392: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

393

Incorrect reconstruction of non-ultrametric data by UPGMA

1

2 3

41 2 34

Tree which correspondsto non-ultrametricdistances

Incorrect ultrametric reconstructionby UPGMA algorithm

Page 393: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

394

Checking for additivityp How can we check if our data is additive?p Let i, j, k and l be four distinct species

p Compute 3 sums: dij + dkl, dik + djl, dil+ djk

Page 394: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

395

Four-point conditioni

j l

k i

j l

k i

j l

kdik

djl

dil

djk

dij dkl

p The sums are represented by the three figuresn Left and middle sum cover all edges, right sum does not

p Four-point condition: i, j, k and l satisfy the four-point condition if two of the sums dij + dkl, dik +djl, dil + djk are the same, and the third one issmaller than these two

Page 395: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

396

Checking for additivityp An n x n matrix D is additive if and only if

the four point condition holds for every 4distinct elements 1 i, j, k, l n

Page 396: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

397

Finding an additive phylogenetic treep Additive trees can be found with, for example,

the neighbor joining method (Saitou & Nei, 1987)p The neighbor joining method produces unrooted

trees, which have to be rooted by other meansn A common way to root the tree is to use an outgroupn Outgroup is a species that is known to be more distantly

related to every other species than they are to eachother

n Root node candidate: position where the outgroup wouldjoin the phylogenetic tree

p However, in real-world data, even additivityusually does not hold very well

Page 397: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

398

Neighbor joining algorithmp Neighbor joining works in a similar fashion

to UPGMAn Find clusters C1 and C2 that minimise a

function f(C1, C2)n Join the two clusters C1 and C2 into a new

cluster Cn Add a node to the tree corresponding to Cn Assign distances to the new branches

p Differences inn The choice of function f(C1, C2)n How to assign the distances

Page 398: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

399

Neighbor joining algorithmp Recall that the distance dij for clusters Ci and Cj

was

p Let u(Ci) be the separation of cluster Ci fromother clusters defined by

where n is the number of clusters.

Page 399: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

400

Neighbor joining algorithmp Instead of trying to choose the clusters Ci

and Cj closest to each other, neighborjoining at the same timen Minimises the distance between clusters Ci and

Cj andn Maximises the separation of both Ci and Cj

from other clusters

Page 400: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

401

Neighbor joining algorithmp Initialisation as in UPGMAp Iteration

n Find clusters i and j for which dij – u(Ci) – u(Cj) is minimaln Define new cluster k by Ck = Ci Cj, and define dkl for all ln Define a node k with edges to i and j. Remove clusters i and jn Assign length ½ dij + ½ (u(Ci) – u(Cj)) to the edge i -> kn Assign length ½ dij + ½ (u(Cj) – u(Ci)) to the edge j -> k

p Termination:n When only one cluster remains

Page 401: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

402

Neighbor joining algorithm: examplea b c d

a 0 6 7 5b 0 11 9c 0 6d 0

i u(i)a (6+7+5)/2 = 9b (6+11+9)/2 = 13c (7+11+6)/2 = 12d (5+9+6)/2 = 10

i,j dij – u(Ci) – u(Cj)a,b 6 - 9 - 13 = -16a,c 7 - 9 - 12 = -14a,d 5 - 9 - 10 = -14b,c 11 - 13 - 12 = -14b,d 9 - 13 - 10 = -14c,d 6 - 12 - 10 = -16

Choose either pairto join

Page 402: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

403

Neighbor joining algorithm: examplea b c d

a 0 6 7 5b 0 11 9c 0 6d 0

i u(i)a (6+7+5)/2 = 9b (6+11+9)/2 = 13c (7+11+6)/2 = 12d (5+9+6)/2 = 10

i,j dij – u(Ci) – u(Cj)a,b 6 - 9 - 13 = -16a,c 7 - 9 - 12 = -14a,d 5 - 9 - 10 = -14b,c 11 - 13 - 12 = -14b,d 9 - 13 - 10 = -14c,d 6 - 12 - 10 = -16

a b c d

e

dae = ½ 6 + ½ (9 – 13) = 1dbe = ½ 6 + ½ (13 – 9) = 5

dbedae

This is the first step only…

Page 403: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

404

Inferring the Past: PhylogeneticTrees (chapter 12)p The biological problemp Parsimony and distance methodsp Models for mutations and estimation of

distancesp Maximum likelihood methodsn These parts of the book is skipped on this

course (see slides of 2007 course for materialon these topics)

n No questions in exams on these topics!

Page 404: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

405

Problems with tree-buildingp Assumptionsn Sites evolve independently of one othern (Sites evolve according to the same stochastic

model; not really covered this year)n The tree is rootedn The sequences are alignedn Vertical inheritance

Page 405: Introduction to Bioinformatics - Computer Science · 20 A good biology course for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti)

406

Additional material on phylogenetictreesp Durbin, Eddy, Krogh, Mitchison: Biological

sequence analysisp Jones, Pevzner: An introduction to

bioinformatics algorithmsp Gusfield: Algorithms on strings, trees, and

sequences

p Course on phylogenetic analyses in Spring2009