Bioinformatics

43
Bioinformatics Dr. Aladdin Hamwieh Khalid Al-shamaa Abdulqader Jighly 2010-2011 Lecture 9 Phylogenetic Prediction po University lty of technical engineering rtment of Biotechnology

description

Lecture 9 Phylogenetic Prediction. Bioinformatics. Dr. Aladdin HamwiehKhalid Al- shamaa Abdulqader Jighly. Aleppo University Faculty of technical engineering Department of Biotechnology. 2010-2011. Phylogenetic Trees and Dissimilarity estimation. Historical Note. - PowerPoint PPT Presentation

Transcript of Bioinformatics

Page 1: Bioinformatics

Bioinformatics

Dr. Aladdin Hamwieh Khalid Al-shamaaAbdulqader Jighly

2010-2011

Lecture 9Phylogenetic Prediction

Aleppo UniversityFaculty of technical engineeringDepartment of Biotechnology

Page 2: Bioinformatics

Phylogenetic Trees and Dissimilarity estimation

Page 3: Bioinformatics

3

Historical Note• Until mid 1950’s phylogenies were constructed

by experts based on their opinion (subjective criteria)

• Since then, focus on objective criteria for constructing phylogenetic trees– Thousands of articles in the last decades

• Important for many aspects of biology– Classification – Understanding biological mechanisms

Page 4: Bioinformatics

Morphological vs. Molecular

• Classical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc.

• Modern biological methods allow to use molecular features– Gene sequences– Protein sequences– DNA markers

Page 5: Bioinformatics

5

Rat QEPGGLVVPPTDA

Rabbit QEPGGMVVPPTDA

Gorilla QEPGGLVVPPTDA

Cat REPGGLVVPPTEG

From sequences to a phylogenetic tree

There are many possible types of sequences to use (e.g. Mitochondrial vs Nuclear proteins).

Page 6: Bioinformatics

.

Basic Assumptions Closer related organisms have more similar genomes.

Highly similar genes are homologous (have the same ancestor).

Phylogenetic relation can be expressed by a dendrogram (a “tree”) .

Aardvark Bison Chimp Dog Elephant

Page 7: Bioinformatics

Dangers in Molecular Phylogenies

• We have to emphasize that gene/protein sequence can be homologous for several different reasons:

• Orthologs -- are genes in different species that have evolved from a common ancestral gene via speciation.

• Paralogs -- sequences diverged after a duplication event

• Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

Page 8: Bioinformatics

8

Species Phylogeny

Gene Phylogenies

Speciation events

Gene Duplication

1A 2A 3A 3B 2B 1B

Phylogenies can be constructed to describe evolution genes.

Three species termed 1,2,3.Two paralog genes A and B.

Page 9: Bioinformatics

9

Types of Trees

A natural model to consider is that of rooted trees Common

Ancestor

Page 10: Bioinformatics

10

Types of treesUnrooted tree represents the same phylogeny without the

root node

Depending on the model, data from current day species does not distinguish between different placements of the root.

Page 11: Bioinformatics

11

Distance-Based MethodInput: distance matrix between species

For two sequences si and sj , perform a pairwise (global) alignment. Let f = the fraction of sites with different residues. Then

Outline:• Cluster species together• Initially clusters are singletons• At each iteration combine two “closest” clusters

to get a new one

3 4log(1 )4 3ijd f (Jukes-Cantor Model)

Page 12: Bioinformatics
Page 13: Bioinformatics

Human, Chimp, Gorilla, Orangutan, and Gibbon

Page 14: Bioinformatics

UPGMA

Taxa 1 2 3 4 5 6 7OTU-1 T G C G T A TOTU-2 T G G G T A TOTU-3 T G C G C T TOTU-4 T G C T G T GOTU-5 T A G T A G C

Step 1: Generate data (Sequence/ Genotype/ Morphological) for each OTU.

Page 15: Bioinformatics

Distance can be calculated by using different substitution models:1. # of nucleotide differences.2. p-distance.3. JC distance4. K2P distance.5. F816. HKY857.GTR etc

Step 2: Calculate p- distance for all pairs of taxa

Taxa 1 2 3 4 5 6 7OTU-1 T G C G T A TOTU-2 T G G G T A T

= 0.142857143

Page 16: Bioinformatics

Step 3: Calculate distance matrix for all pairs of taxa and select pair of taxa with minimum distance as new OTU.

Taxa 1 2 3 4 51 0 1 2 4 62 0.1428 0 3 5 53 0.2857 0.4285 0 3 64 0.5714 0.7142 0.4285 0 55 0.8571 0.7142 0.85710.7142 0

OTU-1OTU-2

0.0714

0.0714

Page 17: Bioinformatics

Step 4: Recalculate new distance matrix, assuming OTU-1 and OTU-2 as one OTU.

= 0.3571

taxa 1+2 3 4 51+2 0

3 0.35714 0 4 0.64285 0.4285 0 5 0.78571 0.8571 0.7142 0

Taxa 1 2 3 4 51 0 2 0.1428 0 3 0.2857 0.4285 0 4 0.5714 0.7142 0.4285 0 5 0.8571 0.7142 0.8571 0.7142 0

Page 18: Bioinformatics

Step 5: Select pair of taxa with minimum distance as new OTU.

OTU-1

OTU-2

0.071

0.071

OTU-30.179

0.107

0.107 + 0.071 + 0.179 = 0.357

Page 19: Bioinformatics

Step 6: Again select pair of OTU with minimum distance as new OTU and recalculate distance matrix.

= 0.5714

taxa (1+2)3 4 5(1+2)3 0

4 0.5714 0 5 0.8095 0.7142 0

taxa 1+2 3 4 51+2 0

3 0.35714 0 4 0.64285 0.4285 0 5 0.78571 0.8571 0.7142 0

Taxa 1 2 3 4 51 0 2 0.1428 0 3 0.2857 0.4285 0 4 0.5714 0.7142 0.4285 0 5 0.8571 0.7142 0.8571 0.7142 0

Page 20: Bioinformatics

Step 7: Again select pair of taxa with minimum distance as new OTU.

OTU-2

OTU-10.071

0.071

OTU-30.179

0.107

OTU-40.286

0.107

0.107 + 0.107 + 0.071 + 0.286 = 0.571

Page 21: Bioinformatics

Step 8: Again select pair of OTU with minimum distance as new OTU and recalculate distance matrix.

= 0.7857

taxa ((1+2)3)4 5

((1+2)3)4 0

5 0.7857 0

taxa (1+2)3 4 5(1+2)3 0

4 0.5714 0 5 0.8095 0.7142 0

taxa 1+2 3 4 51+2 0

3 0.35714 0 4 0.64285 0.4285 0 5 0.78571 0.8571 0.7142 0

Taxa 1 2 3 4 51 0 2 0.1428 0 3 0.2857 0.4285 0 4 0.5714 0.7142 0.4285 0 5 0.8571 0.7142 0.8571 0.7142 0

Page 22: Bioinformatics

Step 9: Again select pair of OTU with minimum distance as new OTU and make final rooted tree.

OTU-2

OTU-10.071

0.071

OTU-30.179

0.107

OTU-40.286

0.107

OTU-50.393

0.107

0.393 + 0.107 + 0.107 + 0.107 + 0.071 = 0.785

Page 23: Bioinformatics

Jukes-Cantor distancethe rate of nucleotide substitution is the same for all pairs of the four nucleotides A, T, C, and G A A

A CA GA TC AC CC GC TG AG CG GG TT AT CT GT T

25% similar (= distance of 0.75). 75% which is what you expect with random assignment of nucleotides to a pair of taxa

Page 24: Bioinformatics

طريقة الوراثية UPGMAتفترض القرابة شجرة أفرع طول في ثابتة نسبة

=-(3/4)*LN(1-(((4/3)*0.1594)))

Page 25: Bioinformatics
Page 26: Bioinformatics
Page 27: Bioinformatics
Page 28: Bioinformatics

Neighbor-joiningطريقة • - أفرع طول في ثابتة نسبة استخدام على مارغولياش فيتش طريقة تعHتمد ال

طريقة في هي كما الوراثية القرابة UPGMAشجرةبأقل • المدروسة للوحدات أزواج أقرب تحديد على تعتمد الطريقة هذه

( . المقارب الزوج تعريف ويمكن لألفرع قيمة( Pair of neighborاألطوال بأنهجذرية ) غير بعقدة وحدتين بين (.unrooted nodeاالرتباط

والغوريال: • األنسان عكس على وحدة في متحدان والشيمبانزي اإلنسان مثال ) الغHوريال، ) مع تجاور على والشيمبانزي اإلنسان األولى الوحدة ندعو وعليه

باقي مع القرابة عن نبحث والغHوريال األولى الوحدة بين القرابة دراسة وبعد. المدروس المجتمع أفراد

Page 29: Bioinformatics

• : نبدأ مدروسة أفراد ثمانية لدراسة مثالمرتبطون جميعا أنهم لو كما المقارنة

إثبات وعند بعدها واحدة، بعقدةبين على 2و 1االرتباط الشجرة تصبح

Neighbor-joiningطريقة

Page 30: Bioinformatics

Neighbor-joiningطريقة

A B C D E rA (human) — 0.015 0.045 0.143 0.198 0.4010B (chimp) — 0.03 0.126 0.179 0.3500C (gorilla) — 0.092 0.179 0.3460D (orangutan) — 0.179 0.5400E (gibbon) — 0.7350

Page 31: Bioinformatics

Neighbor-joiningطريقة

A B C D EA (human) — 0.0150 0.0450 0.1430 0.1980B (chimp) -0.3605 — 0.0300 0.1260 0.1790C (gorilla) -0.3285 -0.3180 — 0.0920 0.1790D (orangutan) -0.3275 -0.3190 -0.3510 — 0.1790E (gibbon) -0.3700 -0.3635 -0.3615 -0.4585 —

A:B = 0.015-(0.4010+0.35)/2

Page 32: Bioinformatics

Example:

A B C D E r r/3A (human) — 0.015 0.045 0.143 0.198 0.4010 0.1337B (chimp) — 0.03 0.126 0.179 0.3500 0.1167C (gorilla) — 0.092 0.179 0.3460 0.1153D (orangutan) — 0.179 0.5400 0.1800E (gibbon) — 0.7350 0.2450

=0.179/2+(0.18-0.245)/2

=0.179-0.057

Page 33: Bioinformatics

Human and chimpanzee have the smallest value of Mij and they are replaced by node 2.

Page 34: Bioinformatics
Page 35: Bioinformatics

dijMij

Page 36: Bioinformatics

• PHYLIP (Phylogeny Inference Package)

a = 0.016

3

2

1

b = -0.001

c = 0.006

d = 0.057

e = 0.1221'= 0.0403

2'= 0.024

E

D

A

B

C• UPGMA

• Neighbor-joining (NJ)

Page 37: Bioinformatics
Page 38: Bioinformatics

Genetic distanceMarker1 Marker2 Marker3 Marker4 Marker5 Marker6 Marker7

Plant1 1 0 1 1 0 1 1

Plant2 1 1 1 0 0 1 0

Plant11 0

Plant2 1 Fa=3 Fb=10 Fc=2 Fd=1

N= Fa+Fb+Fc+Fd

Simple Match distance = Fa/N= 3/7= 0.43Genetic distance (Jaccard) = Fa/(Fa+Fb+Fc) = 3/6= 0.5

Page 39: Bioinformatics
Page 40: Bioinformatics

Dissimilarity indices – Continuous

Euclidean distance

Euclidean Distance is the most common use of distance. In most cases when people said about distance , they will refer to Euclidean distance. Euclidean distance or simply 'distance' examines the root of square differences between coordinates of a pair of objects.

Page 41: Bioinformatics

Example:Point A has coordinate (0, 3, 4, 5) and point B has coordinate (7, 6, 3, -1). The Euclidean Distance between point A and B is

Features

cost time weight incentive Plant A 0 3 4 5 Plant B 7 6 3 -1

Dissimilarity indices – Continuous

Euclidean distance

Page 42: Bioinformatics

Manhattan (City-Block)It is also known as Manhattan distance, boxcar distance, absolute value distance. It examines the absolute differences between coordinates of a pair of objects.

Features cost time weight incentive

Plant A 0 3 4 5 Plant B 7 6 3 -1

Page 43: Bioinformatics

Thank you

PAST برنامج على تطبيق العملي جلسة