Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.

Genome Analysis IIGenome Analysis II

Comparative GenomicsComparative Genomics

Jiangbo Miao

Apr. 25, 2002

CISC889-02S: Bioinformatics

Why Comparative Genomics ?Why Comparative Genomics ?

• It tells us what are common and what are unique between different species at the genome level.

• Genome comparison may be the surest and most reliable way to identify genes and predict their functions and interactions.

– e.g., to distinguish orthologs from paralogs

• The functions of human genes and other DNA regions can be revealed by studying their counterparts in lower organisms.

OutlineOutline

• All-against-all Self-comparison of Proteome

• Between-proteome Comparisons

• Family and Domain Analysis

• Ancient Conserved Regions (ACRs)

• Horizontal Gene Transfer

• Functional Classification of Genes

• Gene-order Comparisons

All-against-all Self-comparisonAll-against-all Self-comparison

• How?

– Making a database of the proteome

– Use each protein as a query in a similarity search against the database

(BLAST, WU-BLAST or FASTA)

– Generate a matrix of alignment scores (P or E value)

: A conservative cutoff E value : 10e-6

• Why?

– Number of Gene Families

This comparison distinguishes unique proteins from proteins arisen from gene duplication, and also reveals the # of gene families.

– Paralogs

Significantly matched pairs of protein sequences may be paralogs.

All-against-all Comparison: ExampleAll-against-all Comparison: Example

Cluster AnalysisCluster Analysis

• To sort out relationships among all of the proteins found to be related in the above search.

• Clustering organizes the proteins into groups by some objective criterion:

– P or E value ( < 0.01-0.05)

– Distance between each pair of sequences in a multiple seq. alignment

(# of amino acid changes between the aligned seq.)

• Methods:

– By Making Sub-graphs

– By Single Linkage

Clustering by making subgraphsClustering by making subgraphs

• Each protein sequence is a vertex;

• Each matched pair of sequences with a significant score is joined by an edge

• The edges are weighted according to the P/E value

• Simple Algorithm: Remove weaker links (From the weakest one)

• Rubin et al. (2000)

– Edges of E value > 10-6 are removed

– Remaining subgraphs comprise sequences that share a significant relationship to each other but not to other seq.

– Criterion: the group should mutually share >= 2/3 of all of the edges from this group to all proteins in the proteome

: This algorithm favors the selection of proteins with the same domain structure reflecting that these proteins are most probably paralogs

Clustering by making subgraphs: ExampleClustering by making subgraphs: Example

Clustering by single linkageClustering by single linkage

• Based on the distance criterion

• A group of related sequences found in the all-against-all proteome comp. is subjected to a MSA (CLUSTALW).

• A distance matrix is made

• Use this matrix to cluster the sequence by a neighbor-joining algorithm

(the same procedure as that used to make a phylogenetic tree)

• Cluster representation: Tree or Dendrogram

• As smaller groups are chosen, the most strongly supported clusters are more likely to be made up of paralogs(?)

Clustering by single linkage: ExampleClustering by single linkage: Example

Core ProteomeCore Proteome

• All-against-all comparison reveals the # of protein/gene families in an organism.

• This number represents the core proteome of the organism from which all biological functions have diversified.

Organism # of genes

# of gene families

# of duplicated genes

H. Influenzae (bacteria) 1709 1425 284

S. Cerevisiae (yeast) 6241 4383 1858

C. Elegans (worm) 18,424 9453 8971

D. Melanogaster (fly) 13,600 8065 5536

* In Hemophilus, 1247 out of 1709 proteins do not have paralogs

* Core proteome of the multicellular organisms is only twice that of yeast

OutlineOutline








Between-Proteome Comparisons : Why?Between-Proteome Comparisons : Why?

• To identify orthologs, gene families, and domains• Orthologs: (proteins that share a common ancestry & function)

– A pair of proteins in two organisms that align along most of their lengths with a highly significant alignment score.

– These proteins perform the core biological functions shared by the two organisms.

– Two matched sequences (X in A, Y in B) may not be orthologs(Y and Z are paralogs in B, X and Z are orthologs)

– Identify true orthologs(a) highest-scoring match (best hit)(b) E value < 0.01(c) > 60% alignment over both proteins

Between-Proteome Comparisons: How?Between-Proteome Comparisons: How?

1. Choose a yeast protein and perform a database similarity search of the worm proteome (WU-BLAST): a yeast-versus-worm search

2. Group the worm seqs that match the yeast query seq with a high P value (10-10 to 10-100), also include the yeast query seq in the group

3. From the group made in 2, choose a worm seq and make a search of the yeast proteome, using the same P limit

4. Add any matching yeast seq to the group made in 25. Repeat 3 & 4 for all initially matched seqs in the group6. Repeat 1-5 for every yeast protein7. As 1-6, perform a comparable worm-versus-yeast search8. Coalesce the groups of related seqs. and remove any redundancies so tha

t every sequence is represented only once.9. Eliminate any matched pairs in which less than 80% of each seq is in the

alignment

Between-Proteome Comparison: ResultBetween-Proteome Comparison: Result

Cut-off P value < 10-10 < 10-20 < 10-50 < 10-100

# of seq groups 1171 984 552 236

# of groups with >2 members 560 442 230 79

# and % of all yeast proteins (6217) represented in groups

2697(40)

1848(30)

888(14)

330(5)

# and % of all worm proteins represented in groups

3653(19)

2497(13)

1094(6)

370(2)

* The sequences also align to 80%, so they represent highly conserved sets of genes

Cluster of orthologous group (COG)Cluster of orthologous group (COG)

• Motivation

In the above database search, A protein seq will not only match the orthologous seq in the second proteome, but also those paralogous seqs of the orthologous seq.

• Objective

To identify all matching proteins as an orthologous group related by both speciation (ortholog) and gene duplication (paralog) events.

• Meaning

COGs usually correspond to classes of metabolic function

• Application (example)

– Produce a COG database by analysis of microbial & yeast genomes

– Search a newly identified microbial protein in this database

– Significant match will provide an indication of its metabolic function

Comparison of Proteome to EST databaseComparison of Proteome to EST database

• Why?

– For many organisms(Eukaryotic), complete genome seq not available

– While a large collection of EST seqs are available• An EST database of an organism can also be analyzed for the presence

of gene families, orthologs, and paralogs.

– e.g. a protein from the yeast or fly proteome can be used as a query of a human EST database

– (translate EST seq in all six possible reading frames) • Problem EST seqs are usually short( the equivalent of 100-150 amino acids)• Solution

– identify overlapping EST seq : a longer alignment can be produced

– perform an exhaustive search for a protein family

Search for orthologs to a protein family in EST database Search for orthologs to a protein family in EST database

• [Retief et al. (1999)] Use FAST-PAN to scan EST database with multiple queries from a protein family, sorts the alignment scores, and produces charts and alignments of the matches found.

• Example

–Protein family: glutathione transferase proteins

–Mammalian EST database

–TFASTY3 search system

–Shown are matches of two mouse ESTs to a query seq

Search for orthologs to a protein family in EST databaseSearch for orthologs to a protein family in EST database

Class

•A large number of known glutathione transferase proteins was first subjected to MSA, and a phylogenetic tree was made to identify classes of proteins within the family

•The object was to choose class representatives

resultFlow chat

Search

http://www.cis.udel.edu/~decker/courses/889d-bio/figure/gr_tree.jpeg

http://www.cis.udel.edu/~decker/courses/889d-bio/figure/gr-result.jpg

http://www.cis.udel.edu/~decker/courses/889d-bio/figure/gr.jpg.4f1

http://www.cis.udel.edu/~decker/courses/889d-bio/figure/gr-alignment.jpg

OutlineOutline








Family and Domain AnalysisFamily and Domain Analysis

• What is domain?

– Proteins are modular & often comprise separate domains

– Domains represent modules of structure and function

• Domain Comparison

– Comparison of the domain content of a proteome with that of another proteome reveals the biological roles of diverse domains in different organisms.

• Example : an analysis of fly, worm, & yeast proteomes

– 744 families and domains were common to all three org.

– > 2000 fly & worm proteins are multidomain proteins (1/3 in yeast)

Ancient Conserved Regions (ACRs)Ancient Conserved Regions (ACRs)

• What is ACR?

In some phylogenetically diverse groups of organisms, there are conserved proteins or protein domains that have been conserved over long periods of evolutionary time.

• How to find ACRs?

– Database similarity search of the SwissProt database with human, worm, yeast and E. coli genes

– Identify matches with sequence from a different phylum than the query sequence

– The number of ACRs may be estimated by the proportion of genes that match database sequence of known function

e.g. 70% prokaryotic genomes contain ACRs

Horizontal Gene TransferHorizontal Gene Transfer

• Horizontal Transfer (HT)

the acquisition of genetic material from a different organism and these transferred material then becomes a permanent addition to the recipient

(HT is a significant source of genome variation for bacteria)

• Comparisons of bacterial genomes reveal that they are mosaics of ancestral (vertical) and horizontally transferred seqs.

– 12.8% of the genome of E. coli is due to HT DNA (the highest level)

• How to detect HT?

– Fact: each genome of bacterial species has a unique base composition

– HT can be detected as an island of seq with different composition

– If the amino acid composition of transferred genes is typical, these islands may be detected by a codon usage analysis

– The time of the transfer may be estimated by the degree of “blend”

OutlineOutline








Functional Classification of GenesFunctional Classification of Genes

• Genes that are significantly similar in an organism, i.e., paralogous seqs, frequently are found to have a related biological function.

• Classification Scheme

– Eight related groups of E. coli genes: enzymes, transport elements, regulators, membranes, structural elements, protein factors, leader peptides, and carriers.

90% of E. coli genes fell into these same broad categories

– Special Commission, e.g. Enzyme Commission of (IUBMB) provides a kind of detailed classes based on the biochemical reactions they catalyze

– Examine relationships among multiple enzymes that perform the same biochemical function in the same organism. (these enzymes showed variations in metabolic regulation of their activity)

OutlineOutline








Gene Order ComparisonGene Order Comparison

• Observations about gene order

– Gene order is highly conserved in closely related species but becomes changed by rearrangements over evolutionary time

– Groups of genes that have a similar biological function tend to remain localized in a group or cluster

• Chromosomal Rearrangement

– Occasional chromosomal breaks (random chromosomal location)

– Random rejoining of the fragments by a DNA repair mechanism

• Rearrangement Analysis

– By comparing the location of orthologs

Chromosomal RearrangementChromosomal Rearrangement

Computational Analysis of Genome RearrangementsComputational Analysis of Genome Rearrangements

• Challenges

– The number and types of rearrangements that have occurred

– When they occurred?

• Example: a comparison of human and mouse chromosomes

• Computational Approach

– Genome alignment

– Alignment reduction : reconstruct the number and types of rearrangement

Computational Analysis of Genome RearrangementComputational Analysis of Genome Rearrangement

Human chromosomes were cut into > 100 pieces and reassembled into a reasonable facsimile of the mouse chromosome.

Computational Analysis of Gene RearrangementComputational Analysis of Gene Rearrangement

• Lines indicate homologous position

• The more rearrangements there are, the more intersections will occur

• [Sankoff & Goldstein(1989)] devised a shuffling model for estimating the # of rearrangements given the # of intersections.

A

B

B

A

Circular

Computational Analysis of Gene RearrangementComputational Analysis of Gene Rearrangement

Assume that those rearrangements have occurred by some transposition or recombination events

And identify the rearrangements by “undoing” those events.

The goal is to minimum the number of rearrangements, which represents a genetic distance between the two genome sequences

Clusters of Genes on ChromosomesClusters of Genes on Chromosomes

• In a given organism, genes are found in a given order that is maintained on the chromosomes.

• On the other hand, genes with a related function are frequently found to be clustered at one chromosome location

• Example : tryptophan genes in different prokaryotic organisms

• Observation:

– At least some of the trp genes are also clustered together on the chromosomes of other species of Bacteria & Archaea

– The order of genes within the cluster is conserved within the first four species (bacteria)

– The order is much less conserved in the last three species (Archaea)

– Gene fusions, which generate a new protein that performs both biochemical functions of the single-gene, parent proteins.

Clusters of Genes on ChromosomesClusters of Genes on Chromosomes

Cluster of Genes on ChromosomesCluster of Genes on Chromosomes

• How to identify those clusters or coordinately regulated genes? [Overbeek et al. (1999)]

– Perform a full reciprocal search between the proteomes of two org.

– Protein pairs that gave a best hit with the other genome & had an E value < 10-5 were identified, called a bidirectional best hit (BBH)

– Pairs of close BBH (PCBBH) that are within 300 bp of each other on the chromosomes of the respective organisms and that are transcribed from the same strand, i.e., are in a “typical” operon, were then identified

– A score for these pairs was formulated. When the # of organisms in which the pair is observed is greater and the phylogenetic distance between the organisms is larger, this score is higher

: 40% of these pairs with higher score correspond to proteins that are known to act in a common metabolic pathway.

A significant proportion of the pairs of PCBBH correspond to genes that have a related function and lie on the same pathway.

Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.

Documents

Transcript of Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.