IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

23
IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES L. Coin and R. Durbin Wellcome Trust Sanger Institute BIOINFORMATICS 2004 Presented by: Oscar Sanchez Plazas

description

IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES. L. Coin and R. Durbin Wellcome Trust Sanger Institute. BIOINFORMATICS 2004 Presented by: Oscar Sanchez Plazas. Outline. Problem definition Previous works on pseudogene identification Proposed method Protein domain profile (Pfam) - PowerPoint PPT Presentation

Transcript of IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Page 1: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF

PSEUDOGENES

L. Coin and R. DurbinWellcome Trust Sanger Institute

BIOINFORMATICS2004

Presented by:Oscar Sanchez Plazas

Page 2: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Outline

Problem definitionPrevious works on pseudogene

identificationProposed method

Protein domain profile (Pfam)AlgorithmResults and Discussion

Page 3: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Pseudogene Identification

Pseudogene: Remnants of genomic sequences of genes that are no longer translated into functional proteins. Non-processed (duplicated):

Product of genome duplication (paralogous)loss of function at the transcription or translation level

Processed (~70%):Product of retro-transpositionNo introns, no promoter

(*) Plagiarized Errors and Molecular Genetics. Edward E. Max, M.D.

Page 4: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

(*) http://www.pseudogene.org/definition.html

Page 5: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

PseudogenesSignificance:

Comparative Genomic Evolution of DNA, new gene expression, patterns Study of mechanisms for regulation of gene

expression Verification of gene sequences in databases

Page 6: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Pseudogenes

Are they functional? (why high conservation compared to prokaryotes?) “Pseudogenes exhibit evolutionary conservation of gene sequence, reduced nucleotide variability,

excess synonymous over nonsynonymous nucleotide polymorphism, and other features that are expected in genes or DNA sequences that have functional roles”1

(1) PSEUDOGENES: Are They “Junk” or Functional DNA? Evgeniy S. Balakirev, Francisco Ayala. 2003- An expressed pseudogene regulates the messenger stability of its homologous coding gene. Nature, Hirotsune,S. et

al. 2003 - The putatively functional Mkrn1-p1 pseudogene is neither expressed nor imprinted, nor does it regulate its source gene

in trans. Gray TA, Wilson A, Fortin PJ, Nicholls RD. PNAS. 2006

(*) www.answersingenesis.org/tj/v17/i2/pseudogene.asp

Page 7: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Problem

Sometimes pseudogenes are mis-annotated in gene sequence databases as functional genes.

Key Insight:Employ a evolutionary constraint model

derived from a functional characterization over the gene product.

Constrained vs. neutral model

Page 8: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Previous approaches

(*) Large-scale analysis of pseudogenes in the human genomeZhao Lei Zhang, Mark Gerstein

Presence of stop codon and frameshift. Not very sensitive (~50% are detectable )

Page 9: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Previous approaches

Ratio of synonymous and non-synonymous substitutions (dN/dS) Not very accurate: e.g.

gene under positive selection pressure.

(*) Genome-wide survey of human pseudogenes. Torrents,D., Suyama,M., Zdobnov,E. and Bork,P.

Page 10: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Model Proposed

PSILC: Pseudogene inference from loss of constraint (log-odd score) Protein Domain evolution (functional constrain)

- Null probability model (Pfam) Neutral nucleotide model Protein coding model

Page 11: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Domain Profile - HMM

Protein Domains: structural, functional and evolutionary units of proteins

HMM profiles: the most sensitive models for domains Every state has a particular emission distribution over {A,C,T,G}

(*) genome.nasa.gov/MediaLib/hmm_project_fig2.jpg

match

deletion

insertion

Page 12: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

(*) http://pfam.sanger.ac.uk//family/TAF

Page 13: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Model Proposed

Objective Look at pattern of substitution in conserved protein

domains

Algorithm Input

Alignment A Unrooted tree T Profile HMM D (aligned with A)

Output Score for a leaf of the tree which represents the belief that the

node corresponds to a pseudogene.

Page 14: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Algorithm

Notation Xn. : row corresponding to leaf-node n. X.i : i-th column. A\Xn. : Alignment A excluding Xn. mj : j-th match column of profile HMM. pn : parent node of n. bn : branch from pn to n. T\bn : Tree T excluding bn.

Page 15: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Algorithm Input: Unrooted tree T, Alignment A, profile HMM D Output: Log-odds scores:

A neutral nucleotide model compared to a Pfam domain encoding model (PSILC-nuc/dom)

A protein coding model compared to a Pfam domain encoding model (PSILC-prot/dom).

Evolutionary model

Page 16: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Algorithm

• Independence assumptions

• xni respect to other columns in the row given A\xn

• xni respect to other columns in A\xn given x.i\xni

• Tree assumption: xni respect to x.i\xn given xpni

Page 17: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

AlgorithmSteps:

Calculate the distribution at xpni given the evolutionary constraints on the other branches.

For each residue/base at xpni, calculate the transition probability to xni given the evolutionary constraints.

•pn is set as the root of the T•Prior distribution: Stationary dist. of Q

Page 18: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Evolutionary Model

Instantaneous rate matrix (Q)*: DNA models: HKY model (^ - uniform) Amino acid model: database estimates (WAG, ^)

- steady state distribution (vs. equilibrium): Alternative models: observed in A Null model: distribution of the state in the HMM

Parameters (ML): f: trade off mutation pressure

(from-to) r: evolutionary rate : ratio transition/transversion

(*) A Novel Use of Equilibrium Frequencies in Models of Sequence EvolutionNick Goldman and Simon Whelan

Page 19: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

AlgorithmDirectionality of the calculation

Score on an alignment of two transcripts x1, x2 is not symmetric (detailed balance).

If base x1i is more likely than x2i at a particular match state but equally likely under the protein model, score for x2. being a pseudogene is higher than score for x1.

dN/dS does not have this property (a third sequence should be used). Requires a PFam model (independent)

Page 20: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Results

Data: Cromosome 6 human genome Manually annotated (pseudo)genes Blast search-ENSEMBL e<10^-7 (>80%) (<99%) Multiple alignment: ClustalW Max. likelihood distance. Nearest neighbor tree. 598 (875) coding transcripts, 97 (158)

pseudogenes

Page 21: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Results

ROC

Why PSILC-prot/dom is better than PSILC-nuc/dom?

Page 22: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Results

•Better discrimination

Page 23: IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Question

What is the main difference between the HMM’s previously studied (eg. Pairwise alignment) and the HMM profiles? Why the latter HMM’s are important for the identification of pseudogenes?