The Role of Algorithmic Research in Computational Genomics

31
The Role of Algorithmic Research in Computational Genomics Richard M. Karp IEEE Computer Society Bioinformatics Conference August 14, 2003

description

The Role of Algorithmic Research in Computational Genomics. Richard M. Karp IEEE Computer Society Bioinformatics Conference August 14, 2003. Algorithmic Research in Computer Science. Computer science is a ``science of the artificial.’’ - PowerPoint PPT Presentation

Transcript of The Role of Algorithmic Research in Computational Genomics

Page 1: The Role of Algorithmic Research in Computational Genomics

The Role of Algorithmic Research in Computational

GenomicsRichard M. Karp

IEEE Computer Society Bioinformatics Conference

August 14, 2003

Page 2: The Role of Algorithmic Research in Computational Genomics

Algorithmic Research in Computer Science

• Computer science is a ``science of the artificial.’’

• Problems are precisely stated and are often generic rather than application-specific.

• The quality of an algorithm is measured by its worst-case time bound.

• Mathematical elegance is just as important as relevance to applications.

Page 3: The Role of Algorithmic Research in Computational Genomics

Algorithmic Research in Computational Genomics

• The goal is to understand ground truth.• Problem statements are often fuzzy.• Problems are often application-specific, and

problem formulations must be faithful to those applications.

• The quality of an algorithm is measured by its performance on real data.

• Biological findings are more important than computational methods.

Page 4: The Role of Algorithmic Research in Computational Genomics

Genomics can Benefit from Algorithmic Research in C.S.

• Data structures such as suffix trees.• Randomized algorithms and sampling techniques.• Dynamic programming (sequence alignment,

RNA folding, protein threading, haplotype block structure…)

• Network flows, graph theory, NP-completeness, integer programming, semidefinite programming.

Page 5: The Role of Algorithmic Research in Computational Genomics

Adapting to Genomics

• Choose problems that are fundamental, timely and relevant.

• Mathematical depth and elegance are highly desirable, but often simple mathematics, artfully applied, is the key to success.

• Avoid problems that will change when technology changes.

• Learn the biological background of your problem, the available sources of data and their noise characteristics.

Page 6: The Role of Algorithmic Research in Computational Genomics

Adapting to Genomics

• Work with an application-oriented team and don’t get typecast as an algorithms specialist.

• Benchmark your algorithms on real data, establish a user community and make your software available and easy to use.

Page 7: The Role of Algorithmic Research in Computational Genomics

Sequence Assembly

• Given many noisy `reads’ of short substrings of a target string, identify the target string.

• The shortest superstring problem, an elegant but flawed abstraction: find a shortest string containing a set of given strings as substrings. The problem is NP-hard, and theoretical results focus on constant-factor approximation algorithms.

Page 8: The Role of Algorithmic Research in Computational Genomics

Shortest Superstring Problem

The shortest superstring problem is only superficially related to the sequence assembly problem. Its difficulty stems from pathological examples that are unlikely to occur in practice. It does not take noisy reads into account, and admits solutions with an unreasonably large number of mutually overlapping reads.

Page 9: The Role of Algorithmic Research in Computational Genomics

Progress in Sequence Assembly Algorithms

• Phred provides highly accurate base-specific quality scores based on signal analysis of sequence traces.

• Celera assembler: realistic simulations based on the structure of repeats in genomic sequence suggested that full-genome sequence assembly would be possible using double-ended reads. A sophisticated heuristic assembly algorithm was constructed, leading to the successful assembly of the Drosophila, human and mouse genomes.

Page 10: The Role of Algorithmic Research in Computational Genomics

Physical Mapping

• Goal: determine the relative locations of sequence-tagged sites, restriction sites or clones on a target DNA molecule.

• Radiation hybrid mapping: fragment the target, recover random sets of fragments and detect the sequence-tagged sites within them.

Page 11: The Role of Algorithmic Research in Computational Genomics

Physical Mapping

• Optical mapping: directly image the restriction sites on many incomplete copies of the target.

• Clone-based mapping: generate a clone library together with a restriction-site or sequence-tagged-site fingerprint of each clone. Computationally infer the relative positions of the clones.

Page 12: The Role of Algorithmic Research in Computational Genomics

A Generic Subproblem

• X(i) distance in bases of site i from 5’ end of target

• Experimental data yields inequalities of the form a(i,j) X(i)– X(j) b(i,j)

• In nearly every case, no solution existed.• The algorithm was then modified to find the

minimal obstructions to a solution and pinpoint the places where the experimental data needed to be corrected.

Page 13: The Role of Algorithmic Research in Computational Genomics

Why My Physical Mapping Projects Had Little Influence

• Some problem formulations were technology-dependent and hence of transient interest.

• Difficulty in infiltrating existing projects and acquiring test data.

• Implementations lacked good user interfaces.• Whole-genome sequencing supplanted physical

mapping to some extent.

Page 14: The Role of Algorithmic Research in Computational Genomics

Elegance vs. Realism: the Case of Probe Selection

• Probe Selection Problem: find a maximum number of DNA probes, such that each hybridizes strongly to its complement, but not to the complement of any other probe.

• For highly realistic models of hybridization there appears to be no method of solution short of brute force search.

• A reasonable simplified model has an elegant solution.

Page 15: The Role of Algorithmic Research in Computational Genomics

Simplified Model

• 2-4 rule: the melting temperature of a DNA sequence is twice the number of A’s and T’s within it, plus four times the number of C’s and G’s.

• Simplified problem: Find a maximum number of probes such that each has melting temperature a, but no sequence of melting temperature b occurs as a substring of two different probes.

• Open question: how to modify solution to the simplified problem to satisfy constraints of more realistic models.

Page 16: The Role of Algorithmic Research in Computational Genomics

Principles for Designing Computational Strategies

• An organism is best understood in the light of its evolutionary relationship to other organisms.

• The use of diverse sources of data is often the key to success.

• Problems of finding structure within data should be framed within statistical models, so that significance can be attached to the structures that are found.

Page 17: The Role of Algorithmic Research in Computational Genomics

Fundamental Problems that Need Better Algorithms

• Multiple alignment• Global alignment of multiple genomes• Phylogeny construction• Genome rearrangement• Approximate string matching• Clustering biological data• Feature selection: finding small sets of input

variables that most accurately predict a given output variable.

Page 18: The Role of Algorithmic Research in Computational Genomics

SNPs, Genotypes, and Haplotypes

• SNP: site where the two copies of a chromosome commonly contain different bases.

• Genotype: the pair of bases occurring at each SNP.

• Haplotype: designates which base lies on which copy.

Page 19: The Role of Algorithmic Research in Computational Genomics

Haplotyping Problems

• Given the genotypes of a sample of individuals, determine:– The common haplotypes and their frequencies– The haplotype of each individual– The influence of an individual’s haplotype on

observable phenotypes such as disease.

Page 20: The Role of Algorithmic Research in Computational Genomics

Analysis of Gene Regulation

• Gene finding• Breaking the cis-regulatory code (analysis of

transcriptional regulation)– Characterize the binding sites of transcription factors

– Find sets of transcription factors that work in combination to induce or repress many genes

• Analysis of signal transduction pathways and protein complexes using protein-protein interaction data.

Page 21: The Role of Algorithmic Research in Computational Genomics

Combinatorial Analysis of Transcriptional Regulation

• Goal: find sets of transcription factors whose binding sites co-occur frequently in the promoter regions of selected sets of genes and determine whether these transcription factors combine to activate or suppress transcription.

Page 22: The Role of Algorithmic Research in Computational Genomics

Databases and Tools

• TRANSFAC: binding site motifs for 414 TFs occurring in vertebrate genomes

• RefSeq: database of human genes• LBNL alignment of human and mouse genomes• rVista : tool for finding human-mouse conserved

motif occurrences • Expression data and phases for cell-cycle regulated

genes• Stress response genes in the GO database and their

subcategories

Page 23: The Role of Algorithmic Research in Computational Genomics

General Mechanisms for Combining Diverse Data Sources• Biclustering

– A. Tanay, R. Sharan, M. Kupiec, R. Shamir, manuscript (2003)

• Probabilistic graphical models• Kernel-based data fusion

– G. Lanckriet, M. Deng, N. Cristianini, M.Jordan, W. Noble, Technical Report 645, UC Berkeley Department of Statistics (2003)

Page 24: The Role of Algorithmic Research in Computational Genomics

Biclustering

Given a (0,1) matrix in which the rows represent genes, the columns represent properties of genes (function, expression, association with diseases etc.) and a 1 in the (g,p) entry indicates that gene g has property p, find submatrices with an unusually high density of 1s.

Page 25: The Role of Algorithmic Research in Computational Genomics

Probabilistic Graphical Model (PGM)

• Graph-theoretic representation of the probabilistic and deterministic relationships among a set of variables. Vertices correspond to variables and directed edges represent dependencies. Some variables are observed and others are hidden.

• A PGM provides an algorithm for generating samples from the joint distribution of its variables.

• Given the values of the observed variables, there is an automatic (but not necessarily efficient) procedure for inferring the most likely values of the hidden variables.

Page 26: The Role of Algorithmic Research in Computational Genomics

Application: Finding Binding Site Motifs

• Observed variables: the genomic sequences within which the motifs occur• Hidden variables: the locations of the motif

occurrences, the nucleotide distributions at sites within and between motif occurrences, and meta-parameters governing these nucleotide distributions.

• Questions: How much data is required to train a PGM, and how rapidly will the inference algorithm converge?

Page 27: The Role of Algorithmic Research in Computational Genomics

Classification Using Diverse Sources of Data

Example: Classifying proteins based on five

types of data:

(1) Their domain structures

(2) Protein-protein interactions

(3) Genetic interactions

(4) Co-participation in protein complexes

(5) Cell cycle gene expression measurements

Page 28: The Role of Algorithmic Research in Computational Genomics

Support Vector Machine (SVM)

• Input: a training set {p1,p2, …,pn} of proteins, a class label (positive or negative) for each protein and a n x n positive-definite matrix S = (sjk) giving the similarities between all pairs of proteins.

• The SVM algorithm produces a decision rule that achieves maximal separation between the positive and negative examples in the training set and can be used to classify additional proteins on the basis of their similarities to proteins in the training set.

Page 29: The Role of Algorithmic Research in Computational Genomics

Extension to Diverse Data Sources

• The t-th data source gives a positive-definite matrix St = (sjk

t) of similarities between the proteins.

• Data fusion: Any positive linear combination of these matrices gives a positive-definite similarity matrix.

• The problem of choosing the linear combination giving the largest margin of separation between positive and negative examples can be solved by semidefinite programming.

Page 30: The Role of Algorithmic Research in Computational Genomics

Computer Science Paradigms from Biology and Genomics

• Living cells can adapt to environmental changes, but large computer programs are brittle.Does biology hold clues for software engineering?

• Genomics algorithms are required to perform well on real-life data, not on all possible data.Should theoretical computer science depart from worst-case analysis?

Page 31: The Role of Algorithmic Research in Computational Genomics

Computer Science Paradigms from Biology and Genomics

• The Celera whole-genome shotgun sequencing algorithm is an instance of a general approach to combinatorial puzzle solving in which constraints on the solution are enforced in an order determined by the strength of evidence for them. Should this approach be studied within theoretical computer science?