Conditional Graphical Models for Protein Structure Predictionyanliu/summary.pdfand distribution of...

26
Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute School of Computer Science Carnegie Mellon University October 17, 2006 Abstract It is widely believed that the protein three-dimensional structures play key roles in determining their functions and activities. However, it is extremely labor-expensive and time-consuming to determine the protein structures by lab experiments. In this thesis, we aim at designing computational methods to predict the protein structures. Specifically, given a protein sequence, our goal is to predict what are the secondary structure elements, how they arrange themselves in three-dimensional space, and how they associate with multiple chains to form complexes. In this thesis, a framework of conditional graphical models is developed to predict protein structures. Specifically, a special kind of undirected graphical model is defined, whose nodes represent the states of the structural components (e.g. an amino acid or a secondary structure element) and whose edges indicate the local interactions between neighboring nodes in sequence order or the long-range interactions between nodes with chemical-bonding in three-dimensional structures. Following the idea of discriminative model, the conditional probability of the joint labels given the observed sequences is defined directly as an exponential function of the features. The major advantages of our conditional graphical models include: (1) the intuitive representation of protein structures via graphs; (2) the ability to model dependencies between segments in a non-Markovian way, so that the chemical-bonding between distant residues can be captured; (3) the ability to use any features that measure properties of segments or bonds that biologists have identified. Within the framework, we develop the conditional random fields and kernel con- ditional random fields for protein secondary structure prediction; segmentation condi- tional random fields and chain graph model for protein fold (or motif) recognition; and linked segmentation conditional random fields for quaternary fold (or motif) recogni- tion. The thesis work enriches current graphical models for the prediction problem with structured-outputs, in particular to handle the long-range interactions; in addition, it provides a better understanding of the mapping from protein sequences to structures, and we hope our prediction results will shed light on the functions of some protein folds and aid the studies on drugs designs. 1

Transcript of Conditional Graphical Models for Protein Structure Predictionyanliu/summary.pdfand distribution of...

  • Conditional Graphical Models for ProteinStructure Prediction

    Yan LiuLanguage Technologies Institute

    School of Computer Science

    Carnegie Mellon University

    October 17, 2006

    Abstract

    It is widely believed that the protein three-dimensional structures play key rolesin determining their functions and activities. However, it is extremely labor-expensiveand time-consuming to determine the protein structures by lab experiments. In thisthesis, we aim at designing computational methods to predict the protein structures.Specifically, given a protein sequence, our goal is to predict what are the secondarystructure elements, how they arrange themselves in three-dimensional space, and howthey associate with multiple chains to form complexes.

    In this thesis, a framework of conditional graphical models is developed to predictprotein structures. Specifically, a special kind of undirected graphical model is defined,whose nodes represent the states of the structural components (e.g. an amino acid or asecondary structure element) and whose edges indicate the local interactions betweenneighboring nodes in sequence order or the long-range interactions between nodes withchemical-bonding in three-dimensional structures. Following the idea of discriminativemodel, the conditional probability of the joint labels given the observed sequences isdefined directly as an exponential function of the features. The major advantages ofour conditional graphical models include: (1) the intuitive representation of proteinstructures via graphs; (2) the ability to model dependencies between segments in anon-Markovian way, so that the chemical-bonding between distant residues can becaptured; (3) the ability to use any features that measure properties of segments orbonds that biologists have identified.

    Within the framework, we develop the conditional random fields and kernel con-ditional random fields for protein secondary structure prediction; segmentation condi-tional random fields and chain graph model for protein fold (or motif) recognition; andlinked segmentation conditional random fields for quaternary fold (or motif) recogni-tion. The thesis work enriches current graphical models for the prediction problem withstructured-outputs, in particular to handle the long-range interactions; in addition, itprovides a better understanding of the mapping from protein sequences to structures,and we hope our prediction results will shed light on the functions of some proteinfolds and aid the studies on drugs designs.

    1

  • Glossary

    α-helix One type of protein secondary structure, a rod-shape peptidechain coiled to form a helix structure

    β-sheet One type of protein secondary structure, in which two peptidestrands aligned in the same direction (parallel β-sheet) or op-posite direction (antiparallel β-sheet) and stabled by hydrogenbonds

    amino acid The unit component of proteins, a small molecule that containan amino group (NH2), a carboxyl group (COOH), a hydrogenatom (H) and a side chain (R-group) attached to a central alphacarbon (C-α)

    coil One type of protein secondary structure with irregular regions

    domain Subsections of a protein that represent structurally independentregions

    fold identifiable arrangement of secondary structure elements, whichappear repeatedly in different protein domains

    motif the unit made up of only a few secondary structural elementsand appear repeatedly in different protein domains

    PDB Protein Data Bank , a worldwide repository for the processingand distribution of experimentally solved 3-D structure data

    primary structure The linear sequence of a protein

    protein Linear polymers of amino acids, responsible for the essentialstructures and functions of the cells

    quaternary structure The stable association of multiple polypeptide chains resultingin an active unit 20

    residue The amino acid connected by the peptide bonds through a re-action of their respective carboxyl and amino groups

    secondary structure The local conformation of the polypeptide chain, or intuitivelyas building blocks for its three-dimensional structures

    tertiary structure Global three-dimensional structure of one protein chain

    2

  • 1 Introduction

    Proteins, as chains of amino acids, make up a large portion of the living organisms andperform most of the important functions, such as catalysis of biochemical reactions, tran-scription factors to guide the differentiation of the cell and its later responsiveness to signals,transport of materials in body fluids, receptors for hormones and other signaling molecules,and formation of tissues and muscular fiber. It is widely believed that the protein structuresplay key roles in determining its functions. However, it is extremely labor-expensive andsometimes even impossible to experimentally determine the structures for every protein se-quence. Since the distinct sequence of amino acids determines the ultimate three-dimensionalstructure of proteins, it is essential to design effective computational methods to predict theprotein structures, which is the main task of the thesis.

    In order to better characterize the structure properties, biologists have defined a hierarchyof protein structures with four levels. The primary structure is simply the linear proteinchain, i.e. a sequence of amino acids. The secondary structure is the local conformation ofprotein structures. There are three types of major secondary structures, known as alpha-helix, beta-sheet and coils (or loops). The tertiary structure is the global three-dimensionalstructures. Sometimes, multiple protein chains unit together and form hydrogen bondsbetween each other, resulting in the quaternary structures. There are many interesting andchallenging topics about protein structure prediction. In this thesis, we focus on predictingthe protein structure topologies of all levels via machine learning approaches, ranging fromprotein secondary structure prediction, tertiary fold recognition and alignment prediction, toquaternary fold recognition and alignment prediction. In silico, given the protein sequenceinformation only, our goal is to predict what the secondary structure elements are, howthey arrange themselves in three-dimensional space, and how multiple chains associate intocomplexes.

    Traditional approaches for protein structure prediction are sequence-based, i.e. theysearch the database using PSI-BLAST [Altschul et al., 1997] or match against a hiddenMarkov model(HMM) profile built from sequences with similar structures [Durbin et al.,1998, Krogh et al., 1994, Karplus et al., 1998]. These methods work well for simple con-served structures with strong sequence similarities, but fail when the similarity across pro-teins is poor and/or there exist long-range interactions, such as those containing β-sheets orα-helical couples. These cases necessitate a more expressive model, which is able to capturethe structured properties (e.g. the long range interactions). Therefore several graphical mod-els beyond the basic HMMs have been applied to protein structure prediction. Delcher et al.introduced probabilistic causal networks for protein secondary structure modeling [Delcheret al., 1993]; Schmidle et al. propose a Bayesian segmentation model for protein secondarystructure prediction [Schmidler et al., 2000]; Yanover and Weiss apply an undirected graph-ical model to side-chain prediction using various approximate inference algorithms [Yanoverand Weiss, 2002]; Chu et al. extended segmental semi-Markov model under the Baysianframework to predict secondary structures [Chu et al., 2004].

    The previous graphical models achieved success in a certain extent, however, they allbelong to the generative models, which assumes a particular data generating process thatare not true in most cases. From computational perspective, the protein structure predictionproblem falls into one type of general machine learning problems, known as the segmentation

    3

  • and labeling for structured data. The goal of the task is to predict the labels of each nodein a graph given the observation with particular structures, for example, webpages withhyperlinks or grids of image pixels. Discriminative graphical model defined over undirectedgraphs, such as conditional random fields (CRFs), prove to be one of the most effective toolsfor this kind of problem [Kumar and Hebert, 2003, Pinto et al., 2003, Sha and Pereira, 2003].

    Motivated by previous work on structured data, a framework of conditional graphicalmodels is developed to predict protein structures. Specifically, we define a special kind ofundirected graph, i.e. a protein structure graph, whose nodes represent the elements of theconcerned structures (either a residue or a secondary structure element) and whose edgesindicate the state transitions between adjacent nodes or long-range interactions between dis-tant nodes. Following the idea of CRFs, the conditional probability of the labels given thedata is defined over the graph, in which the potential function is an exponential functionof the features. In this way, our framework is able to incorporate the long-range interac-tion information via the graphs, use any kinds of informative features conveniently, such assegment-level features, overlapping or long-range interaction features, and at the same timeretain all the advantages of discriminative models.

    1.1 The Task

    There are different techniques to experimentally determine the protein structures, such asX-ray crystallography, Nuclear Magnetic Resonance, circular dichroism and Cryo-electronmicroscopy. However, most of these methods are time-consuming and labor-expensive. Fur-thermore, some proteins, such as many viruses and membrane proteins, can not be deter-mined using current available techniques. Therefore, the prediction of three-dimensionalstructures of a protein from its primary sequence is a fundamental and well-studied area instructural bioinformatics [Bourne and Weissig, 2003].

    In the thesis, we focus on three important subtasks, including the protein secondarystructure prediction, tertiary fold (or motif) recognition and quaternary fold (or motif)recognition. We choose these tasks specifically because (1) they provide important infor-mation to ultimately determine the protein structures; (2) they are essential to predict theprotein functions, protein-protein interactions and suggest the structural and functional con-servations during evolution; (3) they all involve the long-range interactions, namely aminoacids that are distant in sequence order may be close in three-dimensional structures andform chemical bonds; (4) the outputs of each task lie on different level of the protein structurehierarchy respectively so that the generality of our model can be demonstrated.

    In protein secondary structure prediction, the input is a sequence of amino acid x =x1 . . . xN , and the output is the secondary structure assignment for each amino acid y =y1 . . . yN , where yi ∈{helix, sheet, coil}. It is widely believed that the secondary structurescan contribute valuable information to discerning how proteins fold in three-dimensions.Therefore protein secondary structure prediction has been extensively studied for decades[Rost and Sander, 1993, King and Sternberg, 1996, D.T.Jones, 1999, Rost, 2001]. Recentlythe performance has been improved to as high as 78 - 79% in accuracy in general and 80-85%for predicting helix and coils [Kim and Park, 2003]. The major bottleneck lies in the β-sheetsprediction, which involves long-range interactions between regions of the protein chain thatare not necessarily consecutive in the primary sequence.

    4

  • In protein fold recognition, our task starts with a target fold F that the biologists areinterested in. The fold can be either a tertiary fold (with one protein chain) or a quaternaryfold (with multiple protein chains). All the proteins with resolved structures deposited inthe PDB can be classified into two groups, i.e. those taking the target fold F and thosenot. These proteins together with the labels can be used as training data. The goal is topredict whether a testing protein, given the sequence information only, takes the fold F innature or not; if they do, locate the starting and ending positions of the subsequence thattakes the fold. It can be seen that the recognition task involves two sub-tasks: one is theclassification problem, that is, given a set of training sequences x1,x2, . . . ,xL and their labelsy1, y2, . . . , yN , yi ∈ {“Take fold F”, “Not take fold F”}, predict the label of a testing sequencextest; the other subtask is not that straightforward to describe in mathematical settings. Wecan think of the target fold as some patterns (or motifs in bioinformatics terminology). Givena set of instances of the pattern, including both the positive examples (subsequences with thepattern F ) and the negative examples (sequences without the pattern F ), we want to predictwhether the pattern appears in any subsequence of the testing proteins. The first questioncan be answered easily if we can solve the second one successfully. A key problem in thesecond task is how we can represent the descriptive patterns (or motifs) using mathematicalformulation.

    The recognition problems that addressed in the thesis falls within the general studies inprotein fold (or motif) recognition, but differs in two aspects: first, the target fold comesdirectly from the focused study and experiments by the biologists (in our case, the collabora-tors that we worked with have been studying a particular fold for a long time). The positiveproteins with resolved structures are quite limited, although they believe it is a common foldin nature. Second, our task is much more difficult because we do not have as many positiveexamples and they do not share high sequence similarities. In other words, the patternsthat we are trying to identify have not been represented clearly in the training data. This isthe main motivation for developing a richer graphical model, rather than a simple classifier.Notice that our models can be used in the traditional fold recognition or threading setting,however, its advantages are most evident in predicting those difficult protein folds.

    1.2 The Approach

    The protein structure prediction can be generalized as the prediction problem with structured-outputs in machine learning, in which the observed data are sequential or with other simplestructures while the outputs are interdependent, involving complex structures. This kind ofproblem finds applications in many domains, such as computer vision, information extractionand human-computer interactions.

    An intuitive representation of the complex structures is via graphs. Therefore the graph-ical models, as an elegant combination of probability theory and graph theory, is a naturalsolution to the prediction problem with structured-outputs. In particular, conditional ran-dom fields (CRF) has played an essential role in the recent development of the area [Laffertyet al., 2001]. CRF is an undirected graphical model (also known as random fields), whichdefines the conditional probability of labels given the observation directly as an exponentialfunction. The major advantage of such definition is the convenient use of any informativefeatures, such as overlapping or long-range interaction features, since it does not assume a

    5

  • particular data generating process as the generative models. It has been proven very effec-tive in many applications with structured outputs, such as information extraction, imageprocessing, parsing and so on [Pinto et al., 2003, Kumar and Hebert, 2003, Sha and Pereira,2003].

    In the thesis, we develop a framework of conditional graphical models for protein structureprediction. It can be seen as a generalization of the conditional random fields. Specifically,we address the following questions: (1) how to represent the protein structures with a graph?(2) how to incorporate the domain knowledge into the model to help prediction? (3) giventhe complex induced graph, how can we make efficient inferences so that our model can beused in genome-wide applications? In the rest of the paper, we discuss the basic idea of thesolutions to the questions above. Full details as well as the interesting experiment resultscan be found in Chapter 4 - 8 of the thesis.

    2 A Framework of Conditional Graphical Models

    In contrast to the common independent and identically distributed (i.e. iid) assumptionsin statistics and machine learning, one distinctive property of protein structures is that theresidues at different positions are not independent. For example, neighboring residues in thesequence are connected by peptide bonds; some residues that are far away from each otherin the primary structures might be close in 3-D and form hydrogen bonds or disulfide bonds.These chemical bonds are essential to the stability of the structures and directly determinethe functionality of the protein. In order to model the long-range interactions explicitlyand incorporate all our tasks into a unified framework, it is desirable to have a powerfulmodel that can capture the interdependent structured properties of proteins. In this thesis,we develop a series of graphical models for protein structure prediction. These models canbe generalized as a framework of conditional graphical models, which directly defines theprobability distribution over the labels (i.e., segmentation and labeling of the delineatedsegments) underlying an observed protein sequence, rather than assuming particular datagenerating process as in the generative models.

    2.1 Definition

    First, we define the “protein structure graph” (PSG), which is the basis of our framework torepresent the protein structures and define probability distributions. Specifically, PSG is anundirected graph G = {V , E}, where V is the set of nodes corresponding to the specificitiesof structural units such as secondary structure assignments, motifs or insertions in the super-secondary structure (which are unobserved and to be inferred), and the amino acid residuesat each position (which are observed and to be conditioned on), E is the set of edges denotingdependencies between the objects represented by the nodes, such as local constraints and/orstate transitions between adjacent nodes in the primary sequence, or long-range interactionsbetween non-neighboring motifs and/or insertions (see Fig. 1). The latter type of depen-dencies is unique to the protein structural graph and results in much of the difficulties insolving such graphical models.

    The random variables corresponding to the nodes in PSG are as follows: M denotes the

    6

  • X1 X3X2 X4 X6X5 Xn…...

    W1 W2 W3 W4 …... WM

    Figure 1: The graphical model representation of protein fold models. Circles represent thestate variables, edges represent couplings between the corresponding variables (in particular,long-range interaction between units are depicted by red arcs). The dashed box over x’sdenote the sets of observed sequence variables. An edge from a box to a node is a simplifica-tion of dependencies between the non-boxed node to all the nodes in the box (and thereforeresult in a clique containing all x’s).

    number of nodes in PSG. Notice that M can be either a constant or a variable taking valuesfrom a discrete sets {1, . . . ,mmax}, where mmax is the maximal number of nodes allowed(usually defined by the biologists). Wi is the label for the i

    th node, i.e. the starting andending positions in the sequence and/or state assignment, which completely determine thenode according to its semantics defined in the PSG. Under this setup, a value instantiationof W = {M, {Wi}} defines a unique segmentation and annotation of the observed proteinsequence x (see Fig. 1).

    Let CG denote the set of cliques in graph G. Furthermore, we use Wc to represent anarbitrary clique c ∈ CG. Given a protein sequence x = x1x2 . . . xN where x ∈ {amino acid},and a PSG G, the probabilistic distribution of the variables W given observation x can bepostulated using the potential functions defined on the cliques in the graph [Hammersleyand Clifford, 1971], i.e.

    P (W |x) = 1Z

    ∏c∈CG

    Φ(x,Wc), (1)

    where Z is a normalization factor and Φ(·) is the potential function defined over a clique.Following the idea of CRFs, the clique potential can be defined as an exponential functionof the feature function f , i.e.

    P (W |x) = 1Z

    ∏c∈CG

    exp(K∑

    k=1

    λkfk(x,Wc)), (2)

    where K is the number of features. The definition of the feature function f varies, dependingon the semantics of nodes in the protein structure graph, which is described thoroughly inthe sections about specific models. The parameters λ = (λ1, . . . , λK) are computed byminimizing the regularized log-loss of the conditional probability of the training data, i.e.

    λ = argmax {L∑

    j=1

    log P (w(j)|x(j)) + Ω(‖λ‖)}, (3)

    7

  • where L is the number training sequences. Notice that the conditional likelihood function isconvex so that finding the global optimum is guaranteed. Given a testing protein, our goalis to seek the segmentation configuration with the highest conditional probability definedabove, i.e.

    wopt = arg maxW

    P (W |x).The major advantages of the conditional graphical model defined above include: (1) the

    intuitive representation of protein structures via graphs; (2) the ability to model dependenciesbetween segments in a non-Markovian way, so that the chemical-bonding between distantresidues (both inter-chain and intra-chain bonding) can be captured; (3) the ability to useany features that measure properties of segments or bonds that biologists have identified.

    2.2 Efficient Inference and Learning

    The parameters λ = (λ1, . . . , λK) are computed by minimizing the regularized log-loss of theconditional probability of the training data (shown in eq(3)). The conditional likelihood isdefined as a convex function so that finding the global optimum is guaranteed. Since there isno closed form solution to the optimization function above, we compute the first derivativeof right side of eq(??) with respect to λ and set it to zero, resulting in

    L∑j=1

    fk(x,Wc)−L∑

    j=1

    EP (W |x)[fk(x,Wc)] + ∆Ω(‖λ‖) = 0 (4)

    The intuition of eq (4) is to seek the direction of λk where the model expectation agrees withthe empirical distribution. Given a testing, our goal is to seek the segmentation configurationwith the highest conditional probability defined above, i.e.

    W opt = argmax∑c∈CG

    K∑

    k=1

    λkfk(x,Wc). (5)

    It can be seen that we need to compute the expectations of the features over the models ineq(4) and search over all possible assignments of the segmentation to ensure the maximum ineq(5). A naive exhaustive search would be prohibitively expensive due to the complex graphsinduced by the protein structures. In addition, there are millions of sequences in the proteinsequence database. Such large-scale applications demand efficient inference and optimizationalgorithms. It is known that he complexity of the inference algorithm depends on the graphsdefined by the conditional graphical models. If it is a simple chain, or tree-structure, wecan use exact inference algorithms, such as belief propagation. For complex graphs, sincecomputing exact marginal distributions is in general feasible, approximation algorithms haveto be applied. There are three major approximation approaches for inference in graphicalmodels, including sampling, variational methods and loopy belief propagation. We surveyall possible approximation algorithms for our conditional graphical models in Chapter 5 ofthe thesis.

    Here we give an example of the sampling algorithm, reversible jump Markov chain MonteCarlo algorithm [Green, 1995] that we derive for protein fold recognition task. Sampling has

    8

  • been widely used in the statistics community due to its simplicity. However, there is aproblem if we use the naive Gibbs sampling for cases when M (the number of segments) isalso a variable and as a result the output variables Y = {M, {wi}} have different dimensionsin each sampling iteration. The reversible jump Markov chain Monte Carlo algorithms havebeen proposed to solve the problem [Green, 1995]. It has demonstrated successes in variousapplications, such as mixture models, hidden Markov models for DNA sequence segmentationand phylogenetic trees [Huelsenbeck et al., 2004, Boys and Henderson, 2001].

    Given a segmentation y = (M,wi), our goal is propose a new move y∗. To satisfy

    the detailed balance defined by the MCMC algorithm, auxiliary random variables v andv∗ have to be introduced. The definitions for v and v∗ should guarantee the dimension-matching requirement, i.e. dim(y) + dim(v) = dim(y∗) + dim(v′) and there is a one-to-onemapping from (y, v) to (y∗, v′), i.e. there exists a function Ψ so that Ψ(y, v) = (y∗, v′) andΨ−1(y∗, v′) = (y, v). Then the acceptance rate for the proposed transition from y to y∗ is

    min{1, posterior ratio× proposal ratio× Jacobian} = min{1, P (y∗|x)

    P (y|x)P (v′)P (v)

    ∣∣∣∣∂(y∗i , v

    ′)∂(yi, v)

    ∣∣∣∣},

    where the last term is the determinant of the Jacobian matrix.To construct a Markov chain on the sequence of segmentations, we define four types of

    Metropolis operators [Green, 1995]:(1) State switching : given a segmentation y , sample a segment index j uniformly from[1,M ], and set its state to a new random assignment.(2) Position Switching : given a segmentation y, sample the segment index j uniformly from[1,M ], and change its starting position to a number sampled from U[pi,j−1, qi,j].(3) Segment split : given a segmentation y, propose a move with M∗i = Mi + 1 segments bysplitting the jth segment, where j is randomly sampled from U[1,M ].(4) Segment merge: given a segmentation y, sample the segment index j uniformly from[1,M ], propose a move by merging the jth segment and j + 1th segment.

    2.3 Specific Models

    To predict the protein structures of different levels, we develop several specific graphicalmodels within the conditional graphical model framework (see Table 1 for summaries), in-cluding:

    • Conditional random fields for score combination from first-round prediction algorithmsfor protein secondary structure prediction and β-sheet prediction. We achieve encour-aging improvement in prediction accuracy from the cross-validation results over non-homologous proteins with resolved structures. The prediction accuracy of β-sheet havebeen improved by 6-8%.

    • Kernel conditional random fields for protein secondary structure prediction. We achievesimilar performance in prediction accuracy but 30-50% improvement in transition ac-curacy, compared with support vector machines using same kernels.

    • Segmentation conditional random fields, which assigns labels to a segment (i.e. a subse-quence) rather than an individual amino acid, for supersecondary structure prediction

    9

  • Table 1: Thesis work: conditional graphical models for protein structure prediction of allhierarchies

    Hierarchy Secondary Tertiary QuaternaryTask Secondary

    structureprediction

    Parallel orantiparallelβ-sheet predic-tion

    Fold (motif)recognition

    Fold withstructuralrepeats

    Quaternary fold(motif) recogni-tion

    Target Proteins globularproteins

    globular pro-teins

    β-helix β-helix,leucine-richrepeats

    double barreltrimer, tripleβ-spiral

    Structural modules amino acid amino acid secondarystructure

    structuralmotifs/insertions

    secondary/super-secondary struc-tures

    Module length fixed fixed variable variable variableGraphical model CRFs,

    kCRFsCRFs SCRFs chain graph

    modellinked SCRFs

    or motif recognition. As a result, the model is feasible to use any type of segment-level features, such as the alignment score against some signal motifs or the lengthof the segment. Most importantly, our model is able to capture the segment-levelassociations, i.e. the interactions between secondary structure elements, which solvesthe long-range interaction problem for protein fold (or motif) recognition. We applythe model to predict the right-handed β-helix fold and outperforms other state-of-artmethods. We also hypothesize potential β-helix proteins, some of which have beencrystallized recently and confirmed.

    • Chain graph model for predicting protein folds with structural repeats. The basic ideaof the model is to decompose the complex graphs into subgraphs, and then unit themtogether via directed edges under the framework of chain graphs. The model can beseen as a semi-localized version of the SCRFs model. It reduces the computationalcosts, and at the same time achieves good approximation to the original SCRFs. Ourexperiment results on the β-helix and leucine-rich repeats performs similarly as theSCRF model while the real running time has been reduced by 50 times.

    • Linked segmentation conditional random fields for quaternary fold (or motif) predic-tion. It is able to capture both the inter-chain and intra-chain interactions within thequaternary folds, which are essential in keeping the stability of the structures. Theresulting graphs are extremely complex, involving multiple chains. Thus we derive thereversible jump MCMC sampling algorithms for efficient inferences. The experimentson triple β-spirals and double-barrel trimer demonstrate the effectiveness of our model.

    In the next three sections, we describe in detail the definition of those models and howthey can be applied in the protein structure prediction tasks.

    10

  • 3 Protein Secondary Structure Prediction

    Protein secondary structure prediction assigns the secondary structure label, such as helix,sheet or coil, for each residue in the protein sequence. Therefore the nodes in the PSGrepresent the states of secondary structure assignment and the graph structure is simply achain as the protein sequence. As a result, the model is simply the plain conditional randomfield (CRF) with a chain structure.

    3.1 Conditional random fields (CRFs)

    The graphical model representation for a chain-structured CRF is shown in Figure 2, inwhich we have one node for the state assignment for each residue in the sequence. Mappingback to the general framework in the previous section, we have M = n and Wi = yi ∈{helix, sheets, coils}. The conditional probability P (W |x) = P (y|x) is defined as

    P (y|x) = 1Z

    N∏i=1

    exp(K∑

    k=1

    λkfk(x, i, yi−1, yi)), (6)

    where fk can be arbitrary features, including overlapping or long-range interaction fea-tures. As a special case, we can construct HMM-like features that are factored as twoparts: fk(x, i, yi−1, yi) = gk(x, i)δ(yi−1, yi), in which δ(yi−1, yi) is the indicator function overeach pair of state assignments (yi−1, yi) (similar to the transition probability in HMM),and gk(x, i) is any feature defined over the observations (x, yi) (which mimics the emissionprobability without any particular assumptions about the data).

    CRFs take on a global normalizer Z over the whole sequence. This results in a seriesof nice properties, but at the same time introduces huge computational costs. MaximumEntropy Markov Models (MEMMs) can be seen as a localized version of CRFs (see Fig. 2(B)) [McCallum et al., 2000]. The conditional probability in MEMMs is defined as

    P (y|x) =N∏

    i=1

    1

    Ziexp(

    K∑

    k=1

    λkfk(x, i, yi−1, yi)), (7)

    where Zi is a normalizer over the ith position. MEMMs reduce the computational costs dra-

    matically, but suffer from the “label bias” problem, that is, the total probability “received”

    y1

    x3

    y3

    x2

    y2

    x1

    y1

    x3

    y3

    x2

    y2

    x1

    y1

    x3

    y3

    x2

    y2

    x1

    (A) (B) (C)

    Figure 2: Graphical model representation of simple HMM(A), MEMM(B), and chain-structured CRF(C)

    11

  • Figure 3: Kernels for structured graph: K((x, c, yc), (x′, c′, y′c)) = K((x, c), (x

    ′, c′))δ(yc, y′c).

    by yi−1 must be passed on to labels yi at time i even if xi is completely incompatible with yi−1[Lafferty et al., 2001]. Empirical results show that for most applications CRFs are able tooutperform MEMMs with either slight or significant improvement. The detailed comparisonresults with application to protein secondary structure prediction are discussed in Section6.3 of the thesis.

    3.2 Kernel conditional random fields (kCRFs)

    The original CRFs model only allows linear combination of features. For protein secondarystructure prediction, the state-of-art method can achieve an accuracy of around 80% usingSVM with linear kernels, which indicates that the current feature sets are not sufficient fora linear separation.

    Recent work in machine learning has shown that kernel methods to be extremely effectivein a wide variety of applications [Cristianini and Shawe-Taylor, 2000]. Kernel conditionalrandom fields, as an extension of conditional random fields, permits the use of implicit fea-tures spaces through Mercer kernels [Lafferty et al., 2004]. Similar to CRFs, the conditionalprobability is defined as

    P (y|x) = 1Z

    ∏c∈CG

    exp f ∗(x, c, yc),

    where f(·) is the kernel basis function, i.e. f(·) = K(·, (x, c, yc)). One way to define thekernels over the structured graph can be K((x, c, yc), (x

    ′, c′, y′c)) = K((x, c), (x′, c′))δ(yc, y′c),

    whose first term is the typical kernels defined for iid examples, and the second term is theindicator function over each state pair δ(yc, y

    ′c) (see Fig. 3). By representer theorem, the

    minimizer of the regularized loss has the form

    f ∗(·) =L∑

    j=1

    ∑c∈C

    G(j)

    yc∈Y|c|λ(j)yc K(·, (x(j), c, yc)).

    Notice that the dual parameters λ depend on all the clique label assignments, not limitedto the true labels. The detailed algorithms and experiment results on protein secondarystructure prediction are shown in Section 6.5 of the thesis.

    12

  • 4 Protein Tertiary Fold Recognition

    Protein folds or motifs are frequent arrangement patterns of several secondary structureelements. The chemical bonding physically exists at the atomic level on the side-chains ofamino acids, however, the structural topology and interaction maps in the fold are conservedonly at the secondary structure level due to the many possible insertions or deletions in theprotein sequence. Therefore it is natural for the state labels to be assigned to segments (sub-sequences corresponding to one secondary structure element) rather than to individual aminoacids, and then connect nodes with edges indicating their dependencies in three-dimensionalstructures. Following the idea, a segmentation conditional random fields (SCRFs) model isdeveloped for general protein fold recognition.

    4.1 Segmentation conditional random fields (SCRFs)

    For protein fold (or motif) recognition, we define a PSG G =< V , E >, where V = U ⋃{I},U is the set of nodes corresponding to the secondary structure elements within the foldand I is the node that represents the elements outside the fold. E is the set of edgesbetween neighboring elements in primary sequences or those indicating the potential long-range interactions between elements in tertiary structures (see Figure 4). Given a proteinsequence x = x1x2 . . . xN , we can have a possible segmentation of the sequence according tothe graph G, i.e. W = {M, {Wi}}, where M is the number of segments, Wi = {si, pi, qi},and si, pi, qi are the state, starting position and ending position of the i

    th segment. Herethe states are the set of labels to distinguish each structure component of the fold. Theconditional probability of W given the observation x is defined as

    P (W |x) = 1Z

    ∏c∈CG

    exp(K∑

    k=1

    λkfk(x, wc)), (8)

    where fk is the kth feature defined over the cliques c. In a special case, we can consider only

    the pairwise cliques, i.e.

    f(x, wi, wj) = g(x, pi, qi, pj, qj)δ(si, sj)δ(qi − pi)δ(qj − pj),where g is any feature defined over the two segments. Note that CG can be a huge set, andeach Wc can also include a large number of nodes due to various levels of dependencies.Designing features for such cliques is non-trivial because one has to consider all the jointconfigurations of all the nodes in a clique.

    X1 X3X2 X4 X6X5 Xn…...

    W1 W2 W3 W3 …... WT

    Figure 4: Graphical model representation for segmentation conditional random fields

    13

  • Usually, the spatial ordering of most regular protein folds is known a priori, which leadsto a deterministic state dependency between adjacent nodes wi and wi+1. Thus we have asimplification of the “effective” clique sets (those need to be parameterized) and the relevantfeature design. Essentially, only pairs of segment-specific cliques that are coupled need tobe considered (e.g., those connected by the undirected “red” arc in Fig. 4)1, which resultsin the following formulation:

    P (W |x) = 1Z

    M∏i=1

    exp(K∑

    k=1

    λkfk(x,Wi,Wπi)), (9)

    where Wπi denotes the spatial predecessor (i.e., with small position index) of Wi determinedby a “long-range interaction arc”. The detailed inference algorithm with application toregular fold recognition is described in Section 7.2 of the thesis.

    4.2 Chain graph model

    SCRF is a model for regular protein fold recognition. It can be seen as an exhaustivesearch over all possible segmentation configurations of the given protein and thus results intremendous computational costs. To alleviate the problem, a chain graph model is proposed,which is designed for a special structure, i.e. protein folds with repetitive structural repeats.They are defined as repetitive structurally conserved secondary or supersecondary units,such as α-helices, β-strands, β-sheets, connected by insertions of variable lengths, which aremostly short loops and sometimes α-helices or/and β-sheets. These folds are believed to beprevalent in proteins and involve in a wide spectrum of cellular activities.

    A chain graph is a graph consisting of both directed and undirected arcs associated withprobabilistic semantics. It leads to a probabilistic distribution bearing the properties of bothMarkov random fields (i.e., allowing potential-based local marginals that encode constraintsrather than causal dependencies) and Bayesian networks (i.e., not having a hard-to-computeglobal partition function for normalization and allowing causal integration of subgraphs thatcan be either directed or undirected) [Lauritzen and Wermuth, 1989, Buntine, 1995].

    Back to the protein structure graph, we propose a hierarchical segmentation for a pro-tein sequence. On the top level, we define an envelop Ξi, as a subgraph that correspondsto one repeat region in the fold containing both motifs and insertions or the null regions(structures outside the protein fold). It can be viewed as a mega node in a chain graphdefined on the entire protein sequence and its segmentation (Fig. 5). Analogous to theSCRF model, let M denote the number of envelops in the sequence, T = {T1, . . . , TM}where Ti ∈ {repeat, non-repeat} denote the structural label of the ith envelop. On the lowerlevel, we decompose each envelop as a regular arrangement of several motifs and insertions,which can be modeled using one SCRFs model. Let Ξi denote the internal segmentation ofthe ith envelop (determined by the local SCRF), i.e. Ξi = {Mi,Yi}. Following the notationalconvention in the previous section, we use Wi,j to represent a segment-specific clique withinenvelop i that completely determines the configuration of the jth segment in the ith envelop.To capture the influence of neighboring repeats, we also introduce a motif indicator Qi for

    1Technically, neighboring nodes must satisfy the constraints on the location indexes, i.e. qi−1 + 1 = pi.We omit it here for presentation clarity.

    14

  • Z2

    S1 S2 S4

    T1

    T2

    T3

    T4

    S3 SM

    TM

    ...

    ...

    ...X1 X2 X3 X4 X5 X6 XN

    S1 S2 S4

    T1

    T2

    T3

    T4

    S3 SM

    TM

    ...

    ...

    Z3 Z4

    S1 S2 S4

    T1

    T2

    T3

    T4

    S3 SM

    TM

    ...

    ...

    T2 T3 TMT1...

    ...

    Figure 5: Chain graph model for predicting folds with repetitive structures

    each repeat i, which signals the presence or absence of sequence motifs therein, based on thesequence distribution profiles estimated from previous repeat.

    Putting everything together, we arrive at a chain graph depicted in Figure 5. The con-ditional probability of a segmentation W given a sequence x can be defined as

    P (W |x) = P (M,Ξ,T|x) = P (M)M∏i=1

    P (Ti|x, Ti−1, Ξi−1)P (Ξi|x, Ti, Ti−1, Ξi−1). (10)

    P (M) is the prior distribution of the number of repeats in one protein, P (Ti|x, Ti−1, Ξi−1)is the state transition probability and we use the structural motif as an indicator for theexistence of a new repeat:

    P (Ti|x, Ti−1, Ξi−1) =1∑

    Qi=0

    P (Ti|Qi)P (Qi|x, Ti−1, Ξi−1),

    where Qi is a random variable denoting whether there exists a motif in the ith envelop and

    P (Qi|x, Ti−1, Ξi−1) can be computed using any motif detection model. For the third term, aSCRFs model is employed, i.e.

    P (Ξi|x, Ti, Ti−1, Ξi−1) = 1Zi

    exp(

    Mi∑j=1

    K∑

    k=1

    λkfk(x,Wi,j, Wπi,j)), (11)

    where Zi is the normalizer over all the configurations of Ξi, and Wπi,j is the spatial predecessorof Wi,j defined by long-range interactions. Similarly, the parameters λk can be estimatedminimizing the regularized negative log-loss of the training data.

    Compared with SCRFs, the chain graph model can effectively identify motifs by explor-ing their structural conservation and at the same time take into account the long-rangeinteractions between repeat units. In addition, the model takes on a local normalization,which reduces the computational costs dramatically. Since the effects of most chemical bondsare limited to a small range in 3-D space without passing through the whole sequence, thismodel can be seen as a reasonable approximation for a global optimal solution as SCRFs.The detailed algorithm and experimental results are discussed in Section 7.3 of the thesis.

    15

  • 5 Protein Quaternary Fold Recognition

    The quaternary structure is the stable association of multiple polypeptide chains via non-covalent bonds, resulting in a stable unit. Quaternary structures are stabilized mainly bythe same non-covalent interactions as tertiary structures, such as hydrogen bonding, vander Walls interactions and ionic bonding. Unfortunately, previous work on fold recognitionfor single chains is not directly applicable because the complexity is greatly increased bothbiologically and computationally, when moving to quaternary multi-chain structures. There-fore we propose the linked SCRF model to handle protein folds consisting of multiple proteinchains.

    5.1 linked Segmentation Conditional Random Fields (l-SCRFs)

    The PSG for a quaternary fold can be derived similarly as the PSG for tertiary fold: firstconstruct a PSG for each component protein or a component monomeric PSG for homo-multimer, and then add edges between the nodes from different chains if there are chemicalbonds, forming a more complex but similarly-structured quaternary PSG. Given a quaternarystructure graph G with C chains, i.e. {xi|i = 1 . . . C}, we have a segmentation initiation ofeach chain yi = (Mi,wi) defined by the PSG, where Mi is the number of segments in theith chain, and wi,j = (si,j, pi,j, qi,j), si,j, pi,j, qi,j are the state, starting position and endingposition of the jth segment. Following similar idea as the CRFs model, we have

    P (y1, . . . ,yC |x1, . . . ,xC) =1

    Z

    ∏C∈G

    Φ(yC,x) (12)

    =1

    Z

    ∏wi,j∈VG

    Φ(xi,wi,j)∏

    〈wa,u,wb,v〉∈EGΦ(xa,xb,wa,u,wb,v) (13)

    =1

    Zexp(

    ∑wi,j∈VG

    K1∑

    k=1

    θ1,kfk(xi,wi,j) +∑

    〈wa,u,wb,v〉∈EG

    K2∑

    k=1

    θ2,kgk(xa,xb,wa,u,wb,v)) (14)

    where Z is the normalizer over all possible segmentation assignments of all componentsequences (see Figure 6 for its graphical model representation). In eq(14), we decompose thepotential function over the cliques Φ(yC,x) as a product of unary and pairwise potentials,where fk and gk are features, θ1,k and θ2,k are the corresponding weights for the features.Specifically, we factorize the features as the following way,

    fk(xi,wi,j) = f′k(xi, pi,j, qi,j)δ(wi,j)

    =

    {f ′k(xi, pi,j, qi,j) if si,j = s&qi,j − pi,j ∈ length range(s)0 otherwise,

    Similarly, we can factorize gk(xa,xb,wa,u,wb,v) = g′k(xa,xb, qa,u, pa,u, qb,v, pb,v) if qa,u − pa,u ∈

    length range (s) and qb,v − pb,v ∈ length range (s′), and 0 otherwise.The major advantages of linked SCRFs model include: (1) the ability to encode the

    output structures (both inter-chain and intra-chain chemical bonding) using the graph; (2)dependencies between segments can be non-Markovian so that the chemical-bonding between

    16

  • …...

    W1,1

    …...

    X1,1

    X1,2 X1,ni…...

    W1,2

    W1,3 W1,4 W1,Mi

    W2,1

    W2,2

    W2,3

    W2,4

    W2,Mj

    …...X

    1,3X

    1,4X

    1,5X

    1,6

    X2,1

    X2,2

    X2,3

    X2,4

    X2,5

    X2,6 X2,nj

    (C)

    Figure 6: Graphical Model Representation of l-SCRFs model with multiple chains. Noticethat there are long-range interactions (represented by red edges) within a chain and betweenchains

    distant amino acids can be captured; (3) it permits the convenient use of any features thatmeasure the property of segments the biologists have identified. On the other hand, thelinked SCRF model differs significantly from the SCRF model in that the quaternary foldswith multiple chains introduce huge complexities for inference and learning. Therefore wedevelop efficient approximation algorithms that are able to find optimal or near-optimalsolutions as well as their applications in Chapter 5 of the thesis.

    6 Discussion

    In our previous discussion, we use the regularized log loss as the objective function to estimateparameters, following the original definition in CRFs model. In addition to CRFs, there areseveral other discriminative methods proposed for the segmentation and labeling problemof structured data, such as max-margin Markov networks (M3N) [Taskar et al., 2003] andgaussian process sequence classifier (GPSC) [Altun et al., 2004]. Similar to the classifiersfor classification problem, these models can be unified under the exponential model withdifferent loss functions and regularization terms.

    6.1 Unified view via loss function analysis

    Classification problem, as a subfield in supervised learning, aims at assigning one or morediscrete class labels to each example in the dataset. In recent years, various classifiers havebeen proposed and successfully applied in lots of applications, such as logistic regression,support vector machines, naive Bayes, k-Nearest neighbor and so on [Hastie et al., 2001].Discriminative classifiers, as opposed to generative models, computes the conditional proba-bility directly and usually assumes a linear decision boundary in the original feature space orin the corresponding Hilbert space defined by the kernel functions. Previous research workindicate that the loss function analysis can provide a comprehensible and unified view ofthose classifiers with totally different mathematical formulation [Hastie et al., 2001].

    In the following discussion, we focus on the binary classification problem and concernourselves with three specific classifiers, including regularized logistic regression (with exten-sion to kernel logistic regression) [Zhu and Hastie, 2001], support vector machines (SVM)

    17

  • [Vapnik, 1995] and Gaussian process (GP) [Williams and Barber, 1998]. All these threeclassifiers can be seen as a linear classifier which permits the use of kernels. Specifically, thedecision function f(x) has the form as

    f(x) =L∑

    i=1

    λiK(xi, x), (15)

    where λ are the parameters of the model and K is the kernel function. λ are learned byminimizing a regularized loss function and the general form of the optimization function canbe written as

    λ = argmaxL∑

    i=1

    g(yif(xi)) + Ω(‖f‖F), (16)

    where the first term is the training set error, g is specific loss function and the second termis the complexity penalty or regularizer.

    The essence of different classifiers can be revealed through their definitions of the lossfunctions as follows:

    • Kernel logistic regression defines the loss function as the logistic loss, i.e. g(z) =log(1 + exp(−z)). In the model described in [Zhu and Hastie, 2001], a Gaussian priorwith zero mean and diagonal covariance matrix is applied, which equals to an L2regularizer.

    • Support vector machines uses the hinge loss, i.e. g(z) = (1− z)+, which results in thenice properties of sparse parameters (most values are equal to 0). Similar to logisticregression, an L2 regularizer is employed.

    • Gaussian process classification can be seen as a logistic loss with Gaussian prior definedover infinite dimensions over f. Since it is intractable to integrate out all the hiddenvariables, maximum a posterior (MAP) estimate has to be applied. This formulationhas a very similar loss function expression as the kernel logistic regression except it ismore general in terms of the definition for mean and variance in the Gaussian prior.

    Previous analysis on loss functions provides a general view for different classifiers andhelps us better understand the classification problem. For the prediction problem (segmen-tation and labeling) of structured data, a similar analysis can be derived accordingly. Asdiscussed in Section 2.1, the conditional graphical models define the probability of the la-bel sequence y given the observation x directly and use exponential model to estimate thepotential functions. The decision function f(x) has the form as:

    f ∗(·) =L∑

    j=1

    ∑c∈C

    G(j)

    yc∈Y|c|λ(j)yc K(·, (x(j), c, yc)).

    where λ is the model parameters which can be learned by minimizing the loss over thetraining data. Similar to kernel logistic regression, kernel CRFs take on a logistic loss withan L2 regularizer. Max-margin Markov networks, like SVM, employs a hinge loss. On the

    18

  • other hand, the Gaussian process classification for segmenting and labeling (GPSC) aremotivated from the gaussian process point of view, however, its final form are very close tokCRFs.

    In summary, although our work is mostly focused on the logistic loss, they can be adaptedto other loss functions and regularizer, depending on the tradeoff between complexity andeffectiveness in specific applications.

    6.2 Related work

    From machine learning perspective, our conditional graphical model framework is closely re-lated to the semi-Markov conditional random fields [Sarawagi and Cohen, 2004] and dynamicconditional random fields [Sutton et al., 2004] (see Chapter for detail). All these three mod-els are extensions of the CRF model, however, ours is more representative in that it allowsboth the semi-Markov assumptions, i.e. assigning the label to a segment (i.e. subsequence)instead of individual element, and complex graph structures involving multiple chains. Fur-thermore, our models are able to handle the interactions or associations between nodes evenon different chains thanks to the flexible formulation and efficient inference algorithms wedeveloped.

    In structural biology, the conventional representation of a protein fold is the use of agraph [Westhead et al., 1999], in which nodes represent the secondary structure componentsand the edges indicate the inter- and intra-chain interactions between the components in the3-D structures. Therefore the graph representation for protein structures is not novel fromthat perspective. However, there have been very few studies about combining the graphrepresentation and probability theory via graphical models for protein structure prediction.Furthermore, there has been no work about developing discriminative training of graphicalmodels on this topics.

    7 Conclusion and Future Work

    In this thesis, we propose to a framework of conditional graphical models for protein structureprediction. Specifically, we focused on the structure topology prediction, in contrast to 3-Dcoordinates. Based on the specific tasks, a corresponding segmentation can be defined overthe protein sequences. Then a graphical model is applied to capture the interactions betweenthe segments, which correspond to the chemical bonds essential to the stability of the proteinstructures. To the best of knowledge, this is the first probabilistic framework to model thelong-range interactions directly and conveniently.

    7.1 Thesis statement

    In the thesis proposal, it is hypothesized that the conditional graphical models are effectivefor protein structure prediction. Specifically, they can provide an expressive framework torepresent the patterns in the protein structures or protein folds and enable the convenienceto incorporate any kinds of informative features. They are able to solve the long-range

    19

  • interaction problems for protein fold recognition and alignment prediction, given the basicknowledge about the structure topology.

    In our exploration, we have demonstrated the effectiveness of the conditional graphicalmodels for general secondary structure prediction on globular proteins, tertiary fold (motif)recognition for two specific folds, i.e. right-handed β-helix and leucine-rich repeats (mostlynon-globular proteins), quaternary fold recognition for two specific folds, i.e. triple β-spiralsand double barrel trimer. Therefore we verified that the conditional graphical models aretheoretically justified and empirically effective for protein structure prediction, independentof the structure hierarchies. Based on the thesis work, we conclude that the statement holdsin general. Specifically, we make three strong claims and two weak claims as follows:Strong claims :

    1. The conditional graphical models have the representation power to capture the struc-ture properties for protein structure prediction

    2. The conditional graphical models are convenient to incorporate any kinds of informa-tive features to improve the protein structure prediction, including overlapping fea-tures, segment-level features as well as long-range interaction features.

    3. The complexity of the conditional graphical models grows exponentially with the ef-fective tree-width in the induced graphs. It can be reduced to polynomial complexityusing approximate inference algorithms or the chain graph model.

    Weak claims:

    1. The conditional graphical models are able to solve the long-range interaction problem inprotein fold recognition (either tertiary or quaternary), if the following prior knowledgeare given: what are the possible structure components and how they are arranged in 3-D structures. Without the information, the models only have limited ability to handlethe long-range interactions.

    2. To our best knowledge, the conditional graphical models are the most representativemodels available for protein structure prediction. They also have the ability to con-veniently explore alternative feature space via kernels. However, the final predictionperformance are limited by the prior knowledge we have about the structures.

    7.2 Contribution

    In this thesis, our primary goal is to develop effective machine learning algorithms for proteinstructure prediction. In addition, we target at designing novel models to best capture theproperties of protein structures rather than naive applications of existing algorithms so thatwe can contribute both computationally and biologically. From computational perspective,

    1. We propose a series of conditional graphical models under a unified framework, whichenriches current graphical models for structured data prediction, in particular to han-dle the long-range interactions common in various applications, such as informationextraction and machine translation, and furthermore relaxes the iid assumptions fordata with inherent structures theoretically.

    20

  • 2. With millions of sequences in the protein databank (swissProt or Uniprot), efficientstructure prediction algorithms are required. Therefore for each of the graphical modelswithin the framework, we develop different inference and learning algorithms accord-ingly. Our large scale applications have demonstrated the efficiency of these infer-ence algorithms and at the same time extended the possibility of graphical models togenome-wide or other large-scale applications.

    3. In protein structure prediction, we have to face the data with various problems, such asvery few positive examples, unreliable labeling and features, massive data processing(millions of sequences) with limited computational resources, and etc. By incorpo-rating the prior knowledge into the graphs via the graphical models, we able to solvethe problems. It can be seen as one of the solutions to represent domain knowledgevia statistical models. In fact, these problems are shared by the data in many realapplications. Although our discussion are focused on protein structure prediction, themethodologies and experiences are valuable and applicable to other applications.

    From biological perspective,

    1. We propose to use the conditional random fields and kernel CRFs for protein secondarystructure prediction and achieve some improvement over the state-of-art methods usingwindow-based approaches.

    2. We propose the segmentation CRFs, the first probabilistic models to consider thelong-range interactions in the protein fold prediction. It has been demonstrated veryeffective in identifying the complex protein folds where most traditional approachesfail. We also hypothesize potential proteins with β-helix structures and hopefully willprovide guidance to the biologists in related areas for their experiments.

    3. We propose the linked SCRF model, the first probabilistic model designed specificallyto solve the quaternary fold recognition problem. It is also one of the early models tosuccessfully make predictions for the virus proteins.

    4. In general, our work will provide a better understanding of the mapping from proteinsequences to structures, and hopefully our prediction results will shed light on thefunctions of some protein folds and drugs designs.

    7.3 Limitation

    Until now, we have proved our thesis statement, i.e. the conditional graphical models arean effective and graceful solutions for protein structure prediction. However, there are alsoseveral limitations of the models we have developed:

    1. The discriminative graphical models provide an powerful framework so that any kindsof features can be used conveniently. However, there is no guideline on how to extractfeatures from the sequences of amino acids automatically. Most of our features arebased on domain knowledge, which requires devoted time and efforts. We have notfound an efficient way to generate features fully automatically. However, we do examine

    21

  • some possibilities of bootstrapping the motif databases (such as Pfam or PROSITE)and it will be part of our future work.

    2. The processes of obtaining the ground truth for the potential member proteins of thetarget motifs requires lab experiments, which takes a long time (1-5 years or more).Therefore part of our prediction results cannot be verified in the foreseeable near future.However, we do get some verifications from recently crystallized proteins that ourpredictions are reasonable and reliable.

    3. Another limitation or concern is the complexities of the model (O(nd), depending onthe specific inference algorithms we used). It is much higher than some similarity-basedmethods, such as PSI-BLAST or Profile HMM. However, we do get paid in terms ofprediction accuracy and sensitivity from our complex graphical models for difficulttarget folds (motifs). A natural solution to the problem is to use sequence-basedmethods on relatively simple folds, and apply our model on the complex folds.

    7.4 Future work

    For future work, we would like to examine multiple directions for extension of the thesiswork, including

    Efficient inference algorithms In the thesis, we have examined many inference algo-rithms, such as belief propagation and MCMC sampling. There are some recent develop-ments of efficient inference algorithms, for example, the preconditioner approximation andstructural mean field approximation. The main theme of the thesis is developing the appro-priate models to solve important biological problems. As a future work, it will be interestingto examine the efficiency and effectiveness of different approximation algorithms. On theone hand, we can find a best algorithm for fast training and testing of the graphical modelswe developed. On the other hand, since our applications involve quite complex graphs andmillions of examples, it provides an outstanding testing case for thorough examination ofinference algorithms.

    Virus protein structure prediction Viruses are a noncellular biological entity that canreproduce only within a host cell. Most viruses consist of nucleic acids covered by proteinswhile some animal viruses are also surrounded by membranes. Inside the infected cell, thevirus uses the synthetic capability of the host to produce progeny virus and attack the hostcell. There are various kinds of viruses, either species-dependent or species-independent.For example, some famous viruses are known to be related to human life, such as humanimmunodeficiency virus (HIV), tumor virus, sudden acute respiratory syndrome (SARS)virus.

    The structures of the virus proteins are very important for studying the infection processesand designing drugs to stop the infection. However, it is extremely difficult to acquirethis information by lab experiments because the genes of the virus mutate rapidly and thestructures of the proteins change accordingly. On the other hand, there are very few researchwork on predicting the structures of the virus proteins using computational methods.

    22

  • Many proteins that we work on in the thesis, such as right-handed β-helices, triple β-spirals and double barrel trimer, are virus proteins. The successes in predicting these proteinfolds demonstrate strong indication that our model might be effective for the sequence analy-sis and structure prediction for other virus proteins. Computationally, it would be interestingto verify the generality of our model in this exciting area.

    Protein function prediction It is widely believed that protein structures reveal impor-tant information about their functions. However, it is not straightforward how to map thesequences to specific functions since most functional sites or active binding sites consist ofonly several residues, which are distant in primary structures. Previous work on proteinfunction prediction can be summarized as two approaches. One approach is focused on com-bining information from different resources, such as databases of function motifs, statisticalanalysis of macroarray data, protein dynamics and etc. The other approach is motivatedby the structural properties of functional site. For example, TRILOGY searches all possiblecombinations of triplets in the proteins and a subset of these patterns are selected as seeds forlonger patterns [Bradley et al., 2002]. Both approaches have been investigated extensively,however, the current prediction results are still far from practical uses.

    In the thesis, we study the problem from the aspects of motif or super-secondary struc-tures. In some protein families with structural repeats, for example the leucine-rich repeatsor TIM barrel fold, the structures provide a stable basis so that multiple functions or bindingscould happen. By predicting the segmentation of a testing sequence against these folds, wemanage to know the regions of structural stability and functional active sites. This informa-tion, combined with other resources for function prediction, such as location, functional motifidentification and mircoarray data, will provide key indicators for function identification.

    Protein-protein interaction prediction The protein structures reveal important infor-mation about their functions, for example complex protein folds might serve as the structuralbasis for multiple functional sites or enzyme active sites; a short super-secondary structuremay be a functional site itself. Our current evaluation is based on prediction accuracy. Oneextension could be predicting the protein functions or protein-protein interactions based onour work for tertiary or quaternary structures. If we can predict the functional sites orenzyme sites accurately, it can be seen as an indirect evaluation of our models. It also solvesthe problem mentioned in our second limitation.

    Other applications There are lots of applications with sequential data containing long-range interactions in addition to protein structure prediction, such as information extractionand video segmentation. It would be interesting to apply our conditional graphical modelsto those applications and extend them for general structured data prediction, not limitedto biological sequences. We are now collaborating with Wendy Chapman to work on theapplications of mining emergency room reports.

    23

  • References

    SF Altschul, TL Madden, AA Schaffer, J Zhang, Z Zhang, W Miller, and DJ. Lipman.Gapped BLAST and PSI-blast: a new generation of protein database search programs.Nucleic Acids Res., 25(17):3389–402, 1997.

    Yasemin Altun, Thomas Hofmann, and Alexander J. Smola. Gaussian process classifica-tion for segmenting and annotating sequences. In ICML ’04: Twenty-first internationalconference on Machine learning, 2004.

    Philip E. Bourne and Helge Weissig. Structural Bioinformatics: Methods of BiochemicalAnalysis. Wiley-Liss, 2003.

    R J Boys and D A Henderson. A comparison of reversible jump mcmc algorithms for dnasequence segmentation using hidden markov models. Comp. Sci. and Statist., 33:35–49,2001.

    Phil Bradley, Peter S. Kim, and Bonnie Berger. Trilogy: discovery of sequence-structurepatterns across diverse proteins. In RECOMB ’02: Proceedings of the sixth annual inter-national conference on Computational biology, pages 77–88, 2002.

    Wray L. Buntine. Chain graphs for learning. In Uncertainty in Artificial Intelligence, pages46–54, 1995.

    W. Chu, Z. Ghahramani, and D. L. Wild. A graphical model for protein secondary structureprediction. In Proc.of International Conference on Machine Learning (ICML-04), pages161–168, 2004.

    Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines.Cambridge University Press, 2000.

    A. Delcher, S. Kasif, H. Goldberg, and W. Xsu. Protein secondary-structure modeling withprobabilistic networks. In International Conference on Intelligent Systems and MolecularBiology (ISMB’93), pages 109–117, 1993.

    D.T.Jones. Protein secondary structure prediction based on position-specific scoring matri-ces. J Mol Biol., 292:195–202, 1999.

    R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilisticmodels of proteins and nucleic acids. Cambridge University Press, 1998.

    Peter J. Green. Reversible jump markov chain monte carlo computation and bayesian modeldetermination. Biometrika, 82:711–732, 1995.

    J. Hammersley and P. Clifford. Markov fields on finite graphs and lattices. Unpublishedmanuscript, 1971.

    Trevor Hastie, Robert Tibshirani, and J. H. Friedman. The elements of statistical learning:data mining, inference, and prediction. New York: Springer-Verlag, 2001.

    24

  • JP Huelsenbeck, B Larget, and ME Alfaro. Bayesian phylogenetic model selection usingreversible jump markov chain monte carlo. Mol Biol. Evol., 6:1123–33, 2004.

    K. Karplus, C. Barrett, and R. Hughey. Hidden markov models for detecting remote proteinhomologies. Bioinformatics, 14(10):846–56, 1998.

    H. Kim and H. Park. Protein secondary structure prediction based on an improved supportvector machines approach. Protein Eng., 16:553–60, 2003.

    R.D. King and M.J. Sternberg. Identification and application of the concepts important foraccurate and reliable protein secondary structure prediction. Protein Sci., 5:2298–2310,1996.

    A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden markov models incomputational biology: Applications to protein modeling. J Mol Biol., 235(5):1501–31,1994.

    S. Kumar and M. Hebert. Discriminative random fields: A discriminative framework forcontextual interaction in classification. In Proc. of ICCV’03, pages 1150–1159, 2003.

    John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Prob-abilistic models for segmenting and labeling sequence data. In Proc. of ICML’01, pages282–289, 2001.

    John Lafferty, Xiaojin Zhu, and Yan Liu. Kernel conditional random fields: representationand clique selection. In Proc.of International Conference on Machine Learning (ICML-04),2004.

    S. Lauritzen and N. Wermuth. Graphical models for associations between variables, some ofwhich are qualitative and some quantitative. Annals of Statistics, 17:31–57, 1989.

    Yan Liu, Jaime Carbonell, Peter Weigele, and Vanathi Gopalakrishnan. Segmentation condi-tional random fields (SCRFs): A new approach for protein fold recognition. In Proceedingsof RECOMB’05, 2005.

    Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. Maximum entropy markovmodels for information extraction and segmentation. In Proc.of International Conferenceon Machine Learning (ICML-00), pages 591–598, 2000.

    David Pinto, Andrew McCallum, Xing Wei, and W. Bruce Croft. Table extraction usingconditional random fields. In Proceedings of the 26th ACM SIGIR conference, pages 235–242, 2003.

    B. Rost. Review: protein secondary structure prediction continues to rise. J Struct Biol.,134:204–218, 2001.

    B. Rost and C. Sander. Prediction of protein secondary structure at better than 70% accu-racy. J Mol Biol., 232:584–599, 1993.

    25

  • Sunita Sarawagi and William W. Cohen. Semi-markov conditional random fields for infor-mation extraction. In Proc. of NIPS’2004, 2004.

    SC Schmidler, JS Liu, and DL. Brutlag. Bayesian segmentation of protein secondary struc-ture. Journal of computational biology, 7:233–48, 2000.

    F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proceedings ofHuman Language Technology, NAACL 2003, 2003.

    Charles A. Sutton, Khashayar Rohanimanesh, and Andrew McCallum. Dynamic conditionalrandom fields: factorized probabilistic models for labeling and segmenting sequence data.In ICML, 2004.

    B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Proc. of NIPS’03,2003.

    V. Vapnik. The nature of statistical learning theory. Springer-Verlag, New York, 1995.

    DR Westhead, TW Slidel, TP Flores, and JM Thornton. Protein structural topology: Au-tomated analysis and diagrammatic representation. Protein Sci., 8(4):897–904, 1999.

    Christopher K. I. Williams and David Barber. Bayesian classification with gaussian processes.IEEE Trans. Pattern Anal. Mach. Intell., 20:1342–1351, 1998.

    Chen Yanover and Yair Weiss. Approximate inference and protein-folding. In Neural Infor-mation Processing Systems (NIPS’02), 2002.

    Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. InNIPS, pages 1081–1088, 2001.

    26