Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Mark Craven, Joseph Bockhorst*

Many sequential prediction tasks involve locating instances of pat- terns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally repre- sent arbitrarily overlapping elements. We show how to efficiently train and perform inference with these models. Experimental re- sults from a genomics domain show that our models are more accu- rate at locating instances of overlapping patterns than are baseline models based on HMMs.

1 Introduction

Hidden Markov models (HMMs) and related probabilistic sequence models have been among the most accurate methods used for sequence-based prediction tasks in genomics, natural language processing and other problem domains. One key limitation of these models, however, is that they cannot represent general overlaps among sequence elements in a concise and natural manner. We present a novel approach to modeling and predicting overlapping sequence elements that is based on undirected Markov networks. Our work is motivated by the task of predicting DNA sequence elements involved in the regulation of gene expression in bacteria. Like HMM-based methods, our approach is able to represent and exploit relationships among different sequence elements of interest. In contrast to HMMs, however, our approach can naturally represent sequence elements that overlap in arbitrary ways.

We describe and evaluate our approach in the context of predicting a bacterial genome's genes and regulatory "signals" (together its regulatory elements). Part of the process of understanding a given genome is to assemble a "parts list", often using computational methods, of its regulatory elements. Predictions, in this case, entail specifying the start and end coordinates of subsequences of interest. It is common in bacterial genomes for these important sequence elements to overlap.

```
START END (a) (b) prom prom prom term 1 2 3 1 gene1 gene 2 prom gene term
```

Figure 1: (a) Example arrangement of two genes, three promoters and one terminator in a DNA sequence. (b) Topology of an HMM for predicting these elements. Large circles represent element-specific sub-models and small gray circles represent inter-element sub- models, one for each allowed pair of adjacent elements. Due to the overlapping elements, there is no path through the HMM consistent with the configuration in (a).

Our approach to predicting overlapping sequence elements, which is based on dis- criminatively trained undirected graphical models called conditional Markov net- works [5, 10] (also called conditional random fields), uses two key steps to make a set of predictions. In the first step, candidate elements are generated by having a set of models independently make predictions. In the second step, a Markov network is constructed to decide which candidate predictions to accept.

Consider the task of predicting gene, promoter, and terminator elements encoded in bacterial DNA. Figure 1(a) shows an example arrangement of these elements in a DNA sequence. Genes are DNA sequences that encode information for constructing proteins. Promoters and terminators are DNA sequences that regulate transcrip- tion, the first step in the synthesis of a protein from a gene. Transcription begins at a promoter, proceeds downstream (left-to-right in Figure 1(a)), and ends at a terminator. Regulatory elements often overlap each other, for example prom2 and prom3 or gene1 and prom2 in Figure 1.

One technique for predicting these elements is first to train a probabilistic sequence model for each element type (e.g. [2, 9]) and then to "scan" an input sequence with each model in turn. Although this approach can predict overlapping elements, it is limited since it ignores inter-element dependencies. Other methods, based on HMMs (e.g. [11, 1]), explicitly consider these dependencies. Figure 1(b) shows an example topology of such an HMM. Given an input sequence, this HMM defines a probability distribution over parses, partitionings of the sequence into subsequences corresponding to elements and the regions between them. These models are not nat- urally suited to representing overlapping elements. For the case shown in Figure 1(a) for example, even if the subsequences for gene1 and prom2 match their respective sub-models very well, since both elements cannot be in the same parse there is a competition between predictions of gene1 and prom2. One could expand the state set to include states for specific overlap situations however, the number of states in- creases exponentially with the number of overlap configurations. Alternatively, one could use the factorized state representation of factorial HMMs [4]. These models, however, assume a fixed number of loosely connected processes evolving in parallel, which is not a good match to our genomics domain.

Like HMMs, our method, called CMN-OP (conditional Markov networks for over- lapping patterns), employs element-specific sub-models and probabilistic constraints on neighboring elements qualitatively expressed in a graph. The key difference be- tween CMN-OP and HMMs is the probability distributions they define for an input sequence. While, as mentioned above, an HMM defines a probability distribution over partitions of the sequence, a CMN-OP defines a probability distribution over all possible joint arrangements of elements in an input sequence. Figure 2 illustrates this distinction.

(a) HMM (b) CMN-OP

```
predicted labels sample space predicted signals sample space
end position 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1
2
1 2 3 4 5 6 7 8 3
4
5
6 start position 7 8
```

Figure 2: An illustration of the difference in the sample spaces on which probability distributions over labelings are defined by (a) HMMs and (b) CMN-OP models. The left side of (a) shows a sequence of length eight for which an HMM has predicted that an element of interest occupies two subsequences, [1:3] and [6:7]. The darker subsequences, [4:5] and [8:8], represent sequence regions between predicted elements. The right side of (a) shows the corresponding event in the sample space of the HMM, which associates one label with each position. The left side of (b) shows four predicted elements made by a CMN-OP model. The right side of (b) illustrates the corresponding event in the CMN-OP sample space. Each square corresponds to a subsequence, and an event in this sample space assigns a (possibly empty) label to each sub-sequence.

Do not remove: This comment is monitored to verify that the site is working properly