{"title": "Slice Normalized Dynamic Markov Logic Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1907, "page_last": 1915, "abstract": "Markov logic is a widely used tool in statistical relational learning, which uses a weighted first-order logic knowledge base to specify a Markov random field (MRF) or a conditional random field (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the domain of time points typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not significant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random field for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efficient online inference, and can directly model influences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks.", "full_text": "Slice Normalized Dynamic Markov Logic Networks\n\nTivadar Papai\n\nHenry Kautz\n\nDaniel Stefankovic\n\nDepartment of Computer Science\n\nUniversity of Rochester\nRochester, NY 14627\n\n{papai,kautz,stefanko}@cs.rochester.edu\n\nAbstract\n\nMarkov logic is a widely used tool in statistical relational learning, which uses\na weighted \ufb01rst-order logic knowledge base to specify a Markov random \ufb01eld\n(MRF) or a conditional random \ufb01eld (CRF). In many applications, a Markov logic\nnetwork (MLN) is trained in one domain, but used in a different one. This pa-\nper focuses on dynamic Markov logic networks, where the size of the discretized\ntime-domain typically varies between training and testing. It has been previously\npointed out that the marginal probabilities of truth assignments to ground atoms\ncan change if one extends or reduces the domains of predicates in an MLN. We\nshow that in addition to this problem, the standard way of unrolling a Markov logic\ntheory into a MRF may result in time-inhomogeneity of the underlying Markov\nchain. Furthermore, even if these representational problems are not signi\ufb01cant for\na given domain, we show that the more practical problem of generating samples\nin a sequential conditional random \ufb01eld for the next slice relying on the samples\nfrom the previous slice has high computational cost in the general case, due to the\nneed to estimate a normalization factor for each sample. We propose a new dis-\ncriminative model, slice normalized dynamic Markov logic networks (SN-DMLN),\nthat suffers from none of these issues. It supports ef\ufb01cient online inference, and\ncan directly model in\ufb02uences between variables within a time slice that do not\nhave a causal direction, in contrast with fully directed models (e.g., DBNs). Ex-\nperimental results show an improvement in accuracy over previous approaches to\nonline inference in dynamic Markov logic networks.\n\n1\n\nIntroduction\n\nMarkov logic [1] is a language for statistical relational learning, which employs weighted \ufb01rst-order\nlogic formulas to compactly represent a Markov random \ufb01eld (MRF) or a conditional random \ufb01eld\n(CRF). A Markov logic theory where each predicate can take an argument representing a time point\nis called a dynamic Markov logic network (DMLN). We will focus on two-slice dynamic Markov\nlogic networks, i.e., ones in which each quanti\ufb01ed temporal argument is of the form t or t + 1, in\nthe conditional (CRF) setting. DMLNs are the undirected analogue of dynamic Bayesian networks\n(DBN) [13] and akin to dynamic conditional random \ufb01elds [19].\nDMLNs have been shown useful for relational inference in complex dynamic domains; for example,\n[17] employed DMLNs for reasoning about the movements and strategies of 14-player games of\nCapture the Flag. The usual method for performing of\ufb02ine inference in a DMLN is to simply unroll\nit into a CRF and employ a general MLN or CRF inference algorithm. We will show, however, that\nthe standard unrolling approach has a number of undesirable properties.\nThe \ufb01rst two negative properties derive from the fact that MLNs are in general sensitive to the\nnumber of constants in each variable domain [6]; and so, in particular cases, unintuitive results can\noccur when the length of training and testing sequences differ. First, as one increases the number\nof time points in the domain, the marginals can \ufb02uctuate, even if the observations have little or no\nin\ufb02uence on the hidden variables. Second, the model can become time-inhomogeneous, even if the\nground weighted formulas between the time slices originate from the same weighted \ufb01rst-order logic\nformulas.\nThe third negative property is of greater practical concern. In domains where there are a large num-\nber of variables within each slice dynamic programming based exact inference cannot be used. When\n\n1\n\n\fthe number of time steps is high and/or online inference is required, unrolling the entire sequence\n(perhaps repeatedly) becomes prohibitively expensive. Kersting et al. [7] suggests reducing the cost\nby exploiting symmetries while Nath & Domingos [14] propose reusing previously sent messages\nwhile performing a loopy belief propagation. Both algorithms are restricted by the capabilities of\nloopy belief propagation, which can fail to converge to the correct distribution in MLNs. Geier &\nBiundo [2] provides a slice-by-slice approximate inference algorithm for DMLNs that can utilize\nany inference algorithm as a black box, but assumes that projecting the distribution over the random\nvariables at every time step to the product of their marginal distributions does not introduce a large\ndegree of error \u2014 an assumption that does not always hold. Sequential Monte Carlo methods, or\nparticle \ufb01lters, are perhaps the most popular methods for online inference in high-dimensional se-\nquential models. However, except for special cases such as, e.g., the Gaussian distributions used in\n[11], sampling from a two-slice CRF model can become expensive, due to the need to evaluate a\npartition function for each particle (see Sec. 3 for more details).\nAs a solution to all of these concerns, we propose a novel way of unrolling a Markov logic theory\nsuch that in the resulting probabilistic model a smaller CRF is embedded into a larger CRF mak-\ning the clique potentials between adjacent slices normalized. We call this model slice normalized\ndynamic Markov logic network (SN-DMLN). Because of the embedded CRF and the undirected\ncomponents in our proposed model, the distribution represented by a SN-DMLN cannot be com-\npactly captured by conventional chain graph [10], DBN or CRF graph representations, as we will\nexplain in Sec. 4. The SN-DMLN has none of the negative theoretical or practical properties out-\nlined above, and for accuracy and/or speed of inference matches or outperforms unrolled CRFs and\nthe slice-by-slice approximate inference methods. Finally, because the maximum likelihood param-\neter learning for an SN-DMLN can be a non-convex optimization problem, we provide an effective\nheuristic for weight learning, along with initial experimental results.\n2 Background\n\nProbabilistic graphical models compactly represent probability distributions using a graph struc-\nture that expresses conditional independences among the variables. Directed graphical models are\nmainly used in the generative setting, i.e., they model the joint distribution of the hidden variables\nand the observations, and during training the joint probability of the training data is maximized.\nHidden Markov models are the prototypical directed models used for sequential data with hidden\nand observable parts. It has been demonstrated that for classi\ufb01cation problems, discriminative mod-\nels, which model the conditional probability of the hidden variables given the observations, can\noutperform generative models [12]. The main justi\ufb01cations for their success are that complex de-\npendencies between observed variables do not have to be modeled explicitly, and the conditional\nprobability of the training data (which is maximized during parameter learning) is a better objective\nfunction if we eventually want to use our model for classi\ufb01cation. Markov random \ufb01elds (MRFs)\nand conditional random \ufb01elds (CRFs) belong to the class of undirected graphical models. MRFs\nare generative models, while CRFs are their discriminative version. (For a more detailed discussion\nof the relationships between these models see [8]). Markov logic [1] is a \ufb01rst-order probabilistic\nlanguage that allows one to de\ufb01ne template features that apply to whole classes of objects at once.\nA Markov logic network is a set of weighted \ufb01rst-order logic formulas and a \ufb01nite set of constants\nC = {c1, c2, . . . , c|C|} which together de\ufb01ne a Markov network ML,C that contains a binary node\nfor each possible grounding of each predicate (ground atom) and a binary valued feature for each\ngrounding of each \ufb01rst-order logic formula. We will also call the ground atoms variables (since they\nare random variables). In each truth assignment to the variables, each variable or feature (ground\nformula) evaluates to 1 (true) or 0 (false). In this paper we assume function-free clauses and Her-\nbrand interpretations. Using the knowledge base we can either create an MRF or a CRF. If we\ninstantiate the model as a CRF, the conditional probability of a truth assignment y to the hidden\nground atoms (query atoms) in an MLN, given truth assignment x to the observable ground atoms\n(evidence atoms), is de\ufb01ned as:\n\nP r(Y = y|X = x) =\n\nexp(Pi wiPj fi,j(x, y))\n\nZ(x)\n\n,\n\n(1)\n\nwhere fi,j(x, y) = 1 if the jth grounding of the ith formula is true under truth assignment {x, y},\nand fi,j(x, y) = 0 otherwise. wi is the weight of the ith formula and Z(x) is the normalization\nfactor. Ground atoms share the same weight if they are groundings of the same weighted \ufb01rst-\n\norder logic formula, and (1) could be expressed in terms of ni(x, y) = Pj fi,j(x, y). Instantiation\n\nas an MRF can be done similarly, having an empty set of evidence atoms. Dynamic MLNs [7]\nare MLNs with distinguished arguments in every predicate representing the \ufb02ow of time or some\nother sequential quantity. In our settings, Yt and Xt will denote the set of hidden and observable\nrandom variables, respectively, at time t, and Y1:t and X1:t from time step 1 to t. Each set can\ncontain many variables, and we should note that their distribution will be represented compactly\nby weighted \ufb01rst-order logic formulas. The formulas in the knowledge base can be partitioned into\n\n2\n\n\ftwo sets. The transitions part contains the formulas for which it is true that for any grounding of\neach formula, there is a t such that the grounding shares variables only with Yt and Yt+1. The\nemission part represents the formulas which connect the hidden and observable variables, i.e. Yt and\nXt. We will use \u02dcP (Yt, Yt+1) (or \u02dcP (Yt:t+1)) and \u02dcP (Yt, Xt) to denote the product of the potentials\ncorresponding to weighted ground formulas at time t of the transition and the observation formulas,\nrespectively. Since some ground formulas may contain only variables from Yt ( i.e., de\ufb01ned over\nhidden variables within the same slice), in order to count the corresponding potentials exactly once,\nwe always include their potentials \u02dcP (Yt, Yt\u22121), and for t = 1 we have a separate \u02dcP (Y1). Hence, the\ndistribution de\ufb01ned in (1) in sequential domains can be factorized as:\n\nP r(Y1:t = y1:t|X1:t = x1:t) =\n\n\u02dcP1(Y1 = y1)Qt\n\ni=2\n\n\u02dcP (Yi\u22121:i = yi\u22121:i)Qt\n\nZ(x1:t)\n\ni=1\n\n\u02dcP (Yi = yi, Xi = xi)\n\n(2)\n\nIn the rest of the paper, we only allow the temporal domain to vary, and the rest of the domains are\n\ufb01xed.\n3 Unrolling MLNs into random \ufb01elds in temporal domains\n\nWe now describe disadvantages of the standard de\ufb01nition of DMLNs, i.e., when the knowledge base\nis unrolled into a CRF:\n\n1. As one increases the number of time points the marginals can \ufb02uctuate, even if all the clique\n\npotentials \u02dcP (Yi = yi, Xi = xi) in (2) are uninformative.\n\n2. The transition probability Pr(Yi+1|Yi) can be dependent on i, even if every \u02dcP (Yi =\nyi, Xi = xi) is uninformative and we use the same weighted \ufb01rst-order logic formula\nresponsible for the ground formulas covering the transitions between every i and i + 1.\n\n3. Particle \ufb01ltering is costly in general, i.e., if we have the marginal probabilities at time t, we\ncannot compute them at time t + 1 using particle \ufb01ltering unless certain special conditions\nare satis\ufb01ed.\n\nSaying that \u02dcP (Yi = yi, Xi = xi) is uninformative is equivalent to saying that \u02dcP (Yi = yi, Xi = xi)\nis constant. (Note that, if Yi and Xi are independent, i.e., for some q and r \u02dcP (Yi = yi, Xi = xi) =\nr(yi)q(xi) then q could be marginalized out and r(Yi) could be snapped to \u02dcP (Yi, Yi\u22121) in (2).) To\ndemonstrate Property 1, consider an unrolled MRF with the temporal domain T = {1, . . . , T },\nwith only predicate P (t) (t \u2208 T ) and with the weighted formulas (+\u221e, P (t) \u21d4 P (t + 1))\n(hard constraint) and (w, P (t)) (soft constraint). Because of the hard constraint, only the se-\nquences \u2200t : P (t) and \u2200t : \u00acP (t) have non-zero probabilities. The soft weights imply that\nPr(P (t)) = exp(wT )Pr(\u00acP (t)), i.e., Pr(P (t)) converges to 1, 0 or to 0.5 with exponential rate\ndepending on the sign of w. But we are not always fortunate to have converging marginals, e.g., if\nwe change the hard constraint to be P (t) \u21d4 \u00acP (t + 1) and w 6= 0 the marginals will diverge. If\nT is even, then for every t \u2208 T , Pr(P (t)) = Pr(\u00acP (t)), since in both sequences P (t) has the same\nnumber of true groundings. If T is odd then for every odd t \u2208 T : Pr(P (t)) = exp(w)Pr(\u00acP (t)).\nConsequently, we have diverging marginals as T \u2192 +\u221e. This phenomenon not only makes the\ninference unreliable, but a weight learning algorithm that maximizes the log-likelihood of the data\nwould produce different weights depending on whether T is even or odd. A similar effect aris-\ning from moving between different sized domains is discussed in more details in [6]. The akin\nProperty 2 (inhomogeneity) can be demonstrated similarly, consider, e.g., an MLN with a single\n\ufb01rst-order logic formula P (t) \u2228 P (t + 1) with weight w. For the sake of simplicity, assume T = 3.\nThe unrolled MRF de\ufb01nes a distribution where Pr(\u00acP (3)|\u00acP (2)) =\n1+2exp(w)+exp(2w) which is\nnot equal to Pr(\u00acP (2)|\u00acP (1)) =\n\n1+exp(w)+2 exp(2w) for an arbitrary choice of w.\n\n1+exp(w)\n\n1+exp(w)\n\nThe examples we just gave involved hard constraints.\nIn fact, we can show that if there are no\nhard hard constraints, as T increases the marginals converge and the system becomes homogeneous\n(except for a \ufb01nite number of transitions). Consider the matrix \u03a6 s.t. \u03a6i,j = \u02dcP (Yt = aj, Yt\u22121 = ai),\nwhere ai, i = 1, . . . , N is an enumeration of the all the possible truth assignments within each\nslice and N is the number of the possible truth assignments in the slice. Let PrT (Y1 = y1) =\n\u02dcP (Yi =\n\nyi, Yi+1 = yi+1).\nProposition 1. limt\u2192\u221e Prt(Y1 = y) exists if \u03a6 is a positive matrix, i.e., \u2200i, j : \u03a6i,j > 0.\n\n\u02dcP (Yi = yi, Yi+1 = yi+1), where Z(Y1:T ) = Py1,...,yT QT \u22121\n\nZ(Y1:T )Py2,...,yT QT \u22121\n\ni=1\n\n1\n\ni=1\n\n3\n\n\fProof. Using \u03a6 and the notation ~1 for all one vector and ~ei for a vector which has 1 at the ith\ncomponent and 0 everywhere else, we can express Prt(Y1 = y) as:\n\nPrt(Y1 = y) = Py2\n\n\u02dcP (Y1 = ai, Y2 = y2)~ei\u03a6t\u22121~1\n\n~1T \u03a6t~1\n\n(3)\n\nSince \u03a6 is positive we can apply theorem 8.2.8. from [5], i.e., if the spectral radius of \u03a6 is \u03c1(\u03a6)\n(which is always positive for positive matrices): limt\u2192\u221e(\u03c1\u22121(\u03a6)\u03a6)t = L, where L = xyT , \u03a6x =\n\u03c1(\u03a6)x, \u03a6T y = \u03c1(\u03a6)y, x > 0,y > 0 and xT y = 1. Dividing both the numerator and the denominator\nby \u03c1t(\u03a6) in (3) proves the convergence of Prt(Y1 = y).\n\nThe issue of diverging marginals and time-inhomogeneity has not been previously recognized as a\npractical problem. However, the increasing interest in probabilistic models that contain large num-\nbers of deterministic constraints (see, e.g. [4]) might bring this issues to the fore. This proposition\ncan serve as an explanation why in practice we do not encounter diverging marginals in linear chain\ntype CRFsand why except for a \ufb01nite number of transitions the model becomes time-homogeneous.\nA more signi\ufb01cant practical challenge is described by Property 3, the problem of sampling from\nPr(Yt|X1:t = x1:t) using the previously drawn samples from Pr(Yt\u22121|X1:t\u22121 = x1:t\u22121).\nIn a\ndirected graphical model (e.g., in a hidden Markov model), following standard particle \ufb01lter design,\nhaving sampled s1:t\u22121 \u223c Pr(Y1:t\u22121 = s1:t\u22121|X1:t\u22121 = x1:t\u22121), and then using s1:t\u22121 one would\nsample st \u223c Pr(Yt, Y1:t\u22121 = s1:t\u22121|X1:t\u22121). Since\nPr(Y1:t = s1:t|X1:t\u22121 = x1:t\u22121) = Pr(Yt = st|Yt\u22121 = st\u22121)Pr(Y1:t\u22121 = s1:t\u22121|X1:t\u22121 = x1:t\u22121)\n(4)\n\nwe do not have any dif\ufb01culty performing this sampling step, and all that is left is to re-sample\nthe collection of s1:t with importance weights Pr(Yt = st|Xt = xt). The analogue of this pro-\nIf one \ufb01rst draws a sample s1:t\u22121 \u223c \u02dcP (Y1, X1 =\ncess does not work in a CRF in general.\n\u02dcP (Yi, Yi\u22121) \u02dcP (Yi, Xi = xi), and then draws st \u223c \u02dcP (Yt, Yt\u22121 = st\u22121), we end\n\nx1) \u02dcP (Y1)Qt\u22121\n\nup sampling from:\n\ni=2\n\ns \u223c \u02dcP (Y1, X1 = x1) \u02dcP (Y1)\n\n\u02dcP (Yi, Yi\u22121) \u02dcP (Yi, Xi = xi)\n\n1\n\nZt\u22121(yt\u22121)\n\n(5)\n\nt\n\nYi=2\n\nwhere Zt\u22121(yt\u22121) = Pyt\n\n\u02dcP (Yt = yt, Yt\u22121 = yt\u22121). Unless Zt\u22121(yt\u22121) is the same for every\nyt\u22121, it is necessary to approximate Zt\u22121(st\u22121) for every st\u22121. 1 Although several algorithms have\nbeen proposed to estimate partition functions [16, 18], the partition function estimation can increase\nboth the running time of the sampling algorithm signi\ufb01cantly and the error of the approximation of\nthe sampling algorithm. While there are restricted special cases where the normalization factor can\nbe ignored [11], in general ignoring the approximation of Zt\u22121(yt\u22121) could cause a large error in\nthe computed marginals. Consider, e.g., when we have three weighted formulas in the previously\nused toy domain, namely, w : \u00acP (Yt) \u2228 \u00acP (Yt+1), \u2212w : P (Yt) \u2227 \u00acP (Yt+1) and w\u2032 : P (Yt) \u2194\n\u00acP (Yt+1), where w > 0 and w\u2032 < 0. It can be proved that in this setting using particle \ufb01ltering in a\nCRF without accounting for Zt\u22121(yt\u22121) would result in limt\u2192\u221e Pr(P (Yt)) = 1\n2 , while in the CRF\nthe correct marginal would be limt\u2192\u221e Pr(P (Yt)) = 1 \u2212 exp(w)\n1+exp(w) exp(w\u2032) + O(exp(2w\u2032)), which\ngets arbitrarily close to 1 as we decrease w\u2032.\n4 Slice normalized DMLNs\n\nlies in that \u02dcP (Yt = yt, Yt\u22121 = yt\u22121) is unnormalized, i.e.,Pyt\n\nAs we demonstrated in Section 3, the root cause of the weaknesses of an ordinarily unrolled CRF\n\u02dcP (Yt = yt, Yt\u22121 = yt\u22121) 6= 1 in\ngeneral. One approach to introduce normalization could be to use maximum entropy Markov models\n(MEMM) [12]. In that case we would directly represent Pr(Yt|Xt, Yt\u22121), hence we could implement\na sequential Monte Carlo algorithm simply directly sampling st \u223c Pr(Yt|Xt = xt, Yt\u22121 = st\u22121)\nfrom slice to slice. However, in [9], it was pointed out that MEMMs suffer from the label-bias prob-\nlem to which as a solution CRFs were invented. Chain graphs (see e.g. [10]) have also the advantage\nof mixing directed and undirected components, and would be a tempting choice to use, but they could\nonly model the transition between slices by either representing (i) Pr(Yt|Xt = xt, Yt\u22121 = st\u22121),\n\n1Exploiting inner structure according to the graphical model within the slice would in worst case still result\nin computation of the expensive partition function, or could result in a higher variance estimator the same way\nas, e.g., using a uniform proposal distribution does.\n\n4\n\n\fin which case the model would again suffer from the label-bias problem, or (ii) Pr(Yt, Xt|Yt\u22121)\nor (iii) Pr(Xt|Yt) and Pr(Yt|Yt\u22121). The de\ufb01ned distributions both in (ii) and (iii) do not give any\nadvantage performing the sampling step in (4), and similarly to CRFs would require the expensive\ncomputation of a normalization factor. We propose a slice normalized dynamic Markov logic net-\nwork (SN-DMLN) model, which consists of directed and undirected components on the high level,\nand can be thought of as a smaller CRF nested into a larger CRF describing the transition probabil-\nities constructed using weighted \ufb01rst-order logic formulas as templates. SN-DMLNs neither suffer\nfrom the label bias problem, nor bear the disadvantageous properties presented in Section 3. The\ndistribution de\ufb01ned by an unrolled SN-DMLN is as follows:\n\nPr(Y1:t = y1:t|X1:t = x1:t) =\n\n1\n\nZ(x1:t)\n\nP1(Y1)\n\n\u02dcP (Yi = yi, Xi = xi)\n\n(6)\n\nt\n\nYi=1\n\nP (Yi = yi|Yi\u22121 = yi\u22121) ,\n\nt\n\nYi=2\n\nwhere\n\nP1(Y1 = y1) =\n\n\u02dcP (Y1 = y1)\n\u02dcP (Y1 = y\u2032\n1)\n\n1\n\nPy\u2032\n\nand the partition function is de\ufb01ned by:\n\n, P (Yi = yi|Yi\u22121 = yi\u22121) =\n\nZ(x1:t) = Xy1,...,yt(P1(Y1 = y1)\n\nYi=1\n\nt\n\nt\n\n\u02dcP (Yi = yi, Xi = xi)\n\nYi=2\n\n,\n\n\u02dcP (Yi = yi, Yi\u22121 = yi\u22121)\n\ni, Yi\u22121 = yi\u22121)\n\ni\n\n\u02dcP (Yi = y\u2032\n\nPy\u2032\nP (Yi = yi|Yi\u22121 = yi\u22121)) .\n\nde\ufb01ned in (6) reduces to P1(Y1 = y1)Qt\n\nP (Yt = yt|Yt\u22121 = yt\u22121) is de\ufb01ned by a two-slice Markov logic network (CRF), which describes\nthe state transitions probabilities in a compact way. If we hide the details of this nested CRF compo-\nnent and treat it as one potential, we could represent the distribution in (6) by regular chain graphs or\nCRFs; however we would lose then the compactness the nested CRF provides for describing the dis-\ntribution. Similarly, we could collapse the variables at every time slice into one and could use a DBN\n(or again a chain graph), but it would need exponentially more entries in its conditional probability\ntables. If \u02dcP (Yi = yi, Xi = xi) does not have any information content , the probability distribution\ni=2 P (Yi = yi|Yi\u22121 = yi\u22121), which is a time-homogeneous\nMarkov chain 2 , hence this model clearly does not have Properties 1 and 2, no matter what formulas\nare present in the knowledge base. Furthermore, we do not have to compute the partition function\nbetween the slices, because equation (5) shows, drawing a sample yt \u223c \u02dcP (Yt, Yt\u22121 = yt\u22121) while\nkeeping the value yt\u22121 \ufb01xed is equivalent to sampling from P (Yt|Yt\u22121 = yt\u22121), the quantity present\nin equation (6). This means that using our model one can avoid estimating Z(yt\u22121). To learn the\nparameters of the model we will maximize the conditional log-likelihood (L) of the data. We use a\nmodi\ufb01ed version of a hill climbing algorithm. The modi\ufb01cation is needed, because in our case L is\nnot necessarily concave. We will partition the weights (parameters) of our model based on whether\nthey belong to transition or to emission part of the model. The gradient of the L of a data sequence\nd = y1, x1, . . . , yt, xt w.r.t. an emission parameter we (to which feature ne belongs) is:\n\n\u2202Ld\n\u2202we\n\n=\n\nt\n\nXi=1\n\nne(yi, xi) \u2212 E\n\nP r(Y |X=x)\" t\nXi=1\n\nne(Yi, xi)# ,\n\n(7)\n\nwhich is analogous to what one would expect for CRFs. However, for a transition parameter wtr\n(belonging to feature ntr) we get something different:\n\nt\n\n\u2202Ld\n\u2202wtr\n\n=\n\nntr(yi+1, yi) \u2212\n\nXi=1\nP r(Y |X=x)\u00b7 t\u22121\nXi=1\n\n\u2212 E\n\nt\u22121\n\nXi=1\n\nE\n\nP (Yi+1|yi) [ntr(Yi+1, Yi = yi)]\n\n(8)\n\nntr(Yi+1, Yi) \u2212\n\nt\u22121\n\nXi=1\n\nE\n\nP ( \u02dcYi+1|Yi)hntr( \u02dcYi+1, Yi)i\u00b8 .\n\n(Note that, Ld is concave w.r.t.\nthe emission parameters, i.e., when the transition parameters are\nkept \ufb01xed, allowing the transition parameters to vary makes Ld no longer concave.) In (8) the \ufb01rst\n\n2Note that, in the SN-DMLN model the uniformity of \u02dcP (Yi = yi, Xi = xi) is a stronger assumption than\n\nthe independence of Xi and Yi.\n\n5\n\n\ffriendships re\ufb02ect\n\npeople\u2019s similarity in\n\nsmoking habits\nsymmetry and\nre\ufb02exivity of\nfriendship\n\npersistence of\n\nsmoking\n\nSmokes(p1, t) \u2227 \u00acSmokes(p2, t) \u2227 (p1 6= p2) \u2283 \u00acF riends(p1, p2, t)\n\nSmokes(p1, t) \u2227 Smokes(p2, t) \u2227 (p1 6= p2) \u2283 F riends(p1, p2, t)\n\n\u00acSmokes(p1, t) \u2227 \u00acSmokes(p2, t) \u2227 (p1 6= p2) \u2283 F riends(p1, p2, t)\n\n\u00acF riends(p1, p2, t) \u2283 \u00acF riends(p2, p1, t)\n\nF riends(p1, p2, t) \u2283 F riends(p2, p1, t)\n\nF riends(p, p, t)\n\nSmokes(p, t) \u2283 Smokes(p, t + 1)\n\n\u00acSmokes(p, t) \u2283 \u00acSmokes(p, t + 1)\n\npeople with different smoking\n\nhabits hang out separately\n\nHangout(p1, g1, t) \u2227 Hangout(p2, g2, x) \u2227 Smokes(p1, t)\u2227\n\n(p1 6= p2) \u2227 (g1 6= g2) \u2283 \u00acSmokes(p2, t)\n\nHangout(p1, g1, t) \u2227 Hangout(p2, g2, t) \u2227 \u00acSmokes(p1, t)\u2227\n\n(p1 6= p2) \u2227 (g1 6= g2) \u2283 Smokes(p2, t)\n\nTable 1: Formulas in the knowledge base\n\ntwo and the last two terms can be grouped together. The \ufb01rst group would represent the gradient\nin the case of uninformative observations, i.e., when the model simpli\ufb01es to a Markov chain with\na compactly represented transition probability distribution. The second group is the expected value\nof the expression in the \ufb01rst group. The \ufb01rst three terms correspond to the gradient of a concave\nfunction; while the fourth term corresponds to the gradient of a convex function, so the function as\na whole is not guaranteed to be maximized by convex optimization techniques alone. Therefore, we\nchose a heuristic for our optimization algorithm which gradually increases the effects of the second\ngroup in the gradient. More precisely, we always compute the gradient w.r.t. wo according to (7),\nbut w.r.t. wtr we use:\n\nE\n\nP (Yi+1|yi) [ntr(Yi+1, yi)]\n\n(9)\n\nntr(Yi+1, Yi) \u2212\n\nt\u22121\n\nXi=1\n\nE\n\nP ( \u02dcYi+1|Yi)hntr( \u02dcYi+1, Yi)i\u00b8\n\nt\n\n\u2202Ld\n\u2202wtr\n\n=\n\nntr(yi+1, yi) \u2212\n\nXi=1\nP r(Y |X=x)\u00b7 t\nXi=1\n\n\u2212 \u03b1E\n\nt\u22121\n\nXi=1\n\nwhere \u03b1 is kept at the value of 0 until convergence, and then gradually increased from 0 to 1 to\nconverge to the nearest local optimum. In Section 5, we experimentally demonstrate that this heuris-\ntic provides reasonably good results, hence we did not turn to more sophisticated algorithms. The\nrationale behind our heuristic is that if \u02dcP (Yi = yi, Xi = xi) had truly no information content, then\nfor \u03b1 = 0 we would \ufb01nd the global optimum, and as we increase \u03b1 we are taking into account that\nthe observations are correlated with the hidden variables with an increasing weight.\n5 Experiments\n\nFor our experiments we extended the Probabilistic Consistency Engine (PCE) [3], a Markov logic\nimplementation that has been used effectively in different problem domains. For training, we\nused 10000 samples for the unrolled CRF and 100 particles and 100 samples for approximat-\ning the conditional expectations in (9) for the SN-DMLN to estimate the gradients. For infer-\nence we used 10000 samples for the CRF and 10000 particles for the mixed model. The sam-\npling algorithm we relied on was MC-SAT [15]. Our example training data set was a modi-\n\ufb01ed version of the dynamic social network example [7, 2]. The hidden predicates in our knowl-\nedge base were Smokes(person, time), F riends(person1, person2, time) and the observable\nwas Hangout(person, group, time). The goal of inference was to predict which people could\npotentially be friends, based on the similarity in their smoking habits, which similarity could be in-\nferred based on the groups the individuals hang out. We generated training and test data as follows:\nthere were two groups g1, g2, one for smokers and one for non-smokers. Initially 2 people were\nrandomly chosen to be smokers and 2 to be non-smokers. People with the same smoking habits\ncan become friends at any time step with probability 1 \u2212 0.05\u03b1, and a smoker and a non-smoker\ncan become friends with probability 0.05\u03b1. Every 5th time step (starting with t = 0) people hang\nout in groups and for each person the probability of joining one of the groups is 1 \u2212 0.05\u03b1. With\nprobability 1\u2212 0.05\u03b1, everyone spends time with the group re\ufb02ecting their smoking habits, and with\nprobability 0.05\u03b1 they go to hang out with the other group. The rest of the days people do not hang\nout. The smoking habits persist, i.e., a smoker stays a smoker and a non-smoker stays a non-smoker\nat the next time step with probability 1 \u2212 0.05\u03b1. In our two con\ufb01gurations we had \u03b1 = 0 (deter-\nministic case) and \u03b1 = 1 (non-deterministic case). The weights of the clauses we learned using the\nSN-DMLN and the CRF unrolled models are in Table 1.\nWe used chains with length 5, 10, 20 and 40 as training data, respectively. For each chain we had\n40, 20, 10 and 5 examples both for the training and for testing, respectively. In our experiments\nwe compared three types of inference, and measured the prediction quality for the hidden predicate\nF riends by assigning true to every ground atom the marginal probability of which was greater than\n\n6\n\n\flength\n\n\u03b1 = 0\n\n\u03b1 = 1\n\naccuracy\n\nf1\n\naccuracy\n\nf1\n\n5\n10\n20\n40\n\nSN MAR MC-SAT\n1.0\n1.0\n1.0\n1.0\n\n1.0\n0.97\n0.67\n0.60\n\n0.40\n0.40\n0.40\n0.85\n\nSN MAR MC-SAT\n1.0\n1.0\n1.0\n1.0\n\n0.14\n0.14\n0.14\n0.72\n\n1.0\n0.95\n0.49\n0.43\n\nSN MAR MC-SAT\n0.84\n0.84\n0.92\n0.88\n\n0.81\n0.77\n0.66\n0.59\n\n0.36\n0.36\n0.55\n0.73\n\nSN MAR MC-SAT\n0.75\n0.74\n0.85\n0.78\n\n0.10\n0.11\n0.32\n0.55\n\n0.69\n0.61\n0.47\n0.42\n\nTable 2: Accuracy and F-score results when models were trained and tested on chains with the same\nlength\n\n(a) \u03b1 = 0\n\n(b) \u03b1 = 1\n\nFigure 1: F-score of models trained and tested on the same length of data\n\n0.55, and false if its probability was less than 0.45; otherwise we considered it as a misclassi\ufb01cation.\nPrediction of Smokes was impossible in the generated data set, because the data generation was\nsymmetric w.r.t to smoking and not smoking, and from the observations we could only tell that\ncertain pairs of people have similar or different smoking habits, but not who smokes and who does\nnot. The three methods we compared were (i) particle \ufb01ltering in the SN-DMLN model (SN), (ii) the\napproximate online inference algorithm of [2], which projects the inferred distribution of the random\nvariables at the previous slice to the product of their marginals, and incorporates this information\ninto a two slice MLN to infer the probabilities at the next slice (we re-implemented the algorithm\nin PCE) (MAR), and (iii) using a general inference algorithm (MC-SAT [15]) for a CRF which is\nalways completely unrolled in every time step (UNR). In UNR and MAR the same CRF models\nwere used. The training of the SN-DMLN model took approximately for 120 minutes for all the test\ncases, while for the CRF model, it took 120, 145, 175 and 240 minutes respectively. The inference\nover the entire test set, took approximately 6 minutes for SN and MAR in every test case, while\nUNR required 5, 8, 12 and 40 minutes for the different test cases. The accuracy and F-scores for the\ndifferent test cases are summarized in Table 2 and the F-scores are plotted in Fig. 1.\nSN outperforms MAR, because as we see that in the knowledge base, MAR can only conclude that\npeople have the same or different smoking habits on the days when people hang out (every 5th time\nstep), and the marginal distributions of Smokes do not carry enough information about which pair\nof people have different smoking habits, hence the quality of MAR\u2019s prediction decreases on days\nwhen people do not hang out. The performance of SN and MAR stays the same as we increase\nthe length of the chain while the performance of UNR degrades. This is most pronounced in the\ndeterministic case (\u03b1 = 0). This can be explained by that MC-SAT requires more sampling steps to\nmaintain the same performance as the chain length increases.\nTo demonstrate that if we use the same number of particles in SN as number of samples in UNR,\nthe performance of SN stays approximately the same while the performance of UNR degrades over\ntime, we trained both the CRF and SN-DMLN on length 5 chains where both SN and UNR were\nperforming equally well and used test sets of different lengths up to length 150. F-scores are plotted\nin Fig. 2.\nWe see from Fig. 2 that SN outperforms both UNR and MAR as the chain length increases. More-\nover, UNR\u2019s performance is clearly decreasing as the length of the chain increases.\n6 Conclusion\n\nIn this paper, we explored the theoretical and practical questions of unrolling a sequential Markov\nlogic knowledge base into different probabilistic models. The theoretical issues arising in a CRF-\n\n7\n\n\f(a) \u03b1 = 0\n\n(b) \u03b1 = 1\n\nFigure 2: F-score of models trained and tested on different length of data\n\nbased MLN unrolling are a warning that unexpected results may occur if the observations are weakly\ncorrelated with the hidden variables. We gave a qualitative justi\ufb01cation why this phenomenon is\nmore of a theoretical concern in domains lacking deterministic constraints. We demonstrated that\nthe CRF based unrolling can be outperformed by a model that mixes directed and undirected com-\nponents (the proposed model does not suffer from any of the theoretical weaknesses, nor from the\nlabel-bias problem).\nFrom a more practical point of view, we showed that our proposed model provides computational\nsavings, when the data has to be processed in a sequential manner. These savings are due to that\nwe do not have to unroll a new CRF at every time step, or estimate a partition function which is re-\nsponsible for normalizing the product of clique potentials appearing in two consecutive slices. The\npreviously used approximate inference methods in dynamic MLNs either relied on belief propaga-\ntion or assumed that approximating the distribution at every time step by the product of the marginals\nwould not cause any error. It is important to note that, although in our paper we focused on marginal\ninference, \ufb01nding the most likely state sequence could be done using the generated particles. Al-\nthough the conditional log-likelihood of the training data in our model may be non-concave so that\nhill climbing based approaches could fail to settle in a global maximum, we proposed a heuristic\nfor weight learning and demonstrated that it could train our model so that it performs as well as\nconditional random \ufb01elds. Although training the mixed model might have a higher computational\ncost than training a conditional random \ufb01eld, but this cost is amortized over time, since in applica-\ntions inference is performed many times, while weight learning only once. Designing more scalable\nweight learning algorithms is among our future goals.\n7 Acknowledgments\n\nWe thank Daniel Gildea for his insightful comments.\nThis research was supported by grants from ARO (W991NF-08-1-0242), ONR (N00014-11-10417),\nNSF (IIS-1012017), DOD (N00014-12-C-0263), and a gift from Intel.\nReferences\n[1] Pedro Domingos and Daniel Lowd. Markov Logic: An Interface Layer for Arti\ufb01cial Intelli-\ngence. Synthesis Lectures on Arti\ufb01cial Intelligence and Machine Learning. Morgan & Clay-\npool Publishers, 2009.\n\n[2] Thomas Geier and Susanne Biundo. Approximate online inference for dynamic markov logic\nnetworks. In Tools with Arti\ufb01cial Intelligence (ICTAI), 2011 23rd IEEE International Confer-\nence on, pages 764\u2013768, 2011.\n\n[3] Shalini Ghosh, Natarajan Shankar, and Sam Owre. Machine reading using markov logic net-\n\nworks for collective probabilistic inference. In In Proceedings of ECML-CoLISD., 2011.\n\n[4] Vibhav Gogate and Rina Dechter. Samplesearch: Importance sampling in presence of deter-\n\nminism. Artif. Intell., 175(2):694\u2013729, 2011.\n\n[5] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1990.\n[6] Dominik Jain, Andreas Barthels, and Michael Beetz. Adaptive Markov Logic Networks:\nIn 19th European Con-\n\nLearning Statistical Relational Models with Dynamic Parameters.\nference on Arti\ufb01cial Intelligence (ECAI), pages 937\u2013942, 2010.\n\n8\n\n\f[7] K. Kersting, B. Ahmadi, and S. Natarajan. Counting belief propagation. In J. Bilmes A. Ng,\neditor, Proceedings of the 25th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI\u201309),\nMontreal, Canada, June 18\u201321 2009.\n\n[8] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT\n\nPress, 2009.\n\n[9] John Lafferty. Conditional random \ufb01elds: Probabilistic models for segmenting and labeling\n\nsequence data. pages 282\u2013289. Morgan Kaufmann, 2001.\n\n[10] Steffen Lauritzen and Thomas S. Richardson. Chain graph models and their causal interpreta-\n\ntions. B, 64:321\u2013361, 2001.\n\n[11] B. Limketkai, D. Fox, and Lin Liao. CRF-Filters: Discriminative Particle Filters for Sequential\nState Estimation. In Robotics and Automation, 2007 IEEE International Conference on, pages\n3142\u20133147, 2007.\n\n[12] Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. Maximum entropy markov\nmodels for information extraction and segmentation. In Proceedings of the Seventeenth In-\nternational Conference on Machine Learning, ICML \u201900, pages 591\u2013598, San Francisco, CA,\nUSA, 2000. Morgan Kaufmann Publishers Inc.\n\n[13] Kevin Patrick Murphy. Dynamic bayesian networks: representation, inference and learning.\n\nPhD thesis, 2002. AAI3082340.\n\n[14] Aniruddh Nath and Pedro Domingos. Ef\ufb01cient belief propagation for utility maximization and\n\nrepeated inference, 2010.\n\n[15] Hoifung Poon and Pedro Domingos. Sound and ef\ufb01cient inference with probabilistic and deter-\nministic dependencies. In Proceedings of the 21st national conference on Arti\ufb01cial intelligence\n- Volume 1, AAAI\u201906, pages 458\u2013463. AAAI Press, 2006.\n\n[16] G. Potamianos and J. Goutsias. Stochastic approximation algorithms for partition function\nestimation of Gibbs random \ufb01elds. IEEE Transactions on Information Theory, 43(6):1948\u2013\n1965, 1997.\n\n[17] Adam Sadilek and Henry Kautz. Recognizing multi-agent activities from GPS data. In Twenty-\n\nFourth AAAI Conference on Arti\ufb01cial Intelligence, 2010.\n\n[18] R. Salakhutdinov. Learning and evaluating Boltzmann machines. Technical Report UTML TR\n\n2008-002, Department of Computer Science, University of Toronto, June 2008.\n\n[19] Charles Sutton, Andrew McCallum, and Khashayar Rohanimanesh. Dynamic conditional ran-\ndom \ufb01elds: Factorized probabilistic models for labeling and segmenting sequence data. J.\nMach. Learn. Res., 8:693\u2013723, May 2007.\n\n9\n\n\f", "award": [], "sourceid": 946, "authors": [{"given_name": "Tivadar", "family_name": "Papai", "institution": null}, {"given_name": "Henry", "family_name": "Kautz", "institution": null}, {"given_name": "Daniel", "family_name": "Stefankovic", "institution": null}]}