{"title": "Agreement-Based Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 913, "page_last": 920, "abstract": null, "full_text": "Agreement-Based Learning\n\nPercy Liang\n\nComputer Science Division\n\nUniversity of California\n\nBerkeley, CA 94720\n\nDan Klein\n\nComputer Science Division\n\nUniversity of California\n\nBerkeley, CA 94720\n\nMichael I. Jordan\n\nComputer Science Division\n\nUniversity of California\n\nBerkeley, CA 94720\n\npliang@cs.berkeley.edu\n\nklein@cs.berkeley.edu\n\njordan@cs.berkeley.edu\n\nAbstract\n\nThe learning of probabilistic models with many hidden variables and non-\ndecomposable dependencies is an important and challenging problem. In contrast\nto traditional approaches based on approximate inference in a single intractable\nmodel, our approach is to train a set of tractable submodels by encouraging them\nto agree on the hidden variables. This allows us to capture non-decomposable\naspects of the data while still maintaining tractability. We propose an objective\nfunction for our approach, derive EM-style algorithms for parameter estimation,\nand demonstrate their effectiveness on three challenging real-world learning tasks.\n\n1 Introduction\n\nMany problems in natural language, vision, and computational biology require the joint modeling of\nmany dependent variables. Such models often include hidden variables, which play an important role\nin unsupervised learning and general missing data problems. The focus of this paper is on models\nin which the hidden variables have natural problem domain interpretations and are the object of\ninference.\nStandard approaches for learning hidden-variable models involve integrating out the hidden vari-\nables and working with the resulting marginal likelihood. However, this marginalization can be in-\ntractable. An alternative is to develop procedures that merge the inference results of several tractable\nsubmodels. An early example of such an approach is the use of pseudolikelihood [1], which deals\nwith many conditional models of single variables rather than a single joint model. More generally,\ncomposite likelihood permits a combination of the likelihoods of subsets of variables [7]. Another\napproach is piecewise training [10, 11], which has been applied successfully to several large-scale\nlearning problems.\nAll of the above methods, however, focus on fully-observed models. In the current paper, we develop\ntechniques in this spirit that work for hidden-variable models. The basic idea of our approach is to\ncreate several tractable submodels and train them jointly to agree on their hidden variables. We\npresent an intuitive objective function and ef\ufb01cient EM-style algorithms for training a collection of\nsubmodels. We refer to this general approach as agreement-based learning.\nSections 2 and 3 presents the general theory for agreement-based learning. In some applications, it\nis infeasible computationally to optimize the objective function; Section 4 provides two alternative\nobjectives that lead to tractable algorithms. Section 5 demonstrates that our methods can be ap-\nplied successfully to large datasets in three real world problem domains\u2014grammar induction, word\nalignment, and phylogenetic hidden Markov modeling.\n\n1\n\n\f2 Agreement-based learning of multiple submodels\n\nAssume we have M (sub)models pm(x, z; \u03b8m), m = 1, . . . , M, where each submodel speci\ufb01es a\ndistribution over the observed data x \u2208 X and some hidden state z \u2208 Z. The submodels could be\nparameterized in completely different ways as long as they are de\ufb01ned on the common event space\nX \u00d7 Z. Intuitively, each submodel should capture a different aspect of the data in a tractable way.\nTo learn these submodels, the simplest approach is to train them independently by maximizing the\nsum of their log-likelihoods:\n\nOindep(\u03b8) def= log(cid:89)\n\n(cid:88)\n\npm(x, z; \u03b8m) =(cid:88)\n\nz\n\nm\n\nlog pm(x; \u03b8m),\n\nwhere \u03b8 = (\u03b81, . . . , \u03b8M ) is the collective set of parameters and pm(x; \u03b8m) = (cid:80)\n\nz pm(x, z; \u03b8m)\nis the likelihood under submodel pm.1 Given an input x, we can then produce an output z by\ncombining the posteriors pm(z | x; \u03b8m) of the trained submodels.\nIf we view each submodel as trying to solve the same task of producing the desired posterior over\nz, then it seems advantageous to train the submodels jointly to encourage \u201cagreement on z.\u201d We\npropose the following objective which realizes this insight:\n\n(1)\n\nm\n\npm(x, z; \u03b8m) =(cid:88)\n\nlog pm(x; \u03b8m) + log(cid:88)\n\n(cid:89)\n\nOagree(\u03b8) def= log(cid:88)\n\n(cid:89)\n\npm(z | x; \u03b8m).\n\n(2)\n\nz\n\nm\n\nm\n\nz\n\nm\n\nby a different submodel: p((x1, z1), . . . , (xM , zM ); \u03b8) = (cid:81)\n\nThe last term rewards parameter values \u03b8 for which the submodels assign probability mass to the\nsame z (conditioned on x); the summation over z re\ufb02ects the fact that we do not know what z is.\nOagree has a natural probabilistic interpretation. Imagine de\ufb01ning a joint distribution over M inde-\npendent copies over the data and hidden state, (x1, z1), . . . , (xM , zM ), which are each generated\nm p(xm, zm; \u03b8m). Then Oagree is the\nprobability that the submodels all generate the same observed data x and the same hidden state:\np(x1 = \u00b7\u00b7\u00b7 = xM = x, z1 = \u00b7\u00b7\u00b7 = zM ; \u03b8).\nOagree is also related to the likelihood of a proper probabilistic model pnorm, obtained by normalizing\nthe product of the submodels, as is done in [3]. Our objective Oagree is then a lower bound on the\nlikelihood under pnorm:\n\n(cid:81)\n(cid:81)\n\npnorm(x; \u03b8) def=\n\n(cid:80)\n(cid:80)\n\n(cid:80)\n(cid:81)\n(cid:81)\n(cid:80)\nits own set of (nuisance) hidden variables: ppoe(x) \u221d(cid:81)\n\nm pm(x, z; \u03b8m)\npm(x, z; \u03b8m)\n\n\u2265\n\nx,z\n\nm\n\nm\n\nx,z\n\nz\n\nz\n\nThe inequality holds because the denominator of the lower bound contains additional cross terms.\nThe bound is generally loose, but becomes tighter as each pm becomes more deterministic. Note\nthat pnorm is distinct from the product-of-experts model [3], in which each \u201cexpert\u201d model pm has\nz pm(x, z; \u03b8m). In contrast, pnorm has\none set of hidden variables z common to all submodels, which is what provides the mechanism for\nagreement-based learning.\n\n(cid:80)\n\nm\n\nm pm(x, z; \u03b8m)\npm(x, z; \u03b8m)\n\n= Oagree(\u03b8).\n\n(3)\n\n2.1 The product EM algorithm\nWe now derive the product EM algorithm to maximize Oagree. Product EM bears many striking\nsimilarities to EM: both are coordinate-wise ascent algorithms on an auxiliary function and both\nincrease the original objective monotonically. By introducing an auxiliary distribution q(z) and\napplying Jensen\u2019s inequality, we can lower bound Oagree with an auxiliary function L:\n\n\u2265 Eq(z) log\n\nm pm(x, z; \u03b8m)\n\nq(z)\n\ndef= L(\u03b8, q)\n\n(4)\n\n(cid:81)\n\nOagree(\u03b8) = log(cid:88)\nmizing a KL-divergence: L(\u03b8, q) = \u2212KL(q(z)||(cid:81)\n\nm pm(x, z; \u03b8m)\n\nq(z)\n\nq(z)\n\nz\n\n(cid:81)\n\nThe product EM algorithm performs coordinate-wise ascent on L(\u03b8, q). In the (product) E-step, we\noptimize L with respect to q. Simple algebra reveals that this optimization is equivalent to mini-\nm pm(x, z; \u03b8m)) + constant, where the constant\n\n1To simplify notation, we consider one data point x. Extending to a set of i.i.d. points is straightforward.\n\n2\n\n\fm\n\ndoes not depend on q. This quantity is minimized by setting q(z) \u221d(cid:81)\nL(\u03b8, q) = (cid:80)\n\nm pm(x, z; \u03b8m). In the (prod-\nuct) M-step, we optimize L with respect to \u03b8, which decomposes into M independent objectives:\nEq log pm(x, z; \u03b8m) + constant, where this constant does not depend on \u03b8. Each\nterm corresponds to an independent M-step, just as in EM for maximizing Oindep.\nThus, our product EM algorithm differs from independent EM only in the E-step, in which the\nsubmodels are multiplied together to produce one posterior over z rather than M separate ones.\nAssuming that there is an ef\ufb01cient EM algorithm for each submodel pm, there is no dif\ufb01culty in\nperforming the product M-step. In our applications (Section 5), each pm is composed of multinomial\ndistributions, so the M-step simply involves computing ratios of expected counts. On the other hand,\nthe product E-step can become intractable and we must develop approximations (Section 4).\n\n3 Exponential family formulation\n\npm(x, z; \u03b8m) = exp{\u03b8T\n\nThus far, we have placed no restrictions on the form of the submodels. To develop a richer under-\nstanding and provide a framework for making approximations, we now assume that each submodel\npm is an exponential family distribution:\n\nm\u03c6m(x, z) \u2212 Am(\u03b8m)} for x \u2208 X , z \u2208 Zm and 0 otherwise,\n\nwhere \u03c6m are suf\ufb01cient statistics (features) and Am(\u03b8m) = log(cid:80)\n\n(5)\nm\u03c6m(x, z)} is\nthe log-partition function,2 de\ufb01ned on \u03b8m \u2208 \u0398m \u2282 RJ. We can think of all the submodels pm as\nbeing de\ufb01ned on a common space Z\u222a = \u222amZm, but the support of q(z) as computed in the E-step is\nonly the intersection Z\u2229 = \u2229mZm. Controlling this support will be essential in developing tractable\napproximations (Section 4.1).\nIn the general formulation, we required only that the submodels share the same event space X \u00d7\nZ. Now we make explicit the possibility of the submodels sharing features, which give us more\nstructure for deriving approximations. In particular, suppose each feature j of submodel pm can be\ndecomposed into a part that depends on x (which is speci\ufb01c to that particular submodel) and a part\nthat depends on z (which is the same for all submodels):\n\nexp{\u03b8T\n\nx\u2208X ,z\u2208Zm\n\ni=1\n\n\u03c6X\nmji(x)\u03c6Z\n\ni (z), or in matrix notation, \u03c6m(x, z) = \u03c6X\n\n\u03c6mj(x, z) =\n(6)\nm(x) is a J \u00d7 I matrix and \u03c6Z(z) is a I \u00d7 1 vector. When z is discrete, such a decompo-\nwhere \u03c6X\nsition always exists by de\ufb01ning \u03c6Z(z) to be an |Z\u222a|-dimensional indicator vector which is 1 on the\ncomponent corresponding to z. Fortunately, we can usually obtain more compact representations of\n\u03c6Z(z). We can now express our objective L(\u03b8, q) (4) using (5) and (6):\n\nm(x)\u03c6Z(z),\n\nI(cid:88)\n\n(cid:16)(cid:88)\n\nm\n\n(cid:17)\n(Eq(z)\u03c6Z(z)) + H(q) \u2212(cid:88)\nm(x) and b =(cid:80)\n\nm\n\nL(\u03b8, q) =\n\nm\u03c6X\n\u03b8T\n\nm(x)\n\nAm(\u03b8m) for q \u2208 Q(Z\u2229),\n\n(7)\n\nm\u03c6X\n\nm = \u03b8T\n\nwhere Q(Z(cid:48)) def= {q : q(z) = 0 for z (cid:54)\u2208 Z(cid:48)} is the set of distributions with support Z(cid:48). For\nconvenience, de\ufb01ne bT\nm bm, which summarize the parameters \u03b8 for the\nE-step. Note that for any \u03b8, the q maximizing L always has the following exponential family form:\n(8)\nz\u2208Z\u2229 exp{\u03b2T \u03c6Z(z)} is the log-partition function. In a minor abuse of\n\nq(z; \u03b2) = exp{\u03b2T \u03c6Z(z) \u2212 AZ\u2229(\u03b2)} for z \u2208 Z\u2229 and 0 otherwise,\n\nwhere AZ\u2229(\u03b2) = log(cid:80)\n\nnotation, we write L(\u03b8, \u03b2) = L(\u03b8, q(\u00b7; \u03b2)). Speci\ufb01cally, L(\u03b8, \u03b2) is maximized by setting \u03b2 = b.\nIt will be useful to express (7) using convex duality [12]. The key idea of convex duality is the\nexistence of a mapping between the canonical exponential parameters \u03b2 \u2208 RI of an exponential\nfamily distribution q(z; \u03b2) and the mean parameters de\ufb01ned by \u00b5 = Eq(z;\u03b2)\u03c6Z(z) \u2208 M(Z\u2229) \u2282 RI,\nwhere M(Z(cid:48)) = {\u00b5 : \u2203q \u2208 Q(Z(cid:48)) : Eq\u03c6Z(z) = \u00b5} is the set of realizable mean parameters. The\nFenchel-Legendre conjugate of the log-partition function AZ\u2229(\u03b2) is\n\nA\u2217\nZ\u2229(\u00b5) def= sup\n\u03b2\u2208RI\n\n{\u03b2T \u00b5 \u2212 AZ\u2229(\u03b2)} for \u00b5 \u2208 M(Z\u2229),\n\n(9)\n\n2Our applications use directed graphical models, which correspond to curved exponential families where\n\neach \u0398m is de\ufb01ned by local normalization constraints and Am(\u03b8m) = 0.\n\n3\n\n\f(cid:17)\n\nL\u2217(\u03b8, \u00b5) def=\n\n(cid:16)(cid:88)\n\nZ\u2229(\u00b5) \u2212(cid:88)\n\nwhich is also equal to \u2212H(q(z; \u03b2)), the negative entropy of any distribution q(z; \u03b2) corresponding\nto \u00b5. Substituting \u00b5 and A\u2217\n\nZ\u2229(\u00b5) into (7), we obtain an objective in terms of the dual variables \u00b5:\nm\u03c6X\n\u03b8T\n\nAm(\u03b8m) for \u00b5 \u2208 M(Z\u2229).\n\n(10)\nNote that the two objectives are equivalent: sup\u03b2\u2208RI L(\u03b8, \u03b2) = sup\u00b5\u2208M(Z\u2229) L\u2217(\u03b8, \u00b5) for each \u03b8.\nThe mean parameters \u00b5 are exactly the z-speci\ufb01c expected suf\ufb01cient statistics computed in the prod-\nuct E-step. The dual is an attractive representation because it allows us to form convex combinations\nof different \u00b5, an operation does not have a direct correlate in the primal formulation. The product\nEM algorithm is summarized below:\n\n\u00b5 \u2212 A\u2217\n\nm(x)\n\nm\n\nm\n\n\u00b5 = argmax\u00b5(cid:48)\u2208M(Z\u2229){bT \u00b5(cid:48) \u2212 A\u2217\n\u03b8m = argmax\u03b8(cid:48)\n\nZ\u2229(\u00b5(cid:48))}\nm \u03c6X (x)\u00b5 \u2212 Am(\u03b8(cid:48)\n\nProduct EM\nm\u2208\u0398m{\u03b8(cid:48)T\n\nm)}\n\nE-step:\nM-step:\n\n4 Approximations\n\nThe product M-step is tractable provided that the M-step for each submodel is tractable, which\nis generally the case. The corresponding statement is not true for the E-step, which in general\nrequires explicitly summing over all possible z \u2208 Z\u2229, often an exponentially large set. We will thus\nconsider alternative E-steps, so it will be convenient to succinctly characterize an E-step. An E-step\nis speci\ufb01ed by a vector b(cid:48) (which depends on \u03b8 and x) and a set Z(cid:48) (which we sum z over):\n\nZ(cid:48)(\u00b5(cid:48))}.\n\n{b(cid:48)T \u00b5(cid:48) \u2212 A\u2217\n\nE(b(cid:48),Z(cid:48)) computes \u00b5 = argmax\n\u00b5(cid:48)\u2208M(Z(cid:48))\n\n(11)\nUsing this notation, E(bm,Zm) is the E-step for training the m-th submodel independently using\nEM and E(b,Z\u2229) is the E-step of product EM. Though we write E-steps in the dual formulation, in\npractice, we compute \u00b5 as an expectation over all z \u2208 Z(cid:48), perhaps leveraging dynamic programming.\nIf E(bm,Zm) is tractable and all submodels have the same dynamic programming structure (e.g.,\nif z is a tree and all features are local with respect to that tree), then E(b,Z\u2229) is also tractable: we\ncan incorporate all the features into the same dynamic program and simply run product EM (see\nSection 5.1 for an example).\nHowever, E(b,Z\u2229) is intractable in general, owing to two complications: (1) we can sum over each\nZm ef\ufb01ciently but not the intersection Z\u2229; and (2) each bm corresponds to a decomposable graphical\nm bm corresponds to a loopy graph. In the sequel, we describe two\napproximate objective functions addressing each complication, whose maximization can be carried\nout by performing M independent tractable E-steps.\n\nmodel, but the combined b =(cid:80)\n\n4.1 Domain-approximate product EM\nAssume that for each submodel pm, E(b,Zm) is tractable (see Section 5.2 for an example). We\npropose maximizing the following objective:\n\nL\u2217\ndom(\u03b8, \u00b51, . . . , \u00b5m) def=\n\n1\nM\n\nm(cid:48) \u03c6X\n\u03b8T\n\nm(cid:48)(x)\n\n\u00b5m \u2212 A\u2217\nZm\n\n(\u00b5m)\n\nAm(\u03b8m),\n\n(12)\n\n(cid:88)\n\n(cid:104)(cid:16)(cid:88)\n\nm\n\nm(cid:48)\n\n(cid:17)\n\n(cid:105) \u2212(cid:88)\n\nm\n\nwith each \u00b5m \u2208 M(Zm). This objective can be maximized via coordinate-wise ascent:\n\nE-step:\nM-step:\n\n\u00b5m = argmax\u00b5(cid:48)\n\u03b8m = argmax\u03b8(cid:48)\n\nDomain-approximate product EM\nm\u2208M(Zm){bT \u00b5(cid:48)\nm\u2208\u0398m{\u03b8(cid:48)T\nm \u03c6X (x)\n\n(cid:80)\nm \u2212 A\u2217\nZm\n\nm)}\n(\u00b5(cid:48)\nm(cid:48) \u00b5m(cid:48)\n\n(cid:17) \u2212 Am(\u03b8(cid:48)\n\n(cid:16) 1\n\nM\n\nm)}\n\n[E(b,Zm)]\n\nThe product E-step consists of M separate E-steps, which are each tractable because each involves\nthe respective Zm instead of Z\u2229. The resulting expected suf\ufb01cient statistics are averaged and used\nin the product M-step, which breaks down into M separate M-steps.\n\n4\n\n\fWhile we have not yet established any relationship between our approximation L\u2217\nobjective L\u2217, we can, however, relate L\u2217\nZ\u2229 with Z\u222a in (10).\nProposition 1. L\u2217\n\ndom and the original\n\u222a, which is de\ufb01ned as an analogue of L\u2217 by replacing\n\u222a(\u03b8, \u00af\u00b5) for all \u03b8 and \u00b5m \u2208 M(Zm) and \u00af\u00b5 =\n\ndom to L\u2217\ndom(\u03b8, \u00b51, . . . , \u00b5M ) \u2264 L\u2217\n\n(cid:80)\n\n1\nM\n\nm \u00b5m.\n\n(cid:80)\nProof. First, since M(Zm) \u2282 M(Z\u222a) and M(Z\u222a) is a convex set, \u00af\u00b5 \u2208 M(Z\u222a), so L\u2217\n\u222a(\u03b8, \u00af\u00b5)\ndom(\u03b8, \u00b51, . . . , \u00b5M ) \u2212\nis well-de\ufb01ned. Subtracting the L\u222a version of (10) from (12), we obtain L\u2217\n(cid:80)\nL\u2217\nZ\u222a(\u00af\u00b5) \u2264 1\nZ\u222a(\u00b5m) \u2264\n\u222a(\u03b8, \u00af\u00b5) = A\u2217\nm A\u2217\nZ\u222a(\u00b7). For the second inequality:\nm A\u2217\nZm\nsince Zm \u2283 Z\u222a, AZ\u222a(\u00b5m) \u2265 AZm(\u00b5m); by inspecting (9), it follows that A\u2217\nZ\u222a(\u00b5m) \u2264 A\u2217\n(\u00b5m).\nZm\n\nZ\u222a(\u00af\u00b5) \u2212 1\n(\u00b5m). It suf\ufb01ces to show A\u2217\n(\u00b5m). The \ufb01rst inequality follows from convexity of A\u2217\n\n(cid:80)\nm A\u2217\nZm\n\n1\nM\n\nM\n\nM\n\n4.2 Parameter-approximate product EM\nNow suppose that for each submodel pm, E(bm,Z\u2229) is tractable (see Section 5.3 for an example).\nWe propose maximizing the following objective:\nm\u03c6X\n\nL\u2217\npar(\u03b8, \u00b51, . . . , \u00b5m) def=\n\n(cid:105) \u2212(cid:88)\n\nm(x))\u00b5m \u2212 A\u2217\n\n(cid:88)\n\nAm(\u03b8m),\n\nZ\u2229(\u00b5m)\n\n(M \u03b8T\n\n(13)\n\n(cid:104)\n\n1\nM\n\nm\n\nwith each \u00b5m \u2208 M(Z\u2229). This objective can be maximized via coordinate-wise ascent, which again\nconsists of M separate E-steps E(M bm,Z\u2229) and the same M-step as before:\n\nm\n\nE-step:\nM-step:\n\n\u00b5m = argmax\u00b5(cid:48)\n\u03b8m = argmax\u03b8(cid:48)\n\n(cid:16) 1\nParameter-approximate product EM\n(cid:80)\nm)}\nm \u2212 A\u2217\nm\u2208M(Zm){(M bm)T \u00b5(cid:48)\nm\u2208\u0398m{\u03b8(cid:48)T\nm(cid:48) \u00b5m(cid:48)\n\n(cid:17) \u2212 Am(\u03b8(cid:48)\n\nm \u03c6X (x)\n\nZ\u2229(\u00b5(cid:48)\n\nM\n\nm)}\n\n[E(M bm,Z\u2229)]\n\npar is at least that of L\u2217, which leaves us maximizing an\nWe can show that the maximum value of L\u2217\nupper bound of L\u2217. Although less logical than maximizing a lower bound, in Section 5.3, we show\nthat our approach is nonetheless a reasonable approximation which importantly is tractable.\npar(\u03b8, \u00b51, . . . , \u00b5M ) \u2265 max\u00b5\u2208M(Z\u2229) L\u2217(\u03b8, \u00b5).\nProposition 2. max\u00b51\u2208M(Z\u2229),...,\u00b5M\u2208M(Z\u2229) L\u2217\nProof. From the de\ufb01nitions of L\u2217\nL\u2217(\u03b8, \u00b5) for all \u00b5 \u2208 M(Z\u2229). If we maximize L\u2217\nwith a smaller value.\n\npar(\u03b8, \u00b5, . . . , \u00b5) =\npar with M distinct arguments, we cannot end up\n\npar (13) and L\u2217 (10), it is easy to see that L\u2217\n\nThe product E-step could also be approximated by mean-\ufb01eld or loopy belief propagation variants.\nThese methods and the two we propose all fall under the general variational framework for approx-\nimate inference [12]. The two approximations we developed have the advantage of permitting exact\ntractable solutions without resorting to expensive iterative methods which are only guaranteed to\nconverge to a local optima.\nWhile we still lack a complete theory relating our approximations L\u2217\npar to the original\nobjective L\u2217, we can give some intuitions. Since we are operating in the space of expected suf\ufb01cient\nstatistics \u00b5m, most of the information about the full posterior pm(z | x) must be captured in these\nstatistics alone. Therefore, we expect our approximations to be accurate when each submodel has\nenough capacity to represent the posterior pm(z | x; \u03b8m) as a low-variance unimodal distribution.\n\ndom and L\u2217\n\n5 Applications\n\nWe now empirically validate our algorithms on three concrete applications: grammar induction using\nproduct EM (Section 5.1), unsupervised word alignment using domain-approximate product EM\n(Section 5.2), and prediction of missing nucleotides in DNA sequences using parameter-approximate\nproduct EM (Section 5.3).\n\n5\n\n\fFigure 1: The two instances of IBM model 1 for word alignment are shown in (a) and (b). The graph\nshows gains from agreement-based learning.\n\n5.1 Grammar induction\n\nGrammar induction is the problem of inducing latent syntactic structures given a set of observed\nsentences. There are two common types of syntactic structure (one based on word dependencies and\nthe other based on constituent phrases), which can each be represented as a submodel. [5] proposed\nan algorithm to train these two submodels. Their algorithm is a special case of our product EM\nalgorithm, although they did not state an objective function. Since the shared hidden state is a tree\nstructure, product EM is tractable. They show that training the two submodels to agree signi\ufb01cantly\nimproves accuracy over independent training. See [5] for more details.\n\n5.2 Unsupervised word alignment\n\nWord alignment is an important component of machine translation systems. Suppose we have a set\nof sentence pairs. Each pair consists of two sentences, one in a source language (say, English) and\nits translation in a target language (say, French). The goal of unsupervised word alignment is to\nmatch the words in a source sentence to the words in the corresponding target sentence. Formally,\nlet x = (e, f) be an observed pair of sentences, where e = (e1, . . . , e|e|) and f = (f1, . . . , f|f|); z\nis a set of alignment edges between positions in the English and positions in the French.\nClassical models for word alignment include IBM models 1 and 2 [2] and the HMM model [8].\nThese are asymmetric models, which means that they assign non-zero probability only to alignments\nin which each French word is aligned to at most one English word; we denote this set Z1. An\nelement z \u2208 Z1 can be parameterized by a vector a = (a1, . . . , a|f|), with aj \u2208 {NULL, 1, . . . ,|e|},\ncorresponding to the English word (if any) that French word fj is aligned to. We de\ufb01ne the \ufb01rst\nsubmodel on X \u00d7 Z1 as follows (specializing to IBM model 1 for simplicity):\np1(aj)p1(fj | eaj ; \u03b81),\n\np1(x, z; \u03b81) = p1(e, f , a; \u03b81) = p1(e)\n\n|f|(cid:89)\n\n(14)\n\nj=1\n\nij(z) \u2208 {0, 1} : i = NULL, 1, . . . ,|e|, j = NULL, 1, . . . ,|f|}. We have \u03c6Z\n\nwhere p1(e) and p1(aj) are constant and the canonical exponential parameters \u03b81 are the transition\nlog-probabilities {log t1;ef} for each English word e (including NULL) and French word f.\nWritten in exponential family form, \u03c6Z(z) is an (|e| + 1)(|f| + 1)-dimensional vector whose com-\nponents are {\u03c6Z\nij(z) = 1\nif and only if English word ei is aligned to French word fj and zNULLj = 1 if and only if fj is not\naligned to any English word. Also, \u03c6X\nef ;ij(x) = 1 if and only if ei = e and fj = f. The mean\nparameters associated with an E-step are {\u00b51;ij}, the posterior probabilities of ei aligning to fj;\nthese can be computed independently for each j. We can de\ufb01ne a second submodel p2(x, z; \u03b82) on\nX \u00d7 Z2 by reversing the roles of English and French. Figure 1(a)\u2013(b) shows the two models.\nWe cannot use product EM algorithm to train p1 and p2 because summing over all alignments\nin Z\u2229 = Z1 \u2229 Z2 is NP-hard. However, we can use domain-approximate product EM because\nE(b1 + b2,Zm) is tractable\u2014the tractability here does not depend on decomposability of b but the\nasymmetric alignment structure of Zm. The concrete change from independent EM is slight: we\nneed to only change the E-step of each pm to use the product of translation probabilities t1;ef t2;f e\nand change the M-step to use the average of the edge posteriors obtained from the two E-steps.\n\n6\n\ne1e2e3a1a2a3a4f1f2f3f4e1e2e3a1a2a3f1f2f3f4(a)Submodelp1(b)Submodelp2 0.07 0.08 0.09 0.1 0.11 0.12 1 2 3 4 5 6 7 8 9 10alignment error rateiterationHMM modelIndependent EMDomain-approximate product EM\fFigure 2: The two phylogenetic HMM models, one for the even slices, the other for the odd ones.\n\n[6] proposed an alternative method to train two models to agree. Their E-step computes \u00b51 =\nE(b1,Z1) and \u00b52 = E(b2,Z2), whereas our E-steps incorporate the parameters of both models\nin b1 + b2. Their M-step uses the elementwise product of \u00b51 and \u00b52, whereas we use the average\n2(\u00b51 + \u00b52). Finally, while their algorithm appears to be very stable and is observed to converge\n1\nempirically, no objective function has been developed; in contrast, our algorithm maximizes (12).\nIn practice, both algorithms perform comparably.\nWe conducted our experiments according to the setup of [6]. We used 100K unaligned sentences\nfor training and 137 for testing from the English-French Hansards data of the NAACL 2003 Shared\nTask. Alignments are evaluated using alignment error rate (AER); see [6] for more details. We\ntrained two instances of the HMM model [8] (English-to-French and French-to-English) using 10\niterations of domain-approximate product EM, initializing with independently trained IBM model 1\nparameters. For prediction, we output alignment edges with suf\ufb01cient posterior probability: {(i, j) :\n2(\u00b51;ij + \u00b52;ij) \u2265 \u03b4}. Figure 1 shows how agreement-based training improves the error rate over\n1\nindependent training for the HMM models.\n\n5.3 Phylogenetic HMM models\nSuppose we have a set of species s \u2208 S arranged in a \ufb01xed phylogeny (i.e., S are the nodes\nof a directed tree). Each species s is associated with a length L sequence of nucleotides ds =\n(ds1, . . . , dsL). Let d = {ds : s \u2208 S} denote all the nucleotides, which consist of some observed\nones x and unobserved ones z.\nA good phylogenetic model should take into consideration both the relationship between nucleotides\nof the different species at the same site and the relationship between adjacent nucleotides in the same\nspecies. However, such a model would have high tree-width and be intractable to train. Past work\nhas focused on traditional variational inference in a single intractable model [9, 4]. Our approach is\nto instead create two tractable submodels and train them to agree. De\ufb01ne one submodel to be\np1(ds(cid:48)j | dsj; \u03b81)p1(ds(cid:48)j+1 | ds(cid:48)j, ds(j+1); \u03b81),\n\np1(x, z; \u03b81) = p1(d; \u03b81) = (cid:89)\n\n(cid:89)\n\n(cid:89)\n\nj odd\n\ns\u2208S\n\ns(cid:48)\u2208CH(s)\n\n(15)\n\nwhere CH(s) is the set of children of s in the tree. The second submodel p2 is de\ufb01ned similarly,\nonly with the product taken over j even. The parameters \u03b8m consist of \ufb01rst-order mutation log-\nprobabilities and second-order mutation log-probabilities. Both submodels permit the same set of\nassignments of hidden nucleotides (Z\u2229 = Z1 = Z2). Figure 2(a)\u2013(b) shows the two submodels.\nExact product EM is not tractable since b = b1 + b2 corresponds to a graph with high tree-width.\nWe can apply parameter-approximate product EM, in which the E-step only involves computing\n\u00b5m = E(2bm,Z\u2229). This can be done via dynamic programming along the tree for each two-\n2(\u00b51 + \u00b52) is used for each model,\nnucleotide slice of the sequence. In the M-step, the average 1\nwhich has a closed form solution.\nOur experiments used a multiple alignment consisting of L = 20, 000 consecutive sites belonging\nto the L1 transposons in the Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) gene\n(chromosome 7). Eight eutherian species were arranged in the phylogeny shown in Figure 3. The\ndata we used is the same as that of [9]. Some nucleotides in the sequences were already missing. In\naddition, we held out some fraction of the observed ones for evaluation. We trained two models using\n30 iterations of parameter-approximate product EM.3 For prediction, the posteriors over heldout\n3We initialized with a small amount of noise around uniform parameters plus a small bias towards identity\n\nmutations.\n\n7\n\ndA1dB1dC1dD1dE1dA2dB2dC2dD2dE2dA3dB3dC3dD3dE3dA4dB4dC4dD4dE4dA1dB1dC1dD1dE1dA2dB2dC2dD2dE2dA3dB3dC3dD3dE3dA4dB4dC4dD4dE4(a)Submodelp1(b)Submodelp2\fFigure 3: The tree is the phylogeny topology used in experiments. The graphs show the predic-\ntion accuracy of independent versus agreement-based training (parameter-approximate product EM)\nwhen 20% and 50% of the observed nodes are held out.\n\nnucleotides under each model are averaged and the one with the highest posterior is chosen. Figure 3\nshows the prediction accuracy. Though independent and agreement-based training eventually obtain\nthe same accuracy, agreement-based training converges much faster. This gap grows as the amount\nof heldout data increases.\n\n6 Conclusion\n\nWe have developed a general framework for agreement-based learning of multiple submodels. View-\ning these submodels as components of an overall model, our framework permits the submodels to be\ntrained jointly without paying the computational cost associated with an actual jointly-normalized\nprobability model. We have presented an objective function for agreement-based learning and three\nEM-style algorithms that maximize this objective or approximations to this objective. We have also\ndemonstrated the applicability of our approach to three important real-world tasks. For grammar in-\nduction, our approach yields the existing algorithm of [5], providing an objective for that algorithm.\nFor word alignment and phylogenetic HMMs, our approach provides entirely new algorithms.\n\nAcknowledgments We would like to thank Adam Siepel for providing the phylogenetic data\nand acknowledge the support of the Defense Advanced Research Projects Agency under contract\nNBCHD030010.\n\nReferences\n[1] J. Besag. The analysis of non-lattice data. The Statistician, 24:179\u2013195, 1975.\n[2] P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. The mathematics of statistical machine\n\ntranslation: Parameter estimation. Computational Linguistics, 19:263\u2013311, 1993.\n\n[3] G. Hinton. Products of experts. In International Conference on Arti\ufb01cial Neural Networks, 1999.\n[4] V. Jojic, N. Jojic, C. Meek, D. Geiger, A. Siepel, D. Haussler, and D. Heckerman. Ef\ufb01cient approximations\n\nfor learning phylogenetic HMM models from data. Bioinformatics, 20:161\u2013168, 2004.\n\n[5] D. Klein and C. D. Manning. Corpus-based induction of syntactic structure: Models of dependency and\n\nconstituency. In Association for Computational Linguistics (ACL), 2004.\n\n[6] P. Liang, B. Taskar, and D. Klein. Alignment by agreement. In Human Language Technology and North\n\nAmerican Association for Computational Linguistics (HLT/NAACL), 2006.\n\n[7] B. Lindsay. Composite likelihood methods. Contemporary Mathematics, 80:221\u2013239, 1988.\n[8] H. Ney and S. Vogel. HMM-based word alignment in statistical translation. In International Conference\n\non Computational Linguistics (COLING), 1996.\n\n[9] A. Siepel and D. Haussler. Combining phylogenetic and hidden Markov models in biosequence analysis.\n\nJournal of Computational Biology, 11:413\u2013428, 2004.\n\n[10] C. Sutton and A. McCallum. Piecewise training of undirected models. In Uncertainty in Arti\ufb01cial Intel-\n\nligence (UAI), 2005.\n\n[11] C. Sutton and A. McCallum. Piecewise pseudolikelihood for ef\ufb01cient CRF training.\n\nConference on Machine Learning (ICML), 2007.\n\nIn International\n\n[12] M. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nTechnical report, Department of Statistics, University of California at Berkeley, 2003.\n\n8\n\n(hidden)(hidden)baboon(hidden)chimphuman(hidden)(hidden)cowpig(hidden)catdog(hidden)mouserat 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0 5 10 15 20 25accuracyiteration20% heldoutIndependent EMParameter-approximate product EM 0.4 0.5 0.6 0.7 0.8 0 5 10 15 20 25accuracyiteration50% heldoutIndependent EMParameter-approximate product EM\f", "award": [], "sourceid": 592, "authors": [{"given_name": "Percy", "family_name": "Liang", "institution": null}, {"given_name": "Dan", "family_name": "Klein", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}