{"title": "Empirical Risk Minimization with Approximations of Probabilistic Grammars", "book": "Advances in Neural Information Processing Systems", "page_first": 424, "page_last": 432, "abstract": "Probabilistic grammars are generative statistical models that are  useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting.", "full_text": "Empirical Risk Minimization\n\nwith Approximations of Probabilistic Grammars\n\nShay B. Cohen\n\nLanguage Technologies Institute\n\nSchool of Computer Science\nCarnegie Mellon University\nPittsburgh, PA 15213, USA\nscohen@cs.cmu.edu\n\nNoah A. Smith\n\nLanguage Technologies Institute\n\nSchool of Computer Science\nCarnegie Mellon University\nPittsburgh, PA 15213, USA\nnasmith@cs.cmu.edu\n\nAbstract\n\nProbabilistic grammars are generative statistical models that are useful for compo-\nsitional and sequential structures. We present a framework, reminiscent of struc-\ntural risk minimization, for empirical risk minimization of the parameters of a\n\ufb01xed probabilistic grammar using the log-loss. We derive sample complexity\nbounds in this framework that apply both to the supervised setting and the un-\nsupervised setting.\n\n1\n\nIntroduction\n\nProbabilistic grammars are an important statistical model family used in natural language processing\n[7], computer vision [16], computational biology [19] and more recently, in human activity analysis\n[12]. They are commonly estimated using maximum likelihood estimate or variants. Such estima-\ntion can be viewed as minimizing empirical risk with the log-loss [21]. The log-loss is not bounded\nwhen applied to probabilistic grammars, and that makes it hard to obtain uniform convergence re-\nsults. Such results would help in deriving sample complexity bounds, that is, bounds on the number\nof training examples required to obtain accurate estimation.\nTo overcome this problem, we derive distribution-dependent uniform convergence results for proba-\nbilistic grammars. In that sense, our learning framework relates to previous work about learning in a\ndistribution-dependent setting [15] and structural risk minimization [21]. Our work is also related to\n[8], which discusses the statistical properties of estimation of parsing models in a distribution-free\nsetting. Based on the notion of bounded approximations [1, 9], we de\ufb01ne a sequence of increasingly\nbetter approximations for probabilistic grammars, which we call \u201cproper approximations.\u201d We then\nderive sample complexity bounds in our framework, for both the supervised case and the unsuper-\nvised case.\nOur results rely on an exponential decay in probabilities with respect to the length of the derivation\n(number of derivation steps the grammar takes when generating a structure). This means that most\nof the probability mass for such a distribution is concentrated on a small number of grammatical\nderivations. We formalize this notion, and use it in many of our results. For applications involving\nreal-world data of \ufb01nite size (as in natural language processing, computational biology, and so on),\nwe believe this is a reasonable assumption.\nThe rest of the paper is organized as follows. \u00a72 gives an overview of probabilistic grammars. \u00a73\ngives an overview of the learning setting. \u00a74 presents proper approximations, which are approximate\nconcept spaces that permit the derivation of sample complexity bounds for probabilistic grammars.\n\u00a75 describes the main sample complexity results. We discuss our results in \u00a76 and conclude in \u00a77.\n\n1\n\n\f2 Probabilistic Grammars\n\n(cid:81)Nk\n\nA probabilistic grammar de\ufb01nes a probability distribution over grammatical derivations generated\nthrough a step-by-step process. For example, probabilistic context-free grammars (PCFGs) generate\nphrase-structure trees by recursively rewriting nonterminal symbols as sequences of \u201cchild\u201d symbols\naccording to a \ufb01xed set of production rules. Each rewrite of a PCFG is conditionally independent\nof previous ones given one PCFG state; this Markov property permits ef\ufb01cient inference for the\nprobability distribution de\ufb01ned by the probabilistic grammar.\nIn this paper, we will assume that any grammatical derivation z fully determines a string x, denoted\nyield(z). There may be many derivations z for a given string (perhaps in\ufb01nitely many for some\nkinds of grammars; we assume that the number of derivations is \ufb01nite). In general, a probabilistic\ngrammar de\ufb01nes the probability of a grammatical derivation z as:\n\nh\u03b8(z) = (cid:81)K\nevents. We let N = (cid:80)K\n|x| denote the length of the string x, and |z| =(cid:80)K\n\n(1)\n\u03c8k,i is a function that \u201ccounts\u201d the number of times the kth distribution\u2019s ith event occurs in the\nderivation. The \u03b8 are a collection of K multinomials (cid:104)\u03b81, ..., \u03b8K(cid:105), the kth of which includes Nk\nk=1 Nk denote the total number of derivation event types. D(G) denotes\nthe set of all possible derivations of G. We de\ufb01ne Dx(G) = {z \u2208 D(G) | yield(z) = x}. We let\ni=1 \u03c8k,i(z) denote the \u201clength\u201d (number of\n\n= exp(cid:80)K\n\nevent tokens) of the derivation z.\nParameter estimation for probabilistic grammars means choosing \u03b8 from complete data (\u201csuper-\nvised\u201d) or incomplete data (\u201csemi-supervised\u201d or \u201cunsupervised,\u201d the latter usually implying that\nstrings x are evidence but all derivations z are missing). We can view parameter estimation as iden-\ntifying a hypothesis from H(G) = {h\u03b8(z) | \u03b8} or, equivalently, from F(G) = {\u2212 log h\u03b8(z) |\n\u03b8}. For simplicity of notation, we assume that there is a \ufb01xed grammar and use H to re-\nfer to H(G) and F to refer to F(G).1 For every f\u03b8 \u2208 F we have parameters \u03b8 such that\n\ni=1 \u03c8k,i(z) log \u03b8k,i\n\n(cid:80)Nk\n\n(cid:80)Nk\n\nk=1\n\ni=1 \u03b8\u03c8k,i(z)\n\nk,i\n\nk=1\n\nk=1\n\nf\u03b8(z) = \u2212(cid:80)K\n\n(cid:80)Nk\n\nk=1\n\ni=1 \u03c8k,i(z) log \u03b8k,i.\n\n|z| \u2265 |x|.\nP(z) \u2264 Lr|z|.\n\nWe will make a few assumptions about G and P(z), the distribution that generates derivations from\nD(G) (note that P does not have to be a probabilistic grammar):\n\u2022 Bounded derivation length: There is an \u03b1 \u2265 1 such that, for all z, |z| \u2264 \u03b1|yield(z)|. Further,\n\u2022 Exponential decay of derivations: There is a constant r < 1 and a constant L \u2265 0 such that\n\u2022 Exponential decay of strings: Let \u039b(k) = |{z \u2208 D(G) | |z| = k}| be the number derivations\nof length k in G. Taking r as above, then we assume there exists a constant q < 1, such that\n\u039b(k)rk \u2264 qk. This implies that the number of derivations of length k may be exponentially large\n(e.g., as with many PCFGs), but is bounded by (q/r)k.\n\n\u2022 Bounded expectations of rules: There is a B < \u221e such that E[\u03c8k,i(z)] \u2264 B for all k and i.\nWe note that, for example, these assumptions must hold for any P whose support consists of a\n\ufb01nite set. These assumptions also hold in many cases when P itself is a probabilistic grammar.\nSee supplementary material for a note about these assumptions, their empirical justi\ufb01cation and the\nrelationship to Tsybakov noise [20, 15].\n\n3 The Learning Setting\n\nIn the supervised learning setting, a set of grammatical derivations z1, . . . , zn is used to estimate \u03b8,\nimplying a choice of h \u2208 H that \u201cagrees\u201d with the training data. MLE chooses h\u2217 \u2208 H to maximize\nthe likelihood of the data:\n\nh\u2217 = argmax\nh\u2208H\n\n1\nn\n\nlog h(zi) = argmin\n\nh\u2208H\n\n\u02dcP(z) (\u2212 log h(z))\n\n(2)\n\nn(cid:88)\n\ni=1\n\n(cid:88)\n\n(cid:124)\n\nz\u2208D(G)\n\n(cid:123)(cid:122)\n\nRemp,n(\u2212 log h)\n\n(cid:125)\n\n1Learning the rules in a grammar is another important problem that has received much attention [11].\n\n2\n\n\fAs shown, this equates to minimizing the empirical risk, or the expected value of a particular loss\nfunction known as log-loss. The expected risk, under P, is the (unknowable) quantity\n\nR(\u2212 log h) =\n\nP(z) (\u2212 log h(z)) = EP[\u2212 log h]\n\n(cid:88)\n\nz\u2208D(G)\n\nShowing convergence of the form suph\u2208H |Remp,n(\u2212 log h) \u2212 R(\u2212 log h)| \u2212\u2192\nn\u2192\u221e 0 (in probability),\n(We note that suph\u2208H |Remp,n(\u2212 log h) \u2212\nis referred to as double-sided uniform convergence.\nR(\u2212 log h)| = supf\u2208F |Remp,n(f ) \u2212 R(f )|). This kind of uniform convergence is the driving\nforce in showing that the empirical risk minimizer is consistent, i.e., the minimized empirical risk\nconverges to the minimized expected risk. We assume familiarity with the relevant literature about\nempirical risk minimization; see [21].\n\n4 Proper Approximations\nThe log-loss is unbounded, so that there is no function F : D(G) \u2192 R such that, \u2200f \u2208 F, \u2200z \u2208\nD(G), f (z) \u2264 F (z); i.e., there is no envelope to uniformly bound F. This makes it dif\ufb01cult to\nobtain a uniform convergence result of supf\u2208F |Remp,n(f ) \u2212 R(f )|. Vapnik [21, page 93] shows\nthat we can still get consistency for the maximum likelihood estimator, if we bound from below and\nabove the family of probability distributions at hand.\nInstead of making this restriction, which is heavy for probabilistic grammars, we revise the learning\nmodel according to well-known results about the convergence of stochastic processes. The revision\napproximates the concept space using a sequence F1, F2, . . . and replaces two-sided uniform con-\nvergence with convergence on the sequence of concept spaces. The concept spaces in the sequence\nvary as a function of the number of samples we have. We next construct the sequence of concept\nspaces, and in \u00a75 we return to the learning model. Our approximations are based on the concept of\nbounded approximations [1, 9].\nLet Fm (for m \u2208 {1, 2, . . .}) be a sequence of concept spaces contained in F. We will require that\nas m grows larger, Fm becomes a better approximation of the original concept space F. We say that\nthe sequence \u201cproperly approximates\u201d F if there exists a non-increasing function \u0001tail(m) such that\n\u0001tail(m) \u2212\u2192\nm\u2192\u221e 0, and an operator\nCm: F \u2192 Fm such that for all m larger than some M:\n\nm\u2192\u221e 0, a non-increasing function \u0001bound(m) such that \u0001bound(m) \u2212\u2192\nBoundedness: \u2203Km \u2265 0, \u2200f \u2208 Fm, E(cid:2)|f| \u00d7 I(|f| \u2265 Km)(cid:3) \u2264 \u0001bound(m)\n(cid:111)\uf8f6\uf8f8 \u2264 \u0001tail(m)\n\nz | Cm(f )(z) \u2212 f (z) \u2265 \u0001tail(m)\n\nContainment: Fm \u2286 F\n\nTightness:\n\n(cid:110)\n\n\uf8eb\uf8ed(cid:91)\n\nf\u2208F\n\nP\n\nThe second requirement bounds the expected values of Fm on values larger than Km. This is\nrequired to obtain uniform convergence results in the revised model [18]. Note that Km can grow\narbitrarily large. The third requirement ensures that our approximation actually converges to the\noriginal concept space F. We will show in \u00a74.2 this is actually a well-motivated characterization of\nconvergence for probabilistic grammars in the supervised setting.\nWe note that a good approximation would have Km increasing fast as a function of m and \u0001tail(m)\nand \u0001bound(m) decreasing fast as a function of m. As we will see in \u00a75, we cannot have an arbitrarily\nfast convergence rate (by, for example, taking a subsequence of Fm), because the size of Km has a\ngreat effect on the number of samples required to obtain accurate estimation.\n\n4.1 Constructing Proper Approximations for Probabilistic Grammars\n\nWe now focus on constructing proper approximations for probabilistic grammars. We make an as-\nsumption about the probabilistic grammar that \u2200k, Nk = 2. For most common grammar formalisms,\nthis does not change the expressive power: any grammar that can be expressed using Nk > 2 can be\nexpressed using a grammar that has Nk \u2264 2. See supplementary material and [6].\n\n3\n\n\fWe now construct Fm. For each f \u2208 F we de\ufb01ne the transformation T (f, \u03b3) that shifts every\n\u03b8k = (cid:104)\u03b8k,1, \u03b8k,2(cid:105) in the probabilistic grammar by \u03b3:\n(cid:104)1 \u2212 \u03b3, \u03b3(cid:105)\n(cid:104)\u03b8k,1,\n\n\u03b8k,1 < \u03b3\n\u03b8k,1 > 1 \u2212 \u03b3\n\nif\nif\notherwise\n\n1 \u2212 \u03b3(cid:105)\n\u03b8k,2(cid:105)\n\n(cid:104)\u03b8k,1, \u03b8k,2(cid:105) \u2190\n\n(cid:40) (cid:104)\u03b3,\n\n(3)\n\nNote that T (f, \u03b3) \u2208 F for any \u03b3 \u2264 1/2. Fix a constant p > 1. For each m \u2208 N, de\ufb01ne Fm =\n{T (f, m\u2212p) | f \u2208 F}.\nProposition 4.1. There exists a constant \u03b2 = \u03b2(L, q, p, N ) > 0 such that Fm has the boundedness\nproperty with Km = pN log3 m and \u0001bound(m) = m\u2212\u03b2 log m.\nProof. Let f \u2208 Fm. Let Z(m) = {z | |z| \u2264 log2 m}. Then, for all z \u2208 Z(m) we have f (z) =\ni,k \u03c8(k, i)(p log m) \u2264 pN log3 m = Km, where the \ufb01rst inequality\nfollows from f \u2208 Fm (\u03b8k,i \u2265 m\u2212p) and the second from |z| \u2264 log2 m. In addition, from the\nrequirements on P we have:\n\n\u2212(cid:80)\ni,k \u03c8(k, i) log \u03b8k,i \u2264 (cid:80)\nE(cid:104)|f| \u00d7 I(|f| \u2265 Km)\n\n(cid:105) \u2264 pN log m\n\n(cid:17) \u2264 \u03ba log m\n\nqlog2 m(cid:17)\n\n(cid:16)(cid:80)\n\nk>log2 m L\u039b(k)rkk\n\n(cid:16)\n\nfor some constant \u03ba > 0. Finally, for some \u03b2(L, q, p, N ) = \u03b2 > 0 and some constant M, if m > M\nthen \u03ba log m\n\n(cid:16)\n\nqlog2 m(cid:17) \u2264 m\u2212\u03b2 log m.\n\nWe show now that Fm is tight with respect to F with \u0001tail(m) =\nProposition 4.2. There\n\nf\u2208F{z | Cm(f )(z) \u2212 f (z) \u2265 \u0001tail(m)}(cid:17) \u2264 \u0001tail(m) for \u0001tail(m) =\n\nexists an M such that\n\nP(cid:16)(cid:83)\n\n:\n\nN log2 m\nmp \u2212 1\n\nfor any m > M we have:\n\nN log2 m\nmp \u2212 1\n\nand\n\nCm(f ) = T (f, m\u2212p).\n\nProof. See supplementary material.\n\nWe now have proper approximations for probabilistic grammars. From this point, we use Fm to\ndenote the proper approximation constructed for G. We use \u0001bound(m) and \u0001tail(m) as in Proposi-\ntion 4.1 and Proposition 4.2, and assume that p > 1 is \ufb01xed, for the rest of the paper.\n\n4.2 Asymptotic Empirical Risk Minimization\n\nIt would be compelling to know that the empirical risk minimizer over Fn is an asymptotic empirical\nrisk minimizer (in the log-loss case, this means it converges to the maximum likelihood estimate).\nAs a conclusion to this section about proper approximations, we motivate the three requirements\nthat we posed on proper approximations by showing that this is indeed true. We now unify n, the\nnumber of samples, and m, the index of the approximation of the concept space F. Let f\u2217\nn be the\nminimizer of the empirical risk over F, (f\u2217\nn = argminf\u2208F Remp,n(f )) and let gn be the minimizer\nof the empirical risk over Fn (gn = argminf\u2208Fn Remp,n(f )).\nLet D = {z1, ..., zn} be a sample from P(z). The operator (gn =) argminf\u2208Fn Remp,n(f ) is an\nasymptotic empirical risk minimizer if E[Remp,n(gn) \u2212 Remp,n(f\u2217\nn)] \u2192 0. Then, we have the\nfollowing:\nProposition 4.3. Let D = {z1, ..., zn} be a sample of derivations for G.\nargminf\u2208Fn Remp,n(f ) is an asymptotic empirical risk minimizer.\n\u201cone of zi \u2208 D is in Z\u0001,n.\u201d Then if Fn properly approximates F then:\n\nThen gn =\nf\u2208F{z | Cn(f )(z) \u2212 f (z) \u2265 \u0001}. Denote by A\u0001,n the event\n\nLemma 4.4. Denote by Z\u0001,n the set(cid:83)\n\nE [Remp,n(gn) \u2212 Remp,n(f\u2217\nn)]\n\n\u2264 (cid:12)(cid:12)E(cid:2)Remp,n(Cn(f\u2217\n\nn)) | A\u0001,n\n\n(cid:3)(cid:12)(cid:12) P(A\u0001,n) +(cid:12)(cid:12)E(cid:2)Remp,n(f\u2217\n\nn) | A\u0001,n\n\n(cid:3)(cid:12)(cid:12) P(A\u0001,n) + \u0001tail(n)\n\n(4)\n\nwhere the expectations are taken with respect to the dataset D. (See the supplementary material for\na proof.)\n\n4\n\n\fProof of Proposition 4.3. Let f0 \u2208 F be the concept that puts uniform weights over \u03b8, i.e., \u03b8k =\n(cid:104) 1\n2 , 1\n\n2(cid:105) for all k. Note that |E[Remp,n(f\u2217\n\n\u2264 |E[Remp,n(f0) | A\u0001,n]|P(A\u0001,n) = log 2\n\nLet Aj,\u0001,n for j \u2208 {1, . . . , n} be the event \u201czj \u2208 Z\u0001,n\u201d. Then A\u0001,n =(cid:83)\n\nk,i\n\nE[\u03c8k,i(zl) | A\u0001,n]P(A\u0001,n)\n\nj Aj,\u0001,n. We have that:\n\n(cid:80)\n\n(cid:80)n\nn) | A\u0001,n]|P(A\u0001,n)\n(cid:80)\n\nl=1\n\nn\n\nE[\u03c8k,i(zl) | A\u0001,n]P(A\u0001,n) \u2264(cid:80)\n\u2264 (cid:80)\n\u2264 (cid:16)(cid:80)\n\nP(zl)P(Aj,\u0001,n)|zl| +(cid:80)\n\n(cid:80)\n\n(cid:17)\n\nj(cid:54)=l\n\nzl\n\nzl\n\nj\n\n(8)\n(9)\nwhere Eq. 7 comes from zl being independent and B is the constant from \u00a72. Therefore, we have:\n\nB + E[\u03c8k,i(z) | z \u2208 Z\u0001,n]P(z \u2208 Z\u0001,n)\n\u2264 (n \u2212 1)BP(z \u2208 Z\u0001,n) + E[\u03c8k,i(z) | z \u2208 Z\u0001,n]P(z \u2208 Z\u0001,n)\n\nP(Aj,\u0001,n)\n\nj(cid:54)=l\n\nn(cid:88)\n\n(cid:88)\n\nE[\u03c8k,i(zl) | A\u0001,n]P(A\u0001,n) \u2264(cid:88)\n\n(cid:0)E[\u03c8k,i(z) | z \u2208 Z\u0001,n]P(z \u2208 Z\u0001,n) + n2BP(z \u2208 Z\u0001,n)(cid:1)\n\nP(zl, Aj,\u0001,n)|zl|\n\nP(zl, Al,\u0001,n)|zl|\n\nzl\n\n(5)\n\n(6)\n(7)\n\n1\nn\n\nl=1\n\nk,i\n\n(10)\nFrom the construction of our proper approximations (Proposition 4.2), we know that only derivations\nof length log2 n or greater can be in Z\u0001,n. Therefore:\n\nk,i\n\nE[\u03c8k,i | Z\u0001,n]P(Z\u0001,n) \u2264 (cid:88)\n\nz:|z|>log2 n\n\n\u221e(cid:88)\n\nl>log2 n\n\nP(z)\u03c8k,i(z) \u2264\n\nL\u039b(l)rll \u2264 \u03baqlog2 n = o(1)\n\n(11)\n\nwhere \u03ba > 0 is a constant. Similarly, we have P(z \u2208 Z\u0001,n) = o(n\u22122). This means that\nn) | A\u0001,n]|P(A\u0001,n) \u2212\u2192\n|E[Remp,n(f\u2217\nn)) |\nA\u0001,n]|P(A\u0001,n) \u2212\u2192\nn\u2192\u221e 0 using the same proof technique we used above, while relying on the fact\nthat Cn(f\u2217\n\nIn addition, it can be shown |E[Remp,n(Cn(f\u2217\n\nn) \u2208 Fn, and therefore Cn(f\u2217\n\nn)(z) \u2264 pN|z| log n.\n\nn\u2192\u221e 0.\n\n5 Sample Complexity Results\n\nWe now give our main sample complexity results for probabilistic grammars. These results hinge\non the convergence of supf\u2208Fn |Remp,n(f )\u2212 R(f )|. The rate of this convergence can be fast, if the\ncovering numbers for Fn do not grow too fast.\nWe next give a brief overview of covering numbers. A cover gives a way to reduce a class of\nfunctions to a much smaller (\ufb01nite, in fact) representative class such that each function in the original\nclass is represented using a function in the smaller class. Let G be a class of functions. Let d(f, g)\nbe a distance measure between two functions f, g from G. An \u0001-cover is a subset of G, denoted by\nG(cid:48), such that for every f \u2208 G there exists an f(cid:48) \u2208 G(cid:48) such that d(f, f(cid:48)) < \u0001. The covering number\nN(\u0001, G, d) is the size of the smallest \u0001-cover of G using with respect to the distance measure d.\nWe will be interested in a speci\ufb01c distance measure that is dependent on the empirical distribution\n\u02dcP that describes the data z1, ..., zn. Let f, g \u2208 G. We will use:\n\n(f, g) = E\u02dcP[|f \u2212 g|] = (cid:80)\n\nd\u02dcP\n\n(cid:80)n\ni=1 |f (zi) \u2212 g(zi)|\n\nz\u2208D(G) |f (z) \u2212 g(z)| \u02dcP(z) = 1\n\nn\n\n(12)\nInstead of using N(\u0001, G, d\u02dcP\n) directly, we are going to bound this quantity with N(\u0001, G) =\nsup\u02dcP N(\u0001, G, d\u02dcP\n), where we consider all possible samples (yielding \u02dcP). The following is the key\nresult about the connection between covering numbers and the double-sided convergence of the\nempirical process supf\u2208Fn |Remp,n(f ) \u2212 R(f )| as n \u2192 \u221e:\nLemma 5.1. Let Fn be a permissible class2 of functions such that for every f \u2208 Fn we have\nE[|f|I(|f| \u2264 Kn)] \u2264 \u0001bound(n). Let Ftruncated,n = {f \u00d7 I(f \u2264 Kn) | f \u2208 Fm}, i.e., the set of\n2The \u201cpermissible class\u201d requirement is a mild regularity condition about measurability that holds for proper\n\napproximations. We refer the reader to [18] for more details.\n\n5\n\n\ffunctions from Fn after being truncated by Kn. Then for \u0001 > 0 we have,\n\u2264 8N(\u0001/8, Ftruncated,n) exp\n\n|Remp,n(f ) \u2212 R(f )| > 2\u0001\n\nP\n\n(cid:33)\n\n(cid:32)\n\nsup\nf\u2208Fn\n\n(cid:18)\n\n(cid:19)\n\n\u2212 1\n128\n\nn\u00012/K 2\nn\n\n+ 2\u0001bound(n)/\u0001\n\nprovided n \u2265 K 2\n\nn/4\u00012.\n\nProof. See [18] (chapter 2, pages 30\u201331). See supplementary material for an explanation.\n\nCovering numbers are rather complex combinatorial quantities that are hard to compute directly.\nFortunately, they can be bounded by using the pseudo dimension [3], a generalization of VC di-\nmension for real functions. In the case of our \u201cbinomialized\u201d probabilistic grammars, the pseudo\ndimension of Fn is bounded by N, because we have Fn \u2286 F, and the functions in F are linear with\nN parameters. Hence, Ftruncated,n has also pseudo dimension that is at most N. We have:\nLemma 5.2. (From [18, 13].) Let Fn be the proper approximations for probabilistic grammars, for\nany 0 < \u0001 < Kn we have:\n\n(cid:18) 2eKn\n\nlog\n\n2eKn\n\n\u0001\n\n\u0001\n\n(cid:19)N\n\nN(\u0001, Ftruncated,n) < 2\n\n5.1 Supervised Case\n\n(13)\n\n(14)\n\n(15)\n\nLemmas 5.1 and 5.2 can be combined to get our main sample complexity result:\nTheorem 5.3. Let G be a grammar. Let Fn be a proper approximation for the corresponding family\nof probabilistic grammars. Let P(x, z) be a distribution over derivations that satis\ufb01es the require-\nments in \u00a72. Let z1, ..., zn be a sample of derivations. Then there exists a constant \u03b2(L, q, p, N ) and\nconstant M such that for any 0 < \u03b4 < 1 and 0 < \u0001 < 1 and any n > M and if\n\n(cid:27)\n\nn \u2265 max\n\nthen we have\n\n2N log(16eKn/\u0001) + log\n\n32\n\u03b4\n\n,\n\nlog 4/\u03b4 + log 1/\u0001\n\n\u03b2(L, q, p, N )\n\n|Remp,n(f ) \u2212 R(f )| \u2264 2\u0001\n\n\u2265 1 \u2212 \u03b4\n\n(cid:18)\n\n(cid:26) 128K 2\n(cid:32)\n\n\u00012\n\nn\n\nP\n\nsup\nf\u2208Fn\n\n(cid:19)\n(cid:33)\n\nwhere Kn = pN log3 n.\n\nProof. Omitted for space. \u03b2(L, q, p, N ) is the constant from Proposition 4.1. The proof is based on\nsimple algebraic manipulation of the right side of Eq. 13 while relying on Lemma 5.2.\n\n5.2 Unsupervised Case\n\nf(cid:48)\n\n\u03b8 as:\n\n(cid:16)(cid:80)K\n\nIn the unsupervised setting, we have n yields of derivations from the grammar, x1, ..., xn, and our\ngoal again is to identify grammar parameters \u03b8 from these yields. Our concept classes are now the\nsets of log marginalized distributions from Fn. For each f\u03b8 \u2208 Fn, we de\ufb01ne f(cid:48)\n\nz\u2208Dx(G) exp(\u2212f\u03b8(z)) = \u2212 log(cid:80)\n\n(cid:80)Nk\n\u03b8(x) = \u2212 log(cid:80)\nunsupervised setting. Let f(cid:48) \u2208 F(cid:48). Let f be the concept in F such that f(cid:48)(x) =(cid:80)\n\nn(f(cid:48))(x) =(cid:80)\n\nn(f(cid:48)) as a \ufb01rst step towards de\ufb01ning F(cid:48)\n\n(16)\nn. We de\ufb01ne analogously F(cid:48). Note that we also need to de\ufb01ne\nn as proper approximations (for F(cid:48)) in the\nz f (z, x). Then\n\nWe denote the set of {f(cid:48)\nthe operator C(cid:48)\nwe de\ufb01ne C(cid:48)\nIt is not immediate to show that F(cid:48)\nn is a proper approximation for F(cid:48). It is not hard to show that the\nboundedness property is satis\ufb01ed with the same Kn and the same form of \u0001bound(n) as in Proposi-\ntion 4.1 (we would have \u0001(cid:48)\nbound(m) = m\u2212\u03b2(cid:48) log m for some \u03b2(cid:48)(L, q, p, N ) = \u03b2(cid:48) > 0). This relies\non the property of bounded derivation length of P. See the supplementary material for a proof. The\nfollowing result shows that we have tightness as well:\n\nz Cn(f )(x, z).\n\ni=1 \u03c8i,k(z)\u03b8i,k\n\n\u03b8} by F(cid:48)\n\nz\u2208Dx(G) exp\n\n(cid:17)\n\nk=1\n\n6\n\n\fi bi \u2265 \u0001 then there exists an i such that\n\n> M we have:\n\nN log2 n\nnp \u2212 1\n\nand the\n\nn(f(cid:48))(x) \u2212 f(cid:48)(x) \u2265 \u0001tail(n)}(cid:17) \u2264 \u0001tail(n) for \u0001tail(n) =\n\nexists an M such that\n\nfor any n\n\nSketch of proof of Proposition 5.4. From Utility Lemma 5.5 we have:\n\nf(cid:48)\u2208F(cid:48){x | C(cid:48)\n\nn(f ) as de\ufb01ned above.\n\nProposition 5.4. There\n\noperator C(cid:48)\n\u2212 log ai + log bi \u2265 \u0001.\n\nP(cid:16)(cid:83)\ni ai + log(cid:80)\nUtility Lemma 5.5. For ai, bi \u2265 0, if \u2212 log(cid:80)\n\uf8eb\uf8ed (cid:91)\n\uf8eb\uf8ed(cid:91)\n\uf8f6\uf8f8 \u2264 P\nP(cid:16)(cid:83)\nf\u2208F{x | \u2203z s.t. Cn(f )(z) \u2212 f (z) \u2265 \u0001tail(n)}(cid:17)\n\u2264 (cid:88)\n\nn(f(cid:48))(x) \u2212 f(cid:48)(x) \u2265 \u0001tail(n)}\n\nP(x) \u2264\n\n\u221e(cid:88)\n\n{x | C(cid:48)\n\nf(cid:48)\u2208F(cid:48)\n\nf\u2208F\n\nP\n\nx\u2208X(n)\nL\u039b(k)rk \u2264 \u0001tail(n)\n\n(17)\nDe\ufb01ne X(n) to be all x such that there exists a z with yield(z) = x and |z| \u2265 log2 n. From the\nproof of Proposition 4.2 and the requirements on P, we know that there exists an \u03b1 \u2265 1 such that\n\n\u2264 (cid:88)\n\nP(x)\n\n{x | \u2203zCn(f )(z) \u2212 f (z) \u2265 \u0001tail(n)}\n\n\uf8f6\uf8f8\n\n(18)\n\nx:|x|\u2265log2 n/\u03b1\n\nk=(cid:98)log2 n/\u03b1(cid:99)\n\nwhere the last inequality happens for some n larger than a \ufb01xed M.\nComputing either the covering number or the pseudo dimension of F(cid:48)\nn is a hard task, because the\nfunction in the classes includes the \u201clog-sum-exp.\u201d In [9], Dasgupta overcomes this problem for\nBayesian networks with \ufb01xed structure by giving a bound on the covering number for (his respective)\nF(cid:48) that depends on the covering number of F.\nUnfortunately, we cannot fully adopt this approach, because the derivations of a probabilistic gram-\nmar can be arbitrarily large. We overcome this problem using the following restriction. We assume\nthat |Dx(G)| < d(n), where d is a function mapping n, the size of our sample, to a real number.\nThe more samples we have, the more permissive (for large derivation set) the grammar can be. On\nthe other hand, the more accuracy we desire, the more restricted we are in choosing grammars that\nhave a large derivation set. We refer to this restriction as the \u201cderivational condition.\u201d With the\nderivational condition, we can show the following result:\nProposition 5.6. (Hidden Variable Rule for Probabilistic Grammars) Under the derivational condi-\ntion, N(\u0001, F(cid:48)\n\ntruncated,n) \u2264 N(\u0001/d(n), Ftruncated,n).\n\nThe proof of Proposition 5.6 is almost identical to the proof of the hidden variable rule in [9]. For\nthe unsupervised case, then, we get the following sample complexity result:\nTheorem 5.7. Let G be a grammar. Let F(cid:48)\nn be a proper approximation for the corresponding\nfamily of probabilistic grammars. Let P(x, z) be a distribution over derivations that satis\ufb01es the\nrequirements in \u00a72. Let x1, ..., xn be a sample of strings from P(x). Then there exists a constant\n\u03b2(cid:48)(L, q, p, N ) and constant M such that for any 0 < \u03b4 < 1 and 0 < \u0001 < 1 and any n > M and if\n\nn \u2265 max\n\n2N log(16eKnd(n)/\u0001) + log\n\n,\n\nlog 4/\u03b4 + log 1/\u0001\n\u03b2(cid:48)(L, q, p, N )\n\nand |Dx(G)| < d(n), we have that\n\n(cid:18)\n\n(cid:26) 128K 2\n(cid:32)\n\n\u00012\n\nn\n\nP\n\nsup\nf\u2208F(cid:48)\n\nn\n\n(cid:19)\n\n32\n\u03b4\n\n(cid:33)\n\n(cid:27)\n\n(19)\n\n(20)\n\n|Remp,n(f ) \u2212 R(f )| \u2264 2\u0001\n\n\u2265 1 \u2212 \u03b4\n\nwhere Kn = pN log3 n.\n\nFor this sample complexity bound to be non-trivial, for example, we can restrict Dx(G), through\nd(n), to have a polynomial size in the number of our samples. Enlarging d(n) is possible even to an\nexponential function of n\u03c1 for \u03c1 < 1, e.g. d(n) = 2\n\nn.\n\n\u221a\n\n7\n\n\fcriterion\ntightness of proper approx-\nimation\nsample complexity bound\n\nas Kn increases . . .\nimproves\n\nas d(n) increases . . .\nno effect\n\nas p increases . . .\nimproves\n\ndegrades\n\ndegrades\n\ndegrades\n\nTable 1: Trade-off between quantities in our learning model and effectiveness of different criteria.\nd(n) is the function that gives the derivational condition, i.e., |Dx(G)| \u2264 d(n).\n\n6 Discussion\n\nOur framework can be specialized to improve the two main criteria that have a trade-off: the tight-\nness of the proper approximation and the sample complexity. For example, we can improve the\ntightness of our proper approximations by taking a subsequence of Fn. However, this will make the\nsample complexity bound degrade, because Kn will grow faster. Table 1 gives the different trade-\noffs between parameters in our model and the effectiveness of learning. In general, we would want\nthe derivational condition to be removed (choose d(n) = \u221e, or at least allow d(n) = \u2126(tn) for\nsome t, for small samples), but in that case our sample complexity bounds become trivial.\nIn the supervised case, our result states that the number of samples we require (as an upper bound)\ngrows mostly because of a term that behaves O(N 3 log N ) (for a \ufb01xed \u03b4 and \u0001). If our grammar, for\nexample, is a PCFG, then N depends on the total number of rules. When the PCFG is in Chomsky\nnormal form and lexicalized [10, 7], then N grows by an order of V 2, where V is the vocabulary size.\nThis means that the bound grows by an order of O(V 6 log V ). This is consistent with conventional\nwisdom that lexicalized grammars require much more data for accurate learning.\nThe dependence of the bound on N suggests that it is easier to learn models with a smaller grammar\nsize. This may help explain the success of recent advances in supervised parsing [4, 22, 17] that\nhave \u201ccoarse\u201d models (with a much smaller size of nontermimals) as a \ufb01rst pass. Those models are\neasier to learn and require less data to be accurate, and can serve as base models for later phases.\nThe sample complexity bound for the unsupervised case suggests that we need log d(n) times as\nmuch data to achieve estimates as good as those for supervised learning. Interestingly, with unsu-\npervised grammar learning, available training sentences longer than a maximum length (e.g., 10) are\noften ignored; see [14].\nWe note that sample complexity is not the only measure for the complexity of estimating probabilis-\ntic grammars. In the unsupervised setting, for example, the computational complexity of ERM is\nNP hard for PCFGs [5] or probabilistic automata [2].\n\n7 Conclusion\n\nWe presented a framework for learning the parameters of a probabilistic grammar under the log-loss\nand derived sample complexity bounds for it. We motivated this framework by showing that the\nempirical risk minimizer for our approximate framework is an asymptotic empirical risk minimizer.\nOur framework uses a sequence of approximations to a family of probabilistic grammars, which\nimproves as we have more data, to give distribution dependent sample complexity bounds in the\nsupervised and unsupervised settings.\n\nAcknowledgements\n\nWe thank the anonymous reviewers for their comments and Avrim Blum, Steve Hanneke, and Dan\nRoth for useful conversations. This research was supported by NSF grant IIS-0915187.\n\nReferences\n[1] N. Abe, J. Takeuchi, and M. Warmuth. Polynomial learnability of probabilistic concepts with\nrespect to the Kullback-Leiber divergence. In ACM Conference on Computational Learning\nTheory, 1990.\n\n8\n\n\f[2] N. Abe and M. Warmuth. On the computational complexity of approximating distributions by\n\nprobabilistic automata. Machine Learning, 2:205\u2013260, 1992.\n\n[3] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge\n\nUniversity Press, 1999.\n\n[4] E. Charniak and M. Johnson. Coarse-to-\ufb01ne n-best parsing and maxent discriminative rerank-\n\ning. In Proc. of ACL, 2005.\n\n[5] S. B. Cohen and N. A. Smith. Viterbi training for PCFGs: Hardness results and competitiveness\n\nof uniform initialization. In Proceedings of ACL, 2010.\n\n[6] S. B. Cohen and N. A. Smith. Empirical risk minimization for probabilistic grammars: Sample\n\ncomplexity and hardness of learning, in preparation.\n\n[7] M. Collins. Head-driven statistical models for natural language processing. Computational\n\nLinguistics, 29:589\u2013637, 2003.\n\n[8] M. Collins. Parameter estimation for statistical parsing models:\n\ntheory and practice of\ndistribution-free methods. Text, Speech and Language Technology (new developments in pars-\ning technology), pages 19\u201355, 2004.\n\n[9] S. Dasgupta. The sample complexity of learning \ufb01xed-structure Bayesian networks. Machine\n\nLearning, 29(2-3):165\u2013180, 1997.\n\n[10] J. Eisner. Three new probabilistic models for dependency parsing: An exploration. In Proc. of\n\nCOLING, 1996.\n\n[11] E. M. Gold. Language identi\ufb01cation in the limit. Information and Control, 10(5):447\u2013474,\n\n1967.\n\n[12] G. Guerra and Y. Aloimonos. Discovering a language for human activity. In AAAI Workshop\n\non Anticipation in Cognitive Systems, 2005.\n\n[13] D. Haussler. Decision-theoretic generalizations of the PAC model for neural net and other\n\nlearning applications. Information and Computation, 100:78\u2013150, 1992.\n\n[14] D. Klein and C. D. Manning. Corpus-based induction of syntactic structure: Models of depen-\n\ndency and constituency. In Proc. of ACL, 2004.\n\n[15] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization.\n\nThe Annals of Statistics, 34(6):2593\u20132656, 2006.\n\n[16] L. Lin, T. Wu, J. Porway, and Z. Xu. A stochastic graph grammar for compositional object\n\nrepresentation and recognition. Pattern Recognition, 8, 2009.\n\n[17] S. Petrov and D. Klein. Improved inference for unlexicalized parsing. In Proc. of HLT-NAACL,\n\n2007.\n\n[18] D. Pollard. Convergence of Stochastic Processes. New York: Springer-Verlag, 1984.\n[19] Y. Sakakibara, M. Brown, R. Hughey, S. Mian, K. Sj\u00a8olander, R. C. Underwood, and D. Haus-\nsler. Stochastic context-free grammars for tRNA modeling. Nucleic Acids Research, 22, 1994.\n[20] A. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. The Annals of Statistics,\n\n32(1):135\u2013166, 2004.\n\n[21] V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.\n[22] D. Weiss and B. Taskar. Structured prediction cascades. In Proceedings of AISTATS, 2010.\n\n9\n\n\f", "award": [], "sourceid": 803, "authors": [{"given_name": "Noah", "family_name": "Smith", "institution": null}, {"given_name": "Shay", "family_name": "Cohen", "institution": null}]}