{"title": "Classification with Hybrid Generative/Discriminative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 545, "page_last": 552, "abstract": "", "full_text": "Classi\ufb01cation with Hybrid\n\nGenerative/Discriminative Models\n\nRajat Raina, Yirong Shen, Andrew Y. Ng\n\nComputer Science Department\n\nStanford University\nStanford, CA 94305\n\nAndrew McCallum\n\nDepartment of Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA 01003\n\nAbstract\n\nAlthough discriminatively trained classi\ufb01ers are usually more accurate\nwhen labeled training data is abundant, previous work has shown that\nwhen training data is limited, generative classi\ufb01ers can out-perform\nthem. This paper describes a hybrid model in which a high-dimensional\nsubset of the parameters are trained to maximize generative likelihood,\nand another, small, subset of parameters are discriminatively trained to\nmaximize conditional likelihood. We give a sample complexity bound\nshowing that in order to \ufb01t the discriminative parameters well, the num-\nber of training examples required depends only on the logarithm of the\nnumber of feature occurrences and feature set size. Experimental results\nshow that hybrid models can provide lower test error and can produce\nbetter accuracy/coverage curves than either their purely generative or\npurely discriminative counterparts. We also discuss several advantages\nof hybrid models, and advocate further work in this area.\n\n1\n\nIntroduction\n\nGenerative classi\ufb01ers learn a model of the joint probability, p(x, y), of the inputs x and\nthe label y, and make their predictions by using Bayes rule to calculate p(y|x), and then\npicking the most likely label y. In contrast, discriminative classi\ufb01ers model the posterior\np(y|x) directly. It has often been argued that for many application domains, discriminative\nclassi\ufb01ers often achieve higher test set accuracy than generative classi\ufb01ers (e.g., [6, 4, 14]).\nNonetheless, generative classi\ufb01ers also have several advantages, among them straightfor-\nward EM methods for handling missing data, and often better performance when training\nset sizes are small. Speci\ufb01cally, it has been shown that a simple generative classi\ufb01er (naive\nBayes) outperforms its conditionally-trained, discriminative counterpart (logistic regres-\nsion) when the amount of available labeled training data is small [11].\nIn an effort to obtain the best of both worlds, this paper explores a class of hybrid models\nfor supervised learning that are partly generative and partly discriminative. In these models,\na large subset of the parameters are trained to maximize the generative, joint probability\nof the inputs and outputs of the supervised learning task; another, much smaller, subset of\nthe parameters are discriminatively trained to maximize the conditional probability of the\noutputs given the inputs.\nMotivated by an application in text classi\ufb01cation as well as a desire to begin by exploring a\nsimple, pure form of hybrid classi\ufb01cation, we describe and give results with a \u201cgenerative-\ndiscriminative\u201d pair [11] formed by naive Bayes and logistic regression, and a hybrid al-\n\n\fgorithm based on both. We also give two natural by-products of the hybrid model: First, a\nscheme for allowing different partitions of the variables to contribute more or less strongly\nto the classi\ufb01cation decision\u2014for an email classi\ufb01cation example, modeling the text in\nthe subject line and message body separately, with learned weights for the relative contri-\nbutions. Second, a method for improving accuracy/coverage curves of models that make\nincorrect independence assumptions, such as naive Bayes.\nWe also prove a sample complexity result showing that the number of training examples\nneeded to \ufb01t the discriminative parameters depends only on the logarithm of the vocabulary\nsize and document length. In experimental results, we show that the hybrid model achieves\nsigni\ufb01cantly more accurate classi\ufb01cation than either its purely generative or purely discrim-\ninative counterparts. We also demonstrate that the hybrid model produces class posterior\nprobabilities that better re\ufb02ect empirical error rates, and as a result produces improved\naccuracy/coverage curves.\n\n2 The Model\n\nWe begin by brie\ufb02y reviewing the multinomial naive Bayes classi\ufb01er applied to text cate-\ngorization [10], and then describe our hybrid model and its relation to logistic regression.\nLet Y = {0, 1} be the set of possible labels for a document classi\ufb01cation task, and let\nW = {w1, w2, . . . , w|W|} be a dictionary of words. A document of N words is represented\nby a vector X = (X1, X2, . . . , XN ) of length N. The ith word in the document is Xi \u2208 W.\nNote that N can vary for different documents. The multinomial naive Bayes model assumes\nthat the label Y is chosen from some prior distribution P (Y = \u00b7), the length N is drawn\nfrom some distribution P (N = \u00b7) independently of the label, and each word Xi is drawn\nindependently from some distribution P (W = \u00b7|Y ) over the dictionary. Thus, we have:1\n\nP (X = x, Y = y) = P (Y = y)P (N = n) Qn\n\n(1)\nSince the length n of the document does not depend on the label and therefore does not\nplay a signi\ufb01cant role, we leave it out of our subsequent derivations.\nThe parameters in the naive Bayes model are \u02c6P (Y ) and \u02c6P (W |Y ) (our estimates of P (Y )\nand P (W |Y )). They are set to maximize the joint (penalized) log-likelihood of the x and\ny pairs in a labeled training set, M = {(x(i), y(i))}m\ni=1. Let n(i) be the length of document\nx(i). Speci\ufb01cally, for any k \u2208 {0, 1}, we have:\n\ni=1 P (W = xi|Y = y).\n\n\u02c6P (Y = k) = 1\n\ni=1 1{y(i) = k}\n\n\u02c6P (W = wl|Y = k) =\n\nm Pm\nPm\n\ni=1 Pn(i)\nPm\n\nj=1\n\ni=1\n\n1{x(i)\n\nj =wl, y(i)=k}+1\n\nn(i)1{y(i)=k}+|W|\n\n(2)\n\n(3)\n\n,\n\nwhere 1{\u00b7} is the indicator function (1{True} = 1, 1{False} = 0), and we have applied\nLaplace (add-one) smoothing in obtaining the estimates of the word probabilities. Using\nBayes rule, we obtain the estimated class posterior probabilities for a new document x as:\n\n\u02c6P (Y = 1|X = x) =\n\n\u02c6P (X=x|Y =1) \u02c6P (Y =1)\n\nPy\u2208Y\n\n\u02c6P (X=x|Y =y) \u02c6P (Y =y)\n\nwhere\n\n\u02c6P (X = x|Y = y) = Qn\n\n(4)\nThe predicted class for the new document is then simply arg maxy\u2208Y \u02c6P (Y = y|X = x).\nIn many text classi\ufb01cation applications, the documents involved consist of several disjoint\nregions that may have different dependencies with the document label. For example, a\nUSENET news posting includes both a subject region and a message body region.2 Because\n\n\u02c6P (W = xi|Y = y).\n\ni=1\n\n1We adopt the notational convention that upper-case is used to denote random variables, and\n\nlower-case is used to denote particular values taken by the random variables.\n\n2Other possible text classi\ufb01cation examples include: Emails consisting of subject and body; tech-\nnical papers consisting of title, abstract, and body; web pages consisting of title, headings, and body.\n\n\fof the strong assumptions used by naive Bayes, it treats the words in the different regions\nof a document in exactly the same way, ignoring the fact that perhaps words in a particular\nregion (such as words in the subject) might be more \u201cimportant.\u201d Further, it also tends to\nallow the words in the longer region to dominate. (Explained below.)\nIn the sequel, we assume that every input document X can be naturally divided into R\nregions X 1, X 2, . . . , X R. Note that R can be one. The regions are of variable lengths\nN1, N2, . . . , NR. For the sake of conciseness and clarity, in the following discussion we\nwill focus on the case of R = 2 regions, the generalization offering no dif\ufb01culties. Thus,\nthe document probability in Equation (4) is now replaced with:\n\n\u02c6P (X = x|Y = y) = \u02c6P (X 1 = x1|Y = y) \u02c6P (X 2 = x2|Y = y)\n\n= Qn1\ni=1\n\n\u02c6P (W = x1\n\ni |Y = y) Qn2\ni=1\n\n\u02c6P (W = x2\n\ni |Y = y)\n\n(5)\n(6)\n\nHere, xj\nPn1\n\ni denotes the ith word in the jth region. Naive Bayes will predict y = 1 if:\n\ni=1 log \u02c6P (W = x1\n\ni |Y = 1) + Pn2\n\ni=1 log \u02c6P (W = x2\n\ni |Y = 1) + log \u02c6P (Y = 1) \u2265\n\nPn1\n\ni |Y = 0) + Pn2\n\ni=1 log \u02c6P (W = x2\n\ni |Y = 0) + log \u02c6P (Y = 0)\n\ni=1 log \u02c6P (W = x1\nand predict y = 0 otherwise.\nIn an email or USENET news classi\ufb01cation problem, if\nthe \ufb01rst region is the subject, and the second region is the message body, then n2 (cid:29) n1,\nsince message bodies are usually much longer than subjects. Thus, in the equation above,\nthe message body contributes to many more terms in both the left and right sides of the\nsummation, and the result of the \u201c\u2265\u201d test will be largely determined by the message body\n(with the message subject essentially ignored or otherwise having very little effect).\nGiven the importance and informativeness of message subjects, this suggests that we might\nobtain better performance than the basic naive Bayes classi\ufb01er by considering a modi\ufb01ed\nalgorithm that assigns different \u201cweights\u201d to different regions, and normalizes for region\nlengths. Speci\ufb01cally, consider making a prediction using the modi\ufb01ed inequality test:\n\u03b81\nn1\n\ni |Y = 1) + log \u02c6P (Y = 1) \u2265\ni |Y = 0) + log \u02c6P (Y = 0)\nHere, the vector of parameters \u03b8 = (\u03b81, \u03b82) controls the relative \u201cweighting\u201d between the\nmessage subjects and bodies, and will be \ufb01t discriminatively. Speci\ufb01cally, we will model\nthe class posteriors, which we denote by \u02c6P\u03b8 to make explicit the dependence on \u03b8, as:3\n\ni=1 log \u02c6P (W = x1\nPn1\n\ni=1 log \u02c6P (W = x2\nPn2\n\ni=1 log \u02c6P (W = x2\n\ni=1 log \u02c6P (W = x1\n\ni |Y = 1) + \u03b82\nn2\n\ni |Y = 0) + \u03b82\nn2\n\nPn2\n\nPn1\n\n\u03b81\nn1\n\n\u02c6P\u03b8(y|x) =\n\n\u02c6P (Y =0) \u02c6P (x1|Y =0)\n\n\u02c6P (y) \u02c6P (x1|y)\n\u03b82\nn2 + \u02c6P (Y =1) \u02c6P (x1|Y =1)\n\n\u03b81\nn1 \u02c6P (x2|Y =0)\n\n\u03b81\nn1 \u02c6P (x2|y)\n\n\u03b82\nn2\n\n\u03b81\nn1 \u02c6P (x2|Y =1)\n\n\u03b82\nn2\n\n(7)\n\nWe had previously motivated our model as assigning different weights to different parts of\nthe document. A second reason for using this model is that the independence assumptions\nof naive Bayes are too strong. Speci\ufb01cally, with a document of length n, the classi\ufb01er \u201cas-\nsumes\u201d that it has n completely independent pieces of evidence supporting its conclusion\nabout the document\u2019s label. Putting nr in the denominator of the exponent as a normal-\nization factor can be viewed as a way of counteracting the overly strong independence\nassumptions.4\nAfter some simple manipulations, we obtain the following expression for \u02c6P\u03b8(Y = 1|x):\n\n\u02c6P\u03b8(Y = 1|x) =\n\n1\n\n). With this expression for \u02c6P\u03b8(y|x), we\nwhere a = log\nsee that it is very similar to the form of the class posteriors used by logistic regression, the\n\nand br = 1\nnr\n\n(log\n\n\u02c6P (Y =1)\n\u02c6P (Y =0)\n\n1+exp(\u2212a\u2212\u03b81b1\u2212...\u2212\u03b8RbR)\n\u02c6P (xr|Y =1)\n\u02c6P (xr|Y =0)\n\n(8)\n\n3When there is no risk of ambiguity, we will sometimes replace P (X = x|Y = y), P (Y =\n\ny|X = x), P (W = xi|Y = y), etc. with P (x|y), P (y|x), P (xi|y).\n\n4\u03b8r can also be viewed as an \u201ceffective region length\u201d parameter, where we assume that region r\nof the document can be treated as only \u03b8r independent pieces of observation. For example, note that\nif each region r of the document has \u03b8r words exactly, then this model reduces to naive Bayes.\n\n\fonly difference being that in this case a is a constant calculated from the estimated class\npriors. To make the parallel to logistic regression complete, we de\ufb01ne b0 = 1, rede\ufb01ne \u03b8\nas \u03b8 = (\u03b80, \u03b81, \u03b82), and de\ufb01ne a new class posterior\n\n\u02c6P\u03b8(Y = 1|x) =\n\n1\n\n1+exp(\u2212\u03b8T b)\n\n(9)\n\nThroughout the derivation, we had assumed that the parameters \u02c6P (x|y) were \ufb01t gener-\natively as in Equation (3) (and b is in turn derived from these parameters as described\nabove). It therefore remains only to specify how \u03b8 is chosen. One method would be to pick\n\u03b8 by maximizing the conditional log-likelihood of the training set M = {x(i), y(i)}m\n\ni=1:\n\n\u03b8 = arg max\u03b80 Pm\n\ni=1 log \u02c6P\u03b80(y(i)|x(i))\n\n(10)\nHowever, the word generation probabilities that were used to calculate b were also trained\nfrom the training set M. This procedure therefore \ufb01ts the parameters \u03b8 to the training\ndata, using \u201cfeatures\u201d b that were also \ufb01t to the data. This leads to a biased estimator.\nSpeci\ufb01cally, since what we care about is the generalization performance of the algorithm,\na better method is to pick \u03b8 to maximize the log-likelihood of data that wasn\u2019t used to\ncalculate the \u201cfeatures\u201d b, because when we see a test example, we will not have had the\nluxury of incorporating information from the test example into the b\u2019s (cf. [15, 12]). This\nleads to the following \u201cleave-one-out\u201d strategy of picking \u03b8:\n\n\u03b8 = arg max\u03b80 Pm\n\ni=1 log \u02c6P\u03b80,\u2212i(y(i)|x(i)),\n\n(11)\nwhere \u02c6P\u02c6\u03b8,\u2212i(y(i)|x(i)) is as given in Equation (9), except that each br is computed from\nword generation probabilities that were estimated with the ith example of the training set\nheld out. We note that optimizing this objective to \ufb01nd \u03b8 is still the same optimization\nproblem as in logistic regression, and hence is convex and can be solved ef\ufb01ciently. Fur-\nther, the word generation probabilities with the ith example left out can also be computed\nef\ufb01ciently.5\nThe predicted label for a new document under this method is arg maxy\u2208Y \u02c6P\u03b8(y|x). We\ncall this method the normalized hybrid algorithm. For the sake of comparison, we will also\nconsider an algorithm in which the exponents in Equation (7) are not normalized by nr.\nIn other words, we replace \u03b8r/nr there by just \u03b8r. We refer to this latter method as the\nunnormalized hybrid algorithm.\n\n3 Experimental Results\n\nWe now describe the results of experiments testing the effectiveness of our methods. All\nexperiments were run using pairs of newsgroups from the 20newsgroups dataset [8] of\nUSENET news postings. When parsing this data, we skipped everything in the USENET\nheaders except the subject line; numbers and email addresses were replaced by special\ntokens NUMBER and EMAILADDR; and tokens were formed after stemming.\nIn each experiment, we compare the performance of the basic naive Bayes algorithm with\nthat of the normalized hybrid algorithm and logistic regression with Gaussian priors on the\nparameters. We used logistic regression with word-counts in the feature vectors (as in [6]),\nwhich forms a discriminative-generative pair with multinomial naive Bayes. All results\nreported in this section are averages over 10 random train-test splits.\nFigure 1 plots learning curves for the algorithms, when used to classify between various\npairs of newsgroups. We \ufb01nd that in every experiment, for the training set sizes considered,\nthe normalized hybrid algorithm with R = 2 has test error that is either the lowest or very\nnear the lowest among all the algorithms. In particular, it almost always outperforms the\n\n5Speci\ufb01cally, by precomputing the numerator and denominator of Equation (3), we can later\nremove any example by subtracting out the terms in the numerator and denominator corresponding\nto that example.\n\n\fr\no\nr\nr\ne\n \nt\ns\ne\nt\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n0\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\ne\n \nt\ns\ne\nt\n\n0\n0\n\natheism vs religion.misc\n\npc.hardware vs mac.hardware\n\ngraphics vs mideast\n\n500\n1000\nsize of training set\n\n1500\n\n(a)\n\natheism vs sci.med\n\n0.4\n\n0.35\n\n0.3\n\nr\no\nr\nr\ne\n \nt\ns\ne\nt\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n0\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\ne\n \nt\ns\ne\nt\n\n500\n1000\nsize of training set\n\n1500\n\n(b)\n\nautos vs motorcycles\n\n0.14\n\n0.12\n\n0.1\n\nr\no\nr\nr\ne\n \nt\ns\ne\nt\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n0\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\nr\no\nr\nr\ne\n \nt\ns\ne\nt\n\n500\n1000\nsize of training set\n\n1500\n\n(c)\n\nhockey vs christian\n\n500\n1000\nsize of training set\n\n(d)\n\n1500\n\n0\n0\n\n500\n1000\nsize of training set\n\n(e)\n\n1500\n\n0\n0\n\n500\n1000\nsize of training set\n\n1500\n\n(f)\n\nFigure 1: Plots of test error vs training size for several different newsgroup pairs. Red\ndashed line is logistic regression; blue dotted line is standard naive Bayes; black solid line\nis the hybrid algorithm. (Colors where available.) (If more training data were available,\nlogistic regression would presumably out-perform naive Bayes; cf. [6, 11].)\n\nbasic naive Bayes algorithm. The difference in performance is especially dramatic for small\ntraining sets.\nAlthough these results are not shown here, the hybrid algorithm with R = 2 (breaking the\ndocument into two regions) outperforms R = 1. Further, the normalized version of the\nhybrid algorithm generally outperforms the unnormalized version.\n\n4 Theoretical Results\n\nIn this section, we give a distribution free uniform convergence bound for our algorithm.\nClassical learning and VC theory indicates that, given a discriminative model with a small\nnumber of parameters, typically only a small amount of training data should be required\nto \ufb01t the parameters \u201cwell\u201d [14]. In our model, a large number of parameters \u02c6P are \ufb01t\ngeneratively, but only a small number (the \u03b8\u2019s) are \ufb01t discriminatively. We would like\nto show that only a small training set is required to \ufb01t the discriminative parameters \u03b8.6\nHowever, standard uniform convergence results do not apply to our problem, because the\n\u201cfeatures\u201d bi given to the discriminative logistic regression component also depend on the\ntraining set. Further, the \u03b8i\u2019s are \ufb01t using the leave-one-out training procedure, so that every\npair of training examples is actually dependent.\nFor our analysis, we assume the training set of size m is drawn i.i.d.from some distribution\nD over X \u00d7 Y. Although not necessary, for simplicity we assume that each document\nhas the same total number of words n = PR\ni=1 ni, though the lengths of the individual\nregions may vary.\n(It also suf\ufb01ces to have an upper- and a lower-bound on document\nlength.) Finally, we also assume that each word occurs at most Cmax times in a single\ndocument, and that the distribution D from which training examples are drawn satis\ufb01es\n\n6For a result showing that naive Bayes\u2019 generatively \ufb01t parameters (albeit one using a different\nevent model) converge to their population (asymptotic) values after a number of training examples\nthat depends logarithmically on the size of the number of features, also see [11].\n\n\f\u03c1min \u2264 P (Y = 1) \u2264 1 \u2212 \u03c1min, for some \ufb01xed \u03c1min > 0.\nNote that we do not assume that the \u201cnaive Bayes assumption\u201d (that words are conditionally\nindependent given the class label) holds. Speci\ufb01cally, even when the naive Bayes assump-\ntion does not hold, the naive Bayes algorithm (as well as our hybrid algorithm) can still be\napplied, and our results apply to this setting.\nGiven a set M of m training examples, for a particular setting of the parameter \u03b8, the\nexpected log likelihood of a randomly drawn test example is:\n\n\u03b5M (\u03b8) = E(x,y)\u223cD log \u02c6P\u03b8(y|x)\n\n(12)\nwhere \u02c6P\u03b8 is the probability model trained on M as described in the previous section, using\nparameters \u02c6P \ufb01t to the entire training set. Our algorithm uses a leave-one-out estimate of\nthe true log likelihood; we call this the leave-one-out log likelihood:\ni=1 log \u02c6P\u03b8,\u2212i(y(i)|x(i))\n\n(13)\n\n\u22121(\u03b8) = 1\n\u02c6\u03b5M\n\nm Pm\n\nwhere \u02c6P\u03b8,\u2212i represents the probability model trained with the ith example left out.\nWe would like to choose \u03b8 to maximize \u03b5M , but we do not know \u03b5M . Now, it is well-known\nthat if we have some estimate \u02c6\u03b5 of a generalization error measure \u03b5, and if |\u02c6\u03b5(\u03b8) \u2212 \u03b5(\u03b8)| \u2264 \u0001\nfor all \u03b8, then optimizing \u02c6\u03b5 will result in a value for \u03b8 that comes within 2\u0001 of the best\npossible value for \u03b5 [14]. Thus, in order to show that optimizing \u02c6\u03b5M\n\u22121 is a good \u201cproxy\u201d for\n\u22121(\u03b8) is uniformly close to \u03b5M (\u03b8). We have:\noptimizing \u03b5M , we only need to show that \u02c6\u03b5M\nTheorem 1 Under the previous set of assumptions, in order to ensure that with probability\nat least 1 \u2212 \u03b4, we have |\u03b5M (\u03b8) \u2212 \u02c6\u03b5M\n\u22121(\u03b8)| < \u0001 for all parameters \u03b8 such that ||\u03b8||\u221e \u2264 \u03b7, it\nsuf\ufb01ces that m = O(poly(1/\u03b4, 1/\u0001, log n, log |W|, R, \u03b7)R).\n\nThe full proof of this result is fairly lengthy, and is deferred to the full version of this\npaper [13]. From the theorem, the number of training examples m required to \ufb01t the \u03b8\nparameters (under the fairly standard regularity condition that \u03b8 be bounded) depends only\non the logarithms of the document length n and the vocabulary size |W|. In our bound,\nthere is an exponential dependence on R; however, from our experience, R does not need\nto be too large for signi\ufb01cantly improved performance. In fact, our experimental results\ndemonstrate good performance for R = 2.\n\n5 Calibration Curves\n\nWe now consider a second application of these ideas, to a text classi\ufb01cation setting where\nthe data is not naturally split into different regions (equivalently, where R = 1). In this\nsetting we cannot use the \u201creweighting\u201d power of the hybrid algorithm to reduce classi\ufb01-\ncation error. But, we will see that, by giving better class posteriors, our method still gives\nimproved performance as measured on accuracy/coverage curves.\nAn accuracy/coverage curve shows the accuracy (fraction correct) of a classi\ufb01er if it is\nasked only to provide x% coverage\u2014that is, if it is asked only to label the x% of the test\ndata on which it is most con\ufb01dent. Accuracy/coverage curves towards the upper-right of the\ngraph mean high accuracy even when the coverage is high, and therefore good performance.\nAccuracy value at coverage 100% is just the normal classi\ufb01cation error. In settings where\nboth human and computer label documents, accuracy/coverage curves play a central role\nin determining how much data has to be labeled by humans. They are also indicative of\nthe quality of a classi\ufb01er\u2019s class posteriors, because a classi\ufb01er with better class posteriors\nwould be able to better judge which x% of the test data it should be most con\ufb01dent on, and\nachieve higher accuracy when it chooses to label that x% of the data.\nFigure 2 shows accuracy/coverage curves for classifying several pairs of newsgroups from\nthe 20newsgroups dataset. Each plot is obtained by averaging the results of ten 50%/50%\nrandom train/test splits. The normalized hybrid algorithm (R = 1) does signi\ufb01cantly better\nthan naive Bayes, and has accuracy/coverage curves that are higher almost everywhere.\n\n\fatheism vs religion.misc\n\npc.hardware vs mac.hardware\n\ngraphics vs mideast\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\ny\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.98\n\n0.96\n\n0.94\n\n0.92\n\n0.9\n\ny\nc\na\nr\nu\nc\nc\nA\n\n1.005\n\n1\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.995\n\n0.99\n\n0.985\n\n0.98\n\n0.7\n0\n\n0.2\n\n0.4\n0.6\nCoverage\n\n(a)\n\natheism vs sci.med\n\n0.8\n\n1\n\n0.88\n0\n\n0.2\n\n0.4\n0.6\nCoverage\n\n(b)\n\nautos vs motorcycles\n\n0.8\n\n0.975\n0\n\n1\n\n0.2\n\n0.4\n0.6\nCoverage\n\n0.8\n\n1\n\n(c)\n\nhockey vs christian\n\n1.02\n\n1.01\n\n1\n\n0.99\n\n0.98\n\n0.97\n\ny\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.99\n\n0.98\n\n0.97\n\n0.96\n\n0.95\n\ny\nc\na\nr\nu\nc\nc\nA\n\n1.005\n\n1\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.995\n\n0.99\n\n0.96\n0\n\n0.2\n\n0.4\n0.6\nCoverage\n\n(d)\n\n0.8\n\n1\n\n0.94\n0\n\n0.2\n\n0.4\n0.6\nCoverage\n\n(e)\n\n0.8\n\n0.985\n0\n\n1\n\n0.2\n\n0.4\n0.6\nCoverage\n\n0.8\n\n1\n\n(f)\n\nFigure 2: Accuracy/Coverage curves for different newsgroups pairs. Black solid line is\nour normalized hybrid algorithm with R = 1; magenta dash-dot line is naive Bayes; blue\ndotted line is unnormalized hybrid, and red dashed line is logistic regression. (Colors where\navailable.)\n\nFor example, in Figure 2a, the normalized hybrid algorithm with R = 1 has a coverage\nof over 40% at 95% accuracy, while naive Bayes\u2019 coverage is 0 for the same accuracy.\nAlso, the unnormalized algorithm has performance about the same as naive Bayes. Even in\nexamples where the various algorithms have comparable overall test error, the normalized\nhybrid algorithm has signi\ufb01cantly better accuracy/coverage.\n\n6 Discussion and Related Work\n\nThis paper has described a hybrid generative/discriminative model, and presented experi-\nmental results showing that a simple hybrid model can perform better than either its purely\ngenerative or discriminative counterpart. Furthermore, we showed that in order to \ufb01t the\nparameters \u03b8 of the model, only a small number of training examples is required.\nThere have been a number of previous efforts to modify naive Bayes to obtain more em-\npirically accurate posterior probabilities. Lewis and Gale [9] use logistic regression to re-\ncalibrate naive Bayes posteriors in an active learning task. Their approach is similar to the\nlower-performing unnormalizedversion of our algorithm, with only one region. Bennett [1]\nstudies the problem of using asymmetric parametric models to obtain high quality proba-\nbility estimates from the scores outputted by text classi\ufb01ers such as naive Bayes. Zadrozny\nand Elkan [16] describe a simple non-parametric method for calibrating naive Bayes prob-\nability estimates. While these methods can obtain good class posteriors, we note that in\norder to obtain better accuracy/coverage, it is not suf\ufb01cient to take naive Bayes\u2019 output\np(y|x) and \ufb01nd a monotone mapping from that to a set of hopefully better class posteriors\n(e.g., [16]). Speci\ufb01cally, in order to obtain better accuracy/coverage, it is also important to\nrearrange the con\ufb01dence orderings that naive Bayes gives to documents (which our method\ndoes because of the normalization).\nJaakkola and Haussler [3] describe a scheme in which the kernel for a discriminative clas-\nsi\ufb01er is extracted from a generative model. Perhaps the closest to our work, however, is\n\n\fthe commonly-used, simple \u201creweighting\u201d of the language model and acoustic model in\nspeech recognition systems (e.g., [5]). Each of the two models is trained generatively; then\na single weight parameter is set using hold-out cross-validation.\nIn related work, there are also a number of theoretical results on the quality of leave-one-\nout estimates of generalization error. Some examples include [7, 2]. (See [7] for a brief\nsurvey.) Those results tend to be for specialized models or have strong assumptions on the\nmodel, and to our knowledge do not apply to our setting, in which we are also trying to \ufb01t\nthe parameters \u03b8.\nIn closing, we have presented one hybrid generative/discriminative algorithm that appears\nto do well on a number of problems. We suggest that future research in this area is poised\nto bear much fruit. Some possible future work includes: automatically determining which\nparameters to train generatively and which discriminatively; training methods for more\ncomplex models with latent variables, that require EM to estimate both sets of parameters;\nmethods for taking advantage of the hybrid nature of these models to better incorporate\ndomain knowledge; handling missing data; and support for semi-supervised learning.\n\nAcknowledgments. We thank Dan Klein, David Mulford and Ben Taskar for helpful con-\nversations. Y. Shen is supported by an NSF graduate fellowship. This work was also sup-\nported by the Department of the Interior/DARPA under contract number NBCHD030010,\nand NSF grant #IIS-0326249.\n\nReferences\n[1] Paul N. Bennett. Using asymmetric distributions to improve text classi\ufb01er probability estimates.\nIn Proceedings of SIGIR-03, 26th ACM International Conference on Research and Development\nin Information Retrieval, 2003.\n\n[2] Luc P. Devroye and T. J. Wagner. Distribution-free performance bounds for potential function\n\nrules. IEEE Transactions on Information Theory, 5, September 1979.\n\n[3] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi\ufb01ers.\n\nAdvances in Neural Information Processing Systems 11, 1998.\n\nIn\n\n[4] T. Jebara and A. Pentland. Maximum conditional likelihood via bound maximization and the\n\ncem algorithm. In Advances in Neural Information Processing Systems 11, 1998.\n[5] D. Jurafsky and J. Martin. Speech and language processing. Prentice Hall, 2000.\n[6] John Lafferty Kamal Nigam and Andrew McCallum. Using maximum entropy for text classi\ufb01-\n\ncation. In IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999.\n\n[7] Michael Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out\n\ncross-validation. Computational Learning Theory, 1997.\n\n[8] Ken Lang. Newsweeder: learning to \ufb01lter netnews.\n\nConference on Machine Learning, 1997.\n\nIn Proceedings of the Ninth European\n\n[9] David D. Lewis and William A. Gale. A sequential algorithm for training text classi\ufb01ers. In\nProceedings of SIGIR-94, 17th ACM International Conference on Research and Development\nin Information Retrieval, 1994.\n\n[10] Andrew McCallum and Kamal Nigam. A comparison of event models for naive bayes text\n\nclassi\ufb01cation. In AAAI-98 Workshop on Learning for Text Categorization, 1998.\n\n[11] Andrew Y. Ng and Michael I. Jordan. On discriminative vs. generative classi\ufb01ers: a comparison\n\nof logistic regression and naive bayes. In NIPS 14, 2001.\n\n[12] John C. Platt. Probabilistic outputs for support vector machines and comparisons to regular-\nized likelihood methods. In A. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans, editors,\nAdvances in Large Margin Classi\ufb01ers. MIT Press, 1999.\n[13] R. Raina, Y. Shen, A. Y. Ng, and A. McCallum.\n\nClassi\ufb01cation with hybrid genera-\n\ntive/discriminative models. http://www.cs.stanford.edu/\u02dcrajatr/nips03.ps, 2003.\n\n[14] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.\n[15] David H. Wolpert. Stacked generalization. Neural Networks, 5(2):241\u2013260, 1992.\n[16] Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision\n\ntrees and naive bayesian classi\ufb01ers. In ICML \u201901, 2001.\n\n\f", "award": [], "sourceid": 2405, "authors": [{"given_name": "Rajat", "family_name": "Raina", "institution": null}, {"given_name": "Yirong", "family_name": "Shen", "institution": null}, {"given_name": "Andrew", "family_name": "McCallum", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}