{"title": "Using Vocabulary Knowledge in Bayesian Multinomial Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1385, "page_last": 1392, "abstract": "", "full_text": "Thomas L .. Griffiths & Joshua B. Tenenbaum\n\nDepartment of Psychology\n\nStanford University, Stanford, CA 94305\n{gruffydd,jbt}\u00a9psych. stanford. edu\n\nAbstract\n\nEstimating the parameters of sparse multinomial distributions is\nan important component of many statistical learning tasks. Recent\napproaches have used uncertainty over the vocabulary of symbols\nin a multinomial distribution as a means of accounting for sparsity.\nWe present a Bayesian approach that allows weak prior knowledge,\nin the form of a small set of approximate candidate vocabularies,\nto be used to dramatically improve the resulting estimates. We\ndemonstrate these improvements in applications to text compres(cid:173)\nsion and estimating distributions over words in newsgroup data.\n\n1\n\nIntroduction\n\nSparse multinomial distributions arise in many statistical domains, including nat(cid:173)\nural language processing and graphical models. Consequently, a number of ap(cid:173)\nproaches to parameter estimation for sparse multinomial distributions have been\nsuggested [3]. These approaches tend to be domain-independent: they make little\nuse of prior knowledge about a specific domain. In many domains where multino(cid:173)\nmial distributions are estimated there is often at least weak prior knowledge about'\nthe potential structure of distributions, such as a set of hypotheses about restricted\nvocabularies from which the symbols might be generated. Such knowledge can be\nsolicited from experts or obtained from unlabeled data. We present a method for\nBayesian_parameter estimation in sparse discrete domains that exploits this weak\nform of prior knowledge to improve estimates over knowledge-free approaches.\n\n1.1 Bayesian parameter estimation for multinomial distributions\n\nFollowing the presentation in [4], we consider a language ~ containing L dis(cid:173)\ntinct symbols. A multinomial distribution is specified by a parameter vector\nf) == (Ol, ... ,f)L), where f)i\nis the probability of an observation being symbol i.\nConsequently, we have the constraints that Ef==l f)i == 1 and (h ~ 0, i == 1, ... ,L.\nThe task of multinomial estimation is to take a data set D and produce a'vector f)\nthat results in a good approximation to the distribution that produced D. In this\ncase, D consists of N independent observations Xl, ... x N drawn from the distribu(cid:173)\ntion to be estimated, which can be summarized by the statistics N i specifying the\nnumber of times the ith symbol occurs in the data. D also determines the set ~o\n\n\fof symbols that occur in the data.\n\nStated in this way, multinomial estimation involves predicting the next observation\nbased on the data. Specifically, we wish to calculate P(XN+1ID). The Bayesian\nestimate for this probability is given by\n\nPL(xN+lID) =I p(XN+1IB)P(BID)dB\n\nwhere P(XN +1 10) follows from the multinomial distribution corresponding to O. The\nposterior probability P(OID) can be obtained via Bayes rule\nP(OID) oc P(DIO)P(O) == P(8}II ONi\n\nL\n\ni==l\n\nwhere P(O) is the prior probability of a given O.\nLaplace used this method with a uniform prior over 0 to give the famous \"law of\nsuccession\" [6J. A more general approach is to assume a Dirichlet prior over (),\nwhich is conjugate to the multinomial distribution and gives\n\nP(XN+l = ilD) =\n\nNi +LCY.i\n\nN + l:j==l O!.j\n\n(1)\n\nwhere the ai are the hyperparameters of the Dirichlet distribution. Different esti(cid:173)\nmates are obtained for different choices of the ai, with most approaches making the\nsimplifying assumption that ai == O!. for all i. Laplace's law results from a == 1. The\ncase with a == 0.5 is the Jeffreys-Perks law or Expected Likelihood Estimation [2]\n[5J [9J, while using arbitrary O!. is Lidstone's law [7].\n\n1.2 EstiIllating sparse Illultinomial distributions\n\nSeveral authors have extended the Bayesian approach to sparse multinomial distri(cid:173)\nbutions, in which only a restricted vocabulary of symbols are used, by maintaining\nIn [10], Ristad uses assumptions about the\nuncertainty over these vocabularies.\nprobability of strings based upon different vocabularies to give the estimate\n\nPR (X N +1 == ilD) ==\n\n(Ni + l)/(N +L)\n(Ni + l)(N + 1 - kO)/(N2 + N + 2kO)\nkO(kO + l)/(L - kO)(N2 + N + 2kO)\n\n{\n\nif kO == L\nif kO < L 1\\ N i > 0\notherwise\n\nwhere kO == I~o I is the size of the smallest vocabulary consistent with the data.\nA different approach is taken by Friedman and Singer in [4], who point out that\nRistad's method is a special case of their framework. Friedman and Singer consider\nthe vocabulary V ~ :E to be a random variable, allowing them to write\n\np.(XN +1 == ilD) == L p(XN +1 == ilV, D)P(VID)\n\nv\n\nwhere P{XN +1 == ilV, D) results from a Dirichlet prior over the symbols in V,\n\np(XN +1 == ilV, D) == {~it'ja\n\no\n\nif i E v:\n\notherWIse\n\nand by Bayes' rule and the properties of Dirichlet priors\n\nP(VID)\n\noc P(DIV)P(V)\n\n{ ~fJ~~(a) niE~O r(~\u00b7t~a) P(V)\n\nEO ~ V\notherwise\n\n(2)\n\n(3)\n\n( )\n4\n\n\fFriedman and Singer assume a hierarchical prior over V, such that all vocabularies\nof cardinality k are given equal probability, namely P(S == k)/(t), where P(S == k)\nis the probability that the size of the vocabulary (IVI) is k. It follows that if i E ~o,\np(XN +I == ilD) == Lk :+1~P(S == kiD). If i \u00a2 ~o, it is necessary to estimate the\nproportion of V that contain i for a given k. The simplified result is\n\nPF(X N +1 == ilD) == { %tt~aC(D,L)\n\nL-kO (1- C(D,L))\n\nif i E ~o\notherwise\n\n(5)\n\nwhere\n\n. h\n\nWIt mk ==.\n\nP(S\n\n==\n\nk) - k!\n\n(k-kO)!\u00b7 r(N+ka:) .\n\nr(ka:)\n\n2\n\nIvlaking use of weak prior knowiedge\n\nFriedman and Singer assume a prior that gives equal probability to all vocabularies\nof a given cardinality. However, many real-world tasks provide limited knowledge\nabout the structure of distributions that we can build into our methods for param(cid:173)\neter estimation.\nIn the context of sparse multinomial estimation, one instance of\nsuch knowledge the importance of specific vocabularies. For example, in predicting\nthe next character in a file, our predictions could be facilitated by considering the\nfact that most files either use a vocabulary consisting of ASCII printing characters\n(such as text files), or all possible characters (suc~ as object files). This kind of\nstructural knowledge about a domain is typically easier to solicit from experts than\naccurate distributional information, and forms a valuable informational resource.\n\nIf we have this kind of prior knowledge, we can restrict our attention to a subset\nof the 2L possible vocabularies. fu particular, we can specify a set of vocabularies\nV which we consider as hypotheses for the vocabulary used in producing D, where\nthe elements of V are specified by our knowledge of the domain. This stands as\na compromise between Friedman and Singer's approach, in which V consists of\nall vocabularies, and traditional Bayesian parameter estimation as represented by\nEquation 1, in which V consists of only the vocabulary containing all words. To\ndo this, we explicitly evaluate the sum given in Equation 2, where the sum over V\nincludes all V E V. This sum remains tractable when V is a small subset of the\npossible vocabularies, and the efficiency is aided by the fact that P(DIV) shares\ncommon terms across all V which can cancel in normalization.\n\nThe intuition behind this approach is that it attempts to classify the target distribu(cid:173)\ntion as using one of a known set of vocabularies, where the vocabularies are obtained\neither from experts or from unlabeled data. Applying standard Bayesian multino(cid:173)\nmial estimation within this vocabulary gives enough flexibility for the method to\ncapture a range of distributions, while making use of our weak prior knowledge.\n\n2.1 An illustration: Text compression\n\nText compression is an effective test of methods for multinomial estimation. Adap(cid:173)\ntive coding can be performed by specifying a method for calculating a distribution\nover the probability of the next byte in a file based upon the preceding bytes [1].\nThe extent to which the file is compressed depends upon the quality of these pre(cid:173)\ndictions. To illustrate the utility of including prior knowledge, we follow Ristad in\nusing the Calgary text compression corpus [1]. This corpus consists of 19 files of\n\n\fTable 1: Text compression lengths (in bytes) on the Calgary corpus\nPJ\n174\n219\n212\n161\n201\n126\n182\n156\n167\n154\n126\n122\n149\n205\n150\n164\n155\n169\n\nkO NH(Ni/N )\n81\n72330\n435043\n82\n365952\n96\n256\n72274\n244633\n98\n256\n15989\n193144\n256\n33113\n95\n47280\n91\n27132\n84\n80\n7806\n7376\n91\n23861\n93\n77636\n159\n25743\n92\n42720\n87\n30052\n89\n64800\n99\n\nfile\nbib\nbook1\nbook2\ngeo\nnellS\nobj1\nobj2\npaper1\npaper2\npaper3\npaper4\npaper5\npaper6\npic\nprogc\nprogl\nprogp\ntrans\n\nsize\n111261\n768771\n610856\n102400\n377109\n21504\n246814\n53161\n82199\n46526\n13286\n11954\n38105\n513216\n39611\n71646\n49379\n93695\n\nPv\n18\n219\n94\n161\n89\n126\n182\n71\n75\n70\n58\n57\n68\n205\n68\n74\n71\n169\n\nPF\n89\n105\n115\n162\n113\n127\n184\n94\n94\n85\n72\n79\n90\n16~\n- 89\n91\n89\n101\n\nPR\n92\n116\n124\n165\n116\n129\n190\n100\n105\n92\n79\n83\n95\n216\n91\n97\n94\n105\n\nPL\n269\n352\n329\n165\n304\n129\n189\n236\n259\n238\n190\n181\n223\n323\n222\n253\n236\n252\n\nseveral different types, each using some subset of 256 possible characters (L == 256).\nThe files include Bib'IEXsource (bib), formatted English text (book*, paper*), ge(cid:173)\nological data (geo), newsgroup articles (news), object files (obj*), a bit-mapped\npicture (pic), programs in three different languages (prog*) and a terminal tran(cid:173)\nscript (trans). The task was to estimate the distribution from which characters in\nthe file were drawn based upon the first N characters and thus predict the N + 1st\ncharacter. Performance was measured in terms of the length of the resulting file,\nwhere the contribution of the N + 1st character to the length is log2 P(XN+lID).\nThe results are expressed as the number of bytes required to encode the file relative\nto the empirical entropy NH(Ni/N) as assessed by Ristad [10].\nResults are shown in Table 1. Pv is the restricted vocabulary model outlined above,\nwith V consisting of just two hypotheses: one corresponding to binary files, contain(cid:173)\ning all 256 characters, and one consisting of a 107 character vocabulary representing\nformatted English. The latter vocabulary was estimated from 5MB of English text,\nC code, Bib'IEXsource, and newsgroup data from outside the Calgary corpus. PF\nis Friedman and Singer's method. For both of these approaches, a was set to 0.5,\nto allow direct comparison to the Jeffreys-Perks law, PJo PR and PL are Ristad's\nand Laplace's laws respectively. Py outperformed the other methods on all files\nbased upon English text, bar bookl, and all files using all 256 symbolsl . The high\nperformance followed from rapid classification of these files as using the appropriate\nvocabulary in V. When the vocabulary included all symbols Py performed as PJ,\nwhich gave the best predictions for these files.\n\n1A number of excellent techniques for\u00b7 text compression exist that outperform all of\nthose presented here. We have not included these techniques for comparison because our\ninterest is in using text compression as a means of assessing estimation procedures, rather\nthan as an end in itself. We thus consider only methods for multinomial estimation as our\ncomparison. group.\n\n\f2.2 Maintaining uncertainty in vocabularies\n\nThe results for book1 illustrate a weakness of the approach outlined above. The file\nlength for Py is higher than those for PF and PR , despite the fact that the file uses a\ntext-based vocabulary. This file contains two characters that were not encountered\nin the data used to construct V. These characters caused P y\nto default to the\nunrestricted vocabulary of all 256 characters. From that point Py corresponded to\nPJ, which gave poor results on this file.\nThis behavior results from the assumption that the candidate vocabularies in V are\ncompletely accurate. Since in many cases the knowledge that informs the vocabu(cid:173)\nlaries in V may be imperfect, it is desirable to allow for uncertainty in vocabularies.\nThis uncertainty will be reflected in the fact that symbols outside V are expected\nto occur with a vocabulary-specific probability ty,\n\np(XN+1 == ilV, D) == {\n\n(1 - (L -IVI)ty) N~~t~la\nty\n\nif i E V\notherwise\n\nwhere Ny == I:iEY N i \u00b7 It follows that\n\nP(DIV) = (1 - (L -IVJ)\u20acV)NV \u20act\"-N v r(N + 1V1a:)\n\nr(IVla)\n\ny\n\nr(Ni + a)\n\nr(a:)\n\niE:EonY\n\nwhich replaces Equations 3-4 in specifying Py .\n\nWhen V is determined by the judgments of domain experts, ty is the probability\nthat an unmentioned word actually belongs to a particular vocabulary. While it\nmay not be the most efficient use of such data, the V E V can also be estimated from\nsome form of unlabeled data. In this case, Friedman and Singer's approach provides\na means of setting ty. Friedman and Singer explicitly calculate the probability that\nan unseen word is in V based upon a dataset: from the second condition of Equation\n5, we find that we should set ty == L_1IYI (1- C(D, L)). We use this method below.\n\n3 Bayesian parameter estimation in natural language\n\nStatistical natural language processing often uses sparse multinomial distributions\nover large vocabularies of words. In different contexts, different vocabularies will be\nused. By specifying a basis set of vocabularies, we can perform parameter estimation\nby classifying distributions according to their vocabulary. This idea was examined\nusing data from 20 different Usenet newsgroups. This dataset is commonly used\nin testing text classification algorithms (eg. [8]). Ten newsgroups were used to\nestimate a set of vocabularies V with corresponding ty. These vocabularies were\nused in estimating multinomial distributions on these newsgroups and ten others.\n\nThe dataset was 20news-18827, which consists of the 20newsgroups data with\nheaders and duplicates removed, and was preprocessed to remove all punctuation,\ncapitalization, and distinct numbers. The articl~s in each of the 20 newsgroups were\nthen divided into three sets. The first 500 articles from ten newsgroups were used to\nestimate the candidate vocabularies V and uncertainty parameters ty. Articles 501(cid:173)\n700 for all 20 newsgroups were used as training data for multinomial estimation.\nArticles 701-900 for all 20 newgroups were used as testing data. Following [8],\na dictionary was built up by running over the 13,000 articles resulting from this\ndivision, and all words that occurred only once were mapped to an \"unknown\"\nword. The resulting dictionary contained L == 54309 words.\nAs before, the restricted vocabulary method (Py), Friedman and Singer's method\n(PF ), and Ristad's (PR ), Laplace's (PL ) and the Jeffreys-Perks (PJ ) laws were ap-\n\n\falt.atheism\n\nscLspace\n\nrec.motorcycles\n\ntalk.politics.guns\n\ncomp.sys.ibm.pc.hardware\n\nscLelectronics\n\ncomp.windows.x\n\nrec.sport.hockey\n\nsoc.religion.christian\n\n~~~\n~~~\n~~~\n~~~\n~::=:= ~-;\"\"\",\n~~~\n~~~\n\nr.?';'~;:;\"\n\ncomp.os.ms-windows.misc\n\nrec.sport.baseball\n\nscLmed\n\nmisc.forsale\n\nrec.autos\n\nscLcrypt\n\n18\n\ntalk.politics.mideast\n\n....\n\n\"\n\ntalk.politics. mise -\n\n11\n\n18\n\n.\n\n17\n\n16\n\n'.\n\ncomp.sys.mac.hardware\n\ntalk.religion.misc\n\ncomp.graphics\n\n100\n\n10000\n\n50000\n\n100\n\n10000\n\nNumber of words\n\nF~gure 1: Cross-entropy of predictions on newsgroup data as a function of the\nlogarithm of the number of words. The abscissa is at the empirical entropy of the\ntest distribution. The top ten panels (talk.polities.mideast and those to its\nright) are for the newsgroups with unknown vocabularies. The bottom ten are for\nthose that contributed vocabularies to V, trained and tested on novel data. PL and\nPJ are both indicated with dotted lines, but PJ always performs better than PL.\nThe box on talk.polities.mideast indicates the point at which Pv defaults to\nthe full vocabulary, as the number of unseen words makes this vocabulary more\nlikely. At this point, the line for Pv joins the line for PJ , since both methods give\nthe same estimates of the distribution.\n\nplied to the task. Both Pv and PF used a == 0.5 to facilitate comparison with\n'V featured one vocabulary that contained all words in the dictionary, and\nPJ .\nten vocabularies each corresponding to the words used in the first 500 articles of\none of the newsgroups designated for this purpose.\n\u20acy was estimated as outlined\nabove. Testing for each newsgroup consisted of taking words from the 200 articles\nassigned for training purposes, estimating a. distribution using each method, and\nthen computing the cross-entropy between that distribution and an empirical esti-\nmate of the true distribution. The cross-entropy is H{Q; P) == Ei Qi log2 Pi, where\nQ is the true distribution and P is the distribution produced by the estimation\nmethod. Q was given by the maximum likelihood estimate formed from the word\nfrequencies in all 200 articles assigned for testing purposes. The testing procedure\nwas conducted with just 100 words, and then in increments of 450 up to a total\nof 10000 words. Long-run performance was examined on talk.polities.mideast\nand talk.polities.mise, each trained with 50000 words.\n\nThe results are shown in Figure 1. As expected, Py consistently outperformed\nthe other methods on the newsgroups that contributed to V. However, perfor(cid:173)\nmance on novel newsgroups was also greatly improved. As can be seen in Figure\n2, the novel newsgroups were classified to appropriate vocabularies - for example\n\n\fall words\n\nrec.autos\n\nrec.sport.baseball\n\nscLcrypt\n\nscLmed\n\ntalk.politics. misc\n\ntalk. religion. misc\n\nmisc.forsale\n\ncomp.sys.mac.hardware\n\nI-----------------rec.motorcycles\n\nr--+--\n\nta1k.politics.guns\n\n, - - - - - - -T - - -+ - - - - - - - - - - - talk.politics.mideast\n\nr l \\ - -_ f - - - - - - -T - - - - ' r - - - - - - - - - - - alt.atheism\nf r t - ' \\ . ; : : : : : : : : : : : ; f := .= '= f - - t - - - - - - - - - - soc.religion.christian\n\nl -+ - i f -H f - - - - - - - t - - - - - - - - - - scLspace\n\n~~?.;~~~~~~g~ey\n\n,--\n\ncomp.os.ms-windows.m~c'-----------------com~sy&ibm.p~hardware\n\ncomp.graphics\n\no\n\nNumber of words\n\n10000\n\nFigure 2: Classification of newsgroup vocabularies. The lines illustrate the vocabu(cid:173)\nlary which had maximum posterior probability for each of the ten test newsgroups\nafter exposure to differing numbers of words. The vocabularies in V are listed along\nthe left hand side of the axis, and the lines are identified with newsgroups by the\nlabels on the right hand side. Lines are offset to facilitate identification.\n\ntalk.religion.misc had the highest posterior probability for alt.atheism and\nsoc. religion. christian, while rec. autos had highest posterior probability for\nrec .motorcycles. The proportion of word types occurring in the test data but\nnot the vocabulary to which the novel newsgroups were classified ranged between\n30.5% and 66.2%, with a mean of 42.2%. This illustrates that even approximate\nknowledge can facilitate predictions: the basis set of vocabularies allowed the high\nfrequency words in the data to be modelled effectively, without excess mass being\nattributed to the low frequency novel word tokens.\n\nLong-run performance on talk.politics .mideast illustrates the same defaulting\nbehavior that was shown for text classification: when the data become more prob(cid:173)\nable under the vocabulary containing all words than under a restricted vocabulary\nthe method defaults to the Jeffreys-Perks law. This guarantees that the method will\ntend to perform no worse than PJ when unseen words are encountered in sufficient\nproportions. This is desirable, since PJ gives good estimates once N becomes large.\n\n4 Discussion\n\nBayesian approaches to parameter estimation for sparse multinomial distributions\nhave employed the notion of a restricted vocabulary from which symbols are pro(cid:173)\nduced. In many domains where such distributions are estimated; there is often at\n\n\fleast limited knowledge about the structure of these vocabularies. By considering\njust the vocabularies suggested by such knowledge, together with some uncertainty\nconcerning those vocabularies, we can achieve very good estimates of distributions in\nthese domains. We have presented a Bayesian approach that employs limited prior\nknowledge, and shown that it outperforms a range of approaches to multinomial\nestimation for both text compression and a task involving natural language.\n\nWhile our applications in this paper estimated approximate vocabularies from data,\nthe real promise of this approach lies with domain knowledge solicited from experts.\nExperts are typically better at providing qualitative structural information than\nquantitative distributional information, and our approach provides a way of using\nthis information in parameter estimation. For example, in the context of parame(cid:173)\nter estimation for graphical models to be used in medical diagnosis, distinguishing\nclasses of symptoms might be informative in determining the parameters governing\ntheir relationship to diseases. This form of knowledge is naturally translated into\na set of vocabularies to be considered for each such distribution. More complex\napplications to natural language lllay also be possible, such as using syntactic in(cid:173)\nformation in estimating probabilities for n-gram models. The approach we have\npresented in this paper provides a simple way to allow this kind of limited domain\nknowledge to be useful in Bayesian parameter estimation.\n\nReferences\n\n[1] T. C. Bell, J. G. Cleary, and 1. H. Witten. Text compression. Prentice Hall, 1990.\n[2] G. E. P. Box and G. C. Tiao. Bayesian Inference in Statistical Analysis. Addison(cid:173)\n\nWesley, 1973.\n\n[3] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language\nmodeling. Technical Report TR-10-98, Center for Research in Computing Technology,\nHarvard University, 1998.\n\n[4] N. Friedman and Y. Singer. Efficient Bayesian parameter estimation in large discrete\n\ndomains. In Neural Information Processing Systems, 1998.\n\n[5] H. Jeffreys. An invariant form for the prior probability in estimation problems. Pro(cid:173)\n\nceedings of the Royal Society A, 186:453-461, 1946.\n\n[6] P.-S. Laplace. Philosophical Essay on Probabilities. Springer-Verlag, 1995. Originally\n\npublished 1825.\n\n[7] G. Lidstone. Note on the general case of the Bayes-Laplace formula for inductive or\na posteriori probabilities. Transactions of the Faculty of Actuaries, 8:182-192, 1920.\n[8] K. Nigam, A. K. Mccallum, S. Thrun, and T. Mitchell. Text classification fro'in labeled\n\nand unlabeled documents using EM. Machine Learning, 39:103-134, 2000.\n\n[9] W. Perks. Some observations on inverse probability, including a new indifference rule.\n\nJournal of the Institute of Actuaries, 73:285-312, 1947.\n\n[10] E. S. Ristad. A natural law \u00b7of succession. Technical Report CS-TR-895-95, Depart(cid:173)\n\nment of Computer Science, Princeton University, 1995.\n\n\f", "award": [], "sourceid": 2063, "authors": [{"given_name": "Thomas", "family_name": "Griffiths", "institution": null}, {"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}]}