{"title": "Bayesian Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 435, "page_last": 442, "abstract": "", "full_text": "Bayesian Sets\n\nZoubin Ghahramani\u2217 and Katherine A. Heller\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\nLondon WC1N 3AR, U.K.\n\n{zoubin,heller}@gatsby.ucl.ac.uk\n\nAbstract\n\nInspired by \u201cGoogle\u2122 Sets\u201d, we consider the problem of retrieving items\nfrom a concept or cluster, given a query consisting of a few items from\nthat cluster. We formulate this as a Bayesian inference problem and de-\nscribe a very simple algorithm for solving it. Our algorithm uses a model-\nbased concept of a cluster and ranks items using a score which evaluates\nthe marginal probability that each item belongs to a cluster containing\nthe query items. For exponential family models with conjugate priors\nthis marginal probability is a simple function of suf\ufb01cient statistics. We\nfocus on sparse binary data and show that our score can be evaluated ex-\nactly using a single sparse matrix multiplication, making it possible to\napply our algorithm to very large datasets. We evaluate our algorithm on\nthree datasets: retrieving movies from EachMovie, \ufb01nding completions\nof author sets from the NIPS dataset, and \ufb01nding completions of sets of\nwords appearing in the Grolier encyclopedia. We compare to Google\u2122\nSets and show that Bayesian Sets gives very reasonable set completions.\n\n1 Introduction\n\nWhat do Jesus and Darwin have in common? Other than being associated with two\ndifferent views on the origin of man, they also have colleges at Cambridge Univer-\nIf these two names are entered as a query into Google\u2122 Sets\nsity named after them.\n(http://labs.google.com/sets) it returns a list of other colleges at Cambridge.\n\nGoogle\u2122 Sets is a remarkably useful tool which encapsulates a very practical and interest-\ning problem in machine learning and information retrieval.1 Consider a universe of items\nD. Depending on the application, the set D may consist of web pages, movies, people,\nwords, proteins, images, or any other object we may wish to form queries on. The user\nprovides a query in the form of a very small subset of items Dc \u2282 D. The assumption\nis that the elements in Dc are examples of some concept / class / cluster in the data. The\nalgorithm then has to provide a completion to the set Dc\u2014that is, some set D0\nc \u2282 D which\npresumably includes all the elements in Dc and other elements in D which are also in this\nconcept / class / cluster2.\n\n\u2217ZG is also at CALD, Carnegie Mellon University, Pittsburgh PA 15213.\n1Google\u2122 Sets is a large-scale clustering algorithm that uses many millions of data instances\nextracted from web data (Simon Tong, personal communication). We are unable to describe any\ndetails of how the algorithm works due its proprietary nature.\n\n2From here on, we will use the term \u201ccluster\u201d to refer to the target concept.\n\n\fWe can view this problem from several perspectives. First, the query can be interpreted\nas elements of some unknown cluster, and the output of the algorithm is the completion\nof that cluster. Whereas most clustering algorithms are completely unsupervised, here the\nquery provides supervised hints or constraints as to the membership of a particular cluster.\nWe call this view clustering on demand, since it involves forming a cluster once some\nelements of that cluster have been revealed. An important advantage of this approach over\ntraditional clustering is that the few elements in the query can give useful information as\nto the features which are relevant for forming the cluster. For example, the query \u201cBush\u201d,\n\u201cNixon\u201d, \u201cReagan\u201d suggests that the features republican and US President are relevant to\nthe cluster, while the query \u201cBush\u201d, \u201cPutin\u201d, \u201cBlair\u201d suggests that current and world leader\nare relevant. Given the huge number of features in many real world data sets, such hints as\nto feature relevance can produce much more sensible clusters.\n\nSecond, we can think of the goal of the algorithm to be to solve a particular information re-\ntrieval problem [2, 3, 4]. As in other retrieval problems, the output should be relevant to the\nquery, and it makes sense to limit the output to the top few items ranked by relevance to the\nquery. In our experiments, we take this approach and report items ranked by relevance. Our\nrelevance criterion is closely related to a Bayesian framework for understanding patterns of\ngeneralization in human cognition [5].\n\n2 Bayesian Sets\nLet D be a data set of items, and x \u2208 D be an item from this set. Assume the user provides\na query set Dc which is a small subset of D. Our goal is to rank the elements of D by how\nwell they would \u201c\ufb01t into\u201d a set which includes Dc. Intuitively, the task is clear: if the set\nD is the set of all movies, and the query set consists of two animated Disney movies, we\nexpect other animated Disney movies to be ranked highly.\nWe use a model-based probabilistic criterion to measure how well items \ufb01t into Dc. Having\nobserved Dc as belonging to some concept, we want to know how probable it is that x also\nbelongs with Dc. This is measured by p(x|Dc). Ranking items simply by this probability\nis not sensible since some items may be more probable than others, regardless of Dc. For\nexample, under most sensible models, the probability of a string decreases with the number\nof characters, the probability of an image decreases with the number of pixels, and the\nprobability of any continuous variable decreases with the precision to which it is measured.\nWe want to remove these effects, so we compute the ratio:\n\nscore(x) = p(x|Dc)\np(x)\n\n(1)\n\nwhere the denominator is the prior probability of x and under most sensible models will\nscale exactly correctly with number of pixels, characters, discretization level, etc. Using\nBayes rule, this score can be re-written as:\n\nscore(x) = p(x,Dc)\np(x) p(Dc)\n\n(2)\nwhich can be interpreted as the ratio of the joint probability of observing x and Dc, to the\nprobability of independently observing x and Dc. Intuitively, this ratio compares the prob-\nability that x and Dc were generated by the same model with the same, though unknown,\nparameters \u03b8, to the probability that x and Dc came from models with different parameters\n\u03b8 and \u03b80 (see \ufb01gure 1). Finally, up to a multiplicative constant independent of x, the score\ncan be written as: score(x) = p(Dc|x), which is the probability of observing the query set\ngiven x (i.e. the likelihood of x).\nFrom the above discussion, it is still not clear how one would compute quantities such\nas p(x|Dc) and p(x). A natural model-based way of de\ufb01ning a cluster is to assume that\n\n\fFigure 1: Our Bayesian score compares the hypotheses that the data was generated by each of the\nabove graphical models.\n\nthe data points in the cluster all come independently and identically distributed from some\nsimple parameterized statistical model. Assume that the parameterized model is p(x|\u03b8)\nwhere \u03b8 are the parameters. If the data points in Dc all belong to one cluster, then under\nthis de\ufb01nition they were generated from the same setting of the parameters; however, that\nsetting is unknown, so we need to average over possible parameter values weighted by\nsome prior density on parameter values, p(\u03b8). Using these considerations and the basic\nrules of probability we arrive at:\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\np(x) =\np(Dc) =\n\nZ\nZ Y\nZ\n\np(x|\u03b8) p(\u03b8) d\u03b8\n\np(xi|\u03b8) p(\u03b8) d\u03b8\n\nxi\u2208Dc\np(x|\u03b8) p(\u03b8|Dc) d\u03b8\n\np(x|Dc) =\np(\u03b8|Dc) = p(Dc|\u03b8) p(\u03b8)\n\np(Dc)\n\nWe are now fully equipped to describe the \u201cBayesian Sets\u201d algorithm:\n\nBayesian Sets Algorithm\n\nx \u2208 D, a prior on the model parameters p(\u03b8)\n\nbackground: a set of items D, a probabilistic model p(x|\u03b8) where\ninput: a query Dc = {xi} \u2282 D\nfor all x \u2208 D do\n\ncompute\n\nscore(x) = p(x|Dc)\np(x)\n\nend for\noutput: return elements of D sorted by decreasing score\n\nWe mention two properties of this algorithm to assuage two common worries with Bayesian\nmethods\u2014tractability and sensitivity to priors:\n\n1. For the simple models we will consider, the integrals (3)-(5) are analytical. In fact,\nfor the model we consider in section 3 computing all the scores can be reduced to\na single sparse matrix multiplication.\n\n\f2. Although it clearly makes sense to put some thought into choosing sensible mod-\nels p(x|\u03b8) and priors p(\u03b8), we will show in 5 that even with very simple models\nand almost no tuning of the prior one can get very competitive retrieval results. In\npractice, we use a simple empirical heuristic which sets the prior to be vague but\ncentered on the mean of the data in D.\n\n3 Sparse Binary Data\n\nWe now derive in more detail the application of the Bayesian Sets algorithm to sparse\nbinary data. This type of data is a very natural representation for the large datasets we used\nin our evaluations (section 5). Applications of Bayesian Sets to other forms of data (real-\nvalued, discrete, ordinal, strings) are also possible, and especially practical if the statistical\nmodel is a member of the exponential family (section 4).\nAssume each item xi \u2208 Dc is a binary vector xi = (xi1, . . . , xiJ) where xij \u2208 {0, 1}, and\nthat each element of xi has an independent Bernoulli distribution:\n\np(xi|\u03b8) =\n\n(1 \u2212 \u03b8j)1\u2212xij\n\n\u03b8xij\nj\n\n(7)\n\nJY\n\nj=1\n\nJY\n\nj=1\n\nThe conjugate prior for the parameters of a Bernoulli distribution is the Beta distribution:\n\np(\u03b8|\u03b1, \u03b2) =\n\n\u0393(\u03b1j + \u03b2j)\n\u0393(\u03b1j)\u0393(\u03b2j) \u03b8\u03b1j\u22121\n\nj\n\n(1 \u2212 \u03b8j)\u03b2j\u22121\n\n(8)\n\nwhere \u03b1 and \u03b2 are hyperparameters, and the Gamma function is a generalization of the\nfactorial function. For a query Dc = {xi} consisting of N vectors it is easy to show that:\n\np(Dc|\u03b1, \u03b2) =Y\ni=1 xij and \u02dc\u03b2 = \u03b2 + N \u2212PN\n\n\u0393(\u03b1j + \u03b2j)\n\u0393(\u03b1j)\u0393(\u03b2j)\n\nj\n\nwhere \u02dc\u03b1 = \u03b1 +PN\n\n\u0393(\u02dc\u03b1j)\u0393( \u02dc\u03b2j)\n\u0393(\u02dc\u03b1j + \u02dc\u03b2j)\n\n(9)\n\ni=1 xij. For an item x = (x\u00b71 . . . x\u00b7J) the\n\nscore, written with the hyperparameters explicit, can be computed as follows:\n\nscore(x) = p(x|Dc, \u03b1, \u03b2)\np(x|\u03b1, \u03b2)\n\n\u0393(\u03b1j +\u03b2j +N )\n\n\u0393( \u02dc\u03b1j +x\u00b7j )\u0393( \u02dc\u03b2j +1\u2212x\u00b7j )\n\n\u0393(\u03b1j +\u03b2j +N +1)\n\n\u0393(\u03b1j +\u03b2j )\n\n\u0393(\u03b1j +\u03b2j +1)\n\n\u0393( \u02dc\u03b1j )\u0393( \u02dc\u03b2j )\n\n\u0393(\u03b1j +x\u00b7j )\u0393(\u03b2j +1\u2212x\u00b7j )\n\n\u0393(\u03b1j )\u0393(\u03b2j )\n\n(10)\n\n=Y\n\nj\n\nThis daunting expression can be dramatically simpli\ufb01ed. We use the fact that \u0393(x) =\n(x \u2212 1) \u0393(x \u2212 1) for x > 1. For each j we can consider the two cases x\u00b7j = 0 and x\u00b7j = 1\nand separately. For x\u00b7j = 1 we have a contribution\n. For x\u00b7j = 0 we have a\ncontribution \u03b1j +\u03b2j\n\n. Putting these together we get:\n\n\u03b1j +\u03b2j +N\n\n\u03b1j +\u03b2j\n\n\u02dc\u03b1j\n\u03b1j\n\n\u03b1j +\u03b2j +N\n\n\u02dc\u03b2j\n\u03b2j\n\nscore(x) =Y\n\n!1\u2212x\u00b7j\n\n\u03b1j + \u03b2j\n\n(cid:18) \u02dc\u03b1j\nlog score(x) = c +X\n\n\u03b1j + \u03b2j + N\n\n\u03b1j\n\n(cid:19)x\u00b7j  \u02dc\u03b2j\n\n\u03b2j\n\nqjx\u00b7j\n\nj\nThe log of the score is linear in x:\n\nwhere\n\nc =X\n\nj\n\nlog(\u03b1j + \u03b2j) \u2212 log(\u03b1j + \u03b2j + N) + log \u02dc\u03b2j \u2212 log \u03b2j\n\nj\n\n(11)\n\n(12)\n\n(13)\n\n\fand\n(14)\nIf we put the entire data set D into one large matrix X with J columns, we can compute\nthe vector s of log scores for all points using a single matrix vector multiplication\n\nqj = log \u02dc\u03b1j \u2212 log \u03b1j \u2212 log \u02dc\u03b2j + log \u03b2j\n\ns = c + Xq\n\n(15)\n\nFor sparse data sets this linear operation can be implemented very ef\ufb01ciently. Each query\nDc corresponds to computing the vector q and scalar c. This can also be done ef\ufb01ciently if\nthe query is also sparse, since most elements of q will equal log \u03b2j \u2212 log(\u03b2j + N) which\nis independent of the query.\n\n4 Exponential Families\n\nWe generalize the above result to models in the exponential family. The distribution for\nsuch models can be written in the form p(x|\u03b8) = f(x)g(\u03b8) exp{\u03b8>u(x)}, where u(x) is a\nK-dimensional vector of suf\ufb01cient statistics, \u03b8 are the natural parameters, and f and g are\nnon-negative functions. The conjugate prior is p(\u03b8|\u03b7, \u03bd) = h(\u03b7, \u03bd)g(\u03b8)\u03b7 exp{\u03b8>\u03bd}, where\n\u03b7 and \u03bd are hyperparameters, and h normalizes the distribution.\nGiven a query Dc = {xi} with N items, and a candidate x, it is not hard to show that the\nscore for the candidate is:\n\nscore(x) = h(\u03b7 + 1, \u03bd + u(x)) h(\u03b7 + N, \u03bd +P\nh(\u03b7, \u03bd) h(\u03b7 + N + 1, \u03bd + u(x) +P\n\ni u(xi))\ni u(xi))\n\n(16)\n\nThis expression helps us understand when the score can be computed ef\ufb01ciently. First of\nall, the score only depends on the size of the query (N), the suf\ufb01cient statistics computed\nfrom each candidate, and from the whole query. It therefore makes sense to precompute U,\na matrix of suf\ufb01cient statistics corresponding to X. Second, whether the score is a linear\noperation on U depends on whether log h is linear in the second argument. This is the case\nfor the Bernoulli distribution, but not for all exponential family distributions. However,\nfor many distributions, such as diagonal covariance Gaussians, even though the score is\nnonlinear in U, it can be computed by applying the nonlinearity elementwise to U. For\nsparse matrices, the score can therefore still be computed in time linear in the number of\nnon-zero elements of U.\n\n5 Results\n\nWe ran our Bayesian Sets algorithm on three different datasets:\nthe Groliers Encyclo-\npedia dataset, consisting of the text of the articles in the Encyclopedia, the EachMovie\ndataset, consisting of movie ratings by users of the EachMovie service, and the NIPS au-\nthors dataset, consisting of the text of articles published in NIPS volumes 0-12 (spanning\nthe 1987-1999 conferences). The Groliers dataset is 30991 articles by 15276 words, where\nthe entries are the number of times each word appears in each document. We preprocess\n(binarize) the data by column normalizing each word, and then thresholding so that a (ar-\nticle,word) entry is 1 if that word has a frequency of more than twice the article mean.\nWe do essentially no tuning of the hyperparameters. We use broad empirical priors, where\n\u03b1 = c\u00d7m, \u03b2 = c \u00d7 (1\u2212m) where m is a mean vector over all articles, and c = 2. The\nanalogous priors are used for both other datasets.\n\nThe EachMovie dataset was preprocessed, \ufb01rst by removing movies rated by less than 15\npeople, and people who rated less than 200 movies. Then the dataset was binarized so that a\n(person, movie) entry had value 1 if the person gave the movie a rating above 3 stars (from\na possible 0-5 stars). The data was then column normalized to account for overall movie\npopularity. The size of the dataset after preprocessing was 1813 people by 1532 movies.\n\n\fFinally the NIPS author dataset (13649 words by 2037 authors), was preprocessed very\nsimilarly to the Grolier dataset. It was binarized by column normalizing each author, and\nthen thresholding so that a (word,author) entry is 1 if the author uses that word more fre-\nquently than twice the word mean across all authors.\n\nThe results of our experiments, and comparisons with Google Sets for word and movie\nqueries are given in tables 2 and 3. Unfortunately, NIPS authors have not yet achieved the\nkind of popularity on the web necessary for Google Sets to work effectively. Instead we\nlist the top words associated with the cluster of authors given by our algorithm (table 4).\n\nThe running times of our algorithm on all three datasets are given in table 1. All experi-\nments were run in Matlab on a 2GHz Pentium 4, Toshiba laptop. Our algorithm is very fast\nboth at pre-processing the data, and answering queries (about 1 sec per query).\n\nSIZE\n\nGROLIERS\n\n30991 \u00d7 15276\n\nEACHMOVIE\n1813 \u00d7 1532\n\nNIPS\n\n13649 \u00d7 2037\n\nNON-ZERO ELEMENTS\n\nPREPROCESS TIME\n\nQUERY TIME\n\n2,363,514\n\n6.1S\n1.1S\n\n517,709\n0.56S\n0.34S\n\n933,295\n\n3.22S\n0.47S\n\nTable 1: For each dataset we give the size of that dataset along with the time taken to do the (one-\ntime) preprocessing and the time taken to make a query (both in seconds).\n\nQUERY: WARRIOR, SOLDIER\n\nQUERY: ANIMAL\n\nQUERY: FISH, WATER, CORAL\n\nGOOGLE SETS\n\nBAYES SETS\n\nGOOGLE SETS\n\nBAYES SETS\n\nGOOGLE SETS\n\nBAYES SETS\n\nWARRIOR\nSOLDIER\n\nSPY\n\nENGINEER\n\nMEDIC\nSNIPER\n\nDEMOMAN\n\nPYRO\nSCOUT\n\nPYROMANIAC\n\nSOLDIER\nWARRIOR\n\nMERCENARY\n\nCAVALRY\nBRIGADE\n\nCOMMANDING\n\nSAMURAI\nBRIGADIER\nINFANTRY\nCOLONEL\n\nANIMAL\nPLANT\nFREE\nLEGAL\nFUNGAL\nHUMAN\n\nANIMAL\nANIMALS\n\nPLANT\n\nHUMANS\n\nFOOD\n\nSPECIES\n\nHYSTERIA\nVEGETABLE\n\nMINERAL\n\nINDETERMINATE\n\nMAMMALS\n\nAGO\n\nORGANISMS\nVEGETATION\n\nHWGUY\n\nSHOGUNATE\n\nFOZZIE BEAR\n\nPLANTS\n\nFISH\n\nWATER\nCORAL\n\nAGRICULTURE\n\nFOREST\n\nRICE\n\nSILK ROAD\nRELIGION\n\nHISTORY POLITICS\n\nDESERT\n\nARTS\n\nWATER\n\nFISH\n\nSURFACE\nSPECIES\nWATERS\nMARINE\n\nFOOD\n\nTEMPERATURE\n\nOCEAN\n\nSHALLOW\n\nFT\n\nTable 2: Clusters of words found by Google Sets and Bayesian Sets based on the given queries.\nThe top few are shown for each query and each algorithm. Bayesian Sets was run using Grolier\nEncyclopedia data.\n\nIt is very dif\ufb01cult to objectively evaluate our results since there is no ground truth for this\ntask. One person\u2019s idea of a good query cluster may differ drastically from another person\u2019s.\nWe chose to compare our algorithm to Google Sets since it was our main inspiration and it\nis currently the most public and commonly used algorithm for performing this task.\n\nSince we do not have access to the Google Sets algorithm it was impossible for us to run\ntheir method on our datasets. Moreover, Google Sets relies on vast amounts of web data,\nwhich we do not have. Despite those two important caveats, Google Sets clearly \u201cknows\u201d\na lot about movies3 and words, and the comparison to Bayesian Sets is informative.\nWe found that Google Sets performed very well when the query consisted of items which\ncan be found listed on the web (e.g. Cambridge colleges). On the other hand, for more\nabstract concepts (e.g. \u201csoldier\u201d and \u201cwarrior\u201d, see Table 2) our algorithm returned more\nsensible completions.\n\nWhile we believe that most of our results are self-explanatory, there are a few details that we\nwould like to elaborate on. The top query in table 3 consists of two classic romantic movies,\n\n3In fact, one of the example queries on the Google Sets website is a query of movie titles.\n\n\fTable 3: Clusters of movies found by Google Sets and Bayesian Sets based on the given queries. The\ntop 10 are shown for each query and each algorithm. Bayesian Sets was run using the EachMovie\ndataset.\n\nand while most of the movies returned by Bayesian Sets are also classic romances, hardly\nany of the movies returned by Google Sets are romances, and it would be dif\ufb01cult to call\n\u201cErnest Saves Christmas\u201d either a romance or a classic. Both \u201cCutthroat Island\u201d and \u201cLast\nAction Hero\u201d are action movie \ufb02ops, as are many of the movies given by our algorithm\nfor that query. All the Bayes Sets movies associated with the query \u201cMary Poppins\u201d and\n\u201cToy Story\u201d are children\u2019s movies, while 5 of Google Sets\u2019 movies are not. \u201cBut I\u2019m\na Cheerleader\u201d, while appearing to be a children\u2019s movie, is actually an R rated movie\ninvolving lesbian and gay teens.\n\nTable 4: NIPS authors found by Bayesian Sets based on the given queries. The top 10 are shown for\neach query along with the top 10 words associated with that cluster of authors. Bayesian Sets was\nrun using NIPS data from vol 0-12 (1987-1999 conferences).\n\nThe NIPS author dataset is rather small, and co-authors of NIPS papers appear very similar\nto each other. Therefore, many of the authors found by our algorithm are co-authors of a\nNIPS paper with one or more of the query authors. An example where this is not the case is\nWim Wiegerinck, who we do not believe ever published a NIPS paper with Lawrence Saul\nor Tommi Jaakkola, though he did have a NIPS paper on variational learning and graphical\nmodels.\n\nQUERY:GONEWITHTHEWIND,CASABLANCAGOOGLESETSBAYESSETSCASABLANCA(1942)GONEWITHTHEWIND(1939)GONEWITHTHEWIND(1939)CASABLANCA(1942)ERNESTSAVESCHRISTMAS(1988)THEAFRICANQUEEN(1951)CITIZENKANE(1941)THEPHILADELPHIASTORY(1940)PETDETECTIVE(1994)MYFAIRLADY(1964)VACATION(1983)THEADVENTURESOFROBINHOOD(1938)WIZARDOFOZ(1939)THEMALTESEFALCON(1941)THEGODFATHER(1972)REBECCA(1940)LAWRENCEOFARABIA(1962)SINGINGINTHERAIN(1952)ONTHEWATERFRONT(1954)ITHAPPENEDONENIGHT(1934)QUERY:MARYPOPPINS,TOYSTORYQUERY:CUTTHROATISLAND,LASTACTIONHEROGOOGLESETSBAYESSETSGOOGLESETSBAYESSETSTOYSTORYMARYPOPPINSLASTACTIONHEROCUTTHROATISLANDMARYPOPPINSTOYSTORYCUTTHROATISLANDLASTACTIONHEROTOYSTORY2WINNIETHEPOOHGIRLKULLTHECONQUERORMOULINROUGECINDERELLAENDOFDAYSVAMPIREINBROOKLYNTHEFASTANDTHEFURIOUSTHELOVEBUGHOOKSPRUNGPRESQUERIENBEDKNOBSANDBROOMSTICKSTHECOLOROFNIGHTJUDGEDREDDSPACEDDAVYCROCKETTCONEHEADSWILDBILLBUTI\u2019MACHEERLEADERTHEPARENTTRAPADDAMSFAMILYIHIGHLANDERIIIMULANDUMBOADDAMSFAMILYIIVILLAGEOFTHEDAMNEDWHOFRAMEDROGERRABBITTHESOUNDOFMUSICSINGLESFAIRGAMEQUERY:A.SMOLA,B.SCHOLKOPFQUERY:L.SAUL,T.JAAKKOLAQUERY:A.NG,R.SUTTONTOPMEMBERSTOPWORDSTOPMEMBERSTOPWORDSTOPMEMBERSTOPWORDSA.SMOLAVECTORL.SAULLOGR.SUTTONDECISIONB.SCHOLKOPFSUPPORTT.JAAKKOLALIKELIHOODA.NGREINFORCEMENTS.MIKAKERNELM.RAHIMMODELSY.MANSOURACTIONSG.RATSCHPAGESM.JORDANMIXTUREB.RAVINDRANREWARDSR.WILLIAMSONMACHINESN.LAWRENCECONDITIONALD.KOLLERREWARDK.MULLERQUADRATICT.JEBARAPROBABILISTICD.PRECUPSTARTJ.WESTONSOLVEW.WIEGERINCKEXPECTATIONC.WATKINSRETURNJ.SHAWE-TAYLORREGULARIZATIONM.MEILAPARAMETERSR.MOLLRECEIVEDV.VAPNIKMINIMIZINGS.IKEDADISTRIBUTIONT.PERKINSMDPT.ONODAMIND.HAUSSLERESTIMATIOND.MCALLESTERSELECTS\fAs part of the evaluation of our algorithm, we showed 30 na\u00a8\u0131ve subjects the unlabeled\nresults of Bayesian Sets and Google Sets for the queries shown from the EachMovie and\nGroliers Encyclopedia datasets, and asked them to choose which they preferred. The results\nof this study are given in table 5.\n\nQUERY\n\nWARRIOR\nANIMAL\n\nFISH\n\nGONE WITH THE WIND\n\nMARY POPPINS\n\nCUTTHROAT ISLAND\n\n% BAYES SETS\n\nP-VALUE\n\n96.7\n93.3\n90.0\n86.7\n96.7\n81.5\n\n< 0.0001\n< 0.0001\n< 0.0001\n< 0.0001\n< 0.0001\n\n0.0008\n\nTable 5: For each evaluated query (listed by\n\ufb01rst query item), we give the percentage of re-\nspondents who preferred the results given by\nBayesian Sets and the p-value rejecting the null\nhypothesis that Google Sets is preferable to\nBayesian Sets on that particular query.\n\nSince, in the case of binary data, our method reduces to a matrix-vector multiplication, we\nalso came up with ten heuristic matrix-vector methods which we ran on the same queries,\nusing the same datasets. Descriptions and results can be found in supplemental material on\nthe authors websites.\n\n6 Conclusions\n\nWe have described an algorithm which takes a query consisting of a small set of items,\nand returns additional items which belong in this set. Our algorithm computes a score\nfor each item by comparing the posterior probability of that item given the set, to the prior\nprobability of that item. These probabilities are computed with respect to a statistical model\nfor the data, and since the parameters of this model are unknown they are marginalized out.\n\nFor exponential family models with conjugate priors, our score can be computed exactly\nand ef\ufb01ciently. In fact, we show that for sparse binary data, scoring all items in a large\ndata set can be accomplished using a single sparse matrix-vector multiplication. Thus, we\nget a very fast and practical Bayesian algorithm without needing to resort to approximate\ninference. For example, a sparse data set with over 2 million nonzero entries (Grolier) can\nbe queried in just over 1 second.\n\nOur method does well when compared to Google Sets in terms of set completions, demon-\nstrating that this Bayesian criterion can be useful in realistic problem domains. One of the\nproblems we have not yet addressed is deciding on the size of the response set. Since the\nscores have a probabilistic interpretation, it should be possible to \ufb01nd a suitable threshold\nto these probabilities. In the future, we will incorporate such a threshold into our algorithm.\n\nThe problem of retrieving sets of items is clearly relevant to many application domains.\nOur algorithm is very \ufb02exible in that it can be combined with a wide variety of types of\ndata (e.g. sequences, images, etc.) and probabilistic models. We plan to explore ef\ufb01cient\nimplementations of some of these extensions. We believe that with even larger datasets the\nBayesian Sets algorithm will be a very useful tool for many application areas.\n\nAcknowledgements: Thanks to Avrim Blum and Simon Tong for useful discussions, and to Sam\nRoweis for some of the data. ZG was partially supported at CMU by the DARPA CALO project.\nReferences\n[1] Google \u2122Sets. http://labs.google.com/sets\n[2] Lafferty, J. and Zhai, C. (2002) Probabilistic relevance models based on document and query generation.\n\nIn Language\n\nmodeling and information retrieval.\n\n[3] Ponte, J. and Croft, W. (1998) A language modeling approach to information retrieval. SIGIR.\n[4] Robertson, S. and Sparck Jones, K. (1976). Relevance weighting of search terms. J Am Soc Info Sci.\n[5] Tenenbaum, J. B. and Grif\ufb01ths, T. L. (2001). Generalization, similarity, and Bayesian inference. Behavioral and Brain\n\nSciences, 24:629\u2013641.\n\n[6] Tong, S. (2005). Personal communication.\n\n\f", "award": [], "sourceid": 2817, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Katherine", "family_name": "Heller", "institution": null}]}