{"title": "Learning the context of a category", "book": "Advances in Neural Information Processing Systems", "page_first": 1795, "page_last": 1803, "abstract": "This paper outlines a hierarchical Bayesian model for human category learning that learns both the organization of objects into categories, and the context in which this knowledge should be applied. The model is fit to multiple data sets, and provides a parsimonious method for describing how humans learn context specific conceptual representations.", "full_text": "Learning the context of a category\n\nDaniel J. Navarro\nSchool of Psychology\nUniversity of Adelaide\n\nAdelaide, SA 5005, Australia\n\ndaniel.navarro@adelaide.edu.au\n\nAbstract\n\nThis paper outlines a hierarchical Bayesian model for human category learning\nthat learns both the organization of objects into categories, and the context in\nwhich this knowledge should be applied. The model is \ufb01t to multiple data sets,\nand provides a parsimonious method for describing how humans learn context\nspeci\ufb01c conceptual representations.\n\n1 Introduction\n\nHuman knowledge and expertise is often tied to particular contexts. The superior memory that chess\nmasters have for chessboard con\ufb01gurations is limited to plausible games, and does not generalize\nto arbitrary groupings of pieces [1]. Expert \ufb01re\ufb01ghters make different predictions about the same\n\ufb01re depending on whether it is described as a back-burn or a to-be-controlled \ufb01re [2].\nIn part,\nthis context speci\ufb01city re\ufb02ects the tendency for people to organize knowledge into independent\n\u201cbundles\u201d which may contain contradictory information, and which may be deemed appropriate\nto different contexts. This phenomenon is called knowledge partitioning [2\u20136], and is observed\nin arti\ufb01cial category learning experiments as well as real world situations. When people learn to\nclassify stimuli in an environment where there are systematic changes in the \u201ccontext\u201d in which\nobservations are made, they often construct category representations that are tightly linked to the\ncontext, and only generalize their knowledge when the context is deemed appropriate [3, 4, 6].\n\nContext induced knowledge partitioning poses a challenge to models of human learning. As noted in\n[4] many models cannot accommodate the effect, or, as discussed later in this paper, are somewhat\nunsatisfying in the manner that they do so. This paper explores the possibility that Bayesian models\nof human category learning can provide the missing explanation. The structure of the paper is as\nfollows: \ufb01rst, a context-sensitive Bayesian category learning model is described. This model is\nthen shown to provide a parsimonious and psychologically appealing account of the knowledge\npartitioning effect. Following this, a hierarchical extension is introduced to the model, which allows\nit to acquire abstract knowledge about the context speci\ufb01city of the categories, in a manner that is\nconsistent with the data on human learning.\n\n2 Learning categories in context\n\nThis section outlines a Bayesian model that is sensitive to the learning context. It extends Anderson\u2019s\n[7] rational model of categorization (RMC) by allowing the model to track the context in which\nobservations are made, and draw inferences about the role that context plays.\n\n2.1 The statistical model\n\nThe central assumption in the RMC is that the learner seeks to organize his or her observations\ninto clusters. If zi denotes the cluster to which the ith observation is assigned, then the joint prior\n\n1\n\n\fdistribution over zn = (z1, . . . , zn) can be speci\ufb01ed via the Chinese restaurant process [8],\n\nzn|\u03b1 \u223c CRP(\u03b1).\n\n(1)\n\nEach cluster of observations is mapped onto a distribution over features. Feature values are denoted\nby the vector xi = (xi1, . . . , xid), the values of the ith observation for each of the d features. When\nfeature values vary continuously, the RMC associates the kth cluster with a multivariate Gaussian\nthat has mean vector \u00b5k and covariance matrix \u03a3k. Setting standard conjugate priors, we obtain\n\nxi\n\u00b5k\n\u03a3k\n\n| \u00b5k, \u03a3k, zi = k \u223c Normal(\u00b5k, \u03a3k)\n| \u03a3k, \u03ba0, \u00b50\n| \u039b0, \u03bd0\n\n\u223c Normal(\u00b50, \u03a3k/\u03ba0)\n\u223c Inv-Wishart(\u03bd0, \u039b0\n\n\u22121)\n\n(2)\n\nThis is a minor generalization of the original model, as it allows any covariance matrix (i.e., symmet-\nric positive de\ufb01nite \u03a3) and does not require the restrictive assumption that the stimulus dimensions\nare independent (which would force \u03a3 to be diagonal). While independence is reasonable when\nstimulus dimensions are separable [9], knowledge partitioning can occur regardless of whether di-\nmensions are separable or integral (see [6] for details), so the more general formulation is useful.\n\nIn the RMC, labels are treated in the same way as discrete-valued features. Each cluster is associated\nwith a distribution over category labels. If \u2113i denotes the label given to the ith observation, then\n\n\u2113i\n\u03b8k\n\n|\n| \u03b2\n\nzi = k, \u03b8k \u223c Bernoulli(\u03b8k)\n\n\u223c Beta(\u03b2, \u03b2)\n\n(3)\n\nThe \u03b2 parameter describes the extent to which items in the same cluster are allowed to have different\nlabels. If there are more than two labels, this generalizes to a Dirichlet-multinomial model.\n\nEquations 1\u20133 de\ufb01ne the standard RMC. The extension to handle context dependence is straight-\nforward: contextual information is treated as an auxiliary feature, and so each cluster is linked to\na distribution over contexts. In the experiments considered later, each observation is assigned to\na context individually, which allows us to apply the exact same model for contextual features as\nregular ones. Thus a very simple context model is suf\ufb01cient:\n\nci\n\u03c6k\n\n|\n| \u03b3\n\nzi = k, \u03c6k \u223c Bernoulli(\u03c6k)\n\n\u223c Beta(\u03b3, \u03b3)\n\n(4)\n\nThe context speci\ufb01city parameter \u03b3 is analogous to \u03b2 and controls the extent to which clusters can\ninclude observations made in different contexts. In more general contexts, a richer model would be\nrequired to capture the manner in which context can vary.\nApplying the model requires values to be chosen for \u03b1, \u03b2, \u03b3, \u00b5, \u039b0, \u03bd0 and \u03ba0, most of which can\nbe \ufb01xed in a sensible way. Firstly, since the categories do not overlap in the experiments discussed\nhere it makes sense to set \u03b2 = 0, which has the effect of forcing each cluster to be associated only\nwith one category. Secondly, human learners rarely have strong prior knowledge about the features\nused in arti\ufb01cial category learning experiments, expressed by setting \u03ba0 = 1 and \u03bd0 = 3 (\u03bd0 is larger\nto ensure that the priors over features always has a well de\ufb01ned covariance structure). Thirdly, to\napproximate the fact that the experiments quickly reveal the full range of stimuli to participants,\nit makes sense to set \u00b50 and \u039b0 to the empirical mean and covariances across all training items.\nHaving made these choices, we may restrict our attention to \u03b1 (the bias to introduce new clusters)\nand \u03b3 (the bias to treat clusters as context general).\n\n2.2 Inference in the model\n\nInference is performed via a collapsed Gibbs sampler, integrating out \u03c6, \u03b8, \u00b5 and \u03a3 and de\ufb01ning a\nsampler only over the cluster assignments z. To do so, note that\n\nP (zi = k|x, \u2113, c, z\u2212i) \u221d P (xi, \u2113i, ci|x\u2212i, \u2113\u2212i, c\u2212i, z\u2212i, zi = k)P (zi = k|z\u2212i)\n\n= P (xi|x\u2212i, z\u2212i, zi = k)P (\u2113i|\u2113\u2212i, z\u2212i, zi = k)\n\nP (ci|c\u2212i, z\u2212i, zi = k)P (zi = k|z\u2212i)\n\n(5)\n\n(6)\n\nwhere the dependence on the parameters that describe the prior (i.e., \u03b1, \u03b2, \u03b3, \u039b0, \u03ba0, \u03bd0, \u00b50) is sup-\npressed for the sake of readability. In this expression z\u2212i denotes the set of all cluster assignments\n\n2\n\n\fexcept the ith, and the normalizing term is calculated by summing Equation 6 over all possible clus-\nter assignments k, including the possibility that the ith item is assigned to an entirely new cluster.\nThe conditional prior probability P (zi = k|z\u2212i) is\n\nP (zi = k|z\u2212i) =(cid:26) nk\n\n\u03b1\n\nn\u22121+\u03b1\n\nn\u22121+\u03b1\n\nif k is old\nif k is new\n\n(7)\n\nwhere nk counts the number of items (not including the ith) that have been assigned to the kth\ncluster. Since the context is modelled using a beta-Bernoulli model:\n\nP (ci|c\u2212i, z\u2212i, zi = k) = Z 1\n\n0\n\nP (ci|\u03c6k, zi = k)P (\u03c6k|c\u2212i, z\u2212i) d\u03c6k =\n\nn(ci)\nk + \u03b3\nnk + 2\u03b3\n\n(8)\n\nwhere n(ci)\nthe same context as the ith item. A similar result applies to the labelling scheme:\n\ncounts the number of observations that have been assigned to cluster k and appeared in\n\nk\n\nP (\u2113i|\u2113\u2212i, z\u2212i, zi = k) = Z 1\n\n0\n\nP (\u2113i|\u03b8k, zi = k)P (\u03b8k|\u2113\u2212i, z\u2212i) d\u03b8k =\n\nn(\u2113i)\nk + \u03b2\nnk + 2\u03b2\n\n(9)\n\nk\n\nwhere n(\u2113i)\ncounts the number of observations that have been assigned to cluster k and given the\nsame label as observation i. Finally, integrating out the mean vector \u00b5k and covariance matrix \u03a3k\nfor the feature values yields a d-dimensional multivariate t distribution (e.g., [10], ch. 3):\n\nP (xi|x\u2212i, z\u2212i, zi = k) = Z P (xi|\u00b5k, \u03a3k, zi = k)P (\u00b5k, \u03a3k|x\u2212i, z\u2212i) d(\u00b5k, \u03a3k)\n!\u2212\n\n\u22121(xi \u2212 \u00b5\u2032\nk)\u039b\u2032\nk\n\u03bd \u2032\nk\n\n2 1 +\n\n\u0393( \u03bd \u2032\n\nk+d\n2\n\u03bd \u2032\n2 )(\u03c0\u03bd \u2032\nk)\nk\n\n(xi \u2212 \u00b5\u2032\n\nk)T\n\nd\n\n2 |\u039b\u2032\n\nk| 1\n\n\u0393(\n\n=\n\n)\n\n(10)\n\n\u03bd \u2032\nk\n\n+d\n\n2\n\n(11)\n\nIn this expression the posterior degrees of freedom for cluster k is \u03bd \u2032\nposterior mean is \u00b5\u2032\nvalues for items in the cluster. Finally, the posterior scale matrix is\n\nk = \u03bd0 + nk \u2212 d + 1 and the\nk = (\u03ba0\u00b50 + nk \u00afxk)/(\u03ba0 + nk), where \u00afxk denotes the empirical mean feature\n\n\u039b\u2032\n\nk = (cid:18)\u039b0 + Sk +\n\n\u03ba0nk\n\n\u03ba0 + nk\n\n(\u00afxk \u2212 \u00b50)T(\u00afxk \u2212 \u00b50)(cid:19)\n\n\u03ba0 + nk + 1\n\n(\u03ba0 + nk)(\u03bd0 + nk \u2212 2d + 2)\n\n(12)\n\nwhere Sk =P(xi \u2212 \u00afxk)T(xi \u2212 \u00afxk) is the sum of squares matrix around the empirical cluster mean\n\n\u00afxk, and the sum in question is taken over all observations assigned to cluster k.\nTaken together, Equations 6, 8, 9 and 11 suggest a simple a Gibbs sampler over the cluster assign-\nments z. Cluster assignments zi are initialized randomly, and are then sequentially redrawn from\nthe conditional posterior distribution in Equation 6. For the applications in this paper, the sampler\ntypically converges within only a few iterations, but a much longer burn in (usually 1000 iterations,\nnever less than 100) was used in order to be safe. Successive samples are drawn at a lag of 10\niterations, and multiple runs (between 5 and 10) are used in all cases.\n\n3 Application to knowledge partitioning experiments\n\nTo illustrate the behavior of the model, consider the most typical example of a knowledge partition-\ning experiment [3, 4, 6]. Stimuli vary along two continuous dimensions (e.g., height of a rectangle,\nlocation of a radial line), and are organized into categories using the scheme shown in Figure 1a.\nThere are two categories organized into an \u201cinside-outside\u201d structure, with one category (black cir-\ncles/squares) occupying a region along either side of the other one (white circles/squares). The\ncritical characteristic of the experiment is that each stimulus is presented in a particular \u201ccontext\u201d,\nusually operationalized as an auxiliary feature not tied to the stimulus itself, such as the background\ncolor. In Figure 1a, squares correspond to items presented in one context, and circles to items pre-\nsented in the other context. Participants are trained on these items in a standard supervised catego-\nrization experiment: stimuli are presented one at a time (with the context variable), and participants\nare asked to predict the category label. After making a prediction, the true label is revealed to them.\n\n3\n\n\f \n\nlabel A, context 1\nlabel A, context 2\nlabel B, context 1\nlabel B, context 2\ntransfer items\n\ns\nt\nn\na\np\nc\ni\nt\nr\na\np\n\ni\n\n \n\ne\nv\ni\nt\ni\ns\nn\ne\ns\n \nt\nx\ne\n\nt\n\nn\no\nc\n\ns\nt\nn\na\np\nc\ni\nt\nr\na\np\n\ni\n\n \n\ne\nv\ni\nt\ni\ns\nn\ne\ns\nn\n\ni\n \nt\nx\ne\n\nt\n\nn\no\nc\n\n \n\n(a)\n\ncontext 1\n\ncontext 2\n\n \n\n76\u2212100%\n51\u221275%\n26\u221250%\n0\u221225%\n\n \n\n(b)\n\nFigure 1: Stimuli used in the typical knowledge partitioning design (left) and the different general-\nization patterns that are displayed by human learners (right). Percentages refer to the probability of\nselecting category label A.\n\nThis procedure is repeated until participants can correctly label all items. At this point, participants\nare shown transfer items (the crosses in Figure 1a), and asked what category label these items should\nbe given. No feedback is given during this phase. Critically, each transfer item is presented in both\ncontexts, to determine whether people generalize in a context speci\ufb01c way.\n\nThe basic effect, replicated across several different experiments, is that there are strong individual\ndifferences in how people solve the problem. This leads to the two characteristic patterns of general-\nization shown in Figure 1b (these data are from Experiments 1 and 2A in [6]). Some participants are\ncontext insensitive (lower two panels) and their predictions about the transfer items do not change\nas a function of context. However, other participants are context sensitive (upper panels) and adopt\na very different strategy depending on which context the transfer item is presented in. This is taken\nto imply [3, 4, 6] that the context sensitive participants have learned a conceptual representation in\nwhich knowledge is \u201cpartitioned\u201d into different bundles, each associated with a different context.\n\n3.1 Learning the knowledge partition\n\nThe initial investigation focused on what category representations the model learns, as a function\nof \u03b1 and \u03b3. After varying both parameters over a broad range, it was clear that there are two quite\ndifferent solutions that the model can produce, illustrated in Figure 2. In the four cluster solution\n(panel b, small \u03b3), the clusters never aggregate across items observed in different contexts.\nIn\ncontrast, the three cluster solution (panel a, larger \u03b3) is more context general, and collapses category\nB into a single cluster. However, there is an interaction with \u03b1, since large \u03b1 values drive the model\nto introduce more clusters. As a result, for \u03b1 > 1 the model tends not to produce the three cluster\nsolution. Given that the main interest is in \u03b3, we can \ufb01x \u03b1 such that the prior expected number of\nclusters is 3.5, so as to be neutral with respect to the two solutions. Since the expected number of\nk=0 (\u03b1 + k) [11] and there are n = 40 observations, this value is \u03b1 = 0.72.\nThe next aim was to quantify the extent to which \u03b3 in\ufb02uences the relative prevalence of the four\ncluster solution versus the three cluster solution. For any given partition produced by the model, the\nadjusted Rand index [12] can be used to assess its similarity to the two idealized solutions (Figure 2a\nand 2b). Since the adjusted Rand index measures the extent to which any given pair of items are clas-\nsi\ufb01ed in the same way by the two solutions, it is a natural measure of how close a model-generated\nsolution is to one of the two idealized solutions. Then, adopting an approach loosely inspired by\nPAC-learning [13], two partitions were deemed to be approximately the same if the adjusted Rand\n\nclusters is given by \u03b1Pn\u22121\n\n4\n\n\f3 cluster solution\n\n4 cluster solution\n\ncontext 1 only\n\ncontext 1 only\n\nt\n\nn\ne\nm\ne\ne\nr\ng\na\ne\na\nm\nx\no\nr\np\np\na\n\nt\n\n \n\ni\n\nboth contexts\n\ncontext 2 only\n\ncontext 2 only\n\n(a)\n\n(b)\n\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n\n \nr\no\ni\nr\ne\n\nt\ns\no\np\n\n \n\n4 cluster solution\n3 cluster solution\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n0\n\n5\n\n10\n\n15\n\ngamma\n(c)\n\nFigure 2: The two different clustering schemes produced by the context sensitive RMC, and the\nvalues of \u03b3 that produce them (for \u03b1 \ufb01xed at 0.72). See main text for details.\n\nindex between the two exceeded 0.9. The estimated posterior probability that the model solutions\napproximate either of the the two idealized partitions is plotted in Figure 2c as a function of \u03b3.\nAt smaller values of \u03b3 (below about 3.7) the four cluster solution is extremely dominant whereas\nat larger values the three cluster solution is preferred. Since there are approximately 1.6 \u00d7 1035\npossible partitions of 40 objects, the extent of this dominance is clearly very strong.\n\nThe fact that the model concentrates on two different but entirely sensible solutions as a function of\n\u03b3 is very appealing from a psychological perspective. One of the most desirable characteristics is the\nfact that the partitioning of the learners knowledge is made explicit. That is, the model learns a much\nmore differentiated and context bound representation when \u03b3 is small, and a more context general\nand less differentiated representation when \u03b3 is large. By way of comparison, the only other model\nthat has been shown to produce the effect is ATRIUM [14], which in its standard form consists of\na linked \u201crule learning\u201d module and an \u201cexemplar learning\u201d module. In order to \ufb01t the data, the\nmodel was modi\ufb01ed [4] so that it starts with two rule modules and an exemplar model. During\ntraining, the model learns to weight each of the rule modules differently depending on context,\nthereby producing context speci\ufb01c generalizations. This provides a partial explanation of the effect,\nbut it is rather unsatisfying in some ways. In ATRIUM, the knowledge partition is represented via\nthe learned division of responsibilities between two hard coded rule modules [4]. In a very real\nsense, the partition is actually hard coded into the architecture of the model. As such, ATRIUM\nlearns the context dependence, but not the knowledge partition itself.\n\n3.2 Generalizing in context-speci\ufb01c and context-general ways\n\nThe discussion to this point shows how the value of \u03b3 shapes the conceptual knowledge that the\nmodel acquires, but has not looked at what generalizations the model makes. However, it is straight-\nforward to show that varying \u03b3 does allow the context sensitive RMC to capture the two generaliza-\ntion patterns in Figure 1. With this in mind, Figure 3 plots the generalizations made by the model\nfor two different levels of context speci\ufb01city (\u03b3 = 0 and \u03b3 = 10) and for the two different clustering\nsolutions. Obviously, in view of the results in Figure 2c the most interesting cases are panels (a) and\n(d), since those correspond to the solutions most likely to be learned by the model, but it is useful\nto consider all four cases. As is clear from inspection \u2013 and veri\ufb01ed by the squared correlations\nlisted in the Figure caption \u2013 when \u03b3 is small the model generalizes in a context speci\ufb01c manner,\nbut when \u03b3 is large the generalizations are the same in all contexts. This happens for both clustering\nsolutions, which implies that \u03b3 plays two distinct but related roles, insofar as it in\ufb02uences the context\nspeci\ufb01city of both the learned knowledge partition and the generalizations to new observations.\n\n4 Acquiring abstract knowledge about context speci\ufb01city\n\nOne thing missing from both ATRIUM and the RMC is an explanation for how the leaner decides\nwhether context speci\ufb01c or context general representations are appropriate. In both cases, the model\nhas free parameters that govern the switch between the two cases, and these parameters must be\n\n5\n\n\f\u03b3 = 0\n\n\u03b3 = 10\n\ncontext 1\n\ncontext 2\n\ncontext 1\n\ncontext 2\n\n(a)\n\n(b)\n\ns\nr\ne\n\nt\ns\nu\nc\n \n\nl\n\n4\n\ns\nr\ne\n\nt\ns\nu\nc\n \n\nl\n\n3\n\n(c)\n\n(d)\n\ns\nr\ne\nt\ns\nu\nc\n \n\nl\n\n4\n\ns\nr\ne\nt\ns\nu\nc\n \n\nl\n\n3\n\nFigure 3: Generalizations made by the model. In panel (a) the model accounts for 82.1% of the\nvariance in the context sensitive data, but only 35.2% of the variance in the context insensitive data.\nFor panel (b) these numbers are 77.9% and 3.6% respectively. When \u03b3 is large the pattern reverses:\nin panel (c) only 23.6% of the variance in the context sensitive data is explained, whereas 67.1% of\nthe context insensitive data can be accounted for. In panel (d), the numbers are 17.5% and 73.9%.\n\nestimated from data.\nIn the RMC, \u03b3 is a free parameter that does all the work; for ATRIUM,\nfour separate parameters are varied [4]. This poses the question: how do people acquire abstract\nknowledge about which way to generalize? In RMC terms, how do we infer the value of \u03b3?\nTo answer this, note that if the context varies in a systematic fashion, an intelligent learner might\ncome to suspect that the context matters, and would be more likely to decide to generalize in a\ncontext speci\ufb01c way. On the other hand, if there are no systematic patterns to the way that observa-\ntions are distributed across contexts, then the learner should deem the context to be irrelevant and\nhence decide to generalize broadly across contexts. Indeed, this is exactly what happens with human\nlearners. For instance, consider the data from Experiment 1 in [4]. One condition of this experiment\nwas a standard knowledge partitioning experiment, identical in every meaningful respect to the data\ndescribed earlier in this paper. As is typical for such experiments, knowledge partitioning was ob-\nserved for at least some of the participants. In the other condition, however, the context variable was\nrandomized: each of the training items was assigned to a randomly chosen context. In this condition,\nno knowledge partitioning was observed.\n\nWhat this implies is that human learners use the systematicity of the context as a cue to determine\nhow broadly to generalize. As such, the model should learn that \u03b3 is small when the context varies\nsystematically; and similarly should learn that \u03b3 is large if the context is random. To that end, this\nsection develops a hierarchical extension to the model that is able to do exactly this, and shows that\nit is able to capture both conditions of the data in [4] without varying any parameter values.\n\n4.1 A hierarchical context-sensitive RMC\n\nExtending the statistical model is straightforward: we place priors over \u03b3, and allow the model to\ninfer a joint posterior distribution over the cluster assignments z and the context speci\ufb01city \u03b3. This is\nclosely related to other hierarchical Bayesian models of category learning [15\u201319]. A simple choice\nof prior for this situation is the exponential distribution,\n\n\u03b3|\u03bb \u223c Exponential(\u03bb)\n\n(13)\n\nFollowing the approach taken with \u03b1, \u03bb was \ufb01xed so as to ensure that the model has no a priori bias\nto prefer either of the two solutions. When \u03b3 = 3.7 the two solutions are equally likely (Figure 2);\na value of \u03bb = .19 ensures that this value of \u03b3 is the prior median.\n\n6\n\n\fsystematic context\nrandomized context\n\n \n\n1000\n\n800\n\n600\n\n400\n\n200\n\ny\nc\nn\ne\nu\nq\ne\nr\nf\n\n \n0\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\nlog10(\u03b3)\n\n0\n\n1\n\n2\n\nFigure 4: Learned distributions over \u03b3 in the systematic (dark rectangles) and randomized (light\nrectangles) conditions, plotted on a logarithmic scale. The dashed line shows the location of the\nprior median (i.e., \u03b3 = 3.7).\n\nInference in the hierarchical model proceeds as before, with a Metropolis step added to resample \u03b3.\nThe acceptance probabilities for the Metropolis sampler may be calculated by observing that\n\nP (\u03b3|x, \u2113, c, z) \u221d P (x, \u2113, c|z, \u03b3)P (\u03b3)\n\n\u221d P (c|z, \u03b3)P (\u03b3)\n\n= Z P (c|z, \u03c6)P (\u03c6|\u03b3) d\u03c6 P (\u03b3)\n\n= P (\u03b3)\n\nP (c(k)|\u03c6k)P (\u03c6k|\u03b3) d\u03c6k\n\n(14)\n(15)\n\n(16)\n\n(17)\n\n(18)\n\n(19)\n\nK\n\n0\n\nK\n\nYk=1Z 1\nYk=1\nYk=1\n\nK\n\n= \u03bb exp(\u2212\u03bb\u03b3)\n\n\u221d exp(\u2212\u03bb\u03b3)\n\nnk!\n!n(c=2)\n\nk\n\nB(n(c=1)\n\nk\n\n+ \u03b3, n(c=2)\nk\nB(\u03b3, \u03b3)\n\n+ \u03b3)\n\n!\n\nk\n\nn(c=1)\nB(n(c=1)\n\nk\n\n+ \u03b3, n(c=2)\nk\nB(\u03b3, \u03b3)\n\n+ \u03b3)\n\nwhere B(a, b) = \u0393(a)\u0393(b)/\u0393(a + b) denotes the beta function, and n(c=j)\nitems in cluster k that appeared in context j.\n\nk\n\ncounts the number of\n\n4.2 Application of the extended model\n\nTo explore the performance of the hierarchical extension of the context sensitive RMC, the model\nwas trained on both the original, systematic version of the knowledge partitioning experiments, and\non a version with the context variables randomly permuted. The posterior distributions over \u03b3 that\nthis produces are shown in Figure 4. As expected, in the systematic condition the model notices the\nfact that the context varies systematically as a function of the feature values x, and learns to form\ncontext speci\ufb01c clusters. Indeed, 97% of the posterior distribution over z is absorbed by the four\ncluster solution (or other solutions that are suf\ufb01ciently similar in the sense discussed earlier). In the\nprocess, the model infers that \u03b3 is small and generalizes in a context speci\ufb01c way (as per Figure 3).\nNevertheless, without changing any parameter values, the same model in the randomized condition\ninfers that there is no pattern to the context variable, which ends up being randomly scattered across\nthe clusters. For this condition 57% of the posterior mass is approximately equivalent to the three\ncluster solution. As a result, the model infers that \u03b3 is large, and generalizes in the context general\nfashion. In short, the model captures human performance quite effectively.\n\nWhen considering the implications of Figure 4, it is clear that the model captures the critical fea-\nture of the experiment: the ability to learn when to make context speci\ufb01c generalizations and when\nnot to. The distributions over \u03b3 are very different as a function of condition, indicating that the\nmodel learns appropriately. What is less clear is the extent to which the model would be expected\nto produce the correct pattern of individual differences. Inspection of Figure 4 reveals that in the\n\n7\n\n\frandomized context condition the posterior distribution over \u03b3 does not move all that far above the\nprior median of 3.7 (dashed line) which by construction is intended to be a fairly neutral value,\nwhereas in the systematic condition nearly the entire distribution lies below this value. In other\nwords, the systematic condition produces more learning about \u03b3. If one were to suppose that people\nhad no inherent prior biases to prefer to generalize one way or the other, it should follow that the\nless informative condition (i.e., random context) should reveal more individual differences. Empir-\nically, the reverse is true: in the less informative condition, all participants generalize in a context\ngeneral fashion; whereas in the more informative condition (i.e., systematic context) some but not\nall participants learn to generalize more narrowly. This does not pose any inherent dif\ufb01culty for the\nmodel, but it does suggest that the \u201cunbiased\u201d prior chosen for this demonstration is not quite right:\npeople do appear to have strong prior biases to prefer context general representations. Fortunately, a\ncursory investigation revealed that altering the prior over \u03b3 moves the posteriors in a sensible fashion\nwhile still keeping the two distributions distinct.\n\n5 Discussion\n\nThe hierarchical Bayesian model outlined in this paper explains how human conceptual learning\ncan be context general in some situations, and context sensitive in others. It captures the critical\n\u201cknowledge partitioning\u201d effect [2\u20134, 6] and does so without altering the core components of the\nRMC [7] and its extensions [15, 16, 18, 20]. This success leads to an interesting question: why does\nALCOVE [21] not account for knowledge partitioning (see [4])? Arguably, ALCOVE has been\nthe dominant theory for learned selective attention for almost 20 years, and its attentional learning\nmechanisms bear a striking similarity to the hierarchical Bayesian learning idea used in this paper\nand elsewhere [15\u201319], as well as to statistical methods for automatic relevance determination in\nBayesian neural networks [22]. On the basis of these similarities, one might expect similar behavior\nfrom ALCOVE and the context sensitive RMC. Yet this is not the case. The answer to this lies in\nthe details of why one learns dimensional biases. In ALCOVE, as in many connectionist models, the\ndimensional biases are chosen to optimize the ability to predict the category label. Since the context\nvariable is not correlated with the label in these experiments (by construction), ALCOVE learns to\nignore the context variable in all cases. The approach taken by the RMC is qualitatively different:\nit looks for clusters of items where the label, the context and the feature values are all similar to\none another. Knowledge partitioning experiments more or less require that such clusters exist, so\nthe RMC can learn that the context variable is not distributed randomly. In short, ALCOVE treats\ncontext as important only if it can predict the label; the RMC treats the context as important if it\nhelps the learner infer the structure of the world.\n\nLooking beyond arti\ufb01cial learning tasks, learning the situations in which knowledge should be ap-\nplied is an important task for an intelligent agent operating in a complex world. Moreover, hierar-\nchical Bayesian models provide a natural formalism for describing how human learners are able to\ndo so. Viewed in this light, the fact that it is possible for people to hold contradictory knowledge\nin different \u201cparcels\u201d should be viewed as a special case of the general problem of learning the set\nof relevant contexts. Consider, for instance, the example in which \ufb01re \ufb01ghters make different judg-\nments about the same \ufb01re depending on whether it is called a back-burn or a to-be-controlled \ufb01re\n[2]. If \ufb01re \ufb01ghters observe a very different distribution of \ufb01res in the context of back-burns than\nin the context of to-be-controlled \ufb01res, then it should be no surprise that they acquire two distinct\ntheories of \u201c\ufb01res\u201d, each bound to a different context. Although this particular example is a case in\nwhich the learned context speci\ufb01city is incorrect, it takes only a minor shift to make the behavior\ncorrect. While the behavior of \ufb01res does not depend on the reason why they were lit, it does depend\non what combustibles they are fed. If the distinction were between \ufb01res observed in a forest con-\ntext and \ufb01res observed in a tyre yard, context speci\ufb01c category representations suddenly seem very\nsensible. Similarly, social categories such as \u201cpolite behavior\u201d are necessarily highly context depen-\ndent, so it makes sense that the learner would construct different rules for different contexts. If the\nworld presents the learner with observations that vary systematically across contexts, partitioning\nknowledge by context would seem to be a rational learning strategy.\n\nAcknowledgements\n\nThis research was supported by an Australian Research Fellowship (ARC grant DP-0773794).\n\n8\n\n\fReferences\n[1] W. G. Chase and H. A. Simon. Perception in chess. Cognitive Psychology, 4:55\u201381, 1973.\n[2] S. Lewandowsky and K. Kirsner. Knowledge partitioning: Context-dependent use of expertise.\n\nMemory and Cognition, 28:295\u2013305, 2000.\n\n[3] L.-X. Yang and S. Lewandowsky. Context-gated knowledge partitioning in categorization.\n\nJournal of Experimental Psychology: Learning, Memory, and Cognition, 29:663\u2013679, 2003.\n\n[4] L.-X. Yang and S. Lewandowsky. Knowledge partitioning in categorization: Constraints on\nexemplar models. Journal of Experimental Psychology: Learning, Memory, and Cognition,\n30:1045\u20131064, 2004.\n\n[5] M. L. Kalish, S. Lewandowsky, and J. K. Kruschke. Population of linear experts: Knowledge\n\npartitioning in function learning. Psychological Review, 111:1072\u20131099, 2004.\n\n[6] S. Lewandowsky, L. Roberts, and L.-X. Yang. Knowledge partitioning in category learning:\n\nBoundary conditions. Memory and Cognition, 38:1676\u20131688, 2006.\n\n[7] J. R. Anderson. The adaptive nature of human categorization. Psychological Review, 98:\n\n409\u2013429, 1991.\n\n[8] D. Aldous. Exchangeability and related topics. In \u00b4Ecole d\u2019\u00b4et\u00b4e de probabilit\u00b4es de Saint-Flour,\n\nXIII-1983, pages 1\u2013198. Springer, Berlin, 1985.\n\n[9] R. N. Shepard. Integrality versus separability of stimulus dimensions: From an early conver-\ngence of evidence to a proposed theoretical basis. In J. R. Pomerantz and G. L. Lockhead,\neditors, The Perception of Structure: Essays in Honor of Wendell R. Garner, pages 53\u201371.\nAmerican Psychological Association, Washington, DC, 1991.\n\n[10] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman and\n\nHall, Boca Raton, 2nd edition, 2004.\n\n[11] C. E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric\n\nproblems. Annals of Statistics, 2:1152\u20131174, 1974.\n\n[12] L. Hubert and P. Arabie. Comparing partitions. Journal of Classi\ufb01cation, 2:193\u2013218, 1985.\n[13] L. Valiant. A theory of the learnable. Communications of the ACM, 27:1134\u20131142, 1984.\n[14] M. A. Erickson and J. K. Kruschke. Rules and exemplars in category learning. Journal of\n\nExperimental Psychology: General, 127:107\u2013140, 1998.\n\n[15] C. Kemp, A. Perfors, and J. B. Tenenbaum. Learning overhypotheses with hierarchical\n\nBayesian models. Developmental Science, 10:307\u2013332, 2007.\n\n[16] A. Perfors and J. B. Tenenbaum. Learning to learn categories. In N. Taatgen, H. van Rijn,\nL. Schomaker, and J. Nerbonne, editors, Proceedings of the 31st Annual Conference of the\nCognitive Science Society, pages 136\u2013141, Austin, TX, 2009. Cognitive Science Society.\n\n[17] D. J. Navarro. From natural kinds to complex categories. In R. Sun and N. Miyake, editors,\nProceedings of the 28th Annual Conference of the Cognitive Science Society, pages 621\u2013626,\nMahwah, NJ, 2006. Lawrence Erlbaum.\n\n[18] T. L. Grif\ufb01ths, K. R. Canini, A. N. Sanborn, and D. J. Navarro. Unifying rational models of\ncategorization via the hierarchical Dirichlet process. In D. S. McNamara and J. G. Trafton,\neditors, Proceedings of the 29th Annual Conference of the Cognitive Science Society, pages\n323\u2013328, Austin, TX, 2007. Cognitive Science Society.\n\n[19] K. Heller, A. N. Sanborn, and N. Chater. Hierarchical learning of dimensional biases in hu-\nman categorization. In J. Lafferty and C. Williams, editors, Advances in Neural Information\nProcessing Systems 22, Cambridge, MA, 2009. MIT Press.\n\n[20] A. N. Sanborn, T. L. Grif\ufb01ths, and D. J. Navarro. Rational approximations to rational models:\n\nAlternative algorithms for category learning. Psychological Review, in press.\n\n[21] J. K. Kruschke. ALCOVE: An exemplar-based connectionist model of category learning. Psy-\n\nchological Review, 99:22\u201344, 1992.\n\n[22] R. Neal. Bayesian learning for neural networks. Springer-Verlag, New York, 1996.\n\n9\n\n\f", "award": [], "sourceid": 518, "authors": [{"given_name": "Dan", "family_name": "Navarro", "institution": null}]}