{"title": "Active Learning with Statistical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 705, "page_last": 712, "abstract": null, "full_text": "Active Learning with Statistical Models \n\nDavid A. Cohn, Zoubin Ghahramani, and Michael I. Jordan \n\ncohnQpsyche.mit.edu. zoubinQpsyche.mit.edu. jordan~syche.mit.edu \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nFor many types of learners one can compute the statistically \"op(cid:173)\ntimal\" way to select data. We review how these techniques have \nbeen used with feedforward neural networks [MacKay, 1992; Cohn, \n1994] . We then show how the same principles may be used to select \ndata for two alternative, statistically-based learning architectures: \nmixtures of Gaussians and locally weighted regression. While the \ntechniques for neural networks are expensive and approximate, the \ntechniques for mixtures of Gaussians and locally weighted regres(cid:173)\nsion are both efficient and accurate. \n\n1 ACTIVE LEARNING - BACKGROUND \n\nAn active learning problem is one where the learner has the ability or need to \ninfluence or select its own training data. Many problems of great practical interest \nallow active learning, and many even require it. \nWe consider the problem of actively learning a mapping X - Y based on a set of \ntraining examples {(Xi,Yi)}~l' where Xi E X and Yi E Y. The learner is allowed \nto iteratively select new inputs x (possibly from a constrained set), observe the \nresulting output y, and incorporate the new examples (x, y) into its training set. \nThe primary question of active learning is how to choose which x to try next. \nThere are many heuristics for choosing x based on intuition, including choosing \nplaces where we don't have data, where we perform poorly [Linden and Weber, \n1993], where we have low confidence [Thrun and Moller, 1992], where we expect it \n\n\f706 \n\nDavid Cohn, Zoubin Ghahramani, Michael I. Jordon \n\nto change our model [Cohn et aI, 1990], and where we previously found data that \nresulted in learning [Schmidhuber and Storck, 1993]. \nIn this paper we consider how one may select x \"optimally\" from a statistical \nviewpoint. We first review how the statistical approach can be applied to neural \nnetworks, as described in MacKay [1992] and Cohn [1994]. We then consider two \nalternative, statistically-based learning architectures: mixtures of Gaussians and \nlocally weighted regression. While optimal data selection for a neural network is \ncomputationally expensive and approximate, we find that optimal data selection for \nthe two statistical models is efficient and accurate. \n\n2 ACTIVE LEARNING - A STATISTICAL APPROACH \n\nWe denote the learner's output given input x as y(x). The mean squared error of \nthis output can be expressed as the sum of the learner's bias and variance. The \nvariance 0'3 (x) indicates the learner's uncertainty in its estimate at x. 1 Our goal \nwill be to select a new example x such that when the resulting example (x, y) is \nadded to the training set, the integrated variance IV is minimized: \n\nIV = J 0'3 P (x)dx. \n\n(1) \n\nHere, P(x) is the (known) distribution over X. In practice, we will compute a \nMonte Carlo approximation of this integral, evaluating 0'3 at a number of random \npoints drawn according to P(x). \nSelecting x so as to minimize IV requires computing 0-3, the new variance at x given \n(x, y). Until we actually commit to an x, we do not know what corresponding y we \nwill see, so the minimization cannot be performed deterministically.2 Many learning \narchitectures, however, provide an estimate of PWlx) based on current data, so we \ncan use this estimate to compute the expectation of 0-3. Selecting x to minimize \nthe expected integrated variance provides a solid statistical basis for choosing new \nexamples. \n\n2.1 EXAMPLE: ACTIVE LEARNING WITH A NEURAL \n\nNETWORK \n\nIn this section we review the use of techniques from Optimal Experiment Design \n(OED) to minimize the estimated variance of a neural network [Fedorov, 1972; \nMacKay, 1992; Cohn, 1994] . We will assume we have been given a learner y = fwO, \na training set {(Xi, yd}f;l and a parameter vector til that maximizes a likelihood \nmeasure. One such measure is the minimum sum squared residual \n\n52 = ~ f (Yi - Y(Xi))2. \n\nm i=l \n\nlUnless explicitly denoted, fI and O'~ are functions of x. For simplicity, we present our \nresults in the univariate setting. All results in the paper extend easily to the multivariate \ncase. \n\n2This contrasts with related work by Plutowski and White [1993], which is concerned \n\nwith filtering an existing data set. \n\n\fActive Learning with Statistical Models \n\n707 \n\nThe estimated output variance of the network is \n\ny \n\nO'~ ~ S2 (Oy(X \u00bb) T (02 S2) -1 (Oy(X\u00bb) \n\now \n\nOW2 \n\nOW \n\nThe standard OED approach assumes normality and local linearity. These as(cid:173)\nsumptions allow replacing the distribution P(ylx) by its estimated mean y(x) and \nvariance S2. The expected value of the new variance, iT~, is then: \n\n(-2)...... 2 \nO'g \n\n...... O'g - S2 + O'~(x)' [MacKay, 1992]. \n\nO'~(x, x) \n\n(2) \n\nwhere we define \n\n_( \n\n0' y x, x -\n\n_) = S2 (OY(X\u00bb)T (02S2)-1 (Oy(X\u00bb) \n\now \n\now2 \n\now\u00b7 \n\nFor empirical results on the predictive power of Equation 2, see Cohn [1994] . \n\nThe advantages of minimizing this criterion are that it is grounded in statistics, \nand is optimal given the assumptions. Furthermore, the criterion is continuous \nand differentiable. As such, it is applicable in continuous domains with continuous \naction spaces, and allows hillclimbing to find the \"best\" x. \nFor neural networks, however, this approach has many disadvantages. The criterion \nrelies on simplifications and strong assumptions which hold only approximately. \nComputing the variance estimate requires inversion of a Iwl x Iwl matrix for each \nnew example, and incorporating new examples into the network requires expensive \nretraining. Paass and Kindermann [1995] discuss an approach which addresses some \nof these problems. \n\n3 MIXTURES OF GAUSSIANS \n\nThe mixture of Gaussians model is gaining popularity among machine learning prac(cid:173)\ntitioners [Nowlan, 1991; Specht, 1991; Ghahramani and Jordan, 1994]. It assumes \nthat the data is produced by a mixture of N Gaussians gi, for i = 1, ... , N. We \ncan use the EM algorithm [Dempster et aI, 1977] to find the best fit to the data, \nafter which the conditional expectations of the mixture can be used for function \napproximation. \n\nFor each Gaussian gi we will denote the estimated input/output means as JLx,i and \nJLy,i and estimated covariances as O';,i' O';,i and O'xy,i. The conditional variance of \ny given x may then be written \n\nWe will denote as ni the (possibly fractional) number of training examples for which \ngi takes responsibility: \n\n\f708 \n\nDavid Cohn, Zoubin Ghahramani, Michael I. Jordon \n\nFor an input x, each 9i has conditional expectation Yi and variance (1'~,i: \n. ( ( .)2) \n\no-~ . = ~ 1 + x - J.Lx,~ \n\nYi = J.Ly,i + -2 - X -\n\n) \nJ.Lx,i , \n\n0- 2 \n\nA \n\n. \n\n0-xy ,i ( \n0-x,i \n\ny,J \n\nn' \nt \n\n0- 2 . \nXI' \n\nThese expectations and variances are mixed according to the prior probability that \n9i has of being responsible for x: \n\nh. = h.( ) _ \n\n,_~x- N \n\nP(xli) \n\n. \n2:j=l P(xlj) \n\nFor input x then, the conditional expectation Y of the resulting mixture and its \nvariance may be written: \n\nN \n\nY = L hi Yi, \n\ni:::l \n\nIn contrast to the variance estimate computed for a neural network, here o-~ can be \ncomputed efficiently with no approximations. \n\n3.1 ACTIVE LEARNING WITH A MIXTURE OF GAUSSIANS \nWe want to select x to minimize ( Cr~). With a mixture of Gaussians, the model's \nestimated distribution of ii given x is explicit: \n\nN \n\nP(ylx) = L hiP(ylx, i) = L hiN(Yi(X), o-;lx,i(X)), \n\nN \n\ni=l \n\ni=l \n\nwhere hi = hi (x). Given this, calculation of ( Cr~) is straightforward: we model the \n\nchange in each 9i separately, calculating its expected variance given a new point \nsampled from P(ylx, i) and weight this change by hi. The new expectations combine \nto form the learner's new expected variance \n\nwhere the expectation can be computed exactly in closed form: \n\n(3) \n\n\fActive Learning with Statistical Models \n\n709 \n\n4 LOCALLY WEIGHTED REGRESSION \n\nWe consider here two forms of locally weighted regression (LWR): kernel regression \nand the LOESS model [Cleveland et aI, 1988]. Kernel regression computes y as an \naverage of the Yi in the data set, weighted by a kernel centered at x. The LOESS \nmodel performs a linear regression on points in the data set, weighted by a kernel \ncentered at x. The kernel shape is a design parameter: the original LOESS model \nuses a \"tricubic\" kernel; in our experiments we use the more common Gaussian \n\nhi(x) == hex - Xi) = exp( -k(x - xd 2), \n\nwhere k is a smoothing constant. For brevity, we will drop the argument x for hi(x), \nand define n = L:i hi. We can then write the estimated means and covariances as: \n\nL:ihiXi \n\nJ.Lx = \n\n_ L:i hiYi \n\nJ.Ly -\n\n, Uy -\n\n, U = \n\nL:i hi(Xi- x )2 \n\n2 \nx \n2 _ Li hi(Yi - J.Ly)2 \n\nn \n\nn \n\nn \n\n_ \n\n2 \n\nu;y \n, Uyl x - Uy - -2 . \n~ \n\n2 \n\nn \n\n, U xy = \n\nLihi(Xi-X)(Yi-J.Ly) \n\nn \n\nWe use them to express the conditional expectations and their estimated variances: \n\nkernel: \n\nLOESS: \n\n, _ \nY -\n\nY = J.Ly, \n~( \n\nJ.Ly + q2 X - J.Lx, \n\n% \n\nV \n\nY \n\nn \n\n),...? __ u;lx (1 + (x - J.Lx)2) \n\nu; \n\nu 2 \nu? = -1!.. \nn \n\ny \n\n(4) \n\n(5) \n\n4.1 ACTIVE LEARNING WITH LOCALLY WEIGHTED \n\nREGRESSION \n\nAgain we want to select x to minimize (iT~) . With LWR, the model's estimated \ndistribution of y given x is explicit: \n\nP(ylx) = N(y(x), u;lxCx)) \n\nThe estimate of (iT~) is also explicit. Defining h as the weight assigned to x by the \nkernel, the learner's expected new variance is \n\n1. (-2) _ (iT~) \nk \nerne. u y - - - -\nn+h \n\nwhere the expectation can be computed exactly in closed form: \n\n(6) \n\n\f710 \n\nDavid Cohn, Zoubin Ghahramani, Michael 1. Jordon \n\n5 EXPERIMENTAL RESULTS \n\nBelow we describe two sets of experiments demonstrating the predictive power of \nthe query selection criteria in this paper. In the first set, learners were trained on \ndata from a noisy sine wave. The criteria described in this paper were applied to \npredict how a new training example selected at point x would decrease the learner's \nvariance. These predictions, along with the actual changes in variance when the \ntraining points were queried and added, are plotted in Figure 1. \n\no. \n\n-0.5 \n\n- . - - . _. - predicted change \n- - actual change \n\n-0.2 \n\n\" , . \n\n.-.- -.-.- predicted change \n- - actual change \n\no. \n\n- . _. -. - .\u2022 predicted change \n- - actual 9\"8rl97\". \n-\" i \n~. \n\n. \n\n\\ \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\nFigure 1: The upper portion of each plot indicates each learner's fit to noisy sinu(cid:173)\nsoidal data. The lower portion of each plot indicates predicted and actual changes \nin the learner's average estimated variance when x is queried and added to the \ntraining set, for x E [0,1]. Changes are not plotted to scale with learners' fits. \n\nIn the second set of experiments, we a:pplied the techniques of this paper to learning \nthe kinematics of a two-joint planar arm (Figure 2; see Cohn [1994] for details). \nBelow, we illustrate the problem using the LOESS algorithm. \n\nAn example of the correlation between predicted and actual changes in variance \non this problem is plotted in Figure 2. Figure 3 demonstrates that this cor(cid:173)\nrelation may be exploited to guide sequential query selection. We compared a \nLOESS learner which selected each new query so as to minimize expected variance \n\n\fActive Learning with Statistical Models \n\n711 \n\nwith LOESS learners which selected queries according to various heuristics. The \nvariance-minimizing learner significantly outperforms the heuristics in terms of both \nvariance and MSE. \n\n0 .025r--..---...,......-~---.----...---,---.\", \n\no \no \n\n0 \n\n0.02 \n\n~ 0.015 \nc:: \n.~ \n~ 0.01 \ntil \n~ 0.005 \n\"iii \n::I \n~ \n\n0 \n\n-0.005 \n\no \n\n-\u00b0$.01 -0.005 \n\no \n\no \n\n0.005 0,01 \n\n0 \npredicted delta variance \n\n0.015 0.02 0.025 \n\nFigure 2: (left) The arm kinematics problem. (right) Predicted vs. actual changes \nin model variance for LOESS on the arm kinematics problem. 100 candidate points \nare shown for a model trained with 50 initial random examples. Note that most \nof the potential queries produce very little improvement, and that the algorithm \nsuccessfully identifies those few that will help most. \n\n0.2 \n\n0.1 \n\nVarianceO.04 \n\n0.02 \n\n0.01 \n\n0.004 \n\n3 \n\nMSE \n\n0.3 \n\n0.1 \n\n50 100 150 200 250 300 350 400 450 500 \n\ntraining examples \n\n50 100 150200 250 300 350 400 450 500 \n\ntraining examples \n\nFigure 3: Variance and MSE for a LOESS learner selecting queries according to \nthe variance-minimizing criterion discussed in this paper and according to several \nheuristics. \"Sensitivity\" queries where output is most sensitive to new data, \"Bias\" \nqueries according to a bias-minimizing criterion, \u00abSupport\" queries where the model \nhas the least data support. The variance of \"Random\" and \"Sensitivity\" are off the \nscale. Curves are medians over 15 runs with non-Gaussian noise. \n\n\f712 \n\nDavid Cohn. Zouhin Ghahramani. Michael 1. Jordon \n\n6 SUMMARY \n\nMixtures of Gaussians and locally weighted regression are two statistical models \nthat offer elegant representations and efficient learning algorithms. In this paper \nwe have shown that they also offer the opportunity to perform active learning in an \nefficient and statistically correct manner. The criteria derived here can be computed \ncheaply and, for problems tested, demonstrate good predictive power. \n\nAcknowledgements \nThis work was funded by NSF grant CDA-9309300, the McDonnell-Pew Foundation, \nATR Human Information Processing Laboratories and Siemens Corporate Research. \nWe thank Stefan Schaal for helpful discussions about locally weighted regression . \n\nReferences \n\nW. Cleveland, S. Devlin, and E. Grosse. (1988) Regression by local fitting. Journal of \nEconometrics 37:87-114. \nD. Cohn, 1. Atlas and R. Ladner. (1990) Training Connectionist Networks with Queries \nand Selective Sampling. In D. Touretzky, ed., Advances in Neural Information Processing \nSystems 2, Morgan Kaufmann. \nD. Cohn. (1994) Neural network exploration using optimal experiment design. In J . Cowan \net al., eds., Advances in Neural Information Processing Systems 6. Morgan Kaufmann. \nA. Dempster, N. Laird and D. Rubin. (1977) Maximum likelihood from incomplete data \nvia the EM algorithm. J. Royal Statistical Society Series B, 39:1-38. \nV. Fedorov. (1972) Theory of Optimal Experiments. Academic Press, New York. \nZ. Ghahramani and M. Jordan. (1994) Supervised learning from incomplete data via an \nEM approach. In J. Cowan et al., eds., Advances in Neural Information Processing Systems \n6. Morgan Kaufmann. \nA. Linden and F. Weber. (1993) Implementing inner drive by competence reflection. In \nH. Roitblat et al., eds., Proc. 2nd Int. Conf. on Simulation of Adaptive Behavior, MIT \nPress, Cambridge. \nD. MacKay. (1992) Information-based objective functions for active data selection, Neural \nComputation 4( 4): 590-604. \nS. Nowlan. (1991) Soft Competitive Adaptation: Neural Network Learning Algorithms \nbased on Fitting Statistical Mixtures. CMU-CS-91-126, School of Computer Science, \nCarnegie Mellon University, Pittsburgh, PA. \nPaass, G., and Kindermann, J . (1995) . Bayesian Query Construction for Neural Network \nModels. In this volume. \nM. Plutowski and H. White (1993). Selecting concise training sets from clean data. IEEE \nTransactions on Neural Networks, 4, 305-318. \nS. Schaal and C. Atkeson. (1994) Robot Juggling: An Implementation of Memory-based \nLearning. Control Systems Magazine, 14(1):57-71. \nJ. Schmidhuber and J . Storck. (1993) Reinforcement driven information acquisition in \nnondeterministic environments. Tech. Report, Fakultiit fiir Informatik, Technische Uni(cid:173)\nversitiit Munchen. \nD. Specht. (1991) A general regression neural network. IEEE Trans. Neural Networks, \n2(6):568-576. \nS. Thrun and K. Moller. (1992) Active exploration in dynamic environments. In J. Moody \net aI., editors, Advances in Neural Information Processing Systems 4. Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 1011, "authors": [{"given_name": "David", "family_name": "Cohn", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}