{"title": "Adaptive optimal training of animal behavior", "book": "Advances in Neural Information Processing Systems", "page_first": 1947, "page_last": 1955, "abstract": "Neuroscience experiments often require training animals to perform tasks designed to elicit various sensory, cognitive, and motor behaviors. Training typically involves a series of gradual adjustments of stimulus conditions and rewards in order to bring about learning. However, training protocols are usually hand-designed, relying on a combination of intuition, guesswork, and trial-and-error, and often require weeks or months to achieve a desired level of task performance. Here we combine ideas from reinforcement learning and adaptive optimal experimental design to formulate methods for adaptive optimal training of animal behavior. Our work addresses two intriguing problems at once: first, it seeks to infer the learning rules underlying an animal's behavioral changes during training; second, it seeks to exploit these rules to select stimuli that will maximize the rate of learning toward a desired objective.  We develop and test these methods using data collected from rats during training on a two-interval sensory discrimination task.  We show that we can accurately infer the parameters of a policy-gradient-based learning algorithm that describes how the animal's internal model of the task evolves over the course of training.  We then formulate a theory for optimal training, which involves selecting sequences of stimuli that will drive the animal's internal policy toward a desired location in the parameter space. Simulations show that our method can in theory provide a substantial speedup over standard training methods. We feel these results will hold considerable theoretical and practical implications both for researchers in reinforcement learning and for experimentalists seeking to train animals.", "full_text": "Adaptive optimal training of animal behavior\n\nJi Hyun Bak1,4 Jung Yoon Choi2,3 Athena Akrami3,5 Ilana Witten2,3 Jonathan W. Pillow2,3\n\n1Department of Physics, 2Department of Psychology, Princeton University\n\n3Princeton Neuroscience Institute, Princeton University\n\n4School of Computational Sciences, Korea Institute for Advanced Study\n\njhbak@kias.re.kr, {jungchoi,aakrami,iwitten,pillow}@princeton.edu\n\n5Howard Hughes Medical Institute\n\nAbstract\n\nNeuroscience experiments often require training animals to perform tasks designed\nto elicit various sensory, cognitive, and motor behaviors. Training typically involves\na series of gradual adjustments of stimulus conditions and rewards in order to bring\nabout learning. However, training protocols are usually hand-designed, relying\non a combination of intuition, guesswork, and trial-and-error, and often require\nweeks or months to achieve a desired level of task performance. Here we combine\nideas from reinforcement learning and adaptive optimal experimental design to\nformulate methods for adaptive optimal training of animal behavior. Our work\naddresses two intriguing problems at once: \ufb01rst, it seeks to infer the learning rules\nunderlying an animal\u2019s behavioral changes during training; second, it seeks to\nexploit these rules to select stimuli that will maximize the rate of learning toward a\ndesired objective. We develop and test these methods using data collected from rats\nduring training on a two-interval sensory discrimination task. We show that we can\naccurately infer the parameters of a policy-gradient-based learning algorithm that\ndescribes how the animal\u2019s internal model of the task evolves over the course of\ntraining. We then formulate a theory for optimal training, which involves selecting\nsequences of stimuli that will drive the animal\u2019s internal policy toward a desired\nlocation in the parameter space. Simulations show that our method can in theory\nprovide a substantial speedup over standard training methods. We feel these results\nwill hold considerable theoretical and practical implications both for researchers in\nreinforcement learning and for experimentalists seeking to train animals.\n\n1\n\nIntroduction\n\nAn important \ufb01rst step in many neuroscience experiments is to train animals to perform a particular\nsensory, cognitive, or motor task. In many cases this training process is slow (requiring weeks to\nmonths) or dif\ufb01cult (resulting in animals that do not successfully learn the task). This increases the\ncost of research and the time taken for experiments to begin, and poorly trained animals\u2014for example,\nanimals that incorrectly base their decisions on trial history instead of the current stimulus\u2014may\nintroduce variability in experimental outcomes, reducing interpretability and increasing the risk of\nfalse conclusions.\nIn this paper, we present a principled theory for the design of normatively optimal adaptive training\nmethods. The core innovation is a synthesis of ideas from reinforcement learning and adaptive\nexperimental design: we seek to reverse engineer an animal\u2019s internal learning rule from its observed\nbehavior in order to select stimuli that will drive learning as quickly as possible toward a desired\nobjective. Our approach involves estimating a model of the animal\u2019s internal state as it evolves over\ntraining sessions, including both the current policy governing behavior and the learning rule used to\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1:\n(A) Stimulus space for a 2AFC discrimination task, with optimal separatrix between\ncorrect \u201cleft\u201d and \u201cright\u201d choices shown in red. Filled circles indicate a \u201creduced\u201d set of stimuli\n(consisting of those closest to the decision boundary) which have been used in several prominent\nstudies [3, 6, 9]. (B) Schematic of active training paradigm. We infer the animal\u2019s current weights\nwt and its learning rule (\u201cRewardMax\u201d), parametrized by \u03c6, and use them to determine an optimal\nstimulus xt for the current trial (\u201cAlignMax\u201d), where optimality is determined by the expected weight\nchange towards the target weights wgoal.\n\nmodify this policy in response to feedback. We model the animal as using a policy-gradient based\nlearning rule [15], and show that parameters of this learning model can be successfully inferred from\na behavioral time series dataset collected during the early stages of training. We then use the inferred\nlearning rule to compute an optimal sequence of stimuli, selected adaptively on a trial-by-trial basis,\nthat will drive the animal\u2019s internal model toward a desired state. Intuitively, optimal training involves\nselecting stimuli that maximally align the predicted change in model parameters with the trained\nbehavioral goal, which is de\ufb01ned as a point in the space of model parameters. We expect this research\nto provide both practical and theoretical bene\ufb01ts: the adaptive optimal training protocol promises a\nsigni\ufb01cantly reduced training time required to achieve a desired level of performance, while providing\nnew scienti\ufb01c insights into how and what animals learn over the course of the training period.\n\n2 Modeling animal decision-making behavior\n\nLet us begin by de\ufb01ning the ingredients of a generic decision-making task. In each trial, the animal\nis presented with a stimulus x from a bounded stimulus space X, and is required to make a choice\ny among a \ufb01nite set of available responses Y . There is a \ufb01xed reward map r : {X, Y } \u2192 R. It\nis assumed that this behavior is governed by some internal model, or the psychometric function,\ndescribed by a set of parameters or weights w. We introduce the \u201cy-bar\u201d notation \u00afy(x) to indicate the\ncorrect choice for the given stimulus x, and let Xy denote the \u201cstimulus group\u201d for a given y, de\ufb01ned\nas the set of all stimuli x that map to the same correct choice y = \u00afy(x).\nFor concreteness, we consider a two-alternative forced-choice (2AFC) discrimination task where the\nstimulus vector for each trial, x = (x1, x2), consists of a pair of scalar-valued stimuli that are to be\ncompared [6, 8, 9, 16]. The animal should report either x1 > x2 or x1 < x2, indicating its choice\nwith a left (y = L) or right (y = R) movement, respectively. This results in a binary response space,\nY = {L, R}. We de\ufb01ne the reward function r(x, y) to be a Boolean function that indicates whether\na stimulus-response pair corresponds to a correct choice (which should therefore be rewarded) or not:\n\n(cid:26)1\n\n0\n\nr(x, y) =\n\nif {x1 > x2, y = L} or {x1 < x2, y = R};\notherwise.\n\n(1)\n\nFigure 1A shows an example 2-dimensional stimulus space for such a task, with circles representing\na discretized set of possible stimuli X, and the desired separatrix (the boundary separating the two\nstimulus groups XL and XR) shown in red. In some settings, the experimenter may wish to focus on\nsome \u201creduced\u201d set of stimuli, as indicated here by \ufb01lled symbols [3, 6, 9].\nWe model the animal\u2019s choice behavior as arising from a Bernoulli generalized linear model (GLM),\nalso known as the logistic regression model. The choice probabilities for the two possible stimuli at\ntrial t are given by\n\npR(xt, wt) =\n\n1\n\n1 + exp(\u2212g(xt)(cid:62)wt)\n\n,\n\npL(xt, wt) = 1 \u2212 pR(xt, wt)\n\n(2)\n\n2\n\nABpast observations.........optimal stimulus selectionstimulusstimulus spaceactive training schematicweightsresponsestimulus 1stimulus 2 target weightsanimal\u2019slearningrule x\fwhere g(x) = (1, x(cid:62))(cid:62) is the input carrier vector, and w = (b, a(cid:62))(cid:62) is the vector of parameters or\nweights governing behavior. Here b describes the animal\u2019s internal bias to choosing \u201cright\u201d (y = R),\nand a = (a1, a2) captures the animal\u2019s sensitivity to the stimulus.1\nWe may also incorporate the trial history as additional dimensions of the input governing the animal\u2019s\nbehavior; humans and animals alike are known to exhibit history-dependent behavior in trial-based\ntasks [1, 3, 5, 7]. Based on some preliminary observations from animal behavior (see Supplementary\nMaterial for details), we encode the trial history as a compressed stimulus history, using a binary\nvariable \u0001\u00afy(x) de\ufb01ned as \u0001L = \u22121 and \u0001R = +1. Taking into account the previous d trials, the input\ncarrier vector and the weight vector become:\n\ng(xt) \u2192 (1, x(cid:62)\n\nt , \u0001\u00afy(xt\u22121),\u00b7\u00b7\u00b7 , \u0001\u00afy(xt\u2212d))(cid:62),\n\nwt \u2192 (b, a(cid:62), h1,\u00b7\u00b7\u00b7 , hd).\n\n(3)\n\nThe history dependence parameter hd describes the animal\u2019s tendency to stick to the correct answer\nfrom the previous trial (d trials back). Because varying number of history terms d gives a family of\npsychometric models, determining the optimal d is a well-de\ufb01ned model selection problem.\n\n3 Estimating time-varying psychometric function\n\nIn order to drive the animal\u2019s performance toward a desired objective, we \ufb01rst need a framework to\ndescribe, and accurately estimate, the time-varying model parameters of the animal behavior, which\nis fundamentally non-stationary while training is in progress.\n\n3.1 Constructing the random walk prior\nWe assume that the single-step weight change at each trial t follows a random walk, wt \u2212 wt\u22121 = \u03bet,\nwhere \u03bet \u223c N (0, \u03c32\nt ), for t = 1,\u00b7\u00b7\u00b7 , N. Let w0 be some prior mean for the initial weight. We\nassume \u03c32 = \u00b7\u00b7\u00b7 = \u03c3N = \u03c3, which is to believe that although the behavior is variable, the variability\nof the behavior is a constant property of the animal. We can write this more concisely using a state-\nspace representation [2, 11], in terms of the vector of time-varying weights w = (w1, w2,\u00b7\u00b7\u00b7 , wN )(cid:62)\nand its prior mean w0 = w01:\n\nD(w \u2212 w0) = \u03be \u223c N (0, \u03a3),\n\n(4)\n1, \u03c32,\u00b7\u00b7\u00b7 , \u03c32) is the N \u00d7 N covariance matrix, and D is the sparse banded\nwhere \u03a3 = diag(\u03c32\nmatrix with \ufb01rst row of an identity matrix and subsequent rows computing \ufb01rst order differences.\nRearranging, the full random walk prior on the N-dimensional vector w is\nw \u223c N (w0, C), where C\u22121 = D(cid:62)\u03a3\u22121D.\n\n(5)\n\nIn many practical cases there are multiple weights in the model, say K weights. The full set of\nparameters should now be arranged into an N \u00d7 K array of weights {wti}, where the two subscripts\nconsistently indicate the trial number (t = 1,\u00b7\u00b7\u00b7 , N) and the type of parameter (i = 1,\u00b7\u00b7\u00b7 , K),\nrespectively. This gives a matrix\n\nW = {wti} = (w\u22171,\u00b7\u00b7\u00b7 , w\u2217i,\u00b7\u00b7\u00b7 , w\u2217K) = (w1\u2217,\u00b7\u00b7\u00b7 , wt\u2217,\u00b7\u00b7\u00b7 , wN\u2217)(cid:62)\n\n(6)\nwhere we denote the vector of all weights at trial t as wt\u2217 = (wt1, wt2,\u00b7\u00b7\u00b7 , wtK)(cid:62), and the time\nseries of the i-th weight as w\u2217i = (w1i, w2i,\u00b7\u00b7\u00b7 , wN i)(cid:62).\n\u2217K)(cid:62) be the vectorization of W , a long vector with the columns\nLet w = vec(W ) = (w(cid:62)\nof W stacked together. Equation (5) still holds for this extended weight vector w, where the\nextended D and \u03a3 are written as block diagonal matrices D = diag(D1, D2,\u00b7\u00b7\u00b7 , DK) and \u03a3 =\ndiag(\u03a31, \u03a32,\u00b7\u00b7\u00b7 , \u03a3K), respectively, where Di is the weight-speci\ufb01c N \u00d7 N difference matrix and\n\u03a3i is the corresponding covariance matrix. Within a linear model one can freely renormalize the units\nof the stimulus space in order to keep the sizes of all weights comparable, and keep all \u03a3i\u2019s equal.\nWe used a transformed stimulus space in which the center is at 0 and the standard deviation is 1.\n\n\u22171,\u00b7\u00b7\u00b7 , w(cid:62)\n\n1We use a convention in which a single-indexed tensor object is automatically represented as a column vector\n\n(in boldface notation), and the operation (\u00b7,\u00b7,\u00b7\u00b7\u00b7 ) concatenates objects horizontally.\n\n3\n\n\f3.2 Log likelihood\n\nLet us denote the log likelihood of the observed data by L =(cid:80)N\n\nis the trial-speci\ufb01c log likelihood. Within the binomial model we have\n\nt=1 Lt, where Lt = log p(yt|xt, wt\u2217)\n\nLt = (1 \u2212 \u03b4yt,R) log(1 \u2212 pR(xt, wt\u2217)) + \u03b4yt,R log pR(xt, wt\u2217).\n\n(7)\nAbbreviating pR(xt, wt\u2217) = pt and pL(xt, wt\u2217) = 1 \u2212 pt, the trial-speci\ufb01c derivatives are solved to\nbe \u2202Lt/\u2202wt\u2217 = (\u03b4yt,R \u2212 pt) g(xt) \u2261 \u2206t and \u22022Lt/\u2202wt\u2217\u2202wt\u2217 = \u2212pt(1 \u2212 pt)g(xt)g(xt)(cid:62) \u2261 \u039bt.\nExtension to the full weight vector is straightforward because distinct trials do not interact. Working\nout with the indices, we may write\n\n\uf8ee\uf8ef\uf8ef\uf8f0 M11 M12\n\nM21 M22\n...\nMK1 MK2\n\n\uf8f9\uf8fa\uf8fa\uf8fb\n\n\u00b7\u00b7\u00b7 M1K\nM2K\n...\n\n...\n\u00b7\u00b7\u00b7 MKK\n\n(8)\n\n= vec([\u22061,\u00b7\u00b7\u00b7 , \u2206N ](cid:62)),\n\n\u2202L\n\u2202w\n\n\u22022L\n\u2202w2 =\n\nwhere the (i, j)-th block of the full second derivative matrix is an N \u00d7 N diagonal matrix de\ufb01ned by\nMij = \u22022L/\u2202w\u2217i\u2202w\u2217j = diag((\u039b1)ij,\u00b7\u00b7\u00b7 , (\u039bt)ij,\u00b7\u00b7\u00b7 , (\u039bN )ij). After this point, we can simplify\nour notation such that wt = wt\u2217. The weight-type-speci\ufb01c w\u2217i will no longer appear.\n\n3.3 MAP estimate of w\n\n(cid:18) 1\n\n2\n\nlog(cid:12)(cid:12)C\u22121(cid:12)(cid:12) \u2212 1\n\n2\n\n(cid:19)\n\nThe posterior distribution of w is a combination of the prior and the likelihood (Bayes\u2019 rule):\n\nlog p(w|D) \u223c\n\n(w \u2212 w0)(cid:62)C\u22121(w \u2212 w0)\n\n+ L.\n\n(9)\n\nWe can perform a numerical maximization of the log posterior using Newton\u2019s method (we used the\nMatlab function fminunc), knowing its gradient j and the hessian H explicitly:\n\nj =\n\n\u2202(log p)\n\n\u2202w\n\n= \u2212C\u22121(w \u2212 w0) +\n\n\u2202L\n\u2202w\n\n,\n\nH =\n\n\u22022(log p)\n\n\u2202w2 = \u2212C\u22121 +\n\n\u22022L\n\u2202w2 .\n\n(10)\n\nThe maximum a posteriori (MAP) estimate \u02c6w is where the gradient vanishes, j( \u02c6w) = 0. If we work\nwith a Laplace approximation, the posterior covariance is Cov = \u2212H\u22121 evaluated at w = \u02c6w.\n\n3.4 Hyperparameter optimization\n\nThe model hyperparameters consist of \u03c31, governing the variance of w1, the weights on the \ufb01rst trial\nof a session, and \u03c3, governing the variance of the trial-to-trial diffusive change of the weights. To set\nthese hyperparameters, we \ufb01xed \u03c31 to a large default value, and used maximum marginal likelihood\nor \u201cevidence optimization\u201d over a \ufb01xed grid of \u03c3 [4, 11, 13]. The marginal likelihood is given by:\n\np(y|x, \u03c3) =\n\ndwp(y|x, w)p(w|\u03c3) =\n\np(y|x, w)p(w|\u03c3)\n\np(w|x, y, \u03c3)\n\n\u2248 exp(L) \u00b7 N (w|w0, C)\nN (w| \u02c6w,\u2212H\u22121)\n\n,\n\n(11)\n\n(cid:90)\n\nwhere \u02c6w is the MAP estimate of the entire vector of time-varying weights and H is the Hessian of the\nlog-posterior over w at its mode. This formula for marginal likelihood results from the well-known\nLaplace approximation to the posterior [11, 12]. We found the estimate not to be insensitive to \u03c31 so\nlong as it is suf\ufb01ciently large.\n\n3.5 Application\n\nWe tested our method using a simulation, drawing binary responses from a stimulus-free GLM\nyt \u223c logistic(wt), where wt was diffused as wt+1 \u223c N (wt, \u03c32) with a \ufb01xed hyperparameter\n\u03c3. Given the time series of responses {yt}, our method captures the true \u03c3 through evidence\nmaximization, and provides a good estimate of the time-varying w = {wt} (Figure 2A). Whereas the\nestimate of the weight wt is robust over independent realizations of the responses, the instantaneous\nweight changes \u2206w = wt+1 \u2212 wt are not reproducible across realizations (Figure 2B). Therefore it is\ndif\ufb01cult to analyze the trial-to-trial weight changes directly from real data, where only one realization\nof the learning process is accessible.\n\n4\n\n\fFigure 2: Estimating time-varying model parameters. (A-B) Simulation: (A) Our method captures\nthe true underlying variability \u03c3 by maximizing evidence. (B) Weight estimates are accurate and\nrobust over independent realizations of the responses, but weight changes across realizations are\nnot reproducible. (C-E) From the choice behavior of a rat under training, we could (C) estimate the\ntime-varying weights of its psychometric model, and (D) determine the characteristic variability by\nevidence maximization. (E) The number of history terms to be included in the model was determined\nby comparing the BIC, using the early/mid/late parts of the rat dataset. Because log-likelihood is\ncalculated up to a constant normalization, both log-evidence and BIC are shown in relative values.\n\nWe also applied our method to an actual experimental dataset from rats during the early training\nperiod for a 2AFC discrimination task, as introduced in Section 2 (using classical training methods\n[3], see Supplementary Material for detailed description). We estimated the time-varying weights\nof the GLM (Figure 2C), and estimated the characteristic variability of the rat behavior \u03c3rat = 2\u22127\nby maximizing marginal likelihood (Figure 2D). To determine the length d of the trial history\ndependence, we \ufb01t models with varying d and used the Bayesian Information Criterion BIC(d) =\n\u22122 log L(d) + K(d) log N (Figure 2E). We found that animal behavior exhibits long-range history\ndepedence at the beginning of training, but this dependence becomes shorter as training progresses.\nNear the end of the dataset, the behavior of the rat is best described drat = 1 (single-trial history\ndependence), and we use this value for the remainder of our analyses.\n\n4\n\nIncorporating learning\n\nThe fact that animals show improved performance, as training progresses, suggests that we need a\nnon-random component in our model that accounts for learning. We will \ufb01rst introduce a simple\nmodel of weight change based on the ideas from reinforcement learning, then discuss how we can\nincorporate the learning model into our time-varying estimate method.\nA good candidate model for animal learning is the policy gradient update from reinforcement learning,\nfor example as in [15]. There are debates as to whether animals actually learn using policy-based\nmethods, but it is dif\ufb01cult to de\ufb01ne a reasonable value function that is consistent with our preliminary\nobservations of rat behavior (e.g. win-stay/lose-switch). A recent experimental study supports the\nuse of policy-based models in human learning behavior [10].\n\n4.1 RewardMax model of learning (policy gradient update)\n\nHere we consider a simple model of learning, in which the learner attempts to update its policy (here\nthe weight parameters in the model) to maximize the expected reward. Given some \ufb01xed reward\nfunction r(x, y), the expected reward at the next-upcoming trial t is de\ufb01ned as\n\n(cid:68)(cid:104)r(xt, yt)(cid:105)p(yt|xt,wt)\n\n(cid:69)\n\n\u03c1(wt) =\n\nPX (xt)\n\n(12)\n\nwhere PX (xt) re\ufb02ects the subject animal\u2019s knowledge as to the probability that a given stimulus x\nwill be presented at trial t, which may be dynamically updated. One way to construct the empirical\n\n5\n\n010002000trials-0.500.51weight wtrue weightbest fit-8-6-4log2 -12-8-40log evd. (rel.)log evdmax-evdtrue 010002000trials-0.500.51weight wtrue weightrepeated fits-2024w (rep 1) 10-3-2024w (rep 2)10-3-0.500.5bias b00.51sensitivity a1sensitivity a20200040006000trials0.511.5history dependence h-12-10-8-6-4log2 -400-300-200-1000log evd. (rel.)log evdmax-evd012345d (trials back)-200-1000BIC (rel.)earlymidlateABDCE\fFigure 3: Estimating the learning model. (A-B) Simulated learner with \u03c3sim = \u03b1sim = 2\u22127. (A) The\nfour weight parameters of the simulated model are successfully recovered by our MAP estimate with\nthe learning effect incorporated, where (B) the learning rate \u03b1 is accurately determined by evidence\nmaximization. (C) Evidence maximization analysis on the rat training dataset reveals \u03c3rat = 2\u22126\nand \u03b1rat = 2\u221210. Displayed is a color plot of log evidence on the hyperparameter plane (in relative\nvalues). The optimal set of hyperparameters is marked with a star.\n\n(cid:88)\n\n(cid:12)(cid:12)(cid:12)(cid:12)t\n\n\u2202\u03c1\n\u2202w\n\nPX is to accumulate the stimulus statistics up to some timescale \u03c4 \u2265 0; here we restrict to the simplest\nlimit \u03c4 = 0, where only the most recent stimulus is remembered. That is, PX (xt) = \u03b4(xt \u2212 xt\u22121).\nIn practice \u03c1 can be evaluated at wt = wt\u22121, the posterior mean from previous observations.\nUnder the GLM (2), the choice probability is p(y|x, w) = 1/(1 + exp(\u2212\u0001yg(x)(cid:62)w)), where\n\u0001L = \u22121 and \u0001R = +1, trial index suppressed. Therefore the expected reward can be written out\nexplicitly, as well as its gradient with respect to w:\n\nwhere we de\ufb01ne the effective reward function f (x) \u2261(cid:80)\n\nx\u2208X\n\n\u2202\u03c1\n\u2202w\n\n=\n\nPX (x) f (x) pR(x, w) pL(x, w) g(x)\n\ny\u2208Y \u0001yr(x, y) for each stimulus. In the\nspirit of the policy gradient update, we consider the RewardMax model of learning, which assumes\nthat the animal will try to climb up the gradient of the expected reward by\n\n(13)\n\n\u2206wt = \u03b1\n\n(14)\nwhere \u2206wt = (wt+1 \u2212 wt). In this simplest setting, the learning rate \u03b1 is the only learning\nhyperparameter \u03c6 = {\u03b1}. The model can be extended by incorporating more realistic aspects of\nlearning, such as the non-isotropic learning rate, the rate of weight decay (forgetting), or the skewness\nbetween experienced and unexperienced rewards. For more discussion, see Supplementary Material.\n\n\u2261 v(wt, xt; \u03c6),\n\n4.2 Random walk prior with drift\n\nBecause our observation of a given learning process is stochastic and the estimate of the weight\nchange is not robust (Figure 2B), it is dif\ufb01cult to test the learning rule (14) on any individual dataset.\nHowever, we can still assume that the learning rule underlies the observed weight changes as\n\n(cid:104)\u2206w(cid:105) = v(w, x; \u03c6)\n\n(15)\nwhere the average (cid:104)\u00b7(cid:105) is over hypothetical repetitions of the same learning process. This effect of\nnon-random learning can be incorporated into our random walk prior as a drift term, to make a fully\nBayesian model for an imperfect learner. The new weight update prior is written as D(w \u2212 w0) =\nv + \u03be, where v is the \u201cdrift velocity\u201d and \u03be \u223c N (0, \u03a3) is the noise. The modi\ufb01ed prior is\n\nw \u2212 D\u22121v \u223c N (w0, C),\n\n(16)\nEquations (9-10) can be re-written with the additional term D\u22121v. For the RewardMax model\nv = \u03b1\u2202\u03c1/\u2202w, in particular, the \ufb01rst and second derivatives of the modi\ufb01ed log posterior can be\nwritten out analytically. Details can be found in Supplementary Material.\n\nC\u22121 = D(cid:62)\u03a3\u22121D.\n\n4.3 Application\n\nTo test the model with drift, a simulated RewardMax learner was generated, based on the same task\nstructure as in the rat experiment. The two hyperparameters {\u03c3sim, \u03b1sim} were chosen such that the\n\n6\n\n010002000trials-101model weightstrueestimated-9-8-7-6log2 -2-10log evidence (rel.)log evidencetrue max-evd-11-10-9-8-7log2 -8-7-6-5log2 -20-15-10-50log evidence (rel.)ABC\fresulting time series data is qualitatively similar to the rat data. The simulated learning model can be\nrecovered by maximizing the evidence (11), now with the learning hyperparameter \u03b1 as well as the\nvariability \u03c3. The solution accurately re\ufb02ects the true \u03b1sim, shown where \u03c3 is \ufb01xed at the true \u03c3sim\n(Figures 3A-3B). Likewise, the learning model of a real rat was obtained by performing a grid search\non the full hyperparameter plane {\u03c3, \u03b1}. We get \u03c3rat = 2\u22126 and \u03b1rat = 2\u221210 (Figure 3C). 2\nCan we determine whether the rat\u2019s behavior is in a regime where the effect of learning dominates the\neffect of noise, or vice versa? The obtained values of \u03c3 and \u03b1 depend on our choice of units which\nis arbitrary; more precisely, \u03b1 \u223c [w2] and \u03c3 \u223c [w] where [w] scales as the weight. Dimensional\nanalysis suggests a (dimensionless) order parameter \u03b2 = \u03b1/\u03c32, where \u03b2 (cid:29) 1 would indicate a regime\nwhere the effect of learning is larger than the effect of noise. Our estimate of the hyperparameters\ngives \u03b2rat = \u03b1rat/\u03c32\n\nrat \u2248 4, which leaves us optimistic.\n\n5 AlignMax: Adaptive optimal training\n\nWhereas the goal of the learner/trainee is (presumably) to maximize the expected reward, the trainer\u2019s\ngoal is to drive the behavior of the trainee as close as possible to some \ufb01xed model that corresponds\nto a desirable, yet hypothetically achievable, performance. Here we propose a simple algorithm that\naims to align the expected model parameter change of the trainee (cid:104)\u2206wt(cid:105) = v(wt, xt; \u03c6) towards a\n\ufb01xed goal wgoal. We can summarize this in an AlignMax training formula\n\nxt+1 = argmax\n\nx\n\n(wgoal \u2212 wt)(cid:62) (cid:104)\u2206wt(cid:105) .\n\n(17)\n\nx\u2208X PX (x)(cid:80)\n\nis de\ufb01ned as DKL =(cid:80)\n\nLooking at Equations (13), (14) and (17), it is worth noting that g(x) puts a heavier weight on more\ndistinguishable or \u201ceasier\u201d stimuli (exploitation), while pLpR puts more weight on more dif\ufb01cult\nstimuli, with more uncertainty (exploration); an exploitation-exploration tradeoff emerges naturally.\nWe tested the AlignMax training protocol3 using a simulated learner with \ufb01xed hyperparameters\n\u03b1sim = 0.005 and \u03c3sim = 0, using wgoal = (b, a1, a2, h)goal = (0,\u221210, 10, 0) in the current\nparadigm. We chose a noise-free learner for clear visualization, but the algorithm works as well\nin the presence of noise (\u03c3 > 0, see Supplementary Material for a simulated noisy learner). As\nexpected, our AlignMax algorithm achieves a much faster training compared to the usual algorithm\nwhere stimuli are presented randomly (Figure 4). The task performance was measured in terms of the\nsuccess rate, the expected reward (12), and the Kullback-Leibler (KL) divergence. The KL divergence\ny\u2208Y \u02c6py(x) log(\u02c6py(x)/py(x)) where \u02c6py(x) = r(x, y) is the\n\u201ccorrect\u201d psychometric function, and a smaller value of DKL indicates a behavior that is closer to\nthe ideal. Both the expected reward and the KL divergence were evaluated using a uniform stimulus\ndistribution PX (x). The low success rate is a distinctive feature of the adaptive training algorithm,\nwhich selects adversarial stimuli such that the \u201clazy \ufb02ukes\u201d are actively prevented (e.g. such that a\nleft-biased learner wouldn\u2019t get thoughtless rewards from the left side). It is notable that the AlignMax\ntraining eliminates the bias b and the history dependence h (the two stimulus-independent parameters)\nmuch more quickly compared to the conventional (random) algorithm, as shown in Figure 4A.\nTwo general rules were observed from the optimal trainer. First, while the history dependence h is\nnon-zero, AlignMax alternates between different stimulus groups in order to suppress the win-stay\nbehavior; once h vanishes, AlignMax tries to neutralize the bias b by presenting more stimuli from the\n\u201cnon-preferred\u201d stimulus group yet being careful not to re-install the history dependence. For example,\nit would give LLRLLR... for an R-biased trainee. This suggests that a pre-de\ufb01ned, non-adaptive\nde-biasing algorithm may be problematic as it may reinforce an unwanted history dependence (see\nSupp. Figure S1). Second, AlignMax exploits the full stimulus space by starting from some \u201ceasier\u201d\nstimuli in the early stage of training (farther away from the true separatrix x1 = x2), and presenting\nprogressively more dif\ufb01cult stimuli (closer to the separatrix) as the trainee performance improves.\nThis suggests that using the reduced stimulus space may be suboptimal for training purposes. Indeed,\ntraining was faster on the full stimulus plane, than on the reduced set (Figures 4B-4C).\n\n2Based on a 2000-trial subset of the rat dataset.\n3When implementing the algorithm within the current task paradigm, because of the way we model the\nhistory variable as part of the stimulus, it is important to allow the algorithm to choose up to d + 1 future stimuli,\nin this case as a pair {xt+1, xt+2} , in order to generate a desired pattern of trial history.\n\n7\n\n\fFigure 4: AlignMax training (solid lines) compared to a random training (dashed lines), for a\nsimulated noise-free learner. (A) Weights evolving as training progresses, shown from a simulated\ntraining on the full stimulus space shown in Figure 1A. (B-C) Performances measured in terms of\nthe success rate (moving average over 500 trials), the expected reward and the KL divergence. The\nsimulated learner was trained either (B) in the full stimulus space, or (C) in the reduced stimulus\nspace. The low success rate is a natural consequence of the active training algorithm, which tends to\nselect adversarial stimuli to facilitate learning.\n\n6 Discussion\n\nIn this work, we have formulated a theory for designing an optimal training protocol of animal\nbehavior, which works adaptively to drive the current internal model of the animal toward a desired,\npre-de\ufb01ned objective state. To this end, we have \ufb01rst developed a method to accurately estimate the\ntime-varying parameters of the psychometric model directly from animal\u2019s behavioral time series,\nwhile characterizing the intrinsic variability \u03c3 and the learning rate \u03b1 of the animal by empirical\nBayes. Interestingly, a dimensional analysis based on our estimate of the learning model suggests\nthat the rat indeed lives in a regime where the effect of learning is stronger than the effect of noise.\nOur method to infer the learning model from data is different from many conventional approaches of\ninverse reinforcement learning, which also seek to infer the underlying learning rules from externally\nobservable behavior, but usually rely on the stationarity of the policy or the value function. On the\ncontrary, our method works directly on the non-stationary behavior. Our technical contribution is\ntwofold: \ufb01rst, building on the existing framework for estimation of state-space vectors [2, 11, 14],\nwe provide a case in which parameters of a non-stationary model are successfully inferred from real\ntime-series data; second, we develop a natural extension of the existing Bayesian framework where\nnon-random model change (learning) is incorporated into the prior information.\nThe AlignMax optimal trainer provides important insights into the general principles of effective\ntraining, including a balanced strategy to neutralize both the bias and the history dependence of the\nanimal, and a dynamic tradeoff between dif\ufb01cult and easy stimuli that makes ef\ufb01cient use of a broad\nrange of the stimulus space. There are, however, two potential issues that may be detrimental to the\npractical success of the algorithm: First, the animal may suffer a loss of motivation due to the low\nsuccess rate, which is a natural consequence of the adaptive training algorithm. Second, as with any\nmodel-based approach, mismatch of either the psychometric model (logistic, or any generalization\nmodel) or the learning model (RewardMax) may result in poor performances of the training algorithm.\nThese issues are subject to tests on real training experiments. Otherwise, the algorithm is readily\napplicable. We expect it to provide both a signi\ufb01cant reduction in training time and a set of reliable\nmeasures to evaluate the training progress, powered by direct access to the internal learning model of\nthe animal.\n\nAcknowledgments\nJHB was supported by the Samsung Scholarship and the NSF PoLS program. JWP was supported by grants\nfrom the McKnight Foundation, Simons Collaboration on the Global Brain (SCGB AWD1004351) and the NSF\nCAREER Award (IIS-1150186). We thank Nicholas Roy for the careful reading of the manuscript.\n\n8\n\n\fReferences\n[1] A. Abrahamyan, L. L. Silva, S. C. Dakin, M. Carandini, and J. L. Gardner. Adaptable history\n\nbiases in human perceptual decisions. Proc. Nat. Acad. Sci., 113(25):E3548\u2013E3557, 2016.\n\n[2] Y. Ahmadian, J. W. Pillow, and L. Paninski. Ef\ufb01cient Markov chain Monte Carlo methods for\n\ndecoding neural spike trains. Neural Computation, 23(1):46\u201396, 2011.\n\n[3] A. Akrami, C. Kopec, and C. Brody. Trial history vs. sensory memory - a causal study of\nthe contribution of rat posterior parietal cortex (ppc) to history-dependent effects in working\nmemory. Society for Neuroscience Abstracts, 2016.\n\n[4] C. M. Bishop. Pattern Recognition and Machine Learning. Information science and statistics.\n\nSpringer, 2006.\n\n[5] L. Busse, A. Ayaz, N. T. Dhruv, S. Katzner, A. B. Saleem, M. L. Sch\u00f6lvinck, A. D. Zaharia,\nand M. Carandini. The detection of visual contrast in the behaving mouse. J. Neurosci.,\n31(31):11351\u201311361, 2011.\n\n[6] A. Fassihi, A. Akrami, V. Esmaeili, and M. E. Diamond. Tactile perception and working\n\nmemory in rats and humans. Proc. Nat. Acad. Sci., 111(6):2331\u20132336, 2014.\n\n[7] I. Fr\u00fcnd, F. A. Wichmann, and J. H. Macke. Quantifying the effect of intertrial dependence on\n\nperceptual decisions. J. Vision, 14(7):9\u20139, 2014.\n\n[8] D. M. Green and J. A. Swets. Signal Detection Theory and Psychophysics. Wiley, New York,\n\n1966.\n\n[9] A. Hern\u00e1ndez, E. Salinas, R. Garc\u00eda, and R. Romo. Discrimination in the sense of \ufb02utter: new\n\npsychophysical measurements in monkeys. J. Neurosci., 17(16):6391\u20136400, 1997.\n\n[10] J. Li and N. D. Daw. Signals in human striatum are appropriate for policy update rather than\n\nvalue prediction. J. Neurosci., 31(14):5504\u20135511, 2011.\n\n[11] L. Paninski, Y. Ahmadian, D. G. Ferreira, S. Koyama, K. Rahnama Rad, M. Vidne, J. Vogelstein,\nand W. Wu. A new look at state-space models for neural data. J. Comp. Neurosci., 29(1):107\u2013\n126, 2010.\n\n[12] J. W. Pillow, Y. Ahmadian, and L. Paninski. Model-based decoding, information estimation, and\nchange-point detection techniques for multineuron spike trains. Neural Comput, 23(1):1\u201345,\nJan 2011.\n\n[13] M. Sahani and J. F. Linden. Evidence optimization techniques for estimating stimulus-response\nfunctions. In S. Becker, S. Thrun, and K. Obermayer, editors, Adv. Neur. Inf. Proc. Sys. 15,\npages 317\u2013324. MIT Press, 2003.\n\n[14] A. C. Smith, L. M. Frank, S. Wirth, M. Yanike, D. Hu, Y. Kubota, A. M. Graybiel, W. A.\nSuzuki, and E. N. Brown. Dynamic analysis of learning in behavioral experiments. J. Neurosci.,\n24(2):447\u2013461, 2004.\n\n[15] R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement\nlearning with function approximation. In S. A. Solla, T. K. Leen, and K. Muller, editors, Adv.\nNeur. Inf. Proc. Sys. 12, pages 1057\u20131063. MIT Press, 2000.\n\n[16] C. W. Tyler and C.-C. Chen. Signal detection theory in the 2afc paradigm: Attention, channel\n\nuncertainty and probability summation. Vision Research, 40(22):3121\u20133144, 2000.\n\n9\n\n\f", "award": [], "sourceid": 1053, "authors": [{"given_name": "Ji Hyun", "family_name": "Bak", "institution": "Princeton University"}, {"given_name": "Jung Yoon", "family_name": "Choi", "institution": null}, {"given_name": "Athena", "family_name": "Akrami", "institution": ""}, {"given_name": "Ilana", "family_name": "Witten", "institution": ""}, {"given_name": "Jonathan", "family_name": "Pillow", "institution": "Princeton University"}]}