{"title": "Threshold Learning for Optimal Decision Making", "book": "Advances in Neural Information Processing Systems", "page_first": 3763, "page_last": 3771, "abstract": "Decision making under uncertainty is commonly modelled as a process of competitive stochastic evidence accumulation to threshold (the drift-diffusion model). However, it is unknown how animals learn these decision thresholds. We examine threshold learning by constructing a reward function that averages over many trials to Wald's cost function that defines decision optimality. These rewards are highly stochastic and hence challenging to optimize, which we address in two ways: first, a simple two-factor reward-modulated learning rule derived from Williams' REINFORCE method for neural networks; and second, Bayesian optimization of the reward function with a Gaussian process. Bayesian optimization converges in fewer trials than REINFORCE but is slower computationally with greater variance. The REINFORCE method is also a better model of acquisition behaviour in animals and a similar learning rule has been proposed for modelling basal ganglia function.", "full_text": "Threshold Learning for Optimal Decision Making\n\nNathan F. Lepora\n\nDepartment of Engineering Mathematics, University of Bristol, UK\n\nn.lepora@bristol.ac.uk\n\nAbstract\n\nDecision making under uncertainty is commonly modelled as a process of com-\npetitive stochastic evidence accumulation to threshold (the drift-diffusion model).\nHowever, it is unknown how animals learn these decision thresholds. We examine\nthreshold learning by constructing a reward function that averages over many trials\nto Wald\u2019s cost function that de\ufb01nes decision optimality. These rewards are highly\nstochastic and hence challenging to optimize, which we address in two ways: \ufb01rst,\na simple two-factor reward-modulated learning rule derived from Williams\u2019 RE-\nINFORCE method for neural networks; and second, Bayesian optimization of the\nreward function with a Gaussian process. Bayesian optimization converges in fewer\ntrials than REINFORCE but is slower computationally with greater variance. The\nREINFORCE method is also a better model of acquisition behaviour in animals\nand a similar learning rule has been proposed for modelling basal ganglia function.\n\n1\n\nIntroduction\n\nThe standard view of perceptual decision making across psychology and neuroscience is of a\ncompetitive process that accumulates sensory evidence for the choices up to a threshold (bound)\nthat triggers the decision [1, 2, 3]. While there is debate about whether humans and animals are\n\u2018optimal\u2019, nonetheless the standard psychological model of this process for two-alternative forced\nchoices (the drift-diffusion model [1]) is a special case of an optimal statistical test for selecting\nbetween two hypotheses (the sequential probability ratio test, or SPRT [4]). Formally, this sequential\ntest optimizes a cost function linear in the decision time and type I/II errors averaged over many\ntrials [4]. Thus, under broad assumptions about the decision process, the optimal behaviour is simply\nto stop gathering data after reaching a threshold independent of the data history and collection time.\nHowever, there remains the problem of how to set these decision thresholds. While there is consensus\nthat an animal tunes its decision making by maximizing mean reward ([3, Chapter 5],[5, 6, 7, 8, 9, 10]),\nthe learning rule is not known. More generally, it is unknown how an animal tunes its propensity\ntowards making choices while also tuning its overall speed-accuracy balance.\nHere we show that optimization of the decision thresholds can be considered as reinforcement learning\nover single trial rewards derived from Wald\u2019s trial averaged cost function considered previously.\nHowever, these single trial rewards are highly stochastic and their average has a broad \ufb02at peak\n(Fig. 1B), constituting a challenging optimization problem that will defeat standard methods. We\naddress this challenge by proposing two distinct ways to learn the decision thresholds, with one\napproach closer to learning rules from neuroscience and the other to machine learning. The \ufb01rst\napproach is a learning rule derived from Williams\u2019 REINFORCE algorithm for training neural\nnetworks [11], which we here combine with an appropriate policy for controlling the thresholds for\noptimal decision making. The second is a Bayesian optimization method that \ufb01ts a Gaussian process\nto the reward function and samples according to the mean reward and reward variance [12, 13, 14].\nWe \ufb01nd that both methods can successfully learn the thresholds, as validated by comparison against\nan exhaustive optimization of the reward function. Bayesian optimization converges in fewer trials\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: (A) Drift-diffusion model, representing a noisy stochastic accumulation until reaching a\nthreshold when the decision is made. The optimal threshold maximizes the mean reward (equation 5).\n(B) Sampled rewards over 1000 trials with equal thresholds \u03b80 = \u03b81 (dotted markers); the average\nreward function is estimated from Gaussian process regression (red curve). Optimizing the thresholds\nis a challenging problem, particularly when the two thresholds are not equal.\n(\u223c102) than REINFORCE (\u223c103) but is 100-times more computationally expensive with about triple\nthe variance in the threshold estimates. Initial validation is with one decision threshold, corresponding\nto equal costs of type I/II errors. The methods scale well to two thresholds (unequal costs), and we\nuse REINFORCE to map the full decision performance over both costs. Finally, we compare both\nmethods with experimental two-alternative forced choice data, and \ufb01nd that REINFORCE gives a\nbetter account of the acquisition (learning) phase, such as converging over a similar number of trials.\n\n2 Background to the drift-diffusion model and SPRT\n\nz(t + 1) = z(t) + \u2206z,\n\n\u2206z \u223c N (\u00b5, \u03c32),\n\nThe drift-diffusion model (DDM) of Ratcliff and colleagues is a standard approach for modeling the\nresults of two-alternative forced choice (2AFC) experiments in psychophysics [1, 15]. A decision\nvariable z(t) represents the sensory evidence accumulated to time t from a starting bias z(0) = z0.\nDiscretizing time in uniform steps (assumed integer without losing generality), the update equation is\n(1)\nwhere \u2206z is the increment of sensory evidence at time t, which is conventionally assumed drawn\nfrom a normal distribution N (\u00b5, \u03c32) of mean \u00b5 and variance \u03c32. The decision criterion is that the\naccumulated evidence crosses one of two decision thresholds, assumed at \u2212\u03b80 < 0 < \u03b81.\nWald\u2019s sequential probability ratio test (SPRT) optimally determines whether one of two hypotheses\nH0, H1 is supported by gathering samples x(t) until a con\ufb01dent decision can be made [4]. It is\noptimal in that it minimizes the average sample size among all sequential tests to the same error\nprobabilities. The SPRT can be derived from applying Bayes\u2019 rule recursively to sampled data, from\nwhen the log posterior ratio log PR(t) passes one of two decision thresholds \u2212\u03b80 < 0 < \u03b81:\np(x(t)|H1)\np(x(t)|H0)\n\nlog PR(t + 1) = log PR(t) + log LR(t), PR(t) =\n\np(H1|x(t))\np(H0|x(t))\n\n, LR(t) =\n\n,\n\n(2)\n\nbeginning from priors at time zero: PR(0) = p(H1)/p(H0). The right-hand side of equation (2) can\nalso be written as a log likelihood ratio log LR(t) summed over time t (by iterative substitution).\nThe DDM is recognized as a special case of SPRT by setting the likelihoods as two equi-variant\nGaussians N (\u00b51, \u03c3), N (\u00b50, \u03c3), so that\n\n0 \u2212 \u00b52\n\u00b52\n2\u03c32\n\n1\n\n.\n\n(3)\n\np(x|H1)\np(x|H0)\n\nlog\n\n= log\n\ne\u2212(x\u2212\u00b51)2/2\u03c32\ne\u2212(x\u2212\u00b50)2/2\u03c32 =\n\n\u2206\u00b5\n\u03c32 x + d,\n\n\u2206\u00b5 = \u00b51 \u2212 \u00b50, d =\n\nThe integrated evidence z(t) in (1) then coincides with the log posterior ratio in (2) and the increments\n\u2206z with the log likelihood ratio in (2).\n\n3 Methods to optimize the decision threshold\n\n3.1 Reinforcement learning for optimal decision making\n\nA general statement of decision optimality can be made in terms of minimizing the Bayes risk [4].\nThis cost function is linear in the type I and II error probabilities \u03b11 = P (H1|H0) = E1(e) and\n\n2\n\n051015202530decision time, t-10-50510evidence, zDrift-diffusion modelthreshold, \u03b81threshold, -\u03b80A0246810equal thresholds, \u03b80=\u03b81-2-1.5-1-0.50reward, RReward with thresholdoptimal threshold\u03b80*=\u03b81*B\f\u03b10 = P (H0|H1) = E0(e), where the decision error e = {0, 1} for correct/incorrect trials, and is\nalso linear in the expected stopping times for each decision outcome 1\n\n0, \u03b8\u2217\n\nCrisk := 1\n\n2 (W0\u03b10 + c E0[T ]) + 1\n\n2 (W1\u03b11 + c E1[T ]),\n\n(4)\nwith type I/II error costs W0, W1 > 0 and cost of time c. That the Bayes risk Crisk has a unique\nminimum follows from the error probabilities \u03b10, \u03b11 monotonically decreasing and the expected\nstopping times E0[T ], E1[T ] monotonically increasing with increasing threshold \u03b80 or \u03b81. For each\npair (W0/c, W1/c), there is thus a unique threshold pair (\u03b8\u2217\nWe introduce reward into the formalism by supposing that an application of the SPRT with thresholds\n(\u03b80, \u03b81) has a penalty proportional to the stopping time T and decision outcome\nincorrect decision of hypothesis H0\nincorrect decision of hypothesis H1\n\n(5)\nOver many decision trials, the average reward is thus (cid:104)R(cid:105) = \u2212Crisk, the negative of the Bayes risk.\nReinforcement learning can then be used to \ufb01nd the optimal thresholds to maximize reward and thus\noptimize the Bayes risk. Over many trials n = 1, 2, . . . , N with reward R(n), the problem is to\nestimate these optimal thresholds (\u03b8\u2217\n1) while maintaining minimal regret: the difference between\nthe reward sum of the optimal decision policy and the sum of the collected rewards\n\n(cid:40) \u2212W0 \u2212 cT,\n\ncorrect decision of hypothesis H0 or H1.\n\n1) that minimizes Crisk.\n\n\u2212W1 \u2212 cT,\n\n\u2212cT,\n\n0, \u03b8\u2217\n\nR =\n\n\u03c1(N ) = \u2212N Crisk(\u03b8\u2217\n\n0, \u03b8\u2217\n\nn=1R(n).\n\n(6)\nThis is recognized as a multi-armed bandit problem with a continuous two-dimensional action space\nparametrized by the threshold pairs (\u03b80, \u03b81).\nThe optimization problem of \ufb01nding the thresholds that maximize mean reward is highly challenging\nbecause of the stochastic decision times and errors. Standard approaches such as gradient ascent fail\nand even state-of-the-art approaches such as cross-entropy or natural evolution strategies are ineffec-\ntive. A successful approach must combine reward averaging with learning (in a more sophisticated\nway than batch-averaging or \ufb01ltering). We now consider two distinct approaches for this.\n\n1) \u2212(cid:80)N\n\n3.2 REINFORCE method\n\nThe \ufb01rst approach to optimize the decision threshold is a standard 2-factor learning rule derived\nfrom Williams\u2019 REINFORCE algorithm for training neural networks [11], but modi\ufb01ed to the novel\napplication of continuous bandits. From a modern perspective, the REINFORCE algorithm is seen as\nan example of a policy gradient method [16, 17]. These are well-suited to reinforcement learning with\ncontinuous action spaces, because they use gradient descent to optimize continuously parameterized\npolicies with respect to cumulative reward.\nWe consider the decision thresholds (\u03b80, \u03b81) to parametrize actions that correspond to making a\nsingle decision with those thresholds. Here we use a policy that expresses the threshold as a linear\ncombination of binary unit outputs, with \ufb01xed coef\ufb01cients specifying the contribution of each unit\n\nns(cid:88)\n\n2ns(cid:88)\n\n\u03b80 =\n\nsjyj,\n\n\u03b81 =\n\nsjyj.\n\n(7)\n\nExponential coef\ufb01cients were found to work well (equivalent to binary encoding), scaled to give a\nrange of thresholds from zero to \u03b8max:\n\nj=1\n\nj=ns+1\n\nsj = sns+j =\n\n(1/2)j\n\n1 \u2212 (1/2)ns\n\n\u03b8max,\n\n(8)\n\nwhere here we use ns = 10 units per threshold with maximum threshold \u03b8max = 10. The bene\ufb01t\nof this policy (7,8) is that the learning rule can be expressed in terms of the binary unit outputs\nyj = {0, 1}, which are the variables considered in the REINFORCE learning rule [11].\nFollowing Williams, the policy choosing the threshold on a trial is stochastic by virtue of the binary\nunit outputs yj = {0, 1} being distributed according to a logistic function of weights wj, such that\n(9)\n\nyj \u223c p(yj|wj) = f (wj)yj + (1 \u2212 f (wj))(1 \u2212 yj),\n\nf (wj) =\n\n1\n\n.\n\n1 + e\u2212wj\n\n1The full expression has prior probabilities for the frequency of each outcome, which are here assumed equal.\n\n3\n\n\fThe REINFORCE learning rule for these weights is determined by the reward R(n) on trial n\n\n\u2206wj = \u03b2 [yj(t) \u2212 f (wj)] R(n),\n\n(10)\n\nwith learning rate \u03b2 (here generally taken as 0.1). An improvement to the learning rule can be\nmade with reinforcement comparison, with a reference reward \u00afR(n) = \u03b3R(n) + (1 \u2212 \u03b3) \u00afR(n \u2212 1)\nsubtracted from R(n); a value \u03b3 = 0.5 was found to be effective, and is used in all simulations using\nthe REINFORCE rule in this paper.\nThe power of the REINFORCE learning rule is that the weight change is equal to the gradient of\nthe expected return J(www) = E[R{\u03b8}] over all possible threshold sequences {\u03b8}. Thus, a single-trial\nlearning rule performs like stochastic gradient ascent averaged over many trials. Note also that the\nneural network input xi of the original formalism [11] is here set to x1 = 1, but a non-trivial input\ncould be used to aid learning recall and generalization (see discussion). Overall, the learning follows\na reward-modulated two-factor rule that recruits units distributed according to an exponential size\nprinciple, and thus resembles models of biological motor learning.\n\n3.3 Bayesian optimization method\n\nThe second approach is to use Bayesian optimization to \ufb01nd the optimal thresholds from iteratively\nbuilding a probabilistic model of the reward function that is used to guide future sampling [12, 13, 14].\nBayesian optimization typically uses a Gaussian process model, which provides a nonlinear regression\nmodel both of the mean reward and the reward variance with decision threshold. This model can then\nbe used to guide future threshold choice via maximising an acquisition function of these quantities.\nThe basic algorithm for Bayesian optimization is as follows:\n\nAlgorithm Bayesian optimization applied to optimal decision making\n\nfor n=1 to N do\n\nNew thresholds from optimizing acquisition function (\u03b80, \u03b81)n = argmax\n(\u03b80,\u03b81)\n\nMake the decision with thresholds (\u03b80, \u03b81)n to \ufb01nd reward R(n)\nAugment data by including new samples Dn = (Dn\u22121; (\u03b80, \u03b81)n, R(n))\nUpdate the statistical (Gaussian process) model of the rewards\n\nend for\n\n\u03b1(\u03b80, \u03b81;Dn\u22121)\n\nFollowing other work on Bayesian optimization, we model the reward dependence on the decision\nthresholds with a Gaussian process\n\nR(\u03b80, \u03b81) \u223c GP[m(\u03b80, \u03b81), k(\u03b80, \u03b81; \u03b8(cid:48)\n\n0, \u03b8(cid:48)\n\n1)],\n\n(11)\n\nf exp(cid:0)\u2212 \u03bb\n\nwith mean m(\u03b80, \u03b81) = E[R(\u03b80, \u03b81)] and covariance modelled by a squared-exponential function\n\n0, \u03b8(cid:48)\n\n1) = \u03c32\n\nk(\u03b80, \u03b81; \u03b8(cid:48)\n\n2||(\u03b80, \u03b81) \u2212 (\u03b8(cid:48)\n\n0, \u03b8(cid:48)\nThe \ufb01tting of the hyperparameters \u03c32\nf , \u03bb used standard methods [18] (GPML toolbox and a quasi-\nNewton optimizer in MATLAB). In principle, the two thresholds could each have distinct hyperpa-\nrameters, but we use one to maintain the symmetry \u03b80 \u2194 \u03b81 of the decision problem.\nThe choice of decision thresholds is viewed as a sampling problem, and represented by maximizing\nan acquisition function of the decision thresholds that trades off exploration and exploitation. Here we\nuse the probability of improvement, which guides the sampling towards regions of high uncertainty\nand reward by maximizing the chance of improving the present best estimate:\n\n(12)\n\n1)||2(cid:1) .\n\n(cid:18) m(\u03b80, \u03b81) \u2212 R(\u03b8\u2217\n\n0, \u03b8\u2217\n1)\n\n(cid:19)\n\n(\u03b80, \u03b81)n = argmax\n(\u03b80,\u03b81)\n\n\u03b1(\u03b80, \u03b81),\n\n\u03b1(\u03b80, \u03b81) = \u03a6\n\nk(\u03b80, \u03b81; \u03b80, \u03b81)\n\n,\n\n(13)\n\n0, \u03b8\u2217\n\nwhere (\u03b8\u2217\n1) are the threshold estimates that have given the greatest reward and \u03a6 is the normal\ncumulative distribution function. Usually one would include a noise parameter for exploration, but\nbecause the decision making is stochastic we use the noise from that process instead.\n\n4\n\n\fFigure 2: REINFORCE learning (exponential coef\ufb01cients) of the two decision thresholds over a\nsingle learning episode. Decision costs c = 0.05, W0 = 0.1 and W1 = 1. Plots are smoothed over 50\ntrials. The red curve is the average accuracy by trial number (\ufb01tted to a cumulative Weibull function).\nOptimal values (from exhaustive optimization) are shown as dashed lines.\n\nFigure 3: Bayesian optimization of the two decision thresholds over a single learning episode. Other\ndetails are the same as in Fig. 2, other than only 500 trials were used with smoothing over 20 trials.\n\n4 Results\n\n0, \u03b8\u2217\n\n4.1 Single learning episode\nThe learning problem is to \ufb01nd the pair of optimal decision thresholds (\u03b8\u2217\n1) that maximize the\nreward function (5), which is a linear combination of penalties for delays and type I and II errors.\nThe reward function has two free parameters that affect the optimal thresholds: the costs W0/c and\nW1/c of making type I and II errors relative to time. The methods apply generally, although for\nconcreteness we consider a drift-diffusion model equivalent to the SPRT with distribution means\n\u00b50 =\u2212\u00b51 = 1/3 and standard deviation \u03c3 = 1.\nBoth the REINFORCE method and Bayesian optimization can converge to approximations of the\noptimal decision thresholds, as shown in Figures 2D,3D above for a typical learning episode. The\ndecision error e, decision time T and reward R are all highly variable from the stochastic nature of\nthe evidence, although displayed plots have their variance reduced by smoothing over 50 trials (to\nhelp interpret the results). There is a gradual convergence towards near optimal decision performance.\nClearly the main difference between the REINFORCE method and the Bayesian optimization method\nis the speed of convergence to the decision thresholds (c.f. Figures 2D vs 3D). REINFORCE gradually\nconverges over \u223c5000 trials whereas Bayesian optimization converges in (cid:46) 500 trials. However,\nthere are other differences between the two methods that are only revealed for multiple learning\nepisodes, which act to balance the pros and cons across the two methods.\n\n4.2 Multiple learning episodes: one decision threshold\n\nFor validation purposes, we reduce the learning problem to the simpler case where there is only\none decision threshold \u03b80 = \u03b81, by setting costs equal for type I and II errors W0/c = W1/c so\nthat the error probabilities are equal \u03b10 = \u03b11. This will allow us to compare the two methods\nin a representative scenario that is simpler to visualize and can be validated against an exhaustive\noptimization of the reward function (which takes too long to calculate for two thresholds).\n\n5\n\n010002000300040005000trials, N00.10.20.30.40.5decision error, eDecision accuracyA010002000300040005000trials, N05101520decision time, TDecision timeB010002000300040005000trials, N-1-0.8-0.6-0.4-0.20reward, RRewardC010002000300040005000trials, N0246810thresholdsDecision thresholdD\u03b81\u03b800100200300400500trials, N00.10.20.30.40.5decision error, eDecision accuracyA0100200300400500trials, N05101520decision time, TDecision timeB0100200300400500trials, N-1-0.8-0.6-0.4-0.20reward, RRewardC0100200300400500trials, N0246810thresholdsDecision thresholdD\u03b81\u03b80\fFigure 4: REINFORCE learning of one decision threshold (for equal thresholds \u03b81 = \u03b80) over 200\nlearning episodes with costs c/W1 = c/W0 sampled uniformly from [0, 0.1]. Results are after 5000\nlearning trials (averaged over 100 trials). The mean and standard deviation of these results (red line\nand shaded region) are compared with an exhaustive optimization over 106 episodes (blue curves).\n\nFigure 5: Bayesian optimization of one decision threshold (for equal thresholds \u03b81 = \u03b80) over 200\nlearning episodes with costs c/W1 = c/W0 sampled uniformly from [0, 0.1]. Results are after 500\nlearning trials (averaged over 100 trials). The mean and standard deviation of these results (red line\nand shaded region) are compared with an exhaustive optimization over 106 episodes (blue curves).\n\nWe consider REINFORCE over 5000 trials and Bayesian optimization over 500 trials, which are\nsuf\ufb01cient for convergence (Figures 2,3). Costs were considered over a range W/c > 10 via random\nuniform sampling of c/W over the range [0, 0.1]. Mean decision errors e, decision times T , rewards\nand thresholds are averaged over the \ufb01nal 50 trials, combining the results for both choices.\nBoth the REINFORCE and Bayesian optimization methods estimate near-optimal decision thresholds\nfor all considered cost parameters (Figures 4,5; red curves) as veri\ufb01ed from comparison with an\nexhaustive search of the reward function (blue curves) over 106 decision trials (randomly sampling\nthe threshold range to estimate an average reward function, as in Fig 1B). In both cases, the exhaustive\nsearch lies within one standard deviation of the decision threshold from the two learning methods.\nThere are, however, differences in performance between the two methods. Firstly, the variance of the\nthreshold estimates is greater for Bayesian optimization than for REINFORCE (c.f. Figures 4D vs\n5D). The variance of the decision thresholds feeds through into larger variances for the decision error,\ntime and reward. Secondly, although Bayesian optimization converges in fewer trials (500 vs 5000),\nit comes at the expense of greater computational cost of the algorithm (Table 1).\nThe above results were checked for robustness across reasonable ranges of the various meta-\nparameters for each learning method. For REINFORCE, the results were not appreciably affected by\nhaving any learning rate \u03b2 within the range 0.1-1; similarly, increasing the unit number n did not\naffect the threshold variances, but scales the computation time.\n\n4.3 Multiple learning episodes: two decision thresholds\n\nWe now consider the learning problem with two decision thresholds (\u03b80, \u03b81) that optimize the reward\nfunction 5 with differing W0/c and W1/c values. We saw above that REINFORCE produces the\nmore accurate estimates relative to the computational cost, so we concentrate on that method only.\n\n6\n\n\fFigure 6: Reinforcement learning of two decision thresholds. Method same as Figure 4 except that\n2002 learning episodes are considered with costs (c/W0, c/W1) sampled from [0, 0.1]\u00d7 [0, 0.1]. The\nthreshold \u03b80 results are just re\ufb02ections of those for \u03b81 in the axis c/W0 \u2194 c/W1 and thus not shown.\n\nTable 1: Comparison of threshold learning methods. Results for one decision threshold, averaging\nover the data in Figures 4,5. (Benchmarked on an i7 2.7GHz CPU.)\n\ncomputation time\n\ncomputation time/trial\nuncertainty, \u2206\u03b8 (1 s.d.)\n\nREINFORCE\n\nmethod\n\n0.5 sec (5000 trials)\n\n0.1 msec/trial\n\n0.23\n\nBayesian\n\noptimization\n\nExhaustive\noptimization\n\n50 sec (500 trials)\n\n100 msec/trial\n\n44 sec (106 trials)\n0.04 msec/trial\n\n0.75\n\n0.01\n\nThe REINFORCE method can \ufb01nd the two decision thresholds (Figure 6), as demonstrated by\nestimating the thresholds over 2002 instances of the reward function with (c/W0, c/W1) sampled\nuniformly from [0, 0.1]\u00d7[0, 0.1]. Because of the high compute time, we cannot compare the results\nto those from an exhaustive search, apart from that the plot diagonals (W0/c = W1/c) reduce to the\nsingle threshold results which matched an exhaustive optimization (Figure 4).\nFigure 6 is of general interest because it maps the drift-diffusion model (SPRT) decision performance\nover a main portion of its parameter space. Results for the two decision thresholds (\u03b80, \u03b81) are\nre\ufb02ections of each other about W0 \u2194 W1, while the decision error, time and reward are re\ufb02ection\nsymmetric (consistent with these symmetries of the decision problem). All quantities depend on both\nweight parameters (W0/c, W1/c) in a smooth but non-trivial manner. To our knowledge, this is the\n\ufb01rst time the full decision performance has been mapped.\n\n4.4 Comparison with animal learning\n\nThe relation between reward and decision optimality is directly relevant to the psychophysics of two\nalternative forced choice tasks in the tradeoff between decision accuracy and speed [3]. Multiple\nstudies support that the decision threshold is set to maximize reward [7, 8, 9]. However, the mechanism\nby which subjects learn the optimal thresholds has not been addressed. Our two learning methods are\ncandidate mechanisms, and thus should be compared with experiment.\nWe have found a couple of studies showing data over the acquisition phase of two-alternative forced\nchoice behavioural experiments: one for rodent whisker vibrotactile discrimination [19, Figure 4] and\nthe other for bat echoacoustic discrimination [20]. Studies detailing the acquisition phase are rare\ncompared to those of the pro\ufb01cient phase, even though they are a necessary component of all such\nbehavioural experiments (and successful studies rest on having a well-designed acquisition phase).\nIn both behavioural studies, the animals acquired pro\ufb01cient decision performance after 5000-10000\ntrials: in rodent, this was after 25-50 sessions of \u223c200 trials [19, Figure 4]; and in bat, after about\n6000 trials for naive animals [20, Figure 4]. The typical progress of learning was to begin with\nrandom choices (mean decision error e = 0.5) and then gradually converge on the appropriate balance\nof decision time vs accuracy. There was considerable variance in \ufb01nal performance across different\nanimals (in rodent, mean decision errors were e \u223c 0.05-0.15).\n\n7\n\n00.020.040.060.080.1cost parameter, c/W000.020.040.060.080.1cost parameter, c/W1Decision accuracydecision error , eA00.10.20.30.40.500.020.040.060.080.1cost parameter, c/W000.020.040.060.080.1cost parameter, c/W1Decision timedecision time, TB04812162000.020.040.060.080.1cost parameter, c/W000.020.040.060.080.1cost parameter, c/W1Rewardreward, RC-0.5-0.4-0.3-0.2-0.1000.020.040.060.080.1cost parameter, c/W000.020.040.060.080.1cost parameter, c/W1Decision thresholdthreshold, \u03b81D0246810\fThat acquisition takes 5000 or more trials is consistent with the REINFORCE learning rule (Figure 2),\nand not with Bayesian optimization (Figure 3). Moreover, the shape of the acquisition curve for the\nREINFORCE method resembles that of the animal learning, in also having a good \ufb01t to a cumulative\nWeibull function over a similar number of trials (red line, Figure 2). That being said, the animals begin\nmaking random choices and gradually improve in accuracy with longer decision times, whereas both\narti\ufb01cial learning methods (Figures 2,3) begin with accurate choices and then decrease in accuracy\nand decision time. Taken together, this evidence supports that the REINFORCE learning rule is a\nplausible model of animal learning, although further theoretical and experimental study is required.\n\n5 Discussion\n\nWe examined how to learn decision thresholds in the drift-diffusion model of perceptual decision\nmaking. A key step was to use single trial rewards derived from Wald\u2019s trial-averaged cost function\nfor the equivalent sequential probability ratio test, which took the simple form of a linear weighting of\npenalties due to time and type I/II errors. These highly stochastic rewards are challenging to optimize,\nwhich we addressed with two distinct methods to learn the decision thresholds.\nThe \ufb01rst approach for learning the thresholds was based on a method for training neural networks\nknown as Williams\u2019 REINFORCE rule [11].\nIn modern terminology, this can be viewed as a\npolicy gradient method [16, 17] and here we proposed an appropriate policy for optimal decision\nmaking. The second method was a modern Bayesian optimization method that samples and builds\na probabilistic model of the reward function to guide further sampling [12, 13, 14]. Both learning\nmethods converged to nearby the optimum decision thresholds, as validated against an exhaustive\noptimization (over 106 trials). The Bayesian optimization method converged much faster (\u223c500\ntrials) than the REINFORCE method (\u223c5000 trials). However, Bayesian optimization is three-times\nas variable in the threshold estimates and 40-times slower in computation time. It appears that the\nfaster convergence for Bayesian optimization leads to less averaging over the stochastic rewards, and\nhence greater variance than with the REINFORCE method.\nWe expect that both the REINFORCE and Bayesian optimization methods used here can be improved\nto compensate for some of their individual drawbacks. For example, the full REINFORCE learning\nrule has a third factor corresponding to the neural network input, which could represent a context\nsignal to allow recall and generalization over past learnt thresholds; also, information on past trial\nperformance is discarded by REINFORCE, which could be partially retained to improve learning.\nBayesian optimization could be improved in computational speed by updating the Gaussian process\nwith just the new samples after each decision, rather than re\ufb01tting the entire Gaussian process; also,\nthe variance of the threshold estimates may improve with other choices of acquisition function for\nsampling the rewards or other assumptions for the Gaussian process covariance function. In addition,\nthe optimization methods may have broader applicability when the optimal decision thresholds vary\nwith time [10], such as tasks with deadlines or when there are multiple (three or more) choices.\nSeveral more factors support the REINFORCE method as a model of reward-driven learning during\nperceptual decision making. First, REINFORCE is based on a neural network and is thus better\nsuited as a connectionist model of brain function. Second, the REINFORCE model results (Fig. 2)\nresemble acquisition data from behavioural experiments in rodent [19] and bat [20] (Sec. 4.4). Third,\nthe site of reward learning would plausibly be the basal ganglia, and a similar 3-factor learning rule\nhas already been used to model cortico-striatal plasticity [21]. In addition, multi-alternative (MSPRT)\nversions of the drift-diffusion model offer a model of action selection in the basal ganglia [22, 23],\nand so the present REINFORCE model of decision acquisition would extend naturally to encompass\na combined model of reinforcement learning and optimal decision making in the brain.\n\nAcknowledgements\n\nI thank Jack Crago, John Lloyd, Kirsty Aquilina, Kevin Gurney and Giovanni Pezzulo for discussions\nrelated to this research. The code used to generate the results and \ufb01gures for this paper is at\nhttp://lepora.com/publications.htm\n\n8\n\n\fReferences\n[1] R. Ratcliff. A theory of memory retrieval. Psychological Review, 85:59\u2013108, 1978.\n\n[2] J. Gold and M. Shadlen. The neural basis of decision making. Annu. Rev. Neurosci., 30:535\u2013574, 2007.\n\n[3] R. Bogacz, E. Brown, J. Moehlis, P. Holmes, and J.D. Cohen. The physics of optimal decision making: A\nformal analysis of models of performance in two-alternative forced-choice tasks. Psychological Review,\n113(4):700, 2006.\n\n[4] A. Wald and J. Wolfowitz. Optimum character of the sequential probability ratio test. The Annals of\n\nMathematical Statistics, 19(3):326\u2013339, 1948.\n\n[5] J. Gold and M. Shadlen. Banburismus and the brain: decoding the relationship between sensory stimuli,\n\ndecisions, and reward. Neuron, 36(2):299\u2013308, 2002.\n\n[6] P. Simen, J. Cohen, and P. Holmes. Rapid decision threshold modulation by reward rate in a neural network.\n\nNeural networks, 19(8):1013\u20131026, 2006.\n\n[7] P. Simen, D. Contreras, C. Buck, P. Hu, and J. Holmes, P.and Cohen. Reward rate optimization in two-\nalternative decision making: empirical tests of theoretical predictions. Journal of Experimental Psychology:\nHuman Perception and Performance, 35(6):1865, 2009.\n\n[8] R. Bogacz, P. Hu, P. Holmes, and J. Cohen. Do humans produce the speed\u2013accuracy trade-off that\n\nmaximizes reward rate? The Quarterly Journal of Experimental Psychology, 63(5):863\u2013891, 2010.\n\n[9] F. Balci, P. Simen, R. Niyogi, A. Saxe, J. Hughes, P. Holmes, and J. Cohen. Acquisition of decision making\ncriteria: reward rate ultimately beats accuracy. Attention, Perception, & Psychophysics, 73(2):640\u2013657,\n2011.\n\n[10] J. Drugowitsch, R. Moreno-Bote, A. Churchland, M. Shadlen, and A. Pouget. The cost of accumulating\n\nevidence in perceptual decision making. The Journal of Neuroscience, 32(11):3612\u20133628, 2012.\n\n[11] R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine learning, 8(3-4):229\u2013256, 1992.\n\n[12] M. Pelikan. Bayesian optimization algorithm. In Hierarchical Bayesian optimization algorithm, pages\n\n31\u201348. Springer, 2005.\n\n[13] E. Brochu, V. Cora, and N. De Freitas. A tutorial on bayesian optimization of expensive cost func-\ntions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint\narXiv:1012.2599, 2010.\n\n[14] J. Snoek, H. Larochelle, and R. Adams. Practical bayesian optimization of machine learning algorithms.\n\nIn Advances in neural information processing systems, pages 2951\u20132959, 2012.\n\n[15] R. Ratcliff and G. McKoon. The diffusion decision model: theory and data for two-choice decision tasks.\n\nNeural computation, 20(4):873\u2013922, 2008.\n\n[16] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks,\n\n21(4):682\u2013697, 2008.\n\n[17] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning\n\nwith function approximation. In Neural Information Processing Systems 12, pages 1057\u20131063, 2000.\n\n[18] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. the MIT Press, 2006.\n\n[19] J. Mayrhofer, V. Skreb, W. von der Behrens, S. Musall, B. Weber, and F. Haiss. Novel two-alternative\nforced choice paradigm for bilateral vibrotactile whisker frequency discrimination in head-\ufb01xed mice and\nrats. Journal of neurophysiology, 109(1):273\u2013284, 2013.\n\n[20] K. Stich and Y. Winter. Lack of generalization of object discrimination between spatial contexts by a bat.\n\nJournal of experimental biology, 209(23):4802\u20134808, 2006.\n\n[21] M. Frank and E. Claus. Anatomy of a decision: striato-orbitofrontal interactions in reinforcement learning,\n\ndecision making, and reversal. Psychological review, 113(2):300, 2006.\n\n[22] R. Bogacz and K. Gurney. The basal ganglia and cortex implement optimal decision making between\n\nalternative actions. Neural computation, 19(2):442\u2013477, 2007.\n\n[23] N. Lepora and K. Gurney. The basal ganglia optimize decision making over general perceptual hypotheses.\n\nNeural Computation, 24(11):2924\u20132945, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1860, "authors": [{"given_name": "Nathan", "family_name": "Lepora", "institution": "University of Bristol"}]}