{"title": "How Prior Probability Influences Decision Making: A Unifying Probabilistic Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1268, "page_last": 1276, "abstract": "How does the brain combine prior knowledge with sensory evidence when making decisions under uncertainty? Two competing descriptive models have been proposed based on experimental data. The first posits an additive offset to a decision variable, implying a static effect of the prior. However, this model is inconsistent with recent data from a motion discrimination task involving temporal integration of uncertain sensory evidence. To explain this data, a second model has been proposed which assumes a time-varying influence of the prior. Here we present a normative model of decision making that incorporates prior knowledge in a principled way. We show that the additive offset model and the time-varying prior model emerge naturally when decision making is viewed within the framework of partially observable Markov decision processes (POMDPs). Decision making in the model reduces to (1) computing beliefs given observations and prior information in a Bayesian manner, and (2) selecting actions based on these beliefs to maximize the expected sum of future rewards. We show that the model can explain both data previously explained using the additive offset model as well as more recent data on the time-varying influence of prior knowledge on decision making.", "full_text": "How Prior Probability In\ufb02uences Decision Making:\n\nA Unifying Probabilistic Model\n\nYanping Huang\n\nUniversity of Washington\n\nhuangyp@cs.washington.edu\n\nAbram L. Friesen\n\nUniversity of Washington\n\nafriesen@cs.washington.edu\n\nTimothy D. Hanks\nPrinceton University\n\nthanks@princeton.edu\n\nMichael N. Shadlen\nColumbia University\n\nHoward Hughes Medical Institute\n\nms4497@columbia.edu\n\nRajesh P. N. Rao\n\nUniversity of Washington\n\nrao@cs.washington.edu\n\nAbstract\n\nHow does the brain combine prior knowledge with sensory evidence when making\ndecisions under uncertainty? Two competing descriptive models have been pro-\nposed based on experimental data. The \ufb01rst posits an additive offset to a decision\nvariable, implying a static effect of the prior. However, this model is inconsistent\nwith recent data from a motion discrimination task involving temporal integration\nof uncertain sensory evidence. To explain this data, a second model has been pro-\nposed which assumes a time-varying in\ufb02uence of the prior. Here we present a\nnormative model of decision making that incorporates prior knowledge in a prin-\ncipled way. We show that the additive offset model and the time-varying prior\nmodel emerge naturally when decision making is viewed within the framework\nof partially observable Markov decision processes (POMDPs). Decision making\nin the model reduces to (1) computing beliefs given observations and prior in-\nformation in a Bayesian manner, and (2) selecting actions based on these beliefs\nto maximize the expected sum of future rewards. We show that the model can\nexplain both data previously explained using the additive offset model as well as\nmore recent data on the time-varying in\ufb02uence of prior knowledge on decision\nmaking.\n\n1\n\nIntroduction\n\nA fundamental challenge faced by the brain is to combine noisy sensory information with prior\nknowledge in order to perceive and act in the natural world. It has been suggested (e.g., [1, 2, 3, 4])\nthat the brain may solve this problem by implementing an approximate form of Bayesian inference.\nThese models however leave open the question of how actions are chosen given probabilistic repre-\nsentations of hidden state obtained through Bayesian inference. Daw and Dayan [5, 6] were among\nthe \ufb01rst to study decision theoretic and reinforcement learning models with the goal of interpreting\nresults from various neurobiological experiments. Bogacz and colleagues proposed a model that\ncombines a traditional decision making model with reinforcement learning [7] (see also [8, 9]).\nIn the decision making literature, two apparently contradictory models have been suggested to ex-\nplain how the brain utilizes prior knowledge in decision making: (1) a model that adds an offset to a\n\n1\n\n\fdecision variable, implying a static effect of changes to the prior probability [10, 11, 12], and (2) a\nmodel that adds a time varying weight to the decision variable, representing the changing in\ufb02uence\nof prior probability over time [13]. The LATER model (Linear Approach to Threshold with Ergodic\nRate), an instance of the additive offset model, incorporates prior probability as the starting point\nof a linearly rising decision variable and successfully predicts changes to saccade latency when dis-\ncriminating between two low contrast stimuli [10]. However, the LATER model fails to explain data\nfrom the random dots motion discrimination task [14] in which the agent is presented with noisy,\ntime-varying stimuli and must continually process this data in order to make a correct choice and\nreceive reward. The drift diffusion model (DDM), which uses a random walk accumulation, instead\nof a linear rise to a boundary, has been successful in explaining behavioral and neurophysiological\ndata in various perceptual discrimination tasks [14, 15, 16]. However, in order to explain behavioral\ndata from recent variants of random dots tasks in which the prior probability of motion direction is\nmanipulated [13], DDMs require the additional assumption of dynamic reweighting of the in\ufb02uence\nof the prior over time.\nHere, we present a normative framework for decision making that incorporates prior knowledge and\nnoisy observations under a reward maximization hypothesis. Our work is inspired by models which\ncast human and animal decision making in a rational, or optimal, framework. Frazier & Yu [17]\nused dynamic programming to derive an optimal strategy for two-alternative forced choice tasks\nunder a stochastic deadline. Rao [18] proposed a neural model for decision making based on the\nframework of partially observable Markov decision processes (POMDPs) [19]; the model focuses\non network implementation and learning but assumes a \ufb01xed deadline to explain the collapsing\ndecision threshold seen in many decision making tasks. Drugowitsch et al. [9] sought to explain\nthe collapsing decision threshold by combining a traditional drift diffusion model with reward rate\nmaximization; their model also requires knowledge of decision time in hindsight. In this paper,\nwe derive a novel POMDP model from which we compute the optimal behavior for sequential\ndecision making tasks. We demonstrate our model\u2019s explanatory power on two such tasks:\nthe\nrandom dots motion discrimination task [13] and Carpenter and Williams\u2019 saccadic eye movement\ntask [10]. We show that the urgency signal, hypothesized in previous models, emerges naturally as a\ncollapsing decision boundary with no assumption of a decision deadline. Furthermore, our POMDP\nformulation enables incorporation of partial or incomplete prior knowledge about the environment.\nBy \ufb01tting model parameters to the psychometric function in the neutral prior condition (equal prior\nprobability of either direction), our model accurately predicts both the psychometric function and\nthe reaction times for the biased (unequal prior probability) case, without introducing additional free\nparameters. Finally, the same model also accurately predicts the effect of prior probability changes\non the distribution of reaction times in the Carpenter and Williams task, data that was previously\ninterpreted in terms of the additive offset model.\n\n2 Decision Making in a POMDP framework\n\n2.1 Model Setup\n\nWe model a decision making task using a POMDP, which assumes that at any particular time step,\nt, the environment is in a particular hidden state, x \u2208 X , that is not directly observable by the\nanimal. The animal can make sensory measurements in order to observe noisy samples of this hidden\nstate. At each time step, the animal receives an observation (stimulus), st, from the environment as\ndetermined by an emission distribution, Pr(st|x). The animal must maintain a belief over the set\nof possible true world states, given the observations it has made so far: bt(x) = Pr(x|s1:t), where\ns1:t represents the sequence of stimuli that the animal has received so far, and b0(x) represents\nthe animal\u2019s prior knowledge about the environment. At each time step, the animal chooses an\naction, a \u2208 A and receives an observation and a reward, R(x, a), from the environment, depending\non the current state and the action taken. The animal uses Bayes rule to update its belief about the\nenvironment after each observation. Through these interactions, the animal learns a policy, \u03c0(b) \u2208 A\nfor all b, which dictates the action to take for each belief state. The goal is to \ufb01nd an optimal policy,\n\u03c0\u2217(b), that maximizes the animal\u2019s total expected future reward in the task.\nFor example, in the random dots motion discrimination task, the hidden state, x, is composed of\nboth the coherence of the random dots c \u2208 [0, 1] and the direction d \u2208 {\u22121, 1} (corresponding\nto leftward and rightward motion, respectively), neither of which are known to the animal. The\n\n2\n\n\fanimal is shown a movie of randomly moving dots, a fraction of which are moving in the same\ndirection (this fraction is the coherence). The movie is modeled as a sequence of time varying\nstimuli s1:t. Each frame, st, is a snapshot of the changes in dot positions, sampled from the emission\ndistribution st \u223c Pr(st|kc, d), where k > 0 is a free parameter that determines the scale of st.\nIn order to discriminate the direction given the stimuli, the animal uses Bayes rule to compute the\nposterior probability of the static joint hidden state, Pr(x = kdc|s1:t)1. At each time step, the animal\nchooses one of three actions, a \u2208 {AR, AL, AS}, denoting rightward eye movement, leftward eye\nmovement, and sampling (i.e., waiting for one more observation), respectively. When the animal\nmakes a correct choice (i.e., a rightward eye movement a = AR when x > 0 or a leftward eye\nmovement a = AL when x < 0), the animal receives a positive reward RP > 0. The animal\nreceives a negative reward (penalty) or no reward when an incorrect action is chosen, RN \u2264 0. We\nassume that the animal is motivated by hunger or thirst to make a decision as quickly as possible\nand model this with a unit penalty RS = \u22121, representing the cost the agent needs to pay when\nchoosing the sampling action AS.\n\n2.2 Bayesian Inference of Hidden State from Prior Information and Noisy Observations\nIn a POMDP, decisions are made based on the belief state bt(x) = Pr(x|s1:t), which is the posterior\nprobability distribution over x given a sequence of observations s1:t. The initial belief b0(x) repre-\nsents the animal\u2019s prior knowledge about x. In both the Carpenter and William\u2019s task [10] and the\nrandom dots motion discrimination task [13], prior information about the probability of a speci\ufb01c\ndirection (we use rightward direction here, dR, without loss of generality) is learned by the subjects,\nPr(dR) = Pr(d = 1) = Pr(x > 0) = 1 \u2212 Pr(dL). Consider the random dots motion discrimina-\ntion task. Unlike the traditional case where a full prior distribution is given, this direction-only prior\ninformation provides only partial knowledge about the hidden state which also includes coherence.\nIn the least informative case, only Pr(dR) is known and we model the distribution over the remain-\ning components of x as a uniform distribution. Combining this with the direction prior, which is\nBernoulli distributed, gives a piecewise uniform distribution for the prior, b0(x). In the general case,\nwe can express the distribution over coherence as a normal distribution parameterized by \u00b50 and \u03c30,\nresulting in a piecewise normal prior over x,\n\nb0(x) = Z\u22121\n\n0 N (x | \u00b50, \u03c30) \u00d7\n\nx \u2265 0\nx < 0,\n\n(1)\n\n\u03a6(x | \u00b5, \u03c3) = (cid:82) x\n\nwhere Zt = Pr(dR)(1 \u2212 \u03a6 (0 | \u00b5t, \u03c3t)) + Pr(dL)\u03a6 (0 | \u00b5t, \u03c3t) is the normalization factor and\n\u2212\u221e N (x | \u00b5, \u03c3)dx is the cumulative distribution function (CDF) of the normal\ndistribution. The piecewise uniform prior is then just a special case with \u00b50 = 0 and \u03c30 = \u221e.\nWe assume the emission distribution is also normally-distributed, Pr(st|x) = N (st|x, \u03c32\nfrom Bayes\u2019 rule, results in a piecewise normal posterior distribution\n\ne ), which,\n\n(cid:26) Pr(dR)\n\nPr(dL)\n\n(cid:26) Pr(dR)\n(cid:19)\n\nPr(dL)\nt\n\u03c32\ne\n\n+\n\n,\n\nwhere\n\n\u00b5t =\n\nbt(x) = Z\u22121\n\nt N (x | \u00b5t, \u03c3t) \u00d7\n\n(cid:18) \u00b50\n(cid:18) 1\n\n\u03c32\n0\n\n\u03c32\n0\n\n+\n\n+\n\nt\u00afst\n\u03c32\ne\nt\n\u03c32\ne\n\n(cid:19)\n(cid:18) 1\n(cid:19)\u22121\n\n/\n\n\u03c32\n0\n\n,\n\nx \u2265 0\nx < 0\n\n(2)\n\n(3)\n\n(4)\n\nand the running average \u00afst =(cid:80)t\n\n\u03c32\nt =\n\nt(cid:48)=1 st(cid:48)/t. Consequently, the posterior distribution depends only on\n\u00afs and t, the two suf\ufb01cient statistics of the sequence s1:t. For the case of a piecewise uniform prior,\nthe variance \u03c32\nt , which decreases inversely in proportion to elapsed time. Unless otherwise\nmentioned, we \ufb01x \u03c3e = 1, \u03c30 = \u221e and \u00b50 = 0 for the rest of this paper because we can rescale the\nPOMDP time step t(cid:48) = t\n\nto compensate.\n\nt = \u03c32\n\ne\n\n\u03c3e\n\n1In the decision making tasks that we model in this paper, the hidden state is \ufb01xed within a trial and thus\nthere is no transition distribution to include in the belief update equation. However, the POMDP framework is\nentirely valid for time-varying states.\n\n3\n\n\f2.3 Finding the optimal policy by reward maximization\nWithin the POMDP framework, the animal\u2019s goal is to \ufb01nd an optimal policy \u03c0\u2217(bt) that maximizes\nits expected reward, starting at bt. This is encapsulated in the value function\n\n(cid:35)\n\nv\u03c0(bt) = E\n\nr(bt+k, \u03c0(bt+k)) | bt, \u03c0\n\n(5)\n\n(cid:34) \u221e(cid:88)\n\nk=1\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\nx R(x, a)b(x)dx is\nRS,\n\nwhere the expectation is taken with respect to all future belief states (bt+1, . . . , bt+k, . . .) given\nthat the animal is using \u03c0 to make decisions, and r(b, a) is the reward function over belief states\nx R(x, a)b(x)dx. Given the\nvalue function, the optimal policy is simply \u03c0\u2217(b) = arg max\u03c0 v\u03c0(b). In this model, the belief b is\nparameterized by \u00afst and t, so the animal only needs to keep track of these instead of encoding the\nentire posterior distribution bt(x) explicitly.\n\nor, equivalently, the expected reward over hidden states, r(b, a) = (cid:82)\nIn our model, the expected reward r(b, a) =(cid:82)\n\nt\n\nt\n\nr(b, a) =\n\nZ\u22121\nZ\u22121\n\n[ RP Pr(dR) (1 \u2212 \u03a6(0 | \u00b5t, \u03c3t)) + RN Pr(dL)\u03a6(0 | \u00b5t, \u03c3t) ],\n[ RN Pr(dR) (1 \u2212 \u03a6(0 | \u00b5t, \u03c3t)) + RP Pr(dL)\u03a6(0 | \u00b5t, \u03c3t) ],\n\nwhen a = AS\nwhen a = AR\nwhen a = AL\n(6)\nwhere \u00b5t and \u03c3t are given by (3) and (4), respectively. The above equations can be interpreted as\nfollows. With probability Pr(dL) \u00b7 \u03a6(0 | \u00b5t, \u03c3t), the hidden state x is less than 0, making AR an\nincorrect decision and resulting in a penalty RN if chosen. Similarly, action AR is correct with\nprobability Pr(dR)\u00b7 [1 \u2212 \u03a6(0 | \u00b5t, \u03c3t)] and earns a reward of RP . The inverse is true for AL. When\nAS is selected, the animal simply receives an observation at a cost of RS.\nComputing the value function de\ufb01ned in (5) involves an expectation with respect to future belief.\nTherefore, we need to compute the transition probabilities over belief states, T (bt+1|bt, a), for each\naction. When the animal chooses to sample, at = AS, the animal\u2019s belief distribution at the next\ntime step is computed by marginalizing over all possible observations [19]\n\nT (bt+1|bt, AS) =\n\nwhere\n\nPr(bt+1 | s, bt, AS) =\n\nand\n\nPr(s | bt, AS) =\n\nPr(bt+1|s, bt, AS)Pr(s|bt, AS)ds\n\nif bt+1(x) = Pr(s|x)bt(x)/Pr(s|bt, AS), \u2200x\notherwise;\n\n0\nPr(s|x)Pr(x|b, a)dx = Ex\u223cb[Pr(s|x)]\n\n(7)\n\n(8)\n\n(9)\n\nx\n\nWhen choosing AS, the agent does not affect the world state, so, given the current state bt and\nan observation s, the updated belief bt+1 is deterministic and thus Pr(bt+1 | s, bt, AS) is a delta\nfunction, following Bayes\u2019 rule. The probability Pr(s | bt, AS) can be treated as a normalization\nfactor and is independent of hidden state2. Thus, the transition probability function, T (bt+1 | bt, AS),\nis solely a function of the belief bt and is a stationary distribution over the belief space.\nWhen the selected action is AL or AR, the animal stops sampling and makes an eye movement to the\nleft or the right, respectively. To account for these cases, we include a terminal state, \u0393, with zero-\nreward (i.e., R(\u0393, a) = 0,\u2200a), and absorbing behavior, T (\u0393|\u0393, a) = 1,\u2200a. Moreover, whenever the\nanimal chooses AL or AR, the POMDP immediately transitions into \u0393: T (\u0393|b, a \u2208 {AL, AR}) =\n1, \u2200b, indicating the end of a trial.\nGiven the transition probability between belief states T (bt+1|bt, a) and the reward function, we can\nconvert our POMDP model into a Markov Decision Process (MDP) over the belief state. Standard\ndynamic programming techniques (e.g., value iteration [20]) can then be applied to compute the\nvalue function in (5). A neurally plausible method for learning the optimal policy by trial and error\nusing temporal difference (TD) learning was suggested in [18]. Here, we derive the optimal policy\nfrom \ufb01rst principles and focus on comparisons between our model\u2019s predictions and behavioral data.\n\n2Explicitly, Pr(s|bt, AS) = Z\n\nt N (s|\u00b5t, \u03c32\n\u22121\n\ne + \u03c32\n\nt )[Pr(dR) + (1\u2212 2Pr(dR))\u03a6(0| \u00b5t\n\n+ s\n\u03c32\ne\n+ 1\n\u03c32\ne\n\n, ( 1\n\u03c32\nt\n\n+ 1\n\u03c32\ne\n\n)\u22121]).\n\n\u03c32\nt\n1\n\u03c32\nt\n\n4\n\n(cid:90)\n(cid:26) 1\n(cid:90)\n\ns\n\n\f3 Model Predictions\n\n3.1 Optimal Policy\n\n(a)\n\n(b)\n\nRS\n\n= 1, 000.\n\nFigure 1: Optimal policy for Pr(dR) = 0.5, and 0.9. (a\u2013b) Optimal policy as a joint function of\n\u00afs and t. Every point in these \ufb01gures represents a belief state determined by equations (2), (3) and\n(4). The color of each point represents the corresponding optimal action. The boundaries \u03c8R(t) and\n\u03c8L(t) divide the belief space into three areas \u03a0S (center), \u03a0R (upper) and \u03a0L (lower), respectively.\nModel parameters: RN\u2212RP\nFigure 1(a) shows the optimal policy \u03c0\u2217 as a joint function of \u00afs and t for the unbiased case where\nthe prior probability Pr(dR) = Pr(dL) = 0.5. \u03c0\u2217 partitions the belief space into three regions: \u03a0R,\n\u03a0L, and \u03a0S, representing the set of belief states preferring actions AR, AL and AS, respectively.\nWe de\ufb01ne the boundary between AR and AS, and the boundary between AL and AS as \u03c8R(t)\nand \u03c8L(t), respectively. Early in a trial, the model selects the sampling action AS regardless of\nthe value of the observed evidence. This is because the variance of the running average \u00afs is high\nfor small t. Later in the trial, the model will choose AR or AL when \u00afs is only slightly above\n0 because this variance decreases as the model receives more observations. For this reason, the\nwidth of \u03a0S diminishes over time. This gradual decrease in the threshold for choosing one of\nthe non-sampling actions AR or AL has been called a \u201ccollapsing bound\u201d in the decision making\nliterature [21, 17, 22]. For this unbiased prior case, the expected reward function is symmetric,\nr(bt(x), AR) = r(Pr(x|\u00afst, t), AR) = r(Pr(x| \u2212\u00afst, t), AL), and thus the decision boundaries are\nalso symmetric around 0: \u03c8R(t) = \u2212\u03c8L(t).\nThe optimal policy \u03c0\u2217 is entirely determined by the reward parameters {RP , RN , RS} and the prior\nprobability (the standard deviation of the emission distribution \u03c3e only determines the temporal\nresolution of the POMDP). It applies to both Carpenter and Williams\u2019 task and the random dots\ntask (these two tasks differ only in the interpretation of the hidden state x). The optimal action at\na speci\ufb01c belief state is determined by the relative, not the absolute, value of the expected future\nreward. From (6), we have\n\nRS\n\nr(b, AL) \u2212 r(b, AR) \u221d RN \u2212 RP .\n\nand the prior.\n\n(10)\nMoreover, if the unit of reward is speci\ufb01ed by the sampling penalty, the optimal policy \u03c0\u2217 is entirely\ndetermined by the ratio RN\u2212RP\nAs the prior probability becomes biased, the optimal policy becomes asymmetric. When the prior\nprobability, Pr(dR), increases, the decision boundary for the more likely direction (\u03c8R(t)) shifts\ntowards the center (the dashed line at \u00afs = 0 in \ufb01gure 1), while the decision boundary for the opposite\ndirection (\u03c8L(t)) shifts away from the center, as illustrated in Figure 1(b) for prior Pr(dR = 0.9).\nEarly in a trial, \u03a0S has approximately the same width as in the neutral prior case, but it is shifted\ndownwards to favor more sampling for dL (\u00afs < 0). Later in a trial, even for some belief states\nwith \u00afs < 0, the optimal action is still AR, because the effect of the prior is stronger than that of the\nobserved data.\n\n3.2 Psychometric function and reaction times in the random dots task\n\nWe now construct a decision model from the learned policy for the reaction time version of the\nmotion discrimination task [14], and compare the model\u2019s predictions to the psychometric and\n\n5\n\n\f(a) Human SK\n\n(b) Human LH\n\n(c) Monkey Pr(dR) = .8 (d) Monkey Pr(dR) = .7\n\nFigure 2: Comparison of Psychometric (upper panels) and Chronometric (lower panels) func-\ntions between the Model and Experiments. The dots with error bars represent experimental data\nfrom human subject SK, and LH, and the combined results from four monkeys. Blue solid curves\nare model predictions in the neutral case while green dotted curves are model predictions from the\nbiased case. The R2 \ufb01ts are shown in the plots. Model parameters: (a) RN\u2212RP\n= 1, 000, k = 1.45.\n(b) RN\u2212RP\n= 1, 000, k = 1.4. (d) Pr(dR) = 0.7,\nRS\nRN\u2212RP\n\n= 1, 000, \u00b5 = 1.45. (c) Pr(dR) = 0.8, RN\u2212RP\n\nRS\n\n= 1, 000, k = 1.4.\n\nRS\n\nRS\n\nchronometric functions of a monkey performing the same task [13, 14]. Recall that the belief b\nis parametrized by \u00afst and t, so the animal only needs to know the elapsed time and compute a run-\nning average \u00afst of the observations in order to maintain the posterior belief bt(x). Given its current\nbelief, the animal selects an action from the optimal policy \u03c0\u2217(bt). When bt \u2208 \u03a0S, the animal\nchooses the sampling action and gets a new observation st+1. Otherwise the animal terminates\nthe trial by making an eye movement to the right or to the left, for \u00afst > \u03c8R(t) or \u00afst < \u03c8L(t),\nrespectively.\nThe performance on the task using the optimal policy can be measured in terms of both the accuracy\nof direction discrimination (the so-called psychometric function), and the reaction time required to\nreach a decision (the chronometric function). The hidden variable x = kdc encapsulates the un-\nknown direction and coherence, as well as the free parameter k that determines the scale of stimulus\nst. Without loss of generality, we \ufb01x d = 1 (rightward direction), and set the hidden direction dR as\nthe biased direction. Given an optimal policy, we compute both the psychometric and chronomet-\nric function by simulating a large number of trials (10000 trials per data point) and collecting the\nreaction time and chosen direction from each trial.\nThe upper panels of \ufb01gure 2(a) and 2(b) (blue curves) show the performance accuracy as a function\nof coherence for both the model (blue solid curve) and the human subjects (blue dots) for neutral\nprior Pr(dR) = 0.5. We \ufb01t our simulation results to the experimental data by adjusting the only\ntwo free parameters in our model: RN\u2212RP\nand k. The lower panels of \ufb01gure 2(a) and 2(b) (blue\nsolid curves) shows the predicted mean reaction time for correct choices as a function of coherence\nc for our model (blue solid curve, with same model parameters) and the data (blue dots). Note\nthat our model\u2019s predicted reaction times represent the expected number of POMDP time steps\nbefore making a rightward eye movement AR, which we can directly compare to the monkey\u2019s\nexperimental data in units of real time. A linear regression is used to determine the duration \u03c4 of\na single time step and the onset of decision time tnd. This offset, tnd, can be naturally interpreted\nas the non-decision residual time. We applied the experimental mean reaction time reported in [13]\nwith motion coherence c = 0.032, 0.064, 0.128, 0.256 and 0.512 to compute the slope and offset, \u03c4\nand tnd. Linear regression gives the unit duration per POMDP step as \u03c4 = 5.74ms , and the offset\ntnd = 314.6ms, for human SK. For human LH, similar results are obtained with \u03c4 = 5.20ms and\ntnd = 250.0ms. Our predicted offsets compare well with the 300ms non-decision time on average\nreported in the literature [23, 24].\n\nRS\n\n6\n\n\fWhen the human subject is verbally told that the prior probability is Pr(dR) = Pr(d = 1) = 0.8,\nthe experimental data is inconsistent with the predictions of the classic drift diffusion model [14]\nunless an additional assumption of a dynamic bias signal is introduced. In the POMDP model we\npropose, we predict both the accuracy and reaction times in the biased setting (green curves in\n\ufb01gure 2) with the parameters learned in the neutral case, and achieve a good \ufb01t (with the coef\ufb01cients\nof determination shown in \ufb01g. 2) to the experimental data reported by Hanks et al. [13]. Our model\npredictions for the biased cases are a direct result of the reward maximization component of our\nframework and require no additional parameter \ufb01tting.\nCombined behavioral data from four monkeys is shown by the dotted curves in \ufb01gure 2(c). We\n\ufb01t our model parameters to the psychometric function in the neutral case, with \u03c4 = 8.20ms and\ntnd = 312.50ms, and predict both the psychometric function and the reaction times in the biased\ncase. However, our results do not match the monkey data as well as the human data when Pr(dR) =\n0.8. This may be due to the fact that the monkeys cannot receive verbal instructions from the\nexperimenters and must learn an estimate of the prior during training. As a result, the monkeys\u2019\nestimate of the prior probability might be inaccurate. To test this hypothesis, we simulated our\nmodel with Pr(dR) = 0.7 (see \ufb01gure 2(d)) and these results \ufb01t the experimental data much more\naccurately (even though the actual probability was 0.8).\n\n3.3 Reaction times in the Carpenter and Williams\u2019 task\n\n(a)\n\n(b)\n\nFigure 3: Model predictions of saccadic eye movement in Carpenter & Williams\u2019 experi-\nments [10]. (a) Saccadic latency distributions from model simulations plotted in the form of probit-\nscale cumulative mass function, as a function of reciprocal latency. For different values of Pr(dR),\nthe simulated data are well \ufb01t by straight lines, indicating that the reciprocal of latency follows a\nnormal distribution. The solid lines are linear functions \ufb01t to the data with the constraint that all\nlines must pass through the same intercept for in\ufb01nite time (see [10]). (b) Median latency plotted as\na function of log prior probability. Black dots are from experimental data and blue dots are model\npredictions. The two (overlapping) straight lines are the linear least squares \ufb01ts to the experimental\ndata and model data. These lines do not differ noticeably in either slope or offset. Model parameters:\nRN\u2212RP\n\n= 1, 000, k = 0.3, \u03c3e = 0.46.\n\nRS\n\nIn Carpenter and Williams\u2019 task, the animal needs to decide on which side d \u2208 {\u22121, 1} (denoting\nleft or right side) a target light appeared at a \ufb01xed distance from a central \ufb01xation light. After the\nsudden appearance of the target light, a constant stimulus st = s is observed by the animal, where s\ncan be regarded as the perceived location of the target. Due to noise and uncertainty in the nervous\nsystem, we assume that s varies from trial to trial, centered at the location of the target light and\nwith standard deviation \u03c3e (i.e., s \u223c N (s | k, \u03c32\ne )), where k is the distance between the target and\nthe \ufb01xation light. Inference over the direction d thus involves joint inference over (d, k) where the\nemission probability follows Pr(s|d, k). Then the joint state (k, d) can be one-on-one-mapped to\nkd = x, where x represents the actual location of the target light. Under the POMDP framework,\nCarpenter and Williams\u2019 task and the random dots task differ in the interpretation of hidden state x\nand stimulus s, but they follow the same optimal policy given the same reward parameters.\nWithout loss of generality, we set the hidden variable x > 0 and say that the animal makes a\ncorrect choice at a hitting time tH when the animal\u2019s belief state reaches the right boundary. The\n\n7\n\n\ftH \u223c N (1/tH | k, \u03c32\n\nsaccadic latency can be computed by inverting the boundary function \u03c8\u22121\nR (s) = tH. Since, for\nsmall t, \u03c8R(t) behaves like a simple reciprocal function of t, the reciprocal of the reaction time is\napproximately proportional to a normal distribution with 1\ne ). In \ufb01gure 3(a),\nwe plot the distribution of reciprocal reaction time with different values of Pr(dR) on a probit scale\n(similar to [10]). Note that we label the y-axis using the CDF of the corresponding probit value\nand the x-axis in \ufb01gure 3(a) has been reversed. If the reciprocal of reaction time (with the same\nprior Pr(dR)) follows a normal distribution, each point on the graph will fall on a straight line with\ny-intercept k\nthat is independent of Pr(dR). We \ufb01t straight lines to the points on the graph,\nwith the constraint that all lines should pass through the same intercept for in\ufb01nite time (see [10]).\nWe obtain an intercept of 6.19, consistent with the intercept 6.20 obtained from experimental data\nin [10]. Figure 3(b) demonstrates that the median of our model\u2019s reaction times is a linear function\nof the log of the prior probability. Increasing the prior probability lowers the decision boundary\n\u03c8R(t), effectively decreasing the latency. The slope and intercept of the best \ufb01t line are consistent\nwith experimental data (see \ufb01g. 3(b)).\n\n2\n\u03c3e\n\n\u221a\n\n4 Summary and Conclusion\n\nRS\n\nOur results suggest that decision making in the primate brain may be governed by the dual principles\nof Bayesian inference and reward maximization as implemented within the framework of partially\nobservable Markov decision processes (POMDPs). The model provides a uni\ufb01ed explanation for\nexperimental data previously explained by two competing models, namely, the additive offset model\nand the dynamic weighting model for incorporating prior knowledge. In particular, the model pre-\ndicts psychometric and chronometric data for the random dots motion discrimination task [13] as\nwell as Carpenter and Williams\u2019 saccadic eye movement task [10].\nPrevious models of decision making, such as the LATER model [10] and the drift diffusion\nmodel [25, 15], have provided descriptive accounts of reaction time and accuracy data but often\nrequire assumptions such as a collapsing bound, urgency signal, or dynamic weighting to fully ex-\nplain the data [26, 21, 22, 13]. Our model provides a normative account of the data, illustrating how\nthe subject\u2019s choices can be interpreted as being optimal under the framework of POMDPs.\nOur model relies on the principle of reward maximization to explain how an animal\u2019s decisions\nare in\ufb02uenced by changes in prior probability. The same principle also allows us to predict how an\nanimal\u2019s choice is in\ufb02uenced by changes in the reward function. Speci\ufb01cally, the model predicts that\nthe optimal policy \u03c0\u2217 is determined by the ratio RN\u2212RP\nand the prior probability Pr(dR). Thus, a\ntestable prediction of the model is that the speed-accuracy trade-off in tasks such as the random dots\ntask is governed by the ratio RN\u2212RP\n: smaller penalties for sampling (RS) will increase accuracy\nand reaction time, as will larger rewards for correct choices (RP ) or greater penalties for errors\n(RN ). Since the reward parameters in our model represent internal reward, our model also provides\na bridge to study the relationship between physical reward and subjective reward.\nIn our model of the random dots discrimination task, belief is expressed in terms of a piecewise nor-\nmal distribution with the domain of the hidden variable x \u2208 (\u2212\u221e,\u221e). A piecewise beta distribution\nwith domain x \u2208 [\u22121, 1] \ufb01ts the experimental data equally well. However, the beta distribution\u2019s\nconjugate prior is the multinomial, which can limit the application of this model. For example, the\nobservations in the Carpenter and Williams\u2019 model cannot easily be described by a discrete value.\nThe belief in our model can be expressed by any distribution, even a non-parametric one, as long\nas the observation model provides a faithful representation of the stimuli and captures the essential\nrelationship between the stimuli and the hidden world state.\nThe POMDP model provides a unifying framework for a variety of perceptual decision making\ntasks. Our state variable x and action variable a work with arbitrary state and action spaces, ranging\nfrom multiple alternative choices to high dimensional real value choices. The state variables can\nalso be dynamic, with xt following a Markov chain. Currently, we have assumed that the stimuli\nare independent from one time step to the next, but most real world stimuli are temporally corre-\nlated. Our model is suitable for decision tasks with time-varying state and observations that are time\ndependent within a trial (as long as they are conditional independent given the time-varying hidden\nstate sequence). We thus expect our model to be applicable to signi\ufb01cantly more complicated tasks\nthan the ones modeled here.\n\nRS\n\n8\n\n\fReferences\n[1] D. Knill and W. Richards. Perception as Bayesian inference. Cambridge University Press, 1996.\n[2] R.S. Zemel, P. Dayan, and A. Pouget. Probabilistic interpretation of population codes. Neural Computa-\n\ntion, 10(2), 1998.\n\n[3] R.P.N. Rao. Bayesian computation in recurrent neural circuits. Neural Computation, 16(1):1\u201338, 2004.\n[4] W.J. Ma, J.M. Beck, P.E. Latham, and A. Pouget. Bayesian inference with probabilistic population codes.\n\nNature Neuroscience, 9(11):1432\u20131438, 2006.\n\n[5] N.D. Daw, A.C. Courville, and D.S.Touretzky. Representation and timing in theories of the dopamine\n\nsystem. Neural Computation, 18(7):1637\u20131677, 2006.\n\n[6] P. Dayan and N.D. Daw. Decision theory, reinforcement learning, and the brain. Cognitive, Affective and\n\nBehavioral Neuroscience, 8:429\u2013453, 2008.\n\n[7] R. Bogacz and T. Larsen. Integration of reinforcement learning and optimal decision making theories of\n\nthe basal ganglia. Neural Computation, 23:817\u2013851, 2011.\n\n[8] C.T. Law and J. I. Gold. Reinforcement learning can account for associative and perceptual learning on a\n\nvisual-decision task. Nat. Neurosci, 12(5):655\u2013663, 2009.\n\n[9] J. Drugowitsch, and A. K. Churchland R. Moreno-Bote, M. N. Shadlen, and A. Pouget. The cost of\n\naccumulating evidence in perceptual decision making. J. Neurosci, 32(11):3612\u20133628, 2012.\n\n[10] R.H.S. Carpenter and M.L.L. Williams. Neural computation of log likelihood in the control of saccadic\n\neye movements. Nature, 377:59\u201362, 1995.\n\n[11] M.C. Dorris and D.P. Munoz. Saccadic probability in\ufb02uences motor preparation signals and time to\n\nsaccadic initiation. J. Neurosci, 18:7015\u20137026, 1998.\n\n[12] J.I. Gold, C.T. Law, P. Connolly, and S. Bennur. The relative in\ufb02uences of priors and sensory evidence on\n\nan oculomotor decision variable during perceptual learning. J. Neurophysiol, 100(5):2653\u20132668, 2008.\n\n[13] T.D. Hanks, M.E. Mazurek, R. Kiani, E. Hopp, and M.N. Shadlen. Elapsed decision time affects the\nweighting of prior probability in a perceptual decision task. Journal of Neuroscience, 31(17):6339\u20136352,\n2011.\n\n[14] J.D. Roitman and M.N. Shadlen. Response of neurons in the lateral intraparietal area during a combined\n\nvisual discrimination reaction time task. Jounral of Neuroscience, 22, 2002.\n\n[15] R. Bogacz, E. Brown, J. Moehlis, P. Hu, P. Holmes, and J.D. Cohen. The physics of optimal decision\nmaking: A formal analysis of models of performance in two-alternative forced choice tasks. Psychological\nReview, 113:700\u2013765, 2006.\n\n[16] R. Ratcliff and G. McKoon. The diffusion decision model: Theory and data for two-choice decision tasks.\n\nNeural Computation, 20:127\u2013140, 2008.\n\n[17] P. L. Frazier and A. J. Yu. Sequential hypothesis testing under stochastic deadlines. In Advances in Neural\n\nInformation procession Systems, 20, 2007.\n\n[18] R.P.N. Rao. Decision making under uncertainty: A neural model based on POMDPs. Frontiers in Com-\n\nputational Neuroscience, 4(146), 2010.\n\n[19] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic\n\ndomains. Arti\ufb01cial Intelligence, 101:99\u2013134, 1998.\n\n[20] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.\n[21] P.E. Latham, Y. Roudi, M. Ahmadi, and A. Pouget. Deciding when to decide. Soc. Neurosci. Abstracts,\n\n740(10), 2007.\n\n[22] A. K. Churchland, R. Kiani, and M. N. Shadlen. Decision-making with multiple alternatives. Nat.\n\nNeurosci., 11(6), 2008.\n\n[23] R.D. Luce. Response times: their role in inferring elementary mental organization. Oxford University\n\nPress, 1986.\n\n[24] M.E. Mazurek, J.D. Roitman, J. Ditterich, and M.N. Shadlen. A role for neural integrators in perceptual\n\ndecision-making. Cerebral Cortex, 13:1257\u20131269, 2003.\n\n[25] J. Palmer, A.C. Huk, and M.N. Shadlen. The effects of stimulus strength on the speed and accuracy of a\n\nperceptual decision. Journal of Vision, 5:376\u2013404, 2005.\n\n[26] J. Ditterich. Stochastic models and decisions about motion direction: Behavior and physiology. Neural\n\nNetworks, 19:981\u20131012, 2006.\n\n9\n\n\f", "award": [], "sourceid": 617, "authors": [{"given_name": "Yanping", "family_name": "Huang", "institution": null}, {"given_name": "Timothy", "family_name": "Hanks", "institution": null}, {"given_name": "Mike", "family_name": "Shadlen", "institution": null}, {"given_name": "Abram", "family_name": "Friesen", "institution": null}, {"given_name": "Rajesh", "family_name": "Rao", "institution": null}]}