{"title": "Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting", "book": "Advances in Neural Information Processing Systems", "page_first": 2607, "page_last": 2615, "abstract": "How humans achieve long-term goals in an uncertain environment, via repeated trials and noisy observations, is an important problem in cognitive science. We investigate this behavior in the context of a multi-armed bandit task. We compare human behavior to a variety of models that vary in their representational and computational complexity. Our result shows that subjects' choices, on a trial-to-trial basis, are best captured by a forgetful\" Bayesian iterative learning model in combination with a partially myopic decision policy known as Knowledge Gradient. This model accounts for subjects' trial-by-trial choice better than a number of other previously proposed models, including optimal Bayesian learning and risk minimization, epsilon-greedy and win-stay-lose-shift. It has the added benefit of being closest in performance to the optimal Bayesian model than all the other heuristic models that have the same computational complexity (all are significantly less complex than the optimal model). These results constitute an advancement in the theoretical understanding of how humans negotiate the tension between exploration and exploitation in a noisy, imperfectly known environment.\"", "full_text": "Forgetful Bayes and myopic planning: Human\nlearning and decision-making in a bandit setting\n\nShunan Zhang\n\nDepartment of Cognitive Science\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\ns6zhang@ucsd.edu\n\nAngela J. Yu\n\nDepartment of Cognitive Science\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\najyu@ucsd.edu\n\nAbstract\n\nHow humans achieve long-term goals in an uncertain environment, via repeated\ntrials and noisy observations, is an important problem in cognitive science. We\ninvestigate this behavior in the context of a multi-armed bandit task. We com-\npare human behavior to a variety of models that vary in their representational and\ncomputational complexity. Our result shows that subjects\u2019 choices, on a trial-to-\ntrial basis, are best captured by a \u201cforgetful\u201d Bayesian iterative learning model\n[21] in combination with a partially myopic decision policy known as Knowl-\nedge Gradient [7]. This model accounts for subjects\u2019 trial-by-trial choice better\nthan a number of other previously proposed models, including optimal Bayesian\nlearning and risk minimization, \u03b5-greedy and win-stay-lose-shift. It has the added\nbene\ufb01t of being closest in performance to the optimal Bayesian model than all\nthe other heuristic models that have the same computational complexity (all are\nsigni\ufb01cantly less complex than the optimal model). These results constitute an ad-\nvancement in the theoretical understanding of how humans negotiate the tension\nbetween exploration and exploitation in a noisy, imperfectly known environment.\n\n1\n\nIntroduction\n\nHow humans achieve long-term goals in an uncertain environment, via repeated trials and noisy\nobservations, is an important problem in cognitive science. The computational challenges consist of\nthe learning component, whereby the observer updates his/her representation of knowledge and un-\ncertainty based on ongoing observations, and the control component, whereby the observer chooses\nan action that balances between the short-term objective of acquiring reward and the long-term ob-\njective of gaining information about the environment. A classic task used to study such sequential\ndecision making problems is the multi-arm bandit paradigm [15]. In a standard bandit setting, peo-\nple are given a limited number of trials to choose among a set of alternatives, or arms. After each\nchoice, an outcome is generated based on a hidden reward distribution speci\ufb01c to the arm chosen,\nand the objective is to maximize the total reward after all trials. The reward gained on each trial both\nhas intrinsic value and informs the decision maker about the relative desirability of the arm, which\ncan help with future decisions. In order to be successful, decision makers have to balance their de-\ncisions between general exploration (selecting an arm about which one is ignorant) and exploitation\n(selecting an arm that is known to have relatively high expected reward).\nBecause bandit problem elegantly capture the tension between exploration and exploitation that is\nmanifest in real-world decision-making situations, they have received attention in many \ufb01elds, in-\ncluding statistics [10], reinforcement learning [11, 19], economics [1, e.g.], psychology and neuro-\nscience [5, 4, 18, 12, 6]. There is no known analytical optimal solution to the general bandit problem,\nthough properties about the optimal solution of special cases are known [10]. For relatively simple,\n\ufb01nite-horizon problems, the optimal solution can be computed numerically via dynamic program-\n\n1\n\n\fming [11], though its computational complexity grows exponentially with the number of arms and\ntrials. In the psychology literature, a number of heuristic policies, with varying levels of complex-\nity in the learning and control processes, have been proposed as possible strategies used by human\nsubjects [5, 4, 18, 12]. Most models assume that humans either adopt simplistic policies that retain\nlittle information about the past and sidestep long-term optimization (e.g. win-stay-lose-shift and\n\u03b5-greedy), or switch between an exploration and exploitation mode either randomly [5] or discretely\nover time as more is learned about the environment [18].\nIn this work, we analyze a new model for human bandit choice behavior, whose learning component\nis based on the dynamic belief model (DBM) [21], and whose control component is based on the\nknowledge gradient (KG) algorithm [7]. DBM is a Bayesian iterative inference model that assumes\nthat there exists statistical patterns in a sequence of observations, and they tend to change at a charac-\nteristic timescale [21]. DBM was proposed as a normative learning framework that is able to capture\nthe commonly observed sequential effect in human choice behavior, where choice probabilities (and\nresponse times) are sensitive to the local history of preceding events in a systematic manner \u2014 even\nif the subjects are instructed that the design is randomized, so that any local trends arise merely by\nchance and not truly predictive of upcoming stimuli [13, 8, 20, 3]. KG is a myopic approximation\nto the optimal policy for sequential informational control problem, originally developed for opera-\ntions research applications [7]; KG is known to be exactly optimal in some special cases of bandit\nproblems, such as when there are only two arms. Conditioned on the previous observations at each\nstep, KG chooses the option that maximizes the future cumulative reward gain, based on the myopic\nassumption that the next observation is the last exploratory choice, and all remaining choices will\nbe exploitative (choosing the option with the highest expected reward by the end of the next trial).\nNote that this myopic assumption is only used in reducing the complexity of computing the expected\nvalue of each option, and not actually implemented in practice \u2013 the algorithm may end up executing\narbitrarily many non-exploitative choices. KG tends to explore more when the number of trials left\nis large, because \ufb01nding an arm with even a slightly better reward rate than the currently best known\none can lead to a large cumulative advantage in future gain; on the other hand, when the number of\ntrials left is small, KG tends to stay with the currently best known option, as the relative bene\ufb01t of\n\ufb01nding a better option diminishes against the risk of wasting limited time on a good option. KG has\nbeen shown to outperform several established models, including the optimal Bayesian learning and\nrisk minimization, \u03b5-greedy and win-stay-lose-shift, for human decision-making in bandit problems,\nunder two certain learning scenarios other than DBM [22].\nIn the following, we \ufb01rst describe the experiment, then describe all the learning and control models\nthat we consider. We then compare the performance of the models both in terms of agreement with\nhuman behavior on a trial-to-trial basis, and in terms of computational optimality.\n\n2 Experiment\n\nWe adopt data from [18], where a total of 451 subjects participated in the experiment as part of\n\u201ctestweek\u201d at the University of Amsterdam. In the experiment, each participant completed 20 bandit\nproblems in sequence, all problems had 4 arms and 15 trials. The reward rates were \ufb01xed for all\narms in each game, and were generated, prior to the start of data collection, independently from a\nBeta(2,2) distribution. All participants played the same reward rates, but the order of the games\nwas randomized. Participants were instructed that the reward rates in all games were drawn from\nthe same environment, and that the reward rates were drawn only once; participants were not told\nthe exact form of the Beta environment, i.e. Beta(2,2). A screenshot of the experimental interface\nis shown in Fig 1:a.\n\n3 Models\n\nThere exist multiple levels of complexity and optimality in both the learning and the decision com-\nponents of decision making models of bandit problems. For the learning component, we examine\nwhether people maintain any statistical representation of the environment at all, and if they do,\nwhether they only keep a mean estimate (running average) of the reward probability of the differ-\nent options, or also uncertainty about those estimates; in addition, we consider the possibility that\nthey entertain trial-by-trial \ufb02uctuation of the reward probabilities. The decision component can also\n\n2\n\n\fFigure 1: (a) A screenshot of the experimental interface. The four panels correspond to the four\narms, each of which can be chosen by clicking the corresponding button. In each panel, successes\nfrom previous trials are shown as green bars, and failures as red bars. At the top of each panel, the\nratio of successes to failures, if de\ufb01ned, is shown. The top of the interface provides the count of the\ntotal number of successes to the current trial, index of the current trial and index of the current game.\n(b) Bayesian graphical model of FBM, assuming \ufb01xed reward probabilities. \u03b8 \u2208 [0,1], Rt \u2208 {0,1}.\nThe inset shows an example of the Beta prior for the reward probabilities. The numbers in circles\nshow example values for the variables. (c) Bayesian graphical model of DBM, assuming reward\nprobabilities change from trial to trial. P(\u03b8t ) = \u03b3\u03b4 (\u03b8t = \u03b8t\u22121) + (1\u2212 \u03b3)P0 (\u03b8t ).\n\ndiffer in complexity in at least two respects: the objective the decision policy tries to optimize (e.g.\nreward versus information), and the time-horizon over which the decision policy optimizes its objec-\ntive (e.g. greedy versus long-term). In this section, we introduce models that incorporate different\ncombinations of learning and decision policies.\n\n3.1 Bayesian Learning in Beta Environments\n\nThe observations are generated independently and identically (iid) from an unknown Bernoulli dis-\ntribution for each arm. We consider two Bayesian learning scenarios below, the dynamic belief\nmodel (DBM), which assumes that the Bernoulli reward rates for all the arms can reset on any trial\nwith probability 1\u2212 \u03b3, and the \ufb01xed belief model (FBM), a special case of DBM that assumes the\nreward rates to be stationary throughout each game. In either case, we assume the prior distribu-\ntion that generates the Bernoulli rates is a Beta distribution, Beta (\u03b1, \u03b2), which is conjugate to the\nBernoulli distribution, and whose two hyper-parameters, \u03b1 and \u03b2, specify the pseudo-counts associ-\nated with the prior.\n\n3.1.1 Dynamic Belief Model\n\nUnder the dynamic belief model (DBM), the reward probabilities can undergo discrete changes at\ntimes during the experimental session, such that at any trial, the subject\u2019s prior belief is a mixture of\nthe posterior belief from the previous trial and a generic prior. The subject\u2019s implicit task is then to\ntrack the evolving reward probability of each arm over the course of the experiment.\nSuppose on each game, we have K arms with reward rates, \u03b8k, k = 1,\u00b7\u00b7\u00b7 ,K, which are iid gener-\nated from Beta (\u03b1, \u03b2). Let St\nk and Ft\nk be the numbers of successes and failures obtained from the\nkth arm on the trial t. The estimated reward probability of arm k at trial t is \u03b8t\nk. We assume \u03b8t\nk\nhas a Markovian dependence on \u03b8t\u22121\n, such that there is a probability \u03b3 of them being the same,\nand a probability 1\u2212 \u03b3 of \u03b8t\nk being redrawn from the prior distribution Beta (\u03b1, \u03b2). The Bayesian\nideal observer combines the sequentially developed prior belief about reward probabilities, with the\nincoming stream of observations (successes and failures on each arm), to infer the new posterior dis-\ntributions. The observation Rt\nqt\nk(\u03b8t\nk given the observed sequence, also\nknown as the belief state. On each trial, the new posterior distribution can be computed via Bayes\u2019\nRule:\n\n(cid:1) to denote the posterior distribution of \u03b8t\n\n(cid:1). We use the notation\n\nk \u223c Bernoulli(cid:0)\u03b8t\n\nk) := Pr(cid:0)\u03b8t\n\nk is assumed to be Bernoulli, Rt\n\nk\n\nk|St\n\nk,Ft\nk\n\nk\n\nk) \u223c Pr(cid:0)Rt\n\nk|\u03b8t\n\nk\n\n(cid:1)Pr(cid:0)\u03b8t\n\nqt\nk(\u03b8t\n\n(1)\n\n(cid:1)\n\nk|St\u22121\n\nk\n\n,Ft\u22121\n\nk\n\n3\n\n\uf071 .6 1 0 1 Rt-1 Rt Rt+1 FBM \uf071 .4 0 1 1 Rt-1 Rt Rt+1 DBM .4 .6 \uf071t-1 \uf071t+1 a b c \fwhere the prior probability is a weighted sum (parameterized by \u03b3) of last trial\u2019s posterior and the\ngeneric prior q0 := Beta (\u03b1,\u03b2):\n\nPr(cid:0)\u03b8t\n\n(cid:1) = \u03b3qt\u22121\n\nk\n\nk = \u03b8|St\u22121\n\nk\n\n, Ft\u22121\n\nk\n\n(\u03b8) + (1\u2212 \u03b3)q0(\u03b8)\n\n(2)\n\n3.1.2 Fixed Belief Model\n\nA simpler generative model (and more correct one given the true, stationary environment) is to\nassume that the statistical contingencies in the task remain \ufb01xed throughout each game, i.e. all\nbandit arms have \ufb01xed probabilities of giving a reward throughout the game. What the subjects\nwould then learn about the task over the time course of the experiment is the true value of \u03b8. We call\nthis model a \ufb01xed belief model (FBM); it can be viewed as a special case of the DBM with \u03b3 = 1. In\nthe Bayesian update rule, the prior on each trial is simply the posterior on the previous trial.\nFigure 1b;c illustrates the graphical models of FBM and DBM, respectively.\n\n3.2 Decision Policies\n\nWe consider four different decision policies. We \ufb01rst describe the optimal model, and then the three\nheuristic models with increasing levels of complexity.\n\n3.2.1 The Optimal Model\n\nThe learning and decision problem for bandit problems can be viewed as as a Markov Decision\nProcess with a \ufb01nite horizon [11], with the state being the belief state qt = (qt\n4), which\nobviously provides the suf\ufb01cient statistics for all the data seen up through trial t. Due to the low\ndimensionality of the bandit problem here (i.e. small number of arms and number of trials per\ngame), the optimal policy, up to a discretization of the belief state, can be computed numerically\nusing Bellman\u2019s dynamic programming principle [2]. Let V t (qt ) be the expected total future reward\non trial t. The optimal policy should satisfy the following iterative property:\n\n2,qt\n\n1,qt\n\n3,qt\n\nV t (qt ) = max\n\nk\n\n\u03b8t\n\nand the optimal action, Dt, is chosen according to\nDt (qt ) = argmaxk\u03b8t\n\nk + E(cid:2)V t+1(qt+1)(cid:3)\nk + E(cid:2)V t+1(qt+1)(cid:3)\n\n(3)\n\n(4)\n\nWe solve the equation using dynamic programming, backward in time from the last time step, whose\nvalue function and optimal policy are known for any belief state: always choose the arm with the\nhighest expected reward, and the value function is just that expected reward. In the simulations, we\ncompute the optimal policy off-line, for any conceivable setting of belief state on each trial (up to a\n\ufb01ne discretization of the belief state space), and then apply the computed policy for each sequence\nof choice and observations that each subject experiences. We use the term \u201cthe optimal solution\u201d to\nrefer to the speci\ufb01c solution under \u03b1 = 2 and \u03b2 = 2, which is the true experimental design.\n\n3.2.2 Win-Stay-Lose-Shift\n\nWSLS does not learn any abstract representation of the environment, and has a very simple decision\npolicy. It assumes that the decision-maker will keep choosing the same arm as long as it continues\nto produce a reward, but shifts to other arms (with equal probabilities) following a failure to gain\nreward. It starts off on the \ufb01rst trial randomly (equal probability at all arms).\n\n3.2.3\n\n\u03b5-Greedy\n\nThe \u03b5-greedy model assumes that decision-making is determined by a parameter \u03b5 that controls\nthe balance between random exploration and exploitation. On each trial, with probability \u03b5, the\ndecision-maker chooses randomly (exploration), otherwise chooses the arm with the greatest esti-\nmated reward rate (exploitation). \u03b5-Greedy keeps simple estimates of the reward rates, but does not\ntrack the uncertainty of the estimates. It is not sensitive to the horizon, maximizing the immediate\ngain with a constant rate, otherwise searching for information by random selection.\n\n4\n\n\fMore concretely, \u03b5-greedy adopts a stochastic policy:\n\nPr(cid:0)Dt = k|\u03b5, \u03b8t(cid:1) =\n\n(cid:26) (1\u2212 \u03b5) /Mt\n\n\u03b5/ (K \u2212 Mt )\n\nif k \u2208 argmaxk(cid:48)\u03b8t\nk(cid:48)\notherwise\n\nwhere Mt is the number of arms with the greatest estimated value at the tth trial.\n\n3.2.4 Knowledge Gradient\n\nThe knowledge gradient (KG) algorithm [16] is an approximation to the optimal policy, by pre-\ntending only one more exploratory measurement is allowed, and assuming all remaining choices\nwill exploit what is known after the next measurement. It evaluates the expected change in each\nestimated reward rate, if a certain arm were to be chosen, based on the current belief state.\nIts\napproximate value function for choosing arm k on trial t given the current belief state qt is\n\n(cid:20)\n\n(cid:21)\n\nk = E\nvKG,t\n\nk(cid:48) \u03b8t+1\nmax\nk(cid:48)\n\n|Dt = k, qt\n\n\u2212 max\nk(cid:48) \u03b8t\nk(cid:48)\n\n(5)\n\nThe \ufb01rst term is the expected largest reward rate (the value of the subsequent exploitative choices) on\nthe next step if the kth arm were to be chosen, with the expectation taken over all possible outcomes\nof choosing k; the second term is the expected largest reward given no more exploitative choices;\ntheir difference is the \u201cknowledge gradient\u201d of taking one more exploratory sample.\nThe KG decision rule is\n\nDKG,t = argmax\n\nk\n\nk + (T \u2212t \u2212 1)vKG,t\n\u03b8t\n\nk\n\n(6)\n\nThe \ufb01rst term of Equation 6 denotes the expected immediate reward by choosing the kth arm on trial\nt, whereas the second term re\ufb02ects the expected knowledge gain. The formula for calculating vKG,t\nfor the binary bandit problems can be found in Chapter 5 of [14].\n\nk\n\n3.3 Model Inference and Evaluation\n\nUnlike previous modeling papers on human decision-making in the bandit setting [5, 4, 18, 12],\nwhich generally look at the average statistics of how people distribute their choices among the op-\ntions, here we use a more stringent trial-by-trial measure of the model agreement, i.e. how well\neach model captures subject\u2019s choice. We calculate the per-trial likelihood of the subject\u2019s choice\nconditioned on the previously experienced actions and choices. For WSLS, it is 1 for a win-stay\ndecision, 1/3 for a lose-shift decision (because the model predicts shifting to the other three arms\nwith equal probability), and 0 otherwise. For probabilistic models, take \u03b5-greedy for example, it\nis (1\u2212 \u03b5)/M if the subject chooses the option with the highest predictive reward, where M is the\nnumber of arms with the highest predictive reward; it is \u03b5/(4\u2212 M) for any other choice, and when\nM = 4, it is considered all arms have the highest predictive reward.\nWe use sampling to compute a posterior distribution of the following model parameters: the parame-\nters of the prior Beta distribution (\u03b1 and \u03b2) for all policies, \u03b3 for all DBM policies, \u03b5 for \u03b5-greedy. For\nthis model \ufb01tting process, we infer the re-parameterization of \u03b1/(\u03b1 + \u03b2) and \u03b1 + \u03b2, with a uniform\n\u22123/2, as\nprior on the former, and weakly informative prior for the latter, i.e. Pr (\u03b1 + \u03b2) \u223c (\u03b1 + \u03b2)\nsuggested by [9]. The reparameterization has psychological interpretation as the mean reward prob-\nability and the certainty. We use uniform prior for \u03b5 and \u03b3. Model inference use combined sampling\nalgorithm, with Gibbs sampling of \u03b5, and Metropolis sampling of \u03b3, \u03b1 and \u03b2. All chains contained\n3000 steps, with a burn-in size of 1000. All chains converged according to the R-hat measure [9].\nWe calculate the average per-trial likelihood (across trials, games, and subjects) under each model\nbased on its maximum a posteriori (MAP) parameterization.\nWe \ufb01t each model across all subjects, assuming that every subject shared the same prior belief of\nthe environment (\u03b1 and \u03b2), rate of exploration (\u03b5), and rate of change (\u03b3). For further analyses to\nbe shown in the result section, we also \ufb01t the \u03b5-greedy policy and the KG policy together with both\nlearning models for each individual subject. All model inferences are based on a leave-one-out cross-\nvalidation containing 20 runs. Speci\ufb01cally, for each run, we train the model while withholding one\ngame (sampled without replacement) from each subject, and test the model on the withheld game.\n\n5\n\n\fFigure 2: (a) Model agreement with data simulated by the optimal solution, measured as the average\nper-trial likelihood. All models (except the optimal) are \ufb01t to data simulated by the optimal solution\nunder the correct beta prior Beta(2,2). Each bar shows the mean per-trial likelihood (across all\nsubjects, trials and games) of a decision policy coupled with a learning framework. For \u03b5-greedy\n(eG) and KG, the error bars show the standard errors of the mean per-trial likelihood calculated\nacross all tests in the cross validation procedure (20-fold). WSLS does not rely on any learning\nframework.(b) Model agreement with human data based on a leave-one(game)-out cross-validation,\nwhere we randomly withhold one game from each subject for training, i.e. we train the model on\na total number of 19 \u00d7 451 games, with 19 games from each subject. For the current study, we\nimplement the optimal policy under DBM using the estimated \u03b3 under the KG DBM model in order\nto reduce the computational burden. (c) Mean per-trial likelihood of the \u03b5-greedy model (eG) and\nKG with individually-\ufb01t parameters (for each subject), using cross-validation; the individualized\n(ind. for abbreviation in the legend) DBM assumes each person has his/her own Beta prior and\n\u03b3. (d) Trialwise agreement of eG and KG under individually-\ufb01t MAP parameterization. The mean\nper-trial likelihood is calculated across all subjects for each trial, with the error bars showing the\nstandard error of the mean per-trial likelihood across all tests.\n\n4 Results\n\n4.1 Model agreement with the Optimal Policy\n\nWe \ufb01rst examine how well each of the decision policies agrees with the optimal policy on a trial-\nto-trial basis. Figure 2a shows the mean per-trial likelihood (averaged across all tests in the cross-\nvalidation procedure) of each model, when \ufb01t to data simulated by the optimal solution under the\ntrue design Beta(2,2). KG algorithm, under either learning framework, is most consistent (over\n90%) with the optimal algorithm (separately under FBM and DBM assumptions). This is not sur-\nprising given that KG is an approximation algorithm to the optimal policy. The inferred prior is\nBeta (1.93, 2.15), correctly recovering the actual environment. The simplest WSLS model, on the\nother hand, achieves model agreement well above 60%. In fact, the optimal model also almost al-\nways stays after a success; the only situation that WSLS does not resemble the optimal decision\noccurs when it shifts away from an arm that the optimal policy would otherwise stay with. Because\nthe optimal solution (which simulated the data) knows the true environment, DBM does not have\nadvantage against FBM.\n\n4.2 Model Agreement with Human Data\n\nFigure 2b shows the mean per-trial likelihood (averaged across all tests in the cross-validation pro-\ncedure) of each model, when \ufb01t to the human data. KG with DBM outperforms other models of\nconsideration. The average posterior mean of \u03b3 across all tests is .81, with standard error .091. The\naverage posterior means for \u03b1 and \u03b2 are .65 and 1.05, with standard errors .074 and .122, respec-\ntively. A \u03b3 value of .81 implies that the subjects behave as if they think the world changes on average\nabout every 5 steps (calculated as 1/(1\u2212 .81)).\nWe did a pairwise comparison between models on the mean per-trial likelihood of the subject\u2019s\nchoice given each model\u2019s predictive distribution, using a pairwise t-test. The test between DBM-\n\n6\n\n054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107WSLSeGKG0.40.50.60.70.80.91Model agreement with optimala FBMDBMWSLSOptimaleGKG0.40.50.60.70.8Model agreement with subjectsb FBMDBMeGKG0.40.50.60.70.8Individually\u2212fit Model agreementc 510150.550.60.650.70.750.80.85dTrialTrialwise model agreement DBMDBM ind.eG ind.KG ind.Figure1:AveragerewardachievedbytheKGmodelforwardplayingthebanditproblemswiththesamerewardrates.KGachievessimilarrewarddistributionasthehumanperformance,withKGplayingatitsmaximumaposterioriprobability(MAP)estimate,a=.1andb=.8.KGachievesthesamerewarddistributionastheoptimalsolutionwhenplayingwiththecorrectpriorknowledgeoftheenvironment.NewRomanisthepreferredtypefacethroughout.Paragraphsareseparatedby1/2linespace,withnoindentation.Papertitleis17point,initialcaps/lowercase,bold,centeredbetween2horizontalrules.Topruleis4pointsthickandbottomruleis1pointthick.Allow1/4inchspaceaboveandbelowtitletorules.Allpagesshouldstartat1inch(6picas)fromthetopofthepage.Forthe\ufb01nalversion,authors\u2019namesaresetinboldface,andeachnameiscenteredabovethecorre-spondingaddress.Theleadauthor\u2019snameistobelisted\ufb01rst(left-most),andtheco-authors\u2019names(ifdifferentaddress)aresettofollow.Ifthereisonlyoneco-author,listbothauthorandco-authorsidebyside.Pleasepayspecialattentiontotheinstructionsinsection3regarding\ufb01gures,tables,acknowledg-ments,andreferences.2Headings:\ufb01rstlevelFirstlevelheadingsarelowercase(exceptfor\ufb01rstwordandpropernouns),\ufb02ushleft,boldandinpointsize12.Onelinespacebeforethe\ufb01rstlevelheadingand1/2linespaceafterthe\ufb01rstlevelheading.2.1Headings:secondlevelSecondlevelheadingsarelowercase(exceptfor\ufb01rstwordandpropernouns),\ufb02ushleft,boldandinpointsize10.Onelinespacebeforethesecondlevelheadingand1/2linespaceafterthesecondlevelheading.2.1.1Headings:thirdlevelThirdlevelheadingsarelowercase(exceptfor\ufb01rstwordandpropernouns),\ufb02ushleft,boldandinpointsize10.Onelinespacebeforethethirdlevelheadingand1/2linespaceafterthethirdlevelheading.3Citations,\ufb01gures,tables,referencesTheseinstructionsapplytoeveryone,regardlessoftheformatterbeingused.3.1CitationswithinthetextCitationswithinthetextshouldbenumberedconsecutively.Thecorrespondingnumberistoappearenclosedinsquarebrackets,suchas[1]or[2]-[5].Thecorrespondingreferencesaretobelistedinthesameorderattheendofthepaper,intheReferencessection.(Note:thestandardBIBTEXstyle2\fFigure 3: Behavioral patterns in the human data and the simulated data from all models. The four\npanels show the trial-wise probability of staying after winning, shifting after losing, choosing the\ngreatest estimated value on any trial, choosing the least known when the exploitative choice is not\nchosen, respectively. Probabilities are calculated based on simulated data from each model at their\nMAP parameterization, and are averaged across all games and all participants. The optimal solution\nshown here uses the correct prior Beta (2,2).\n\noptimal and DBM-eG, and the test between DBM-optimal and FBM-optimal, are not signi\ufb01cant at\nthe .05 level. All other tests are signi\ufb01cant. Table 1 shows the p-values for each pairwise comparison.\n\nOp DB\nKG FB eG DB eG FB Op DB Op FB eG DB eG FB Op DB Op FB eG FB Op DB Op FB Op DB Op FB Op FB\n.0480\n.1476\n\nKG DB\n\nKG FB\n\n.0000\n\n.0001\n\n.0000\n\n.0060\n\neG DB\n\n.0002\n\n.0001\n\n.5066\n\n.0000\n\n.0187\n\neG FB\n\nTable 1: P-values for all pairwise t tests.\n\n.0001\n\n.0354\n\n.0001\n\n.0036\n\nFigure 2c shows the model agreement with human data, of \u03b5-greedy and KG, when their parameters\nare individually \ufb01t. KG with DBM with individual parameterization has the best performance under\ncross validation. \u03b5-Greedy also has a great gain in model agreement when coupled with DBM.\nIn fact, under DBM, \u03b5-greedy and KG have close performance in the overall model agreement.\nHowever, Figure 2d shows a systematic difference between the two models in their agreement with\nhuman data on a trial-by-trial base: during early trials, subjects\u2019 behavior is more consistent with\n\u03b5-greedy, whereas during later trials, it is more consistent with KG.\nWe next break down the overall behavioral performance into four \ufb01ner measures: how often people\ndo win-stay and lose-shift, how often they exploit, and whether they use random selection or search\nfor the greatest amount of information during exploration. Figure 3 shows the results of model com-\nparisons on these additional behavioral criteria. We show the patterns of the subjects, the optimal\nsolution with Beta(2,2), KG and eG under both learning frameworks and the simplest WSLS.\nThe \ufb01rst panel, for example, shows the trialwise probability of staying with the same arm following\na previous success. People do not stay with the same arm after an immediate reward, which is\nalways the case for the optimal algorithm. Subjects also do not persistently explore, as predicted\nby \u03b5-greedy. In fact, subjects explore more during early trials, and become more exploitative later\non, similar to KG. As implied by Equation 5, KG calculates the probability of an arm surpassing\nthe known best upon chosen, and weights the knowledge gain more heavily in the early stage of the\ngame. During the early trials, it sometimes chooses the second-best arm to maximize the knowledge\ngain. Under DBM, a previous success will cause the corresponding arm to appear more rewarding,\nresulting in a smaller knowledge gradient value; because knowledge is weighted more heavily during\nthe early trials, the KG model then tends to choose the second best arms that have a larger knowledge\ngain.\nThe second panel shows the trialwise probability of shifting away given a previous failure. When\nthe horizon is approaching, it becomes increasingly important to stay with the arm that is known to\nbe reasonably good, even if it may occasionally yield a failure. All algorithms, except for the naive\nWSLS algorithm, show a downward trend to shift after losing as the horizon approaches, along with\nhuman choices. \u03b5-Greedy with DBM learning is closest to human behavior.\nThe third panel shows the probability of choosing the arm with the largest success ratio. KG, under\nFBM, mimics the optimal model in that the probability of choosing the highest success ratio in-\ncreases over time; they both grossly overly estimate subjects\u2019 tendency to select the highest success\n\n7\n\n0540550560570580590600610620630640650660670680690700710720730740750760770780790800810820830840850860870880890900910920930940950960970980991001011021031041051061073150.60.81P(stay|win)Trial 3150.20.40.60.81P(shift|lose)Trial 3150.40.60.81P(best value)Trial 3150.20.40.6P(least known)Trial HumanOptimalFBM KGDBM KGFBM eGDBM eGWSLSFigure1:AveragerewardachievedbytheKGmodelforwardplayingthebanditproblemswiththesamerewardrates.KGachievessimilarrewarddistributionasthehumanperformance,withKGplayingatitsmaximumaposterioriprobability(MAP)estimate,a=.1andb=.8.KGachievesthesamerewarddistributionastheoptimalsolutionwhenplayingwiththecorrectpriorknowledgeoftheenvironment.NewRomanisthepreferredtypefacethroughout.Paragraphsareseparatedby1/2linespace,withnoindentation.Papertitleis17point,initialcaps/lowercase,bold,centeredbetween2horizontalrules.Topruleis4pointsthickandbottomruleis1pointthick.Allow1/4inchspaceaboveandbelowtitletorules.Allpagesshouldstartat1inch(6picas)fromthetopofthepage.Forthe\ufb01nalversion,authors\u2019namesaresetinboldface,andeachnameiscenteredabovethecorre-spondingaddress.Theleadauthor\u2019snameistobelisted\ufb01rst(left-most),andtheco-authors\u2019names(ifdifferentaddress)aresettofollow.Ifthereisonlyoneco-author,listbothauthorandco-authorsidebyside.Pleasepayspecialattentiontotheinstructionsinsection3regarding\ufb01gures,tables,acknowledg-ments,andreferences.2Headings:\ufb01rstlevelFirstlevelheadingsarelowercase(exceptfor\ufb01rstwordandpropernouns),\ufb02ushleft,boldandinpointsize12.Onelinespacebeforethe\ufb01rstlevelheadingand1/2linespaceafterthe\ufb01rstlevelheading.2.1Headings:secondlevelSecondlevelheadingsarelowercase(exceptfor\ufb01rstwordandpropernouns),\ufb02ushleft,boldandinpointsize10.Onelinespacebeforethesecondlevelheadingand1/2linespaceafterthesecondlevelheading.2.1.1Headings:thirdlevelThirdlevelheadingsarelowercase(exceptfor\ufb01rstwordandpropernouns),\ufb02ushleft,boldandinpointsize10.Onelinespacebeforethethirdlevelheadingand1/2linespaceafterthethirdlevelheading.3Citations,\ufb01gures,tables,referencesTheseinstructionsapplytoeveryone,regardlessoftheformatterbeingused.3.1CitationswithinthetextCitationswithinthetextshouldbenumberedconsecutively.Thecorrespondingnumberistoappearenclosedinsquarebrackets,suchas[1]or[2]-[5].Thecorrespondingreferencesaretobelistedinthesameorderattheendofthepaper,intheReferencessection.(Note:thestandardBIBTEXstyle2\fratio, as well as predicting an unrealized upward trend. WSLS under-estimates how often subjects\nmake this choice, while \u03b5-greedy under DBM learning over-estimates it. It is KG under DBM, and\n\u03b5-greedy with FBM, that are closest to subjects\u2019 behavior.\nThe fourth panels shows how often subjects choose to explore the least known option when they\nshift away from the choice with the highest expected reward. It is DBM with either KG or \u03b5-greedy\nthat provides the best \ufb01t.\nIn general, the KG model with DBM matches the second-order trend of human data the best, with\n\u03b5-greedy following closely behind. However, there still exists a gap on the absolute scale, especially\nwith respect to the probability of staying with a successful arm.\n\n5 Discussion\n\nOur analysis suggests that human behavior in the multi-armed bandit task is best captured by a\nknowledge gradient decision policy supported by a dynamic belief model learning process. Human\nsubjects tend to explore more often than policies that optimize the speci\ufb01c utility of the bandit\nproblems, and KG with DBM attributes this tendency to the belief of a stochastically changing\nenvironment, causing the sequential effects due to recent trial history. Concretely, we \ufb01nd that people\nadopt a learning process that (erroneously) assumes the world to be non-stationary, and that they\nemploy a semi-myopic choice policy that is sensitive to the horizon but assumes one-step exploration\nwhen comparing action values.\nOur results indicate that all decision policies considered here capture human data much better under\nthe dynamic belief model than the \ufb01xed belief model. By assuming the world is changeable, DBM\ndiscount data from the distant past in favor of new data.\nInstead of attributing this discounting\nbehavior to biological limitations (e.g. memory loss), DBM explains it as the automatic engagement\nof mechanisms that are critical for adapting to a changing environment. Indeed, there is previous\nwork suggesting that people approach bandit problems as if expecting a changing world [17]. This\nis despite informing the subjects that the arms have \ufb01xed reward probabilities.\nSo far, our results also favor the knowledge gradient policy as the best model for human decision-\nmaking in the bandit task.\nIt optimizes the semi-myopic goal of maximizing future cumulative\nreward while assuming only one more time step of exploration and strict exploitation thereafter.\nThe KG model under the more general DBM has the largest proportion of correct predictions of\nhuman data, and can capture the trial-wise dynamics of human behavioral reasonably well. This\nresult implies that humans may use a normative way, as captured by KG, to explore by combining\nimmediate reward expectation and long-term knowledge gain, compared to the previously proposed\nbehavioral models that typically assumes that exploration is random or arbitrary. In addition, KG\nachieves similar behavioral patterns as the optimal model, and is computationally much less expen-\nsive (in particular being online and incurring a constant cost), making it a more plausible algorithm\nfor human learning and decision-making.\nWe observed that decision policies vary systematically in their abilities to predict human behavior\non different kinds of trials. In the real world, people might use hybrid policies to solve the bandit\nproblems; they might also use some smart heuristics, which dynamically adjusts the weight of the\nknowledge gain to the immediate reward gain. Figure 2d suggests that subjects may be adopting\na strategy that is aggressively greedy at the beginning of the game, and then switches to a policy\nthat is both sensitive to the value of exploration and the impending horizon as the end of the game\napproaches. One possibility is that subjects discount future rewards, which would result in a more\nexploitative behavior than non-discounted KG, especially at the beginning of the game. These would\nall be interesting lines of future inquiries.\n\nAcknowledgments\n\nWe thank M Steyvers and E-J Wagenmakers for sharing the data. This material is based upon work\nsupported by, or in part by, the U. S. Army Research Laboratory and the U. S. Army Research Of\ufb01ce\nunder contract/grant number W911NF1110391 and NIH NIDA B/START # 1R03DA030440-01A1.\n\n8\n\n\fReferences\n[1] J. Banks, M. Olson, and D. Porter. An experimental analysis of the bandit problem. Economic\n\nTheory, 10:55\u201377, 2013.\n\n[2] R. Bellman. On the theory of dynamic programming. Proceedings of the National Academy\n\nof Sciences, 1952.\n\n[3] R. Cho, L. Nystrom, E. Brown, A. Jones, T. Braver, P. Holmes, and J. D. Cohen. Mechanisms\nunderlying dependencies of performance on stimulus history in a two-alternative forced-choice\ntask. Cognitive, Affective and Behavioral Neuroscience, 2:283\u2013299, 2002.\n\n[4] J. D. Cohen, S. M. McClure, and A. J. Yu. Should I stay or should I go? Exploration versus\nexploitation. Philosophical Transactions of the Royal Society B: Biological Sciences, 362:933\u2013\n942, 2007.\n\n[5] N. D. Daw, J. P. O\u2019Doherty, P. Dayan, B. Seymour, and R. J. Dolan. Cortical substrates for\n\nexploratory decisions in humans. Nature, 441:876\u2013879, 2006.\n\n[6] A. Ejova, D. J. Navarro, and A. F. Perfors. When to walk away: The effect of variability on\nkeeping options viable. In N. Taatgen, H. van Rijn, L. Schomaker, and J. Nerbonne, editors,\nProceedings of the 31st Annual Conference of the Cognitive Science Society, Austin, TX, 2009.\n[7] P. Frazier, W. Powell, and S. Dayanik. A knowledge-gradient policy for sequential information\n\ncollection. SIAM Journal on Control and Optimization, 47:2410\u20132439, 2008.\n\n[8] W. R. Garner. An informational analysis of absolute judgments of loudness. Journal of Exper-\n\nimental Psychology, 46:373\u2013380, 1953.\n\n[9] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman &\n\nHall/CRC, Boca Raton, FL, 2 edition, 2004.\n\n[10] J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical\n\nSociety, 41:148\u2013177, 1979.\n\n[11] L. P. Kaebling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal\n\nof Arti\ufb01cial Intelligence Research, 4:237\u2013285, 1996.\n\n[12] M. D. Lee, S. Zhang, M. Munro, and M. Steyvers. Psychological models of human and optimal\n\nperformance in bandit problems. Cognitive Systems Research, 12:164\u2013174, 2011.\n\n[13] M. I. Posner and Y. Cohen. Components of visual orienting. Attention and Performance Vol.\n\nX, 1984.\n\n[14] W. Powell and I. Ryzhov. Optimal Learning. Wiley, 1 edition, 2012.\n[15] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematical Society, 58:527\u2013535, 1952.\n\n[16] I. Ryzhov, W. Powell, and P. Frazier. The knowledge gradient algorithm for a general class of\n\nonline learning problems. Operations Research, 60:180\u2013195, 2012.\n\n[17] J. Shin and D. Ariely. Keeping doors open: The effect of unavailability on incentives to keep\n\noptions viable. MANAGEMENT SCIENCE, 50:575\u2013586, 2004.\n\n[18] M. Steyvers, M. D. Lee, and E.-J. Wagenmakers. A bayesian analysis of human decision-\n\nmaking on bandit problems. Journal of Mathematical Psychology, 53:168\u2013179, 2009.\n\n[19] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cam-\n\nbridge, MA, 1998.\n\n[20] M. C. Treisman and T. C. Williams. A theory of criterion setting with an application to se-\n\nquential dependencies. Psychological Review, 91:68\u2013111, 1984.\n\n[21] A. J. Yu and J. D. Cohen. Sequential effects: Superstition or rational behavior? In Advances\nin Neural Information Processing Systems, volume 21, pages 1873\u20131880, Cambridge, MA.,\n2009. MIT Press.\n\n[22] S. Zhang and A. J. Yu. Cheap but clever: Human active learning in a bandit setting.\n\nProceedings of the Cognitive Science Society Conference, 2013.\n\nIn\n\n9\n\n\f", "award": [], "sourceid": 1231, "authors": [{"given_name": "Shunan", "family_name": "Zhang", "institution": "UC San Diego"}, {"given_name": "Angela", "family_name": "Yu", "institution": "UC San Diego"}]}