{"title": "Goal-directed decision making in prefrontal cortex: a computational framework", "book": "Advances in Neural Information Processing Systems", "page_first": 169, "page_last": 176, "abstract": "Research in animal learning and behavioral neuroscience has distinguished between two forms of action control: a habit-based form, which relies on stored action values, and a goal-directed form, which forecasts and compares action outcomes based on a model of the environment. While habit-based control has been the subject of extensive computational research, the computational principles underlying goal-directed control in animals have so far received less attention. In the present paper, we advance a computational framework for goal-directed control in animals and humans. We take three empirically motivated points as founding premises: (1) Neurons in dorsolateral prefrontal cortex represent action policies, (2) Neurons in orbitofrontal cortex represent rewards, and (3) Neural computation, across domains, can be appropriately understood as performing structured probabilistic inference. On a purely computational level, the resulting account relates closely to previous work using Bayesian inference to solve Markov decision problems, but extends this work by introducing a new algorithm, which provably converges on optimal plans. On a cognitive and neuroscientific level, the theory provides a unifying framework for several different forms of goal-directed action selection, placing emphasis on a novel form, within which orbitofrontal reward representations directly drive policy selection.", "full_text": "Goal-directed  decision making  in  prefrontal\n\ncortex: A  computational framework\n\n                         Matthew Botvinick   \n            Princeton Neuroscience Institute and                        Computer Science Department\n      Department of Psychology, Princeton University                  Princeton University\n                        Princeton, NJ 08540\n                    matthewb@princeton.edu\n\n                         Princeton, NJ 08540\n                          an@princeton.edu\n\n                                  James An\n\nAbstract\n\nResearch  in  animal  learning  and  behavioral  neuroscience  has  distinguished\nbetween  two  forms  of  action  control:  a  habit-based  form,  which  relies  on\nstored  action  values,  and  a  goal-directed  form,  which  forecasts  and\ncompares  action  outcomes  based  on  a  model  of  the  environment.    While\nhabit-based  control  has  been  the  subject  of  extensive  computational\nresearch,  the  computational  principles  underlying  goal-directed  control  in\nanimals  have  so  far  received  less  attention.    In  the  present  paper,  we\nadvance  a  computational  framework  for  goal-directed  control  in  animals\nand  humans.    We  take  three  empirically  motivated  points  as  founding\npremises:  (1)  Neurons  in  dorsolateral  prefrontal  cortex  represent  action\npolicies,  (2)  Neurons  in  orbitofrontal  cortex  represent  rewards,  and  (3)\nNeural  computation,  across  domains,  can  be  appropriately  understood  as\nperforming  structured  probabilistic  inference.    On  a  purely  computational\nlevel,  the resulting  account  relates closely  to  previous  work using  Bayesian\ninference  to  solve  Markov  decision  problems,  but  extends  this  work  by\nintroducing  a  new  algorithm,  which  provably  converges  on  optimal  plans.\nOn  a  cognitive  and  neuroscientific  level,  the  theory  provides  a  unifying\nframework  for  several  different  forms  of  goal-directed  action  selection,\nplacing  emphasis  on  a  novel  form,  within  which  orbitofrontal  reward\nrepresentations directly drive policy selection.\n\n1\n\nG o a l - d i re c t e d   a c t i o n   c o n t ro l\n\nIn  the  study  of  human  and  animal  behavior,  it  is  a  long-standing  idea  that  reward-based\ndecision  making  may  rely  on  two  qualitatively  different  mechanisms.    In  habit-based\ndecision  making,  stimuli  elicit  reflex-like  responses,  shaped  by  past  reinforcement  [1].    In\ngoal-directed or purposive  decision making, on the other hand, actions are selected based on\na prospective consideration of possible outcomes and future lines of action [2]. Over the past\ntwenty  years  or  so,  the  attention  of  cognitive  neuroscientists  and  computationally  minded\npsychologists  has  tended  to  focus  on  habit-based  control,  due  in  large  part  to  interest  in\npotential  links  between  dopaminergic  function  and  temporal-difference  algorithms  for\nreinforcement  learning.    However,  a  resurgence  of  interest  in  purposive  action  selection  is\nnow  being  driven  by  innovations  in  animal  behavior  research,  which  have  yielded  powerful\nnew  behavioral  assays  [3],  and  revealed  specific  effects  of  focal  neural  damage  on  goal-\ndirected behavior [4].\n\nIn  discussing  some  of  the  relevant  data,  Daw,  Niv  and  Dayan  [5]  recently  pointed  out  the\nclose  relationship  between  purposive  decision  making,  as  understood  in  the  behavioral\nsciences,  and  model-based  methods  for  the  solution  of  Markov  decision  problems  (MDPs),\nwhere  action  policies  are  derived  from  a  joint  analysis  of  a  transition  function  (a  mapping\n\n\ffrom  states  and  actions  to  outcomes)  and  a  reward  function  (a  mapping  from  states  to\nrewards).    Beyond  this  important  insight,  little  work  has  yet  been  done  to  characterize  the\ncomputations  underlying  goal-directed  action  selection  (though  see  [6,  7]).    As  discussed\nbelow, a great deal of evidence indicates that purposive action selection depends critically on\na particular region of the brain, the prefrontal cortex.  However, it is currently a critical, and\nquite open, question what the relevant computations within this part of the brain might be.\n\nOf  course,  the basic  computational  problem of  formulating  an optimal  policy  given a  model\nof  an  MDP  has  been  extensively  studied,  and  there  is  no  shortage  of  algorithms  one  might\nconsider  as  potentially  relevant  to  prefrontal  function  (e.g.,  value  iteration,  policy  iteration,\nbackward  induction,  linear  programming,  and  others).    However,  from  a  cognitive  and\nneuroscientific perspective, there  is one approach  to solving  MDPs that it  seems particularly\nappealing to  consider.  In particular,  several researchers have  suggested  methods for solving\nMDPs  through  probabilistic  inference  [8-12].    The  interest  of  this  idea,  in  the  present\ncontext,  derives  from  a  recent  movement  toward  framing  human  and  animal  information\nprocessing,  as  well  as  the  underlying  neural  computations, \nin  terms  of  structured\nprobabilistic  inference  [13,  14].    Given  this  perspective,  it  is  inviting  to  consider  whether\ngoal-directed  action  selection,  and  the  neural  mechanisms  that  underlie  it,  might  be\nunderstood in those same terms.\n\nOne challenge in investigating this  possibility is that previous research furnishes  no \u2018off-the-\nshelf\u2019  algorithm  for  solving  MDPs  through  probabilistic  inference  that  both  provably  yields\noptimal  policies  and  aligns  with  what  is  known  about  action  selection  in  the  brain.    We\nendeavor  here  to start  filling  in  that  gap.   In  the  following  section,  we introduce  an  account\nof  how  goal-directed  action  selection  can  be  performed  based  on  probabilisitic  inference,\nwithin  a  network  whose  components  map  grossly  onto  specific  brain  structures.    As  part  of\nthis  account,  we  introduce  a  new  algorithm  for  solving  MDPs  through  Bayesian  inference,\nalong  with  a  convergence  proof.    We  then  present  results  from  a  set  of  simulations\nillustrating  how  the  framework  would  account  for  a  variety  of  behavioral  phenomena    that\nare thought to involve purposive  action selection.\n\n2\n\nC o m p u t a t i o n a l   m o d e l\n\nAs  noted  earlier,  the  prefrontal  cortex  (PFC)  is  believed  to  play  a  pivotal  role  in  purposive\nbehavior.  This  is  indicated  by  a  broad  association  between  prefrontal  lesions  and\nimpairments  in  goal-directed  action  in  both  humans  (see  [15])  and  animals  [4].    Single-unit\nrecording  and  other  data  suggest  that  different  sectors  of  PFC  make  distinct  contributions.\nIn  particular,  neurons  in  dorsolateral  prefrontal  cortex  (DLPFC)  appear  to  encode  task-\nspecific  mappings  from  stimuli  to  responses  (e.g.,  [16]):  \u201ctask  representations,\u201d  in  the\nlanguage  of  psychology,  or  \u201cpolicies\u201d  in  the  language  of  dynamic  programming.    Although\nthere  is  some  understanding  of  how  policy  representations  in  DLPFC  may  guide  action\nexecution  [15],  little  is  yet  known  about  how  these  representations  are  themselves  selected.\nOur  most  basic  proposal  is  that  DLPFC  policy  representations  are  selected  in  a  prospective,\nmodel-based  fashion,  leveraging  information  about  action-outcome  contingencies  (i.e.,  the\ntransition function) and about the incentive value associated with specific outcomes or states\n(the  reward  function).    There  is  extensive  evidence  to  suggest  that  state-reward  associations\nare  represented  in  another  area  of  the  PFC,  the  orbitofrontal  cortex  (OFC)  [17,  18].    As  for\nthe  transition function,  although it  is  clear that  the  brain contains  detailed representations  of\naction-outcome  associations  [19],  their  anatomical  localization  is  not  yet  entirely  clear.\nHowever,  some  evidence  suggests  that  the  enviromental  effects  of  simple  actions  may  be\nrepresented  in  inferior fronto-parietal  cortex  [20],  and there  is  also  evidence suggesting  that\nmedial temporal structures may be important in forecasting action outcomes [21].\n\nAs  detailed  in  the  next  section,  our  model  assumes  that  policy  representations  in  DLPFC,\nreward  representations  in  OFC,  and  representations  of  states  and  actions  in  other  brain\nregions,  are  coordinated  within  a  network  structure  that  represents  their  causal  or  statistical\ninterdependencies, and that policy selection occurs, within this network, through a process of\nprobabilistic inference.\n\n2 . 1  \n\nA r c h i t e c t u r e\n\nThe implementation takes the form of a directed graphical model [22], with the layout shown\nin  Figure  1.    Each  node  represents  a  discrete  random  variable.    State  variables  (s),\n\n\frepresenting  the  set  of  m  possible  world  states,  serve  the  role  played  by  parietal  and  medial\ntemporal  cortices  in  representing  action  outcomes.  Action  variables  (a)  representing  the  set\nof  available  actions,  play  the  role\nof  high-level  cortical  motor  areas\ninvolved  in  the  programming  of\naction  sequences.  Policy  variables\n((cid:1)),  each  repre-senting  the  set  of\nall \npolicies\nassociated  with  a  specific  state,\ncapture  the  representational  role\nof  DLPFC. \n  Local  and  global\nutility  variables,  described  further\nbelow,  capture  the  role  of  OFC  in\nrepresenting  incentive  value.    A\nseparate set of nodes is included for each discrete time-step up to the planning horizon.\n\n   Fig 1. Left: Single-step decision. Right: Sequential decision.\n   Each time-slice includes a set of m policy nodes.\n\ndeterministic \n\nThe  conditional  probabilities  associated  with  each  variable  are  represented  in  tabular  form.\nState probabilities are based on the state and action variables in the preceding time-step, and\nthus  encode  the  transition  function.    Action  probabilities depend  on  the  current  state  and  its\nassociated  policy  variable.    Utilities  depend  only  on  the  current  state.    Rather  than\nrepresenting reward magnitude as a continuous variable, we adopt an approach introduced by\n[23],  representing  reward  through  the  posterior  probability  of  a  binary  variable  (u).    States\nassociated  with  large  positive  reward  raise  p(u)  (i.e, p(u=1|s))  near  to  one;  states  associated\nwith large negative rewards reduce  p(u) to near zero.   In the simulations reported below, we\nused a simple linear transformation to map from scalar reward values to p(u):\n\np u si\n(\n\n) =\n\n)\n\n1\n2\n\n(cid:2)\n(cid:6)\n(cid:3)\n\nR si(\nrmax\n\n(cid:4)\n(cid:7),\n+1\n(cid:5)\n\n      \n\nrmax (cid:1) max j R s j(\n\n)\n\n     (1)\n\nIn situations involving sequential actions, expected returns from different time-steps must be\nintegrated  into  a  global  representation  of  expected  value.    In  order  to  accomplish  this,  we\nemploy a technique proposed by [8], introducing a \u201cglobal\u201d utility variable (uG).  Like u, this\nis a binary random variable, but associated with a posterior probability determined as:1\n\np uG(\n\n) =\n\n1\nN\n\n(cid:1)\n\np(ui )\n\n     (2)\n\ni\n\n                                                    \nwhere  N is  the number  of u  nodes.   The network  as whole  embodies a  generative model  for\ninstrumental  action.    The  basic  idea  is  to  use  this  model  as  a  substrate  for  probabilistic\ninference,  in  order  to  arrive  at  optimal  policies.    There  are  three  general  methods  for\naccomplishing  this,  which  correspond  three  forms  of  query.    First,  a  desired  outcome  state\ncan  be  identified,  by  treating  one  of  the  state  variables  (as  well  as  the  initial  state  variable)\nas  observed  (see  [9]  for  an  application  of  this  approach).    Second,  the  expected  return  for\nspecific plans can be evaluated and compared by conditioning on specific sets of values over\nthe  policy  nodes  (see  [5,  21]).    However,  our  focus  here  is  on  a  less  obvious  possibility,\nwhich is to condition directly on the utility variable uG , as explained next.\n\n2 . 2\n\nP o l i c y   s e l e c t i o n   b y   p r o b a b i l i s t i c   i n f e r e n c e :   a n   i t e r a t i v e   a l g o r i t h m\n\nCooper  [23]  introduced  the  idea  of  inferring  optimal  decisions  in  influence  diagrams  by\ntreating  utility  nodes into  binary  random  variables and  then  conditioning  on these  variables.\nAlthough this technique has been adopted in some more recent work [9, 12], we are aware of\nno  application that  guarantees  optimal decisions,  in the  expected-reward  sense, in  multi-step\ntasks.    We  introduce  here  a  simple  algorithm  that  does  furnish  such  a  guarantee.    The\nprocedure  is  as  follows:  (1)  Initialize  the  policy  nodes  with  any  set  of  non-deterministic\npriors. (2) Treating the initial state and uG as observed variables (uG = 1),2 use standard belief\n                                                  \n1 Note that temporal discounting can be incorporated into the framework through minimal\nmodifications to Equation 2.\n2 In the single-action situation, where there is only one  u node, it is this variable that is treated as\nobserved (u = 1).\n\n\fpropagation  (or  a  comparable  algorithm)  to  infer  the  posterior  distributions  over  all  policy\nnodes.    (3)  Set  the  prior  distributions  over  the  policy  nodes  to  the  values  (posteriors)\nobtained  in  step 2.   (4)  Go to  step  2.   The  next two  sections present  proofs of  monotonicity\nand convergence for this algorithm.\n\n2 . 2 . 1 M o n o t o n i c i t y\n\nWe show first that, at each policy node, the probability associated with the optimal policy will rise\non every iteration. Define (cid:1)* as follows:\n\n                                        p uG (cid:2)\n+ is the current set of probability distributions at all policy nodes on subsequent time-steps.\nwhere (cid:1)\n(Note that we assume here, for simplicity, that there is a unique optimal policy.) The objective is\nto establish that:\n\n), (cid:4) (cid:3)(cid:2) (cid:1) (cid:2)\n\n) > p uG\n\n         (3)\n\n(cid:3)(cid:2) ,(cid:2)+\n\n(\n\n(\n\n*,(cid:2)+\n\n*\n\n                                                     \n\n*\np (cid:1)t\n(\n\n*\n\n) > p (cid:1)t (cid:2)1\n\n(\n\n)\n\n         (4)\n\nwhere t indexes processing iterations.  The dynamics of the network entail that\n\n                                                    p (cid:1)t(\nwhere (cid:1) represents any value (i.e., policy) of the decision node being considered.  Substituting this\ninto (4) gives\n\n) = p (cid:1)t (cid:2)1 uG\n\n         (5)\n\n(\n\n)\n\n                                                  \nFrom  this  point  on  the  focus  is  on  a  single  iteration,  which  permits  us  to  omit  the  relevant\nsubscripts.   Applying Bayes\u2019 law to (6) yields\n\n        (6)\n\n(\np (cid:1)t (cid:2)1\n\n* uG\n\n*\n\n) > p (cid:1)t (cid:2)1\n\n(\n\n)\n\np uG (cid:2)*\n(\n(cid:1)\n\n) p (cid:2)*\n(\n)\np (cid:2)(\np uG (cid:2)(\n)\n)\n\n> p (cid:2)*\n\n(\n\n)\n\n                                                \nCanceling, and bringing the denominator up, this becomes\n\n(cid:2)\n\n                                             \nRewriting the left hand side, we obtain\n\np uG (cid:2)*\n(\n\n) >\n\n(cid:1)\n\np uG (cid:2)(\n\n) p (cid:2)(\n)\n\n(cid:2)\n\n                                         \nSubtracting and further rearranging:\n\n(cid:2)\n\n(cid:1)\n\np uG (cid:2)*\n(\n\n) p (cid:2)(\n)\n\n>\n\n(cid:1)\n\n(cid:2)\n\np uG (cid:2)(\n\n) p (cid:2)(\n)\n\n                                           \n(\n\np uG (cid:3)*\n(\n\n) (cid:4) p uG (cid:3)*\n\n(cid:6)\n(cid:7)\n\n(cid:8)\n(cid:9) p (cid:3)*\n)\n\n(\n\n                         \n\np uG (cid:2)*\n(\n\n) (cid:3) p uG (cid:2)(\n)\n\n(cid:6)\n(cid:7) p (cid:2)(\n)\n\n> 0\n\n(cid:1)\n\n(cid:4)\n(cid:5)\n\n(cid:2)\n\n                                         \n\n(cid:6)\n(cid:7)\n\np uG (cid:3)*\n(\n\n(cid:2)\n(cid:5)(cid:3) (cid:1)(cid:3)*\n\n) (cid:4) p uG\n\n(cid:5)(cid:3)(\n)\n\n(cid:8)\n(cid:9) p\n\n(cid:5)(cid:3)(\n)\n\n> 0\n\n(cid:6)\n(cid:7)\n\np uG (cid:3)*\n(\n\n) +\n\n(cid:2)\n(cid:5)(cid:3) (cid:1)(cid:3)*\n(cid:8)\n) (cid:4) p uG\n(cid:9) p\n)\n(cid:5)(cid:3)(\n\n(cid:5)(cid:3)(\n)\n\n> 0\n\n         (7)\n\n         (8)\n\n         (9)\n\n       (10)\n\n       (11)\n\n       (12)\n\nNote that this last inequality (12) follows from the definition of (cid:1)*.\n\n+.  In particular, the policy (cid:1)* will only be part\nRemark:  Of course, the identity of (cid:1)* depends on (cid:1)\n+  is  optimal.   Fortunately,  this  requirement  is\nof  a  globally  optimal  plan  if  the  set  of  choices (cid:1)\nguaranteed  to  be  met,  as  long  as  no  upper  bound  is  placed  on  the  number  of  processing  cycles.\nRecalling  that  we  are  considering  only  finite-horizon  problems,  note  that  for  policies  leading  to\n+  is  empty.   Thus (cid:1)*  at  the  relevant  policy  nodes  is  fixed,  and  is\nstates  with  no  successors, (cid:1)\nguaranteed to be part of the optimal policy.  The proof above shows that (cid:1)* will continuously rise.\nOnce  it  reaches  a  maximum,  (cid:1)*  at  immediately  preceding  decisions  will  perforce  fit  with  the\nglobally optimal policy.  The process works backward, in the fashion of backward induction.\n\n\f2 . 2 . 2 C o n v e r g e n c e\n\nContinuing with the same notation, we show now that\n\n                                                 \nNote that, if we apply Bayes\u2019 law recursively,\n\nlimt (cid:3)(cid:1) pt (cid:2)\n(\n\n* uG\n\n) = 1\n\n       (13)\n\n) =\n\n(\np uG (cid:1)(cid:3)\n\n) pt (cid:1)(cid:3)\n\n(\n\n)\n\npi uG(\n\n)\n\n=\n\n2\n\n(\np uG (cid:1)(cid:3)\npi uG(\n\n)\n(\n) pt (cid:2)1 uG(\n\npt (cid:2)1 (cid:1)(cid:3)\n)\n\n)\n\n=\n\n3\n\n)\n\n(\np uG (cid:1)(cid:3)\npt uG(\n\n) pt (cid:2)1 uG(\n\n(\n\npt (cid:2)2 (cid:1)(cid:3)\n)\n) pt (cid:2)2 uG(\n\n\u2026\n\n)\n\n       (14)\n\npt (cid:1)(cid:3) uG\n\n(\n\n            \nThus,\n\np1 (cid:1)(cid:2) uG\n\n(\n\n) =\n\n(\np uG (cid:1)(cid:2)\n\n) p1 (cid:1)(cid:2)\n\n(\n\n)\n\np1 uG(\n\n)\n\np2 (cid:1)(cid:2) uG\n\n(\n\n) =\n\n,\n\n \n\n2\n\n(\np uG (cid:1)(cid:2)\np2 uG(\n\n)\n(\n) p1 uG(\n\np1 (cid:1)(cid:2)\n)\n\n)\n\n,\n\n \n\np3 (cid:1)(cid:2) uG\n\n(\n\n) =\n\nand so forth.  Thus, what we wish to prove is\n\n3\n\n(\np uG (cid:1)(cid:2)\np3 uG(\n\n)\n) p2 uG(\n\np1 (cid:1)(cid:2)\n)\n(\n) p1 uG(\n\n                                                 \nor, rearranging,\n\n(cid:1)\n\np1 (cid:3)*\n\n(\n\n)\n\n= 1\n\n(cid:1)\n\n)\npt uG(\n\n)\n\np uG (cid:3)*\n(\n\n(cid:1)\n\n(cid:2)\n\nt =1\n\n(cid:2)\n\nt =1\n\npt uG(\n)\n(\np uG (cid:3)(cid:4)\n\n)\n\n= p1 (cid:3)(cid:4)\n\n(\n\n).\n\n(15)\n\n,\n\n)\n\n      (16)\n\n       (17)\n\n       (18)\n\n\u2026        (19)\n\n                                                \nNote that, given the stipulated relationship between p((cid:1)) on each processing iteration and p((cid:1) | uG)\non the previous iteration,\n\npt uG(\n\n) =\n\n(cid:1)\n\n(cid:2)\n\np uG (cid:2)(\n)\n\npt (cid:2)(\n\n) =\n\n(cid:1)\n\n(cid:2)\n\np uG (cid:2)(\n)\n\npt (cid:3)1 (cid:2) uG\n\n(\n\n) =\n\n(cid:1)\n\n(cid:2)\n\npt (cid:3)1 (cid:2)(\n\n)\n\np uG (cid:2)(\n\n)2\npt (cid:3)1 uG(\n\n)\n\n                   \n\n=\n                                 \nWith this in mind, we can rewrite the left hand side product in (17) as follows:\n\n) pt (cid:3)3 uG(\n\n\u2026\n\n=\n\n)\n\n)\n\npt (cid:3)1 (cid:2)(\n)\n\np uG (cid:2)(\n\n(cid:1)\npt (cid:3)1 uG(\n\n)3\n) pt (cid:3) 2 uG(\n\n(cid:2)\n\n(cid:1)\npt (cid:3)1 uG(\n\np uG (cid:2)(\n\n)4\n) pt (cid:3)2 uG(\n\n(cid:2)\n\npt (cid:3)1 (cid:2)(\n)\n\np1 uG(\n)\n(\np uG (cid:2)(cid:4)\n\n)\n\n(cid:3)\n\np uG (cid:2)(\n)\n\n2\n\np1 (cid:2)(\n)\n\n(cid:1)\n\n(cid:2)\n\n(\np uG (cid:2)(cid:4)\n\n) p1 uG(\n\n)\n\n(cid:3)\n\np uG (cid:2)(\n)\n\n3\n\np1 (cid:2)(\n)\n\n(cid:1)\n\n(cid:2)\n\n(cid:3)\n\np uG (cid:2)(\n)\n\n4\n\np1 (cid:2)(\n)\n\n(cid:1)\n\n(cid:2)\n\n         \nNote  that,  given  (18),  the  numerator  in  each  factor  of  (19)  cancels  with  the  denominator  in  the\nsubsequent factor, leaving only p(uG|(cid:1)*) in that denominator. The expression can thus be rewritten\nas\n\n(\np uG (cid:2)(cid:4)\n\n) p1 uG(\n\n) p2 uG(\n\n)\n\n(\np uG (cid:2)(cid:4)\n\n) p1 uG(\n\n) p2 uG(\n\n) p3 uG(\n\n)\n\n1\n\n                    \n\n(\np uG (cid:2)(cid:4)\n\n1\n\n(\np uG (cid:2)(cid:4)\n\n)\n\n(cid:3)\n\n1\n\n(\np uG (cid:2)(cid:4)\n\n)\n\n(cid:3)\n\n)\n\n(cid:1)\n\n(cid:2)\n\n(cid:3)\n\np uG (cid:2)(\n)\n\n4\n\np1 (cid:2)(\n)\n\n(\np uG (cid:2)(cid:4)\n\n)\n\n)(cid:1)\np uG (cid:3)(\n.\n(cid:1) p1 (cid:3)(\n)\n(\n)\np uG (cid:3)(cid:4)\n\n=\n\n(cid:2)\n\n(cid:3)\n\n\u2026\n\n \n\n       (20)\n\nThe objective is then to show that the above equals p((cid:1)*).   It proceeds directly from the definition\nof (cid:1)* that, for all (cid:1) other than (cid:1)*,\n\np uG (cid:1)(\n)\n(\n)\np uG (cid:1)(cid:2)\n\n< 1\n\n       (21)\n\n                                                        \nThus,  all  but  one  of  the  terms  in  the  sum  above  approach  zero,  and  the  remaining  term  equals\np1((cid:1)*).  Thus,\n\n                                            \n\n(cid:2)\n\n(cid:3)\n\n)(cid:1)\np uG (cid:3)(\n)\n(\np uG (cid:3)(cid:5)\n\n(cid:1)\n\np1 (cid:3)(\n\n) = p1 (cid:3)(cid:5)\n\n(\n\n)\n\n       (22)\n\n\f3\n\nS i m u l a t i o n s\n\n3 . 1\n\nB i n a r y   c h o i c e\n\nWe  begin  with  a  simulation  of  a  simple  incentive  choice  situation.    Here,  an  animal  faces\ntwo  levers.    Pressing  the  left  lever  reliably  yields  a  preferred  food  (r  =  2),  the  right  a  less\npreferred food (r = 1).  Representing these contingencies in a network structured as in Fig. 1\n(left)  and  employing  the  iterative  algorithm  described  in  section  2.2  yields  the  results  in\nFigure  2A.    Shown  here  are  the  posterior  probabilities  for  the  policies  press  left  and  press\nright,  along  with  the  marginal  value  of  p(u  =  1)  under  these  posteriors  (labeled  EV  for\nexpected  value).    The  dashed  horizontal  line  indicates  the  expected  value  for  the  optimal\nplan, to which the model obviously converges.\n\nA  key  empirical  assay  for  purposive  behavior  involves  outcome  devaluation.  Here,  actions\nyielding a previously valued outcome are abandoned after the incentive value of the outcome\nis reduced,  for example by pairing with an aversive event (e.g., [4]).  To simulate this within\nthe  binary  choice  scenario  just  described,  we  reduced  to  zero  the  reward  value  of  the  food\nyielded  by  the  left  lever  (fL),  by  making  the  appropriate  change  to  p(u|fL).    This  yielded  a\nreversal in lever choice (Fig. 2B).\n\nAnother  signature  of  purposive  actions  is  that  they  are  abandoned  when  their  causal\nconnection  with  rewarding  outcomes  is  removed  (contingency  degradation,  see  [4]).    We\nsimulated  this  by  starting  with  the  model  from  Fig.  2A  and  changing  conditional\nprobabilities  at  s  for  t=2  to  reflect a  decoupling  of  the left  action  from  the  fL  outcome.    The\nresulting behavior is shown in Fig. 2C.\n\n \n\nFig 2. Simulation results, binary choice.\n\n3 . 2\n\nS t o c h a s t i c   o u t c o m e s\n\nA  critical  aspect  of  the  present  modeling  paradigm  is  that  it  yields  reward-maximizing\nchoices  in  stochastic  domains,  a  property  that  distinguishes  it  from  some  other  recent\napproaches  using  graphical  models  to  do  planning  (e.g.,  [9]).    To  illustrate,  we  used  the\narchitecture  in  Figure  1  (left)  to  simulate  a  choice  between  two  fair  coins.    A  \u2018left\u2019  coin\nyields  $1  for  heads,  $0  for  tails;  a  \u2018right\u2019  coin  $2  for  heads  but  for  tails  a  $3  loss.    As\nillustrated in Fig. 2D, the model maximizes expected value by opting for the left coin.\n\nFig 3. Simulation results, two-step sequential choice.\n\n3 . 3\n\nS e q u e n t i a l   d e c i s i o n\n\nHere, we  adopt the  two-step T-maze  scenario used  by [24]  (Fig. 3A).   Representing  the task\ncontingencies  in  a  graphical  model  based  on  the  template  from  Fig  1  (right),  and  using  the\nreward  values  indicated  in  Fig.  3A,  yields  the  choice  behavior  shown  in  Figure  3B.\nFollowing  [24],  a  shift  in  motivational  state  from  hunger  to  thirst  can  be  represented  in  the\n\n\fgraphical  model  by  changing  the  reward  function  (R(cheese)  =  2,  R(X)  =  0,  R(water)  =  4,\nR(carrots)  =  1).    Imposing  this  change  at  the  level  of  the  u  variables  yields  the  choice\nbehavior  shown  in  Fig.  3C.    The  model  can  also  be  used  to  simulate  effort-based  decision.\nStarting  with  the  scenario  in  Fig.  2A,  we  simulated  the  insertion  of  an  effort-demanding\nscalable  barrier  at  S2  (R(S2)  =  -2)  by  making  appropriate  changes  p(u|s).    The  resulting\nbehavior is shown in Fig. 3D.\n\nA  famous  empirical  demonstration  of  purposive  control  involves  detour  behavior.  Using  a\nmaze like the  one shown  in Fig. 4A,  with a  food reward  placed at  s5,  Tolman [2]  found that\nrats reacted to a barrier at location A by taking the upper route, but to a barrier at B by taking\nthe  longer  lower  route.    We  simulated  this  experiment  by  representing  the  corresponding\ntransition  and  reward  functions  in  a  graphical  model  of  the  form  shown  in  Fig.  1  (right),3\nrepresenting  the  insertion  of  barriers  by  appropriate changes  to  the  transition  function.    The\nresulting choice behavior at the critical juncture s2 is shown in Fig. 4.\n\nFig 4. Simulation results, detour behavior. B: No barrier. C: Barrier at A. D: Barrier at B.\n\nAnother  classic  empirical  demonstration  involves  latent\nlearning.   Blodgett  [25]  allowed rats  to explore  the maze\nshown  in  Fig.  5.    Later  insertion  of  a  food  reward  at  s13\nwas  followed  immediately  by  dramatic  reductions  in  the\nrunning  time,  reflecting  a  reduction  in  entries  into  blind\nalleys.    We  simulated  this  effect  in  a  model  based  on  the\ntemplate  in  Fig.  1  (right),  representing  the  maze  layout\nvia  an  appropriate  transition  function.    In  the  absence  of\na  reward  at  s12,  random  choices  occurred  at  each\nintersection.    However,  setting  R(s13)  =  1  resulted  in  the\nset of choices indicated by the heavier arrows in Fig. 5.\n\n4\n\nR e l a t i o n   t o   p re v i o u s   w o r k\n\n         Fig 5. Latent learning.\n\nInitial  proposals  for  how  to  solve  decision  problems  through  probabilistic  inference  in\ngraphical  models,  including  the  idea  of  encoding  reward  as  the  posterior  probability  of  a\nrandom  utility  variable,  were  put  forth  by  Cooper  [23].    Related  ideas  were  presented  by\nShachter  and  Peot  [12],  including  the  use  of  nodes  that  integrate  information  from  multiple\nutility nodes.  More recently, Attias [11] and Verma and Rao [9] have used graphical models\nto  solve  shortest-path  problems,  leveraging  probabilistic  representations  of  rewards,  though\nnot  in  a  way  that  guaranteed  convergence  on  optimal  (reward  maximizing)  plans.    More\nclosely related to the present research is work by Toussaint and Storkey [10], employing  the\nEM algorithm.  The iterative approach we have introduced here has a certain resemblance to\nthe EM procedure, which becomes evident if one views the policy variables in our models as\nparameters  on  the  mapping  from  states  to  actions.    It  seems  possible  that  there  may  be  a\nformal equivalence between the algorithm we have proposed and the one reported by [10].\n\nAs a cognitive and neuroscientific proposal, the present work bears a close relation to recent\nwork  by  Hasselmo  [6],  addressing  the  prefrontal  computations  underlying  goal-directed\naction  selection  (see  also  [7]).    The  present  efforts  are  tied  more  closely  to  normative\nprinciples  of  decision-making,  whereas the  work  in  [6]  is tied  more  closely  to the  details  of\nneural  circuitry.    In  this  respect,  the  two  approaches  may  prove  complementary,  and  it  will\nbe interesting to further consider their interrelations.\n                                                  \n3 In this simulation and the next, the set of states associated with each state node was limited to the\nset of reachable states for the relevant time-step, assuming an initial state of s1.\n\n\fA c k n o w l e d g m e n t s\n\nThanks to  Andrew Ledvina,  David  Blei, Yael  Niv, Nathaniel  Daw, and  Francisco Pereira  for\nuseful comments.\n\nR e f e r e n c e s\n\n[1] Hull, C.L., Principles of Behavior. 1943, New York: Appleton-Century.\n\n[2] Tolman, E.C., Purposive Behavior in Animals and Men. 1932, New York: Century.\n\n[3]  Dickinson,  A.,  Actions  and  habits:  the  development  of  behavioral  autonomy.  Philosophical\nTransactions of the Royal Society (London), Series B, 1985. 308: p. 67-78.\n\n[4]  Balleine,  B.W.  and  A.  Dickinson,  Goal-directed  instrumental  action:  contingency  and  incentive\nlearning and their cortical substrates. Neuropharmacology, 1998. 37: p. 407-419.\n\n[5]  Daw,  N.D.,  Y.  Niv,  and  P.  Dayan,  Uncertainty-based  competition  between  prefrontal  and  striatal\nsystems for behavioral control. Nature Neuroscience, 2005. 8: p. 1704-1711.\n\n[6]  Hasselmo,  M.E.,  A  model  of  prefrontal  cortical  mechanisms  for  goal-directed  behavior.  Journal  of\nCognitive Neuroscience, 2005. 17: p. 1115-1129.\n\n[7]  Schmajuk,  N.A.  and  A.D.  Thieme,  Purposive  behavior  and  cognitive  mapping.    A  neural  network\nmodel. Biological Cybernetics, 1992. 67: p. 165-174.\n\n[8]  Tatman,  J.A.  and  R.D.  Shachter,  Dynamic  programming  and \nTransactions on Systems, Man and Cybernetics, 1990. 20: p. 365-379.\n\ninfluence  diagrams.  IEEE\n\n[9]  Verma,  D.  and  R.P.N.  Rao.  Planning  and  acting  in  uncertain  enviroments  using  probabilistic\ninference. in IEEE/RSJ International Conference on Intelligent R obots and Systems. 2006.\n\n[10]  Toussaint,  M.  and  A.  Storkey.  Probabilistic  inference  for  solving  discrete  and  continuous  state\nmarkov  decision  processes.  in  Proceedings  of  the  23rd  International  Conference  on  Machine\nLearning. 2006. Pittsburgh, PA.\n\n[11]  Attias,  H.  Planning  by  probabilistic  inference.  in  Proceedings  of  the  9th  Int.  Workshop  on\nArtificial Intelligence and Statistics. 2003.\n\n[12]  Shachter,  R.D.  and  M.A.  Peot.  Decision  making  using  probabilistic  inference  methods.  in\nUncertainty  in  artificial  intelligence:  Proceedings  of  the  Eighth  Conference  (1992).  1992.  Stanford\nUniversity: M. Kaufmann.\n\n[13]  Chater,  N.,  J.B.  Tenenbaum,  and  A.  Yuille,  Probabilistic  models  of  cognition:  conceptual\nfoundations. Trends in Cognitive Sciences, 2006. 10(7): p. 287-291.\n\n[14] Doya, K., et al., eds. The Bayesian Brain: Probabilistic  Approaches to Neural Coding. 2006, MIT\nPress: Cambridge, MA.\n\n[15]  Miller,  E.K.  and  J.D.  Cohen,  An  integrative  theory  of  prefrontal  cortex  function.  Annual  Review\nof Neuroscience, 2001. 24:  p. 167-202.\n\n[16]  Asaad,  W.F.,  G.  Rainer,  and  E.K.  Miller,  Task-specific  neural  activity  in  the  primate  prefrontal\ncortex. Journal of  Neurophysiology, 2000. 84: p. 451-459.\n\n[17] Rolls, E.T., The functions of the orbitofrontal cortex. Brain and Cognition, 2004. 55: p. 11-29.\n\n[18]  Padoa-Schioppa,  C.  and  J.A.  Assad,  Neurons  in  the  orbitofrontal  cortex  encode  economic  value.\nNature, 2006. 441: p. 223-226.\n\n[19]  Gopnik,  A.,  et  al.,  A  theory  of  causal  learning  in  children:  causal  maps  and  Bayes  nets.\nPsychological Review, 2004. 111: p. 1-31.\n\n[20]  Hamilton,  A.F.d.C.  and  S.T.  Grafton,  Action  outcomes  are  represented  in  human  inferior\nfrontoparietal cortex. Cerebral Cortex, 2008. 18: p. 1160-1168.\n\n[21]  Johnson,  A.,  M.A.A.  van  der  Meer,  and  D.A.  Redish,  Integrating  hippocampus  and  striatum  in\ndecision-making. Current Opinion in Neurobiology, 2008. 17: p. 692-697.\n\n[22] Jensen, F.V., Bayesian Networks and Decision Graphs. 2001, New York: Springer Verlag.\n\n[23]  Cooper,  G.F.  A  method  for  using  belief  networks  as  influence  diagrams.  in  Fourth  Workshop  on\nUncertainty in Artificial Intelligence. 1988. University of Minnesota, Minneapolis.\n\n[24]  Niv,  Y.,  D.  Joel,  and  P.  Dayan,  A  normative  perspective  on  motivation.  Trends  in  Cognitive\nSciences, 2006. 10: p. 375-381.\n\n[25]  Blodgett,  H.C.,  The  effect  of  the  introduction  of  reward  upon  the  maze  performance  of  rats.\nUniversity of California Publications in Psychology, 1929. 4: p. 113-134.\n\n\f", "award": [], "sourceid": 34, "authors": [{"given_name": "Matthew", "family_name": "Botvinick", "institution": null}, {"given_name": "James", "family_name": "An", "institution": null}]}