{"title": "Inference, Attention, and Decision in a Bayesian Neural Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 1577, "page_last": 1584, "abstract": null, "full_text": "              Inference, Attention, and Decision\n              in a Bayesian Neural Architecture\n\n\n                              Angela J. Yu         Peter Dayan\n                      Gatsby Computational Neuroscience Unit, UCL\n                 17 Queen Square, London WC1N 3AR, United Kingdom.\n      feraina@gatsby.ucl.ac.uk                    dayan@gatsby.ucl.ac.uk\n\n\n                                          Abstract\n\n         We study the synthesis of neural coding, selective attention and percep-\n         tual decision making. A hierarchical neural architecture is proposed,\n         which implements Bayesian integration of noisy sensory input and top-\n         down attentional priors, leading to sound perceptual discrimination. The\n         model offers an explicit explanation for the experimentally observed\n         modulation that prior information in one stimulus feature (location) can\n         have on an independent feature (orientation). The network's intermediate\n         levels of representation instantiate known physiological properties of vi-\n         sual cortical neurons. The model also illustrates a possible reconciliation\n         of cortical and neuromodulatory representations of uncertainty.\n\n\n\n1    Introduction\n\nA constant stream of noisy and ambiguous sensory inputs bombards our brains, informing\non-going inferential processes and directing perceptual decision-making. Neurophysiolo-\ngists and psychologists have long studied inference and decision-making in isolation, as\nwell as the careful attentional filtering that is necessary to optimize them. The recent focus\non their interactions poses an important opportunity and challenge for computational mod-\nels. In this paper, we study an attentional task which involves all three components, and\nthereby directly confront their interaction. We first discuss the background of the individual\nelements; then describe our model.\n\nThe first element involves the representation and manipulation of uncertainty in sensory\ninputs and contextual information. There are two broad families of suggestions. One is mi-\ncroscopic, for which individual cortical neurons and populations either implicitly or explic-\nitly represent the uncertainty. This spans a broad spectrum, from distributional codes that\ncan also encode restricted aspects of uncertainty [1] to more exotic interpretations of codes\nas representing complex distributions [1, 2, 3, 4, 5]. The other family is macroscopic, with\ncholinergic (ACh) and noradrenergic (NE) neuromodulatory systems reporting computa-\ntionally distinct forms of uncertainty to influence the way that information in differentially\nreliable cortical areas is integrated and learned [6, 7]. How microscopic and macroscopic\nfamilies work together is hitherto largely unexplored.\n\nThe second element is selective attention and top-down influences over sensory processing.\nHere, the key challenge is to couple the many ideas about the way that attention should,\nfrom a sound statistical viewpoint, modify sensory processing, to the measurable effects of\nattention on the neural substrate. For instance, one typical consequence of (visual) featural\nand spatial attention is an increase in the activities of neurons in cortical populations repre-\n\n\f\nsenting those features, which is equivalent to multiplying their tuning functions by a factor\n[8]. Under the sort of probabilistic representational scheme in which the population activity\ncodes for uncertainty in the underlying variable, it is of obvious importance to understand\nhow this multiplication changes the implied uncertainty, and what statistical characteristic\nof the attention licenses this change [9].\n\nThe third element is the coupling between sensory processing and perceptual decisions.\nImplementational and computational issues underlying binary decisions, especially in sim-\nple cases, have been extensively explored, with psychologists [11, 12], and neuroscientists\n[13, 14] converging on common statistical [10] ideas about drift-diffusion processes.\n\nIn order to explore the interaction of these elements, we model an extensively studied atten-\ntional task (due to Posner [15]), in which probabilistic spatial cueing is used to manipulate\nattentional modulation of visual discrimination. We employ a hierarchical neural architec-\nture in which top-down attentional priors are integrated with sequentially sampled sensory\ninput in a sound Bayesian manner, using a logarithmic mapping between cortical neural\nactivities and uncertainty [4]. In the model, the information provided by the cue is realized\nas a change in the prior distribution over the cued dimension (space). The effect of the\nprior is to eliminate inputs from spatial locations considered irrelevant for the task, thus\nimproving discrimination in another dimension (orientation).\n\nIn section 2, we introduce the Posner task and give a Bayesian description of the computa-\ntions underlying successful performance. In section 3, we describe the probabilistic seman-\ntics of the layers, and their functional connections, in the hierarchical neural architecture.\nIn section 4, we compare the perceptual performance of the network to psychophysics data,\nand the intermediate layers' activities to the relevant physiological data.\n\n\n2    Spatial Attention as Prior Information\n\nIn the classic version of Posner's task [15], a subject is presented with a cue that predicts\nthe location of a subsequent target with a certain probability termed its validity. The cue is\nvalid if it makes a correct prediction, and invalid otherwise. Subjects typically perform de-\ntection or discrimination on the target more rapidly and accurately on a valid-cue trial than\nan invalid one, reflecting cue-induced attentional modulation of visual processing and/or\ndecision making [15]. This difference in reaction time or accuracy is often termed the\nvalidity effect [16], and depends on the cue validity [17].\n\nWe consider sensory stimuli with two feature dimensions, a periodic variable, orientation,\n = , about which decisions are to be made, and a linear variable, space,  =  which\nis cued. The cue induces a top-down spatial prior, which we model as a mixture of a com-\nponent sharply peaked at the cued location and a broader component capturing contextual\nand bottom-up saliency factors (including the possibility of invalidity). For simplicity, we\nuse a Gaussian for the peaked component, and a uniform distribution for the broader one,\nalthough more complex priors of a similar nature would not change the model behavior:\np() = N (^\n            , 2) + (1-)c. Given lower-layer activation patterns Xt  {x1, ..., xt}, as-\nsumed to be iid samples (with Gaussian noise) of bell-shaped tuning responses to the true\nunderlying stimulus values , : fij(, ) = Z exp(-(i-)2/22+k cos(j-)),\nthe task is to infer a posterior distribution P (|Xt), involving the following steps:\n\n           p(xt|, ) =            p(x\n                             ij           ij (t)|, )    Likelihood\n\n           p(|xt) =       p(, )p(xt|, )d            Prior-weighted marginalization\n\n           p(|Xt)  p(|xt-1\n                                   1    )p(, |xt)       Temporal accumulation\n\nBecause the marginalization step is weighted by the priors, a valid cue results in the inte-\n\n\f\n                                                                                                                                                                                                                 Layer V\n                                                                                                                  0.8\n\n                                                                                                                  0.6\n\n                                                                                                                  0.4\n                                                                                                                 0.2                                                                                            r5(t) = exp(r4(t))/(                                  (t)))\n                            log P (                                                                                                                                                                              j                 j                k exp(r4\n                                                                                                                                                                                                                                                                  k\n                                                             i)\n                                                                                                                       0\n                                                                                                                                       1              2      3            4           5           6\n                                                                                                                                                             \n            0.4\n\n\n            0.2\n             0                                                                                                    10                                                                                             Layer IV\n            -1.5           -1         -0.5     0            0.5              1    1.5\n\n                                                                                                                   5                                                                                             r4(t) = r4(t - 1) + r3(t) + c\n                                                                                                                                                                                                                  j           j                 j            t\n                                                                                                                 0\n                                                                                                                                   1             2           3            4           5           6\n                                                                                                                                                             \n                                                                                                                  4                                                                                              Layer III\n                                                                                                                  2                                                                                              r3(t) = log                     (t)) + b\n                                                                                                                                                                                                                  j                 i exp(r2\n                                                                                                                                                                                                                                           ij                t\n                                                                                                                 0\n                                                                                                                                  1              2           3            4           5           6\n                                                                                                                                                             \n\n                                                                                                                                                                                                                 Layer II\n                                                                                                                            15\n                                                                                                                            10                                                                                   r2 (t) = r1 (t) + log P (\n                                                                                                                                                                                                                  ij          ij                       i) + at\n                                                                                                              5\n                                                                                                                                                                                                            6\n\n                                                                                                                                                                                                       4\n                     f                                                                                                       0\n                                                                                                                                                                                            2              \n                            ij (, )                                                                                     1.5             1         0.5    0    -0.5         -1    -1.5\n                                                                                                                                                             \n                                                                                                                                                                                                                 Layer I\n      3                                                                                                                     10\n                                                                                                                             8                                                                                   r1 (t) = log p(\n      2                                                                                                                                                                                                           ij                xt|i, j)\n                                                                                                                             6\n\n                                                                                                                             4\n      1                                                                                                      \n                                                                             6                                                                                                                              6\n                                                                                                                             2\n                                                                        4                                                                                                                              4\n      0                                                                                                                      0\n                                                                   2                                                                                                                         2\n                                                                                                                                                                                                           \n     1.5     1      0.5                                                                                                     1.5\n                                 0                                                                                                          1\n                                      -0.5                                                                                                            0.5\n                                              -1                                                                                                             0\n                                                    -1.5                                                                                                          -0.5         -1    -1.5\n                                                                                                \n                                                                                                                                                            \n\n\nFigure 1: A Bayesian neural architecture. Layer I activities represent the log likelihood of the data\ngiven each possible setting of i and j . This gives a noisy version of the smooth bell-shaped tuning\ncurve (shown on the left). In layer II, the log likelihood of each i and j is modulated by the prior\ninformation log P (j ), shown on the upper left. The prior in  strongly suppresses the noisy input\nin the irrelevant part of the  dimension, thus enabling improved inference based on the underlying\ntuning response fij . The layer III neurons represent the log marginal posterior of  by integrating\nout the  dimension of layer II activities. Layer IV neurons combine recurrent information and\nfeedforward input from layer III to compute the log marginal posterior given all data so far observed.\nLayer V computes the cumulative posterior distribution of  using a softmax operation. Due to the\nstrong nonlinearity of softmax, its activity is much more peaked than in layer III and IV. Solid lines in\nthe diagram represent excitatory connections, dashed lines inhibitory. Blue circles illustrate how the\nactivities of one row of inputs in Layer I travels through the hierarchy to affect the final decision layer.\nBrown circles illustrate how one unit in the spatial prior layer comes into the integration process.\n\n\ngration of more \"signal\" and less \"noise\" into the marginal posterior, whereas the opposite\nresults from an invalid cue. To turn this on-line posterior into a decision ^\n                                                                                                                                                                                                                                         , we use an ex-\ntension of the Sequential Probability Ratio Test (SPRT [10]): observe x1, x2, ... until the\nfirst time that max P (j|Xt) exceeds a fixed threshold q, then terminate the observation\nprocess and report ^\n                                                                                         = argmaxP (j|Xt) as the estimate of  for the current trial.\n\n\n3            A Bayesian Neural Architecture\n\nThe neural architecture implements the above computational steps exactly through a loga-\nrithmic transform, and has five layers (Fig 1). In layer I, activity of neuron ij, r1ij(t), reports\nthe log likelihood, log p(xt|i, j) (throughout, we discretize space and orientation). Layer\nII combines this log likelihood information with the prior, r2ij(t)=r1ij(t) + log P (i) + at,\nto yield the joint log posterior up to an additive constant at that makes min r2ij = 0. Layer\nIII performs the marginalization r3j(t)=log                                                                                                                  exp(r2\n                                                                                                                                                      i                                    ij )+bt, to give the marginal posterior\nin  (up to a constant bt that makes min r3j(t) = 0). While this step ('log-of-sums') looks\ncomputationally formidable for neural hardware, it has been shown [4] that under certain\nconditions it can be well approximated by a (weighted) 'sum-of-logs' r3j(t)                                                                                                                                                                            c\n                                                                                                                                                                                                                                                       i ir2\n                                                                                                                                                                                                                                                                  ij + bt,\nwhere ci are weights optimized to minimize approximation error. Layer IV neurons com-\nbine recurrent information and feedforward input from layer III to compute the log marginal\n\n\f\n     (a) Model Valid & Invalid RT                       (b) Valid & Invalid ^         (c)\n                                                                                                 Reaction Time vs. \n      150\n                                           val                                          0.6\n      100                                  inv           200\n\n                                                                                        0.4\n      50\n                                                         150\n       0                                                                                0.2\n                10     20     30    40       50                                                   .50           .75          .99\n                                                         100\n             Empirical Valid & Invalid RT                                                                Error Rate vs. \n                                                                                             1\n                                                          50\n\n\n\n                                                                                        0.5\n                                                           0\n                                                           0       /2           \n                                                                                             0    .50           .75          .99\n                      time                                                                                    \n\nFigure 2: Validity effect and dependence on . (a) The distribution of reaction times for the invalid\ncondition ( = 0.5) has a greater mean and longer tail than the valid condition in model simulation\nresults (top). Compare to similar results (bottom) from a Posner task in rats [18]. (b) Distribution of\ninferred ^\n              is more tightly clustered around the true  (red dashed line) in valid case (blue) than the\ninvalid case (red).  = 0.75 (c) Validity effect, in both reaction time (top) and error rate (bottom)\nincreases with increasing . {i} = {-1.5, -1.4, ..., 1.5}, {j } = {/8, 2/8, ..., 16/8},  =\n0.1,  = /16, q = 0.90,  = 0.5,   {0.5, .75, .99},  = 0.05, 300 trials each of valid and\ninvalid trials. 100 trials of each  value.\n\n\nposterior given all data so far observed, r4j(t) = r4j(t-1) + r3j(t) + ct, up to a constant ct.\nFinally, layer V neurons perform a softmax operation to retrieve the exact marginal poste-\nrior, r5j(t) = exp(r4j)/                  exp(r4) = P (\n                                    k              k             j |Xt), with the additive constants dropping out.\nNote that a pathway parallel to III-IV-V consisting of neurons that only care about  and\nnot  can be constructed in exactly the same manner. Its corresponding layers would report\nlog p(xt, i), log p(Xt, i), and p(i|Xt). An example of activities at each layer of the\nnetwork, along with the choice of prior p() and tuning function fij, is shown in Fig 1.\n\n\n4    Results\n\nWe first verify that the model indeed exhibits the cue-induced validity effect, ie shorter RT\nand greater accuracy for valid-cue trials than invalid ones. \"Reaction time\" on a trial is the\nnumber of iid samples necessary to reach a decision, and \"error rate\" is the average angular\ndistance between the estimated ^\n                                                   and the true . Figure 2 shows simulation results for 300\ntrials each of valid and invalid cue trials, for different values of , reflecting the model's\nbelief as to cue validity. Reassuringly, the RT distribution for valid-cue trials distribution is\ntighter and left-shifted compared to invalid-cue trials (Figure 2(a), top panel), as observed\nin experimental data [15, 18] (Fig 2(a), bottom panel); (b) shows that accuracy is also\nhigher for valid-cue trials. Consistent with data from a human Posner task [17], (c) shows\nthat the VE increases with increasing perceived cue validity, as parameterized by , in both\nreaction times and error rates (precluding a simple speed-error trade-off).\n\nSince we have an explicit model of not only the \"behavioral output\" but also the whole\nneural hierarchy, we can relate activities at various levels of representation to existing phys-\niological data. Ample evidence indicates that spatial attention to one side of the visual field\nincreases stimulus-induced activities in the corresponding part of the visual cortex [19, 20].\nFig 3(a) shows that our model qualitatively reproduces this effect; indeed it increases with\n, the perceived cue validity. Electrophysiological data also shows that spatial attention\nhas a multiplicative effect on orientation tuning responses in visual cortical neurons [8]\n(Fig 3(b)). We see a similar phenomenon in the layer IV neurons (Fig 3(c); layer III simi-\nlar, data not shown). Fig 3(d) is a scatter-plot of log p(xt, j)+c1 for the valid condition\n                                                                                                         t\nversus the invalid condition, for various values of , along with the slope fit to the exper-\niment of Fig 3(b) (Layer III similar, data not shown). The linear least square error fits are\ngood, and the slope increases with increasing confidence in the cued location (larger ). In\n\n\f\n(a)                                        (b)                                  (c)                                              (d)\n        Cued vs. Uncued Activities                Attention & V4 Activities             Valid vs. Invalid Cueing                                       Multiplicative Gain\n\n               cued                                                                     30\n   7.5                                                                                                               val                                = .5\n               uncued                                                                                                inv                                                = .75\n        7                                                                               20                                                   20\n r2                                                                              r4                                         Valid r4\n  ij                                                                               j                                                    j\n\n   6.5                                                                                  10                                                   10                              = .99\n\n        6                                                                                     0                                                   0\n             .50       .75          .99                                                       0               /2                                0              5         10\n                                                                                                                                                             Invalid r4j\n\n\nFigure 3: Multiplicative gain modulation by spatial attention. (a) r2ij activities, averaged over the half\nof layer II where the prior peaks, are greater for valid (blue, left) than invalid (red, right) conditions.\n(b) Experimentally observed multiplicative modulation of V4 orientation tunings by spatial attention\n[8]. (c) Similar multiplicative effect in layer IV in the model. (d) Linear fits to scatter-plot of layer\nIII activities for valid cue condition vs. invalid cue condition show that the slope is greatest for large\n and smallest for small  (magenta:  = 0.99, blue:  = 0.75, red:  = 0.5, black: linear fit to\nstudy in (b)). Simulation parameters are same as in Fig 2. Error bars: standard errors of the mean.\n\n\nthe model, the slope not only depends on  but also the noise model, the discretization, and\nso on, so the comparison of Figure 3(d) should be interpreted loosely.\n\nIn valid cases, the effect of attention is to increase the certainty in the posterior marginal\nover , since the correct prior allows the relative suppression of noisy input from the irrel-\nevant part of space. Were the posterior marginal exactly Gaussian, the increased certainty\nwould translate into a decreased variance. For Gaussian probability distributions, logarith-\nmic coding amounts to something close to a quadratic (adjusted for the circularity of ori-\nentation), with a curvature determined by the variance. Decreasing the variance increases\nthe curvature, and therefore has a multiplicative effect on the activities (as in figure 3).\n\nThe approximate gaussianity of the marginal posterior comes from the accumulation of\nmany independent samples over time and space, and something like the central limit theo-\nrem. While it is difficult to show this multiplicative modulation rigorously, we can at least\ndemonstrate it mathematically for the case where the spatial prior is very sharply peaked at\nits Gaussian mean ^\n                                     y. In this case, ( log p1(x(t), j) +c                                                                                      +c\n                                                                                                   t         1)/( log p2(x(t), j ) t                                   2)  R,\nwhere c1, c2, and R are constants independent of j and i. Based on the peaked prior as-\nsumption, p()  (- ^\n                                            ), we have p(x(t), ) =                               p(x(t)|, )p()p()  p(x(t)|, ^\n                                                                                                                                                                                  ).\nWe can expand log p(x(t)|^\n                                                  , ) and compute its average over time\n\n                                                                          N\n                                 log p(x(t)|^\n                                             , )           = C -               (f                                                                       .                       (1)\n                                                        t             22               ij (, ) - fij ( ^\n                                                                                                                           , ))2 ij\n                                                                           n\n\nThen using the tuning function defined earlier, we can compare the joint probabilities given\nvalid (val) and invalid (inv) cues:\n\n                    log p                                                                                                   g()\n                                  val(x(t), )                            1 -           e-(i-)2/2                                           j\n                                                      t =                                                             i                                           ,               (2)\n                    log pinv(x(t), ) t                      2 -  e-((i-)2+(i-^)2)/22                                                    g() j\n                                                                                                                                        i\n\n                                                             log pval(xt, ) + c1\n                    and therefore,                                                      t                     e(-^)2/(42 )\n                                                                                                                                                      = R.                       (3)\n                                                             log pinv(xt, ) + c\n                                                                                         t              2\n\nThe derivation for a multiplicative effect on layer IV activities is very similar.\n\nAnother aspect of intermediate representation of interest is the way attention modifies the\nevidence accumulation process over time. Fig 4 show the effect of cueing on the activities\nof neuron r5j (t), or P (|\n                                                   Xt), for all trials with correct responses. The mean activity\ntrajectory is higher for the valid cue case than the invalid one: in this case, spatial atten-\ntion mainly acts through increasing the rate of evidence accumulation after stimulus onset\n\n\f\n     (a)                                                              (b)                                                      (c)\n                  1                                                                           1                                            1\n\n\n          5                                                                     5                                                   5\n     r          0.5                                                        r          0.5                                      r          0.5\n          j                                                                    j                                                  j\n\n                                            val\n                                            inv\n                  0                                                                           0                                            0\n                       0                           50                                          0                         50                 0            50\n                                    Time                                                                  Time                                   Time\n     (d)                                            (e)                                                           (f)\n                 1                                                   1\n                            =.5                                                      =.5\n                            =.75                                                     =.75\n                            =.99                                                     =.99\n                0.8                                                 0.8\n\n\n                0.6                                                 0.6\n          5                                                   5\n     r                                                   r\n          j                                                  j\n\n                0.4                                                 0.4\n\n\n                0.2                                                 0.2\n\n\n                 0                                                   0\n                  0                  10     20                                       -5              0\n                               Time                                             Time\n\n\nFigure 4: Accumulation of iid samples in orientation discrimination, and dependence on prior belief\nabout stimulus location. (a-c) Average activity of neuron r5j , which represents P (|Xt), saturates\nto 100% certainty much faster for valid cue trials (blue) than invalid cue trials (red). The difference\nis more drastic when  is larger, or when there is more prior confidence in the cued target location.\n(a)  = 0.5, (b)  = 0.75, (c)  = 0.99. Cyan dashed line indicates stimulus onset. (d) First 15\ntime steps (from stimulus onset) of the invalid cue traces from (a-c) are aligned to stimulus onset;\ncyan line denotes stimulus onset. The differential rates of rise are apparent. (e) Last 8 time steps of\nthe invalid traces from (a-c) are aligned to decision threshold-crossing; there is no clear separation\nas a function . (f) Multiplicative gain modulation of attention on V4 orientation tuning curves.\nSimulation parameters are same as in Fig 2.\n\n\n(steeper rise). This attentional effect is more pronounced when the system is more con-\nfident about its prior information ((a)  = 0.5, (b)  = 0.75, (c)  = 0.99). Effectively,\nincreasing  for invalid-cue trials is increasing input noise. Figure 4 (d) shows the average\ntraces for invalid-cueing trials aligned to the stimulus onset and (e) to the decision threshold\ncrossing. These results bear remarkable similarities to the LIP neuronal activities recorded\nduring monkey perceptual decision-making [13] (shown in (f)). In the stimulus-aligned\ncase, the traces rise linearly at first and then tail off somewhat, and the rate of rise increases\nfor lower (effective) noise. In the decision-aligned case, the traces rise steeply and together.\nAll these characteristics can also be seen in the experimental results in (f), where the input\nnoise level is explicitly varied.\n\n\n5         Discussion\n\nWe have presented a hierarchical neural architecture that implements optimal probabilistic\nintegration of top-down information and sequentially observed data. We consider a class\nof attentional tasks for which top-down modulation of sensory processing can be concep-\ntualized as changes in the prior distribution over implicit stimulus dimensions. We use the\nspecific example of the Posner spatial cueing task to relate the characteristics of this neural\narchitecture to experimental literature. The network produces a reaction time distribution\nand error rates that qualitatively replicate experimental data. The way these measures de-\npend on valid versus invalid cueing, and on the exact perceived validity of the cue, are\nsimilar to those observed in attentional experiments. Moreover, the activities in various\n\n\f\nlevels of the hierarchy resemble electrophysiologically recorded activities in the visual cor-\ntical neurons during attentional modulation and perceptual discrimination, lending farther\ncredence to the particular encoding and computational mechanisms that we have proposed.\nIn particular, the intermediate layers demonstrate a multiplicative gain modulation by at-\ntention, as observed in primate V4 neurons [8]; and the temporal behavior of the final layer,\nrepresenting the marginal posterior, qualitative replicates the experimental observation that\nLIP neurons show noise-dependent firing rate increase when aligned to stimulus onset, and\nnoise-independent rise when aligned to the decision [13].\n\nOur results illustrate the important concept that priors in a variable in one dimension (space)\ncan dramatically alter the inferential performance in a completely independent variable\ndimension (orientation). In this case, the spatial prior affects the marginal posterior over \nby altering the relative importance of joint posterior terms in the marginalization process.\nThis leads to the difference in performance between valid and invalid trials, a difference\nthat increases with . This model elaborates on an earlier phenomenological model [9], by\nshowing explicitly how marginalizing (in layer III) over activities biased by the prior (in\nlayer II) produces the effect.\n\nThis work has various theoretical and experimental implications. The model presents one\npossible reconciliation of cortical and neuromodulatory representations of uncertainty. The\nsensory-driven activities (layer I in this model) themselves encode bottom-up uncertainty,\nincluding sensory receptor noise and any processing noise that have occurred up until then.\nThe top-down information, which specifies the Gaussian component of the spatial prior\np(), involves two kinds of uncertainty. One determines the locus and spatial extent of\nvisual attention, the other specifies the relative importance of this top-down bias compared\nto the bottom-up stimulus-driven input. The first is highly specific in modality and featural\ndimension, presumably originating from higher visual cortical areas (eg parietal cortex for\nspatial attention, inferotemporal cortex for complex featural attention). The second is more\ngeneric and may affect different featural dimensions and maybe even different modalities\nsimultaneously, and is thus more appropriately signalled by a diffusely-projecting neuro-\nmodulator such as ACh. This characterization is also in keeping with our previous models\nof ACh [21, 7] and experimental data showing that ACh selectively suppresses cortico-\ncortical transmission relative to bottom-up processing in primary sensory cortices [22].\n\nThe perceptual decision strategy employed in this model is a natural multi-dimensional\nextension of SPRT [10], by monitoring the first-time passage of any one of the posterior\nvalues crossing a fixed decision threshold.. Note that the distribution of reaction times is\nskewed to the right (Fig 2(a)), as is commonly observed in visual discrimination tasks [11].\nFor binary decision tasks modeled using continuous diffusion processes [10, 11, 12, 13, 14],\nthis skew arises from the properties of the first-passage time distribution (the time at which\na diffusion barrier is first breached, corresponding to a fixed threshold confidence level in\nthe binary choice). Our multi-choice decision-making realization of visual discrimination,\nas an extension of SPRT, also retains this skewed first-passage time distribution. Given\nthat SPRT is optimal for binary decisions (smallest average response time for a given error\nrate), and that MAP estimate is optimal for 0-1 loss, we conjecture that our particular n-dim\ngeneralization of SPRT should be optimal for sequential decision-making under 0-1 loss.\nThis is an area of active research.\n\nThere are several important open issues. One is that of noise: our network performs exact\nBayesian inference when activities are deterministic. The potentially deleterious effects of\nnoise, particularly in log probability space, needs to be explored. Another important ques-\ntion is how uncertainty in signal strength, including the absence of a signal, can be detected\nand encoded. If the stimulus strength is unknown and can vary over time, then naive inte-\ngration of bottom-up inputs ignoring the signal-to-noise ratio is no longer optimal. Based\non a slightly different task involving sustained attention or vigilance [23], Brown et al [24]\nhave made the interesting suggestion that one role for noradrenergic neuromodulation is\n\n\f\nto implement a change in the integration strategy when a stimulus is detected. We have\nalso addressed this issue by ascribing to phasic norepinephrine a related but distinct role in\nsignaling unexpected state uncertainty (in preparation).\n\n\nAcknowledgement\n\nWe are grateful to Eric Brown, Jonathan Cohen, Phil Holmes, Peter Latham, and Iain Mur-\nray for helpful discussions. Funding was from the Gatsby Charitable Foundation.\n\n\nReferences\n\n [1] Zemel, R S, Dayan, P, & Pouget, A (1998). Probabilistic interpretation of population codes.\n     Neural Comput 10: 403-30.\n [2] Sahani, M & Dayan, P (2003). Doubly distributional population codes: simultaneous represen-\n     tation of uncertainty and multiplicity. Neural Comput 15: 2255-79.\n [3] Barber, M J, Clark, J W, & Anderson, C H (2003). Neural representation of probabilistic infor-\n     mation. Neural Comput 15: 1843-64\n [4] Rao, R P (2004). Bayesian computation in recurrent neural circuits. Neural Comput 16: 1-38.\n [5] Weiss, Y & Fleet D J(2002). Velocity likelihoods in biological and machine vision. In Prob\n     Models of the Brain: Perc and Neural Function. Cambridge, MA: MIT Press.\n [6] Dayan, P & Yu, A J (2002). Acetylcholine, uncertainty, and cortical inference. In Adv in Neural\n     Info Process Systems 14.\n [7] Yu, A J & Dayan, P (2003). Expected and unexpected uncertainty: ACh and NE in the neocor-\n     tex. In Adv in Neural Info Process Systems 15.\n [8] McAdams, C J & Maunsell, J H R (1999). Effects of attention on orientation-tuning functions\n     of single neurons in Macaque cortical area V4. J. Neurosci 19: 431-41.\n [9] Dayan, P & Zemel R S (1999). Statistical models and sensory attention. In ICANN 1999.\n[10] Wald, A (1947). Sequential Analysis. New York: John Wiley & Sons, Inc.\n[11] Luce, R D (1986). Response Times: Their Role in Inferring Elementary Mental Organization.\n     New York: Oxford Univ. Press.\n[12] Ratcliff, R (2001). Putting noise into neurophysiological models of simple decision making.\n     Nat Neurosci 4: 336-7.\n[13] Gold, J I & Shadlen, M N (2002). Banburismus and the brain: decoding the relationship be-\n     tween sensory stimuli, decisions, and reward. Neuron 36: 299-308.\n[14] Bogacz, Brown, Moehlis, Holmes, & Cohen (2004). The physics of optimal decision making:\n     a formal analysis of models of performance in two-alternative forced choice tasks, in press.\n[15] Posner, M I (1980). Orienting of attention. Q J Exp Psychol 32: 3-25.\n[16] Phillips, J M, et al (2000). Cholinergic neurotransmission influences overt orientation of visu-\n     ospatial attention in the rat. Psychopharm 150:112-6.\n[17] Yu, A J et al (2004). Expected and unexpected uncertainties control allocation of attention in a\n     novel attentional learning task. Soc Neurosci Abst 30:176.17.\n[18] Bowman, E M, Brown, V, Kertzman, C, Schwarz, U, & Robinson, D L (2003). Covert orienting\n     of attention in Macaques: I. effects of behavioral context. J Neurophys 70: 431-434.\n[19] Reynolds, J H & Chelazzi, L (2004). Attentional modulation of visual processing. Annu Rev\n     Neurosci 27: 611-47.\n[20] Kastner, S & Ungerleider, L G (2000). Mechanisms of visual attention in the human cortex.\n     Annu Rev Neurosci 23: 315-41.\n[21] Yu, A J & Dayan, P (2002). Acetylcholine in cortical inference. Neural Networks 15: 719-30.\n[22] Kimura, F, Fukuada, M, & Tusomoto, T (1999). Acetylcholine suppresses the spread of exci-\n     tation in the visual cortex revealed by optical recording: possible differential effect depending\n     on the source of input. Eur J Neurosci 11: 3597-609.\n[23] Rajkowski, J, Kubiak, P, & Aston-Jones, P (1994). Locus coeruleus activity in monkey: phasic\n     and tonic changes are associated with altered vigilance. Synapse 4: 162-4.\n[24] Brown, E et al (2004). Simple neural networks that optimize decisions. Int J Bifurcation and\n     Chaos, in press.\n\n\f\n", "award": [], "sourceid": 2548, "authors": [{"given_name": "Angela", "family_name": "Yu", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}