{"title": "Neural representation of action sequences: how far can a simple snippet-matching model take us?", "book": "Advances in Neural Information Processing Systems", "page_first": 593, "page_last": 601, "abstract": "The macaque Superior Temporal Sulcus (STS) is a brain area that receives and integrates inputs from both the ventral and dorsal visual processing streams (thought to specialize in form and motion processing respectively). For the processing of articulated actions, prior work has shown that even a small population of STS neurons contains sufficient information for the decoding of actor invariant to action, action invariant to actor, as well as the specific conjunction of actor and action. This paper addresses two questions. First, what are the invariance properties of individual neural representations (rather than the population representation) in STS? Second, what are the neural encoding mechanisms that can produce such individual neural representations from streams of pixel images? We find that a baseline model, one that simply computes a linear weighted sum of ventral and dorsal responses to short action \u201csnippets\u201d, produces surprisingly good fits to the neural data. Interestingly, even using inputs from a single stream, both actor-invariance and action-invariance can be produced simply by having different linear weights.", "full_text": "Neural representation of action sequences: how far\n\ncan a simple snippet-matching model take us?\n\nCheston Tan\n\nInstitute for Infocomm Research\n\nSingapore\n\ncheston@mit.edu\n\nJedediah M. Singer\n\nBoston Children\u2019s Hospital\n\nBoston, MA 02115\n\njedediah.singer@childrens.harvard.edu\n\nThomas Serre\n\nDavid Sheinberg\n\nBrown University\n\nProvidence, RI 02912\n\n{Thomas Serre, David Sheinberg}@brown.edu\n\nTomaso A. Poggio\n\nMIT\n\nCambridge, MA 02139\n\ntp@ai.mit.edu\n\nAbstract\n\nThe macaque Superior Temporal Sulcus (STS) is a brain area that receives and in-\ntegrates inputs from both the ventral and dorsal visual processing streams (thought\nto specialize in form and motion processing respectively). For the processing of\narticulated actions, prior work has shown that even a small population of STS neu-\nrons contains suf\ufb01cient information for the decoding of actor invariant to action,\naction invariant to actor, as well as the speci\ufb01c conjunction of actor and action.\nThis paper addresses two questions. First, what are the invariance properties of in-\ndividual neural representations (rather than the population representation) in STS?\nSecond, what are the neural encoding mechanisms that can produce such individ-\nual neural representations from streams of pixel images? We \ufb01nd that a simple\nmodel, one that simply computes a linear weighted sum of ventral and dorsal re-\nsponses to short action \u201csnippets\u201d, produces surprisingly good \ufb01ts to the neural\ndata. Interestingly, even using inputs from a single stream, both actor-invariance\nand action-invariance can be accounted for, by having different linear weights.\n\nIntroduction\n\n1\nFor humans and other primates, action recognition is an important ability that facilitates social in-\nteraction, as well as recognition of threats and intentions. For action recognition, in addition to the\nchallenge of position and scale invariance (which are common to many forms of visual recognition),\nthere are additional challenges. The action being performed needs to be recognized in a manner\ninvariant to the actor performing it. Conversely, the actor also needs to be recognized in a manner\ninvariant to the action being performed. Ultimately, however, both the particular action and actor\nalso need to be \u201cbound\u201d together by the visual system, so that the speci\ufb01c conjunction of a particular\nactor performing a particular action is recognized and experienced as a coherent percept.\nFor the \u201cwhat is where\u201d vision problem, one common simpli\ufb01cation of the primate visual system\nis that the ventral stream handles the \u201cwhat\u201d problem, while the dorsal stream handles the \u201cwhere\u201d\nproblem [1]. Here, we investigate the analogous \u201cwho is doing what\u201d problem. Prior work has\nfound that brain cells in the macaque Superior Temporal Sulcus (STS) \u2014 a brain area that receives\nconverging inputs from dorsal and ventral streams \u2014 play a major role in solving the problem. Even\nwith a small population subset of only about 120 neurons, STS contains suf\ufb01cient information for\naction and actor to be decoded independently of one another [2]. Moreover, the particular conjunc-\ntion of actor and action (i.e. stimulus-speci\ufb01c information) can also be decoded. In other words,\n\n1\n\n\fSTS neurons have been shown to have successfully tackled the three challenges of actor-invariance,\naction-invariance and actor-action binding.\nWhat sort of neural computations are performed by the visual system to achieve this feat is still an\nunsolved question. Singer and Sheinberg [2] performed population decoding from a collection of\nsingle neurons. However, they did not investigate the computational mechanisms underlying the\nindividual neuron representations. In addition, they utilized a decoding model (i.e. one that models\nthe usage of the STS neural information by downstream neurons). An encoding model \u2013 one that\nmodels the transformation of pixel inputs into the STS neural representation \u2014 was not investigated.\nHere, we further analyze the neural data of [2] to investigate the characteristics of the neural repre-\nsentation at the level of individual neurons, rather than at the population level. We \ufb01nd that instead\nof distinct clusters of actor-invariant and action-invariant neurons, the neurons cover a broad, con-\ntinuous range of invariance.\nTo the best of our knowledge, there have not been any prior attempts to predict single-neuron re-\nsponses at such a high level in the visual processing hierarchy. Furthermore, attempts at time-series\nprediction for visual processing are also rare. Therefore, as a baseline, we propose a very simple\nand biologically-plausible encoding model and explore how far this model can go in terms of re-\nproducing the neural responses in the STS. Despite its simplicity, modeling STS neurons as a linear\nweighted sum of inputs over a short temporal window produces surprisingly good \ufb01ts to the data.\n\n2 Background: the Superior Temporal Sulcus\nThe macaque visual system is commonly described as being separated into the ventral (\u201cwhat\u201d)\nand dorsal (\u201cwhere\u201d) streams [1]. The Superior Temporal Sulcus (STS) is a high-level brain area\nthat receive inputs from both streams [3, 4]. In particular, it receives inputs from the highest levels\nof the processing hierarchy of either stream \u2014 inferotemporal (IT) cortex for the ventral stream,\nand the Medial Superior Temporal (MST) cortex for the dorsal stream. Accordingly, neurons that\nare biased more towards either encoding form information or motion information have been found\nin the STS [5]. The upper bank of the STS has been found to contain neurons more selective for\nmotion, with some invariance to form [6, 7]. Relative to the upper bank, neurons in the lower bank\nof the STS have been found to be more sensitive to form, with some \u201csnapshot\u201d neurons selective\nfor static poses within action sequences [7]. Using functional MRI (fMRI), neurons in the lower\nbank were found to respond to point-light \ufb01gures [8] performing biological actions [9], consistent\nwith the idea that actions can be recognized from distinctive static poses [10]. However, there is no\nclear, quantitative evidence for a neat separation between motion-sensitive, form-invariant neurons\nin the upper bank and form-sensitive, motion-invariant neurons in the lower bank. STS neurons have\nbeen found to be selective for speci\ufb01c combinations of form and motion [3, 11]. Similarly, based on\nfMRI data, the STS responds to both point-light display and video displays, consistent with the idea\nthat the STS integrates both form and motion [12].\n\n3 Materials and methods\nNeural recordings. The neural data used in this work has previously been published by Singer and\nSheinberg [2]. We summarize the key points here, and refer the reader to [2] for details. Two male\nrhesus macaques (monkeys G and S) were trained to perform an action recognition task, while neu-\nral activity from a total of 119 single neurons (59 and 60 from G and S respectively) was recorded\nduring task performance. The mean \ufb01ring rate (FR) over repeated stimulus presentations was calcu-\nlated, and the mean FR over time is termed the response \u201cwaveform\u201d (Fig. 3). The monkeys\u2019 heads\nwere \ufb01xed, but their eyes were free to move (other than \ufb01xating at the start of each trial).\nStimuli and task. The stimuli consisted of 64 movie clips (8 humanoid computer-generated \u201cactors\u201d\neach performing 8 actions; see Fig. 1). A sample movie of one actor performing the 8 actions\ncan be found at http://www.jneurosci.org/content/30/8/3133/suppl/DC1 (see Supplemental Movie\n1 therein). The monkeys\u2019 task was to categorize the action in the displayed clip into two pre-\ndetermined but arbitrary groups, pressing one of two buttons to indicate their decision. At the start\nof each trial, after the monkey maintained \ufb01xation for 450ms, a blank screen was shown for 500ms,\nand then one of the actors was displayed (subtending 6\u25e6 of visual angle vertically). Regardless of\naction, the actor was \ufb01rst displayed motionless in an upright neutral pose for 300ms, then began\n\n2\n\n\fperforming one of the 8 actions. Each clip ended back at the initial neutral pose after 1900ms of\nmotion. A button-press response at any point by the monkey immediately ended the trial, and the\nscreen was blanked. In this paper, we considered only the data corresponding to the actions (i.e.\nexcluding the motionless neutral pose). Similar to [2], we assumed that all neurons had a response\nlatency of 130ms.\n\nFigure 1: Depiction of stimuli used. Top row: the 8 \u201cactors\u201d in the neutral pose. Bottom row:\nsample frames of actor 5 performing the 8 actions; frames are from the same time-point within each\naction. The 64 stimuli were an 8-by-8 cross of each actor performing each action.\n\nActor- and action-invariance indices. We characterized each neuron\u2019s response characteristics\nalong two dimensions: invariance to actor and to action. For the actor-invariance index, a neuron\u2019s\naverage response waveform to each of the 8 actions was \ufb01rst calculated by averaging over all actors.\nThen, we calculated the Pearson correlation between the neuron\u2019s actual responses and the responses\nthat would be seen if the neuron were completely actor-invariant (i.e. if it always responded with\nthe average waveform calculated in the previous step). The action-invariance index was calculated\nsimilarly. The calculation of these indices bear similarities to that for the pattern and component\nindices of cells in area MT [13].\nVentral and dorsal stream encoding models. We utilize existing models of brain areas that provide\ninput to the STS. Speci\ufb01cally, we use the HMAX family of models, which include models of the\nventral [14] and dorsal [15] streams. These models receive pixel images as input, and simulate visual\nprocessing up to areas V4/IT (ventral) and areas MT/MST (dorsal). Such models build hierarchies\nof increasingly complex and invariant representations, similar to convolutional and deep-learning\nnetworks. While ventral stream processing has traditionally been modeled as producing outputs in\nresponse to static images, in practice, neurons in the ventral stream are also sensitive to temporal\naspects [16]. As such, we extend the ventral stream model to be more biologically realistic. Speci\ufb01-\ncally, the V1 neurons that project to the ventral stream now have temporal receptive \ufb01elds (RFs) [17],\nnot just spatial ones. These temporal and spatial RFs are separable, unlike those for V1 neurons that\nproject to the dorsal stream [18]. Such space-time separable V1 neurons that project to the ventral\nstream are not directionally-selective and are not sensitive to motion per se. They are still sensitive\nto form rather than motion, but are better models of form processing, since in reality input to the\nvisual system consists of a continuous stream of images. Importantly, the parameters of dorsal and\nventral encoding models were \ufb01xed, and there was no optimization done to produce better \ufb01ts to the\ncurrent data. We used only the highest-level (C2) outputs of these models.\nSTS encoding model. As a \ufb01rst approximation, we model the neural processing by STS neurons\nas a linear weighted sum of inputs. The weights are \ufb01xed, and do not change over time. In other\nwords, at any point in time, the output of a model STS neuron is a linear combination of the C2\noutputs produced by the ventral and dorsal encoding models. We do not take into account temporal\nphenomena such as adaptation. We make the simplifying (but unrealistic) assumptions that synaptic\nef\ufb01cacy is constant (i.e. no \u201cneural fatigue\u201d), and time-points are all independent.\nEach model neuron has its own set of static weights that determine its unique pattern of neural\nresponses to the 64 action clips. The weights are learned using leave-one-out cross-validation. Of\nthe 64 stimuli, we use 63 for training, and use the learnt weights to predict the neural response\nwaveform to the left-out stimulus. This procedure is repeated 64 times, leaving out a different\nstimulus each time. The 64 sets of predicted waveforms are collectively compared to the original\nneural responses. The goodness-of-\ufb01t metric is the Pearson correlation (r) between predicted and\nactual responses.\n\n3\n\n\fThe weights are learned using simple linear regression. For number of input features F , there are\nF + 1 unknown weights (including a constant bias term). The inputs to the STS model neuron are\nrepresented as a (T \u00d7 63) by (F + 1) matrix, where T is the number of timesteps. The output\nis a (T \u00d7 63) by 1 vector, which is simply a concatenation of the 63 neural response waveforms\ncorresponding to the 63 training stimuli. This simple linear system of equations, with (F + 1)\nunknowns and (T \u00d7 63) equations, can be solved using various methods. In practice, we used the\nleast-squares method. Importantly, at no point are ground-truth actor or action labels used.\nRather than use the 960 dorsal and/or 960 ventral C2 features directly as inputs to the linear re-\ngression, we \ufb01rst performed PCA on these features (separately for the two streams) to reduce the\ndimensionality. Only the \ufb01rst 300 principal components (accounting for 95% or more of the vari-\nance) were used; the rest was discarded. Therefore, F = 300. Fitting was also performed using the\ncombination of dorsal and ventral C2 features. As before, PCA was performed, and only the \ufb01rst\n300 principal components were retained. Keeping F constant at 300, rather than setting it to 600,\nallowed for a fairer comparison to using either stream alone.\n\n4 What is the individual neural representation like?\nIn this section, we examine the neural representation at the level of individual neurons. Figure 2\nshows the invariance characteristics of the population of 119 neurons. Overall, the neurons span a\nbroad range of action and actor invariance (95% of invariance index values span the ranges [0.301\n0.873] and [0.396 0.894] respectively). The correlation between the two indices is low (r=0.26).\nConsidering each monkey separately, the correlations between the two indices were 0.55 (monkey\nG) and -0.09 (monkey S). This difference could be linked to slightly different recording regions [2].\n\nFigure 2: Actor- and action-invariance indices for 59 neurons from monkey G (blue) and 60 neurons\nfrom monkey S (red). Blue and red crosses indicate mean values.\n\nFigure 3 shows the response waveforms of some example neurons to give a sense of what response\npatterns correspond to low and high invariance indices. The average over actors, average over actions\nand the overall average are also shown. Neuron S09 is highly action-invariant but not actor-invariant,\nwhile G54 is the opposite. Neuron G20 is highly invariant to both action and actor, while the\ninvariance of S10 is close to the mean invariance of the population.\nWe \ufb01nd that there are no distinct clusters of neurons with high actor-invariance or action-invariance.\nSuch clusters would correspond to a representation scheme in which certain neurons specialize in\ncoding for action invariant to actor, and vice-versa. A cluster of neurons with both low actor- and\naction-invariance could correspond to cells that code for a speci\ufb01c conjunction (binding) of actor and\naction, but no such cluster is seen. Rather, Fig. 2 indicates that instead of the \u201ccell specialization\u201d\napproach to neural representation, the visual system adopts a more continuous and distributed repre-\nsentation scheme, one that is perhaps more universal and generalizes better to novel stimuli. In the\n\n4\n\n\fFigure 3: Plots of waveforms (mean \ufb01ring rate in Hz vs. time in secs) for four example neurons.\nRows are actors, columns are actions. Red lines: mean \ufb01ring rate (FR). Light red shading: \u00b11 SEM\nof FR. Black lines (row 9 and column 9): waveforms averaged over actors, actions, or both.\n\nrest of this paper, we explore how well a linear, feedforward encoding model of STS ventral/dorsal\nintegration can reproduce the neural responses and invariance properties found here.\n\n5 The \u201csnippet-matching\u201d model\nIn their paper, Singer and Sheinberg found evidence for the neural population representing actions\nas \u201csequences of integrated poses\u201d [2]. Each pose contains visual information integrated over a\nwindow of about 120ms. However, it was unclear what the representation was for each individual\nneuron. For example, does each neuron encode just a single pose (i.e. a \u201csnippet\u201d), or can it encode\nmore than one? What are the neural computations underlying this encoding?\nIn this paper, we examine what is probably the simplest model of such neural computations, which\nwe call the \u201csnippet-matching\u201d model. According to this model, each individual STS neuron com-\npares its incoming input over a single time step to its preferred stimulus. Due to hierarchical organi-\nzation, this single time step at the STS level contains information processed from roughly 120ms of\nraw visual input. For example, a neuron matches the incoming visual input to one particular short\nsegment of the human walking gait cycle, and its output at any time is in effect how similar the\nvisual input (from the previous 120ms up to the current time) is compared to that preferred stimulus\n(represented by linear weights; see sub-section on STS encoding model in Section 3).\nSuch a model is purely feedforward and does not rely on any lateral or recurrent neural connections.\nThe temporal-window matching is straightforward to implement neurally e.g. using the same \u201cdelay-\nline\u201d mechanisms [19] proposed for motion-selective cells in V1 and MT. Neurons implementing\n\n5\n\n\fthis model are said to be \u201cmemoryless\u201d or \u201cstateless\u201d, because their outputs are solely dependent on\ntheir current inputs, and not on their own previous outputs. It is important to note that the inputs\nto be matched could, in theory, be very short. In the extreme, the temporal window is very small,\nand the visual input to be matched could simply be the current frame. In this extreme case, action\nrecognition is performed by the matching of individual frames to the neuron\u2019s preferred stimulus.\nSuch a \u201csnippet\u201d framework (memoryless matching over a single time step) is consistent with prior\n\ufb01ndings regarding recognition of biological motion. For instance, it has been found that humans can\neasily recognize videos of people walking from short clips of point-light stimuli [8]. This is consis-\ntent with the idea that action recognition is performed via matching of snippets. Neurons sensitive\nto such action snippets have been found using techniques such as fMRI [9, 12] and electrophysiol-\nogy [7]. However, such snippet encoding models have not been investigated in much detail.\nWhile there is some evidence for the snippet model in terms of the existence of neurons responsive\nand selective for short action sequences, it is still unclear how feasible such an encoding model is.\nFor instance, given some visual input, if a neuron simply tries to match that sequence to its preferred\nstimulus, how exactly does the neuron ignore the motion aspects (to recognize actor invariant to\naction) or ignore the form aspects (to recognize action invariant to actors)? Given the broad range of\nactor- and action-invariances found in the previous section, it is crucial to see if the snippet model\ncan in fact reproduce such characteristics.\n\n6 How far can snippet-matching go?\n\nIn this section, we explore how well the simple snippet-matching model can predict the response\nwaveforms of our population of STS neurons. This is a challenging task. STS is high up in the\nvisual processing hierarchy, meaning that there are more unknown processing steps and parameters\nbetween the retina and STS, as compared to a lower-level visual area. Furthermore, there is a diver-\nsity of neural response patterns, both between different neurons (see Figs. 2 and 3) and sometimes\nalso between different stimuli for a neuron (e.g. S10, Fig. 3).\nThe snippet-matching process can utilize a variety of matching functions. Again, we try the simplest\npossible function: a linear weighted sum. First, we examine the results of the leave-one-out \ufb01tting\nprocedure when the inputs to STS model neurons are from either the dorsal or ventral streams alone.\nFor monkey G, the mean goodness-of-\ufb01t (correlation between actual and predicted neural responses\non left-out test stimuli) over all 59 neurons are 0.50 and 0.43 for the dorsal and ventral stream\ninputs respectively. The goodness-of-\ufb01t is highly correlated between the two streams (r=0.94).\nFor monkey S, the mean goodness-of-\ufb01t over all 60 neurons is 0.33 for either stream (correlation\nbetween streams, r=0.91). Averaged over all neurons and both streams, the mean goodness-of-\ufb01t is\n0.40. As a sanity check, when either the linear weights or the predictions are randomly re-ordered,\nmean goodness-of-\ufb01t is 0.00.\nFigure 4 shows the predicted and actual responses for two example neurons, one with a relatively\ngood \ufb01t (G22 \ufb01t to dorsal, r=0.70) and one with an average \ufb01t (S10 \ufb01t to dorsal, r=0.39). In the case\nof G22 (Fig. 4 left), which is not even the best-\ufb01t neuron, there is a surprisingly good \ufb01t despite the\nclear complexity in the neural responses. This complexity is seen most clearly from the responses\nto the 8 actions averaged over actors, where the number and height of peaks in the waveform vary\nconsiderably from one action to another. The \ufb01t is remarkable considering the simplicity of the\nsnippet model, in which there is only one set of static linear weights; all \ufb02uctuations in the predicted\nwaveforms arise purely from changes in the inputs to this model STS neuron.\nOver the whole population, the \ufb01ts to the dorsal model (mean r=0.42) are better than to the ventral\nmodel (mean r=0.38). Is there a systematic relationship between the difference in goodness-of-\ufb01t to\nthe two streams and the invariance indices calculated in Section 4? For instance, one might expect\nthat neurons with high actor-invariance would be better \ufb01t to the dorsal than ventral model. From\nFig. 5, we see that this is exactly the case for actor invariance. There is a strong positive correla-\ntion between actor invariance and difference (dorsal minus ventral) in goodness-of-\ufb01t (monkey G:\nr=0.72; monkey S: r=0.69). For action invariance, as expected, there is a negative correlation (i.e.\nstrong action invariance predicts better \ufb01t to ventral model) for monkey S (r=-0.35). However, for\nmonkey G, the correlation is moderately positive (r=0.43), contrary to expectation. It is unclear why\nthis is the case, but it may be linked to the robust correlation between actor- and action-invariance\nindices for monkey G (r=0.55), seen in Fig. 2. This is not the case for monkey S (r=-0.09).\n\n6\n\n\fFigure 4: Predicted (blue) and actual (red) waveforms for two example neurons, both \ufb01t to the dorsal\nstream. G22: r=0.70, S10: r=0.39. For each of the 64 sub-plots, the prediction for that test stimulus\nused the other 63 stimuli for training. Solely for visualization purposes, predictions were smoothed\nusing a moving average window of 4 timesteps (total length 89 timesteps).\n\nFigure 5: Relationship between goodness-of-\ufb01t and invariance. Y-axis: difference between r from\n\ufb01tting to dorsal versus ventral streams. X-axis: actor (left) and action (right) invariance indices.\n\nInterestingly, either stream can produce actor-invariant and action-invariant responses (Fig. 6).\nWhile G54 is better \ufb01t to the dorsal than ventral stream (0.77 vs. 0.67), both \ufb01ts are relatively\ngood \u2014 and are actor-invariant. The converse is true for S48. These results are consistent with the\nreality that both streams are interconnected and the what/where distinction is a simpli\ufb01cation.\nSo far, we have performed linear \ufb01tting using the dorsal and ventral streams separately. Does \ufb01tting\nto a combination of both models improve the \ufb01t? For monkey G, the mean goodness-of-\ufb01t is 0.53;\nfor monkey S it is 0.38. The improvements over the better of either model alone are moderate (6%\nfor G, 15% for S). Interestingly, this \ufb01tting to a combination of streams without prior knowledge of\nwhich stream is more suitable, produces \ufb01ts that are as good or better than if we knew a priori which\nstream would produce a better \ufb01t for a speci\ufb01c neuron (0.53 vs. 0.51 for G; 0.38 vs. 0.36 for S).\nHow much better compared to low-level controls does our snippet model \ufb01t to the combined outputs\nof dorsal and ventral stream models? To answer this question, we instead \ufb01t our snippet model to a\nlow-level pixel representation while keeping all else constant. The stimuli were resized to be 32 x\n32 pixels, so that the number of features (1024 = 32 x 32) was roughly the same number as the C2\nfeatures. This was then reduced to 300 principal components, as was done for C2 features. Fitting\nour snippet model to this pixel-derived representation produced worse \ufb01ts (G: 0.40, S: 0.32). These\nwere 25% (G) and 16% (S) worse than \ufb01tting to the combination of dorsal and ventral models.\nFurthermore, the monkeys were free to move their eyes during the task (apart from a \ufb01xation period\n\n7\n\n\fFigure 6: Either stream can produce actor-invariant (G54) and action-invariant (S48) responses.\n\nat the start of each trial). Even slight random shifts in the pixel-derived representation of less than\n0.25\u25e6 of visual angle (on the order of micro-saccades) dramatically reduced the \ufb01ts to 0.25 (G)\nand 0.21 (S). In contrast, the same random shifts did not change the average \ufb01t numbers for the\ncombination of dorsal and ventral models (0.53 for G, 0.39 for S). These results suggest that the\n\ufb01tting process does in fact learn meaningful weights, and that biologically-realistic, robust encoding\nmodels are important in providing suitable inputs to the \ufb01tting process.\nFinally, how do the actor- and action-invariance indices calculated from the predicted responses\ncompare to those calculated from the ground-truth data? Averaged over all 119 neurons \ufb01tted to\na combination of dorsal and ventral streams, the actor- and action-invariance indices are within\n0.0524 and 0.0542 of their true values (mean absolute error). In contrast, using the pixel-derived\nrepresentation, the results are much worse (0.0944 and 0.1193 respectively, i.e. the error is double).\n\n7 Conclusions\n\nWe found that at the level of individual neurons, the neuronal representation in STS spans a broad,\ncontinuous range of actor- and action-invariance, rather than having groups of neurons with distinct\ninvariance properties. Simply as a baseline model, we investigated how well a linear weighted sum\nof dorsal and ventral stream responses to action \u201csnippets\u201d could reproduce the neural response\npatterns found in these STS neurons. The results are surprisingly good for such a simple model,\nconsistent with \ufb01ndings from computer vision [20]. Clearly, however, more complex models should,\nin theory, be able to better \ufb01t the data. For example, a non-linear operation can be added, as in the\nLN family of models [13]. Other models include those with nonlinear dynamics, as well as lateral\nand feedback connections [21, 22]. Other ventral and dorsal models can also be tested (e.g. [23]),\nincluding computer vision models [24, 25]. Nonetheless, this simple \u201csnippet-matching\u201d model is\nable to grossly reproduce the pattern of neural responses and invariance properties found in the STS.\n\n8\n\n\fReferences\n[1] L. G. Ungerleider and J. V. Haxby, \u201c\u2018What\u2019 and \u2018where\u2019 in the human brain.\u201d Current Opinion in Neuro-\n\nbiology, vol. 4, no. 2, pp. 157\u201365, 1994.\n\n[2] J. M. Singer and D. L. Sheinberg, \u201cTemporal cortex neurons encode articulated actions as slow sequences\n\nof integrated poses.\u201d Journal of Neuroscience, vol. 30, no. 8, pp. 3133\u201345, 2010.\n\n[3] M. W. Oram and D. I. Perrett, \u201cIntegration of form and motion in the anterior superior temporal polysen-\n\nsory area of the macaque monkey.\u201d Journal of Neurophysiology, vol. 76, no. 1, pp. 109\u201329, 1996.\n\n[4] J. S. Baizer, L. G. Ungerleider, and R. Desimone, \u201cOrganization of visual inputs to the inferior temporal\n\nand posterior parietal cortex in macaques.\u201d Journal of Neuroscience, vol. 11, no. 1, pp. 168\u201390, 1991.\n\n[5] T. Jellema, G. Maassen, and D. I. Perrett, \u201cSingle cell integration of animate form, motion and location in\nthe superior temporal cortex of the macaque monkey.\u201d Cerebral Cortex, vol. 14, no. 7, pp. 781\u201390, 2004.\n[6] C. Bruce, R. Desimone, and C. G. Gross, \u201cVisual properties of neurons in a polysensory area in superior\n\ntemporal sulcus of the macaque.\u201d Journal of Neurophysiology, vol. 46, no. 2, pp. 369\u201384, 1981.\n\n[7] J. Vangeneugden, F. Pollick, and R. Vogels, \u201cFunctional differentiation of macaque visual temporal corti-\n\ncal neurons using a parametric action space.\u201d Cerebral Cortex, vol. 19, no. 3, pp. 593\u2013611, 2009.\n\n[8] G. Johansson, \u201cVisual perception of biological motion and a model for its analysis.\u201d Perception & Psy-\n\nchophysics, vol. 14, pp. 201\u2013211, 1973.\n\n[9] E. Grossman, M. Donnelly, R. Price, D. Pickens, V. Morgan, G. Neighbor, and R. Blake, \u201cBrain areas\ninvolved in perception of biological motion.\u201d Journal of Cognitive Neuroscience, vol. 12, no. 5, pp. 711\u2013\n20, 2000.\n\n[10] J. A. Beintema and M. Lappe, \u201cPerception of biological motion without local image motion.\u201d Proceedings\n\nof the National Academy of Sciences of the United States of America, vol. 99, no. 8, pp. 5661\u20133, 2002.\n\n[11] D. I. Perrett, P. A. Smith, A. J. Mistlin, A. J. Chitty, A. S. Head, D. D. Potter, R. Broennimann, A. D.\nMilner, and M. A. Jeeves, \u201cVisual analysis of body movements by neurones in the temporal cortex of the\nmacaque monkey: a preliminary report.\u201d Behavioural Brain Research, vol. 16, no. 2-3, pp. 153\u201370, 1985.\n[12] M. S. Beauchamp, K. E. Lee, J. V. Haxby, and A. Martin, \u201cFMRI responses to video and point-light\ndisplays of moving humans and manipulable objects.\u201d Journal of Cognitive Neuroscience, vol. 15, no. 7,\npp. 991\u20131001, 2003.\n\n[13] N. C. Rust, V. Mante, E. P. Simoncelli, and J. A. Movshon, \u201cHow MT cells analyze the motion of visual\n\npatterns.\u201d Nature Neuroscience, vol. 9, no. 11, pp. 1421\u201331, 2006.\n\n[14] M. Riesenhuber and T. Poggio, \u201cHierarchical models of object recognition in cortex.\u201d Nature Neuro-\n\nscience, vol. 2, no. 11, pp. 1019\u201325, 1999.\n\n[15] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, \u201cA Biologically Inspired System for Action Recognition,\u201d in\n\n2007 IEEE 11th International Conference on Computer Vision.\n\nIEEE, 2007.\n\n[16] C. G. Gross, C. E. Rocha-Miranda, and D. B. Bender, \u201cVisual properties of neurons in inferotemporal\n\ncortex of the macaque.\u201d Journal of Neurophysiology, vol. 35, no. 1, pp. 96\u2013111, 1972.\n\n[17] P. Dayan and L. F. Abbott, Theoretical Neuroscience: Computational and Mathematical Modeling of\n\nNeural Systems. Cambridge, MA: The MIT Press, 2005.\n\n[18] J. A. Movshon and W. T. Newsome, \u201cVisual response properties of striate cortical neurons projecting to\n\narea MT in macaque monkeys.\u201d Journal of Neuroscience, vol. 16, no. 23, pp. 7733\u201341, 1996.\n\n[19] W. Reichardt, \u201cAutocorrelation, a principle for the evaluation of sensory information by the central ner-\n\nvous system,\u201d Sensory Communication, pp. 303\u201317, 1961.\n\n[20] K. Schindler and L. van Gool, \u201cAction snippets: How many frames does human action recognition re-\n\nquire?\u201d in 2008 IEEE Conference on Computer Vision and Pattern Recognition.\n\nIEEE, 2008.\n\n[21] M. A. Giese and T. Poggio, \u201cNeural mechanisms for the recognition of biological movements.\u201d Nature\n\nReviews Neuroscience, vol. 4, no. 3, pp. 179\u201392, 2003.\n\n[22] J. Lange and M. Lappe, \u201cA model of biological motion perception from con\ufb01gural form cues.\u201d Journal\n\nof Neuroscience, vol. 26, no. 11, pp. 2894\u2013906, 2006.\n\n[23] P. J. Mineault, F. A. Khawaja, D. A. Butts, and C. C. Pack, \u201cHierarchical processing of complex motion\nalong the primate dorsal visual pathway.\u201d Proceedings of the National Academy of Sciences of the United\nStates of America, vol. 109, no. 16, pp. E972\u201380, 2012.\n\n[24] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, \u201cLearning realistic human actions from movies,\u201d\n\nin 2008 IEEE Conference on Computer Vision and Pattern Recognition.\n\nIEEE, 2008.\n\n[25] A. Bobick and J. Davis, \u201cThe recognition of human movement using temporal templates,\u201d IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257\u2013267, 2001.\n\n9\n\n\f", "award": [], "sourceid": 362, "authors": [{"given_name": "Cheston", "family_name": "Tan", "institution": "Institute for Infocomm Research, Singapore"}, {"given_name": "Jedediah", "family_name": "Singer", "institution": "Boston Children's Hospital"}, {"given_name": "Thomas", "family_name": "Serre", "institution": "Brown University"}, {"given_name": "David", "family_name": "Sheinberg", "institution": "Brown University"}, {"given_name": "Tomaso", "family_name": "Poggio", "institution": "MIT"}]}