Review for NeurIPS paper: Modeling Task Effects on Meaning Representation in the Brain via Zero-Shot MEG Prediction

NeurIPS 2020

Modeling Task Effects on Meaning Representation in the Brain via Zero-Shot MEG Prediction

Review 1

Summary and Contributions: This paper investigates using machine learning and MEG data how the brain processes words conditionally on the task. The paper has clear merit although the effects revealed are small and technical novelty are moderate. Sharing the code is particularly appreciated.

Strengths: The paper is very well written, with adequate literature review and reasonable experimental results.

Weaknesses: Major concerns: - It is unfortunate that source localization results are not provided. Topographical maps in fig 5 are not easy interpret. - Attentional model is neat and is potentially interesting for the NeurIPS audience but unfortunately it does not lead to any improved restults. Doing a learning curve analysis should help to answer the question if "more samples are needed to learn a better one". - Using ADAM in this setting (only a few samples) seems dangerous. Using L-BFGS using gradients from tensorflow could potentially stabilize and accelerate the estimation. Minor concerns: - please number the equations - what software was used to analyse the MEG data? Brainstorm (Tadel et al.), Fieldtrip (Oostenveld et al.), MNE (Gramfort et al.)? - When you introduce a matrix or vector notation such as W provide the dimensions - in equation of H4.2 paragraph you minimize over W_s and A (A is missing in the \min) - use \|.\| rather than ||.|| for the norm in Latex - Using B and Y for the target is unnecessarily confusing, also having features call Q and S. I would stick to X_Q or X_S so features are always called X and target always Y. - use ^\top rather than ^T for transpose especially as you have T as a notation for time.

Correctness: ok

Clarity: well written

Relation to Prior Work: good

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper presents a re-analysis of the MEG experiment of Sudre et al (2012), where participants were tasked with responding to a question about the meaning of an object concept word (e.g. “Can you hold it?” CARROT (respond yes/no)). In the original Sudre et al analysis, the focus was on testing the predictive power of different perceptual and semantic feature models of the concept word for the MEG data. In the current study, the focus is on the role of the task question that precedes the concept word, and in particular whether and how the semantics of the task question modulates the subsequent processing and neural activity time-locked to the stimulus word. This is an interesting neurocognitive question, as it sheds light on how lexical-semantic representation and access can be modulated by the preceding context, and how the timing of processing of the target concept word that is independent of the task demands relates to the timing of the processing that involves integrating that conceptual knowledge with the task requirements in order to respond on the task. To analyze the data, the authors construct vector-based semantic models of both the concept words and the task questions, using human responses from separate questions and concepts where the participants rated the truth of the task questions for the concepts. This yields vector-based representations that reflect the semantic content of the questions and the concepts, where semantically similar questions are represented with similar vectors and semantically similar concepts are represented as similar vectors. The authors use these stimulus feature models to predict the MEG brain data. In particular, they construct different predictive linear models, corresponding to different hypotheses about how task question and concept word processing affect the neurocognitive signal. They consider a model that uses the task question representations alone, a model that uses the concept word representations alone, and a model that concatenates the task and concept representations in the predictive model (such that the brain activity is predicted from some linear combination of the task and concept features). In addition, they consider two “interactive” models, where task representations modulate concept representations through a soft attention mechanism. The first of these uses precomputed attention weights, whereas the second are learned in conjunction with the other linear model weights when fitting the predictive model to the brain data. The results show that the concept-word-only predictive model yields the most parsimonious prediction of the brain data up to what is presumed to be the end of word lexical processing, at around 475 ms, with the task-only model being significant in the 475-525 ms timeframe. In addition, the models combining task and concept features show their best performance relative to the concept-word-only model in this later time frame, and in particular one of the attention based models significantly outperforms the concept-word-only model at 500-525 ms. This pattern of results suggests a particular time-course of processing in this experimental paradigm, with lexico-semantic access of the word, independent of task requirements, preceding the integration of the task requirements with the word representation in order to perform the task. UPDATE: I have read an considered the rebuttal document and the other reviews. I am not a fan of the "grouping 20 examples together for a 20v20 classification tasK" approach to evaluation that the authors mention, which has always seemed to me a kludge to artificially increase reported accuracies by trying to ignore between-item variability (not unlike the "language as a fixed effect fallacy" in psycholinguistics). The authors have not responded to my comment on the unnatural experimental design (understandable given the space limitations). However, I still have an overall favourable opinion of this work.

Strengths: This study presents a competent set of analyses on the Sudre et al dataset, and I found all aspects of the methodology (design of the stimulus models with the MTurk data, construction of the four types of predictive models corresponding to different neurocognitive hypotheses, and the use of an attention parameter matrix to model interactive effects of task question on concept representation) convincing. The results of the analysis make sense – the initial transient effect of the task-alone model can be interpreted as residual processing of the cue question, and the results nicely support the claim that processing of the concept word initially happens independently of subsequent task effects.

Weaknesses: The experimental paradigm used seems a bit unnatural. Rarely are we asked a question about an unspecified object’s properties (“Can you hold it”) and *then* get given the object name. This lack of ecological validity may mean that the results do not generalize well to lexico-semantic processes in question answering more generally (and the paper does imply in a number of places that the task is about question answering). Although I liked the paper and also would like to see more work featuring neuroimaging data at NeurIPS, I wonder if it is not more suited to a cognitive neuroscience venue. The use of soft attention in combining feature models in predicting the MEG data is interesting, but apart from this the focus of the paper is perhaps not particularly methodologically novel. It is a pity that the analyses with the BERT model do not seem to work out effectively, as this might be of interest to understanding how BERT represents questions about object concepts. The final section of the paper makes reasonable suggestions about how BERT might be incorporated into the analysis more effectively, but that is not included in the current work. Although the results make sense, are significant, and are interesting, the reported differences in the H1 and H4.1 models in the later time period are not huge (a difference of about 1%-1.5% in prediction accuracy) and overall the decoding performance is low (~53%). This, combined with the small number of experimental participants (6) makes me wonder about the robustness of the reported findings.

Correctness: Yes, I think the analyses are correct and appear to have been competently implemented.

Clarity: Yes. I have no concerns about the quality of the writing, and the paper is easy to understand. Typo: line 181: “thes regularization weight” - > “the regularization weight”

Relation to Prior Work: In general, the prior work is well-covered, and in particular the paper does a good job situating itself with respect to other work using similar linear decoding analysis frameworks (e.g. the work of the Gallant and Mitchell labs and others). Given the scientific focus of the paper, there could perhaps be better coverage of the cognitive neuroscience literature on lexical-semantic representation and retrieval, such as work by Beth Jeffries, Jeff Binder, Sharon L. Thompson-Schill inter alia., and a better situation of the results of the paper with respect to neurocognitive theories of word processing.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The authors propose a predictive model to examine the relationship between brain recordings and stimulus properties and explicitly investigated task effects at the same time. It is the first computational model to predict brain recordings as a function both of the task and the semantics of the observed stimulus. Such research is beneficial for future studies on question-answering in the neuroscience area.

Strengths: - "Modeling Task Effects on Meaning Representation in the Brain" is an important study. It is good to know the relationship between them which might give some insights for researchers on model designing. - The authors use MEG recordings which 2000-times finer temporal resolution than the fMRI recordings and thus allow to localize the task effect in time. - The hypotheses are thoroughly explored in the context of exploring relationship of task and stimulus.

Weaknesses: - Apart from ridge regression, there are some regression models like lasso regression the authors might try. - For Table 1, the notation should be aligned with its description. - Overall, I am quite curious why 2v2 accuracies for all hypotheses are just slightly better than the random chance. Why is that? Correct me if I am wrong, thanks.

Correctness: To best of my knowledge, yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Overall, there are few new things for the methodology part. But what the authors study is quite important and interesting. Therefore, initial rating can be changed as further information given. %%%%%Post-rebuttal After reading the rebuttal and comments from other reviewers, I will increase my score to 6. It is an interesting and important problem to understand task effects both for ML and neuroscience community. My only concern is the number of specimens used in the experiments. To acquire a more general picture of task effects, more samples may be needed. Lastly, thanks for answering all my questions.