__ Summary and Contributions__: This paper proposes a framework to infer the learning rule of animals, based on the REINFORCE policy gradient method. They validate their method via model recovery analysis, that they are able to recover the hyperparameters and weight trajectory with only behavioral data. They apply their method on two animal datasets and recover interesting traits of animal learning.

__ Strengths__: Their study is novel to the best of my knowledge. I like this paper and I’m very excited about this topic. Few studies look at the training phase of animal learning or attempt to recover the policy-iteration process. They validate their model identifiability with simulation. I think it’s relevance to the NeurIPS community, as it applies method from reinforcement learning (REINFORCE) to neuroscience, and find interesting results of animal learning, which might inspire new RL algorithms.

__ Weaknesses__: As it is pointed out by the author (line 148-151), the result strongly relies on the correct assumption of the learning model to be REINFORCE, which I think it’s a very strong assumption. It would be better supported by literature, showing animals can/are doing similar learning.
Also as the authors pointed out, their model is descriptive. As the nature of a descriptive model, I feel like I don’t gain much insight from the model of how animals learn. For example, the authors found a non-zero update to the bias weight on incorrect trial, which explains the “incorrect” bahevior of repeatedly choosing the wrong option. This sounds like a “noise” in the behavior to me and the model also does not explain it further besides it being noise. Also, the author pointed out that the value function that the animals optimize for can be something other than expected reward, which I completely agree with. However, it seems their model does not provide any insight into their value function, besides saying it’s expected reward plus some noise. Please correct me if I misunderstand something, or I think it would be good if the authors can further concretely discuss how their model can provide more insights into animals for neuroscientists, and how it might be able to inspire machine learning algorithms.
Also, the model recovery results only show one particular (hyper-)parameter setting, which can be recovered fairly well. I assume animals have lots of individual variabilities, and also variabilities across different tasks/datasets. So it is unclear if other parameter settings can also be well recovered, or it would be good to discuss if there are regions in the parameter space that can behave fairly similar.
In addition, it would be good to show model identifiability results of RF_1, RF_K, RF_beta, i.e. if we simulate RF_1, will it be correctly identified as RF_1?
It would be good to compare goodness-of-fit with some common value-iteration baselines that’s commonly used in animal neuroscience literature, e.g. RW and it’s variants (having two-learning rates, one for positive PE, one for negative; and/or adding a forgetting factor, that the value of the option(s) decaying towards zero/a parameterize value, with a decay parameter; and/or adding a side bias). I’m also interested to see if you simulate data with a fairly simple value-iteration method, and use your model to recover, what would you find?
--- updates ---
I appreciate authors effort into the response and being open to suggestions from the reviewers. With some of the new analyses, the paper will be stronger, however, it is hard to predict the results of those analyses, which could potentially make this paper weaker. I understand it is not reasonable to expect those new results within one-week. I'm keeping my original score, which is a borderline paper given the results currently presented.

__ Correctness__: They are correct to the best of my knowledge.

__ Clarity__: The paper is very well-written and I found it easy to follow. The figures are also very clear and easy to understand. One part that I found a little confused was 2.2 and 2.3. I was a little confused by the “placeholder” in equation 3. Would it be better if you introduce the learning models first? For 2.3, I wasn’t sure about the notation of the square brackets “[]”. I think the author can introduce REINFORCE even better, not assuming the readers all familiar with it, especially if you are also targeting neuroscientists as your audience.

__ Relation to Prior Work__: The author discussed a few previous works attempt to look into animal policy-iteration and pointed out their contributions. Their unique contribution was clear to me.

__ Reproducibility__: Yes

__ Additional Feedback__: I’m a little confused about figure 5. Curve c also doesn’t look much alike curve a?
I would also be interested to see how the model parameter might be correlated with, for example, the animal performance (accuracy). It would be interesting to see if some weight trajectory leads to better learning (accuracy, few-shot learning if there are switches in the task e.g. in Harlow task).

__ Summary and Contributions__: The paper proposes a really interesting method for inferring parameters in learning models and in particular the contribution of learning and noise terms. It provides an interesting way to analyse behaviour, even if the selected learning rules (mainly variations of REINFORCE) are not the most common in computational cognitive neuroscience - temporal difference reinforcement learning with softmax action selection or history-based models (as mentioned in the discussion) are more usual. It would be nice to see (at least in the discussion, if insufficient time to run simulations) how results could compare if using these models, which may help understanding if the negative bias result is some artifact specific to limitations of REINFORCE or some truly interesting phenomenon. I would suppose that stimulus uncertainty plays considerable role but especially for rodents, history-based patterns (such as win-stay, lose-shift) may be predominant and with appropriate parameters (e.g. low probability of shifting) may easily explain the negative bias result.

__ Strengths__: The decomposition into learning and noise in evaluating learning models is particularly interesting. Although AIC is more commonly used, this sounds like a great and more specific alternative. Perhaps some discussion of comparison between the two could help. Furthermore, having the result replicated in two different datasets & species is clearly a plus. The contribution is clearly relevant for NeurIPS and at least some important aspects are novel, although broader importance of the results is a bit hard to evaluate due to limited range of learning rules explored.

__ Weaknesses__: The main weakness is limitation of analysis to variations of REINFORCE learning rule to predict behaviour - a broader range, including some more commonly used rules should be explored. There is also no direct primary behavioural data presented to show the learning curves, which would be helpful to make sense of (model free) learning dynamics and hypothesize which models may be the most appropriate. Although I presume this could be found in references online, it would be helpful to add them at least to the SM.

__ Correctness__: Yes, it seems correct. It's a bit strange that mean or median AIC values are not provided in Fig. 4c-d. Also, why is RF_beta missing in Fig. 4d? I think focusing on "example mouse" and presenting it in black is somewhat misleading and mean or median AIC values across mice should be shown instead (or in addition).

__ Clarity__: Yes - it is explained reasonably well, although the pink trajectory in Fig. 5a is overlapping many times and a bit unclear - perhaps some continuous colour coding would help? It's also unusual that differences in learning rates between bias and stimulus (shown in Fig. 4b) are not discussed. Alphas seem much higher for bias and it would be good to discuss this. What exactly is R+B in Fig. 4c? (is it RF_beta?) Finally it's important not to confuse right (when it means direction) with right (meaning correct).

__ Relation to Prior Work__: Generally yes; however, some clearly relevant work on (dynamic) parameter estimation of (dynamic) behavioural data is missing, such as https://papers.nips.cc/paper/2418-estimating-internal-variables-and-paramters-of-a-learning-agent-by-a-particle-filter.pdf and http://papers.neurips.cc/paper/3601-stress-noradrenaline-and-realistic-prediction-of-mouse-behaviour-using-reinforcement-learning.pdf As a result, the novelty assertion in lines 133-134 doesn't really hold.

__ Reproducibility__: Yes

__ Additional Feedback__: I thank the authors for their responses. I understand that REINFORCE is used loosely, but this still doesn't include a number or more common learning rules, such as TD learning. I simply think that both kinds of models should be included and compared openly (and making arguments that humans/animals use one and not the other is tricky and should generally be avoided, unless a large amount of evidence pointed that way, which I think is not the case...) Thanks for promising to address my other concerns. Look forward to reading the final version!

__ Summary and Contributions__: The authors develop a method to infer learning rules from observations of behavior. Specifically, they assume a specific parametric form for the relation between stimulus and behavior. They then test various learning rules which adjust these parameters (weights) over the course of training. Using a two stage inference procedure, the parameters of learning and weight trajectories are inferred from bahavior.
This method is tested on simulated data and on two experimental tasks (in mice and rats). The authors show that a specific variant (multiple learning rates and baselines) provides a better fit to the experimental data compared to other learning rules tested.

__ Strengths__: 1. Inferring learning rules is an important problem, and the authors leverage the recent availability of large behavioral datasets to address it.
2. The method is quite general with respect to the learning rules that can be tested.
3. The results show a “non-optimal” trajectory in weight space (Figure 5) which is an interesting observation.

__ Weaknesses__: 1. Unless I missed it, there is no evaluation of the quality of the fit. Figure 4C,D provides the relative performance of the different rules, but there is no description of how well the best models fit validation sets.
2. The sub-optimal behavior finding calls for a more direct validation outside of the inference framework.
3. The simulations were only tested with matching generative and assumed learning rules. It would be useful to characterize the effect of a mismatch, as this is what is expected in real data.

__ Correctness__: The results appear correct.

__ Clarity__: The overall motivation and results are clearly presented.
If the quality of fit is presented (Weakness #1), then it is not clearly presented.

__ Relation to Prior Work__: Relevant prior work is mentioned.

__ Reproducibility__: Yes

__ Additional Feedback__: POST REBUTTAL UPDATE:
I read all reviews and the author response.
I was somewhat surprised that the rebuttal did not include any results.
It is not realistic to expect a full revised paper within the one-week rebuttal period.
But I think it is realistic to expect some results. For instance - correlating weights with accuracy shouldn't take much time.
I think model mismatch is a crucial point. Even if there isn't time for a "broad exploration of hyperparameter space", there was time for a single example.
All of these requests quickly add up - but I'm a bit concerned that the authors didn't do any of them during the rebuttal period.
Furthermore - some of the results can qualitatively change my evaluation of the paper. If, for instance, the model fails completely once there is some mismatch (which is bound to exist in the data) - then it's hard to interpret the results of data analysis.
Given this state - I'm keeping my original score.
---------------------------------------------
1. Line 112. Is D defined by a linear regression?
2. Simulated data: is it similar to animal behavior? Are there statistics to compare?
3. L153: How many trials/mouse? I think this is more relevant than the total.
4. Figure 4A: “outlier excluded”. How can there be an outlier for a quantity that is between [0,1]?
5. Figure 4D: Why is RF_beta not there?
6. Figure 5, trajectory a. Is this from RF_beta? The text (L168) describes this as the “retrieved weight trajectory for the animal”, but the retrieval is for specific assumptions.
7. L172: “trajectories for RF1 and RFk look very similar. How does this relate to Figure 5 b,c?

__ Summary and Contributions__: This paper proposes a method for modelling behavioural data in learning tasks. The paper is built upon the framework proposed in reference 19, and adds a learning element to the framework to enable deterministic changes in the learning weights.
================== post rebuttal ==============================
The authors have mentioned that they will add the results regarding the first concern below to the final version, which would be great, thank you!
For the second concern I still think the question of one vs two learning rates can be answered just by fitting the learning rules. Based on this I haven't changed the score.

__ Strengths__: The paper is well-prepared, clearly written and the results are interesting. The mathematical framework is also novel to my knowledge, and can inspire model developments in other areas of computational modelling of decision-making.

__ Weaknesses__: 1- Given the huge number of parameters in the model O(KT), it is important to see how the model performs compared to a baseline learning model *without* the noise. That would be, using ‘v’ in equation (1) instead of ‘w’ and comparing the obtained model in terms of AIC with the models presented in Figure 4c,d. From figure 3g it seems that the role of noise is negligible and there is not much benefit in using the full model beyond the learning rule (‘v’).
2- Whether the weights are updated using one or two learning rates could also be investigated by just fitting the learning rules (only using ‘v’ in equation 1) with different numbers of learning rates to the data and looking at their fit. Based on this, It is unclear to me what would be the insight about the animals’ behaviour that we have gained using this model that we couldn’t obtain from fitting just learning rules to the data.

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Except for the addition of learning term (v), the computational framework is the same as the one in reference 19. This similarity, however, is NOT currently clear in the paper. I still think the contribution of the paper is sufficient, but it is essential that the authors be upfront about this similarity and make their contributions and ref 19 clear.

__ Reproducibility__: Yes

__ Additional Feedback__: In Figure 6d, it seems that the positive gap between the learning component of bias and the inferred trajectory systematically increases over time. Does it imply that the model lacks a time-dependent \beta parameter to absorb this variance? (similar to the logic presented in lines 175-178).