NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:1053
Title:Adaptive optimal training of animal behavior

Reviewer 1

Summary

The paper postulates that learning is accomplished by policy gradient method and proposes to adaptively train animals by designing optimal stimuli sequences to drive the animal's internal model to a desired state in a shorter duration of time. The paper focuses on a specific example: A rat training experiment involving detecting an auditory tone of higher amplitude between two tones played sequentially. The paper models the task using a time-varying logistic model, where the weights of the logistic function diffuse from trial to trial (Gaussian random walk). The weights are estimated using a numerical MAP estimation. Simulations suggest the estimation method is reasonable. The paper then amends the model to include the animal's learning by modeling the trial to trial variation of weights using a drift-diffusion model, with the gradient of the expected reward in the upcoming trial as the drift. The paper proposes selecting upcoming stimuli in order to drive the internal state to a desired state faster. Simulations (with no noise) demonstrate interesting phases in the learning: the algorithm first removes stickiness followed by the removal of bias while maintaining stickiness st 0. There is also suggestive simulation evidence that utilizing the full stimulus space (from simple to difficult) could result in faster learning.

Qualitative Assessment

The paper presents an intriguing method for optimizing an experimental training procedure so as to optimize the speed at which animals can achieve the desired level of performance. While the concept is interesting, I feel like there are several serious issues plaguing the paper. One is that the way the algorithm works is via brute-force numerical estimation/optimization, which seems intractable for realistic experimental settings. The paper only presents simulation results from one simple experimental setting (2AFC auditory discrimination), which assumes that subjects undergo policy-gradient-based learning. In the experimental literature, it is clear that animals can utilize quite complex spatial and abstract representations for complex tasks, and it's not clear how the proposed framework can be generalized to those scenarios where multiple interactive parameters would be changing over time as the animal learns -- or more challengingly, when the nature of the internal representation fundamentally changes from model-free to a model-based. More generally, it's not clear to me how the framework can deal with not having a large enough model space to begin with to speculate what animals can learn and where their finally learning destination is. The paper considers that there might be history dependence in subjects' behavior. With a growing sequence of stimuli, the space of possible historical dependencies grows exponentially, and it seems to me intractable to try to take this fully into account. Secondly, experimentally the goal in a learning experiment is often to see what the animal is capable of learning under "natural" conditions, and if the experimental design is changing all the time, then it might lead to misleading conclusions about what the animal is capable of learning on its own without a very specific training sequence. I think it's important to point out that the proposed framework is specifically designed for optimizing training rather than finding out what the animal is "naturally" capable of learning. Finally, I would have liked to see at least one example of the method being applied to real data, however simple a scenario, to show that it is practically useful in at least one sufficiently simple/well-controlled instance.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 2

Summary

The authors have tried to accomplish two goals in this paper. First, they develop a technique to dynamically track the (time-varying) psychometric function of an animal while it learns a new task. Then they propose an algorithm to choose an optimal stimulus sequence that would most quickly drive the animal towards a desired (target) psychometric function. For simplicity, the focus of the paper is restricted to binary choice tasks in which the psychometric model is assumed to be described by a logistic function. The model is somewhat enriched by incorporating additional parameters (“stickiness coefficients”) to allow for possible history-dependence in choice behavior typically observed in the laboratory. The bulk of the results are presented in sections 3-5. In section 3, the authors develop a general framework to estimate time-varying parameters of the logistic model when the changes in parameter values are purely stochastic (follow a gaussian random walk). This is achieved by first using an empirical Bayes approach to determine the hyperparameter that characterises the variability of the random walk, followed by standard MAP estimation of the model parameters for the given dataset. The approach is validated through simulations and applied to data collected from training rodents on an auditory task. This exercise is repeated in section 4, but now after adding a non-random component to the temporal dynamics of the model parameters to mimic the effect of learning. This deterministic component is assumed to push the model parameters along the gradient of the expected reward with a learning rate which enters the model as a second hyperparameter. They call this the “RewardMax” model of learning. This section ends with a comparison of the two hyperpameters fit to behavioural data from rodents, and the authors make a point about how these results imply that the rodents are indeed behaving in a regime where the effect of learning trumps that of random noise. In section 5, the authors present an algorithm to select a stimulus sequence that would drive the psychometric function of the trainee towards a desired shape as quickly as possible. This is achieved using an “AlignMax” algorithm in which an upcoming stimulus is chosen as the one that maximally aligns the expected change in model parameters along the direction of the target model parameters. A comparison of the evolution of model-parameters using optimal and random stimulus sequences revealed that the former is indeed faster albeit only marginally. The more remarkable observation (although strangely this is not emphasised) was the fact that employing optimal stimulus sequence completely eliminated bias and history-dependence whereas random stimuli failed to achieve this even after training for several thousand trials. The authors then highlight key features of the resulting optimal stimulus sequence, and explain how their algorithm exploits a large stimulus space to optimise training.

Qualitative Assessment

This is an innovative study that establishes a nice model-based method to optimize animal learning of a task. It addresses a simple task and simple model but this is challenging enough and a good foundation from which to build. The authors don’t yet applying this method to real experiments, but I’m sure that’s coming soon, and it’s not needed here to demonstrate their technical advance. - There should be some discussion of the impact of the restriction to a memory-less model of reward (tau = 0). - Figure 4 – define success rate. Better explain why success is so low (because you’re giving essentially adversarial examples?). Why is expected reward worse than for the random stimulus presentation schedule? - Figure 4 – The performance measures are interesting, but the how well does the model learn the desired weights (which is the expressed goal of the model)? - L is likelihood of data or likelihood of weights? - Figure 2 - dw of different realisations are not identical but there is a strong correlation. - Define “lazy fluke”. - It would be appropriate to address/cite Christian Machens’ work on optimal stimulus ensembles (http://www.ncbi.nlm.nih.gov/pubmed/16055067) and, especially, Liam Paninski’s work on optimizing stimuli to maximize information about an underlying model (http://www.stat.columbia.edu/~liam/research/pubs/pbr.pdf).

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 3

Summary

This paper presents a technique for optimizing training to shape an animal's choice behavior. It is essentially an adaptive optimal experimental design technique applied to instrumental conditioning tasks. The authors show using simulations and analysis of rat choice data that the model and algorithm can be used effectively (though see my comments below).

Qualitative Assessment

The paper is strong from a technical point of view and very well-written. My main concerns are as follows: 1) The authors only consider a single learning algorithm (based on policy gradient). Yet many others have been considered in the instrumental learning literature. Model comparison would be particularly valuable here. 2) I don't understand why win-stay/lose-switch is considered incompatible with value-based learning algorithms. 3) There is no validation of the optimal training technique in a real experimental preparation. As this is the core of the paper, I think this is a big gap. 4) I appreciate that this is a novel application of optimal experimental design, but beyond that the main theoretical ideas are not novel.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 4

Summary

Assuming a policy gradient learning algorithm describes the animal's internal learning rule, the algorithm's hyper-parameters are estimated based on observed behavior. The author's provide an experimental design framework in order to select stimuli that will drive learning toward a desired location in the parameter space.

Qualitative Assessment

It is not novel to infer the learning rules underlying an animal’s behavioral changes during training, cf. e.g. the tutorial review by N Daw 2011. Indeed relating quantities of RL algorithms to fMRI signals is fairly common. Using the inferred algorithm for adaptive optimal training seems a nice idea, however it remained unclear how an experimenter would choose the abstract goal weight w*. (E.g. simulate the reverse engineered learning rule with the stimulus-response associations the animal should learn in silico to get w* and use this when actually training the animal?)

Confidence in this Review

1-Less confident (might not have understood significant parts)


Reviewer 5

Summary

This paper proposes a new way to define experiments for animal behavioural training. It consists of two main contributions: 1) Derive a policy-gradient setup to quantify animal choice behaviour during training. 2) Show how to make optimise the stimulus statistic online in order to direct an animal's policy towards a target policy. Contribution 1 is used on experimental data and provides consistent estimates of known animal's behaviour. Contribution 2 is tested on synthetic data for now, but provides really important insights into ways to improve on current techniques used in animal training.

Qualitative Assessment

[EDIT: read the author's feedback, still stick with my assessment: it's a nice paper, addressing a good question in the appropriate setting, with a scope aligned to NIPS. The animal experiment results will be needed, but should be addressed in another venue building upon this current paper] I really liked this paper. The problem is clearly relevant (as animal training is a complex art) and the paper is theoretically strong, novel in its approach and touches on all the important points. The math is clear (perhaps a bit too didactic and verbose at time, e.g. Section 3.1). The model used is impressively clear and well treated. It uses the best practices in Bayesian statistics: MAP estimates + model evidence for hyperparameters, BIC comparisons, appropriate priors. Treating the trial by trial evolution as a Drift diffusion process is interesting, but it's really the way the authors then introduce learning in a policy gradient setting as the drift term that makes it great. That part in itself would be a nice publication, and fits to animal behaviour in Figure 3 were consistent with prior knowledge and interesting. The second contribution, where the inputs are manipulated to make the psychometric function closer to a target policy was great as well. The form of equation 15 might need a bit more explanation, or might be derived through other means, but it makes a lot of sense. I really want to see actual animal training using the optimal training though, but I assume this would come in a later paper (and if it succeeds, it would easily promote this paper to high impact Neuroscience journals, as I think the authors are well aware). I wanted to make the point about low success rate affecting animals' motivation, but the authors already address it in the conclusion, so I assume they'll try to tackle it. I assume the authors are aware of this literature, but similar ideas have been tried (although not exactly comparable) for neurophysiology: - http://pillowlab.princeton.edu/pubs/Pillow2016_ActiveLearningChap.pdf - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3674314/pdf/fncir-07-00101.pdf - http://www.stat.columbia.edu/~liam/research/pubs/lewi-nc08.pdf In this literature, the metric optimised to chose stimuli was the mutual information between data and the model parameters. Similar ideas might help provide more principled algorithms to replace the simple alignment update proposed here. The only weakness of the proposed approach is the simplicity of the decision function. Going away from simple logistic regression with linear interactions will prove complex if the authors want to keep a similar bayesian treatment throughout. For simple behaviours and trivial inputs it won't be an issue, but more complex tasks may not have simple psychometric functions as currently handled by this paper.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 6

Summary

The authors propose a methodology for animal experiments in which the experiment is adapted during training in order to infer the animal's internal learning rule from its behaviour. They postulate that animals follow a policy-gradient type learning to give evidence for the efficacy of this training protocol.

Qualitative Assessment

- This paper has a high standard of clarity and presentation, and the research has been well carried out by people with a high expertise in machine learning. - The authors are proposing a novel approach to experimental design for investigating animal behaviour and learning. While this is obviously a positive in terms of novelty, I do have reservations about how well the paper will engage experimentalists because the style of the paper is directed at a core audience in machine learning. - For this reason, the statement 'immense practical benefits to neuroscientists tasked with training animals' seems to me an over-statement, but I can see why the authors would want to say it to 'sell' their work.

Confidence in this Review

1-Less confident (might not have understood significant parts)