NeurIPS 2020

Gibbs Sampling with People

Meta Review

This paper introduces a new method for eliciting human representations of perceptual concepts, such as what RGB values people think correspond to the color “sunset” or what auditory dimensions (e.g. pitch, duration, intensity) they think make a spoken sentence sound “happy” vs. “sad”. Rather than eliciting representations via guess-and-check (i.e., start with a dataset and then apply human-generated labels), this method (Gibbs Sampling with People, or GSP) enables inference to go in the other direction (i.e., start with labels, and then identify percepts that match those labels). GSP extends prior work (MCMC with People) to allow eliciting representations of much higher-dimensional stimuli. The reviewers unanimously praised this paper for tackling an important and relevant problem in cognitive science, for its breadth of empirical results, and for its novelty over prior work. R2 stated that the paper is “impressive in scale, scope, and results”, R3 stated that it was “very relevant to the NeurIPS community and very novel”, and R4 felt there could be “a potentially large impact of this work” with “substantial interest” amongst the NeurIPS community. I agree this work is very exciting, novel, and impactful, both for cognitive scientists and to machine learning researchers interested in representation. However, the paper does raise some cause for worry from an ethics standpoint. In eliciting percepts that match participants’ internal representations, GSP will naturally reproduce biases and stereotypes that those participants hold. Indeed, this can be seen in the results in one of the experiments in the paper, in which the method is used to elicit facial representations for attributes like “attractive”, “fun”, “intelligent”, etc., where the faces elicited are overwhelmingly white and reflect gender stereotypes such that men are “intelligent”. As such, this paper was flagged for ethics review and received three additional reviews, which I summarize here. The opinions of the ethics reviewers ranged substantially. ER3 felt that there was “potential concern regarding the perpetuation of racial and gender stereotypes” but that this was sufficiently addressed by the broader impact statement. ER1 noted that this work is a “very exciting approach for eliciting human semantic representations” but felt that the broader impact statement was insufficient and that “the risks coming from human and societal biases are understated, and I worry practitioners may build datasets or generative models that strongly encode these biases”. They requested three revisions (see below). ER2 was concerned that “drawing inferences on emotional states of a person based on perceptual judgments of listeners as well as the assumption that social and psychological characteristics...are something that can be read off of faces has a dark history in physiognomy”. They were also concerned that the paper “fail[ed] to mention how [social stereotypes being perpetuated] might be mitigated or how in fact such work can be of a net positive value to society and specifically to the ethnically and racially marginalized”. ER2 argued for rejection, and while I disagree with this recommendation, I do think that their assessment highlights the fact that the paper does not state clearly enough what the method is meant for (e.g. as a tool for psychologists), what it is not meant for (e.g. generating datasets for ML models) and why, and what the experimental limitations are. ER1's requested revisions (directly copied from their review): --- 1/ Make a much stronger and categorical assertion that the GSP (Gibbs sampling with people) method proposed will reflect individual and cultural biases. The Broader Impact Statement alludes to observed biases (towards men for 'intelligence' and towards women for 'beauty') but it doesn’t quantify them, nor explain what part of the bias is attributable to individuals vs. the composition of the dataset. There are very simple tests that can be done to test for both effects, and it seems at least that controlling for bias in the dataset should be a reasonable requirement. 2/ Call out the risk that this method will be used to generate datasets that reinforce and amplify existing stereotypes in society. What these stereotypes will look like will depend crucially on whom the “people” are in GSP. The paper uses Amazon Mechanical Turk to recruit people, but doesn’t delve into the limitations of this approach for recruiting diverse sets of “people.” It’s imperative to call out the stereotype and representation harms that can come from using non-diverse humans in GSP. My major concern is with stereotyping in the human faces example, but I’m also concerned about the music example. The fact that the distribution peaks at “prototypical sonorities from Western music” is another clear example of how biased the humans behind GSP can be, and the dangers of generating biased datasets. 3/ As a corollary of the two points above, call out the risks of using GSP for tuning the parameters or hyperparameters of generative deep nets. In the human faces example, there’s a direct connection to tuning the hyperparameters of a deep net for face generation, where if the input was set to “intelligent” you’d get more males. It’s imperative to demand that applications of GSP for tuning generative deep nets fully analyse the diversity of the humans behind GSP, and that they analyze their potential biases first. --- I am recommending conditional acceptance of the paper, as I would like the paper to be revised for the camera ready in light of the ethics reviews. In particular, I think the paper needs to do a better job at: (1) motivating the use case for this method and for the particular experiments from a cognitive science perspective (not just in the broader impacts but also throughout the paper); (2) making it clear that the elicited “representations are not speaking to intrinsic qualities of the stimuli but about the representations of the people in the study” (as suggested by R5); and (3) discussing ways in which GSP should not be used (such as generating datasets or fine-tuning ML models). I would also like the authors to (4) pay particular attention to (and address) R5’s suggestions in Q3 and Q8 and the suggestions from ER1 above. **Please note that acceptance of the paper is conditional on these four changes being made in the camera-ready.** ******************************* Note from Program Chairs: The camera-ready version of this paper has been reviewed with regard to the conditions listed above, and this paper is now fully accepted for publication.