{"title": "Shape and Material from Sound", "book": "Advances in Neural Information Processing Systems", "page_first": 1278, "page_last": 1288, "abstract": "Hearing an object falling onto the ground, humans can recover rich information including its rough shape, material, and falling height. In this paper, we build machines to approximate such competency. We first mimic human knowledge of the physical world by building an efficient, physics-based simulation engine. Then, we present an analysis-by-synthesis approach to infer properties of the falling object. We further accelerate the process by learning a mapping from a sound wave to object properties, and using the predicted values to initialize the inference. This mapping can be viewed as an approximation of human commonsense learned from past experience. Our model performs well on both synthetic audio clips and real recordings without requiring any annotated data. We conduct behavior studies to compare human responses with ours on estimating object shape, material, and falling height from sound. Our model achieves near-human performance.", "full_text": "Shape and Material from Sound\n\nZhoutong Zhang\n\nMIT\n\nQiujia Li\n\nUniversity of Cambridge\n\nZhengjia Huang\n\nShanghaiTech University\n\nJiajun Wu\n\nMIT\n\nJoshua B. Tenenbaum\n\nMIT\n\nWilliam T. Freeman\nMIT, Google Research\n\nAbstract\n\nHearing an object falling onto the ground, humans can recover rich information\nincluding its rough shape, material, and falling height. In this paper, we build\nmachines to approximate such competency. We \ufb01rst mimic human knowledge of\nthe physical world by building an ef\ufb01cient, physics-based simulation engine. Then,\nwe present an analysis-by-synthesis approach to infer properties of the falling\nobject. We further accelerate the process by learning a mapping from a sound wave\nto object properties, and using the predicted values to initialize the inference. This\nmapping can be viewed as an approximation of human commonsense learned from\npast experience. Our model performs well on both synthetic audio clips and real\nrecordings without requiring any annotated data. We conduct behavior studies\nto compare human responses with ours on estimating object shape, material, and\nfalling height from sound. Our model achieves near-human performance.\n\n1\n\nIntroduction\n\nFrom a short audio clip of interacting objects, humans can recover the number of objects involved, as\nwell as their materials and surface smoothness [Zwicker and Fastl, 2013, Kunkler-Peck and Turvey,\n2000, Siegel et al., 2014]. How does our cognitive system recover so much content from so little?\nWhat is the role of past experience in understanding auditory data?\nFor physical scene understanding from visual input, recent behavioral and computational studies\nsuggest that human judgments can be well explained as approximate, probabilistic simulations of a\nmental physics engine [Battaglia et al., 2013, Sanborn et al., 2013]. These studies suggest that the\nbrain encodes rich, but noisy, knowledge of physical properties of objects and basic laws of physical\ninteractions between objects. To understand, reason, and predict about a physical scene, humans\nseem to rely on simulations from this mental physics engine.\nIn this paper, we develop a computational system to interpret audio clips of falling objects, inspired\nby the idea that humans may use a physics engine as part of a generative model to understand the\nphysical world. Our generative model has three components. The \ufb01rst is a object representation that\nincludes its 3D shape, position in space, and physical properties such as mass, Young\u2019s modulus,\nRayleigh damping coef\ufb01cients, and restitution. We aim to infer all these attributes from auditory\ninputs.\nThe second component is an ef\ufb01cient, physics-based audio synthesis engine. Given an initial scene\nsetup and object properties, the engine simulates the object\u2019s motion and generates its trajectory\nusing rigid body physics. It also produces the corresponding collision pro\ufb01le \u2014 when, where, and\nhow collisions happen. The object\u2019s trajectory and collision pro\ufb01le are then combined with its\npre-computed sound statistics to generate the sound it makes during the physical event. With this\nef\ufb01cient forward model, we can then infer object properties using analysis-by-synthesis; for each\naudio clip, we want to \ufb01nd a set of latent variables that best reproduce it.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Given an audio of a single object falling, we utilize our generative model to infer latent\nvariables that could best reproduce the sound.\n\nThe third component of the model is therefore a likelihood function that measures the perceptual\ndistance between two sounds. Designing such a likelihood function is typically challenging; however,\nwe observe that features like spectrogram are effective when latent variables have limited degrees of\nfreedom. This motivates us to infer latent variables via methods like Gibbs sampling, where we focus\non approximating the conditional probability of a single variable given the others.\nThe inference procedure can be further accelerated with a self-supervised learning paradigm inspired\nby the wake/sleep phases in Helmholtz machines [Dayan et al., 1995]. We train a deep neural network\nas the recognition model to regress object properties from sound, where training data are generated\nusing our inference algorithm. Then, for any future audio clip, the output of the recognition model\ncan be used as a good initialization for the sampling algorithm to converge faster.\nWe evaluate our models on a range of perception tasks: inferring object shape, material, and initial\nheight from sound. We also collect human responses for each task and compare them with model\nestimates. Our results indicate that \ufb01rst, humans are quite successful in these tasks; second, our\nmodel not only closely matches human successes, but also makes similar errors as humans do. For\nthese quantitative evaluations, we have mostly used synthetic data, where ground truth labels are\navailable. We further evaluate the model on recordings to demonstrate that it also performs well on\nreal-world audios.\nWe make three contributions in this paper. First, we propose a novel model for estimating physical\nobject properties from auditory inputs by incorporating the feedback of a physics engine and an\naudio engine into the inference process. Second, we incorporate a deep recognition network with\nthe generative model for more ef\ufb01cient inference. Third, we evaluate our model and compare it to\nhumans on a variety of judgment tasks, and demonstrate the correlation between human responses\nand model estimates.\n\n2 Related Work\n\nHuman visual and auditory perception Psychoacoustics researchers have explored how humans\ncan infer object properties, including shape, material and size, from audio in the past decades [Zwicker\nand Fastl, 2013, Kunkler-Peck and Turvey, 2000, Rocchesso and Fontana, 2003, Klatzky et al., 2000,\nSiegel et al., 2014]. Recently, McDermott et al. [2013] proposed compact sound representations that\ncapture semantic information and are informative of human auditory perception.\n\nSound simulation Our sound synthesis engine builds upon and extends existing sound simulation\nsystems in computer graphics and computer vision [O\u2019Brien et al., 2001, 2002, James et al., 2006,\nBonneel et al., 2008, Van den Doel and Pai, 1998, Zhang et al., 2017]. Van den Doel and Pai [1998]\nsimulated object vibration using the \ufb01nite element method and approximated the vibrating object as\na single point source. O\u2019Brien et al. [2001, 2002] used the Rayleigh method to approximate wave\nequation solutions for better synthesis quality. James et al. [2006] proposed to solve Helmholtz\n\n2\n\n\fFigure 2: Our inference pipeline. We use Gibbs sampling over the latent variables. The conditional\nprobability is approximated using the likelihood between reconstructed sound and the input sound.\n\nequations using the Boundary Element Method, where each object\u2019s vibration mode is approximated\nby a set of vibrating points. Recently, Zhang et al. [2017] built a framework for synthesizing large-\nscale audio-visual data. In this paper, we accelerate the framework by Zhang et al. [2017] to achieve\nnear real-time rendering, and explore learning object representations from sound with the synthesis\nengine in the loop.\n\nPhysical Object Perception There has been a growing interest in understanding physical object\nproperties, like mass and friction, from visual input or scene dynamics [Chang et al., 2017, Battaglia\net al., 2016, Wu et al., 2015, 2016, 2017]. Much of the existing research has focused on inferring\nobject properties from visual data. Recently, researchers have begun to explore learning object\nrepresentations from sound. Owens et al. [2016a] attempted to infer material properties from\naudio, focusing on the scenario of hitting objects with a drumstick. Owens et al. [2016b] further\ndemonstrated audio signals can be used as supervision on learning object concepts from visual data,\nand Aytar et al. [2016] proposed to learn sound representations from corresponding video frames.\nZhang et al. [2017] discussed the complementary role of auditory and visual data in recovering both\ngeometric and physical object properties. In this paper, we learn physical object representations\nfrom audio through a combination of powerful deep recognition models and analysis-by-synthesis\ninference methods.\n\nAnalysis-by-synthesis Our framework also relates to the \ufb01eld of analysis-by-synthesis, or genera-\ntive models with data-driven proposals [Yuille and Kersten, 2006, Zhu and Mumford, 2007, Wu et al.,\n2015], as we are incorporating a graphics engine as a black-box synthesizer. Unlike earlier methods\nthat focus mostly on explaining visual data, our work aims to infer latent parameters from auditory\ndata. Please refer to Bever and Poeppel [2010] for a review of analysis-by-synthesis methods.\n\n3 An Ef\ufb01cient, Physics-Based Audio Engine\n\nAt the core of our inference pipeline is an ef\ufb01cient audio synthesis engine. In this section, we \ufb01rst\ngive a brief overview of existing synthesis engines, and then present our technical innovations on\naccelerating them for real-time rendering in our inference algorithm.\n\n3.1 Audio Synthesis Engine\n\nAudio synthesis engines generate realistic sound by simulating physics. First, rigid body simulation\nproduces the interaction between an object and the environment, where Newton\u2019s laws dictate the\nobject\u2019s motion and collisions over time. Each collision causes the object to vibrate in certain patterns,\nchanging the air pressure around its surface. These vibrations propagate in air to the recorder and\ncreate the sound of this physical process.\n\n3\n\nshaperotationheightmaterialGibbs SamplingLikelihood FunctionGenerative ModelTarget Audioiteration iteration \fSettings\nOriginal algorithm\nAmplitude cutoff\nPrincipal modes\nMulti-threading\nAll\n\nTime (s)\n30.4\n24.5\n12.7\n1.5\n0.8\n\nFigure 3: Our 1D deep convolutional network. Its\narchitecture follows that in Aytar et al. [2016], where\nraw audio waves are forwarded through consecutive\nconv-pool layers, and then passed to a fully connected\nlayer to produce output.\n\nTable 1: Acceleration break down of each\ntechnique we adopted. Timing is evaluated\nby synthesizing an audio with 200 collisions.\nThe last row reports the \ufb01nal timing after\nadopting all techniques.\n\nRigid Body Simulation Given an object\u2019s 3D position and orientation, and its mass and restitution,\na physics engine can simulate the physical processes and output the object\u2019s position, orientation,\nand collision information over time. Our implementation uses an open-source physics engine,\nBullet [Coumans, 2010]. We use a time step of 1/300 second to ensure simulation accuracy. At each\ntime step, we record the 3D pose and position of the object, as well as the location, magnitude, and\ndirection of collisions. The sound made by the object can then be approximated by accumulating\nsounds caused by those discrete impulse collisions on its surface.\n\nAudio Synthesis The audio synthesis procedure is built upon previous work on simulating realistic\nsounds [James et al., 2006, Bonneel et al., 2008, O\u2019Brien et al., 2001]. To facilitate fast synthesis,\nthis process is decomposed into two modules, one of\ufb02ine and one online. The of\ufb02ine part \ufb01rst uses\n\ufb01nite element methods (FEM) to obtain the object\u2019s vibration modes, which depend on the shape\nand Young\u2019s modulus of the object. These vibration modes are then used as Neumann boundary\nconditions of the Helmholtz equation, which can be solved using boundary element methods (BEM).\nWe use the techniques proposed by James et al. [2006] to approximate the solution by modeling the\npressure \ufb01elds with a sparse set of vibrating points. Note that the computation above only depends on\nobject\u2019s intrinsic properties such as shape and Young\u2019s modulus, but not on the extrinsics such as\nits position and velocity. This allows us to pre-compute a number of shape-modulus con\ufb01gurations\nbefore simulation; only minimal computation is needed during the online simulation.\nThe online part of the audio engine loads pre-computed approximations and decomposes impulses on\nthe surface mesh of the object into its modal bases. At the observation point, the engine measures the\npressure changes induced by vibrations in each mode, and sums them up to produce the simulated\nsound. An evaluation of the \ufb01delity of these simulations can be found in Zhang et al. [2017].\n\n3.2 Accelerating Audio Synthesis\n\nAnalysis-by-synthesis inference requires the audio engine to be highly ef\ufb01cient; however, a straight-\nforward implementation of the above simulation procedure would be computationally expensive. We\ntherefore present technical innovations to accelerate the computation to near real-time.\nFirst, we select the most signi\ufb01cant modes excited by each impulse until their total energy reaches\n90% of the energy of the impulse. Ignoring sound components generated by the less signi\ufb01cant\nmodes reduces the computational time by about 50%. Second, we stop the synthesis process if the\namplitude of the damped sound goes below a certain threshold, since it is unlikely to be heard. Third,\nwe parallelize the synthesis process by tackling collisions separately, so that each can computed\non an independent thread. We then integrate them into a shared buffer to generate the \ufb01nal audio\naccording to their timestamps. The effect of acceleration is shown in Table 1. Online sound synthesis\nonly contains variables that are fully decoupled from the of\ufb02ine stage, which enables us to freely\nmanipulate other variables with little computational cost during simulation.\n\n3.3 Generating Stimuli\n\nBecause real audio recordings with rich labels are hard to acquire, we synthesize random audio\nclips using our physics-based simulation to evaluate our models. Speci\ufb01cally, we focus on a single\n\n4\n\nWaveformSoundNet-8Audio Waveconv1pool1conv7pool7fc\u2026\u2026\fVariable\n\nRange\n\nPrimitive shape (s)\n\n14 classes\n\nHeight (z)\n\nRotation axis (i, j, k)\nRayleigh damping (\u03b1)\n\n[1, 2]\nS2\n\n10[\u22128,\u22125]\n\nC/T\nD\nC\nC\nC\n\nVariable\n\nSpeci\ufb01c modulus (E/\u03c1)\n\nRestitution (e)\n\nRotation angle (w)\n\nRayleigh damping (\u03b2)\n\nRange\n\n[1, 30] \u00d7 106\n[0.6, 0.9]\n[\u2212\u03c0, \u03c0)\n2[0,5]\n\nC/T\nD\nC\nC\nC\n\nTable 2: Variables in our generative model, where the C/T column indicates whether sampling takes\nplace in continuous (C) or discrete (D) domain, and values inside parentheses are the range we\nuniformly sampled from. Rotation is de\ufb01ned in quaternions.\n\nscenario \u2014 shape primitives falling onto the ground. We \ufb01rst construct an audio dataset that includes\n14 primitives (some shown in Table 2), each with 10 different speci\ufb01c moduli (de\ufb01ned as Young\u2019s\nmodulus over density). After pre-computing their space-modulus con\ufb01gurations, we can generate\nsynthetic audio clips in a near-real-time fashion. Because the process of objects falling onto the\nground is relatively fast, we set the total simulation time of each scenario to 3 seconds. Details of our\nsetup can be found in Table 2.\n\n4\n\nInference\n\nIn this section, we investigate four models for inferring object properties, each corresponding to a\ndifferent training condition. Inspired by how humans can infer scene information using a mental\nphysics engine [Battaglia et al., 2013, Sanborn et al., 2013], we start from an unsupervised model\nwhere the input is only one single test case with no annotation. We adopt Gibbs sampling over latent\nvariables to \ufb01nd the combination that best reproduces the given audio.\nWe then extend the model to include a deep neural network, analogous to what humans may learn\nfrom their past experience. The network is trained using labels inferred by the unsupervised model.\nDuring inference, the sampling algorithm uses the network prediction as the initialization. This\nself-supervised learning paradigm speeds-up convergence.\nWe also investigate a third case, when labels can be acquired but are extremely coarse. We \ufb01rst\ntrain a recognition model with weak labels, then randomly pick candidates from those labels as an\ninitialization for our analysis-by-synthesis inference.\nLastly, to understand performance limits, we train a deep neural network with fully labeled data that\nyields the upper-bound performance.\n\n4.1 Models\n\nUnsupervised Given an audio clip S, we would like to recover the latent variables x to make\nthe reproduced sound g(x) most similar to S. Let L(\u00b7,\u00b7) be a likelihood function that measures\nthe perceptual distance between two sounds, then our goal is to maximize L(g(x), S). We denote\nL(g(x), S) as p(x) for brevity. In order to \ufb01nd x that maximizes p(x), p(x) can be treated as an\ndistribution \u02c6p(x) scaled by an unknown partition function Z. Since we do not have an exact form for\np(\u00b7), nor \u02c6p(x), we apply Gibbs sampling to draw samples from p(x). Speci\ufb01cally, at sweep round t,\nwe update each variable xi by drawing samples from\n\n\u02c6p(xi|xt\n\ni\u22121, xt\u22121\n\ni+1, ...xt\u22121\nn ).\n\n1, xt\n\n2, ...xt\n\n(1)\n\nSuch conditional probabilities are straightforward to approximate. For example, to sample Young\u2019s\nmodulus conditioned on other variables, we can use the spectrogram as a feature and measure the\nl2 distance between the spectrograms of two sounds, because Young\u2019s modulus will only affect the\nfrequency at each collision. Indeed, we can use the spectrogram as features for all variables except\nheight. Since the height can be inferred from the time of the \ufb01rst collision, a simple likelihood\nfunction can be designed as measuring the time difference between the \ufb01rst impact in two sounds.\nNote that this is only an approximate measure: object\u2019s shape and orientation also affect, although\nonly slightly, the time of \ufb01rst impact.\n\n5\n\n\fTo sample from the conditional probabilities, we adopt the Metropolis\u2013Hastings algorithm, where\nsamples are drawn from a Gaussian distribution and are accepted by \ufb02ipping a biased coin according\nto its likelihood compared with the previous sample. Speci\ufb01cally, we calculate the l2 distance dt\nin feature space between g(xt) and S. For a new sample xt+1, we also calculate the l2 distance\ndt+1 in feature space between g(xt+1) and S. The new sample is accepted if dt+1 is smaller than\ndt; otherwise, xt+1 is accepted with probability exp(\u2212(dt+1 \u2212 dt)/T ), where T is a time varying\nfunction inspired by simulated annealing algorithm. In our implementation, T is set as a quadratic\nfunction of the current MCMC sweep number t.\n\nSelf-supervised Learning To accelerate the above sampling process, we propose a self-supervised\nmodel, which is analogous to a Helmholtz machine trained by the wake-sleep algorithm. We \ufb01rst train\na deep neural network, whose labels are generated by the unsupervised inference model suggested\nabove for a limited number of iterations. For a new audio clip, our self-supervised model uses the\nresult from the neural network as an initialization, and then runs our analysis-by-synthesis algorithm\nto re\ufb01ne the inference. By making use of the past experiences which trained the network, the sampling\nprocess starts from a better position and requires fewer iterations to converge than the unsupervised\nmodel.\n\nWeakly-supervised Learning We further investigate the case where weak supervision might be\nhelpful for accelerating the inference process. Since the latent variables we aim to recover are hard to\nobtain in real world settings, it is more realistic to assume that we could acquire very coarse labels,\nsuch as the type of material, rough attributes of the object\u2019s shape, the height of the fall, etc. Based\non such assumptions, we coarsen ground truth labels for all variables. For our primitive shapes,\nthree attributes are de\ufb01ned, namely \u201cwith edge,\u201d \u201cwith curved surface,\u201d and \u201cpointy.\u201d For material\nparameters, i.e., speci\ufb01c modulus, Rayleigh damping coef\ufb01cients and restitution, they are mapped to\nsteel, ceramic, polystyrene and wood by \ufb01nding the nearest neighbor to those real material parameters.\nHeight is divided into \u201clow\u201d and \u201chigh\u201d categories. A deep convolutional neural network is trained\non our synthesized dataset with coarse labels. As shown in Figure 4, even trained using coarse labels,\nour network learns features very similar to the ones learned by the fully supervised network. To go\nbeyond coarse labels, the unsupervised model is applied using the initialization suggested by the\nneural network.\n\nFully-supervised Learning To explore the performance upper bound of the inference tasks, we\ntrain an oracle model with ground truth labels. To visualize the abstraction and characteristic features\nlearned by the oracle model, we plot the inputs that maximally activate some hidden units in the last\nlayer of the network. Figure 4 illustrates some of the most interesting waveforms. A selection of\nthem learned to recognize speci\ufb01c temporal patterns, and others were sensitive to speci\ufb01c frequencies.\nSimilar patterns were found in the weakly and fully supervised models.\n\n4.2 Contrasting Model Performance\n\nWe evaluate how well our model performs under different settings, studying how past experience or\ncoarse labeling can improve the unsupervised results. We \ufb01rst present the implementation details of\nall four models, then compare their results on all inference tasks.\n\nSampling Setup We perform 80 sweeps of MCMC sampling over all the 7 latent variables; for\nevery sweep, each variable is sampled twice. Shape, speci\ufb01c modulus and rotation are sampled by\nuniform distributions across their corresponding dimensions. For other continuous variables, we\nde\ufb01ne an auxiliary Gaussian variable xi \u223c N (\u00b5i, \u03c32\ni ) for sampling, where the mean \u00b5i is based on\nthe current state. To evaluate the likelihood function between the input and the sampled audio (both\nwith sample rate of 44.1k), we compute the spectrogram of the signal using a Tukey window of length\n5,000 with a 2,000 sample overlap. For each window, a 10,000 point Fourier transform is applied.\n\nDeep Learning Setup Our fully supervised and self-supervised recognition models use the archi-\ntecture of SoundNet-8 [Aytar et al., 2016] as Figure 3, which takes an arbitrarily long raw audio\nwave as an input, and produces a 1024-dim feature vector. We append to that a fully connected layer\nto produce a 28-dim vector as the \ufb01nal output of the neural network. The \ufb01rst 14 dimensions are\nthe one-hot encoding of primitive shapes and the next 10 dimensions are encodings of the speci\ufb01c\nmodulus. The last 4 dimensions regress the initial height, the two Rayleigh damping coef\ufb01cients and\n\n6\n\n\fFigure 4: Visualization of top two sound waves that activate the hidden unit most signi\ufb01cantly, in\ntemporal and spectral domain. Their common characteristics can re\ufb02ect the values of some latent\nvariables, e.g. Rayleigh damping, restitution and speci\ufb01c modulus from left to right. Both weakly\nand fully supervised models capture similar features.\n\nInference Model\n\nUnsupervised\n\nSelf-supervised\n\nWeakly supervised\n\nFully supervised\n\ninitial\n\ufb01nal\ninitial\n\ufb01nal\ninitial\n\ufb01nal\n\ufb01nal\n\nLatent Variables\n\nheight\nshape mod.\n10% 0.179\n8%\n56% 0.003\n54%\n16% 0.060\n14%\n62% 0.005\n52%\n12% 0.018\n18%\n62%\n66% 0.005\n98% 100% 0.001\n\n\u03b1\n\n0.144\n0.069\n0.092\n0.061\n0.077\n0.055\n0.001\n\n\u03b2\n\n0.161\n0.173\n0.096\n0.117\n0.095\n0.153\n0.051\n\nTable 3: Initial and \ufb01nal classi\ufb01cation accuracies (as percentages) and parameter MSE errors of\nthree different inference models after 80 iterations of MCMC. Initial unsupervised numbers indicate\nchance performance. Results from the fully supervised model show performance bounds. \u03b1 and \u03b2\nare Rayleigh damping coef\ufb01cients.\n\nthe restitution respectively. All the regression dimensions are normalized to a [\u22121, 1] range. The\nweakly supervised model preserves the structure of the fully supervised one, but with an 8-dim \ufb01nal\noutput: 3 for shape attributes, 1 for height, and 4 for materials. We used stochastic gradient descent\nfor training, with a learning rate of 0.001, a momentum of 0.9 and a batch size of 16. Mean Square\nError(MSE) loss is used for back-propagation. We implemented our framework in Torch7 [Collobert\net al., 2011], and trained all models from scratch.\n\nResults Results for the four inference models proposed above are shown in Table 3. For shapes and\nspeci\ufb01c modulus, we evaluate the results as classi\ufb01cation accuracies; for height, Rayleigh damping\ncoef\ufb01cients, and restitution, results are evaluated as MSE. Before calculating MSE, we normalize\nvalues of each latent variable to [\u22121, 1] interval, so that the MSE score is comparable across variables.\nFrom Table 3, we can conclude that self-supervised and weakly supervised models bene\ufb01t from the\nbetter initialization to the analysis-by-synthesis algorithm, especially on the last four continuous\nlatent variables. One can also observe that \ufb01nal inference accuracies and MSEs are marginally better\nthan for the unsupervised case. To illustrate the rate of convergence, we plot the likelihood value,\nexp(\u2212kd) where d is the distance of sound features, along iterations of MCMC in Figure 5. The\nmean curve of self-supervised model meets our expectation, i.e., it converges much faster than the\nunsupervised model, and reaches a slightly higher likelihood at the end of 30 iterations. The fully\nsupervised model, which is trained on 200,000 audios with the full set of ground-truth labels, yields\nnear-perfect results for all latent variables.\n\n7\n\nlow frequencysmall dampingmultiple collisionsmid frequencyhigh frequencyFrequency DomainTime DomainFullySupervisedWeaklySupervised\fFigure 5: Left and middle: confusion matrix of material classi\ufb01cation performed by human and our\nunsupervised model. Right: mean likelihood curve over MCMC iterations.\n\nFigure 6: Human performance and unsupervised performance comparison. The horizontal line\nrepresents human performance for each task. Our algorithm closely matches human performance.\n\n5 Evaluations\n\nWe \ufb01rst evaluate the performance of our inference procedure by comparing its performance with\nhumans. The evaluation is conducted using synthetic audio with their ground truth labels. Then, we\ninvestigate whether our inference algorithm performs well on real-world recordings. Given recorded\naudio, our algorithm can distinguish the shape from a set of candidates.\n\n5.1 Human Studies\n\nWe seek to evaluate our model relative to human performance. We designed three tasks for our\nsubjects: inferring the object\u2019s shape, material and height-of-fall from the sound, intuitive attributes\nwhen hearing an object fall. Those tasks are designed to be classi\ufb01cation problems, where the labels\nare in accordance with coarse labels used by our weakly-supervised model. The study was conducted\non Amazon Mechanical Turk. For each experiment (shape, material, height), we randomly selected\n52 test cases. Before answering test questions, the subject is shown 4 training examples with ground\ntruth as familiarization of the setup. We collected 192 responses for the experiment on inferring\nshape, 566 for material, and 492 for height, resulting in a total of 1,250 responses.\n\nInferring Shapes After becoming familiar with the experiment, participants are asked to make\nthree binary judgments about the shape by listening to our synthesized audio clip. Prior examples are\ngiven for people to understand the distinctions of \u201cwith edge,\u201d \u201cwith curved surface,\u201d and \u201cpointy\u201d\nattributes. As shown in Figure 6, humans are relatively good at recognizing shape attributes from\nsound and are around the same level of competency when the unsupervised algorithm runs for 10\u223c30\niterations.\n\nInferring Materials We sampled audio clips whose physical properties \u2013 density, Young\u2019s modulus\nand damping coef\ufb01cients \u2013 are in the vicinity of true parameters of steel, ceramic, polystyrene and\nwood. Participants are required to choose one out of four possible materials. However, it can still be\nchallenging to distinguish between materials, especially when sampled ones have similar damping\nand speci\ufb01c modulus. Our algorithm confuses steel with ceramic and ceramic with polystyrene\noccasionally, which is in accordance with human performance, as shown in Figure 5.\n\n8\n\nsteelceramicpolywoodsteelceramicpolywood00.20.40.60.81Unsupervisedmodelsteelceramicpolywoodsteelceramicpolywood00.20.40.60.81Human051015202530Number of MCMC Sweeps0.50.60.70.80.91.0Likelihoodrandomself-supervised10305080I0.250.500.75Is Pointy103050800.170.330.500.670.831.00Has Curved Surface10305080Iterations0.250.500.75Has Edge103050800.560.670.780.891.00Height103050800.170.330.500.67Materialhuman10305080Iterations0.170.330.500.670.831.00Has Curved Surface10305080Iterations0.250.500.75Is Pointy10305080Iterations0.250.500.75Has Edge10305080Iterations0.560.670.780.891.00Height10305080Iterations0.170.330.500.67MaterialhumanAccuracyPointyWith EdgeWith Curved Surface\fFigure 7: Results of inference on real world data. The test recording is made by dropping the metal\ndice in (a). Our inferred shape and reproduced sound is shown in (b). Likelihood over iteration is\nplotted in (c).\n\nInferring Heights\nIn this task, we ask participants to choose whether the object is dropped from a\nhigh position or a low one. We provided example videos and audios to help people anchor reference\nheight. Under our scene setup, the touchdown times of the two extremes of the height range differ\nby 0.2s. To address the potential bias that algorithms may be better at exploiting falling time, we\nexplicitly told humans that the silence at the beginning is informative. Second, we make sure that\nthe anchoring example is always available during the test, which participants can always compare\nand refer to. Third, the participant has to play each test clip manually, and therefore has control over\nwhen the audio begins. Last, we tested on different object shapes. Because the time of \ufb01rst impact is\nshape-dependent, differently shaped objects dropped from the same height would have \ufb01rst impacts\nat different times, making it harder for the machine to exploit the cue.\n\n5.2 Transferring to Real Scenes\n\nIn addition to the synthetic data, we designed real world experiments to test our unsupervised model.\nWe select three candidate shapes: tetrahedron, octahedron, and dodecahedron. We record the sound\na metal octahedron dropping on a table and used our unsupervised model to recover the latent\nvariables. Because the real world scenarios may introduce highly complex factors that cannot be\nexactly modeled in our simulation, a more robust feature and a metric are needed. For every audio\nclip, we use its temporal energy distribution as the feature, which is derived from spectrogram. A\nwindow of 2,000 samples with a 1,500 sample overlap is used to calculate the energy distribution.\nThen, we use the earth mover\u2019s distance (EMD) [Rubner et al., 2000] as the metric, which is a natural\nchoice for measuring distances between distributions.\nThe inference result is illustrated in Figure 7. Using the energy distribution with EMD distance\nmeasure, our generated sound aligns its energy at major collision events with the real audio, which\ngreatly reduces ambiguities among the three candidate shapes. We also provide our normalized\nlikelihood function overtime to show our sampling has converged to produce highly probable samples.\n\n6 Conclusion\n\nIn this paper, we propose a novel model for estimating physical properties of objects from auditory\ninputs, by incorporating the feedback of an ef\ufb01cient audio synthesis engine in the loop. We demon-\nstrate the possibility of accelerating inference with fast recognition models. We compare our model\npredictions with human responses on a variety of judgment tasks and demonstrate the correlation\nbetween human responses and model estimates. We also show that our model generalizes to some\nreal data.\n\nAcknowledgements\n\nThe authors would like to thank Changxi Zheng, Eitan Grinspun, and Josh H. McDermott for helpful\ndiscussions. This work is supported by NSF #1212849 and #1447476, ONR MURI N00014-16-1-\n2007, Toyota Research Institute, Samsung, Shell, and the Center for Brain, Minds and Machines\n(NSF STC award CCF-1231216).\n\n9\n\n(a)Realshapeandsound(b)Inferredshapeandsound(c)Normalizedlikelihoodoveriterations0 5 10152025303540455055Iteration0.40.60.81Likelihood\fReferences\n\nYusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled\n\nvideo. In NIPS, 2016. 3, 4, 6\n\nPeter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene\n\nunderstanding. PNAS, 110(45):18327\u201318332, 2013. 1, 5\n\nPeter W Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, and Koray Kavukcuoglu.\n\nnetworks for learning about objects, relations and physics. In NIPS, 2016. 3\n\nInteraction\n\nThomas G Bever and David Poeppel. Analysis by synthesis: a (re-) emerging program of research for language\n\nand vision. Biolinguistics, 4(2-3):174\u2013200, 2010. 3\n\nNicolas Bonneel, George Drettakis, Nicolas Tsingos, Isabelle Viaud-Delmon, and Doug James. Fast modal\n\nsounds with scalable frequency-domain synthesis. ACM TOG, 27(3):24, 2008. 2, 4\n\nMichael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based\n\napproach to learning physical dynamics. In ICLR, 2017. 3\n\nRonan Collobert, Koray Kavukcuoglu, and Cl\u00e9ment Farabet. Torch7: A matlab-like environment for machine\n\nlearning. In BigLearn, NIPS Workshop, 2011. 7\n\nErwin Coumans. Bullet physics engine. Open Source Software: http://bulletphysics. org, 2010. 4\n\nPeter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural\n\nComput., 7(5):889\u2013904, 1995. 2\n\nDoug L James, Jernej Barbi\u02c7c, and Dinesh K Pai. Precomputed acoustic transfer: output-sensitive, accurate sound\n\ngeneration for geometrically complex vibration sources. ACM TOG, 25(3):987\u2013995, 2006. 2, 4\n\nRoberta L Klatzky, Dinesh K Pai, and Eric P Krotkov. Perception of material from contact sounds. Presence:\n\nTeleoperators and Virtual Environments, 9(4):399\u2013410, 2000. 2\n\nAndrew J Kunkler-Peck and MT Turvey. Hearing shape. J. Exp. Psychol. Hum. Percept. Perform., 26(1):279,\n\n2000. 1, 2\n\nJosh H McDermott, Michael Schemitsch, and Eero P Simoncelli. Summary statistics in auditory perception.\n\nNat. Neurosci., 16(4):493\u2013498, 2013. 2\n\nJames F O\u2019Brien, Perry R Cook, and Georg Essl. Synthesizing sounds from physically based motion. In\n\nSIGGRAPH, 2001. 2, 4\n\nJames F O\u2019Brien, Chen Shen, and Christine M Gatchalian. Synthesizing sounds from rigid-body simulations. In\n\nSCA, 2002. 2\n\nAndrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman.\n\nVisually indicated sounds. In CVPR, 2016a. 3\n\nAndrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound\n\nprovides supervision for visual learning. In ECCV, 2016b. 3\n\nDavide Rocchesso and Federico Fontana. The sounding object. Mondo estremo, 2003. 2\n\nYossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover\u2019s distance as a metric for image retrieval.\n\nInternational journal of computer vision, 40(2):99\u2013121, 2000. 9\n\nAdam N Sanborn, Vikash K Mansinghka, and Thomas L Grif\ufb01ths. Reconciling intuitive physics and newtonian\n\nmechanics for colliding objects. Psychol. Rev., 120(2):411, 2013. 1, 5\n\nMax Siegel, Rachel Magid, Joshua B Tenenbaum, and Laura Schulz. Black boxes: Hypothesis testing via\n\nindirect perceptual evidence. In CogSci, 2014. 1, 2\n\nKees Van den Doel and Dinesh K Pai. The sounds of physical shapes. Presence: Teleoperators and Virtual\n\nEnvironments, 7(4):382\u2013395, 1998. 2\n\nJiajun Wu, Ilker Yildirim, Joseph J Lim, William T Freeman, and Joshua B Tenenbaum. Galileo: Perceiving\n\nphysical object properties by integrating a physics engine with deep learning. In NIPS, 2015. 3\n\n10\n\n\fJiajun Wu, Joseph J Lim, Hongyi Zhang, Joshua B Tenenbaum, and William T Freeman. Physics 101: Learning\n\nphysical object properties from unlabeled videos. In BMVC, 2016. 3\n\nJiajun Wu, Erika Lu, Pushmeet Kohli, William T Freeman, and Joshua B Tenenbaum. Learning to see physics\n\nvia visual de-animation. In NIPS, 2017. 3\n\nAlan Yuille and Daniel Kersten. Vision as bayesian inference: analysis by synthesis? TiCS, 10(7):301\u2013308,\n\n2006. 3\n\nZhoutong Zhang, Jiajun Wu, Qiujia Li, Zhengjia Huang, James Traer, Josh H. McDermott, Joshua B. Tenenbaum,\nand William T. Freeman. Generative modeling of audible shapes for object perception. In ICCV, 2017. 2, 3, 4\nSong-Chun Zhu and David Mumford. A stochastic grammar of images. Foundations and Trends R(cid:13) in Computer\n\nGraphics and Vision, 2(4):259\u2013362, 2007. 3\n\nEberhard Zwicker and Hugo Fastl. Psychoacoustics: Facts and models, volume 22. Springer Science & Business\n\nMedia, 2013. 1, 2\n\n11\n\n\f", "award": [], "sourceid": 847, "authors": [{"given_name": "Zhoutong", "family_name": "Zhang", "institution": "MIT"}, {"given_name": "Qiujia", "family_name": "Li", "institution": "University of Cambridge"}, {"given_name": "Zhengjia", "family_name": "Huang", "institution": "Shanghaitech"}, {"given_name": "Jiajun", "family_name": "Wu", "institution": "MIT"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}, {"given_name": "Bill", "family_name": "Freeman", "institution": "MIT/Google"}]}