{"title": "Learning to Play With Intrinsically-Motivated, Self-Aware Agents", "book": "Advances in Neural Information Processing Systems", "page_first": 8388, "page_last": 8399, "abstract": "Infants are experts at playing, with an amazing ability to generate novel structured behaviors in unstructured environments that lack clear extrinsic reward signals. We seek to mathematically formalize these abilities using a neural network that implements curiosity-driven intrinsic motivation.  Using a simple but ecologically naturalistic simulated environment in which an agent can move and interact with objects it sees, we propose a \"world-model\" network that learns to predict the dynamic consequences of the agent's actions.  Simultaneously, we train a separate explicit \"self-model\" that allows the agent to track the error map of its world-model. It then uses the self-model to adversarially challenge the developing world-model. We demonstrate that this policy causes the agent to explore novel and informative interactions with its environment, leading to the generation of a spectrum of complex behaviors, including ego-motion prediction, object attention, and object gathering.  Moreover, the world-model that the agent learns supports improved performance on object dynamics prediction, detection, localization and recognition tasks.  Taken together, our results are initial steps toward creating flexible autonomous agents that self-supervise in realistic physical environments.", "full_text": "Learning to Play With Intrinsically-Motivated,\n\nSelf-Aware Agents\n\nNick Haber1,2,3,\u21e4, Damian Mrowca4,\u21e4, Stephanie Wang4 , Li Fei-Fei4 , and\n\nDaniel L. K. Yamins1,4,5\n\nDepartments of Psychology1, Pediatrics2, Biomedical Data Science3, Computer Science4, and Wu\n\nTsai Neurosciences Institute5, Stanford, CA 94305\n\n{nhaber, mrowca}@stanford.edu\n\nAbstract\n\nInfants are experts at playing, with an amazing ability to generate novel structured\nbehaviors in unstructured environments that lack clear extrinsic reward signals.\nWe seek to mathematically formalize these abilities using a neural network that\nimplements curiosity-driven intrinsic motivation. Using a simple but ecologically\nnaturalistic simulated environment in which an agent can move and interact with\nobjects it sees, we propose a \u201cworld-model\u201d network that learns to predict the\ndynamic consequences of the agent\u2019s actions. Simultaneously, we train a separate\nexplicit \u201cself-model\u201d that allows the agent to track the error map of its world-\nmodel.\nIt then uses the self-model to adversarially challenge the developing\nworld-model. We demonstrate that this policy causes the agent to explore novel\nand informative interactions with its environment, leading to the generation of a\nspectrum of complex behaviors, including ego-motion prediction, object attention,\nand object gathering. Moreover, the world-model that the agent learns supports\nimproved performance on object dynamics prediction, detection, localization and\nrecognition tasks. Taken together, our results are initial steps toward creating\n\ufb02exible autonomous agents that self-supervise in realistic physical environments.\n\nIntroduction\n\n1\nTruly autonomous arti\ufb01cial agents must be able to discover useful behaviors in complex environments\nwithout having humans present to constantly pre-specify tasks and rewards. This ability is beyond\nthat of today\u2019s most advanced autonomous robots. In contrast, human infants exhibit a wide range of\ninteresting, apparently spontaneous, visuo-motor behaviors \u2014 including navigating their environment,\nseeking out and attending to novel objects, and engaging physically with these objects in novel and\nsurprising ways [4, 9, 13, 15, 20, 21, 44]. In short, young children are excellent at playing \u2014\n\u201cscientists in the crib\u201d [13] who create, intentionally, events that are new, informative, and exciting to\nthem [9, 42]. Aside from being fun, play behaviors are an active learning process [40], driving self-\nsupervised learning of representations underlying sensory judgments and motor planning [4, 15, 24].\nBut how can we use these observations on infant play to improve arti\ufb01cial intelligence? AI theorists\nhave long realized that playful behavior in the absence of rewards can be mathematically formalized\nvia loss functions encoding intrinsic reward signals, in which an agent chooses actions that result\nin novel but predictable states that maximize its learning [38]. These ideas rely on a virtuous cycle\nin which the agent actively self-curricularizes as it pushes the boundaries of what its world-model-\nprediction systems can achieve. As world-modeling capacity improves, what used to be novel\nbecomes old hat, and the cycle starts again.\n\n\u21e4Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: 3D Physical Environment. The agent\ncan move around, apply forces to visible objects in\nclose proximity, and receive visual input.\n\nHere, we build on these ideas using the tools of deep reinforcement learning to create an arti\ufb01cial\nagent that learns to play. We construct a simulated physical environment inspired by infant play rooms,\nin which an agent can swivel its head, move around, and physically act on nearby visible objects\n(Fig. 1). Akin to challenging video game tasks [26], informative interactions in this environment\nare possible, but sparse unless actively sought by the agent. However, unlike most video game or\nconstrained robotics environments, there is no extrinsic goal to constrain the agent\u2019s action policy.\nThe agent has to learn about its world, and what is interesting in it, for itself.\nIn this environment, we describe a neural net-\nwork architecture with two interacting compo-\nnents, a world-model and a self-model, which\nare learned simultaneously. The world-model\nseeks to predict the consequences of agent\u2019s ac-\ntions, either through forward or inverse dynam-\nics estimation. The self-model learns explicitly\nto predict the errors of the world-model. The\nagent then uses the self-model to choose actions\nthat it believes will adversarially challenge the\ncurrent state of its world-model.\nOur core result is the demonstration that this\nintrinsically-motived self-aware architecture sta-\nbly engages in a virtuous reinforcement learning\ncycle, spontaneously discovering highly nontrivial cognitive behaviors \u2014 \ufb01rst understanding and\ncontrolling self-generated motion of the agent (\u201cego-motion\u201d), and then selectively paying attention to,\nand eventually organizing, objects. This learning occurs through an emergent active self-supervised\nprocess in which new capacities arise at distinct \u201cdevelopmental milestones\u201d like those in human\ninfants. Crucially, it also learns visual encodings with substantially improved transfer to key vi-\nsual scene understanding tasks such as object detection, localization, and recognition and learns\nto predict physical dynamics better than a number of strong baselines. This is to our knowledge\nthe \ufb01rst demonstration of the ef\ufb01cacy of active learning of a deep visual encoding for a complex\nthree-dimensional environment in a purely self-supervised setting. Our results are steps toward\nmathematically well-motivated, \ufb02exible autonomous agents that use intrinsic motivation to learn\nabout and spontaneously generate useful behaviors for real-world physical environments.\nRelated Work Our work connects to a variety of existing ideas in self-supervision, active learning,\nand deep reinforcement learning. Visual learning can be achieved through self-supervised auxiliary\ntasks including semantic segmentation [18], pose estimation [29], solving jigsaw puzzles [32],\ncolorization [46], and rotation [43]. Self-supervision on videos frame prediction [23] is also promising,\nbut faces the challenge that most sequences in recorded videos are \u201cboring\u201d, with little interesting\ndynamics occurring from one frame to the next.\nIn order to encourage interesting events to happen, it is useful for an agent to have the capacity to\nselect the data that it sees in training. In active learning, an agent seeks to learn a supervised task\nusing minimal labeled data [12, 40]. Recent methods obtain diversi\ufb01ed sets of hard examples [8, 39],\nor use con\ufb01dence-based heuristics to determine when to query for more labels [45]. Beyond selection\nof examples from a pre-determined set, recent work in robotics [2, 7, 10, 36] study learning tasks\nwith interactive visuo-motor setups such as robotic arms. The results are promising, but largely use\nrandom policies to generate training data without biasing the robot to explore in a structured way.\nIntrinsic and extrinsic reward structures have been used to learn generic \u201cskills\u201d for a variety of tasks\n[6, 28, 41]. Houthooft et al. [19] demonstrated that reasonable exploration-exploitation trade-offs can\nbe achieved by intrinsic reward terms formulated as information gain. Frank et al. [11] use information\ngain maximization to implement arti\ufb01cial curiosity on a humanoid robot. Kulkarni et al. [26] combine\nintrinsic motivation with hierarchical action-value functions operating at different temporal scales,\nfor goal-driven deep reinforcement learning. Achiam and Sastry [1] formulate surprise for intrinsic\nmotivation as the KL-divergence of the true transition probabilities from learned model probabilities.\nHeld et al. [16] use a generator network, which is optimized using adversarial training to produce\ntasks that are always at the appropriate level of dif\ufb01culty for an agent, to automatically produce a\ncurriculum of navigation tasks to learn. Jaderberg et al. [22] show that target tasks can be improved\nby using auxiliary intrinsic rewards.\n\n2\n\n\fOudeyer and colleagues [14, 33, 34] have explored formalizations of curiosity as maximizing\nprediction-ability change, showing the emergence of interesting realistic cognitive behaviors from\nsimple intrinsic motivations. Unlike this work, we use deep neural networks to learn the world-model\nand generate action choices, and co-train the world-model and self-model, rather than pre-training\nthe world-model on a separate prediction task and then freezing it before instituting the curious\nexploration policy. Pathak et al. [35] uses curiosity to antagonize a future prediction signal in the latent\nspace of a inverse dynamics prediction task to improve learning in video games, showing that intrinsic\nmotivation leads to faster \ufb02oor-plan exploration in a 2D game environment. Our work differs in using\na physically realistic three-dimensional environment and shows how intrinsic motivation can lead to\nsubstantially more sophisticated agent-object behavior generation (the \u201cplaying\u201d). Underlying the\ndifference between our technical approach is our introduction of a self-model network, representing\nthe agent\u2019s awareness of its own internal state. This difference can be viewed in RL terms as the use\nof a more explicit model-based architecture in place of a model-free setup.\nUnlike previous work, we show the learned representation transfers to improved performance on\nanalogs of real-world visual tasks, such as object localization and recognition. To our knowledge, a\nself-supervised setup in which an explicitly self-modeling agent uses intrinsic motivation to learn\nabout and restructure its environment has not been explored prior to this work.\n2 Environment and Architecture\nInteractive Physical Environment. Our agent is situated in a physically realistic simulated envi-\nronment (black in Fig. 2) built in Unity 3D (Fig. 1). Objects in the environment interact according\nto Newtonian physics as simulated by the PhysX engine [5]. The agent\u2019s avatar is a sphere that\nswivels in place, moves around, and receives RGB images from a forward-facing camera (as in Fig.\n1). The agent can apply forces and torques in all three dimensions to any objects that are both in\nview and within a \ufb01xed maximum interaction distance  of the agent\u2019s position. We say that such\nan object is in a play state, and that a state with such an object is a play state. Although the \ufb02oor\nand walls of the environment are static, the agent and objects can collide with them. The agent\u2019s\naction space is a subset of R2+6N. The \ufb01rst 2 dimensions specify ego-motion, restricting agent\nmovement to forward/backward motion vf wd and horizontal planar rotation v\u2713, while the remaining\n6N dimensions specify the forces fx, fy, fz and torques \u2327x,\u2327 y,\u2327 z applied to N objects sorted from\nthe lower-leftmost to the upper-rightmost object relative to the agent\u2019s \ufb01eld of view. All coordinates\nare bounded by constants and normalized to be within [1, 1] for input into models and losses. In\nthis setup, both the observation space (images from the 3d rendering) and action space (ego-motion\nand object force application) are continuous and high-dimensional, complicating the challenges of\nlearning the visual encoding and action policy.\nAgent Architecture. Our agent consists of two simultaneously-learned components: a world-model\nand a self-model (Fig. 2). The world-model seeks to solve one or more dynamics prediction problems\nbased on inputs from the environment. The self-model seeks to estimate the world-model\u2019s losses\nfor several timesteps into the future, as a function both of visual input and of potential agent actions.\nAn action choice policy based on the self-model\u2019s output chooses actions that are \u201cinteresting\u201d to\nthe world-model. In this work, we choose perhaps the simplest such motivational mechanism, using\npolicies that try to maximize the world-model\u2019s loss. In part as a review of the key issues of prediction\nerror-based curiosity [33\u201335, 38], we now formalize these ideas mathematically.\nWorld-Model: At the core of our architecture is the world-model \u2014 e.g. the neural network that\nattempts to learn how dynamics in the agent\u2019s environment work, especially in reaction to the agent\u2019s\nown actions. Finding the right dynamics prediction problem(s) to set as the agent\u2019s world-modeling\ngoal is a nontrivial challenge.\nConsider a partially observable Markov Decision Process (POMDP) with state st, observation ot,\nand action at. In our agent\u2019s situation, st is the complete information of object positions, extents, and\nvelocities at time t; ot is the images rendered by the agent-mounted camera; and at is the agent\u2019s\napplied ego-motion, forces and torques vector. The rules of physics are the dynamics which generate\nst+1 from st and at. Agents make decisions about what action to take at each time, accumulating\nhistories of observations and actions. Informally, a dynamics prediction problem is a pairing of\ncomplementary subsets of data \u2014 \u201cinputs\u201d and \u201coutputs\u201d \u2014 generated from the history. The\ngoal of the agent is to learn a map from inputs to outputs. More precisely, adopting the notation\not1:t2 = (ot1, ot1+1, . . . ot2) and similarly for actions and states, we let D (the \u201cdata\u201d) be \ufb01xed-time-\nlength segments of history {dt = (otb:t+f , atb:t+f ) | t = 1, 2, 3 . . .}. A dynamics prediction\n\n3\n\n\fStateEnvironmentSelfModelWorldModelStateActionegotobjt1objt2egot+1objt+11objt+12egot-2objt-21objt-22egotobjt1objt2LtLt+1Lt+2Lt+TLt+3...stargmax Action!\u21e4 State fromenvironmentegot-1objt-11objt-12(b)(a)st-1stst+1st-2Action LossAction LossFigure2:Intrinsically-motivatedself-awareagentarchitecture.Theworld-model(blue)solvesadynamicspredictionproblem.Simultaneouslyaself-model(red)seekstopredicttheworld-model\u2019sloss.Actionsarechosentoantagonizetheworld-model,leadingtonovelandsurprisingeventsintheenvironment(black).(a)Environment-agentloop.(b)Agentinformation\ufb02ow.problem(Figure3)isthende\ufb01nedbyspecifying(possiblytime-varying)maps\u25c6t:D!Inand\u2327t:D!Outforsomespeci\ufb01edinputandoutputspacesInandOut,formingatriangulardiagram.DInOut\u25c6t\u2327t!tFigure3:Diagrammingdynamicspre-dictionproblems.AlsogivenaspartofthedynamicspredictionproblemisalossLforcomparingground-truthversusestimatedoutputs.Theagent\u2019sworld-modelattimetisamap!\u2713t:In!OutwhoseparametersareupdatedbystochasticgradientdescentinordertolowerL.Inwords,theagent\u2019sworld-model(blueinFig.2)triestolearntoreconstructthetrue-valuefromtheinputdatum.Notethatbatchesofdataonwhichthisupdateoccursarenotdrawnfromany\ufb01xeddistributionsincetheycomefromthehistoryofanagentasitexecutesitspolicy,andhencethislearningprocessdoesnotcorrespondtoatraditionalstatisticallearningoptimization.Sincewearefocusedonagentslearningfromanenvironmentwithoutexternalinput,themaps\u25c6and\u2327shouldingeneralbeeasyfortheagenttoestimateatlowcostfromits\u201csensedata\u201d\u2014whatissometimescalledself-supervision[18,23,29,32,43,46].Forexample,perhapsthemostnaturaldynamicsproblemtoassigntotheagentasthegoalofitsworld-modelisforwarddynamicsprediction,withinput(otb:t,atb:t+f1)andtrue-valueot+1:t+f.Inwords,theagentistryingtopredictthenext(several)observation(s)givenpastobservationsandasequence(pastandpresent)ofactions.In3-Dphysicaldomainssuchasours,theoutputscorrespondtofbitmapimagearraysoffutureframes,andthelossfunctionLFmaybe`2lossonpixelsorsomediscretizationthereof.Despiterecentprogressontheframepredictionproblem[10,23],itremainsquitechallenging,inpartbecausethedimensionalityofthetrue-valuespaceissolarge.Inpractice,itcanbesubstantiallyeasiertosolveinversedynamicsprediction,withinput(otb:t+f,atb:t1,at+1:t+f1)andtrue-valueat.Inotherwords,theagentistryingto\u201cpost-dict\u201dtheactionneededtohavegeneratedtheobservedsequenceofobservations,givenknowledgeofitspastandfutureactions.Here,thelossfunctionLIDiscomputedon(whatisgenerallythecomparitivelylow-dimensional)actionspace,aproblemthathasproventractable[2,3].Onemajorconcerninintrinsicmotivation,inparticularwhentheagent\u2019spolicyattemptstomaximizetheworld-model\u2019sloss,iswhenthedynamicspredictionproblemisinherentlyunpredictable.Thisissometimesreferredto(perhapsinlessgeneralitythaninwhatweproceedtode\ufb01ne)asthewhite-noiseproblemPathaketal.[35],Schmidhuber[38].Incaseswheretheagent\u2019spolicyattemptstomaximizetheworld-model\u2019sloss,theagentismotivatedto\ufb01xateontheunlearnable.Withintheaboveframework,thisproblemmanifestsinthatthereisnorequirementthat\u25c6tand\u2327tactuallyinduceawell-de\ufb01nedmappingIn!Outthatmakesthediagramabovecommute.Werefertotheexistenceofpoliciesforwhichthereareobstructionstosuchacommutingdiagram,withnonzeroprobability,asdegeneracyinthedynamicspredictionproblem.Infact,theinversedynamicsproblemcansufferfromsubstantialdegeneracy.Considerthecaseofanagentpressinganobjectstraightintotheground:nomatterwhatthedownwardforceis,theobjectdoesnotmove,sothevisionandactioninputinformationisinsuf\ufb01cienttodeterminethetrue-value.4\f,\u2327 LF\n\nt\n\nt\n\n(a) World-model training loss\n\n(b) Object frequency\n\n(c) Easy data (ego-motion)\n\n(d) Hard data (object play)\n\nFigure 4: Single-object experiments. (a) World-model training loss. (b) Percentage of frames in\nwhich an object is present. (c) World-model test-set loss on \u201ceasy\u201d ego-motion-only data, with no\nobjects present. (d) World-model test-set loss on \u201chard\u201d validation data, with object present, where\nagent must solve object physics prediction. Validation datasets are evaluated every 1000 batch update\nsteps. Lighter curves represent unsmoothed batch-mean values \u2014 min and max for validation.\n\nt\n\nt\n\n(dt) = eID\n\nt\n\nt = eID\n\u21b5t\n\n(otb:t), atb:0) and \u2327 LF\n\nt\n\n(dt) = (eID\n\n\u21b5t where \u21b5t and t are non-overlapping sets of parameters. We call eID\n\nTo avoid both of these pixel space and degeneracy dif\ufb01culties, one can instead try forward dynamics\nprediction, but in a latent space \u2014 for example, the latent space determined by an encoder for\ninverse dynamics problem [35]. In this case, we begin with a system solving the inverse dynamics\nprediction problem and assume that its parametrization of world-model factors into a composition\nthe\nt  eID\n!ID\n\u2713t = dID\nthe latent space L of the ID problem. On top, we de\ufb01ne (time-varying)\nencoding and the range of eID\nas the 1-time-step future prediction problem on trajectories in L given by the time-varying\n\u25c6LF\nt\nencoding, i.e. by \u25c6LF\n(ot+1). The problem is then\nsupervised by `2 loss. The inverse-prediction world-model !ID and latent-space world-model !LF\nevolve simultaneously. If L is suf\ufb01ciently low dimensional, this may be a good compromise task that\nrepresents only \u201cessential\u201d features for prediction.\nIn this work, we explore both inverse dynamics and latent space future prediction tasks.\nExplicit Self-Model: In the strategy outlined above, the agent\u2019s action policy goal is to antag-\nonize its world-model.\nIf the agent explicitly predicts its own world-model loss L!t incurred\nat future timesteps as a function of visual input and current action, a simple antagonistic pol-\nicy could simply seek to maximize L!t over some number of future timesteps. Embodying this\nidea, the self-model \u21e4 (red in Fig. 2) is given ot1:t and a proposed next action a and predicts\n\u21e4ot1:t(a) = (p1(c | ot1:t, a) . . . pT (c | ot1:t, a)), where pi(c | ot1:t, a) is the probability that\nthe loss incurred by the world-model at time t + i will equal c. For convenience of optimization,\nwe discretize the losses into loss bins C, so that each pi 2P (C) is a probability distribution over\ndiscrete classes c 2 C. \u21e4ot1:t(a) is penalized with a softmax cross-entropy loss for each i and\naveraged over i 2 1, . . . , T . All future losses aside from the \ufb01rst one depend on future actions taken,\nand the self-model hence needs to predict in expectation over policy. Each pi(c | ot1:t, a) can be\ninterpreted as a map over action space which turns out to be useful for intuitively visualizing what\nstrategy the agent is taking in any given situation (see Fig 5).\nAdversarial Action Choice Policy: The self-model provides, given ot1:t and a proposed next action\na, T probability distributions pi. The agent uses a simple mechanism to convert this data to an action\nchoice. To summarize loss map predictions over times t 2{ 1, . . . T}, we add expectation values:\n\n(a)[ot1:t] =Xi Xc2C\n\nc \u00b7 pi(c).\n\nThe agent\u2019s action policy is then given by sampling with respect to a Boltzmann distribution\n\u21e1(a | ot1:t) \u21e0 exp((\u21e4ot1:t(a))) with \ufb01xed hyperparameter .\nArchitectures and Losses: We use convolutional neural networks to learn both world-models !\u2713\nand self-models \u21e4 . In the experiments described below, these have an encoding structure with a\ncommon architecture involving twelve convolutional layers, two-stride max pools every other layer,\n\n5\n\nEgo motion learningEmergence of object attentionObject interaction learning*\u00b0ID-SPIDRW-RPID-RPLF-SPTraining loss0.40.30.20.10.004080120160200Steps (in thousands)\u00b0*1 Object frequency1.00.80.60.40.000.2Steps (in thousands)4080120160200*Validation loss0.000.05Steps40801201602000.100.150.200.250.30\u00b00Steps4080120160200Validation loss1.02.03.04.0\f\u2713ID is used as the latent space L, and the latent model !LF\n\nand one fully-connected layer, to encode observations into a lower-dimensional latent space, with\nshared weights across time. For the inverse dynamics task, the top encoding layer of the network is\ncombined with actions {at0 | t0 6= t}, fed into a two-layer fully-connected network, on top of which\na softmax classi\ufb01er is used to predict action at. For the latent space future prediction task, the top\nconvolutional layer of !ID\n\u2713LF is parametrized\nby a fully-connected network that receives, in addition to past encoded images, past actions. In the\nID-only case (ID-SP), we optimize min\u2713ID LID + min L\u21e4,ID. In the LF case (LF-SP), we optimize\nmin\u2713ID LID + min\u2713LF LLF + min L\u21e4,LF . See supplementary for details.\n3 Experiments\nWe randomly place the agent in a square 10 by 10 meter room, together with up to two other objects\nwith which the agent can interact, setting the maximum interaction distance  to 2 meters. The\nobjects are drawn from a set of 16 distinct geometric shapes, e.g. cones, cylinders, cuboids, pyramids,\nand spheroids of various aspect ratios. Training is performed using 16 asynchronous simulation\ninstances [30], with different seeds and objects. The scene is reinitialized periodically, with time\nof reset randomly chosen between 213 to 215 steps. Each simulation maintains a data buffer of 250\ntimesteps to ensure stable training [27]. For model updates two examples are randomly sampled from\neach of the 16 simulation buffers to form a batch of size 32. Gradient updates are performed using\nthe Adam algorithm [25], with an initial learning rate of 0.0001. See the supplement for tests of the\nstability of all results to variations in interaction radius , room size, and agent speed, as well as\nper-object-type behavioral breakdowns.\nFor each experiment, we evaluate the agents\u2019 abilities with three types of metrics. We \ufb01rst measure\nthe (i) spontaneous emergence of novel behaviors, involving the appearance of highly structured but\nnon-preprogrammed events such as the agent attending to and acting upon objects (rather than just\nperforming mere self-motion), engaging in directed navigation trajectories, or causing interactions\nbetween multiple objects. Finding such emergent behaviors indicates that the curiosity-driven policies\ngenerate qualitatively novel scenarios in which the agent can push the boundaries of its world-model.\nFor each agent type, we also evaluate (ii) improvements in dynamic task prediction in the agents\u2019\nworld-models, on challenging held-out validation data constructed to test learning about both ego-\nmotion dynamics and object physical interactions. Finding such improvements indicates that the\ndata gleaned from the novel scenarios uncovered by intrinsic motivation actually does improve the\nagents\u2019 world-modeling capacities. Finally, we also evaluate (iii) task transfer, the ability of the\nvisual encoding features learned by the curious agents to serve as a general basis for other useful\nvisual tasks, such as object recognition and detection.\nControl models: In addition to the two curious agents, we study several ablated models as controls.\nID-RP is an ablation of ID-SP in which the world-model trains but the agent executes a random policy,\nused to demonstrate the difference an active policy makes in world-model performance and encoding.\nIDRW-SP is an ablation of ID-SP in which the policy is executed as above but with the encoding\nportion of the world-model frozen with random weights. This control measures the importance of\nhaving the action policy inform the deep internal layers of the world model network. IDRW-RP\ncombines both ablations.\n\n3.1 Emergent behaviors\nUsing metrics inspired by the developmental psychology literature, we quantify the appearance of\nnovel structured behaviors, including attention to and acting on objects, navigation and planning, and\nability to interact with multiple objects. In addition to sharp stage-like transitions in world-model\nloss and self-model evaluations, to quantify these behaviors we measure play state frequency and (in\nthe case of multiple objects) the average distance between the agent and objects. We compute these\nquantities by averaging play state count and distance between objects, respectively, over the three\nsimulation steps per batch update. Quantities presented below are aggregates over all 16 simulation\ninstances unless otherwise speci\ufb01ed.\nObject attention. Fig. 4a shows the total training loss curves of the ID-SP, LF-SP models and\nbaselines. Upon initialization, all agents start with behaviors indistinguishable from the random\npolicy, choosing largely self-motion actions and rarely interacting with objects. For learned-weight\nagents, an initial loss decrease occurs due to learning of ego-motion, as seen in Fig. 4a. For the\ncurious agents, this initial phase is robustly succeeded by a second phase in which loss increases.\nAs shown in Fig. 4b, this loss increase corresponds to the emergence of object attention, in which\n\n6\n\n\ft\n\nt+1\n\nt+2\n\nt+3\n\nt+4\n\nt+5\n\nt+6\n\nt+7\n\nt+8\n\nt+9\n\nt+10\n\nt+11\n\nFigure 5: Navigation and planning behavior. Example model roll-out for 12 consecutive timesteps.\nRed force vectors on the objects depict the actions predicted to maximize world-model loss. Ego-\nmotion self-prediction maps are drawn at the center of the agents position. Red colors correspond to\nhigh and blue colors to low loss predictions. The agent starts without seeing an object and predicts\nhigher loss if it turns around to explore for an object. The self-model predicts higher loss if the agent\napproaches a faraway object or turns towards a close object to keep it in view.\n\nthe agent dramatically increases the play state frequency. As seen by comparing Fig. 4c-d, object\ninteractions are much harder to predict than simple ego-motions, and thus are enriched by the curious\npolicy: for the ID-SP agent, object interactions increase to about 60 % of all frames. In comparison,\nfrequency of object interaction increases much less or not at all for control policy agents.\nNavigation and planning. The curiousity-driven agents also exhibit emergent navigation and\nplanning abilities. In Fig. 5 we visualize ID-SP self-prediction maps projected onto the agent\u2019s\nposition for the one-object setup. The maps are generated by uniformly sampling 1000 actions a,\nevaluating \u21e4ot1:t(a) and applying a post-processing smoothing algorithm. We show an example\nsequence of 12 timesteps. The self-prediction maps show the agent predicting a higher loss (red) for\nactions moving it towards the object to reach a play state. As a result, the intrinsically-motivated\nagents learn to take actions to navigate closer to the object.\nMulti-object interactions. In experiments with multiple objects present, initial learning stages\nmirror those for the one object experiment (Fig. 6a) for both ID-SP and LF-SP. The loss temporarily\ndecreases as the agent learns to predict its ego-motion and rises when its attention shifts towards\nobjects, which it then interacts with. However, for ID-SP agents with suf\ufb01ciently long time horizon\n(e.g. T = 40), we robustly observe the emergence of an additional stage in which the loss increases\nfurther. This stage corresponds to the agent gathering and \u201cplaying\u201d with two objects simultaneously,\nre\ufb02ected in a sharp increase in two-object play state frequency (Fig. 6c), and a decrease in the average\ndistance between the agent and the both objects (Fig. 6d). We do not observe this additional stage\neither for ID-SP of shorter time horizon (e.g. T = 5) or for the LF-SP model even with longer\nhorizons. The ID-SP and LF-SP agents both experience two object play slightly more often than the\nID-RP baseline, having achieved substantial one object play time. However, only the ID-SP agent\nhas discovered how to take advantage of the increased dif\ufb01culty and therefore \u201cinterestingness\u201d of\ntwo object con\ufb01gurations (compare blue with green horizontal line in Fig. 6a).\n\n3.2 Dynamics prediction tasks\nWe measure the inverse dynamics prediction performance on two held-out validation subsets of data\ngenerated from the uncontrolled background distribution of events: (i) an easy dataset consisting\nsolely of ego-motion with no play states, and (ii) a hard dataset heavily enriched for play states, each\nwith 4800 examples. These data are collected by executing a random policy in sixteen simulation\ninstances, each containing one object, one for each object type. The hard dataset is the set of examples\nfor which the object is in a play state immediately before the action to be predicted, and the easy\ndataset is the complement of this. This measures active learning gains, assessing to what extent the\nagent self-constructs training data for the hard subset while retaining performance on the easy dataset.\nEgo-motion learning. All aside from random-encoding agents IDRW-RP and IDRW-SP learn ego-\nmotion prediction effectively. The ID-RP model quickly converges to a low loss value, where it\nremains from then on, having effectively learned ego-motion prediction without an antagonistic\n\n7\n\n\f(a) World-model training loss\n\n(b) One Object frequency\n\n(c) Two Object frequency\n\n(d) Average agent-object distance\nFigure 6: Two-object experiments. (a) World-model training loss. (b) Percentage of frames in which\none object is present. (c) Percentage of frames in which two objects are present. (d) Average distance\nbetween agent and objects in Unity units. For this average to be low (\u21e0 2.) both objects must be\nclose to the agent simultaneously. Lighter curves represent unsmoothed batch-mean values.\npolicy since ego-motion interactions are common in the background random data distribution. The\nID-SP and LF-SP models also learn ego-motion effectively, as seen in the initial decrease of their\ntraining losses (Fig. 4a) and low loss on the easy ego-motion validation dataset (Fig. 4c).\nObject dynamics prediction. Object attention and navigation lead SP agents to substantially differ-\nent data distributions than baselines. We evaluate the inverse dynamics prediction performance on\nthe held-out hard object interaction validation set. Here, the ID-SP and LF-SP agents outperform\nthe baselines on predicting the harder object interaction subset by a signi\ufb01cant margin, showing that\nincreased object attention translates to improved inverse dynamics prediction (see Fig. 4d and Table\n1). Crucially, even though ID-SP and LF-SP have substantially decreased the fraction of time spent\non ego-motion interactions (Fig. 4c), they still retain high performance on that easier sub-task.\n\n3.3 Task transfers\nWe measure the agents\u2019 abilities to solve visual tasks for which they were not directly trained,\nincluding (i) object presence, (ii) localization, as measured by pixel-wise 2D centroid position,\nand (iii) 16-way object category recognition. We collect data with a random policy from sixteen\nsimulation instances (each with one object, one for each object). For object presence, we subselect\nexamples so as to have an equal number with and without an object in view. For localization and\ncategory identity (discerning which of the sixteen objects is in view), we take only frames with the\nobject in a play state. These data are split into train (16000 examples), validation (8000 examples),\nand test (8000 examples) sets. On train sets, we \ufb01t elastic net regression/classi\ufb01cation for each layer\nof both world- and self-model encodings, and we use validation sets to select the best-performing\nmodel per agent type. These best models are then evaluated on the test sets. Note that the test sets\ncontain substantial variation in position, pose and size, rendering these tasks nontrivial. Self-model\ndriven agents substantially outperform alternatives on all three transfer tasks, As shown in Table 1,\nthe SP (T = 5) agents outperform baselines on inverse dynamics and object presence metrics, while\nID-SP outperforms LF-SP on localization and recognition. Crucially, the ID-RP ablation comparison\nshows that without an active learning policy, the encoding learned performs comparitively poorly\non transfer tasks. Interestingly, we \ufb01nd that training with two objects present improves recognition\ntransfer performance as compared to one object scenarios, potentially due to the greater complexity\nof two-object con\ufb01gurations (Table 1). This is especially notable for the ID-SP (T = 40) agent that\nconstructs a substantially increased percentage of two-object events.\n\n4 Discussion\nWe have constructed a simple self-supervised mechanism that spontaneously generates a spectrum\nof emergent naturalistic behaviors via an active learning process, experiencing \u201cdevelopmental\n\n8\n\nEgo motion learningEmergence of object attention1 object learning2 object learning1 object loss2 object loss\u00b0*xID-SP-40ID-SP-5ID-RPLF-SPTraining loss0.40.30.20.10.00.50.6040206080Steps (in thousands)320560800x\u00b0*0Steps (in thousands)204060805601 Object frequency1.00.80.60.40.00.2x0Steps (in thousands)204060805602 Object frequency0.50.40.30.20.00.10Steps (in thousands)20406080560Agent-object distance8.06.04.00.02.0x\fTable 1: Performance comparison. Ego-motion (vf wd, v\u2713) and interaction (f, \u2327 ) accuracy in % is\ncompared for play and non-play states. Object frequency, presence and recognition are measured in\n% and localization in mean pixel error. Models are trained with one object per room unless stated.\n\nTASK\nvf wd ACCURACY \u2014 EASY\nv\u2713 ACCURACY \u2014 EASY\nvf wd ACCURACY \u2014 HARD\nv\u2713 ACCURACY \u2014 HARD\nf ACCURACY \u2014 HARD\n\u2327 ACCURACY \u2014 HARD\nOBJECT FREQUENCY\nOBJECT PRESENCE ERROR\nLOCALIZATION ERROR [PX]\nRECOGNITION ACCURACY\nRECOGNITION ACC. \u2013 2 OBJECT TRAINING\n\nIDRW-RP\n\nIDRW-SP\n\n65.9\n82.9\n62.4\n79.0\n20.8\n20.9\n0.50\n4.0\n15.04\n13.0\n12.0\n\n56.0\n75.2\n69.2\n80.0\n33.1\n32.1\n47.9\n3.0\n10.14\n21.99\n\n-\n\nID-RP\n96.0\n98.7\n90.4\n95.5\n42.1\n41.3\n0.40\n0.92\n5.94\n12.3\n16.1\n\nID-SP\n95.3\n98.4\n95.9\n98.2\n51.1\n43.2\n61.1\n0.92\n4.36\n28.5\n39.7\n\nLF-SP\n95.3\n98.5\n95.4\n98.1\n45.1\n43.2\n12.8\n0.60\n5.94\n18.7\n21.1\n\nhere,\n\nmilestones\u201d of increasing complexity as the agent learns to \u201cplay\u201d. The agent \ufb01rst learns the\ndynamics of its own motion, gets \u201cbored\u201d, then shifts its attention to locating, moving toward, and\ninteracting with single objects (\u21e4 in Fig. 4 and Fig. 6). Once these are better understood ( in Fig. 4\nand Fig. 6), the agent transitions to gathering multiple objects and learning from their interactions\n(\u21e5 in Fig. 6). This increasingly challenging self-generated curriculum leads to performance gains\nin the agent\u2019s world-model and improved transfer to other useful visual tasks on which the system\nnever received any explicit training signal. Our ablation studies show that without this active learning\npolicy, world-model accuracy remains poor and visual encodings transfer much less well. These\nresults constitute a proof-of-concept that both complex behaviors and useful visual features can arise\nfrom simple intrinsic motivations in a three-dimensional physical environment with realistically large\nand continuous state and action spaces.\nIn future work, we\nseek to generate much\nmore\nsophisticated\nbehaviors than those\nseen\ninclud-\ning the creation of\ncomplex\nplanned\ntrajectories and the\nuseful\nbuilding\nof\nenvironmental\nstruc-\ntures.\nBeyond the\nobjective of building\nmore robustly learning\nAI, we seek to build computational models that admit precise quantitiative comparisons to the\ndevelopmental trajectories observed in human children (Figure 7). To this end, our environment will\nneed better graphics and physics, more varied visual objects, and more realistically embodied robotic\nagent avatars with articulated actuators and haptic sensors. From a core algorithms approach, we will\nneed to improve our approach to handling the inherent degeneracy (\u201cwhite-noise problem\u201d) of our\ndynamics prediction problems. The LF-SP agent employs, as discussed in Section 2, the technique in\n[35] aimed at this. It was, however, unclear whether this method fully resolved the issue. It will likely\nbe necessary to improve both the formulation of the world-model dynamics prediction tasks our\nagent solves as well as the antagonistic action policies of the agent\u2019s self-model. One approach may\nbe improving our formulation of curiosity from the simple adversarial concept to include additional\nnotions of intrinsic motivation such as learning progress [33, 34, 38]. More re\ufb01ned future prediction\nmodels (e.g. [31]) may also ameliorate degeneracy and lead to more sophisticated behavior. Finally,\nincluding other animate agents in the environment will not only lead to more complex interactions,\nbut potentially also better learning through imitation [17]. In this scenario, the self-model component\nof our architecture will need to be not only aware of the agent itself, but also make predictions about\nthe actions of other agents \u2014 perhaps providing a connection to the cognitive science of theory of\nmind [37].\n\nFigure 7: Computational model and human development comparison.\n\n9\n\nEnvironmentWorld ModelSelf-ModelStateStateAdversarial SignalPredictionActionComputational ExperimentsDevelopmental ExperimentsComparisonRenementMulti-agentInteractiveImprovementModelModelTestingARTIFICIAL INTELLIGENCEEXPERIMENTSEgo motion learningEmergence of object attention1 object learning2 object learning1 object loss2 object loss\u00b0*xTraining loss0.40.30.20.10.00.50.6040206080Steps (in thousands)320560800Self-aware agentPartially self-aware agentNon-self-aware agentChest Up2 mo.Sit withSupport4 mo.SitAlone7 mo.Stand HoldingFurniture9 mo.Creep10 mo.Walk Alone 15 mo.\fAcknowledgments\nThis work was supported by grants from the James S. McDonnell Foundation, Simons Foundation,\nand Sloan Foundation (DLKY), a Berry Foundation postdoctoral fellowship and Stanford Department\nof Biomedical Data Science NLM T-15 LM007033-35 (NH), ONR - MURI (Stanford Lead) N00014-\n16-1-2127 and ONR - MURI (UCLA Lead) 1015 G TA275 (LF).\n\nReferences\n[1] J. Achiam and S. Sastry. Surprise-based intrinsic motivation for deep reinforcement learning.\n\nCoRR, abs/1703.01732, 2017.\n\n[2] P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking:\n\nExperiential learning of intuitive physics. In NIPS, 2016.\n\n[3] A. Baranes and P.-Y. Oudeyer. Active learning of inverse models with intrinsically motivated\n\ngoal exploration in robots. Robotics and Autonomous Systems, 61(1):49\u201373, 2013.\n\n[4] K. Begus, T. Gliga, and V. Southgate. Infants learn what they want to learn: Responding to\ninfant pointing leads to superior learning. PLOS ONE, 9(10):1\u20134, 10 2014. doi: 10.1371/journal.\npone.0108817.\n\n[5] A. Boeing and T. Br\u00e4unl. Evaluation of real-time physics simulation systems. In Proceedings of\nthe 5th international conference on Computer graphics and interactive techniques in Australia\nand Southeast Asia, pages 281\u2013288. ACM, 2007.\n\n[6] N. Chentanez, A. G. Barto, and S. P. Singh. Intrinsically motivated reinforcement learning. In\n\nNIPS, pages 1281\u20131288, 2005.\n\n[7] F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip\n\nconnections. In CoRL, volume 78, pages 344\u2013356. PMLR, 2017.\n\n[8] E. Elhamifar, G. Sapiro, A. Y. Yang, and S. S. Sastry. A convex optimization framework for\nactive learning. In ICCV, pages 209\u2013216. IEEE Computer Society, 2013. ISBN 978-1-4799-\n2839-2.\n\n[9] R. L. Fantz. Visual experience in infants: Decreased attention to familiar patterns relative to\n\nnovel ones. Science, 146:668\u2013670, 1964.\n\n[10] C. Finn and S. Levine. Deep visual foresight for planning robot motion. In ICRA, pages\n\n2786\u20132793. IEEE, 2017. ISBN 978-1-5090-4633-1.\n\n[11] M. Frank, J. Leitner, M. F. Stollenga, A. F\u00f6rster, and J. Schmidhuber. Curiosity driven\n\nreinforcement learning for motion planning on humanoids. Front. Neurorobot., 2014, 2014.\n\n[12] R. Gilad-Bachrach, A. Navot, and N. Tishby. Query by committee made real. In NIPS, pages\n\n443\u2013450, 2005.\n\n[13] A. Gopnik, A. Meltzoff, and P. Kuhl. The Scientist In The Crib: Minds, Brains, And How\n\nChildren Learn. HarperCollins, 2009. ISBN 9780061846915.\n\n[14] J. Gottlieb, P.-Y. Oudeyer, M. Lopes, and A. Baranes. Information-seeking, curiosity, and\nattention: computational and neural mechanisms. Trends in cognitive sciences, 17(11):585\u2013593,\n2013.\n\n[15] L. Goupil, M. Romand-Monnier, and S. Kouider. Infants ask for help when they know they\ndon\u2019t know. Proceedings of the National Academy of Sciences, 113(13):3492\u20133496, 2016. doi:\n10.1073/pnas.1515129113.\n\n[16] D. Held, X. Geng, C. Florensa, and P. Abbeel. Automatic goal generation for reinforcement\n\nlearning agents. CoRR, abs/1705.06366, 2017.\n\n[17] J. Ho and S. Ermon. Generative adversarial imitation learning. In NIPS, pages 4565\u20134573,\n\n2016.\n\n10\n\n\f[18] S. Hong, D. Yeo, S. Kwak, H. Lee, and B. Han. Weakly supervised semantic segmentation\nusing web-crawled videos. In CVPR, pages 2224\u20132232. IEEE Computer Society, 2017. ISBN\n978-1-5386-0457-1.\n\n[19] R. Houthooft, X. Chen, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel. Vime:\n\nVariational information maximizing exploration. In NIPS, pages 1109\u20131117, 2016.\n\n[20] K. B. Hurley and L. M. Oakes. Experience and distribution of attention: Pet exposure and\n\ninfants\u2019 scanning of animal images. J Cogn Dev, 16(1):11\u201330, Jan 2015.\n\n[21] K. B. Hurley, K. A. Kovack-Lesh, and L. M. Oakes. The in\ufb02uence of pets on infants\u2019 processing\n\nof cat and dog images. Infant Behav Dev, 33(4):619\u2013628, Dec 2010.\n\n[22] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu.\n\nReinforcement learning with unsupervised auxiliary tasks. CoRR, abs/1611.05397, 2016.\n\n[23] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and\nK. Kavukcuoglu. Video pixel networks. In ICML, volume 70 of JMLR Workshop and Conference\nProceedings, pages 1771\u20131779. JMLR.org, 2017.\n\n[24] C. Kidd, S. T. Piantadosi, and R. N. Aslin. The goldilocks effect: Human infants allocate\nattention to visual sequences that are neither too simple nor too complex. PLOS ONE, 7(5):1\u20138,\n05 2012. doi: 10.1371/journal.pone.0036399.\n\n[25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n2014.\n\n[26] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement\nlearning: Integrating temporal abstraction and intrinsic motivation. In NIPS, pages 3675\u20133683,\n2016.\n\n[27] L.-J. Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Pittsburgh,\n\nPA, USA, 1992. UMI Order No. GAX93-22750.\n\n[28] M. C. Machado, M. G. Bellemare, and M. Bowling. A laplacian framework for option discovery\n\nin reinforcement learning. arXiv preprint arXiv:1703.00956, 2017.\n\n[29] C. Mitash, K. E. Bekris, and A. Boularias. A self-supervised learning system for object detection\nusing physics simulation and multi-view pose estimation. In IROS, pages 545\u2013551. IEEE, 2017.\nISBN 978-1-5386-2682-5.\n\n[30] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and\nK. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, ICML\u201916,\npages 1928\u20131937. JMLR.org, 2016.\n\n[31] D. Mrowca, C. Zhuang, E. Wang, N. Haber, L. Fei-Fei, J. B. Tenenbaum, and D. L. Yamins.\nFlexible neural representation for physics prediction. arXiv preprint arXiv:1806.08047, 2018.\n[32] M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw\npuzzles. In ECCV (6), volume 9910 of Lecture Notes in Computer Science, pages 69\u201384.\nSpringer, 2016. ISBN 978-3-319-46465-7.\n\n[33] P.-Y. Oudeyer and L. B. Smith. How evolution may work through curiosity-driven developmental\n\nprocess. Topics in Cognitive Science, 8(2):492\u2013502, 2016.\n\n[34] P.-Y. Oudeyer, F. Kaplan, and V. V. Hafner. Intrinsic motivation systems for autonomous mental\n\ndevelopment. IEEE transactions on evolutionary computation, 11(2):265\u2013286, 2007.\n\n[35] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-\nsupervised prediction. In ICML, volume 70 of JMLR Workshop and Conference Proceedings,\npages 2778\u20132787. JMLR.org, 2017.\n\n[36] I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. Barth-Maron, M. Vecerik, T. Lampe, Y. Tassa,\nT. Erez, and M. Riedmiller. Data-ef\ufb01cient deep reinforcement learning for dexterous manipula-\ntion. arXiv preprint arXiv:1704.03073, 2017.\n\n11\n\n\f[37] R. Saxe and N. Kanwisher. People thinking about thinking people: the role of the temporo-\n\nparietal junction in \u201ctheory of mind\u201d. Neuroimage, 19(4):1835\u20131842, 2003.\n\n[38] J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990-2010). IEEE\n\nTrans. Autonomous Mental Development, 2(3):230\u2013247, 2010.\n\n[39] O. Sener and S. Savarese. A geometric approach to active learning for convolutional neural\n\nnetworks. CoRR, abs/1708.00489, 2017.\n\n[40] B. Settles. Active Learning, volume 18. Morgan & Claypool Publishers, 2011.\n[41] S. P. Singh, R. L. Lewis, A. G. Barto, and J. Sorg. Intrinsically motivated reinforcement learning:\nAn evolutionary perspective. IEEE Trans. Autonomous Mental Development, 2(2):70\u201382, 2010.\n\n[42] E. Sokolov. Perception and the conditioned re\ufb02ex. Pergamon Press, 1963.\n[43] N. K. Spyros Gidaris, Praveer Singh. Unsupervised representation learning by predicting image\n\nrotations. ICLR, 2018.\n\n[44] K. E. Twomey and G. Westermann. Curiosity-based learning in infants: a neurocomputational\n\napproach. Dev Sci, Oct 2017.\n\n[45] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin. Cost-effective active learning for deep image\n\nclassi\ufb01cation. IEEE Trans. Circuits Syst. Video Techn., 27(12):2591\u20132600, 2017.\n\n[46] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In B. Leibe, J. Matas, N. Sebe,\nand M. Welling, editors, ECCV (3), volume 9907 of Lecture Notes in Computer Science, pages\n649\u2013666. Springer, 2016. ISBN 978-3-319-46486-2.\n\n12\n\n\f", "award": [], "sourceid": 5086, "authors": [{"given_name": "Nick", "family_name": "Haber", "institution": "Stanford University"}, {"given_name": "Damian", "family_name": "Mrowca", "institution": "Stanford University"}, {"given_name": "Stephanie", "family_name": "Wang", "institution": "Stanford University"}, {"given_name": "Li", "family_name": "Fei-Fei", "institution": "Stanford University & Google"}, {"given_name": "Daniel", "family_name": "Yamins", "institution": "Stanford University"}]}