{"title": "Learning to Poke by Poking: Experiential Learning of Intuitive Physics", "book": "Advances in Neural Information Processing Systems", "page_first": 5074, "page_last": 5082, "abstract": "We investigate an experiential learning paradigm for acquiring an internal model of intuitive physics. Our model is evaluated on a real-world robotic manipulation task that requires displacing objects to target locations by poking. The robot gathered over 400 hours of experience by executing more than 50K pokes on different objects. We propose a novel approach based on deep neural networks for modeling the dynamics of robot's interactions directly from images, by jointly estimating forward and inverse models of dynamics. The inverse model objective provides supervision to construct informative visual features, which the forward model can then predict and in turn regularize the feature space for the inverse model. The interplay between these two objectives creates useful, accurate models that can then be used for multi-step decision making. This formulation has the additional benefit that it is possible to learn forward models in an abstract feature space and thus alleviate the need of predicting pixels. Our experiments show that this joint modeling approach outperforms alternative methods. We also demonstrate that active data collection using the learned model further improves performance.", "full_text": "Learning to Poke by Poking: Experiential Learning of\n\nIntuitive Physics\n\nPulkit Agrawal\u2217\n\nAshvin Nair\u2217\n\nPieter Abbeel\n\nJitendra Malik\n\nSergey Levine\n\nBerkeley Arti\ufb01cial Intelligence Research Laboratory (BAIR)\n\nUniversity of California Berkeley\n\n{pulkitag,anair17,pabbeel,malik,svlevine}@berkeley.edu\n\nAbstract\n\nWe investigate an experiential learning paradigm for acquiring an internal model of\nintuitive physics. Our model is evaluated on a real-world robotic manipulation task\nthat requires displacing objects to target locations by poking. The robot gathered\nover 400 hours of experience by executing more than 100K pokes on different\nobjects. We propose a novel approach based on deep neural networks for modeling\nthe dynamics of robot\u2019s interactions directly from images, by jointly estimating\nforward and inverse models of dynamics. The inverse model objective provides\nsupervision to construct informative visual features, which the forward model can\nthen predict and in turn regularize the feature space for the inverse model. The\ninterplay between these two objectives creates useful, accurate models that can\nthen be used for multi-step decision making. This formulation has the additional\nbene\ufb01t that it is possible to learn forward models in an abstract feature space and\nthus alleviate the need of predicting pixels. Our experiments show that this joint\nmodeling approach outperforms alternative methods.\n\n1\n\nIntroduction\n\nHumans can effortlessly manipulate previously unseen objects in novel ways. For example, if a\nhammer is not available, a human might use a piece of rock or back of a screwdriver to hit a nail.\nWhat enables humans to easily perform such tasks that machines struggle with? One possibility is that\nhumans possess an internal model of physics (i.e. \u201cintuitive physics\u201d (Michotte, 1963; McCloskey,\n1983)) that allows them to reason about physical properties of objects and forecast their dynamics\nunder the effect of applied forces. Such models can be used to transform a given task into a search\nproblem in a manner similar to how moves can be planned in a game of chess or tic-tac-toe by\nsearching through the game tree. Because the search algorithm is independent of task semantics,\nsolutions to different and possibly new tasks can be determined using the same mechanism.\nIn human development, it is well known that infants spend years worth of time playing with objects\nin a seemingly random manner with no speci\ufb01c end goal (Smith & Gasser, 2005; Gopnik et al., 1999).\nOne hypothesis is that infants distill this experience into intuitive physics models that predict how\ntheir actions effect the motion of objects. Once learnt, these models could be used for planning\nactions for achieving novel goals later in life. Inspired by this hypothesis, in this work we investigate\nwhether a robot can use it\u2019s own experience to learn an intuitive model of physics that is also effective\nfor planning actions. In our setup (see Figure 1), a Baxter robot interacts with objects kept on a table\nin front of it by randomly poking them. The robot records the visual state of the world before and\nafter it executes a poke in order to learn a mapping between its actions and the accompanying change\nin visual state caused by object motion. To date our robot has interacted with objects for more than\n400 hours and in process collected more than 100K pokes on 16 distinct objects.\n\n\u2217equal contribution, authors are listed in alphabetical order.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Infants spend years worth of time playing with objects in a seemingly random manner.\nThey might use this experience to learn a model of physics relating their actions with the resulting\nmotion of objects. Inspired by this hypothesis, we let a robot interact with objects by randomly\npoking them. The robot pokes objects and records the visual state before (left) and after (right) the\npoke. The triplet of before image, after image and the applied poke is used to train a neural network\n(center) for learning the mapping between actions and the accompanying change in visual state. We\nshow that this learn model can be used to push objects into a desired con\ufb01guration.\n\nWhat kind of a model should the robot learn from it\u2019s experience? One possibility is to build a model\nthat predicts the next visual state from the current visual state and the applied force (i.e forward\ndynamics model). This is challenging because predicting the value of every pixel in the next image is\nnon-trivial in real world scenarios. Moreover, in most cases it is not the precise pixel values that are of\ninterest, but the occurrence of a more abstract event. For example, predicting that a glass jar will break\nwhen pushed from the table onto the ground is of greater interest (and easier) than predicting exactly\nhow every piece of shattered glass will look. The dif\ufb01culty, however, is that supervision for such\nabstract concepts or events is not readily available in unsupervised settings such as ours. In this work,\nwe propose one solution to this problem by jointly training forward and inverse dynamics models. A\nforward model predicts the next state from the current state and action, and an inverse model predicts\nthe action given the initial and target state. In joint training, the inverse model objective provides\nsupervision for transforming image pixels into an abstract feature space, which the forward model\ncan then predict. The inverse model alleviates the need for the forward model to make predictions in\nthe pixel space and the forward model in turn regularizes the feature space for the inverse model.\nWe empirically show that the joint model allows the robot to generalize and plan actions for achieving\ntasks with signi\ufb01cantly different visual statistics as compared to the data used in the learning phase.\nOur model can be used for multi step decision making and displace objects with novel geometry\nand texture into desired goal locations that are much farther apart as compared to position of objects\nbefore and after a single poke. We probe the joint modeling approach further using simulation studies\nand show that the forward model regularizes the inverse model.\n\n2 Data\n\nFigure 1 shows our experimental setup. The robot is equipped with a Kinect camera and a gripper for\npoking objects kept on a table in front of it. At any given time there were 1-3 objects chosen from a\nset of 16 distinct objects present on the table. The robot\u2019s coordinate system was as following: X and\nY axis represented the horizontal and vertical axes, while the Z axis pointed away from the robot.\nThe robot poked objects by moving its \ufb01nger along the XZ plane at a \ufb01xed height from the table.\nPoke Representation: For collecting a sample of interaction data, the robot \ufb01rst selects a random\ntarget point in its \ufb01eld of view to poke. One issue with random poking is that most pokes are executed\nin free space which severely slows down collection of interesting interaction data. For speedy data\ncollection, a point cloud from the Kinect depth camera was used to only chose points that lie on any\nobject except the table. Point cloud information was only used during data collection and at test time\nour system only requires RGB image data. After selecting a random point to poke (p) on the object,\n\n2\n\nCNNCNNPredict Poke\fFigure 2: These images depict the robot in the process of displacing the bottle away from the indicated\ndotted line. In the middle of the poke, the object \ufb02ips and ends up moving in the wrong direction.\nSuch occurrences are common because the real world objects have complex geometric and material\nproperties. This makes learning manipulation strategies without prior knowledge very challenging.\n\nthe robot randomly samples a poke direction (\u03b8) and length (l). Kinematically, the poke is de\ufb01ned\nby points p1, p2 that are l\n2 distance from p in the directions \u03b8o, (180 + \u03b8)o respectively. The robot\nexecutes the poke by moving its \ufb01nger from p1 to p2.\nOur robot can run autonomously 24x7 without any human intervention. Sometimes when objects are\npoked they move as expected, but other times due to non-linear interaction between the robot\u2019s \ufb01nger\nand the object they move in unexpected ways as shown in Figure 2. Any model of the poking data\nmust deal with such non-linear interactions (see project website for more examples). A small amount\nof data in the early stages of the project was collected on a table with a green background, but most\nof our data was collected in a wooden arena with walls for preventing objects from falling down. All\nresults in this paper are from data collected only from the wooden arena.\n\n3 Method\n\nThe forward and inverse models can be formally described by equations 1 and 2, respectively. The\nnotation is as following: xt, ut are the world state and action applied time step t, \u02c6xt+1, \u02c6ut+1 are the\npredicted state and actions, and Wf wd and Winv are parameters of the functions F and G that are\nused to construct the forward and inverse models.\n(1)\n\n\u02c6xt+1 = F (xt, ut; Wf wd)\n\n\u02c6ut = G(xt, xt+1; Winv)\n\n(2)\n\nGiven an initial and goal state, inverse models provide a direct mapping to actions required for\nachieving the goal state in one step (if feasible). However, multiple possible actions can transform\nthe world from one visual state to another. For example, an object can appear in a certain part of the\nvisual \ufb01eld if the agent moves or if the agent uses its arms to move the object. This multi-modality\nin the action space makes the learning hard. On the other hand, given xt and ut, there exists a next\nstate xt+1 that is unique up to dynamics noise. This suggests that forward models might be easier to\nlearn. However, learning forward models in image space is hard because predicting the value of each\npixel in the future frames is a non-trivial problem with no known good solution. However, in most\nscenarios we are not interested in predicting every pixel, but predicting the occurrence of a more\nabstract event such as object motion, change in object pose etc.\nThe ability to learn an abstract task relevant feature space should make it easier to learn a forward\ndynamics model. One possible approach is to learn a dynamics model in the feature representation of\na higher layer of a deep neural network trained to perform image classi\ufb01cation (say on ImageNet)\n(Vondrick et al., 2016). However, this is not a general way of learning task relevant features and it is\nunclear whether features adept at object recognition are also optimal for object manipulation. The\nalternative of adapting higher layer features of a neural network while simultaneously optimizing\nfor the prediction loss leads to a degenerate solution of all the features reducing to zero, since the\nprediction loss in this case is also zero. Our key observation is that this degenerate solution can be\navoided by imposing the constraint that it should be possible to infer the the executed action (ut)\nfrom the feature representation of two images obtained before (xt) and after (xt+1) the action (ut) is\napplied (i.e. optimizing the inverse model). This formulation provides a general mechanism for using\ngeneral purpose function approximators such as deep neural networks for simultaneously learning a\ntask relevant feature space and forecasting the future outcome of actions in this learned space.\nA second challenge in using forward models is that inferring the optimal action inevitably leads to\n\ufb01nding a solution to non-convex problems that are subject to local optima. The inverse model does\nnot suffers from this drawback as it directly outputs the required action. These considerations suggest\nthat inverse and forward models have complementary strengths and therefore it is worthwhile to\ninvestigate training a joint model of inverse and forward dynamics.\n\n3\n\n\fFigure 3: (a) The collection of objects in the training set poked by the robot. (b) Example pairs\nof before (It) and after images (It+1) after a single poke was made by the robot. (c) A Siamese\nconvolutional neural network was trained to predict the poke location (pt), angle (\u03b8t) and length (lt)\nrequired to transform objects in the image at the tth time step (It) into their state in It+1. Images It\nand It+1 are transformed into their latent feature representations (xt, xt+1) by passing them through\na series of convolutional layers. For building the inverse model, xt, xt+1 are concatenated and passed\nthrough fully connected layers to predict the discretized poke. For building the forward model, the\naction ut = {pt, \u03b8t, lt} and xt are passed through a series of fully connected layers to predict xt+1.\n3.1 Model\n\nA deep neural network is used to simultaneously learn a model of forward and inverse dynamics (see\nFigure 3). A tuple of before image (It), after image (It+1) and the robot\u2019s action (ut) constitute one\ntraining sample. Input images at consequent time steps (It, It+1) are transformed into their latent\nfeature representations (xt, xt+1) by passing them through a series of \ufb01ve convolutional layers with\nthe same architecture as the \ufb01rst \ufb01ve layers of AlexNet (Krizhevsky et al., 2012). For building the\ninverse model, xt, xt+1 are concatenated and passed through fully connected layers to conditionally\npredict the poke location (pt), angle (\u03b8t) and length (lt) separately. For modeling multimodal poke\ndistributions, poke location, angle and length of poke are discretized into a 20 \u00d7 20 grid, 36 bins and\n11 bins respectively. The 11th bin of the poke length is used to denote no poke. For building the\nforward model, the feature representation of the before image (xt) and the action (ut; real-valued\nvector without discretization) are passed into a sequence of fully connected layer that predicts the\nfeature representation of the next image (xt+1). Training is performed to optimize the loss de\ufb01ned in\nequation 3 below.\n\nLjoint = Linv(ut, \u02c6ut, W ) + \u03bbLf wd(xt+1, \u02c6xt+1, W )\n\n(3)\nLinv is a sum of three cross entropy losses between the actual and predicted poke location, angle\nand length. Lf wd is a L1 loss between the predicted (\u02c6xt+1) and the ground truth (xt+1) feature\nrepresentation of the after image (It+1). W are the parameters of the neural network. We used\n\u03bb = 0.1 in all our experiments. We call this the joint model and we compare its performance against\nthe inverse only model that was trained by setting \u03bb = 0 in equation 3. More details about model\ntraining are provided in the supplementary materials.\n\n3.2 Evaluation Procedure\n\nOne way to test the learnt model is to provide the robot with an initial and goal image and task it to\napply pokes that would displace objects into the con\ufb01guration shown in the goal image. If the robot\nsucceeds at achieving the goal con\ufb01guration when the visual statistics of the pair of initial and goal\nimage is similar to before and after image in the training set, then this would not be a convincing\ndemonstration of generalization. However, if the robot is able to displace objects into goal positions\nthat are much farther apart as compared to position of objects before and after a single poke then it\nmight suggest that our model has not simply over\ufb01t but has learnt something about the underlying\nphysics of how objects move when poked. This suggestion would be further strengthened if the robot\nis also able to push objects with novel geometry and texture in presence of multiple distractor objects.\nIf the objects in the initial and goal image are farther apart than the maximum distance that can be\npushed by a single poke, then the model would be required to output a sequence of pokes. We use a\n\n4\n\nIt+1Itxt\u02c6lt\u02c6\u03b8t\u02c6ptpt,\u03b8t,ltxt+1\u02c6xt+1(c)(a)(b)\fFigure 4: (a) Greedy planner is used to output a sequence of pokes to displace the objects from their\ncon\ufb01guration in initial to the goal image. (b) The blob model \ufb01rst detects the location of objects in\nthe current and goal image. Based on object positions, location and angle of the poke is computed\nand then executed by the robot. The obtained next and goal image are used to compute the next poke\nand this process is repeated iteratively. (c) The error of the models in poking objects to their correct\npose is measured as the angle between the major axis of the objects in the \ufb01nal and goal images.\n\ngreedy planning method (see Figure 4(a)) to output a sequence of pokes. First, images depicting the\ninitial and goal state are passed through the learnt model to predict the poke which is then executed\nby the robot. Then, the image depicting the current world state (i.e. the current image) and the goal\nimage are fed again into the model to output a poke. This process is repeated iteratively unless the\nrobot predicts a no-poke (see section 3.1) or a maximum number of 10 pokes is reached.\nError Metrics: In all our experiments, the initial and goal images differ in the position of only a\nsingle object. The location and pose of the object in the \ufb01nal image after the robot stops and the goal\nimage are compared for quantitative evaluation. The location error is the Euclidean distance between\nthe object locations. In order to account for different object distances in the initial and goal state, we\nuse relative instead of absolute location error. Pose error is de\ufb01ned as the angle (in degrees) between\nthe major axis of the objects in the \ufb01nal and goal images (see Figure 4(c)). Please see supplementary\nmaterials for further details.\n\n3.3 Blob Model\n\nWe compared the performance of the learnt model against a baseline blob model. This model \ufb01rst\nestimates object locations in current and goal image using template based object detector. It then uses\nthe vector difference between these to compute the location, angle and length of poke executed by\nthe robot (see supplementary materials for details). In a manner similar to greedy planning with the\nlearnt model, this process is repeated iteratively until the object gets closer to the desired location in\nthe goal image by a pre-de\ufb01ned threshold or a maximum number of pokes is reached.\n\n4 Results\n\nThe robot was tasked to displace objects in an initial image into their con\ufb01guration depicted in a\ngoal image (see Figure 5). The three rows in the \ufb01gure show the performance when the robot is\nasked to displace an object (Nutella bottle) present in the training set, an object (red cup) whose\ngeometry is different from objects in the training set and when the task is to move an object around\nan obstacle. These examples are representative of the robot\u2019s performance and more examples can be\nfound on the project website. It can be seen that the robot is able to successfully poke objects present\nin the training set and objects with novel geometry and texture into desired goal locations that are\nsigni\ufb01cantly farther than pair of before and after images used in the training set.\nRow 2 in Figure 5 also shows that the robot\u2019s performance in unaffected by the presence of distractor\nobjects that occupy the same location in the current and goal images. These results indicate that the\nlearnt model allows the robot to perform tasks that show generalization beyond the training set (i.e.\npoking object by small distances). Row 3 in Figure 5 depicts an example where the robots fails to\npush the object around an obstacle (yellow object). The robot acts greedily and ends up pushing the\nobstacle along with the object. One more side-effect of greedy planning is zig-zag instead of straight\ntrajectories taken by the object between its initial and goal locations. Investigating alternatives to\n\n5\n\nActionPredictorCurrentImage(It)GoalImage(Ig)NextImage(It+1)(a)GreedyPlanner(b)BlobModel(c)PoseErrorEvaluationAngle(\u03b8)\fFigure 5: The robot is able to successfully displace objects in the training set (row 1; Nutella bottle)\nand objects with previously unseen geometry (row 2; red cup) into goal locations that are signi\ufb01cantly\nfarther than pair of before and after images used in the training set. The robot is unable to push\nobjects around obstacles (row 3; limitation of greedy planning).\n\ngreedy planning, such as using the learnt forward model for planning pokes is a very interesting\ndirection for future research.\nWhat representation could the robot have learnt that allows it to generalize? One possibility is that\nthe robot ignores the geometry of the object and only infers the location of the object in the initial and\ngoal image and uses the difference vector between object locations to deduce what poke to execute.\nThis strategy is invariant to absolute distance between the object locations and is therefore capable\nof explaining the observed generalization to large distances. While we cannot prove that the model\nhas learnt to detect object location, nearest neighbor visualizations of the learnt feature space clearly\nsuggest sensitivity to object location (see supplementary materials). This is interesting because the\nrobot received no direct supervision to locate objects.\nBecause different objects have different geometries, they need to be poked at different places to move\nthem in the same manner. For example, a Nutella bottle can be reliably moved forward without\nrotating the bottle by poking it on the side along the direction toward its center of mass, whereas a\nhammer is reliably moved by poking it where the hammer head meets the handle. Pushing an object to\na desired pose is harder and requires a more detailed understanding of object geometry in comparison\nto pushing the object to a desired location. In order to test whether the learnt model represents any\ninformation about object geometry, we compared its performance against the baseline blob model\n(see section 3.3 and \ufb01gure 4(b)) that ignores object geometry. For this comparison, the robot was\ntasked to push objects to a nearby goal by making only a single poke (see supplementary materials\nfor more details). Results in Figure 6(a) show that both the inverse and joint model outperform the\nblob model. This indicates that in addition to representing information about object location, the\nlearn models also represent some information about object geometry.\n\n4.1 Forward model regularizes the inverse model\n\nWe tested the hypothesis whether the forward model regularizes the feature space learnt by the\ninverse model in a 2-D simulation environment where the agent interacted with a red rectangular\nobject by poking it by small forces. The rectangle was allowed to freely translate and rotate (Figure\n6(c)). Model training was performed using an architecture similar to the one described in section 3.1.\nAdditional details about the experimental setup, network architecture and training procedure for the\nsimulation experiments are provided in the supplementary materials. Figure 6(c) shows that when\nless training data (10K, 20K examples) is available the joint model outperforms the inverse model\nand reaches closer to the goal state in fewer steps (i.e. fewer actions). This shows that indeed the\nforward model regularizes the inverse model and helps generalize better. However, when the number\nof training examples is increased to 100K both models are at par. This is not surprising because\ntraining with more data often results in better generalization and thus the inverse model is no longer\nreliant on the forward model for the regularization.\nEvaluation on the real robot supports the \ufb01ndings from the simulation experiments. Figure 6(b) shows\nthat in a test of generalization, when an object is required to be displaced by a long distance, the\njoint model outperforms the inverse model. Similar performance of joint and blob model at this task\nis not surprising because even if the pokes are somewhat inaccurate but generally in the direction\n\n6\n\nInitialStateGoalStateTrainingsetobjectUnseenobjectEndofSequence(EoS)Limitation(EoS)\fFigure 6: (a) Inverse and Joint model are more accurate than the blob model at pushing objects\ntowards the desired pose. (b) The joint model outperforms the inverse-only model when the robot\nis tasked to push objects by distances that are signi\ufb01cantly larger than object distance in before and\nafter images used in the training set (i.e. a test of generalization). (c) Simulation studies reveal that\nwhen less number of training examples (10K, 20K) are available the joint model outperforms the\ninverse model and the performance is comparable with larger amount of data (100K). This result\nindicates that the forward model regularizes the inverse model.\n\nfrom object\u2019s current to goal location, the object might traverse a zig-zag path but it would eventually\nreach the goal. The joint model is however more accurate at displacing objects into their correct pose\nas compared to the blob model (Figure 6(a)).\n\n5 Related Work\n\nLearning visual control policies using reinforcement learning for tasks such as playing Atari\ngames (Mnih et al., 2015), controlling robots in simulation (Lillicrap et al., 2016) and in the real\nworld (Levine et al., 2016a) is of growing interest. However, these methods are model free and learn\ngoal speci\ufb01c policies, which makes it dif\ufb01cult to repurpose the learned policies for new tasks. In\ncontrast, the aim of this work is to learn intuitive physical models of object interaction which we show\nallow the agent to generalize. Other works in visual control have relied on model free methods that\noperate on a a low-dimensional state representation of images obtained using autoencoders (Lange\net al., 2012; Finn et al., 2016; Kietzmann & Riedmiller, 2009). It is unclear that features obtained by\noptimizing pixelwise reconstruction are necessarily well suited for model based control.\nLearning to grasp objects by trial and error from large amounts of interaction data has recently\nbeen explored (Pinto & Gupta, 2016; Levine et al., 2016b). These methods aim to acquire a policy\nfor solving a single concrete task, while our work is concerned with learning a general predictive\nmodel that could be used to achieve a variety of goals at test time. When an object is grasped, it is\npossible to fully control the state of the grasped object. However, in non-prehensile manipulation\n(i.e. manipulation without grasping (LaValle, 2006)) such as poking, the object state is not directly\ncontrollable which makes manipulation by poking harder than grasping (Dogar & Srinivasa, 2012).\nLearning a model of poking was considered by (Pinto et al., 2016), but their goal was to learn visual\nrepresentations and they did not consider using the learnt models to displace objects to goal locations.\nA good review of model based control can be found in (Mayne, 2014) and (Jordan & Rumelhart,\n1992; Wolpert et al., 1995) provide interesting perspectives. A model based deep learning method for\ncutting vegetables was considered by (Lenz et al., 2015). However, as their system operated on the\nrobotic state space instead of vision and is thus limited in its generality. Model based control from\nvisual inputs was considered by (Fragkiadaki et al., 2016; Wahlstr\u00f6m et al., 2015; Watter et al., 2015;\nOh et al., 2015) in synthetic domains of manipulating two degree of freedom robotic arm, inverted\npendulum, billiards and Atari games. In contrast, we tackle manipulation of complex, compressible\nreal world objects. Instead of learning a model of physics, some recents works (Wu et al., 2015;\nMottaghi et al., 2016; Lerer et al., 2016) have proposed to use Newtonian physics in combination\nwith neural networks to forecast object dynamics.\n\n7\n\n01234Number of Steps0.20.30.40.50.60.70.80.91.0Relative Location ErrorInverse Model, #Train 10KJoint Model, #Train 10KInverse Model, #Train 20KJoint Model, #Train 20KInverse Model, #Train 100KJoint Model, #Train 100KInitialStateGoalState(c)Simulationexperiments0.00.10.20.30.4(a)PoseerrorfornearbygoalsBlobModelInverseModelJointModel0204060(b)Relativelocationerrorforfarawaygoals\fIn robotic manipulation, a number of prior methods have been proposed that use hand-designed visual\nfeatures and known object poses or key locations to plan and execute pushes and other non-prehensile\nmanipulations (Kopicki et al., 2011; Lau et al., 2011; Meri\u00e7li et al., 2015). Unlike these methods,\nthe goal in our work is to learn an intuitive physics model for pushing only from raw images, thus\nallowing the robot to learn by exploring the environment on its own without human intervention.\n\n6 Discussion and Future Work\n\nIn this work we propose to learn \u201cintuitive\" model of physics using interaction data. An alternative is\nto represent the world in terms of a \ufb01xed set of physical parameters such as mass, friction coef\ufb01cient,\nnormal forces etc and use a physics simulator for computing object dynamics from this representation\n(Kolev & Todorov, 2015; Mottaghi et al., 2016; Wu et al., 2015; Hamrick et al., 2011). This approach\nis general because physics simulators inevitably use Newton\u2019s laws that apply to a wide range of\nphysical phenomenon ranging from orbital motion of planets to a swinging pendulum. Estimating\nparameters such as as mass, friction coef\ufb01cient etc. from sensory data is subject to errors, and it is\npossible that one parameterization is easier to estimate or more robust to sensory noise than another.\nFor example, the conclusion that objects with feather like appearance fall slower than objects with\nstone like appearance can be reached by either correlating visual texture to the speed of falling objects,\nor by computing the drag force after estimating the cross section area of the object. Depending on\nwhether estimation of visual texture or cross section area is more robust, one parameterization will\nresult in more accurate predictions than the other. Pre-de\ufb01ning a set of parameters for predicting\nobject dynamics, which is required by \u201csimulator-based\" approach might therefore lead to suboptimal\nsolutions that are less robust.\nFor many practical object manipulation tasks of interest, such as re-arranging objects, cutting\nvegetables, folding clothes, and so forth, small errors in execution are acceptable. The key challenge\nis robust performance in the face of varying environmental conditions. This suggests that a more\nrobust but a somewhat imprecise model may in fact be desirable over a less robust and a more\nprecise model. While the arguments presented above suggest that intuitive physics models are likely\nto be more robust than simulator based models, quantifying the robustness of these models is an\ninteresting direction for future work. Furthermore, it is non-trivial to use simulator based models\nfor manipulating deformable objects such as clothes and ropes because simulation of deformable\nobjects is hard and also also requires representing objects by heavily handcrafted features that are\nunlikely to generalize across objects. The intuitive physics approach does not make any object\nspeci\ufb01c assumptions and can be easily extended to work with deformable objects. This approach is\nin the spirit of recent successful deep learning techniques in computer vision and speech processing\nthat learn features directly from data, whereas the simulator based physics approach is more similar\nto using hand-designed features. Current methods for learning intuitive physics models, such as ours\nare data inef\ufb01cient and it is possible that combining intuitive and simulator based approaches leads to\nbetter models than either approach by itself.\nIn poking based interaction, the robot does not have full control of the object state which makes it\nharder to predict and plan for the outcome of an action. The models proposed in this work generalize\nand are able to push objects into their desired location. However, performance on setting objects\nin the desired pose is not satisfactory, possibly because of the robot only executing pokes in large,\ndiscrete time steps. An interesting area of future investigation is to use continuous time control with\nsmaller pokes that are likely to be more predictable than the large pokes used in this work. Further,\nalthough our approach is evaluated on a speci\ufb01c robotic manipulation task, there are no task speci\ufb01c\nassumptions, and the techniques are applicable to other tasks. In future, it would be interesting to\nsee how the proposed approach scales with more complex environments, diverse object collections,\ndifferent manipulation skills and to other non-manipulation based tasks, such as navigation. Other\ndirections for future investigation include the use of forward model for planning and developing\nbetter strategies for data collection than random interaction.\nSupplementary Materials: and videos can be found at http://ashvin.me/pokebot-website/.\nAcknowledgement: We thank Alyosha Efros for inspiration and fruitful discussions throughout this\nwork. The title of this paper is partly in\ufb02uenced by the term \u201cpokebot\" that Alyosha has been using\nfor several years. We thank Ruzena Bajcsy for access to Baxter robot and Shubham Tulsiani for\nhelpful comments. This work was supported in part by ONR MURI N00014-14-1-0671, ONR YIP\n\n8\n\n\fand by ARL through the MAST program. We are grateful to NVIDIA corporation for donating K40\nGPUs and providing access to the NVIDIA PSG cluster.\n\nReferences\nDogar, Mehmet R and Srinivasa, Siddhartha S. A planning framework for non-prehensile manipulation under clutter and uncertainty. Au-\n\ntonomous Robots, 33(3):217\u2013236, 2012.\n\nFinn, Chelsea, Tan, Xin Yu, Duan, Yan, Darrell, Trevor, Levine, Sergey, and Abbeel, Pieter. Deep spatial autoencoders for visuomotor learning.\n\nICRA, 2016.\n\nFragkiadaki, Katerina, Agrawal, Pulkit, Levine, Sergey, and Malik, Jitendra. Learning visual predictive models of physics for playing billiards.\n\nICLR, 2016.\n\nGopnik, Alison, Meltzoff, Andrew N, and Kuhl, Patricia K. The scientist in the crib: Minds, brains, and how children learn. 1999.\n\nHamrick, Jessica, Battaglia, Peter, and Tenenbaum, Joshua B. Internal physics models guide probabilistic judgments about object dynamics.\n\nIn Cognitive Science Society, pp. 1545\u20131550, 2011.\n\nJordan, Michael I and Rumelhart, David E. Forward models: Supervised learning with a distal teacher. Cognitive science, 16, 1992.\n\nKietzmann, Tim C and Riedmiller, Martin. The neuro slot car racer: Reinforcement learning in a real world setting. In ICMLA, 2009.\n\nKolev, Svetoslav and Todorov, Emanuel. Physically consistent state estimation and system identi\ufb01cation for contacts. In International Confer-\n\nence on Humanoid Robots, pp. 1036\u20131043. IEEE, 2015.\n\nKopicki, Marek, Zurek, Sebastian, Stolkin, Rustam, M\u00f6rwald, Thomas, and Wyatt, Jeremy. Learning to predict how rigid objects behave under\n\nsimple manipulation. In ICRA, pp. 5722\u20135729. IEEE, 2011.\n\nKrizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classi\ufb01cation with deep convolutional neural networks. In NIPS, pp.\n\n1097\u20131105, 2012.\n\nLange, Stanislav, Riedmiller, Martin, and Voigtlander, Arne. Autonomous reinforcement learning on raw visual input data in a real world\n\napplication. In IJCNN, pp. 1\u20138. IEEE, 2012.\n\nLau, Manfred, Mitani, Jun, and Igarashi, Takeo. Automatic learning of pushing strategy for delivery of irregular-shaped objects. In ICRA, pp.\n\n3733\u20133738. IEEE, 2011.\n\nLaValle, Steven M. Planning algorithms. Cambridge university press, 2006.\n\nLenz, Ian, Knepper, Ross, and Saxena, Ashutosh. Deepmpc: Learning deep latent features for model predictive control. In RSS, 2015.\n\nLerer, Adam, Gross, Sam, and Fergus, Rob. Learning physical intuition of block towers by example. arXiv preprint arXiv:1603.01312, 2016.\n\nLevine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter. End-to-end training of deep visuomotor policies. JMLR, 2016a.\n\nLevine, Sergey, Pastor, Peter, Krizhevsky, Alex, and Quillen, Deirdre. Learning hand-eye coordination for robotic grasping with deep learning\n\nand large-scale data collection. arXiv, 2016b.\n\nLillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander, Heess, Nicolas, Erez, Tom, Tassa, Yuval, Silver, David, and Wierstra, Daan.\n\nContinuous control with deep reinforcement learning. ICLR, 2016.\n\nMayne, David Q. Model predictive control: Recent developments and future promise. Automatica, 50(12):2967\u20132986, 2014.\n\nMcCloskey, Michael. Intuitive physics. Scienti\ufb01c american, 248(4):122\u2013130, 1983.\n\nMeri\u00e7li, Tekin, Veloso, Manuela, and Ak\u0131n, H Levent. Push-manipulation of complex passive mobile objects using experimentally acquired\n\nmotion models. Autonomous Robots, 38(3):317\u2013329, 2015.\n\nMichotte, Albert. The perception of causality. 1963.\n\nMnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin,\n\nFidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 2015.\n\nMottaghi, Roozbeh, Bagherinezhad, Hessam, Rastegari, Mohammad, and Farhadi, Ali. Newtonian image understanding: Unfolding the\n\ndynamics of objects in static images. CVPR, 2016.\n\nOh, Junhyuk, Guo, Xiaoxiao, Lee, Honglak, Lewis, Richard, and Singh, Satinder. Action-conditional video prediction using deep networks in\n\natari games. NIPS, 2015.\n\nPinto, Lerrel and Gupta, Abhinav. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. ICRA, 2016.\n\nPinto, Lerrel, Gandhi, Dhiraj, Han, Yuanfeng, Park, Yong-Lae, and Gupta, Abhinav. The curious robot: Learning visual representations via\n\nphysical interactions. In ECCV, pp. 3\u201318. Springer, 2016.\n\nSmith, Linda and Gasser, Michael. The development of embodied cognition: Six lessons from babies. Arti\ufb01cial life, 11(1-2):13\u201329, 2005.\n\nVondrick, Carl, Pirsiavash, Hamed, and Torralba, Antonio. Anticipating the future by watching unlabeled video. CVPR, 2016.\n\nWahlstr\u00f6m, Niklas, Sch\u00f6n, Thomas B., and Deisenroth, Marc Peter. From pixels to torques: Policy learning with deep dynamical models.\n\nCoRR, abs/1502.02251, 2015.\n\nWatter, Manuel, Springenberg, Jost, Boedecker, Joschka, and Riedmiller, Martin. Embed to control: A locally linear latent dynamics model\n\nfor control from raw images. In NIPS, pp. 2728\u20132736, 2015.\n\nWolpert, Daniel M, Ghahramani, Zoubin, and Jordan, Michael I. An internal model for sensorimotor integration. Science-AAAS-Weekly Paper\n\nEdition, 269(5232):1880\u20131882, 1995.\n\nWu, Jiajun, Yildirim, Ilker, Lim, Joseph J, Freeman, Bill, and Tenenbaum, Josh. Galileo: Perceiving physical object properties by integrating\n\na physics engine with deep learning. In NIPS, pp. 127\u2013135, 2015.\n\n9\n\n\f", "award": [], "sourceid": 442, "authors": [{"given_name": "Pulkit", "family_name": "Agrawal", "institution": "UC Berkeley"}, {"given_name": "Ashvin", "family_name": "Nair", "institution": "UC Berkeley"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "OpenAI / UC Berkeley / Gradescope"}, {"given_name": "Jitendra", "family_name": "Malik", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "University of Washington"}]}