{"title": "Learning Trajectory Preferences for  Manipulators via Iterative Improvement", "book": "Advances in Neural Information Processing Systems", "page_first": 575, "page_last": 583, "abstract": "We consider the problem of learning good trajectories for manipulation tasks. This is challenging because the criterion defining a good trajectory varies with users, tasks and environments.  In this paper, we propose a co-active online  learning framework for teaching robots the preferences of its users for object  manipulation tasks. The key novelty of our approach lies in the type of feedback expected from the user: the human user does not need to demonstrate optimal trajectories as training data, but merely needs to iteratively provide trajectories that slightly improve over the trajectory currently proposed by the system. We argue that this  co-active preference feedback can be more easily elicited from the user than demonstrations of optimal trajectories, which are often challenging and non-intuitive to provide on high degrees of freedom manipulators. Nevertheless, theoretical regret bounds of our algorithm match the asymptotic rates of optimal trajectory algorithms. We also formulate a score function to capture the contextual information and demonstrate the generalizability of our algorithm on a variety of household tasks, for whom, the preferences were not only influenced by the object being manipulated but also by the surrounding environment.", "full_text": "Learning Trajectory Preferences for Manipulators\n\nvia Iterative Improvement\n\nAshesh Jain, Brian Wojcik, Thorsten Joachims, Ashutosh Saxena\n\nDepartment of Computer Science, Cornell University.\n\n{ashesh,bmw75,tj,asaxena}@cs.cornell.edu\n\nAbstract\n\nWe consider the problem of learning good trajectories for manipulation tasks. This\nis challenging because the criterion de\ufb01ning a good trajectory varies with users,\ntasks and environments.\nIn this paper, we propose a co-active online learning\nframework for teaching robots the preferences of its users for object manipulation\ntasks. The key novelty of our approach lies in the type of feedback expected from\nthe user: the human user does not need to demonstrate optimal trajectories as train-\ning data, but merely needs to iteratively provide trajectories that slightly improve\nover the trajectory currently proposed by the system. We argue that this co-active\npreference feedback can be more easily elicited from the user than demonstrations\nof optimal trajectories, which are often challenging and non-intuitive to provide\non high degrees of freedom manipulators. Nevertheless, theoretical regret bounds\nof our algorithm match the asymptotic rates of optimal trajectory algorithms. We\ndemonstrate the generalizability of our algorithm on a variety of grocery check-\nout tasks, for whom, the preferences were not only in\ufb02uenced by the object being\nmanipulated but also by the surrounding environment.1\n\nIntroduction\n\n1\nMobile manipulator robots have arms with high degrees of freedom (DoF), enabling them to perform\nhousehold chores (e.g., PR2) or complex assembly-line tasks (e.g., Baxter). In performing these\ntasks, a key problem lies in identifying appropriate trajectories. An appropriate trajectory not only\nneeds to be valid from a geometric standpoint (i.e., feasible and obstacle-free, the criterion that most\npath planners focus on), but it also needs to satisfy the user\u2019s preferences.\nSuch user\u2019s preferences over trajectories vary between users, between tasks, and between the en-\nvironments the trajectory is performed in. For example, a household robot should move a glass of\nwater in an upright position without jerks while maintaining a safe distance from nearby electronic\ndevices. In another example, a robot checking out a kitchen knife at a grocery store should strictly\nmove it at a safe distance from nearby humans. Furthermore, straight-line trajectories in Euclidean\nspace may no longer be the preferred ones. For example, trajectories of heavy items should not\npass over fragile items but rather move around them. These preferences are often hard to describe\nand anticipate without knowing where and how the robot is deployed. This makes it infeasible to\nmanually encode (e.g. [18]) them in existing path planners (such as [29, 35]) a priori.\nIn this work we propose an algorithm for learning user preferences over trajectories through inter-\nactive feedback from the user in a co-active learning setting [31]. Unlike in other learning settings,\nwhere a human \ufb01rst demonstrates optimal trajectories for a task to the robot, our learning model\ndoes not rely on the user\u2019s ability to demonstrate optimal trajectories a priori. Instead, our learn-\ning algorithm explicitly guides the learning process and merely requires the user to incrementally\nimprove the robot\u2019s trajectories. From these interactive improvements the robot learns a general\nmodel of the user\u2019s preferences in an online fashion. We show empirically that a small number of\nsuch interactions is suf\ufb01cient to adapt a robot to a changed task. Since the user does not have to\ndemonstrate a (near) optimal trajectory to the robot, we argue that our feedback is easier to provide\nand more widely applicable. Nevertheless, we will show that it leads to an online learning algorithm\nwith provable regret bounds that decay at the same rate as if optimal demonstrations were available.\n\n1For more details and a demonstration video, visit: http://pr.cs.cornell.edu/coactive\n\n1\n\n\fFigure 1: Zero-G feedback: Learning trajectory preferences from sub-optimal zero-G feedback. (Left) Robot\nplans a bad trajectory (waypoints 1-2-4) with knife close to \ufb02ower. As feedback, user corrects waypoint 2 and\nmoves it to waypoint 3. (Right) User providing zero-G feedback on waypoint 2.\nIn our empirical evaluation, we learn preferences for a high DoF Baxter robot on a variety of grocery\ncheckout tasks. By designing expressive trajectory features, we show how our algorithm learns\npreferences from online user feedback on a broad range of tasks for which object properties are of\nparticular importance (e.g., manipulating sharp objects with humans in vicinity). We extensively\nevaluate our approach on a set of 16 grocery checkout tasks, both in batch experiments as well as\nthrough robotic experiments wherein users provide their preferences on the robot. Our results show\nthat robot trained using our algorithm not only quickly learns good trajectories on individual tasks,\nbut also generalizes well to tasks that it has not seen before.\n2 Related Work\nTeaching a robot to produce desired motions has been a long standing goal and several approaches\nhave been studied. Most of the past research has focused on mimicking expert\u2019s demonstrations, for\nexample, autonomous helicopter \ufb02ights [1], ball-in-a-cup experiment [17], planning 2-D paths [27,\n25, 26], etc. Such a setting (learning from demonstration, LfD) is applicable to scenarios when it is\nclear to an expert what constitutes a good trajectory. In many scenarios, especially involving high\nDoF manipulators, this is extremely challenging to do [2].2 This is because the users have to give\nnot only the end-effector\u2019s location at each time-step, but also the full con\ufb01guration of the arm in a\nway that is spatially and temporally consistent. In our setting, the user never discloses the optimal\ntrajectory (or provide optimal feedback) to the robot, but instead, the robot learns preferences from\nsub-optimal suggestions on how the trajectory can be improved.\nSome later works in LfD provided ways for handling noisy demonstrations, under the assumption\nthat demonstrations are either near optimal [39] or locally optimal [22]. Providing noisy demonstra-\ntions is different from providing relative preferences, which are biased and can be far from optimal.\nWe compare with an algorithm for noisy LfD learning in our experiments. A recent work [37] lever-\nages user feedback to learn rewards of a Markov decision process. Our approach advances over [37]\nand Calinon et. al. [5] in that it models sub-optimality in user feedback and theoretically converges\nto user\u2019s hidden score function. We also capture the necessary contextual information for household\nand assembly-line robots, while such context is absent in [5, 37]. Our application scenario of learn-\ning trajectories for high DoF manipulations performing tasks in presence of different objects and\nenvironmental constraints goes beyond the application scenarios that previous works have consid-\nered. We design appropriate features that consider robot con\ufb01gurations, object-object relations, and\ntemporal behavior, and use them to learn a score function representing the preferences in trajectories.\nUser preferences have been studied in the \ufb01eld of human-robot interaction. Sisbot et. al. [34, 33] and\nMainprice et. al. [23] planned trajectories satisfying user speci\ufb01ed preferences in form of constraints\non the distance of robot from user, the visibility of robot and the user arm comfort. Dragan et. al. [8]\nused functional gradients [29] to optimize for legibility of robot trajectories. We differ from these in\nthat we learn score functions re\ufb02ecting user preferences from implicit feedback.\n3 Learning and Feedback Model\nWe model the learning problem in the following way. For a given task, the robot is given a context\nx that describes the environment, the objects, and any other input relevant to the problem. The robot\nhas to \ufb01gure out what is a good trajectory y for this context. Formally, we assume that the user\nhas a scoring function s\u2217(x, y) that re\ufb02ects how much he values each trajectory y for context x.\nThe higher the score, the better the trajectory. Note that this scoring function cannot be observed\ndirectly, nor do we assume that the user can actually provide cardinal valuations according to this\n\n2Consider the following analogy: In search engine results, it is much harder for a user to provide the best\n\nweb-pages for each query, but it is easier to provide relative ranking on the search results by clicking.\n\n2\n\n\fThe robot\u2019s performance will be measured in terms of\n\nt maximizing the user\u2019s unknown scoring function s\u2217(x, y), y\u2217\n\n(cid:80)T\nt=1[s\u2217(xt, y\u2217\n\nfunction. Instead, we merely assume that the user can provide us with preferences that re\ufb02ect this\nscoring function. The robots goal is to learn a function s(x, y; w) (where w are the parameters to be\nlearned) that approximates the users true scoring function s\u2217(x, y) as closely as possible.\nInteraction Model. The learning process proceeds through the following repeated cycle of interac-\ntions between robot and user.\nStep 1: The robot receives a context x. It then uses a planner to sample a set of trajectories, and\nranks them according to its current approximate scoring function s(x, y; w).\nStep 2: The user either lets the robot execute the top-ranked trajectory, or corrects the robot by\nproviding an improved trajectory \u00afy. This provides feedback indicating that s\u2217(x, \u00afy) > s\u2217(x, y).\nStep 3: The robot now updates the parameter w of s(x, y; w) based on this preference feedback and\nreturns to step 1.\nRegret.\n1\nT\n\nregret, REGT =\nt ) \u2212 s\u2217(xt, yt)], which compares the robot\u2019s trajectory yt at each time step t\nagainst the optimal trajectory y\u2217\nt =\nargmaxys\u2217(xt, y). Note that the regret is expressed in terms of the user\u2019s true scoring function s\u2217,\neven though this function is never observed. Regret characterizes the performance of the robot over\nits whole lifetime, therefore re\ufb02ecting how well it performs throughout the learning process. As we\nwill show in the following sections, we employ learning algorithms with theoretical bounds on the\nregret for scoring functions that are linear in their parameters, making only minimal assumptions\nabout the difference in score between s\u2217(x, \u00afy) and s\u2217(x, y) in Step 2 of the learning process.\nUser Feedback and Trajectory Visualization. Since the ability to easily give preference feedback\nin Step 2 is crucial for making the robot learning system easy to use for humans, we designed two\nfeedback mechanisms that enable the user to easily provide improved trajectories.\n(a) Re-ranking: We rank trajectories in order of their current predicted scores and visualize the rank-\ning using OpenRave [7]. User observers trajectories sequentially and clicks on the \ufb01rst trajectory\nwhich is better than the top ranked trajectory.\n(b) Zero-G: This feedback allow users to improve trajectory waypoints by physically changing the\nrobot\u2019s arm con\ufb01guration as shown in Figure 1. To enable effortless steering of robot\u2019s arm to de-\nsired con\ufb01guration we leverage Baxter\u2019s zero-force gravity-compensation mode. Hence we refer\nthis feedback as zero-G. This feedback is useful (i) for bootstrapping the robot, (ii) for avoiding\nlocal maxima where the top trajectories in the ranked list are all bad but ordered correctly, and (iii)\nwhen the user is satis\ufb01ed with the top ranked trajectory except for minor errors. A counterpart of this\nfeedback is keyframe based LfD [2] where an expert demonstrates a sequence of optimal waypoints\ninstead of the complete trajectory.\nNote that in both re-ranking and zero-G feedback, the user never reveals the optimal trajectory to\nthe algorithm but just provides a slightly improved trajectory.\n4 Learning Algorithm\nFor each task, we model the user\u2019s scoring function s\u2217(x, y) with the following parameterized family\nof functions.\n(1)\nw is a weight vector that needs to be learned, and \u03c6(\u00b7) are features describing trajectory y for context\nx. We further decompose the score function in two parts, one only concerned with the objects the\ntrajectory is interacting with, and the other with the object being manipulated and the environment.\n(2)\n\ns(x, y; wO, wE) = sO(x, y; wO) + sE(x, y; wE) = wO \u00b7 \u03c6O(x, y) + wE \u00b7 \u03c6E(x, y)\n\ns(x, y; w) = w \u00b7 \u03c6(x, y)\n\nWe now describe the features for the two terms, \u03c6O(\u00b7) and \u03c6E(\u00b7) in the following.\n4.1 Features Describing Object-Object Interactions\nThis feature captures the interaction between objects in the environment with the object being ma-\nnipulated. We enumerate waypoints of trajectory y as y1, .., yN and objects in the environment as\nO = {o1, .., oK}. The robot manipulates the object \u00afo \u2208 O. A few of the trajectory waypoints would\nbe affected by the other objects in the environment. For example in Figure 2, o1 and o2 affect the\nwaypoint y3 because of proximity. Speci\ufb01cally, we connect an object ok to a trajectory waypoint if\nthe minimum distance to collision is less than a threshold or if ok lies below \u00afo. The edge connecting\nyj and ok is denoted as (yj, ok) \u2208 E.\nSince it is the attributes [19] of the object that really matter in determining the trajectory quality,\nwe represent each object with its attributes. Speci\ufb01cally, for every object ok, we consider a vector\nk = {0, 1} indicating whether object ok possesses\nof M binary variables [l1\n\nk ], with each lm\n\nk, .., lM\n\n3\n\n\fFigure 2: (Left) A grocery checkout environment with a few objects where the robot was asked to checkout\n\ufb02owervase on the left to the right. (Middle) There are two ways of moving it, \u2018a\u2019 and \u2018b\u2019, both are sub-optimal\nin that the arm is contorted in \u2018a\u2019 but it tilts the vase in \u2018b\u2019. Given such constrained scenarios, we need to reason\nabout such subtle preferences. (Right) We encode preferences concerned with object-object interactions in a\nscore function expressed over a graph. Here y1, . . . , yn are different waypoints in a trajectory. The shaded\nnodes corresponds to environment (table node not shown here). Edges denotes interaction between nodes.\nproperty m or not. For example, if the set of possible properties are {heavy, fragile, sharp, hot,\nliquid, electronic}, then a laptop and a glass table can have labels [0, 1, 0, 0, 0, 1] and [0, 1, 0, 0, 0, 0]\nrespectively. The binary variables lp\nk and lq indicates whether ok and \u00afo possess property p and q re-\nspectively.3 Then, for every (yj, ok) edge, we extract following four features \u03c6oo(yj, ok): projection\nof minimum distance to collision along x, y and z (vertical) axis and a binary variable, that is 1, if\nok lies vertically below \u00afo, 0 otherwise.\nWe now de\ufb01ne the score sO(\u00b7) over this graph as follows:\n\n(cid:88)\n\nM(cid:88)\n\n(yj ,ok)\u2208E\n\np,q=1\n\nsO(x, y; wO) =\n\nklq[wpq \u00b7 \u03c6oo(yj, ok)]\nlp\n\n(3)\n\nHere, the weight vector wpq captures interaction between objects with properties p and q. We obtain\nwO in eq. (2) by concatenating vectors wpq. More formally, if the vector at position i of wO is wuv\n\nthen the vector corresponding to position i of \u03c6O(x, y) will be(cid:80)\n\n(yj ,ok)\u2208E lu\n\nk lv[\u03c6oo(yj, ok)].\n\n4.2 Trajectory Features\nWe now describe features, \u03c6E(x, y), obtained by performing operations on a set of waypoints. They\ncomprise the following three types of the features:\nRobot Arm Con\ufb01gurations. While a robot can reach the same operational space con\ufb01guration for\nits wrist with different con\ufb01gurations of the arm, not all of them are preferred [38]. For example,\nthe contorted way of holding the \ufb02owervase shown in Figure 2 may be \ufb01ne at that time instant, but\nwould present problems if our goal is to perform an activity with it, e.g. packing it after checkout.\nFurthermore, humans like to anticipate robots move and to gain users\u2019 con\ufb01dence, robot should\nproduce predictable and legible robot motion [8].\nWe compute features capturing robot\u2019s arm con\ufb01guration using the location of its elbow and wrist,\nw.r.t. to its shoulder, in cylindrical coordinate system, (r, \u03b8, z). We divide a trajectory into three\nparts in time and compute 9 features for each of the parts. These features encode the maximum and\nminimum r, \u03b8 and z values for wrist and elbow in that part of the trajectory, giving us 6 features.\nSince at the limits of the manipulator con\ufb01guration, joint locks may happen, therefore we also add 3\nfeatures for the location of robot\u2019s elbow whenever the end-effector attains its maximum r, \u03b8 and z\nvalues respectively. Therefore obtaining \u03c6robot(\u00b7) \u2208 R9 (3+3+3=9) features for each one-third part\nand \u03c6robot(\u00b7) \u2208 R27 for the complete trajectory.\nOrientation and Temporal Behavior of the Object to be Manipulated. Object orientation during\nthe trajectory is crucial in deciding its quality. For some tasks, the orientation must be strictly\nmaintained (e.g., moving a cup full of coffee); and for some others, it may be necessary to change\nit in a particular fashion (e.g., pouring activity). Different parts of the trajectory may have different\nrequirements over time. For example, in the placing task, we may need to bring the object closer to\nobstacles and be more careful.\nWe therefore divide trajectory into three parts in time. For each part we store the cosine of the\nobject\u2019s maximum deviation, along the vertical axis, from its \ufb01nal orientation at the goal location.\nTo capture object\u2019s oscillation along trajectory, we obtain a spectrogram for each one-third part for\n\n3In this work, our goal is to relax the assumption of unbiased and close to optimal feedback. We therefore\nassume complete knowledge of the environment for our algorithm, and for the algorithms we compare against.\nIn practice, such knowledge can be extracted using an object attribute labeling algorithm such as in [19].\n\n4\n\n\fthe movement of the object in x, y, z directions as well as for the deviation along vertical axis (e.g.\nFigure 3). We then compute the average power spectral density in the low and high frequency part\nas eight additional features for each. This gives us 9 (=1+4*2) features for each one-third part.\nTogether with one additional feature of object\u2019s maximum deviation along the whole trajectory, we\nget \u03c6obj(\u00b7) \u2208 R28 (=9*3+1).\nObject-Environment Interactions. This feature cap-\ntures temporal variation of vertical and horizontal dis-\ntances of the object \u00afo from its surrounding surfaces. In\ndetail, we divide the trajectory into three equal parts, and\nfor each part we compute object\u2019s: (i) minimum vertical\ndistance from the nearest surface below it. (ii) minimum\nhorizontal distance from the surrounding surfaces; and\n(iii) minimum distance from the table, on which the task\nis being performed, and (iv) minimum distance from the\ngoal location. We also take an average, over all the way-\npoints, of the horizontal and vertical distances between\nthe object and the nearest surfaces around it.4 To capture\ntemporal variation of object\u2019s distance from its surround-\ning we plot a time-frequency spectrogram of the object\u2019s\nvertical distance from the nearest surface below it, from\nwhich we extract six features by dividing it into grids.\nThis feature is expressive enough to differentiate whether\nan object just grazes over table\u2019s edge (steep change in vertical distance) versus, it \ufb01rst goes up and\nover the table and then moves down (relatively smoother change). Thus, the features obtained from\nobject-environment interaction are \u03c6obj\u2212env(\u00b7) \u2208 R20 (3*4+2+6=20).\nFinal feature vector is obtained by concatenating \u03c6obj\u2212env, \u03c6obj and \u03c6robot, giving us \u03c6E(\u00b7) \u2208 R75.\n4.3 Computing Trajectory Rankings\nFor obtaining the top trajectory (or a top few) for a given task with context x, we would like to\nmaximize the current scoring function s(x, y; wO, wE).\n\nFigure 3: (Top) A good and bad trajectory\nfor moving a mug. The bad trajectory un-\ndergoes ups-and-downs. (Bottom) Spectro-\ngrams for movement in z-direction: (Right)\nGood trajectory, (Left) Bad trajectory.\n\ns(x, y; wO, wE).\n\n(4)\n\ny\u2217 = arg max\n\ny\n\nNote that this poses two challenges. First, trajectory space is continuous and needs to be discretized\nto maintain argmax in (4) tractable. Second, for a given set {y(1), . . . , y(n)} of discrete trajectories,\nwe need to compute (4). Fortunately, the latter problem is easy to solve and simply amounts to sort-\ning the trajectories by their trajectory scores s(x, y(i); wO, wE). Two effective ways of solving the\nformer problem is either discretizing the robot\u2019s con\ufb01guration space or directly sampling trajectories\nfrom the continuous space. Previously both approaches [3, 4, 6, 36] have been studied. However,\nfor high DoF manipulators sampling based approaches [4, 6] maintains tractability of the problem,\nhence we take this approach. More precisely, similar to Berg et al. [4], we sample trajectories us-\ning rapidly-exploring random tree (RRT) [20].5 Since our primary goal is to learn a score function\non sampled set of trajectories we now describe our learning algorithm and for more literature on\nsampling trajectories we refer the readers to [9].\n4.4 Learning the Scoring Function\nThe goal is to learn the parameters wO and wE of the scoring function s(x, y; wO, wE) so that it\ncan be used to rank trajectories according to the user\u2019s preferences. To do so, we adapt the Pref-\nerence Perceptron algorithm [31] as detailed in Algorithm 1. We call this algorithm the Trajectory\nPreference Perceptron (TPP). Given a context xt, the top-ranked trajectory yt under the current pa-\nrameters wO and wE, and the user\u2019s feedback trajectory \u00afyt, the TPP updates the weights in the\ndirection \u03c6O(xt, \u00afyt) \u2212 \u03c6O(xt, yt) and \u03c6E(xt, \u00afyt) \u2212 \u03c6E(xt, yt) respectively.\nDespite its simplicity and even though the algorithm typically does not receive the optimal tra-\njectory y\u2217\nt = arg maxy s\u2217(xt, y) as feedback, the TPP enjoys guarantees on the regret [31]. We\nmerely need to characterize by how much the feedback improves on the presented ranking us-\ning the following de\ufb01nition of expected \u03b1-informative feedback: Et[s\u2217(xt, \u00afyt)] \u2265 s\u2217(xt, yt) +\n\n4We query PQP collision checker plugin of OpenRave for these distances.\n5When RRT becomes too slow, we switch to a more ef\ufb01cient bidirectional-RRT. The cost function (or\nits approximation) we learn can be fed to trajectory optimizers like CHOMP [29] or optimal planners like\nRRT* [15] to produce reasonably good trajectories.\n\n5\n\n\f(cid:80)T\n\n\u221a\n1\n\n\u03b1\n\nT\n\n+ 1\n\u03b1T\n\nO \u2190 0, w(1)\n\nE \u2190 0\n\nInitialize w(1)\nfor t = 1 to T do\n\nAlgorithm 1 Trajectory Preference Perceptron. (TPP)\n\nO + \u03c6O(xt, \u00afyt) \u2212 \u03c6O(xt, yt)\nE + \u03c6E(xt, \u00afyt) \u2212 \u03c6E(xt, yt)\n\nSample trajectories {y(1), ..., y(n)}\nO , w(t)\nyt = argmaxys(xt, y; w(t)\nE )\nObtain user feedback \u00afyt\nO \u2190 w(t)\nw(t+1)\nE \u2190 w(t)\nw(t+1)\nend for\n\nt ) \u2212 s\u2217(xt, yt)) \u2212 \u03bet. This de\ufb01nition states that the user feedback should have a\n\u03b1(s\u2217(xt, y\u2217\nscore of \u00afyt that is\u2014in expectation over the users choices\u2014higher than that of yt by a fraction\n\u03b1 \u2208 (0, 1] of the maximum possible range s\u2217(xt, \u00afyt) \u2212 s\u2217(xt, yt).\nIf this condition is not ful-\n\ufb01lled due to bias in the feedback, the slack variable \u03bet captures the amount of violation. In this\nway any feedback can be described by an appropriate combination of \u03b1 and \u03bet. Using these\ntwo parameters, the proof by [31] can be adapted to show that the expected average regret of\nthe TPP is upper bounded by E[REGT ] \u2264 O(\nt=1 \u03bet) after T rounds of feedback.\n5 Experiments and Results\nWe now describe our data set, baseline al-\ngorithms and the evaluation metrics we use.\nFollowing this, we present quantitative re-\nsults (Section 5.2) and report robotic exper-\niments on Baxter (Section 5.3).\n5.1 Experimental Setup\nTask and Activity Set for Evaluation. We\nevaluate our approach on 16 pick-and-place\nrobotic tasks in a grocery store checkout\nsetting. To assess generalizability of our\napproach, for each task we train and test on scenarios with different objects being manipulated,\nand/or with a different environment. We evaluate the quality of trajectories after the robot has\ngrasped the items and while it moves them for checkout. Our work complements previous works on\ngrasping items [30, 21], pick and place tasks [11], and detecting bar code for grocery checkout [16].\nWe consider following three commonly occurring activities in a grocery store:\n1) Manipulation centric: These activities primarily care for the object being manipulated. Hence\nthe object\u2019s properties and the way robot moves it in the environment is more relevant. Examples\ninclude moving common objects like cereal box, Figure 4 (left), or moving fruits and vegetables,\nwhich can be damaged when dropped/pushed into other items.\n2) Environment centric: These activities also care for the interactions of the object being manipulated\nwith the surrounding objects. Our object-object interaction features allow the algorithm to learn\npreferences on trajectories for moving fragile objects like glasses and egg cartons, Figure 4 (middle).\n3) Human centric: Sudden movements by the robot put the human in a danger of getting hurt. We\nconsider activities where a robot manipulates sharp objects, e.g., moving a knife with a human in\nvicinity as shown in Figure 4 (right). In previous work, such relations were considered in the context\nof scene understanding [10, 12].\nBaseline algorithms. We evaluate the algorithms that learn preferences from online feedback, under\ntwo settings: (a) untrained, where the algorithms learn preferences for the new task from scratch\nwithout observing any previous feedback; (b) pre-trained, where the algorithms are pre-trained on\nother similar tasks, and then adapt to the new task. We compare the following algorithms:\n\u2022 Geometric: It plans a path, independent of the task, using a BiRRT [20] planner.\n\u2022 Manual: It plans a path following certain manually coded preferences.\n\u2022 TPP: This is our algorithm. We evaluate it under both, untrained and pre-trained settings.\n\u2022 Oracle-svm: This algorithm leverages the expert\u2019s labels on trajectories (hence the name Oracle)\nand is trained using SVM-rank [13] in a batch manner. This algorithm is not realizable in practice,\nas it requires labeling on the large space of trajectories. We use this only in pre-trained setting\nand during prediction it just predicts once and does not learn further.\n\u2022 MMP-online: This is an online implementation of Maximum margin planning (MMP) [26, 28]\nalgorithm. MMP attempts to make an expert\u2019s trajectory better than any other trajectory by a\n\nFigure 4: (Left) Manipulation centric: a box of corn\ufb02akes doesn\u2019t interact much with surrounding items and is\nindifferent to orientation. (Middle) Environment centric: an egg carton is fragile and should preferably be kept\nupright and closer to a supporting surface. (Right) Human centric: a knife is sharp and interacts with nearby\nsoft items and humans. It should strictly be kept at a safe distance from humans.\n\n6\n\n\fmargin, and can be interpreted as a special case of our algorithm with 1-informative feedback.\nHowever, adapting MMP to our experiments poses two challenges: (i) we do not have knowledge\nof optimal trajectory; and (ii) the state space of the manipulator we consider is too large, and\ndiscretizing makes learning via MMP intractable. We therefore train MMP from online user\nfeedback observed on a set of trajectories. We further treat the observed feedback as optimal.\nAt every iteration we train a structural support vector machine (SSVM) [14] using all previous\nfeedback as training examples, and use the learned weights to predict trajectory scores for the\nnext iteration. Since we learn on a set of trajectories, the argmax operation in SSVM remains\ntractable. We quantify closeness of trajectories by the l2\u2212norm of difference in their feature\nrepresentations, and choose the regularization parameter C for training SSVM in hindsight, to\ngive an unfair advantage to MMP-online.\n\nEvaluation metrics. In addition to performing a user study on Baxter robot (Section 5.3), we also\ndesigned a data set to quantitatively evaluate the performance of our online algorithm. An expert\nlabeled 1300 trajectories on a Likert scale of 1-5 (where 5 is the best) on the basis of subjective\nhuman preferences. Note that these absolute ratings are never provided to our algorithms and are\nonly used for the quantitative evaluation of different algorithms. We quantify the quality of a ranked\nlist of trajectories by its normalized discounted cumulative gain (nDCG) [24] at positions 1 and 3.\nWhile nDCG@1 is a suitable metric for autonomous robots that execute the top ranked trajectory,\nnDCG@3 is suitable for scenarios where the robot is supervised by humans.\n\nTable 1: Comparison of different algorithms and study\nof features in untrained setting.\nTable contains average\nnDCG@1(nDCG@3) values over 20 rounds of feedback.\n\n5.2 Results and Discussion\nWe now present the quantitative results on the data set of 1300 labeled trajectories.\nHow well does TPP generalize to new tasks? To study generalization of preference feedback\nwe evaluate performance of TPP-pre-trained (i.e., TPP algorithm under pre-trained setting) on a\nset of tasks the algorithm has not seen before. We study generalization when: (a) only the object\nbeing manipulated changes, e.g., an egg carton replaced by tomatoes, (b) only the surrounding\nenvironment changes, e.g., rearranging objects in the environment or changing the start location of\ntasks, and (c) when both change. Figure 5 shows nDCG@3 plots averaged over tasks for all types of\nactivities.6 TPP-pre-trained starts-off with higher nDCG@3 values than TPP-untrained in all three\ncases. Further, as more feedback is received, performance of both algorithms improve to eventually\nbecome (almost) identical. We further observe, generalizing to tasks with both new environment\nand object is harder than when only one of them changes.\nHow does TPP compare to other al-\ngorithms? Despite the fact that TPP\nnever observes optimal feedback, it per-\nforms better than baseline algorithms,\nsee Figure 5. It improves over Oracle-\n0.31 (0.30) 0.40 (0.39)\n0.33 (0.31) 0.57 (0.57)\nSVM in less than 5 feedbacks, which\n0.79 (0.73) 0.76 (0.74)\nis not updated since it requires expert\u2019s\n0.80 (0.69) 0.80 (0.73)\n0.85 (0.72) 0.86 (0.79)\nlabels on test set and hence it is imprac-\n0.81 (0.65) 0.75 (0.69)\ntical. MMP-online assumes every user\n0.90 (0.80) 0.89 (0.83)\n0.33 (0.30) 0.45 (0.46)\nfeedback as optimal, and over iterations\naccumulates many contradictory training examples. This also highlights the sensitivity of MMP to\nsub-optimal demonstrations. We also compare against planners with manually coded preferences\ne.g., keep a \ufb02owervase upright. However, some preferences are dif\ufb01cult to specify, e.g., not to move\nheavy objects over fragile items. We empirically found the resulting manual algorithm produces\npoor trajectories with an average nDCG@3 of 0.57 over all types of activities.\nHow helpful are different features? Table 1 shows the performance of the TPP algorithm in the un-\ntrained setting using different features. Individually each feature captures several aspects indicating\ngoodness of trajectories, and combined together they give the best performance. Object trajectory\nfeatures capture preferences related to the orientation of the object. Robot arm con\ufb01guration and\nobject environment features capture preferences by detecting undesirable contorted arm con\ufb01gura-\ntions and maintaining safe distance from surrounding surfaces, respectively. Object-object features\nby themselves can only learn, for example, to move egg carton closer to a supporting surface, but\nmight still move it with jerks or contorted arms. These features can be combined with other features\nto yield more expressive features. Nevertheless, by themselves they perform better than Manual\nalgorithm. Table 1 also compares TPP and MMP-online under untrained setting.\n\nAlgorithms Manipulation Environment\nGeometric\n0.45 (0.39)\n0.77 (0.77)\nManual\n0.80 (0.79)\ns Obj-obj interaction\n0.78 (0.72)\nRobot arm con\ufb01g\n0.88 (0.84)\nObject trajectory\n0.75 (0.74)\nObject environment\n0.90 (0.85)\nTPP (all features)\nMMP-online\n0.54 (0.56)\n\nHuman\ncentric\n\ncentric\n\n0.46 (0.48)\n0.61 (0.62)\n0.68 (0.68)\n0.82 (0.77)\n0.85 (0.81)\n0.70 (0.69)\n0.88 (0.84)\n0.47 (0.50)\n\ncentric\n\nMean\n\nP\nP\nT\n\ne\nr\nu\nt\na\ne\nF\n\n6Similar results were obtained with nDCG@1 metric. We have not included it due to space constraints.\n\n7\n\n\f3\n@\nG\nC\nD\nn\n\n(a)Same environment, different object\n\n(b)New Environment, same object\n\n(c)New Environment, different object\n\n# Re-ranking\n\nfeedback\n5.4 (4.1)\n1.8 (1.0)\n2.9 (0.8)\n3.2 (2.0)\n3.6 (1.0)\n\n# Zero-G\nfeedback\n3.3 (3.4)\n1.7 (1.3)\n2.0 (2.0)\n1.5 (0.9)\n1.9 (2.1)\n\nAverage\ntime (min.)\n7.8 (4.9)\n4.6 (1.7)\n5.0 (2.9)\n5.3 (1.9)\n5.0 (2.3)\n\nFigure 5: Study of generalization with change in object, environment and both. Manual, Oracle-SVM, Pre-\ntrained MMP-online (\u2014), Untrained MMP-online (\u2013 \u2013), Pre-trained TPP (\u2014), Untrained TPP (\u2013 \u2013).\n5.3 Robotic Experiment: User Study in learning trajectories\nWe perform a user study of our system on Baxter robot on a variety of tasks of varying dif\ufb01culties.\nThereby, showing our approach is practically realizable, and that the combination of re-rank and\nzero-G feedbacks allows the users to train the robot in few feedbacks.\nExperiment setup: In this study, \ufb01ve users (not associated with this work) used our system to\ntrain Baxter for grocery checkout tasks, using zero-G and re-rank feedback. Zero-G was provided\nkinesthetically on the robot, while re-rank was elicited in a simulator (on a desktop computer). A set\nof 10 tasks of varying dif\ufb01culty level was presented to users one at a time, and they were instructed\nto provide feedback until they were satis\ufb01ed with the top ranked trajectory. To quantify the quality\nof learning each user evaluated their own trajectories (self score), the trajectories learned of the other\nusers (cross score), and those predicted by Oracle-svm, on a Likert scale of 1-5 (where 5 is the best).\nWe also recorded the time a user took for each task\u2014from start of training till the user was satis\ufb01ed.\nResults from user study.\nThe study\nTable 2: Shows learning statistics for each user averaged over\nshows each user on an average took 3 re-\nall tasks. The number in parentheses is standard deviation.\nrank and 2 zero-G feedbacks to train Bax-\nTrajectory Quality\nUser\nter (Table 2). Within 5 feedbacks the users\ncross\nwere able to improve over Oracle-svm,\nFig. 6 (Left), consistent with our previous\nanalysis. Re-rank feedback was popular\nfor easier tasks, Fig. 6 (Right). However as\ndif\ufb01culty increased the users relied more\non zero-G feedback, which allows recti-\nfying erroneous waypoints precisely. An\naverage difference of 0.6 between users\u2019\nself and cross score suggests preferences\nmarginally varied across the users.\nIn terms of training time, each user took\non average 5.5 minutes per-task, which we\nbelieve is acceptable for most applications.\nFuture research in human computer inter-\naction, visualization and better user inter-\nface [32] could further reduce this time. Despite its limited size, through user study we show our\nalgorithm is realizable in practice on high DoF manipulators. We hope this motivates researchers to\nbuild robotic systems capable of learning from non-expert users.\nFor more details and video, please visit: http://pr.cs.cornell.edu/coactive\n6 Conclusion\nIn this paper we presented a co-active learning framework for training robots to select trajectories\nthat obey a user\u2019s preferences. Unlike in standard learning from demonstration approaches, our\nframework does not require the user to provide optimal trajectories as training data, but can learn\nfrom iterative improvements. Despite only requiring weak feedback, our TPP learning algorithm has\nprovable regret bounds and empirically performs well. In particular, we propose a set of trajectory\nfeatures for which the TPP generalizes well on tasks which the robot has not seen before. In addition\nto the batch experiments, robotic experiments con\ufb01rmed that incremental feedback generation is\nindeed feasible and that it leads to good learning results already after only a few iterations.\nAcknowledgments. We thank Shikhar Sharma for help with the experiments. This research was\nsupported by ARO, Microsoft Faculty fellowship and NSF Career award (to Saxena).\n\nFigure 6: (Left) Average quality of the learned trajectory af-\nter every one-third of total feedback. (Right) Bar chart show-\ning the average number of feedback and time required for\neach task. Task dif\ufb01culty increases from 1 to 10.\n\nself\n\n3.8 (0.6)\n4.3 (1.2)\n4.4 (0.7)\n3.0 (1.2)\n3.5 (1.3)\n\n4.0 (1.4)\n3.6 (1.2)\n3.2 (1.2)\n3.7 (1.0)\n3.3 (0.6)\n\n1\n2\n3\n4\n5\n\n8\n\n\fReferences\n[1] P. Abbeel, A. Coates, and A. Y. Ng. Autonomous helicopter aerobatics through apprenticeship learning.\n\n[2] B. Akgun, M. Cakmak, K. Jiang, and A. L. Thomaz. Keyframe-based learning from demonstration. IJSR,\n\nIJRR, 29(13), 2010.\n\n4(4):343\u2013355, 2012.\n\n[3] R. Alterovitz, T. Sim\u00e9on, and K. Goldberg. The stochastic motion roadmap: A sampling framework for\n\nplanning with markov motion uncertainty. In RSS, 2007.\n\n[4] J. V. D. Berg, P. Abbeel, and K. Goldberg. Lqg-mp: Optimized path planning for robots with motion\n\nuncertainty and imperfect state information. In RSS, 2010.\n\n[5] S. Calinon, F. Guenter, and A. Billard. On learning, representing, and generalizing a task in a humanoid\n\nrobot. IEEE Transactions on Systems, Man, and Cybernetics, 2007.\n\n[6] D. Dey, T. Y. Liu, M. Hebert, and J. A. Bagnell. Contextual sequence prediction with application to\n\ncontrol library optimization. In RSS, 2012.\n\n[7] R. Diankov. Automated Construction of Robotic Manipulation Programs. PhD thesis, CMU, RI, 2010.\n[8] A. Dragan and S. Srinivasa. Generating legible motion. In RSS, 2013.\n[9] C. J. Green and A. Kelly. Toward optimal sampling in the space of paths. In ISRR. 2007.\n[10] Y. Jiang, M. Lim, and A. Saxena. Learning object arrangements in 3d scenes using human context. In\n\n[11] Y. Jiang, M. Lim, C. Zheng, and A. Saxena. Learning to place new objects in a scene. IJRR, 31(9), 2012.\n[12] Y. Jiang, H. Koppula, and A. Saxena. Hallucinated humans as the hidden context for labeling 3d scenes.\n\n[13] T. Joachims. Training linear svms in linear time. In KDD, 2006.\n[14] T. Joachims, T. Finley, and C. Yu. Cutting-plane training of structural svms. Mach Learn, 77(1), 2009.\n[15] S. Karaman and E. Frazzoli. Incremental sampling-based algorithms for optimal motion planning. In\n\nICML, 2012.\n\nIn CVPR, 2013.\n\nRSS, 2010.\n\n[16] E. Klingbeil, D. Rao, B. Carpenter, V. Ganapathi, A. Y. Ng, and O. Khatib. Grasping with application to\n\nan autonomous checkout robot. In ICRA, 2011.\n\n[17] J. Kober and J. Peters. Policy search for motor primitives in robotics. Machine Learning, 84(1), 2011.\n[18] H. S. Koppula and A. Saxena. Anticipating human activities using object affordances for reactive robotic\n\n[19] H. S. Koppula, A. Anand, T. Joachims, and A. Saxena. Semantic labeling of 3d point clouds for indoor\n\n[20] S. M. LaValle and J. J. Kuffner. Randomized kinodynamic planning. IJRR, 20(5):378\u2013400, 2001.\n[21] I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. In RSS, 2013.\n[22] S. Levine and V. Koltun. Continuous inverse optimal control with locally optimal examples. In ICML,\n\nresponse. In RSS, 2013.\n\nscenes. In NIPS, 2011.\n\n2012.\n\n[23] J. Mainprice, E. A. Sisbot, L. Jaillet, J. Cort\u00e9s, R. Alami, and T. Sim\u00e9on. Planning human-aware motions\n\nusing a sampling-based costmap planner. In ICRA, 2011.\n\n[24] C. D. Manning, P. Raghavan, and H. Sch\u00fctze. Introduction to information retrieval, volume 1. Cambridge\n\nUniversity Press Cambridge, 2008.\n\n[25] N. Ratliff. Learning to Search: Structured Prediction Techniques for Imitation Learning. PhD thesis,\n\n[26] N. Ratliff, J. A. Bagnell, and M. Zinkevich. Maximum margin planning. In ICML, 2006.\n[27] N. Ratliff, D. Bradley, J. A. Bagnell, and J. Chestnutt. Boosting structured prediction for imitation learn-\n\nCMU, RI, 2009.\n\ning. In NIPS, 2007.\n\n[28] N. Ratliff, D. Silver, and J. A. Bagnell. Learning to search: Functional gradient techniques for imitation\n\n[29] N. Ratliff, M. Zucker, J. A. Bagnell, and S. Srinivasa. Chomp: Gradient optimization techniques for\n\nlearning. Autonomous Robots, 27(1):25\u201353, 2009.\n\nef\ufb01cient motion planning. In ICRA, 2009.\n\n[30] A. Saxena, J. Driemeyer, and A.Y. Ng. Robotic grasping of novel objects using vision. IJRR, 27(2), 2008.\n[31] P. Shivaswamy and T. Joachims. Online structured prediction via coactive learning. In ICML, 2012.\n[32] B. Shneiderman and C. Plaisant. Designing The User Interface: Strategies for Effective Human-Computer\n\nInteraction. Addison-Wesley Publication, 2010.\n\n[33] E. A. Sisbot, L. F. Marin, and R. Alami. Spatial reasoning for human robot interaction. In IROS, 2007.\n[34] E. A. Sisbot, L. F. Marin-Urias, R. Alami, and T. Simeon. A human aware mobile robot motion planner.\n\nIEEE Transactions on Robotics, 2007.\n\n[35] I. A. Sucan, M. Moll, and L. E. Kavraki. The Open Motion Planning Library. IEEE Robotics & Automa-\n\ntion Magazine, 19(4):72\u201382, 2012. http://ompl.kavrakilab.org.\n\n[36] P. Vernaza and J. A. Bagnell. Ef\ufb01cient high dimensional maximum entropy modeling via symmetric\n\n[37] A. Wilson, A. Fern, and P. Tadepalli. A bayesian approach for policy learning from trajectory preference\n\npartition functions. In NIPS, 2012.\n\nqueries. In NIPS, 2012.\n\n[38] F. Zacharias, C. Schlette, F. Schmidt, C. Borst, J. Rossmann, and G. Hirzinger. Making planned paths\n\nlook more human-like in humanoid robot manipulation planning. In ICRA, 2011.\n\n[39] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning.\n\nIn AAAI, 2008.\n\n9\n\n\f", "award": [], "sourceid": 359, "authors": [{"given_name": "Ashesh", "family_name": "Jain", "institution": "Cornell University"}, {"given_name": "Brian", "family_name": "Wojcik", "institution": "Cornell University"}, {"given_name": "Thorsten", "family_name": "Joachims", "institution": "Cornell University"}, {"given_name": "Ashutosh", "family_name": "Saxena", "institution": "Cornell University"}]}