{"title": "Scalable Planning with Tensorflow for Hybrid Nonlinear Domains", "book": "Advances in Neural Information Processing Systems", "page_first": 6273, "page_last": 6283, "abstract": "Given recent deep learning results that demonstrate the ability to effectively optimize high-dimensional non-convex functions with gradient descent optimization on GPUs, we ask in this paper whether symbolic gradient optimization tools such as Tensorflow can be effective for planning in hybrid (mixed discrete and continuous) nonlinear domains with high dimensional state and action spaces? To this end, we demonstrate that hybrid planning with Tensorflow and RMSProp gradient descent is competitive with mixed integer linear program (MILP) based optimization on piecewise linear planning domains (where we can compute optimal solutions) and substantially outperforms state-of-the-art interior point methods for nonlinear planning domains. Furthermore, we remark that Tensorflow is highly scalable, converging to a strong plan on a large-scale concurrent domain with a total of 576,000 continuous action parameters distributed over a horizon of 96 time steps and 100 parallel instances in only 4 minutes. We provide a number of insights that clarify such strong performance including observations that despite long horizons, RMSProp avoids both the vanishing and exploding gradient problems. Together these results suggest a new frontier for highly scalable planning in nonlinear hybrid domains by leveraging GPUs and the power of recent advances in gradient descent with highly optimized toolkits like Tensorflow.", "full_text": "Scalable Planning with Tensor\ufb02ow for Hybrid\n\nNonlinear Domains\n\nGa Wu\n\nBuser Say\n\nDepartment of Mechanical & Industrial Engineering, University of Toronto, Canada\n\nemail: {wuga,bsay,ssanner}@mie.utoronto.ca\n\nScott Sanner\n\nAbstract\n\nGiven recent deep learning results that demonstrate the ability to effectively opti-\nmize high-dimensional non-convex functions with gradient descent optimization on\nGPUs, we ask in this paper whether symbolic gradient optimization tools such as\nTensor\ufb02ow can be effective for planning in hybrid (mixed discrete and continuous)\nnonlinear domains with high dimensional state and action spaces? To this end, we\ndemonstrate that hybrid planning with Tensor\ufb02ow and RMSProp gradient descent\nis competitive with mixed integer linear program (MILP) based optimization on\npiecewise linear planning domains (where we can compute optimal solutions)\nand substantially outperforms state-of-the-art interior point methods for nonlinear\nplanning domains. Furthermore, we remark that Tensor\ufb02ow is highly scalable,\nconverging to a strong plan on a large-scale concurrent domain with a total of\n576,000 continuous action parameters distributed over a horizon of 96 time steps\nand 100 parallel instances in only 4 minutes. We provide a number of insights that\nclarify such strong performance including observations that despite long horizons,\nRMSProp avoids both the vanishing and exploding gradient problems. Together\nthese results suggest a new frontier for highly scalable planning in nonlinear hybrid\ndomains by leveraging GPUs and the power of recent advances in gradient descent\nwith highly optimized toolkits like Tensor\ufb02ow.\n\n1\n\nIntroduction\n\nMany real-world hybrid (mixed discrete continuous) planning problems such as Reservoir Con-\ntrol [Yeh, 1985], Heating, Ventilation and Air Conditioning (HVAC) [Erickson et al., 2009; Agarwal\net al., 2010], and Navigation [Faulwasser and Findeisen, 2009] have highly nonlinear transition and\n(possibly nonlinear) reward functions to optimize. Unfortunately, existing state-of-the-art hybrid\nplanners [Ivankovic et al., 2014; L\u00f6hr et al., 2012; Coles et al., 2013; Piotrowski et al., 2016] are not\ncompatible with arbitrary nonlinear transition and reward models. While HD-MILP-PLAN [Say et\nal., 2017] supports arbitrary nonlinear transition and reward models, it also assumes the availability of\ndata to learn the state-transitions. Monte Carlo Tree Search (MCTS) methods [Coulom, 2006; Kocsis\nand Szepesv\u00e1ri, 2006; Keller and Helmert, 2013] including AlphaGo [Silver et al., 2016] that can use\nany (nonlinear) black box model of transition dynamics do not inherently work with continuous action\nspaces due to the in\ufb01nite branching factor. While MCTS with continuous action extensions such as\nHOOT [Weinstein and Littman, 2012] have been proposed, their continuous partitioning methods do\nnot scale to high-dimensional continuous action spaces (for example, 100\u2019s or 1,000\u2019s of dimensions\nas used in this paper). Finally, of\ufb02ine model-free reinforcement learning (for example, Q-learning)\nwith function approximation [Sutton and Barto, 1998; Szepesv\u00e1ri, 2010] and deep extensions [Mnih\net al., 2013] do not require any knowledge of the (nonlinear) transition model or reward, but they also\ndo not directly apply to domains with high-dimensional continuous action spaces. That is, of\ufb02ine\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: The evolution of RMSProp gradient descent based Tensor\ufb02ow planning in a two-\ndimensional Navigation domain with nested central rectangles indicating nonlinearly increasing\nresistance to robot movement. (top) In initial RMSProp epochs, the plan evolves directly towards the\ngoal shown as a star. (bottom) As later epochs of RMSProp descend the objective cost surface, the\nfastest path evolves to avoid the central obstacle entirely.\n\nlearning methods like Q-learning [Watkins and Dayan, 1992] require action maximization for every\nupdate, but in high-dimensional continuous action spaces such nonlinear function maximization is\nnon-convex and computationally intractable at the scale of millions or billions of updates.\nTo address the above scalability and expressivity limitations of existing methods, we turn to Tensor-\n\ufb02ow [Abadi et al., 2015], which is a symbolic computation platform used in the machine learning\ncommunity for deep learning due to its compilation of complex layered symbolic functions into a\nrepresentation amenable to fast GPU-based reverse-mode automatic differentiation [Linnainmaa,\n1970] for gradient-based optimization. Given recent results in gradient descent optimization with deep\nlearning that demonstrate the ability to effectively optimize high-dimensional non-convex functions,\nwe ask whether Tensor\ufb02ow can be effective for planning in discrete time, hybrid (mixed discrete and\ncontinuous) nonlinear domains with high dimensional state and action spaces?\nOur results answer this question af\ufb01rmatively, where we demonstrate that hybrid planning with\nTensor\ufb02ow and RMSProp gradient descent [Tieleman and Hinton, 2012] is surprisingly effective at\nplanning in complex hybrid nonlinear domains1. As evidence, we reference \ufb01gure 1, where we show\nTensor\ufb02ow with RMSProp ef\ufb01ciently \ufb01nding and optimizing a least-cost path in a two-dimensional\nnonlinear Navigation domain. In general, Tensor\ufb02ow with RMSProp planning results are competitive\nwith optimal MILP-based optimization on piecewise linear planning domains. The performance\ndirectly extends to nonlinear domains where Tensor\ufb02ow with RMSProp substantially outperforms\ninterior point methods for nonlinear function optimization. Furthermore, we remark that Tensor\ufb02ow\nconverges to a strong plan on a large-scale concurrent domain with 576,000 continuous actions\ndistributed over a horizon of 96 time steps and 100 parallel instances in 4 minutes.\nTo explain such excellent results, we note that gradient descent algorithms such as RMSProp are\nhighly effective for non-convex function optimization that occurs in deep learning. Further, we\nprovide an analysis of many transition functions in planning domains that suggest gradient descent\non these domains will not suffer from either the vanishing or exploding gradient problems, and hence\nprovide a strong signal for optimization over long horizons. Together these results suggest a new\nfrontier for highly scalable planning in nonlinear hybrid domains by leveraging GPUs and the power\nof recent advances in gradient descent with Tensor\ufb02ow and related toolkits.\n\n2 Hybrid Nonlinear Planning via Tensor\ufb02ow\n\nIn this section, we present a general framework of hybrid nonlinear planning along with a compilation\nof the objective in this framework to a symbolic recurrent neural network (RNN) architecture with\naction parameter inputs directly amenable to optimization with the Tensor\ufb02ow toolkit.\n\n2.1 Hybrid Planning\nA hybrid planning problem is a tuple (cid:104)S,A,T ,R,C(cid:105) with S denoting the (in\ufb01nite) set of hybrid\nstates with a state represented as a mixed discrete and continuous vector, A the set of actions bounded\nby action constraints C, R : S \u00d7 A \u2192 R the reward function and T : S \u00d7 A \u2192 S the transition\n1The approach in this paper is implemented in Tensor\ufb02ow, but it is not speci\ufb01c to Tensor\ufb02ow. While \u201cscalable\nhybrid planning with symbolic representations, auto-differentiation, and modern gradient descent methods for\nnon-convex functions implemented on a GPU\u201d would make for a more general description of our contributions,\nwe felt that \u201cTensor\ufb02ow\u201d succinctly imparts at least the spirit of all of these points in a single term.\n\n2\n\n0.6000.7500.9000.6000.7500.9000.6000.7500.9000.6000.7500.9000.6000.7500.9000.6000.7500.9000.6000.7500.900Epochs:10Epochs:20Epochs:40Epochs:80Epochs:160Epochs:320\fFigure 2: An recurrent neural network (RNN) encoding of a hybrid planning problem: A single-step\nreward and transition function of a discrete time decision-process are embedded in an RNN cell.\nRNN inputs correspond to the starting state and action; the outputs correspond to reward and next\nstate. Rewards are additively accumulated in V . Since the entire speci\ufb01cation of V is a symbolic\nrepresentation in Tensor\ufb02ow with action parameters as inputs, the sequential action plan can be\ndirectly optimized via gradient descent using the auto-differentiated representation of V.\n\nfunction. There is also an initial state s0 and the planning objective is to maximize the cumulative\nreward over a decision horizon of H time steps. Before proceeding, we outline the necessary notation:\n\n\u2022 st: mixed discrete, continuous state vector at time t.\n\u2022 at: mixed discrete, continuous action vector at time t.\n\u2022 R(st, at): a non-positive reward function (i.e., negated costs).\n\u2022 T (st, at): a (nonlinear) transition function.\n\n\u2022 V =(cid:80)H\n\nt=1 rt =(cid:80)H\u22121\n\nt=0 R(st, at): cumulative reward value to maximize.\n\nIn general due to the stochastic nature of gradient descent, we will run a number of planning domain\ninstances i in parallel (to take the best performing plan over all instances), so we additionally de\ufb01ne\ninstance-speci\ufb01c states and actions:\n\n\u2022 sitj: the jth dimension of state vector of problem instance i at time t.\n\u2022 aitj: the jth dimension of action vector of problem instance i at time t.\n\n2.2 Planning through Backpropagation\n\nBackpropagation [Rumelhart et al.] is a standard method for optimizing parameters of large mul-\ntilayer neural networks via gradient descent. Using the chain rule of derivatives, backpropagation\npropagates the derivative of the output error of a neural network back to each of its parameters in a\nsingle linear time pass in the size of the network using what is known as reverse-mode automatic\ndifferentiation [Linnainmaa, 1970]. Despite its relative ef\ufb01ciency, backpropagation in large-scale\n(deep) neural networks is still computationally expensive and it is only with the advent of recent\nGPU-based symbolic toolkits like Tensor\ufb02ow [Abadi et al., 2015] that recent advances in training\nvery large deep neural networks have become possible.\nIn this paper, we reverse the idea of training parameters of the network given \ufb01xed inputs to instead\noptimizing the inputs (i.e., actions) subject to \ufb01xed parameters (effectively the transition and reward\nparameterization assumed a priori known in planning). That is, as shown in \ufb01gure 2, given transition\nT (st, at) and reward function R(st, at), we want to optimize the input at for all t to maximize the\naccumulated reward value V . Speci\ufb01cally, we want to optimize all actions a = (a1, . . . , aH\u22121) w.r.t.\na planning loss L (de\ufb01ned shortly) that we minimize via the following gradient update schema\n\na(cid:48) = a \u2212 \u03b7\n\n\u2202L\n\u2202a\n\n,\n\n3\n\n(1)\n\n\fwhere \u03b7 is the optimization rate and the partial derivatives comprising the gradient based optimization\nin problem instance i are computed as\n\n\u2202L\n\u2202aitj\n\n=\n\n=\n\n=\n\n\u2202L\n\u2202Li\n\u2202L\n\u2202Li\n\n\u2202Li\n\u2202aitj\n\u2202Li\n\u2202sit+1\n\n\u2202L\n\u2202Li\n\n\u2202sit+1\n\u2202aitj\n\n\u2202sit+1\n\u2202aitj\n\nT(cid:88)\n\n[\n\u03c4 =t+2\n\nt+2(cid:89)\n\n\u03ba=\u03c4\n\n\u2202si\u03ba\nsi\u03ba\u22121\n\n].\n\n\u2202Li\n\u2202ri\u03c4\n\n\u2202ri\u03c4\n\u2202si\u03c4\n\n(2)\n\nWe must now connect our planning objective to a standard Tensor\ufb02ow loss function. First, however,\nlet us assume that we have N structurally identical instances i of our planning domain given in\nFigure 2, each with objective value Vi; then let us de\ufb01ne V = (. . . , Vi, . . .). In Tensor\ufb02ow, we\nchoose Mean Squared Error (MSE), which given two continuous vectors Y and Y\u2217 is de\ufb01ned as\nN (cid:107)Y\u2217 \u2212 Y(cid:107)2. We speci\ufb01cally choose to minimize L = MSE(0, V) with inputs\nMSE(Y, Y\u2217) = 1\nof constant vector 0 and value vector V in order to maximize our value for each instance i; we remark\nthat here we want to independently maximize each non-positive Vi, but minimize each positive V 2\ni\nwhich is achieved with MSE. We will further explain the use of MSE in a moment, but \ufb01rst we digress\nto explain why we need to solve multiple problem instances i.\nSince both transition and reward functions are not assumed to be convex, optimization on a domain\nwith such dynamics could result in a local minimum. To mitigate this problem, we use randomly\ninitialized actions in a batch optimization: we optimize multiple mutually independent planning\nproblem instances i simultaneously since the GPU can exploit their parallel computation, and then\nselect the best-performing action sequence among the independent simultaneously solved problem\ninstances. MSE then has dual effects of optimizing each problem instance i independently and\nproviding fast convergence (faster than optimizing V directly). We remark that simply de\ufb01ning the\nobjective V and the de\ufb01nition of all state variables in terms of predecessor state and action variables\nvia the transition dynamics (back to the known initial state constants) is enough for Tensor\ufb02ow to\nbuild the symbolic directed acyclic graph (DAG) representing the objective and take its gradient with\nrespect to to all free action parameters as shown in (2) using reverse-mode automatic differentiation.\n\n2.3 Planning over Long Horizons\n\nThe Tensor\ufb02ow compilation of a nonlinear planning problem re\ufb02ects the same structure as a recurrent\nneural network (RNN) that is commonly used in deep learning. The connection here is not super\ufb01cial\nsince a longstanding dif\ufb01culty with training RNNs lies in the vanishing gradient problem, that is,\nmultiplying long sequences of gradients in the chain rule usually renders them extremely small and\nmake them irrelevant for weight updates, especially when using nonlinear transfer functions such\nas a sigmoid. However in hybrid planning problems, continuous state updates often take the form\nsi(t+1)j = sitj + \u2206 for some \u2206 function of the state and action at time t. Critically we note that\nthe transfer function here is linear in sitj which is the largest determiner of si(t+1)j, hence avoiding\nvanishing gradients.\nIn addition, a gradient can explode with the chain rule through backpropagation if the elements of\nthe Jacobian matrix of state transitions are too large. In this case, if the planning horizon is large\nenough, a simple Stochastic Gradient Descent (SGD) optimizer may suffer from overshooting the\noptimum and never converge (as our experiments appear to demonstrate for SGD). The RMSProp\noptimization algorithm has a signi\ufb01cant advantage for backpropagation-based planning because of\nits ability to perform gradient normalization that avoids exploding gradients and additionally deals\nwith piecewise gradients [Balduzzi et al., 2016] that arise naturally as conditional transitions in\nmany nonlinear domains (e.g., the Navigation domain of Figure 1 has different piecewise transition\ndynamics depending on the starting region). Speci\ufb01cally, instead of naively updating action aitj\nthrough equation 1, RMSProp maintains a decaying root mean squared gradient value G for each\nvariable, which averages over squared gradients of previous epochs\n\nG(cid:48)\n\naitj\n\n= 0.9Gaitj + 0.1(\n\n\u2202L\n\u2202aitj\n\n)2,\n\n(3)\n\n4\n\n\fand updates each action variable through\n\nitj = aitj \u2212\na(cid:48)\n\n\u03b7(cid:112)Gaitj + \u0001\n\n\u2202L\n\u2202aitj\n\n.\n\n(4)\n\nHere, the gradient is relatively small and consistent over iterations. Although the Adagrad [Duchi\net al., 2011] and Adadelta [Zeiler, 2012] optimization algorithms have similar mechanisms, their\nlearning rate could quickly reduce to an extremely small value when encountering large gradients. In\nsupport of these observations, we note the superior performance of RMSProp in Section 3.\n\n2.4 Handling Constrained and Discrete Actions\n\nIn most hybrid planning problems, there exist natural range constraints for actions. To handle those\nconstraints, we use projected stochastic gradient descent. Projected stochastic gradient descent\n(PSGD) is a well-known descent method that can handle constrained optimization problems by\nprojecting the parameters (actions) into a feasible range after each gradient update. To this end, we\nclip all actions to their feasible range after each epoch of gradient descent.\nFor planning problems with discrete actions, we use a one-hot encoding for optimization purposes\nand then use a {0, 1} projection for the maximal action to feed into the forward propagation. In this\npaper, we focus on constrained continuous actions which are representative of many hybrid nonlinear\nplanning problems in the literature.\n\n3 Experiments\n\nIn this section, we introduce our three benchmark domains and then validate Tensor\ufb02ow planning\nperformance in the following steps. (1) We evaluate the optimality of the Tensor\ufb02ow backpropagation\nplanning on linear and bilinear domains through comparison with the optimal solution given by\nMixture Integer Linear Programming (MILP). (2) We evaluate the performance of Tensor\ufb02ow\nbackpropagation planning on nonlinear domains (that MILPs cannot handle) through comparison\nwith the Matlab-based interior point nonlinear solver FMINCON. (4) We investigate the impact of\nseveral popular gradient descent optimizers on planning performance. (5) We evaluate optimization\nof the learning rate. (6) We investigate how other state-of-the-art hybrid planners perform.\n\n3.1 Domain Descriptions\n\nNavigation: The Navigation domain is designed to test the ability of optimization of Tensor\ufb02ow\nin a relatively small environment that supports different complexity transitions. Navigation has a\ntwo-dimensional state of the agent location s and a two-dimensional action a. Both of state and\naction spaces are continuous and constrained by their maximum and minimum boundaries separately.\nThe objective of the domain is for an agent to move to the goal state as soon as possible (cf. \ufb01gure 1).\nTherefore, we compute the reward based on the Manhattan distance from the agent to the goal state at\neach time step as R(st, at) = \u2212(cid:107)st \u2212 g(cid:107)1, where g is the goal state.\nWe designed three different transitions; from left to right, nonlinear, bilinear and linear:\n\n2(cid:88)\n(cid:26) dt\n\nj=1\n\ndt = (cid:107)st \u2212 z(cid:107)1\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n3.6 \u2264 dt < 4\n0.8,\n2.4 \u2264 dt < 3.6\n0.6,\n1.6 \u2264 dt < 2.4\n0.4,\n0.8 \u2264 dt < 1.6\n0.2,\n0.05, dt < 0.8\ndt \u2265 4\n1,\n\ndt =\n\n|stj \u2212 zj|\n\n\u03bb =\n\ndt = (cid:107)st \u2212 z(cid:107)\n2\n\n\u03bb =\n\n1 + exp(\u22122dt)\n\n\u2212 0.99\n\n\u03bb =\n\n4 , dt < 4\ndt \u2265 4\n1,\n\np = st + \u03bbat\nT (st, at) = max(u, min(l, p)),\n(5)\n\np = st + \u03bbat\nT (st, at) = max(u, min(l, p)),\n(6)\n\np = st + \u03bbat\nT (st, at) = max(u, min(l, p)),\n(7)\n\n5\n\n\fThe nonlinear transition has a velocity reduction zone based on its Euclidean distance to the center z.\nHere, dt is the distance from the deceleration zone z, p is the proposed next state, \u03bb is the velocity\nreduction factor, and u,l are upper and lower boundaries of the domain respectively.\nThe bilinear domain is designed to compare with MILP where domain discretization is possible. In\nthis setting, we evaluate the ef\ufb01cacy of approximately discretizing bilinear planning problems into\nMILPs. Equation 6 shows the bilinear transition function.\nThe linear domain is the discretized version of the bilinear domain used for MILP optimization. We\nalso test Tensor\ufb02ow on this domain to see the optimality of the Tensor\ufb02ow solution. Equation 7\nshows the linear transition function.\n\nReservoir Control: Reservoir Control [Yeh, 1985] is a system to control multiple connected\nreservoirs. Each of the reservoirs in the system has a single state sj \u2208 R that denotes the water level\nof the reservoir j and a corresponding action to permit a \ufb02ow aj \u2208 [0, sj] from the reservoir to the\nnext downstream reservoir.\nThe objective of the domain is to maintain the target water level of each reservoir in a safe range and\nas close to half of its capacity as possible. Therefore, we compute the reward through:\n\n\uf8f1\uf8f2\uf8f30,\n\n\u22125,\n\u2212100,\n\ncj =\n\nLj \u2264 sj \u2264 Uj\nsj < Lj\nsj > Uj\n\nR(st, at) = \u2212(cid:107)c \u2212 0.1 \u2217 | (u \u2212 l)\n\n\u2212 st|(cid:107)1,\n\n2\n\nwhere cj is the cost value of Reservoir j that penalizes water levels outside a safe range.\nIn this domain, we introduce two settings: namely, Nonlinear and Linear. For the nonlinear domain,\nnonlinearity due to the water loss ej for each reservoir j includes water usage and evaporation. The\ntransition function is\n\nst\nm\n\net = 0.5 \u2217 st (cid:12) sin(\n\nT (st, at) = st + rt \u2212 et \u2212 at + at\u03a3,\n\n(8)\nwhere (cid:12) represents an elementwise product, r is a rain quantity parameter, m is the maximum\ncapacity of the largest tank, and \u03a3 is a lower triangular adjacency matrix that indicates connections to\nupstream reservoirs.\nFor the linear domain, we only replace the nonlinear function of water loss by a linear function:\n\net = 0.1 \u2217 st, T (st, at) = st + rt \u2212 et \u2212 at + at\u03a3,\n\n(9)\nUnlike Navigation, we do not limit the state dimension of the whole system into two dimensions. In\nthe experiments, we use domain setting of a network with 20 reservoirs.\n\n),\n\nHVAC: Heating, Ventilation, and Air Conditioning [Erickson et al., 2009; Agarwal et al., 2010] is\na centralized control problem, with concurrent controls of multiple rooms and multiple connected\nbuildings. For each room j there is a state variable sj denoting the temperature and an action aj for\nsending the speci\ufb01ed volume of heated air to each room j via vent actuation.\nThe objective of the domain is to maintain the temperature of each room in a comfortable range and\nconsume as little energy as possible in doing so. Therefore, we compute the reward based through:\n\ndt = | (u \u2212 l)\n\n2\n\n\u2212 st|,\n\net = at \u2217 C, R(st, at) = \u2212(cid:107)et + dt(cid:107)1,\n\nwhere C is the unit electricity cost.\nSince thermal models for HVAC are inherently nonlinear, we only present one version with a nonlinear\ntransition function:\n\n\u03b8t = at (cid:12) (F vent \u2212 st), \u03c6t = (stQ \u2212 st (cid:12) J(cid:88)\n\n\u03d1t = (F out\n\nt \u2212 st) (cid:12) o/wo, \u03c6t = (F hall\n\nT (st, at) = st + \u03b1 \u2217 (\u03b8t + \u03c6t + \u03d1t + \u03c6t),\n\nqj)/wq\nt \u2212 st) (cid:12) h/wh\n\nj=1\n\n(10)\n\n6\n\n\ft\n\nt\n\nand F hall\n\nare temperatures of the room vent, outside and hallway, respectively,\nwhere F vent, F out\nQ. o and h are respectively the adjacency matrix of rooms, adjacency vector of outside areas, and the\nadjacency vector of hallways. wq, wo and wh are thermal resistances with a room and the hallway\nand outside walls, respectively.\nIn the experiments, we work with a building layout with \ufb01ve \ufb02oors and 12 rooms on each \ufb02oor for a\ntotal of 60 rooms. For scalability testing, we apply batched backpropagation on 100 instances of such\ndomain simultaneously, of which, there are 576,000 actions needed to plan concurrently.\n\n3.2 Planning Performance\n\nIn this section, we investigate the performance of Tensor\ufb02ow optimization through comparison with\nthe MILP on linear domains and with Matlab\u2019s fmincon nonlinear interior point solver on nonlinear\ndomains. We ran our experiments on Ubuntu Linux system with one E5-1620 v4 CPU, 16GB RAM,\nand one GTX1080 GPU. The Tensor\ufb02ow version is beta 0.12.1, the Matlab version is R2016b, and\nthe MILP version is IBM ILOG CPLEX 12.6.3.\n\n3.2.1 Performance in Linear Domains\n\n(a) Navigation Linear\n\n(b) Navigation Bilinear\n\n(c) Reservoir Linear\n\nFigure 3: The total reward comparison (values are negative, lower bars are better) among Tensor\ufb02ow\n(Red), MILP optimization guided planning (Green) and domain-speci\ufb01c heuristic policy (Blue). Error\nbars show standard deviation across the parallel Tensor\ufb02ow instances; most are too small to be visible.\nThe heuristic policy is a manually designed baseline solution. In the linear domains (a) and (c), the\nMILP is optimal and Tensor\ufb02ow is near-optimal for \ufb01ve out of six domains.\n\nIn Figure 3, we show that Tensor\ufb02ow backpropagation results in lower cost plans than domain-speci\ufb01c\nheuristic policies, and the overall cost is close to the MILP-optimal solution in \ufb01ve of six linear\ndomains.\nWhile Tensor\ufb02ow backpropagation planning generally shows strong performance, when comparing\nthe performance of Tensor\ufb02ow on bilinear and linear domains of Navigation to the MILP solution\n(recall that the linear domain was discretized from the bilinear case), we notice that Tensor\ufb02ow does\nmuch better relative to the MILP on the bilinear domain than the discretized linear domain. The\nreason for this is quite simple: gradient optimization of smooth bilinear functions is actually much\neasier for Tensor\ufb02ow than the piecewise linear discretized version which has large piecewise steps that\nmake it hard for RMSProp to get a consistent and smooth gradient signal. We additionally note that\nthe standard deviation of the linear navigation domain is much larger than the others. This is because\nthe piecewise constant transition function computing the speed reduction factor \u03bb provides a \ufb02at loss\nsurface with no curvature to aid gradient descent methods, leading to high variation depending on the\ninitial random starting point in the instance.\n\n3.2.2 Performance in Nonlinear Domains\n\nIn \ufb01gure 4, we show Tensor\ufb02ow backpropagation planning always achieves the best performance\ncompared to the heuristic solution and the Matlab nonlinear optimizer fmincon. For relatively simple\ndomains like Navigation, we see the fmincon nonlinear solver provides a very competitive solution,\nwhile, for the complex domain HVAC with a large concurrent action space, the fmincon solver shows\na complete failure at solving the problem in the given time period.\nIn \ufb01gure 5(a), Tensor\ufb02ow backpropagation planning shows 16 times faster optimization in the \ufb01rst\n15s, which is close to the result given by fmincon at 4mins. In \ufb01gure 5(b), the optimization speed of\n\n7\n\n3060120Horizon\u2212900\u2212800\u2212700\u2212600\u2212500\u2212400\u2212300\u2212200\u22121000Total\u00a0RewardHeuristicMILPTF3060120Horizon\u2212900\u2212800\u2212700\u2212600\u2212500\u2212400\u2212300\u2212200\u22121000Total\u00a0Reward3060120Horizon\u22123500\u22123000\u22122500\u22122000\u22121500\u22121000\u22125000Total\u00a0Reward\f(a) Navigation Nonlinear\n\n(b) Reservoir Nonlinear\n\n(c) HVAC Nonlinear\n\nFigure 4: The total reward comparison (values are negative, lower bars are better) among Tensor\ufb02ow\nbackpropagation planning (Red), Matlab nonlinear solver fmincon guided planning (Purple) and\ndomain-speci\ufb01c heuristic policy (Blue). We gathered the results after 16 minutes of optimization time\nto allow all algorithms to converge to their best solution.\n\n(a) Reservoir, Horizon 60\n\n(b) Reservoir, Horizon 120\n\nFigure 5: Optimization comparison between Tensor\ufb02ow RMSProp gradient planning (Green) and\nMatlab nonlinear solver fmincon interior point optimization planning (Orange) on Nonlinear Reservoir\nDomains with Horizon (a) 60 and (b) 120. As a function of the logarithmic time x-axis, Tensor\ufb02ow is\nsubstantially faster and more optimal than fmincon.\n\nTensor\ufb02ow shows it to be hundreds of times faster than the fmincon nonlinear solver to achieve the\nsame value (if fmincon does ever reach it). These remarkable results demonstrate the power of fast\nparallel GPU computation of the Tensor\ufb02ow framework.\n\n3.2.3 Scalability\n\nIn table 1, we show the scalability of Tensor\ufb02ow backpropagation planning via the running times\nrequired to converge for different domains. The results demonstrate the extreme ef\ufb01ciency with\nwhich Tensor\ufb02ow can converge on exceptionally large nonlinear hybrid planning domains.\n\nDomain Dim Horizon Batch Actions\n\nTime\n\nNav.\nRes.\nHVAC\n\n2\n20\n60\n\n120\n120\n96\n\n100\n100\n100\n\n24000 < 1mins\n4mins\n240000\n576000\n4mins\n\nTable 1: Timing evaluation of the largest instances of the three domains we tested. All of these tests\nwere performed on the nonlinear versions of the respectively named domains.\n\n3.2.4 Optimization Methods\n\nIn this experiment, we investigate the effects of different backpropagation optimizers. In \ufb01gure 6(a),\nwe show that the RMSProp optimizer provides exceptionally fast convergence among the \ufb01ve standard\noptimizers of Tensor\ufb02ow. This observation re\ufb02ects the previous analysis and discussion concerning\nequation (4) that RMSProp manages to avoid exploding gradients. As mentioned, although Adagrad\nand Adadelta have similar mechanisms, their normalization methods may cause vanishing gradients\nafter several epochs, which corresponds to our observation of nearly \ufb02at curves for these methods.\nThis is a strong indicator that exploding gradients are a signi\ufb01cant concern for hybrid planning with\ngradient descent and that RMSProp performs well despite this well-known potential problem for\ngradients over long horizons.\n\n8\n\n3060120Horizon\u2212500\u2212400\u2212300\u2212200\u22121000Total\u00a0RewardHeuristicFMCTF3060120Horizon\u221210000\u22128000\u22126000\u22124000\u221220000Total\u00a0Reward12244896Horizon\u221250000\u221240000\u221230000\u221220000\u2212100000Total\u00a0Reward15s30s60s2m4m8m16mTime\u22129000\u22128000\u22127000\u22126000\u22125000\u22124000\u22123000\u22122000\u22121000Total RewardTFFMC15s30s60s2m4m8m16mTime-106-105-104-103Total RewardTFFMC\f(a)\n\n(b)\n\nFigure 6: (a) Comparison of Tensor\ufb02ow gradient methods in the HVAC domain. All of these\noptimizers use the same learning rate of 0.001.\n(b) Optimization learning rate comparison of\nTensor\ufb02ow with the RMSProp optimizer on HVAC domain. The optimization rate 0.1 (Orange) gave\nthe fastest initial convergence speed but was not able to reach the best score that optimization rate\n0.001 (Blue) found.\n\n3.2.5 Optimization Rate\n\nIn \ufb01gure 6(b), we show the best learning optimization rate for the HVAC domain is 0.01 since this\nrate converges to near-optimal extremely fast. The overall trend is smaller optimization rates have a\nbetter opportunity to reach a better \ufb01nal optimization solution, but can be extremely slow as shown for\noptimization rate 0.001. Hence, while larger optimization rates may cause overshooting the optima,\nrates that are too small may simply converge too slowly for practical use. This suggests a critical\nneed to tune the optimization rate per planning domain.\n\n3.3 Comparison to State-of-the-art Hybrid Planners\n\nFinally, we discuss and test the scalability of the state-of-art hybrid planners on our hybrid domains.\nWe note that neither DiNo [Piotrowski et al., 2016], dReal [Bryce et al., 2015] nor SMTPlan [Cash-\nmore et al., 2016] support general metric optimization. We ran ENHSP [Scala et al., 2016] on a\nmuch smaller version of the HVAC domain with only 2 rooms over multiple horizon settings. We\nfound that ENHSP returned a feasible solution to the instance with horizon equal to 2 in 31 seconds,\nwhereas the rest of the instances with greater horizon settings timed out with an hour limit.\n\n4 Conclusion\n\nWe investigated the practical feasibility of using the Tensor\ufb02ow toolbox to do fast, large-scale\nplanning in hybrid nonlinear domains. We worked with a direct symbolic (nonlinear) planning domain\ncompilation to Tensor\ufb02ow for which we optimized planning actions directly through gradient-based\nbackpropagation. We then investigated planning over long horizons and suggested that RMSProp\navoids both the vanishing and exploding gradient problems and showed experiments to corroborate\nthis \ufb01nding. Our key empirical results demonstrated that Tensor\ufb02ow with RMSProp is competitive\nwith MILPs on linear domains (where the optimal solution is known \u2014 indicating near optimality\nof Tensor\ufb02ow and RMSProp for these non-convex functions) and strongly outperforms Matlab\u2019s\nstate-of-the-art interior point optimizer on nonlinear domains, optimizing up to 576,000 actions in\nunder 4 minutes. These results suggest a new frontier for highly scalable planning in nonlinear hybrid\ndomains by leveraging GPUs and the power of recent advances in gradient descent such as RMSProp\nwith highly optimized toolkits like Tensor\ufb02ow.\nFor future work, we plan to further investigate Tensor\ufb02ow-based planning improvements for domains\nwith discrete action and state variables as well as dif\ufb01cult domains with only terminal rewards that\nprovide little gradient signal guidance to the optimizer.\n\n9\n\n05001000150020002500300035004000Epoch-9.0e+05-8.0e+05-7.0e+05-6.0e+05-5.0e+05-4.0e+05-3.0e+05-2.0e+05-1.0e+050.0e+00Total RewardSGDAdagradAdadeltaAdamRMSProp05001000150020002500300035004000Epoch-7.0e+05-6.0e+05-5.0e+05-4.0e+05-3.0e+05-2.0e+05-1.0e+050.0e+00Total RewardOptimizing_Rate:1Optimizing_Rate:0.1Optimizing_Rate:0.01Optimizing_Rate:0.001\fReferences\nMart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew\nHarp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath\nKudlur, Josh Levenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike\nSchuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent\nVanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg,\nMartin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on\nheterogeneous systems, 2015. Software available from tensor\ufb02ow.org.\n\nYuvraj Agarwal, Bharathan Balaji, Rajesh Gupta, Jacob Lyles, Michael Wei, and Thomas Weng.\nOccupancy-driven energy management for smart building automation. In Proceedings of the 2nd\nACM Workshop on Embedded Sensing Systems for Energy-Ef\ufb01ciency in Building, pages 1\u20136. ACM,\n2010.\n\nDavid Balduzzi, Brian McWilliams, and Tony Butler-Yeoman. Neural taylor approximations: Con-\n\nvergence and exploration in recti\ufb01er networks. arXiv preprint arXiv:1611.02345, 2016.\n\nDaniel Bryce, Sicun Gao, David Musliner, and Robert Goldman. SMT-based nonlinear PDDL+\n\nplanning. In 29th AAAI, pages 3247\u20133253, 2015.\n\nMichael Cashmore, Maria Fox, Derek Long, and Daniele Magazzeni. A compilation of the full\n\nPDDL+ language into SMT. In ICAPS, pages 79\u201387, 2016.\n\nAmanda Jane Coles, Andrew Coles, Maria Fox, and Derek Long. A hybrid LP-RPG heuristic for\n\nmodelling numeric resource \ufb02ows in planning. J. Artif. Intell. Res. (JAIR), 46:343\u2013412, 2013.\n\nR\u00e9mi Coulom. Ef\ufb01cient selectivity and backup operators in monte-carlo tree search. In International\n\nConference on Computers and Games, pages 72\u201383. Springer Berlin Heidelberg, 2006.\n\nJohn Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\nVarick L. Erickson, Yiqing Lin, Ankur Kamthe, Rohini Brahme, Alberto E. Cerpa, Michael D.\nSohn, , and Satish Narayanan. Energy ef\ufb01cient building environment control strategies using\nreal-time occupancy measurements. In Proceedings of the 1st ACM Workshop On Embedded\nSensing Systems For Energy-Ef\ufb01cient Buildings (BuildSys 2009), pages 19\u201324, Berkeley, CA, USA,\nNovember 2009. ACM.\n\nTimm Faulwasser and Rolf Findeisen. Nonlinear Model Predictive Path-Following Control. In\nNonlinear Model Predictive Control - Towards New Challenging Applications, Lecture Notes in\nControl and Information Sciences, pages 335\u2013343. Springer, Berlin, Heidelberg, 2009.\n\nFranc Ivankovic, Patrik Haslum, Sylvie Thiebaux, Vikas Shivashankar, and Dana Nau. Optimal\nIn International Conference on Automated\nplanning with global numerical state constraints.\nPlanning and Scheduling (ICAPS), pages 145\u2013153, Portsmouth, New Hampshire, USA, jun 2014.\n\nThomas Keller and Malte Helmert. Trial-based heuristic tree search for \ufb01nite horizon mdps. In\nProceedings of the 23rd International Conference on Automated Planning and Scheduling, ICAPS\n2013, Rome, Italy, June 10-14, 2013, 2013.\n\nLevente Kocsis and Csaba Szepesv\u00e1ri. Bandit based Monte-Carlo planning. In Proceedings of the\n\n17th European Conference on Machine Learning (ECML-06), pages 282\u2013293, 2006.\n\nSeppo Linnainmaa. The representation of the cumulative rounding error of an algorithm as a taylor\nexpansion of the local rounding errors. Master\u2019s Thesis (in Finnish), Univ. Helsinki, pages 6\u20137,\n1970.\n\nJohannes L\u00f6hr, Patrick Eyerich, Thomas Keller, and Bernhard Nebel. A planning based framework\nfor controlling hybrid systems. In Proceedings of the Twenty-Second International Conference on\nAutomated Planning and Scheduling, ICAPS 2012, Atibaia, S\u00e3o Paulo, Brazil, June 25-19, 2012,\n2012.\n\n10\n\n\fVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep\nLearning Workshop. 2013.\n\nWiktor Mateusz Piotrowski, Maria Fox, Derek Long, Daniele Magazzeni, and Fabio Mercorio.\nIn Proceedings of the Thirtieth AAAI Conference on\n\nHeuristic planning for hybrid systems.\nArti\ufb01cial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 4254\u20134255, 2016.\n\nDavid E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by\n\nback-propagating errors. Cognitive modeling, 5(3):1.\n\nBuser Say, Wu Ga, Yu Qing Zhou, and Scott Sanner. Nonlinear hybrid planning with deep net learned\ntransition models and mixed-integer linear programming. In Proceedings of the Twenty-Sixth\nInternational Joint Conference on Arti\ufb01cial Intelligence, IJCAI-17, pages 750\u2013756, 2017.\n\nEnrico Scala, Patrik Haslum, Sylvie Thi\u00e9baux, and Miquel Ram\u00edrez. Interval-based relaxation for\n\ngeneral numeric planning. In ECAI, pages 655\u2013663, 2016.\n\nDavid Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den\nDriessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander\nDieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap,\nMadeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game\nof go with deep neural networks and tree search. Nature, 529:484\u2013503, 2016.\n\nRichard S. Sutton and Andrew G. Barto.\nCambridge, MA, USA, 1st edition, 1998.\n\nIntroduction to Reinforcement Learning. MIT Press,\n\nCsaba Szepesv\u00e1ri. Algorithms for Reinforcement Learning. Morgan & Claypool, 2010.\n\nTijmen Tieleman and Geoffrey E Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running\naverage of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26\u201331,\n2012.\n\nChristopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3):279\u2013292, May\n\n1992.\n\nAri Weinstein and Michael L. Littman. Bandit-based planning and learning in continuous-action\nmarkov decision processes. In Proceedings of the Twenty-Second International Conference on\nAutomated Planning and Scheduling, ICAPS 2012, Atibaia, S\u00e3o Paulo, Brazil, June 25-19, 2012,\n2012.\n\nWilliam G Yeh. Reservoir management and operations models: A state-of-the-art review. Water\n\nResources research, 21,12:1797\u20131818, 1985.\n\nMatthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701,\n\n2012.\n\n11\n\n\f", "award": [], "sourceid": 3164, "authors": [{"given_name": "Ga", "family_name": "Wu", "institution": "University of Toronto"}, {"given_name": "Buser", "family_name": "Say", "institution": "University of Toronto"}, {"given_name": "Scott", "family_name": "Sanner", "institution": "University of Toronto"}]}