{"title": "Reinforcement Learning for Solving the Vehicle Routing Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 9839, "page_last": 9849, "abstract": "We present an end-to-end framework for solving the Vehicle Routing Problem (VRP) using reinforcement learning. In this approach, we train a single policy model that finds near-optimal solutions for a broad range of problem instances of similar size, only by observing the reward signals and following feasibility rules. We consider a parameterized stochastic policy, and by applying a policy gradient algorithm to optimize its parameters, the trained model produces the solution as a sequence of consecutive actions in real time, without the need to re-train for every new problem instance. On capacitated VRP, our approach outperforms classical heuristics and Google's OR-Tools on medium-sized instances in solution quality with comparable computation time (after training). We demonstrate how our approach can handle problems with split delivery and explore the effect of such deliveries on the solution quality. Our proposed framework can be applied to other variants of the VRP such as the stochastic VRP, and has the potential to be applied more generally to combinatorial optimization problems", "full_text": "Reinforcement Learning for Solving the\n\nVehicle Routing Problem\n\nMohammadreza Nazari Afshin Oroojlooy Martin Tak\u00e1\u02c7c Lawrence V. Snyder\n\nDepartment of Industrial and Systems Engineering\n\nLehigh University, Bethlehem, PA 18015\n\n{mon314,afo214,takac,lvs2}@lehigh.edu\n\nAbstract\n\nWe present an end-to-end framework for solving the Vehicle Routing Problem\n(VRP) using reinforcement learning. In this approach, we train a single policy\nmodel that \ufb01nds near-optimal solutions for a broad range of problem instances of\nsimilar size, only by observing the reward signals and following feasibility rules.\nWe consider a parameterized stochastic policy, and by applying a policy gradient\nalgorithm to optimize its parameters, the trained model produces the solution as\na sequence of consecutive actions in real time, without the need to re-train for\nevery new problem instance. On capacitated VRP, our approach outperforms\nclassical heuristics and Google\u2019s OR-Tools on medium-sized instances in solution\nquality with comparable computation time (after training). We demonstrate how\nour approach can handle problems with split delivery and explore the effect of such\ndeliveries on the solution quality. Our proposed framework can be applied to other\nvariants of the VRP such as the stochastic VRP, and has the potential to be applied\nmore generally to combinatorial optimization problems.\n\n1\n\nIntroduction\n\nThe Vehicle Routing Problem (VRP) is a combinatorial optimization problem that has been studied\nin applied mathematics and computer science for decades. VRP is known to be a computationally\ndif\ufb01cult problem for which many exact and heuristic algorithms have been proposed, but providing\nfast and reliable solutions is still a challenging task. In the simplest form of the VRP, a single\ncapacitated vehicle is responsible for delivering items to multiple customer nodes; the vehicle must\nreturn to the depot to pick up additional items when it runs out. The objective is to optimize a set of\nroutes, all beginning and ending at a given node, called the depot, in order to attain the maximum\npossible reward, which is often the negative of the total vehicle distance or average service time. This\nproblem is computationally dif\ufb01cult to solve to optimality, even with only a few hundred customer\nnodes [12], and is classi\ufb01ed as an NP-hard problem. For an overview of the VRP, see, for example,\n[15, 23, 24, 33].\nThe prospect of new algorithm discovery, without any hand-engineered reasoning, makes neural\nnetworks and reinforcement learning a compelling choice that has the potential to be an important\nmilestone on the path toward solving these problems. In this work, we develop a framework with\nthe capability of solving a wide variety of combinatorial optimization problems using Reinforcement\nLearning (RL) and show how it can be applied to solve the VRP. For this purpose, we consider the\nMarkov Decision Process (MDP) formulation of the problem, in which the optimal solution can be\nviewed as a sequence of decisions. This allows us to use RL to produce near-optimal solutions by\nincreasing the probability of decoding \u201cdesirable\u201d sequences.\nA naive approach would be to train an instance-speci\ufb01c policy by considering every instance separately.\nIn this approach, an RL algorithm needs to take many samples, maybe millions of them, from the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\funderlying MDP of the problem to be able to produce a good-quality solution. Obviously, this\napproach is not practical since the RL method should be comparable to existing algorithms not only\nin terms of the solution quality but also in terms of runtime. For example, for all of the problems\nstudied in this paper, we wish to have a method that can produce near-optimal solutions in less than a\nsecond. Moreover, the policy learned by this naive approach would not apply to instances other than\nthe one that was used in the training; after a small perturbation of the problem setting, e.g., changing\nthe location or demand of a customer, we would need to rebuild the policy from scratch.\nTherefore, rather than focusing on training a separate policy for every problem instance, we propose\na structure that performs well on any problem sampled from a given distribution. This means that if\nwe generate a new VRP instance with the same number of nodes and vehicle capacity, and the same\nlocation and demand distributions as the ones that we used during training, then the trained policy\nwill work well, and we can solve the problem right away, without retraining for every new instance.\nAs long as we approximate the generating distribution of the problem, the framework can be applied.\nOne can view the trained policy as a black-box heuristic (or a meta-algorithm) which generates a\nhigh-quality solution in a reasonable amount of time.\nThis study is motivated by the recent work by Bello et al. [4]. We have generalized their framework\nto include a wider range of combinatorial optimization problems such as the VRP. Bello et al. [4]\npropose the use of a Pointer Network [34] to decode the solution. One major issue that complicates\nthe direct use of their approach for the VRP is that it assumes the system is static over time. In\ncontrast, in the VRP, the demands change over time in the sense that once a node has been visited its\ndemand becomes, effectively, zero. To overcome this, we propose an alternate approach\u2014which is\nsimpler than the Pointer Network approach\u2014that can ef\ufb01ciently handle both the static and dynamic\nelements of the system. Our policy model consists of a recurrent neural network (RNN) decoder\ncoupled with an attention mechanism. At each time step, the embeddings of the static elements are\nthe input to the RNN decoder, and the output of the RNN and the dynamic element embeddings are\nfed into an attention mechanism, which forms a distribution over the feasible destinations that can be\nchosen at the next decision point.\nThe proposed framework is appealing to practitioners since we utilize a self-driven learning procedure\nthat only requires the reward calculation based on the generated outputs; as long as we can observe\nthe reward and verify the feasibility of a generated sequence, we can learn the desired meta-algorithm.\nFor instance, if one does not know how to solve the VRP but can compute the cost of a given\nsolution, then one can provide the signal required for solving the problem using our method. Unlike\nmost classical heuristic methods, it is robust to problem changes, e.g., when a customer changes\nits demand value or relocates to a different position, it can automatically adapt the solution. Using\nclassical heuristics for VRP, the entire distance matrix must be recalculated and the system must\nbe re-optimized from scratch, which is often impractical, especially if the problem size is large.\nIn contrast, our proposed framework does not require an explicit distance matrix, and only one\nfeed-forward pass of the network will update the routes based on the new data.\nOur numerical experiments indicate that our framework performs signi\ufb01cantly better than well-known\nclassical heuristics designed for the VRP, and that it is robust in the sense that its worst results are still\nrelatively close to optimal. Comparing our method with the OR-Tools VRP engine [16], which is one\nof the best open-source VRP solvers, we observe a noticeable improvement; in VRP instances with\n50 and 100 customers, our method provides shorter tours in roughly 61% of the instances. Another\ninteresting observation that we make in this study is that by allowing multiple vehicles to supply the\ndemand of a single node, our RL-based framework \ufb01nds policies that outperform the solutions that\nrequire single deliveries. We obtain this appealing property, known as the split delivery, without any\nhand engineering and at no extra cost.\n\n2 Background\n\nBefore presenting the problem formalization, we brie\ufb02y review the required notation and relation to\nexisting work.\n\nSequence-to-Sequence Models Sequence-to-Sequence models [32, 34, 25] are useful in tasks for\nwhich a mapping from one sequence to another is required. They have been extensively studied in\nthe \ufb01eld of neural machine translation over the past several years, and there are numerous variants\n\n2\n\n\fof these models. The general architecture, which is shared by many of these models, consists of\ntwo RNN networks called the encoder and decoder. An encoder network reads through the input\nsequence and stores the knowledge in a \ufb01xed-size vector representation (or a sequence of vectors);\nthen, a decoder converts the encoded information back to an output sequence.\nIn the vanilla Sequence-to-Sequence architecture [32], the source sequence appears only once in the\nencoder and the entire output sequence is generated based on one vector (i.e., the last hidden state\nof the encoder RNN). Other extensions, for example Bahdanau et al. [3], illustrate that the source\ninformation can be used more explicitly to increase the amount of information during the decoding\nsteps. In addition to the encoder and decoder networks, they employ another neural network, namely\nan attention mechanism that attends to the entire encoder RNN states. This mechanism allows the\ndecoder to focus on the important locations of the source sequence and use the relevant information\nduring decoding steps for producing \u201cbetter\u201d output sequences. Recently, the concept of attention has\nbeen a popular research idea due to its capability to align different objects, e.g., in computer vision\n[6, 39, 40, 18] and neural machine translation [3, 19, 25]. In this study, we also employ a special\nattention structure for the policy parameterization. See Section 3.3 for a detailed discussion of the\nattention mechanism.\n\nNeural Combinatorial Optimization Over the last several years, multiple methods have been\ndeveloped to tackle combinatorial optimization problems by using recent advances in arti\ufb01cial\nintelligence. The \ufb01rst attempt was proposed by Vinyals et al. [34], who introduce the concept of\na Pointer Network, a model originally inspired by sequence-to-sequence models. Because it is\ninvariant to the length of the encoder sequence, the Pointer Network enables the model to apply to\ncombinatorial optimization problems, where the output sequence length is determined by the source\nsequence. They use the Pointer Network architecture in a supervised fashion to \ufb01nd near-optimal\nTraveling Salesman Problem (TSP) tours from ground truth optimal (or heuristic) solutions. This\ndependence on supervision prohibits the Pointer Network from \ufb01nding better solutions than the ones\nprovided during the training.\nClosest to our approach, Bello et al. [4] address this issue by developing a neural combinatorial\noptimization framework that uses RL to optimize a policy modeled by a Pointer Network. Using\nseveral classical combinatorial optimization problems such as TSP and the knapsack problem, they\nshow the effectiveness and generality of their architecture.\nOn a related topic, Dai et al. [11] solve optimization problems over graphs using a graph embedding\nstructure [10] and a deep Q-learning (DQN) algorithm [26]. Even though VRP can be represented\nby a graph with weighted nodes and edges, their proposed approach does not directly apply since in\nVRP, a particular node (e.g. the depot) might be visited multiple times.\nNext, we introduce our model, which is a simpli\ufb01ed version of the Pointer Network.\n\n3 The Model\n\nIn this section, we formally de\ufb01ne the problem and our proposed framework for a generic combinato-\n= {xi, i = 1,\u00b7\u00b7\u00b7 , M}. We allow some of\n.\nrial optimization problem with a given set of inputs X\nthe elements of each input to change between the decoding steps, which is, in fact, the case in many\nproblems such as the VRP. The dynamic elements might be an artifact of the decoding procedure\nitself, or they can be imposed by the environment. For example, in the VRP, the remaining customer\ndemands change over time as the vehicle visits the customer nodes; or we might consider a variant\nin which new customers arrive or adjust their demand values over time, independent of the vehicle\ndecisions. Formally, we represent each input xi by a sequence of tuples {xi\nt), t = 0, 1,\u00b7\u00b7\u00b7},\nt are the static and dynamic elements of the input, respectively, and can themselves\nwhere si and di\nt as a vector of features that describes the state of input i at time t. For\nbe tuples. One can view xi\nt gives a snapshot of the customer i, where si corresponds to the 2-dimensional\ninstance, in the VRP, xi\ncoordinate of customer i\u2019s location and di\nt is its demand at time t. We will denote the set of all input\nstates at a \ufb01xed time t with Xt.\nWe start from an arbitrary input in X0, where we use the pointer y0 to refer to that input. At every\ndecoding time t (t = 0, 1,\u00b7\u00b7\u00b7 ), yt+1 points to one of the available inputs Xt, which determines\nthe input of the next decoder step; this process continues until a termination condition is satis\ufb01ed.\nThe termination condition is problem-speci\ufb01c, showing that the generated sequence satis\ufb01es the\n\n.\n= (si, di\n\nt\n\n3\n\n\ffeasibility constraints. For instance, in the VRP that we consider in this work, the terminating\ncondition is that there is no more demand to satisfy. This process will generate a sequence of length\nT , Y = {yt, t = 0, ..., T}, possibly with a different sequence length compared to the input length M.\nThis is due to the fact that, for example, the vehicle may have to go back to the depot several times to\nre\ufb01ll. We also use the notation Yt to denote the decoded sequence up to time t, i.e., Yt = {y0,\u00b7\u00b7\u00b7 , yt}.\nWe are interested in \ufb01nding a stochastic policy \u03c0 which generates the sequence Y in a way that\nminimizes a loss objective while satisfying the problem constraints. The optimal policy \u03c0\u2217 will\ngenerate the optimal solution with probability 1. Our goal is to make \u03c0 as close to \u03c0\u2217 as possible.\nSimilar to Sutskever et al. [32], we use the probability chain rule to decompose the probability of\ngenerating sequence Y , i.e., P (Y |X0), as follows:\n\nT(cid:89)\n\n(1)\n\n(2)\n\nP (Y |X0) =\n\n\u03c0(yt+1|Yt, Xt),\n\nand\n\nt=0\n\nXt+1 = f (yt+1, Xt)\n\nis a recursive update of the problem representation with the state transition function f. Each\ncomponent in the right-hand side of (1) is computed by the attention mechanism, i.e.,\n\n\u03c0(\u00b7|Yt, Xt) = softmax(g(ht, Xt)),\n\n(3)\nwhere g is an af\ufb01ne function that outputs an input-sized vector, and ht is the state of the RNN decoder\nthat summarizes the information of previously decoded steps y0,\u00b7\u00b7\u00b7 , yt. We will describe the details\nof our proposed attention mechanism in Section 3.3.\nRemark 1: This structure can handle combinatorial optimization problems in both a more classical\nstatic setting as well as in dynamically changing ones. In static combinatorial optimization, X0 fully\nde\ufb01nes the problem that we are trying to solve. For example, in the VRP, X0 includes all customer\nlocations as well as their demands, and the depot location; then, the remaining demands are updated\nwith respect to the vehicle destination and its load. With this consideration, often there exists a\nwell-de\ufb01ned Markovian transition function f, as de\ufb01ned in (2), which is suf\ufb01cient to update the\ndynamics between decision points. However, our framework can also be applied to problems in\nwhich the state transition function is unknown and/or is subject to external noise, since the training\ndoes not explicitly make use of the transition function. However, knowing this transition function\nhelps in simulating the environment that the training algorithm interacts with. See Appendix C.6\nfor an example of how to handle a stochastic version of the VRP in which random customers with\nrandom demands appear over time.\n\n3.1 Limitations of Pointer Networks\n\nAlthough the framework proposed by Bello et al. [4] works well on problems such as the knapsack\nproblem and TSP, it is not ef\ufb01cient to more complicated combinatorial optimization problems in\nwhich the system representation varies over time, such as VRP. Bello et al. [4] feed a random sequence\nof inputs to the RNN encoder. Figure 1 illustrates with an example why using the RNN in the encoder\nis restrictive. Suppose that at the \ufb01rst decision step, the policy sends the vehicle to customer 1,\n0 (cid:54)= d1\nand as a result, its demand is satis\ufb01ed, i.e., d1\n1. Then in the second decision step, we need to\nre-calculate the whole network with the new d1\n1 information in order to choose the next customer. The\ndynamic elements complicate the forward pass of the network since there should be encoder/decoder\nupdates when an input changes. The situation is even worse during back-propagation to accumulate\nthe gradients since we need to remember when the dynamic elements changed. In order to resolve\nthis complication, we require the policy model to be invariant to the input sequence so that changing\nthe order of any two inputs does not affect the network. In Section 3.2, we present a simple network\nthat satis\ufb01es this property.\n\n3.2 The Proposed Neural Network Model\n\nWe argue that the RNN encoder adds extra complication to the encoder but is actually not necessary,\nand the approach can be made much more general by omitting it. RNNs are necessary only when\nthe inputs convey sequential information; e.g., in text translation the combination of words and their\n\n4\n\n\f1\n\ns1\nd1\n1\n\ns2\nd2\n1\n\ns3\nd3\n1\n\ns4\nd4\n1\n\ns5\nd5\n1\n\n\u21d2\n\ns1\nd1\n0\n\nEmbedding\n\nFigure 1: Limitation of the Pointer Network.\nAfter a change in dynamic elements (d1\n1 in this\nexample), the whole Pointer Network must be\nupdated to compute the probabilities in the next\ndecision point.\n\n\u03c0(\u00b7|Y0, X0)\n\nAttention\n\nlayer\n\nct\n\nat\n\n\u00afs1\n\n\u00afs1\n\u00afd1\n0\n\ns1\nd1\n0\n\n\u00afs2\n\u00afd2\n0\n\ns2\nd2\n0\n\n\u00afs3\n\u00afd3\n0\n\ns3\nd3\n0\n\n\u00afs4\n\u00afd4\n0\n\ns4\nd4\n0\n\n\u00afs5\n\u00afd5\n0\n\ns5\nd5\n0\n\nFigure 2: Our proposed model. The embedding\nlayer maps the inputs to a high-dimensional\nvector space. On the right, an RNN decoder\nstores the information of the decoded sequence.\nThen, the RNN hidden state and embedded in-\nput produce a probability distribution over the\nnext input using the attention mechanism.\n\nrelative position must be captured in order for the translation to be accurate. But the question here is,\nwhy do we need to have them in the encoder for combinatorial optimization problems when there is\nno meaningful order in the input set? As an example, in the VRP, the inputs are the set of unordered\ncustomer locations with their respective demands, and their order is not meaningful; any random\npermutation contains the same information as the original inputs. Therefore, in our model, we simply\nleave out the encoder RNN and directly use the embedded inputs instead of the RNN hidden states.\nBy this modi\ufb01cation, many of the computational complications disappear, without decreasing the\nef\ufb01ciency. In Appendix A, we provide experiments to verify this claim.\nAs illustrated in Figure 2, our model is composed of two main components. The \ufb01rst is a set of\ngraph embeddings [30] that can be used to encode structured data inputs. Among the available\ntechniques, we tried a one-layer Graph Convolutional Network [21] embedding, but it did not show\nany improvement on the results, so we kept the embedding in this paper simple by utilizing the local\ninformation at each node, e.g., its coordinates and demand values, without incorporating adjacency\ninformation. In fact, this embeddings maps each customer\u2019s information into a D-dimensional vector\nspace encoding. We might have multiple embeddings corresponding to different elements of the\ninput, but they are shared among the inputs. The second component is a decoder that points to an\ninput at every decoding step. As is common in the literature [3, 32, 7], we use RNN to model the\ndecoder network. Notice that we feed the static elements as the inputs to the decoder network. The\ndynamic element can also be an input to the decoder, but our experiments on the VRP do not suggest\nany improvement by doing so. For this reason, the dynamic elements are used only in the attention\nlayer, described next.\n\n3.3 Attention Mechanism\n\nAn attention mechanism is a differentiable structure for addressing different parts of the input. Figure\n2 illustrates the attention mechanism employed in our method. At decoder step i, we utilize a\ncontext-based attention mechanism with glimpse, similar to Vinyals et al. [35], which extracts the\nrelevant information from the inputs using a variable-length alignment vector at. In other words, at\nspeci\ufb01es how much every input data point might be relevant in the next decoding step t.\nt) be the embedded input i, and ht \u2208 RD be the memory state of the RNN cell at\nLet \u00afxi\ndecoding step t. The alignment vector at is then computed as\n\nt = (\u00afsi, \u00afdi\n\nat = at(\u00afxt, ht) = softmax (ut) , where ui\n\n(4)\nHere \u201c;\u201d means the concatenation of two vectors. We compute the conditional probabilities by\ncombining the context vector ct, computed as\n\nt = vT\n\na tanh(cid:0)Wa[\u00afxi\n\nt; ht](cid:1) .\n\nct =\n\nt \u00afxi\nai\nt,\n\n(5)\n\nM(cid:88)\n\ni=1\n\n5\n\n\fwith the embedded inputs, and then normalizing the values with the softmax function, as follows:\n\nc tanh(cid:0)Wc[\u00afxi\n\nt; ct](cid:1) .\n\n\u03c0(\u00b7|Yt, Xt) = softmax(\u02dcut), where \u02dcui\n\nt = vT\n\n(6)\n\nIn (4)\u2013(6), va, vc, Wa and Wc are trainable variables.\nRemark 2: Model Symmetry: Vinyals et al. [35] discuss an extension of sequence-to-sequence\nmodels where they empirically demonstrate that in tasks with no obvious input sequence, such as\nsorting, the order in which the inputs are fed into the network matter. A similar concern arises\nwhen using Pointer Networks for combinatorial optimization problems. However, the policy model\nproposed in this paper does not suffer from such a complication since the embeddings and the\nattention mechanism are invariant to the input order.\n\n3.4 Training Method\n\nTo train the network, we use well-known policy gradient approaches. To use these methods, we\nparameterize the stochastic policy \u03c0 with parameters \u03b8, where \u03b8 is vector of all trainable variables\nused in embedding, decoder, and attention mechanism. Policy gradient methods use an estimate of\nthe gradient of the expected return with respect to the policy parameters to iteratively improve the\npolicy. In principle, the policy gradient algorithm contains two networks: (i) an actor network that\npredicts a probability distribution over the next action at any given decision step, and (ii) a critic\nnetwork that estimates the reward for any problem instance from a given state. Our training methods\nare quite standard, and due to space limitation we leave the details to the Appendix.\n\n4 Computational Experiment\n\nMany variants of the VRP have been extensively studied in the operations research literature. See,\nfor example, the reviews by Laporte [23], Laporte et al. [24], or the book by Toth and Vigo [33] for\ndifferent variants of the problem. In this section, we consider a speci\ufb01c capacitated version of the\nproblem in which one vehicle with a limited capacity is responsible for delivering items to many\ngeographically distributed customers with \ufb01nite demands. When the vehicle\u2019s load runs out, it returns\nto the depot to re\ufb01ll. We will denote the vehicle\u2019s remaining load at time t as lt. The objective is to\nminimize the total route length while satisfying all of the customer demands. This problem is often\ncalled the capacitated VRP (CVRP) to distinguish it from other variants, but we will refer to it simply\nas the VRP.\nWe assume that the node locations and demands are randomly generated from a \ufb01xed distribution.\nSpeci\ufb01cally, the customers and depot locations are randomly generated in the unit square [0, 1]\u00d7[0, 1].\nFor simplicity of exposition, we assume that the demand of each node is a discrete number in {1, .., 9},\nchosen uniformly at random. We note, however, that the demand values can be generated from any\ndistribution, including continuous ones.\nWe assume that the vehicle is located at the depot at time 0, so the \ufb01rst input to the decoder is\nan embedding of the depot location. At each decoding step, the vehicle chooses from among the\ncustomer nodes or the depot to visit in the next step. After visiting customer node i, the demands and\nvehicle load are updated as follows:\nt \u2212 lt),\n\n(7)\nwhich is an explicit de\ufb01nition of the state transition function (2) for the VRP. Once a sequence of the\nnodes to be visited is sampled, we compute the total vehicle distance and use its negative value as the\nreward signal.\nIn this experiment, we have employed two different decoders: (i) greedy, in which at every decoding\nstep, the node (either customer or depot) with the highest probability is selected as the next destination,\nand (ii) beam search (BS), which keeps track of the most probable paths and then chooses the one\nwith the minimum tour length [28]. Our results indicate that by applying the beam search algorithm,\nthe quality of the solutions can be improved with only a slight increase in computation time.\nFor faster training and generating feasible solutions, we have used a masking scheme which sets\nthe log-probabilities of infeasible solutions to \u2212\u221e or forces a solution if a particular condition is\nsatis\ufb01ed. In the VRP, we use the following masking procedures: (i) nodes with zero demand are\nnot allowed to be visited; (ii) all customer nodes will be masked if the vehicle\u2019s remaining load is\n\nlt+1 = max(0, lt \u2212 di\nt)\n\ndi\nt+1 = max(0, di\n\ndk\nt+1 = dk\nt\n\nfor k (cid:54)= i, and\n\n6\n\n\fexactly 0; and (iii) the customers whose demands are greater than the current vehicle load are masked.\nNotice that under this masking scheme, the vehicle must satisfy all of a customer\u2019s demands when\nvisiting it. We note, however, that if the situation being modeled does allow split deliveries, one\ncan relax (iii). Indeed, the relaxed masking allows split deliveries, so the solution can allocate the\ndemands of a given customer into multiple routes. This property is, in fact, an appealing behavior\nthat is present in many real-world applications, but is often neglected in classical VRP algorithms. In\nall the experiments of the next section, we do not allow to split demands. Further investigation and\nillustrations of this property is included in Appendix C.2\u2013C.4.\n\n4.1 Results\n\nIn this section, we compare the solutions found using our framework with those obtained from the\nClarke-Wright savings heuristic (CW), the Sweep heuristic (SW), and Google\u2019s optimization tools\n(OR-Tools). We run our tests on problems with 10, 20, 50 and 100 customer nodes and corresponding\nvehicle capacity of 20, 30, 40 and 50; for example, VRP10 consists of 10 customer and the default\nvehicle capacity is 20 unless otherwise speci\ufb01ed. The results are based on 1000 instances, sampled\nfor each problem size.\n\n(a) Comparison for VRP10\n\n(b) Comparison for VRP20\n\n(c) Comparison for VRP50\n\n(d) Comparison for VRP100\n\nFigure 3: Parts 3a and 3b show the optimality gap (in percent) using different algorithms/solvers for\nVRP10 and VRP20. Parts 3c and 3d give the proportion of the samples for which the algorithms in\nthe rows outperform those in the columns; for example, RL-BS(5) is superior to RL-greedy in 85.8%\nof the VRP50 instances.\n\nFigure 3 shows the distribution of total tour lengths generated by our method, using greedy and\nBS decoders, with the number inside the parentheses indicating the beam-width parameter. In\nthe experiments, we label our method with the \u201cRL\u201d pre\ufb01x. In addition, we also implemented\na randomized version of both heuristic algorithms to improve the solution quality; for Clarke-\nWright, the numbers inside the parentheses are the randomization depth and randomization iterations\nparameters; and for Sweep, it is the number of random initial angles for grouping the nodes. Finally,\nwe use Google\u2019s OR-Tools [16], which is a more competitive baseline. See Appendix B for a detailed\ndiscussion on the baselines.\n\n7\n\nRL-GreedyRL-BS(5)RL-BS(10)CW-GreedyCW-Rnd(5,5)CW-Rnd(10,10)SW-BasicSW-Rnd(5)SW-Rnd(10)OR-Tools0102030405060Optimalitygap(inpercent)RL-GreedyRL-BS(5)RL-BS(10)CW-GreedyCW-Rnd(5,5)CW-Rnd(10,10)SW-BasicSW-Rnd(5)SW-Rnd(10)OR-Tools0102030405060Optimalitygap(inpercent)RL-Greedy12.27.299.497.296.397.997.997.941.5RL-BS(5)85.812.599.799.098.799.199.199.154.6RL-BS(10)91.957.799.899.499.299.399.399.360.2CW-Greedy0.60.30.20.00.068.968.968.91.0CW-Rnd(5,5)2.81.00.692.230.484.584.584.53.5CW-Rnd(10,10)3.71.30.897.568.086.886.886.84.7SW-Basic2.10.90.731.115.513.20.00.01.4SW-Rnd(5)2.10.90.731.115.513.20.00.01.4SW-Rnd(10)2.10.90.731.115.513.20.00.01.4OR-Tools58.545.439.899.096.595.398.698.698.6RL-GreedyRL-BS(5)RL-BS(10)CW-GreedyCW-Rnd(5,5)CW-Rnd(10,10)SW-BasicSW-Rnd(5)SW-Rnd(10)OR-ToolsRL-Greedy25.420.899.999.899.799.599.599.544.4RL-BS(5)74.435.3100.0100.099.9100.0100.0100.056.6RL-BS(10)79.261.6100.0100.0100.099.899.899.862.2CW-Greedy0.10.00.00.00.065.265.265.20.0CW-Rnd(5,5)0.20.00.092.632.782.082.082.00.7CW-Rnd(10,10)0.30.10.097.265.885.485.485.40.8SW-Basic0.50.00.234.818.014.60.00.00.0SW-Rnd(5)0.50.00.234.818.014.60.00.00.0SW-Rnd(10)0.50.00.234.818.014.60.00.00.0OR-Tools55.643.437.8100.099.399.2100.0100.0100.0RL-GreedyRL-BS(5)RL-BS(10)CW-GreedyCW-Rnd(5,5)CW-Rnd(10,10)SW-BasicSW-Rnd(5)SW-Rnd(10)OR-Tools\fFor small problems of VRP10 and VRP20, it is possible to \ufb01nd the optimal solution, which we do by\nsolving a mixed integer formulation of the VRP [33]. Figures 3a and 3b measure how far the solutions\nare far from optimality. The optimality gap is de\ufb01ned as the distance from the optimal objective value\nnormalized by the latter. We observe that using a beam width of 10 is the best-performing method;\nroughly 95% of the instances are at most 10% away from optimality for VRP10 and 13% for VRP20.\nEven the outliers are within 20\u201325% of optimality, suggesting that our RL-BS methods are robust in\ncomparison to the other baseline approaches.\nSince obtaining the optimal objective values for VRP50 and VRP100 is not computationally af-\nfordable, in Figures 3d and 3d, we compare the algorithms in terms of their winning rate. Each\ntable gives the percentage of the instances in which the algorithms in the rows outperform those in\nthe columns. In other words, the cell corresponding to (A,B) shows the percentage of the samples\nin which algorithm A provides shorter tours than B. We observe that the classical heuristics are\noutperformed by the other approaches in almost 100% of the samples. Moreover, RL-greedy is\ncomparable with OR-Tools, but incorporating beam search into our framework increases the winning\nrate of our approach to above 60%.\nFigure 4 shows the solution times normalized by the number of customer nodes. We observe that\nthis ratio stays almost the same for RL with different decoders. In contrast, the run time for the\nClarke-Wright and Sweep heuristics increases faster than linearly with the number of nodes. This\nobservation is one motivation for applying our framework to more general combinatorial problems,\nsince it suggests that our method scales well. Even though the greedy Clark-Wright and basic Sweep\nheuristics are fast for small instances, they do not provide competitive solutions. Moreover, for\nlarger problems, our framework is faster than the randomized heuristics. We also include the solution\ntimes for OR-Tools in the graph, but we should note that OR-Tools is implemented in C++, which\nmakes exact time comparisons impossible since the other baselines are implemented in Python. It\nis worthwhile to mention that the runtimes reported for the RL methods are for the case when we\ndecode a single problem at a time. It is also possible to decode all 1000 test problems in a batch\nwhich will result in approximately 50\u00d7 speed up. For example, one-by-one decoding of VRP10 for\n1000 instances takes around 50 seconds, but by passing all 1000 instances to decoder at once, the\ntotal decoding time decreases to around 1 second on a K80 GPU.\nActive search is another method used by [4] to assist with the RL training on a speci\ufb01c problem\ninstance in order to iteratively search for a better solution. We do not believe that active search is\na practical tool for our problem. One reason is that it is very time-consuming. A second is that we\nintend to provide a solver that produces solutions by just scoring a trained policy, while active search\nrequires a separate training phase for every instance. To test our conjecture that active search will not\nbe effective for our problem, we implemented active search for VRP10 with samples of size 640 and\n1280, and the average route length was 4.78 and 4.77 with 15s and 31s solution times per instance,\nwhich are far worse than BS decoders. Note that BS(5) and BS(10) give 4.72 and 4.68, in less than\n0.1s. For this reason, we exclude active search from our comparisons.\n\nFigure 4: Ratio of solution time to the number\nof customer nodes using different algorithms.\n\nFigure 5: Trained for VRP100 and tested for\nVRP90-VRP110.\n\nOne desired property of the method is that it should be able to handle variable problem sizes. To\ntest this property, we designed two experiments. In the \ufb01rst experiment, we used the trained policy\nfor VRP100 and evaluated its performance on VRP90-VRP110. As it can be seen in Figure 5, our\nmethod with BS(10) outperforms OR-Tools on all problem sizes. In the second experiment, we test\n\n8\n\nVRP10VRP20VRP50VRP100VRP Size104103102101time / # of customer nodesRL-GreedyRL-BS(5)RL-BS(10)CW-GreedyCW-Rnd(5,5)CW-Rnd(10,10)SW-BasicSW-Rnd(5)SW-Rnd(10)OR-Tools9092949698100102104106108110Number of Customers15.516.016.517.017.518.018.5Average Tour LengthMethodRL-BS(10)OR-Tools\fthe generalization when the problems are very different. More speci\ufb01cally, we use the models trained\nfor VRP50-Cap40 and VRP50-Cap50 in order to generate a solution for VRP100-Cap50. Using\nBS(10), the average tour length is 18.00 and 17.80, which is still better than the classical heuristics,\nbut worse than OR-Tools. Overall, these two experiments suggest that when the problems are close\nin terms of the number of customer and vehicle capacity, it is reasonable to expect a near-optimal\nsolution, but we will see a degradation when the testing problems are very different from the training\nones.\n\n4.2 Extension to Other VRPs\n\nThe proposed framework can be extended easily to problems with multiple depots; one only needs to\nconstruct the corresponding state transition function and masking procedure. It is also possible to\ninclude various side constraints: soft constraints can be applied by penalizing the rewards, or hard\nconstraints such as time windows can be enforced through a masking scheme. However, designing\nsuch a scheme might be a challenging task, possibly harder than solving the optimization problem\nitself. Another interesting extension is for VRPs with multiple vehicles. In the simplest case in\nwhich the vehicles travel independently, one must only design a shared masking scheme to avoid the\nvehicles pointing to the same customer nodes. Incorporating competition or collaboration among the\nvehicles is also an interesting line of research that relates to multi-agent RL (MARL) [5].\nThis framework can also be applied to real-time services including on-demand deliveries and taxis.\nIn Appendix C.6, we design an experiment to illustrate the performance of the algorithm on a VRP\nwhere both customer locations and their demands are subject to change. Our results indicate superior\nperformance than the baselines.\n\n5 Discussion and Conclusion\n\nAccording to the \ufb01ndings of this paper, our RL algorithm is competitive with state-of-the-art VRP\nheuristics, and this represents progress toward solving the VRP with RL for real applications. The\nfact that we can solve similar-sized instances without retraining for every new instance makes it easy\nto deploy our method in practice. For example, a vehicle equipped with a processor can use the\ntrained model and solve its own VRP, only by doing a sequence of pre-de\ufb01ned arithmetic operations.\nMoreover, unlike many classical heuristics, our proposed method scales well as the problem size\nincreases, and it has superior performance with competitive solution-time. It does not require a\ndistance matrix calculation, which might be computationally cumbersome, especially in dynamically\nchanging VRPs. One important discrepancy which is usually neglected by the classical heuristics\nis that one or more of the elements of the VRP are stochastic in the real world. In this paper, we\nalso illustrate that the proposed RL-based method can be applied to a more complicated stochastic\nversion of the VRP. In summary, we expect that the proposed architecture has signi\ufb01cant potential\nto be used in real-world problems with further improvements and extensions that incorporate other\nrealistic constraints.\nNoting that the proposed algorithm is not limited to VRP, it will be an important topic of future\nresearch to apply it to other combinatorial optimization problems such as bin-packing and job-shop or\n\ufb02ow-shop scheduling. This method is quite appealing since the only requirement is a veri\ufb01er to \ufb01nd\nfeasible solutions and also a reward signal to demonstrate how well the policy is working. Once the\ntrained policy is available, it can be used many times, without needing to re-train for new problems as\nlong as they are generated from the training distribution.\n\nReferences\n[1] David L Applegate, Robert E Bixby, Vasek Chvatal, and William J Cook. The traveling salesman\n\nproblem: a computational study. Princeton university press, 2006.\n\n[2] Claudia Archetti and Maria Grazia Speranza. The split delivery vehicle routing problem: a\nsurvey. In The vehicle routing problem: Latest advances and new challenges, pages 103\u2013122.\nSpringer, 2008.\n\n[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\nlearning to align and translate. In International Conference on Learning Representations, 2015.\n\n9\n\n\f[4] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combina-\n\ntorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.\n\n[5] Lucian Bu\u00b8soniu, Robert Babu\u0161ka, and Bart De Schutter. Multi-agent reinforcement learning: An\noverview. In Innovations in multi-agent systems and applications-1, pages 183\u2013221. Springer,\n2010.\n\n[6] Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. Abc-cnn:\nAn attention based convolutional neural network for visual question answering. arXiv preprint\narXiv:1511.05960, 2015.\n\n[7] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-\ndecoder for statistical machine translation. Conference on Empirical Methods in Natural\nLanguage Processing, 2014.\n\n[8] Nicos Christo\ufb01des. Worst-case analysis of a new heuristic for the travelling salesman problem.\nTechnical report, Carnegie-Mellon Univ Pittsburgh Pa Management Sciences Research Group,\n1976.\n\n[9] Geoff Clarke and John W Wright. Scheduling of vehicles from a central depot to a number of\n\ndelivery points. Operations research, 12(4):568\u2013581, 1964.\n\n[10] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for\n\nstructured data. In International Conference on Machine Learning, pages 2702\u20132711, 2016.\n\n[11] Hanjun Dai, Elias B Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial\noptimization algorithms over graphs. Advances in Neural Information Processing Systems,\n2017.\n\n[12] Ricardo Fukasawa, Humberto Longo, Jens Lysgaard, Marcus Poggi de Arag\u00e3o, Marcelo Reis,\nEduardo Uchoa, and Renato F Werneck. Robust branch-and-cut-and-price for the capacitated\nvehicle routing problem. Mathematical programming, 106(3):491\u2013511, 2006.\n\n[13] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\nIn Proceedings of the Thirteenth International Conference on Arti\ufb01cial\n\nneural networks.\nIntelligence and Statistics, pages 249\u2013256, 2010.\n\n[14] Fred Glover and Manuel Laguna. Tabu search*. In Handbook of combinatorial optimization,\n\npages 3261\u20133362. Springer, 2013.\n\n[15] Bruce L Golden, Subramanian Raghavan, and Edward A Wasil. The Vehicle Routing Problem:\nLatest Advances and New Challenges, volume 43. Springer Science & Business Media, 2008.\n\n[16] Inc. Google. Google\u2019s optimization tools (or-tools), 2018. URL https://github.com/\n\ngoogle/or-tools.\n\n[17] Inc. Gurobi Optimization. Gurobi optimizer reference manual, 2016. URL http://www.\n\ngurobi.com.\n\n[18] Seunghoon Hong, Junhyuk Oh, Honglak Lee, and Bohyung Han. Learning transferrable\nknowledge for semantic segmentation with deep convolutional neural network. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition, pages 3204\u20133212, 2016.\n\n[19] S\u00e9bastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large\n\ntarget vocabulary for neural machine translation. 2015.\n\n[20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Interna-\n\ntional Conference on Machine Learning, 2015.\n\n[21] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv preprint arXiv:1609.02907, 2016.\n\n[22] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing.\n\nscience, 220(4598):671\u2013680, 1983.\n\n10\n\n\f[23] Gilbert Laporte. The vehicle routing problem: An overview of exact and approximate algorithms.\n\nEuropean journal of operational research, 59(3):345\u2013358, 1992.\n\n[24] Gilbert Laporte, Michel Gendreau, Jean-Yves Potvin, and Fr\u00e9d\u00e9ric Semet. Classical and modern\nheuristics for the vehicle routing problem. International transactions in operational research, 7\n(4-5):285\u2013300, 2000.\n\n[25] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-\nbased neural machine translation. Conference on Empirical Methods in Natural Language\nProcessing, 2015.\n\n[26] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[27] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein-\nforcement learning. In International Conference on Machine Learning, pages 1928\u20131937,\n2016.\n\n[28] Graham Neubig. Neural machine translation and sequence-to-sequence models: A tutorial.\n\narXiv preprint arXiv:1703.01619, 2017.\n\n[29] Ulrike Ritzinger, Jakob Puchinger, and Richard F Hartl. A survey on dynamic and stochastic\nvehicle routing problems. International Journal of Production Research, 54(1):215\u2013231, 2016.\n\n[30] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.\nComputational capabilities of graph neural networks. IEEE Transactions on Neural Networks,\n20(1):81\u2013102, 2009.\n\n[31] Lawrence V Snyder and Zuo-Jun Max Shen. Fundamentals of Supply Chain Theory. John\n\nWiley & Sons, 2nd edition, 2018.\n\n[32] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[33] Paolo Toth and Daniele Vigo. The Vehicle Routing Problem. SIAM, 2002.\n\n[34] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural\n\nInformation Processing Systems, pages 2692\u20132700, 2015.\n\n[35] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for\n\nsets. 2016.\n\n[36] Christos Voudouris and Edward Tsang. Guided local search and its application to the traveling\n\nsalesman problem. European journal of operational research, 113(2):469\u2013499, 1999.\n\n[37] Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement\n\nlearning algorithms. Connection Science, 3(3):241\u2013268, 1991.\n\n[38] Anthony Wren and Alan Holliday. Computer scheduling of vehicles from one or more depots\n\nto a number of delivery points. Operational Research Quarterly, pages 333\u2013344, 1972.\n\n[39] Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. The\napplication of two-level attention models in deep convolutional neural network for \ufb01ne-grained\nimage classi\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 842\u2013850, 2015.\n\n[40] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov,\nRich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with\nvisual attention. In International Conference on Machine Learning, pages 2048\u20132057, 2015.\n\n11\n\n\f", "award": [], "sourceid": 6428, "authors": [{"given_name": "MohammadReza", "family_name": "Nazari", "institution": "Lehigh University"}, {"given_name": "Afshin", "family_name": "Oroojlooy", "institution": "SAS Institute"}, {"given_name": "Lawrence", "family_name": "Snyder", "institution": "Lehigh University"}, {"given_name": "Martin", "family_name": "Takac", "institution": "Lehigh University"}]}