{"title": "Learning Combinatorial Optimization Algorithms over Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 6348, "page_last": 6358, "abstract": "The design of good heuristics or approximation algorithms for NP-hard combinatorial optimization problems often requires significant specialized knowledge and trial-and-error. Can we automate this challenging, tedious process, and learn the algorithms instead? In many real-world applications, it is typically the case that the same optimization problem is solved again and again on a regular basis, maintaining the same problem structure but differing in the data. This provides an opportunity for learning heuristic algorithms that exploit the structure of such recurring problems.  In this paper, we propose a unique combination of reinforcement learning and graph embedding to address this challenge. The learned greedy policy behaves like a meta-algorithm that incrementally constructs a solution, and the action is determined by the output of a graph embedding network capturing the current state of the solution. We show that our framework can be applied to a diverse range of optimization problems over graphs, and learns effective algorithms for the Minimum Vertex Cover, Maximum Cut and Traveling Salesman problems.", "full_text": "Learning Combinatorial Optimization Algorithms over Graphs\n\nHanjun Dai\u2020\u21e4, Elias B. Khalil\u2020\u21e4, Yuyu Zhang\u2020, Bistra Dilkina\u2020, Le Song\u2020\u00a7\n\n\u2020 College of Computing, Georgia Institute of Technology\n\n{hanjun.dai, elias.khalil, yuyu.zhang, bdilkina, lsong}@cc.gatech.edu\n\n\u00a7 Ant Financial\n\nAbstract\n\nThe design of good heuristics or approximation algorithms for NP-hard combi-\nnatorial optimization problems often requires signi\ufb01cant specialized knowledge\nand trial-and-error. Can we automate this challenging, tedious process, and learn\nthe algorithms instead? In many real-world applications, it is typically the case\nthat the same optimization problem is solved again and again on a regular basis,\nmaintaining the same problem structure but differing in the data. This provides\nan opportunity for learning heuristic algorithms that exploit the structure of such\nrecurring problems. In this paper, we propose a unique combination of reinforce-\nment learning and graph embedding to address this challenge. The learned greedy\npolicy behaves like a meta-algorithm that incrementally constructs a solution, and\nthe action is determined by the output of a graph embedding network capturing\nthe current state of the solution. We show that our framework can be applied to a\ndiverse range of optimization problems over graphs, and learns effective algorithms\nfor the Minimum Vertex Cover, Maximum Cut and Traveling Salesman problems.\n\nIntroduction\n\n1\nCombinatorial optimization problems over graphs arising from numerous application domains, such\nas social networks, transportation, telecommunications and scheduling, are NP-hard, and have thus\nattracted considerable interest from the theory and algorithm design communities over the years. In\nfact, of Karp\u2019s 21 problems in the seminal paper on reducibility [19], 10 are decision versions of graph\noptimization problems, while most of the other 11 problems, such as set covering, can be naturally\nformulated on graphs. Traditional approaches to tackling an NP-hard graph optimization problem\nhave three main \ufb02avors: exact algorithms, approximation algorithms and heuristics. Exact algorithms\nare based on enumeration or branch-and-bound with an integer programming formulation, but may\nbe prohibitive for large instances. On the other hand, polynomial-time approximation algorithms are\ndesirable, but may suffer from weak optimality guarantees or empirical performance, or may not even\nexist for inapproximable problems. Heuristics are often fast, effective algorithms that lack theoretical\nguarantees, and may also require substantial problem-speci\ufb01c research and trial-and-error on the part\nof algorithm designers.\nAll three paradigms seldom exploit a common trait of real-world optimization problems: instances\nof the same type of problem are solved again and again on a regular basis, maintaining the same\ncombinatorial structure, but differing mainly in their data. That is, in many applications, values of\nthe coef\ufb01cients in the objective function or constraints can be thought of as being sampled from the\nsame underlying distribution. For instance, an advertiser on a social network targets a limited set of\nusers with ads, in the hope that they spread them to their neighbors; such covering instances need\nto be solved repeatedly, since the in\ufb02uence pattern between neighbors may be different each time.\nAlternatively, a package delivery company routes trucks on a daily basis in a given city; thousands of\nsimilar optimizations need to be solved, since the underlying demand locations can differ.\n\n\u21e4Both authors contributed equally to the paper.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fEmbed \ngraph\n\nEmbed \ngraph\n\n\u0398\n\u0398\n\u0398\n\u0398\n\nReLu\n\nReLu\n\n\u0398\n\u0398\n\u0398\n\u0398\n\nReLu\n\nReLu\n\n\u0398\n\u0398\n\nGreedy: add \nbest node\n\nGreedy: add \nbest node\n\n1st iteration\n\n2nd iteration\n\nState\n\nEmbedding the graph + partial solution\n\nGreedy node selection\n\nFigure 1: Illustration of the proposed framework as applied to an instance of Minimum Vertex Cover. The\nmiddle part illustrates two iterations of the graph embedding, which results in node scores (green bars).\nDespite the inherent similarity between problem instances arising in the same domain, classical\nalgorithms do not systematically exploit this fact. However, in industrial settings, a company may\nbe willing to invest in upfront, of\ufb02ine computation and learning if such a process can speed up its\nreal-time decision-making and improve its quality. This motivates the main problem we address:\n\nProblem Statement: Given a graph optimization problem G and a distribution D of problem\ninstances, can we learn better heuristics that generalize to unseen instances from D?\n\nRecently, there has been some seminal work on using deep architectures to learn heuristics for\ncombinatorial problems, including the Traveling Salesman Problem [37, 6, 14]. However, the\narchitectures used in these works are generic, not yet effectively re\ufb02ecting the combinatorial structure\nof graph problems. As we show later, these architectures often require a huge number of instances in\norder to learn to generalize to new ones. Furthermore, existing works typically use the policy gradient\nfor training [6], a method that is not particularly sample-ef\ufb01cient. While the methods in [37, 6] can\nbe used on graphs with different sizes \u2013 a desirable trait \u2013 they require manual, ad-hoc input/output\nengineering to do so (e.g. padding with zeros).\nIn this paper, we address the challenge of learning algorithms for graph problems using a unique\ncombination of reinforcement learning and graph embedding. The learned policy behaves like a\nmeta-algorithm that incrementally constructs a solution, with the action being determined by a graph\nembedding network over the current state of the solution. More speci\ufb01cally, our proposed solution\nframework is different from previous work in the following aspects:\n1. Algorithm design pattern. We will adopt a greedy meta-algorithm design, whereby a feasible\nsolution is constructed by successive addition of nodes based on the graph structure, and is maintained\nso as to satisfy the problem\u2019s graph constraints. Greedy algorithms are a popular pattern for designing\napproximation and heuristic algorithms for graph problems. As such, the same high-level design can\nbe seamlessly used for different graph optimization problems.\n2. Algorithm representation. We will use a graph embedding network, called structure2vec\n(S2V) [9], to represent the policy in the greedy algorithm. This novel deep learning architecture\nover the instance graph \u201cfeaturizes\u201d the nodes in the graph, capturing the properties of a node in the\ncontext of its graph neighborhood. This allows the policy to discriminate among nodes based on\ntheir usefulness, and generalizes to problem instances of different sizes. This contrasts with recent\napproaches [37, 6] that adopt a graph-agnostic sequence-to-sequence mapping that does not fully\nexploit graph structure.\n3. Algorithm training. We will use \ufb01tted Q-learning to learn a greedy policy that is parametrized\nby the graph embedding network. The framework is set up in such a way that the policy will aim\nto optimize the objective function of the original problem instance directly. The main advantage of\nthis approach is that it can deal with delayed rewards, which here represent the remaining increase in\nobjective function value obtained by the greedy algorithm, in a data-ef\ufb01cient way; in each step of the\ngreedy algorithm, the graph embeddings are updated according to the partial solution to re\ufb02ect new\nknowledge of the bene\ufb01t of each node to the \ufb01nal objective value. In contrast, the policy gradient\napproach of [6] updates the model parameters only once w.r.t. the whole solution (e.g. the tour in\nTSP).\n\n2\n\n\fThe application of a greedy heuristic learned with our framework is illustrated in Figure 1. To\ndemonstrate the effectiveness of the proposed framework, we apply it to three extensively studied\ngraph optimization problems. Experimental results show that our framework, a single meta-learning\nalgorithm, ef\ufb01ciently learns effective heuristics for each of the problems. Furthermore, we show that\nour learned heuristics preserve their effectiveness even when used on graphs much larger than the\nones they were trained on. Since many combinatorial optimization problems, such as the set covering\nproblem, can be explicitly or implicitly formulated on graphs, we believe that our work opens up a\nnew avenue for graph algorithm design and discovery with deep learning.\n2 Common Formulation for Greedy Algorithms on Graphs\nWe will illustrate our framework using three optimization problems over weighted graphs. Let\nG(V, E, w) denote a weighted graph, where V is the set of nodes, E the set of edges and w : E ! R+\nthe edge weight function, i.e. w(u, v) is the weight of edge (u, v) 2 E. These problems are:\n\u2022 Minimum Vertex Cover (MVC): Given a graph G, \ufb01nd a subset of nodes S \u2713 V such that every\nedge is covered, i.e. (u, v) 2 E , u 2 S or v 2 S, and |S| is minimized.\n\u2022 Maximum Cut (MAXCUT): Given a graph G, \ufb01nd a subset of nodes S \u2713 V such that the weight\nof the cut-setP(u,v)2C w(u, v) is maximized, where cut-set C \u2713 E is the set of edges with one\nend in S and the other end in V \\ S.\n\u2022 Traveling Salesman Problem (TSP): Given a set of points in 2-dimensional space, \ufb01nd a tour\nof minimum total weight, where the corresponding graph G has the points as nodes and is fully\nconnected with edge weights corresponding to distances between points; a tour is a cycle that visits\neach node of the graph exactly once.\n\nWe will focus on a popular pattern for designing approximation and heuristic algorithms, namely\na greedy algorithm. A greedy algorithm will construct a solution by sequentially adding nodes to\na partial solution S, based on maximizing some evaluation function Q that measures the quality\nof a node in the context of the current partial solution. We will show that, despite the diversity of\nthe combinatorial problems above, greedy algorithms for them can be expressed using a common\nformulation. Speci\ufb01cally:\n1. A problem instance G of a given optimization problem is sampled from a distribution D, i.e. the\n2. A partial solution is represented as an ordered list S = (v1, v2, . . . , v|S|), vi 2 V , and S = V \\ S\nthe set of candidate nodes for addition, conditional on S. Furthermore, we use a vector of binary\ndecision variables x, with each dimension xv corresponding to a node v 2 V , xv = 1 if v 2 S\nand 0 otherwise. One can also view xv as a tag or extra feature on v.\n3. A maintenance (or helper) procedure h(S) will be needed, which maps an ordered list S to a\n\nV , E and w of the instance graph G are generated according to a model or real-world data.\n\ncombinatorial structure satisfying the speci\ufb01c constraints of a problem.\n\n4. The quality of a partial solution S is given by an objective function c(h(S), G) based on the\n\ncombinatorial structure h of S.\n5. A generic greedy algorithm selects a node v to add next such that v maximizes an evaluation\nfunction, Q(h(S), v) 2 R, which depends on the combinatorial structure h(S) of the current\npartial solution. Then, the partial solution S will be extended as\n(1)\nand (S, v\u21e4) denotes appending v\u21e4 to the end of a list S. This step is repeated until a termination\ncriterion t(h(S)) is satis\ufb01ed.\n\nS := (S, v\u21e4), where v\u21e4 := argmaxv2S Q(h(S), v),\n\ncriterion checks whether all edges have been covered.\n\nIn our formulation, we assume that the distribution D, the helper function h, the termination criterion\nt and the cost function c are all given. Given the above abstract model, various optimization problems\ncan be expressed by using different helper functions, cost functions and termination criteria:\n\u2022 MVC: The helper function does not need to do any work, and c(h(S), G) = | S|. The termination\n\u2022 MAXCUT: The helper function divides V into two sets, S and its complement S = V \\ S,\nand maintains a cut-set C = {(u, v)| (u, v) 2 E, u 2 S, v 2 S}. Then, the cost is\nc(h(S), G) =P(u,v)2C w(u, v), and the termination criterion does nothing.\n\u2022 TSP: The helper function will maintain a tour according to the order of the nodes in S. The\nsimplest way is to append nodes to the end of partial tour in the same order as S. Then the cost\nc(h(S), G) = P|S|1\ni=1 w(S(i), S(i + 1))  w(S(|S|), S(1)), and the termination criterion is\n\n3\n\n\factivated when S = V . Empirically, inserting a node u in the partial tour at the position which\nincreases the tour length the least is a better choice. We adopt this insertion procedure as a helper\nfunction for TSP.\n\nAn estimate of the quality of the solution resulting from adding a node to partial solution S will\nbe determined by the evaluation function Q, which will be learned using a collection of problem\ninstances. This is in contrast with traditional greedy algorithm design, where the evaluation function\nQ is typically hand-crafted, and requires substantial problem-speci\ufb01c research and trial-and-error. In\nthe following, we will design a powerful deep learning parameterization for the evaluation function,\n\n3 Representation: Graph Embedding\n\nbQ(h(S), v;\u21e5) , with parameters \u21e5.\nSince we are optimizing over a graph G, we expect that the evaluation function bQ should take into\naccount the current partial solution S as it maps to the graph. That is, xv = 1 for all nodes v 2 S,\nand the nodes are connected according to the graph structure. Intuitively, bQ should summarize the\nstate of such a \u201ctagged\" graph G, and \ufb01gure out the value of a new node if it is to be added in\nthe context of such a graph. Here, both the state of the graph and the context of a node v can be\nvery complex, hard to describe in closed form, and may depend on complicated statistics such as\nglobal/local degree distribution, triangle counts, distance to tagged nodes, etc. In order to represent\nsuch complex phenomena over combinatorial structures, we will leverage a deep learning architecture\n\nover graphs, in particular the structure2vec of [9], to parameterize bQ(h(S), v;\u21e5) .\n3.1 Structure2Vec\nWe \ufb01rst provide an introduction to structure2vec. This graph embedding network will compute\na p-dimensional feature embedding \u00b5v for each node v 2 V , given the current partial solution S.\nMore speci\ufb01cally, structure2vec de\ufb01nes the network architecture recursively according to an\ninput graph structure G, and the computation graph of structure2vec is inspired by graphical\nmodel inference algorithms, where node-speci\ufb01c tags or features xv are aggregated recursively\naccording to G\u2019s graph topology. After a few steps of recursion, the network will produce a new\nembedding for each node, taking into account both graph characteristics and long-range interactions\nbetween these node features. One variant of the structure2vec architecture will initialize the\nembedding \u00b5(0)\nat each node as 0, and for all v 2 V update the embeddings synchronously at each\nv\niteration as\n(2)\nwhere N (v) is the set of neighbors of node v in graph G, and F is a generic nonlinear mapping such\nas a neural network or kernel function.\nBased on the update formula, one can see that the embedding update process is carried out based on\nthe graph topology. A new round of embedding sweeping across the nodes will start only after the\nembedding update for all nodes from the previous round has \ufb01nished. It is easy to see that the update\nalso de\ufb01nes a process where the node features xv are propagated to other nodes via the nonlinear\npropagation function F . Furthermore, the more update iterations one carries out, the farther away\nthe node features will propagate and get aggregated nonlinearly at distant nodes. In the end, if one\nterminates after T iterations, each node embedding \u00b5(T )\nv will contain information about its T -hop\nneighborhood as determined by graph topology, the involved node features and the propagation\nfunction F . An illustration of two iterations of graph embedding can be found in Figure 1.\n\nu }u2N (v),{w(v, u)}u2N (v) ;\u21e5\u2318 ,\n\nv F\u21e3xv,{\u00b5(t)\n\n\u00b5(t+1)\n\n\u00b5(t+1)\n\nWe now discuss\nusing the embeddings\nstructure2vec. In particular, we design F to update a p-dimensional embedding \u00b5v as:\n\nthe parameterization of bQ(h(S), v;\u21e5)\n\nv relu\u27131xv + \u27132Xu2N (v)\n\n(3)\nwhere \u27131 2 Rp, \u27132,\u2713 3 2 Rp\u21e5p and \u27134 2 Rp are the model parameters, and relu is the recti\ufb01ed linear\nunit (relu(z) = max(0, z)) applied elementwise to its input. The summation over neighbors is one\nway of aggregating neighborhood information that is invariant to permutations over neighbors. For\nsimplicity of exposition, xv here is a binary scalar as described earlier; it is straightforward to extend\nxv to a vector representation by incorporating any additional useful node information. To make the\n\nu + \u27133Xu2N (v)\n\nrelu(\u27134 w(v, u)),\n\n3.2 Parameterizing bQ(h(S), v;\u21e5)\n\nfrom\n\n\u00b5(t)\n\n4\n\n\fv\n\nrespectively, i.e.\n\nnonlinear transformations more powerful, we can add some more layers of relu before we pool over\nthe neighboring embeddings \u00b5u.\nOnce the embedding for each node is computed after T iterations, we will use these embeddings\nfor node\nu , as the surrogates for v and h(S),\n\nto de\ufb01ne the bQ(h(S), v;\u21e5) function. More speci\ufb01cally, we will use the embedding \u00b5(T )\nv and the pooled embedding over the entire graph,Pu2V \u00b5(T )\nbQ(h(S), v;\u21e5) = \u2713>5 relu([\u27136Xu2V\n(4)\nwhere \u27135 2 R2p, \u27136,\u2713 7 2 Rp\u21e5p and [\u00b7,\u00b7] is the concatenation operator. Since the embedding \u00b5(T )\nis computed based on the parameters from the graph embedding network, bQ(h(S), v) will depend\non a collection of 7 parameters \u21e5= {\u2713i}7\ni=1. The number of iterations T for the graph embedding\ncomputation is usually small, such as T = 4.\nThe parameters \u21e5 will be learned. Previously, [9] required a ground truth label for every input\ngraph G in order to train the structure2vec architecture. There, the output of the embedding\nis linked with a softmax-layer, so that the parameters can by trained end-to-end by minimizing the\ncross-entropy loss. This approach is not applicable to our case due to the lack of training labels.\nInstead, we train these parameters together end-to-end using reinforcement learning.\n4 Training: Q-learning\n\n\u00b5(T )\nu ,\u2713 7 \u00b5(T )\n\nu\n\n])\n\nv\n\nWe show how reinforcement learning is a natural framework for learning the evaluation function bQ.\nThe de\ufb01nition of the evaluation function bQ naturally lends itself to a reinforcement learning (RL)\nformulation [36], and we will use bQ as a model for the state-value function in RL. We note that we\nwould like to learn a function bQ across a set of m graphs from distribution D, D = {Gi}m\n\ni=1, with\npotentially different sizes. The advantage of the graph embedding parameterization in our previous\nsection is that we can deal with different graph instances and sizes seamlessly.\n4.1 Reinforcement learning formulation\nWe de\ufb01ne the states, actions and rewards in the reinforcement learning framework as follows:\n1. States: a state S is a sequence of actions (nodes) on a graph G. Since we have already represented\nnodes in the tagged graph with their embeddings, the state is a vector in p-dimensional space,\n\nPv2V \u00b5v. It is easy to see that this embedding representation of the state can be used across\ndifferent graphs. The terminal state bS will depend on the problem at hand;\n2. Transition: transition is deterministic here, and corresponds to tagging the node v 2 G that was\n3. Actions: an action v is a node of G that is not part of the current state S. Similarly, we will\nrepresent actions as their corresponding p-dimensional node embedding \u00b5v, and such a de\ufb01nition\nis applicable across graphs of various sizes;\n\nselected as the last action with feature xv = 1;\n\n4. Rewards: the reward function r(S, v) at state S is de\ufb01ned as the change in the cost function after\n\n(5)\n\ntaking action v and transitioning to a new state S0 := (S, v). That is,\nr(S, v) = c(h(S0), G)  c(h(S), G),\n\nand c(h(;), G) = 0. As such, the cumulative reward R of a terminal state bS coincides exactly\nwith the objective function value of the bS, i.e. R(bS) =P|bS|i=1 r(Si, vi) is equal to c(h(bS), G);\n5. Policy: based on bQ, a deterministic greedy policy \u21e1(v|S) := argmaxv02S bQ(h(S), v0) will be\n\nused. Selecting action v corresponds to adding a node of G to the current partial solution, which\nresults in collecting a reward r(S, v).\n\nTable 1 shows the instantiations of the reinforcement learning framework for the three optimization\nproblems considered herein. We let Q\u21e4 denote the optimal Q-function for each RL problem. Our graph\n\nmodel for it, which will be learned via n-step Q-learning.\n4.2 Learning algorithm\n\nembedding parameterization bQ(h(S), v;\u21e5) from Section 3 will then be a function approximation\nIn order to perform end-to-end learning of the parameters in bQ(h(S), v;\u21e5) , we use a combination\n\nof n-step Q-learning [36] and \ufb01tted Q-iteration [33], as illustrated in Algorithm 1. We use the term\n\n5\n\n\fState\nsubset of nodes selected so far\nsubset of nodes selected so far\npartial tour\n\nTable 1: De\ufb01nition of reinforcement learning components for each of the three problems considered.\nProblem\nMVC\nMAXCUT\nTSP\nepisode to refer to a complete sequence of node additions starting from an empty solution, and until\ntermination; a step within an episode is a single action (node addition).\nStandard (1-step) Q-learning updates the function approximator\u2019s parameters at each step of an\nepisode by performing a gradient step to minimize the squared loss:\n\nTermination\nall edges are covered\ncut weight cannot be improved\ntour includes all nodes\n\nAction\nadd node to subset\nadd node to subset\ngrow tour by one node\n\nReward\n-1\nchange in cut weight\nchange in tour cost\n\nHelper function\nNone\nNone\nInsertion operation\n\n(6)\n\n(y  bQ(h(St), vt;\u21e5)) 2,\n\nwhere y =  maxv0 bQ(h(St+1), v0;\u21e5) + r(St, vt) for a non-terminal state St. The n-step Q-learning\n\nhelps deal with the issue of delayed rewards, where the \ufb01nal reward of interest to the agent is only\nreceived far in the future during an episode. In our setting, the \ufb01nal objective value of a solution is\nonly revealed after many node additions. As such, the 1-step update may be too myopic. A natural\nextension of 1-step Q-learning is to wait n steps before updating the approximator\u2019s parameters, so\nas to collect a more accurate estimate of the future rewards. Formally, the update is over the same\n\nsquared loss (6), but with a different target, y =Pn1\n\ni=0 r(St+i, vt+i) +  maxv0 bQ(h(St+n), v0;\u21e5) .\n\nThe \ufb01tted Q-iteration approach has been shown to result in faster learning convergence when using\na neural network as a function approximator [33, 28], a property that also applies in our setting, as\nwe use the embedding de\ufb01ned in Section 3.2. Instead of updating the Q-function sample-by-sample\nas in Equation (6), the \ufb01tted Q-iteration approach uses experience replay to update the function\napproximator with a batch of samples from a dataset E, rather than the single sample being currently\nexperienced. The dataset E is populated during previous episodes, such that at step t + n, the tuple\ni=0 r(St+i, at+i). Instead of performing\na gradient step in the loss of the current sample as in (6), stochastic gradient descent updates are\nperformed on a random sample of tuples drawn from E.\nIt is known that off-policy reinforcement learning algorithms such as Q-learning can be more sample\nef\ufb01cient than their policy gradient counterparts [15]. This is largely due to the fact that policy gradient\nmethods require on-policy samples for the new policy obtained after each parameter update of the\nfunction approximator.\n\n(St, at, Rt,t+n, St+n) is added to E, with Rt,t+n = Pn1\n\nAlgorithm 1 Q-learning for the Greedy Algorithm\n1: Initialize experience replay memory M to capacity N\n2: for episode e = 1 to L do\n3:\n4:\n5:\n\nDraw graph G from distribution D\nInitialize the state to empty S1 = ()\nfor step t = 1 to T do\n\n6:\n\nw.p. \u270f\n\nvt =(random node v 2 St,\nargmaxv2St bQ(h(St), v;\u21e5) , otherwise\nAdd vt to partial solution: St+1 := (St, vt)\nif t  n then\nAdd tuple (Stn, vtn, Rtn,t, St) to M\nSample random batch from B iid.\u21e0M\nUpdate \u21e5 by SGD over (6) for B\nend if\nend for\n\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14: end for\n15: return \u21e5\n5 Experimental Evaluation\nInstance generation. To evaluate the proposed method against other approximation/heuristic algo-\nrithms and deep learning approaches, we generate graph instances for each of the three problems.\nFor the MVC and MAXCUT problems, we generate Erd\u02ddos-Renyi (ER) [11] and Barabasi-Albert\n(BA) [1] graphs which have been used to model many real-world networks. For a given range on the\nnumber of nodes, e.g. 50-100, we \ufb01rst sample the number of nodes uniformly at random from that\n\n6\n\n\frange, then generate a graph according to either ER or BA. For the two-dimensional TSP problem,\nwe use an instance generator from the DIMACS TSP Challenge [18] to generate uniformly random\nor clustered points in the 2-D grid. We refer the reader to the Appendix D.1 for complete details on\ninstance generation. We have also tackled the Set Covering Problem, for which the description and\nresults are deferred to Appendix B.\nStructure2Vec Deep Q-learning. For our method, S2V-DQN, we use the graph representations and\nhyperparameters described in Appendix D.4. The hyperparameters are selected via preliminary results\non small graphs, and then \ufb01xed for large ones. Note that for TSP, where the graph is fully-connected,\nwe build the K-nearest neighbor graph (K = 10) to scale up to large graphs. For MVC, where\nwe train the model on graphs with up to 500 nodes, we use the model trained on small graphs as\ninitialization for training on larger ones. We refer to this trick as \u201cpre-training\", which is illustrated in\nFigure D.2.\nPointer Networks with Actor-Critic. We compare our method to a method, based on Recurrent\nNeural Networks (RNNs), which does not make full use of graph structure [6]. We implement\nand train their algorithm (PN-AC) for all three problems. The original model only works on the\nEuclidian TSP problem, where each node is represented by its (x, y) coordinates, and is not designed\nfor problems with graph structure. To handle other graph problems, we describe each node by its\nadjacency vector instead of coordinates. To handle different graph sizes, we use a singular value\ndecomposition (SVD) to obtain a rank-8 approximation for the adjacency matrix, and use the low-rank\nembeddings as inputs to the pointer network.\nBaseline Algorithms. Besides the PN-AC, we also include powerful approximation or heuristic\nalgorithms from the literature. These algorithms are speci\ufb01cally designed for each type of problem:\n\u2022 MVC: MVCApprox iteratively selects an uncovered edge and adds both of its endpoints [30]. We\ndesigned a stronger variant, called MVCApprox-Greedy, that greedily picks the uncovered edge\nwith maximum sum of degrees of its endpoints. Both algorithms are 2-approximations.\n\u2022 MAXCUT: We include MaxcutApprox, which maintains the cut set (S, V \\ S) and moves a node\nfrom one side to the other side of the cut if that operation results in cut weight improvement [25].\nTo make MaxcutApprox stronger, we greedily move the node that results in the largest improvement\nin cut weight. A randomized, non-greedy algorithm, referred to as SDP, is also implemented based\non [12]; 100 solutions are generated for each graph, and the best one is taken.\n\u2022 TSP: We include the following approximation algorithms: Minimum Spanning Tree (MST),\nFarthest insertion (Farthest), Cheapest insertion (Cheapest), Closest insertion (Closest), Christo\ufb01des\nand 2-opt. We also add the Nearest Neighbor heuristic (Nearest); see [4] for algorithmic details.\nDetails on Validation and Testing. For S2V-DQN and PN-AC, we use a CUDA K80-enabled cluster\nfor training and testing. Training convergence for S2V-DQN is discussed in Appendix D.6. S2V-DQN\nand PN-AC use 100 held-out graphs for validation, and we report the test results on another 1000\ngraphs. We use CPLEX[17] to get optimal solutions for MVC and MAXCUT, and Concorde [3] for\nTSP (details in Appendix D.1). All approximation ratios reported in the paper are with respect to the\nbest (possibly optimal) solution found by the solvers within 1 hour. For MVC, we vary the training\nand test graph sizes in the ranges {15\u201320, 40\u201350, 50\u2013100, 100\u2013200, 400\u2013500}. For MAXCUT and\nTSP, which involve edge weights, we train up to 200\u2013300 nodes due to the limited computation\nresource. For all problems, we test on graphs of size up to 1000\u20131200.\nDuring testing, instead of using Active Search as in [6], we simply use the greedy policy. This gives\nus much faster inference, while still being powerful enough. We modify existing open-source code to\nimplement both S2V-DQN 2 and PN-AC 3. Our code is publicly available 4.\n\n5.1 Comparison of solution quality\nTo evaluate the solution quality on test instances, we use the approximation ratio of each method\nrelative to the optimal solution, averaged over the set of test instances. The approximation ratio of a\nsolution S to a problem instance G is de\ufb01ned as R(S, G) = max( OP T (G)\nOP T (G) ), where c(h(S))\nis the objective value of solution S, and OP T (G) is the best-known solution value of instance G.\n\nc(h(S)) , c(h(S))\n\n2https://github.com/Hanjun-Dai/graphnn\n3https://github.com/devsisters/pointer-network-tensor\ufb02ow\n4https://github.com/Hanjun-Dai/graph_comb_opt\n\n7\n\n\f(a) MVC BA\n\n(b) MAXCUT BA\n\n(c) TSP random\n\nFigure 2: Approximation ratio on 1000 test graphs. Note that on MVC, our performance is pretty close to\noptimal. In this \ufb01gure, training and testing graphs are generated according to the same distribution.\nFigure 2 shows the average approximation ratio across the three problems; other graph types are in\nFigure D.1 in the appendix. In all of these \ufb01gures, a lower approximation ratio is better. Overall,\nour proposed method, S2V-DQN, performs signi\ufb01cantly better than other methods. In MVC, the\nperformance of S2V-DQN is particularly good, as the approximation ratio is roughly 1 and the bar is\nbarely visible.\nThe PN-AC algorithm performs well on TSP, as expected. Since the TSP graph is essentially fully-\nconnected, graph structure is not as important. On problems such as MVC and MAXCUT, where\ngraph information is more crucial, our algorithm performs signi\ufb01cantly better than PN-AC. For TSP,\nThe Farthest and 2-opt algorithm perform as well as S2V-DQN, and slightly better in some cases.\nHowever, we will show later that in real-world TSP data, our algorithm still performs better.\n\n5.2 Generalization to larger instances\nThe graph embedding framework enables us to train and test on graphs of different sizes, since the\nsame set of model parameters are used. How does the performance of the learned algorithm using\nsmall graphs generalize to test graphs of larger sizes? To investigate this, we train S2V-DQN on\ngraphs with 50\u2013100 nodes, and test its generalization ability on graphs with up to 1200 nodes. Table 2\nsummarizes the results, and full results are in Appendix D.3.\n\nTable 2: S2V-DQN\u2019s generalization ability. Values are average approximation ratios over 1000 test instances.\nThese test results are produced by S2V-DQN algorithms trained on graphs with 50-100 nodes.\n500-600\n1.0048\n1.0177\n1.0975\n\nMAXCUT (BA)\nTSP (clustered)\n\n400-500\n1.0045\n1.0123\n1.0944\n\n1000-1200\n\n1.0062\n1.0038\n1.1065\n\n50-100\n1.0033\n1.0150\n1.0730\n\nTest Size\nMVC (BA)\n\n100-200\n1.0041\n1.0181\n1.0895\n\n200-300\n1.0045\n1.0202\n1.0869\n\n300-400\n1.0040\n1.0188\n1.0918\n\nWe can see that S2V-DQN achieves a very good approximation ratio. Note that the \u201coptimal\" value\nused in the computation of approximation ratios may not be truly optimal (due to the solver time\ncutoff at 1 hour), and so CPLEX\u2019s solutions do typically get worse as problem size grows. This is\nwhy sometimes we can even get better approximation ratio on larger graphs.\n\n5.3 Scalability & Trade-off between running time and approximation ratio\nTo construct a solution on a test graph, our algorithm has polynomial complexity of O(k|E|) where k\nis number of greedy steps (at most the number of nodes |V |) and |E| is number of edges. For instance,\non graphs with 1200 nodes, we can \ufb01nd the solution of MVC within 11 seconds using a single GPU,\nwhile getting an approximation ratio of 1.0062. For dense graphs, we can also sample the edges for\nthe graph embedding computation to save time, a measure we will investigate in the future.\nFigure 3 illustrates the approximation ratios of various approaches as a function of running time.\nAll algorithms report a single solution at termination, whereas CPLEX reports multiple improving\nsolutions, for which we recorded the corresponding running time and approximation ratio. Figure D.3\n(Appendix D.7) includes other graph sizes and types, where the results are consistent with Figure 3.\nFigure 3 shows that, for MVC, we are slightly slower than the approximation algorithms but enjoy a\nmuch better approximation ratio. Also note that although CPLEX found the \ufb01rst feasible solution\nquickly, it also has much worse ratio; the second improved solution found by CPLEX takes similar or\nlonger time than our S2V-DQN, but is still of worse quality. For MAXCUT, the observations are still\nconsistent. One should be aware that sometimes our algorithm can obtain better results than 1-hour\nCPLEX, which gives ratios below 1.0. Furthermore, sometimes S2V-DQN is even faster than the\n\n8\n\n\fFigure 3: Time-approximation\ntrade-off for MVC and MAX-\nCUT. In this \ufb01gure, each dot\nrepresents a solution found for\na single problem instance, for\n100 instances. For CPLEX, we\nalso record the time and qual-\nity of each solution it \ufb01nds, e.g.\nCPLEX-1st means the \ufb01rst feasi-\nble solution found by CPLEX.\n\n(a) MVC BA 200-300\n\n(b) MAXCUT BA 200-300\n\nMaxcutApprox, although this comparison is not exactly fair, since we use GPUs; however, we can\nstill see that our algorithm is ef\ufb01cient.\n5.4 Experiments on real-world datasets\nIn addition to the experiments for synthetic data, we identi\ufb01ed sets of publicly available benchmark\nor real-world instances for each problem, and performed experiments on them. A summary of results\nis in Table 3, and details are given in Appendix C. S2V-DQN signi\ufb01cantly outperforms all competing\nmethods for MVC, MAXCUT and TSP.\n\nTable 3: Realistic data experiments, results summary. Values are average approximation ratios.\n\nProblem\nMVC\nMAXCUT\nTSP\n\nDataset\nMemeTracker\nPhysics\nTSPLIB\n\nS2V-DQN Best Competitor\n1.0021\n1.0223\n1.0475\n\n1.2220 (MVCApprox-Greedy)\n1.2825 (MaxcutApprox)\n1.0800 (Farthest)\n\n2nd Best Competitor\n1.4080 (MVCApprox)\n1.8996 (SDP)\n1.0947 (2-opt)\n\n5.5 Discovery of interesting new algorithms\nWe further examined the algorithms learned by S2V-DQN, and tried to interpret what greedy heuristics\nhave been learned. We found that S2V-DQN is able to discover new and interesting algorithms which\nintuitively make sense but have not been analyzed before. For instance, S2V-DQN discovers an\nalgorithm for MVC where nodes are selected to balance between their degrees and the connectivity\nof the remaining graph (Appendix Figures D.4 and D.7). For MAXCUT, S2V-DQN discovers an\nalgorithm where nodes are picked to avoid cancelling out existing edges in the cut set (Appendix\nFigure D.5). These results suggest that S2V-DQN may also be a good assistive tool for discovering\nnew algorithms, especially in cases when the graph optimization problems are new and less well-\nstudied.\n6 Conclusions\nWe presented an end-to-end machine learning framework for automatically designing greedy heuris-\ntics for hard combinatorial optimization problems on graphs. Central to our approach is the com-\nbination of a deep graph embedding with reinforcement learning. Through extensive experimental\nevaluation, we demonstrate the effectiveness of the proposed framework in learning greedy heuristics\nas compared to manually-designed greedy algorithms. The excellent performance of the learned\nheuristics is consistent across multiple different problems, graph types, and graph sizes, suggesting\nthat the framework is a promising new tool for designing algorithms for graph problems.\nAcknowledgments\nThis project was supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF\nCAREER IIS-1350983, NSF IIS-1639792 EAGER, NSF CNS-1704701, ONR N00014-15-1-2340,\nIntel ISTC, NVIDIA and Amazon AWS. Dilkina is supported by NSF grant CCF-1522054 and\nExxonMobil.\n\nReferences\n[1] Albert, R\u00e9ka and Barab\u00e1si, Albert-L\u00e1szl\u00f3. Statistical mechanics of complex networks. Reviews\n\nof modern physics, 74(1):47, 2002.\n\n[2] Andrychowicz, Marcin, Denil, Misha, Gomez, Sergio, Hoffman, Matthew W, Pfau, David,\nSchaul, Tom, and de Freitas, Nando. Learning to learn by gradient descent by gradient descent.\nIn Advances in Neural Information Processing Systems, pp. 3981\u20133989, 2016.\n\n9\n\n\f[3] Applegate, David, Bixby, Robert, Chvatal, Vasek, and Cook, William. Concorde TSP solver,\n\n2006.\n\n[4] Applegate, David L, Bixby, Robert E, Chvatal, Vasek, and Cook, William J. The traveling\n\nsalesman problem: a computational study. Princeton university press, 2011.\n\n[5] Balas, Egon and Ho, Andrew. Set covering algorithms using cutting planes, heuristics, and\nsubgradient optimization: a computational study. Combinatorial Optimization, pp. 37\u201360, 1980.\n[6] Bello, Irwan, Pham, Hieu, Le, Quoc V, Norouzi, Mohammad, and Bengio, Samy. Neural\ncombinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940,\n2016.\n\n[7] Boyan, Justin and Moore, Andrew W. Learning evaluation functions to improve optimization\n\nby local search. Journal of Machine Learning Research, 1(Nov):77\u2013112, 2000.\n\n[8] Chen, Yutian, Hoffman, Matthew W, Colmenarejo, Sergio Gomez, Denil, Misha, Lillicrap,\nTimothy P, and de Freitas, Nando. Learning to learn for global optimization of black box\nfunctions. arXiv preprint arXiv:1611.03824, 2016.\n\n[9] Dai, Hanjun, Dai, Bo, and Song, Le. Discriminative embeddings of latent variable models for\n\nstructured data. In ICML, 2016.\n\n[10] Du, Nan, Song, Le, Gomez-Rodriguez, Manuel, and Zha, Hongyuan. Scalable in\ufb02uence\n\nestimation in continuous-time diffusion networks. In NIPS, 2013.\n\n[11] Erdos, Paul and R\u00e9nyi, A. On the evolution of random graphs. Publ. Math. Inst. Hungar. Acad.\n\nSci, 5:17\u201361, 1960.\n\n[12] Goemans, M.X. and Williamson, D. P. Improved approximation algorithms for maximum\ncut and satis\ufb01ability problems using semide\ufb01nite programming. Journal of the ACM, 42(6):\n1115\u20131145, 1995.\n\n[13] Gomez-Rodriguez, Manuel, Leskovec, Jure, and Krause, Andreas.\n\nInferring networks of\ndiffusion and in\ufb02uence. In Proceedings of the 16th ACM SIGKDD international conference on\nKnowledge discovery and data mining, pp. 1019\u20131028. ACM, 2010.\n\n[14] Graves, Alex, Wayne, Greg, Reynolds, Malcolm, Harley, Tim, Danihelka, Ivo, Grabska-\nBarwi\u00b4nska, Agnieszka, Colmenarejo, Sergio G\u00f3mez, Grefenstette, Edward, Ramalho, Tiago,\nAgapiou, John, et al. Hybrid computing using a neural network with dynamic external memory.\nNature, 538(7626):471\u2013476, 2016.\n\n[15] Gu, Shixiang, Lillicrap, Timothy, Ghahramani, Zoubin, Turner, Richard E, and Levine,\nSergey. Q-prop: Sample-ef\ufb01cient policy gradient with an off-policy critic. arXiv preprint\narXiv:1611.02247, 2016.\n\n[16] He, He, Daume III, Hal, and Eisner, Jason M. Learning to search in branch and bound algorithms.\n\nIn Advances in Neural Information Processing Systems, pp. 3293\u20133301, 2014.\n\n[17] IBM. CPLEX User\u2019s Manual, Version 12.6.1, 2014.\n[18] Johnson, David S and McGeoch, Lyle A. Experimental analysis of heuristics for the stsp. In\n\nThe traveling salesman problem and its variations, pp. 369\u2013443. Springer, 2007.\n\n[19] Karp, Richard M. Reducibility among combinatorial problems. In Complexity of computer\n\ncomputations, pp. 85\u2013103. Springer, 1972.\n\n[20] Kempe, David, Kleinberg, Jon, and Tardos, \u00c9va. Maximizing the spread of in\ufb02uence through a\n\nsocial network. In KDD, pp. 137\u2013146. ACM, 2003.\n\n[21] Khalil, Elias B., Dilkina, B., and Song, L. Scalable diffusion-aware optimization of network\n\ntopology. In Knowledge Discovery and Data Mining (KDD), 2014.\n\n[22] Khalil, Elias B., Le Bodic, Pierre, Song, Le, Nemhauser, George L, and Dilkina, Bistra N.\n\nLearning to branch in mixed integer programming. In AAAI, pp. 724\u2013731, 2016.\n\n10\n\n\f[23] Khalil, Elias B., Dilkina, Bistra, Nemhauser, George, Ahmed, Shabbir, and Shao, Yufen.\nLearning to run heuristics in tree search. In 26th International Joint Conference on Arti\ufb01cial\nIntelligence (IJCAI), 2017.\n\n[24] Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[25] Kleinberg, Jon and Tardos, Eva. Algorithm design. Pearson Education India, 2006.\n[26] Lagoudakis, Michail G and Littman, Michael L. Learning to select branching rules in the dpll\n\nprocedure for satis\ufb01ability. Electronic Notes in Discrete Mathematics, 9:344\u2013359, 2001.\n\n[27] Li, Ke and Malik, Jitendra. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.\n[28] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis,\nWierstra, Daan, and Riedmiller, Martin A. Playing atari with deep reinforcement learning.\nCoRR, abs/1312.5602, 2013. URL http://arxiv.org/abs/1312.5602.\n\n[29] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare,\nMarc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n[30] Papadimitriou, C. H. and Steiglitz, K. Combinatorial Optimization: Algorithms and Complexity.\n\nPrentice-Hall, New Jersey, 1982.\n\n[31] Peleg, David, Schechtman, Gideon, and Wool, Avishai. Approximating bounded 0-1 integer\nlinear programs. In Theory and Computing Systems, 1993., Proceedings of the 2nd Israel\nSymposium on the, pp. 69\u201377. IEEE, 1993.\n\n[32] Reinelt, Gerhard. Tsplib\u2014a traveling salesman problem library. ORSA journal on computing, 3\n\n(4):376\u2013384, 1991.\n\n[33] Riedmiller, Martin. Neural \ufb01tted q iteration\u2013\ufb01rst experiences with a data ef\ufb01cient neural\nreinforcement learning method. In European Conference on Machine Learning, pp. 317\u2013328.\nSpringer, 2005.\n\n[34] Sabharwal, Ashish, Samulowitz, Horst, and Reddy, Chandra. Guiding combinatorial optimiza-\n\ntion with uct. In CPAIOR, pp. 356\u2013361. Springer, 2012.\n\n[35] Samulowitz, Horst and Memisevic, Roland. Learning to solve QBF. In AAAI, 2007.\n[36] Sutton, R.S. and Barto, A.G. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[37] Vinyals, Oriol, Fortunato, Meire, and Jaitly, Navdeep. Pointer networks. In Advances in Neural\n\nInformation Processing Systems, pp. 2692\u20132700, 2015.\n\n[38] Zhang, Wei and Dietterich, Thomas G. Solving combinatorial optimization tasks by reinforce-\nment learning: A general methodology applied to resource-constrained scheduling. Journal of\nArti\ufb01cial Intelligence Reseach, 1:1\u201338, 2000.\n\n11\n\n\f", "award": [], "sourceid": 3183, "authors": [{"given_name": "Elias", "family_name": "Khalil", "institution": "Georgia Tech"}, {"given_name": "Hanjun", "family_name": "Dai", "institution": "Georgia Tech"}, {"given_name": "Yuyu", "family_name": "Zhang", "institution": null}, {"given_name": "Bistra", "family_name": "Dilkina", "institution": "Georgia Institute of Technology"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}]}