{"title": "Learning dynamic polynomial proofs", "book": "Advances in Neural Information Processing Systems", "page_first": 4179, "page_last": 4188, "abstract": "Polynomial inequalities lie at the heart of many mathematical disciplines. In this paper, we consider the fundamental computational task of automatically searching for proofs of polynomial inequalities. We adopt the framework of semi-algebraic proof systems that manipulate polynomial inequalities via elementary inference rules that infer new inequalities from the premises. These proof systems are known to be very powerful, but searching for proofs remains a major difficulty. In this work, we introduce a machine learning based method to search for a dynamic proof within these proof systems. We propose a deep reinforcement learning framework that learns an embedding of the polynomials and guides the choice of inference rules, taking the inherent symmetries of the problem as an inductive bias. We compare our approach with powerful and widely-studied linear programming hierarchies based on static proof systems, and  show that our method reduces the size of the linear program by several orders of magnitude while also improving performance. These results hence pave the way towards augmenting powerful and well-studied semi-algebraic proof systems with machine learning guiding strategies for enhancing the expressivity of such proof systems.", "full_text": "Learning dynamic polynomial proofs\n\nAlhussein Fawzi\n\nDeepMind\n\nafawzi@google.com\n\nMateusz Malinowski\n\nDeepMind\n\nmateuszm@google.com\n\nHamza Fawzi\n\nUniversity of Cambridge\n\nhf323@cam.ac.uk\n\nOmar Fawzi\nENS Lyon\n\nomar.fawzi@ens-lyon.fr\n\nAbstract\n\nPolynomial inequalities lie at the heart of many mathematical disciplines. In this\npaper, we consider the fundamental computational task of automatically searching\nfor proofs of polynomial inequalities. We adopt the framework of semi-algebraic\nproof systems that manipulate polynomial inequalities via elementary inference\nrules that infer new inequalities from the premises. These proof systems are known\nto be very powerful, but searching for proofs remains a major dif\ufb01culty. In this\nwork, we introduce a machine learning based method to search for a dynamic\nproof within these proof systems. We propose a deep reinforcement learning\nframework that learns an embedding of the polynomials and guides the choice of\ninference rules, taking the inherent symmetries of the problem as an inductive bias.\nWe compare our approach with powerful and widely-studied linear programming\nhierarchies based on static proof systems, and show that our method reduces the\nsize of the linear program by several orders of magnitude while also improving\nperformance. These results hence pave the way towards augmenting powerful and\nwell-studied semi-algebraic proof systems with machine learning guiding strategies\nfor enhancing the expressivity of such proof systems.\n\n1\n\nIntroduction\n\nPolynomial inequalities abound in mathematics and its applications. Many questions in the areas\nof control theory [Par00], robotics [MAT13], geometry [PP04], combinatorics [Lov79], program\nveri\ufb01cation [MFK+16] can be modeled using polynomial inequalities. For example, deciding the\nstability of a control system can be reduced to proving the nonnegativity of a polynomial [PP02].\nProducing proofs of polynomial inequalities is thus of paramount importance for these applications,\nand has been a very active \ufb01eld of research [Las15].\nTo produce such proofs, we rely on semi-algebraic proof systems, which de\ufb01ne a framework for\nmanipulating polynomial inequalities. These proof systems de\ufb01ne inference rules that generate new\npolynomial inequalities from existing ones. For example, inference rules can state that the product\nand sum of two non-negative polynomials is non-negative. Given a polynomial f (x), a proof of\nglobal non-negativity of f consists of a sequence of applications of the inference rules, starting from\na set of axioms, until we reach the target statement. Finding such a path is in general a very complex\ntask. To overcome this, a very popular approach in polynomial optimization is to use hierarchies\nthat are based on static proof systems, whereby inference rules are unrolled for a \ufb01xed number of\nsteps, and convex optimization is leveraged for the proof search. Despite the great success of such\nmethods in computer science and polynomial optimization [Lau03, CT12], this approach however\ncan suffer from a lack of expressivity for lower levels of the hierarchy, and a curse of dimensionality\nat higher levels of the hierarchy. Moreover, such static proofs signi\ufb01cantly depart from our common\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fconception of the proof search process, which is inherently sequential. This makes static proofs\ndif\ufb01cult to interpret.\nIn this paper, we use machine learning to guide the search of a dynamic proof of polynomial\ninequalities. We believe this is the \ufb01rst attempt to use machine learning to search for semi-algebraic\nproofs. Speci\ufb01cally, we list our main contributions as follows:\n\u2022 We propose a novel neural network architecture to handle polynomial inequalities with built-in\n\u2022 Leveraging the proposed architecture, we train a prover agent with DQN [MKS+13] in an\nunsupervised environment; i.e., without having access to any existing proof or ground truth\ninformation.\n\u2022 We illustrate our results on the maximum stable set problem, a well known combinatorial problem\nthat is intractable in general. Using a well-known semi-algebraic proof system [LS91, SA90],\nwe show that our dynamic prover signi\ufb01cantly outperforms the corresponding static, unrolled,\nmethod.\n\nsupport for the symmetries of the problem.\n\nRelated works. Semi-algebraic proof systems have been studied by various communities e.g., in real\nalgebraic geometry, global optimization, and in theoretical computer science. Completeness results\nfor these proof systems have been obtained in real algebraic geometry, e.g., [Kri64, Ste74]. In global\noptimization, such proof systems have led to the development of very successful convex relaxations\nbased on static hierarchies [Par00, Las01, Lau03]. In theoretical computer science, static hierarchies\nhave become a standard tool for algorithm design [BS14], often leading to optimal performance.\nGrigoriev et al. [GHP02] studied the proof complexity of various problems using different semi-\nalgebraic proof systems. This fundamental work has shown that problems admitting proofs of very\nlarge static degree can admit a compact dynamic proof. While most previous works has focused on\nunderstanding the power of bounded-degree static proofs, there has been very little work on devising\nstrategies to search for dynamic proofs, and our work is a \ufb01rst step in this direction.\nRecent works have also studied machine learning strategies for automated theorem proving [BLR+19,\nHDSS18, KUMO18, GKU+18]. Such works generally build on existing theorem provers and seek\nto improve the choice of inference rules or tactics at each step of the proof. In contrast, our work\ndoes not rely on existing theorem provers and instead uses elementary inference rules in the context\nof semi-algebraic systems. We see these two lines of works as complementary, as building improved\nprovers for polynomial inequalities can provide a crucial tactic that integrates into general ATP\nsystems. We \ufb01nally note that prior works have applied neural networks to combinatorial optimization\nproblems [BLP18], such as the satis\ufb01ability problem [SLB+18]. While such techniques seek to\nshow the existence of good-quality feasible points (e.g., a satisfying assignment), we emphasize\nthat we focus here on proving statements for all values in a set (e.g., showing the nonexistence of\nany satisfying assignment) \u2013 i.e., \u2203 vs \u2200. Finally, we note that the class of polynomial optimization\ncontains combinatorial optimization problems as a special case.\nNotations. We let R[x] denote the ring of multivariate polynomials in x = (x1, . . . , xn). For\n\u03b1 \u2208 Nn and x = (x1, . . . , xn), we let x\u03b1 = x\u03b11\nn . The degree of a monomial x\u03b1 is\ni=1 \u03b1i. The degree of any polynomial in R[x] is the largest degree of any of its monomials.\nFor n \u2208 N, we use [n] to denote the set {1, . . . , n}. We use | \u00b7 | to denote the cardinality of a \ufb01nite set.\n\n|\u03b1| =(cid:80)n\n\n1 \u00b7\u00b7\u00b7 x\u03b1n\n\n2 Problem modeling using polynomials\n\nTo illustrate the scope of this paper, we review the connection between optimization problems and\nproving the non-negativity of polynomials. We also describe the example of the stable set problem,\nwhich we will use as a running example throughout the paper.\nPolynomial optimization. A general polynomial optimization problem takes the form\n\nmaximize\n\n(1)\nwhere f (x) is a polynomial and S is a basic closed semi-algebraic set de\ufb01ned using polynomial\nequations and inequalities S = {x \u2208 Rn : gi(x) \u2265 0, hj(x) = 0 \u2200i, j}, where gi, hj are arbitrary\npolynomials. Such problem subsumes many optimization problems as a special case. For example\ni = xi restricts xi to be an integer in {0, 1}. As such,\nusing the polynomial equality constraints x2\n\nx \u2208 S.\n\nf (x)\n\nsubject to\n\n2\n\n\f\u2200x \u2208 S,\n\n\u03b3 \u2212 f (x) \u2265 0.\n\ninteger programming is a special case of (1). Problem (1) can also model many other optimization\nproblems that arise in theory and practice, see e.g., [Las15].\nOptimization and inequalities. In this paper we are interested in proving upper bounds on the\noptimal value of (1). Proving an upper bound of \u03b3 on the optimal value of (1) amounts to proving\nthat\n\n(2)\nWe are looking at proving such inequalities using semi-algebraic proof systems. Therefore, developing\ntractable approaches to proving nonnegativity of polynomials on semialgebraic sets has important\nconsequences on polynomial optimization.\nRemark 1. We note that proving an upper bound on the value of (1) is more challenging than\nproving a lower bound. Indeed, to prove a lower bound on the value of the maximization problem (1)\none only needs to exhibit a feasible point x0 \u2208 S; such a feasible point implies that the optimal value\nis \u2265 f (x0). In contrast, to prove an upper bound we need to prove a polynomial inequality, valid for\nall x \u2208 S (notice the \u2200 quanti\ufb01er in (2)).\nStable sets in graphs. We now give an example of a well-known combinatorial optimization problem,\nand explain how it can be modeled using polynomials. Let G = (V, E) denote a graph of n = |V |\nnodes. A stable set S in G is a subset of the vertices of G such that for every two vertices in S, there\nis no edge connecting the two. The stable set problem is the problem of \ufb01nding a stable set with\nlargest cardinality in a given graph. This problem can be formulated as a polynomial optimization\nproblem as follows:\n\nmaximize\nsubject to xixj = 0 for all (i, j) \u2208 E,\n\n(cid:80)n\ni=1 xi\ni = xi for all i \u2208 {1, . . . , n}.\nx2\n\nx\u2208Rn\n\ni = xi is equivalent to xi \u2208 {0, 1}. The variable x \u2208 Rn is interpreted as the\nThe constraint x2\ncharacteristic function of S: xi = 1 if and only if vertex i belongs to the stable set S. The cardinality\ni=1 xi, and the constraint xixj = 0 for ij \u2208 E disallows having two nodes in S\nthat are connected by an edge. Finding a stable set of largest size is a classical NP-hard problem, with\nmany diverse applications [Lov79, Sch03]. As explained earlier for general polynomial optimization\nproblems, showing that there is no stable set of size larger than \u03b3 corresponds to showing that\n\nof S is measured by(cid:80)n\n\u03b3 \u2212(cid:80)n\n\ni=1 xi \u2265 0 for all x verifying the constraints of (3).\n\n(3)\n\n3 Static and dynamic semi-algebraic proofs\n\nA semi-algebraic proof system is de\ufb01ned by elementary inference rules, which produce non-negative\npolynomials. Speci\ufb01cally, a proof consists in applying these inference rules starting from a set of\naxioms gi(x) \u2265 0, hj(x) = 0 until we reach a desired inequality p \u2265 0.1\nIn this paper, we will focus on proving polynomial inequalities valid on the hypercube [0, 1]n =\n{x \u2208 Rn : 0 \u2264 xi \u2264 1, \u2200i = 1, . . . , n}. As such, we consider the following inference rules, which\nappear in the so-called Lov\u00e1sz-Schrijver (LS) proof system [LS91] as well as in the Sherali-Adams\nframework [SA90]:\n\ng \u2265 0\nxig \u2265 0\n\ng \u2265 0\n\n(1 \u2212 xi)g \u2265 0\n\n(cid:80)\ni \u03bbigi \u2265 0,\u2200\u03bbi \u2265 0\n\ngi \u2265 0\n\n,\n\n(4)\n\nfunctions of the form(cid:80)\n\nB denotes that A implies B. The proof of a statement (i.e., non-negativity of a polynomial\nwhere A\np) consists in the composition of these elementary inference rules, which exactly yields the desired\npolynomial p. Starting from the axiom 1 \u2265 0, the composition of inference rules in Eq. (4) yields\n\u03b1,\u03b2 \u03bb\u03b1,\u03b2x\u03b1(1\u2212x)\u03b2, where \u03b1 = (\u03b11, . . . , \u03b1n) \u2208 Nn and \u03b2 = (\u03b21, . . . , \u03b2n) \u2208\nNn are tuples of length n, and \u03bb\u03b1,\u03b2 are non-negative coef\ufb01cients. It is clear that all polynomials\nof this form are non-negative for all x \u2208 [0, 1]n, as they consist in a composition of the inference\nrules (4). As such, writing a polynomial p in this form gives a proof of non-negativity of p on the\nhypercube. The following theorem shows that such a proof always exists provided we assume p(x) is\n1In the setting discussed in Section 2, the desired inequality is p = \u03b3 \u2212 f \u2265 0, where f is the objective\n\nfunction of the optimization problem in (1).\n\n3\n\n\fstrictly positive for all x \u2208 [0, 1]n. In words, this shows that the set of inference rules (4) forms a\ncomplete proof system2:\nTheorem 1 ([Kri64] Positivstellensatz). Assume p is a polynomial such that p(x) > 0 for all\nx \u2208 [0, 1]n. Then there exists an integer l, and nonnegative scalars \u03bb\u03b1,\u03b2 \u2265 0 such that\n\n(cid:88)\n\np(x) =\n\n|\u03b1|+|\u03b2|\u2264l\n\n\u03bb\u03b1,\u03b2x\u03b1(1 \u2212 x)\u03b2.\n\n(5)\n\nFigure 1: Illustration of a dynamic\nvs. static proof. Each concentric cir-\ncle depicts the set of polynomials\nthat can be proved non-negative by\nthe l\u2019th level of the hierarchy. The\nwiggly area is the set of polynomi-\nals of degree e.g., 1. A dynamic\nproof (black arrows) of p \u2265 0 seeks\nan (adaptive) sequence of inference\nrules that goes from the initial set of\naxioms (dots in L0) to target p.\n\nStatic proofs. Theorem 1 suggests the following approach to\nproving non-negativity of a polynomial p(x): \ufb01x an integer l\nand search for non-negative coef\ufb01cients \u03bb\u03b1,\u03b2 (for |\u03b1|+|\u03b2| \u2264 l)\nsuch that (5) holds. This static proof technique is one of the\nmost widely used approaches for \ufb01nding proofs of polynomial\ninequalities, as it naturally translates to solving a convex op-\ntimization problem [Lau03]. In fact, (5) is a linear condition\nin the unknowns \u03bb\u03b1,\u03b2, as the functional equality of two poly-\nnomials is equivalent to the equality of the coef\ufb01cients of each\nmonomial. Thus, \ufb01nding such coef\ufb01cients is a linear program\nwhere the number of variables is equal to the number of tuples\n(\u03b1, \u03b2) \u2208 Nn \u00d7 Nn such that |\u03b1| + |\u03b2| \u2264 l, i.e., of order \u0398(nl)\nfor l constant. The collection of these linear programs gives\na hierarchy, indexed by l \u2208 N, for proving non-negativity\nof polynomials. Theorem 1 shows that as long as p > 0 on\n[0, 1]n there exists l such that p can be proved nonnegative by\nthe l\u2019th level of the hierarchy. However, we do not know a\npriori the value of l. In fact this value of l can be much larger\nthan the degree of the polynomial p. In other words, in order\nto prove the non-negativity of a low-degree polynomial p, one\nmay need to manipulate high-degree polynomial expressions\nand leverage cancellations in the right-hand side of (5) \u2013 see\nillustration below for an example.\nDynamic proofs. For large values of l, the linear program associated to the l\u2019th level of the hierarchy\nis prohibitively large to solve. To remedy this, we propose to search for dynamic proofs of non-\nnegativity. This technique relies on proving intermediate lemmas in a sequential way, as a way to\n\ufb01nd a concise proof of the desired objective. Crucially, the choice of the intermediate lemmas is\nstrongly problem-dependent \u2013 it depends on the target polynomial p, in addition to the axioms and\npreviously derived lemmas. This is in stark contrast with the static approach, where hierarchies\nare problem-independent (e.g., they are obtained by limiting the degree of proof generators, the\nx\u03b1(1 \u2212 x)\u03b2 in our case). In spite of the bene\ufb01ts of a dynamic proof system, searching for these\nproofs is a challenging problem on its own, where one has to decide on inference rules applied at\neach step of the proof. We also believe such a dynamic proving approach is more aligned with human\nreasoning, which is also a sequential process where intuition plays an important role in deriving new\nlemmas by applying suitable inference rules that lead to interpretable proofs. We \ufb01nally note that the\ndynamic proving strategy subsumes the static one, as a static proof can be seen as a non-adaptive\nversion of a dynamic proof.\nIllustration. To illustrate the difference between the static and dynamic proof systems, consider\nthe stable set problem in Sect. 2 on the complete graph on n nodes, where each pair of nodes is\nconnected. It is clear that the maximal stable set has size 1; this can be formulated as follows:3\n\n\u21d2 1 \u2212 n(cid:88)\nIn the static framework, we seek to express the polynomial 1 \u2212(cid:80)n\n\ni = xi, i = 1, . . . , n\nxixj = 0, \u2200i (cid:54)= j\nxi \u2265 0, 1 \u2212 xi \u2265 0, i = 1, . . . , n\n\ni=1\n\nxi \u2265 0.\n\n(6)\n\ni=1 xi as in (5), modulo the\n\nequalities xixj = 0. One can verify that\n\ni=1 xi = (cid:81)n\n\ni=1(1 \u2212 xi)\n\nmod\n\n(xixj = 0, \u2200i (cid:54)= j).\n\n(7)\n\n\uf8f1\uf8f2\uf8f3x2\n1 \u2212(cid:80)n\n\nrefutationally complete.\n\n2The result is only true for strictly positive polynomials. More precisely, the proof system in (4) is only\n3Note that redundant inequalities xi \u2265 0 and 1 \u2212 xi \u2265 0 have been added for sake of clarity in what follows.\n\n4\n\nL2L1L0\f(cid:81)n\nThe proof in Equation (7) is a static proof of degree n because it involves the degree n product\ni=1(1 \u2212 xi). This means that the proof (7) will only be found at level n of the static hierarchy,\nwhich is a linear program of size exponential in n. One can further show that it is necessary to go to\nlevel at least n to \ufb01nd a proof of (6) (cf. Supp. Mat).\nIn contrast, one can provide a dynamic proof of the above where the degree of any intermediate\nlemma is at most two. To see why, it suf\ufb01ces to multiply the polynomials 1 \u2212 xi sequentially, each\ntime eliminating the degree-two terms using the equalities xixj = 0 for i (cid:54)= j. The dynamic proof\nproceeds as follows (note that no polynomial of degree greater than two is ever formed).\n\n1 \u2212 x1 \u2265 0\n\nmultiply by\n1 \u2212 x2 \u2265 0\n\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2192 (1 \u2212 x1)(1 \u2212 x2) \u2265 0\nmultiply by\n1 \u2212 x3 \u2265 0\n\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2192 (1 \u2212 x1 \u2212 x2)(1 \u2212 x3) \u2265 0\n\n...\n\nreduce using\nx1x2 = 0\n\n\u2212\u2212\u2212\u2212\u2212\u2212\u2192 1 \u2212 x1 \u2212 x2 \u2265 0\n\nreduce using\n\nx1x3 = x2x3 = 0\n\n\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2192 1 \u2212 x1 \u2212 x2 \u2212 x3 \u2265 0\n\nmultiply by\n1 \u2212 xn \u2265 0\n\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2192 (1 \u2212 x1 \u2212 . . . \u2212 xn\u22121)(1 \u2212 xn) \u2265 0\n\nreduce using\n\nxixn = 0 for i < n\n\n\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2192 1 \u2212 x1 \u2212 . . . \u2212 xn \u2265 0.\n\n4 Learning dynamic proofs of polynomials\n\n4.1 Reinforcement learning framework for semi-algebraic proof search\n\nWe model the task of \ufb01nding dynamic proofs as an interaction between the agent and an environment,\nformalized as a Markov Decision Process (MDP), resulting in a sequence of states, actions and\nobserved rewards. The agent state st at time step t is de\ufb01ned through the triplet (f,Mt,Et), where:\n\u2022 Mt denotes the memory at t; i.e., the set of polynomials that are known to be non-negative at t.\nThis contains the set of polynomials that are assumed to be non-negative (i.e., axioms gi), as well\nas intermediate steps (i.e., lemmas), which are derived from the axioms through inference rules,\n\n\u2022 Et denotes the set of equalities; i.e., the set of polynomials identically equal to zero,\n\u2022 f denotes the objective polynomial to bound (cf Section 2).\nAt each time t, the agent selects an action at from a set of legal actions At, obtained by applying\none or more inference rules in Eq. (4) to elements in Mt.4 Observe that since elements in Mt are\nnon-negative, the polynomials in At are also non-negative. The selected action at \u2208 At is then\nappended to the memory Mt+1 at the next time step. After selecting at, a reward rt is observed,\nindicating how close the agent is to \ufb01nding the proof of the statement, with higher rewards indicating\nthat the agent is \u201ccloser\u201d to \ufb01nding a proof \u2013 see Sect. 4.2 for more details.\n\nThe goal of the agent is to select actions that maximize future returns Rt = E[(cid:80)T\n\nt(cid:48)=t \u03b3t(cid:48)\u2212trt(cid:48)], where\nT indicates the length of an episode, and \u03b3 is the discount factor. We use a deep reinforcement\nlearning algorithm where the action-value function is modeled using a deep neural network q\u03b8(s, a).\nSpeci\ufb01cally, the neural network takes as input a state-action pair, and outputs an estimate of the return;\nwe use the DQN [MKS+13] algorithm for training, which leverages a replay memory buffer for\nincreased stability [Lin92]. We refer to [MKS+13, Algorithm 1] for more details about this approach.\nNote that in contrast to many RL scenarios, the action space here grows with t, as larger memories\nmean that more lemmas can be derived. The large action space makes the task of \ufb01nding a dynamic\nproof particularly challenging; we therefore rely on dense rewards (Sect. 4.2) and specialized\narchitectures (Sect. 4.3) for tackling this problem.\n\n4.2 Reward signal\n\nWe now describe the reward signal rt. One potential choice is to assign a positive reward (rt > 0)\nwhen the objective \u03b3\u2217 \u2265 f is reached (where \u03b3\u2217 is the optimal bound) and zero otherwise. However,\nthis suffers from two important problems: 1) the reward is sparse, which makes learning dif\ufb01cult,\n2) this requires the knowledge of the optimal bound \u03b3\u2217. Here, we rely instead on a dense and\n4In practice, we limit ourselves to the \ufb01rst two inference rules (i.e., multiplication by xi and 1 \u2212 xi), and \ufb01nd\nlinear combinations using the LP strategy described in Section 4.2. This yields action spaces At of size 2n|Mt|.\n\n5\n\n\f|Mt|(cid:88)\n\ni=1\n\nunsupervised reward scheme, where positive reward is given whenever the chosen action results in an\nimprovement of the bound.\nMore formally, at each step t, we solve the following linear program:\n\n\u03b3t,{\u03bb} \u03b3t\nmin\n\nsubject to \u03b3t \u2212 f =\n\n\u03bbimi, \u03bb \u2265 0,\n\n(8)\n\nlinear combination of elements in the memory; in fact, since(cid:80)|Mt|\n\nwhere {mi} denote the polynomials in Mt. Note that the constraint in Eq. (8) is a functional equality\nof two polynomials, which is equivalent to the equality of the coef\ufb01cients of the polynomials. In\nwords, Eq. (8) computes the optimal upper bound \u03b3t on f that can be derived through a non-negative\ni=1 \u03bbimi is non-negative, we have\nf \u2264 \u03b3t. Crucially, the computation of the bound in Eq. (8) can be done very ef\ufb01ciently, as Mt is\nkept of small size in practice (e.g., |Mt| \u2264 200 in the experiments).\nThen, we compute the reward as the relative improvement of the bound: rt = \u03b3t+1 \u2212 \u03b3t, where\nrt is the reward observed after taking action at. Note that positive reward is observed only when\nthe chosen action at leads to an improvement of the current bound. We emphasize that this reward\nattribution scheme alleviates the need for any supervision during our training procedure; speci\ufb01cally,\nthe agent does not require human proofs or even estimates of bounds for training.\n\n4.3 Q-network with symmetries\n\nThe basic objects we manipulate are polynomials and sets of polynomials, which impose natural\nsymmetry requirements. We now describe how we build in symmetries in our Q-network q\u03b8.\nOur Q-network q\u03b8, takes as input the state st = (f,Mt,Et), as well as the action polynomial at.\nWe represent polynomials as vectors of coef\ufb01cients of size N, where N is the number of possible\nmonomials. While sets of polynomials (e.g., Mt) can be encoded with a matrix of size c \u00d7 N, where\nc denotes the cardinality of the set, such an encoding does not take into account the orderless nature\nof sets. We, therefore, impose our Q-value function to be invariant to the order of enumeration of\nelements in M, and E; that is, we require that the following hold for any permutations \u03c0 and \u03c7:\ni=1 ,{e\u03c7(j)}|Et|\nSymmetry I (orderless sets).\n\n(cid:16){m\u03c0(i)}|Mt|\n\n(cid:16){mi}|Mt|\n\nj=1, f, a\n\nj=1, f, a\n\n(cid:17)\n\n= q\u03b8\n\n(cid:17)\n\n.\n\nTo satisfy the above symmetry, we consider value functions of the form:\n\nq\u03b8\n\ni=1 ,{ej}|Et|\n\nj=1, f, a\n\n= \u03b6\u03b8(3) (\u03c3(V ), \u03c3(W )) ,\n\nq\u03b8\n\n(cid:16){mi}|Mt|\n\ni=1 ,{ej}|Et|\n(cid:17)\n\nSymmetry II (variable relabeling).\n\ni=1 , W = {v\u03b8(2)(ej, f, a)}|Et|\n\nwhere V = {v\u03b8(1)(mi, f, a)}|Mt|\nj=1, v\u03b8(1) and v\u03b8(2) are trainable neural net-\nworks with additional symmetry constraints (see below), \u03c3 is a symmetric function of the arguments\n(e.g., max, sum), and \u03b6\u03b8(3) is a trainable neural network.\nIn addition to the above symmetry, v\u03b8 has to be well chosen in order to guarantee invariance under\nrelabeling of variables (that is, xi \u2192 x\u03c0(i) for any permutation \u03c0). In fact, the variable names do\nnot have any speci\ufb01c meaning per se; relabeling all polynomials in the same way results in the exact\nsame problem. We therefore require that the following constraint is satis\ufb01ed for any permutation \u03c0:\n(9)\nwhere \u03c0m indicates a permutation of the variables in m using \u03c0. For example, if \u03c0 is such that\n\u03c0(1) = 2, \u03c0(2) = 3 and \u03c0(3) = 1, and m = x1 + 2x1x3 then \u03c0m = x2 + 2x1x2. Note that in the\nabove constraint, the same permutation \u03c0 is acting on m, f and a.\nWe now describe how we impose this symmetry. Given two triplets of monomials (x\u03b11 , x\u03b12, x\u03b13)\nand (x\u03b21, x\u03b22, x\u03b23 ), we say that these two triplets are equivalent (denoted by the symbol \u223c) iff\n2, x2x3) \u223c\nthere exists a permutation \u03c0 such that \u03b2i = \u03c0(\u03b1i) for i = 1, 2, 3. For example, (x1x2, x2\n3, x2x3). The equivalence class [(x\u03b11 , x\u03b12, x\u03b13)] regroups all triplets of monomials that\n(x1x3, x2\nare equivalent to (x\u03b11 , x\u03b12, x\u03b13). We denote by E the set of all such equivalence classes. Our \ufb01rst\nstep to construct v\u03b8 consists in mapping the triplet (m, f, a) to a feature vector which respects the\nvariable relabeling symmetry. To do so, let m, f, a be polynomials in R[x]; we consider a feature\nfunction that is trilinear in (m, f, a); that is, it is linear in each argument m, f and a. For such a\n\nv\u03b8(m, f, a) = v\u03b8(\u03c0m, \u03c0f, \u03c0a),\n\n6\n\n\fFigure 2: Structure of Q-network. {mi} denotes the set of axioms and lemmas, a denotes the action,\nf is the objective function, and ej denotes the set of equality polynomials.\n\n, x\u03b2(cid:48)\n\n, x\u03b3(cid:48)\n\n(cid:80)\n\u03b1,\u03b2,\u03b3 m\u03b1f\u03b2a\u03b3T (x\u03b1, x\u03b2, x\u03b3). If (x\u03b1, x\u03b2, x\u03b3) \u223c (x\u03b1(cid:48)\n\nfunction, T : R[x] \u00d7 R[x] \u00d7 R[x] \u2192 Rs (where s denotes the feature size), we have: T (m, f, a) =\n), then we set T (x\u03b1, x\u03b2, x\u03b3) =\nT (x\u03b1(cid:48)\n). In other words, the function T has to be constant on each equivalence class. Such\na T will satisfy our symmetry constraint that T (m, f, a) = T (\u03c0m, \u03c0f, \u03c0a) for any permutation \u03c0.\nFor example, the above equality constrains T (1, x1, x1) = T (1, xi, xi) for all i since (1, x1, x1) \u223c\n(1, xi, xi), and T (x1, x2, x3) = T (xi, xj, xk) for i (cid:54)= j (cid:54)= k as (x1, x2, x3) \u223c (xi, xj, xk). Note,\nhowever, that T (1, x1, x1) (cid:54)= T (1, xi, xj) for i (cid:54)= j; in fact, (1, x1, x1) (cid:54)\u223c (1, xi, xj). Finally, we set\nv\u03b8 = u\u03b8 \u25e6 T where u\u03b8 is a trainable neural network. Fig. 2 summarizes the architecture we use for the\nQ-network. We refer to Supp. Mat. for more details about architectures and practical implementation.\n\n, x\u03b2(cid:48)\n\n, x\u03b3(cid:48)\n\n5 Experimental results\nWe illustrate our dynamic proving approach on the stable set problem described in Section 2. This\nproblem has been extensively studied in the polynomial optimization literature [Lau03]. We evaluate\nour method against standard linear programming hierarchies considered in this \ufb01eld. The largest\nstable set in a graph G is denoted \u03b1(G).\nTraining setup. We train our prover on randomly generated graphs of size n = 25, where an edge\nbetween nodes i and j is created with probability p \u2208 [0.5, 1]. We seek dynamic proofs using the\nproof system in Eq. (4), starting from the axioms {xi \u2265 0, 1 \u2212 xi \u2265 0, i = 1, . . . , n} and the\npolynomial equalities xixj = 0 for all edges ij in the graph and x2\ni = xi for all nodes i. We\nrestrict the number of steps in the dynamic proof to be at most 100 steps and limit the degree of any\nintermediate lemma to 2. We note that our training procedure is unsupervised and does not require\nprior proofs, or knowledge of \u03b1(G) for learning. We use the DQN approach presented in Sect. 4 and\nprovide additional details about hyperparameters and architecture choices in the Supp. Mat.\nWe compare our approach to the following static hierarchy of linear programs indexed by l:\n\n(cid:18) xixj = 0, ij \u2208 E\n\ni = xi, i \u2208 V\nx2\n\n(cid:19)\n\n.\n\n(10)\n\nmin. \u03b3 s.t.\n\n\u03bb\u03b1,\u03b2x\u03b1(1 \u2212 x)\u03b2 mod\n\n\u03b3 \u2212 n(cid:88)\n\ni=1\n\n(cid:88)\n\nxi =\n\n|\u03b1|+|\u03b2|\u2264l\n\nThis hierarchy corresponds to the level l of the Sherali-Adams hierarchy applied to the maximum\nstable set problem [LS14, Section 4], which is one of the most widely studied hierarchies for\ncombinatorial optimization [Lau03]. Observe that the linear program (10) has \u0398(nl) variables and\nconstraints for l constant. By completeness of the hierarchy, we know that solving the linear program\n(10) at level l = n yields the exact value \u03b1(G) of the maximum stable set.\nResults. Table 1 shows the results of the proposed dynamic prover on a test set consisting of random\ngraphs of different sizes.5 We compare the value obtained by the dynamic prover with a random\nprover taking random legal actions (from the considered proof system), as well as with the Sherali-\nAdams hierarchy (10). The reported values correspond to an average over a set of 100 randomly\n\n5Despite training the network on graphs of \ufb01xed size, we can test it on graphs of any size, as the embedding\n\ndimension is independent of n. In fact, it is equal to the number of equivalence classes |E |.\n\n7\n\nMapping TMaxMaxTrainable neural networksTrainable neural networkCombination of equality and memoryfeattures\u2019\fn\n\n15\n20\n25\n30\n35\n40\n45\n50\n\nDyn.\n(deg. 2)\n\n3.43\n3.96\n4.64\n5.44\n6.37\n7.23\n8.14\n8.89\n\nStatic hierarchy\nl = 4\nl = 3\n3.94\n5.01\n5.04\n6.67\n8.33\n6.26\n7.50\n10.0\n8.75\n11.67\n10.0\n13.33\n15.0\n11.25\n12.50\n16.67\n\nl = 2\n7.50\n10.0\n12.50\n15.0\n17.5\n20.0\n22.5\n25.0\n\nl = 5\n3.48\n4.32\n5.08\n6.03\n7.02\n8.00\n9.00\n10.0\n\nRandom\n\nSize of LP\n\nDyn.\n130\n140\n150\n160\n170\n180\n190\n200\n\nStatic l = 5\n5.9 \u00d7 103\n2.6 \u00d7 104\n7.6 \u00d7 104\n1.9 \u00d7 105\n4.2 \u00d7 105\n8.3 \u00d7 105\n1.5 \u00d7 106\n2.6 \u00d7 106\n\n5.91\n8.91\n12.7\n15.6\n19.6\n23.5\n28.1\n31.6\n\nTable 1: Evaluation of different methods on 100 randomly sampled problems on the maximal stable\nset problem. For each method, the average estimated bound is displayed (lower values correspond to\nbetter \u2013 i.e., tighter \u2013 bounds). Moreover, the average size of the linear program in which the proof is\nsought is reported in the last two columns. The proof size is limited to 100 for the dynamic proof,\nleading to an LP of size 100 + 2n, as the problem has 2n inequality axioms (xi \u2265 0, 1 \u2212 xi \u2265 0).\nNote that the static linear program at level l cannot give a bound smaller than n/l; we prove this\nresult in Theorem 1 in Supp. Mat.\n\nProof that 3 \u2212(cid:80)7\n\ni=1 xi \u2265 0:\n\n[Step 0] 0 <= -x2 - x3 + 1 = (-x3 + 1) * (-x2 + 1)\n[Step 1] 0 <= -x5 - x6 + 1 = (-x6 + 1) * (-x5 + 1)\n[Step 2] 0 <= -x4 - x5 + 1 = (-x4 + 1) * (-x5 + 1)\n[Step 3] 0 <= -x1 - x7 + 1 = (-x7 + 1) * (-x1 + 1)\n[Step 4] 0 <= -x1 - x2 + 1 = (-x1 + 1) * (-x2 + 1)\n[Step 5] 0 <= -x2*x4 - x2*x5 + x2 = [Step 2] * (x2)\n[Step 6] 0 <= x1*x5 - x1 + x2*x5 - x2 - x5 + 1 = [Step 4] * (-x5 + 1)\n[Step 7] 0 <= x5*x7 - x5 - x6 - x7 + 1 = [Step 1] * (-x7 + 1)\n[Step 8] 0 <= x2*x4 - x2 - x3 - x4 + 1 = [Step 0] * (-x4 + 1)\n[Step 9] 0 <= -x1*x5 - x5*x7 + x5 = [Step 3] * (x5)\n0 <= 1 * [Step 5] + 1 * [Step 7] + 1 * [Step 8]\ni=1 xi.\n\n+ 1 * [Step 9] + 1 * [Step 6] = 3 \u2212(cid:80)7\n\nTable 2: An example of proof generated by our agent. Axioms are shown in blue, and derived\npolynomials (i.e., intermediate lemmas) are shown in red. Note that coef\ufb01cients in the proof are all\nrational, leading to an exact and fully veri\ufb01able proof. See more examples of proofs in the Supp. Mat.\n\ngenerated graphs. We note that for all methods, bounds are accompanied with a formal, veri\ufb01able,\nproof, and are hence correct by de\ufb01nition.\nOur dynamic polynomial prover is able to prove an upper bound on \u03b1(G) that is better than the one\nobtained by the Sherali-Adams hierarchy with a linear program that is smaller by several orders of\nmagnitude. For example on graphs of 50 nodes, the Sherali-Adams linear program at level l = 5 has\nmore than two million variables, and gives an upper bound on \u03b1(G) that is worse than our approach\nwhich only uses a linear program of size 200. This highlights the huge bene\ufb01ts that dynamic proofs\ncan offer, in comparison to hierarchy-based static approaches. We also see that our agent is able to\nlearn useful strategies for proving polynomial inequalities, as it signi\ufb01cantly outperforms the random\nagent. We emphasize that while the proposed agent is only trained on graphs of size n = 25, it still\noutperforms all other methods for larger values of n showing good out-of-distribution generalization.\nNote \ufb01nally that the proposed architecture which incorporates symmetries (as described in Sect. 4.3)\nsigni\ufb01cantly outperforms other generic architectures, as shown in the Supp. Mat.\nTable 2 provides an example of a proof produced by our automatic prover, showing that the largest\nstable set in the cycle graph on 7 nodes is at most 3. Despite the symmetric nature of the graph\n(unlike random graphs in the training set), our proposed approach leads to human interpretable, and\nrelatively concise proofs. In contrast, the static approach involves searching for a proof in a very\nlarge algebraic set.\n\n8\n\n1234567\f6 Conclusion\n\nExisting hierarchies for polynomial optimization currently rely on a static viewpoint of algebraic\nproofs and leverage the convexity of the search problem. We propose here a new approach for\nsearching for a dynamic proof using machine learning based strategies. The framework we propose\nfor proving inequalities on polynomials leads to more natural, interpretable proofs, and signi\ufb01cantly\noutperforms static proof techniques. We believe that augmenting polynomial systems with ML-\nguided dynamic proofs will have signi\ufb01cant impact in application areas such as control theory,\nrobotics, veri\ufb01cation, where many problems can be cast as proving polynomial inequalities. One very\npromising avenue for future research is to extend our dynamic proof search method to other more\npowerful semi-algebraic proof systems; e.g., based on semi-de\ufb01nite programming.\n\nReferences\n\n[BLP18] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial\noptimization: a methodological tour d\u2019horizon. arXiv preprint arXiv:1811.06128, 2018.\n\n[BLR+19] Kshitij Bansal, Sarah M Loos, Markus N Rabe, Christian Szegedy, and Stewart Wilcox.\nHOList: An environment for machine learning of higher-order theorem proving (ex-\ntended version). arXiv preprint arXiv:1904.03241, 2019.\n\n[BS14] Boaz Barak and David Steurer. Sum-of-squares proofs and the quest toward optimal\nalgorithms. In Proceedings of International Congress of Mathematicians (ICM), 2014.\n\n[CT12] Eden Chlamtac and Madhur Tulsiani. Convex relaxations and integrality gaps. In\nHandbook on semide\ufb01nite, conic and polynomial optimization, pages 139\u2013169. Springer,\n2012.\n\n[GHP02] Dima Grigoriev, Edward A. Hirsch, and Dmitrii V. Pasechnik. Complexity of semialge-\n\nbraic proofs. Mosc. Math. J., 2(4):647\u2013679, 805, 2002.\n\n[GKU+18] Thibault Gauthier, Cezary Kaliszyk, Josef Urban, Ramana Kumar, and Michael Norrish.\n\nLearning to prove with tactics. arXiv preprint arXiv:1804.00596, 2018.\n\n[HDSS18] Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. Gamepad: A learning\n\nenvironment for theorem proving. arXiv preprint arXiv:1806.00608, 2018.\n\n[Kri64] Jean-Louis Krivine. Quelques propri\u00e9t\u00e9s des pr\u00e9ordres dans les anneaux commutatifs\nunitaires. Comptes Rendus Hebdomadaires des Seances de l\u2019Academie des Sciences,\n258(13):3417, 1964.\n\n[KUMO18] Cezary Kaliszyk, Josef Urban, Henryk Michalewski, and Miroslav Ol\u0161\u00e1k. Reinforcement\nlearning of theorem proving. In Advances in Neural Information Processing Systems,\npages 8822\u20138833, 2018.\n\n[Las01] Jean B Lasserre. Global optimization with polynomials and the problem of moments.\n\nSIAM Journal on Optimization, 11(3):796\u2013817, 2001.\n\n[Las15] Jean B Lasserre. An introduction to polynomial and semi-algebraic optimization,\n\nvolume 52. Cambridge University Press, 2015.\n\n[Lau03] Monique Laurent. A comparison of the Sherali-Adams, Lov\u00e1sz-Schrijver, and Lasserre\nrelaxations for 0\u20131 programming. Mathematics of Operations Research, 28(3):470\u2013496,\n2003.\n\n[Lin92] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning\n\nand teaching. Machine learning, 8(3-4):293\u2013321, 1992.\n\n[Lov79] L\u00e1szl\u00f3 Lov\u00e1sz. On the Shannon capacity of a graph. IEEE Transactions on Information\n\ntheory, 25(1):1\u20137, 1979.\n\n[LS91] L\u00e1szl\u00f3 Lov\u00e1sz and Alexander Schrijver. Cones of matrices and set-functions and 0\u20131\n\noptimization. SIAM journal on optimization, 1(2):166\u2013190, 1991.\n\n9\n\n\f[LS14] Monique Laurent and Zhao Sun. Handelman\u2019s hierarchy for the maximum stable set\n\nproblem. Journal of Global Optimization, 60(3):393\u2013423, 2014.\n\n[MAT13] Anirudha Majumdar, Amir Ali Ahmadi, and Russ Tedrake. Control design along\ntrajectories with sums of squares programming. In 2013 IEEE International Conference\non Robotics and Automation, pages 4054\u20134061. IEEE, 2013.\n\n[MFK+16] Alexandre Mar\u00e9chal, Alexis Fouilh\u00e9, Tim King, David Monniaux, and Micha\u00ebl P\u00e9rin.\nPolyhedral approximation of multivariate polynomials using Handelman\u2019s theorem. In\nInternational Conference on Veri\ufb01cation, Model Checking, and Abstract Interpretation,\npages 166\u2013184. Springer, 2016.\n\n[MKS+13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou,\nDaan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.\narXiv preprint arXiv:1312.5602, 2013.\n\n[Par00] Pablo A Parrilo. Structured semide\ufb01nite programs and semialgebraic geometry methods\nin robustness and optimization. PhD thesis, California Institute of Technology, 2000.\n\n[PP02] Antonis Papachristodoulou and Stephen Prajna. On the construction of Lyapunov\nfunctions using the sum of squares decomposition. In Proceedings of the 41st IEEE\nConference on Decision and Control, 2002., volume 3, pages 3482\u20133487. IEEE, 2002.\n\n[PP04] Pablo A Parrilo and Ronen Peretz. An inequality for circle packings proved by semidef-\n\ninite programming. Discrete & Computational Geometry, 31(3):357\u2013367, 2004.\n\n[SA90] Hanif D Sherali and Warren P Adams. A hierarchy of relaxations between the continuous\nand convex hull representations for zero-one programming problems. SIAM Journal on\nDiscrete Mathematics, 3(3):411\u2013430, 1990.\n\n[Sch03] Alexander Schrijver. Combinatorial optimization: polyhedra and ef\ufb01ciency, volume 24.\n\nSpringer Science & Business Media, 2003.\n\n[SLB+18] Daniel Selsam, Matthew Lamm, Benedikt B\u00fcnz, Percy Liang, Leonardo de Moura,\nand David L Dill. Learning a SAT solver from single-bit supervision. arXiv preprint\narXiv:1802.03685, 2018.\n\n[Ste74] Gilbert Stengle. A nullstellensatz and a positivstellensatz in semialgebraic geometry.\n\nMathematische Annalen, 207(2):87\u201397, 1974.\n\n10\n\n\f", "award": [], "sourceid": 2323, "authors": [{"given_name": "Alhussein", "family_name": "Fawzi", "institution": "DeepMind"}, {"given_name": "Mateusz", "family_name": "Malinowski", "institution": "DeepMind"}, {"given_name": "Hamza", "family_name": "Fawzi", "institution": "University of Cambridge"}, {"given_name": "Omar", "family_name": "Fawzi", "institution": "ENS Lyon"}]}