{"title": "Reinforcement Learning of Theorem Proving", "book": "Advances in Neural Information Processing Systems", "page_first": 8822, "page_last": 8833, "abstract": "We introduce a theorem proving algorithm that uses practically no domain heuristics for guiding its connection-style proof search. Instead, it runs many Monte-Carlo simulations guided by reinforcement learning from previous proof attempts. We produce several versions of the prover, parameterized by different learning and guiding algorithms. The strongest version of the system is trained on a large corpus of mathematical problems and evaluated on previously unseen problems. The trained system solves within the same number of inferences over 40% more problems than a baseline prover, which is an unusually high improvement in this hard AI domain. To our knowledge this is the first time reinforcement learning has been convincingly applied to solving general mathematical problems on a large scale.", "full_text": "Reinforcement Learning of Theorem Proving\n\nCezary Kaliszyk\u2217\n\nUniversity of Innsbruck\n\nJosef Urban\u2217\n\nCzech Technical University in Prague\n\nHenryk Michalewski\nUniversity of Warsaw,\n\nInstitute of Mathematics of the\nPolish Academy of Sciences,\n\ndeepsense.ai\n\nMirek Ol\u02c7s\u00b4ak\n\nCharles University\n\nAbstract\n\nWe introduce a theorem proving algorithm that uses practically no domain heuris-\ntics for guiding its connection-style proof search. Instead, it runs many Monte-\nCarlo simulations guided by reinforcement learning from previous proof attempts.\nWe produce several versions of the prover, parameterized by different learning and\nguiding algorithms. The strongest version of the system is trained on a large cor-\npus of mathematical problems and evaluated on previously unseen problems. The\ntrained system solves within the same number of inferences over 40% more prob-\nlems than a baseline prover, which is an unusually high improvement in this hard\nAI domain. To our knowledge this is the \ufb01rst time reinforcement learning has been\nconvincingly applied to solving general mathematical problems on a large scale.\n\n1\n\nIntroduction\n\nAutomated theorem proving (ATP) [38] can in principle be used to attack any formally stated math-\nematical problem. For this, state-of-the-art ATP systems rely on fast implementations of complete\nproof calculi such as resolution [37], superposition [4], SMT [5] and (connection) tableau [15] that\nhave been over several decades improved by many search heuristics. This is already useful for\nautomatically discharging smaller proof obligations in large interactive theorem proving (ITP) veri-\n\ufb01cation projects [7]. In practice, today\u2019s best ATP system are however still far weaker than trained\nmathematicians in most research domains. Machine learning from many proofs could be used to\nimprove on this.\nFollowing this idea, large formal proof corpora have been recently translated to ATP formalisms [45,\n32, 19], and machine learning over them has started to be used to train guidance of ATP sys-\ntems [47, 28, 2]. First, to select a small number of relevant facts for proving new conjectures over\nlarge formal libraries [1, 6, 11], and more recently also to guide the internal search of the ATP sys-\ntems. In sophisticated saturation-style provers this has been done by feedback loops for strategy\ninvention [46, 17, 39] and by using supervised learning [16, 30] to select the next given clause [31].\nIn the simpler connection tableau systems such as leanCoP [34], supervised learning has been used\nto choose the next tableau extension step [48, 20] and \ufb01rst experiments with Monte-Carlo guided\nproof search [10] have been done. Despite a limited ability to prioritize the proof search, the guided\nsearch in the latter connection systems is still organized by iterative deepening. This ensures com-\npleteness, which has been for long time a sine qua non for building proof calculi.\nIn this work, we remove this requirement, since it basically means that all shorter proof candidates\nhave to be tried before a longer proof is found. The result is a bare connection-style theorem prover\n\n\u2217These authors contributed equally to this work.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fthat does not use any human-designed proof-search restrictions, heuristics and targeted (decision)\nprocedures. This is in stark contrast to recent mainstream ATP research, which has to a large extent\nfocused on adding more and more sophisticated human-designed procedures in domains such as\nSMT solving.\nBased on the bare prover, we build a sequence of systems, adding Monte-Carlo tree search [25], and\nreinforcement learning [44] of policy and value guidance. We show that while the performance of\nthe system (called rlCoP) is initially much worse than that of standard leanCoP, after ten iterations\nof proving and learning it solves signi\ufb01cantly more previously unseen problems than leanCoP when\nusing the same total number of inference steps.\nThe rest of the paper is organized as follows. Section 2 explains the basic connection tableau setting\nand introduces the bare prover. Section 3 describes integration of the learning-based guiding mech-\nanisms, i.e. Monte-Carlo search, the policy and value guidance. Section 4 evaluates the system on a\nlarge corpus of problems extracted from the Mizar Mathematical Library [13].\n\n2 The Game of Connection Based Theorem Proving\n\nWe assume basic \ufb01rst-order logic and theorem proving terminology [38]. We start with the con-\nnection tableau architecture as implemented by the leanCoP [34] system. leanCoP is a compact\ntheorem prover whose core procedure can be written in seven lines in Prolog. Its input is a (math-\nematical) problem consisting of axioms and conjecture formally stated in \ufb01rst-order logic (FOL).\nThe calculus searches for refutational proofs, i.e. proofs showing that the axioms together with the\nnegated conjecture are unsatis\ufb01able.2 The FOL formulas are \ufb01rst translated to the clause normal\nform (CNF), producing a set of \ufb01rst-order clauses consisting of literals (atoms or their negations).\nAn example set of clauses is shown in Figure 1. The \ufb01gure also shows a closed connection tableau,\ni.e., a \ufb01nished proof tree where every branch contains complementary literals (literals with opposite\npolarity). Since all branches contain a pair of contradictory literals, this shows that the set of clauses\nis unsatis\ufb01able.\n\nClauses:\n\nTableau:\n\nc1 : P (x)\nc2 : R(x, y) \u2228 \u00acP (x) \u2228 Q(y)\nc3 : S(x) \u2228 \u00acQ(b)\nc4 : \u00acS(x) \u2228 \u00acQ(x)\nc5 : \u00acQ(x) \u2228 \u00acR(a, x)\nc6 : \u00acR(a, x) \u2228 Q(x)\n\nP (a)\n\n\u00acP (a)\n\nQ(b)\n\nR(a, b)\n\n\u00acR(a, b) Q(b)\n\nS(b)\n\n\u00acQ(b)\n\n\u00acQ(b) \u00acR(a, b)\n\n\u00acS(b)\n\n\u00acQ(b)\n\nFigure 1: Closed connection tableau for a set of clauses (adapted from Letz et al. [29]).\n\nThe proof search starts with a start clause as a goal and proceeds by building a connection tableau by\nrepeatedly applying extension steps and reduction steps to it. The extension step connects (uni\ufb01es)\nthe current goal (a selected tip of a tableau branch) with a complementary literal of a new clause.\nThis extends the current branch, possibly splitting it into several branches if there are more literals\nin the new clause, and possibly instantiating some variables in the tableau. The reduction step\nconnects the current goal to a complementary literal of the active path, thus closing the current\nbranch. The proof is \ufb01nished when all branches are closed. The extension and reduction steps are\nnondeterministic, requiring backtracking in the standard connection calculus. Iterative deepening is\ntypically used to ensure completeness, i.e. making sure that the proof search \ufb01nds a proof if there is\nany. Incomplete strategies that restrict backtracking can be used in leanCoP, sometimes improving\nits performance on benchmarks [33].\n\n2.1 The Bare Prover\n\nOur bare prover is based on a previous reimplementation [23] of leanCoP in OCaml (mlCoP).\nUnlike the Prolog version, mlCoP uses an explicit stack for storing the full proof state. This allows\n\n2To minimize the required theorem proving background, we follow the more standard connection tableau\ncalculus presentations using CNF and refutational setting as in [29]. The leanCoP calculus is typically pre-\nsented in a dual (DNF) form, which is however isomorphic to the more standard one.\n\n2\n\n\fus to use the full proof state for machine learning guidance. We \ufb01rst modify mlCoP by remov-\ning iterative deepening, i.e., the traversal strategy that makes sure that shorter (shallower) tableaux\nare tested before deeper ones.\nInstead, the bare prover randomly chooses extension and reduc-\ntion steps operating on the current goal, possibly going into arbitrary depth. This makes our bare\nprover trivially incomplete. A simple example to demonstrate that is the unsatis\ufb01able set of clauses\n{P (0),\u00acP (x) \u2228 P (s(x)),\u00acP (0)} . If the prover starts with the goal P (0), the third clause can be\nused to immediately close the tableau. Without any depth bounds, the bare prover may however also\nalways extend with the second clause, generating an in\ufb01nite branch P (0), P (s(0)), P (s(s(0))), ....\nRather than designing such completeness bounds and corresponding exhaustive strategies, we will\nuse Monte-Carlo search and reinforcement learning to gradually teach the prover to avoid such bad\nbranches and focus on the promising ones.\nNext, we add playouts and search node visit counts. A playout of length d is simply a sequence of d\nconsecutive extension/reduction steps (inferences) from a given proof state (a tableau with a selected\nInferences thus correspond to actions and are similar to moves in games. We represent\ngoal).\ninferences as integers that encode the selected clause together with the literal that connected to the\ngoal. Instead of running one potentially in\ufb01nite playout, the bare prover can be instructed to play\nn playouts of length d. Each playout updates the counts for the search nodes that it visits. Search\nnodes are encoded as sequences of inferences starting at the empty tableau. A playout can also run\nwithout length restrictions until it visits a previously unexplored search node. This is the current\ndefault. Each playout in this version always starts with empty tableau, i.e., it starts randomly from\nscratch.\nThe next modi\ufb01cation to this simple setup are bigsteps done after b playouts. They correspond\nto moves that are chosen in games after many playouts. Similarly, instead of starting all playouts\nalways from scratch (empty tableau) as above, we choose after the \ufb01rst b playouts a particular single\ninference (bigstep), resulting in a new bigstep tableau. The next b playouts will start with this\ntableau, followed by another bigstep, etc.\nThis \ufb01nishes the description of the bare prover. Without any guidance and heuristics for choosing\nthe playout inferences and bigsteps, this prover is typically much weaker than standard mlCoP with\niterative deepening, see Section 4. The bare prover will just iterate between randomly doing b new\nplayouts, and randomly making a bigstep.\n\n3 Guidance\n\nThe rlCoP extends the bare prover with (i) Monte-Carlo tree search balancing exploration and ex-\nploitation using the UCT formula [25], (ii) learning-based mechanisms for estimating the prior prob-\nability of inferences to lead to a proof (policy), and (iii) learning-based mechanisms for assigning\nheuristic value to the proof states (tableaux).\n\n3.1 Monte-Carlo Guidance\n\nTo implement Monte-Carlo tree search, we maintain at each search node i the number of its visits\nni, the total reward wi, and its prior probability pi. This is the transition probability of the action\n(inference) that leads from i\u2019s parent node to i. If no policy learning is used, the prior probabilities\nare all equal to one. The total reward for a node is computed as a sum of the rewards of all nodes\nbelow that node. In the basic setting, the reward for a leaf node is 1 if the sequence of inferences\nresults in a closed tableau, i.e., a proof of the conjecture. Otherwise it is 0.\nInstead of this basic setting, we will by default use a simple evaluation heuristic, that will later be\nreplaced by learned value. The heuristic is based on the number of open (non-closed) goals (tips\nof the tableau) Go. The exact value is computed as 0.95Go, i.e., the leaf value exponentially drops\nwith the number of open goals in the tableau. The motivation is similar as, e.g., preferring smaller\nclauses (closer to the empty clause) in saturation-style theorem provers. If nothing else is known\nand the open goals are assumed to be independent, the chances of closing the tableau within a given\ninference limit drop exponentially with each added open goal. The exact value of 0.95 has been\ndetermined experimentally using a small grid search.\n\n3\n\n\fWe use the standard UCT formula [25] to select the next inferences (actions) in the playouts:\n\n(cid:114) ln N\n\nni\n\n+ c \u00b7 pi \u00b7\n\nwi\nni\n\nwhere N stands for the total number of visits of the parent node. We have also experimented with\nPUCT as in AlphaZero [43], however the results are practically the same. The value of c has been\nexperimentally set to 2 when learned policy and value are used.\n\n3.2 Policy Learning and Guidance\n\nFrom many proof runs we learn prior probabilities of actions (inferences) in particular proof states\ncorresponding to the search nodes in which the actions were taken. We characterize the proof states\nfor policy learning by extracting features (see Section 3.4) from the current goal, the active path, and\nthe whole tableau. Similarly, we extract features from the clause and its literal that were used to per-\nform the inference. Both are extracted as sparse vectors and concatenated into pairs (fstate, faction).\nFor each search node, we extract from its UCT data the frequency of each action a, and normalize\nit by dividing with the average action frequency at that node. This yields a relative proportion\nra \u2208 (0,\u221e). Each concatenated pair of feature vectors (fs, fa) is then associated with ra, which\nconstitutes the training data for policy learning implemented as regression on the logarithms. During\nthe proof search, the prior probabilities pi of the available actions ai in a state s are computed as a\nsoftmax of their predictions. We use \u03c4 = 2.5 by default as the softmax temperature. This value has\nbeen optimized by a small grid search.\nThe policy learning data can be extracted from all search nodes or only from some of them. By\ndefault, we only extract the training examples from the bigstep nodes. This makes the amount of\ntraining data manageable for our experiments and also focuses on important examples.\n\n3.3 Value Learning and Guidance\n\nBigstep nodes are used also for learning of the proof state evaluation (value). For value learning we\ncharacterize the proof states of the nodes by extracting features from all goals, the active path, and\nthe whole tableau. If a proof was found, each bigstep node b is assigned value vb = 1. If the proof\nsearch was unsuccessful, each bigstep is assigned value vb = 0. By default we also apply a small\ndiscount factor to the positive bigstep values, based on their distance dproof (b) to the closed tableau,\nmeasured by the number of inferences. This is computed as 0.99dproof (b). The exact value of 0.99\nhas again been determined experimentally using a small grid search.\nFor each bigstep node b the sparse vector of its proof state features fb is associated with the value\nvb. This constitutes the training data for value learning which is implemented as regression on the\nlogits. The trained predictor is then used during the proof search to estimate the logit of the proof\nstate value.\n\n3.4 Features\n\nWe have brie\ufb02y experimented with using deep neural networks to learn the policy and value pre-\ndictors directly from the theorem proving data. Current deep neural architectures however do not\nseem to perform on such data signi\ufb01cantly better than non-neural classi\ufb01ers such as XGBoost with\nmanually engineered features [16, 36, 2]. We use the latter approach, which is also signi\ufb01cantly\nfaster [30].\nFeatures are collected from the \ufb01rst-order terms, clauses, goals and tableaux. Most of them are\nbased on (normalized) term walks of length up to 3, as used in the ENIGMA system [16]. These\nfeatures are in turn based on the syntactic and semantic features developed in [24, 47, 19, 18, 21]. We\nuniquely identify each symbol by a 64-bit integer. To combine a sequence of integers originating\nfrom symbols in a term walk into a single integer, the components are multiplied by \ufb01xed large\nprimes and added. The resulting integers are then reduced to a smaller feature space by taking\nmodulo by a large prime (218 \u2212 5). The value of each feature is the sum of its occurrences in the\ngiven expression.\nIn addition to the term walks we also use several common abstract features, especially for more\ncomplicated data such as tableaux and paths. Such features have been previously used for learning\n\n4\n\n\fstrategy selection [27, 40]. These are: number of goals, total symbol size of all goals, maximum goal\nsize, maximum goal depth, length of the active path, number of current variable instantiations, and\nthe two most common symbols and their frequencies. The exact features used have been optimized\nbased on several experiments and analysis of the reinforcement learning data.\n\n3.5 Learners and Their Integration\n\nFor both policy and value we have experimented with several fast linear learners such as LIBLIN-\nEAR [9] and the XGBoost [8] gradient boosting toolkit (used with the linear regression objective).\nThe latter performs signi\ufb01cantly better and has been chosen for conducting the \ufb01nal evaluation. The\nXGBoost parameters have been optimized on a smaller dataset using randomized cross-validated\nsearch3 taking speed of training and evaluation into account. The \ufb01nal values that we use both for\npolicy and value learning are as follows: maximum number of iterations = 400, maximum tree\ndepth = 9, ETA (learning rate) = 0.3, early stopping rounds = 200, lambda (weight regularization)\n= 1.5. Table 1 compares the performance of XGBoost and LIBLINEAR4 on value data extracted\nfrom 2003 proof attempts.\n\nTable 1: Machine learning performance of XGBoost and LIBLINEAR on the value data extracted\nfrom 2003 problems.The errors are errors on the logits.\n\nPredictor\nXGBoost\nLIBLINEAR\n\nTrain Time\n19 min\n37 min\n\nRMSE (Train)\n0.99\n0.83\n\nRMSE (Test)\n2.89\n16.31\n\nFor real-time guidance during the proof search we have integrated LIBLINEAR and XGBoost into\nrlCoP using the OCaml foreign interface, which allows for a reasonably low prediction overhead.\nAnother part of the guidance overhead is feature computation and transformation of the computed\nfeature vectors into the form accepted by the learned predictors. Table 2 shows that the resulting\nslowdown is in low linear factors.\n\nTable 2:\nThe data are averaged over 2003 problems.\n\nInference speed comparison of mlCoP and rlCoP. IPS stand for inferences per second.\n\nSystem\nAverage IPS\n\nmlCoP\n64335.5\n\nrlCoP without policy/value (UCT only)\n64772.4\n\nrlCoP with XGBoost policy/value\n16205.7\n\n4 Experimental Results\n\nThe evaluation is done on two datasets of \ufb01rst-order problems exported from the Mizar Mathematical\nLibrary [13] by the MPTP system [45]. The larger Miz40 dataset 5 consists of 32524 problems that\nhave been proved by several state-of-the-art ATPs used with many strategies and high time limits\nin the experiments described in [21]. See Section 4.6 for a discussion of the performance of these\nsystems on Miz40. We have also created a smaller M2k dataset by taking 2003 Miz40 problems that\ncome from related Mizar articles. Most experiments and tuning were done on the smaller dataset.\nAll problems are run on the same hardware6 and with the same memory limits. When using UCT,\nwe always run 2000 playouts (each until a new node is found) per bigstep.\n\n4.1 Performance without Learning\n\nFirst, we use the M2k dataset to compare the performance of the baseline mlCoP with the bare\nprover and with the non-learning rlCoP using only UCT with the simple goal-counting proof state\nevaluation heuristic. The results of runs with a limit of 200000 inferences are shown in Table 3.\n\n3We have used the RandomizedSearchCV method of Scikit-learn [35].\n4We use L2-regularized L2-loss support vector regression and \u0001 = 0.0001 for LIBLINEAR.\n5https://github.com/JUrban/deepmath\n6Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz with 256G RAM.\n\n5\n\n\fRaising the inference limit helps only a little: mlCoP solves 1003 problems with a limit of 2 \u2217 106\ninferences, and 1034 problems with a limit of 4\u2217106 inferences. The performance of the bare prover\nis low as expected - only about half of the performance of mlCoP. rlCoP using UCT with no policy\nand only the simple proof state evaluation heuristic is also weaker than mlCoP, however already\nsigni\ufb01cantly better than the bare prover.\n\nTable 3: Performance on the M2k dataset of mlCoP, the bare prover and non-learning rlCoP with\nUCT and simple goal-counting proof state evaluation. (200000 inference limit).\n\nSystem\nProblems proved\n\nmlCoP bare prover\n876\n\n434\n\nrlCoP without policy/value (UCT only)\n770\n\n4.2 Reinforcement Learning of Policy Only\n\nNext we evaluate on the M2k dataset rlCoP with UCT using only policy learning, i.e., the value\nis still estimated heuristically. We run 20 iterations, each with 200000 inference limit. After each\niteration we use the policy training data (Section 3.2) from all previous iterations to train a new\nXGBoost predictor. This is then used for estimating the prior action probabilities in the next run.\nThe 0th run uses no policy. This means that it is the same as in Section 4.1, solving 770 problems.\nTable 4 shows the problems solved by iterations 1 to 20. Already the \ufb01rst iteration signi\ufb01cantly\nimproves over mlCoP run with 200000 inference limit. Starting with the fourth iteration, rlCoP is\nbetter than mlCoP run with the much higher 4 \u2217 106 inference limit.\n\nTable 4: 20 policy-guided iterations of rlCoP on the M2k dataset.\n\nIteration\nProved\nIteration\nProved\n\n1\n974\n11\n1074\n\n2\n1008\n12\n1079\n\n3\n1028\n13\n1077\n\n4\n1053\n14\n1080\n\n5\n1066\n15\n1075\n\n6\n1054\n16\n1075\n\n7\n1058\n17\n1087\n\n8\n1059\n18\n1071\n\n9\n1075\n19\n1076\n\n10\n1070\n20\n1075\n\n4.3 Reinforcement Learning of Value Only\n\nSimilarly, we evaluate on the M2k dataset 20 iterations of rlCoP with UCT and value learning, but\nwith no learned policy (i.e., all prior inference probabilities are the same). Each iteration again uses\na limit of 200000 inferences. After each iteration a new XGBoost predictor is trained on the value\ndata (Section 3.3) from all previous iterations, and is used to evaluate the proof states in the next\niteration. The 0th run again uses neither policy nor value, solving 770 problems. Table 5 shows the\nproblems solved by iterations 1 to 20. The performance nearly reaches mlCoP, however it is far\nbelow rlCoP using policy learning.\n\nTable 5: 20 value-guided iterations of rlCoP on the M2k dataset.\n\nIteration\nProved\nIteration\nProved\n\n1\n809\n11\n832\n\n2\n818\n12\n830\n\n3\n821\n13\n825\n\n4\n821\n14\n832\n\n5\n818\n15\n828\n\n6\n824\n16\n820\n\n7\n856\n17\n825\n\n8\n831\n18\n825\n\n9\n842\n19\n831\n\n10\n826\n20\n815\n\n4.4 Reinforcement Learning of Policy and Value\n\nFinally, we run on the M2k dataset 20 iterations of full rlCoP with UCT and both policy and value\nlearning. The inference limits, the 0th run and the policy and value learning are as above. Table 6\nshows the problems solved by iterations 1 to 20. The 20th iteration proves 1235 problems, which is\n19.4% more than mlCoP with 4 \u2217 106 inferences, 13.6% more than the best iteration of rlCoP with\npolicy only, and 44.3% more than the best iteration of rlCoP with value only. The \ufb01rst iteration im-\nproves over mlCoP with 200000 inferences by 18.4% and the second iteration already outperforms\nthe best policy-only result.\nWe also evaluate the effect of joint reinforcement learning of policy and value. Replacing the \ufb01nal\npolicy with the best one from the policy-only runs decreases the performance in 20th iteration from\n\n6\n\n\fTable 6: 20 iterations of rlCoP with policy and value guidance on the M2k dataset.\n\nIteration\nProved\nIteration\nProved\n\n1\n1037\n11\n1206\n\n2\n1110\n12\n1217\n\n3\n1166\n13\n1204\n\n4\n1179\n14\n1219\n\n5\n1182\n15\n1223\n\n6\n1198\n16\n1225\n\n7\n1196\n17\n1224\n\n8\n1193\n18\n1217\n\n9\n1212\n19\n1226\n\n10\n1210\n20\n1235\n\n1235 to 1182. Replacing the \ufb01nal value with the best one from the value-only runs decreases the\nperformance in 20th iteration from 1235 to 1144.\n\n4.5 Evaluation on the Whole Miz40 Dataset\n\nThe Miz40 dataset is suf\ufb01ciently large to allow an ultimate train/test evaluation in which rlCoP is\ntrained in several iterations on 90% of the problems, and then compared to mlCoP on the 10%\nof previously unseen problems. This will provide the \ufb01nal comparison of human-designed proof\nsearch with proof search trained by reinforcement learning on many related problems. We therefore\nrandomly split Miz40 into a training set of 29272 problems and a testing set of 3252 problems.\nFirst, we again measure the performance of the unguided systems, i.e., comparing mlCoP, the\nbare prover and the non-learning rlCoP using only UCT with the simple goal-counting proof state\nevaluation heuristic. The results of runs with a limit of 200000 inferences are shown in Table 7.\nmlCoP here performs relatively slightly better than on M2k. It solves 13450 problems in total with\na higher limit of 2 \u2217 106 inferences, and 13952 problems in total with a limit of 4 \u2217 106 inferences.\nTable 7: Performance on the Miz40 dataset of mlCoP, the bare prover and non-learning rlCoP with\nUCT and simple goal-counting proof state evaluation. (200000 inference limit).\n\nSystem\nTraining problems proved\nTesting problems proved\nTotal problems proved\n\nmlCoP bare prover\n10438\n1143\n11581\n\n4184\n431\n4615\n\nrlCoP without policy/value (UCT only)\n7348\n804\n8152\n\nFinally, we run 10 iterations of full rlCoP with UCT and both policy and value learning. Only the\ntraining set problems are however used for the policy and value learning. The inference limit is again\n200000, the 0th run is as above, solving 7348 training and 804 testing problems. Table 8 shows the\nproblems solved by iterations 1 to 10. rlCoP guided by the policy and value learned on the training\ndata from iterations 0 \u2212 4 proves (in the 5th iteration) 1624 testing problems, which is 42.1% more\nthan mlCoP run with the same inference limit. This is our \ufb01nal result, comparing the baseline\nprover with the trained prover on previously unseen data. 42.1% is an unusually high improvement\nwhich we achieved with practically no domain-speci\ufb01c engineering. Published improvements in the\ntheorem proving \ufb01eld are typically between 3 and 10 %.\n\nTable 8: 10 iterations of rlCoP with policy and value guidance on the Miz40 dataset. Only the\ntraining problems are used for the policy and value learning.\n\nIteration\nTraining proved\nTesting proved\n\n1\n12325\n1354\n\n2\n13749\n1519\n\n3\n14155\n1566\n\n4\n14363\n1595\n\n5\n14403\n1624\n\n6\n14431\n1586\n\n7\n14342\n1582\n\n8\n14498\n1591\n\n9\n14481\n1577\n\n10\n14487\n1621\n\n4.6 Comparison with State-of-the-art Saturation Systems\n\nThe Miz40 dataset was created [21] by running state-of-the-art \ufb01rst-order ATPs - mainly E [40]\nand Vampire [26] - on various axiom selections in many different ways, followed by pseudomini-\nmization [1] of the set of axioms. E and Vampire are complex human-programmed saturation-style\nATP systems developed for decades. They consist of hundreds of thousands lines of code ef\ufb01ciently\nimplementing many speci\ufb01c procedures and heuristics in low-level languages. This includes opti-\nmizations for equational reasoning based on term orderings, many heuristics for clause and literal\nselection, integration of propositional splitting, etc. The systems also include auto-con\ufb01guration\n\n7\n\n\fmethods, choosing portfolios of strategies. New strategies can be formulated in domain-speci\ufb01c\nlanguages either manually or, e.g., by evolutionary methods. The latter have been used in prior re-\nsearch to develop good strategies and their portfolios speci\ufb01cally targetting Mizar problems [46, 17].\nTable 9 shows performance of E and Vampire run in several ways on the Miz40 dataset. In all cases\nwe use a CPU limit of 3 seconds, which approximately corresponds to the time it takes to mlCoP\nto do 200000 inferences. Vampire is run in its default competition mode, where it uses a portfolio\nof strategies for each problem. E is run in four different ways: (auto) using its automated single-\nstrategy selection, (noauto) using a default term ordering, only a simple clause selection, and no\nliteral selection, (restrict) limiting additionally literal orderings, and (noorder), limiting all literal\nand term orderings. Since the majority of Miz40 problems contain equality, limiting the orderings\noften results in more proli\ufb01c (less smart) generation of clauses by the inference rules.\n\nTable 9: Performance of E and Vampire run in several ways on the Miz40 dataset.\n\nSystem\nTraining proved\nTesting proved\nTotal proved\n\nVampire E-auto E-noauto E-restrict E-noorder\n26304\n2923\n29227\n\n26645\n2942\n29587\n\n11235\n1229\n12464\n\n20914\n2330\n23244\n\n11735\n1271\n13006\n\nBoth E and Vampire clearly outperform mlCoP and rlCoP when using their best strategies and\nportfolios. The performance drops when only weak clause selection and no literal selection is used.\nFurther restricting literal and term orderings makes E weaker than rlCoP and comparable to mlCoP.\nThis is consistent with the facts that (i) practically all Miz40 problems (32413) make use of equality,\n(ii) good strategies and term orderings have been invented for E and Vampire on the Mizar dataset\nfor many years, and that (iii) the Miz40 dataset has been created as the Mizar problems that E or\nVampire could solve.\nThe rlCoP implementation is two orders of magnitude smaller than E and Vampire: the core is about\n2200 lines, and about 700 lines is the interface to the learners. An obvious question this performance\ncomparison poses is how much can rlCoP improve solely by better machine learning in the current\nsetting, and if further engineering and learning of more abstract guidance systems such as equality\nhandling and literal/clause selection heuristics will be needed.\n\n4.7 Examples\n\nThere are 577 test problems that rlCoP trained in 10 iterations on Miz40 can solve and standard\nmlCoP cannot.7 We show three of the problems which cannot be solved by standard mlCoP even\nwith a much higher inference limit (4 million). Theorem TOPREALC:108 states commutativity of\nscalar division with squaring for complex-valued functions. Theorem WAYBEL 0:289 states that a\nunion of upper sets in a relation is an upper set. And theorem FUNCOP 1:3410 states commutativity\nof two kinds of function composition. All these theorems have nontrivial human-written formal\nproof in Mizar, and they are also relatively hard to prove using state-of-the-art saturation-style ATPs.\nFigure 2 partially shows an example of the completed Monte-Carlo tree search for WAYBEL 0:28.\nThe local goals corresponding to the nodes leading to the proof are printed to the right.\n\ntheorem :: T O P R E A L C :10\nfor c being complex number for f being complex - valued F u n c t i o n\n\nholds ( f (/) c ) ^2 = ( f ^2) (/) ( c ^2)\n\ntheorem :: W A Y B E L _ 0 :28\nfor L being RelStr for A being Subset - Family of L st\n\n( for X being Subset of L st X in A holds X is upper )\nholds union A is upper Subset of L\n\ntheorem Th34 : :: F U N C O P _ 1 :34\n\n7Since theorem proving is almost never monotonically better, there are also 96 problems solved by mlCoP\nand not solved by rlCoP in this experiment. The \ufb01nal performance difference between rlCoP and mlCoP is\nthus 577 \u2212 96 = 481 problems.\n\n8http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/toprealc#T10\n9http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/waybel_0#T28\n10http://grid01.ciirc.cvut.cz/~mptp/7.13.01_4.181.1147/html/funcop_1#T34\n\n8\n\n\fp=0.21\nr=0.1859\nn=28...\n\np=0.10\nr=0.2038\nn=9...\n\np=0.13\nr=0.2110\nn=14...\n\nr=0.3099\nn=1182\n\n# (tableau starting atom)\n\np=0.24\nr=0.3501\nn=536\n\np=0.14\nr=0.2384\nn=21...\n\np=0.19\nr=0.2289\nn=58...\np=0.14\nr=0.3370\n...\nn=181\np=0.30\nr=0.1368\nn=14...\n\np=0.22\nr=0.1783\nn=40...\np=0.20\nr=0.3967\nn=279\n\np=0.15\nr=0.0288\nn=2...\np=0.66\nr=0.4217\nn=247\n\np=0.35\nr=0.2889\n...\nn=548\np=0.08\nr=0.1116\nn=3...\np=0.56\nr=0.4135\nn=262\n\np=0.18\nr=0.2633\nn=8...\n\np=0.17\nr=0.2554\nn=6...\n\n36 more MCTS tree levels until proved\n\nRelStr(c1)\n\nupper(c1)\n\nSubset(union(c2), carrier(c1))\n\nSubset(c2, powerset(carrier(c1))\n\nFigure 2: The MCTS tree for the WAYBEL 0:28 problem at the moment when the proof is found.\nFor each node we display the predicted probability p, the number of visits n and the average reward\nr = w/n. For the (thicker) nodes leading to the proof the corresponding local proof goals are\npresented on the right.\n\nfor f , h , F being F u n c t i o n for x being set\n\nholds ( F [;] (x , f )) * h = F [;] (x ,( f * h ))\n\n5 Related Work\n\nSeveral related systems have been mentioned in Section 1. Many iterations of a feedback loop\nbetween proving and learning have been explored since the MaLARea [47] system, signi\ufb01cantly\nimproving over human-designed heuristics when reasoning in large theories [22, 36]. Such systems\nhowever only learn high-level selection of relevant facts from a large knowledge base, and delegate\nthe internal proof search to standard ATP systems treated there as black boxes. Related high-level\nfeedback loops have been designed for invention of targeted strategies of ATP systems [46, 17].\nSeveral systems have been produced recently that use supervised learning from large proof corpora\nfor guiding the internal proof search of ATPs. This has been done in the connection tableau set-\nting [48, 20, 10], saturation style setting [16, 30], and also as direct automation inside interactive\ntheorem provers [12, 14, 49]. Reinforcement-style feedback loops however have not been explored\nyet in this setting. The closest recent work is [10], where Monte-Carlo tree search is added to con-\nnection tableau, however without reinforcement learning iterations, with complete backtracking, and\nwithout learned value. The improvement over the baseline measured in that work is much less sig-\nni\ufb01cant than here. An obvious recent inspiration for this work are the latest reinforcement learning\nadvances in playing Go and other board games [41, 43, 42, 3].\n\n6 Conclusion\n\nIn this work we have developed a theorem proving algorithm that uses practically no domain en-\ngineering and instead relies on Monte-Carlo simulations guided by reinforcement learning from\nprevious proof searches. We have shown that when trained on a large corpus of general mathemat-\nical problems, the resulting system is more than 40% stronger than the baseline system in terms of\nsolving nontrivial new problems. We believe that this is a landmark in the \ufb01eld of automated rea-\nsoning, demonstrating that building general problem solvers for mathematics, veri\ufb01cation and hard\nsciences by reinforcement learning is a very viable approach.\nObvious future research includes strong learning algorithms for characterizing mathematical data.\nWe believe that development of suitable (deep) learning architectures that capture both syntactic\nand semantic features of the mathematical objects will be crucial for training strong assistants for\nmathematics and hard science by reinforcement learning.\n\n9\n\n\f7 Acknowledgments\n\nKaliszyk was supported by ERC grant no. 714034 SMART. Urban was supported by the\nAI4REASON ERC Consolidator grant number 649043, and by the Czech project AI&Reasoning\nCZ.02.1.01/0.0/0.0/15 003/0000466 and the European Regional Development Fund. Michalewski\nand Kaliszyk acknowledge support of the Academic Computer Center Cyfronet of the AGH Univer-\nsity of Science and Technology in Krak\u00b4ow and their Prometheus supercomputer.\n\nReferences\n[1] J. Alama, T. Heskes, D. K\u00a8uhlwein, E. Tsivtsivadze, and J. Urban. Premise selection for math-\n\nematics by corpus analysis and kernel methods. J. Autom. Reasoning, 52(2):191\u2013213, 2014.\n\n[2] A. A. Alemi, F. Chollet, N. E\u00b4en, G. Irving, C. Szegedy, and J. Urban. DeepMath - deep se-\nquence models for premise selection. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Con-\nference on Neural Information Processing Systems 2016, pages 2235\u20132243, 2016.\n\n[3] T. Anthony, Z. Tian, and D. Barber. Thinking fast and slow with deep learning and tree search.\nIn I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Con-\nference on Neural Information Processing Systems 2017, pages 5366\u20135376, 2017.\n\n[4] L. Bachmair and H. Ganzinger. Rewrite-based equational theorem proving with selection and\n\nsimpli\ufb01cation. Journal of Logic and Computation, 4(3):217\u2013247, 1994.\n\n[5] C. W. Barrett, R. Sebastiani, S. A. Seshia, C. Tinelli, et al. Satis\ufb01ability modulo theories.\n\nHandbook of satis\ufb01ability, 185:825\u2013885, 2009.\n\n[6] J. C. Blanchette, D. Greenaway, C. Kaliszyk, D. K\u00a8uhlwein, and J. Urban. A learning-based\n\nfact selector for Isabelle/HOL. J. Autom. Reasoning, 57(3):219\u2013244, 2016.\n\n[7] J. C. Blanchette, C. Kaliszyk, L. C. Paulson, and J. Urban. Hammering towards QED. J.\n\nFormalized Reasoning, 9(1):101\u2013148, 2016.\n\n[8] T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the\n22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\nKDD \u201916, pages 785\u2013794, New York, NY, USA, 2016. ACM.\n\n[9] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large\n\nlinear classi\ufb01cation. Journal of machine learning research, 9(Aug):1871\u20131874, 2008.\n\n[10] M. F\u00a8arber, C. Kaliszyk, and J. Urban. Monte Carlo tableau proof search. In L. de Moura, editor,\n26th International Conference on Automated Deduction (CADE), volume 10395 of LNCS,\npages 563\u2013579. Springer, 2017.\n\n[11] T. Gauthier and C. Kaliszyk. Premise selection and external provers for HOL4. In X. Leroy\nand A. Tiu, editors, Proc. of the 4th Conference on Certi\ufb01ed Programs and Proofs (CPP\u201915),\npages 49\u201357. ACM, 2015.\n\n[12] T. Gauthier, C. Kaliszyk, and J. Urban. TacticToe: Learning to reason with HOL4 tactics. In\nT. Eiter and D. Sands, editors, 21st International Conference on Logic for Programming, Ar-\nti\ufb01cial Intelligence and Reasoning, LPAR-21, volume 46 of EPiC Series in Computing, pages\n125\u2013143. EasyChair, 2017.\n\n[13] A. Grabowski, A. Korni\u0142owicz, and A. Naumowicz. Mizar in a nutshell. J. Formalized Rea-\n\nsoning, 3(2):153\u2013245, 2010.\n\n[14] T. Gransden, N. Walkinshaw, and R. Raman. SEPIA: search for proofs using inferred automata.\nIn Automated Deduction - CADE-25 - 25th International Conference on Automated Deduction,\nBerlin, Germany, August 1-7, 2015, Proceedings, pages 246\u2013255, 2015.\n\n[15] R. H\u00a8ahnle. Tableaux and related methods. In Robinson and Voronkov [38], pages 100\u2013178.\n\n10\n\n\f[16] J. Jakubuv and J. Urban. ENIGMA: ef\ufb01cient learning-based inference guiding machine. In\nH. Geuvers, M. England, O. Hasan, F. Rabe, and O. Teschke, editors, Intelligent Computer\nMathematics - 10th International Conference, CICM 2017, volume 10383 of Lecture Notes in\nComputer Science, pages 292\u2013302. Springer, 2017.\n\n[17] J. Jakubuv and J. Urban. Hierarchical invention of theorem proving strategies. AI Commun.,\n\n31(3):237\u2013250, 2018.\n\n[18] C. Kaliszyk and J. Urban. Stronger automation for Flyspeck by feature weighting and strategy\nevolution. In J. C. Blanchette and J. Urban, editors, PxTP 2013, volume 14 of EPiC Series,\npages 87\u201395. EasyChair, 2013.\n\n[19] C. Kaliszyk and J. Urban. Learning-assisted automated reasoning with Flyspeck. J. Autom.\n\nReasoning, 53(2):173\u2013213, 2014.\n\n[20] C. Kaliszyk and J. Urban. FEMaLeCoP: Fairly ef\ufb01cient machine learning connection prover. In\nM. Davis, A. Fehnker, A. McIver, and A. Voronkov, editors, Logic for Programming, Arti\ufb01cial\nIntelligence, and Reasoning - 20th International Conference, volume 9450 of Lecture Notes in\nComputer Science, pages 88\u201396. Springer, 2015.\n\n[21] C. Kaliszyk and J. Urban. MizAR 40 for Mizar 40. J. Autom. Reasoning, 55(3):245\u2013256, 2015.\n\n[22] C. Kaliszyk, J. Urban, and J. Vysko\u02c7cil. Machine learner for automated reasoning 0.4 and\n0.5. In S. Schulz, L. de Moura, and B. Konev, editors, 4th Workshop on Practical Aspects of\nAutomated Reasoning, PAAR@IJCAR 2014, Vienna, Austria, 2014, volume 31 of EPiC Series\nin Computing, pages 60\u201366. EasyChair, 2014.\n\n[23] C. Kaliszyk, J. Urban, and J. Vysko\u02c7cil. Certi\ufb01ed connection tableaux proofs for HOL Light\nand TPTP. In X. Leroy and A. Tiu, editors, Proc. of the 4th Conference on Certi\ufb01ed Programs\nand Proofs (CPP\u201915), pages 59\u201366. ACM, 2015.\n\n[24] C. Kaliszyk, J. Urban, and J. Vysko\u02c7cil. Ef\ufb01cient semantic features for automated reasoning\nIn Q. Yang and M. Wooldridge, editors, IJCAI\u201915, pages 3084\u20133090.\n\nover large theories.\nAAAI Press, 2015.\n\n[25] L. Kocsis and C. Szepesv\u00b4ari. Bandit based monte-carlo planning. In J. F\u00a8urnkranz, T. Scheffer,\nand M. Spiliopoulou, editors, Machine Learning: ECML 2006, 17th European Conference on\nMachine Learning, volume 4212 of LNCS, pages 282\u2013293. Springer, 2006.\n\n[26] L. Kov\u00b4acs and A. Voronkov. First-order theorem proving and Vampire. In N. Sharygina and\n\nH. Veith, editors, CAV, volume 8044 of LNCS, pages 1\u201335. Springer, 2013.\n\n[27] D. K\u00a8uhlwein and J. Urban. MaLeS: A framework for automatic tuning of automated theorem\n\nprovers. J. Autom. Reasoning, 55(2):91\u2013116, 2015.\n\n[28] D. K\u00a8uhlwein, T. van Laarhoven, E. Tsivtsivadze, J. Urban, and T. Heskes. Overview and evalu-\nation of premise selection techniques for large theory mathematics. In B. Gramlich, D. Miller,\nand U. Sattler, editors, IJCAR, volume 7364 of LNCS, pages 378\u2013392. Springer, 2012.\n\n[29] R. Letz, K. Mayr, and C. Goller. Controlled integration of the cut rule into connection tableau\n\ncalculi. Journal of Automated Reasoning, 13:297\u2013337, 1994.\n\n[30] S. M. Loos, G. Irving, C. Szegedy, and C. Kaliszyk. Deep network guided proof search. In\nT. Eiter and D. Sands, editors, 21st International Conference on Logic for Programming, Ar-\nti\ufb01cial Intelligence and Reasoning, LPAR-21, volume 46 of EPiC Series in Computing, pages\n85\u2013105. EasyChair, 2017.\n\n[31] W. McCune. Otter 2.0. In International Conference on Automated Deduction, pages 663\u2013664.\n\nSpringer, 1990.\n\n[32] J. Meng and L. C. Paulson. Translating higher-order clauses to \ufb01rst-order clauses. J. Autom.\n\nReasoning, 40(1):35\u201360, 2008.\n\n[33] J. Otten. Restricting backtracking in connection calculi. AI Commun., 23(2-3):159\u2013182, 2010.\n\n11\n\n\f[34] J. Otten and W. Bibel. leanCoP: lean connection-based theorem proving. J. Symb. Comput.,\n\n36(1-2):139\u2013161, 2003.\n\n[35] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-\ntenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per-\nrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning\nResearch, 12:2825\u20132830, 2011.\n\n[36] B. Piotrowski and J. Urban. ATPboost: Learning premise selection in binary setting with ATP\nfeedback. In D. Galmiche, S. Schulz, and R. Sebastiani, editors, Automated Reasoning - 9th\nInternational Joint Conference, IJCAR 2018, Held as Part of the Federated Logic Conference,\nFloC 2018, Oxford, UK, July 14-17, 2018, Proceedings, volume 10900 of Lecture Notes in\nComputer Science, pages 566\u2013574. Springer, 2018.\n\n[37] J. A. Robinson. A machine-oriented logic based on the resolution principle. Journal of the\n\nACM (JACM), 12(1):23\u201341, 1965.\n\n[38] J. A. Robinson and A. Voronkov, editors. Handbook of Automated Reasoning (in 2 volumes).\n\nElsevier and MIT Press, 2001.\n\n[39] S. Sch\u00a8afer and S. Schulz. Breeding theorem proving heuristics with genetic algorithms. In\nG. Gottlob, G. Sutcliffe, and A. Voronkov, editors, Global Conference on Arti\ufb01cial Intelligence,\nGCAI 2015, volume 36 of EPiC Series in Computing, pages 263\u2013274. EasyChair, 2015.\n\n[40] S. Schulz. System description: E 1.8. In K. L. McMillan, A. Middeldorp, and A. Voronkov,\n\neditors, LPAR, volume 8312 of LNCS, pages 735\u2013743. Springer, 2013.\n\n[41] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch-\nbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.\nMastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484\u2013\n489, 2016.\n\n[42] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre,\nD. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and\nshogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815,\n2017.\n\n[43] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert,\nL. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge.\nNature, 550(7676):354, 2017.\n\n[44] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. Cambridge\n\nUniv Press, 1998.\n\n[45] J. Urban. MPTP 0.2: Design, implementation, and initial experiments. J. Autom. Reasoning,\n\n37(1-2):21\u201343, 2006.\n\n[46] J. Urban. BliStr: The Blind Strategymaker.\n\nIn G. Gottlob, G. Sutcliffe, and A. Voronkov,\neditors, Global Conference on Arti\ufb01cial Intelligence, GCAI 2015, volume 36 of EPiC Series in\nComputing, pages 312\u2013319. EasyChair, 2015.\n\n[47] J. Urban, G. Sutcliffe, P. Pudl\u00b4ak, and J. Vysko\u02c7cil. MaLARea SG1 - Machine Learner for Au-\ntomated Reasoning with Semantic Guidance. In A. Armando, P. Baumgartner, and G. Dowek,\neditors, IJCAR, volume 5195 of LNCS, pages 441\u2013456. Springer, 2008.\n\n[48] J. Urban, J. Vysko\u02c7cil, and P. \u02c7St\u02c7ep\u00b4anek. MaLeCoP: Machine learning connection prover. In\nK. Br\u00a8unnler and G. Metcalfe, editors, TABLEAUX, volume 6793 of LNCS, pages 263\u2013277.\nSpringer, 2011.\n\n[49] D. Whalen. Holophrasm: a neural automated theorem prover for higher-order logic. CoRR,\n\nabs/1608.02644, 2016.\n\n12\n\n\f", "award": [], "sourceid": 5309, "authors": [{"given_name": "Cezary", "family_name": "Kaliszyk", "institution": "Innsbruck University"}, {"given_name": "Josef", "family_name": "Urban", "institution": "Czech Technical University in Prague"}, {"given_name": "Henryk", "family_name": "Michalewski", "institution": "University of Warsaw, University of Oxford, deepsense.ai"}, {"given_name": "Miroslav", "family_name": "Ol\u0161\u00e1k", "institution": "Charles University in Prague"}]}