{"title": "Depth-First Proof-Number Search with Heuristic Edge Cost and Application to Chemical Synthesis Planning", "book": "Advances in Neural Information Processing Systems", "page_first": 7226, "page_last": 7236, "abstract": "Search techniques, such as Monte Carlo Tree Search (MCTS) and Proof-Number Search (PNS), are effective in playing and solving games.  However, the understanding of their performance in industrial applications is still limited.  We investigate MCTS and Depth-First Proof-Number (DFPN) Search, a PNS variant, in the domain of Retrosynthetic Analysis (RA). \nWe find that DFPN's strengths, that justify its success in games, have limited value in RA, and that an enhanced MCTS variant by Segler et al. significantly outperforms DFPN.  We address this disadvantage of DFPN in RA with a novel approach to combine DFPN with Heuristic Edge Initialization.  Our new search algorithm DFPN-E outperforms the enhanced MCTS in search time by a factor of 3 on average, with comparable success rates.", "full_text": "Depth-First Proof-Number Search with Heuristic\nEdge Cost and Application to Chemical Synthesis\n\nPlanning\n\nAkihiro Kishimoto\nIBM Research, Ireland\n\nBeat Buesser\n\nIBM Research, Ireland\n\nBei Chen\n\nIBM Research, Ireland\n\nAdi Botea\nEaton, Ireland\u2217\n\nAbstract\n\nSearch techniques, such as Monte Carlo Tree Search (MCTS) and Proof-Number\nSearch (PNS), are effective in playing and solving games. However, the understand-\ning of their performance in industrial applications is still limited. We investigate\nMCTS and Depth-First Proof-Number (DFPN) Search, a PNS variant, in the do-\nmain of Retrosynthetic Analysis (RA). We \ufb01nd that DFPN\u2019s strengths, that justify\nits success in games, have limited value in RA, and that an enhanced MCTS variant\nby Segler et al. signi\ufb01cantly outperforms DFPN. We address this disadvantage\nof DFPN in RA with a novel approach to combine DFPN with Heuristic Edge\nInitialization. Our new search algorithm DFPN-E outperforms the enhanced MCTS\nin search time by a factor of 3 on average, with comparable success rates.\n\n1\n\nIntroduction\n\nSearch is a core AI technique, especially in games [5, 27, 31] and domain-independent planning\n[3, 12]. Historically, new search algorithms and novel combinations of existing algorithms have led\nto signi\ufb01cant performance improvements. Another way to achieve progress is to transfer successful\napproaches from one domain to another. This paper combines both strategies.\nProof-Number Search (PNS) [1] and Monte Carlo Tree Search (MCTS) [18] are two notable al-\ngorithms that have been extensively studied in solving games/game positions and playing games,\nrespectively. Their success has been demonstrated by solving the game of checkers [27] and by\nachieving super-human strength in playing the game of Go [31].\nDue to recent advances in AI, chemistry is being reconsidered as an AI research domain [14]. We\nfocus on chemical synthesis planning, the task of planning chemical reaction routes to synthesize\na given organic molecule. Retrosynthetic Analysis (RA) is a technique going backwards from the\ntarget molecule towards a library of usually smaller, starting molecules. This task can be modeled\nsimilarly to solving games [11] and can be tackled with search algorithms such as PNS and MCTS\n[11, 29]. However, since no direct performance comparison between PNS and MCTS has been made,\ntheir search performance in RA is not well understood.\nWe investigate the search performance of MCTS and Depth-First Proof-Number (DFPN) Search\n[21], a variant of PNS, in RA. PNS variants [1, 21] are designed to ef\ufb01ciently identify the smallest\namount of work needed to solve a game position. The game research community largely agrees that\nthis advantage is one reason why PNS variants have been a popular choice in solving dif\ufb01cult games\nor game positions [16, 21, 27], with solution lengths exceeding 1500 steps. However, we discover\nthat DFPN performs poorly in the domain of RA. We \ufb01nd that DFPN\u2019s signi\ufb01cant performance\ndegradation is due to RA\u2019s lopsided search space, which alternates large branching factors with\nvery small ones. To address this, we introduce DFPN-E, a novel algorithm to combine DFPN with\n\n\u2217Work performed while the author was af\ufb01liated with IBM Research, Ireland.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Example of reaction rule. Esteri\ufb01cation reaction adapted from [11].\n\nHeuristic Edge Initialization. We empirically evaluate DFPN-E and an enhanced MCTS variant [29]\non a dataset of complex target molecules extracted from the US patent literature. We demonstrate that\nDFPN-E outperforms this enhanced MCTS in search time, with competitive success rates of \ufb01nding\nsuccessful synthesis routes.\n\n2 Chemical Synthesis Planning\n\nChemical synthesis planning is the task of creating a sequence of chemical reactions (or reaction\npathway) that synthesizes a target organic molecule by using existing (e.g., commercially available)\nstarting materials. Chemists perform retrosynthetic analysis (RA), a systematic analysis that re-\ncursively splits a target molecule into simpler reactant molecules with reactions they expect to be\nfeasible, until they \ufb01nd a branching reverse route from the target to starting materials. Automated\nretrosynthetic analysis has been an open problem for 50 years, e.g., [6, 7, 33], because of its scienti\ufb01c\nimportance in chemistry as well as its direct applications to pharmaceutical and material industries.\nRA takes as input a target molecule, a library of starting materials and a database of reactions. A\nmolecule is represented as a graph where nodes and links correspond to atoms and chemical bonds,\nrespectively. A starting material database contains de\ufb01nitions of available starting molecules. A\nreaction database has a set of reaction rules specifying graph substructures of molecules that must be\npresent to allow the application of the reaction. Reaction rules can be constructed manually [11, 33]\nor automatically [19, 29].\nSince RA explores the space backwards, from the target molecule, reaction rules are applied in\na reverse (or retrosynthetic) manner. When checking if a reaction rule can be applied to a target\nmolecule, RA checks if the target has a substructure speci\ufb01ed in the product of the rule. If this is the\ncase, the target is split into the reactant molecules speci\ufb01ed in the reaction rule. Unless a reactant\nmolecule P appears in the starting material database, P needs to be synthesized, recursively applying\na reaction rule to P . A molecule may have multiple matching substructures where a reaction rule can\nbe applied. In this paper, all these are considered as different moves.\nFigure 1 illustrates a reaction rule for producing an ester from an alcohol and anhydride. The carbon\natoms located in numbers 1, 5 and 7 are omitted. To activate this reaction, an alcohol and anhydride\nmolecules need to have substructures as illustrated in the \ufb01gure. An ester is produced by modifying\nthe substructures speci\ufb01ed there and keeping the remaining substructures unchanged.\nOur focus is to elucidate the behavior of MCTS and PNS in the search space of RA, modeled in [11],\nwhere each reaction rule captures only substructural modi\ufb01cations between a product and its reactants.\nIn practice, reaction rules need to consider constraints such as protecting groups, side compounds,\ntemperature and yield [33]. This is beyond the scope of our paper.\nWhile RA is often modeled as single-agent search (OR search), e.g., [19, 29, 33], Heifets and Jurisica\nmodel RA as a problem similar to solving a game position in two-player games such as chess and Go\n[11]. Their model allows to ef\ufb01ciently split a problem into a set of independent subproblems that can\nbe solved with AND/OR search. In their game position solving framework, two players alternately\nplay moves. Terminologies of wins and losses are used from a viewpoint of the \ufb01rst player. The \ufb01rst\nplayer attempts to synthesize a target molecule, while the second player (or opponent) attempts to\nprevent it. A position of the \ufb01rst player corresponds to a molecule they want to synthesize. A move\nof the \ufb01rst player is a reaction rule applicable to that molecule in a reverse manner. On the other hand,\na position for the opponent corresponds to a set of precursor molecules generated by a reaction rule.\nA move for the opponent is to select one of the precursor molecules in their position. In Figure 1, an\napplication of this rule corresponds to a move of the \ufb01rst player if it is applicable. The opponent has\ntwo moves, one choosing an alcohol and the other choosing an anhydride.\n\n2\n\n\fA molecule in the starting material database is a winning terminal position, while a molecule to no\napplicable reaction rule is a loss terminal position. At a position for the \ufb01rst player, if at least one\nmove leads to a win, that position is a win, as one reaction to synthesize the molecule exists. If all\nlegal moves lead to losses, the position is a loss (i.e., no pathway exists to synthesize it). At a position\nfor the opponent, if all moves (precursor molecules presented to the \ufb01rst player) lead to wins, the\nposition is a win (all precursors are successfully generated). Otherwise, the position is a loss.\nThe size of the RA search space can be very large. The branching factor at OR nodes is typically\nestimated as 80 and a synthetic route may require up several tens of synthetic steps [33]. In addition,\ncomplex molecules may require over 100 steps [11]. Like in many games, the search space of RA is\ncyclic since a sequence of reaction rules may create repeated states, e.g., oxidation and reduction,\nwhile a solution may not contain a cycle (see the next section for the de\ufb01nition of a solution).\nUnlike in many games, successor state generation is a serious bottleneck in RA, because checking\nan applicability of a reaction rule requires to solve subgraph matching, which is an NP-complete\nproblem. For example, Kishimoto et al. [14] discuss that the implementation in [11] generates at\nmost several tens of molecules per second, as compared to millions of positions per second in a game\nsuch as chess.\n\n3 Background\n\nThis section gives an overview of PNS and MCTS. See [4, 17] for more comprehensive surveys.\n\n3.1 AND/OR Search and Proof-Number Search Variants\n\nIn our model, OR and AND nodes correspond to the \ufb01rst player and the opponent, respectively. OR\nand AND nodes alternate along a given pathway in the search tree. A \u201cmove\u201d represents an edge\n(transition) in a search tree. Once the AND/OR search \ufb01nds a solution, the value of the root node is\ndetermined to be either a win or a loss. Node n is called proven if the value of n is proven to be a\nwin that is also called a proof. Node n is called disproven if the value of n is proven to be a loss. A\ndisproof is also used to express a loss. The value of n is unknown or unproven if n is neither proven\nnor disproven.\nA proof tree T of node r theoretically ensures that r is proven. T has the following properties: (1)\nThe root node r is in T , (2) for each internal OR node n in T , at least one child of n is in T , (3) for\neach internal AND node n in T , all the children of n are in T , and (4) all terminal nodes in T have\nvalues of wins. A disproof tree, that provides a disproof, is analogously de\ufb01ned.\nAn ef\ufb01cient AND/OR search aims at \ufb01nding a proof or disproof tree as quickly as possible. PNS\nuses proof and disproof numbers to estimate the dif\ufb01culty of \ufb01nding a proof or a disproof for node\nn [1]. The proof number pn(n) for node n is de\ufb01ned as the minimum number of leaf nodes to be\nproven to \ufb01nd a proof for n. A node with a smaller proof number is more promising to \ufb01nd a proof.\nAnalogously, the disproof number dn(n) for n is the minimum number of leaf nodes to be disproven\nto \ufb01nd a disproof for n. For proven terminal node n, pn(n) = 0 and dn(n) = \u221e, and pn(n) = \u221e\nand dn(n) = 0 for disproven terminal node n. At an internal node n with a set of children S(n),\nproving one child leads to a proof for an OR node n, while all children must be proven for a proof\nof an AND node n (and vice versa for disproof). Hence, pn(n) and dn(n) are: For an internal\ns\u2208S(n) dn(s). For an internal AND node\ns\u2208S(n) pn(s) and dn(n) = mins\u2208S(n) dn(s). Figure 2(left) illustrates an example of\nproof and disproof numbers, where numbers on the top and bottom inside each node indicate proof\nand disproof numbers, respectively.\nPNS [1] maintains proof and disproof numbers of each node in a search tree. Starting with the root\nuntil reaching a leaf node, PNS traverses down a tree in a best-\ufb01rst manner by selecting a child with\nthe smallest proof number at each internal OR node, and a child with the smallest disproof number\nat each internal AND node (selection). PNS then expands the selected leaf node (expansion), and\nrecalculates proof and disproof numbers of the nodes along the path back to the root (backup). For\nexample, in Figure 2(right), PNS selects path A \u2192 C \u2192 F , generates two new leaf nodes H and I,\nand updates proof and disproof numbers at F , C and A. PNS repeats these steps until \ufb01nding either a\nproof or disproof or exhausting its time/memory resources.\n\nOR node n, pn(n) = mins\u2208S(n) pn(s) and dn(n) =(cid:80)\nn, pn(n) =(cid:80)\n\n3\n\n\fFigure 2: Example of PNS\n\nDepth-First Proof-Number (DFPN) search [21] is a PNS variant that reformulates best-\ufb01rst PNS to\ndepth-\ufb01rst search. DFPN has been successfully used to solve dif\ufb01cult games or game positions in\nmany games, e.g., [2, 21, 27], often involving a solution with several hundred steps [13, 21].\nDFPN re-expands fewer internal nodes than best-\ufb01rst PNS as well as operates in limited memory.\nA large part of the available memory is allocated to the transposition table (TT). The TT caches\nthe proof and disproof numbers of examined nodes. DFPN introduces thresholds for proof and\ndisproof numbers: thpn(n) and thdn(n) which are set to initially large values at the root node.\nDFPN recalculates pn(n) and dn(n) by using the proof and disproof numbers of n\u2019s children. When\nthpn(n) \u2264 pn(n) or thdn(n) \u2264 dn(n) holds for an unproven node n, DFPN identi\ufb01es that there are\nmore promising nodes than n and postpones the examination of n. Otherwise, DFPN selects a child\ns1 with the smallest (dis)proof number for a further examination, with the following thresholds: For\nOR node n, thpn(s1) = min(thpn(n), pn(s2) + 1) and thdn(s1) = thdn(n)\u2212 dn(n) + dn(s1). For\nAND node n, thpn(s1) = thpn(n) \u2212 pn(n) + pn(s1) and thdn(s1) = min(thdn(n), dn(s2) + 1),\nwhere s2 is a child with the second smallest (dis)proof number among a list of children of an OR\n(AND) node n. DFPN tends to gradually increment the thresholds as search progresses.\nThe basic PNS sets pn(n) = dn(n) = 1 for an unproven leaf n. Heuristic Initialization enhances PNS\nvariants including DFPN+ [21] by initializing pn(n) = hpn(n) and dn(n) = hdn(n) at unproven\nleaf n, where hpn(n) and hdn(n) are evaluation functions for proof and disproof numbers. Existing\nwork on creating hpn(n) and hdn(n) for games includes manual [16, 35] and machine learning\n[9, 21] approaches. Even if DFPN reduces re-expansions of internal nodes compared to basic PNS,\nit still often suffers from the high overhead of such a re-examination, resulting in examining only\nsmall portions of the new search spaces. To alleviate this issue, approaches to increase thpn(n) and\nthpn(n) have been developed [16, 21, 24].\n\n3.2 Monte Carlo Tree Search\n\nMCTS is based on a best-\ufb01rst search that repeats the selection, expansion, and backup steps. Most\nMCTS algorithms except [32] perform Monte Carlo samplings to calculate a heuristic value of a leaf\nnode [8, 18]. A sampling at a leaf estimates a probability of a win by randomly selecting moves\nuntil it reaches a terminal position. For more accurate evaluations, samplings select moves with\nnon-uniform probabilities, considering game board con\ufb01gurations [10, 31]. Additionally, MCTS can\nuse evaluation functions for leaf-node evaluations with [31] or without [32] sampling.\nOne conceptual difference between MCTS and PNS is that MCTS aims to balance a trade-off between\nexploration and exploitation. A move with a high reward (e.g., high winning ratio) must be selected\nto have a higher chance to win a game (exploitation). On the other hand, a move with a low reward\nmust also be selected to overcome an inaccuracy due to a small number of examinations of the move\n(exploration). A variety of formulas [10, 18, 34] have been presented to achieve the right balance.\nLet n be the current node, A(n) be a set of legal moves for n, and Q(n, ai) be an accumulated reward\nfor move ai. To select the best move abest in RA, consider Segler et al.\u2019s formula [29], originally\npresented in [26, 31]: abest = arg max\n) where N (n, ai) is\nai\u2208A(n)\n\n( Q(n,ai)\nN (n,ai) + cP (n, ai)\n\nthe number of visits to ai at n, P (n, ai) is a prior probability of move ai at n, and c is a constant\nempirically preset. In RA, P (n, ai) corresponds to a probability that one step retrosynthesis with\nreaction rule ai is successfully applied to molecule n. A neural network is trained to estimate P (n, ai)\n\n\u221a(cid:80)\n\na\u2208A(n) N (n,a)\n1+N (n,ai)\n\n4\n\n\fFigure 3: Proof number limitation in lopsided search space\n\n[29, 30]. As in UCT [18], the \ufb01rst and second terms of the above formula take exploitation and\nexploration into account. However, the above second term more gradually reduces the exploration\nfactor than UCT and encourages to explore the search space based on P (s, ai) encoding domain-\nspeci\ufb01c knowledge.\nLet bi be a sequence of selected moves in a search tree leading to node n from the root, L(bi) be\nits length, and nj be a node with selected move aj on bi. Segler et al. de\ufb01ne Q(n, a) to prefer\na move with a higher winning ratio as well as a shorter pathway leading to a win: Q(n, a) =\n) where Ii(n, a) is a function indicating whether\nmove a at node n is selected at the i-th traversal, k is a damping constant (0 \u2264 k \u2264 1), Lmax is the\nmaximum tree depth, and zi is a reward received depending on whether a sampling leads to a win or\nnot. A cut-off depth dr stops a sampling that reached no terminal node.\n\n(cid:80)n\ni=1 Ii(n, a)zi max(0, 1 \u2212 L(bi)\u2212(cid:80)\n\nkP (nj ,aj )\n\naj\u2208bi\nLmax\n\n4 Proof-Number Search with Heuristic Edge Initialization\n\nIn \ufb01nding a proof at the root, PNS tends to select a path that has nodes with small proof numbers.\nSince the proof numbers of OR children are summed up to calculate the proof number of an AND\nnode, PNS prefers selecting an AND node that has a small number of moves. In games where\nPNS variants have been successfully applied, e.g., [16, 21, 27], the number of available moves can\nchange signi\ufb01cantly between consecutive game positions. Such a property of the opponent\u2019s move\ndistribution in the game search space allows PNS variants to have various values of proof numbers,\nsuccessfully identifying promising portions of the search space.\nRA raises a new challenge due to different characteristics of its search space. In RA, many reaction\nrules are applicable to a molecule, increasing the number of moves at OR nodes. However, most of\nthe reaction rules have only one precursor. There are at most a few reactants in the most complicated\nreaction rules. In other words, the search space is lopsided, since the branching factor at OR nodes is\nlarge, while it is very small at AND nodes. In the lopsided search space, PNS variants have dif\ufb01culties\nin identifying moves with higher chances of leading to proofs. For example, assume that the AND\nnodes illustrated in Figure 3 always have only one move each. Then, even if there are many moves\nat OR nodes, pn(n) = 1 holds for any node n in the \ufb01gure, thus preventing PNS variants from\nidentifying a most promising path leading to a proof. In this \ufb01gure, DFPN is turned into simple,\ninef\ufb01cient depth-\ufb01rst search with no depth limit.\nHeuristic initialization at leaf nodes can assign a variety of proof numbers to leaf nodes and turns PNS\nvariants into greedy search in the lopsided search space. This approach suffers from performance\ndegradation when an evaluation function hpn(n) assigns an inaccurate proof number to a leaf. Assume\nthat hpn(n) evaluates all leaf nodes illustrated in Figure 3, only the left leaf node l has a proof, and\nhpn(l) > hpn(m) holds for any other leaf m. Then, PNS needs to examine all the nodes in this\n\ufb01gure, before leaf l is examined.\nOur DFPN with Heuristic Edge Initialization (DFPN-E) addresses a limitation arising in the lopsided\nsearch space. DFPN with heuristic initialization of leaf nodes attempts to sum up the values of the\nleaf nodes that can be part of a proof tree. Based on more informed heuristics values calculated by a\nset of the leaf nodes, DFPN with heuristic initialization determines a leaf to expand next. In contrast,\nDFPN-E assigns a heuristic cost to an edge from an OR node to an AND node, which estimates the\ndif\ufb01culty of \ufb01nding a proof. In addition to the number of leaf nodes needed to \ufb01nd a proof, DFPN-E\nattempts to estimate the total cost of the edges needed to be included in a proof tree.\n\n5\n\n\fA leaf node to expand next is determined by this estimated effort. Once DFPN-E proves that an edge\ncan lead to a proof, it does not need any effort to \ufb01nd a proof. DFPN-E, therefore, updates the edge\ncost to zero. Formally, we de\ufb01ne pn(n) for an internal OR node n as follows:\n\n(cid:26)0\n\npn(n) =\n\nmins\u2208S(n)(h(n, s) + pn(s))\n\n(mins\u2208S(n) pn(s) = 0)\n(otherwise)\n\nwhere h(n, s) is an evaluation function evaluating an edge from node n to its child s. In addition,\nthe evaluation value is non-negative for any n and s. The remaining cases of calculating pn(n) and\ndn(n) as well as the child selection scheme are the same as described in the previous section for PNS.\nHeifets\u2019 and Jurisica\u2019s PNS implementation [11] is regarded as a special case of h(n, s) = 1 for any\nn and s. In this case, their PNS implementation is almost identical to iterative deepening based on\ndepth except that it always returns to the root whenever a leaf is expanded. While an advantage of\nPNS in solving games is to examine the search space as deep as possible without any depth limit,\ntheir approach no longer inherits this advantage, losing the capability of \ufb01nding long pathways.\nAs in DFPN, DFPN-E uses two thresholds thpn(n) and thdn(n): one for the proof number and the\nother for the disproof number, enabling DFPN-E to search as long as thpn(n) > pn(n) \u2227 thdn(n) >\ndn(n) holds. Let sbest be a child chosen for an examination at an OR node n. DFPN-E updates\nthpn(n) in a combination of [21] with [16]:\n\nthpn(sbest) = min(thpn(n), pn(s2) + \u03b4) \u2212 h(n, sbest)\n\nwhere s2 is a child where pn(s2) is the second smallest proof number with our modi\ufb01cation among a\nlist of n\u2019s children, and \u03b4 is an integer for threshold controlling to decrease the node reexamination\noverhead [16]. The remaining cases are identical to original DFPN [21].\nOur evaluation function h(n, s) is based on a one-step retrosynthesis prediction combined with the\nidea behind so-called forced moves in games. First, as described in [30], our approach encodes a\nmolecule into a \ufb01ngerprint that is a \ufb01xed bit vector. A neural network, described in the next section,\nreceives the \ufb01ngerprint as input. The neural network has R nodes in its output layer where R is the\nnumber of reaction rules in the reaction database. The neural network predicts a probability P (n, a)\nthat reaction rule a is applied to a molecule n in a reverse manner. Then, we de\ufb01ne h(n, s) as follows:\n\nh(n, s) = min(Mpn,(cid:98)\u2212 log(P (n, ai) + \u0001) + 1(cid:99))\n\nwith Mpn a constant, ai the reaction rule to generate a child s at OR node n, and \u0001 a small constant.\nFinally, let ai\u22121 be the reaction rule applied just before ai in the current search tree. On top of the\nabove h(n, s), our approach sets h(n, s) = 0, if (1) ai\u22121 = ai holds and (2) the largest molecule in\ns is smaller than that in n. We de\ufb01ne the size of a molecule as the number of atoms. The intuition\nbehind the latter heuristic rule is that a simpler precursor molecule could often be used to synthesize\na product, as the feasibility of a simpler molecule tends be easier to verify. In addition, applying the\nsame rule encourages DFPN-E to keep simplifying the molecule structure. Forced moves in games\nhave similar principles to guide search towards positions that can be more easily veri\ufb01ed.\n\n5 Experimental Results\n\nThis section is dedicated to empirical evaluations of our algorithms.\n\n5.1 Setup\n\nOur implementation easily solves the benchmark instances in [11] with their reaction and starting\nmaterial databases, except instance #18 that has a special symmetric structure. Therefore, for our\nempirical evaluation, we create more dif\ufb01cult benchmarks from reactions extracted from the text\nmining of US patents from 1976\u20132013 provided by Lowe2 [20]. We \ufb01rst select unique reactions\nthat RDKit3 can correctly parse, and split them into a training set and a test set. The training set\ncontains 681,866 reactions in patents \ufb01led before ones in the test set. We use the training set to create\ndatabases of reaction rules and starting materials, and train a neural network.\n\n2https://bitbucket.org/dan2097/patent-reaction-extraction/downloads\n3https://www.rdkit.org/\n\n6\n\n\fWe automatically construct reaction rules, as commonly done in the literature, e.g., [19, 29]. We\nkeep only the largest molecule in the product of each reaction. From each reaction, we then extract a\nreaction core and its direct neighbors as a candidate for a reaction rule. If that candidate appears at\nleast 8 times in the training set, we keep it. We obtain 1550 reaction rules. Molecules present in the\ntraining set are included in the starting material database, which ends up having 977,435 molecules.\nThe training set of Segler et al. [29] is more than 18 times larger than ours (12.4 million version 681\nthousand), because we do not have any access to commercial databases. Therefore, as a trade-off, we\nhave a smaller number of reaction rules as well as use \ufb01ngerprints and neural networks of a smaller\nsize. Our \ufb01ngerprints are based on the Morgan \ufb01ngerprints [25] of radius 3 with 512 bits.\n\n5.2\n\nImplementation Details\n\nThe neural network for edge initialization consists of one fully connected layer of 512 neurons with\nRecti\ufb01ed Linear Unit (ReLU) activations followed by a softmax layer for 1550 output categories\nimplemented in Tensor\ufb02ow 1.12. The network parameters are optimized using Adam optimization\nwith a learning rate of 10-4, batch size of 128 and a dropout rate of 0.2 applied after the dense layer.\nFrom the test set, we extract 1945 target molecules that are not in the starting material database.\nWe run each algorithm with a time limit of 900 seconds per instance on a machine whose CPU is\nIntel Xeon E5-2683 at 2.00GHz with 32GB memory and with only one core in use. Algorithms\nevaluated include: (1) MCTS: MCTS with Segler et al.\u2019s formula [29], (2) DFPN: Basic DFPN,\nwith no edge costs, (3) DFPNH: DFPN with heuristic initialization of proof numbers at leaf nodes,\n(4) DFPN-1: DFPN with a unit edge cost, an enhanced version of Heifets\u2019 and Jurisica\u2019s PNS for\nchemical synthesis planning [11], and (5) DFPN-E: DFPN with heuristic edge initialization. For\nDFPN-E, we set Mpn = 20, \u0001 = 10\u221230 and a threshold controlling parameter \u03b4 = 2. For MCTS, we\nuse parameters shown in [29], i.e., c = 3, k = 0.99, Lmax = 25, dr = 5, z = 10 if leading to a win\nand otherwise z = \u22121.\n\nEnhancements to MCTS. Unlike [29], our MCTS implementation uses only one neural network\nand performs neither forward pruning nor in-scope \ufb01ltering, because we have a smaller set of reaction\nrules extracted from a smaller training set and because the purpose of our paper is to elucidate the\nperformance of MCTS and DFPN under the game-solving model presented in [11]. However, we\nfurther enhance MCTS of Segler et al. [29] by formulating their Markov Decision Processes (MDPs)\nin a game-proof search. Assume that two reactants B and C are generated from a product A by\napplying a reaction rule in a retrosynthetic manner, a state always includes both B and C in their\nrepresentation. In contrast, our MCTS implementation regards it as an AND node with two OR\nchildren B and C. This is more generic than the partial reward of Segler et al. [29], because B are C\nare considered as their partial state. In addition, at an AND node our MCTS implementation more\ndynamically selects B and C, based on which one is more promising to prove.\n\n5.3 Results\n\nOf 1550 reaction rules, there are 1213 rules that have only one reactant, 330 rules with two reactants\nand 7 rules with three reactants, demonstrating that the search space is lopsided in RA. 78% of the\nrules generate only one move at an AND node, and at most three moves can be generated there. We\ndo not intend to argue that there are more uni-molecular reactions than bi-molecular reactions, but\nthese statistics result from writing the reaction templates obtained from the patent literature in the\nreaction direction of large molecules reacting into smaller molecules to facilitate the search from a\nusually larger target molecule towards the usually smaller starting materials.\nTable 1 summarizes the performance of all algorithms. We include the search performance for the\ninstances where the best known pathway lengths are longer than two, since we are interested in\nsolving dif\ufb01cult instances. However, there are two instances with a pathway length of two which are\nnot solved by MCTS but solved by DFPN-E. This shows that even \ufb01nding extremely short pathways\ncan fail if MCTS continues examining the wrong branches due to slow node expansion rates and a\nlarge branching factor. There are 897 instances in total, and 483 instances are solved by all methods\nwith a time limit of 15 minutes per instance. The runtime and node expansion in Table 1 indicates the\ntotal runtime and the total node expansion that each algorithm needs to solve these 483 instances.\n\n7\n\n\fMethod\n\nRuntime (s)\n\nNode expansion\n\nNum solved\n\nLongest pathway\nAverage pathway\n\nTable 1: Summary of performance\n\nMCTS\n18,552\n184,347\n852\n21\n5.58\n\nDFPN DFPNH DFPN-1 DFPN-E\n5,654\n20,133\n68,719\n730,241\n770\n842\n19\n2,000\n227.42\n5.72\n\n30,822\n668,421\n619\n455\n21.09\n\n46,542\n460,738\n691\n8\n3.59\n\nFigure 4: Performance comparison based on density distribution on instances.\n\nDFPN-E tends to solve instances much more quickly than the other algorithms by signi\ufb01cantly\ndecreasing node expansions. For example, on average DFPN-E is 3.3 times faster than MCTS and\nreduces node expansions by a factor of 2.7. Compared to DFPN, DFPN-E improves the runtime\nby a factor of 3.6. We plot the logarithmic value of the runtime/node expansion on the vertical axis\nagainst the density distribution on the instances on the horizontal axis in Figure 4. This \ufb01gure also\nclearly shows that DFPN-E performs better than DFPN and MCTS. DFPNH, which initializes proof\nnumbers at leaf nodes, worsens the performance of DFPN even if DFPN-E\u2019s evaluation function is\nused, demonstrating the importance of assigning heuristic values not to leaf nodes but to edges.\nIn terms of the number of problems solved, MCTS and DFPN-E are two of the most competitive\nalgorithms. MCTS performs slightly better than DFPN-E. Of the 897 instances, both DFPN-E and\nMCTS solve 809 instances. DFPN-E solves 33 instances unsolved by MCTS, while there are 43\ninstances solved only by MCTS. The remaining 12 problems remain unsolved.\nBoth DFPN and DFPNH tend to return very long pathways. DFPN tends to search as deep as possible\nif an AND node has one branch. The lopsided search space of RA leads DFPN to returning a pathway\nof 2000 steps. In this case, DFPN keeps making very small changes to the target molecule such as\nchanging one double bond to a single bond until it reaches a set of simple starting materials. DFPN\u2019s\nperformance is degraded, especially when it accidentally fails to select a correct move immediately\nleading to a proof. In addition, synthesizing a target molecule with a long pathway is dif\ufb01cult for\nthe chemist to perform in practice. Even with an evaluation function, DFPNH has a pathway of\n\n8\n\n\f455 steps in the worst case, showing that DFPNH\u2019s behavior is also affected by the lopsided search\nspace. DFPN-1 returns the shortest pathways on average. However, as DFPN-1 performs similarly to\ndepth-\ufb01rst iterative deepening limited by its search depth, it does not \ufb01nd pathways longer than 8.\nDFPN-1 is 8.2 times slower on average and solves 151 fewer instances than DFPN-E. We conclude\nthat DFPN-E and MCTS are the strongest candidates for chemical synthesis planning.\n\n6 Related Work\n\nNagai also introduces edge costs in DFPN [21]. His approach, DFPN+, initializes proof and disproof\nnumbers at the leaf nodes and uses a uniform edge cost (e.g., -1 in the game of Othello [22]). The\nedge cost of DFPN+ aims to control the re-examination overhead of DFPN. In contrast, DFPN-E\nassigns non-uniform edge costs whose values depend on the promise of each edge, and is designed to\nwork in lopsided search spaces. DFPN-E deals with the re-examination overhead with the formula of\nKishimoto and M\u00fcller [16].\nOptimal AND/OR search algorithms [15, 23] have constant edge costs and an admissible heuristic\nfunction to evaluate leaf nodes and calculate so-called q-values backed up in a manner similar to proof\nnumbers. These approaches do not heuristically initialize edge costs. In addition, they monotonically\nincrease the q-values to ensure optimality. However, all PNS variants including DFPN-E have a\nscenario where updated proof-numbers are decreased as search progresses, when one of OR children\nis proven at an AND node. When this scenario occurs, the search space rooted at that node becomes\nmore promising to \ufb01nd a proof, which is a key advantage for DFPN-E.\nThere are other approaches based on OR search to perform retrosynthetic analysis, e.g., [19, 33].\nSegler et al. [29] compare MCTS with greedy best-\ufb01rst search (GBFS) based on OR search and\nconclude that MCTS outperforms GBFS. Schreck et al. [28] employ deep reinforcement learning\nwith a value network trained by using simulated experiences. Whether or not the policy learned by\ntheir approach could be effectively combined with DFPN-E remains important future work.\n\n7 Conclusions\n\nWe elucidated the behavior of MCTS and PNS in the search space of RA and presented DFPN-E,\na DFPN-based algorithm with edge cost initialization that addresses an essential problem of the\nlopsided search space in RA. Our experiments in search spaces generated from the US patent literature\ndemonstrate that DFPN-E is a competitive alternative to MCTS for chemical synthesis planning.\nIn future work, we plan to further improve the search performance of DFPN-E by incorporating\nsuccessful approaches from other games as well as more accurate neural networks in cases of less\ndata than in [29]. Furthermore, combining MCTS and DFPN-E in a portfolio and considering the\nfeasibility of reactions [29, 33] are promising research ideas.\n\nReferences\n[1] L. V. Allis, M. van der Meulen, and H. J. van den Herik. Proof-number search. Arti\ufb01cial\n\nIntelligence, 66(1):91\u2013124, 1994.\n\n[2] B. Arneson, R. B. Hayward, and P. Henderson. Solving Hex: Beyond humans. In Computers\nand Games 2010, volume 6515 of Lecture Notes in Computer Science, pages 1\u201310. Springer,\nBerlin, Germany, 2011.\n\n[3] B. Bonet and H. Geffner. Planning as heuristic search. Arti\ufb01cial Intelligence, 120:5\u201333, 2001.\n\n[4] C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener,\nD. Perez, S. Samothrakis, and S. Colton. A survey of Monte Carlo tree search methods. IEEE\nTransactions on Computational Intelligence and AI in Games, 4(1):1\u201349, 2012.\n\n[5] M. Campbell, A. J. H. Jr., and F. Hsu. Deep Blue. Arti\ufb01cial Intelligence, 134(1\u20132):57\u201383, 2002.\n\n[6] E. J. Corey. General methods for the construction of complex molecules. Pure and Applied\n\nChemistry, 14:19\u201338, 1967.\n\n9\n\n\f[7] E. J. Corey and W. T. Wipke. Computer-assisted design of complex organic syntheses. Science,\n\n166:178\u2013192, 1969.\n\n[8] R. Coulom. Ef\ufb01cient selectivity and backup operators in Monte-Carlo Tree Search. In Proceed-\nings of the 5th Computers and Games Conference, volume 4630 of Lecture Notes in Computer\nScience, pages 72\u201383, 2007.\n\n[9] C. Gao, M. M\u00fcller, and R. Hayward. Focused depth-\ufb01rst proof number search using convolu-\n\ntional neural networksfor the game of Hex. In IJCAI, pages 3668\u20133674, 2017.\n\n[10] S. Gelly and D. Silver. Combining online and of\ufb02ine knowledge in UCT. In Z. Ghahramani,\n\neditor, ICML, pages 273\u2013280, 2007.\n\n[11] A. Heifets and I. Jurisica. Construction of new medicines via game proof search. In AAAI,\n\npages 1564\u20131570, 2012.\n\n[12] M. Helmert. The Fast Downward planning system. Journal of Arti\ufb01cial Intelligence Research,\n\n26:191\u2013246, 2006.\n\n[13] A. Kishimoto. Dealing with in\ufb01nite loops, underestimation, and overestimation of depth-\ufb01rst\n\nproof-number search. In AAAI, pages 108\u2013113, 2010.\n\n[14] A. Kishimoto, B. Buesser, and A. Botea. AI meets chemistry. In AAAI, pages 7978\u20137982, 2018.\n\n[15] A. Kishimoto and R. Marinescu. Recursive best-\ufb01rst AND/OR search for optimization in\n\ngraphical models. In UAI, pages 400\u2013409, 2014.\n\n[16] A. Kishimoto and M. M\u00fcller. Search versus knowledge for solving life and death problems in\n\nGo. In AAAI, pages 1374\u20131379, 2005.\n\n[17] A. Kishimoto, M. Winands, M. M\u00fcller, and J.-T. Saito. Game-tree search using proof numbers:\n\nThe \ufb01rst twenty years. ICGA Journal, Vol. 35, No. 3, 35(3):131\u2013156, 2012.\n\n[18] L. Kocsis and C. Szepesv\u00e1ri. Bandit based Monte-Carlo planning. In Proceedings of the\n17th European Conference on Machine Learning (ECML), volume 4212 of Lecture Notes in\nComputer Science, pages 282\u2013293. Springer, 2006.\n\n[19] J. Law, Z. Zsoldos, A. Simon, D. Reid, Y. Liu, S. Y. Khew, A. P. Johnson, S. Major, R. A.\nWade, and H. Y. Ando. Route Designer: A retrosynthetic analysis tool utilizing automated\nretrosynthetic rule generation. Journal of Chemical Information and Modeling, 49(3):593\u2013602,\n2009.\n\n[20] D. M. Lowe. Extraction of Chemical Structures and Reactions from the Literature. PhD thesis,\n\nUniversity of Cambridge, 2012.\n\n[21] A. Nagai. Df-pn Algorithm for Searching AND/OR Trees and Its Applications. PhD thesis, The\n\nUniversity of Tokyo, 2002.\n\n[22] A. Nagai and H. Imai. Application of df-pn+ to Othello endgames. In Proceedings of the 5th\n\nGame Programming Workshop (GPW) in Japan, pages 16\u201323, 1999.\n\n[23] N. J. Nilsson. Principles of Arti\ufb01cial Intelligence. Tioga Publishing Co, Palo Alto, CA, 1980.\n\n[24] J. Pawlewicz and L. Lew. Improving depth-\ufb01rst pn-search: 1+\u0001 trick. In Proceedings of the 5th\nComputers and Games Conference, volume 4630 of Lecture Notes in Computer Science, pages\n160\u2013170, 2007.\n\n[25] D. Rogers and M. Hahn. Extended-connectivity \ufb01ngerprints. Journal of Chemical Information\n\nand Modeling, 50(5):742\u2013754, 2010.\n\n[26] C. Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Arti\ufb01cial\n\nIntelligence, 61(3):203\u2013230, 2011.\n\n[27] J. Schaeffer, N. Burch, Y. Bj\u00f6rnsson, A. Kishimoto, M. M\u00fcller, R. Lake, P. Lu, and S. Sutphen.\n\nCheckers is solved. Science, 317(5844):1518\u20131522, 2007.\n\n10\n\n\f[28] J. S. Schreck, C. W. Coley, and K. J. M. Bishop. Learning retrosynthetic planning through\n\nsimulated experience. ACS Central Science, 5(6):970\u2013981, 2019.\n\n[29] M. H. S. Segler, M. Preuss, and M. P. Waller. Planning chemical syntheses with deep neural\n\nnetworks and symbolic AI. Nature, 555:604\u2013610, 2018.\n\n[30] M. H. S. Segler and M. P. Waller. Neural-symbolic machine learning for retrosynthesis and\n\nreaction prediction. Chemistry \u2013 A European Journal, 1521(3765), 2017.\n\n[31] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch-\nbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.\nMastering the game of Go with deep neural networks and tree search. Nature, 529:484\u2013489,\n2016.\n\n[32] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre,\nD. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. A general reinforcement\nlearning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140\u2013\n1144, 2018.\n\n[33] S. Szymku\u00b4c, E. P. Gajewska, T. Klucznik, K. Molga, P. Dittwald, M. Startek, M. Bajczyk,\nand B. A. Grzybowski. Computer-assisted synthetic planning: The end of the beginning.\nAngewandte Chemie International Edition, 55(20):5904\u20135937, 2016.\n\n[34] M. H. M. Winands, Y. Bj\u00f6rnsson, and J.-T. Saito. Monte-Carlo tree search solver. In Proceedings\nof the 6th Computers and Games Conference, volume 5131 of Lecture Notes in Computer\nScience, pages 25\u201336, 2008.\n\n[35] M. H. M. Winands and M. P. D. Schadd. Evaluation-function based proof-number search. In\nComputers and Games, volume 6515 of Lecture Notes in Computer Science, pages 23\u201335, 2011.\n\n11\n\n\f", "award": [], "sourceid": 3926, "authors": [{"given_name": "Akihiro", "family_name": "Kishimoto", "institution": "IBM Research"}, {"given_name": "Beat", "family_name": "Buesser", "institution": "IBM Research"}, {"given_name": "Bei", "family_name": "Chen", "institution": "IBM Research"}, {"given_name": "Adi", "family_name": "Botea", "institution": "IBM Research"}]}