{"title": "A Machine Learning Approach to Predict Chemical Reactions", "book": "Advances in Neural Information Processing Systems", "page_first": 747, "page_last": 755, "abstract": "Being able to predict the course of arbitrary chemical reactions is essential to the theory and applications of organic chemistry. Previous approaches are not high-throughput, are not generalizable or scalable, or lack sufficient data to be effective. We describe single mechanistic reactions as concerted electron movements from an electron orbital source to an electron orbital sink. We use an existing rule-based expert system to derive a dataset consisting of 2,989 productive mechanistic steps and 6.14 million non-productive mechanistic steps. We then pose identifying productive mechanistic steps as a ranking problem: rank potential orbital interactions such that the top ranked interactions yield the major products. The machine learning implementation follows a two-stage approach, in which we first train atom level reactivity filters to prune 94.0% of non-productive reactions with less than a 0.1% false negative rate. Then, we train an ensemble of ranking models on pairs of interacting orbitals to learn a relative productivity function over single mechanistic reactions in a given system. Without the use of explicit transformation patterns, the ensemble perfectly ranks the productive mechanisms at the top 89.1% of the time, rising to 99.9% of the time when top ranked lists with at most four non-productive reactions are considered. The final system allows multi-step reaction prediction. Furthermore, it is generalizable, making reasonable predictions over reactants and conditions which the rule-based expert system does not handle.", "full_text": "A Machine Learning Approach to Predict Chemical\n\nReactions\n\nMatthew A. Kayala\n\nPierre Baldi\u2217\nInstitute of Genomics and Bioinformatics\n\nSchool of Information and Computer Sciences\n\nUniversity of California, Irvine\n\nIrvine, CA 92697\n\n{mkayala,pfbaldi}@ics.uci.edu\n\nAbstract\n\nBeing able to predict the course of arbitrary chemical reactions is essential to the\ntheory and applications of organic chemistry. Previous approaches are not high-\nthroughput, are not generalizable or scalable, or lack suf\ufb01cient data to be effective.\nWe describe single mechanistic reactions as concerted electron movements from\nan electron orbital source to an electron orbital sink. We use an existing rule-based\nexpert system to derive a dataset consisting of 2,989 productive mechanistic steps\nand 6.14 million non-productive mechanistic steps. We then pose identifying pro-\nductive mechanistic steps as a ranking problem: rank potential orbital interactions\nsuch that the top ranked interactions yield the major products. The machine learn-\ning implementation follows a two-stage approach, in which we \ufb01rst train atom\nlevel reactivity \ufb01lters to prune 94.0% of non-productive reactions with less than a\n0.1% false negative rate. Then, we train an ensemble of ranking models on pairs of\ninteracting orbitals to learn a relative productivity function over single mechanis-\ntic reactions in a given system. Without the use of explicit transformation patterns,\nthe ensemble perfectly ranks the productive mechanisms at the top 89.1% of the\ntime, rising to 99.9% of the time when top ranked lists with at most four non-\nproductive reactions are considered. The \ufb01nal system allows multi-step reaction\nprediction. Furthermore, it is generalizable, making reasonable predictions over\nreactants and conditions which the rule-based expert system does not handle.\n\n1 Introduction\n\nDetermining the major products of chemical reactions given the input reactants and conditions is\na fundamental problem in organic chemistry. Reaction prediction is a necessary component of\nretro-synthetic analysis or virtual library generation for drug design[1, 2] and has the potential to\nincrease our understanding of biochemical catalysis and metabolism[3]. There are a broad range\nof approaches to reaction prediction falling around at least three main poles: physical simulations\nof transition states using various quantum mechanical and other approximations[4, 5, 6], rule-based\nexpert systems[2, 7, 8, 9, 10, 11], and inductive machine learning methods[12]. However, none of\nthese approaches can successfully emulate the remarkable abilities of a human chemist.\n\n1.1 Previous approaches and representations\n\nThe very concept of a \u201creaction\u201d can be ambiguous, as it corresponds to a macroscopic abstraction,\nhence simpli\ufb01cation, of a very complex underlying microscopic reality, ultimately driven by the\n\n\u2217To whom correspondence should be addressed\n\n1\n\n\flaws of quantum and statistical mechanics. However, even for relatively small systems, it is impos-\nsible to \ufb01nd exact solutions to the Schr\u00a8odinger equation. Thus in practice, energies are calculated\nwith varyingly accurate approximations, ranging from ab-initio Hartree-Fock approaches or density\nfunctional theory to semi-empirical methods or mechanical force \ufb01elds[6]. This leads to modeling\nreactions as minimum energy paths between stable atom con\ufb01gurations on a high-dimensional po-\ntential energy surface, where the path through the lowest energy transition state, i.e., saddle point, is\nthe most favorable[4, 5]. By explicitly modeling energies, these approaches can be highly accurate\nand generalize to a diverse range of chemistries but require careful initialization and are computa-\ntionally expensive (see [13] for a representative example). This branch of computational chemistry\nprovides invaluable tools for in-depth analysis but is currently not suitable for high-throughput re-\nactivity tasks and is far from being able to recapitulate the knowledge and ability of human experts.\n\nIn contrast, most rule-based expert systems for high-throughput reactivity tasks use a much more\nabstract representation, in the form of general transformations over molecular graphs[2, 7, 8, 9, 10].\nReactions are predicted when a match is found in a library of allowable graph transformations. These\ngeneral transformations model only net molecular changes for processes that in reality involve a\nsequence of transition states, as shown in Figure 1. These rule-based approaches suffer from at least\nfour drawbacks: (1) they use a representation that is too high-level, in that an overall transformation\nobfuscates the underlying physical reality; (2) they require the manual curation of large amounts of\nexpert knowledge; (3) they become unmanageable at larger scales, in that adding a new graph pattern\noften involves having to update a large proportion of existing transformations with exceptions; and\n(4) they lack generality, in that particular chemistries must explicitly be encoded to be predicted.\n\n[C;X3H0:1]=[C;X3:2].[H:3][Br:4]>>[Br:4][C:1][C:2][H:3]\n\nFigure 1: Overall transformation of an alkene (hydrocarbon with double bond) with hydro-\nbromic acid (HBr) and corresponding mechanistic reactions. (a) shows the overall transform as\na SMIRKS[14] string pattern and as a graph representation. In a molecular graph, vertices represent\natoms, with carbons at unlabeled vertices. The number of edges between two vertices represents\nbond order. +/\u2212 symbols represent formal charge. Standard valences are \ufb01lled using implicit\nhydrogens. (b) shows the two mechanistic reactions composing the overall transformation as arrow-\npushing diagrams[15, 16]. Dots represent non-bonded (lone pair) electrons, while arrows represent\nconcerted electron movement. In the \ufb01rst step, electrons in the electron-rich carbon-carbon double\nbond attack the hydrogen and break the electron-poor hydrogen-bromine single bond, producing an\nanionic bromide (Br\u2212) and a carbocation (C+). In the second step, electrons from the charged,\nelectron-rich bromide attack the electron-poor carbocation, yielding the \ufb01nal alkyl halide.\n\nSomewhere between low-level QM treatment and abstract graph-based overall transformations, one\ncan consider reactions at the mechanistic level. A mechanistic, or elementary, reaction is a con-\ncerted electron movement through a single transition state[15, 16]. These mechanistic reactions can\nbe composed to yield overall transformations. For example, Figure 1 shows the overall transfor-\nmation of an alkene interacting with hydrobromic acid to yield an alkyl halide, along with the two\nelementary reactions which compose the transformation. A mechanistic reaction is described as an\nidealized molecular orbital (MO) interaction between an electron source (donor) MO and an electron\nsink (acceptor) MO. MOs represent regions of the molecule with high (source) or low (sink) electron\n\n2\n\n\fdensity. In general, potential electron sources are composed of lone pairs of electrons and bonds, and\npotential electron sinks are composed of empty atomic orbitals and bonds. Bonds can act as either\na source or a sink depending on the context. Because of space constraints, we cannot fully describe\nsubtle chemical details that must be handled, such as chaining for resonance rearrangement. For\ndetails, see texts[15, 16] on mechanisms. Note that by considering all possible pairings of source\nand sink MOs, this representation allows the exhaustive enumeration of all potential mechanistic\nreactions over an arbitrary set of molecules.\n\nRecent work by Chen and Baldi[11] introduces a rule-based expert system (Reaction Explorer) in\nwhich each rearrangement pattern encompasses an elementary reaction. Here, the elementary reac-\ntions represent \u201cproductive\u201d mechanistic steps, i.e. those reactions which lead to the overall major\nproducts. Thus, elementary reactions which are not the most kinetically favorable, but which even-\ntually lead to the overall thermodynamic transformation product may be considered \u201cproductive\u201d.\nThis approach is a marked change from previous approaches using overall transformations, but as a\nrule-based system still suffers from the problems of curation, scale, and generality.\n\nWhile mechanistic reaction representations are approximations quite far from the Schr\u00a8odinger equa-\ntion, we expect them to be closer to the underlying reality and therefore more useful than overall\ntransformations. Furthermore, we expect them also to be easier to predict than overall transfor-\nmations due to their more elementary nature and mechanistic interpretation. In combination, these\narguments suggest that working with mechanistic steps may facilitate the application of statistical\nmachine learning approaches, and take advantage of their capability to generalize. Thus, in this\nwork, reactions are modeled as mechanisms, and for the remainder of the paper, we consider the\nterm \u201creaction\u201d to denote a single elementary reaction. Furthermore, we consider the problem of\nreaction prediction to be precisely that of identifying the \u201cproductive\u201d reactions over a given set of\nreactants under particular conditions.\n\nThere has been very little work on machine learning approaches to reaction prediction. The sole\nexample is a paper from 1990 on inductively extracting overall transformation patterns from reac-\ntion databases[12], a method which was never actually incorporated into a full reaction prediction\nsystem. This situation is surprising. Given improvements in both computing power and machine\nlearning methods over the past 20 years, one could imagine a machine learning system that mines\nreaction information to learn the grammar of chemistry, e.g., in terms of graph grammars[17]. One\npotential reason behind the lack of progress in this area is the paucity of available data. Chemical\npublishing is dominated by closed models, making literature information dif\ufb01cult to access. Further-\nmore, parsing scienti\ufb01c text and extracting relevant chemical information from text and image data\nis an open problem of research[18, 19]. While commercial reaction databases exist, e.g., Reaxys[20]\nor SPRESI[21], the reactions in these databases are mostly unbalanced, not atom-mapped, and lack\nmechanistic detail[22]. This is in addition to suffering from a severe lack of openness; the databases\nare exorbitantly priced or provided with a restrictive query interface which precludes serious statisti-\ncal data mining. As a result, and to the best of our knowledge, effective machine learning approaches\nto reaction prediction still need to be developed.\n\n1.2 A new approach\n\nThe limitations of previous work motivate a new, fresh approach to reaction prediction combining\nmachine learning with mechanistic representations. The key idea is to \ufb01rst enumerate all potential\nsource and sink MOs, and thus all possible reactions by their pairing, and then use classi\ufb01cation\nand ranking techniques to identify productive reactions. There are multiple bene\ufb01ts resulting from\nsuch an approach. By using very general rules to enumerate possible reactions, the approach is not\nrestricted to manually curated reaction patterns. By detailing individual reactions at the mechanistic\nlevel, the system may be able to statistically learn ef\ufb01cient predictive models based on physico-\nchemical attributes rather than abstract overall transformations. And by ranking possible reactions\ninstead of making binary decisions, the system may provide results amenable to \ufb02exible interpreta-\ntion. However, the new approach also faces three key challenges: (1) the development of appropri-\nate training datasets of productive reactions; (2) the development of a machine learning approach to\ncontrol the combinatorial complexity resulting from considering all possible pairs of electron donors\nand acceptors among the reacting molecules; and (3) the development of machine learning solutions\nto the problem of predictively ranking the possible mechanisms. These challenges are addressed\none-by-one in the following sections.\n\n3\n\n\f2 The data challenge\n\nA mechanistically de\ufb01ned dataset of reactions to use with the proposed approach does not currently\nexist. To derive a dataset, we use a mechanistically de\ufb01ned rule-based expert system (Reaction\nExplorer) together with its validation suite[11]. The validation suite is a manually composed set of\nreactants, reagents, and products covering a complete undergraduate organic chemistry curriculum.\n\nEntering a set of reactants and a reagent model into Reaction Explorer yields the complete sequence\nof mechanistic steps leading to the \ufb01nal products, where all reactions in this sequence share the\nconditions encoded by the corresponding reagent model. Each one of these mechanistic steps is\nconsidered to be a distinct productive elementary reaction. For a given set of reactants and condi-\ntions, which we call a (r, c) query tuple, the Reaction Explorer system labels a small set of reactions\nproductive, while all other reactions enumerated by pairing source and sink MOs over the reactants\nare considered non-productive.\n\nWe then de\ufb01ne two {0, 1} labels for each atom (up to symmetries) and conditions (a, c) tuple over\nall (r, c) queries. An (a, c) tuple has label srcreact = 1 if it is the main atom of a source MO in\na productive reaction over any corresponding (r, c) query, and has label srcreact = 0 otherwise.\nThe label sinkreact is de\ufb01ned similarly using sink MOs.\n\nReaction conditions are described with three parameters: temperature, anion solvation potential,\nand cation solvation potential. Temperature is listed in Kelvin. The solvation potentials are unitless\nnumbers between 0 and 1 representing ease of cation or anion solvation, thus providing a quantita-\ntive scale to describe polar protic, polar aprotic, and nonpolar solvents. Note that any mechanistic\ninteraction with the solvent or reagent is explicitly modeled, e.g. as in Figure 1.\n\nAs an initial validation of the method, we consider general ionic reactions from the Reaction Ex-\nplorer validation suite involving C, H, N, O, Li, Mg, and the halides. Extensions to include stere-\noselective, pericyclic, and radical reactions are discussed in Section 5. The dataset consists of 6.14\nmillion reactions composed of 84,825 source and 74,725 sink MOs from 2,752 distinct reactants\nand reaction conditions, i.e., (r, c) queries. Of these 6.14 million reactions, the Reaction Explorer\nsystem labels 2,989 of them as productive. There are 22,894 atom symmetry classes, which when\npaired with reaction condition yields 29,104 (a, c) tuples. Of these 29,104 (a, c) tuples, 1,262 have\nlabel srcreact = 1 , and 1,786 have label sinkreact = 1.\nAtom and MO interaction data is available at our chemoinformatics portal (http://cdb.ics.\nuci.edu) under Supplements.\n\n3 The combinatorial complexity challenge\n\nIn the dataset, the average molecule has 44 source MOs and 50 sink MOs. For this average molecule,\nconsidering only intermolecular reactions with a second copy of the same molecule gives 44 \u00d7\n50 = 2200 potential elementary reactions. Thus, the number of possible reactions is very large,\nmotivating identifying productive reactions given a (r, c) query in two stages. In the \ufb01rst stage, we\ntrain \ufb01lters using classi\ufb01cation techniques on the source and sink reactivity labels. The idea is to\ntrain highly sensitive classi\ufb01ers which reduce the breadth of possible reactions without erroneously\n\ufb01ltering productive reactions. Then only those source and sink MOs where the main atom passes\nthe respective atom level \ufb01lter are considered when enumerating reactions to consider in the second\nranking stage for predicting reaction productivity.\n\nHere, we train two separate classi\ufb01ers to predict the source and sink atom level reactivity labels,\neach using the same feature descriptions and machine learning implementations. To assess the\nperformance of the reactive site \ufb01lter training, we perform full 10-fold cross-validation (CV) over\nall distinct tuples of molecules and conditions (m, c).\n\n3.1 Feature representation\n\nEach (a, c) tuple is represented as a vector of physicochemical and topological features. There are\n14 real-valued physicochemical features such as the reaction conditions, the molecular weight of\nthe molecule, and the charge at and around the atom. Topological features are meant to capture the\nneighboring context of a in the molecular graph, for example counts over vertex-and-edge labeled\n\n4\n\n\fpaths and trees rooted at a. We compute paths to length 4 and trees to depth 2, producing 743\nmolecular graph features. In addition to standard molecular graph features, we also include similar\ntopological features over a restricted alphabet pharmacophore point graph, where pharmacophore\npoint graph de\ufb01nitions are adapted from H\u00a8ahnke, et al[23]. Using paths of length 4 and trees of\ndepth 2 in the pharmacophore point graph yields another 759 features. This results in a total of\n1,516 features.\n\n3.2 Training\n\nBefore training, all features are normalized to [0, 1] using the minimum and maximum values of\nthe training set. We oversample (a, c) tuples with label 1 to ensure approximately balanced classes.\nWe experimented with a variety of architectures. Here we report the results obtained using arti\ufb01cial\nneural networks using sigmoidal activation functions, with a single hidden layer and a single output\nnode with a cross-entropy error function. Grid search using internal three-fold CV on a single train-\ning set is used to \ufb01t the architecture size (converging to 10 hidden nodes) and the L2-regularization\n(weight decay) parameter shared by all folds of the overall 10-fold CV. Weights are optimized by\nstochastic gradient descent with per-weight adaptive learning rates[24]. Optimization is stopped\nafter 100 epochs as this is observed to be suf\ufb01cient for convergence.\nAs highly sensitive classi\ufb01ers are desired, the choice of a decision threshold is important. We per-\nform internal three-fold CV on the training set to \ufb01nd decision thresholds yielding a false negative\nrate of 0 on each respective internal test set. The decision threshold for the overall CV fold is taken\nas the average of these internal CV fold thresholds.\n\n3.3 Results\n\nWe report the true negative rate (TNR) and the false negative rate (FNR) for both the source and\nsink classi\ufb01cation problems as well as for the the actual reaction \ufb01ltering problem, as shown in\nTable 1. In a CV regime, we are able to \ufb01lter 94.0% of the 6.14 million non-productive reactions\nwith less than 0.1% false negatives, effectively reducing the ranking problem imbalance by an order\nof magnitude with minimal error. Having established excellent \ufb01ltering results with rigorous CV,\nwe then train classi\ufb01ers with all available data in order to independently assess the ranking method.\nThe results of these classi\ufb01ers are shown in the last column of Table 1.\n\nTable 1: Reactive site classi\ufb01cation results. Source reactive and sink reactive rows show results\non the respective classi\ufb01cation problems. The reaction row shows results of using the two atom\nclassi\ufb01ers for an initial reaction \ufb01ltering. CV columns indicate results of full 10-fold cross-validation\nover (m, c) tuples. CV results show the mean and standard deviation over folds. The best TNR\ncolumn shows results when trained with all available data.\n\nProblem\n\nCV TNR % (SD) CV FNR % (SD) Best TNR %\n\nSource Reactive\nSink Reactive\n\nReaction\n\n87.7(2.0)\n75.6(5.8)\n94.0(1.5)\n\n0.1(0.2)\n0.2(0.4)\n\n< 0.1(< 0.1)\n\n92.1\n85.6\n97.2\n\n4 The ranking challenge\n\nWe pose the task of identifying the productive reactions as a ranking problem. To assess perfor-\nmance, we perform full 10-fold CV over the 2,752 distinct (r, c) queries. With the overall \ufb01ltered\nset of reactions, there are, on average, 1.1 productive and 62.5 non-productive reactions per (r, c)\nquery.\n\n4.1 Feature representation\n\nEach reaction is composed of a source and sink MO. The reaction feature vector is the concatenation\nof the corresponding source and sink atom level feature vectors with some modi\ufb01cations. To keep\nthe size reasonable, only real valued and pharmacophore (path length 3 and tree depth 2) atom level\n\n5\n\n\ffeatures are included. 124 features are calculated to describe the net difference between reactants\nand products, such as counts over bond types, rings, and formal charges. And \ufb01nally, 450 features\ndescribing the forward and inverse reactions are calculated, including atoms and bonds involved and\nimplied transition state geometry. This leads to a total of 1,677 reaction features.\n\n4.2 Training\n\nWe use a pairwise approach to ranking similar to [25], using two identical shared-weight arti\ufb01cial\nneural networks linked to a single comparator output node with \ufb01xed \u00b11 weights. The general\narchitecture is shown in Figure 2. Each shared network receives as an input a potential reaction, i.e.\na source-sink pair. Training is performed via back-propagation with weight-sharing.\n\n...\n\n...\n\n(Source, Sink) A\n\n(Source, Sink) B\n\nFigure 2: Shared weight arti\ufb01cial neural network architecture for pairwise ranking. The goal is\nto determine a productivity order between the (source, sink) A and (source, sink) B pairs. This\nis done with a pair of shared-weight arti\ufb01cial neural networks with sigmoidal hidden nodes and a\nlinear output node. The output of these internal networks are tied to a single sigmoidal output node\nwith \ufb01xed weights. The \ufb01nal output will approach 1 if the (source, sink) A pair is predicted to be\nrelatively more productive than the (source, sink) B pair, and 0 otherwise.\n\nTraining details are similar to the reactive site classi\ufb01cation. All features are normalized to [0, 1]\nand grid search with internal three-fold CV on a single training set is used to \ufb01t the architecture\nsize (converging to 20 hidden nodes) and L2-regularization (weight decay) parameter shared by all\nfolds of the overall 10-fold CV. Weights are optimized using stochastic gradient descent with the\nsame per-weight adaptive learning rate scheme[24]. Optimization is stopped after 25 epochs as this\nis observed to be suf\ufb01cient for convergence.\n\nAn ensemble consisting of \ufb01ve separate pairwise ranking machines (as described in Figure 2) is used\nfor each training set. Each machine in the ensemble is trained with all the productive reactions (from\nthe training set) and a random partition of the non-productive reactions (from the training set). Final\nranking on the test set is determined by either simple majority vote or by ranking the average scores\nfrom the linear output node of the inner shared-weight network for each machine in the ensemble.\nThe latter yields a minute performance increase and is reported.\n\n4.3 Results\n\nWe consider two measures for evaluating rankings, Normalized Discounted Cumulative Gain at list\nsize i (NDCG@i) and Percent Within-n. NDCG@i is a common information retrieval metric[26]\nthat sums the overall usefulness (or gain) of productive reactions in a given list of the top-i results,\nwhere individual gain decays exponentially with lower position. The measure is normalized such\nthat the best possible ranking of a size i list has NDCG@i = 1. For example, NDCG@1 is the\nfraction of (r, c) queries in which the top ranked reaction is a productive reaction. Percent Within-n\nis simply how many (r, c) queries have at most n non-productive reactions in the smallest ranked list\ncontaining all productive reactions. For example, Percent Within-0 measures the percent of (r, c)\n\n6\n\n\fqueries with perfect rank, and Percent Within-4 measures how often all productive reactions are\nrecovered with at most 4 mis-ranked non-productive reactions. Note that the NDCG@1 and Percent\nWithin-0 will differ because roughly 10% of (r, c) queries have more than one productive reaction.\nThe non-productive MO interactions vastly outnumber the productive interactions. In spite of this\nimbalance, our approach gives excellent ranking results, shown in Table 2. The NDCG results\nshow, for example, that in 89.5% of the queries, the top ranked reaction is productive. The Percent\nWithin-n results show that 89.1% of queries have perfect ranking, while 99.9% of queries recover\nall productive reactions by considering lists with at most four non-productive reactions.\n\nTable 2: Reaction ranking results. We show Normalized Discounted Cumulative Gain at different\nlist sizes i (NDCG@i) and Percent Within-n. See text for description of the measures. We report\nmean (standard deviation) results over CV folds.\ni Mean NDCG@i (SD)\n1\n2\n3\n4\n5\n\n0.895(0.016)\n0.939(0.011)\n0.952(0.008)\n0.954(0.007)\n0.956(0.007)\n\nPercent Within-n (SD)\n\nn\n0\n1\n2\n3\n4\n\n89.1(1.7)\n96.8(1.0)\n98.9(0.6)\n99.5(0.4)\n99.9(0.3)\n\n4.4 Chemical applications\n\nThe strong performance of the ranking system is exhibited by its ability to make accurate multi-step\nreaction predictions. An example, shown in the \ufb01rst row of Table 3, is an intramolecular Claisen\ncondensation reaction with conditions (room temperature, polar aprotic solvent) requiring three el-\nementary steps. The ranking method correctly predicts the given reaction as the highest ranked\nreaction at each step.\n\nTable 3: Chemical reactions of interest. The \ufb01rst row shows an example of full multi-step reaction\nprediction by the ranking system, a three step intramolecular Claisen condensation (room temp.,\npolar aprotic). At each stage, the reaction shown is the top ranked when all possible reactions are\nconsidered by the two stage machine learning system. The second row shows two macrocyclizations\nwhich the rule-based system (Reaction Explorer) is unable to predict, but the machine learning\napproach effectively generalizes and ranks correctly. These reactions lead to the formation of a\nseven homo-cycle (7 carbons) on the left and seven hetero-cycle (6 carbons, 1 oxygen) on the right.\nThe third row shows an intelligible error of the machine learning approach (see text).\n\nMulti-\nStep\nReaction\nPrediction\n\nGenerality\n\nReasonable\nErrors\n\nA generalizable system should be able to make reasonable predictions about reactants and reaction\ntypes with which it has only had implicit, rather than explicit, experience. Reaction Explorer, as a\n\n7\n\n\frule-based expert system without explicit rules about larger ring forming reactions, does not make\nany predictions about seven and eight atom cyclizations.\nIn reality though, larger ring forming\nreactions are possible. The second row of Table 3 shows the top two ranked reactions over a set\nof bromo-hept-1-en-2-olate reactants, leading to seven-member ring formation. The ranking model,\nwithout ever being trained with seven or eight-member ring forming reactions, returns the enolate\nattack as the most favorable, but also returns the lone pair nucleophilic substitution as the second\nmost favorable. Similar results are made for similar eight-membered ring systems (not shown).\nThus the ranking model is able to generalize and make reasonable suggestions, while the rule-based\nsystem is limited by hard-coded transformation patterns.\n\nFinally, the vast majority of errors are close errors, as exhibited by the 99.9% Within-4 measure.\nFurthermore, upon examination of these errors, they are largely intelligible and not unreasonable\npredictions. For example, the third row of Table 3 shows two reactions involving an oxonium com-\npound and a bromide anion. Our ranking models return these two reactions as the highest, ranking\nthe deprotonation slightly ahead of the substitution. This is considered a Within-1 ranking because\nthe Reaction Explorer system labels only the substitution reaction as productive. However, the\nimmediate precursor reaction in the sequence of Reaction Explorer mechanisms leading to these\nreactants is the inverse of the deprotonation reaction, i.e., the protonation of the alcohol. Hydrogen\ntransfer reactions like this are reversible, and thus the deprotonation is likely the kinetically favored\nmechanism, i.e., it is reasonable to rank the deprotonation highly. It is just not productive, in that it\ndoes not lead to the \ufb01nal overall product. In a prediction system working with multi-step syntheses,\nsuch reversals of previous steps are easily discarded.\n\n5 Conclusion\n\nBeing able to predict the outcome of chemical reactions is a fundamental scienti\ufb01c problem. The\nultimate goal of a reaction prediction system is to recapitulate and eventually surpass the ability of\nhuman chemists. In this work, we take a signi\ufb01cant step in this direction, showing how to formulate\nreaction prediction as a machine learning problem and building an accurate implementation for a\nlarge and key subset of organic chemistry. There are a number of immediate applications of our\nsystem, including validating retro-synthetic suggestions, generating virtual libraries of molecules,\nand mechanistically annotating existing reaction databases.\n\nReaction prediction is a largely untapped area for machine learning approaches. As such, there is of\ncourse room for improvements. The \ufb01rst is increasing the breadth of chemistry captured, e.g. radical,\npericyclic, and stereoselective chemistry. Augmenting the MO description with number of electrons,\nallowing cyclic chained MO interactions, and including face orientations are plausible extensions to\nattack each of these additional areas of chemical reactivity. A second area of improvement is the\ncuration of larger mechanistically de\ufb01ned datasets. We can approach this manually, by further use\nof expert systems to construct data with the required level of detail, or by carefully crafted crowd-\nsourcing approaches. Other ongoing areas of research include improving the features, performing\nsystematic feature selection, and experimenting with different statistical ranking techniques.\n\nAs an untapped research problem for the machine learning community, we hope that the current\nwork and our publicly available data will spark continued and open research in this important area.\n\nAcknowledgments\n\nWork supported by NIH grants LM010235-01A1 and 5T15LM007743 and NSF grant MRI EIA-\n0321390 to PB. We acknowledge OpenEye Scienti\ufb01c Software and ChemAxon for academic soft-\nware licenses. We wish to thank Profs. James Nowick, David Van Vranken, and Gregory Weiss for\nuseful discussions.\n\nReferences\n[1] E.J. Corey and W.T. Wipke. Computer-assisted design of complex organic syntheses.\n\n166(3902):178\u201392, 1969.\n\nScience,\n\n[2] M.H. Todd. Computer-aided organic synthesis. Chem. Soc. Rev., 34(3):247\u2013266, 2005.\n[3] P. Rydberg, D.E. Gloriam, J. Zaretzki, C. Breneman, and L. Olsen. SMARTCyp: A 2D method for\n\nprediction of cytochrome P450-mediated drug metabolism. ACS Med. Chem. Lett., 1(3):96\u2013100, 2010.\n\n8\n\n\f[4] G. Henkelman, B.P. Uberuaga, and H. J\u00b4onsson. A climbing image nudged elastic band method for \ufb01nding\n\nsaddle points and minimum energy paths. J. Chem. Phys., 113(22):9901\u20139904, 2000.\n\n[5] B. Peters, A. Heyden, A.T. Bell, and A. Chakraborty. A growing string method for determining transition\nstates: comparison to the nudged elastic band and string methods. J. Chem. Phys., 120(17):7877\u20137886,\n2004.\n\n[6] C.J. Cramer. Essentials of Computational Chemistry: Theories and Models. Wiley, West Sussex, England,\n\n2 edition, 2004.\n\n[7] W.L. Jorgensen, E.R. Laird, A.J. Gushurst, J.M. Fleischer, S.A. Gothe, H.E. Helson, G.D. Paderes, and\nS. Sinclair. CAMEO: a program from the logical prediction of the products of organic reactions. Pure\nAppl. Chem., 62:1921\u20131932, 1990.\n\n[8] R. Hollering, J. Gasteiger, L. Steinhauer, K.-P. Schulz, and A. Herwig. Simulation of organic reactions:\nfrom the degradation of chemicals to combinatorial synthesis. J. Chem. Inf. Model., 40(2):482\u2013494, 2000.\n[9] G. Benk\u00a8o, C. Flamm, and P.F. Stadler. A graph-based toy model of chemistry. J. Chem. Inf. Model.,\n\n43(4):1085\u20131093, 2003.\n\n[10] I.M. Socorro, K. Taylor, and J.M. Goodman. ROBIA: a reaction prediction program. Org. Lett.,\n\n7(16):3541\u20133544, 2005.\n\n[11] J. Chen and P. Baldi. No electron left behind: a rule-based expert system to predict chemical reactions\n\nand reaction mechanisms. J. Chem. Inf. Model., 49(9):2034\u20132043, 2009.\n\n[12] P. R\u00a8ose and J. Gasteiger. Automated derivation of reaction rules for the EROS 6.0 system for reaction\n\nprediction. Anal. Chim. Acta, 235:163\u2013168, 1990.\n\n[13] B. Wang and Z. Cao. Mechanism of acid-catalyzed hydrolysis of formamide from cluster-continuum\nmodel calculations: concerted versus stepwise pathway. J. Phys. Chem. A, 114(49):12918\u201312927, 2010.\n[14] C.A. James, D. Weininger, and J. Delany. Daylight theory manual. http://www.daylight.com/\n\ndayhtml/doc/theory/index.html, 2008. Last accessed January 2011.\n\n[15] C.K. Ingold. Structure and Mechanism in Organic Chemistry. Cornell University Press, Ithaca, NY, 1953.\n[16] R. Grossman. The Art of Writing Reasonable Organic Reaction Mechanisms. Springer-Verlag, New York,\n\nNY, 2 edition, 2003.\n\n[17] G. Rozenberg, editor. Handbook of Graph Grammars and Computing by Graph Transformation: Volume\n\nI. Foundations. World Scienti\ufb01c Publishing, River Edge, NJ, 1997.\n\n[18] D.L. Banville. Mining chemical structural information from the drug literature. Drug Discovery Today,\n\n11:35\u201342, 2006.\n\n[19] J. Park, G.R. Rosania, and K. Saitou. Tunable machine vision-based strategy for automated annotation of\n\nchemical databases. J. Chem. Inf. Model., 49(8):1993\u20132001, 2009.\n\n[20] D.D. Ridley. Searching for chemical reaction information. In S.R. Heller, editor, The Beilstein Online\nDatabase, volume 436 of ACS Symposium Series, pages 88\u2013112. American Chemical Society, Washing-\nton, DC, 1990.\n\n[21] D.L. Roth. SPRESIweb 2.1, a selective chemical synthesis and reaction database. J. Chem. Inf. Model.,\n\n45(5):1470\u20131473, 2005.\n\n[22] J. Gasteiger and T. Engel, editors. Chemoinformatics: A Textbook. Wiley-VCH, Weinheim, Germany,\n\n2003.\n\n[23] V. H\u00a8ahnke, B. Hofmann, T. Grgat, E. Proschak, D. Steinhilber, and G. Schneider. PhAST: pharmacophore\n\nalignment search tool. J. Comput. Chem., 30(5):761\u201371, 2009.\n\n[24] R. Neuneier and H.-G. Zimmermann. How to train neural networks. In G.B. Orr and K.-R. M\u00a8uller, editors,\n\nNeural Networks: Tricks of the Trade, pages 373\u2013423. Springer-Verlag, Heidelberg, Germany, 1998.\n\n[25] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to\nrank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning\n(ICML05), pages 89\u201396. ACM Press, Bonn, Germany, 2005.\n\n[26] K. J\u00a8arvelin and J. Kek\u00a8al\u00a8ainen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst.,\n\n20(4):422\u2013446, 2002.\n\n9\n\n\f", "award": [], "sourceid": 504, "authors": [{"given_name": "Matthew", "family_name": "Kayala", "institution": null}, {"given_name": "Pierre", "family_name": "Baldi", "institution": null}]}