{"title": "Integrating Markov processes with structural causal modeling enables counterfactual inference in complex systems", "book": "Advances in Neural Information Processing Systems", "page_first": 14234, "page_last": 14244, "abstract": "This manuscript contributes a general and practical framework for casting a Markov process model of a system at equilibrium as a structural causal model, and carrying out counterfactual inference. Markov processes mathematically describe the mechanisms in the system, and predict the system\u2019s equilibrium behavior upon intervention, but do not support counterfactual inference. In contrast, structural causal models support counterfactual inference, but do not identify the mechanisms. This manuscript leverages the benefits of both approaches. We define the structural causal models in terms of the parameters and the equilibrium dynamics of the Markov process models, and counterfactual inference flows from these settings. The proposed approach alleviates the identifiability drawback of the structural causal models, in that the counterfactual inference is consistent with the counterfactual trajectories simulated from the Markov process model. We showcase the benefits of this framework in case studies of complex biomolecular systems with nonlinear dynamics. We illustrate that, in presence of Markov process model misspecification, counterfactual inference leverages prior data, and therefore estimates the outcome of an intervention more accurately than a direct simulation.", "full_text": "Integrating Markov processes with structural causal modeling\n\nenables counterfactual inference in complex systems\n\nRobert Ness\nGamalon Inc.\n\nrobert.ness@gamalon.com\n\nKaushal Paneri\n\nNortheastern University\nkaushalpaneri@gmail.com\n\nOlga Vitek\n\nNortheastern University\no.vitek@northeastern.edu\n\nAbstract\n\nThis manuscript contributes a general and practical framework for casting a Markov\nprocess model of a system at equilibrium as a structural causal model, and carry-\ning out counterfactual inference. Markov processes mathematically describe the\nmechanisms in the system, and predict the system\u2019s equilibrium behavior upon\nintervention, but do not support counterfactual inference. In contrast, structural\ncausal models support counterfactual inference, but do not identify the mechanisms.\nThis manuscript leverages the bene\ufb01ts of both approaches. We de\ufb01ne the structural\ncausal models in terms of the parameters and the equilibrium dynamics of the\nMarkov process models, and counterfactual inference \ufb02ows from these settings.\nThe proposed approach alleviates the identi\ufb01ability drawback of the structural\ncausal models, in that the counterfactual inference is consistent with the counter-\nfactual trajectories simulated from the Markov process model. We showcase the\nbene\ufb01ts of this framework in case studies of complex biomolecular systems with\nnonlinear dynamics. We illustrate that, in presence of Markov process model mis-\nspeci\ufb01cation, counterfactual inference leverages prior data, and therefore estimates\nthe outcome of an intervention more accurately than a direct simulation.\n\nIntroduction\n\n1\nMany complex systems contain discrete components that interact in continuous time, and maintain\ninteractions that are stochastic, dynamic, and governed by natural laws. For example, molecular\nsystems biology studies molecules (e.g., gene products, proteins) in a living cell that interact according\nto biochemical laws. An important aspect of studying these systems is predicting the equilibrium\nbehavior of the system upon an intervention, and selecting high-value interventions. For example,\nwe may want to predict the effect of a drug intervention on a new equilibrium of gene expression\n[1, 27]. The intervention may have a high value if reduces the expression of a speci\ufb01c gene, while\nminimizing changes to the other genes.\nRecent work in the reinforcement learning community has highlighted the utility of counterfactual\npolicy evaluation for evaluating and comparing interventions. Counterfactual policy evaluation uses\ndata from past experimental interventions to ask whether a higher value could have been achieved\nunder an alternative intervention [7, 16, 8, 19]. Counterfactual inference answers this question by\npredicting the outcome of the alternative intervention, conditional on the outcome of the intervention\nfor which the data were observed [7, 21].\nPredicting the outcome of an intervention requires us to model the system. In particular, discrete-state\ncontinuous-time Markov process models unambiguously describe the changes of system components\nacross all the system states (i.e., not only at equilibrium) in term of hazard functions [11, 28]. A\nMarkov process model predicts the equilibrium upon an intervention by applying the intervention to\nthe initial conditions, performing multiple direct stochastic simulations to reach post-intervention\nequilibriums, and averaging over these equilibriums. Markov process modeling is one way of\nmodeling complexity in biological systems, particularly in systems that are intrinsically stochastic\n[1]. The Markov process models are called stochastic kinetic models in this context.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fUnfortunately, Markov process models do not support counterfactual inference. Moreover, it is often\nimpossible to correctly specify a Markov process model of a complex system such as a biological\nsystem, where many aspects of the underlying mechanism are unknown. Direct simulations from an\nincorrectly speci\ufb01ed model may incorrectly predict the outcomes of interventions.\nAn alternative class of models are structural causal models (SCMs). These probabilistic generative\ncausal models are attractive, in that they enable both interventional and counterfactual inference\n[20]. Recent work used SCMs to model the transition functions in simple Markov decision process\nmodels and apply counterfactual policy evaluation to the decisions (i.e. interventions) at each time\nstep [8, 19]. Unfortunately, these approaches require outcome data at each time point. This limits\ntheir use in situations where we are only interested in the outcome at equilibrium, and only collect\ndata once the equilibrium is reached.\nDe\ufb01ning SCM models at equilibrium directly is non-trivial, because multiple SCMs may be consis-\ntent with the equilibrium distribution of the system components upon an intervention, but provide\ncontradictory answers to the same counterfactual query [20, 23]. Recent work [6, 18, 5] connected a\nbroader class of dynamic models and SCMs, and established the conditions under which interventions\nin dynamic simulations correspond to SCM\u2019s predictions of equilibrium upon the interventions.\nHowever, researchers lack practical examples that leverage this connection, and combine the bene\ufb01ts\nof these two approaches for counterfactual inference.\nThis manuscript builds on these prior results, and contributes a general and practical framework for\ncasting the equilibrium distribution Markov process model as an SCM model of equilibrium behavior.\nThe SCMs are de\ufb01ned in terms of the structure and the hazard rates parameters of the Markov process\nmodel, and counterfactual inference \ufb02ows from these settings. The proposed approach alleviates\nthe identi\ufb01ability drawback of the SCMs, in that their counterfactual inference is consistent with\nthe counterfactual trajectories simulated from the Markov process model. We showcase the bene\ufb01ts\nof this approach in two studies of cell signal transduction with nonlinear dynamics. The \ufb01rst is a\ncanonical model of the MAPK signaling pathway [17]. The second is a larger model that connects\nthe MAPK pathway to stimulus from growth factors [3]. We illustrate that, when the underlying\nMarkov process model is misspeci\ufb01ed, counterfactual inference anchors intervention predictions to\npast observed data, and makes selection of interventions more robust to model misspeci\ufb01cation.\n2 Background\nDiscrete-state continuous-time Markov process models Discrete-state continuous-time Markov\nprocess models describe the temporal interactions between the system components in terms of\nabstract or physical processes, called rate laws, with real-valued parameters rates [11]. The rate laws\ndetermine hazard functions, which provide instantaneous probabilities of state transitions.\nA place invariant is a set of system components with an invariant sum. A minimal place invariant can\nnot be further reduced to smaller place invariants [9]. De\ufb01ne random variables X(t) = {Xi(t) : i \u2208\n1...J} representing the states of J minimal place invariant components in a Markov process model.\nWe use capital letters to refer to random variables, lower case letters to refer to instances of random\nvariables, normal font for a single variable, and boldface for a tuple of variables. Denote P\n(t) the\nprobability distribution of X(t), and P\n(t) the marginal probability of Xi(t). A Markov process\nmodel M is de\ufb01ned by master equations, i. e. a coupled set of ordinary differential equations that\ndescribe the rate of change of the probabilities of the states X(t) over time [29]:\n\nM\nXi\n\nM\n\n(t)\n\n= hi (t, vi, PAM,i(t)) , Xi(0) = (x0)i \u2200i \u2208 J\n\n(1)\n\ndP\n\nM\nXi\ndt\n\nThe function hi is the hazard function that determines the probability of a state change between\nXi(t) and Xi(s), s > t. Here vi is a set of parameters of the rate laws, and x0 is an initial\ncondition. PAM,i(t) \u2286 X(t) \\ Xi(t) is the set of parents of variable Xi(t), i.e. variables that\nregulate Xi(t). Here we consider only Markov process models that converge to unique equilibrium\nstationary distributions. If equilibrium exists, then limt\u2192\u221e dP\ni the random\nvariable to which Xi(t) converges in distribution X\u2217\nM the equilibrium\ndistribution of X\u2217, and P\nEquilibrium distribution of a Markov process model as a generative model In the equilibrium\ndistribution the place invariants in a Markov process model factorizes into a set of conditional\n\nthe marginal probability of X\u2217\ni .\n\ndt = 0. We denote X\u2217\n\nd\n:= limt\u2192\u221e Xi. We denote P\n\nM\nXi\n\nM\nX\u2217\n\ni\n\n(t)\n\ni\n\n2\n\n\fprobability distributions, with a causal ordering based on the solutions to the master equations (see\nSupplementary materials for details). Based on this, the equilibrium distribution can be cast as a\ncausal generative model G that consists of [20, 23]:\n\n1. Random variables X = {Xi; i \u2208 1...J}: the states of the system\n\n2. A directed acyclic graph D with nodes {i \u2208 J} that impose an ordering on X.\n\nXi \u223c pi(PAD,i, Ni),\u2200i \u2208 J where PAD,i \u2286 X \\ Xi are the parents of Xi in D.\n\n3. A set of probabilistic generative functions for each variable Xi, p = {pi, i \u2208 J} such that\nG. This means that a procedure\nG is a generative model that entails an observational distribution P\nG. This is viewed\nthat \ufb01rst samples from each pi along the ordering in D generates samples from P\nas the generating process for the observed X. A primary contribution of this work is a method for\ntransforming G into and structural causal model.\nStructural causal models (SCMs) A structural causal model C of the same system has the same\ncausal directed graph D, ordering the same random variables X. The model consists of [20, 23]:\n\nN on independent noise random variables N = {Ni; i \u2208 J}\nC\n1. A distribution P\n2. A set of functions f = {fi, i \u2208 J} called structural assignments, such that\n\nXi = fi(PAC,i, Ni),\u2200i \u2208 J where PAC,i \u2286 X \\ Xi are the parents of Xi in D.\n\nM of X\u2217.\n\nC. This is viewed as the generating process for the observed X.\n\nG, the same observational distribution as G. For consistency, we\nC when discussed in the context of C. This means that a procedure that\nC\nN, and then sets the values of X deterministically with f, generates\n\nC is a generative model that entails P\nrefer to this distribution as P\n\ufb01rst samples noise values from P\nsamples from P\nInterventions in Markov process models and in SCMs An SCM C uses ideal interventions, which\nreplace a random variable with a \ufb01xed point value. These are represented with Pearl\u2019s \u201cdo\u201d notation\ndo(Xi = x) [10, 22], denoted. The intervention that sets Xi to x replaces the structural assignment\nC;do(Xi=x) is entailed by C under\nXi = fi(PAC,i, Ni) with Xi = x. The intervention distribution P\nthe intervention and is generally different from the equilibrium distribution P\nIn the context of a Markov process model, a typical intervention de\ufb01nition is that an intervention\nincreases a reaction rate (catalyzation) or decreases a reaction rate (inhibition). We de\ufb01ne a type\nof soft intervention [10] for Markov process models that make this rate manipulation comparable\nto the SCM\u2019s ideal intervention. We de\ufb01ne a \ufb01xed post-equilibrium expected value for a variable\nthat we want to achieve, then \ufb01nd a change to the variables rate parameter values that achieve that\noutcome. For example, an intervention that sets the equilibrium value of Xi to x does so by \ufb01nding\nmanipulating Xi\u2019s rate parameters to achieve this result. Borrowing the \u201cdo\u201d notation, denote this\nas do(X\u2217\ni =x). We compare\nC;do(Xi=x). For both Markov process models and SCMs,\nintervention queries on P\nthe intervention queries are answered by sampling from these distributions. See Supplementary\nmaterials for contrasts to related intervention modeling approaches.\nCounterfactual inference in SCMs Counterfactual inference is the process of observing a random\noutcome, making inference about the unseen causes of the outcome, and then inferring the outcome\nthat would have been observed under an intervention [23, 26]. For example, an SCM C helps\nanswer the query \u201cHaving observed Xi = x, what would have happened under the intervention\ndo(Xi = \u00acx)?\". SCMs support the following algorithm for counterfactual inference [2]: (1) having\nobserved X = x, infer the noise distribution conditional on the observation P\n, (2) replace\nin C, (3) apply the intervention do(X = \u00acx), and (4) sample from the resulting\nC\nN with P\nP\nmutated model. The intuition is that in (2) we infer the latent initial conditions (values of N) that\ncould have lead to the outcome X = x, this information is encoded in P\n, the posterior of N\ngiven X = x. We then pass that encoded information to the counterfactual world where X is set to\n\u00acx and play out scenarios in that world by sampling from P\nand deriving downstream variables\ngiven those noise values. Thus the algorithm mutates C into an SCM entailing the counterfactual\ndistribution P\n\nM;do(X\u2217\n\ni =x) to P\n\ni = x). Let the equilibrium distribution under intervention be P\n\nM;do(X\u2217\n\nC;X=x\nN\n\nC;X=x\nN\n\nC;X=x\nN\n\nC;X=x,do(X=\u00acx).\n\nC;X=x\nN\n\n3\n\n\f3 Methods\n\n3.1 Motivating example\n\nThis manuscript contributes a practical framework for casting Markov process models of a system\nobserved at equilibrium as an SCM, for the purposes of conducting counterfactual inference. As a\nmotivating example, we consider a system of three biomolecules (i.e., components) X1, X2 and Y.\nEach component takes two states: active (\u201con\") and inactive (\u201coff\"). Component X1 in the \u201con\" state\nactivates Y; component X2 in the \u201con\" state deactivates Y, as shown in the causal diagram [1] below:\n\nXon\n1 + Yoff\n\nv1\u2192 Xon\n\n1 + Yon and Xon\n\n2 + Yon\n\nv2\u2192 Xon\n\n2 + Yoff\n\n(2)\n\nLet X1(t), X2(t), and Y (t) be the total number of active-state particles of X1, X2, and Y at time t.\nAssume that each component has T = 100 particles in total, such that T \u2212 Y (t) is the number of\ninactive particles of Y at time t, and that each component is initialized with 100 off-state particles.\nTo ensure that the equilibrium distribution of the Markov process model M has a closed-form\nsolution, we limit this work to M with zero or \ufb01rst-order hazard functions (i.e. hazard functions\nfor which outputs are either constant or directly proportional to a product of the inputs) [15, 29].\nIn this example, the hazard functions assume mass action kinetics [13], a common assumption in\nbiochemical modeling. Let h1(Y (t)) and h2(Y (t)) denote stochastic rate laws for the activation and\ndeactivation of Y, expressing the probabilities that the reactions occur in the instant (t, t + dt]. Then,\naccording to a \ufb01rst-order stochastic kinetic assumption of chemical reactions [28], h1 and h2 are\n\nh1(Y (t)) = v1X1(t)(T \u2212 Y (t)) and h2(Y (t)) = v2X2(t)Y (t)\n\n(3)\nThe hazard functions are parameterized by v = {v1, v2} regulating X1 and X2, and by the initial\nstates.\nThe Kolmogorov forward equations determine the change in P\n\ndP\n\nM\nY (t)\ndt =\n\nh1(Y (t) \u2212 1)P\n\nY (t)\u22121 \u2212 h1(Y (t))P\nM\n\nM\nY (t)\n\n+\n\n(cid:17)\n\n(cid:16)\n\nM\nY (t) as the system evolves in time:\nY (t)+1 \u2212 h2(Y (t))P\nM\nM\nY (t)\n\nh2(Y (t) + 1)P\n\n(cid:17)\n\n(4)\n\n(cid:16)\n\nWe pose a counterfactual query \u201cHaving observed X1 = 34, X2 = 45, Y = 56, what would Y have\nbeen if X1 was set to 50\"?\n\n3.2 Converting a Markov process model into an SCM\n\nAlgorithm 1 summarizes the proposed steps of converting the Markov process model into an SCM.\nThe steps are a series of mathematical derivations (as opposed to a pseudocode for a computational\nimplementation). Below we illustrate these steps for the component Y in the motivating example.\nAdditional mathematical details are available in Supplementary materials.\n\nM\n\ndt\n\ndP\n\n(t)\n\n(t) :=(cid:82)\n\nAlgorithm 1 Convert Markov process into SCM\nInputs: Markov process model M\nStructural causal model C\nOutput:\n1: procedure GETSCM(M)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n\n(cid:73) Solve master equation\nM\nP\n)dt\n(cid:73) Find the equilibrium distribution\nM\n= limt\u2192\u221e P\nP\n(cid:73) Use P\nG := {x \u223c P\nM}\n(cid:73) Convert the generative model to an SCM\n(cid:73) that entails P\nM\nC :=\nreturn C\n\nM to de\ufb01ne generative model G\n\n(cid:26) N \u223c P\n\nX = f (X, N)\n\nC \u2248 P\n\nM\n\n(t)\n\nC\nN\n\n: P\n\nt\n\nAlgorithm 2 Counterfactual inference on SCM\nInputs: Prior distribution on exogenous noise NPrior\n\nStructural causal model C\nObserved endogenous variables X = x\nCounterfactual interventions X = \u00acx\nDesired sample size ssize\n\nC;X=x,do(X:=\u00acx)\n\nOutput: ssize samples from P\n1: procedure CFQUERY(C, NPrior, x, \u00acx, ssize)\n2:\n3:\n4:\n5:\n6:\n7:\n\n(cid:73) Create \u201cobservation\" and \u201cintervention\" models\nobsModel \u2190 Condition(C, X = x)\nintModel \u2190 Do(C, X = \u00acx)\n(cid:73) Infer noise distribution with observation model\nNPosterior \u2190 Infer(obsModel, NPrior)\n(cid:73) Simulate from intervention model w/ updated\n\nsamples = array(ssize)\nfor i in (0:ssize) do\n\nsamples[i] \u2190 intModel(NPosterior)\n\nM\n\nnoise\n\n8:\n9:\n10:\n11:\n\nreturn samples\n\n4\n\n\fSolve the master equation (Algo. 1 line 3). We can arrive at the solution for P\nindirectly by solving the ordinary differential equation on the expectation of Y (t) over P\n\nM\nY (t) in Eq. (4)\n\nM\nY (t):\n\n(5)\n\n(6)\n\nE(Y (t)) = v1X1(t)T \u2212 (v1X1(t) + v2X2(t)) E(Y (t))\n\nd\ndt\n\nThis has an analytical solution, where:\n\nE(Y (t))\n\n= e\u2212t(v1X1(t)+v2X2(t)) +\n\nT\n\nv1X1(t)\n\nv1X1(t) + v2X2(t)\n\nFinally, Y (t) is a count of binary state variables with the same probability of being activated at a\nM\nY (t) must be Binomial distribution with T trials, and trial probability E(Y (t))\ngiven instant. Then P\n.\nFind the equilibrium distribution (Algo. 1 line 5). Taking the limit in time of Eq. (6):\n\nT\n\nE(Y )\n\nT\n\n= lim\nt\u2192\u221e\n\nE(Y (t))\n\nT\n\n=\n\nv1X1(t)\n\nv1X1(t) + v2X2(t)\n\n(7)\n\nv1X1(t)\n\nT\n\nM\nX1\n\nand P\n\nThus at equilibrium Y follows the Binomial probability distribution with parameter\n\nG := {X1 \u223c Binom(T, \u03b8X1); X2 \u223c Binom(T, \u03b8X2); Y \u223c Binom(T, \u03b8Y (X1, X2))}\n\nv1X1(t)+v2X2(t).\nM to de\ufb01ne generative model G (Algo. 1 line 7). Let \u03b8X1 and \u03b8X1 be the probability\nUse P\nM\n. Let \u03b8Y (X1, X2) = E(Y )\nbe the\nparameters for the equilibrium Binomial distributions P\nX2\nM\nY . De\ufb01ne a generative model G:\nprobability parameter for the equilibrium Binomial distribution P\n(8)\nM (Algo. 1 line 10). We rely on a method\nConvert the generative model to an SCM that entails P\nof monotonic conversion, which restricts the class of possible SCMs to those with a common set\nof identi\ufb01able counterfactual quantities (such as the probability of necessity, i.e. the probability\nthat Y would not have been activated without X1) [20]. For each structural assignment Xi =\n|\nfi(PAC,i, Ni),\u2200i \u2208 J the method enforces the property E[Xi\n| do(PAC,i = y)] \u2265 E[Xi\ndo(PAC,i = y(cid:48))] \u21d2 fi(y, ni) \u2265 fi(y(cid:48), ni)\u2200ni.\nFor this example we selected a monotonic conversion by means of the inverse CDF transform. Denote\nF \u22121(u, n, p) the inverse CDF of the Binomial distribution, where 0 < u < 1, and n (number of\ntrials) and p (success probability) are the parameters of the Binomial distribution. Then the SCM C\nthat entails P\n(9)\n\nC :=(cid:8)X1 = F \u22121(NX1, T, \u03b8X1 ); X2 = F \u22121(NX2, T, \u03b8X2); Y = F \u22121(NY , T, \u03b8Y (X1, X2))(cid:9)\n\nind\u223c Uniform(0, 1);\n\nM is de\ufb01ned as\n\nNX1 , NX2, NY\n\nFor larger models such as in Case studies 1 and 2 thereafter, it may be desirable to work with alternative\ntransforms that are more amenable to gradient-based inference such as stochastic variational inference.\n\n3.3 Counterfactual inference and evaluation\nAlgorithm 2 details the counterfactual inference on C. Algorithms 3 and 4 in Supplementary materials\ndetail the evaluation. The evaluation stems from the insight that noise at the equilibrium captures\nthe stochasticity in the Markov process trajectories. Therefore, we repeatedly simulate pairs of the\ntrajectories with and without the counterfactual intervention, with a same random seed in a pair, such\nthat each pair has an identical stochastic component. We then compare the differences in the values\nof these pairs at equilibrium to the differences between the original and the intervened-upon values\nprojected by the SCM. These differences estimate the respective causal effects. The algorithms differ\nin choosing a deterministic or a stochastic approach for the estimation of causal effects. To ensure\nscalability to large models and the ability to do inference over a broad set of structural assignments,\nwe implemented the algorithms in PyTorch and the probabilistic programming language Pyro [4].\nThe code and the runtime data are in Supplementary materials.\n\n4 Case studies\n\n4.1 Case Study 1: The MAPK signaling pathway\n\nThe system The mitogen-activated protein kinase (MAPK) pathway is important in many biological\nprocesses, such as determination of cell fate. It is a cascade of three proteins, a MAPK, a MAPK\n\n5\n\n\factivation hazard\ndeactivation hazard\n\nMAP3K\nK3E1(TK3 \u2212 K3(t))\nvact\nK3 K3(t)\nvinh\n\nMAP2K\nK2K3(TK2 \u2212 K2(t))\nvact\nK2 K2(t)\nvinh\n\nMAPK\nK K2(TK \u2212 K(t))\nvact\nK K(t)\nvinh\n\nTable 1: The hazard functions in Case study 1 (MAPK), speci\ufb01ed according to mass action enzyme kinetics.\n\nE1 \u2192 MAP3K \u2192 MAP2K \u2192 MAPK\n\nkinase (MAP2K), and a MAPK kinase kinase (MAP3K), represented with a causal diagram [14, 24]\n(10)\nHere E1 is an input signal to the pathway. The cascade relays the signal from one protein to the next\nby changing the count of proteins in an active state.\nThe biochemical reactions A protein molecule is in an active state if it has one or more attached\nphosphoryl groups. Each arrow in Eq. (10) combines the reactions of phosphorylation (i.e., activation)\nand dephosphorylation (i.e., desactivation). For example, E1 \u2192 MAP3K combines two reactions\n\nK3, vact\n\nK2, vact\n\nK3 , vinh\n\nK2 , vinh\n\nE1 + MAP3K vact\n\nK3\u2192 E1 + P-MAP3K and P-MAP3K vinh\n\nK3\u2192 MAP3K\n\nK and deactivation rate parameters vinh\n\n(11)\nIn the \ufb01rst reaction in Eq. (11), a particle of the input signal E1 binds (i.e., activates) a molecule\nof MAP3K to produce MAP3K with an attached phosphoryl. The rate parameter associated with\nthis reaction is vact. In the second reaction, phosphorylated MAP3K loses its phosphoryl (i.e.,\ndeactivates), with the rate vinh. The remaining arrows in Eq. (10) aggregate similar reactions and\nrate pairs.\nThe mechanistic model Let K3(t), K2(t) and K(t) denote the counts of phosphorylated MAP3K,\nMAP2K, and MAPK at time t. Let TK3, TK2, and TK represent the total amount of each of the three\nproteins, and E1 the total amount of input, which we assume are constant in time. We model the\nsystem as a continuous-time discrete-state Markov process M with hazard rates functions in Table 1.\nThe data We simulated the counts of protein particles using the Markov process model with rate\nparameters vact\nK . We conducted three\nsimulation experiments with three sets of rates, all consistent with a low concentration in a cell-sized\nvolume (see Supplementary materials). The initial conditions assumed 1 particle of E1, 100 particles\nof the unphosphorylated form of each protein, and 0 particles of the phosphorylated form.\nThe counterfactual of interest Let K3, K2 and K denote the observed counts of phosphorylated\nMAP3K, MAP2K, and MAPK at 100 seconds, the time corresponding to an equilibrium for all the\nrates. Let K3(cid:48) be the count of phosphorylated MAP3K generated by a 3 times smaller vact\nK3. Thus\nv(cid:48) = [vact\nK ]. We pose the counterfactual question: \u201cHaving observed\nthe equilibrium particle counts K3, K2 and K, what would have been the count of K if we had K3(cid:48)?\u201d.\nThe evaluation We derive the SCM C of the Markov process model and evaluate the counterfactual\nwhere x(cid:48) is the expected equilibrium value associated with\ndistribution P\nv(cid:48). We evaluate this counterfactual statement as described in Algorithms 3 and 4 (with 500 seeds). If\nthe counterfactuals from the converted SCMs are consistent with the Markov process models, their\nhistograms from Algorithms 3 and 4 should overlap.\nThe evaluation under model misspeci\ufb01cation We consider the Markov process model M with the\n\ufb01rst set of rates (see Supplementary materials). Let [x, y, z] be sampled from M. Next, instead of the\ncorrect model we consider a misspeci\ufb01ed model M(cid:48), where vact\nK2 is perturbed with noise sampled from\nUniform(0.1, 0.5). We denote as C(cid:48) the SCM corresponding to M(cid:48), and evaluate the counterfactual\ndistribution P\n. We expect that, since the counterfactual distribution from\nC(cid:48) incorporates the data from the correct model, it should be closer to the true causal effect simulated\nfrom M than the direct simulation from the misspeci\ufb01ed M(cid:48). We repeat this experiment 50 times.\n\nC(cid:48);K3=x,K2=y,K=z,do(K3=x(cid:48))\nK3\n\nC;K3=x,K2=y,K=z,do(K3=x(cid:48))\nK3\n\nK3/3, vinh\n\nK3 , vact\n\nK2, vinh\n\nK2 , vact\n\nK , vinh\n\n4.2 Case Study 2: The IGF signaling system\n\nThe system The growth factor signaling system is involved in growth and development of tissues.\nWhen external stimuli activate the epidermal growth factor (EGF) or the insulin-like growth factor\n(IGF), this triggers a cascade [3] in Fig. (1)(a). The Raf-Mek-Erk pathway is equivalent to Eq. (10),\nrenamed to follow the convention adopted by the biological literature in this context.\nThe biochemical reactions All the edges in Fig. (1)(a) represent enzyme reactions E + S v\u2192 E +\nP, where the change of substrate S to product P is catalyzed by enzyme E. As in Case study 1, the\n\n6\n\n\fRas-SOS = vact\n\npointed edges combine activation and deactivation. The \ufb02at-headed edges only represent deactivation.\nThe mechanistic model is built similarly to Case study 1.\nThe data We simulated the counts of protein particles using the Markov process model with rates\nin Supplemental Tables 2 and 3. The other settings are as in Case study 1. The initial condition\nassumed 37 particles of EGFR, 5 particles of IGFR, 100 particles of the unphosphorylated form of\nother proteins, and 0 particles of the phosphorylated form.\nThe counterfactual of interest Let R(cid:48) be the number of phosphorylated particles of Ras at equilib-\nrium, achieved with v(cid:48)act\nRas-SOS/6. We pose the counterfactual: Having observed the number\nof phosphorylated particles of each protein before the intervention, what would be the number of\nparticles of Erk if the intervention had \ufb01xed Ras = R(cid:48)? Unlike the MAPK pathway, where the\nintervention on MAP3K affects the counterfactual target MAPK through a direct path, this system\nhas two paths from Ras to Erk. One path goes directly through Raf, and the other through a mediating\npath PI3K \u2192 AKT. This challenges the algorithm to address multiple paths of in\ufb02uence.\nThe evaluation We consider the rates vact\nand the Algorithms 3 and 4 (with 300 seeds).\nThe evaluation under model misspeci\ufb01cation We consider Markov process model M with the same\nrates and initial conditions as above. Let xi be sampled from M. We then introduce a misspeci\ufb01ed\nmodel M(cid:48), where vact\nAKT-PI3K is perturbed with noise sampled from Uniform(0.01, 0.1). We denote as\nC(cid:48);Xi=xi,do(Ras=R(cid:48))\nC(cid:48) the SCM corresponding to M(cid:48), and evaluate the counterfactual distribution P\n.\nErk\nThe resulting counterfactual distribution from C(cid:48) should be closer to the true causal effect simulated\nfrom M than the direct simulation from the misspeci\ufb01ed M(cid:48). We repeat this experiment 50 times.\n\nRas-SOS/6, the counterfactual distribution P\n\nC;Xi=xi,do(Ras=R(cid:48))\nErk\n\n,\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: Case study 2 (IGF). (a) IGF signaling. The top nodes are receptors for the epidermal growth factor\n(EGF) and the insulin growth factor (IGF). Downstream of the receptors are several canonical signaling pathways\n(including Raf-Mek-Erk, a renamed equivalent of Eq. (10)). Each reaction has a single rate parameter. The\nauto-deactivation reactions are not pictured. (b) Deterministic and stochastic trajectories of the active-state\nproteins in the system. Horizontal lines are the expected values at equilibrium. (c) Histogram of causal effects,\nde\ufb01ned as differences between the \u201cobserved\" and the \u201ccounterfactual\" trajectories of ERK at equilibrium.\n\n5 Results\n5.1 Case Study 1: The MAPK signaling pathway\n\nM\n\nSolve stochastic process\u2019s master equation (Algorithm 1 line 3). As in the motivating example, we\nindirectly solve dP\nby way of the solving the forward equations for the expectation. For K3(t)\nthis is dE(K3(t))\nK3 E(K3(t))) (Added Expectation in RHS, please\nreview). We derive similar forward equations for K2(t) and K(t). We solve the ODE above:\n\nK3 E1(TK3 \u2212 E(K3(t)) \u2212 vinh\n\n= vact\n\n(t)\n\ndt\n\ndt\n\nE(K3(t))\n\nTK3\n\n= e\u2212t(vact\n\nK3 E1+vinh\n\nK3 ) +\n\nand obtain the equilibrium by taking the limit t \u2192 \u221e. The \ufb01rst term in Eq. (12) goes to 0:\n\nK3 E1\nvact\n\nK3 E1 + vinh\nvact\nK3\n\nE(K3)\n\nTK3\n\n=\n\nK3 TK3E1\nvact\nK3 E1 + vinh\nvact\nK3\n\n7\n\n(12)\n\n(13)\n\nEGFIGFnucleusRafPPRafAktPP2AMekErkRasSOSPI3Kp90RasGapcell wall\fFind the equilibrium distribution (Algorithm 1 line 5) As in Sec. 3.2), the each active-state MAPK\nprotein has a Binomial marginal distribution. Let \u03b8K3(E1) denote the probability that a MAP3K\nparticle is active at equilibrium given E1. After solving the master equation,\n\nK3 E1 + vinh\nvact\nK3\nExtending this solution to MAP2K and MAPK leads to probabilities\n\nTK3\n\n\u03b8K3(E1) =\n\nE(K3)\n\n=\n\nK3 E1\nvact\n\n(14)\n\n(15)\n\nK K2\nvact\n\nK K2 + vinh\nvact\n\nK\n\n\u03b8K3(E1) =\n\nK3 E1\nvact\n\nK3 E1 + vinh\nvact\nK3\n\n; \u03b8K2(K3) =\n\nK2 K3\nvact\n\nK2 K3 + vinh\nvact\nK2\n\n; \u03b8K(K2) =\n\nM\nK3:\n\nK2 \u2261 Binomial(TK2, \u03b8K2(K3)); P\nM\n\nK \u2261 Binomial(TK, \u03b8K(K2)) (16)\nM\nM to de\ufb01ne generative model G (Algorithm 1 line 7). From here it is straightforward to create\n\nand the following equilibrium distributions:\nK3 \u2261 Binomial(TK3, \u03b8K3(E1)); P\nM\nP\nUse P\na generative model that entails P\nG := {K3 \u223c Binom(TK3, \u03b8K3(E1)); K2 \u223c Binom(TK2, \u03b8K2(K3)); K \u223c Binom(TK, \u03b8K(K2))} (17)\nM (Algorithm 1 line 10). Here the\nConvert the generative model to an SCM that entails P\nchallenge is in expressing the stochasticity in G, while de\ufb01ning K3, K2, K as deterministic functions\nof the noise variables NK, NK2, NK3. Instead of using the inverse binomial CDF, we demonstrate the\nuse of a differentiable monotonic conversion, so that we can validate approximate counterfactual\ninference with stochastic gradient descent. We achieve this by \ufb01rst applying a Gaussian approximation\nto the Binomial distribution, and then applying the \u201creparameterization trick\" used in variational\nautoencoders [25] (combined in helper function q in Eq. (18)).\n\n(18)\nC := {K3 = q(\u03b8K3(E1), TK3, NK3); K2 = q(\u03b8K2(K3), TK2, NK2); K = q(\u03b8K(K2), TK, NK)} (19)\n\nind\u223c N (0, 1); q(\u03b8, T, N ) = N \u00b7 (T \u03b8(1 \u2212 \u03b8))1/2 + \u03b8T\n\nNK, NK2, NK3\n\nThe Gaussian approximation facilitates the gradient-based inference in line 6 of Algorithm 2. Despite\nthe approximation, the resulting SCM is still de\ufb01ned in terms of \u03b8. In this manner the SCM retains\nthe biological mechanisms and the interpretation of the Markov process model.\nCreate \u201cobservation\" and \u201cintervention\" models (Algorithm 2 lines 3-4) In a probabilistic pro-\ngramming language, the deterministic functions in Eq. (19) are speci\ufb01ed with a Dirac Delta dis-\ntribution. However, at the time of writing, gradient-based inference in Pyro produced errors when\nconditioning on a Dirac sample. We relaxed the Dirac Delta to allow a small amount of density.\n\n(a)\n\n(b)\n\nFigure 2: Case study 1 (MAPK). (a) Deterministic and stochastic trajectories of the active-state MAPK\nproteins. Horizontal lines are the expected values at equilibrium. (b) Histograms of causal effects, de\ufb01ned as\ndifferences between the \u201cobserved\" and the \u201ccounterfactual\" trajectories of MAP3K at equilibrium.\n\nInfer noise distribution with observation model (Algorithm 2 line 6) We use stochastic variational\ninference ([12]) to infer and update NK3, NK2 and NK from the observation model, and independent\nNormal distributions as approximating distributions.\nSimulate from intervention model with updated noise (Algorithm 2 line 10) After updating the\nnoise distributions, we generate the target distribution of the intervention model.\nDeterministic and stochastic counterfactual simulation and evaluation (Algorithms 3 and 4 in\nSupplementary materials). Fig. (2)(a) illustrates that the simulated trajectories converge in steady state.\n\n8\n\n\fSince we rely on the Gaussian approximation to the Binomial in constructing C, we would expect\nworse results if we were to set the rates on or near the boundaries 0 and 100, where the approximation\nis weak. Fig. (2)(b) shows that for each experiment with different sets of rates, the causal effects from\nthe SCM\u2019s counterfactual distribution are centered around the ground truth simulated deterministically\nusing Eq. (12) and similar equations for K2 and K. The SCM\u2019s distribution has less variance, likely\ndue to the fact that ideal interventions in the SCM allow less variation than rate-manipulation-based\ninterventions in the Markov process model.\nEvaluation under model misspeci\ufb01cation Fig. (3)(a) shows histograms from one of the 50 repeti-\ntions of the experiment conducted to evaluate the robustness of the SCM under model misspeci\ufb01cation,\nand illustrates that the causal effects from the misspeci\ufb01ed SCM is closer to true causal effect than\nthe causal effect derived from a direct but misspeci\ufb01ed simulation. Over the 50 repetitions, the\nabsolute difference between the median of the true causal effect and the causal effect derived from\nthe misspeci\ufb01ed SCM is on average 0.343. The absolute difference between the median of the true\ncausal effect and misspeci\ufb01ed direct simulation is on average 1.03.\n5.2 Case Study 2: The IGF signaling system\nThe derivations for the growth factor signaling system align closely with that of the motivating\nexample and of the MAPK model. For variable Xi with parents PAM,i, we partition each parent set\ninto activators and inhibitors PAM,i = {PAactM,i, PAinhM,i}. The rate parameters are also partitioned\ninto v = {vact, vinh}. For each Xi the probability for particle activation at equilibrium is:\n\nvactPAactM,i\n\nvactPAactM,i + vinhPAinhM,i\n\n\u03b8Xi(PAM,i) =\n\n(20)\n\nNext, we derive an SCM using the same Normal approximation to the Binomial distribution as in the\nMAPK pathway. Fig. (1)(b) plots deterministic and stochastic time courses for active states counts of\nthe proteins in the pathway. Fig. (1)(c) illustrates that the counterfactual inference was successful\ndespite the increased model complexity and size.\nThe evaluation under model misspeci\ufb01cation Similarly to Case study 1, Fig. (3)(b) illustrates that\nthe causal effects from the misspeci\ufb01ed SCM are closer to the true causal effect than the causal effect\nderived from a direct but misspeci\ufb01ed simulation. Over the 50 repetitions, the absolute difference\nbetween the median of the true causal effect and the causal effect derived from the misspeci\ufb01ed\nSCM is on average 7.563. The absolute difference between the median of the true causal effect and\nmisspeci\ufb01ed direct simulation is on average 92.55.\n\nFigure 3: Histograms of causal effects, i.e. differences between the \u201cobserved\" and the \u201ccounterfactual\"\ntrajectories at equilibrium, for one repetition of the evaluation. The causal effect from the misspeci\ufb01ed SCM\n(blue histogram) is closer to true causal effect (orange histogram) than the causal effect derived from a direct but\nmisspeci\ufb01ed simulation (green histogram). (a) MAPK, (b) IGF.\n6 Discussion\nThis work proposed a practical approach for casting a Markov process model of a system at equi-\nlibrium as an SCM. Equilibrium counterfactual inferences using this SCM are anchored to the rate\nlaws of the Markov process. We derived the speci\ufb01c steps of conducting counterfactual inference\nin real-life case studies of biochemical networks. The case studies illustrate that the counterfactual\ninference is consistent with the differences in the initial and the intervened upon trajectories of the\nMarkov process, and makes the selection of interventions more robust to model misspeci\ufb01cation.\nThis approach opens many opportunities for future methodological research, such as extending this\napproach to models with cycles, a common feature of complex systems. Overall, this work is a step\ntowards broader adoption of counterfactual inference in systems biology and other applications.\n\n9\n\n\fReferences\n[1] U. Alon. An Introduction to Systems Biology. CRC press, 2006.\n\n[2] A. Balke and J. Pearl. Counterfactual probabilities: Computational methods, bounds and\napplications. In Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 1994.\n\n[3] F. Bianconi, E. Baldelli, V. Ludovini, L. Crin\u00f2, A. Flacco, and P. Valigi. Computational model\nof EGFR and IGF1R pathways in lung cancer: A systems biology approach for translational\noncology. Biotechnology Advances, 30:142, 2012.\n\n[4] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, T. Karaletsos, R. Singh,\nP. Szerlip, P. Horsfall, and N. D. Goodman. Pyro: Deep universal probabilistic programming.\narXiv:1810.09538, 2018.\n\n[5] T. Blom, S. Bongers, and J. M. Mooij. Beyond structural causal models: Causal constraints\n\nmodels. Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 2019.\n\n[6] S. Bongers and J. M. Mooij. From random differential equations to structural causal models:\n\nThe stochastic case. arXiv:1803.08784, 2018.\n\n[7] L. Bottou, J. Peters, J. Qui\u00f1onero-Candela, D. X. Charles, D. M. Chickering, E. Portugaly,\nD. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example\nof computational advertising. The Journal of Machine Learning Research, 14:3207, 2013.\n\n[8] L. Buesing, T. Weber, Y. Zwols, S. Racaniere, A. Guez, J.-B. Lespiau, and N. Heess. Woulda,\n\ncoulda, shoulda: Counterfactually-guided policy search. arXiv 1811.06272, 2018.\n\n[9] L. E. Dubins and D. A. Freedman. Invariant probabilities for certain Markov processes. The\n\nAnnals of Mathematical Statistics, 37:837, 1966.\n\n[10] F. Eberhardt and R. Scheines. Interventions and Causal Inference. Philosophy of Science,\n\n74:981, 2007.\n\n[11] R. Hilborn and M. Mangel. The Ecological Detective: Confronting Models with Data (MPB-28).\n\nPrinceton University Press, 1997.\n\n[12] M. Hoffman, D. M. Blei, C. Wang, and J. Paisley.\n\narXiv:1206.7051, 2012.\n\nStochastic variational inference.\n\n[13] F. Horn and R. Jackson. General mass action kinetics. Archive for rational mechanics and\n\nanalysis, 47:81, 1972.\n\n[14] C.-Y. F. Huang and J. E. Ferrell. Ultrasensitivity in the mitogen-activated protein kinase cascade.\n\nProceedings of the National Academy of Sciences, 93:10078, 1996.\n\n[15] Tobias Jahnke and Wilhelm Huisinga. Solving the chemical master equation for monomolecular\n\nreaction systems analytically. Journal of mathematical biology, 54(1):1\u201326, 2007.\n\n[16] T. Joachims and A. Swaminathan. Counterfactual evaluation and learning for search, recom-\nmendation and ad placement. In Proceedings of the 39th International ACM SIGIR conference\non Research and Development in Information Retrieval, page 1199. ACM, 2016.\n\n[17] E. K. Kim and E. Choi. Pathological roles of MAPK signaling pathways in human diseases.\n\nBiochimica et Biophysica Acta (BBA) - Molecular Basis of Disease, 1802:396, 2010.\n\n[18] J. M. Mooij, D. Janzing, and B. Sch\u00f6lkopf. From ordinary differential equations to structural\n\ncausal models: The deterministic case. arXiv:1304.7920, 2013.\n\n[19] M. Oberst and D. Sontag. Counterfactual off-policy evaluation with Gumbel-Max structural\n\ncausal models. arXiv preprint arXiv:1905.05824, 2019.\n\n[20] J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2009.\n\n[21] J. Pearl. The algorithmization of counterfactuals. Annals of Mathematics and Arti\ufb01cial\n\nIntelligence, 61:29, 2011.\n\n10\n\n\f[22] J. Pearl. On the interpretation of do(x). Journal of Causal Inference, 2019.\n\n[23] J. Peters, D. Janzing, and B. Sch\u00f6lkopf. Elements of Causal Inference: Foundations and\n\nLearning Algorithms. MIT press, 2017.\n\n[24] L. Qiao, R. B. Nachbar, I. G. Kevrekidis, and S. Y. Shvartsman. Bistability and oscillations in\n\nthe Huang-Ferrell model of MAPK signaling. PLoS computational biology, 3:e184, 2007.\n\n[25] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate\n\ninference in deep generative models. arXiv:1401.4082, 2014.\n\n[26] N. J. Roese. Counterfactual thinking. Psychological bulletin, 121:133, 1997.\n\n[27] J. J Tyson, K. C. Chen, and B. Novak. Sniffers, buzzers, toggles and blinkers: Dynamics of\nregulatory and signaling pathways in the cell. Current Opinion in Cell Biology, 15:221, 2003.\n\n[28] D. J. Wilkinson. Stochastic Modelling for Systems Biology. Chapman and Hall/CRC, 2006.\n\n[29] D. J. Wilkinson. Stochastic modeling for quantitative description of heterogeneous biological\n\nsystems. Nature Reviews Genetics, 10:122, 2009.\n\n11\n\n\f", "award": [], "sourceid": 7995, "authors": [{"given_name": "Robert", "family_name": "Ness", "institution": "Gamalon"}, {"given_name": "Kaushal", "family_name": "Paneri", "institution": "Microsoft"}, {"given_name": "Olga", "family_name": "Vitek", "institution": "Northeastern University"}]}