{"title": "Identification and Estimation of Causal Effects from Dependent Data", "book": "Advances in Neural Information Processing Systems", "page_first": 9424, "page_last": 9435, "abstract": "The assumption that data samples are independent and identically distributed (iid) is standard in many areas of statistics and machine learning. Nevertheless, in some settings, such as social networks, infectious disease modeling, and reasoning with spatial and temporal data, this assumption is false. An extensive literature exists on making causal inferences under the iid assumption [12, 8, 21, 16], but, as pointed out in [14], causal inference in non-iid contexts is challenging due to the combination of unobserved confounding bias and data dependence. In this paper we develop a general theory describing when causal inferences are possible in such scenarios. We use segregated graphs [15], a generalization of latent projection mixed graphs [23], to represent causal models of this type and provide a complete algorithm for non-parametric identification in these models. We then demonstrate how statistical inferences may be performed on causal parameters identified by this algorithm, even in cases where parts of the model exhibit full interference, meaning only a single sample is available for parts of the model [19]. We apply these techniques to a synthetic data set which considers the adoption of fake news articles given the social network structure, articles read by each person, and baseline demographics and socioeconomic covariates.", "full_text": "Identi\ufb01cation and Estimation Of Causal Effects from\n\nDependent Data\n\nEli Sherman\n\nDepartment of Computer Science\n\nJohns Hopkins University\n\nBaltimore, MD 21218\nesherman@jhu.edu\n\nIlya Shpitser\n\nDepartment of Computer Science\n\nJohns Hopkins University\n\nBaltimore, MD 21218\nilyas@cs.jhu.edu\n\nAbstract\n\nThe assumption that data samples are independent and identically distributed (iid)\nis standard in many areas of statistics and machine learning. Nevertheless, in\nsome settings, such as social networks, infectious disease modeling, and reasoning\nwith spatial and temporal data, this assumption is false. An extensive literature\nexists on making causal inferences under the iid assumption [17, 11, 26, 21], even\nwhen unobserved confounding bias may be present. But, as pointed out in [19],\ncausal inference in non-iid contexts is challenging due to the presence of both\nunobserved confounding and data dependence. In this paper we develop a general\ntheory describing when causal inferences are possible in such scenarios. We use\nsegregated graphs [20], a generalization of latent projection mixed graphs [28],\nto represent causal models of this type and provide a complete algorithm for non-\nparametric identi\ufb01cation in these models. We then demonstrate how statistical\ninference may be performed on causal parameters identi\ufb01ed by this algorithm.\nIn particular, we consider cases where only a single sample is available for parts\nof the model due to full interference, i.e., all units are pathwise dependent and\nneighbors\u2019 treatments affect each others\u2019 outcomes [24]. We apply these techniques\nto a synthetic data set which considers users sharing fake news articles given the\nstructure of their social network, user activity levels, and baseline demographics\nand socioeconomic covariates.\n\n1\n\nIntroduction\n\nThe assumption of independent and identically distributed (iid) samples is ubiquitous in data analysis.\nIn many research areas, however, this assumption simply does not hold. For instance, social media\ndata often exhibits dependence due to homophily and contagion [19]. Similarly, in epidemiology,\ndata exhibiting herd immunity is likely dependent across units. Likewise, signal processing and\nsequence learning often consider data that are spatially [8] or temporally [23] dependent.\nIn causal inference, dependence in data often manifests as interference wherein some units\u2019 treatments\nmay causally affect other units\u2019 outcomes [3, 9]. Herd immunity is a canonical example of interfer-\nence since other subjects\u2019 vaccination status causally affects the likelihood of a particular subject\ncontracting a disease. Even under the iid assumption, making causal inferences from observed data is\ndif\ufb01cult due to the presence of unobserved confounding. This dif\ufb01culty is worsened when interference\nis present, as described in detail in [19]. In general, these dif\ufb01culties prevent identi\ufb01cation of causal\nparameters of interest, making estimation of these parameters from data an ill-posed problem. An\nextensive literature on identi\ufb01cation of causal parameters (under the iid assumption) has been devel-\noped. The g-formula [17] identi\ufb01es any interventional distribution in directed acylcic graph-based\n(DAG) causal models without latent variables. Pearl showed that in certain cases identi\ufb01cation is\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fpossible even in the presence of unobserved confounding via the front-door criterion [11]. These\nresults were generalized into a complete identi\ufb01cation theory in hidden variable causal DAG models\nvia the ID algorithm [26, 21]. An extensive theory of estimation of identi\ufb01ed causal parameters has\nbeen developed. Some approaches are described in [17, 18], although this is far from an exhaustive\nlist. While work on identi\ufb01cation and estimation of causal parameters under interference exists\n[3, 25, 9, 14, 13, 7, 1], no general theory has been developed up to now. In this paper, we aim to\nprovide this theory for a general class of causal models that permit interference.\n\n2 A Motivating Example\n\nTo motivate subsequent developments, we introduce the following example application. Consider\na large group of internet users, belonging to a set of online communities, perhaps based on shared\nhobbies or political views. For each user i, their time spent online Ai is in\ufb02uenced by their observed\nvector of baseline factors Ci, and unobserved factors Ui. In addition, each user maintains a set of\nfriendship ties with other users via an online social network. The user\u2019s activity level in the network,\nMi, is potentially dependent on the user\u2019s friends\u2019 activities, meaning that for users j and k, Mj\nand Mk are potentially dependent. The dependence between M variables is modeled as a stable\nsymmetric relationship that has reached an equilibrium state. Furthermore, activity level Mi for user\ni is in\ufb02uenced by observed factors Ci, time spent online Ai, and the time spent online Aj of any unit\nj who is a friend of i. Finally, we denote user i\u2019s sharing behavior by Yi. This behavior is in\ufb02uenced\nby the social network activity of the unit, and possibly the unit friends\u2019 time spent online.\nA crucial assumption in our example is that for each user i, purchasing behavior Yi is causally\nin\ufb02uenced by baseline characteristics Ci, social network activity Mi, and unobserved characteristics\nUi, but time spent online Ai does not directly in\ufb02uence sharing Yi, except as mediated by social\nnetwork activity of the users. While this might seem like a rather strong assumption, it is more\nreasonable than standard \u201cfront-door\u201d assumptions [12] in the literature, since we allow the entire\nsocial network structure to mediate the in\ufb02uence Ai on Yi for every user.\nWe are interested in predicting how a counterfactual change in a set of users\u2019 time spent online\nin\ufb02uences their purchasing behavior. Note that solving this problem from observed data on users\nas we described is made challenging both by the fact that unobserved variables causally affect both\ncommunity membership and sharing, creating spurious correlations, and because social network\nmembership introduces dependence among users. In particular, for realistic social networks, every\nuser\u2019s activity potentially depends on every other user\u2019s activity (even if indirectly). This implies that\na part of the data for this problem may effectively consist of a single dependent sample [24].\nIn the remainder of the paper we formally describe how causal inference may be performed in\nexamples like above, where both unobserved confounding and data dependence are present. In\nsection 3 we review relevant terminology and notation, give factorizations de\ufb01ning graphical models,\ndescribe causal inference in models without hidden variables, and give identi\ufb01cation theory for such\nmodels in terms of a modi\ufb01ed factorization. We also introduce the dependent data setting we will\nconsider. In section 4 we describe more general nested factorizations [16] applicable to marginals\nobtained from hidden variable DAG models, and describe identi\ufb01cation theory in causal models with\nhidden variables in terms of a modi\ufb01ed nested factorization. In section 5, we introduce causal chain\ngraph models [6] as a way of modeling causal problems with interference and data dependence, and\npose the identi\ufb01cation problem for interventional distributions in such models. In section 6 we give a\nsound and complete identi\ufb01cation algorithm for interventional distributions in a large class of causal\nchain graph models with hidden variables, which includes the above example, but also many others.\nWe describe our experiments, which illustrate how identi\ufb01ed functionals given by our algorithm may\nbe estimated in practice, even in full interference settings where all units are mutually dependent, in\nsection 7. Our concluding remarks are found in section 8.\n\n3 Background on Causal Inference And Interference Problems\n\n3.1 Graph Theory\nWe will consider causal models represented by mixed graphs containing directed (\u2192), bidirected\n(\u2194) and undirected (\u2212) edges. Vertices in these graphs and their corresponding random variables\nwill be used interchangeably, denoted by capital letters, e.g. V ; values, or realizations, of vertices\n\n2\n\n\fC1\n\nC2\n\nC1\n\nC2\n\nC1\n\nC2\n\nC1\n\nC2\n\nA1\n\nA2\n\nU1\n\nU2\n\nA1\n\nA2\n\nU1\n\nM1\n\nM2\n\nU2\n\nA1\n\nA2\n\nM1\n\nM2\n\nM1\n\nM2\n\nA1\n\nA2\n\nm1\n\nm2\n\nY1\n\nY2\n\nY1\n\nY2\n\nY1\n\nY2\n\nY2\n\nY1\n\nY2\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 1: (a) A causal model representing the effect of community membership on article sharing,\nmediated by social network structure. (b) A causal model on dyads which is a variation of causal\nmodels of interference considered in [9]. (c) A latent projection of the CG in (a) onto observed\nvariables. (d) The graph representing GY\u2217 for the intervention operation do(a1) applied to (c). (e)\nThe ADMG obtained by \ufb01xing M1, M2 in (c).\n\ninstance for a set S, paG(S) =(cid:83)\n\nand variables will be denoted by lowercase letters, e.g. v; bold letters will denote sets of variables\nor values e.g. V or v. We will denote the state space of a variable V or a set of variables V as\nXV , and XV. Unless stated otherwise, all graphs will be assumed to have a vertex set denoted by\nV. For a mixed graph G of the above type, we denote the standard genealogic sets for a variable\nV \u2208 V as follows: parents paG(V ) \u2261 {W \u2208 V|W \u2192 V }, children chG(V ) \u2261 {W \u2208 V|V \u2192 W},\nsiblings sibG(V ) \u2261 {W \u2208 V|W \u2194 V }, neighbors nbG(V ) \u2261 {W \u2208 V|W \u2212 V }, ancestors\nanG(V ) \u2261 {W \u2208 V|W \u2192 \u00b7\u00b7\u00b7 \u2192 V }, descendants deG(V ) \u2261 {W \u2208 V|V \u2192 \u00b7\u00b7\u00b7 \u2192 W}, and\nnon-descendants ndG(V ) \u2261 V \\ deG(V ). We de\ufb01ne the anterior of V , or antG(V ), to be the set of\nall vertices with a partially directed path (a path containing only \u2192 and \u2212 edges such that no \u2212 edge\ncan be oriented to induce a directed cycle) into V . These relations generalize disjunctively to sets, for\nS\u2208S paG(S). We also de\ufb01ne the set pasG(S) as paG(S) \\ S. Given\na graph G and a subset S of V, de\ufb01ne the induced subgraph GS to be a graph with a vertex set S and\nall edges in G between elements in S.\nGiven a mixed graph G, we de\ufb01ne a district D to be a maximal set of vertices, where every vertex\npair in GD is connected by a bidirected path (a path containing only \u2194 edges). Similarly we de\ufb01ne a\nblock B to be a maximal set of vertices, where every vertex pair in GB is connected by an undirected\npath (a path containing only \u2212 edges). Any block of size at least 2 is called a non-trivial block. We\nde\ufb01ne a maximal clique as a maximal set of vertices pairwise connected by undirected edges. The set\nof districts in G is denoted by D(G), the set of blocks is denoted by B(G), the set non-trivial blocks\nis denoted by Bnt(G), and the set of cliques is denoted by C(G). The district of V is denoted by\ndisG(V ). By convention, for any V , disG(V ) \u2229 deG(V ) \u2229 anG(V ) \u2229 antG(V ) = {V }.\nA mixed graph is called segregated (SG) if it contains no partially directed cycles, and no vertex has\nboth neighbors and siblings, Fig. 1 (c) is an example. In a SG G, D(G) and Bnt(G) partition V. A SG\nwithout bidirected edges is called a chain graph (CG) [5]. A SG without undirected edges is called an\nacyclic directed mixed graph (ADMG) [15]. A CG without undirected edges or an ADMG without\nbidirected edges is a directed acyclic graph (DAG) [10]. A CG without directed edges is called an\nundirected graph (UG). Given a CG G, the augmented graph Ga is the UG where any adjacent vertices\nin G or any elements in paG(B) for any B \u2208 B(G) are connected by an undirected edge.\n\n3.2 Graphical Models\n\np(V) =(cid:81)\nciated with a UG G that can be written in terms of a UG factorization: p(V) = Z\u22121(cid:81)\n\nA graphical model is a set of distributions with conditional independences represented by structures\nin a graph. The following (standard) de\ufb01nitions appear in [5]. A DAG model, or a Bayesian network,\nis a set of distributions associated with a DAG G that can be written in terms of a DAG factorization:\nV \u2208V p(V | paG(V )). A UG model, or a Markov random \ufb01eld, is a set of distributions asso-\nC\u2208C(G) \u03c8C(C),\nwhere Z is a normalizing constant. A CG model is a set of distributions associated with a CG G that\n\n3\n\n\fcan be written in terms of the following two level factorization: p(V) =(cid:81)\nwhere for each B \u2208 B(G), p(B| paG(B)) = Z(paG(B))\u22121(cid:81)\n\nB\u2208B(G) p(B| paG(B)),\nC\u2208C((GB\u222apaG (B))a);C(cid:54)\u2286paG (B) \u03c8C(C).\n\n3.3 Causal Inference and Causal Models\n\nmodel, the distribution p({\u0001V |V \u2208 V}) is assumed to factorize as(cid:81)\n\nA causal model of a DAG is also a set of distributions, but on counterfactual random variables.\nGiven Y \u2208 V and A \u2286 V \\ {Y }, a counterfactual variable, or \u2018potential outcome\u2019, written as Y (a),\nrepresents the value of Y in a hypothetical situation where a set of treatments A is set to values a\nby an intervention operation [12]. Given a set Y, de\ufb01ne Y(a) \u2261 {Y}(a) \u2261 {Y (a) | Y \u2208 Y}. The\ndistribution p(Y(a)) is sometimes written as p(Y|do(a)) [12].\nCausal models of a DAG G consist of distributions de\ufb01ned on counterfactual random variables of\nthe form V (a) where a are values of paG(V ). In this paper we assume Pearl\u2019s functional model for\na DAG G with vertices V, where V (a) are determined by structural equations fV (a, \u0001V ), which\nremain invariant under any possible intervention on a, with \u0001V an exogenous disturbance variable\nwhich introduces randomness into V even after all elements of paG(V ) are \ufb01xed. Under Pearl\u2019s\nV \u2208V p(\u0001V ). This implies that\nthe sets of variables {{V (aV ) | aV \u2208 XpaG (V )} | V \u2208 V} are mutually independent [12]. The\natomic counterfactuals in the above set model the relationship between paG(V ), representing direct\ncauses of V , and V itself. From these, all other counterfactuals may be de\ufb01ned using recursive\nsubstitution. For any A \u2286 V \\ {V }, V (a) \u2261 V (apaG (V )\u2229A,{paG(V ) \\ A}(a)). For example, in\nthe DAG in Fig. 1 (b), Y1(a1) is de\ufb01ned to be Y1(a1, U1, A2(U2)). Counterfactual responses to\ninterventions are often compared on the mean difference scale for two values a, a(cid:48), representing cases\nand controls: E[Y (a)] \u2212 E[Y (a(cid:48))]. This quantity is known as the average causal effect (ACE).\nA causal parameter is said to be identi\ufb01ed in a causal model if it is a function of the observed data\ndistribution p(V). Otherwise the parameter is said to be non-identi\ufb01ed. In any causal model of a\nDAG G, all interventional distributions p(V \\ A|do(a)) are identi\ufb01ed by the g-formula [17]:\n\n(1)\n\n(cid:89)\n\np(V | paG(V ))(cid:12)(cid:12)A=a\n\np(V \\ A|do(a)) =\n\nV \u2208V\\A\n\nNote that the g-formula may be viewed as a modi\ufb01ed (or truncated) DAG factorization, with terms\ncorresponding to elements in A missing.\n\n3.4 Modeling Dependent Data\n\n1 , . . . , Y j\n\nSo far, the causal and statistical models we have introduced assumed data generating process that\nproduce independent samples. To capture examples of the sort we introduced in section 2, we must\ngeneralize these models. Suppose we analyze data with M blocks with N units each. It is not\nnecessary to assume that blocks are equally sized for the kinds of problems we consider, but we make\nthis assumption to simplify our notation. Denote the variable Y for the i\u2019th unit in block j as Y j\ni . For\neach block j, let Yj \u2261 (Y j\nN ), and let Y \u2261 (Y1, . . . , YM ). In some cases we will not be\nconcerned with units\u2019 block memberships. In these cases we will accordingly omit the superscript\nand the subscript will index the unit with respect to all units in the network.\nWe are interested in counterfactual responses to interventions on A, treatments on all units in\nall blocks. For any a \u2208 XA, de\ufb01ne Y j\ni (a) to be the potential response of unit i in block j to a\nhypothetical treatment assignment of a to A. We de\ufb01ne Yj(a) and Y(a) in the natural way as\nvectors of responses, given a hypothetical treatment assignment to a, either for units in block j or for\nall units, respectively. Let a(j) be a vector of values of A, where values assigned to units in block j\nare free variables, and other values are bound variables. Furthermore, for any \u02dcaj \u2208 XAj , let a(j)[\u02dcaj]\nbe a vector of values which agrees on all bound values with a(j), but which assigns \u02dcaj to all units in\nblock j (e.g. which binds free variables in a(j) to \u02dcaj).\nA commonly made assumption is interblock non-interference, also known as partial interference in\n[22, 25], where for any block j, treatments assigned to units in a block other than j do not affect\nthe responses of any unit in block j. Formally, this is stated as (\u2200j, a(j), a(cid:48)(j), \u02dcaj), Yj(a(j)[\u02dcaj]) =\nYj(a(cid:48)(j)[\u02dcaj]). Counterfactuals under this assumption are written in a way that emphasizes they only\ndepend on treatments assigned within that block. That is, for any a(j), Yj(a(j)[\u02dcaj]) \u2261 Yj(\u02dcaj).\n\n4\n\n\fIn this paper we largely follow the convention of [9], where variables corresponding to distinct units\nwithin a block are shown as distinct vertices in a graph. As an example, Fig. 1 (b) represents a causal\nmodel with observed data on multiple realizations of dyads or blocks of two dependent units [4]. Note\nthat the arrow from A2 to Y1 in this model indicates that the treatment of unit 2 in a block in\ufb02uences\nthe outcome of unit 1, and similarly for treatment of unit 1 and outcome of unit 2. In this model, a\nvariation of models considered in [9], the interventional distributions p(Y2|do(a1)) = p(Y2|a1) and\np(Y1|do(a2)) = p(Y1|a2) even if U1, U2 are unobserved.\n\n4 Causal Inference with Hidden Variables\n\nIf a causal model contains hidden variables, only data on the observed marginal distribution is avail-\nable. In this case, not every interventional distribution is identi\ufb01ed, and identi\ufb01cation theory becomes\nmore complex. However, just as identi\ufb01ed interventional distributions were expressible as a truncated\nDAG factorization via the g-formula (1) in fully observed causal models, identi\ufb01ed interventional\ndistributions are expressible as a truncated nested factorization [16] of a latent projection ADMG\n[28] that represents a class of hidden variable DAGs that share identi\ufb01cation theory. In this section\nwe de\ufb01ne latent projection ADMGs, introduce the nested factorization with respect to an ADMG in\nterms of a \ufb01xing operator, and re-express the ID algorithm [27, 21] as a truncated nested factorization.\n\n4.1 Latent Projection ADMGs\nGiven a DAG G(V \u222a H), where V are observed and H are hidden variables, a latent projection G(V)\nis the following ADMG with a vertex set V. An edge A \u2192 B exists in G(V) if there exists a directed\npath from A to B in G(V \u222a H) with all intermediate vertices in H. Similarly, an edge A \u2194 B exists\nin G(V) if there exists a path without consecutive edges \u2192 \u25e6 \u2190 from A to B with the \ufb01rst edge on\nthe path of the form A \u2190 and the last edge on the path of the form \u2192 B, and all intermediate vertices\non the path in H. As an example of this operation, the graph in Fig. 1 (c) is the latent projection of\nFig. 1 (a). Note that a variable pair in a latent projection G(V) may be connected by both a directed\nand a bidirected edge, and that multiple distinct hidden variable DAGs G1(V \u222a H1) and G2(V \u222a H2)\nmay share the same latent projection ADMG.\n\n4.2 The Nested Factorization\nThe nested factorization of p(V) with respect to an ADMG G(V) is de\ufb01ned on kernel objects derived\nfrom p(V) and conditional ADMGs derived from G(V). The derivations are via a \ufb01xing operation,\nwhich can be causally interpreted as a single application of the g-formula on a single variable (to\neither a graph or a kernel) to obtain another graph or another kernel.\n\n4.2.1 Conditional Graphs And Kernels\nA kernel qV(V|W) is a mapping from values in W to normalized densities over V [5]. In other\nv\u2208V qV(v|w) = 1,\u2200w \u2208 W.\nConditioning and marginalization in kernels are de\ufb01ned in the usual way. For A \u2286 V, we de\ufb01ne\n\nwords, kernels act like conditional distributions in the sense that(cid:80)\nq(A|W) \u2261(cid:80)\n\nV\\A q(V|W) and q(V \\ A|A, W) \u2261 q(V|W)/q(A|W).\n\nA conditional acyclic directed mixed graph (CADMG) G(V, W) is an ADMG in which the nodes are\npartitioned into W, representing \ufb01xed variables, and V, representing random variables. Variables\nin W have the property that only outgoing directed edges may be adjacent to them. Genealogic\nrelationships generalize from ADMGs to CADMGs without change. Districts are de\ufb01ned to be\nsubsets of V in a CADMG G, e.g. no element of W is in any element of D(G).\n\n4.2.2 Fixability and Fixing\nA variable V \u2208 V in a CADMG G is \ufb01xable if deG(V ) \u2229 disG(V ) = \u2205. In other words, V is \ufb01xable\nif paths V \u2194 \u00b7\u00b7\u00b7 \u2194 B and V \u2192 \u00b7\u00b7\u00b7 \u2192 B do not both exist in G for any B \u2208 V \\ {V }. Given\na CADMG G(V, W) and V \u2208 V \ufb01xable in G, the \ufb01xing operator \u03c6V (G) yields a new CADMG\nG(cid:48)(V \\ {V }|W \u222a {V }), where all edges with arrowheads into V are removed, and all other edges\nin G are kept. Similarly, given a CADMG G(V, W), a kernel qV(V|W), and V \u2208 V \ufb01xable in G,\nthe \ufb01xing operator \u03c6V (qV;G) yields a new kernel q(cid:48)\nqV(V | ndG (V ),W).\n\nV\\{V }(V \\ {V }|W \u222a {V }) \u2261\n\nqV(V|W)\n\n5\n\n\fNote that \ufb01xing is a probabilistic operation in which we divide a kernel by a conditional kernel. In\nsome cases this operates as a conditioning operation, in other cases as a marginalization operation,\nand in yet other cases, as neither, depending on the structure of the kernel being divided.\nFor a set S \u2286 V in a CADMG G, if all vertices in S can be ordered into a sequence \u03c3S = (cid:104)S1, S2, . . .(cid:105)\nsuch that S1 is \ufb01xable in G, S2 in \u03c6S1(G), etc., S is said to be \ufb01xable in G, V\\S is said to be reachable\nin G, and \u03c3S is said to be valid. A reachable set C is said to be intrinsic if GC has a single district.\nWe will de\ufb01ne \u03c6\u03c3S(G) and \u03c6\u03c3S(q;G) via the usual function composition to yield operators that \ufb01x\nall elements in S in the order given by \u03c3S.\nThe distribution p(V) is said to obey the nested factorization for an ADMG G if there exists a set\nof kernels {qC(C | paG(C)) | C is intrinsic in G} such that for every \ufb01xable S, and any valid \u03c3S,\nD\u2208D(\u03c6\u03c3S (G)) qD(D| pasG(D)). All valid \ufb01xing sequences for S yield the same\nCADMG G(V \\ S, S), and if p(V) obeys the nested factorization for G, all valid \ufb01xing sequences\nfor S yield the same kernel. As a result, for any valid sequence \u03c3 for S, we will rede\ufb01ne the operator\n\u03c6\u03c3, for both graphs and kernels, to be \u03c6S. In addition, it can be shown [16] that the above kernel set\nis characterized as:\n\n\u03c6\u03c3S(p(V);G) =(cid:81)\n\n{qC(C | paG(C)) | C is intrinsic in G} = {\u03c6V\\C(p(V);G) | C is intrinsic in G}.\n\nThus, we can re-express the above nested factorization as stating that for any \ufb01xable set S, we have\nD\u2208D(\u03c6S(G)) \u03c6V\\D(p(V);G). Since \ufb01xing is de\ufb01ned on CADMGs and kernels,\nthe de\ufb01nition of nested Markov models generalizes in a straightforward way to a kernel q(V|W)\nbeing in the nested Markov model for a CADMG G(V, W). This holds if for every S \ufb01xable in\n\n\u03c6S(p(V);G) =(cid:81)\nG(V, W), \u03c6S(q(V|W);G) =(cid:81)\n\nD\u2208D(\u03c6S(G)) \u03c6V\\D(q(V|W);G).\n\nAn important result in [16] states that if p(V \u222a H) obeys the factorization for a DAG G with vertex\nset V \u222a H, then p(V) obeys the nested factorization for the latent projection ADMG G(V).\n\nIdenti\ufb01cation in Hidden Variable Causal DAGs\n\n4.3\nFor any disjoint subsets Y, A of V in a latent projection G(V) representing a causal DAG G(V\u222a H),\nde\ufb01ne Y\u2217 \u2261 anG(V)V\\A (Y). Then p(Y|do(a)) is identi\ufb01ed in G if and only if every set D \u2208\nD(G(V)Y\u2217 ) is reachable (in fact, intrinsic). Moreover, if identi\ufb01cation holds, we have [16]:\n\np(Y|do(a)) =\n\n\u03c6V\\D(p(V);G(V))|A=a.\n\n(2)\n\n(cid:88)\n\n(cid:89)\n\nY\u2217\\Y\n\nD\u2208D(G(V)Y\u2217 )\n\nIn other words, p(Y|do(a)) is only identi\ufb01ed if it can be expressed as a factorization, where every\npiece corresponds to a kernel associated with a set intrinsic in G(V). Moreover, no piece in this\nfactorization contains elements of A as random variables, just as was the case in (1). In fact, (2)\nprovides a concise formulation of the ID algorithm [27, 21] in terms of the nested Markov model in\nwhich the observed distribution in the causal problem lies. For a full proof, see [16].\n\n5 Chain Graphs For Causal Inference With Dependent Data\n\nWe generalize causal models to represent settings with data dependence, speci\ufb01cally to cases where\nvariables may exhibit stable but symmetric relationships. These may correspond to friendship ties in\na social network, physical proximity, or rules of infectious disease spread. These stand in contrast to\ncausal relationships which are also stable, but asymmetric. We represent settings with both of these\nkinds of relationships using causal CG models under the Lauritzen-Wermuth-Freydenburg (LWF)\ninterpretation. Though there are alternative conceptions of chain graphs [2], we concentrate on LWF\nCGs here. This is because LWF CGs yield observed data distributions with smooth parameterizations.\nIn addition, LWF CGs yield Markov properties where each unit\u2019s friends (and direct causes) screen\nthe unit from other units in the network. This sort of independence is intuitively appealing in many\nnetwork settings. Extensions of our results to other CG models are likely possible, but we leave them\nto future work.\nLWF CGs were given a causal interpretation in [6]. In a causal CG, the distribution p(B| paG(B)) for\neach block B is determined via a computer program that implements a Gibbs sampler on variables\nB \u2208 B, where the conditional distribution p(B|B \\ {B}, paG(B)) is determined via a structural\nequation of the form fB(B \\ {B}, paG(B), \u0001B). This interpretation of p(B| paG(B)) allows the\n\n6\n\n\fimplementation of a simple intervention operation do(b). The operation sets B to b by replacing the\nline of the Gibbs sampler program that assigns B to the value returned by fB(B \\ {B}, paG(B), \u0001B)\n(given a new realization of \u0001B), with an assignment of B to the value b. It was shown [6] that in a\ncausal CG model, for any disjoint Y, A, p(Y|do(a)) is identi\ufb01ed by the CG version of the g-formula\n\n(1): p(Y|do(a)) =(cid:81)\n\nB\u2208B(G) p(B \\ A| pa(B), B \u2229 A)|A=a.\n\nIn our example above, stable symmetric relationships inducing data dependence, represented by\nundirected edges, coexist with hidden variables. To represent causal inference in this setting, we\ngeneralize earlier developments for hidden variable causal DAG models to hidden variable causal CG\nmodels. Speci\ufb01cally, we \ufb01rst de\ufb01ne a latent projection analogue called the segregated projection for a\nlarge class of hidden variable CGs using segregated graphs (SGs). We then de\ufb01ne a factorization for\nSGs that generalizes the nested factorization and the CG factorization, and show that if a distribution\np(V \u222a H) factorizes given a CG G(V \u222a H) in the class, then p(V) factorizes according to the\nsegregated projection G(V). Finally, we derive identi\ufb01cation theory for hidden variable CGs as a\ngeneralization of (2) that can be viewed as a truncated SG factorization.\n\n5.1 Segregated Projections Of Latent Variable Chain Graphs\nFix a chain graph CG G and a vertex set H such that for all H \u2208 H, H does not lie in B \u222a paG(B),\nfor any B \u2208 Bnt(G). We call such a set H block-safe.\nDe\ufb01nition 1 Given a CG G(V \u222a H) and a block-safe set H, de\ufb01ne a segregated projection graph\nG(V) with a vertex set V. Moreover, for any collider-free path from any two elements V1, V2 in V,\nwhere all intermediate vertices are in H, G(V) contains an edge with end points matching the path.\nThat is, we have V1 \u2190 \u25e6 . . .\u25e6 \u2192 V2 leads to the edge V1 \u2194 V2, V1 \u2192 \u25e6 . . .\u25e6 \u2192 V2 leads to the\nedge V1 \u2192 V2, and in G(V).\n\nAs an example, the SG in Fig. 1 (c) is a segregated projection of the hidden variable CG in Fig. 1 (a).\nWhile segregated graphs preserve conditional independence structure on the observed marginal of a\nCG for any H [20], we chose to further restrict the set H in order to ensure that the directed edges in\nthe segregated projection retain an intuitive causal interpretation of edges in a latent projection [28].\nThat is, whenever A \u2192 B in a segregated projection, A is a causal ancestor of B in the underlying\ncausal CG. SGs represent latent variable CGs, meaning that they allow causal systems that model\nfeedback that leads to network structures, of the sort considered in [6], but simultaneously allow\ncertain forms of unobserved confounding in such causal systems.\n\n5.2 Segregated Factorization\n\nThe segregated factorization of an SG can be de\ufb01ned as a product of two kernels which themselves\nfactorize, one in terms of a CADMG (a conditional graph with only directed and bidirected arrows),\nand another in terms of a conditional chain graph (CCG) G(V, W), a CG with the property that the\nonly type of edge adjacent to any element W of W is a directed edge out of W . A kernel q(V|W) is\nsaid to be Markov relative to the CCG G(V, W) if q(V|W) = Z(W)\nB\u2208B(G) q(B| paG(B)),\n\n\u22121(cid:81)\n\nC\u2208C((GB\u222apaG (B))a);C(cid:54)\u2286paG (B) \u03c8C(C), for each B \u2208 B(G).\n\nand q(B| paG(B)) = Z(paG(B))\u22121(cid:81)\nand the two corresponding kernels. Given a SG G, let district variables D\u2217 be de\ufb01ned as(cid:83)\nand let block variables B\u2217 be de\ufb01ned as(cid:83)\n\nWe now show, given p(V) and an SG G(V), how to construct the appropriate CADMG and CCG,\nD\u2208D(G) D,\nB\u2208Bnt(G) B. Since D(G) and Bnt(G) partition V in a SG,\nB\u2217 and D\u2217 partition V as well. Let the induced CADMG Gd of a SG G be the graph containing\nthe vertex sets D\u2217 as V and pasG(D\u2217) as W, and which inherits all edges in G between D\u2217, and\nall directed edges from pasG(D\u2217) to D\u2217 in G. Similarly, let the induced CCG Gb of G be the graph\ncontaining the vertex set B\u2217 as V and pasG(B\u2217) as W, and which inherits all edges in G between\nB\u2217, and all directed edges from paG(B\u2217) to B\u2217. We say that p(V) obeys the factorization of a SG\nG(V) if p(V) = q(D\u2217| pasG(D\u2217))q(B\u2217| paG(B\u2217)), q(B\u2217| paG(B\u2217)) is Markov relative to the CCG\nGb, and q(D\u2217| pasG(D\u2217)) is in the nested Markov model of the CADMG Gd.\nThe following theorem gives the relationship between a joint distribution that factorizes given a\nhidden variable CG G, its marginal distribution, and the corresponding segregated factorization. This\n\n7\n\n\ftheorem is a generalization of the result proven in [16] relating hidden variable DAGs and latent\nprojection ADMGs. The proof is deferred to the appendix.\nTheorem 1 If p(V \u222a H) obeys the CG factorization relative to G(V \u222a H), and H is block-safe then\np(V) obeys the segregated factorization relative to the segregated projection G(V).\n\n6 A Complete Identi\ufb01cation Algorithm for Latent Variable Chain Graphs\n\nY\u2217\\Y\n\n\u03c6D\u2217\\D(q(D\n\n\u2217| paG(V)(D\n\n\u2217\n\n));Gd)\n\nD\u2208D((cid:101)Gd)\n\n\uf8ee\uf8f0 (cid:89)\n\n\uf8f9\uf8fb\uf8ee\uf8f0 (cid:89)\n\nWith Theorem 1 in hand, we are ready to characterize general non-parametric identi\ufb01cation of\ninterventional distributions in hidden variable causal chain graph models, where hidden variables\nform a block-safe set. This result can be viewed on the one hand as a generalization of the CG\ng-formula derived in [6], and on the other hand as a generalization of the ID algorithm (2).\nTheorem 2 Assume G(V \u222a H) is a causal CG, where H is block-safe. Fix disjoint subsets Y, A of\nin D((cid:101)Gd) is reachable in Gd, where (cid:101)Gd is the induced CADMG of G(V)Y\u2217.\nV. Let Y\u2217 = antG(V)V\\A Y. Then p(Y|do(a)) is identi\ufb01ed from p(V) if and only if every element\nMoreover, if p(Y|do(a)) is identi\ufb01ed, it is equal to\n(cid:88)\nwhere q(D\u2217| paG(V)(D\u2217)) = p(V)/((cid:81)\n\n\uf8f9\uf8fb(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)A=a\nB\u2208Bnt(G(V)) p(B| paG(V)(B)), and (cid:101)Gb is the induced CCG\n\np(B \\ A| paG(V)Y\u2217 (B), B \u2229 A)\n\nof G(V)Y\u2217.\nTo illustrate the application of this theorem, consider the SG G in Fig. 1 (c), where we are interested\nin p(Y2|do(a1, a2)). It is easy to see that Y\u2217 = {C1, C2, M1, M2, Y2} (see GY\u2217 in Fig. 1 (d))\nwith B(GY\u2217 ) = {{M1, M2}} and D(GY\u2217 ) = {{C1},{C2},{Y2}}. The chain graph factor of the\nfactorization in Theorem 2 is p(M1, M2|A1 = a1, A2, C1, C2). Note that this expression further\nfactorizes according to the (second level) undirected factorization of blocks in a CCG. For the\nthree district factors {C1},{C2},{Y2} in Fig. 1 (d), we must \ufb01x variables in three different sets\n{C2, A1, A2, Y1, Y2}, {C1, A1, A2, Y1, Y2}, {C1, C2, A1, Y1, A2} in Gd, shown in Fig. 1 (e). We\ndefer the full derivation involving the \ufb01xing operator to the supplementary material. The resulting\nidentifying functional for p(Y2|do(a1, a2)) is:\np(M1, M2|a1, a2, C1, C2)\n\np(Y2|a1, A2, M2, C2)p(A2|C2)p(C1)p(C2)\n\nB\u2208B((cid:101)Gb)\n\n(3)\n\n(cid:88)\n\n{C1,C2,M1,M2}\n\n(cid:88)\n\nA2\n\n7 Experiments\n\nWe now illustrate how identi\ufb01ed functionals given by Theorem 2 may be estimated from data.\nSpeci\ufb01cally we consider network average effects (N.E.), the network analogue of the average causal\neffect (ACE), as de\ufb01ned in [3]:\n1\nN\n\nE[Yi(Ai = 1, A\u22121 = 1)] \u2212 E[Yi(Ai = 0, A\u2212i = 0)]\n\nNEi(a\u2212i) =\n\n(cid:88)\n\ni\n\nin our article sharing example described in section 2, and shown in simpli\ufb01ed form (for two units) in\nFig. 1 (a). The experiments and results we present here generalize easily to other network effects\nsuch as direct and spillover effects [3], although we do not consider this here in the interests of\nspace. For purposes of illustration we consider a simple setting where the social network is a 3-\nregular graph, with networks of size N = [400, 800, 1000, 2000]. Under the hidden variable CG\nmodel we described in section 2, the above effect is identi\ufb01ed by a functional which generalizes\n(3) from a network of size 2 to a larger network. Importantly, since we assume a single connected\nnetwork of M variables, we are in the full interference setting where only a single sample from\np(M1, . . . MN|A1, . . . , AN , C1, . . . , CN ) is available. This means that while the standard maximum\nlikelihood plug-in estimation strategy is possible for models for Yi and Ai in (3), the strategy does\nnot work for the model for M. Instead, we adapt the auto-g-computation approach based on the\npseudo-likelihood and coding estimators proposed in [24], which is appropriate for full interference\n\n8\n\n\fsettings with a Markov property given by a CG, as part of our estimation procedure. Note that\nthe approach in [24] was applied for a special case of the set of causal models considered here, in\nparticular those with no unmeasured confounding. Here we use the same approach for estimating\ngeneral functionals in models that may include unobserved confounders between treatments and\noutcomes. In fact, our example model is analogous to the model in [24], in the same way that the\nfront-door criterion is to the backdoor criterion in causal inference under the assumption of iid data\n[12].\nOur detailed estimation strategy, along with a more detailed description of our results, is described in\nthe appendix. We performed 1000 bootstrap samples of the 4 different networks. Since calculating\nthe true causal effects is intractable even if true model parameters are known, we calculate the\napproximate \u2018ground truth\u2019 for each intervention by sampling from our data generating process under\nthe intervention 5 times and averaging the relevant effect. We calculated the (approximation of) the\nbias of each effect by subtracting the estimate from the \u2018ground truth.\u2019 The \u2018ground truth\u2019 network\naverage effects range from \u2212.453 to \u2212.456. As shown in Tables 1 and 2, both estimators recover the\nground truth effect with relatively small bias. Estimators for effects which used the pseudo-likelihood\nestimator for M generally have lower variance than those that used the coding estimator for M,\nwhich is expected due to the greater ef\ufb01ciency of the former. This behavior was also observed in\n[24]. In both estimators, bias decreases with network size. This is also expected intuitively, although\ndetailed asymptotic theory for statistical inference in networks is currently an open problem, due to\ndependence of samples.\n\n95% Con\ufb01dence Intervals of Bias of Network Average Effects\nN\nCoding\nPseudo\n\n1000\n(-.100, .065)\n(-.116, .074)\n\n400\n(-.157, .103)\n(-.133, .080)\n\n800\n(-.129, .106)\n(-.099, .089)\n\n2000\n(-.086, .051)\n(-.070, .041)\n\nEstimator\n\nTable 1: 95% con\ufb01dence intervals for the bias of each estimating method for the network average\neffects. All intervals cover the approximated ground truth since they include 0\n\nEstimator\n\nN\nCoding\nPseudo\n\nBias of Network Average Effects\n\n400\n-.000 (.060)\n.006 (.052)\n\n800\n-.020 (.051)\n-.023 (.042)\n\n1000\n-.024 (.052)\n-.023 (.042)\n\n2000\n-.022 (.034)\n-.021 (.026)\n\nTable 2: The biases of each estimating method for the network average effects. Standard deviation of\nthe bias of each estimate is given in parentheses.\n\n8 Conclusion\n\nIn this paper, we generalized existing non-parametric identi\ufb01cation theory for hidden variable\ncausal DAG models to hidden variable causal chain graph models, which can represent both causal\nrelationships, and stable symmetric relationships that induce data dependence. Speci\ufb01cally, we gave a\nrepresentation of all identi\ufb01ed interventional distributions in such models as a truncated factorization\nassociated with segregated graphs, mixed graphs containing directed, undirected, and bidirected\nedges which represent marginals of chain graphs.\nWe also demonstrated how statistical inference may be performed on identi\ufb01able causal parameters,\nby adapting a combination of maximum likelihood plug in estimation, and methods based on coding\nand pseudo-likelihood estimators that were adapted for full interference problems in [24]. We\nillustrated our approach with an example of calculating the effect of community membership on\narticle sharing if the effect of the former on the latter is mediated by a complex social network of\nunits inducing full dependence.\n\n9 Acknowledgements\n\nThe second author would like to thank the American Institute of Mathematics for supporting this\nresearch via the SQuaRE program. This project is sponsored in part by the National Institutes of\n\n9\n\n\fHealth grant R01 AI127271-01 A1, the Of\ufb01ce of Naval Research grant N00014-18-1-2760 and the\nDefense Advanced Research Projects Agency (DARPA) under contract HR0011-18-C-0049. The\ncontent of the information does not necessarily re\ufb02ect the position or the policy of the Government,\nand no of\ufb01cial endorsement should be inferred.\n\n10\n\n\fReferences\n[1] D. Arbour, D. Garant, and D. Jensen. Inferring network effects from observational data. In\nProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 715\u2013724. ACM, 2016.\n\n[2] M. Drton. Discrete chain graph models. Bernoulli, 15(3):736\u2013753, 2009.\n\n[3] M. Hudgens and M. Halloran. Toward causal inference with interference. Journal of the\n\nAmerican Statistical Association, 103(482):832\u2013842, 2008.\n\n[4] D. A. Kenny, D. A. Kashy, and W. L. Cook. Dyadic Data Analysis. Guilford Press New York,\n\n2006.\n\n[5] S. L. Lauritzen. Graphical Models. Oxford, U.K.: Clarendon, 1996.\n\n[6] S. L. Lauritzen and T. S. Richardson. Chain graph models and their causal interpretations (with\n\ndiscussion). Journal of the Royal Statistical Society: Series B, 64:321\u2013361, 2002.\n\n[7] M. Maier, K. Marazopoulou, and D. Jensen. Reasoning about independence in probabilistic\n\nmodels of relational data. arXiv preprint arXiv:1302.4381, 2013.\n\n[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529, 2015.\n\n[9] E. L. Ogburn and T. J. VanderWeele. Causal diagrams for interference. Statistical Science,\n\n29(4):559\u2013578, 2014.\n\n[10] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan and Kaufmann, San Mateo,\n\n1988.\n\n[11] J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669\u2013709, 1995.\n\n[12] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition,\n\n2009.\n\n[13] J. M. Pe\u00f1a. Learning acyclic directed mixed graphs from observations and interventions. In\n\nConference on Probabilistic Graphical Models, pages 392\u2013402, 2016.\n\n[14] J. M. Pe\u00f1a. Reasoning with alternative acyclic directed mixed graphs. Behaviormetrika, pages\n\n1\u201334, 2018.\n\n[15] T. S. Richardson. Markov properties for acyclic directed mixed graphs. Scandinavial Journal\n\nof Statistics, 30(1):145\u2013157, 2003.\n\n[16] T. S. Richardson, R. J. Evans, J. M. Robins, and I. Shpitser. Nested Markov properties for\n\nacyclic directed mixed graphs, 2017. Working paper.\n\n[17] J. M. Robins. A new approach to causal inference in mortality studies with sustained exposure\nperiods \u2013 application to control of the healthy worker survivor effect. Mathematical Modeling,\n7:1393\u20131512, 1986.\n\n[18] J. M. Robins. Marginal structural models versus structural nested models as tools for causal\ninference. In Statistical Models in Epidemiology: The Environment and Clinical Trials. NY:\nSpringer-Verlag, 1999.\n\n[19] C. R. Shalizi and A. C. Thomas. Homophily and contagion are generically confounded in\nobservational social network studies. Sociological methods & research, 40(2):211\u2013239, 2011.\n\n[20] I. Shpitser. Segregated graphs and marginals of chain graph models. In Advances in Neural\n\nInformation Processing Systems 28. Curran Associates, Inc., 2015.\n\n[21] I. Shpitser and J. Pearl. Identi\ufb01cation of joint interventional distributions in recursive semi-\nMarkovian causal models. In Proceedings of the Twenty-First National Conference on Arti\ufb01cial\nIntelligence (AAAI-06). AAAI Press, Palo Alto, 2006.\n\n11\n\n\f[22] M. E. Sobel. What do randomized studies of housing mobility demonstrate? causal inference in\nthe face of interference. Journal of the American Statistical Association, 101.476:1398\u20131407,\n2006.\n\n[23] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[24] E. J. Tchetgen Tchetgen, I. Fulcher, and I. Shpitser. Auto-g-computation of causal effects on a\n\nnetwork. hhttps://arxiv.org/abs/1709.01577, 2017. Working paper.\n\n[25] E. J. Tchetgen Tchetgen and T. J. VanderWeele. On causal inference in the presence of\n\ninterference. Statistical Methods in Medical Research, 21(1):55\u201375, 2012.\n\n[26] J. Tian and J. Pearl. On the identi\ufb01cation of causal effects. Technical Report R-290-L,\n\nDepartment of Computer Science, University of California, Los Angeles, 2002.\n\n[27] J. Tian and J. Pearl. On the testable implications of causal models with hidden variables. In\nProceedings of the Eighteenth Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-02),\nvolume 18, pages 519\u2013527. AUAI Press, Corvallis, Oregon, 2002.\n\n[28] T. S. Verma and J. Pearl. Equivalence and synthesis of causal models. Technical Report R-150,\n\nDepartment of Computer Science, University of California, Los Angeles, 1990.\n\n12\n\n\f", "award": [], "sourceid": 5729, "authors": [{"given_name": "Eli", "family_name": "Sherman", "institution": "Johns Hopkins University"}, {"given_name": "Ilya", "family_name": "Shpitser", "institution": "Johns Hopkins University"}]}