{"title": "Causal Inference with Noisy and Missing Covariates via Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 6921, "page_last": 6932, "abstract": "Valid causal inference in observational studies often requires controlling for confounders. However, in practice measurements of confounders may be noisy, and can lead to biased estimates of causal effects. We show that we can reduce bias induced by measurement noise using a large number of noisy measurements of the underlying confounders. We propose the use of matrix factorization to infer the confounders from noisy covariates. This flexible and principled framework adapts to missing values, accommodates a wide variety of data types, and can enhance a wide variety of causal inference methods. We bound the error for the induced average treatment effect estimator and show it is consistent in a linear regression setting, using Exponential Family Matrix Completion preprocessing. We demonstrate the effectiveness of the proposed procedure in numerical experiments with both synthetic data and real clinical data.", "full_text": "Causal Inference with Noisy and Missing Covariates\n\nvia Matrix Factorization\n\nNathan Kallus\u2217\n\nXiaojie Mao\u2217\nCornell University\n\nMadeleine Udell\u2217\n\n{kallus, xm77, udell}@cornell.edu\n\nAbstract\n\nValid causal inference in observational studies often requires controlling for con-\nfounders. However, in practice measurements of confounders may be noisy, and\ncan lead to biased estimates of causal effects. We show that we can reduce bias\ninduced by measurement noise using a large number of noisy measurements of\nthe underlying confounders. We propose the use of matrix factorization to infer\nthe confounders from noisy covariates. This \ufb02exible and principled framework\nadapts to missing values, accommodates a wide variety of data types, and can\nenhance a wide variety of causal inference methods. We bound the error for the\ninduced average treatment effect estimator and show it is consistent in a linear\nregression setting, using Exponential Family Matrix Completion preprocessing. We\ndemonstrate the effectiveness of the proposed procedure in numerical experiments\nwith both synthetic data and real clinical data.\n\n1\n\nIntroduction\n\nEstimating the causal effect of an intervention is a fundamental goal across many domains. Examples\ninclude evaluating the effectiveness of recommender systems [1], identifying the effect of therapies on\npatients\u2019 health [2] and understanding the impact of compulsory schooling on earnings [3]. However,\nthis task is notoriously dif\ufb01cult in observatonal studies due to the presence of confounders: variables\nthat affect both the intervention and the outcomes. For example, intelligence level can in\ufb02uence both\nstudents\u2019 decisions regarding whether to go to college, and their earnings later on. Students who\nchoose to go to college may have higher intelligence than those who do not. As a result, the observed\nincrease in earnings associated with attending college is confounded with the effect of intelligence\nand thus cannot faithfully represent the causal effect of college education.\nOne standard way to avoid such confounding effect is to control for all confounders [4]. However,\nthis solution poses practical dif\ufb01culties. On the one hand, an exhaustive list of confounders is not\nknown a priori, so investigators usually adjust for a large number of covariates for fear of missing\nimportant confounders. On the other hand, measurement noise may abound in the collected data:\nsome confounder measurements may be contaminated with noise (e.g., data recording error), while\nother confounders may not be amenable to direct measurements and instead admit only proxy\nmeasurements. For example, we may use an IQ test score as a proxy for intelligence. It is well\nknown that using proxies in place of the true confounders leads to biased causal effect estimates\n[5, 6, 7]. However, we show in a linear regression setting that the bias due to measurement noise\ncan be effectively alleviated by using many proxies for the underlying confounders (Section 2.2).\nFor example, in addition to IQ test score, we may also use coursework grades and other academic\nachievements to characterize the intelligence. Intuitively, using more proxies may allow for a more\naccurate reconstruction of the confounder and thus may facilitate more accurate causal inference.\n\n\u2217Alphabetical order\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTherefore, collecting a large number of covariates is bene\ufb01cial for causal inference not only to avoid\nconfounding effects but also to alleviate bias caused by measurement noise.\nAlthough in the big-data era, collecting myriad covariates is easier than ever before, it is still\nchallenging to use the collected noisy covariates in causal inference. On the one hand, data is\ninevitably contaminated with missing values, especially when we collect many covariates. Inaccurate\nimputation of these missing values may aggravate measurement noise. Moreover, missing value\nimputation can at most gauge the values of noisy covariates but inferring the latent confounders is the\nmost critical for accurate causal inference. On the other hand, the large number of covariates may\ninclude heterogeneous data types (e.g., continuous, ordinal, categorical, etc.) that must be handled\nappropriately to exploit covariate information.\nTo address the aforementioned problems, we propose to use low rank matrix factorization as a\nprincipled approach to preprocess covariate matrices for causal inference. This preprocessing step\ninfers the confounders for subsequent causal inference from partially observed noisy covariates.\nInvestigators can thus collect more covariates to control for potential confounders and use more proxy\nvariables to characterize the unmeasured traits of the subjects without being hindered by missing\nvalues. Moreover, matrix factorization preprocessing is a very general framework. It can adapt to a\nwide variety of data types and it can be seamlessly integrated with many causal inference techniques,\ne.g., regression adjustment, propensity score reweighting, matching [4]. Using matrix factorization as\na preprocessing step makes the whole procedure modular and enables investigators to take advantage\nof existing packages for matrix factorization and causal inference.\nWe rigorously investigate the theoretical implication of the matrix factorization preprocessing with\nrespect to causal effect estimation. We establish a convergence rate for the induced average treatment\neffect (ATE) estimator and show its consistency in a linear regression setting with Exponential\nFamily Matrix Completion preprocessing [8].\nIn contrast to traditional applications of matrix\nfactorization methods with matrix reconstruction as the end goal, our theoretical analysis validates\nmatrix factorization as a preprocessing step for causal inference.\nWe evaluate the effectiveness of our proposed procedure on both synthetic datasets and a clinical\ndataset involving the mortality of twins born in the USA introduced by Louizos et al. [9]. We\nempirically demonstrate that matrix factorization can accurately estimate causal effects by inferring\nthe latent confounders from a large number of noisy covariates. Moreover, matrix factorization\npreprocessing enhances the performance of many causal inference methods and is robust to the\npresence of missing values.\nRelated work. Our paper builds upon low rank matrix completion methods that have been success-\nfully applied in many domains to recover data matrices from incomplete and noisy observations\n[10, 11, 12]. These methods are not only computationally ef\ufb01cient but also theoretically sound\nwith provable guarantees [8, 13, 14, 15, 16, 17]. Moreover, matrix completion methods have been\ndeveloped to accommodate heterogeneous data types prevalent in empirical studies by using a rich\nlibrary of loss functions and penalties [18]. Recently, Athey et al. [19] use matrix completion methods\nto impute the unobservable counterfactual outcomes and estimate the ATE for panel data. In contrast,\nour paper focuses on measurement noise in the covariate matrix. Measurement noise has been\nconsidered in literature for a long time [5, 6, 20]. Kuroki and Pearl [21] and Miao et al. [22] show\nthat causal effects are identi\ufb01able when the emission probabilities of proxies given confounders are\nknown and satisfy an invertibility condition. In contrast, our method assumes a simpler proxy model\n(Figure 1) and provides a practical approach to carry out the estimation based on matrix factorization.\nLouizos et al. recently [9] propose to use Variational Autoencoder as a heuristic way to recover the\nlatent confounders from multiple proxies. In contrast, matrix factorization methods, despite stronger\nparametric assumptions, address the problem of missing values simultaneously, require considerably\nless parameter tuning, and have theoretical justi\ufb01cations.\nNotation. For two scalars a, b \u2208 R, denote a \u2228 b = max{a, b} and a \u2227 b = min{a, b}. For an\npositive integer N, we let [N ] = {1, 2, . . . , N}. For a set \u2126, |\u2126| is the total number of elements in \u2126.\nFor matrix X \u2208 RN\u00d7p, denote its singular values as \u03c31 \u2265 \u03c32 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c3N\u2227p \u2265 0, and its smallest\nsingular value as \u03c3min. The spectral norm, nuclear norm, Frobenius norm and max norm of X are\n|Xij|\nrespectively. The projection matrix for X is de\ufb01ned as PX = X(X(cid:62)X)\u22121X(cid:62). We use col(X) to\ndenote the column space of X and \u03c3(z) to denote the sigmoid function 1/(1 + exp(\u2212z)).\n\nde\ufb01ned as (cid:107)X(cid:107) = \u03c31, (cid:107)X(cid:107)(cid:63) =(cid:80)N\u2227p\n\ni=1 \u03c3i, (cid:107)X(cid:107)F =\n\n(cid:113)\n\n1 + \u00b7\u00b7\u00b7 + \u03c32\n\u03c32\n\nN\u2227p and (cid:107)X(cid:107)max = max\n\nij\n\n2\n\n\fXi\n\nUi\n\nTi\n\nYi\n\nFigure 1: Causal graph for the ith individual, i \u2208 [N ]. The confounders Ui are unobserved (dashed);\nthe proxy variables Xi, treatment Ti, and outcome Yi are all observed (solid).\n\n2 Causal inference with low rank matrix factorization\n\nIn this section, we \ufb01rst introduce the problem of causal inference under measurement noise and\nmissing values formally and de\ufb01ne notation. We then show that the bias caused by measurement\nnoise in linear regression is alleviated when more covariates are used. Finally we review low rank\nmatrix factorization methods and describe the proposed procedure for causal inference.\n\n2.1 Problem formulation\n\nWe consider an observational study with N subjects. For subject i, Ti is the treatment variable and we\nassume Ti \u2208 {0, 1} for simplicity. We use Yi(0), Yi(1) to denote the potential outcomes for subject\ni under control and treatment respectively [4]. We can only observe the potential outcome corre-\nsponding to the received treatment level, i.e., Yi = Yi(Ti). Assume that {Yi(0), Yi(1), Ti}N\ni=1 are\nindependently and identically distributed (i.i.d). We denote T = [T1, ..., TN ](cid:62) and Y = [Y1, ..., YN ](cid:62).\nFor the ease of exposition, we focus on estimating the average treatment effect (ATE):\n\n\u03c4 = E(Yi(1) \u2212 Yi(0)).\n\nOne standard way to estimate ATE is to adjust for the confounders. Suppose we have access to the\nconfounders Ui \u2208 Rr for subject i, \u2200i \u2208 [N ]. Then we can employ many standard causal inference\ntechniques (e.g., regression adjustment, propensity score reweighting, matching, etc.) to estimate\nATE under the following unconfoundedness assumption:\nAssumption 1 (Unconfoundedness given unobservables). For each t = 0, 1 and i = 1, ..., N, Yi(t)\nis independent of Ti conditionally on Ui: P(Ti = 1 | Yi(t), Ui) = P(Ti = 1 | Ui).\nHowever, in practice we may not observe {Ui}N\ni=1 directly. Instead suppose we can only partially\nobserve covariates Xi \u2208 Rp, which is a collection of noisy measurements for the confounders. The\ncausal graph is given in Figure 1. The covariates Xi can represent various data types by canonical\nencoding schemes. For example, Boolean data is encoded using 1 for true and \u22121 for false. Many\nother encoding examples, e.g., categorical data or ordinal data, can be found in Udell et al. [18]. We\nconcatenate these covariates into X \u2208 RN\u00d7p. We assume that only entries of X over a subset of\nindices \u2126 \u2282 [N ] \u00d7 [p] are observed.\nWe further specify the generative model for individual entries Xij, (i, j) \u2208 [N ]\u00d7 [p]. We assume that\ni Vj), where Vj \u2208 Rp represents loadings\nXij are drawn indepedently from distributions P(Xij | U(cid:62)\nof the jth covariate on confounders. The distribution P(Xij | U(cid:62)\ni Vj) models the measurement noise\nmechanism for Xij. For example, if Xi1 is a measurement for Ui1 contaminated with standard\nGaussian noise, then P(Xi1 | U(cid:62)\ni V1, 1) where V1 = [1, 0, ..., 0](cid:62). This generative\nmodel also accomodates proxy variables. Consider a simpli\ufb01ed version of Spearman\u2019s measureable\nintelligence theory [23] where multiple kinds of test scores are used to characterize two kinds of\n(unobservable) intelligence: quantitative and verbal. Suppose that there are p tests (e.g., Classics,\nMath, Music, etc.) which are recorded in Xi1, ..., Xip and the two intelligence are represented by\nUi1 and Ui2. We assume that these proxy variables are noisy realizations of linear combinations\nof two intelligence. This can be modelled using the generative model Xij \u223c P(Xij | U(cid:62)\ni Vj)\nwith Vj = [Vi1, Vi2, 0, ..., 0](cid:62) for j \u2208 [p]. While this linear assumption seems restrictive, it\u2019s\napproximately true for a large class of nonlinear latent variable models when many proxies are used\nfor a small number of latent variables [24].\nWe aim to estimate ATE based on P\u2126(X), Y and T . It is however very challenging for the presence\nof measurement noise and missing values. One the one hand, most causal inference techniques cannot\n\ni V1) \u223c N (U(cid:62)\n\n3\n\n\fadapt to missing values directly and appropriate preprocessing is needed. On the other hand, it is\nwell known that measurement noise can dramatically undermine the unconfoundedness assumption\nand lead to biased causal effect estimation [5, 6], i.e., P(Yi(t)|Ti, Xi) (cid:54)= P(Yi(t)|Xi) for t = 0, 1.\n\n2.2 Measurement noise and bias\n\nIn this subsection, we show that using a large number of noisy covariates can effectively alleviate\nthe ATE estimation bias resulted from measurement noise in linear regression setting. Suppose there\nare no missing values. We consider the linear regression model: \u2200i \u2208 [N ], Yi = U(cid:62)\ni \u03b1 + \u03c4 Ti + \u0001i ,\nwhere \u03b1 \u2208 Rr is the coef\ufb01cient for confounders Ui, \u03c4 is the ATE, and \u0001i\ni.i.d\u223c N (0, \u03c32). For \u2200i \u2208 [N ],\nTi are independently and probabilistically assigned according to confounders Ui. Unconfoundedness\n(Assumtpion 1) implies that Ti are independent with \u0001i conditionally on Ui.\nProposition 1. Consider the additive noise model: X = U V (cid:62) + W where {Ui}N\nW \u2208 RN\u00d7p contains independent noisy entries with mean 0 and variance \u03c32\nindependent with {Ui}N\nof least squares estimator in linear regression of Yi on Xi and Ti has the following form:\n\ni=1 are i.i.d samples,\nw, and entries in W are\ni=1. Suppose that r, p are \ufb01xed and p < N. As N \u2192 \u221e, the asymptotic bias\n\nE(TiUi)E(U(cid:62)\ni ) \u2212 E(TiUi)[( 1\n\ni Ui)\u22121[ 1\n\nV (cid:62)V + E(U(cid:62)\nV (cid:62)V )\u22121 + E(U(cid:62)\n\ni Ui)\u22121]\u22121\u03b1\ni Ui)]\u22121E(U(cid:62)\n\n\u03c32\nw\n\nE(T 2\n\n\u03c32\nw\n\ni Ti)\nCorollary 1.1. The asymptotic bias (1) diminishes to 0 when \u03c3min(V ) \u2192 \u221e.\nCorollary 1.1 suggests an important fact: collecting a large number of noisy covariates is an effective\nremedy for the bias induced by measurement noise, as long as the noisy covariates are suf\ufb01ciently\ninformative about the confounders. The condition \u03c3min(V ) \u2192 \u221e requires that all confounders have\nasymptotically in\ufb01nitely many proxies as p \u2192 \u221e.2 Surprisingly, in this independent additive noise\ncase, the asymptotic bias (1) is even nearly optimal: it is identical to the optimal asymptotic bias we\nwould have if we knew the unobservable V (Proposition 2, Appendix A). In the rest of the paper, we\nfurther exploit this fact by using matrix factorization preprocessing which adapts to missing values,\nheterogenenous data types and more general noise models.\n\n(1)\n\ni=1 from noisy and incomplete\n\n2.3 Low rank matrix factorization preprocessing\nIn this paper, we propose to recover the latent confounders {Ui}N\nobservations of X by using low rank matrix factorization methods, which rely on the assumption:\nAssumption 2 (Low Rank Matrix). The full matrix X is a noisy realization of a low rank matrix\n\u03a6 \u2208 RN\u00d7p with rank r (cid:28) min{N, p}.\nIn the context of causal inference, Assumption 2 corresponds to the surrogate-rich setting where many\nproxies are used for a small number of latent confounders. For example, we have access to IQ test\nscores, coursework grades, academic achievements and other proxies for the unobserved confounder\nintelligence. Under the generative model in section 2.1, Assumption 2 implies that \u03a6 = U V T where\nU = [U1, ..., UN ](cid:62) is the confounder matrix and V = [V1, ..., Vp]T is the covariate loading matrix.\nAlthough this assumption is unveri\ufb01able, low rank structure is shown to pervade in many domains\nsuch as images [11], customer preferences [10], healthcare [12], etc. The recent work by Udell and\nTownsend [24] provides theoretical justi\ufb01cations that low rank structure arises naturally from a large\nclass of latent variable models.\nMoreover, low rank matrix factorization methods usually assume the Missing Completely at Random\n(MCAR) setting where the observed entries are sampled uniformly at random [8, 25].\nAssumption 3 (MCAR). \u2200(i, j) \u2208 \u2126, i \u223c uniform([N ]) and j \u223c uniform([p]) independently and\nthe sampling is independent with the measurement noise.\n\nOur paper takes the Exponential Family Matrix Completion (EFMC) as a concrete example, which\nfurther assumes exponential family noise mechanism for the measurement noise [8].\n\n2For example, suppose the number of confounders is r = 2 and V =\n\n. . .\n. . .\nthe \ufb01rst covariate is a noisy proxy for the \ufb01rst confounder and \u03c3min(V ) = 1 < \u221e for any p.\n\n0\n1\n\n0\n1\n\n0\n\n0\n1\n\n. Then only\n\n(cid:20)1\n\n(cid:21)(cid:62)\n\n4\n\n\fAssumption 4 (Natural Exponential Family). Suppose that each entry Xij is drawn independently\nfrom the corresponding natural exponential family with \u03a6ij as the natural parameter:\n\nP(Xij|\u03a6ij) = h(Xij) exp(Xij\u03a6ij \u2212 G(\u03a6ij))\n\nwhere h : R \u2192 R is any function and G : R \u2192 R (called the log-partition function) is a strictly\nconvex analytic function with \u22072G(u) \u2265 e\u2212\u03b7|u| for some \u03b7 > 0.\nExponential family encompass a wide variety of distributions like Gaussian, Poisson, Bernoulli,\nwhich are extensively used for modelling different data types [26]. For example, if Xij takes binary\nvalues \u00b11, then we can model it using Bernoulli distribution: P(Xij | \u03a6ij) = \u03c3(Xij\u03a6ij).\nEFMC estimates \u03a6 by the following regularized M-estimator:\n\n\u02c6\u03a6 = min(cid:107)\u03a6(cid:107)max\u2264 \u03b1\u2217\u221a\n\nN p\n\nN p|\u2126| [(cid:80)\n\n(i,j)\u2208\u2126 \u2212 log P(Xij|\u03a6ij)] + \u03bb(cid:107)\u03a6(cid:107)(cid:63)\n\n(2)\n\nThe estimator in (2) involves solving a convex optimization problem, whose solution can be found\nef\ufb01ciently by many off-the-shelf algorithms [27]. The nuclear norm regularization encourages a\nlow-rank solution: the larger the tuning parameter \u03bb, the smaller the rank of the solution \u02c6\u03a6. In\npractice, \u03bb is usually selected by cross-validation. Moreover, the constraint (cid:107)\u03a6(cid:107)max \u2264 \u03b1\u2217\u221a\nN p appears\nmerely as an artifact of the proof and it is recommended to drop this constraint in practice [28]. It can\nbe proved that under Assumptions 2 \u2212 4 and some regularity assumptions the relative reconstruction\nerror of \u02c6\u03a6 converges to 0 with high probability (Lemma 4, Appendix A). Furthermore, EFMC can be\nextended by using a rich library of loss functions and regularization functions [18, 29].\nWe use the left singular matrix of \u02c6\u03a6 corresponding to nonzero singular values to estimate the column\nspace of the confounder matrix U. Although \u02c6U has orthonormal columns, the original confounders\nare allowed to be correlated (Assumption 5). The estimated confounder space matrix \u02c6U is then used\nin place of the covariate matrix for subsequent causal inference methods (e.g., regression adjustment,\npropensity reweighting, matching, etc.). Admittedly, only the column space of the confounder matrix\nU can be identi\ufb01ed, and any nonsingular linear transformation of \u02c6U is a valid estimator. However,\nthis suf\ufb01ces for many causal inference techniques. For example, regression adjustment methods\nbased on linear regression [7], polynomial regression, neural networks trained by backpropagation\n[30], propensity reweighting or propensity matching using propensity score estimated by logistic\nregressions, and Mahalanobis matching are invariant to nonsingular linear transformations. Moreover,\nthe invariance to linear transformation and scale-free property is important since the latent confounders\nmay be abstract without commonly acknowledged scale or units (e.g., intelligence).\n\n3 Theoretical guarantee\n\nIn this section, we derive an error bound for the ATE estimator induced by EFMC preprocessing (2)\nin linear regression setting. Proofs are deferred to Appendix A.\nConsider the linear regression model in Section 2.2. Suppose that we use EFMC preprocessing and\nlinear regression for causal inference, which leads to the ATE estimator \u02c6\u03c4. It is well known that the\naccuracy of \u02c6\u03c4 relies on how well the estimated column space col( \u02c6U ) approximates the column space\nof true confounder matrix col(U ). Ideally, if col( \u02c6U ) aligns with col(U ) perfectly, then \u02c6\u03c4 is identical\nto the least squares estimator based on true confounders and is thus consistent. We introduce the\nfollowing distance metric between two column spaces [31]:\nDe\ufb01nition 1. Consider two matrices \u02c6M \u2208 RN\u00d7k and M \u2208 RN\u00d7r with orthonormal columns, the\nprincipal angle between their column spaces is de\ufb01ned as\n\n(cid:113)\n\n\u2220(M, \u02c6M ) =\n\n1 \u2212 \u03c32\n\nr\u2227k( \u02c6M(cid:62)M )\n\nThis metric measures the magnitude of the \u201cangle\u201d between two column spaces. For example,\n\u2220(M, \u02c6M ) = 0 if col(M ) = col( \u02c6M ) while \u2220(M, \u02c6M ) = 1 if they are orthogonal.\nTheorem 1. Assume the following assumptions hold:\n\n(a) (cid:107)\u03b1(cid:107)max \u2264 A for a positive constant A;\n\n5\n\n\f(b)\n\n1\u221a\n\nN r\n\n(cid:107)U(cid:107) is bounded above for any N;\n\n(c) Ui is almost surely not linearly dependent with Ti;\n(d) r\u2220( \u02c6U , U ) \u2192 0 as N \u2192 0;\n(e) unconfoundedness (Assumption 1).\n\nThen there exists a constant c > 0 such that with probability at least 1 \u2212 2 exp(\u2212cN 1/2),\n\n|\u02c6\u03c4 \u2212 \u03c4\u2217| \u2264 ( 2A\u221a\n\nN\n\n(cid:107)T(cid:107))(\n\n1\u221a\n\nN r\n\n1\n\nN T (cid:62)(I \u2212 PU )T \u2212 2\n\n(cid:107)U(cid:107))(r\u2220(U, \u02c6U ))\nN (cid:107)T(cid:107)2\u2220(U, \u02c6U )\n\nN\u2192\u221e\u2212\u2192 0\n\n(3)\n\nIn the above theorem, assumption (c) rules out multicollinearity between the treatment and the\nN T (cid:62)(I \u2212\nconfounders, which is necessary for identifying ATE. This assumption guarantees that 1\nPU )T in (3) is bounded away from 0 for any N (Lemma 5, Appendix A). Assumption (d) states that\ncol( \u02c6U ) should converge to col(U )with rate faster than 1/r to guarantee consistency of the resulting\nATE estimator. Theorem 1 shows that bounding the ATE estimation error requires bounding \u2220(U, \u02c6U ),\ni.e., the error of estimating col(U ) in matrix factorization. In the following theorem, we derive an\nupper bound on the column space estimation error for EFMC (2).\nTheorem 2. Assume that the following assumptions hold:\n\n(a) Assumption 1 - 4 (Unconfoundedness, Low Rank Matrix, Missing Completely at Random,\n\nNatural Exponential Family);\n\n(b) Xij is sub-Exponential conditionally on Ui for any (i, j);\n(c) For \u03a6 = U V (cid:62), \u03c3r(\u03a6)\n\u03c31(\u03a6) is bounded away from 0;\n(d) \u02c6U is estimated by EFMC (2) with \u03bb = 2c0\u03c3(cid:48)\u221a\nN p\n|\u2126| > c1rN log N for positive constants c0 and c1;\n\n(cid:113) rN log N|\u2126|\n\nThen the following holds with probability at least 1 \u2212 4e\u22122 log2 \u00afN \u2212 e\u22122 log \u00afN :\n\n(cid:113) r3 \u00afN log \u00afN\n(cid:113) r3 \u00afN log \u00afN\n\n|\u2126|\n\n|\u2126|\n\n\u2220( \u02c6U , U ) \u2264\n\nc2\u03b1sp(\u03a6)\n\n\u03c3r(\u03a6)\n\n\u03c31(\u03a6) \u2212 c2\u03b1sp(\u03a6)\n\n\u221a\n\n, where N = N \u2228 p and\n\n\u2227 1\n\n(4)\n\nwhere c2 > 0 is a constant and \u03b1sp(\u03a6) =\n\nN p(cid:107)\u03a6(cid:107)max\n(cid:107)\u03a6(cid:107)F\n\nis the spikeness ratio of \u03a6 = U V (cid:62).\n\n\u03c3r(\u03a6)\n\nTheorem 2 shows that the column space estimation error of EFMC depends on two critical quantities:\n\u03b1sp(\u03a6) and \u03c3r(\u03a6)\n\u03c31(\u03a6) . The spikeness ratio \u03b1sp(\u03a6) is a standard measure quantifying the ill-posedness\nof matrix factorization problems [8, 32]. Small \u03b1sp(\u03a6) is necessary for accurate matrix estimation\nerror for matrix factorization, i.e., small (cid:107) \u02c6\u03a6 \u2212 \u03a6(cid:107) (Lemma 6, Appendix A). Moreover, nonvanishing\n\u03c31(\u03a6) means that \u03a6 does not lose information of any direction in col(U ), and thus guarantees that\nsmall matrix estimation error (cid:107) \u02c6\u03a6 \u2212 \u03a6(cid:107) translates into small column space estimation error \u2220( \u02c6U , U )\n(Lemma 7, Appendix). Next we introduce some generative assumptions on confounder matrix U and\ncovariate loading matrix V . Under these assumptions, EFMC accurately estimates the column space\nof confounder matrix U such that r\u2220(U, \u02c6U ) \u2192 0, and thus results in accurate ATE estimator.\nAssumption 5 (Latent Confounders and Covariate Loadings). U and V satisfy the following for\nsome positive constants v, v, cV and cL:\n\n(a) For i \u2208 [N ], Ui are i.i.d Gaussian samples with covariance matrix \u03a3r\u00d7r = LL(cid:62) for some\n\nfull rank matrix L \u2208 Rr\u00d7r such that 1\u221a\n\nr(cid:107)L(cid:107) < cL;\n1(V L(cid:62)) \u2264 vp and maxj (cid:107)Vj(cid:107)\n(cid:107)V (cid:107)F\n\n(b) vp \u2264 \u03c32\n\nr (V L(cid:62)) \u2264 \u03c32\n\n\u2264 cV\u221a\n\np , j = 1, ..., p.\n\n6\n\n\fAssumption 5(a) speci\ufb01es a Gaussian distribution for latent confounders, which implies assumption\n(b) in Theorem 1 with high probability (Lemma 10, Appendix A). It also assumes without loss of\ngenerality that the latent confounders are not perfectly linearly correlated. Moreover, Assumption 5(b)\nexludes the case where almost all covariates have vanishing loadings on the latent confounders. 3 In\nthis case, the collected covariates are not informative enough for recoverying the latent confounders.\nTheorem 3. Suppose that r/N \u2192 0 and \u2203\u03b4 > 0 such that p1+\u03b4/N \u2192 0. Under the Assumption 5,\nthere exist positive constants c3 \u2212 c5 such that\n\nr \u2228 log N with probability at least 1 \u2212 N\u22121/2 \u2212 2 exp(\u2212c4N 1/2);\n\n\u221a\n\n\u2022 \u03b1sp(\u03a6) \u2264 c3cV\n\u2022 \u03c3r(\u03a6)\n\n\u03c31(\u03a6) \u2265(cid:113) v\n\nIf we further assume the assumptions in Theorem 2 and that (cid:107)\u03b1(cid:107)max \u2264 A, then for a constant C > 0,\n\nv+2v with high probability 1 \u2212 2 exp(\u2212c5p\u03b4);\n(cid:113) r5r \u00afN log \u00afN\nv+2v \u2212 \u039b(r, N ,|\u2126|)(cid:3) \u2212 2\u039b(r, N ,|\u2126|)\n\nN T (cid:62)(I \u2212 PU )T(cid:2)(cid:113) v\n(cid:113) \u00afrr3N log N\n\n2AC\n\n|\u2126|\n\n1\n\n,\n\n|\u02c6\u03c4 \u2212 \u03c4\u2217| \u2264\n\n|\u2126|\n\nand r = r \u2228 log N.\n\nwhere \u039b(r, N ,|\u2126|) = C\nThe assumption that p1+\u03b4/N \u2192 0 appears as an artifact of proof and our simulation shows that\nthe consistency also holds when N < p (Figure 3, Appendix B). Theorem 3 guarantees that the\nATE estimator induced by EFMC is consistent as long as r5rN log N /|\u2126| \u2192 0 when N, p \u2192\n\u221e. This seems much more restrictive than consistent matrix reconstruction that merely requires\nrN log N /|\u2126| \u2192 0 (Lemma 6, Appendix A). However, this is due to the pessimistic nature of the\nerror bound. Our simulations in Section 4.1 show that matrix factorization works very well for r = 5,\nN = 1500 and p = 1450 such that r6 (cid:29) N.\n\n4 Numerical results\n\nIn this section, we show that low rank matrix factorization effectively reduces the ATE estimation\nerror caused by measurement noise using two experimental settings: 1) synthetic datasets with both\ncontinuous and binary covariates and 2) the twins dataset introduced by Louizos et al. [9]. To\nimplement matrix factorization, we use the following nonconvex formulation:\n\n(cid:80)\n(i,j)\u2208\u2126 Li,j(Xij, U(cid:62)\n\n\u02c6U , \u02c6V =\n\nargmin\n\nU\u2208RN\u00d7k,V \u2208Rp\u00d7k\n\ni Vj) + \u03bb\n\n2 ((cid:107)U(cid:107)F + (cid:107)V (cid:107)F )\n\n(5)\n\ni Vj \ufb01ts the observation Xij for (i, j) \u2208 \u2126. The\nwhere Lij is a loss function assessing how well U(cid:62)\nsolution \u02c6U is an estimator for the confounder space. This nonconvex formulation (5) is proved to\nequivalently recover the solution of the convex formulation (2) when log-likelihood loss functions\nand suf\ufb01cient large k are used [18, 28]. Solving the nonconvex formulation (5) approximately is\nusually much faster than solving the convex counterpart. In our experiments, we use the the R\npackage softImpute [33] for continuous covariates and quadratic loss, the R package logisticPCA [34]\nfor binary covariates and logistic loss, and the Julia package LowRankModels [18] for categorical\nvariables and multinomial loss. All tuning parameters are chosen via 5-fold cross-validation.\n\n4.1 Synthetic experiment\nWe generate synthetic samples according to the following linear regression process: Yi | Ui, Ti \u223c\nN (\u03b1(cid:62)Ui + \u03c4 Ti, 1) where confounder Uij \u223c N (0, 1) and treatment variable Ti\n| Ui \u223c\nBernoulli(\u03c3(\u03b2(cid:62)Ui)) for i \u2208 [N ], j \u2208 [r]. We consider covariates generated from both in-\ndepedent Gaussian noise and independent Bernoulli noise: Xij \u223c N (U(cid:62)\ni Vj, 5) and Xij \u223c\n3For example, if only nV noisy covariates have nonvanishing loadings on the confounders, and their loading\n.\n\nvectors have norms of similar order, then (cid:107)V (cid:107)F =\n\u2248 1\u221a\nWhen limp\u2192\u221e nV /p \u2192 0, i.e., almost all covariates have vanishing loading, Assumption 5(b) is violated.\n\nnV maxj (cid:107)Vj(cid:107), so maxj (cid:107)Vj(cid:107)\n\nj=1 (cid:107)Vj(cid:107)2 \u2248 \u221a\n\n(cid:113)(cid:80)p\n\n(cid:107)V (cid:107)F\n\nnV\n\n7\n\n\fFigure 2: Estimation error for different ATE estimators with Gaussian and Binary proxy variables .\n\ni Vj)) for Vj \u2208 Rr. We set the dimension of the latent confounders r = 5, use\nBernoulli(\u03c3(U(cid:62)\n\u03b1 = [\u22122, 3,\u22122,\u22123,\u22122] and \u03b2 = [1, 2, 2, 2, 2], and choose \u03c4 = 2 in our example. (But our conclu-\nsion is robust to different values of these parameter.) We consider low dimensional case where the\nnumber of covariates p varies from 100 to 1000 and the sample size N = 2p and high dimensional\ncase where p varies from 150 to 1500 and N = p + 50. For each dimensional setting, we compute the\nerror metrics based on 50 replications of the experiments and we generate entries of V independently\nfrom standard normal distribution with V \ufb01xed across the replications.\nWe compare the root mean squared error (RMSE) scaled by the true ATE in Figure 2 for the following\n\ufb01ve ATE estimators in linear regression: the Lasso, Ridge and OLS estimators from regressing\nYi on Ti and noisy covariates Xi, the OLS estimator from regressing Yi on Ti and the estimated\nconfounders \u02c6Ui from matrix factorization (MF), and the OLS estimator from regressing Yi on Ti and\nthe true confounders Ui (Oracle). The shaded area corresponds to the 2-standard-deviation error band\nfor the estimated relative RMSE across 50 replications.\nFigure 2 shows that OLS leads to accurate ATE estimation for Gaussian additive noise when the\nnumber of covariates is suf\ufb01ciently large, which is consistent with Corollary 1.1. However, for\nhigh dimensional data, matrix factorization preprocessing dominates all other feasible methods and\nits RMSE is very close to the oracle regression for suf\ufb01ciently large number of covariates. While\nall feasible methods tend to have better performance when more covariates are available, matrix\nfactorization preprocessing is the most effective in exploiting the noisy covariates for accurate causal\ninference. Suf\ufb01ciently many noisy covariates are very important for accurate ATE estimation in the\npresence of measurement noise. We can show that the error does not converge when only N grows but\np is \ufb01xed (Figure 6, Appendix B). With only a few covariates, matrix factorization preprocessing may\nhave high error because the cross-validation chooses rank smaller than the ground truth. Furthermore,\nthe gain from using matrix factorization is more dramatic for binary covariates, which demonstrates\nthe advantage of matrix factorization preprocessing with loss functions adapting to the data types.\nMore numerical results on different dimensional settings and missing data can be found in Appendix.\n\n4.2 Twin mortality\n\nWe further examine the effectiveness of matrix factorization preprocessing using the twins dataset\nintroduced by Louizos et al. [9]. This dataset includes information for N = 11984 pairs of twins\nof same sex who were born in the USA between 1998-1991 and weighted less than 2kg. For the\nith twin-pair, the treatment variable Ti corresponds to being the heavier twin and the outcomes\nYi(0), Yi(1) are the mortality in the \ufb01rst year after they were born. We have outcome records for both\ntwins and view them as two potential outcomes for the treatment variable. Therefore, the \u22122.5%\ndifference between the average mortality rate of heavier twins and that of ligher twins can be viewed\nas the \"true\" ATE. This dataset also includes other 46 covariates relating to the parents, the pregnancy\nand birth for each pair of twins. More details about the dataset can be found in Louizos et al. [9].\nTo simulate confounders in observational studies, we follow the practice in Louizos et al. [9] and\nselectively hide one of the two twins based on one variable highly correlated with the outcome:\nGESTAT10, the number of gestation weeks prior to the birth. This is an ordinal variable with values\nfrom 0 to 9 indicating less than 20 gestation weeks, 20 \u2212 27 gestation weeks and so on. We simulate\nTi | Ui \u223c Bernoulli(\u03c3(5(Ui/10 \u2212 0.1))), where Ui is the confounder GESTAT10. Then for each\n\n8\n\n0.00.10.20.30.42505007501000pRelative RMSEGaussian Covariates, Low Dimension0.00.20.40.650010001500pRelative RMSEGaussian Covariates, High Dimension0.00.20.40.650010001500pRelative RMSEBinary Covariates, High DimensionMethodLassoRidgeOLSMF (Proposal)Oracle\fFigure 3: Left: estimation errors of different ATE estimators with no missing values. Right: estimation\nerror of ATE estimators based on matrix factorization and imputation methods when each entry of\nthe proxy variable matrix is missing with probability 30%.\n\ntwin-pair, we only observe the lighter twin if Ti = 0 and the heavier twin otherwise. We create noisy\nproxies for the confounder as follows: we replicate the GESTAT10 p times and independently perturb\nthe entries of these p copies with probability 0.5. Each perturbed entry is assigned with a new value\nsampled from 0 to 9 uniformly at random. We also consider the presence of missing values: we set\neach entry as missing value independently with probability 0.3. We vary p from 5 to 50 and for each\np we repeat the experiments 20 times for computing error metrics.\nWe compare the performance of different methods for both complete data and missing data in Figure 3.\nFor complete data, we consider logistic regression (LR), doubly robust estimator (DR), Mahalanobis\nmatching (Match) and propensity score matching (PS Match) using noisy covariates, and their\ncounterparts using the estimated confounders from matrix factorization. All propensity scores are\nestimated by logistic regression using noisy covariates or estimated confounders accordingly. The\nmatching methods are implemented via the full match algorithm in the R package optmatch [35]. For\nmissing data, we consider logistic regression using data output from different preprocessing method:\nimputing missing values by column-wise mode, multiple imputation using the R package MICE with\n5 repeated imputations [36], and the estimated confounders { \u02c6Ui}N\ni=1 from matrix factorization. We\nalso discuss comparisons to [9] in Appendix C.\nWe can observe that all methods that use matrix factorization clearly outperform their counterparts\nthat do not, even though the noise mechanism does not obey common noise assumptions in matrix\nfactorization literature. In particular, the Mahalanobis matching (Match) bene\ufb01ts the most from matrix\nfactorization that simultaneously alleviates the measurement noise and reduces the dimension. The\neffect of solely reducing measurement noise is shown in the result of the propensity score matching\nwhere matching is based on the univariate propensity score and thus dimensionality is not the primary\nissue. Our results also demonstrate that matrix factorization preprocessing can augment popular\ncausal inference methods beyond linear regression. Furthermore, matrix factorization preprocessing is\nrobust to a considerable amount of missing values and it dominates both the ad-hoc mode imputation\nmethod and the state-of-art multiple imputation method. This suggests that inferring the latent\nconfounders is more important for causal inference than imputing the noisy covariates.\n\n5 Conclusion\n\nIn this paper, we address the problem of measurement noise prevalent in causal inference. We show\nthat with a large number of noisy proxies, we can reduce the bias resulting from measurement noise by\nusing matrix factorization preprocessing to infer latent confounders. We guarantee the effectiveness\nof this approach in a linear regression setting, and show its effectiveness numerically on both synthetic\nand real clinical datasets. These results demonstrate that preprocessing by matrix factorization to\ninfer latent confounders has a number of advantages: it can accommodate a wide variety of data\ntypes, ensures robustness to missing values, and can improve causal effect estimation when used in\nconjunction with a wide variety of causal inference methods. As such, matrix factorization allows\nmore principled and accurate estimation of causal effects from observational data.\n\n9\n\n0.0250.0500.0751020304050Number of ProxiesRMSEMethodMatchMF + MatchLRMF + LRDRMF + DRPS MatchMF + PS MatchComplete Data0.040.060.081020304050Number of ProxiesRMSEMethodMF + LRMICE + LRMode + LRMissing Data\fAcknowledgments\n\nThis work was supported by the National Science Foundation under Grant No. 1656996. This work\nwas supported by the DARPA Award FA8750-17-2-0101.\n\nReferences\n[1] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten\nJoachims. Recommendations as treatments: Debiasing learning and evaluation. arXiv preprint\narXiv:1602.05352, 2016.\n\n[2] Alfred F Connors, Theodore Speroff, Neal V Dawson, Charles Thomas, Frank E Harrell,\nDouglas Wagner, Norman Desbiens, Lee Goldman, Albert W Wu, Robert M Califf, et al. The\neffectiveness of right heart catheterization in the initial care of critically iii patients. Journal of\nAmerican Medical Association, 276(11):889\u2013897, 1996.\n\n[3] Joshua D Angrist and Alan B Keueger. Does compulsory school attendance affect schooling\n\nand earnings? The Quarterly Journal of Economics, 106(4):979\u20131014, 1991.\n\n[4] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical\n\nsciences. Cambridge University Press, 2015.\n\n[5] Peter A Frost. Proxy variables and speci\ufb01cation bias. The review of economics and Statistics,\n\npages 323\u2013325, 1979.\n\n[6] Michael R Wickens. A note on the use of proxy variables. Econometrica: Journal of the\n\nEconometric Society, pages 759\u2013761, 1972.\n\n[7] Jeffrey M Wooldridge. Introductory econometrics: A modern approach. Nelson Education,\n\n2015.\n\n[8] Suriya Gunasekar, Pradeep Ravikumar, and Joydeep Ghosh. Exponential family matrix com-\npletion under structural constraints. In International Conference on Machine Learning, pages\n1917\u20131925, 2014.\n\n[9] Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling.\nCausal effect inference with deep latent-variable models. In Advances in Neural Information\nProcessing Systems, pages 6449\u20136459, 2017.\n\n[10] James Bennett, Stan Lanning, et al. The net\ufb02ix prize. In Proceedings of KDD cup and workshop,\n\nvolume 2007, page 35. New York, NY, USA, 2007.\n\n[11] Feilong Cao, Miaomiao Cai, and Yuanpeng Tan. Image interpolation via low-rank matrix\ncompletion and recovery. IEEE Transactions on Circuits and Systems for Video Technology,\n25(8):1261\u20131270, 2015.\n\n[12] Alejandro Schuler, Vincent Liu, Joe Wan, Alison Callahan, Madeleine Udell, David E Stark,\nand Nigam H Shah. Discovering patient phenotypes using generalized low rank models. In\nBiocomputing 2016: Proceedings of the Paci\ufb01c Symposium, pages 144\u2013155. World Scienti\ufb01c,\n2016.\n\n[13] Emmanuel J Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization.\n\nFoundations of Computational mathematics, 9(6):717, 2009.\n\n[14] Emmanuel J Cand\u00e8s and Terence Tao. The power of convex relaxation: Near-optimal matrix\n\ncompletion. IEEE Transactions on Information Theory, 56(5):2053\u20132080, 2010.\n\n[15] Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE,\n\n98(6):925\u2013936, 2010.\n\n[16] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning\n\nResearch, 12(Dec):3413\u20133430, 2011.\n\n10\n\n\f[17] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a\n\nfew entries. IEEE Transactions on Information Theory, 56(6):2980\u20132998, 2010.\n\n[18] Madeleine Udell, Corinne Horn, Reza Zadeh, Stephen Boyd, et al. Generalized low rank models.\n\nFoundations and Trends R(cid:13) in Machine Learning, 9(1):1\u2013118, 2016.\n\n[19] Susan Athey, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi.\nMatrix completion methods for causal panel data models. arXiv preprint arXiv:1710.10251,\n2017.\n\n[20] Raymond J Carroll, David Ruppert, Ciprian M Crainiceanu, and Leonard A Stefanski. Mea-\n\nsurement error in nonlinear models: a modern perspective. Chapman and Hall/CRC, 2006.\n\n[21] Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference.\n\nBiometrika, 101(2):423\u2013437, 2014.\n\n[22] Wang Miao, Zhi Geng, and Eric Tchetgen Tchetgen. Identifying causal effects with proxy\n\nvariables of an unmeasured confounder. arXiv preprint arXiv:1609.08816, 2016.\n\n[23] Charles Spearman. \" general intelligence,\" objectively determined and measured. The American\n\nJournal of Psychology, 15(2):201\u2013292, 1904.\n\n[24] Madeleine Udell and Alex Townsend. Nice latent variable models have log-rank. arXiv preprint\n\narXiv:1705.07474, 2017.\n\n[25] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data, volume 333.\n\nJohn Wiley & Sons, 2014.\n\n[26] Peter McCullagh. Generalized linear models. European Journal of Operational Research,\n\n16(3):285\u2013292, 1984.\n\n[27] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[28] Nathan Kallus and Madeleine Udell. Dynamic assortment personalization in high dimensions.\n\narXiv preprint arXiv:1610.05604, 2016.\n\n[29] Ajit P Singh and Geoffrey J Gordon. A uni\ufb01ed view of matrix factorization models. In Joint\nEuropean Conference on Machine Learning and Knowledge Discovery in Databases, pages\n358\u2013373. Springer, 2008.\n\n[30] Andrew Y Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance.\n\nIn\nProceedings of the twenty-\ufb01rst international conference on Machine learning, page 78. ACM,\n2004.\n\n[31] T Tony Cai, Anru Zhang, et al. Rate-optimal perturbation bounds for singular subspaces with\n\napplications to high-dimensional statistics. The Annals of Statistics, 46(1):60\u201389, 2018.\n\n[32] Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix\ncompletion: Optimal bounds with noise. Journal of Machine Learning Research, 13(May):1665\u2013\n1697, 2012.\n\n[33] Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrix completion and low-rank\nsvd via fast alternating least squares. Journal of Machine Learning Research, 16:3367\u20133402,\n2015.\n\n[34] Michael Collins, Sanjoy Dasgupta, and Robert E Schapire. A generalization of principal\ncomponents analysis to the exponential family. In Advances in neural information processing\nsystems, pages 617\u2013624, 2002.\n\n[35] Ben B. Hansen and Stephanie Olsen Klopfer. Optimal full matching and related designs via\n\nnetwork \ufb02ows. Journal of Computational and Graphical Statistics, 15(3):609\u2013627, 2006.\n\n[36] Stef van Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate imputation by chained\n\nequations in r. Journal of Statistical Software, 45(3):1\u201367, 2011.\n\n11\n\n\f[37] Gautam Tripathi. A matrix extension of the cauchy-schwarz inequality. Economics Letters,\n\n63(1):1\u20133, 1999.\n\n[38] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv\n\npreprint arXiv:1011.3027, 2010.\n\n[39] Daniel Hsu, Sham Kakade, Tong Zhang, et al. A tail inequality for quadratic forms of subgaus-\n\nsian random vectors. Electronic Communications in Probability, 17, 2012.\n\n12\n\n\f", "award": [], "sourceid": 3449, "authors": [{"given_name": "Nathan", "family_name": "Kallus", "institution": "Cornell University"}, {"given_name": "Xiaojie", "family_name": "Mao", "institution": "Cornell University"}, {"given_name": "Madeleine", "family_name": "Udell", "institution": "Cornell University"}]}