{"title": "Selecting causal brain features with a single conditional independence test per feature", "book": "Advances in Neural Information Processing Systems", "page_first": 12553, "page_last": 12564, "abstract": "We propose a constraint-based causal feature selection method for identifying causes of a given target variable, selecting from a set of candidate variables, while there can also be hidden variables acting as common causes with the target. We prove that if we observe a cause for each candidate cause, then a single conditional independence test with one conditioning variable is sufficient to decide whether a candidate associated with the target is indeed causing it. We thus improve upon existing methods by significantly simplifying statistical testing and requiring a weaker version of causal faithfulness. Our main assumption is inspired by neuroscience paradigms where the activity of a single neuron is considered to be also caused by its own previous state. We demonstrate successful application of our method to simulated, as well as encephalographic data of twenty-one participants,  recorded in Max Planck Institute for intelligent Systems. The detected causes of motor performance are in accordance with the latest consensus about the neurophysiological pathways, and can provide new insights into personalised brain stimulation.", "full_text": "Selecting causal brain features with a single\nconditional independence test per feature\n\nAtalanti A. Mastakouri\n\nBernhard Sch\u00f6lkopf\n\nEmpirical Inference Department\n\nEmpirical Inference Department\n\nMax Planck Institute\nfor Intelligent Systems\n\nT\u00fcbingen, 72076\n\namastakouri@tue.mpg.de\n\nMax Planck Institute\nfor Intelligent Systems\n\nT\u00fcbingen, 72076\nbs@tue.mpg.de\n\nDominik Janzing\nAmazon Research\nT\u00fcbingen, 72076\n\njanzind@amazon.com\n\nAbstract\n\nWe propose a constraint-based causal feature selection method for identifying\ncauses of a given target variable, selecting from a set of candidate variables, while\nthere can also be hidden variables acting as common causes with the target. We\nprove that if we observe a cause for each candidate cause, then a single conditional\nindependence test with one conditioning variable is suf\ufb01cient to decide whether a\ncandidate associated with the target is indeed causing it. We thus improve upon ex-\nisting methods by signi\ufb01cantly simplifying statistical testing and requiring a weaker\nversion of causal faithfulness. Our main assumption is inspired by neuroscience\nparadigms where the activity of a single neuron is considered to be also caused by\nits own previous state. We demonstrate successful application of our method to\nsimulated, as well as encephalographic data of twenty-one participants, recorded\nin Max Planck Institute for intelligent Systems. The detected causes of motor per-\nformance are in accordance with the latest consensus about the neurophysiological\npathways, and can provide new insights into personalised brain stimulation.\n\n1\n\nIntroduction\n\nConditional independence (CI) relations have been an important tool in the \ufb01eld of computational\nstatistics [1, 2] and play a signi\ufb01cant role in causal inference [3]. However, causal inference through\nconditional independencies in real datasets is a challenging task, since testing them is a hard task\n[1], particularly when the number of conditioning variables is large. PC [4], FCI [4] and CPC [5] are\nthree of the most prominent CI based causal discovery methods. To recover the underlying graph\nfrom the data they require some assumptions, which, however, are often violated. These include the\ncausal Markov condition, faithfulness and, in addition for PC method, also causal suf\ufb01ciency, i.e., the\nassumption that all common causes of observed nodes are observed. Although FCI algorithm [4] does\nnot assume that, it becomes unreliable because it requires many statistical tests if the connections\nbetween the features are not sparse. Furthermore, faithfulness is a rather problematic assumption, as\ntypical parameter values in causal models with many variables yield distributions that are close to\nbeing unfaithful [6].\nThe \ufb01eld of non-invasive neuroimaging, such as Electroencephalography (EEG), is one characteristic\ncase where the discovery of causal features is needed. There, the activity of billions of neurons is\nrecorded as noisy mixtures of activity reaching through several layers of cortex, skull and skin and\nhence causal suf\ufb01ciency cannot be assumed. Furthermore, the dimensionality of the data is large,\noften comparable to the sample size. In such datasets, the need for causal inference often arises,\nin order to be able to differentiate a set of causal brain features from a large number of simple\ncorrelations between the brain activity and the observed behavioral response [7, 8, 9].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOur motivation emanates from the \ufb01eld of non-invasive brain stimulation (NIBS); a novel treatment\ntool that aims, among others, at the rehabilitation of motor functions, for patients with motor\ndisabilities. One fundamental problem is the lack of exact knowledge of the mechanism that entrains\nthe ongoing brain oscillations during the stimulation [10, 11]. Subsequently, the selection of the\nfrequency, intensity and exact location of the stimulation is made based on collected observations,\ninstead of being derived from the individual\u2019s brain activity. For instance, stimulation at \u03b3-range\nfrequencies (70Hz) has been proposed to facilitate movement [12, 13], while frequencies in \u03b2-range\nhave been reported to inhibit it [14, 15, 16]. However, particularly in motor tasks, similar stimulation\nparameters have been reported to result in very heterogeneous responses across subjects, that span\nfrom positive to negative [17, 18]. It has been argued [19] that the reason for this discrepancy\nof responses to NIBS originates from the limited or extensive variability of each brain\u2019s activity\nduring movement, and hence personalized stimulation parameters are required to ensure positive\nresponse. Better understanding of the motor cortex activation of each individual could contribute to\nthe identi\ufb01cation of such individual parameters.\nHere we present a constraint-based causal feature selection method for identifying causes of a target\nvariable, in set-ups where one cause for each candidate cause is known. We prove that the detection of\ncauses remains unaffected by common causes, regardless of whether they are observed or unobserved.\nOur method restricts the identi\ufb01cation of a cause to one targeted conditional independence test per\ncandidate, with only one conditioning variable. This restriction simpli\ufb01es the required statistical\ntesting by both limiting the problem of possible faithfulness violation and by scaling with linear\ncomplexity with the number of features. As a \ufb01rst step, we apply our causal methodology on\nsimulated data, which leads to very low percentages of false positives due to statistical error. While\nthe application of our proposed algorithm is not restricted to brain datasets, we apply it also on\nEEG data that we recorded from twenty-one healthy subjects during a motor task, with the aim\nof detecting causal brain features of motor performance for each individual. We give evidence of\ndifferent detected causal brain features according to the motor performance, which are in accordance\nwith the state-of-the-art research in the \ufb01eld of neuroscience. We then discuss how this could be used\nto identify personalized stimulation parameters for rehabilitation.\n\n2 Methods\n\n2.1 De\ufb01nitions and notations\n\nWe brie\ufb02y present some fundamental de\ufb01nitions in Causal Bayesian Networks [20], which we will\nuse to present our methodology and prove our theorem below. For a thorough study see [20, 21, 22].\nThe notions of faithfulness and of causal Markov condition are fundamental to be able to relate the\ndistributions of the variables of interest to properties of a causal graph. Markov condition enables us\nto read off independences from the graph structure, while faithfulness allows us to infer dependences\nfrom the graph [21]. In other words, a distribution P is faithful to a directed acyclic graph (DAG) G if\nno conditional independence relations other than the ones entailed by the Markov property are present.\nAnother important notion for causal discovery is the confounding path between two variables. Here,\nwe de\ufb01ne a confounder variable as a variable (observed or unobserved) that is a common ancestor of\ntwo other variables. In Appendix A.1 we give a list of the exact de\ufb01nitions for the terms that we use\n(d-separation, Ancestor, Descendant, Causal Markov Condition, Faithfulness, Confounder).\nFrom now on, we are going to use that following notation, to describe our method:\n\n\u2022 (cid:57)(cid:57)(cid:75): denotes a directed path with observed variables or a direct link.\n\u2022 \u2192: denotes a direct link.\n\nWe brie\ufb02y introduce the environment of our methodology. The problem of selecting causal features is\ninspired from brain datasets, where the causal candidates are brain features i; i.e. activity in different\nbrain regions and in different brain frequencies, and the target variable R is a behavioural response\nwe measure on the subject. We also consider that each candidate variable has an observed previous\nP i and a current M i state of the brain feature i. The variables\u2019 names read as \u201cPlan\u201d, \u201cMove\u201d and\n\u201cResponse\u201d respectively. An example of such a structure is given in Figure 1, where for example\n(brain) feature M 1 has an ancestor P 1 and is a cause of R, while feature M 2 is not causing R but\nconnects with a confounding path that includes M 1. Without knowing the structure, our theorem\n\n2\n\n\fis able to differentiate the true causes (M 1) from the ones that are dependent to the target due to\nconfounding paths (M 2).\n\n2.2 Formal problem description\n\nGiven the random variables P i, M i i = 1, 2...n and R, we assume the class of DAGs in which there\ncan be instantaneous acyclic effects between P i variables, between M i variables, as well as forward\neffects between P i, M i and R. In Section 2.3 we explain how the assumptions described below are\ncommonly met in datasets where candidate causes can be measured in two time stamps, and hence\na causal path from the previous to the current state can be assumed. Such a case is a brain set-up.\nBelow we present the necessary assumptions for our theorem.\n\nAssumption (1). Causal Markov condition\nAssumption (2). Faithfulness\nAssumption (3). P (cid:54)(cid:76)(cid:57)(cid:57) M (cid:54)(cid:76)(cid:57)(cid:57) R : In the class of DAGs we target here, variable R is measured\nafter M, which is measured after P (there can be no backwards arrows in time).\nAssumption (4). P i (cid:57)(cid:57)(cid:75) M i exists: Variables P i and M i represent two consecutive states of the\nsame brain feature i. We assume that the state P is always a cause of state M for the same feature i.\nAssumption (5). (R, M i, P i) are independently drawn from some distribution (i.i.d)\n\nTheorem. Given the variables P i, M i i = 1, 2...n and R, and assuming 1-5,\nif M i (cid:54)\u22a5\u22a5 R (1) and P i\u22a5\u22a5 R | M i (2) then M i (cid:57)(cid:57)(cid:75) R.\nProof. We prove the claim by contradiction. Assume 1\u22125 and that M i and R are dependent (cond. 1),\nbut there is no directed path from M i to R. Then there is a confounding path p1 := M i (cid:76)(cid:57)(cid:57) C (cid:57)(cid:57)(cid:75) R\nwith some common cause C (hidden or observed). Now consider some path p2 := P i (cid:57)(cid:57)(cid:75) M i\n(it exists due to Assumption 4). If p1 and p2 have only M i in common, M i is a collider and thus\nP i and R are not d-separated by M i. If p1 and p2 share more nodes, assume \ufb01rst they have P i in\ncommon, that is, P i lies on p1. Then P i and R are not d-separated by M i because the sub-path of p1\nconnecting P i and R does not contain M i and p1 is collider-free. Assume now that P i does not lie\non p1, and p1 and p2 share some node X other than M i and P i. Then either (i) X = C, or (ii) X is a\nnode between C and R, or (iii) X is a node between C and M i. For (i) and (ii), we have a directed\npath from P i to R (that does not contain M i). In case (iii), X is a collider and M i a descendent of\nthis collider, hence M i unblocks the path from P i to R. In all three cases, M i does not d-separate P i\nand R, which contradicts P i\u22a5\u22a5 R | M i (cond. 2) due to faithfulness. Hence there must be a directed\npath M i (cid:57)(cid:57)(cid:75) R.\n\nNote that our algorithm requires only one CI\nit\nspeeds up the causal feature selection as it scales linearly with the number of nodes\nin the graph; hence its complexity is O(n). The Matlab code can be found in\nhttps://gitlab.tuebingen.mpg.de/amastakouri/singleCICausalFeatureSelection.git\n\nfor each node. Therefore,\n\ntest\n\nAlgorithm 1: Find causes of R\nInput: P i, M i, R,\u2200i = 1, ..., n.\nOutput: CausesR\nfor i \u2190 1 to n do\n\npvalue1 \u2190 ind_test(M i, R)\nif pvalue1 < threshold1 then\n\npvalue2 \u2190 cond_ind_test(P i, R, M i)\nif pvalue2 > threshold2 then\n\nCausesR \u2190 [CausesR, M i]\n\nend\n\nend\n\nend\n\n3\n\n\fFigure 1: A possible DAG that includes the random variables P i, M i i = 1, 2...n and R, assuming\n1-5. Each candidate causal feature M i have a cause P i and may have other acyclic edges with the\nother candidates, and some features M i cause target R.\n\nNote that Theorem 2.2 provides suf\ufb01cient but not necessary conditions for M i to be a cause of R. In\nother words, not all causes of R may be identi\ufb01ed. Note that our assumptions do not include causal\nsuf\ufb01ciency. So even in the case of unobserved common causes, if the conditions described in 2.2\nare met, then we know that the dependency between the M i and R is due to a directed path and not\ndue to a confounder variable. (We should check that the relationship between M i and P i is not too\ndeterministic. Obviously this would amount to the conditional independence P i\u22a5\u22a5 R | M i even in\nthe presence of counfounding. This violation of faithfulness could happen if M i and P i are too close\nin time.)\nTo corroborate our statement, we further explore the following case: In real datasets where the\ndifferent samples have been measured in different timestamps (such as our EEG experiment), even if\nthe interval between measurements is large, we should not consider the samples as i.i.d. data (Assump.\n5). There is, however, a heuristic argument suggesting that under Assumptions 1-4, our method is\nrobust with respect to the i.i.d. violation. To this end, we model the time dependence formally by\na hidden time variable T . We examine the conditions of Theorem 2.2 exhaustively on each of the\ndifferent cases that T can affect the variables of the DAG; i.e. one variable at a time, two and then all\nof them (see Fig. 6 in suppl.). In case that M i (cid:57)(cid:57)(cid:75) R exists, condition (2) is violated in the graphs (a),\n(c) and (d), since P i (cid:54)\u22a5\u22a5 R | M i. In all the other graphs of Fig. 6 (suppl.), the cause M i is correctly\nidenti\ufb01ed. If M i (cid:54)(cid:57)(cid:57)(cid:75) R (Fig. 7 in suppl.), then graphs (a), (c) and (d) comply with condition (1) of\nTheorem 2.2, but violate condition (2), correctly rejecting the variable M i. For the rest of the graphs,\nthe variables already violate condition (1), and are thus rejected. Therefore, if the hidden variable T\nis present, some causal variables may be rejected but no non-causal variable is falsely accepted. This\nis desired in applications where false positives are harmful compared to false negatives.\n\n2.3 Experimental part\n\nWe apply our method on simulated data and on EEG data that we recorded from twenty-one healthy\nparticipants. All EEG experiments and recordings were performed in Max Planck Institute for\nIntelligent Systems under the ethics approval of the Committee of the Eberhard Karls University of\nT\u00fcbingen. Informed consent was obtained by all participants, prior to their participation to the study.\nFor the implementation, to make sure that in practice Assumption 4 in the data is not violated, we\ncheck the dependence between P i and M i for the same i, with an independence test, and in case it\nis not signi\ufb01cant we reject the candidate without further checking. Both for the simulated graphs\ndescribed below and for the EEG data, we calculate the independencies using the HSIC test [23]\nand the conditional independencies using the conditional independence HSIC test from [24, 25],\nwith Gaussian kernel and the usual heuristic bandwidth used in [23]. Therefore, our algorithm also\nchecks for non-linear relationships between the variables. For the statistical testing we examine the\nnull hypothesis H01 : M i\u22a5\u22a5 R and consider to have rejected the null hypothesis (hence consider to\nhave found M i and R to be dependent) if p < \u03b1D = 0.05. Then, we examine the null hypothesis\nH02 : P i\u22a5\u22a5 R | M i and accept it (hence the conditional independence) if p > \u03b1CI = 0.25 (usual\nvalues for accepting CI in EEG datasets include thresholds above 0.25 [8]).\n\n2.3.1 Simulated graphs\n\nGiven the variables P i, M i i = 1, 2...n and R as described in 2.2, and assuming 1-5, we build\nsimulations of possible DAGs and apply our Theorem 2.2. Simulations were run on a 12-CPU\ncomputer using the parallel toolbox of Matlab.\n\n4\n\n......\fj=1 f1(P j\n\nj=1 f3(M j\n\ni=1 f4(M i\n\ni=1 f5(P i\n\neach parent P j variable: M i = M i +(cid:80)kPM\n\nj=1 f2(P j\n\nby adding a function f1 of each parent P j variable: P i = P i +(cid:80)kP\n\nM i values by adding a function f3 of each parent M j variable: M i = M i +(cid:80)kM\nof each P i that is a parent: R = R +(cid:80)kMR\n\nConstruction of simulated graphs: We sample the noise terms of P , M and R variables from a\nGaussian distribution with variance randomly sampled from a uniform distribution. We then de\ufb01ne\nthe adjacency matrix of the subgraph that consists of all P i variables as an n \u00d7 n matrix AP , whose\nelements are independently drawn from a Bernoulli(p) distribution, denoting the existence of an edge\nbetween the different P i variables, forbidding any self-cycles (aPi=j = 0). We update the P i values\naPij ==1), for the kP parent\nP j variables of P i. As a second step, we de\ufb01ne the adjacency matrix of the subgraph that consists of\nall P i and M i variables as a n\u00d7 n matrix APM, whose elements are values independently drawn from\na Bernoulli(p), denoting the existence of an edge between the different P i and M i variables, making\nsure that for i = j the edge exists (aPMi=j = 1). We update the M i values by adding a function f2 of\naPMij ==1), for the kPM parent P j variables of M i.\nTo avoid creation of cycles, we only generate the following types of arrows: (1) P i \u2192 M j for i \u2264 j,\n(2) P i \u2192 P j for i < j, (3) M i \u2192 M j for i < j, (4) P i \u2192 R and (5) M i \u2192 R. As a third step,\nwe create the adjacency matrix of the subgraph that consists of all M i variables as a n \u00d7 n matrix\nAM , whose elements are values independently drawn from a Bernoulli(p), denoting the existence of\nan edge between the different M i variables, forbidding any self-cycles (aMi=j = 0). We update the\naMij ==1),\nfor the kM parent M j variables of M i. Finally, we create the vectors AMR and APR with n elements\nindependently drawn from a Bernoulli(p), denoting the existence of an edge from M to R and from\nP to R. We update the R values by adding a function f4 of each M i that is a parent and a function f5\naPRi ==1), for the kMR parent\nvariables M i and the kPR parent variables P i of R. We sample the coef\ufb01cients for the \ufb01ve linear\nfunctions f1, f2, f3, f4, f5 from a Gaussian distribution. We examine the statistical performance of\nour algorithm for different number of nodes n for the P and M variables, sparsity of edges and\ndifferent number of samples. For each combination, we examine 20 random graphs and report the\npercentage of the false positives and false negatives, calculated on the number n of features i.\nComparison with Markov Blanket methods and Lasso: Lasso or Markov Blanket (MB) discov-\nery methods require causal suf\ufb01ciency, let alone curse of dimensionality. Furthermore, with high\ndimensional data, any algorithm using CI tests has to condition on large variable sets, in which case\nCI testing is hard [1] and cannot be trusted unless sample sizes are huge. Finally, even if causal\nsuf\ufb01ciency were to hold, the known MB detection algorithms and Lasso do not detect variables but\nrank them, and gradually evaluate the prediction accuracy by including more variables, according to\nthe ranked order the algorithm returned. This requires a heuristic hyperparameter to de\ufb01ne what is\nthe right acceptable number of variables to be included in the MB, which affects the false positive and\nthe false negative rates. For completeness, however, we provide comparison results of our method\nagainst the following three available algorithms (average for 10 random graphs): HSIC Lasso [26],\nBackwards elimination with HSIC, and Forward selection with HSIC for MB discovery [27]. We\npresent the most optimistic for the other algorithms case, that of large sample size (800) and two\ncases of small (20) and large graphs (125 nodes), for sparse (0.2) and dense (0.5, more true causes)\nedges. We report the % of false positives and false negatives in the number of variables.\n\naMRi ==1) +(cid:80)kPR\n\n2.3.2 Identifying brain causes of motor performance from EEG data\n\nOur motivation behind the development of this method was to identify causal brain features of\nupper limb movement from brain activity during a motor task, which could help to identify targets\nof personalised non-invasive brain stimulation. Here, we apply our method to EEG data (no brain\nstimulation applied), independently for each subject. Our causal candidate variables are bandpower\nin different frequency bands and in different electrode locations.\nWe recorded twenty-one healthy participants with high density EEG (128 electrodes, Brain Products),\nduring a motor task. Our paradigm consisted of 150 trials. During each trial, a new target appeared\non a randomized location on a screen in front of the subject. After a planning period of 2.5 \u2212 4 s,\nsubjects had been instructed to move their right arm to reach the target within 10 s. Subject\u2019s arm\nwas being tracked in real time with four infrared cameras (PhaseSpace) and was represented on the\nscreen as a sphere which they could control.\n\n5\n\n\fk) followed by a moving phase (mi\n\nEach trial k consisted of a planning phase (pi\nk). Trials in which\nthe subject did not reach the target within the 10s-window are excluded from the analysis. As an\ninput to our causal discovery algorithm, we examine the bandpower of four brain frequency bands\n(\u03b1 : (8 \u2212 12)Hz, \u03b2 : (12 \u2212 25)Hz, low-\u03b3 : (25 \u2212 45)Hz and \u03b3 : (60 \u2212 80)Hz) and thirty-eight\nelectrodes over the left and right primary motor cortices, the supplementary motor areas and the\ncentral sulcus. That results in n = 4 \u00d7 38 = 142 features. We calculate each feature i as the log-\nbandpower during a window of 1 s in the end of the planning phase (P i) and in the beginning of\nthe moving phase (M i) for the aforementioned four canonical brain frequency bands (larger interval\nbetween the period of P i and M i calculation was also examined, which led to less detected causes).\nFinally, we quantify the response R as the natural logarithm of duration of the reaching movement\nin seconds (see Figure 1). Each sample of the random variables P i, M i and R is one experimental\ntrial. We assume the interval between the trials is wide enough to consider them i.i.d. (Assump. 5). In\nSection 2.2 we examine the violation of this assumption. Assumption 3 and 4 arise in a natural way\nfrom an EEG set-up: There is a time ordering between the brain states P i, M i and R; that is why the\nmeasured response R cannot affect the preceding brain state (Assump. 3). In addition, we assume\nthat the previous state of brain feature i (P i) is a cause of its current brain state M i (Assump. 4).\nPreprocessing of EEG data: Before the bandpower calculation, to attenuate non-cortical artifacts in\nthe EEG data we followed a standardized procedure often applied in this \ufb01eld [28, 29]. We \ufb01ltered the\nEEG signal with a Butterworth 3 Hz high-pass \ufb01lter, performed common average referene \ufb01ltering\non all electrodes, and then performed SOBI [30] Independent Component Analysis (ICA) followed\nby manual rejection of non-cortical sources [31], which then we re-projected on the raw signal.\n\n3 Results\n\n3.1 Simulated data\n\nFigure 2: Percentage of false positives (FP) and false negatives (FN) of detected causes, calculated\non the number of features n, for twenty random simulated graphs, for different sparsity of edges,\nnumber of samples and number of features i. Solid lines: FP. Dashed lines: FN. FN increase with the\nnumber of nodes. FP due to statistical error remain very low regardless of the number of nodes.\n\nFigure 2 depicts the percentage of false positives and false negatives over twenty random graphs, for\neach combination of number of M i nodes n, samples and sparsity of edges. As shown in detail in Fig.\n8 of suppl., the false positives occurring due to statistical error in the computation of the dependencies\nand conditional independences are very few, with a tendency to reduce with more samples. Clearly,\nthe probability of false positives increases with the number of nodes. The number of false negatives\n(Fig. 9 suppl.) appears in\ufb02ated because we consider as true causes both the direct and the indirect\nones; so in case only the direct cause is correctly identi\ufb01ed, then its ancestors which are indirect\ncauses will be counted as false negatives. That is why the number of false negatives increases with\nthe number of features n and the density of the graph.\nComparison with Markov Blanket methods and Lasso: In the simulated data, in sparse large\ngraphs Forward Selection gave more false positives (table 1). Lasso and Forward Selection gave\nmore false positives in small sparse and dense graphs. Backward Elimination performed worse in\nsmall sparse graphs. Overall, our method managed to keep the false positive rate very low (\u223c 2.1%)\nfor all dense/sparse, small/large graphs, while other algorithms\u2019 performance varied with the case.\nOptimal parameters based on the true number of causes was selected for Lasso. Backward Elimination\nand Forward Selection computations took signi\ufb01cantly long. Furthermore, we stress that in these\nsimulations no hidden variables exist, which is an extra advantage for the compared algorithms.\n\n6\n\n\fTable 1: Comparison of false positive and false negative rates calculated in 10 random simulated\ngraphs, among our method and Forward Selection, Backward elimination for markov blanket detection\nand HSIC Lasso.\n\nFP(%)\n\nFN(%)\n\nFP(%)\n\nFN(%)\n\nFP(%)\n\nFN(%)\n\nFP(%)\n\nFN(%)\n\n(nodes, sparse)\n\n(20,.2)\n(20,.5)\n(125,.2)\n(125,.5)\n\nOur method\n3.5\n31.5\n2\n80\n2.9\n70.3\n0\n80.8\n\nHsic Lasso\n9.5\n5.5\n1.1\n0\n\n22.5\n47.5\n77.4\n84.8\n\nBE hsic\n\nFS hsic\n\n11\n1.5\n1.4\n0\n\n23\n79\n77.9\n97.6\n\n6\n7.5\n7.8\n1.1\n\n25\n26\n47.4\n14.5\n\n3.2 Electroencephalographic data\n\nOur \ufb01ndings are consistent across all subjects and divided in three categories that couple detected\ncauses with subjects\u2019 performance: 1. \u03b3-power is detected when subjects improve their performance,\n2. \u03b2-power is detected when subjects worsen or do not improve their performance, and \ufb01nally 3.\n\u03b1-power is detected in the ipsilateral hemisphere. The three groups are discussed in Section 4.\n\nTable 2: Detected causes for six representative subjects; two subjects for each of the three categories\nof detected causes: 1. \u03b2-range detected causal electrodes for inhibition of performance (AB and\nDC), 2.\u03b3-range detected causal electrodes for improvement of performance (KK and II), 3. \u03b1-range\ndetected causal electrodes over ipsilateral hemisphere (HH and JJ).\n\nSubject Alpha\n\nBeta\n\nGamma\n\nPerformance\n\nLow\nGamma\n\nAB\n\nDC\n\nKK\n\nII\nHH\n\nJJ\n\n-\n\nFCC5h\n\nC6, CP2\n\n-\nFC2, FCz\n\nCCP1h,\n\nFC2,\nCPP1h, CP6\nCPP2h, CP5, CPz\n\n-\n\n-\n\nC2, CCP2h\n\nFC5, CCP2h\n\nCCP4h, CCP3h,\nFCC3h, FCC5h,\nFCC6h, CP3\n-\n-\n\nFCC2h,\nCP3, CP1,\nCP2\n-\n-\n\nFC5, CCP4h,\nC6,\nCCP6h,\nFC6, FCC3h\nFC2, FC4\n-\n\nFC4, FC6\n\n-\n\n-\n\nFCC5h, CP1\n\nAbove\nGroup\nAverage\nFalse\n\nFalse\n\nFalse\n\nFalse\nFalse\n\nFalse\n\nFull inhibition\n\nInhibited but then\nimproved\nFull improvement\n\nFull improvement\nImprovement but\nthen inhibited\nFull improvement\n\nFigure 3: Electrodes over contralateral motor and parietal cortex in the \u03b2-range (colored red, 2nd\nplot) are detected as causal features from our algorithm, for subject AB, who worsened her movement\nduration during the reaching trials. Findings are in line with literature about the inhibitory role of\nbeta power. Grey color depicts the motor channels we examine. The y-axis is in logarithmic scale.\n\n7\n\n\fWe applied our method on the preprocessed EEG data described in 2.3.2, individually for each subject.\nIn total, our algorithm identi\ufb01ed causes in seventeen out of twenty-one subjects. Due to lack of\nspace, here we present results for six representative subjects in Table 2 and visualisation for two\nsubjects in Figures 3 and 4. Subject AB (Fig. 3) and DC in Table 2 are two representative subjects\nwho worsened or did not improve their movement duration throughout the sequence of reaching\ntrials (larger durations for completing the trial). Subject AB performed on average (green line) worse\nthan the median performance of all subjects (pink line). Our algorithm detected causes over motor\nchannels in the \u03b2-range (2nd head-plot), as well as a few in gamma range (for subject DC in table).\nSubjects KK and II (Fig. 4) improved their performance, decreasing the duration of their reaching\nmovements throughout the trials. Our algorithm detected causes over motor channels in the \u03b3-range\n(4th head-plot), for both subjects. Finally, HH and JJ are two representative subjects for whom our\nalgorithm detected causes over ipsilateral motor channels in the \u03b1-range. Results for each subject are\npresented and explained based on their performance in Section A.5 (suppl.).\n\nFigure 4: Electrodes over motor cortex in the \u03b3-range (colored pink, 4th plot) are detected as causal\nfeatures from our algorithm, for subject II, who improved her movement duration over the trials.\nFindings are in line with literature about the prokinetic role of gamma power. Grey color depicts the\nmotor channels we examine. The y-axis is in logarithmic scale.\n\n4 Discussion\n\nImprovements upon previous methods. To the best of our knowledge, this is the \ufb01rst constraint\nbased algorithm that scales linearly with the number of candidate features. Previous methods based\non CI tests grow exponentially in time with the number of variables, (if sparse data then they grow\npolynomially), as they require more than one CI test per variable. Therefore, we greatly reduce\nthe computational complexity. Moreover, our algorithm builds on tests that condition on only one\nvariable each. With this improvement, the statistical strength of our inference is superior compared to\nalgorithms where there is more than one conditioning variable. Furthermore, due to this improvement\nwe require a weaker notion of faithfulness [6], as we only assume one triplet of variables per candidate\ncause. As a third point, our method does not assume causal suf\ufb01ciency - a common assumption which\nis, however, often violated in real datasets. Finally, although originally for completeness we assume\ni.i.d. samples, we prove in the suppl. that our method is robust against false positives when the i.i.d.\nassumption is violated (common violation in real data).\nSuf\ufb01cient conditions for fast causal feature selection in large datasets. Our causal discovery\ntheorem imposes assumptions that are commonly met in real datasets where candidate variables have\none known cause. We proved that our proposed conditions, under Assumptions 1-5, are suf\ufb01cient\nfor the identi\ufb01cation of direct or indirect causes of a target variable. Thus, we can rule out that\nthe measured dependency between the causal variable and the response is due to a confounding\npath, even due to a hidden variable. However, our procedure may not identify all causes (see Fig. 5\nin suppl.). Simulations yielded successful application of our algorithm with very low percentages\nof false positive in dense and large graphs. The robustness of our algorithm against confounders,\n\n8\n\n\falongside the linear scaling of complexity, render it suitable for causal feature selection in large\ndatasets, where false acceptance is considered much more serious compared to false rejection.\nNot an instrumental variable approach. Note that although our assumption about the existence\nof a path from P to M (Assump. 4) resembles part of the de\ufb01nition for instrumental variables (IV)\n[32, 33], it is not. To apply our method, in contrast to IV, we do not assume any independence of\nvariable P from unobserved variables that may affect M and R as hidden confounders, nor do we\nassume the lack of a directed path from P to R that does not include M (\u201cexclusion restriction\u201d).\nIn our setting, we do not assume that the variables P i are exogenous variables as in [34]. Note also\nthat the approach of [35] is not applicable here because [35] assume that none of the other observed\nvariables are descendants of the potential cause and the target variable. We don\u2019t have any prior\nknowledge of this kind, apart from the time order. Further, [35] need to search for a set of (possibly\nmultiple) variables to condition on, raising the known statistical problems.\nNeurophysiological validity of results. The application of our proposed method on our EEG data\ngave performance-speci\ufb01c causes across subjects, which are consistent with the known roles of\nphysiological \u03b1, \u03b2 and \u03b3 brain rhythms in upper-limb movements. In particular, \u03b2 activity has been\nfound signi\ufb01cantly elevated in patients with motor disorders (tremors, slowed movements) such\nas Parkinson\u2019s disease [16, 36, 37]. Furthermore, in healthy subjects, elevated \u03b2-power has been\nfound to play an antikinetic role [37]. Our \ufb01ndings support this conclusion, as we found channels\nin the \u03b2 power to play a causal role for subjects that did not improve their motor performance. On\nthe other hand, increased \u03b3 activity over the motor cortices has been associated with large ballistic\nmovements [13, 12]. It has also been suggested to be prokinetic, given that it is increased during\nvoluntary movement [38]. Our \ufb01ndings appear in accordance with this conclusion, since our method\ndetected causal motor channels in the \u03b3 band, in subjects who managed to reduce their reaching times\nand improved their motor performance. Moreover, our detected causal channels in the ipsilateral\nhemisphere at \u03b1-band are consistent with neurophysiological studies that report increased \u03b1-power\nover ipsilateral sensorimotor cortex during selection of movement [39]. Yet, no association of \u03b1-\npower and motor performance has been reported. Although there is no ground truth for comparing\nour neurophysiological results, the \ufb01ndings appear at least plausible given current understanding of\nthe aforementioned physiological brain rhythms in movement. Therefore, our method contributes to\nthe more detailed localisation of causal cortical electrode-areas.\nFinally, we want to emphasize on the appropriate way of interpreting our neurophysiological results.\nSince EEG electrodes record mixtures of underlying neuronal activity, and, therefore, are macro-\nvariables, one could argue about their adequacy as variables for causal inference [40]. In order to\nconsider EEG electrodes as appropriate causal candidates, we assume that the power measured on the\nelectrode level mostly depicts the cortical activity right underneath. We can then interpret our causal\n\ufb01ndings as the brain activity which plays a causal role for the motor performance we observe. This\ndetection of causal features sheds more light on the underlying cortical mechanism that acts during\nupper-limb movements. However, as it is still unknown how the stimulation current in a speci\ufb01c\nfrequency interacts with ongoing brain oscillations, there is not a one-to-one mapping between the\ncausal brain features and the stimulation targets. For example, as it has been shown in [17] \u03b2-rhythms\nmay act as a mediator of \u03b3 stimulation to motor performance. In the chain stimulation parameters\n\u2192 brain activity \u2192 response, our causal method contributes to the second link; thus it narrows the\nquestion of personalised stimulation to stimulation parameters \u2192 detected causal brain activity.\nHence, the search for personalised stimulation parameters can be reduced to the detection of those\nthat upper- or down-modulate accordingly the causal brain features which our algorithm identi\ufb01es.\nContribution. We propose an algorithm and prove a theorem that allows to identify direct or indirect\ncauses of a response variable, tailored to problems in which a cause of a candidate cause is known.\nThis can naturally happen in set-ups where two nodes constitute consecutive time stamps of a\nvariable\u2019s state in a system, and an edge from the previous to the present state can be assumed. The\nnumber of required CI tests is reduced to one targeted CI test per variable with one conditioning\nvariable. Therefore, the complexity of the algorithm scales linearly with the number of variables.\nFurthermore, we thus need a substantially weaker version of faithfulness. We also show why our\nmethod is robust against violation of the i.i.d. assumption, assuming we can model the time effect as\nan independent variable. Finally, applying our algorithm on EEG data exhibited results with rigid\nconsistency with current neuroscienti\ufb01c conclusions, helping to step closer towards personalized\nstimulation.\n\n9\n\n\fAcknowledgments\n\nAuthors would like to thank Sebastian Weichwald and Mateo Rojas-Carulla for their interesting\nfeedback.\n\nReferences\n[1] Rajen D Shah and Jonas Peters. The hardness of conditional independence testing and the\n\ngeneralised covariance measure. arXiv preprint arXiv:1804.07203, 2018.\n\n[2] Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles and\n\ntechniques. MIT press, 2009.\n\n[3] Judea Perl. Causality: Models, reasoning, and inference, 2000.\n\n[4] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search, vol-\n\nume 81. 01 1993.\n\n[5] Joseph Ramsey, Jiji Zhang, and Peter Spirtes. Adjacency-faithfulness and conservative causal\n\ninference. CoRR, abs/1206.6843, 2006.\n\n[6] Caroline Uhler, Garvesh Raskutti, Peter B\u00fchlmann, Bin Yu, et al. Geometry of the faithfulness\n\nassumption in causal inference. The Annals of Statistics, 41(2):436\u2013463, 2013.\n\n[7] Sebastian. Weichwald, Timm. Meyer, Ozan. \u00d6zdenizci, Bernhard. Sch\u00f6lkopf, Tonio. Ball, and\nMoritz Grosse-Wentrup. Causal interpretation rules for encoding and decoding models in\nneuroimaging. NeuroImage, 110:48\u201359, 2015.\n\n[8] Moritz Grosse-Wentrup, Dominik Janzing, Markus Siegel, and Bernhard Sch\u00f6lkopf. Identi\ufb01ca-\ntion of causal relations in neuroimaging data with latent confounders: An instrumental variable\napproach. NeuroImage, 125:825 \u2013 833, 2016.\n\n[9] Anil K. Seth, Adam B. Barrett, and Lionel Barnett. Granger causality analysis in neuroscience\n\nand neuroimaging. Journal of Neuroscience, 35(8):3293\u20133297, 2015.\n\n[10] Nick Davis and Martijn Koningsbruggen. \u201cnon-invasive\u201d brain stimulation is not non-invasive.\n\nFrontiers in Systems Neuroscience, 7:76, 2013.\n\n[11] Randolph F. Helfrich, Christoph S. Herrmann, Andreas K. Engel, and Till R. Schneider. Different\ncoupling modes mediate cortical cross-frequency interactions. NeuroImage, 140:76 \u2013 82, 2016.\n\n[12] Magdalena Nowak, Catharina Zich, and Charlotte J. Stagg. Motor cortical gamma oscillations:\n\nWhat have we learnt and where are we headed? Curr Behav Neurosci Rep, 136(5), 2018.\n\n[13] Suresh D. Muthukumaraswamy. Functional properties of human primary motor cortex gamma\n\noscillations. Journal of Neurophysiology, 104(5):2873\u20132885, 2010.\n\n[14] Svenja Espenhahn. The relationship between cortical beta oscillations and motor learning.\n\nDoctoral Thesis, University College London, 2018.\n\n[15] Alessandro Gulberti, Christian Karl Eberhard Moll, W. R. Hamel, Carsten Buhmann, Jacqueline\nKoeppen, Kai Boelmans, Simone Zittel, Christian Gerloff, Manfred Westphal, Tatyana Schnei-\nder, and Alexandra Engel. Predictive timing functions of cortical beta oscillations are impaired\nin parkinson\u2019s disease and in\ufb02uenced by l-dopa and deep brain stimulation of the subthalamic\nnucleus. NeuroImage: Clinical, 9:436 \u2013 449, 2015.\n\n[16] Craig J. McAllister, Kim C. R\u00f6nnqvist, Ian M. Stanford, Gavin L. Woodhall, Paul L. Furlong,\nand Stephen D. Hall. Oscillatory beta activity mediates neuroplastic effects of motor cortex\nstimulation in humans. Journal of Neuroscience, 33(18):7919\u20137927, 2013.\n\n[17] Atalanti A. Mastakouri, Bernhard Sch\u00f6lkopf, and Moritz Grosse-Wentrup. Beta power may\nmediate the effect of gamma-tacs on motor performance. In Engineering in Medicine and\nBiology Conference (EMBC), July 2019.\n\n10\n\n\f[18] Sarah Wiethoff, Masashi Hamada, and John C. Rothwell. Variability in response to transcranial\n\ndirect current stimulation of the motor cortex. Brain Stimulation, 7(3):468 \u2013 475, 2014.\n\n[19] Atalanti A. Mastakouri, Sebastian Weichwald, Ozan Ozdenizci, Timm Meyer, Bernhard\nSch\u00f6lkopf, and Moritz Grosse-Wentrup. Personalized brain-computer interface models for\nmotor rehabilitation. In Proceedings of the IEEE International Conference on Systems, Man,\nand Cybernetics (SMC 2017), 2017.\n\n[20] Pearl Judea. Reasoning and inference. Econometric Theory, page 2nd ed., 2009.\n\n[21] Jonas Peters, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Elements of Causal Inference -\nFoundations and Learning Algorithms. Adaptive Computation and Machine Learning Series.\nThe MIT Press, Cambridge, MA, USA, 2017.\n\n[22] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search, vol-\n\nume 81. 01 1993.\n\n[23] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch\u00f6lkopf. Measuring statistical\ndependence with hilbert-schmidt norms. In Sanjay Jain, Hans Ulrich Simon, and Etsuji Tomita,\neditors, Algorithmic Learning Theory, pages 63\u201377. Springer Berlin Heidelberg, 2005.\n\n[24] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Sch\u00f6lkopf. Kernel measures of\n\nconditional dependence. pages 489\u2013496, 2008.\n\n[25] Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Kernel-based conditional\nindependence test and application in causal discovery. UAI conference, pages 804\u2013813, 01\n2011.\n\n[26] Makoto Yamada, Wittawat Jitkrittum, Leonid Sigal, Eric P. Xing, and Masashi Sugiyama.\nHigh-dimensional feature selection by feature-wise kernelized lasso. Neural Computation,\n26(1):185\u2013207, 2014.\n\n[27] Le Song, Alexander Smola, Arthur Gretton, Karsten M. Borgwardt, and Justin Bedo. Supervised\nfeature selection via dependence estimation. Proceedings of the 24th international conference\non Machine learning, abs/0704.2668, 04 2007.\n\n[28] Moritz Grosse-Wentrup and Bernhard Sch\u00f6lkopf. High gamma-power predicts performance in\nsensorimotor-rhythm brain\u2013computer interfaces. Journal of Neural Engineering, 9(4):046001,\n2012.\n\n[29] Laura Fr\u00f8lich and Irene Dowding. Removal of muscular artifacts in eeg signals: a comparison\n\nof linear decomposition methods. Brain Informatics, 5(1):13\u201322, 2018.\n\n[30] Adel Belouchrani, Karim Abed-meraim, J. F. Cardoso, and Eric Moulines. Second order blind\nseparation of temporally correlated sources. in Proc. Int. Conf. on Digital Sig. Proc., pages\n346\u2013351, 1993.\n\n[31] Brenton W. McMenamin, Alexander J. Shackman, Jeffrey S. Maxwell, David R.W. Bachhuber,\nAdam M. Koppenhaver, Lawrence L. Greischar, and Richard J. Davidson. Validation of ica-\nbased myogenic artifact correction for scalp and source-localized eeg. NeuroImage, 49(3):2416\n\u2013 2432, 2010.\n\n[32] Judea Pearl. On the testability of causal models with latent and instrumental variables.\n\nUncertainty in Arti\ufb01cial Intelligence, 11:435\u2013443, 02 1995.\n\n[33] Sander Greenland. An introduction to instrumental variables for epidemiologists. International\n\nJournal of Epidemiology, 29(4):722\u2013729, 2000.\n\n[34] Lin S. Chen, Frank Emmert-Streib, and John D. Storey. Harnessing naturally randomized\ntranscription to infer regulatory relationships among genes. Genome Biology, 8:R219 \u2013 R219,\n2007.\n\n11\n\n\f[35] Doris Entner, Patrik Hoyer, and Peter Spirtes. Data-driven covariate selection for nonparametric\nestimation of causal effects.\nIn Proceedings of the Sixteenth International Conference on\nArti\ufb01cial Intelligence and Statistics, volume 31 of Proceedings of Machine Learning Research,\npages 256\u2013264, May 2013.\n\n[36] Peter Brown. Abnormal oscillatory synchronisation in the motor system leads to impaired\n\nmovement. Current Opinion in Neurobiology, 17(6):656 \u2013 664, 2007.\n\n[37] Preeya Khanna and Jose M Carmena. Beta band oscillations in motor cortex re\ufb02ect neural\n\npopulation signals that delay movement onset. ELife, 6, 2017.\n\n[38] Peter Brown. Oscillatory nature of human basal ganglia activity: Relationship to the pathophys-\n\niology of parkinson\u2019s disease. Movement Disorders, 18(4):357\u2013363, 2003.\n\n[39] Loek Brinkman, Arjen Stolk, H. Chris Dijkerman, Floris P. de Lange, and Ivan Toni. Distinct\nroles for alpha- and beta-band oscillations during mental simulation of goal-directed actions.\nJournal of Neuroscience, 34(44):14783\u201314792, 2014.\n\n[40] Paul K. Rubenstein*, Sebastian Weichwald*, Stephan Bongers, Joris M. Mooij, Dominik\nJanzing, Moritz Grosse-Wentrup, and Bernhard Sch\u00f6lkopf. Causal consistency of structural\nequation models. Proceedings of the 33rd Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI), abs/1707.00819, 2017. *equal contribution.\n\n12\n\n\f", "award": [], "sourceid": 6822, "authors": [{"given_name": "Atalanti", "family_name": "Mastakouri", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI for Intelligent Systems"}, {"given_name": "Dominik", "family_name": "Janzing", "institution": "Amazon"}]}