{"title": "Neuropathic Pain Diagnosis Simulator for Causal Discovery Algorithm Evaluation", "book": "Advances in Neural Information Processing Systems", "page_first": 12793, "page_last": 12804, "abstract": "Discovery of causal relations from observational data is essential for many disciplines of science and real-world applications. However, unlike other machine learning algorithms, whose development has been greatly fostered by a large amount of available benchmark datasets, causal discovery algorithms are notoriously difficult to be systematically evaluated because few datasets with known ground-truth causal relations are available. In this work, we handle the problem of evaluating causal discovery algorithms by building a flexible simulator in the medical setting. We develop a neuropathic pain diagnosis simulator, inspired by the fact that the biological processes of neuropathic pathophysiology are well studied with well-understood causal influences. Our simulator exploits the causal graph of the neuropathic pain pathology and its parameters in the generator are estimated from real-life patient cases. We show that the data generated from our simulator have similar statistics as real-world data. As a clear advantage, the simulator can produce infinite samples without jeopardizing the privacy of real-world patients. Our simulator provides a natural tool for evaluating various types of causal discovery algorithms, including those to deal with practical issues in causal discovery, such as unknown confounders, selection bias, and missing data. Using our simulator, we have evaluated extensively causal discovery algorithms under various settings.", "full_text": "Neuropathic Pain Diagnosis Simulator for\nCausal Discovery Algorithm Evaluation\n\nKTH Royal Institute of Technology\n\nCarnegie Mellon University\n\nRuibo Tu\n\nruibo@kth.se\n\nKun Zhang\n\nkunz1@cmu.edu\n\nBo Christer Bertilson\nKarolinska Institute\nbo.bertilson@ki.se\n\nHedvig Kjellstr\u00f6m\n\nKTH Royal Institute of Technology\n\nhedvig@kth.se\n\nCheng Zhang\n\nMicrosoft Research, Cambridge\nCheng.Zhang@microsoft.com\n\nAbstract\n\nDiscovery of causal relations from observational data is essential for many dis-\nciplines of science and real-world applications. However, unlike other machine\nlearning algorithms, whose development has been greatly fostered by a large\namount of available benchmark datasets, causal discovery algorithms are notori-\nously dif\ufb01cult to be systematically evaluated because few datasets with known\nground-truth causal relations are available. In this work, we handle the problem of\nevaluating causal discovery algorithms by building a \ufb02exible simulator in the medi-\ncal setting. We develop a neuropathic pain diagnosis simulator, inspired by the fact\nthat the biological processes of neuropathic pathophysiology are well studied with\nwell-understood causal in\ufb02uences. Our simulator exploits the causal graph of the\nneuropathic pain pathology and its parameters in the generator are estimated from\nreal-life patient cases. We show that the data generated from our simulator have\nsimilar statistics as real-world data. As a clear advantage, the simulator can pro-\nduce in\ufb01nite samples without jeopardizing the privacy of real-world patients. Our\nsimulator provides a natural tool for evaluating various types of causal discovery\nalgorithms, including those to deal with practical issues in causal discovery, such\nas unknown confounders, selection bias, and missing data. Using our simulator,\nwe have evaluated extensively causal discovery algorithms under various settings.\n\n1\n\nIntroduction\n\nMany real-life decision-making processes require an understanding of underlying causal relations. For\nexample, understanding the cause of symptoms is essential for physicians to make correct treatment\ndecisions; understanding the cause of observed environmental changes is critical to take action\nagainst global warming. However, it is generally infeasible or even impossible to do interventions\nor randomized experiments to verify these causal relations. Therefore, causal discovery from\nobservational data has attracted much attention [29, 31, 40, 49].\nHowever, the evaluation of causal discovery algorithms has been a challenge [3]. The great application\ndemand also indicates that ground-truth causal relations in a complex scenario are often unknown to\nhumans. The lack of systematic evaluations of causal discovery algorithms has hindered both the\ndevelopment of the \ufb01eld and the impact of these algorithms on solving real-life problems. Research-\nwise, it is hard to identify the advantages and disadvantages of causal discovery algorithms performing\nin real-world scenarios. A systematic way to evaluate causal discovery algorithms is pressing.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOther machine learning disciplines such as supervised learning and reinforcement learning have made\ngreat success in real-world applications such as image classi\ufb01cation [34, 45] and speech recognition\n[2]. An important driving factor for their fast development and great success is the existence of a\nlarge amount of benchmark datasets for systematic evaluation. The benchmark datasets can be in\nthe form of large-scale labeled and publicly available datasets such as [13, 22], which are commonly\nused for supervised and unsupervised learning. They can also be in the form of synthetic data that\nare generated from simulators, e.g. an autonomous driving simulator [4], an agent motion [5], and\na gaming environment [19]. Such simulators accelerate the development of reinforcement learning\nalgorithms and promote usage in real-life applications.\nEstablishing benchmark datasets for the evaluation of causal discovery algorithms will naturally\naccelerate the development of this research discipline and increase its real-world impact. However, it\nis dif\ufb01cult to collect such datasets with known ground-truth because underlying real-world causal\nrelations are usually highly complex. Fortunately, domain experts in disciplines such as biology and\nphysics can provide information about well-understood causal in\ufb02uences in some speci\ufb01c scenarios.\nThis gives us opportunities to utilize domain knowledge to reveal ground-truth causal relations and\nbuild realistic simulators. In this way, we can generate data from simulators and use such benchmark\ndatasets for the evaluation of causal discovery algorithms.\nIn this work, we present a neuropathic pain diagnosis simulator for evaluating causal discovery\nalgorithms. As one of the most important healthcare issues, neuropathic pain is well-studied in\nbio-medicine and has well-understood causal in\ufb02uences. By de\ufb01nition, neuropathic pain is caused\nby disease or injury of the nervous system. It includes various chronic conditions that, together,\naffect up to 8% of the population. The prevalence of neuropathic pain increased to 60% in those with\nsevere clinical neuropathy [9]. We build a simulator based on the causal relations in neuropathic\npain diagnoses. Given the causal relations, we estimate the parameters of the corresponding causal\ngraph using a small cohort of anonymous real-world clinical records to generate simulated data. Our\nsimulator not only provides the simulated data and the ground-truth causal relations for evaluating\ncausal discovery algorithms but also builds up a bridge between machine learning and neuropathic\npain diagnoses. In summary, our contribution is a neuropathic pain diagnosis simulator. Especially:\n\u2022 It represents a complex real-world scenario with more than 200 variables and around 800\nwell-de\ufb01ned causal relations. It can also generate any amount of data without jeopardizing\nsecurity or privacy of patients\u2019 data (Section 2).\n\nsimulation quality using both medical expertise and statistical evaluation (Section 3).\n\n\u2022 Our simulator can produce data indistinguishable from real-world data. We have veri\ufb01ed the\n\u2022 Our simulator is \ufb02exible and can be used to generate data with different practical issues,\n\u2022 We have evaluated major causal discovery algorithms, including PC [40], Fast Causal\nInference (FCI) [40], and Greedy Equivalence Search (GES) [6] with simulated data under\ndifferent settings (Section 4).\n\nsuch as confounding, selection bias, and missing data (Section 2.3 and Section 4).\n\n2 Neuropathic Pain Simulator\n\nIn this section, we introduce our neuropathic pain diagnosis simulator.1 We \ufb01rst show essential\ncausal relations in the neuropathic pain diagnosis, and then present details of the simulator design.\nFinally, we discuss some open problems in causal discovery and how to use our simulator to simulate\ninstances of such problems.\n\n2.1 Causal Relations for Neuropathic Pain Diagnosis\nNeuropathic pain diagnoses mainly contain symptom diagnosis, pattern diagnosis, and pathophysio-\nlogical diagnosis. For example, Table 1a shows typical neuropathic pain diagnostic records. Symptom\ndiagnosis describes the discomfort of patients. Pattern diagnosis identi\ufb01es symptom patterns. In\nneuropathic pain diagnosis, it identi\ufb01es which set of nerves do not work properly. Such conditional\nis commonly called Radiculopathy. The main tool of pattern diagnosis is the dermatome map as\nshown in Figure 1. Pathophysiological diagnosis refers to the original cause of symptoms where\n\n1 The simulator is available at https://github.com/TURuibo/Neuropathic-Pain-Diagnosis-Simulator.\n\n2\n\n\fTable 1: Diagnostic records and dataset.\n\n(a) A typical neuropathic pain diagnostic record. \"L\" and \"R\" stand for \"left\" and \"right\".\n\nSymptom diagnosis: R back thigh discomfort, R knee discomfort,\nL knee thigh discomfort, Patellofemoral pain syndrome\nPattern diagnosis: L L5 Radiculopathy, R L5 Radiculopathy\nPathophysiological diagnosis: Discoligment injury L4-5\n\n(b) Given many patient records, a diagnostic record dataset takes the following form. \"ID\" represents different\npatients. \"DLI\" and \"Radi\" stand for discoligamentous injury and radiculopathy. Each row is a patient\u2019s\ndiagnostic record in which \"1\" represents that the patient has the symptom and \"0\" represents that the patient has\nno such symptom.\n\nID DLI C1-C2 DLI C2-C3\n1\n2\n...\nn\n\n0\n1\n...\n0\n\n0\n0\n...\n1\n\n... L C5 Radi\n...\n...\n...\n...\n\n1\n0\n...\n0\n\n... R knee L neck\n...\n...\n...\n...\n\n1\n0\n...\n0\n\n0\n1\n...\n0\n\n...\n...\n...\n...\n...\n\nPathophysiology\n\n... DLI C4-C5 ...\n\nPattern\n\n... L C5-Radi\n\nSymptom\nL neck\nL front shld\n\nR C5-Radi ...\n\n...\n\nInterscapular\n\nL shld\n\nL shld im\n\nL arm L lateral arm L upper arm\nL elbow\n\nL upper elbow\n\nL lateral elbow\n\nFigure 1: Dermatome map (image\nsource [1]) shows surface regions of dif-\nferent nerves.\n\nFigure 2: Typical structure of the ground-truth\ncausal graph. \"DLI\" and \"Radi\" represent discol-\nigamentous injury and radiculopathy. \"shldr\" and\n\"im\" stand for shoulder and impingement. \"L\" and\n\"R\" stand for left and right. We show the left side\nsymptoms, and the corresponding connections are\nthe same on the right side.\n\ndiscoligamentous injury is the most common factor in the neuropathic pathophysiological diagnosis.\nGiven a set of patient data, we can present the data as in Table 1b, where 1 indicates that the diagnostic\nlabel exists in a patient record and 0 otherwise.\nIn neuropathic pain diagnoses causal relations are well studied in biomedical research [27, 43]. In\ngeneral, neuropathic pain symptoms in symptom diagnosis are mainly caused by radiculopathies\n(Radi) in the pattern diagnosis, and the radiculopathy is mostly caused by discoligamentous injuries\n(DLI) in the pathophysiological diagnosis. For example, some of the causal relations are shown in\nFigure 2. DLI C4-C5 causes left side C5 radiculopathy and right side C5 radiculopathy. Left side C5\nradiculopathy further causes symptoms at the left front shoulder, the left lateral arm, etc. We see that\nthese locations are consistent with the dermatome map in Figure 1. Despite that there are other causes\nof neuropathic pain symptoms and radiculopathies such as tumors and diabetes, they rarely appear\nin primary care. Therefore, we focus on the causal relations among the discoligamentous injuries,\nradiculopathies, and neuropathic pain symptoms in this work.\nThe complete causal relations are summarized in Appendix A, and we further provide interactive\ncausal graph visualization at: https://cutt.ly/BekNFSy. The causal graph is similar to Figure 2\n\n3\n\n\fand consists of three layers: Symptom diagnosis, pattern diagnosis, and pathophysiological diagnosis.\nNodes in each layer have no connection with each other. Arrows either point from nodes in the\npathophysiological diagnosis layer to nodes in the pattern diagnosis layer or from nodes in the pattern\ndiagnosis layer to nodes in the symptom diagnosis layer. The causal graph also contains different\nd-separations such as the folk structure, denoted by ^ structure (e.g., Left C5 Radiculopathy \nDiscoligamentous injury C4-C5 ! Right C5 Radiculopathy), the collider structure, denoted by _\nstructure (e.g., Left C5 Radiculopathy ! Left neck pain Left C4 Radiculopathy), and the chain\nstructure (e.g., Discoligamentous injury C4-C5 ! Left C5 Radiculopathy ! Left Neck pain).\n2.2 Neuropathic Pain Diagnosis Simulator\n\nWith the domain knowledge mentioned in Section 2.1, we create our simulator to generate patient\ndiagnostic records.\n\nReal-world diagnostic records. To make our generated records close to the real-world scenario,\nwe learn parameters from a dataset including 141 patient diagnostic records [46]. 2 These patients\u2019\ndiagnostic records are represented as a table of binary variables as shown in Table 1b. The variables in\nthe pathophysiological diagnosis consist of the craniocervical junction injury and 26 discoligamentous\ninjuries; the variables in the pattern diagnosis include 52 radiculopathies; the variables in the symptom\ndiagnosis contain 143 symptoms. Similar to the real-world diagnostic records, the columns of\ngenerated records are the mentioned variables and the rows represent the synthetic patients.\n\nParameter estimation of the causal graph. We estimate the Conditional Probability Distribution\n(CPD) of each variable given its parents in the causal graph with the real dataset. We compute the\nCPD of a variable X by P (X | P a(X)) = P (X,P a(X))\nP (P a(X)) , where P a(X) represents the parents of X\nin the causal graph. Since variables are binary, the joint distributions can be computed using the\nnumber of variable values in the dataset. However, we cannot estimate the CPDs accurately for the\nvariables with many parents because of the curse of dimensionality and the limited number of the real\ndata. Therefore, instead of computing the CPD of X given all its parents, we introduce the heuristic\n\nP (X = 1 | P a(X) = c) max\ni 2 I1\n\nP (X = 1 | P ai(X) = ci),\n\n(1)\n\nwhere c is a given vector of parent values (which can contain either value zero or one), and I1 is a\nsubset of the index of all variables in P a(X) such that for 8 i 2 I1, P ai(X) 2 P a(X) and ci = 1.\nThe condition of Equation 1 is that there exists ci 2 c such that ci = 1. This condition is satis\ufb01ed in\nthe real data. Given the parent values c, we only consider the parents taking the value one, and get\nthe maximum conditional probability of X = 1 given a parent taking the value one in c to estimate\nthe CPD of P (X = 1 | P a(X) = c).\nThis approximation is supported by the medical insights. Intuitively, if a symptom is caused by\nmultiple nerves, the chance for the symptom to exist in general is higher when these causes occur at the\nsame time comparing to only one of the causes occurs. For example, both L4 and L5 radiculopathies\ncan cause knee pain. The chance that a person with both L4 and L5 radiculopathies feels knee\npain is higher or equal to the chance that a person with either one of the radiculopathies feels knee\npain. In other words, P (X = 1 | P a1(X) = 1, P a2(X) = 1)  P (X = 1 | P a1(X) = 1) and\nP (X = 1 | P a1(X) = 1, P a2(X) = 1)  P (X = 1 | P a2(X) = 1), where P a1(X) and P a2(X)\nare L4 and L5 radiculopathies and X is knee pain.\nGiven all the conditional probability and marginal probability distributions, we use ancestral sampling\nto sample neuropathic pain diagnosis data of synthetic patients.\n\n2.3 Simulating Data with Practical Issues of Causal Discovery\n\nCausal discovery is facing many practical issues when applied in real-world applications. Our\nsimulator has many advantages over real datasets in evaluating causal discovery algorithms in the\npresence of these challenges. In this section, we introduce how to use our simulator to generate\n\n2 The dataset is collected in a hospital department specialized in neuropathic pain [46]. Only Ruibo Tu and\n\nBo C. Bertilson get access to the dataset during the course of the project.\n\n4\n\n\fdatasets exhibiting different open problems. In Section 4 we show experimental results of applying\ncausal discovery algorithms to these simulated data re\ufb02ecting different real-world problems.\n\nUnmeasured Confounding. Most causal discovery algorithms assume that all variables of con-\ncerned are observed. However, in most real-life applications collected datasets may not cover all\nfactors to discover causal relations of interest. If there is an unobserved common direct cause of two\nor more observed variables, this may produce wrong causal conclusions. This problem is known as\nunmeasured confounding, which is one of the common issues that one is faced with when applying\ncausal discovery algorithms. Addressing unmeasured confounding is an active research direction\n[18, 20, 28, 40, 47].\nThere are many ways for our simulator to generate datasets of unmeasured confounding. We can\ndelete the data of parent nodes in a ^ structure. More speci\ufb01cally, deleting the simulated data of the\npathophysiology diagnosis and the pattern diagnosis variables leads to confounding in the dataset\nbecause they have at least two direct effects. We can also introduce external variables as confounders\nin the data generation process. For example, we can add patients\u2019 occupation as a confounder which\nis not included in the given causal graph. The occupation affects daily activities and then increases\nthe risk level of injuring different spine parts. With such datasets, we can evaluate how unmeasured\nconfounding in\ufb02uences the results of causal discovery algorithms and hopefully develop new and\nbetter algorithms to address this issue.\n\nSelection bias. Selection bias is an important issue in learning causal structures from real-world\nobservational data. In practice, it is a common scenario where the data collection process is in\ufb02uenced\nby some attributes of variables. For example, samples in a dataset are not drawn randomly from\nthe population, but from the people who have higher degrees than a bachelor\u2019s degree. Then, the\nselection variable is whether a person has a higher degree than a bachelor\u2019s degree. Such selection\nbias is non-trivial to be removed from the collected dataset and may introduce erroneous causal\nrelations in the results of causal discovery algorithms. Few methods have been developed to address\nthis issue [11, 12, 39, 47, 48]. We can also introduce selection bias to the simulated data. We \ufb01rst\nchoose variables which the selection depends on, and then remove or maintain records based on the\nvalues of the chosen variables in the simulated dataset.\n\nMissing data. Missing data is a ubiquitous issue, especially in healthcare. It is common to classify\nmissingness mechanisms into Missing Completely At Random (MCAR), Missing At Random (MAR),\nand Missing Not At Random (MNAR) [32]. Among them, MAR and MNAR may introduce wrong\ncausal conclusions if one simply deletes the data with missing entries, and applies causal discovery\nalgorithms to the deleted complete dataset. Thus, methods that can handle different missingness\nmechanisms are much in demand for causal discovery [23, 24, 38, 42, 44].\nUsing our simulator, we can easily generate data with different missingness mechanisms. We can\nintroduce missingness indicators to our causal graph. We then introduce causal relations between\nmissingness indicators and substantive variables, depending on the missingness mechanism wanted.\nIn the end, we sample the missingness indicators and mask out the data according to the values of\nmissingness indicators.\n\n3 Simulation Quality\n\nWe now evaluate whether generated data from our simulator have the similar property to the real-world\ndata. We examine the quality of our simulated data by medical experts and statistical analysis.\n\n3.1 Physician Evaluation\n\nTo examine the quality of our simulated data, we mix 50 simulated records with 50 records sampled\nfrom the real-world dataset. We then ask a physician specialized in neuropathic pain diagnoses to\nrate the 100 mixed records with the following score system:\n\n\u2022 Score 1: This is not likely to be a real patient (possible but never see such patient before);\n\u2022 Score 2: This is likely to be a real patient but is not very common (similar cases have\n\nhappened before but rarely);\n\n5\n\n\f(a) Real data variables marginal distribution\n\n(b) Simulated data variables marginal distribution.\n\n(c) Co-occurrence matrix of the real dataset.\n\n(d) Co-occurrence matrix of the simulated dataset.\n\nFigure 4: Comparison of the marginal distributions and the co-occurrence matrices of the real and\nsimulated datasets. The orders of variables are the same in Panel (a) and (b). In Panel (c) and (d), the\nred color represents pathophysiological diagnosis, the blue color represents pattern diagnosis, and the\nyellow color represents symptom diagnosis.\n\n\u2022 Score 3: This is a common patient (similar cases show up time by time);\n\u2022 Score 4: This is a typical patient (similar cases show up very often).\n\nThe physician evaluates the 100 records without knowing\nthe source of the records (the simulator or the real dataset).\nFigure 3 shows the physician\u2019s evaluation results of the\nreal and the synthetic data. The number of records with\nhigher scores is increasing with the synthetic data which\nis expected due to our score design. The simulator gen-\nerates less unlikely diagnostic records than those in the\nreal datasets, which may be due to the missing and noisy\nlabels in the real-world data. Also, when one or two un-\nlikely diagnostic records are generated within many likely\ndiagnostic labels in a record, the physician considers the\ncase as \"likely\". This case happens more in the simulated\ndata than the real-world data. In general, the result shows\nthat the physician cannot differ the generated data from\nthe real-world data. Also, the simulated data follow the desired frequency (increased numbers for\nhigher scores) from the physician evaluation.\n\nFigure 3: Physician\u2019s evaluation results\nof 50 real data and 50 simulated data.\n\n2\n3\nScore\n\nSynthetic\n\nt\nn\nu\no\nC\n\n30\n\n20\n\n10\n\n0\n\nReal\n\n1\n\n4\n\n3.2 Data Properties\n\nWe compare the marginal probability distributions of the same variables in the real dataset and the\nsimulated dataset as shown in Figure 4a and Figure 4b. It shows that marginal probability distributions\nof variables in both datasets are similar.\nWe use the co-occurrence matrix normalized by the sample size to show the relation between each\npair of variables in Figure 4c and Figure 4d. For example, the upper left corner of the co-occurrence\nmatrices represents the relations between the variables in the pathophysiological diagnosis and the\npattern diagnosis. We \ufb01nd that the pattern of the simulated data is similar to that of the real data.\nIn our simulator, we give no constraints on the relations between both sides of variables, e.g. it is\npossible to have a connection between left C5 radiculopathy and right neck pain in the graph. We\nalso compare the correlation matrices in Appendix B.\n\n6\n\n\fTable 2: Results of causal discovery algorithms using the real dataset and the simulated dataset with\nthe same sample size. \"CauAcc\" and \"Sim\" represent \"Causal Accuracy\" and \"Simulated\".\n\nPC\n0.041\nReal\nSim 0.038\n\nCauAcc\nFCI\n0.024\n0.023\n\nGES\n0.038\n0.063\n\nF1\n\nRFCI\n0.021\n0.016\n\nPC\n0.044\n0.047\n\nGES\n0.037\n0.076\n\nRecall\n\nPC\n0.025\n0.025\n\nGES\n0.022\n0.043\n\nPrecision\nGES\n0.199\n0.377\n\nPC\n0.187\n0.425\n\nTable 3: Results of different causal discovery algorithms with different sample sizes. The performance\nis better when causal accuracy and F1 score have larger values.\n2048\n0.066\n0.173\n0.031\n0.105\n0.036\n0.045\n\nSample size\nF1PC\nF1GES\nCauAccPC\nCauAccGES\nCauAccRFCI\nCauAccFCI\n\n16384\n0.188\n0.325\n0.094\n0.230\n0.070\n0.082\n\n8192\n0.142\n0.261\n0.066\n0.162\n0.053\n0.062\n\n256\n0.028\n0.083\n0.012\n0.045\n0.023\n0.029\n\n512\n0.016\n0.120\n0.009\n0.067\n0.027\n0.034\n\n128\n0.019\n0.042\n0.009\n0.020\n0.021\n0.026\n\n1024\n0.040\n0.150\n0.020\n0.085\n0.033\n0.039\n\n4096\n0.100\n0.217\n0.048\n0.134\n0.041\n0.051\n\n4 Evaluating Causal Discovery Algorithms with Proposed Simulator\n\nWe evaluate major causal discovery algorithms with datasets generated from our simulator. We\n\ufb01rst further evaluate the simulation quality by comparing the causal discovery results of baseline\nmethods between a real-world dataset and a simulated dataset. One advantage of the simulator is\nthat we can generate any amount of data. Thus, we can evaluate causal discovery algorithms with\ndifferent sample sizes to show the asymptotic property of causal discovery algorithms. Next, we apply\ncausal discovery algorithms to the simulated datasets with different practical issues: Unmeasured\nconfounding, selection bias, and missing data.\nWe use the causal discovery algorithms implemented by Tetrad [41]. In the experiments the causal\ndiscovery algorithms comprise: Constraint-based methods, PC [40], FCI [40], and RFCI [10]; score-\nbased method, GES [6]. PC and GES output Complete Partially Directed Acyclic Graph (CPDAG),\nwhile FCI and RFCI output Partial Ancestral Graph (PAG). We use the F1 score and causal accuracy\n[7] as the evaluation metrics. Results of other metrics such as Structural Hamming Distance (SHD),\nprecision, and recall are shown in Appendix .\n\nComparison between simulated and real data. We sample 141 patient records from our simulator\nwith the same sample size as the real-world dataset. We apply causal discovery algorithms to both\ndatasets. The results are shown in Table 2. We \ufb01nd that the causal accuracies and F1 scores of both\ndatasets are similar and the algorithms in the table cannot recover most edges of the ground-truth\ncausal graph. The reason might be that the real dataset has a small sample size 141 compared with\nthe number of nodes and edges in the causal graph. Moreover, Figure 4a shows that the appearance\nfrequencies of diagnostic labels in the real dataset decay exponentially, which means that many\ndiagnostic labels only appear in few patient diagnostic records. This is especially dif\ufb01cult for these\nmethods because they are based on conditional independence tests that require suf\ufb01cient samples.\nFurthermore, we \ufb01nd that the recall rates of PC on both datasets are similar and the precision rate of\nPC on the simulated dataset is higher than the precision rate on the real dataset. The reason might\nbe that we generate values of a variable only based on the values of its parents. Consequently, our\nsimulator can cancel out the in\ufb02uence of unknown confounders, such as the age and the occupation\nof a patient, and other practical issues in the real dataset. We also \ufb01nd that GES bene\ufb01ts relatively\nmore than other methods from such property of the simulated dataset.\n\nSample size. To show the in\ufb02uence of the sample size, we generate simulated datasets with sample\nsize 128, 256, 512, 1024, 2048, 4096, 8192, and 16384. Under certain assumptions, these methods\nare asymptotically correct when in\ufb01nite data are available. Table 3 shows that the performance of the\nalgorithms is improved with increasing the sample size, when there is no selection bias, unknown\nconfounders, or missing values. However, all these methods are not sample ef\ufb01cient as the F1 score\nand causal accuracy are still low and have not saturated even with 16834 data points. Thus, developing\nsample ef\ufb01cient causal discovery algorithms is needed, especially when real-life data are costly.\n\n7\n\n\fConfounding. We generate simulated data with external variables as confounders (see Appendix C\nfor details). We compare the performance of FCI and RFCI on the dataset containing unknown con-\nfounders with that without confounders. The sample size of both datasets is 1024. The causal accuracy\nis 0.033 and 0.030 on the dataset with unknown confounders, and 0.039 and 0.033 on the dataset with-\nout unknown confounders. The results of the FCI algorithms on the dataset with unknown confounders\nare slightly worse than that without unknown confounders because the FCI algorithms consider the\nunknown confounders and output Partial Ancestral Graph (PAG) that provides the information about\npotential unknown confounders. However, it is far from ideal. We also generate confounding data by\ndeleting all the data of some common parents in the causal graph. The results are shown in Appendix C.\n\nSelection bias. We choose both sides of\nC6, C7, L5, and S1 radiculopathy as the\ncauses of a selection variable. We then delete\nthe simulated data regarding the values of\nthe selection variable. We interpret this set-\nting as a situation where the patients without\nthose radiculopathies hardly ever go to the\nhospital; thus, the hospital hardly collects their data. Table 4 shows the results on the dataset with\nselection bias and the reference one without selection bias. RFCI is more robust to selection bias\nthan FCI, even both should be able to handle it by design. For the algorithms without considering\nselection bias, the causal accuracy of GES outperforms PC.\n\nTable 4: Results of different causal discovery methods\nin the presence of selection bias.\nRFCI\n0.039\n0.037\n\nCauAcc\nCauAccref\n\nFCI\n0.039\n0.046\n\nPC\n0.031\n0.033\n\nGES\n0.109\n0.114\n\nMissing data. We evaluated the performance on all three missingness mechanisms: MCAR,\nMAR, and MNAR. We generate missing values in the dataset according to the de\ufb01nition in [23].\nTo generate the data that are MCAR, the probability distribution of missing values follows the\nBernoulli distribution with the missingness probability 0.0007. To generate the data that are MAR,\nwe choose variables in the pattern diagnosis as the causes of missingness indicators and variables in\nthe pathophysiological diagnosis and the symptom diagnosis as the variables with missing values.\nLikewise, to generate the data that are MNAR, the variables with missing values are chosen in the\nrange of all the variables in the causal graph. Since FCI, PC, and GES cannot deal with the dataset\ncontaining missing values, we delete the records containing any missing value and input the remaining\ncomplete dataset. The sample size of the remaining complete dataset is 7042. As a reference, we\ncreate a simulated dataset whose sample size is 7042 without missing values.\nTable 5 shows that the results of MAR and\nMNAR experiments are worse than the re-\nsults of MCAR experiments, which are close\nto the reference one without missing values.\nThis is expected as [44] shows: When the\ndata are MCAR, causal discovery results are\nasymptotically correct; when the data are\nMAR or MNAR, these algorithms may pro-\nduce erroneous edges in the case where the\nmissingness indicators are the common chil-\ndren or descendants of the common children\nof the concerned variables. We further check\nthe number of missingness indicators satis-\nfying this conclusion: 4 in MNAR and 7 in\nMAR out of total 52 missingness indicators.\n\nTable 5: Results of applying causal discovery algo-\nrithms to the MCAR, MAR, and MNAR datasets.\nGES\n0.154\n0.135\n0.161\n0.145\n0.251\n0.241\n0.256\n0.253\n\nCauAccMNAR\nCauAccMAR\nCauAccMCAR\nCauAccref\nF1MNAR\nF1MAR\nF1MCAR\nF1ref\n\nRFCI\n0.051\n0.049\n0.055\n0.050\nX\nX\nX\nX\n\nFCI\n0.059\n0.063\n0.066\n0.062\nX\nX\nX\nX\n\nPC\n0.061\n0.050\n0.067\n0.059\n0.133\n0.132\n0.141\n0.156\n\n5 Related Work\n\nThe evaluation of causal discovery algorithms mainly consists of synthetic and real data experiments.\nSynthetic data are mostly sampled from randomly generated graph structures, or based on models\nproposed in different works. Such synthetic data experiments can show the superior performance\nof proposed methods but sometimes may oversimplify the challenges in real-world scenarios [15].\nUnfortunately, there are few available real-world datasets for evaluating causal discovery algorithms.\nMooij et al. [25] provided a set of cause-effect pairs with ground-truth causal relations. However,\n\n8\n\n\fthe cause-effect pairs can be used for the evaluation of a limited range of causal discovery methods\nsuch as the Linear Non-Gaussian Acyclic Model (LiNGAM) [37]. Also, the dataset containing only\npair-wise data is not complex enough to evaluate causal discovery algorithms in real-world scenarios.\nSeveral other datasets from genomics [30, 35, 14] and health-care [44] contain causal relations among\nmultiple variables and are commonly used for the evaluation; however, few pairs of ground-true\ncausal relations are known/labeled by domain experts and the evaluation is not systematic. Therefore,\nit is necessary to develop causal discovery benchmarks for real-world evaluation.\nFilling the gap between the synthetic and real data evaluation [17], the simulator in the context\nof real-world applications is needed. Glymour et al. [17] discussed the evaluation of search tasks,\nespecially causal discovery, and concluded that simulation is a desired way to evaluate the research in\nthis direction. Despite the argument, [17] did not build any simulator instance. Very recently, a few\nsimulators for causal discovery evaluation have been developed, especially considering time-series\ndata. Sanchez-Romero et al. [36] generated simulated fMRI data over time with the focus on the\nsituation where feedback loops exist. Runge et al. [33] provided ground-truth time-series datasets by\nmimicking properties of real climate and weather datasets. However, these simulators are still limited\nto the complexity re\ufb02ecting real-world causal discovery demands and are not suitable for evaluating\nthe causal discovery methods for static data.\nIn machine learning, there are many simulators built for other disciplines. For example, reinforcement\nlearning bene\ufb01ts from the simulators covering practical issues with different applications [8, 5, 19].\nSome of them are used for evaluating sequential decision making by considering counterfactual\noutcomes. Oberst and Sontag [26] simulated data about treating sepsis among intensive care unit\n(ICU) patients. The data consist of vital signs, treatment options, and the \ufb01nal mortality with a fully\nspeci\ufb01ed underlying Markov Decision Process. Another simulator [16] is used for evaluating the\nperformance of the treatment response over time [21]. Geng et al. [16] provided the dynamics of the\ntumor volume and its relation with chemotherapy, tumor growth, and radiation. Given parameters of\nthe dynamic equations, Lim [21] simulated the data satisfying this domain knowledge and introduced\nthe practical issues such as unmeasured confounding. However, these simulators contribute to\nadvancing the research on estimating treatment response over time but not causal discovery.\n\n6 Discussion\n\nIn this work, we build a simulator in the neuropathic pain diagnosis setting for evaluating causal\ndiscovery algorithms. Our simulator is based on ground-truth causal relations regarding the domain\nknowledge, and its parameters are estimated with a real-world dataset. It contains 222 nodes and\n770 edges establishing complex real-world challenges. Our simulator can generate any amount\nof synthetic records that are indistinguishable from real-world records judged by physicians. The\nsimulator can also simulate practical issues in causal discovery research such as missing data, selection\nbias, and unknown confounding. We demonstrated how to evaluate causal discovery algorithms using\nour simulator for different challenges.\nOur simulator not only contributes to causal discovery research but also machine learning in healthcare\nresearch where public data are extremely scarce due to privacy concerns. In the future, we will re\ufb01ne\nour simulator to consider border scenarios. At the same time, we will seek further opportunities to\nbuild different simulators for causal discovery evaluation and machine learning in healthcare research.\n\nAcknowledgements. Kun Zhang would like to acknowledge the support by National Institutes of\nHealth under Contract No. NIH-1R01EB022858-01, FAINR01EB022858, NIH-1R01LM012087,\nNIH-5U54HG008540-02, and FAIN- U54HG008540, by the United States Air Force under Contract\nNo. FA8650-17-C-7715, and by National Science Foundation EAGER Grant No. IIS-1829681.\nThe National Institutes of Health, the U.S. Air Force, and the National Science Foundation are not\nresponsible for the views reported in this article. In addition, the authors thank Akshaya Thippur\nSridatta and Tino Weinkauf for the help of the audio dubbing of the 3-minute introduction video at\nhttps://youtu.be/1UvVnIbjSX8 and the visualization of the causal graph.\n\n9\n\n\fReferences\n[1] Dermatone\n\nmap\n\nsource.\n\nef7647ceae98d10588f14b4ecd7e6a89.jpg.\n\nhttps://i.pinimg.com/736x/ef/76/47/\n\n[2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper,\nB. Catanzaro, Q. Cheng, G. Chen, et al. Deep speech 2: End-to-end speech recognition in\nenglish and mandarin. In International conference on machine learning, pages 173\u2013182, 2016.\n\n[3] E. Bareinboim, I. Guyon, D. Blei, N. Meinshausen, C. Szepesv\u00e1ri, S. Magliacane, and Y. Bengio.\nPanel discussion on datasets and benchmarks for causal learning. https://www.youtube.\ncom/watch?v=QaoijubZTTA, 2008.\n\n[4] A. Bewley, J. Rigley, Y. Liu, J. Hawke, R. Shen, V.-D. Lam, and A. Kendall. Learning to drive\n\nfrom simulation without real world labels. arXiv preprint arXiv:1812.03823, 2018.\n\n[5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.\n\nOpenai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[6] D. M. Chickering. Optimal structure identi\ufb01cation with greedy search. Journal of machine\n\nlearning research, 3(Nov):507\u2013554, 2002.\n\n[7] T. Claassen and T. Heskes. A bayesian approach to constraint based causal inference. arXiv\n\npreprint arXiv:1210.4866, 2012.\n\n[8] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman. Quantifying generalization in\n\nreinforcement learning. arXiv preprint arXiv:1812.02341, 2018.\n\n[9] L. Colloca, T. Ludman, D. Bouhassira, R. Baron, A. H. Dickenson, D. Yarnitsky, R. Freeman,\nA. Truini, N. Attal, N. B. Finnerup, et al. Neuropathic pain. Nature reviews Disease primers, 3:\n17002, 2017.\n\n[10] D. Colombo, M. H. Maathuis, M. Kalisch, and T. S. Richardson. Learning high-dimensional\ndirected acyclic graphs with latent and selection variables. The Annals of Statistics, pages\n294\u2013321, 2012.\n\n[11] J. D. Correa and E. Bareinboim. Causal effect identi\ufb01cation by adjustment under confounding\n\nand selection biases. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[12] C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh. Sample selection bias correction theory.\n\nIn International conference on algorithmic learning theory, pages 38\u201353. Springer, 2008.\n\n[13] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical\nimage database. In 2009 IEEE conference on computer vision and pattern recognition, pages\n248\u2013255. Ieee, 2009.\n\n[14] A. Dixit, O. Parnas, B. Li, J. Chen, C. P. Fulco, L. Jerby-Arnon, N. D. Marjanovic, D. Dionne,\nT. Burks, R. Raychowdhury, et al. Perturb-seq: dissecting molecular circuits with scalable\nsingle-cell rna pro\ufb01ling of pooled genetic screens. Cell, 167(7):1853\u20131866, 2016.\n\n[15] D. Garant and D. Jensen. Evaluating causal models by comparing interventional distributions.\n\narXiv preprint arXiv:1608.04698, 2016.\n\n[16] C. Geng, H. Paganetti, and C. Grassberger. Prediction of treatment response for combined\nchemo-and radiation therapy for non-small cell lung cancer patients using a bio-mathematical\nmodel. Scienti\ufb01c reports, 7(1):13542, 2017.\n\n[17] C. Glymour, J. D. Ramsey, and K. Zhang. The evaluation of discovery: Models, simulation and\n\nsearch through \u201cbig data\u201d. Open Philosophy, 2(1):39\u201348, 2019.\n\n[18] P. O. Hoyer, S. Shimizu, A. J. Kerminen, and M. Palviainen. Estimation of causal effects using\nlinear non-gaussian causal models with hidden variables. International Journal of Approximate\nReasoning, 49(2):362\u2013378, 2008.\n\n10\n\n\f[19] M. Johnson, K. Hofmann, T. Hutton, and D. Bignell. The malmo platform for arti\ufb01cial\n\nintelligence experimentation. In IJCAI, pages 4246\u20134247, 2016.\n\n[20] N. Kallus, X. Mao, and A. Zhou. Interval estimation of individual-level causal effects under\nunobserved confounding. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine\nLearning Research, volume 89 of Proceedings of Machine Learning Research, pages 2281\u20132290.\nPMLR, 16\u201318 Apr 2019. URL http://proceedings.mlr.press/v89/kallus19a.html.\n[21] B. Lim. Forecasting treatment responses over time using recurrent marginal structural networks.\n\nIn Advances in Neural Information Processing Systems, pages 7483\u20137493, 2018.\n\n[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick.\nMicrosoft coco: Common objects in context. In European conference on computer vision, pages\n740\u2013755. Springer, 2014.\n\n[23] K. Mohan, J. Pearl, and J. Tian. Graphical models for inference with missing data. In Advances\n\nin neural information processing systems, pages 1277\u20131285, 2013.\n\n[24] K. Mohan, F. Thoemmes, and J. Pearl. Estimation with incomplete data: The linear case. In Pro-\nceedings of the Twenty-Seventh International Joint Conference on Arti\ufb01cial Intelligence, IJCAI-\n18, pages 5082\u20135088. International Joint Conferences on Arti\ufb01cial Intelligence Organization, 7\n2018. doi: 10.24963/ijcai.2018/705. URL https://doi.org/10.24963/ijcai.2018/705.\n[25] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Sch\u00f6lkopf. Distinguishing cause from\neffect using observational data: methods and benchmarks. The Journal of Machine Learning\nResearch, 17(1):1103\u20131204, 2016.\n\n[26] M. Oberst and D. Sontag. Counterfactual off-policy evaluation with gumbel-max structural\n\ncausal models. In International Conference on Machine Learning, pages 4881\u20134890, 2019.\n\n[27] D. D. Ohnmeiss, H. Vanharanta, and J. Ekholm. Relation between pain location and disc\npathology: a study of pain drawings and ct/discography. The Clinical journal of pain, 15(3):\n210\u2013217, 1999.\n\n[28] M. Osama, D. Zachariah, and T. Sch\u00f6n. Inferring heterogeneous causal effects in presence of\n\nspatial confounding. arXiv preprint arXiv:1901.09919, 2019.\n\n[29] J. Pearl. Causality. Cambridge university press, 2009.\n[30] J. Peters, P. B\u00fchlmann, N. Meinshausen, et al. Causal inference by using invariant prediction:\nidenti\ufb01cation and con\ufb01dence intervals. Journal of the Royal Statistical Society Series B, 78(5):\n947\u20131012, 2016.\n\n[31] J. Peters, D. Janzing, and B. Sch\u00f6lkopf. Elements of causal inference: foundations and learning\n\nalgorithms. MIT press, 2017.\n\n[32] D. B. Rubin. Inference and missing data. Biometrika, 63(3):581\u2013592, 1976.\n[33] J. Runge, S. Bathiany, E. Bollt, G. Camps-Valls, D. Coumou, E. Deyle, C. Glymour,\nM. Kretschmer, M. D. Mahecha, J. Mu\u00f1oz-Mar\u00ed, et al. Inferring causation from time series in\nearth system sciences. Nature communications, 10(1):2553, 2019.\n\n[34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International\njournal of computer vision, 115(3):211\u2013252, 2015.\n\n[35] K. Sachs, O. Perez, D. Pe\u2019er, D. A. Lauffenburger, and G. P. Nolan. Causal protein-signaling\n\nnetworks derived from multiparameter single-cell data. Science, 308(5721):523\u2013529, 2005.\n\n[36] R. Sanchez-Romero, J. D. Ramsey, K. Zhang, M. R. Glymour, B. Huang, and C. Glymour.\nEstimating feedforward and feedback effective connections from fmri time series: Assessments\nof statistical methods. Network Neuroscience, 3(2):274\u2013306, 2019.\n\n[37] S. Shimizu, P. O. Hoyer, A. Hyv\u00e4rinen, and A. Kerminen. A linear non-gaussian acyclic model\n\nfor causal discovery. Journal of Machine Learning Research, 7(Oct):2003\u20132030, 2006.\n\n11\n\n\f[38] I. Shpitser. Consistent estimation of functions of data missing non-monotonically and not at\n\nrandom. In Advances in Neural Information Processing Systems, pages 3144\u20133152, 2016.\n\n[39] P. Spirtes, C. Meek, and T. Richardson. Causal inference in the presence of latent variables\nand selection bias. In Proceedings of the Eleventh conference on Uncertainty in arti\ufb01cial\nintelligence, pages 499\u2013506. Morgan Kaufmann Publishers Inc., 1995.\n\n[40] P. Spirtes, C. N. Glymour, R. Scheines, D. Heckerman, C. Meek, G. Cooper, and T. Richardson.\n\nCausation, prediction, and search. 2000.\n\n[41] P. Spirtes, C. Glymour, and R. Scheines. The tetrad project: Causal models and statistical data.\n\npittsburgh, 2004.\n\n[42] E. V. Strobl, S. Visweswaran, and P. L. Spirtes. Fast causal inference with non-random\nmissingness by test-wise deletion. International Journal of Data Science and Analytics, pages\n1\u201316.\n\n[43] Y. Tanaka, S. Kokubun, T. Sato, and H. Ozawa. Cervical roots as origin of pain in the neck or\n\nscapular regions. Spine, 31(17):E568\u2013E573, 2006.\n\n[44] R. Tu, C. Zhang, P. Ackermann, K. Mohan, H. Kjellstr\u00f6m, and K. Zhang. Causal discovery in the\npresence of missing data. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine\nLearning Research, volume 89 of Proceedings of Machine Learning Research, pages 1762\u20131770.\nPMLR, 16\u201318 Apr 2019. URL http://proceedings.mlr.press/v89/tu19a.html.\n\n[45] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention\nnetwork for image classi\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 3156\u20133164, 2017.\n\n[46] C. Zhang, H. Kjellstrom, C. H. Ek, and B. C. Bertilson. Diagnostic prediction using discomfort\n\ndrawings with IBTM. In MLHC, 2016.\n\n[47] J. Zhang. On the completeness of orientation rules for causal discovery in the presence of latent\n\nconfounders and selection bias. Arti\ufb01cial Intelligence, 172(16-17):1873\u20131896, 2008.\n\n[48] K. Zhang, J. Zhang, B. Huang, B. Sch\u00f6lkopf, and C. Glymour. On the identi\ufb01ability and\nestimation of functional causal models in the presence of outcome-dependent selection. In UAI,\n2016.\n\n[49] K. Zhang, B. Sch\u00f6lkopf, P. Spirtes, and C. Glymour. Learning causality and causality-related\n\nlearning: some recent progress. National science review, 5(1):26\u201329, 2017.\n\n12\n\n\f", "award": [], "sourceid": 6955, "authors": [{"given_name": "Ruibo", "family_name": "Tu", "institution": "KTH Royal Institute of Technology"}, {"given_name": "Kun", "family_name": "Zhang", "institution": "CMU"}, {"given_name": "Bo", "family_name": "Bertilson", "institution": "KI Karolinska Institutet"}, {"given_name": "Hedvig", "family_name": "Kjellstrom", "institution": "KTH Royal Institute of Technology"}, {"given_name": "Cheng", "family_name": "Zhang", "institution": "Microsoft Research, Cambridge, UK"}]}