{"title": "Estimators for Multivariate Information Measures in General Probability Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 8664, "page_last": 8675, "abstract": "Information theoretic quantities play an important role in various settings in machine learning, including causality testing, structure inference in graphical models, time-series problems, feature selection as well as in providing privacy guarantees. A key quantity of interest is the mutual information and generalizations thereof, including conditional mutual information, multivariate mutual information, total correlation and directed information. While the aforementioned information quantities are well defined in arbitrary probability spaces, existing estimators employ a $\\Sigma H$ method, which can only work in purely discrete space or purely continuous case since entropy (or differential entropy) is well defined only in that regime.\nIn this paper, we define a general graph divergence measure ($\\mathbb{GDM}$), generalizing the aforementioned information measures and we construct a novel estimator via a coupling trick that directly estimates these multivariate information measures using the Radon-Nikodym derivative. These estimators are proven to be consistent in a general setting which includes several cases where the existing estimators fail, thus providing the only known estimators for the following settings: (1) the data has some discrete and some continuous valued components (2) some (or all) of the components themselves are discrete-continuous \\textit{mixtures} (3) the data is real-valued but does not have a joint density on the entire space, rather is supported on a low-dimensional manifold. We show that our proposed estimators significantly outperform known estimators on synthetic and real datasets.", "full_text": "Estimators for Multivariate Information Measures\n\nin General Probability Spaces\n\nArman Rahimzamani\nDepartment of ECE\n\nUniversity of Washington\n\narmanrz@uw.edu\n\nHimanshu Asnani\nDepartment of ECE\n\nUniversity of Washington\n\nasnani@uw.edu\n\nPramod Viswanath\nDepartment of ECE\n\nUniversity of Illinois at Urbana-Champaign\n\npramodv@illinois.edu\n\nSreeram Kannan\nDepartment of ECE\n\nUniversity of Washington\n\nksreeram@uw.edu\n\nAbstract\n\nInformation theoretic quantities play an important role in various settings in ma-\nchine learning, including causality testing, structure inference in graphical models,\ntime-series problems, feature selection as well as in providing privacy guarantees.\nA key quantity of interest is the mutual information and generalizations thereof,\nincluding conditional mutual information, multivariate mutual information, to-\ntal correlation and directed information. While the aforementioned information\nquantities are well de\ufb01ned in arbitrary probability spaces, existing estimators add\nor subtract entropies (we term them \u03a3H methods). These methods work only\nin purely discrete space or purely continuous case since entropy (or differential\nentropy) is well de\ufb01ned only in that regime.\nIn this paper, we de\ufb01ne a general graph divergence measure (GDM),as a measure\nof incompatibility between the observed distribution and a given graphical model\nstructure. This generalizes the aforementioned information measures and we con-\nstruct a novel estimator via a coupling trick that directly estimates these multivariate\ninformation measures using the Radon-Nikodym derivative. These estimators are\nproven to be consistent in a general setting which includes several cases where the\nexisting estimators fail, thus providing the only known estimators for the following\nsettings: (1) the data has some discrete and some continuous valued components\n(2) some (or all) of the components themselves are discrete-continuous mixtures (3)\nthe data is real-valued but does not have a joint density on the entire space, rather is\nsupported on a low-dimensional manifold. We show that our proposed estimators\nsigni\ufb01cantly outperform known estimators on synthetic and real datasets.\n\n1\n\nIntroduction\n\nInformation theoretic quantities, such as mutual information and its generalizations, play an important\nrole in various settings in machine learning and statistical estimation and inference. Here we\nsummarize brie\ufb02y the role of some generalizations of mutual information in learning (cf. Sec. 2.1 for\na mathematical de\ufb01nition of these quantities).\n\n1. Conditional mutual information measures the amount of information between two variables X\nand Y given a third variable Z and is zero iff X is independent of Y given Z. CMI \ufb01nds a wide\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\frange of applications in learning including causality testing [1, 2], structure inference in graphical\nmodels [3], feature selection [4] as well as in providing privacy guarantees [5].\n\n2. Total correlation measures the degree to which a set of N variables are independent of each other,\nand appears as a natural metric of interest in several machine learning problems, for example, in\nindependent component analysis, the objective is to maximize the independence of the variables\nquanti\ufb01ed through total correlation [6]. In feature selection, ensuring the independence of selected\nfeatures is one goal, pursued using total correlation in [7, 8].\n\n3. Multivariate mutual information measures the amount of information shared between multiple\n\nvariables [9, 10] and is useful in feature selection [11, 12] and clustering [13].\n\n4. Directed information measures the amount of information between two random processes [14,15]\n\nand is shown as the correct metric in identifying time-series graphical models [16\u201321].\n\nEstimation of these information-theoretic quantities from observed samples is a non-trivial problem\nthat needs to be solved in order to utilize these quantities in the aforementioned applications. While\nthere has been long history in estimation of entropy [22\u201325], and renewed recent interest [26\u201328],\nmuch less effort has been spent on the multivariate versions. A standard approach to estimating\ngeneral information theoretic quantities is to write them out as a sum or difference of entropy (denoted\nH usually) terms which are then directly estimated; we term such a paradigm as \u03a3H paradigm.\nHowever, the \u03a3H paradigm is applicable only when the variables involved are all discrete or there\nis a joint density on the space of all variables (in which case, differential entropy h can be utilized).\nThe underlying information measures themselves are well de\ufb01ned in very general probability spaces,\nfor example, involving mixtures of discrete and continuous variables; however, no known estimators\nexist.\nWe motivate the requirement of estimators in general probability spaces by some examples in\ncontemporary machine learning and statistical inference.\n\n1. It is common place in machine learning to have data-sets where some variables are discrete,\nand some are continuous. For example, in recent work on utilizing information bottleneck to\nunderstand deep learning [29], an important step is to quantify the mutual information between the\ntraining samples (which are discrete) and the layer output (which is continuous). The employed\nmethodology was to quantize the continuous variables; this is common practice, even though\nhighly sub-optimal.\n\n2. Some variables involved in the calculation may be mixtures of discrete and continuous vari-\nables. For example, the output of ReLU neuron will not have a density even when the input data\nhas a density. Instead, the neuron will have a discrete mass at 0 (or wherever the ReLU breakpoint\nis) but will have a continuous distribution on the positive values. This is also the case in gene\nexpression data, where a gene may have a discrete mass at expression 0 due to an effect called\ndrop-out [30] but have a continuous distribution elsewhere.\n\n3. The variables involved may have a joint density only on a low dimensional manifold. For\nexample, when calculating mutual information between input and output of a neural network,\nsome of the neurons are deterministic functions of the input variables and hence they will have a\njoint density supported on a low-dimensional manifold rather than the entire space.\n\nIn the aforementioned cases, no existing estimators are known to work. It is not merely a matter of\nhaving provable guarantees either. When we plug in estimators that assume a joint density into data\nthat does not, the estimated information measure can be strongly negative.\nWe summarize our main contributions below:\n\n1. General paradigm (Section 2): We de\ufb01ne a general paradigm of graph divergence measures\nwhich captures the aforementioned generalizations of mutual information as special cases. Given a\ndirected acyclic graph (DAG) between n variables, the graph divergence is de\ufb01ned as the Kullback-\nLeibler (KL) divergence between the true data distribution PX and a restricted distribution PX\nde\ufb01ned on the Bayesian network and can be thought of as a measure of incompatibility with the\ngiven graphical model structure. These graph divergence measures are de\ufb01ned using the Radon\nNikodym derivatives which are well-de\ufb01ned for general probability spaces.\n\n2. Novel estimators (Section 3): We propose novel estimators for these graph divergence measures,\nwhich directly estimate the corresponding Radon-Nikodym derivatives. To the best of our knowl-\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) An example of Bayesian Network G with PX as the induced distribution PX1\nPX4|X1,X2\n(c) A Bayesian Network G with PX as the induced distribution PX1\n\nPX6|X4. (b) A Bayesian Network G inducing a Markov chain PX3\n\nPX2 \u00b7\u00b7\u00b7 PXd.\n\nPX5|X4\n\nPX2\nPX1|X3\n\nPX3|X1\nPX2|X3.\n\nedge, these are the \ufb01rst family of estimators that are well de\ufb01ned for general probability spaces\n(breaking the \u03a3H paradigm).\n\n3. Consistency proofs (Section 4): We prove that the proposed estimators converge to the true value\nof the corresponding graph divergence measures as the number of observed samples increases in a\ngeneral setting which includes several cases: (1) the data has some discrete and some continuous\nvalued components (2) some (or all) of the components themselves are discrete-continuous\nmixtures (3) the data is real-valued but does not have a joint density on the entire space but is\nsupported on a low-dimensional manifold.\n\n4. Numerical results (Section 5): Extensive numerical results demonstrate that (1) existing algo-\nrithms have severe failure modes in general probability spaces (strongly negative values, for\nexample), and (2) our proposed estimator achieves consistency as well as signi\ufb01cantly better\n\ufb01nite-sample performance.\n\n2 Graph Divergence Measure\n\nIn this section, we de\ufb01ne the family of graph divergence measures. To begin with, we \ufb01rst de\ufb01ne\nsome notational preliminaries. We denote any random variable by an uppercase letter such as X.\nThe sample space of the variable X is denoted by X and any value in X is denoted by the lowercase\nletter x. For any subset A \u2286 X , the probability of A for a given distribution function PX (.) over X\nis denoted by PX (A). We note that the random variable X can be a d-dimensional vector of random\nvariables, i.e. X \u2261 (X1, . . . , Xd). The N observed samples drawn from the distribution PX are\ndenoted by x(1), x(2), . . . , x(N ), i.e. x(i) is the ith observed sample.\nSometimes we might be interested in a subset of components of a random variable, S \u2286\n{X1, . . . , Xd} instead of the entire vector X. Accordingly, the sample space of the variable S\nis denoted by S. For instance, X = (X1, X2, X3, X4) and S = (X1, X2). Throughout the entire\npaper, unless otherwise stated, there is a one-to-one correspondence between the notations of X\nand any S. For example for any value x \u2208 X , the corresponding value in S is simply denoted by s.\nFurther, s(i) \u2208 S represents the lower-dimensional sample corresponding to the ith observed sample\nx(i) \u2208 X . Furthermore, any marginal distribution de\ufb01ned over S with respect to PX is denoted by\nPS.\nConsider a directed acyclic graph (DAG) G de\ufb01ned over d nodes (corresponding to the d components\nof the random variable X). A probability measure Q over X is said to be compatible with the graph\nG if it is a Bayesian network on G. Given a graph G and a distribution PX, there is a natural measure\nPX (.) which is compatible with the graph and is de\ufb01ned as follows:\n\nd(cid:89)\n\nPX =\n\nPXl|pa(Xl)\n\n(1)\n\nwhere pa(Xl) \u2282 X is the set of the parent nodes of the random variable Xl, with the sample space\ndenoted by Xpa(l), and the sample values xpa(l) corresponding to x. The distribution PXl|pa(Xl) is\n(cid:17)\nthe conditional distribution of Xl given pa(Xl). Throughout the paper, whenever mentioning the\n.\n\nvariable Xl with its own parents pa(Xl) we indicate it by pa+(Xl), i.e. pa+(Xl) \u2261(cid:16)\n\nXl, pa(Xl)\n\nl=1\n\nAn example is shown in Fig. 1a.\n\n3\n\n\fWe note the fact that PS|X\\S is well de\ufb01ned for any subset of variables S \u2282 X. Therefore if we let\nS = X \\ pa(Xl), then PX\\pa(Xl)|pa(Xl) is well de\ufb01ned for any l \u2208 {1, . . . , d}. By marginalizing over\nX \\ pa+(Xl) we see that PXl|pa(Xl) and thus the distribution PX is well de\ufb01ned.\nThe graph divergence measure is now de\ufb01ned as a function of the probability measure PX and the\ngraph G. In this work we will focus only on the KL Divergence as being the distance metric, hence\nunless otherwise stated D(\u00b7 (cid:107) \u00b7) = DKL(\u00b7 (cid:107) \u00b7). Let\u2019s \ufb01rst consider the case where PX is absolutely\ncontinuous with respect to PX and hence the Radon-Nikodym derivative dPX /dPX exists. Therefore\nfor a given set of random variables X and a Bayesian Network G, we de\ufb01ne Graph Divergence\nMeasure (GDM) as :\n\nHere we implicitly assume that log(cid:0)dPX /dPX\n\nGDM(X,G) = D(PX(cid:107)PX ) =\n\n(cid:1) is measurable and integrable with respect to the\n\nmeasure PX. The GDM is set to in\ufb01nity wherever Radon-Nikodym derivative does not exist. It is\nclear that GDM(X,G) = 0 if and only if the data distribution is compatible with the graphical model,\nthus the GDM can be thought of as a measure of incompatibility with the given graphical model\nstructure.\nWe now have relevant variational characterization as below on our graph divergence measure, which\ncan be harnessed to compute upper and lower bounds (More details in supplementary material):\nProposition 2.1. For a random variable X, a DAG G, let \u03a0(G) be the set of measures QX de\ufb01ned\non the Bayesian Network G, then GDM(X,G) = infQX\u2208\u03a0(G) D(PX(cid:107)QX ).\nFurthermore, let C denote the set of functions h : X \u2192 R such that EQX [exp(h(X))] < \u221e. If\nGDM(X,G) < \u221e, then for every h \u2208 C, EPX [h(X)] exists and:\n\ndPX\ndPX\n\nlog\n\nX\n\ndPX\n\n(cid:90)\n\n(2)\n\nGDM(X,G) = sup\nh\u2208C\n\nEPX [h(X)] \u2212 log EQX [exp(h(X))] .\n\n(3)\n\n2.1 Special cases\nFor speci\ufb01c choices of X and Bayesian Network, G, the Equation 2 is reduced to the well-known\ninformation measures. Some examples of these measures are as follows:\nMutual Information (MI): X = (X1, X2) and G has no directed edge between X1 and X2. Thus\nPX = PX1 .PX2, and we get, GDM(X,G) = I(X1; X2) = D(PX1X2(cid:107)PX1\nConditional Mutual Information (CMI): We recover the conditional mutual information of X1\nand X2 given X3 by constraining G to be the one in Fig. 1b, since PX = PX3.PX2|X3.PX1|X3, i.e.,\nGDM(X,G) = I(X1; X2|X3) = D(PX1X2X3(cid:107)PX1|X3\nTotal Correlation (TC): When X = (X1,\u00b7\u00b7\u00b7 , Xd), and G is the graph with no edges (as in Fig. 1c,\nwe recover the total correlation of the random variables X1, . . . , Xd since PX = PX1 . . . PXd, i.e.,\nGDM(X,Gdc) = T C(X1, . . . , Xd) = D(PX1...Xd(cid:107)PX1 . . . PXd )\n\nPX2|X3\n\nPX2 ).\n\nPX3).\n\nMultivariate Mutual Information (MMI) : While the MMI de\ufb01ned by [9] is not positive in gen-\neral,there is a different de\ufb01nition by [10] which is both non-negative and has an operational interpre-\ntation. Since MMI can be de\ufb01ned as the optimal total correlation after clustering, we can utilize our\nde\ufb01nition to de\ufb01ne MMI (cf. supplementary material).\nDirected Information : Suppose there are two stationary random processes X and Y , the directed\ninformation rate from X to Y as \ufb01rst introduced by Massey [31] is de\ufb01ned as:\n\nI(X \u2192 Y ) =\n\nI(X \u2192 Y ) = GDM(cid:16)\n\nT(cid:88)\n\nt=1\n\n1\nT\n\n(cid:12)(cid:12)Y t\u22121(cid:1)\nI(cid:0)X t; Yt\n(cid:17) \u2212 GDM(cid:16)\n\n(cid:17)\n\nIt can be seen that the directed information can be written as:\n\n(X T , Y T ),GI\n\n(X T , Y T ),GC\n\nwhere the graphical model GI correponds to the independent distribution between X T and Y T , and\nGC corresponds to the causal distribution from X to Y (more details provided in supplementary\nmaterial).\n\n4\n\n\f3 Estimators\n\n3.1 Prior Art\n\nEstimators for entropy date back to Shannon, who guesstimated the entropy rate of English [32]. Dis-\ncrete entropy estimation is a well-studied topic and minimax rate of this problem is well-understood as\na function of the alphabet size [33\u201335]. The estimation of differential entropy is considerably harder\nand also studied extensively in literature [23,25,26,36\u201339] and can be broadly divided into two groups;\nbased on either Kernel density estimates [40,41] or based on k-nearest-neighbor estimation [27,42,43].\nIn a remarkable work, Kozachenko and Leonenko suggested a nearest neighbor method for entropy\nestimation [22] which was then generalized to a kth nearest neighbor approach [44]. In this method,\nthe distance to the kth nearest neighbor (KNN) is measured for each data-point, and based on this the\nprobability density around each data point is estimated and substituted into the entropy expression.\nWhen k is \ufb01xed, each density estimate is noisy and the estimator accrues a bias and a remarkable\nresult is that the bias is distribution-independent and can be subtracted out [45].\nWhile the entropy estimation problem is well-studied, mutual information and its generalizations\nare typically estimated using a sum of signed entropy (H) terms, which are estimated \ufb01rst; we term\nsuch estimators as \u03a3H estimators. In the discrete alphabet case, this principle has been shown to be\nworst-case optimal [28]. In the case of distributions with a joint density, an estimator that breaks the\n\u03a3H principle is the KSG estimator [46], which builds on the KNN estimation paradigm but couples\nthe estimates in order to reduce the bias. This estimator is widely used and has excellent practical\nperformance. The original paper did not have any consistency guarantees and its convergence rates\nwere recently established [47]. There have been some extensions to the KSG estimator for other\ninformation measures such as conditional mutual information [48, 49], directed information [50] but\nnone of them show theoretical guarantees on consistency of the estimators, furthermore they fail\ncompletely in mixture distributions.\nWhen the data distribution is neither discrete nor admits a joint density, the \u03a3H approach is no longer\nfeasible as the individual entropy terms are not well de\ufb01ned. This is the regime of interest in our\npaper. Recently, Gao et al [51] proposed a mutual-information estimator based on KNN principle,\nwhich can handle such continuous-discrete mixture cases, and the consistency was demonstrated.\nHowever it is not clear how it generalizes to even Conditional Mutual Information (CMI) estimation,\nlet alone other generalizations of mutual information. In this paper, we build on that estimator in\norder to design an estimator for general graph divergence measures and establish its consistency for\ngeneric probability spaces.\n\n3.2 Proposed Estimator\nThe proposed estimator is given in Algorithm 1 where \u03c8(\u00b7) is the digamma function and 1{\u00b7} is the\nindicator function. The process is schematically shown in Fig. 3 (cf. supplementary material). We\nused the (cid:96)\u221e-norm everywhere in our algorithm and proofs.\nThe estimator intuitively estimates the GDM by the resubstitution estimate 1\ni=1 log \u02c6f (x(i)) in\nwhich \u02c6f (x(i)) is the estimation of Radon-Nikodym derivative at each sample x(i). If x(i) lies in a\nregion where there is a density, the RN derivative is equal to gX (x(i))/\u00afgX (x(i)) in which gX (.) and\n\u00afgX (.) are density functions corresponding to PX and PX respectively. On the other hand, if x(i) lies\non a point where there is a discrete mass, the RN derivative will be equal to hX (x(i))/\u00afhX (x(i)) in\nwhich hX (.) and \u00afhX (.) are mass functions corresponding to PX and PX respectively.\n\n(cid:80)N\n\nN\n\nThe density function \u00afgX (x(i)) can be written as (cid:81)d\n(cid:81)d\n\n(i))(cid:1)\n(i))(cid:1). Thus we need to estimate the density functions g(.)\n\n(cid:0)gpa+(Xl)(xpa+(l)\n\nthe mass function \u00afhX (x(i)) can be written as\n\n(cid:0)hpa+(Xl)(xpa+(l)\n\n(i))/hpa(Xl)(xpa(l)\n\nfor continuous components.\n\nEquivalently,\n\nl=1\n\n(i))/gpa(Xl)(xpa(l)\n\nl=1\n\nand the mass functions h(.) according to the type of x(i). The existing kth nearest neighbor algorithms\nwill suffer while estimating the mass functions h(.), since \u03c1nS ,i (the distance to the nS-th nearest\nneighbor in subspace S) for such points will be equal to zero for large N. Our algorithm, however, is\ndesigned in a way that it\u2019s capable of approximating both g(.) functions as \u2248 nS\n(\u03c1nS ,i)dS and h(.)\nfunctions as \u2248 nS\nN dynamically for any subset S \u2286 X. This is achieved by setting \u03c1nS ,i terms such\nthat all of them cancel out, yielding the estimator as in Eq. (4).\n\nN\n\n1\n\n5\n\n\f.\nInput: Parameter: k \u2208 Z+, Samples: x(1), x(2), . . . , x(N ), Bayesian Network: G on Variables:\n\nX = (X1, X2,\u00b7\u00b7\u00b7 , Xd)\nOutput: (cid:92)GDM(N )\n(X,G)\n1: for i = 1 to N do\n2:\n3:\n4:\n5:\n6:\n\n7:\n8:\n9:\n10: end for\n11: Final Estimator:\n\nQuery:\n\u03c1k,i = (cid:96)\u221e-distance to the kth nearest neighbor of x(i) in the space X\nInquire:\n\u02dcki = # points within the \u03c1k,i-neighborhood of x(i) in the space X\npa(Xl) = # points within the \u03c1k,i-neighborhood of x(i) in the space Xpa(l)\nn(i)\npa+(Xl) = # points within the \u03c1k,i-neighborhood of x(i) in the space Xpa+(l)\n(cid:17)(cid:17)\n\u03b6i = \u03c8(\u02dcki) +(cid:80)d\nn(i)\nCompute:\n\n(cid:17) \u2212 log\n\n1{pa(Xl)(cid:54)=\u2205} log\n\nn(i)\npa+(Xl) + 1\n\nn(i)\npa(Xl) + 1\n\n(cid:16)\n\n(cid:16)\n\nl=1\n\n(cid:16)\nN(cid:88)\n\n(cid:92)GDM(N )\n\n(X,G) =\n\n1\nN\n\n\u03b6i +\n\ni=1\n\nl=1\n\n(cid:32) d(cid:88)\n\n(cid:33)\n1{pa(Xl)=\u2205} \u2212 1\n\nlog N\n\n(4)\n\nAlgorithm 1: Estimating Graph Divergence Measure GDM(X,G)\n\n4 Proof of Consistency\n\nThe proof of consistency for our estimator consists of two steps: First we prove that the expected\nvalue of the estimator in Eq. (4) converges to the true value as N \u2192 \u221e , and second we prove that the\nvariance of the estimator converges to zero as N \u2192 \u221e. Let\u2019s begin with the de\ufb01nition of PX (x, r):\n\n(cid:8)a \u2208 X : (cid:107)a \u2212 x(cid:107)\u221e \u2264 r(cid:9) = PX\n\n(cid:110)\n\nBr(x)\n\n(cid:111)\n\nPX (x, r) = PX\n\n(5)\n\nThus PX (x, r) is the probability of a hypercube with the edge length of 2r centered at the point x.\nWe then state the following assmuptions:\nAssumption 1. We make the following assumptions to prove the consistency of our method:\n\n1. k is set such that limN\u2192\u221e k = \u221e and limN\u2192\u221e k log N\n2. The set of discrete points {x : PX (x, 0) > 0} is \ufb01nite.\n\n(cid:12)(cid:12) log f (x)(cid:12)(cid:12)dPX < +\u221e, where f \u2261 dPX /dPX is Radon-Nikodym derivative.\n\n3. (cid:82)\n\nN = 0.\n\nX\n\nThe Assumption 1.1 with 1.2 controls the boundary effect between the continuous and the discrete\nregions; with this assumption we make sure that all the k nearest neighbors of each point belong\nto the same region almost surely (i.e. all of them are either continuous or discrete). Assumption\n1.3 is the log-integrability of the Radon-Nikodym derivative. These assumptions are satis\ufb01ed under\nmild technical conditions whenever the distribution PX over the set X is (1) \ufb01nitely discrete; (2)\ncontinuous; (3) \ufb01nitely discrete over some dimensions and continuous over some others; (4) a mixture\nof the previous cases; (5) has a joint density supported over a lower dimensional manifold. These\ncases represent almost all the real world data.\nAs an example of a case not conforming to the aforementioned cases, we can consider singular\ndistributions, among which the Cantor distribution is a signi\ufb01cant example whose cumulative\ndistribution function is the Cantor function. This distribution has neither a probability density\nfunction nor a probability mass function, although its cumulative distribution function is a continuous\nfunction. It is thus neither a discrete nor an absolutely continuous probability distribution, nor is it a\nmixture of these.\nThe Theorem 1 formally states the mean-convergence of the estimator while Theorem 2 formally\nstates that convergence of the variance to zero.\n\n6\n\n\fTheorem 1. Under the Assumptions 1, we have limN\u2192\u221e E\nTheorem 2. In addition to the Assumptions 1, assume that we have (kN log N )2/N \u2192 0 as N goes\nto in\ufb01nity. Then we have limN\u2192\u221e Var\n\n(X,G)\n\n= GDM(X,G).\n\n(X,G)\n\n(cid:21)\n\n= 0.\n\n(cid:20)(cid:92)GDM(N )\n\n(cid:21)\n\n(cid:20)(cid:92)GDM(N )\nwe need to upper-bound the term(cid:12)(cid:12)E(cid:2)(cid:92)GDM(N )\n\nThe Theorems 1 and 2 combined yield the consistency of the estimator 4.\nThe proof of the Theorem 1 starts with writing the Radon-Nikodym derivative explicitly. Then\n\n(X,G)(cid:3) \u2212 GDM(X,G)(cid:12)(cid:12). To achieve this goal, we\n\nsegregate the domain of X into three parts as X = \u21261 \u222a \u21262 \u222a \u21263 where \u21261 = {x : f (x) = 0},\n\u21262 = {x : f (x) > 0, PX (x, 0) > 0} and \u21263 = {x : f (x) > 0, PX (x, 0) = 0}. We will show that\nPX (\u21261) = 0. The sets \u21262 and \u21263 correspond to the discrete and continuous regions respectively.\nThen for each of the two regions, we introduce an upperbound which goes to zero as N grows\nboundlessly. Thus equivalently we show the mean of the estimate \u03b61 is close to log f (x) for any x.\nThe proof of the Theorem 2 is based on the Efron-Stein inequality, which upperbounds any estimator\nfor any quantity from the observed samples x(1), . . . , x(N ). For any sample x(i), we then upperbound\n\nthe term(cid:12)(cid:12)\u03b6i(X) \u2212 \u03b6i(X\\j)(cid:12)(cid:12) by segregating the samples into various cases, and examining each case\n\nseparately. \u03b6i(X) is the estimate using all the samples x(1), . . . , x(N ) and \u03b6i(X\\j) is the estimate\nwhen the jth sample is removed. Summing up over all the i\u2019s, we obtain an upper-bound which will\nconverge to 0 as N goes to in\ufb01nity.\n\n5 Empirical Results\n\nIn this section, we evaluate the performance of our proposed estimator in comparison with other\nestimators via numerical experiments. The estimators evaluated here are our estimator referred to as\nGDM, the plain KSG-based estimators for continuous distributions to which we refer by KSG, the\nbinning estimators and the noise-induced \u03a3H estimators. A more detailed discussion can be found in\nSection G.\nExperiment 1: Markov chain model with continuous-discrete mixture. For the \ufb01rst experiment,\nwe simulated an X-Z-Y Markov chain model in which the random variable X is a uniform random\nvariable U(0, 1) clipped at a threshold 0 < \u03b11 < 1 from above. Then Z = min (X, \u03b12) and\nY = min (Z, \u03b13) in which 0 < \u03b13 < \u03b12 < \u03b11. We simulated this system for various numbers of\nsamples, setting \u03b11 = 0.9, \u03b12 = 0.8 and \u03b13 = 0.7. For each set of samples we estimated I(X; Y |Z)\nvia different methods. The theory value for I(X; Y |Z) is 0. The results are shown in Figure 2a. We\ncan see that in this regime, only the GDM estimator can correctly converge. The KSG estimator and\nthe \u03a3H estimator show high negative biases and the binning estimator shows a positive bias.\nExperiment 2: Mixture of AWGN and BSC channels with variable error probability. For the\nsecond scheme of our experiments, we considered an Additive White Gaussian Noise (AWGN)\nChannel in parallel with a Binary Symmetric Channel (BSC) where only one of the two can be\nactivated at a time. The random variable Z = min(\u03b1, \u02dcZ) where \u02dcZ \u223c U (0, 1) controls which channel\nis activated; i.e. if Z is lower than the threshold \u03b2, activate the AWGN channel, otherwise initiate\nthe BSC channel where Z also determines the error probability at each time point. We set \u03b1 = 0.3,\n\u03b2 = 0.2, BSC channel input as X \u223c Bern(0.5), and AWGN input and noise deviation as \u03c3X = 1\nand \u03c3N = 0.1 respectively, and obtained the estimates of I(X; Y |Z, Z 2, Z 3) for various estimators.\nWhile the theory value is equal to I(X; Y |Z) = 0.53241, yet it\u2019s conditioned over a low-dimensional\nmanifold in a high-dimensional space. The results are shown in Figure 2b. Similar to the previous\nexperiment, the GDM estimator can correctly converge to the true value. The \u03a3H and binning\nestimators show a negative bias, and the KSG estimator gets totally lost.\nExperiment 3: Total Correlation for independent mixtures. In this experiment, we estimate\nthe total correlation of three independent variables X, Y and Z. The samples for the variable X\nare generated in the following fashion: First toss a fair coin, if heads appears we \ufb01x X at \u03b1X,\notherwise we draw X from a uniform distribution between 0 and 1. samples from Y and Z are also\ngenerated in the same way independently with parameters \u03b1Y and \u03b1Z respectively. For this setup,\nT C(X, Y, Z) = 0. We set \u03b1X = 1, \u03b1Y = 1/2 and \u03b1Z = 1/4, and generated various datasets with\ndifferent lengths. Then estimated total correlation values are shown in the Figure 2c.\n\n7\n\n\fExperiment 4: Total Correlation for independent uniforms with correlated zero-in\ufb02ation. Here\nwe \ufb01rst consider four auxiliary uniform variables \u02dcX1, \u02dcX2, \u02dcX3 and \u02dcX4 which are taken from\nU(0.5, 1.5). Then each sample is erased with a Bernoulli probability; i.e. X1 = \u03b11 \u02dcX1, X2 = \u03b11 \u02dcX2\nand X3 = \u03b12 \u02dcX3, X4 = \u03b12 \u02dcX4 in which \u03b11 \u223c Bern(p1) and \u03b12 \u223c Bern(p2). As we see, after\nzero-in\ufb02ation X1 and X2 become correlated, so do X3 and X4 while still (X1, X2)|=(X3, X4). In\nthe experiment, we set p1 = p2 = 0.6. The results of running different algorithms over the data can\nbe seen in Figure 2d. For the total correlation experiments 3 and 4, similar to that of conditional\nmutual information in experiments 1 and 2, only the GDM estimator can best estimate the true value.\nThe estimator \u03a3H was removed from the \ufb01gures due to its high inaccuracy.\nExperiment 5: Gene Regulatory Networks. In this experiment we use different estimators to\ndo Gene Regulatory Network inference based on the conditional Restricted Directed Information\n(cRDI) [20]. We do our test on the simulated neuron cells\u2019 development process, based on a model\nfrom [52]. In this model, the time series vector X consists of 13 random variables each of which\ncorresponding to a single gene in the development process. We simulated the development process for\nvarious lengths of time-series in which the noise N \u223c N (0, .03) is added for all the genes, and every\nsingle sample is then subject to erasure (i.e. be replaced by 0s) with a probability of 0.5. Then we\napplied the cRDI method utilizing various CMI estimators and then calculated the Area-Under-ROC\ncurve (AUROC). The results are shown in Figure 2e. It\u2019s seen that the cRDI method implemented\nwith the GDM estimator outperform the other estimators by at least %10 in terms of AUROC. In the\ntests, cRDI for each (Xi, Xj) is conditioned over the node k (cid:54)= i with the highest RDI value to j. We\nnotice that the causal signals are highly destroyed due to the zero-in\ufb02ation, so we won\u2019t expect high\nperformance of the causal inference over the data. We did not include the \u03a3H estimator results due\nto its very low performance.\nExperiment 6: Feature Selection by Conditional Mutual Information Maximization. Feature\nselection is an important pre-processing step in many learning tasks. The application of information\ntheoretic measures in feature selection is well studied in the literature [7]. Among the well-known\nmethods is the conditional mutual information maximization (CMIM) \ufb01rst introduced by Flueret [4],\na variation of which was later introduced called CMIM-2 [53]. Both methods use conditional mutual\ninformation as their core measure to select the features. Hence the performance of the estimators can\nsigni\ufb01cantly in\ufb02uence the performance of the methods. In our experiment, we generated a vector\nX = (X1, . . . , X15) of 15 random variables in which all the random variables are taken from N (0, 1)\nand then each random variable Xi is clipped from above at \u03b1i which is initially taken randomly\nfrom U(0.25, 0.3) and then kept constant during the sample generation. Then Y is generated as\n\n(cid:1). Then we did the CMIM-2 algorithm with various CMI estimators to evaluate\n\nthe performance of the estimators in extracting the relevant features X1, . . . , X5. The AUROC values\nfor each algorithm versus the number of samples generated are shown in Figure 2f. The feature\nselection methods implemented with the GDM estimator outperform the other estimators.\n\nY = cos(cid:0)(cid:80)5\n\ni=1 Xi\n\n6 Discussion and Future Work\n\nA general paradigm of graph divergence measures and novel estimators thereof, for general probability\nspaces are proposed, which estimate several generalizations of mutual information. In future, we\nwould like to derive further ef\ufb01cient estimators for high dimensional data. In the current work,\nestimators are shown to be consistent with in\ufb01nite scaling of parameter k. In future, we would like to\nunderstand the \ufb01nite k performance of the estimators as well as guarantees on sample complexity and\nrates of convergence. Another potential direction to follow is to study the variational characterization\nof the graph divergence measure to design estimators. Improving the computational ef\ufb01ciency of\nthe estimator is another direction of future work. Recent literature including [54] provide a new\nmethodology to estimate mutual information in a computationally ef\ufb01cient manner and leveraging\nthese ideas for the generalized measures and general proabability distributions can be a promising\ndirection ahead.\n\n7 Acknowledgement\n\nThis work was partially supported by NSF grants 1651236, 1703403 and NIH grant 5R01HG008164.\nThe authors also would like to thank Yihan Jiang for presenting our work at the NeurIPS conference.\n\n8\n\n\f(a)\n\n(c)\n\n(e)\n\n(b)\n\n(d)\n\n(f)\n\nFigure 2: The results for the experiments versus the number of samples: 2a: The estimated CMI for\nthe X-Z-Y Markov chain. 2b: CMI for the AWGN+BSC channels with low-dimensional Z manifold.\n2c: The estimated TC values for three independent variables. 2d: The estimated TC for zero-in\ufb02ated\nvariables. 2e: The AUROC values for gene regulatory network inference. The error bars show the\nstandard deviation scaled down by 0.2. 2f: The AUROC values for feature selection accuracy. The\nerror bars show the standard deviations scaled down by 0.2.\n\n9\n\n01000020000300004000050000Number of samples1.00.80.60.40.20.0CMI valuesGDMKSG-continuousSigmaHBinningTheory02000400060008000100001200014000Number of samples321012CMI valuesGDMKSG-continuousSigmaHBinningTheory0200040006000800010000Number of samples0.80.60.40.20.0TC valuesGDMKSG-continuousBinningTheory0200040006000800010000Number of samples0.40.60.81.01.21.4TC valuesGDMKSG-continuousBinningTheory2004006008001000Number of Samples0.500.550.600.650.700.750.80AUC for different methodscRDI - GDMcRDI - KSGcRDI - Binning100200300400500Number of Samples0.60.70.80.91.0AUC for different methodsCMIM2 - GDMCMIM2 - KSGCMIM2 - BinningCMIM2 - SigmaH\fReferences\n\n[1] A. P. Dawid, \u201cConditional independence in statistical theory,\u201d Journal of the Royal Statistical\n\nSociety. Series B (Methodological), pp. 1\u201331, 1979.\n\n[2] K. Zhang, J. Peters, D. Janzing, and B. Sch\u00f6lkopf, \u201cKernel-based conditional independence test\n\nand application in causal discovery,\u201d arXiv preprint arXiv:1202.3775, 2012.\n\n[3] J. Whittaker, Graphical models in applied multivariate statistics. Wiley Publishing, 2009.\n[4] F. Fleuret, \u201cFast binary feature selection with conditional mutual information,\u201d Journal of\n\nMachine Learning Research, vol. 5, no. Nov, pp. 1531\u20131555, 2004.\n\n[5] P. Cuff and L. Yu, \u201cDifferential privacy as a mutual information constraint,\u201d in Proceedings of\nthe 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 43\u201354,\nACM, 2016.\n\n[6] A. Hyv\u00e4rinen and E. Oja, \u201cIndependent component analysis: algorithms and applications,\u201d\n\nNeural networks, vol. 13, no. 4-5, pp. 411\u2013430, 2000.\n\n[7] J. R. Vergara and P. A. Est\u00e9vez, \u201cA review of feature selection methods based on mutual\n\ninformation,\u201d Neural computing and applications, vol. 24, no. 1, pp. 175\u2013186, 2014.\n\n[8] P. E. Meyer, C. Schretter, and G. Bontempi, \u201cInformation-theoretic feature selection in microar-\nray data using variable complementarity,\u201d IEEE Journal of Selected Topics in Signal Processing,\nvol. 2, no. 3, pp. 261\u2013274, 2008.\n\n[9] W. McGill, \u201cMultivariate information transmission,\u201d Transactions of the IRE Professional\n\nGroup on Information Theory, vol. 4, no. 4, pp. 93\u2013111, 1954.\n\n[10] C. Chan, A. Al-Bashabsheh, J. B. Ebrahimi, T. Kaced, and T. Liu, \u201cMultivariate mutual\ninformation inspired by secret-key agreement,\u201d Proceedings of the IEEE, vol. 103, no. 10,\npp. 1883\u20131913, 2015.\n\n[11] J. Lee and D.-W. Kim, \u201cFeature selection for multi-label classi\ufb01cation using multivariate mutual\n\ninformation,\u201d Pattern Recognition Letters, vol. 34, no. 3, pp. 349\u2013357, 2013.\n\n[12] G. Brown, \u201cA new perspective for information theoretic feature selection,\u201d in Arti\ufb01cial Intelli-\n\ngence and Statistics, pp. 49\u201356, 2009.\n\n[13] C. Chan, A. Al-Bashabsheh, Q. Zhou, T. Kaced, and T. Liu, \u201cInfo-clustering: A mathematical\ntheory for data clustering,\u201d IEEE Transactions on Molecular, Biological and Multi-Scale\nCommunications, vol. 2, no. 1, pp. 64\u201391, 2016.\n\n[14] S. Watanabe, \u201cInformation theoretical analysis of multivariate correlation,\u201d IBM Journal of\n\nresearch and development, vol. 4, no. 1, pp. 66\u201382, 1960.\n\n[15] H. H. Permuter, Y.-H. Kim, and T. Weissman, \u201cInterpretations of directed information in\nportfolio theory, data compression, and hypothesis testing,\u201d IEEE Transactions on Information\nTheory, vol. 57, no. 6, pp. 3248\u20133259, 2011.\n\n[16] C. J. Quinn, N. Kiyavash, and T. P. Coleman, \u201cDirected information graphs,\u201d IEEE Transactions\n\non information theory, vol. 61, no. 12, pp. 6887\u20136909, 2015.\n\n[17] J. Sun, D. Taylor, and E. M. Bollt, \u201cCausal network inference by optimal causation entropy,\u201d\n\nSIAM Journal on Applied Dynamical Systems, vol. 14, no. 1, pp. 73\u2013106, 2015.\n\n[18] K. Hlav\u00e1\u02c7ckov\u00e1-Schindler, M. Palu\u0161, M. Vejmelka, and J. Bhattacharya, \u201cCausality detection\nbased on information-theoretic approaches in time series analysis,\u201d Physics Reports, vol. 441,\nno. 1, pp. 1\u201346, 2007.\n\n[19] P.-O. Amblard and O. J. Michel, \u201cOn directed information theory and granger causality graphs,\u201d\n\nJournal of computational neuroscience, vol. 30, no. 1, pp. 7\u201316, 2011.\n\n[20] A. Rahimzamani and S. Kannan, \u201cNetwork inference using directed information: The determin-\nistic limit,\u201d in Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton\nConference on, pp. 156\u2013163, IEEE, 2016.\n\n[21] A. Rahimzamani and S. Kannan, \u201cPotential conditional mutual information: Estimators and\nproperties,\u201d in Communication, Control, and Computing (Allerton), 2017 55th Annual Allerton\nConference on, pp. 1228\u20131235, IEEE, 2017.\n\n10\n\n\f[22] L. Kozachenko and N. N. Leonenko, \u201cSample estimate of the entropy of a random vector,\u201d\n\nProblemy Peredachi Informatsii, vol. 23, no. 2, pp. 9\u201316, 1987.\n\n[23] J. Beirlant, E. J. Dudewicz, L. Gy\u00f6r\ufb01, and E. C. Van der Meulen, \u201cNonparametric entropy\nestimation: An overview,\u201d International Journal of Mathematical and Statistical Sciences,\nvol. 6, no. 1, pp. 17\u201339, 1997.\n\n[24] R. Wieczorkowski and P. Grzegorzewski, \u201cEntropy estimators-improvements and comparisons,\u201d\nCommunications in Statistics-Simulation and Computation, vol. 28, no. 2, pp. 541\u2013567, 1999.\n[25] E. G. Miller, \u201cA new class of entropy estimators for multi-dimensional densities,\u201d in Acoustics,\nSpeech, and Signal Processing, 2003. Proceedings.(ICASSP\u201903). 2003 IEEE International\nConference on, vol. 3, pp. III\u2013297, IEEE, 2003.\n\n[26] I. Lee, \u201cSample-spacings-based density and entropy estimators for spherically invariant multidi-\n\nmensional data,\u201d Neural Computation, vol. 22, no. 8, pp. 2208\u20132227, 2010.\n\n[27] K. Sricharan, D. Wei, and A. O. Hero, \u201cEnsemble estimators for multivariate entropy estimation,\u201d\n\nIEEE transactions on information theory, vol. 59, no. 7, pp. 4374\u20134388, 2013.\n\n[28] Y. Han, J. Jiao, and T. Weissman, \u201cAdaptive estimation of shannon entropy,\u201d in Information\n\nTheory (ISIT), 2015 IEEE International Symposium on, pp. 1372\u20131376, IEEE, 2015.\n\n[29] N. Tishby and N. Zaslavsky, \u201cDeep learning and the information bottleneck principle,\u201d in\n\nInformation Theory Workshop (ITW), 2015 IEEE, pp. 1\u20135, IEEE, 2015.\n\n[30] S. Liu and C. Trapnell, \u201cSingle-cell transcriptome sequencing: recent advances and remaining\n\nchallenges,\u201d F1000Research, vol. 5, 2016.\n\n[31] J. Massey, \u201cCausality, feedback and directed information,\u201d in Proc. Int. Symp. Inf. Theory\n\nApplic.(ISITA-90), pp. 303\u2013305, 1990.\n\n[32] C. E. Shannon, \u201cPrediction and entropy of printed english,\u201d Bell Labs Technical Journal, vol. 30,\n\nno. 1, pp. 50\u201364, 1951.\n\n[33] L. Paninski, \u201cEstimation of entropy and mutual information,\u201d Neural computation, vol. 15,\n\nno. 6, pp. 1191\u20131253, 2003.\n\n[34] J. Jiao, K. Venkat, Y. Han, and T. Weissman, \u201cMinimax estimation of functionals of discrete\ndistributions,\u201d IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2835\u20132885, 2015.\n[35] Y. Wu and P. Yang, \u201cMinimax rates of entropy estimation on large alphabets via best polynomial\napproximation,\u201d IEEE Transactions on Information Theory, vol. 62, no. 6, pp. 3702\u20133720,\n2016.\n\n[36] I. Nemenman, F. Shafee, and W. Bialek, \u201cEntropy and inference, revisited,\u201d in Advances in\n\nneural information processing systems, pp. 471\u2013478, 2002.\n\n[37] M. Le\u00b4sniewicz, \u201cExpected entropy as a measure and criterion of randomness of binary se-\n\nquences,\u201d Przegl \u02dbad Elektrotechniczny, vol. 90, no. 1, pp. 42\u201346, 2014.\n\n[38] K. Sricharan, R. Raich, and A. O. Hero, \u201cEstimation of nonlinear functionals of densities with\ncon\ufb01dence,\u201d IEEE Transactions on Information Theory, vol. 58, no. 7, pp. 4135\u20134159, 2012.\n[39] S. Singh and B. P\u00f3czos, \u201cExponential concentration of a density functional estimator,\u201d in\n\nAdvances in Neural Information Processing Systems, pp. 3032\u20133040, 2014.\n\n[40] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, et al., \u201cNonparametric von\nmises estimators for entropies, divergences and mutual informations,\u201d in Advances in Neural\nInformation Processing Systems, pp. 397\u2013405, 2015.\n\n[41] W. Gao, S. Oh, and P. Viswanath, \u201cBreaking the bandwidth barrier: Geometrical adaptive\nentropy estimation,\u201d in Advances in Neural Information Processing Systems, pp. 2460\u20132468,\n2016.\n\n[42] J. Jiao, W. Gao, and Y. Han, \u201cThe nearest neighbor information estimator is adaptively near\n\nminimax rate-optimal,\u201d arXiv preprint arXiv:1711.08824, 2017.\n\n[43] D. P\u00e1l, B. P\u00f3czos, and C. Szepesv\u00e1ri, \u201cEstimation of r\u00e9nyi entropy and mutual information\nbased on generalized nearest-neighbor graphs,\u201d in Advances in Neural Information Processing\nSystems, pp. 1849\u20131857, 2010.\n\n11\n\n\f[44] H. Singh, N. Misra, V. Hnizdo, A. Fedorowicz, and E. Demchuk, \u201cNearest neighbor estimates\nof entropy,\u201d American journal of mathematical and management sciences, vol. 23, no. 3-4,\npp. 301\u2013321, 2003.\n\n[45] S. Singh and B. P\u00f3czos, \u201cFinite-sample analysis of \ufb01xed-k nearest neighbor density functional\n\nestimators,\u201d in Advances in Neural Information Processing Systems, pp. 1217\u20131225, 2016.\n\n[46] A. Kraskov, H. St\u00f6gbauer, and P. Grassberger, \u201cEstimating mutual information,\u201d Physical review\n\nE, vol. 69, no. 6, p. 066138, 2004.\n\n[47] W. Gao, S. Oh, and P. Viswanath, \u201cDemystifying \ufb01xed k-nearest neighbor information estima-\n\ntors,\u201d IEEE Transactions on Information Theory, pp. 1\u20131, 2018.\n\n[48] J. Runge, \u201cConditional independence testing based on a nearest-neighbor estimator of condi-\n\ntional mutual information,\u201d arXiv preprint arXiv:1709.01447, 2017.\n\n[49] S. Frenzel and B. Pompe, \u201cPartial mutual information for coupling analysis of multivariate time\n\nseries,\u201d Physical review letters, vol. 99, no. 20, p. 204101, 2007.\n\n[50] M. Vejmelka and M. Palu\u0161, \u201cInferring the directionality of coupling with conditional mutual\n\ninformation,\u201d Physical Review E, vol. 77, no. 2, p. 026214, 2008.\n\n[51] W. Gao, S. Kannan, S. Oh, and P. Viswanath, \u201cEstimating mutual information for discrete-\ncontinuous mixtures,\u201d in Advances in Neural Information Processing Systems, pp. 5988\u20135999,\n2017.\n\n[52] X. Qiu, S. Ding, and T. Shi, \u201cFrom understanding the development landscape of the canonical\nfate-switch pair to constructing a dynamic landscape for two-step neural differentiation,\u201d PloS\none, vol. 7, no. 12, p. e49271, 2012.\n\n[53] J. R. Vergara and P. A. Est\u00e9vez, \u201cCmim-2: an enhanced conditional mutual information maxi-\nmization criterion for feature selection,\u201d Journal of Applied Computer Science Methods, vol. 2,\n2010.\n\n[54] M. Noshad and A. O. Hero III, \u201cScalable hash-based estimation of divergence measures,\u201d arXiv\n\npreprint arXiv:1801.00398, 2018.\n\n[55] Y. Wu, \u201cLecture notes in information theory,\u201d www.stat.yale.edu/ yw562/teaching/itlectures.pdf.\n[56] J. M. Bernardo, \u201cAlgorithm as 103: Psi (digamma) function,\u201d Journal of the Royal Statistical\n\nSociety. Series C (Applied Statistics), vol. 25, no. 3, pp. 315\u2013317, 1976.\n\n[57] L. Evans, Measure theory and \ufb01ne properties of functions. Routledge, 2018.\n\n12\n\n\f", "award": [], "sourceid": 5238, "authors": [{"given_name": "Arman", "family_name": "Rahimzamani", "institution": "University of Washington"}, {"given_name": "Himanshu", "family_name": "Asnani", "institution": "University of Washington, Seattle"}, {"given_name": "Pramod", "family_name": "Viswanath", "institution": "UIUC"}, {"given_name": "Sreeram", "family_name": "Kannan", "institution": "University of Washington"}]}