{"title": "Sobolev Independence Criterion", "book": "Advances in Neural Information Processing Systems", "page_first": 9509, "page_last": 9519, "abstract": "We propose the Sobolev Independence Criterion (SIC), an interpretable dependency measure between a high dimensional random variable X and a response variable Y. SIC decomposes to the sum of feature importance scores and hence can be used for nonlinear feature selection. SIC can be seen as a gradient regularized Integral Probability Metric (IPM) between the joint distribution of the two random variables and the product of their marginals. We use sparsity inducing gradient penalties to promote input sparsity of the critic of the IPM. In the kernel version we show that SIC can be cast as a convex optimization problem by introducing auxiliary variables that play an important role in feature selection as they are normalized feature importance scores. We then present a neural version of SIC where the critic is parameterized as a homogeneous neural network, improving its representation power as well as its interpretability. We conduct experiments validating SIC for feature selection in synthetic and real-world experiments. We show that SIC enables reliable and interpretable discoveries, when used in conjunction with the holdout randomization test and knockoffs to control the False Discovery Rate. Code is available at http://github.com/ibm/sic.", "full_text": "Sobolev Independence Criterion\n\nYoussef Mroueh, Tom Sercu, Mattia Rigotti, Inkit Padhi, Cicero Dos Santos \u21e4\n\nIBM Research & MIT-IBM Watson AI lab\n\nmroueh,mrigotti@us.ibm.com,inkit.padhi@ibm.com\n\nAbstract\n\nWe propose the Sobolev Independence Criterion (SIC), an interpretable dependency\nmeasure between a high dimensional random variable X and a response variable\nY . SIC decomposes to the sum of feature importance scores and hence can be used\nfor nonlinear feature selection. SIC can be seen as a gradient regularized Integral\nProbability Metric (IPM) between the joint distribution of the two random variables\nand the product of their marginals. We use sparsity inducing gradient penalties\nto promote input sparsity of the critic of the IPM. In the kernel version we show\nthat SIC can be cast as a convex optimization problem by introducing auxiliary\nvariables that play an important role in feature selection as they are normalized\nfeature importance scores. We then present a neural version of SIC where the critic\nis parameterized as a homogeneous neural network, improving its representation\npower as well as its interpretability. We conduct experiments validating SIC for\nfeature selection in synthetic and real-world experiments. We show that SIC enables\nreliable and interpretable discoveries, when used in conjunction with the holdout\nrandomization test and knockoffs to control the False Discovery Rate. Code is\navailable at http://github.com/ibm/sic.\n\n1\n\nIntroduction\n\nFeature Selection is an important problem in statistics and machine learning for interpretable predic-\ntive modeling and scienti\ufb01c discoveries. Our goal in this paper is to design a dependency measure that\nis interpretable and can be reliably used to control the False Discovery Rate in feature selection. The\nmutual information between two random variables X and Y is the most commonly used dependency\nmeasure. The mutual information I(X; Y ) is de\ufb01ned as the Kullback-Leibler divergence between the\njoint distribution pxy of X, Y and the product of their marginals pxpy, I(X; Y ) = KL(pxy, pxpy).\nMutual information is however challenging to estimate from samples, which motivated the intro-\nduction of dependency measures based on other f-divergences or Integral Probability Metrics [1]\nthan the KL divergence. For instance, the Hilbert-Schmidt Independence Criterion (HSIC) [2] uses\nthe Maximum Mean Discrepancy (MMD) [3] to assess the dependency between two variables, i.e.\nHSIC(X, Y ) = MMD(pxy, pxpy), which can be easily estimated from samples via Kernel mean\nembeddings in a Reproducing Kernel Hilbert Space (RKHS) [4]. In this paper we introduce the\nSobolev Independence Criterion (SIC), a form of gradient regularized Integral Probability Metric\n(IPM) [5, 6, 7] between the joint distribution and the product of marginals. SIC relies on the statistics\nof the gradient of a witness function, or critic, for both (1) de\ufb01ning the IPM constraint and (2) \ufb01nding\nthe features that discriminate between the joint and the marginals. Intuitively, the magnitude of\nthe average gradient with respect to a feature gives an importance score for each feature. Hence,\npromoting its sparsity is a natural constraint for feature selection.\nThe paper is organized as follows: we show in Section 2 how sparsity-inducing gradient penalties can\nbe used to de\ufb01ne an interpretable dependency measure that we name Sobolev Independence Criterion\n\u21e4Tom Sercu is now with Facebook AI Research, and Cicero Dos Santos with Amazon AWS AI. The work\n\nwas done when they were at IBM Research.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(SIC). We devise an equivalent computational-friendly formulation of SIC in Section 3, that gives rise\nto additional auxiliary variables \u2318j. These naturally de\ufb01ne normalized feature importance scores that\ncan be used for feature selection. In Section 4 we study the case where the SIC witness function f is\nrestricted to an RKHS and show that it leads to an optimization problem that is jointly convex in f\nand the importance scores \u2318. We show that in this case SIC decomposes into the sum of feature scores,\nwhich is ideal for feature selection. In Section 5 we introduce a Neural version of SIC, which we\nshow preserves the advantages in terms of interpretability when the witness function is parameterized\nas a homogeneous neural network, and which we show can be optimized using stochastic Block\nCoordinate Descent. In Section 6 we show how SIC and conditional Generative models can be used\nto control the False Discovery Rate using the recently introduced Holdout Randomization Test [8]\nand Knockoffs [9]. We validate SIC and its FDR control on synthetic and real datasets in Section 8.\n\n2 Sobolev Independence Criterion: Interpretable Dependency Measure\n\nMotivation: Feature Selection. We start by motivating gradient-sparsity regularization in SIC as a\nmean of selecting the features that maintain maximum dependency between two randoms variable X\n(the input) and Y (the response) de\ufb01ned on two spaces X \u21e2 Rdx and Y \u21e2 Rdy (in the simplest case\ndy = 1). Let pxy be the joint distribution of (X, Y ) and px, py be the marginals of X and Y resp. Let\nD be an Integral Probability Metric associated with a function space F , i.e for two distributions p, q:\n\nD(p, q) = sup\nf2F\n\nEx\u21e0pf (x)  Ex\u21e0qf (x).\n\nWith p = pxy and q = pxpy this becomes a generalized de\ufb01nition of Mutual Information. In-\nstead of the usual KL divergence, the metric D with its witness function, or critic, f (x, y) mea-\nsures the distance between the joint pxy and the product of marginals pxpy. With this generalized\nde\ufb01nition of mutual information, the feature selection problem can be formalized as \ufb01nding a\nsparse selector or gate w 2 Rdx such that D(pwx,y, pwxpy) is maximal [10, 11, 12, 13] , i.e.\nsupw,kwk`0\uf8ffs D(pwx,y, pwxpy), where  is a pointwise multiplication and kwk`0\n= #{j|wj 6=\n0}. This problem can be written in the following penalized form:\n\nEpxy f (w  x, y)  Epxpy f (w  x, y)  ||w||`0.\n\u02dcf (x, y)  Epxpy\n\n\u02dcf (x, y), where\nWe can relabel \u02dcf (x, y) = f (w  x, y) and write (P) as: sup \u02dcf2 \u02dcF Epxy\n\u02dcF = { \u02dcf| \u02dcf (x, y) = f (w  x, y)|f 2 F ,kwk`0 \uf8ff s}. Observe that we have: @ \u02dcf\n@f (wx,y)\n.\nSince wj is sparse the gradient of \u02dcf is sparse on the support of pxy and pxpy. Hence, we can\nreformulate the problem (P) as follows:\n\n= wj\n\n(P) : sup\nw\n\nsup\nf2F\n\n@xj\n\n@xj\n\n(SIC):\n\nsup\nf2F\n\nEpxy f (x, y)  Epxpy f (x, y)  PS(f ),\n\nwhere PS(f ) is a penalty that controls the sparsity of the gradient of the witness function f on the\nsupport of the measures. Controlling the nonlinear sparsity of the witness function in (SIC) via its\ngradients is more general and powerful than the linear sparsity control suggested in the initial form\n(P), since it takes into account the nonlinear interactions with other variables. In the following Section\nwe formalize this intuition by theoretically examining sparsity-inducing gradient penalties [14].\nSparsity Inducing Gradient Penalties. Gradient penalties have a long history in machine learning\nand signal processing.\nIn image processing the total variation norm is used for instance as a\nregularizer to induce smoothness. Splines in Sobolev spaces [15], and manifold learning exploit\ngradient regularization to promote smoothness and regularity of the estimator. In the context of neural\nnetworks, gradient penalties were made possible through double back-propagation introduced in\n[16] and were shown to promote robustness and better generalization. Such smoothness penalties\nbecame popular in deep learning partly following the introduction of WGAN-GP [17], and were used\nas regularizer for distance measures between distributions in connection to optimal transport theory\n[5, 6, 7]. Let \u00b5 be a dominant measure of pxy and pxpy the most commonly used gradient penalties is\n\nWhile this penalty promotes smoothness, it does not control the desired sparsity as discussed in the\nprevious section. We therefore elect to instead use the nonlinear sparsity penalty introduced in [14] :\n\n\u2326L2(f ) = E(x,y)\u21e0\u00b5 krxf (x, y)k2 .\n\n2\n\n\f2\n\n@xj\n\n\n\u2326`0(f ) = #{j|E(x,y)\u21e0\u00b5 @f (x,y)\nAs discussed in [14], E(x,y)\u21e0\u00b5 @f (x,y)\n\n\u2326S(f ) =\n\n2\n\n\n\n6= 0}, and its relaxation :\n\ndxXj=1sE(x,y)\u21e0\u00b5\n\n@f (x, y)\n\n@xj\n\n2\n\n.\n\n\n\n@xj\n\n= 0 implies that f is constant with respect to variable xj, if\nthe function f is continuously differentiable and the support of \u00b5 is connected. These considerations\nmotivate the following de\ufb01nition of the Sobolev Independence Criterion (SIC):\n(\u2326S(f ))2 \n\nEpxy f (x, y)  Epxpy f (x, y) \n\n\u21e2\n2E\u00b5f 2(x, y).\n\nSIC(L1)2(pxy, pxpy) = sup\nf2F\n\n\n2\n\nNote that we add a `1-like penalty (\u2326S(f ) ) to ensure sparsity and an `2-like penalty (E\u00b5f 2(x, y)) to\nensure stability. This is similar to practices with linear models such as Elastic net.\nHere we will consider \u00b5 = pxpy (although we could also use \u00b5 = 1\n2 (pxy + pxpy)). Then,\ngiven samples {(xi, yi), i = 1, . . . , N} from the joint probability distribution pxy and iid samples\n{(xi, \u02dcyi), i = 1, . . . , N} from pxpy, SIC can be estimated as follows:\ndSIC(L1)2(pxy, pxpy) = sup\nj=1r 1\nwhere \u02c6\u2326S(f ) =Pdx\nRemark 1. Throughout this paper we consider feature selection only on x since y is thought of as\nthe response. Nevertheless, in many other problems one can perform feature selection on x and y\njointly, which can be simply achieved by also controlling the sparsity of ryf (x, y) in a similar way.\n3 Equivalent Forms of SIC with \u2318-trick\n\nNXi=1\ni=1 @f (xi,\u02dcyi)\nNPN\n\n2\u21e3 \u02c6\u2326S(f )\u23182\n\nf (xi, \u02dcyi)\n\nf (xi, yi)\n\nNXi=1\n\nNXi=1\n\nf 2(xi, \u02dcyi),\n\nf2F\n\n1\nN\n\n1\nN\n\n1\nN\n\n\n\n\n\n\u21e2\n2\n\n@xj\n\n\n\n2\n\n.\n\nAs it was just presented, the SIC objective is a dif\ufb01cult function to optimize in practice. First of all,\nthe expectation appears after the square root in the gradient penalties, resulting in a non-smooth term\n(since the derivative of square root is not continuous at 0). Moreover, the fact that the expectation\nis inside the nonlinearity introduces a gradient estimation bias when the optimization of the SIC\nobjective is performed using stochastic gradient descent (i.e. using mini-batches). We alleviate these\nproblems (non-smoothness and biased expectation estimation) by making the expectation linear in the\nobjective thanks to the introduction of auxiliary variables \u2318j that will end up playing an important role\nin this work. This is achieved thanks to a variational form of the square root that is derived from the\nfollowing Lemma (which was used for a similar purpose as ours when alleviating the non-smoothness\nof mixed norms encountered in multiple kernel learning and group sparsity norms):\n\nLemma 1 ([18],[19]). Let aj, j = 1 . . . d, aj > 0 we have: \u21e3Pd\nj=1 \u2318j = 1}, optimum achieved at \u2318j = paj/Pj paj.\n\u2318, \u2318j > 0Pd\nWe alleviate \ufb01rst the issue of non smoothness of the square root by adding an \" 2 (0, 1), and we\nj=1rE(x,y)\u21e0\u00b5 @f (x,y)\nde\ufb01ne: \u2326S,\" =Pdx\n+ \". Using Lemma 1 the nonlinear sparsity inducing\n\ngradient penalty can be written as :\n\n= inf{Pd\n\nj=1 paj\u23182\n\n\n\naj\n\u2318j\n\nj=1\n\n@xj\n\n2\n\n2\n\n:\n\ndxXj=1\n\n\n\n@xj\n\u2318j\n\nEpxpy @f (x,y)\njPdx\n\n3\n\n(\u2326S,\"(f ))2 = inf{\n\n+ \"\n\n: \u2318, \u2318j > 0,\n\n\u2318j = 1},\n\ndxXj=1\n\n+ \". We refer to\nwhere the optimum is achieved for : \u2318\u21e4j,\" =\n\u2318\u21e4j,\" as the normalized importance score of feature j. Note that \u2318j is a distribution over the features\nand gives a natural ranking between the features. Hence, substituting \u2326(S)(f ) with \u2326S,\"(f ) in its\nequivalent form we obtain the \" perturbed SIC:\n\n, where 2\n\nk=1 k\n\nj = Epxpy @f (x,y)\n\n@xj\n\n\n\n2\n\n\fSIC(L1)2,\"(pxy, pxpy) =  inf{L\"(f, \u2318) : f 2 F ,\u2318 j,\u2318 j > 0,\n\ndxXj=1\n\n\u2318j = 1}\n2Epxpy f 2(x, y), and\n\nEpxpy @f (x,y)\n\n\n\n2\n\n+\"\n\nwhere L\"(f, \u2318) = (f, pxy, pxpy) + \n(f, pxy, pxpy) = Epxy f (x, y)  Epxpy f (x, y). Finally, SIC can be empirically estimated as\n\n@xj\n\u2318j\n\n+ \u21e2\n\nj=1\n\n2Pdx\n\n1\n\ndSIC(L1)2,\"(pxy, pxpy) =  inf{ \u02c6L\"(f, \u2318) : f 2 F ,\u2318 j,\u2318 j > 0,\n\n\ni=1 @f (xi, \u02dcyi)\nN PN\n2Pdx\nwhere \u02c6L\"(f, \u2318) =  \u02c6(f, pxy, pxpy) + \nNPN\nNPN\nmain the objective \u02c6(f, pxy, pxpy) = 1\ni=1 f (xi, yi)  1\nRemark 2 (Group Sparsity). We can de\ufb01ne similarly nonlinear group sparsity, if we would like\nour critic to depends on subsets of coordinates. Let Gk, k = 1, . . . , K be an overlapping or non\nk=1rPj2Gk Epxpy @f (x,y)\n\n\noverlapping group : \u2326gS(f ) =PK\n\n4 Convex Sobolev Independence Criterion in Fixed Feature Spaces\n\n\u2318j = 1}\ni=1 f 2(xi, \u02dcyi), and\n\n. The \u2318-trick applies naturally.\n\ni=1 f (xi, \u02dcyi).\n\n+ \u21e2\n2\n\nj=1\n\n@xj\n\n@xj\n\n+\"\n\n\u2318j\n\n2\n\n2\n\ndxXj=1\nNPN\n\n1\n\nWe will now specify the function space F in SIC and consider in this Section critics of the form:\n\nF = {f|f (x, y) = hu, !(x, y)i ,kuk2 \uf8ff },\n\ndxXj=1\n\nwhere ! : X \u21e5 Y ! Rm is a \ufb01xed \ufb01nite dimensional feature map. We de\ufb01ne the mean\nembeddings of the joint distribution pxy and product of marginals pxpy as follow: \u00b5(pxy) =\nEpxy [!(x, y)], \u00b5(pxpy) = Epxpy [!(x, y)] 2 Rm. De\ufb01ne the covariance embedding of pxpy as\nC(pxpy) = Epxpy [!(x, y) \u2326 !(x, y)] 2 Rm\u21e5m and \ufb01nally de\ufb01ne the Gramian of derivatives\nembedding for coordinate j as Dj(pxpy) = Epxpy [ @!(x,y)\n] 2 Rm\u21e5m. We can write\nthe constraint kuk2 \uf8ff  as the penalty term \u2327 kuk2. De\ufb01ne L\"(u, \u2318) = hu, \u00b5(pxpy)  \u00b5(pxy)i +\n\n@xj \u2326 @!(x,y)\n+ \u21e2C(pxpy) + \u2327I m\u2318 uE. Observe that :\n\n2Du,\u21e3Pdx\n\nSIC(L1)2,\"(pxy, pxpy) =  inf{L\"(u, \u2318) : u 2 Rm,\u2318 j,\u2318 j > 0,\n\n\u2318j = 1}.\n\nDj (pxpy)+\"\n\nj=1\n\n@xj\n\n\u2318j\n\n1\n\n2(px) and L2\n\n2 constraints L2\n\n2(py) on each function space separately.\n\nWe start by remarking that SIC is a form of gradient regularized maximum mean discrepancy [3].\nPrevious MMD work comparing joint and product of marginals did not use the concept of nonlinear\nsparsity. For example the Hilbert-Schmidt Independence Criterion (HSIC) [2] uses !(x, y) =\n(x) \u2326 (y) with a constraint ||u||2 \uf8ff 1. CCA and related kernel measures of dependence [20, 21]\nuse L2\nOptimization Properties of Convex SIC We analyze in this Section the Optimization properties of\nSIC. Theorem 1 shows that the SIC(L1)2,\" loss function is jointly strictly convex in (u, \u2318) and hence\nadmits a unique solution that solves a \ufb01xed point problem.\nTheorem 1 (Existence of a solution, Uniqueness, Convexity and Continuity). Note that L(u, \u2318) =\nL\"=0(u, \u2318). The following properties hold for the SIC loss:\n1) L(u, \u2318) is differentiable and jointly convex in (u, \u2318). L(u, \u2318) is not continuous for \u2318, such that\n\u2318j = 0 for some j.\n2) Smoothing, Perturbed SIC: For \" 2 (0, 1), L\"(u, \u2318) = L(u, \u2318) + \nis jointly strictly\nconvex and has compact level sets on the probability simplex, and admits a unique minimizer (u\u21e4\",\u2318 \u21e4\" ).\nthe following \ufb01xed point prob-\n3) The unique minimizer of L\"(u, \u2318) is a solution of\n(\u00b5(pxy)  \u00b5(pxpy)), and \u2318\u21e4j,\" =\n\nlem: u\u21e4\" = \u21e3Pdx\nphu\u21e4\" ,Dj (pxpy)u\u21e4\"i+\"\nk=1 phu\u21e4\" ,Dk(pxpy)u\u21e4\"i+\"\nPdx\nThe following Theorem shows that a solution of the unperturbed SIC problem can be obtained from\nthe smoothed SIC(L1)2,\" in the limit \" ! 0:\n\n+ \u21e2C(pxpy) + \u2327I m\u23181\n\n2Pdx\n\nDj (pxpy)\n\n\"\n\u2318j\n\nj=1\n\nj=1\n\n\u2318\u21e4j\n\n.\n\n4\n\n\fTheorem 2 (From Perturbed SIC to SIC). Consider a sequence \"`, \"` ! 0 as ` ! 1 , and consider\na sequence of minimizers (u\u21e4\"`,\u2318 \u21e4` ) of L\"`(u, \u2318), and let (u\u21e4,\u2318 \u21e4) be the limit of this sequence, then\n(u\u21e4,\u2318 \u21e4) is a minimizer of L(u, \u2318).\nInterpretability of SIC. The following corollary shows that SIC can be written in terms of the\nimportance scores of the features, since at optimum the main objective is proportional to the constraint\nterm. It is to the best of our knowledge the \ufb01rst dependency criterion that decomposes in the sum of\ncontributions of each coordinate, and hence it is an interpretable dependency measure. Moreover, \u2318\u21e4j\nare normalized importance scores of each feature j, and their ranking can be used to assess feature\nimportance.\nCorollary 1 (Interpretability of Convex SIC ). Let (u\u21e4,\u2318 \u21e4) be the limit de\ufb01ned in Theorem 2. De\ufb01ne\nf\u21e4(x, y) = hu\u21e4, !(x, y)i, and kf\u21e4kF = ku\u21e4k. We have that\n2Epxy f\u21e4(x, y)  Epxpy f\u21e4(x, y)\nSIC(L1)2(pxy, pxpy) =\n20@\n|21A\ndxXj=1sEpxpy|\n|2 = \u2318\u21e4j \u2326S,L1(f\u21e4) and Pdx\n\nj=1 \u2318j = 1. The terms \u2318\u21e4j can be seen as\nquantifying how much dependency as measured by SIC can be explained by a coordinate j. Ranking\nof \u2318\u21e4j can be used to rank in\ufb02uence of coordinates.\nThanks to the joint convexity and the smoothness of the perturbed SIC, we can solve convex empirical\nSIC using alternating minimization on u and \u2318 or block coordinate descent using \ufb01rst order methods\nsuch as gradient descent on u and mirror descent [22] on \u2318 that are known to be globally convergent\nin this case (see Appendix A for more details).\n\nMoreover,qEpxpy| @f \u21e4(x,y)\n\n\u21e2\n2Epxpy f\u21e4,2(x, y) +\n\n\u2327\n2||f\u21e4||2\nF .\n\n@f \u21e4(x, y)\n\n2\n\n+\n\n@xj\n\n1\n\n\n\n=\n\n@xj\n\n5 Non Convex Neural SIC with Deep ReLU Networks\n\nWhile Convex SIC enjoys a lot of theoretical properties, a crucial short-coming is the need to choose\na feature map ! that essentially goes back to the choice of a kernel in classical kernel methods. As\nan alternative, we propose to learn the feature map as a deep neural network. The architecture of\nthe network can be problem dependent, but we focus here on a particular architecture: Deep ReLU\nNetworks with biases removed. As we show below, using our sparsity inducing gradient penalties\nwith such networks, results in input sparsity at the level of the witness function f of SIC. This is\ndesirable since it allows for an interpretable model, similar to the effect of Lasso with Linear models,\nour sparsity inducing gradient penalties result in a nonlinear self-explainable witness function f [23],\nwith explicit sparse dependency on the inputs.\nDeep ReLU Networks with no biases, homogeneity and Input Sparsity via Gradient Penalties.\nWe start by invoking the Euler Theorem for homogeneous functions:\nTheorem 3 (Euler Theorem for Homogeneous Functions). A continuously differentiable function f\nis de\ufb01ned as homogeneous of degree k if f (x) = kf (x),8 2 R. The Theorem states that f is\nhomogeneous of degree k if and only if kf (x) = hrxf (x), xi =Pdx\nNow consider deep ReLU networks with biases removed for any number of layers L: FReLu =\n{f|f (x, y) = hu, !(x)i , where !(x, y) = (WL . . . (W2(W1[x, y]))), u 2 Rm, ! :\nRdx+dy ! Rm}, where (t) = max(t, 0), Wj are linear weights. Any f 2 FReLU is clearly\nhomogeneous of degree 1. As an immediate consequence of Euler Theorem we then have:\nf (x, y) = hrxf (x, y), xi + hryf (x, y), yi. The \ufb01rst term is similar to a linear term in a lin-\near model, the second term can be seen as a bias. Using our sparsity-inducing gradient penalties\nwith such networks guarantees that on average on the support of a dominant measure the gradients\nwith respect to x are sparse. Intuitively, the gradients wrt x act like the weight in linear models, and\nour sparsity inducing gradient penalty act like the `1 regularization of Lasso. The main advantage\ncompared to Lasso is that we have a highly nonlinear decision function, that has better capacity of\ncapturing dependencies between X and Y .\nNon-convex SIC with Stochastic Block Coordinate Descent (BCD). We de\ufb01ne the empirical non\nconvex SIC(L1)2 using this function space FReLu as follows:\n\n@f (x)\n@xj\n\nxj.\n\nj=1\n\n5\n\n\fdSIC(L1)2(pxy, pxpy) =  inf{ \u02c6L(f\u2713,\u2318 ) : f\u2713 2 F ReLU ,\u2318 j,\u2318 j > 0,\n\nwhere \u2713 = (vec(W1) . . . vec(WL), u) are the network parameters. Algorithm 3 in Appendix B\nsummarizes our stochastic BCD algorithm for training the Neural SIC. The algorithm consists of\nSGD updates to \u2713 and mirror descent updates to \u2318.\nBoosted SIC. When training Neural SIC, we can obtain different critics f` and importance scores \u2318`,\nby varying random seeds or hyper-parameters (architecture, batch size etc). Inspired by importance\nscores in random forest, we de\ufb01ne Boosted SIC as the arithmetic mean or the geometric mean of \u2318`.\n\n\u2318j = 1},\n\ndxXj=1\n\n6 FDR Control and the Holdout Randomization Test/ Knockoffs.\n\nControlling the False Discovery Rate (FDR) in Feature Selection is an important problem for\nreproducible discoveries. In a nutshell, for a feature selection problem given the ground-truth set of\nfeatures S, and a feature selection method such as SIC that gives a candidate set \u02c6S, our goal is to\nmaximize the TPR (True Positive Rate) or the power, and to keep the False Discovery Rate (FDR)\nunder Control. TPR and FDR are de\ufb01ned as follows:\n\nTPR := E\" #{i : i 2 \u02c6S \\ S}\n\n#{i : i 2 S} # FDR := E\" #{i : i 2 \u02c6S\\S}\n#{i : i 2 \u02c6S} # .\n\n(1)\n\nWe explore in this paper two methods that provably control the FDR: 1) The Holdout Randomization\nTest (HRT) introduced in [8], that we specialize for SIC in Algorithm 4; 2) Knockoffs introduced\nin [9] that can be used with any basic feature selection method such as Neural SIC, and guarantees\nprovable FDR control.\nHRT-SIC. We are interested in measuring the conditional dependency between a feature xj and the\nresponse variable y conditionally on the other features noted xj. Hence we have the following\nnull hypothesis: H0 : xj|=y |xj () pxy = pxj|xj py|xj pxj . In order to simulate the null\nhypothesis, we propose to use generative models for sampling from xj|xj (See Appendix D). The\nprinciple in HRT [8] that we specify here for SIC in Algorithm 4 (given in Appendix B) is the\nfollowing: instead of re\ufb01tting SIC under H0, we evaluate the mean of the witness function of SIC on\na holdout set sampled under H0 (using conditional generators for R rounds). The deviation of the\nmean of the witness function under H0 from its mean on a holdout from the real distribution gives us\np-values. We use the Benjamini-Hochberg [24] procedure on those p-values to achieve a target FDR.\nWe apply HRT-SIC on a shortlist of pre-selected features per their ranking of \u2318j.\nKnockoffs-SIC. Knockoffs [25] work by \ufb01nding control variables called knockoffs \u02dcx that mimic the\nbehavior of the real features x and provably control the FDR [9]. We use here Gaussian knockoffs\n[9] and train SIC on the concatenation of [x, \u02dcx], i.e we train SIC([X; \u02dcX], Y ) and obtain \u2318 that has\nnow twice the dimension dx, i.e for each real feature j, there is the real importance score \u2318j and the\nknockoff importance score \u2318j+dx. knockoffs-SIC consists in using the statistics Wj = \u2318j  \u2318j+dx\nand the knockoff \ufb01lter [9] to select features based on the sign of Wj (See Alg. 5 in Appendix).\n\n7 Relation to Previous Work\n\nKernel/Neural Measure of Dependencies. As discussed earlier SIC can be seen as a sparse gradient\nregularized MMD [3, 7] and relates to the Sobolev Discrepancy of [5, 6]. Feature selection with\nMMD was introduced in [10] and is based on backward elimination of features by recomputing MMD\non the ablated vectors. SIC has the advantage of \ufb01tting one critic that has interpretable feature scores.\nRelated to the MMD is the Hilbert Schmidt Independence Criterion (HSIC) and other variants of\nkernel dependency measures introduced in [2, 21]. None of those criteria has a nonparametric sparsity\nconstraint on its witness function that allows for explainability and feature selection. Other Neural\nmeasures of dependencies such as MINE [26] estimate the KL divergence using neural networks, or\nthat of [27] that estimates a proxy to the Wasserstein distance using Neural Networks.\nInterpretability, Sparsity, Saliency and Sensitivity Analysis. Lasso and elastic net [28] are inter-\npretable linear models that exploit sparsity, but are limited to linear relationships. Random forests\n\n6\n\n\f[29] have a heuristic for determining feature importance and are successful in practice as they can\ncapture nonlinear relationships similar to SIC. We believe SIC can potentially leverage the deep\nlearning toolkit for going beyond tabular data where random forests excel, to more structured data\nsuch as time series or graph data. Finally, SIC relates to saliency based post-hoc interpretation of\ndeep models such as [30, 31, 32]. While those method use the gradient information for a post-hoc\nanalysis, SIC incorporates this information to guide the learning towards the important features. As\ndiscussed in Section 2.1 many recent works introduce deep networks with input sparsity control\nthrough a learned gate or a penalty on the weights of the network [11, 12, 13]. SIC exploits a stronger\nnotion of sparsity that leverages the relationship between the different covariates.\n\n8 Experiments\n\nSynthetic Data Validation. We \ufb01rst validate our methods and compare them to baseline models\nin simulation studies on synthetic datasets where the ground truth is available by construction. For\nthis we generate the data according to a model y = f (x) + \u270f where the model f (\u00b7) and the noise \u270f\nde\ufb01ne the speci\ufb01c synthetic dataset (see Appendix F.1). In particular, the value of y only depends\non a subset of features xi, i = 1, . . . , p through f (\u00b7), and performance is quanti\ufb01ed in terms of TPR\nand FDR in discovering them among the irrelevant features. We experiment with two datasets: A)\nComplex multivariate synthetic data (SinExp), which is generated from a complex multivariate\nmodel proposed in [33] Sec 5.3, where 6 ground truth features xi out of 50 generate the output y\nthrough a non-linearity involving the product and composition of the cos, sin and exp functions (see\nAppendix F.1). We therefore dub this dataset SinExp. To increase the dif\ufb01culty even further, we\nintroduce a pairwise correlation between all features of 0.5. In Fig. 1 we show results for datasets\nof 125 and 500 samples repeated 100 times comparing performance of our models with the one of\ntwo baselines: Elastic Net (EN) and Random Forest (RF). B) Liang Dataset. We show results on the\nbenchmark dataset proposed by [34], speci\ufb01cally the generalized Liang dataset matching most of the\nsetup from [8] Sec 5.1. We provide dataset details and results in Appendix F.1 (Results in Figure 2).\n\nFigure 1: SinExp synthetic dataset. TPR and FDR of Elastic Net (EN) and Random Forest (RF)\nbaseline models (left panels) are compared to our methods: a 2-hidden layer neural network with no\nbiases trained to minimize an objective comprising an MSE cost and a Sobolev Penalty term (MSE +\nSobolev Penalty), and the same network trained to optimize SIC criterion (right panels), for datasets\nof 125 samples (top panels) and 500 samples (bottom panels). For all models TPR and FDR are\ncomputed by selecting the top 6 features in order of feature importance (which for EN is de\ufb01ned\nas the absolute value of the weight of a feature, for RF is the out-of-bag error associated to it (see\n[35]), and for our method is the value of its \u2318). Selecting the \ufb01rst 6 features is useful to compare\nmodels, but assumes oracle knowledge of the fact that there are 6 ground truth features. We therefore\nalso compute FDR and TPR after selecting features using the HRT method of [8] among the top 20\nfeatures. HRT estimates the importance of a feature quantifying its effect on the distribution of y on a\nholdout set by replacing its values with samples from a conditional distribution (see Section 6). We\nuse HRT to control FDR rate at 10% (red horizontal dotted line). Standard box plots are generated\nover 100 repetitions of each simulation.\n\n7\n\nTP5tRS 6FD5tRS 6TP5+5TFD5+5TTP5tRS 6FD5tRS 6TP5+5TFD5+5T0.00.20.40.60.81.0PRwHr Dnd FD5ElDstiF 1Ht5DndRP FRrHstTP5tRS 6FD5tRS 6TP5+5TFD5+5TTP5tRS 6FD5tRS 6TP5+5TFD5+5T0.00.20.40.60.81.006E + 6RERlHv PHnDlty6,CDDtDsHt 6,1EXP, n 125 sDPSlHsTP5tRS 6FD5tRS 6TP5+5TFD5+5TTP5tRS 6FD5tRS 6TP5+5TFD5+5T0.00.20.40.60.81.0PRwHr Dnd FD5ElDstiF 1Ht5DndRP FRrHstTP5tRS 6FD5tRS 6TP5+5TFD5+5TTP5tRS 6FD5tRS 6TP5+5TFD5+5T0.00.20.40.60.81.006E + 6RERlHv PHnDlty6,CDDtDsHt 6,1EXP, n 500 sDPSlHs\fFeature Selection on Drug Response dataset. We consider as a real-world application the Cancer\nCell Line Encyclopedia (CCLE) dataset [36], described in Appendix F.2. We study the result of\nusing the normalized importance scores \u2318j from SIC for (heuristic) feature selection, against features\nselected by Elastic Net. Table 1 shows the heldout MSE of a predictor trained on selected features,\naveraged over 100 runs (each run: new randomized 90%/10% data split, NN initialization). The\ngoal here is to quantify the predictiveness of features selected by SIC on its own, without the full\nrandomized testing machinery. The SIC critic and regressor NN were respectively the big_critic and\nregressor_N N described with training details in Appendix F.3, while the random forest is trained\nwith default hyper parameters from scikit-learn [37]. We can see that, with just \u2318j, informative\nfeatures are selected for the downstream regression task, with performance comparable to those\nselected by ElasticNet, which was trained explicitly for this task. The features selected with high \u2318j\nvalues and their overlap with the features selected by ElasticNet are listed in Appendix F.2 Table 3.\n\nNN\n\nRF\n\nAll 7251 features\nElastic-Net1 [36] top-7\nElastic-Net2 [8] top-10\nSIC top-7\nSIC top-10\nSIC top-15\n\n1.160 \u00b1 3.990\n0.864 \u00b1 0.432\n0.663 \u00b1 0.161\n0.728 \u00b1 0.166\n0.706 \u00b1 0.158\n0.734 \u00b1 0.168\n\n0.783 \u00b1 0.167\n0.931 \u00b1 0.215\n0.830 \u00b1 0.190\n0.856 \u00b1 0.189\n0.817 \u00b1 0.173\n0.859 \u00b1 0.202\n\nTable 1: CCLE results on downstream regression task. Heldout MSE for drug PLX4720 prediction\nbased on selected features. Columns: neural network (NN) and random forest (RF) regressors.\nHIV-1 Drug Resistance with Knockoffs-SIC. The second real-world dataset that we analyze is\nthe HIV-1 Drug Resistance[38], which consists in detecting mutations associated with resistance\nto a drug type. For our experiments we use all the three classes of drugs: Protease Inhibitors (PIs),\nNucleoside Reverse Transcriptase Inhibitors (NRTIs), and Non-nucleoside Reverse Transcriptase\nInhibitors (NNRTIs). We use the pre-processing of each dataset (<drug-class, drug-type>) of the\nknockoff tutorial [39] made available by the authors. Concretely, we construct a dataset (X, \u02dcX) of\nthe concatenation of the real data and Gaussian knockoffs [9], and \ufb01t SIC([X, \u02dcX], Y ). As explained\nin Section 6, we use in the knockoff \ufb01lter the statistics Wj = \u2318j  \u2318j+dx, i.e. the difference of SIC\nimportance scores between each feature and its corresponding knockoff. For SIC experiments, we use\nsmall_critic architecture (See Appendix F.3 for training details). We use Boosted SIC, by varying\nthe batch sizes in N 2{ 10, 30, 50}, and computing the geometric mean of \u2318 produced by those three\nsetups as the feature importance needed for Knockoffs. Results are summarized in Table 2.\n\nPIs\n\nDrug Class Drug Type Knockoff with GLM Boosted SIC Knockoff\nFDP\n0.22\n0.05\n0.16\n0.12\n0.21\n0.20\n0.36\n0\n0.08\n0.29\n0\n0\n0.55\n0.47\n0.611\n\nFDP TD FD\n0.13\n5\n1\n0.26\n0.38\n3\n0.05\n2\n5\n0.22\n2\n0.29\n0.19\n8\n0\n0\n1\n0.09\n0.2\n5\n0\n0.14\n0\n0\n10\n0.56\n10\n0.5\n0.58\n11\n\nTD FD\n3\n19\n22\n8\n12\n19\n1\n16\n7\n24\n8\n19\n17\n4\n0\n0\n1\n10\n4\n16\n1\n6\n0\n0\n13\n10\n11\n11\n7\n10\n\nAPV\nATV\nIDV\nLPV\nNFV\nRTV\nSQV\nX3TC\nABC\nAZT\nD4T\nDDI\nDLV\nEFV\nNVP\n\nNRTIs\n\nNNRTIs\n\n17\n19\n15\n14\n19\n12\n14\n7\n11\n12\n8\n8\n8\n11\n7\n\nTable 2: Comparison of applying (knockoff \ufb01lter + GLM) and (Knockoff \ufb01lter+Boosted SIC). For\neach <drug-class, drug-type> we compared the True Discoveries (TD), False Discoveries(FD) and\nFalse Discovery Proportion (FDP). Knockoff with Boosted SIC keeps FDP under control without\ncompromising power, and succeeds in making true discoveries that GLM with knockoffs doesn\u2019t \ufb01nd.\n\n8\n\n\f9 Conclusion\n\nWe introduced in this paper the Sobolev Independence Criterion (SIC), a dependency measure that\ngives rise to feature importance which can be used for feature selection and interpretable decision\nmaking. We laid down the theoretical foundations of SIC and showed how it can be used in\nconjunction with the Holdout Randomization Test and Knockoffs to control the FDR, enabling\nreliable discoveries. We demonstrated the merits of SIC for feature selection in extensive synthetic\nand real-world experiments with controlled FDR.\n\nReferences\n[1] Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Scholkopf, and Gert\nR. G. Lanckriet. On integral probability metrics, -divergences and binary classi\ufb01cation. 2009.\n\n[2] A. Gretton, K. Fukumizu, CH. Teo, L. Song, B. Sch\u00f6lkopf, and AJ. Smola. A kernel statistical\n\ntest of independence. In Advances in neural information processing systems 20, 2008.\n\n[3] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00f6lkopf, and Alexander\n\nSmola. A kernel two-sample test. JMLR, 2012.\n\n[4] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch\u00f6lkopf. Kernel\n\nmean embedding of distributions: A review and beyond. Arxiv, 2017.\n\n[5] Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev gan. ICLR,\n\n2018.\n\n[6] Youssef Mroueh, Tom Sercu, and Anant Raj. Sobolev descent. In AISTATS, 2019.\n\n[7] Michael Arbel, Dougal J. Sutherland, Mikolaj Binkowski, and Arthur Gretton. On gradient\n\nregularizers for mmd gans. NeurIPS, 2018.\n\n[8] W. Tansey, V. Veitch, H. Zhang, R. Rabadan, and D. M. Blei. The holdout randomization test:\n\nPrincipled and easy black box feature selection. arXiv preprint arXiv:1811.00645, 2018.\n\n[9] Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold: model-x\n\nknockoffs for high dimensional controlled variable selection. 2018.\n\n[10] Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. Feature selection\n\nvia dependence maximization. J. Mach. Learn. Res., 2012.\n\n[11] Jean Feng and Noah Simon. Sparse-input neural networks for high-dimensional nonparametric\n\nregression and classi\ufb01cation. 2017.\n\n[12] Mao Ye and Yan Sun. Variable selection via penalized neural network: a drop-out-one loss\n\napproach. In Proceedings of the 35th International Conference on Machine Learning, 2018.\n\n[13] Yutaro Yamada, O\ufb01r Lindenbaum, Sahand Negahban, and Yuval Kluger. Deep supervised\n\nfeature selection using stochastic gates. Arxiv, 2018.\n\n[14] Lorenzo Rosasco, Silvia Villa, So\ufb01a Mosci, Matteo Santoro, and Alessandro Verri. Nonpara-\n\nmetric sparsity and regularization. J. Mach. Learn. Res., 2013.\n\n[15] Grace Wahba. Smoothing noisy data with spline functions. Numerische mathematik, 24(4),\n\n1975.\n\n[16] Harris Drucker and Yann LeCun. Improving generalization performance using double back-\n\npropagation. IEEE Transactions on Neural Networks, 1992.\n\n[17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville.\n\nImproved training of wasserstein gans. arXiv:1704.00028, 2017.\n\n[18] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature\n\nlearning. Mach. Learn., 2008.\n\n9\n\n\f[19] Francis Bach, Rodolphe Jenatton, and Julien Mairal. Optimization with Sparsity-Inducing\nPenalties (Foundations and Trends(R) in Machine Learning). Now Publishers Inc., Hanover,\nMA, USA, 2011.\n\n[20] H.D. Vinod. Canonical ridge and econometrics of joint production. Journal of Econometrics,\n\n1976.\n\n[21] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Sch\u00f6lkopf. Kernel measures of\n\nconditional dependence. In Advances in Neural Information Processing Systems 20. 2008.\n\n[22] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Oper. Res. Lett., 2003.\n\n[23] David Alvarez Melis and Tommi Jaakkola. Towards robust interpretability with self-explaining\n\nneural networks. In Advances in Neural Information Processing Systems 31. 2018.\n\n[24] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A Practical and powerful\n\napproach to multiple testing. J. Roy. Statist. Soc., 57:289\u2013300, 1995.\n\n[25] Rina Foygel Barber, Emmanuel J Cand\u00e8s, et al. Controlling the false discovery rate via\n\nknockoffs. The Annals of Statistics, 43(5):2055\u20132085, 2015.\n\n[26] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio,\n\nAaron Courville, and R Devon Hjelm. Mine: Mutual information neural estimation, 2018.\n\n[27] Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron van den Oord, Sergey Levine, and Pierre\n\nSermanet. Wasserstein dependency measure for representation learning, 2019.\n\n[28] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.\n\nSpringer New York Inc., 2001.\n\n[29] Leo Breiman. Random forests. Mach. Learn., 2001.\n\n[30] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through\npropagating activation differences. In Proceedings of the 34th International Conference on\nMachine Learning, 2017.\n\n[31] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks:\nInternational Conference on\n\nVisualising image classi\ufb01cation models and saliency maps.\nLearning Representations (Workshop Track)., 2014.\n\n[32] Sebastian Bach, Alexander Binder, Gr\u00e9goire Montavon, Frederick Klauschen, Klaus-Robert\nM\u00fcller, and Wojciech Samek. On pixel-wise explanations for non-linear classi\ufb01er decisions by\nlayer-wise relevance propagation. PLoS ONE, 2015.\n\n[33] Jean Feng and Noah Simon. Sparse-input neural networks for high-dimensional nonparametric\n\nregression and classi\ufb01cation. arXiv preprint arXiv:1711.07592, 2017.\n\n[34] Faming Liang, Qizhai Li, and Lei Zhou. Bayesian neural networks for selection of drug sensitive\n\ngenes. Journal of the American Statistical Association, 113(523), 2018.\n\n[35] Leo Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n\n[36] Jordi Barretina, Giordano Caponigro, Nicolas Stransky, Kavitha Venkatesan, Adam A Margolin,\nSungjoon Kim, Christopher J Wilson, Joseph Leh\u00e1r, Gregory V Kryukov, Dmitriy Sonkin, et al.\nThe cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity.\nNature, 483(7391):603, 2012.\n\n[37] Fabian Pedregosa, Ga\u00ebl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,\nOlivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-\nlearn: Machine learning in python. Journal of machine learning research, 12(Oct):2825\u20132830,\n2011.\n\n10\n\n\f[38] Soo-Yon Rhee, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L Brutlag, and\nRobert W Shafer. Genotypic predictors of human immunode\ufb01ciency virus type 1 drug resistance.\nProceedings of the National Academy of Sciences, 103(46):17355\u201317360, 2006.\n\n[39] Matteo Sesia and Evan Patterson. R tutorial for knockoffs - 4. https://web.stanford.edu/\n\ngroup/candes/knockoffs/software/knockoffs/tutorial-4-r.html, 2017.\n\n[40] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization.\n\nJ. Optim. Theory Appl., 109, 2001.\n\n[41] Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A uni\ufb01ed convergence analysis of block\nsuccessive minimization methods for nonsmooth optimization. SIAM Journal on Optimization,\n2013.\n\n[42] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.\nConditional image generation with pixelcnn decoders. In Advances in neural information\nprocessing systems, pages 4790\u20134798, 2016.\n\n[43] Ethan Perez, Harm de Vries, Florian Strub, Vincent Dumoulin, and Aaron Courville. Learning\n\nvisual reasoning without strong priors. arXiv preprint arXiv:1707.03017, 2017.\n\n[44] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[45] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS-W, 2017.\n\n[46] Yaniv Romano, Matteo Sesia, and Emmanuel Cand\u00e8s. Deep knockoffs. Journal of the American\n\nStatistical Association, pages 1\u201327, 2019.\n\n11\n\n\f", "award": [], "sourceid": 5057, "authors": [{"given_name": "Youssef", "family_name": "Mroueh", "institution": "IBM T.J Watson Research Center"}, {"given_name": "Tom", "family_name": "Sercu", "institution": "Facebook AI Research"}, {"given_name": "Mattia", "family_name": "Rigotti", "institution": "IBM Research AI"}, {"given_name": "Inkit", "family_name": "Padhi", "institution": "IBM Research"}, {"given_name": "Cicero", "family_name": "Nogueira dos Santos", "institution": "Amazon AWS AI"}]}