{"title": "Active Learning of Multi-Index Function Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1466, "page_last": 1474, "abstract": "We consider the problem of actively learning \\textit{multi-index} functions of the form $f(\\vecx) = g(\\matA\\vecx)= \\sum_{i=1}^k g_i(\\veca_i^T\\vecx)$ from point evaluations of $f$. We assume that the function $f$ is defined on an $\\ell_2$-ball in $\\Real^d$, $g$ is twice continuously differentiable almost everywhere, and $\\matA \\in \\mathbb{R}^{k \\times d}$ is a rank $k$ matrix, where $k \\ll d$.  We propose a randomized, active sampling scheme for estimating such functions with uniform approximation guarantees. Our theoretical developments leverage recent techniques from low rank matrix recovery, which enables us to derive an estimator of the function $f$ along with sample complexity bounds. We also characterize the noise robustness of the scheme, and provide empirical evidence that the high-dimensional scaling of our sample complexity bounds are quite accurate.", "full_text": "Active Learning of Multi-Index Function Models\n\nHemant Tyagi and Volkan Cevher\n\nLIONS \u2013 EPFL\n\nAbstract\n\nf (x) = g(Ax) = (cid:80)k\n\ni=1 gi(aT\n\nWe consider the problem of actively learning multi-index functions of the form\ni x) from point evaluations of f. We assume that\nthe function f is de\ufb01ned on an (cid:96)2-ball in Rd, g is twice continuously differen-\ntiable almost everywhere, and A \u2208 Rk\u00d7d is a rank k matrix, where k (cid:28) d. We\npropose a randomized, active sampling scheme for estimating such functions with\nuniform approximation guarantees. Our theoretical developments leverage recent\ntechniques from low rank matrix recovery, which enables us to derive an estima-\ntor of the function f along with sample complexity bounds. We also characterize\nthe noise robustness of the scheme, and provide empirical evidence that the high-\ndimensional scaling of our sample complexity bounds are quite accurate.\n\nIntroduction\n\n1\ni=1: R \u00d7 Rd is a fundamental problem\nLearning functions f: x \u2192 y based on training data (yi, xi)m\nwith many scienti\ufb01c and engineering applications. Often, the function f has a parametric model,\nmodel parameters. In this setting, obtaining an approximate model (cid:98)f when d (cid:29) 1 is challenging due\nas in linear regression when f (x) = aT x, and hence, learning the function amounts to learning the\n\nto the curse-of-dimensionality. Fortunately, low-dimensional parameter models, such as sparsity and\nlow-rank models, enable successful learning from dimensionality reduced or incomplete data [1, 2].\nSince any parametric form is at best an approximation, non-parametric models remain as important\nalternatives where we also attempt to learn the structure of the mapping f from data [3\u201315]. Unfortu-\nnately, the curse-of-dimensionality problem in non-parametric function learning in high-dimensions\nis particularly dif\ufb01cult even with smoothness assumptions on f [16\u201318]. For instance, learning\nfunctions f \u2208 Cs (i.e., the derivatives f(cid:48), . . . , f (s) exist and are continuous), de\ufb01ned over compact\nsupports, require m = \u2126((1/\u03b4)d/s) samples for a uniform approximation guarantee of \u03b4 (cid:28) 1 (i.e.,\n\n(cid:107)f \u2212(cid:98)f(cid:107)L\u221e \u2264 \u03b4) [17]. Surprisingly, even in\ufb01nitely differentiable functions (s = \u221e) are not immune\n\nto this problem (m = \u2126(2(cid:98)d/2(cid:99))) [18]. Therefore, further assumptions on the multivariate functions\nbeyond smoothness are needed for the tractability of successful learning [13, 14, 16, 19].\nTo this end, we seek to learn low-dimensional function models f (x) = g(Ax) that decompose as\n\nk(cid:88)\n\ni=1\n\nk(cid:88)\n\ni=2\n\nModel 1: f (x) =\n\ngi(aT\n\ni x) | Model 2: f (x) = aT\n\n1 x +\n\ngi(aT\n\ni x),\n\n(1)\n\nthereby constraining f to effectively live on k-dimensional subspaces, where k (cid:28) d. The models in\n(1) have several important machine learning applications, and are known as the multi-index models\nin statistics and econometrics, and multi-ridge functions in signal processing [4\u20137, 20\u201324].\ni=1 is given a priori, we posit the\nIn stark contrast to the classical regression setting where (yi, xi)m\nactive learning setting where we can query the function to obtain \ufb01rst an explicit approximation of\nA and subsequently of f. As a stylized example of the active learning setting, consider numerical\nsolutions of parametric partial differential equations (PDE). Given PDE(f, x) = 0, where f (x):\n\n1\n\n\f\u2126 \u2192 R is the implicit solution, obtaining a function sample typically requires running a computa-\ntionally expensive numerical solver. As we have the ability to choose the samples, we can minimize\nthe number of queries to the PDE solver in order to learn an explicit approximation of f [13].\n\nBackground: To set the context for our contributions, it is necessary to review the (rather ex-\ntensive) literature that revolve around the models (1). We categorize the earlier works by how the\nsamples are obtained (regression (passive) vs. active learning), what the underlying low-dimensional\nmodel is (low-rank vs. sparse), and how the smoothness is encoded (kernels vs. Cs).\nRegression/low-rank [3\u20137]: We consider the function model f (x) = g(Ax) to be kernel smooth\nor Cs. Noting the differentiability of f, we observe that the gradients \u2207f (x) = AT\u2207g(Ax) live\nwithin the low-dimensional subspaces of AT . Assuming that \u2207g has suf\ufb01cient richness to span k-\ndimensional subspaces of the rows of A, we use the given samples to obtain Hessian estimates via\nlocal smoothing techniques, such as kernel estimates, nearest-neighbor, or spline methods. We then\nuse the k-principal vectors of the estimated Hessian to approximate A. In some cases, we can even\nestablish asymptotic distribution of A estimates but not \ufb01nite sample complexity bounds.\nRegression/sparse [8\u201312]: We add sparsity restrictions on the function models: for instance, we\nassume only one coordinate is active per additive term in (1). To encode smoothness, we restrict\nf to a particular functional space, such as the reproducing kernel Hilbert or Sobolev spaces. We\nemploy greedy algorithms, back-\ufb01tting approaches, or convex regularizers to not only estimate the\nactive coordinates but also the function itself. We can then establish \ufb01nite sample complexity rates\n\nwith guarantees of the form (cid:107)f \u2212(cid:98)f(cid:107)L2 \u2264 \u03b4, which grow logarithmically with d as well as match the\nfunction learning makes sparsity assumptions on A to obtain guarantees of the form (cid:107)f\u2212(cid:98)f(cid:107)L\u221e \u2264 \u03b4,\n\nminimax bounds for the learning problem. Moreover, the function estimation incurs a linear cost in\nk since the problem formulation affords a rotation-free structure between x and gi\u2019s.\nActive learning [13\u201315]: The majority of the (rather limited) literature on active non-parametric\nwhere f \u2208 Cs with s > 1.1 For instance, we consider the form f (x) = g(Ax), where the rows\nof A live in a weak (cid:96)q ball with q < 2 (i.e., they are approximately sparse).2 We then leverage\na prescribed random sampling, and prove that the sample complexity grows logarithmically with\nd and is inversely proportional to the k-th singular value of a \u201cHessian\u201d matrix H f (for a precise\nde\ufb01nition of H f , see (7)). Thus far, the only known characterization for the k-th singular value of\nH f is for radial basis functions, i.e., f (x) = g((cid:107)Ax(cid:107)2). Just recently, we also see a low-rank model\nto handle f (x) = g(aT x) for a general a (k = 1) with a sample complexity proportional to d [15].\n\nOur contributions:\nspective via low-rank methods, where we have a general A with only Cs assumptions on gi\u2019s.\nOur main contributions are as follows:\n\nIn this paper, which is a summary of [26], we take the active learning per-\n\n1. k-th singular value of H f [14,15]: Based on the random sampling schemes of [14,15], we\nrigorously establish the \ufb01rst high-dimensional scaling characterization of the k-th singular\nvalue of H f , which governs the sample complexity in both sparse and general A for the\nmulti-index models in (1). To achieve this result, we introduce an easy-to-verify, new\nanalysis tool based on Lipschitz continuous second order partial derivatives.\n2. Generalization of [13\u201315]: We derive the \ufb01rst sample complexity bound for the Cs func-\ntions in (1) with arbitrary number of linear parameters k without the compressibility as-\nsumptions on the rows of A. Along the way, we leverage the conventional low-rank mod-\nels in regression approaches and bridge them with the recent low-rank recovery algorithms.\nOur result also lifts the sparse additive models in regression [8\u201312] to a basis-free setting.\n3. Impact of additive noise: We analytically show how additive white Gaussian noise in the\n\nfunction queries impacts the sample complexity of our low-rank approach.\n\n1Not to be confused with the online active learning approaches, which \u201coptimize\u201d a function, such as \ufb01nding\nits maximum [25]. In contrast, we would like to obtain uniform approximation guarantees on f, which might\nlead to redundant samples if we truly are only interested in \ufb01nding a critical point of the function.\n\n2As having one known basis to sparsify all k-dimensions in order to obtain a sparse A is rather restrictive,\n\nthis model does not provide a basis-free generalization of the sparse additive models in regression [8\u201312].\n\n2\n\n\f2 A recipe for active learning of low-dimensional non-parametric models\n\nThis section provides the preliminaries for our low-rank active learning approach for multi-index\nmodels in (1). We \ufb01rst introduce our sampling scheme (based on [14, 15]), summarize our main\nobservation model (based on [6, 7, 14, 15]), and explain our algorithmic approach (based on [15]).\nThis discussion sets the stage for our main theoretical contributions, as described in Section 4.\n\nOur sampling scheme: Our sampling approach relies on a speci\ufb01c interaction of two sets: sam-\npling centers and an associated set of directions for each center. We denote the set of sampling\ncenters as X = {\u03bej \u2208 Sd\u22121; j = 1, . . . , mX}. We form X by sampling points uniformly at random\nin Sd\u22121 (the unit sphere in d-dimensions) according to the uniform measure \u00b5Sd\u22121. Along with\neach \u03bej \u2208 X , we de\ufb01ne a directions vector \u03a6j = [\u03c61,j| . . .|\u03c6m\u03a6,j]T , and construct the sampling\ndirections operator \u03a6 for j = 1, . . . , mX , i = 1, . . . , m\u03a6, and l = 1, . . . , d as\n\n(cid:26)\n(cid:16)(cid:112)d/m\u03a6\n\n\u03a6 =\n\nwhere BRd\n\n(cid:16)(cid:112)d/m\u03a6\n\n(cid:17)\n\n(cid:17)\n\n\u03c6i,j \u2208 BRd\n\nis the (cid:96)2-ball with radius r =(cid:112)d/m\u03a6.\n\n: [\u03c6i,j]l = \u00b1 1\u221a\nm\u03a6\n\n(cid:27)\n\nwith probability 1/2\n\n,\n\n(2)\n\nOur low-rank observation model: We \ufb01rst write the Taylor series approximation of f as follows\n\nf (x + \u0001\u03c6) = f (x) + \u0001(cid:104)\u03c6,\u2207f (x)(cid:105) + \u0001E(x, \u0001, \u03c6); E(x, \u0001, \u03c6) =\n\n(3)\nwhere \u0001 (cid:28) 1, \u0001E(x, \u0001, \u03c6) is the curvature error, and \u03b6(x, \u03c6) \u2208 [x, x + \u0001\u03c6] \u2208 BRd (1 + \u0001r). Substi-\ntuting f (x) = g(Ax) into (3), we obtain a perturbed observation model (\u2207g(\u00b7) is a k \u00d7 1 vector):\n\n\u03c6T\u22072f (\u03b6(x, \u03c6))\u03c6,\n\n\u0001\n2\n\n(cid:10)\u03c6, AT\u2207g(Ax)(cid:11) =\n\n(f (x + \u0001\u03c6) \u2212 f (x)) \u2212 E(x, \u0001, \u03c6).\n\n1\n\u0001\n\n(4)\n\nWe then introduce a matrix X := AT G with G := [\u2207g(A\u03be1)|\u2207g(A\u03be2)|\u00b7\u00b7\u00b7|\u2207g(A\u03bemX )]k\u00d7mX .\nBased on (4), we then derive the following linear system via the operator \u03a6 : Rd\u00d7mX \u2192 Rm\u03a6\n\ny = \u03a6(X) + \u03b5; yi = \u0001\u22121\n\nwhere y \u2208 Rm\u03a6 are the perturbed measurements of X with [\u03a6(X)]j = trace(cid:0)\u03a6T\n\n[f (\u03bej + \u0001\u03c6i,j) \u2212 f (\u03bej)] ,\n\nj=1\n\nE(X , \u0001, \u03a6) is the curvature perturbations. The formulation (5) motivates us to leverage af\ufb01ne rank-\nminimization algorithms [27\u201329] for low-rank matrix recovery since rank(X) \u2264 k (cid:28) d.\n\n(5)\n\nj X(cid:1), and \u03b5 =\n\nmX(cid:88)\n\nOur active low-rank learning algorithm Algorithm 1 outlines the main steps involved in our\napproximation scheme. Step 1 constructs the operator \u03a6 and the measurements y, given m\u03a6, mX ,\nand \u0001. Step 2 revolves around the af\ufb01ne-rank minimization algorithms. Step 3 maps the recovered\n\nAlgorithm 1: Active learner algorithm for the non-parametric model f (x) = g(Ax)\n1: Choose m\u03a6, mX , and \u0001 and construct the sets X and \u03a6, and the measurements y.\n\nlow-rank matrix to (cid:98)A using the singular value decomposition (SVD) and rank-k approximation.\nGiven (cid:98)A, step 4 constructs (cid:98)f (x) = (cid:98)g((cid:98)Ax) as our estimator, where(cid:98)g(y) = f ((cid:98)AT y).\n2: Obtain (cid:98)X via a stable low-rank recovery algorithm (see Section 3 for an example).\n3: Compute SVD((cid:98)X) = (cid:98)U(cid:98)\u03a3(cid:98)VT and set (cid:98)AT = (cid:98)U(k), corresponding to k largest singular values.\n4: Obtain an approximation (cid:98)f (x) := (cid:98)g((cid:98)Ax) via quasi interpolants where(cid:98)g(y) := f ((cid:98)AT y).\nRemark 1. We uniformly approximate the function(cid:98)g by \ufb01rst sampling it on a rectangular grid:\n\nhZk \u2229 (\u2212(1 + \u00af\u0001), (1 + \u00af\u0001))k with uniformly spaced points in each direction (step size h). We then\nuse quasi-interpolants to interpolate in between the points thereby obtaining the approximation \u02c6gh,\nwhere the complexity is exponential in k (see the tractability discussion in the introduction). We\nrefer the reader to Chapter 12 of [17] regarding the construction of these operators.\n\n3\n\n\f3 Stable low-rank recovery algorithms within our learning scheme\n\nBy stable low-rank recovery in Algorithm 1, we mean any algorithm that returns an (cid:98)X with the\nfollowing guarantee: (cid:107)X\u2212(cid:98)X(cid:107)F \u2264 c1(cid:107)X\u2212 Xk(cid:107)F + c2(cid:107)\u03b5(cid:107)2, where c1,2 are constants, and Xk is the\n\nbest rank-k approximation of X. Since there exists a vast set of algorithms with such guarantees, we\nuse the matrix Dantzig selector [29] as a running example. This discussion is intended to expose the\nreader to the key elements necessary to re-derive the sample complexity of our scheme in Section 4\nfor different algorithms, which might offer additional computational trade-offs.\n\nF \u2264 (cid:107)\u03a6(Xk)(cid:107)2\n\n\u2264 (1 + \u03bak)(cid:107)Xk(cid:107)2\n\nStable embedding: We \ufb01rst explain an elementary result stating that our sampling mechanism\nsatis\ufb01es the restricted isometry property (RIP) for all rank-k matrices with overwhelming probabil-\nity. That is, (1\u2212 \u03bak)(cid:107)Xk(cid:107)2\nF , where \u03bak is the RIP constant [29]).\nThis property can be used in establishing stability of virtually all low-rank recovery algorithms.\nAs \u03a6 in (5) is a Bernoulli random measurement ensemble, it follows from standard concentration in-\nF | > t(cid:107)X(cid:107)2\nF ) \u2264\nequalities [30,31] that for any rank-k X \u2208 Rd\u00d7mX , we have P(|(cid:107)\u03a6(X)(cid:107)2\n(cid:17)\nt \u2208 (0, 1). By using a standard covering argument, as shown in Theorem 2.3 of\n2e\u2212 m\u03a6\n[29], we can verify that our \u03a6 satis\ufb01es RIP with isometry constant 0 < \u03bak < \u03ba < 1 with probability\nat least 1 \u2212 2e\u2212m\u03a6q(\u03ba)+k(d+mX +1)u(\u03ba), where q(\u03ba) = 1\n\n2 (t2/2\u2212t3/3),\n\nand u(\u03ba) = log\n\n(cid:16) 36\n\n\u03ba2 \u2212 \u03ba3\n\n\u2212(cid:107)X(cid:107)2\n\n(cid:17)\n\n(cid:16)\n\n\u221a\n\n(cid:96)2\n\n(cid:96)2\n\n.\n\n2\n\n144\n\n9\n\n\u03ba\n\n(cid:98)XDS = arg min\n\nRecovery algorithm and its tuning parameters: The Dantzig selector criteria is given by\n\nM\n\n(cid:107)M(cid:107)\u2217 s.t. (cid:107)\u03a6\u2217 (y \u2212 \u03a6(M ))(cid:107) \u2264 \u03bb,\n\n(6)\nwhere (cid:107)\u00b7(cid:107)\u2217 and (cid:107)\u00b7(cid:107) are the nuclear and spectral norms, respectively, and \u03bb is a tuning parameter. We\nrequire the true X to be feasible, i.e., (cid:107)\u03a6\u2217(\u03b5)(cid:107) \u2264 \u03bb. Hence, the parameter \u03bb can be obtained via\nProposition 1. In (5), we have (cid:107)\u03b5(cid:107)(cid:96)m\u03a6\nC2\u0001dmX k2\n\n(1 + \u03ba)1/2, with probability at least 1 \u2212 2e\u2212m\u03a6q(\u03ba)+(d+mX +1)u(\u03ba).\n\n. Moreover, it holds that (cid:107)\u03a6\u2217(\u03b5)(cid:107) \u2264 \u03bb =\n\n\u2264 C2\u0001dmX k2\n\nm\u03a6\n\n\u221a\n\n\u221a\n\n2\n\n2\n\n2\n\nm\u03a6\n\nProposition 1 is a new result that provides the typical low-rank recovery algorithm tuning parameters\nfor the random sampling scheme in Section 2. We prove Proposition 1 in [26]. Note that the\ndimension d appears in the bound as we do not make any compressibility assumption on A. If the\nj=1 |aij|q)1/q \u2264 D1 \u2200 i = 1, . . . , k for some 0 < q <\n\nrows of A are compressible, that is ((cid:80)d\n\n1, D1 > 0, we can then remove the explicit d-dependence in the bound here.\n\n\u221a\n\n\u03ba <\n\nDS to X in step 4 of our Algorithm 1:\n\n2\u22121 and (cid:107)\u03a6\u2217(\u03b5)(cid:107) \u2264 \u03bb, then we have with probability at least 1\u22122e\u2212m\u03a6q(\u03ba)+4k(d+mX +1)u(\u03ba)\n\nStability of low-rank recovery: We \ufb01rst restate a stability result from [29] for bounded noise in\nTheorem 1. We then exploit this result in Corollary 1 along with Proposition 1 in order to obtain the\n\nerror bound for the rank-k approximation (cid:98)X(k)\nTheorem 1 (Theorem 2.4 in [29]). Let rank(X) \u2264 k and let (cid:98)XDS be the solution to (6). If \u03ba4k <\n(cid:13)(cid:13)(cid:13)(cid:98)XDS \u2212 X\nCorollary 1. Denoting (cid:98)XDS to be the solution of (6), if (cid:98)X(k)\n(cid:98)XDS in the sense of (cid:107)\u00b7(cid:107)F , and if \u03ba4k < \u03ba <\n\nwhere C0 depends only on the isometry constant \u03ba4k.\n\nDS is the best rank-k approximation to\n\n\u2264 C0k\u03bb2,\n\n(cid:13)(cid:13)(cid:13)2\n\n\u221a\n\nF\n\n(cid:13)(cid:13)(cid:13)X \u2212 (cid:98)X(k)\n\nDS\n\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n2 \u2212 1, then we have\n2 k5\u00012d2m2X\n\nC0C 2\n\n\u2264 4C0k\u03bb2 =\n\n(1 + \u03ba),\n\nm\u03a6\n\nwith probability at least 1 \u2212 2e\u2212m\u03a6q(\u03ba)+4k(d+mX +1)u(\u03ba).\nCorollary 1 is the main result of this section, which is proved in [26]. The approximation guarantee\nin Corollary 1 can be tightened if other low-rank recovery algorithms are employed in estimation of\nX. However, we note again that the Dantzig selector enables us to highlight the key steps that lead\nto the sample complexity of our approach in the next section.\n\n4\n\n\f4 Main results\nOverview: Below, we study m\u03a6, mX , and \u0001 that together achieve and balance three objectives:\n\nmX : Sampling centers X are chosen so that the matrix G has rank-k. This is critical in ensuring\nthat G explores the full k-dimensional subspaces as spanned by AT lest X is rank de\ufb01cient.\nm\u03a6: Sampling directions \u03a6 (2) are designed to satisfy the RIP for rank-k matrices (cf., Section\n\n3). This property is typically key in proving low-rank recovery guarantees.\n\n\u0001: The step-size \u0001 in (3) manages the impact of the curvature effects E in the linear system\n(5). Unfortunately, this leads to a collateral damage of amplifying the impact of noise if the\nqueries are corrupted. We provide a remedy below based on sampling the same data points.\nAssumptions: We explicitly mention our assumptions here. Without loss of generality, we assume\nA = [a1, . . . , ak]T is an arbitrary k \u00d7 d matrix with orthogonal rows so that AAT = Ik, and the\nfunction f is de\ufb01ned over the unit ball, i.e., f: BRd (1) \u2192 R.3 For simplicity, we carry out our\nanalysis by assuming g to be a C2 function. By our set up, g also lives over a compact set, hence all\nits partial derivatives till the order of two are bounded as a result of the Stone-Weierstrass theorem:\n\n(cid:13)(cid:13)D\u03b2g(cid:13)(cid:13)\u221e \u2264 C2; D\u03b2g =\n\nsup|\u03b2|\u22642\n\n\u2202|\u03b2|\n\n\u2202y\u03b21\n\n1 . . . \u2202y\u03b2k\n\nk\n\n|\u03b2| = \u03b21 + \u00b7\u00b7\u00b7 + \u03b2k,\n\n;\n\nfor some constant C2 > 0. Finally, the effectiveness of our sampling approach depends on whether\nor not the following \u201cHessian\u201d matrix H f is well-conditioned:\n\n\u2207f (x)\u2207f (x)T d\u00b5Sd\u22121(x).\n\nSd\u22121\n\nH f :=\n\n(7)\nThat is, for the singular values of H f , we assume \u03c31(H f ) \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c3k(H f ) \u2265 \u03b1 > 0 for some \u03b1.\nThis assumption ensures X has full rank-k so that A can be successfully learned.\nRestricted singular values of multi-index models: Our \ufb01rst main technical contribution provides\na local condition in Proposition 2 that fully characterizes \u03b1 for multi-index models in (1) below. We\nprove Proposition 2 and the ensuing Proposition 3 in [26].\nProposition 2. Assume that g \u2208 C2 : BRk \u2192 R has Lipschitz continuous second order partial\n\nderivatives in an open neighborhood of the origin, U\u03b8 = BRk (\u03b8) for some \ufb01xed \u03b8 = O(cid:0)d\u2212(s+1)(cid:1),\n\nand for some s > 0:\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u22022g\n\n\u2202yi\u2202yj\n\n(y1) \u2212 \u22022g\n\u2202yi\u2202yj\n(cid:107)y1 \u2212 y2(cid:107)(cid:96)k\n\n2\n\n(y2)\n\n(cid:90)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2264 Li,j \u2200y1, y2 \u2208 U\u03b8, y1 (cid:54)= y2, i, j = 1, . . . , k.\n\n(cid:12)(cid:12)(cid:12)y=0\n\nDenote L = max1\u2264i,j\u2264k Li,j. Also assume, \u22022g(y)\n\u2202y2\n\u2200i = 2, . . . , k for Model 2 in (1). Then, we have \u03b1 = \u0398(1/d) as d \u2192 \u221e.\ni\nThe proof of Proposition 2 also leads to the following proposition for tractability of learning the\ngeneral set f (x) = g(Ax) without the particular modular decomposition as in (1):\nProposition 3. With the same Lipschitz continuous second order partial derivative assumption as\nin Proposition 2, if \u22072g(0) is rank-k, then we have \u03b1 = \u0398(1/d) as d \u2192 \u221e\n\n(cid:54)= 0 \u2200i = 1, . . . , k for Model 1 and\n\nSampling complexity of active multi-index model learning: The importance of Proposition 2\nand Proposition 3 is further made explicit in our second main technical contribution as Theorem 2\nbelow, which characterizes the sample complexity of our low-rank learning recipe in Section 2 for\nnon-parametric models along with the Dantzig selector algorithm. Its proof can be found in [26].\n\n3Unless further assumptions are made on f or gi\u2019s, we can only identify the subspace spanned by the rows of\nA up to a rotation. Hence, while we discuss approximation results on A, the reader should keep in mind that our\n\ufb01nal guarantees only apply to the function f and not necessarily for A and g individually. Moreover, if f lives\nin some other convex body other than BRd (1), say L\u221e-ball, our analysis can be extended in a straightforward\nfashion (cf., the concluding discussion in [14]). We also assume that an enlargement of the unit ball BRd (1) on\nthe domain of f for a suf\ufb01ciently small \u00af\u0001 > 0 is allowed. This is not a restriction, but is a consequence of our\nscheme as we work with directional derivatives of f at points on the unit sphere Sd\u22121.\n\n5\n\n\f\u221a\n\n2\n\n\u03b4\n\n2k)\n\n\u221a\n\nq(\u03ba)\n\n(1 + \u03ba)C0mX\n\nC2k5/2d(\u03b4 + 2C2\n\n2 \u2212 1 be\n, and\n\n. Then, given m = mX (m\u03a6 + 1) samples,\n\u2264 \u03b4 with probability at least\n\n(cid:18) (1 \u2212 \u03c1)m\u03a6\u03b1\n\u03b1\u03c12 log(k/p1), m\u03a6 \u2265 log(2/p2) + 4k(d + mX + 1)u(\u03ba)\n\nTheorem 2. [Sample complexity of Algorithm 1] Let \u03b4 \u2208 R+, \u03c1 (cid:28) 1, and \u03ba <\n\ufb01xed constants. Choose mX \u2265 2kC 2\n\u0001 \u2264\n\n(cid:19)1/2\nour function estimator (cid:98)f in step 4 of Algorithm 1 obeys\nmX = O(cid:16) k log k\nO(cid:16)\ntures a better dimensional dependence for q \u2208 (1, 2) : m = O(cid:0)k3d2(log(k))2(cid:1). Of course, we\n\n1 \u2212 p1 \u2212 p2.\nTheorem 2 characterizes the necessary scaling of the sample complexity for our active learning\nscheme in order to obtain uniform approximation guarantees on f with overwhelming probability:\n. Note the important role played\nby \u03b1 in the sample complexity. Finally, we also mention that the sample complexity can be written\ndifferently to trade-off \u03b4 among mX , m\u03a6, and \u0001. For instance, we can remove \u03b4 dependence in the\nsampling bound for \u0001: let \u03b4 < 1, then we just need to scale mX by \u03b4\u22122, and m\u03a6 by \u03b4\u22124.\nRemark 2. Note that the sample complexity in [14] for learning compressible A is m =\nwith uniform approximation guarantees on f \u2208 C2. However, the authors\nare able to obtain this result only for a restricted set of radial basis functions. Surprisingly, our\nsample complexity for multi-index models (1) not only generalizes this result for general A but fea-\n\n(cid:13)(cid:13)(cid:13)L\u221e\n(cid:13)(cid:13)(cid:13)f \u2212 (cid:98)f\n(cid:17)\n, m\u03a6 = O(k(d + mX )), and \u0001 = O(cid:16) \u03b1\u221a\n(cid:17)\n\n2\u2212q log(k)\n\n4\u2212q\n2\u2212q d\n\nrequire more computation since we use low-rank recovery as opposed to sparse recovery methods.\n\n(cid:17)\n\n\u03b1\n\nk\n\nd\n\n2\n\n(cid:112)2(1 + \u03ba)mX m\u03a6 +\n\nImpact of noisy queries: Here, we focus on how \u03b1 impacts \u0001 in particular. Our motivation is to\nunderstand how additive noise in function queries, a realistic assumption in many applications, can\nimpact our learning scheme, which will form the basis of our third main technical contribution.\nLet us assume that the evaluation of f at a point x yields: f (x) + Z, where Z \u223c N (0, \u03c32). Thus\nzij\n\u0001 .\nAssuming independent and identically distributed (iid) noise, we have zij \u223c N (0, 2\u03c32), and zi \u223c\n\nunder this noise model, (5) changes to y = \u03a6(X) + \u03b5 + z, where z \u2208 Rm\u03a6 and zi =(cid:80)mX\nN(cid:16)\n\n. Therefore, the noise variance gets ampli\ufb01ed by a factor of 2mX\n\u00012 .\n\n0, 2mX \u03c32\n\n(cid:17)\n\nj=1\n\n\u00012\n\nIn our analysis in Section 3, recall that we require the true matrix X to be feasible. Then, from\nLemma 1.1 in [29] and Proposition 1, it follows that the bound below holds with high probability.\n\nC2\u0001dmX k2\n\n\u221a\n2\n\n(cid:107)\u03a6\u2217(\u03b5 + z)(cid:107) \u2264 2\u03b3\u03c3\n\u0001\n\nm\u03a6\n\n(1 + \u03ba)1/2,\n\n(8)\nUnfortunately, we cannot control the upper bound \u03bb on (cid:107)\u03a6\u2217(\u03b5 + z)(cid:107) by simply choosing smaller \u0001,\ndue to the appearance of the (1/\u0001) term. Hence, unless \u03c3 is O(\u0001) or less, (e.g., \u03c3 reduces with d), we\ncan only declare that our learning scheme with the matrix Dantzig selector is sensitive to noise unless\nwe resample the same data points O(\u0001\u22121)-times and average. If the noise variance \u03c32 is constant,\nthis would keep the impact of noise below a constant times the impact of the curvature errors,\nmX (m\u03a6 +\n1), since we choose mX (m\u03a6 + 1) unique points, and then re-query and average the same points\n-expansion for\nnoise robustness by simply changing the low-rank recovery algorithm since it depends on the relative\nratio of the curvature errors (cid:107)\u03b5(cid:107)2 to the norm of the noise vector (cid:107)z(cid:107). As \u03a6 satis\ufb01es the RIP\nassumption, we can verify that this relative ratio is approximately preserved in (8) for iid Gaussian\nnoise.\n\nwhich our scheme can handle. The sample complexity then becomes m = O(cid:16)\u221a\nO(cid:16)\u221a\n-times. Unfortunately, we cannot qualitatively improve the O(cid:16)\u221a\n\n(cid:17)\n\n(cid:17)\n\n(cid:17)\n\nd/\u03b1\n\nd/\u03b1\n\nd/\u03b1\n\n(\u03b3 > 2(cid:112)log 12).\n\n5 Numerical Experiments\n\nWe present simulation results on toy examples to empirically demonstrate the tightness of the sam-\npling bounds. In the sequel, we assume A to be row orthonormal and concern ourselves only with\nthe recovery of A upto an orthonormal transformation. Therefore, we seek a guaranteed lower\n\n6\n\n\f(cid:13)(cid:13)(cid:13)A(cid:98)AT(cid:13)(cid:13)(cid:13)F\n\nbound on\nthe proof for Theorem 2 (see [26]), we would need to pick \u0001 as follows:\n\n\u2265 (k\u03b7)1/2 for some 0 < \u03b7 < 1. Then it is possible to show, along the lines of\n\nC2k2d((cid:112)k(1 \u2212 \u03b7) +\n\n1\n\n\u221a\n\n2)\n\n\u0001 \u2264\n\n(cid:18) (1 \u2212 \u03c1)m\u03a6\u03b1(1 \u2212 \u03b7)\n\n(cid:19)1/2\n\n(1 + \u03ba)C0mX\n\n.\n\n(9)\n\n(cid:12)(cid:12)g(\u03b2)(y)(cid:12)(cid:12) = 1. Furthermore we\n\nLogistic function (k = 1) We \ufb01rst consider f (x) = g(aT x) where g(y) = (1 + e\u2212y)\u22121 is the\nlogistic function. This case allows us to explicitly calculate all the necessary parameters within\nour paper. For instance, we can easily verify that C2 = sup|\u03b2|\u22642\n\ncompute the value of \u03b1 through the approximation: \u03b1 =(cid:82)(cid:12)(cid:12)g(cid:48)(aT x)(cid:12)(cid:12)2\n\nd\u00b5Sd\u22121 \u2248 |g(cid:48)(0)|2 = (1/16),\nwhich holds for large d. We require |(cid:104)\u02c6a, a(cid:105)| to be greater then \u03b7 = 0.99. We \ufb01x values of \u03ba <\n2\u22121,\n\u03c1 \u2208 (0, 1) and \u0001 = 10\u22123. The value of mX (number of points sampled on Sd\u22121) is \ufb01xed at 20 and\nwe vary d over the range 200\u20133000. For each value of d, we increase m\u03a6 till |(cid:104)\u02c6a, a(cid:105)| reaches\nthe speci\ufb01ed performance criteria of \u03b7. We remark that for each value of d and m\u03a6, we choose \u0001\naccording to the derived equation (9) for the speci\ufb01ed performance criteria given by \u03b7.\nFigure 1 depicts the scaling of m\u03a6 with the dimension d. The results are obtained by selecting a\nuniformly at random on Sd\u22121 and averaging the value of |(cid:104)\u02c6a, a(cid:105)| over 10 independent trials using\nthe Danzig selector. We observe that for large values of d, the minimum number of directional\nderivatives needed to achieve the performance bound on |(cid:104)\u02c6a, a(cid:105)| scales approximately linearly with\nd, with a scaling factor of around 1.45.\n\n\u221a\n\nFigure 1: Plot of m\u03a6\n|(cid:104)\u02c6a, a(cid:105)| \u2265 0.99. \u0001 is \ufb01xed at 10\u22123. m\u03a6 scales approximately linearly with d where the constant is 1.45.\n\nd versus d for mX = 20 , with m\u03a6 chosen to be minimum value needed to achieve\n\nb) =(cid:80)k\n\ni )(cid:1). We \ufb01x d = 100,\n\ni )\u22121/2 exp(cid:0)\u2212(y + bi)2/(2\u03c32\n(cid:13)(cid:13)(cid:13)A(cid:98)A\n\n(cid:13)(cid:13)(cid:13)2\n\nSum of Gaussian functions (k > 1) We next consider functions of the form f (x) = g(Ax +\n\ni=1 gi(aT\n\ni x + bi), where gi(y) = (2\u03c0\u03c32\n\nF\n\n\u0001 = 10\u22123, mX = 100 and vary k from 8 to 32 in steps of 4. For each value of k we are interested\n\u2265 0.99. In Figure 2(a), we see that\nin the minimum value of m\u03a6 needed to achieve 1\nk\nm\u03a6 scales approximately linearly with the number of Gaussian atoms k. The results are averaged\nover 10 trials. In each trial, we select the rows of A over the left Haar measure on Sd\u22121, and the\nparameter b uniformly at random on Sk\u22121 scaled by a factor 0.2. Furthermore we generate the\nstandard deviations of the individual Gaussian functions uniformly over the range [0.1, 0.5].\nImpact of Noise (k > 1) We now consider quadratic forms, i.e. f (x) = g(Ax) = (cid:107)Ax \u2212 b(cid:107)2\nwith the point queries corrupted with Gaussian noise. Here, we take \u03b1 to be 1/d. We \ufb01x k = 5,\nmX = 30, \u0001 = 10\u22121 and vary d from 30 to 120 in steps of 15. For each d we perturb the point queries\nwith Gaussian noise of standard deviation: 0.01/d3/2. This is the same as repeatedly sampling each\nrandom location approximately d3/2 times followed by averaging. We then compute the minimum\n\u2265 0.99. We average the results over 10 trials, and in\nvalue of m\u03a6 needed to achieve 1\nk\neach trial, we select the rows of A over the left Haar measure on Sd\u22121. The parameter b is chosen\nuniformly at random on Sk\u22121. In Figure 2(b), we see that m\u03a6 scales approximately linearly with d,\nwhich follows our sample complexity bound for m\u03a6 in Theorem 2.\n\n(cid:13)(cid:13)(cid:13)A(cid:98)A\n\n(cid:13)(cid:13)(cid:13)2\n\nF\n\n7\n\n\f(a) k > 1 (Gaussian)\n\n(b) k > 1 (quadratic) with noise\n\nFigure 2: The empirical performance of our oracle-based low-rank learning scheme (circles) agrees\nwell with the theoretical scaling (dashed). Section 5 has further details.\n\n6 Conclusions\n\nIn this work, we consider the problem of learning non-parametric low-dimensional functions\nf (x) = g(Ax), which can also have a modular decomposition as in (1), for arbitrary A \u2208 Rk\u00d7d\nwhere rank(A) = k. The main contributions of the work are three-fold. By introducing a new anal-\nysis tool based on Lipschitz property on the second order derivatives, we provide the \ufb01rst rigorous\ncharacterization of the dimension dependence of the k-restricted singular value of the \u201cHessian\u201d ma-\ntrix H f for general multi-index models. We establish the \ufb01rst sample complexity bound for learning\nnon-parametric multi-index models with low-rank recovery algorithms and also analyze the impact\nof additive noise to the sample complexity of the scheme. Lastly, we provide empirical evidence\non toy examples to show the tightness of the sampling bounds. Finally, while our active learning\nscheme ensures the tractability of learning non-parametric multi-index models, it does not establish\na lowerbound on the sample complexity, which is left for future work.\n\n7 Acknowledgments\n\nThis work was supported in part by the European Commission under Grant MIRG-268398, ERC\nFuture Proof, SNF 200021-132548, ARO MURI W911NF0910383, and DARPA KeCoM program\n#11-DARPA-1055. VC also would like to acknowledge Rice University for his Faculty Fellowship.\nThe authors thank Jan Vybiral for useful discussions and Anastasios Kyrillidis for his help with the\nlow-rank matrix recovery simulations.\n\nReferences\n[1] P. B\u00a8uhlmann and S. Van De Geer. Statistics for High-Dimensional Data: Methods, Theory and\n\nApplications. Springer-Verlag New York Inc, 2011.\n\n[2] L. Carin, R.G. Baraniuk, V. Cevher, D. Dunson, M.I. Jordan, G. Sapiro, and M.B. Wakin.\nLearning low-dimensional signal models. Signal Processing Magazine, IEEE, 28(2):39\u201351,\n2011.\n\n[3] M. Hristache, A. Juditsky, J. Polzehl, and V. Spokoiny. Structure adaptive approach for dimen-\n\nsion reduction. The Annals of Statistics, 29(6):1537\u20131566, 2001.\n\n[4] K.C. Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical\n\nAssociation, pages 316\u2013327, 1991.\n\n[5] P. Hall and K.C. Li. On almost linearity of low dimensional projections from high dimensional\n\ndata. The Annals of Statistics, pages 867\u2013889, 1993.\n\n[6] Y. Xia, H. Tong, WK Li, and L.X. Zhu. An adaptive estimation of dimension reduction space.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):363\u2013410,\n2002.\n\n[7] Y. Xia. A multiple-index model and dimension reduction. Journal of the American Statistical\n\nAssociation, 103(484):1631\u20131640, 2008.\n\n8\n\n\f[8] Y. Lin and H.H. Zhang. Component selection and smoothing in multivariate nonparametric\n\nregression. The Annals of Statistics, 34(5):2272\u20132297, 2006.\n\n[9] L. Meier, S. Van De Geer, and P. B\u00a8uhlmann. High-dimensional additive modeling. The Annals\n\nof Statistics, 37(6B):3779\u20133821, 2009.\n\n[10] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax-optimal rates for sparse additive models\nover kernel classes via convex programming. Technical Report, UC Berkeley, Department of\nStatistics, August 2010.\n\n[11] P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 71(5):1009\u20131030, 2009.\n\n[12] V. Koltchinskii and M. Yuan. Sparsity in multiple kernel learning. The Annals of Statistics,\n\n38(6):3660\u20133695, 2010.\n\n[13] A. Cohen, I. Daubechies, R. A. DeVore, G. Kerkyacharian, and D. Picard. Capturing ridge\n\nfunctions in high dimensions from point queries. Constr. Approx., pages 1\u201319, 2011.\n\n[14] M. Fornasier, K. Schnass, and J. Vyb\u00b4\u0131ral. Learning functions of few arbitrary linear parameters\n\nin high dimensions. Preprint, 2010.\n\n[15] H. Tyagi and V. Cevher. Learning ridge functions with randomized sampling in high dimen-\n\nsions. In ICASSP, 2011.\n\n[16] J.F. Traub, G.W Wasilkowski, and H. Wozniakowski. Information-based complexity. Aca-\n\ndemic Press, New York, 1988.\n\n[17] R. DeVore and G.G. Lorentz. Constructive approximation. vol. 303, Grundlehren, Springer\n\nVerlag, N.Y., 1993.\n\n[18] E.Novak and H.Woniakowski. Approximation of in\ufb01nitely differentiable multivariate functions\n\nis intractable. J. Complex., 25:398\u2013404, August 2009.\n\n[19] W. Hardle. Applied nonparametric regression, volume 26. Cambridge Univ Press, 1990.\n[20] J.H. Friedman and W. Stuetzel. Projection pursuit regression. J. Amer. Statist. Assoc., 76:817\u2013\n\n823, 1981.\n\n[21] D.L. Donoho and I.M. Johnstone. Projection based regression and a duality with kernel meth-\n\nods. Ann. Statist., 17:58\u2013106, 1989.\n\n[22] P.J. Huber. Projection pursuit. Ann. Statist., 13:435\u2013475, 1985.\n[23] A. Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica,\n\n8:143\u2013195, 1999.\n\n[24] E.J Cand`es. Harmonic analysis of neural networks. Appl. Comput. Harmon. Anal., 6(2):197\u2013\n\n218, 1999.\n\n[25] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Information-theoretic regret bounds for\ngaussian process optimization in the bandit setting. To appear in the IEEE Trans. on Informa-\ntion Theory, 2012.\n\n[26] Hemant Tyagi and Volkan Cevher. Learning non-parametric basis independent models from\n\npoint queries via low-rank methods. Technical Report, Infoscience EPFL, 2012.\n\n[27] E.J. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of\n\nComputational Mathematics, 9(6):717\u2013772, 2009.\n\n[28] E.J. Cand`es and T. Tao. The power of convex relaxation: near-optimal matrix completion.\n\nIEEE Trans. Inf. Theor., 56:2053\u20132080, May 2010.\n\n[29] E.J. Cand`es and Y. Plan. Tight oracle bounds for low-rank matrix recovery from a minimal\n\nnumber of random measurements. CoRR, abs/1001.0339, 2010.\n\n[30] B. Recht, M. Fazel, and P.A. Parrilo. Guaranteed minimum-rank solutions of linear matrix\n\nequations via nuclear norm minimization. SIAM REVIEW, 52:471\u2013501, 2010.\n\n[31] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection.\n\nThe Annals of Statistics, 28(5):1302\u20131338.\n\n9\n\n\f", "award": [], "sourceid": 701, "authors": [{"given_name": "Tyagi", "family_name": "Hemant", "institution": null}, {"given_name": "Volkan", "family_name": "Cevher", "institution": null}]}