{"title": "The Limits of Learning with Missing Data", "book": "Advances in Neural Information Processing Systems", "page_first": 3495, "page_last": 3503, "abstract": "We study regression and classification in a setting where the learning algorithm is allowed to access only a limited number of attributes per example, known as the limited attribute observation model. In this well-studied model, we provide the first lower bounds giving a limit on the precision attainable by any algorithm for several variants of regression, notably linear regression with the absolute loss and the squared loss, as well as for classification with the hinge loss. We complement these lower bounds with a general purpose algorithm that gives an upper bound on the achievable precision limit in the setting of learning with missing data.", "full_text": "The Limits of Learning with Missing Data\n\nBrian Bullins\n\nElad Hazan\n\nPrinceton University\n\nPrinceton, NJ\n\n{bbullins,ehazan}@cs.princeton.edu\n\nTomer Koren\nGoogle Brain\n\nMountain View, CA\ntkoren@google.com\n\nAbstract\n\nWe study linear regression and classi\ufb01cation in a setting where the learning algo-\nrithm is allowed to access only a limited number of attributes per example, known\nas the limited attribute observation model. In this well-studied model, we provide\nthe \ufb01rst lower bounds giving a limit on the precision attainable by any algorithm for\nseveral variants of regression, notably linear regression with the absolute loss and\nthe squared loss, as well as for classi\ufb01cation with the hinge loss. We complement\nthese lower bounds with a general purpose algorithm that gives an upper bound on\nthe achievable precision limit in the setting of learning with missing data.\n\n1\n\nIntroduction\n\nThe primary objective of linear regression is to determine the relationships between multiple variables\nand how they may affect a certain outcome. A standard example is that of medical diagnosis, whereby\nthe data gathered for a given patient provides information about their susceptibility to certain illnesses.\nA major drawback to this process is the work necessary to collect the data, as it requires running\nnumerous tests for each person, some of which may be discomforting. In such cases it may be\nnecessary to impose limitations on the amount of data available for each example. For medical\ndiagnosis, this might mean having each patient only undergo a small subset of tests.\nA formal setting for capturing regression and learning with limits on the number of attribute observa-\ntions is known as the Limited Attribute Observation (LAO) setting, \ufb01rst introduced by Ben-David\nand Dichterman [1]. For example, in a regression problem, the learner has access to a distribution\nD over data (x, y) 2 Rd \u21e5 R, and \ufb01ts the best (generalized) linear model according to a certain loss\nfunction, i.e., it approximately solves the optimization problem\n\nmin\n\nw:kwkp \uf8ffB\n\nLD (w),\n\nLD (w) = E(x, y)\u21e0D f`(w>x  y)g .\n\nIn the LAO setting, the learner does not have complete access to the examples x, which the reader\nmay think of as attributes of a certain patient. Rather, the learner can observe at most a \ufb01xed number\nof these attributes, denoted k \uf8ff d. If k = d, this is the standard regression problem which can be\nsolved to arbitrary precision.\nThe main question we address: is it possible to compute an arbitrarily accurate solution if the number\nof observations per example, k, is strictly less than d? More formally, given any \"> 0, can one\ncompute a vector w for which\n\nLD (w) \uf8ff min\nkw\u21e4kp \uf8ffB\n\nLD (w\u21e4) + \".\n\nEf\ufb01cient algorithms for regression with squared loss when k < d have been shown in previous work\n[2], and the sample complexity bounds have since been tightened [6, 8]. However, similar results for\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fother common loss functions such as e.g. absolute loss have only been shown by relaxing the hard\nlimit of k attributes per example [3, 6].\nIn this paper we show, for the \ufb01rst time, that in fact this problem cannot be solved in general. Our\nmain result shows that even for regression with the absolute loss function, for any k \uf8ff d  1, there\nis an information-theoretic lower bound on the error attainable by any algorithm. That is, there is\nsome \"0 > 0 for which an \"0-optimal solution cannot be determined, irrespective of the number\nof examples the learner sees. Formally, with constant probability, any algorithm returning a vector\nw 2 Rd must satisfy\n\nLD (w) > min\nkw\u21e4kp \uf8ffB\n\nLD (w\u21e4) + \"0.\n\nWe further show that this ultimate achievable precision parameter is bounded from below by a\npolynomial in the dimension, i.e., \"0 = \u2326(d3/2).\nAdditionally, for the basic setting of Ridge regression (with the squared loss), we give a tight lower\nbound for the LAO setting. Cesa-Bianchi et al. [2] provided the \ufb01rst ef\ufb01cient algorithm for this\nsetting with sample complexity of O(d2/k\"2) for \" error. Hazan and Koren [6] improved upon this\nresult and gave a tight sample complexity of O(d/k\"2) to achieve \" error. In both cases, however, the\nalgorithms only work when k  2. We complete the picture and show that k  2 attributes are in\nfact necessary to obtain arbitrarily low error. That is, with only one attribute per example, there is an\ninformation-theoretic limit on the accuracy attainable by any regression algorithm. We remark that a\nsimilar impossibility result was proven by Cesa-Bianchi et al. [3] in the related setting of learning\nwith noisy examples.\nClassi\ufb01cation may be similarly cast in the LAO setting. For classi\ufb01cation with the hinge loss, namely\nsoft-margin SVM, we give a related lower bound, showing that it is impossible to achieve arbitrarily\nlow error if the number of observed attributes is bounded by k \uf8ff d  1. However, unlike our lower\nbound for regression, the lower bound we prove for classi\ufb01cation scales exponentially with the\ndimension. Although Hazan et al. [7] showed how classi\ufb01cation may be done with missing data, their\nwork includes low rank assumptions and so it is not in contradiction with the lower bounds presented\nhere.\nSimilar to the LAO setting, the setting of learning with missing data [9, 4, 10, 11] presents the learner\nwith examples where the attributes are randomly observed. Since the missing data setting is at least\nas dif\ufb01cult as the LAO setting, our lower bounds extend to this case as well.\nWe complement these lower bounds with a general purpose algorithm for regression and classi\ufb01cation\nwith missing data that, given a suf\ufb01cient number of samples, can achieve an error of O(1/pd). This\nresult leaves only a small polynomial gap compared to the information-theoretic lower bound that we\nprove.\n\n2 Setup and Statement of Results\n\nThe general framework of linear regression involves a set of instances, each of the form (x, y) where\nx 2 Rd is the attribute vector and y 2 R is the corresponding target value. Under the typical statistical\nlearning framework [5], each (x, y) pair is drawn from a joint distribution D over Rd \u21e5 R. The\nlearner\u2019s objective is to determine some linear predictor w such that w>x does well in predicting y.\nThe quality of prediction is measured according to a loss function ` : R 7! R. Two commonly used\nloss functions for regression are the squared loss `(w>x  y) = 1\n2 (w>x  y)2 and the absolute loss\n`(w>x  y) = |w>x  y|. Since our examples are drawn from some arbitrary distribution D, it is best\nto consider the expected loss\nLD (w) = E(x, y)\u21e0D\u21e5`(w>x  y)\u21e4 .\nLD (w) s.t. kwkp \uf8ff B\n\nThe learner\u2019s goal then is to determine a regressor w that minimizes the expected loss LD (w).\nTo avoid over\ufb01tting, a regularization term is typically added, which up to some constant factor is\nequivalent to\n\nfor some regularization parameter B > 0, where k\u00b7k p is the standard ` p norm, p  1. Two common\nvariants of regression are Ridge regression (p = 2 with squared loss) and Lasso regression (p = 1\nwith squared loss).\n\nmin\nw2Rd\n\n2\n\n\fThe framework for classi\ufb01cation is nearly identical to that of linear regression. The main distinction\ncomes from a different meaning of y 2 R, namely that y acts as a label for the corresponding example.\nThe loss function also changes when learning a classi\ufb01er, and in this paper we are interested in the\nhinge loss `(y \u00b7 w>x) = max{0, 1  y \u00b7 w>x}. The overall goal of the learner, however, remains the\nsame: namely, to determine a classi\ufb01er w such that LD (w) is minimized. Throughout the paper, we\nlet w\u21e4 denote the minimizer of LD (w).\n2.1 Main Results\n\nAs a \ufb01rst step, for Lasso and Ridge regressions, we show that one always needs to observe at least two\nattributes to be able to learn a regressor to arbitrary precision. This is given formally in Theorem 1.\nTheorem 1. Let 0 <\"< 1\n32 and let ` be the squared loss. Then there exists a distribution D over\n{x : ||x||1 \uf8ff 1}\u21e5 [1, 1] such that kw\u21e4k1 \uf8ff 2, and any regression algorithm that can observe at\nmost one attribute of each training example of a training set S cannot output a regressor \u02c6w such that\nES[LD ( \u02c6w)] < LD (w\u21e4) + \".\nCorollary 2. Let 0 <\"< 1\n64 and let ` be the squared loss. Then there exists a distribution D over\n{x : ||x||2 \uf8ff 1}\u21e5 [1, 1] such that kw\u21e4k2 \uf8ff 2, and any regression algorithm that can observe at\nmost one attribute of each training example of a training set S cannot output a regressor \u02c6w such that\nES[LD ( \u02c6w)] < LD (w\u21e4) + \".\nThe lower bounds are tight\u2014recall that with two attributes, it is indeed possible to learn a regressor to\nwithin arbitrary precision [2, 6]. Also, notice the order of quanti\ufb01cation in the theorems: it turns out\nthat there exists a distribution which is hard for all algorithms (rather than a different hard distribution\nfor any algorithm).\nFor regression with absolute loss, we consider the setting where the learner is limited to seeing k or\nfewer attributes of each training sample. Theorem 3 below shows that in the case where k < d the\nlearner cannot hope to learn an \"-optimal regressor for some \"> 0.\nTheorem 3. Let d  4, d \u2318 0 (mod 2), 0 <\"< 1\n2 , and let ` be the absolute loss. Then there\nexists a distribution D over {x : ||x||1 \uf8ff 1}\u21e5 [1, 1] such that kw\u21e4k1 \uf8ff 2, and any regression\nalgorithm that can observe at most d  1 attributes of each training example of a training set S cannot\noutput a regressor \u02c6w such that ES[LD ( \u02c6w)] < LD (w\u21e4) + \".\nCorollary 4. Let 0 <\"< 1\n60 d2, and let ` be the absolute loss. Then there exists a distribution D\nover {x : ||x||2 \uf8ff 1}\u21e5 [1, 1] such that kw\u21e4k2 \uf8ff 1, and any regression algorithm that can observe at\nmost d  1 attributes of each training example of a training set S cannot output a regressor \u02c6w such\nthat ES[LD ( \u02c6w)] < LD (w\u21e4) + \".\nWe complement our \ufb01ndings for regression with the following analogous lower bound for classi\ufb01ca-\ntion with the hinge loss (a.k.a., soft margin SVM).\nTheorem 5. Let d  4, d \u2318 0 (mod 2), and let ` be the hinge loss. Then there exists an \"0 > 0\nsuch that the following holds: there exists a distribution D over {x : ||x||2 \uf8ff 1}\u21e5 [1, 1] such that\nkw\u21e4k2 \uf8ff 1, and any classi\ufb01cation algorithm that can observe at most d  1 attributes of each training\nexample of a training set S cannot output a regressor \u02c6w such that ES[LD ( \u02c6w)] < LD (w\u21e4) + \"0.\n3 Lower Bounds\n\n60 d 3\n\nIn this section we discuss our lower bounds for regression with missing attributes. As a warm-up,\nwe \ufb01rst prove Theorem 1 for regression with the squared loss. While the proof is very simple,\nit illustrates some of the main ideas used in all of our lower bounds. Then, we give a proof of\nTheorem 3 for regression with the absolute loss. The proofs of the remaining bounds are deferred to\nthe supplementary material.\n\n3.1 Lower bounds for the squared loss\n\nProof of Theorem 1. It is enough to prove the theorem for deterministic learning algorithms, namely,\nfor algorithms that do not use any external randomization (i.e., any randomization besides the random\nsamples drawn from the data distribution itself). This is because any randomized algorithm can\n\n3\n\n\f1\n\nbe thought of as a distribution over deterministic algorithms, which is independent of the data\ndistribution.\nNow, suppose 0 <\"<\n32. Let X1 = {(0, 0), (1, 1)}, X2 = {(0, 1), (1, 0)}, and let D1 and D2\nbe uniform distributions over X1 \u21e5{ 1} and X2 \u21e5{ 1}, respectively. The main observation is that\nany learner that can observe at most one attribute of each example cannot distinguish between the\ntwo distributions with probability greater than 1\n2, no matter how many samples it is given. This is\nbecause the marginal distributions of the individual attributes under both D1 and D2 are exactly the\nsame. Thus, to prove the theorem it is enough to show that the sets of \"-optimal solutions under the\ndistributions D1 and D2 are disjoint. Indeed, suppose that there is a learning algorithm that emits a\nvector \u02c6w such that E[LD ( \u02c6w)  LD (w\u21e4)] <\"/ 2 (where the expectation is over the random samples\nfrom D used by the algorithm). By Markov\u2019s inequality, it holds that LD ( \u02c6w) < LD (w\u21e4) + \" with\nprobability > 1/2. Hence, the output of the algorithm allows one to distinguish between the two\ndistributions with probability > 1/2, contradicting the indistinguishability property.\nWe set to characterize the sets of \"-optimal solutions under D1 and D2. For D1, we have\n\n1\n2\n\n(w>x  1)2 =\n\n1\n4 +\n\n1\n4\n\n(w1 + w2  1)2,\n\nwhile for D2,\n\nLD2 (w) =\n\n(w>x  1)2 =\n\n1\n4\n\n(w1  1)2 +\n\n1\n4\n\n(w2  1)2.\n\nNote that the set of \"-optimal regressors for LD1 is S1 = {w : |w>1  1|\uf8ff 2p\"}, whereas for LD2\nthe set is S2 = {w : kw  1k2 \uf8ff 2p\"}. Let S02 = {w : |w>1  2|\uf8ff 2p2\"}. Then S2 \u2713 S02, so it is\nsuf\ufb01cient to show that S1 and S02 are disjoint.\n2. However, for any w 2 S02,\nSince \"<\n|w>1  2| < 1\n2, and so w cannot be a member of both S1 and S2. As we argued\nearlier, this suf\ufb01ces to prove the theorem.\n\u21e4\n\n32, for any w 2 S1, |w>1  1| < 1\n1\n2 meaning w>1 > 3\n\n2, meaning w>1 < 3\n\nLD1 (w) =\n\n1\n\n2 Xx2X1\n2 Xx2X2\n\n1\n2\n\n1\n\n3.2 Lower bounds for the absolute loss\n\nAs in the proof of Theorem 1, the main idea is to show that one can design two distributions that are\nindistinguishable to a learner who can observe no more than d  1 attributes of any sample given by\nthe distribution (i.e., that their marginals over any choice of d  1 attributes are identical), but whose\nrespective sets of \"-optimal regressors are disjoint. However, in contrast to Theorem 1, both handling\ngeneral d along with switching to the absolute loss introduce additional complexities to the proof that\nrequire different techniques.\nWe start by constructing these two distributions D1 and D2. Let X1 = {x = (x1, . . . , xd) : x 2\n{0, 1}d, kxk1 \u2318 0 (mod 2)} and X2 = {x = (x1, . . . , xd) : x 2{ 0, 1}d, kxk1 \u2318 1 (mod 2)}, and let D1\nand D2 be uniform over X1 \u21e5{ 1} and X2 \u21e5{ 1}, respectively. From this construction, it is not hard to\nsee that for any choice of k \uf8ff d  1 attributes, the marginals over the k attributes of both distributions\nare identical: they are both a uniform distribution over k bits. Thus, the distributions D1 and D2 are\nindistinguishable to a learner that can only observe at most d  1 attributes of each example.\nLet `(w>x  y) = |w>x  y|, and let\n\nand\n\nLD1 (w) = E(x, y)\u21e0D1[`(w>x, y)] =\n\nLD2 (w) = E(x, y)\u21e0D2[`(w>x, y)] =\n\n1\n\n2d1 Xx2X1\n2d1 Xx2X2\n\n1\n\n|w>x  1|,\n\n|w>x  1|.\n\nIt turns out that the subgradients of LD1 (w) and LD2 (w), which we denote by @LD1 (w) and @LD2 (w)\nrespectively, can be expressed precisely. In fact, the full subgradient set at every point in the domain\nfor both functions can be made explicit. With these representations in hand, we can show that\nw\u21e41 = 2\n\nd+21d are minimizers of LD1 (w) and LD2 (w), respectively.\n\nd 1d and w\u21e42 = 2\n\n4\n\n\fLemma 7. Let w\u21e42 = 2\n\nLD1 (w)  LD1 (w\u21e41) \n\nd+21d. For any w 2 Rd we have\np2\u21e1\n\nLD2 (w)  LD2 (w\u21e42) \n\nGiven Lemmas 6 and 7, the proof of Theorem 3 is immediate.\n\np2\u21e1\n\ne4pd \u00b71>d (w  w\u21e41) .\ne4pd \u00b71>d (w  w\u21e42) .\ne4pd \u00b7 1>d (w  w\u21e41) \uf8ff \"9>=>;\ne4pd \u00b7 1>d (w  w\u21e42) \uf8ff \"9>=>;\n\np2\u21e1\n\np2\u21e1\n\nS1 =8><>:\nS2 =8><>:\n\nw : \nw : \n\nand\n\nProof of Theorem 3. As a direct consequence of Lemmas 6 and 7, we obtain that the sets\n\nFigure 1: Geometric intuition for Lemmas 6 and 7. The lower bounding absolute value function acts\nas a relaxation of the true expected loss LD (depicted here as a cone).\n\nIn fact, using the subgradient sets we can prove a much stronger property of the expected losses\nLD1 and LD2, akin to a \u201cdirectional strong convexity\u201d property around their respective minimizers.\nThe geometric idea behind this property is shown in Figure 1, whereby LD is lower bounded by an\nabsolute value function.\nLemma 6. Let w\u21e41 = 2\n\nd 1d. For any w 2 Rd we have\n\ncontain the sets of \"-optimal regressors for LD1 (w) and LD2 (w), respectively. All that is needed now\n2 , and this is done by showing a\nis to show a separation of their \"-optimal sets for 0 <\"<\nseparation of the more manageable sets S1 and S2. Indeed, \ufb01x 0 <\"< 1\n2 and observe that for\nany w 2 S1 we have \n\n60 d 3\n\n60 d 3\n1\n\nOn the other hand, for any w 2 S2 we have p2\u21e1\n\n1>dw  2 \n\np2\u21e1\n60 d 3\ne4pd \u00b7 1>d (w  w\u21e41) \uf8ff 1\n2 and so, for d  4,\n1\n2d + 3\n> 2 \n.\n2d\nd + 2\n60 d 3\ne4pd \u00b7 1>d (w  w\u21e42) \uf8ff 1\n1\n2d + 1\n1\n<\nd + 2 =\nd + 2\n2d\n\n1\nd + 2 =\n\n2d\nd + 2 +\n\n2d\nd + 2 +\n\n1>dw \uf8ff\n\n.\n\n2 , thus\n\nWe see that no w can exist in both S1 and S2, so these sets are disjoint. Theorem 3 follows by the\nsame reasoning used to conclude the proof of Theorem 1.\n\u21e4\n\n5\n\n\fIt remains to prove Lemmas 6 and 7. As the proofs are very similar, we will only prove Lemma 6\nhere and defer the proof of Lemma 7 to the supplementary material.\n\nProof of Lemma 6. We \ufb01rst write\n1\n\n@`(w>x, 1) =\n\n1\n\n2d1 Xx2X1\n\nsign(w>x  1) \u00b7 x.\n\nLetting w\u21e41 = 2\n\nsign(w\u21e4>1 x  1) \u00b7 x\n\n1\n\n@LD1 (w) =\n\n2d1 Xx2X1\nd \u00b7 1d, we have that\n2d1 Xx2X1\n@LD1 (w\u21e41) =\n2d1\u2713 Xx2X1,\n+ Xx2X1,\n2d1\u2713 Xx2X1,\n\nkxk1 > d\n\nkxk1= d\n\n1\n\n=\n\n2\n\n2\n\n=\n\n1\n\nsign(w\u21e4>1 x  1) \u00b7 x\nsign(w\u21e4>1 x  1) \u00b7 x + Xx2X1,\nsign(0) \u00b7 x + Xx2X1,\nwhere sign(0) can be any number in [1, 1]. Next, we compute\n2i  1! \u00b7 1d \n4 +1 d  1\n2Xi= d\n(1)i d  1\n! \u00b7 1d\n2 2Xi=0\n2  2! \u00b7 1d ,\n= d  2\n\nx  Xx2X1,\n\nXx2X1,\n\nkxk1 > d\n\nkxk1 < d\n\nkxk1 > d\n\nkxk1= d\n\nx =\n\n=\n\ni\n\nd\n\nd\n\nd\n\n2\n\n2\n\n2\n\n2\n\nsign(w\u21e4>1 x  1) \u00b7 x\u25c6\n\nkxk1 < d\n\n2\n\nx  Xx2X1,\n\nkxk1 < d\n\n2\n\nx\u25c6,\n\nd\n\n2i  1! \u00b7 1d\n4 1Xi=1 d  1\n\nk \u2318, which we\ni\u2318 = (1)k\u21e3n1\nwhere the last equality follows from the elementary identityPk\nprove in Lemma 9 in the supplementary material. Now, let X\u21e4 = {x 2X 1 : kxk1 = d\n2 }, let m = |X\u21e4|,\nand let X = [x1, . . . , xm] 2 Rd\u21e5m be the matrix formed by all x 2X \u21e4. Then we may express the\nentire subgradient set explicitly as\n\ni=0(1)i\u21e3n\n2  2! \u00b7 1d\u25c6  r 2 [1, 1]m.\n\n@LD1 (w\u21e41) =\u21e2 1\n\n2d1\u2713Xr + d  2\n\nd\n\nd\n\nThus, any choice of r 2 [1, 1]m will result in a speci\ufb01c subgradient of LD1 (w\u21e41). Consider two such\n2 1\u2318 \u00b7 1d; to see the last equality,\nchoices: r1 = 0 and r2 = 1d. Note that Xr1 = 0 and Xr2 = \u21e3 d1\nconsider any \ufb01xed coordinate i and notice that the number of elements in X\u21e4 with non-zero values in\nthe i\u2019th coordinate is equal to the number of ways to choose the remaining d\n2  1 non-zero coordinates\nfrom the other d  1 coordinates. We then observe that the corresponding subgradients are\n2  2! \u00b7 1d,\n2d1 d  2\n2  1! \u00b7 1d .\n2d1 d  2\n\n2d1 Xr1 + d  2\n2d1 Xr2 + d  2\n\n2  2! \u00b7 1d! =\n2  2! \u00b7 1d! = \n\nNote that, since the set of subgradients of LD1 (w\u21e41) is a convex set, by taking a convex combination\nof h+ and h it follows that 0 2 @LD1 (w\u21e41) and so we see that w\u21e41 is a minimizer of LD1 (w).\n\nh =\n\nh+ =\n\nand\n\n1\n\n1\n\n1\n\n1\n\nd\n\nd\n\nd\n\nd\n\n6\n\n\fGiven a handle on the subgradient set, we now show that these coef\ufb01cients are polynomial in d.\nObserve that, using the fact that p2\u21e1n( n\n2d1 d  2\n2  2! \n\ne )n, we have\n\n1\n\n1\n\nd\n\ne )n \uf8ff n! \uf8ff epn( n\np2\u21e1(d  2)\u21e3 d2\ne \u2318 d2\n2d1*...,\n2+///-\ne2q d4\n2 q d\n2e \u2318 d\n2e\u2318 d\n2 2\u21e3 d\n2 \u21e3 d4\np2\u21e1\n d  2\nd ! d2\n2d1\u2318+/-\n2d1*.,\ne2pd\u21e3 1\ne2pd+- 1 \nd  2! d2\n*,\n\np2\u21e1\n\n\n\n2\n\n1\n\n.\n\np2\u21e1\ne4pd\n\n\n\nd\n\n1\n\n1\n\n\n\n2  1! \uf8ff \n\nLet h\u21e4 =\nh\u21e4 2 @LD1 (w\u21e41). Similarly we may see that\n\np2\u21e1\ne4pd \u00b7 1d. Since h\u21e4 can be written as a convex combination of h+ and 0, we see that\n2d1 d  2\n\np2\u21e1(d  2)\u21e3 d2\ne \u2318 d2\n2e \u2318 d2 +//-\n2  1)\u21e3 d2\nAgain, since h\u21e4 can be written as a convex combination of the vectors h and 0 in the subgradient\nset, we may conclude that h\u21e4 2 @LD1 (w\u21e41) as well.\nBy the subgradient inequality it follows that, for all w 2 Rd,\nLD1 (w)  LD1 (w\u21e41)  h\u21e4>(w  w\u21e41) =\n\np2\u21e1\ne2pd  2 \uf8ff \n\n2d1*..,\n\np2\u21e1\ne4pd\n\n= \n\ne2( d\n\n.\n\np2\u21e1\ne4pd \u00b7 1>d (w  w\u21e41)\np2\u21e1\ne4pd \u00b7 1>d (w  w\u21e41),\n\nand\n\nLD1 (w)  LD1 (w\u21e41)  h\u21e4>(w  w\u21e41) = \n\nwhich taken together imply that\n\nas required.\n\nLD1 (w)  LD1 (w\u21e41) \n\np2\u21e1\n\ne4pd \u00b71>d (w  w\u21e41)\n\n\u21e4\n\n4 General Algorithm for Limited Precision\n\nAlthough we have established limits on the attainable precision for some learning problems, there is\nstill the possibility of reaching this limit. In this section we provide a general algorithm, whereby a\n\nlearner that can observe k < d attributes can always achieve an expected loss of O(p1  k/d).\n\nWe provide the pseudo-code in Algorithm 1. Although similar to the AERR algorithm of Hazan and\nKoren [6]\u2014which is designed to work only with the squared loss\u2014Algorithm 1 avoids the necessity\nof an unbiased gradient estimator by replacing the original loss function with a slightly biased one.\nAs long as the new loss function is chosen carefully (and the functions are Lipschitz bounded), and\ngiven enough samples, the algorithm can return a regressor of limited precision. This is in contrast to\nAERR whereby an arbitrarily precise regressor can always be achieved with enough samples.\nFormally, for Algorithm 1 we prove the following (proof in the supplementary material).\nTheorem 8. Let ` : R 7! R be an H-Lipschitz function de\ufb01ned over [2B, 2B]. Assume the\ndistribution D is such that kxk2 \uf8ff 1 and |y|\uf8ff B with probability 1. Let \u02dcB = max{B, 1}, and let \u02c6w\nbe the output of Algorithm 1, when run with \u2318 = 2B\nGpm . Then, k \u02c6wk2 \uf8ff B, and for any w\u21e4 2 Rd with\nkw\u21e4k2 \uf8ff B,\n2HB\npm\n\nE[LD ( \u02c6w)] \uf8ff LD (w\u21e4) +\n\n+ 2H \u02dcB2r1 \n\nk\nd\n\n.\n\n7\n\n\fAlgorithm 1 General algorithm for regression/classi\ufb01cation with missing attributes\nInput: Loss function `, training set S = {(xt, yt )}t2[m], k, B,\u2318 > 0\nOutput: Regressor \u02c6w with k \u02c6wk2 \uf8ff B\n1: Initialize w1 , 0, kw1k2 \uf8ff B arbitrarily\n2: for t = 1 to m do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nUniformly choose subset of k indices {it,r}r 2[k] from [d] without replacement\nSet \u02dcxt =Pk\nRegression case:\nChoose \u02c6t 2 @`(w>t \u02dcxt  yt )\nChoose \u02c6t 2 @`(yt \u00b7 w>t \u02dcxt )\n\nr =1 x[it,r ] \u00b7 eit, r\n\nClassi\ufb01cation case:\n\nUpdate\n\nB\n\nmax{kwt  \u2318( \u02c6t \u00b7 \u02dcxt )k2, B} \u00b7 (wt  \u2318( \u02c6t \u00b7 \u02dcxt ))\n\nwt+1 =\n\n10: end for\n11: Return \u02c6w = 1\n\nt=1 wt\n\nmPm\n\nIn particular, for m = d/(d  k) we have\n\nE[LD ( \u02c6w)] \uf8ff LD (w\u21e4) + 4H \u02dcB2r1 \n\nk\nd\n\n,\n\nand so when the learner observes k = d  1 attributes, the expected loss is O(1/pd)-away from\n\noptimum.\n\n5 Conclusions and Future Work\n\nIn the limited attribute observation setting, we have shown information-theoretic lower bounds for\nsome variants of regression, proving that a distribution-independent algorithm for regression with\nabsolute loss that attains \" error cannot exist and closing the gap for ridge regression as suggested\nby Hazan and Koren [6]. We have also shown that the proof technique applied for regression\nwith absolute loss can be extended to show a similar bound for classi\ufb01cation with the hinge loss.\nIn addition, we have described a general purpose algorithm which complements these results by\nproviding a means of achieving error up to a certain precision limit.\nAn interesting possibility for future work would be to try to bridge the gap between the upper and\nlower bounds of the precision limits, particularly in the case of the exponential gap for classi\ufb01cation\nwith hinge loss. Another direction would be to develop a more comprehensive understanding of these\nlower bounds in terms of more general functions, one example being classi\ufb01cation with logistic loss.\n\nReferences\n[1] S. Ben-David and E. Dichterman. Learning with restricted focus of attention. Journal of\n\nComputer and System Sciences, 56(3):277\u2013298, 1998.\n\n[2] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Ef\ufb01cient learning with partially observed\n\nattributes. In Proceedings of the 27th International Conference on Machine Learning, 2010.\n\n[3] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Online learning of noisy data. IEEE\n\nTransactions on Information Theory, 57(12):7907\u20137931, 2011.\n\n[4] O. Dekel, O. Shamir, and L. Xiao. Learning to classify with missing and corrupted features.\n\nMachine Learning Journal, 81(2):149\u2013178, 2010.\n\n[5] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other\n\nlearning applications. Information and Computation, 100(1):78\u2013150, 1992.\n\n[6] E. Hazan and T. Koren. Linear regression with limited observation. In Proceedings of the 29th\nInternational Conference on Machine Learning (ICML\u201912), Edinburgh, Scotland, UK, 2012.\n\n8\n\n\f[7] E. Hazan, R. Livni, and Y. Mansour. Classi\ufb01cation with low rank and missing data.\n\nProceedings of the 32nd International Conference on Machine Learning, 2015.\n\nIn\n\n[8] D. Kukliansky and O. Shamir. Attribute ef\ufb01cient linear regression with data-dependent sampling.\n\nIn Proceedings of the 32nd International Conference on Machine Learning, 2015.\n\n[9] R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data, 2nd Edition. Wiley-\n\nInterscience, 2002.\n\n[10] P.-L. Loh and M. J. Wainwright. High-dimensional regression with noisy and missing data:\nProvable guarantees with non-convexity. In Advances in Neural Information Processing Systems,\n2011.\n\n[11] A. Rostamizadeh, A. Agarwal, and P. Bartlett. Learning with missing features. In The 27th\n\nConference on Uncertainty in Arti\ufb01cial Intelligence, 2011.\n\n[12] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58(1):267\u2013288, 1996.\n\n[13] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\n\nProceedings of the 20th International Conference on Machine Learning, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1743, "authors": [{"given_name": "Brian", "family_name": "Bullins", "institution": "Princeton University"}, {"given_name": "Elad", "family_name": "Hazan", "institution": "Princeton University"}, {"given_name": "Tomer", "family_name": "Koren", "institution": "Technion---Israel Inst. of Technology"}]}