{"title": "Unsupervised Risk Estimation Using Only Conditional Independence Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 3657, "page_last": 3665, "abstract": "We show how to estimate a model\u2019s test error from unlabeled data, on distributions very different from the training distribution, while assuming only that certain conditional independencies are preserved between train and test. We do not need to assume that the optimal predictor is the same between train and test, or that the true distribution lies in any parametric family. We can also efficiently compute gradients of the estimated error and hence perform unsupervised discriminative learning. Our technical tool is the method of moments, which allows us to exploit conditional independencies in the absence of a fully-specified model. Our framework encompasses a large family of losses including the log and exponential loss, and extends to structured output settings such as conditional random fields.", "full_text": "Unsupervised Risk Estimation Using Only\n\nConditional Independence Structure\n\nJacob Steinhardt\nStanford University\n\njsteinhardt@cs.stanford.edu\n\nAbstract\n\nPercy Liang\n\nStanford University\n\npliang@cs.stanford.edu\n\nWe show how to estimate a model\u2019s test error from unlabeled data, on distributions\nvery different from the training distribution, while assuming only that certain con-\nditional independencies are preserved between train and test. We do not need to\nassume that the optimal predictor is the same between train and test, or that the\ntrue distribution lies in any parametric family. We can also ef\ufb01ciently compute\ngradients of the estimated error and hence perform unsupervised discriminative\nlearning. Our technical tool is the method of moments, which allows us to exploit\nconditional independencies in the absence of a fully-speci\ufb01ed model. Our frame-\nwork encompasses a large family of losses including the log and exponential loss,\nand extends to structured output settings such as conditional random \ufb01elds.\n\n1\n\nIntroduction\n\nCan we measure the accuracy of a model at test time without any ground truth labels, and without\nassuming the test distribution is close to the training distribution? This is the problem of unsupervised\nrisk estimation (Donmez et al., 2010): Given a loss function L(\u03b8; x, y) and a \ufb01xed model \u03b8, estimate\nthe risk R(\u03b8) def= Ex,y\u223cp\u2217 [L(\u03b8; x, y)] with respect to a test distribution p\u2217(x, y), given access only\nto m unlabeled examples x(1:m) \u223c p\u2217(x). Unsupervised risk estimation lets us estimate model\naccuracy on a novel distribution, and is thus important for building reliable machine learning systems.\nBeyond evaluating a single model, it also provides a way of harnessing unlabeled data for learning: by\nminimizing the estimated risk over \u03b8, we can perform unsupervised learning and domain adaptation.\nUnsupervised risk estimation is impossible without some assumptions on p\u2217, as otherwise p\u2217(y | x)\u2014\nabout which we have no observable information\u2014could be arbitrary. How satis\ufb01ed we should be\nwith an estimator depends on how strong its underlying assumptions are. In this paper, we present\nan approach which rests on surprisingly weak assumptions\u2014that p\u2217 satis\ufb01es certain conditional\nindependencies, but not that it lies in any parametric family or is close to the training distribution.\nTo give a \ufb02avor for our results, suppose that y \u2208 {1, . . . , k} and that the loss decomposes as a\nv=1 fv(\u03b8; xv, y), where the xv (v = 1, 2, 3) are independent\nconditioned on y. In this case, we show that we can estimate the risk to error \u0001 in poly(k)/\u00012 samples,\nindependently of the dimension of x or \u03b8, with only very mild additional assumptions on p\u2217. In\nSections 2 and 3 we generalize to a larger family of losses including the log and exponential losses,\nand extend beyond the multiclass case to conditional random \ufb01elds.\nSome intuition behind our result is provided in Figure 1. At a \ufb01xed value of x, we can think of each\nfv as \u201cpredicting\u201d that y = j if fv(xv, j) is low and fv(xv, j(cid:48)) is high for j(cid:48) (cid:54)= j. Since f1, f2, and\nf3 all provide independent signals about y, their rate of agreement gives information about the model\naccuracy. If f1, f2, and f3 all predict that y = 1, then it is likely that the true y equals 1 and the\nloss is small. Conversely, if f1, f2, and f3 all predict different values of y, then the loss is likely\n\nsum of three parts: L(\u03b8; x, y) = (cid:80)3\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ff1\n\nf2\n\nf3\n\nf1\n\nf2\n\nf3\n\n1 2 3 4\n\n1 2 3 4\n\n1 2 3 4\n\n1 2 3 4\n\n1 2 3 4\n\n1 2 3 4\n\ny\n\ny\n\ny\n\ny\n\ny\n\ny\n\nFigure 1: Two possible loss pro\ufb01les at a given value of x. Left: if f1, f2, and f3 are all minimized at\nthe same value of y, that is likely to be the correct value and the total loss is likely to be small. Right:\nconversely, if f1, f2, and f3 are small at differing values of y, then the loss is likely to be large.\nlarge. This intuition is formalized by Dawid and Skene (1979) when the fv measure the 0/1-loss of\nindependent classi\ufb01ers; in particular, if rv is the prediction of a classi\ufb01er based on xv, then Dawid\nv=1 p(rv | y = j).\n\nand Skene model the rv as independent given y: p(r1, r2, r3) =(cid:80)k\n\nj=1 p(y = j)(cid:81)3\n\nThey then use the learned parameters of this model to compute the 0/1-loss.\nPartial speci\ufb01cation. Dawid and Skene\u2019s approach relies on the prediction rv only taking on k values.\nIn this case, the full distribution p(r1, r2, r3) can be parametrized by k\u00d7k conditional probability\nmatrices p(rv | y) and marginals p(y). However, as shown in Figure 1, we want to estimate continuous\nlosses such as the log loss. We must therefore work with the prediction vector fv \u2208 Rk rather than a\nsingle predicted output rv \u2208{1, . . . , k}. To fully model p(f1, f2, f3) would require nonparametric\nestimation, resulting in an undesirable sample complexity exponential in k\u2014in contrast to the discrete\ncase, conditional independence effectively only partially speci\ufb01es a model for the losses.\nTo sidestep this issue, we make use of the method of moments, which has recently been used to\n\ufb01t non-convex latent variable models (e.g. Anandkumar et al., 2012). In fact, it has a much older\nhistory in the econometrics literature, where it is used as a tool for making causal identi\ufb01cations\nunder structural assumptions, even when no explicit form for the likelihood is known (Anderson and\nRubin, 1949; 1950; Sargan, 1958; 1959; Hansen, 1982; Powell, 1994; Hansen, 2014). It is this latter\nperspective that we draw upon. The key insight is that even in the absence of a fully-speci\ufb01ed model,\ncertain moment equations\u2013such as E[f1f2 | y] = E[f1 | y]E[f2 | y]\u2013can be derived solely from the\nassumed conditional independence. Solving these equations yields estimates of E[fv | y], which can\nin turn be used to estimate the risk. Importantly, our procedure avoids estimation of the full loss\ndistribution p(f1, f2, f3), on which we make no assumptions other than conditional independence.\nOur paper is structured as follows. In Section 2, we present our basic framework, and state and prove\nour main result on estimating the risk. In Section 3, we extend our framework in several directions,\nincluding to conditional random \ufb01elds. In Section 4, we present a gradient-based learning algorithm\nand show that the sample complexity needed for learning is d \u00b7 poly(k)/\u00012, where d is the dimension\nof the parameters \u03b8. In Section 5, we investigate how our method performs empirically.\nRelated Work. While the formal problem of unsupervised risk estimation was only posed recently\nby Donmez et al. (2010), several older ideas from domain adaptation and semi-supervised learning\nare also relevant. The covariate shift assumption posits access to labeled samples from a training\ndistribution p0(x, y) for which p\u2217(y | x) = p0(y | x).\nIf p\u2217(x) and p0(x) are close, we can\napproximate p\u2217 by p0 via importance weighting (Shimodaira, 2000; Qui\u00f1onero-Candela et al., 2009).\nIf p\u2217 and p0 are not close, another approach is to assume a well-speci\ufb01ed discriminative model family\n\u0398, such that p0(y | x) = p\u2217(y | x) = p\u03b8\u2217 (y | x) for some \u03b8\u2217 \u2208 \u0398; then the only error when moving\nfrom p0 to p\u2217 is statistical error in the estimation of \u03b8\u2217 (Blitzer et al., 2011; Li et al., 2011). Such\nassumptions are restrictive\u2014importance weighting only allows small perturbations from p0 to p\u2217, and\nmis-speci\ufb01ed models of p(y| x) are common in practice; many authors report that mis-speci\ufb01cation\ncan lead to severe issues in semi-supervised settings (Merialdo, 1994; Nigam et al., 1998; Cozman\nand Cohen, 2006; Liang and Klein, 2008; Li and Zhou, 2015). More sophisticated approaches based\non discrepancy minimization (Mansour et al., 2009) or learning invariant representations (Ben-David\net al., 2006; Johansson et al., 2016) typically also make some form of the covariate shift assumption.\nOur approach is closest to Dawid and Skene (1979) and some recent extensions (Zhang et al., 2014;\nPlatanios, 2015; Jaffe et al., 2015; Fetaya et al., 2016). Similarly to Zhang et al. (2014) and Jaffe\net al. (2015), we use the method of moments for estimating latent-variable models. However, those\npapers use it for parameter estimation in the face of non-convexity, rather than as a way to avoid full\nestimation of p(fv | y). The insight that the method of moments works under partial speci\ufb01cation lets\nus extend beyond the simple discrete settings they consider to handle more complex continuous and\nstructured losses. The intriguing work of Balasubramanian et al. (2011) provides an alternate approach\n\n2\n\n\flabel:\n\ny\n\nyt\u22122\n\nyt\u22121\n\nyt\n\nyt+1\n\ny\n\nz\n\ninputs:\n\nx1\n\nx2\n\nx3\n\nxt\u22122\n\nxt\u22121\n\nxt\n\nxt+1\n\nx1\n\nx2\n\nx3\n\nFigure 2: Left: our basic 3-view setup (Assumption 1). Center: Extension 1, to CRFs; the embedding\nof 3 views into the CRF is indicated in blue. Right: Extension 3, to include a mediating variable z.\nto continuous losses; they show that the distribution of losses L| y is often approximately Gaussian,\nand use that to estimate the risk. Among all this work, ours is the \ufb01rst to perform gradient-based\nlearning and the \ufb01rst to handle a structured loss (the log loss for conditional random \ufb01elds).\n\n2 Framework and Estimation Algorithm\nWe will focus on multiclass classi\ufb01cation; we assume an unknown true distribution p\u2217(x, y) over\nX \u00d7 Y, where Y = {1, . . . , k}, and are given unlabeled samples x(1), . . . , x(m) drawn i.i.d. from\np\u2217(x). Given parameters \u03b8 \u2208 Rd and a loss function L(\u03b8; x, y), our goal is to estimate the risk of \u03b8\non p\u2217: R(\u03b8) def= Ex,y\u223cp\u2217 [L(\u03b8; x, y)]. Throughout, we will make the 3-view assumption:\nAssumption 1 (3-view). Under p\u2217, x can be split into x1, x2, x3, which are conditionally independent\ngiven y (see Figure 2). Moreover, the loss decomposes additively across views: L(\u03b8; x, y) =\n\nA(\u03b8; x) \u2212(cid:80)3\n\nv=1 fv(\u03b8; xv, y), for some functions A and fv.\n\nexp(cid:0)\u03b8(cid:62) (\u03c61(x1, y) + \u03c62(x2, y) + \u03c63(x3, y)) \u2212 A(\u03b8; x)(cid:1), where x1, x2, and x3 are independent\n\nNote that each xv can be large (e.g. they could be vectors in Rd). If we have V > 3 views, we\ncan combine views to obtain V = 3 without loss of generality. It also suf\ufb01ces for just the fv to be\nindependent rather than the xv. Given only 2 views, the risk can be shown to be unidenti\ufb01able in\ngeneral, although obtaining upper bounds may be possible.\nWe give some examples where Assumption 1 holds, then state and prove our main result (see Section 3\nfor additional examples). We start with logistic regression, which will be our primary focus later on:\nExample 1 (Logistic Regression). Suppose that we have a log-linear model p\u03b8(y | x) =\nconditioned on y. If our loss function is the log-loss L(\u03b8; x, y) = \u2212 log p\u03b8(y | x), then Assumption 1\nholds with fv(\u03b8; xv, y) = \u03b8(cid:62)\u03c6v(xv, y) and A(\u03b8; x) equal to the partition function of p\u03b8.\nAssumption 1 does not hold for the hinge loss (see Appendix A for details), but it does hold for a\nmodi\ufb01ed hinge loss, where we apply the hinge separately to each view:\nv=1(1 + maxj(cid:54)=y \u03b8(cid:62)\u03c6v(xv, j) \u2212\n\u03b8(cid:62)\u03c6v(xv, y))+. In other words, L is the sum of 3 hinge losses, one for each view. Then Assumption 1\nholds with A = 0, and \u2212fv equal to the hinge loss for view v.\nThe model can also be non-linear within each view xv, as long as the views are combined additively:\nExample 3 (Neural Networks). Suppose that for each view v we have a neural network whose output\nis a score for each of the k classes, (fv(\u03b8; xv, j))k\nj=1. Sum the scores f1 + f2 + f3, apply a soft-max,\nv=1 fv(\u03b8; xv, y), where A(\u03b8; x) is the\n\nExample 2 (Modi\ufb01ed Hinge Loss). Suppose that L(\u03b8; x, y) =(cid:80)3\n\nand evaluate using the log loss; then L(\u03b8; x, y) = A(\u03b8; x) \u2212(cid:80)3\n\nlog-normalization constant of the softmax, and hence L satis\ufb01es Assumption 1.\n\nWe are now ready to present our main result on recovering the risk R(\u03b8). The key starting point is the\nconditional risk matrices Mv \u2208 Rk\u00d7k, de\ufb01ned as (suppressing the dependence on \u03b8)\n\n(Mv)ij = E[fv(\u03b8; xv, i) | y = j].\n\n(1)\nIn the case of the 0/1-loss, the Mv are confusion matrices; in general, (Mv)ij measures how strongly\nwe predict class i when the true class is j. If we could recover these matrices along with the marginal\nclass probabilities \u03c0j\n\ndef= p\u2217(y = j), then estimating the risk would be straightforward; indeed,\n\n(cid:34)\n\nA(\u03b8; x) \u2212 3(cid:88)\n\n(cid:35)\n\n= E[A(\u03b8; x)] \u2212 k(cid:88)\n\n3(cid:88)\n\nR(\u03b8) = E\n\nfv(\u03b8; xv, y)\n\n\u03c0j\n\n(Mv)j,j,\n\n(2)\n\nv=1\n\nj=1\n\nv=1\n\n3\n\n\fwhere E[A(\u03b8; x)] can be estimated from unlabeled data alone.\nCaveat: Class permutation. Suppose that at training time, we learn to predict whether an image con-\ntains the digit 0 or 1. At test time, nothing changes except the de\ufb01nitions of 0 and 1 are reversed. It is\nclearly impossible to detect this from unlabeled data\u2014 mathematically, the risk matrices Mv are only\nrecoverable up to column permutation. We will end up computing the minimum risk over these permu-\ntations, which we call the optimistic risk and denote \u02dcR(\u03b8) def= min\u03c3\u2208Sym(k) Ex,y\u223cp\u2217 [L(\u03b8; x, \u03c3(y))].\nThis equals the true risk as long as \u03b8 is at least aligned with the correct classes in the sense that\nEx[L(\u03b8; x, j) | y = j] \u2264 Ex[L(\u03b8; x, j(cid:48)) | y = j] for j(cid:48) (cid:54)= j. The optimal \u03c3 can be computed from\n\nMv and \u03c0 in O(cid:0)k3(cid:1) time using maximum weight bipartite matching; see Section B for details.\n\nOur main result, Theorem 1, says that we can recover both Mv and \u03c0 up to permutation, with a\nnumber of samples that is polynomial in k:\nTheorem 1. Suppose Assumption 1 holds. Then, for any \u0001, \u03b4 \u2208 (0, 1), we can estimate Mv and \u03c0 up\nto column permutation, to error \u0001 (in Frobenius and \u221e-norm respectively). Our algorithm requires\n\nm = poly(cid:0)k, \u03c0\u22121\n\nmin, \u03bb\u22121, \u03c4(cid:1) \u00b7 log(2/\u03b4)\np\u2217(y = j), \u03c4 def= E(cid:2)(cid:80)\n\n\u00012\n\n\u03c0min\n\ndef=\n\nk\n\nmin\nj=1\n\nsamples to succeed with probability 1 \u2212 \u03b4, where\n\nv,jfv(\u03b8; xv, j)2(cid:3), and \u03bb def=\n\n\u03c3k(Mv),\n\n(3)\n\n3\n\nmin\nv=1\n\nk(cid:88)\n\nk(cid:88)\n\n\u00012\n\nand \u03c3k denotes the kth singular value. Moreover, the algorithm runs in time m \u00b7 poly(k).\nEstimates for Mv and \u03c0 imply an estimate for \u02dcR via (2); see Algorithm 1 below for details. Im-\nportantly, the sample complexity in Theorem 1 depends on the number of classes k, but not on the\ndimension d of \u03b8. Moreover, Theorem 1 holds even if p\u2217 lies outside the model family \u03b8, and even if\nthe train and test distributions are very different (in fact, the result is agnostic to how the model \u03b8 was\nproduced). The only requirement is the 3-view assumption for p\u2217 and that \u03bb, \u03c0min (cid:54)= 0.\nLet us interpret each term in (3). First, \u03c4 tracks the variance of the loss, and we should expect the\ndif\ufb01culty of estimating the risk to increase with this variance. The log(2/\u03b4)\nterm is typical and shows\nup even when estimating the parameter of a random variable to accuracy \u0001 from m samples. The\n\u03c0\u22121\nmin term appears because, if one of the classes is very rare, we need to wait a long time to observe\neven a single sample from that class, and even longer to estimate the risk on that class accurately.\nPerhaps least intuitive is the \u03bb\u22121 term, which is large e.g. when two classes have similar conditional\ni=1 | y = j]. To see why this matters, consider an extreme where x1, x2,\nrisk vectors E[(fv(\u03b8; xv, i))k\nand x3 are independent not only of each other but also of y. Then p\u2217(y) is completely unconstrained,\nand it is impossible to estimate R at all. Why does this not contradict Theorem 1? The answer is\nthat in this case, all rows of Mv are equal and hence Mv has rank 1, \u03bb = 0, \u03bb\u22121 = \u221e, and we need\nin\ufb01nitely many samples for Theorem 1 to hold; \u03bb measures how close we are to this degenerate case.\nProof of Theorem 1. We now outline a proof of Theorem 1. Recall the goal is to estimate the\nconditional risk matrices Mv, de\ufb01ned as (Mv)ij = E[fv(\u03b8; xv, i) | y = j]; from these we can\nrecover the risk itself using (2). The key insight is that certain moments of p\u2217(y | x) can be expressed\nas polynomial functions of the matrices Mv, and therefore we can solve for the Mv even without\nexplicitly estimating p\u2217. Our approach follows the technical machinery behind the spectral method of\nmoments (e.g., Anandkumar et al., 2012), which we explain below for completeness.\nDe\ufb01ne the loss vector hv(xv) = (fv(\u03b8; xv, i))k\ni=1, which measures the loss that would be incurred\nunder each of the k classes. The conditional independence of the xv means that E[h1(x1)h2(x2)(cid:62) |\ny] = E[h1(x1) | y]E[h2(x2) | y](cid:62), and similarly for higher-order conditional moments. Marginal-\nizing over y, we see that there is low-rank structure in the moments of h that we can exploit; in\nparticular (letting \u2297 denote outer product and A\u00b7,j denote the jth column of A):\n\nE[hv(xv)] =\n\n\u03c0j\u00b7(Mv)\u00b7,j, E[hv(xv)\u2297hv(cid:48)(xv(cid:48))] =\n\n\u03c0j\u00b7(Mv)\u00b7,j\u2297(Mv(cid:48))\u00b7,j for v (cid:54)= v(cid:48), and\n\nk(cid:88)\n\nj=1\n\nE[h1(x1)\u2297h2(x2)\u2297h3(x3)] =\n\nj=1\n\n\u03c0j\u00b7(M1)\u00b7,j\u2297(M2)\u00b7,j\u2297(M3)\u00b7,j.\n\n(4)\n\nThe left-hand-side of each equation can be estimated from unlabeled data; using tensor decomposition\n(Lathauwer, 2006; Comon et al., 2009; Anandkumar et al., 2012; 2013; Kuleshov et al., 2015), it is\n\nj=1\n\n4\n\n\fAlgorithm 1 Algorithm for estimating \u02dcR(\u03b8) from unlabeled data.\n1: Input: unlabeled samples x(1), . . . , x(m) \u223c p\u2217(x).\n2: Estimate the left-hand-side of each term in (4) using x(1:m).\n3: Compute approximations \u02c6Mv and \u02c6\u03c0 to Mv and \u03c0 using tensor decomposition.\n\n4: Compute \u03c3 maximizing(cid:80)k\n(cid:80)m\ni=1 A(\u03b8; x(i)) \u2212(cid:80)k\n\n5: Output: estimated risk, 1\nm\n\nv=1( \u02c6Mv)j,\u03c3(j).\n\n(cid:80)3\n\n(cid:80)3\n\nj=1 \u02c6\u03c0\u03c3(j)\n\nj=1 \u02c6\u03c0\u03c3(j)\n\nv=1( \u02c6Mv)j,\u03c3(j) using maximum bipartite matching.\n\nthen possible to solve for Mv and \u03c0. In particular, we can recover M and \u03c0 up to permutation: that is,\nwe recover \u02c6M and \u02c6\u03c0 such that Mi,j \u2248 \u02c6Mi,\u03c3(j) and \u03c0j \u2248 \u02c6\u03c0\u03c3(j) for some permutation \u03c3 \u2208 Sym(k).\nThis then yields Theorem 1; see Section C for a full proof.\nAssumption 1 thus yields a set of moment equations (4) whose solution lets us estimate the risk\nwithout any labels y. The procedure is summarized in Algorithm 1: we (i) approximate the left-hand-\nside of each term in (4) by sample averages; (ii) use tensor decomposition to solve for \u03c0 and Mv; (iii)\nuse maximum matching to compute the permutation \u03c3; and (iv) use (2) to obtain \u02dcR from \u03c0 and Mv.\n\n3 Extensions\n\nTheorem 1 provides a basic building block which admits several extensions to more complex model\nstructures. We go over several cases below, omitting most proofs to avoid tedium.\nExtension 1 (Conditional Random Field). Most importantly, the variable y need not belong to a\nsmall discrete set; we can handle structured outputs such as a CRF as long as p\u2217 has the right structure.\nThis contrasts with previous work on unsupervised risk estimation that was restricted to multiclass\nclassi\ufb01cation (though in a different vein, it is close to Proposition 8 of Anandkumar et al. (2012)).\nSuppose that p\u2217(x1:T , y1:T ) factorizes as a hidden Markov model, and that p\u03b8 is a CRF respecting\nt=1 g\u03b8(yt, xt). For the log-loss\nL(\u03b8; x, y) = \u2212 log p\u03b8(y1:T | x1:T ), we can exploit the decomposition\n\nthe HMM structure: p\u03b8(y1:T | x1:T ) \u221d (cid:81)T\n\n\u2212 log p\u03b8(y1:T | x1:T ) =\n\n\u2212 log p\u03b8(yt\u22121, yt | x1:T )\n\n\u2212 log p\u03b8(yt | x1:T )\n\n.\n\n(5)\n\nt satisfy Assumption 1 (see Figure 2; for (cid:96)t, the views are\nt they are x1:t\u22121, xt, xt+1:T ). We use Theorem 1 to estimate\nt] individually, and thus also the full risk E[L]. (We actually estimate the risk for\n\nEach of the components (cid:96)t and (cid:96)(cid:48)\nx1:t\u22122, xt\u22121:t, xt+1:T , and for (cid:96)(cid:48)\neach E[(cid:96)t], E[(cid:96)(cid:48)\ny2:T\u22121 | x1:T due to the 3-view assumption failing at the boundaries.)\nIn general, the idea in (5) applies to any structured output problem that is a sum of local 3-view\nstructures. It would be interesting to extend our results to other structures such as more general\ngraphical models (Chaganty and Liang, 2014) and parse trees (Hsu et al., 2012).\nExtension 2 (Exponential Loss). We can also relax the additivity L = A\u2212f1\u2212f2\u2212f3 in Assumption 1.\nv=1 \u03c6v(xv, y)) is the exponential loss. Theorem 1\nlets us estimate the matrices Mv corresponding to fv(\u03b8; xv, y) = exp(\u2212\u03b8(cid:62)\u03c6v(xv, y)). Then\n\nFor instance, suppose L(\u03b8; x, y) = exp(\u2212\u03b8(cid:62)(cid:80)3\n(cid:35)\n3(cid:89)\n(cid:88)\nby conditional independence, so the risk can be computed as(cid:80)\n(cid:81)3\nto any loss expressible as L(\u03b8; x, y) = A(\u03b8; x) +(cid:80)n\n\nv=1(Mv)j,j. This idea extends\nj \u03c0j\ni (\u03b8; xv, y) for some functions f v\ni .\nv=1 f v\nExtension 3 (Mediating Variable). Assuming that x1:3 are independent conditioned only on y may\nnot be realistic; there might be multiple subclasses of a class (e.g., multiple ways to write the digit 4)\nwhich would induce systematic correlations across views. To address this, we show that independence\nneed only hold conditioned on a mediating variable z, rather than on the class y itself.\nLet z be a re\ufb01nement of y (in the sense that knowing z determines y) which takes on k(cid:48) values, and\nsuppose that the views x1, x2, x3 are independent conditioned on z, as in Figure 2. Then we can\n\nE [fv(\u03b8; xv, j) | y = j]\n\n(cid:34) 3(cid:89)\n\nfv(\u03b8; xv, y)\n\n(cid:81)3\n\nR(\u03b8) = E\n\n(6)\n\nv=1\n\nv=1\n\n\u03c0j\n\ni=1\n\n=\n\nj\n\nt=2 f\u03b8(yt\u22121, yt) \u00b7(cid:81)T\n\u2212 T(cid:88)\n(cid:125)\n(cid:123)(cid:122)\n\nt=1\n\n(cid:124)\n\ndef= (cid:96)t\n\nT(cid:88)\n\nt=2\n\n(cid:124)\n\n(cid:123)(cid:122)\n\ndef= (cid:96)(cid:48)\n\nt\n\n(cid:125)\n\n5\n\n\ftry to estimate the risk by de\ufb01ning L(cid:48)(\u03b8; x, z) = L(\u03b8; x, y(z)), which satis\ufb01es Assumption 1. The\nproblem is that the corresponding risk matrices M(cid:48)\nv will only have k distinct rows and hence have\nrank k < k(cid:48). To \ufb01x this, suppose that the loss vector hv(xv) = (fv(xv, j))k\nj=1 can be extended\nv(xv) \u2208 Rk(cid:48)\nto a vector h(cid:48)\nv(xv) are hv(xv) and (ii) the\nconditional risk matrix M(cid:48)\nv has full rank. Then, Theorem 1 allows us to recover\nM(cid:48)\n\n, such that (i) the \ufb01rst k coordinates of h(cid:48)\nv corresponding to h(cid:48)\n\nv and hence also Mv (since it is a sub-matrix of M(cid:48)\n\nv) and thereby estimate the risk.\n\n4 From Estimation to Learning\nWe now turn our attention to unsupervised learning, i.e., minimizing R(\u03b8) over \u03b8 \u2208 Rd. Unsupervised\nlearning is impossible without some additional information, since even if we could learn the k classes,\nwe wouldn\u2019t know which class had which label (this is the same as the class permutation issue from\nbefore). Thus we assume that we have a small amount of information to break this symmetry:\nAssumption 2 (Seed Model). We have access to a \u201cseed model\u201d \u03b80 such that \u02dcR(\u03b80) = R(\u03b80).\n\nAssumption 2 is very weak \u2014 it merely asks for \u03b80 to be aligned with the true classes on average.\nWe can obtain \u03b80 from a small amount of labeled data (semi-supervised learning) or by training in a\nnearby domain (domain adaptation). We de\ufb01ne gap(\u03b80) to be the difference between R(\u03b80) and the\nnext smallest permutation of the classes\u2013i.e., gap(\u03b80) def= min\u03c3(cid:54)=id E[L(\u03b80; x, \u03c3(y)) \u2212 L(\u03b80; x, y)]\u2013\nwhich will affect the dif\ufb01culty of learning.\nFor simplicity we will focus on the case of logistic regression, and show how to learn given only\nAssumptions 1 and 2. Our algorithm extends to general losses, as we show in Section F.\nLearning from moments. Note that for logistic regression (Example 1), we have\n\n(cid:104)\nA(\u03b8; x) \u2212 \u03b8(cid:62) 3(cid:88)\n\nR(\u03b8) = E\n\n(cid:105)\n\n3(cid:88)\n\n\u03c6v(xv, y)\n\n= E[A(\u03b8; x)] \u2212 \u03b8(cid:62) \u00af\u03c6, where \u00af\u03c6 def=\n\nE[\u03c6v(xv, y)]. (7)\n\nv=1\n\nv=1\n\nFrom (7), we see that it suf\ufb01ces to estimate \u00af\u03c6, after which all terms on the right-hand-side of (7)\nare known. Given an approximation \u02c6\u03c6 to \u00af\u03c6 (we will show how to obtain \u02c6\u03c6 below), we can learn a\nnear-optimal \u03b8 by solving the following convex optimization problem:\n\n\u02c6\u03b8 = arg min\n(cid:107)\u03b8(cid:107)2\u2264\u03c1\n\nE[A(\u03b8; x)] \u2212 \u03b8(cid:62) \u02c6\u03c6.\n\n(8)\n\nIn practice we would need to approximate E[A(\u03b8; x)] by samples, but we ignore this for simplicity\n(it generally only contributes lower-order terms to the error). The reason for the (cid:96)2-constraint on \u03b8 is\nthat it imparts robustness to the error between \u02c6\u03c6 and \u00af\u03c6. In particular (see Section D for a proof):\nLemma 1. Suppose (cid:107) \u02c6\u03c6\u2212 \u00af\u03c6(cid:107)2 \u2264 \u0001. Then the output \u02c6\u03b8 from (8) satis\ufb01es R(\u02c6\u03b8) \u2264 min(cid:107)\u03b8(cid:107)2\u2264\u03c1 R(\u03b8)+2\u0001\u03c1.\nIf the optimal \u03b8\u2217 has (cid:96)2-norm at most \u03c1, Lemma 1 says that \u02c6\u03b8 nearly minimizes the risk: R(\u02c6\u03b8) \u2264\nR(\u03b8\u2217) + 2\u0001\u03c1. The problem of learning \u03b8 thus reduces to computing a good estimate \u02c6\u03c6 of \u00af\u03c6.\nComputing \u02c6\u03c6. Estimating \u00af\u03c6 can be done in a manner similar to how we estimated R(\u03b8) in Section 2.\nIn addition to the conditional risk matrix Mv \u2208 Rk\u00d7k, we compute the conditional moment matrix\nGv \u2208 Rdk\u00d7k, which tracks the conditional expectation of \u03c6v: (Gv)i+(r\u22121)k,j\ndef= E[\u03c6v(\u03b8; xv, i)r |\n\ny = j], where r indexes 1, . . . , d. We then have \u00af\u03c6r =(cid:80)k\n(since O(cid:0)k3d3(cid:1) memory is intractable for even moderate values of d). We take a standard approach\nThen, given m = poly(cid:0)k, \u03c0\u22121\n\nbased on random projections (Halko et al., 2011) and described in Section 6.1.2 of Anandkumar et al.\n(2013). We refer the reader to the aforementioned references for details, and cite only the resulting\nsample complexity and runtime, which are both roughly d times larger than in Theorem 1.\nTheorem 2. Suppose that Assumptions 1 and 2 hold. Let \u03b4 < 1 and \u0001 < min(1, gap(\u03b80)).\nsamples, where \u03bb and \u03c4 are as de\ufb01ned in (3),\n\nAs in Theorem 1, we can solve for G1, G2, and G3 using a tensor factorization similar to (4), though\nsome care is needed to avoid explicitly forming the (kd) \u00d7 (kd) \u00d7 (kd) tensor that would result\n\nv=1(Gv)j+(r\u22121)k,j.\n\nmin, \u03bb\u22121, \u03c4(cid:1) \u00b7 log(2/\u03b4)\n\n\u00012\n\n(cid:80)3\n\nj=1 \u03c0j\n\n6\n\n\fFigure 3: A few sample train images (left) and test images (right) from the modi\ufb01ed MNIST data set.\n\n(a)\n\n(b)\n\n(c)\n\ni,v (cid:107)\u03c6v(xv, i)(cid:107)2\n\nB2 = E[(cid:80)\nGv requires (B/\u03c4 )2 \u00b7 poly(cid:0)k, \u03c0\u22121\n\nFigure 4: Results on the modi\ufb01ed MNIST data set. (a) Risk estimation for varying degrees of\ndistortion a. (b) Domain adaptation with 10,000 training and 10,000 test examples. (c) Domain\nadaptation with 300 training and 10,000 test examples.\nwith probability 1 \u2212 \u03b4 we can recover Mv and \u03c0 to error \u0001, and Gv to error (B/\u03c4 )\u0001, where\n2] measures the (cid:96)2-norm of the features. The algorithm runs in time\nO (d (m + poly(k))), and the errors are in Frobenius norm for M and G, and \u221e-norm for \u03c0.\nSee Section E for a proof sketch. Whereas before we estimated the risk matrix Mv to error \u0001, now\nwe estimate the gradient matrix Gv (and hence \u00af\u03c6) to error (B/\u03c4 )\u0001. To achieve error \u0001 in estimating\nsamples, which is (B/\u03c4 )2 times as large as in\nTheorem 1. The quantity (B/\u03c4 )2 typically grows as O(d), and so the sample complexity needed to\nestimate \u00af\u03c6 is typically d times larger than the sample complexity needed to estimate R. This matches\nthe behavior of the supervised case where we need d times as many samples for learning as compared\nto (supervised) risk estimation of a \ufb01xed model.\nSummary. We have shown how to perform unsupervised logistic regression, given only a seed model\n\u03b80. This enables unsupervised learning under fairly weak assumptions (only the multi-view and seed\nmodel assumptions) even for mis-speci\ufb01ed models and zero train-test overlap, and without assuming\ncovariate shift. See Section F for learning under more general losses.\n\nmin, \u03bb\u22121, \u03c4(cid:1) log(2/\u03b4)\n\n\u00012\n\n5 Experiments\n\nTo better understand the behavior of our algorithms, we perform experiments on a version of the\nMNIST data set that is modi\ufb01ed to ensure that the 3-view assumption holds. To create an image\nI, we sample a class in {0, . . . , 9}, then sample 3 images I1, I2, I3 at random from that class,\nletting every third pixel in I come from the respective image Iv. This guarantees there are 3\nconditionally independent views. To explore train-test variation, we dim pixel p in the image by\nexp (a ((cid:107)p \u2212 p0(cid:107)2 \u2212 0.4)), where p0 is the image center and distances are normalized to be at most\n1. We show example images for a = 0 (train) and a = 5 (a possible test distribution) in Figure 3.\nRisk estimation. We use Algorithm 1 to perform unsupervised risk estimation for a model trained\non a = 0, testing on various values of a \u2208 [0, 10]. We trained the model with AdaGrad (Duchi\net al., 2010) on 10,000 training examples, and used 10,000 test examples to estimate the risk. To\nsolve for \u03c0 and M in (4), we \ufb01rst use the tensor power method implemented by Chaganty and Liang\n(2013) to initialize, and then locally minimize a weighted (cid:96)2-norm of the moment errors in (4) using\nL-BFGS. We compared with two other methods: (i) validation error from held-out samples (which\nj \u2212p\u03b8(j | x) log p\u03b8(j | x) on the test set\n(which would be valid if the predictions were well-calibrated). The results are shown in Figure 4a;\nboth the tensor method in isolation and tensor + L-BFGS estimate the risk accurately, with the latter\nperforming slightly better.\nUnsupservised domain adaptation. We next evaluate our learning algorithm in an unsupervised\ndomain adaptation setting, where we receive labeled training data at a = 0 and unlabeled test data\nat a different value of a. We use the training data to obtain a seed model \u03b80, and then perform\n\nwould be valid if train = test), and (ii) predictive entropy(cid:80)\n\n7\n\n0246810Distortion(a)0.00.20.40.60.81.0EstimatedRiskvalidationerrorentropytensortensor+L-BFGStrue0246810Distortion(a)0.00.10.20.30.40.50.60.70.8RiskR(\u03b8)baselinetensor+L-BFGSoracle0246810Distortion(\u03bb)0.00.20.40.60.81.01.2RiskR(\u03b8)baselinetensor+L-BFGSoracle\funsupervised learning (Section 4), setting \u03c1 = 10 in (8). The results are shown in Figure 4b. For\nsmall values of a, our algorithm performs worse than the baseline of directly using \u03b80, likely due\nto \ufb01nite-sample effects. However, our algorithm is far more robust as a increases, and tracks the\nperformance of an oracle that was trained on the same distribution as the test examples.\nBecause we only need to provide our algorithm with a seed model for disentangling the classes, we\ndo not need much data when training \u03b80. To verify this, we tried obtaining \u03b80 from only 300 labeled\nexamples. Tensor decomposition sometimes led to bad initializations in this limited data regime, in\nwhich case we obtained a different \u03b80 by training with a smaller step size. The results are shown in\nFigure 4c. Our algorithm generally performs well, but has higher variability than before, seemingly\ndue to higher condition number of the matrices Mv.\nSummary. Our experiments show that given 3 views, we can estimate the risk and perform unsuper-\nvised domain adaptation, even with limited labeled data from the source domain.\n\n6 Discussion\n\nWe have presented a method for estimating the risk from unlabeled data, which relies only on\nconditional independence structure and hence makes no parametric assumptions about the true\ndistribution. Our approach applies to a large family of losses and extends beyond classi\ufb01cation tasks\nto conditional random \ufb01elds. We can also perform unsupervised learning given only a seed model that\ncan distinguish between classes in expectation; the seed model can be trained on a related domain, on\na small amount of labeled data, or any combination of the two, and thus provides a pleasingly general\nformulation highlighting the similarities between domain adaptation and semi-supervised learning.\nPrevious approaches to domain adaptation and semi-supervised learning have also exploited multi-\nview structure. Given two views, Blitzer et al. (2011) perform domain adaptation with zero\nsource/target overlap (covariate shift is still assumed). Two-view approaches (e.g. co-training\nand CCA) are also used in semi-supervised learning (Blum and Mitchell, 1998; Ando and Zhang,\n2007; Kakade and Foster, 2007; Balcan and Blum, 2010). These methods all assume some form of\nlow noise or low regret, as do, e.g., transductive SVMs (Joachims, 1999). By focusing on the central\nproblem of risk estimation, our work connects multi-view learning approaches for domain adaptation\nand semi-supervised learning, and removes covariate shift and low-noise/low-regret assumptions\n(though we make stronger independence assumptions, and specialize to discrete prediction tasks).\nIn addition to reliability and unsupervised learning, our work is motivated by the desire to build\nmachine learning systems with contracts, a challenge recently posed by Bottou (2015); the goal is for\nmachine learning systems to satisfy a well-de\ufb01ned input-output contract in analogy with software\nsystems (Sculley et al., 2015). Theorem 1 provides the contract that under the 3-view assumption the\ntest error is close to our estimate of the test error; the typical (weak) contract of ML systems is that if\ntrain and test are similar, then the test error is close to the training error. One other interesting contract\nis to provide prediction regions that contain the truth with probability 1 \u2212 \u0001 (Shafer and Vovk, 2008;\nKhani et al., 2016), which includes abstaining when uncertain as a special case (Li et al., 2011).\nThe most restrictive part of our framework is the three-view assumption, which is inappropriate if the\nviews are not completely independent or if the data have structure that is not captured in terms of\nmultiple views. Since Balasubramanian et al. (2011) obtain results under Gaussianity (which would\nbe implied by many somewhat dependent views), we are optimistic that unsupervised risk estimation\nis possible for a wider family of structures. Along these lines, we end with the following questions:\nOpen question. In the 3-view setting, suppose the views are not completely independent. Is it still\npossible to estimate the risk? How does the degree of dependence affect the number of views needed?\nOpen question. Given only two independent views, can one obtain an upper bound on the risk R(\u03b8)?\nThe results of this paper have caused us to adopt the following perspective: To leverage unlabeled\ndata, we should make generative structural assumptions, but still optimize discriminative model\nperformance. This hybrid approach allows us to satisfy the traditional machine learning goal of\npredictive accuracy, while handling lack of supervision and under-speci\ufb01cation in a principled way.\nPerhaps, then, what is truly needed for learning is understanding the structure of a domain.\nAcknowledgments. This research was supported by a Fannie & John Hertz Foundation Fellowship,\na NSF Graduate Research Fellowship, and a Future of Life Institute grant.\n\n8\n\n\fReferences\nA. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden Markov\n\nA. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent\n\nmodels. In COLT, 2012.\n\nvariable models. arXiv, 2013.\n\nT. W. Anderson and H. Rubin. Estimation of the parameters of a single equation in a complete system of\n\nstochastic equations. The Annals of Mathematical Statistics, pages 46\u201363, 1949.\n\nT. W. Anderson and H. Rubin. The asymptotic properties of estimates of the parameters of a single equation in a\n\ncomplete system of stochastic equations. The Annals of Mathematical Statistics, pages 570\u2013582, 1950.\n\nR. K. Ando and T. Zhang. Two-view feature generation model for semi-supervised learning. In COLT, 2007.\nK. Balasubramanian, P. Donmez, and G. Lebanon. Unsupervised supervised learning II: Margin-based classi\ufb01ca-\n\ntion without labels. JMLR, 12:3119\u20133145, 2011.\n\nM. Balcan and A. Blum. A discriminative model for semi-supervised learning. JACM, 57(3), 2010.\nS. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In\n\nNIPS, pages 137\u2013144, 2006.\n\nJ. Blitzer, S. Kakade, and D. P. Foster. Domain adaptation with coupled subspaces. In AISTATS, 2011.\nA. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998.\nL. Bottou. Two high stakes challenges in machine learning. Invited talk at ICML, 2015.\nA. Chaganty and P. Liang. Spectral experts for estimating mixtures of linear regressions. In ICML, 2013.\nA. Chaganty and P. Liang. Estimating latent-variable graphical models using moments and likelihoods. In ICML,\n\n2014.\n\nP. Comon, X. Luciani, and A. L. D. Almeida. Tensor decompositions, alternating least squares and other tales.\n\nJournal of Chemometrics, 23(7):393\u2013405, 2009.\n\nF. Cozman and I. Cohen. Risks of semi-supervised learning: How unlabeled data can degrade performance of\n\ngenerative classi\ufb01ers. In Semi-Supervised Learning. 2006.\n\nA. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm.\n\nApplied Statistics, 1:20\u201328, 1979.\n\nP. Donmez, G. Lebanon, and K. Balasubramanian. Unsupervised supervised learning I: Estimating classi\ufb01cation\n\nand regression errors without labels. JMLR, 11:1323\u20131351, 2010.\n\nJ. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.\n\nJ. Edmonds and R. M. Karp. Theoretical improvements in algorithmic ef\ufb01ciency for network \ufb02ow problems.\n\nIn COLT, 2010.\n\nJACM, 19(2):248\u2013264, 1972.\n\nE. Fetaya, B. Nadler, A. Jaffe, Y. Kluger, and T. Jiang. Unsupervised ensemble learning with dependent\n\nclassi\ufb01ers. In AISTATS, pages 351\u2013360, 2016.\n\nN. Halko, P.-G. Martinsson, and J. Tropp. Finding structure with randomness: Probabilistic algorithms for\n\nconstructing approximate matrix decompositions. SIAM Review, 53:217\u2013288, 2011.\n\nL. P. Hansen. Large sample properties of generalized method of moments estimators. Econometrica, 1982.\nL. P. Hansen. Uncertainty outside and inside economic models. Journal of Political Economy, 122(5), 2014.\nD. Hsu, S. M. Kakade, and P. Liang. Identi\ufb01ability and unmixing of latent parse trees. In NIPS, 2012.\nA. Jaffe, B. Nadler, and Y. Kluger. Estimating the accuracies of multiple classi\ufb01ers without labeled data. In\n\nAISTATS, pages 407\u2013415, 2015.\n\nT. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In ICML, 1999.\nF. Johansson, U. Shalit, and D. Sontag. Learning representations for counterfactual inference. In ICML, 2016.\nS. M. Kakade and D. P. Foster. Multi-view regression via canonical correlation analysis. In COLT, 2007.\nF. Khani, M. Rinard, and P. Liang. Unanimous prediction for 100% precision with application to learning\n\nsemantic mappings. In ACL, 2016.\n\nV. Kuleshov, A. Chaganty, and P. Liang. Tensor factorization via matrix factorization. In AISTATS, 2015.\nL. D. Lathauwer. A link between the canonical decomposition in multilinear algebra and simultaneous matrix\n\ndiagonalization. SIAM Journal of Matrix Analysis and Applications, 28(3):642\u2013666, 2006.\n\nL. Li, M. L. Littman, T. J. Walsh, and A. L. Strehl. Knows what it knows: a framework for self-aware learning.\n\nMachine learning, 82(3):399\u2013443, 2011.\n\nY. Li and Z. Zhou. Towards making unlabeled data never hurt. IEEE TPAMI, 37(1):175\u2013188, 2015.\nP. Liang and D. Klein. Analyzing the errors of unsupervised learning. In HLT/ACL, 2008.\nY. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT,\n\nThe MIT Press, 2009.\n\nB. Merialdo. Tagging English text with a probabilistic model. Computational Linguistics, 20:155\u2013171, 1994.\nK. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Learning to classify text from labeled and unlabeled\n\ndocuments. In Association for the Advancement of Arti\ufb01cial Intelligence (AAAI), 1998.\n\nE. A. Platanios. Estimating accuracy from unlabeled data. Master\u2019s thesis, Carnegie Mellon University, 2015.\nJ. L. Powell. Estimation of semiparametric models. In Handbook of Econometrics, volume 4. 1994.\nJ. Qui\u00f1onero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning.\n\nJ. D. Sargan. The estimation of economic relationships using instrumental variables. Econometrica, 1958.\nJ. D. Sargan. The estimation of relationships with autocorrelated residuals by the use of instrumental variables.\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), pages 91\u2013105, 1959.\n\nD. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J. Crespo, and\n\nD. Dennison. Hidden technical debt in machine learning systems. In NIPS, pages 2494\u20132502, 2015.\n\nG. Shafer and V. Vovk. A tutorial on conformal prediction. JMLR, 9:371\u2013421, 2008.\nH. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function.\n\nJournal of Statistical Planning and Inference, 90:227\u2013244, 2000.\n\nJ. Steinhardt, G. Valiant, and S. Wager. Memory, communication, and statistical queries. In COLT, 2016.\nN. Tomizawa. On some techniques useful for solution of transportation network problems. Networks, 1971.\nY. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet EM: A provably optimal algorithm for\n\ncrowdsourcing. arXiv, 2014.\n\n2009.\n\n9\n\n\f", "award": [], "sourceid": 1820, "authors": [{"given_name": "Jacob", "family_name": "Steinhardt", "institution": "Stanford University"}, {"given_name": "Percy", "family_name": "Liang", "institution": "Stanford University"}]}