{"title": "No Unbiased Estimator of the Variance of K-Fold Cross-Validation", "book": "Advances in Neural Information Processing Systems", "page_first": 513, "page_last": 520, "abstract": "", "full_text": "No Unbiased Estimator of the Variance of\n\nK-Fold Cross-Validation\n\nYoshua Bengio and Yves Grandvalet\n\nDept. IRO, Universit\u00b4e de Montr\u00b4eal\n\nC.P. 6128, Montreal, Qc, H3C 3J7, Canada\n\nfbengioy,grandvayg@iro.umontreal.ca\n\nAbstract\n\nMost machine learning researchers perform quantitative experiments to\nestimate generalization error and compare algorithm performances. In\norder to draw statistically convincing conclusions, it is important to esti-\nmate the uncertainty of such estimates. This paper studies the estimation\nof uncertainty around the K-fold cross-validation estimator. The main\ntheorem shows that there exists no universal unbiased estimator of the\nvariance of K-fold cross-validation. An analysis based on the eigende-\ncomposition of the covariance matrix of errors helps to better understand\nthe nature of the problem and shows that naive estimators may grossly\nunderestimate variance, as con\u00a3rmed by numerical experiments.\n\n1\n\nIntroduction\n\nThe standard measure of accuracy for trained models is the prediction error (PE), i.e. the\nexpected loss on future examples. Learning algorithms themselves are often compared on\ntheir average performance, which estimates expected value of prediction error (EPE) over\ntraining sets. If the amount of data is large enough, PE can be estimated by the mean\nerror over a hold-out test set. The hold-out technique does not account for the variance\nwith respect to the training set, and may thus be considered inappropriate for the purpose\nof algorithm comparison [4]. Moreover, it makes an inef\u00a3cient use of data which forbids\nits application to small sample sizes. In this situation, one resorts to computer intensive\nresampling methods such as cross-validation or bootstrap to estimate PE or EPE. We\nfocus here on K-fold cross-validation. While it is known that cross-validation provides an\nunbiased estimate of EPE, it is also known that its variance may be very large [2]. This\nvariance should be estimated to provide faithful con\u00a3dence intervals on PE or EPE, and\nto test the signi\u00a3cance of observed differences between algorithms. This paper provides\ntheoretical arguments showing the dif\u00a3culty of this estimation.\n\nThe dif\u00a3culties of the variance estimation have already been addressed [4, 7, 8]. Some\ndistribution-free bounds on the deviations of cross-validation are available, but they are\nspeci\u00a3c to locally de\u00a3ned classi\u00a3ers, such as nearest neighbors [3]. This paper builds upon\nthe work of Nadeau and Bengio [8], which investigated in detail the theoretical and prac-\ntical merits of several estimators of the variance of cross-validation. Our analysis departs\nfrom this work in the sampling procedure de\u00a3ning the cross-validation estimate. While [8]\nconsiders K independent training and test splits, we focus on the standard K-fold cross-\n\n\fvalidation procedure, with no overlap between test sets: each example is used once and\nonly once as a test example.\n\n2 General Framework\n\nFormally, we have a training set D = fz1; : : : ; zng, with zi 2 Z, assumed independently\nsampled from an unknown distribution P . We also have a learning algorithm A : Z \u2044 ! F\nwhich maps a data set to a function. Here we consider symmetric algorithms, i.e. A is\ninsensitive to the ordering of examples in the training set D. The discrepancy between\nthe prediction and the observation z is measured by a loss functional L : F \u00a3 Z ! R.\nFor example one may take in regression L(f; (x; y)) = (f (x) \u00a1 y)2, and in classi\u00a3cation\nL(f; (x; y)) = 1f (x)6=y.\nLet f = A(D) be the function returned by algorithm A on the training set D.\nIn\napplication-based evaluation, the goal of learning is usually stated as the minimization\nof the expected loss of f = A(D) on future test examples:\n\nPE(D) = E[L(f; z)] ;\n\n(1)\n\nwhere the expectation is taken with respect to z \u00bb P . To evaluate and compare learn-\ning algorithms [4] we care about the expected performance of learning algorithm A over\ndifferent training sets:\n\nEPE(n) = E[L(A(D); z)] ;\n\n(2)\nwhere the expectation is taken with respect to D \u00a3 z independently sampled from P n \u00a3 P .\nWhen P is unknown, PE and EPE have to be estimated, and it is crucial to assess the\nuncertainty attached to this estimation. Although this point is often overlooked, estimating\n\n2.1 Hold-out estimates of performance\n\nthe variance of the estimates cPE and [EPE requires caution, as illustrated here.\nThe mean error over a hold-out test set estimates PE, and the variance of cPE is given by the\n\nusual variance estimate for means of independent variables. However, this variance estima-\ntor is not suited to [EPE: the test errors are correlated when the training set is considered as\na random variable.\n\nFigure 1 illustrates how crucial it is to take these correlations into account. The average\nratio (estimator of variance/empirical variance) is displayed for two variance estimators, in\nan ideal situation where 10 independent training and test sets are available. The average of\n\ndown-biased, even for large sample sizes.\n\nb(cid:181)1=(cid:181), the naive variance estimator ignoring correlations, shows that this estimate is highly\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\nFigure 1: Average ratio (estimator of variance/empirical variance) on 100 000 experiments:\n\ncurve) vs. sample size n. The error bars represent \u00a72 standard errors on the average value.\n\nb(cid:181)1=(cid:181) (ignoring correlations, lower curve) and b(cid:181)2=(cid:181) (taking into account correlations, upper\n\n\fExperiment 1 Ideal hold-out estimate of EPE.\nWe have K = 10 independent training sets D1; : : : ; DK of n independent examples\nzi = (xi; yi), where xi = (xi1; : : : ; xid)0 is a d-dimensional centered, unit covariance\nk=1 xik + \"i with \"i being independent, cen-\n\nhave K independent test sets T1; : : : ; TK of size n sampled from the same distribution.\nThe learning algorithm consists in \u00a3tting a line by ordinary least squares, and the\nestimate of EPE is the average quadratic loss on test examples [EPE = \u201eL =\n1\n\nGaussian variable (d = 30), yi =p3=dPd\ntered, unit variance Gaussian variables (thep3=d factor provides R2 \u2019 3=4). We also\nKPK\nk=1Pi(Lki \u00a1 \u201eL)2, which\nThe \u00a3rst estimate of variance of [EPE is b(cid:181)1 =\nb(cid:181)2 =\nNote that Figure 1 suggests that the naive estimator of variance b(cid:181)1 asymptotically converges\n\nk=1Pi;j(Lki \u00a1 \u201eL)(Lkj \u00a1 \u201eL), which estimates correlations.\n\nto the true variance. This can be shown by taking advantage of the results in this paper, as\nlong as the learning algorithm converges (PE(D) a:s:! limn!1 EPE(n)), i.e. provided that\nthe only source of variability of [EPE is due to the \u00a3nite test size.\n\nnPzi2Tk\nK(K\u00a11)n2PK\n\nis unbiased provided there is no correlation between test errors. The second estimate is\n\nLki, where Lki = L(A(Dk); zi).\n\n1\n\nKn(Kn\u00a11)PK\n\n1\n\n1\n\nk=1\n\n2.2 K-fold cross-validation estimates of performance\nIn K-fold cross-validation [9], the data set D is \u00a3rst chunked into K disjoint subsets (or\nblocks) of the same size m = n=K (to simplify the analysis below we assume that n is a\nmultiple of K). Let us write Tk for the k-th such block, and Dk the training set obtained\nby removing the elements in Tk from D. The estimator is\n\nCV =\n\n1\nK\n\nKXk=1\n\n1\n\nm Xzi2Tk\n\nL(A(Dk); zi) :\n\n(3)\n\nUnder stability assumptions on A, CV estimates PE(D) at least as accurately as the\ntraining error [6]. However, as CV is an average of unbiased estimates of PE(D1);\n: : : ; PE(DK), a more general statement is that CV estimates unbiasedly EPE(n\u00a1m).\nNote that the forthcoming analysis also applies to the version of cross-validation dedicated\nto comparing algorithms, using matched pairs\n\nand to the delete-m jackknife estimate of PE(D) debiasing the training error (see e.g. [5]):\n\n1\n\n1\nK\n\n\u00a2CV =\n\nKXk=1\n\nm Xzi2Tk\nL(A(D); zi)\u00a1(K\u00a11)\u02c6\n\n1\n\nK(n \u00a1 m)\n\nJK =\n\n1\nn\n\nnXi=1\n\nL(A1(Dk); zi) \u00a1 L(A2(Dk); zi) ;\n\nKXk=1Xzi2Dk\n\nL(A(Dk); zi) \u00a1\n\nL(A(D); zi)!:\n\n1\nn\n\nnXi=1\n\nIn what follows, CV, \u00a2CV and JK will generically be denoted by ^\u201e:\n\n^\u201e =\n\n1\nn\n\nei =\n\n1\nK\n\nnXi=1\n\nKXk=1\n\n1\n\nm Xi2Tk\n\nei ;\n\nwhere, slightly abusing notation, i 2 Tk means zi 2 Tk and\n\n8i 2 Tk; ei =8<:\n\nL(A(Dk); zi)\nL(A1(Dk); zi) \u00a1 L(A2(Dk); zi)\n\nKL(A(D); zi) \u00a1P\u20186=k L(A(D\u2018); zi)\n\nfor ^\u201e = CV ;\nfor ^\u201e = \u00a2CV ;\nfor ^\u201e = JK :\n\n\fNote that ^\u201e is the average of identically distributed (dependent) variables. Thus, it asymp-\ntotically converges to a normally distributed variable, which is completely characterized by\nits expectation E[^\u201e] and its variance Var[^\u201e].\n\n3 Structure of the Covariance Matrix\n\nn2Pi;j Cov(ei; ej) : By using symmetry over permutations of\n\nThe variance of ^\u201e is (cid:181) = 1\nthe examples in D, we show that the covariance matrix has a simple block structure.\nLemma 1 Using the notation introduced in section 2.2, 1) all ei are identically distributed;\n2) all pairs (ei; ej) belonging to the same test block are jointly identically distributed; 3)\nall pairs (ei; ej) belonging to different test blocks are jointly identically distributed;\n\nProof: derived immediately from the permutation-invariance of P (D) and the symmetry\nof A. See [1] for details and the proofs not shown here for lack of space.\nCorollary 1 The covariance matrix \u00a7 of cross-validation errors e = (e1; : : : ; en)0 has\nthe simple block structure depicted in Figure 2: 1) all diagonal elements are identical\n8i; Cov(ei; ei) = Var[ei] = (cid:190)2; 2) all the off-diagonal entries of the K m \u00a3 m diagonal\nblocks are identical 8(i; j) 2 T 2\nk : j 6= i; T (j) = T (i); Cov(ei; ej) = !; 3) all the\nremaining entries are identical 8i 2 Tk; 8j 2 T\u2018 : \u2018 6= k; Cov(ei; ej) = (cid:176).\n\nn\n\n}|\n\n{\n\nz\n|{z}m\n\nFigure 2: Structure of the covariance matrix.\n\nCorollary 2 The variance of the cross-validation estimator is a linear combination of three\nmoments:\n\n(cid:181) =\n\nCov(ei; ej) =\n\n1\nn\n\n(cid:190)2 +\n\nm \u00a1 1\n\nn\n\n! +\n\nn \u00a1 m\n\nn\n\n(cid:176)\n\n(4)\n\n1\n\nn2Xi;j\n\nHence, the problem of estimating (cid:181) does not involve estimating n(n + 1)=2 covariances,\nbut it cannot be reduced to that of estimating a single variance parameter. Three compo-\nnents intervene, which may be interpreted as follows when ^\u201e is the K-fold cross-validation\nestimate of EPE:\n\n1. the variance (cid:190)2 is the average (taken over training sets) variance of errors for\n\u201ctrue\u201d test examples (i.e. sampled independently from the training sets) when\nalgorithm A is fed with training sets of size m(K \u00a1 1);\n\n2. the within-block covariance ! would also apply to these \u201ctrue\u201d test examples; it\narises from the dependence of test errors stemming from the common training set.\n3. the between-blocks covariance (cid:176) is due to the dependence of training sets (which\nshare n(K \u00a1 2)=K examples) and the fact that test block Tk appears in all the\ntraining sets D\u2018 for \u2018 6= k.\n\n\f4 No Unbiased Estimator of Var[^\u201e] Exists\n\nConsider a generic estimator ^(cid:181) that depends on the sequence of cross-validation errors\ne = (e1; e2; : : : ; en)0. Assuming ^(cid:181) is analytic in e, consider its Taylor expansion:\n\ufb013(i; j; k)eiejek + : : :\n\n(5)\n\n^(cid:181) = \ufb010 +Xi\n\n\ufb011(i)ei +Xi;j\n\n\ufb012(i; j)eiej +Xi;j;k\n\nWe \u00a3rst show that for unbiased variance estimates (i.e. E[ ^(cid:181)] = Var[^\u201e]), all the \ufb01i coef\u00a3-\ncients must vanish except for the second order coef\u00a3cients \ufb01 2;i;j.\nLemma 2 There is no universal unbiased estimator of Var[^\u201e] that involves the ei in a\nnon-quadratic way.\n\nProof: Take the expected value of ^(cid:181) expressed as in (5), and equate it with Var[^\u201e] (4).\n\nSince estimators that include moments other than the second moments in their expectation\nare biased, we now focus on estimators which are quadratic forms of the errors, i.e.\n\nWijeiej :\n\n(6)\n\n^(cid:181) = e0We =Xi;j\n\nLemma 3 The expectation of quadratic estimators ^(cid:181) de\u00a3ned as in (6) is a linear combina-\ntion of only three terms\n\nE[^(cid:181)] = a((cid:190)2 + \u201e2) + b(! + \u201e2) + c((cid:176) + \u201e2) ;\n\n(7)\n\nwhere (a; b; c) are de\u00a3ned as follows:\n\n8><>:\n\na \u00a2= Pn\nb \u00a2= PK\nc \u00a2= PK\n\ni=1 Wii ;\n\nk=1Pi2TkPj2Tk:j6=i Wij ;\nk=1P\u20186=kPi2TkPj2T\u2018\n\nWij :\n\nA \u201ctrivial\u201d representer of estimators with this expected value is\n\n(8)\nwhere (s1; s2; s3) are the only quadratic statistics of e that are invariants to the within\nblocks and between blocks permutations described in Lemma 1:\n\n^(cid:181) = as1 + bs2 + cs3 ;\n\ns1\ns2\n\ns3\n\n8>><>>:\n\n\u00a2= 1\n\u00a2=\n\u00a2=\n\n;\n\ni\n\n1\n\ni=1 e2\n\nnPn\nn(m\u00a11)PK\nn(n\u00a1m)PK\n\n1\n\nk=1Pi2TkPj2Tk:j6=i eiej ;\nk=1P\u20186=kPi2TkPj2T\u2018\n\neiej :\n\nProof: in (6), group the terms that have the same expected values (from Corollary 1).\n\nTheorem 1 There exists no universally unbiased estimator of Var[^\u201e].\n\nProof: thanks to Lemma 2 and 3, it is enough to show that E[ ^(cid:181)] = Var[^\u201e] has no solution\nfor quadratic estimators:\nE[^(cid:181)] = Var[^\u201e] , a((cid:190)2 + \u201e2) + b(! + \u201e2) + c((cid:176) + \u201e2) =\n\nn \u00a1 m\n\nm \u00a1 1\n\n(cid:190)2 +\n\n! +\n\n(cid:176) :\n\n1\nn\n\nn\n\nn\n\nFinding (a; b; c) satisfying this equality for all admissible values of (\u201e; (cid:190)2; !; (cid:176)) is impos-\nsible since it is equivalent to solving the following overdetermined system:\n\n(9)\n\na\nb\nc\na + b + c = 0\n\n= 1\nn ;\n= m\u00a11\nn\n= n\u00a1m\n\nn\n\n;\n;\n\n(10)\nQ.E.D.\n\n8><>:\n\n\f5 Eigenanalysis of the covariance matrix\n\nOne way to gain insight on the origin of the negative statement of Theorem 1 is via the\neigenanalysis of \u00a7, the covariance of e. This decomposition can be performed analytically\nthanks to the very particular block structure displayed in Figure 2.\n\nLemma 4 Let vk be the binary vector indicating the membership of each example to test\nblock k. The eigenvalues of \u00a7 are as follows:\n\n\u2020 \u201a1 = (cid:190)2 \u00a1 ! with multiplicity n \u00a1 K and eigenspace orthogonal to fvkgK\n\u2020 \u201a2 = (cid:190)2 + (m \u00a1 1)! \u00a1 m(cid:176) with multiplicity K \u00a1 1 and eigenspace de\u00a3ned in\nthe orthogonal of 1 by the basis fvkgK\n\u2020 \u201a3 = (cid:190)2 + (m \u00a1 1)! + (n \u00a1 m)(cid:176) with eigenvector 1.\n\nk=1;\n\nk=1;\n\nk=1 in the orthogonal of 1, and one projection on 1.\n\nLemma 4 states that the vector e can be decomposed into three uncorrelated parts: n \u00a1 K\nk=1, K \u00a1 1 projections to the subspace\nprojections to the subspace orthogonal to fvkgK\nspanned by fvkgK\nA single vector example with n independent elements can be seen as n independent ex-\namples. Similarly, the uncorrelated projections of e can be equivalently represented by\nrespectively n \u00a1 K, K \u00a1 1 and one uncorrelated one-dimensional examples.\nIn particular, for the projection on 1, with a single example, the sample variance is null,\nresulting in the absence of unbiased variance estimator of \u201a3. The projection of e on\nthe eigenvector 1\n1 is precisely ^\u201e. Hence there is no unbiased estimate of V ar[^\u201e] = \u201a3\nn\nn\nwhen we have only one realization of the vector e. For the same reason, even with simple\nparametric assumptions on e (such as e Gaussian), the maximum likelihood estimate of (cid:181)\nis not de\u00a3ned. Only \u201a1 and \u201a2 can be estimated unbiasedly. Note that this problem cannot\nbe addressed by performing multiple K-fold splits of the data set. Such a procedure would\nnot provide independent realizations of e.\n\n6 Possible values for ! and (cid:176)\n\nTheorem 1 states that no estimator is unbiased, and in its demonstration, it is shown that\nthe bias of any quadratic estimator is a linear combination of \u201e2, (cid:190)2, ! and (cid:176). Regarding\nestimation, it is thus interesting to see what constraints restrict their possible range.\n\nLemma 5 For ^\u201e = CV and ^\u201e = \u00a2CV, the following inequalities hold:\n\n\u2030\n) \u2030\n\n\u00a1 1\n0\n\u00a1 m\n\n0\n\n\u2022 ! \u2022\n\n(cid:190)2\n\nn\u00a1m ((cid:190)2 + (m \u00a1 1)!) \u2022 (cid:176) \u2022 1\n\nm ((cid:190)2 + (m \u00a1 1)!)\n\n\u2022 ! \u2022 (cid:190)2\nn\u00a1m (cid:190)2 \u2022 (cid:176) \u2022 (cid:190)2 :\n\nThe admissible (!; (cid:176)) region is very large, and there is no constraint linking \u201e to (cid:190)2. Hence,\nwe cannot propose a variance estimate with universally small bias.\n\n7 Experiments\n\nThe bias of any quadratic estimator is a linear combination of \u201e2, (cid:190)2, ! and (cid:176). The ad-\nmissible values provided earlier suggest that ! and (cid:176) cannot be proved to be negligible\ncompared to (cid:190)2. This section illustrates that in practice, the contribution to the variance of\n^\u201e due to ! and (cid:176) (see Equation (4)) can be of same order as the one due (cid:190)2. This con\u00a3rms\nthat the estimators of (cid:181) should indeed take into account the correlations of ei.\n\nExperiment 2 True variance of K-fold cross-validation.\nWe repeat the experimental setup of Experiment 1, except that only one sample of size n is\navailable. Since cross-validation is known to be sensitive to the instability of algorithms,\n\n\fin addition to this standard setup, we also consider another one with outliers:\nThe input xi = (xi1; : : : ; xid)0 is still 30-dimensional, but it is now a mixture of two cen-\ntered Gaussian: let ti be a binary variable, with P (ti = 1) = p = 0:95; ti = 1 ) xi \u00bb\nk=1 xik + \"i;\nti = 1 ) \"i \u00bb N (0; 1=(p + 100(1 \u00a1 p))), ti = 0 ) \"i \u00bb N (0; 100=(p + 100(1 \u00a1 p))).\nWe now look at the variance of K-fold cross-validation (K = 10), and decompose in the\nthree orthogonal components (cid:190)2, ! and (cid:176). The results are shown in Figure 3.\n\nN (0; I), ti = 0 ) xi \u00bb N (0; 100I); yi = p3=(d(p + 100(1 \u00a1 p)))Pd\n\ns 2\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n60\n\n80 100 120 160 220 280 360 460 600\n\nn\u2212m\n\nno outliers\n\n4\n\n3\n\n2\n\n1\n\n0\n\ns 2\n\n60\n\n80 100 120 160 220 280 360 460 600\n\nn\u2212m\n\noutliers\n\nFigure 3: Contributions of ((cid:190)2; !; (cid:176)) to total variance V ar[CV ] vs. n \u00a1 m.\n\nWithout outliers, the contribution of (cid:176) is very important for small sample sizes. For large\nsample sizes, the overall variance is considerably reduced and is mainly caused by (cid:190) 2\nbecause the learning algorithm returns very similar answers for all training sets. When\nthere are outliers, the contribution of (cid:176) is of same order as the one of (cid:190)2 even when the\nratio of examples to free parameters is large (here up to 20). Thus, in dif\u00a3cult situations,\nwhere A(D) varies according to the realization of D, neglecting the effect of ! and (cid:176) can\nbe expected to introduce a bias of the order of the true variance.\n\nIt is also interesting to see how these quantities are affected by the number of folds K. The\ndecomposition of (cid:181) in (cid:190)2, ! and (cid:176) (4) does not imply that K should be set either to n or\nto 2 (according to the sign of ! \u00a1(cid:176)) in order to minimize the variance of ^\u201e. Modifying\nK affects (cid:190)2, ! and (cid:176) through the size and overlaps of the training sets D1; : : : ; DK, as\nillustrated in Figure 4. For a \u00a3xed sample size, the variance of ^\u201e and the contribution of (cid:190) 2,\n! and (cid:176) vary smoothly with K (of course, the mean of ^\u201e is also affected in the process).\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\ns 2\n\n2 3 4 5 6 8 10 12 15 20 24 30 40 60120\n\nK\n\nno outliers\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n2 3 4 5 6 8 10 12 15 20 24 30 40 60120\n\nK\n\noutliers\n\nFigure 4: Contributions of ((cid:190)2; !; (cid:176)) to total variance V ar[CV ] vs. K for n = 120.\n\n8 Discussion\n\nThe analysis presented in this paper for K-fold cross-validation can be instantiated to sev-\neral interesting cases. First, when having K independent training and test sets (K = 1\n\nq\nw\ng\nq\nw\ng\nq\nw\ng\nq\n\fis the realistic case), the structure of hold-out errors resemble the one of cross-validation\n\nerrors, with (cid:176) = 0. Knowing that allows to build the unbiased estimate b(cid:181)2 given in 2.1:\n\nknowing that (cid:176) = 0 removes the third equation of system (10) in the proof of Theorem 1.\nTwo-fold cross-validation has been advocated to perform hypothesis testing [4]. It is a\nspecial case of K-fold cross-validation where the training blocks are mutually independent\nsince they do not overlap. However, this independence does not modify the structure of e\nin the sense that (cid:176) is not null. The between-block correlation stems from the fact that the\ntraining block D1 is the test block T2 and vice-versa.\nFinally, Leave-one-out cross validation is another particular case, with K = n. The\nstructure of the covariance matrix is simpli\u00a3ed, without diagonal blocks. The estimation\ndif\u00a3culties however remain: even in this particular case, there is no unbiased estimate of\nvariance. From the de\u00a3nition of b in Lemma 3, we have b = 0, and with m = 1 the linear\nsystem (10) still admits no solution.\n\nTo summarize, it is known that K-fold cross-validation may suffer from high variability,\nwhich can be responsible for bad choices in model selection and erratic behavior in the\nestimated expected prediction error [2, 4, 8]. This paper demonstrates that estimating the\nvariance of K-fold cross-validation is dif\u00a3cult. Not only there is no unbiased estimate of\nthis variance, but we have no theoretical result showing that this bias should be negligi-\nble in the non-asymptotical regime. The eigenanalysis of the covariance matrix of errors\ntraces the problem back to the dependencies between test-block errors, which induce the\nabsence of redundant pieces of information regarding the average test error. i.e. the K-fold\ncross-validation estimate. It is clear that this absence of redundancy is bound to provide\ndif\u00a3culties in the estimation of variance.\n\nOur experiments show that the bias incurred by ignoring test errors dependencies can be\nof the order of the variance itself, even for large sample sizes. Thus, the assessment of the\nsigni\u00a3cance of observed differences in cross-validation scores should be treated with much\ncaution. The next step of this study consists in building and comparing variance estimators\ndedicated to the very speci\u00a3c structure of the test-block error dependencies.\n\nReferences\n\n[1] Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of K-fold cross-validation.\n\nJournal of Machine Learning Research, 2003.\n\n[2] L. Breiman. Heuristics of instability and stabilization in model selection. The Annals of Statistics,\n\n24(6):2350\u20132383, 1996.\n\n[3] L. Devroye, L. Gy\u00a8or\u00a3, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer,\n\n1996.\n\n[4] T. G. Dietterich. Approximate statistical tests for comparing supervised classi\u00a3cation learning\n\nalgorithms. Neural Computation, 10(7):1895\u20131924, 1999.\n\n[5] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap, volume 57 of Monographs on\n\nStatistics and Applied Probability. Chapman & Hall, 1993.\n\n[6] M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-\n\nvalidation. Neural Computation, 11(6):1427\u20131453, 1996.\n\n[7] R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selec-\ntion. In Proceedings of the Fourteenth International Joint Conference on Arti\u00a3cial Intelligence,\npages 1137\u20131143, 1995.\n\n[8] C. Nadeau and Y. Bengio. Inference for the generalization error. Machine Learning, 52(3):239\u2013\n\n281, 2003.\n\n[9] M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal\n\nStatistical Society, B, 36(1):111\u2013147, 1974.\n\n\f", "award": [], "sourceid": 2468, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Yves", "family_name": "Grandvalet", "institution": null}]}