{"title": "Uncertainty Sampling is Preconditioned Stochastic Gradient Descent on Zero-One Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 6955, "page_last": 6964, "abstract": "Uncertainty sampling, a popular active learning algorithm, is used to reduce the amount of data required to learn a classifier, but it has been observed in practice to converge to different parameters depending on the initialization and sometimes to even better parameters than standard training on all the data. In this work, we give a theoretical explanation of this phenomenon, showing that uncertainty sampling on a convex (e.g., logistic) loss can be interpreted as performing a preconditioned stochastic gradient step on the population zero-one loss. Experiments on synthetic and real datasets support this connection.", "full_text": "Uncertainty Sampling is Preconditioned Stochastic\n\nGradient Descent on Zero-One Loss\n\nStephen Mussmann\n\nDepartment of Computer Science\n\nStanford University\n\nStanford, CA\n\nmussmann@stanford.edu\n\nPercy Liang\n\nDepartment of Computer Science\n\nStanford University\n\nStanford, CA\n\npliang@cs.stanford.edu\n\nAbstract\n\nUncertainty sampling, a popular active learning algorithm, is used to reduce the\namount of data required to learn a classi\ufb01er, but it has been observed in practice to\nconverge to different parameters depending on the initialization and sometimes to\neven better parameters than standard training on all the data. In this work, we give\na theoretical explanation of this phenomenon, showing that uncertainty sampling\non a convex (e.g., logistic) loss can be interpreted as performing a preconditioned\nstochastic gradient step on the population zero-one loss. Experiments on synthetic\nand real datasets support this connection.\n\n1\n\nIntroduction\n\nActive learning algorithms aim to learn parameters with less data by querying labels adaptively.\nHowever, since such algorithms change the sampling distribution, they can introduce bias in the\nlearned parameters. While there has been some work to understand this (Sch\u00fctze et al., 2006; Bach,\n2007; Dasgupta and Hsu, 2008; Beygelzimer et al., 2009), the most common algorithm, \u201cuncertainty\nsampling\u201d (Lewis and Gale, 1994; Settles, 2010), remains elusive. One of the oddities of uncertainty\nsampling is that sometimes the bias is helpful: uncertainty sampling with a subset of the data can\nyield lower error than random sampling on all the data (Schohn and Cohn, 2000; Bordes et al., 2005;\nChang et al., 2017). But sometimes, uncertainty sampling can vastly underperform, and in general,\ndifferent initializations can yield different parameters asymptotically. Despite the wealth of theory\non active learning (Balcan et al., 2006; Hanneke et al., 2014), a theoretical account of uncertainty\nsampling is lacking.\nIn this paper, we characterize the dynamics of a variant of uncertainty sampling to explain the bias\nintroduced. For convex models, we show that uncertainty sampling with respect to a convex loss on all\nthe points is performing a preconditioned 1 stochastic gradient step on the (non-convex) population\nzero-one loss. Furthermore, each uncertainty sampling iterate in expectation moves in a descent\ndirection of the zero-one loss, unless the parameters are at an approximate stationary point. This\nexplains why uncertainty sampling sometimes achieves lower zero-one loss than random sampling,\nsince that is the quantity it implicitly optimizes. At the same time, as the zero-one loss is non-convex,\nwe can get stuck in a local minimum with higher zero-one loss (see Figure 1).\nEmpirically, we validate the properties of uncertainty sampling on a simple synthetic dataset for\nintuition as well as 25 real-world datasets. Our new connection between uncertainty sampling and the\nzero-one loss minimization clari\ufb01es the importance of a good (suf\ufb01ciently large) seed set, rather than\n\n1Preconditioned refers to multiplication of a symmetric positive semide\ufb01nite matrix to the (stochastic)\ngradient for (stochastic) gradient descent (Li, 2018; Klein et al., 2011). It is often chosen to approximate the\ninverse Hessian.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: A typical run of uncertainty sampling. Each iteration, uncertainty sampling chooses to label\nthe point closest to the current decision boundary. (a) Random initialization of uncertainty sampling.\n(b) A point closest to the decision boundary is added. (c) Several more points around the decision\nboundary are added until convergence. We see that uncertainty sampling uses only a fraction of the\ndata, but converges to a local minimum of the zero-one loss. (d) Three different local minima of the\nzero-one loss are shown, where the horizontal linear classi\ufb01er much more preferable to the other two.\n\nusing a single point per class, as is commonly done in the literature (Tong and Koller, 2001; Yang\nand Loog, 2016).\n\n2 Setup\nWe focus on binary classi\ufb01cation. Let z = (x, y) be a data point, where x \u2208 Rk is the input and\ny \u2208 {\u22121, 1} is the label, drawn from some unknown true data distribution z \u223c p\u2217. Assume we have\na family of scoring functions S(x, \u03b8), where \u03b8 \u2208 Rd are the parameters; for linear models, we have\nS(x, \u03b8) = \u03b8 \u00b7 \u03c6(x), where \u03c6 : Rk \u2192 Rd is the feature map.\nGiven parameters \u03b8, we predict 1 if S(x, \u03b8) > 0 and \u22121 otherwise, and therefore err when y and\nS(x, \u03b8) have opposite signs. De\ufb01ne Z(\u03b8) to be the zero-one loss (misclassi\ufb01cation rate) over the data\ndistribution, generally the quantity of interest:\n\nZ(\u03b8) def= E(x,y)\u223cp\u2217 [H(\u2212yS(x, \u03b8))],\n\nwhere H is the Heaviside function:\n\nH(x) def=\n\n\uf8f1\uf8f2\uf8f30 x < 0,\n\n1\n2 x = 0,\n1 x > 0.\n\n(1)\n\n(2)\n\nNote that the training zero-one loss is step-wise constant, and the gradient is 0 almost everywhere.\nHowever, if the PDF of p\u2217 is continuous, then the population zero-one loss is differentiable at most\nparameters, a fact that will be shown later.\nSince minimizing the zero-one loss is computationally intractable (Feldman et al., 2012), it is common\nto de\ufb01ne a convex surrogate (cid:96)((x, y), \u03b8) = \u03c8(yS(x, \u03b8)) which upper bounds the zero-one loss; for\nexample, the logistic loss is \u03c8(s) = log(1 + e\u2212s). Given a labeled dataset D = {z1, . . . , zn}, we\ncan de\ufb01ne the estimator that minimizes the sum of the loss plus quadratic regularization:\n\n\u03b8D def= arg min\n\n\u03b8\n\n(cid:96)(z, \u03b8) + \u03bb(cid:107)\u03b8(cid:107)2\n2,\n\n(3)\n\n(cid:88)\n\nz\u2208D\n\nThis can generally be solved ef\ufb01ciently via convex optimization.\n\nPassive learning: random sampling. De\ufb01ne the population loss as\n\n(4)\nIn standard passive learning, we sample D randomly from the population and compute \u03b8D. As\n|D| \u2192 \u221e, the parameters will generally converge to the minimizer of L (note this is in general\ndistinct from the minimizer of Z).\n\nL(\u03b8) def= Ez\u223cp\u2217 [(cid:96)(z, \u03b8)].\n\n2\n\n(a) Initialization(b) One Iteration(c) Convergence(d) Local Minima\fActive learning: uncertainty sampling.\nIn active learning, we have access to a pool of npool\nunlabeled data points (known x, unknown y) drawn from p\u2217 and adaptively choose the points to label.\nIn this work, we analyze uncertainty sampling (Lewis and Gale, 1994; Settles, 2010), which is widely\nused for its simplicity and ef\ufb01cacy (Yang and Loog, 2016).\nLet us denote our label budget as n, the number of points we label. Uncertainty sampling (Algorithm\n1) begins with nseed < n labeled points D drawn randomly from the pool and minimizes the\nregularized loss (3) to obtain initial parameters. Then, the algorithm draws a random minipool (subset\nXM of the data pool XU), and chooses the point x \u2208 XM that the current model is most uncertain about,\ni.e., the one with the smallest absolute value of score.2 It then queries x to get the corresponding label\ny and adds (x, y) to D. Finally, we update the model by optimizing (3). A key difference between\nthis version of uncertainty sampling and most other versions is that we remove the minipool XM from\nXU after choosing a point from it; this is done simply for theoretical convenience. The process is\ncontinued until we have labeled n points in total.\n\nAlgorithm 1 Uncertainty Sampling\n\nInput: Loss (cid:96), regularization parameter \u03bb, label budget n, labeled D of size nseed, unlabeled XU of\nsize npool, minipool size nminipool\nTrain \u03b8nseed = arg min\u03b8\nfor t = (nseed + 1), . . . , n do\n\n(cid:80)\nz\u2208D (cid:96)(z, \u03b8) + \u03bb(cid:107)\u03b8(cid:107)2\n\n2\n\nDraw a random subset XM of size nminipool from XU\nChoose x = arg minx\u2208XM |S(x, \u03b8)|\nQuery x to get label y\nD = D \u222a {(x, y)}\nXU = XU \\ XM\nTrain \u03b8t = arg min\u03b8\n\nz\u2208D (cid:96)(z, \u03b8) + \u03bb(cid:107)\u03b8(cid:107)2\n\n(cid:80)\n\n2\n\nend for\nReturn \u03b8n\n\nWe have four hyperparameters related to the number of data points: nseed, nminipool, npool, and n.\nWe start with nseed labeled points and npool unlabeled points, and select a new point from a random\nsubset of size nminipool sampled without replacement until we have n labeled points in total. Note that\nnpool \u2265 n \u00b7 nminipool.\n\n3 Theory\n\nWe present three results shedding light on uncertainty sampling that build on each other. First, in\nSection 3.1, we show how the optimal parameters change with the addition of a single point to the\nconvex surrogate (e.g. logistic) loss. Then, we show that uncertainty sampling is preconditioned\nstochastic gradient descent on the zero-one loss in Section 3.2. Finally, we show that uncertainty\nsampling iterates in expectation move in a descent direction of Z in Section 3.3.\n\n3.1\n\nIncremental Parameter Updates\n\nFirst, we analyze how the sample convex surrogate loss minimizer changes with each additional\npoint; these are the iterates of uncertainty sampling. Let us assume the loss is convex and thrice-\ndifferentiable with bounded derivatives:\nAssumption 1 (Convex Surrogate Loss). The surrogate loss (cid:96)(z, \u03b8) is convex in \u03b8.\nAssumption 2 (Surrogate Loss Regularity). The surrogate loss (cid:96)(z, \u03b8) is continuously thrice differ-\nentiable in \u03b8, and the \ufb01rst three derivatives are bounded by M(cid:96) in Frobenius norm.\n\n2For binary classi\ufb01cation, the uncertainty measures of highest entropy, smallest margin, and most uncertain\n\nSettles (2010) are equivalent.\n\n3\n\n\fConsider any iterative algorithm (e.g., random sampling or uncertainty sampling) that at each iteration\nt adds a single point z(t) and minimizes the regularized training loss:\n\nLt(\u03b8) def=\n\n(cid:96)(z(i), \u03b8) + \u03bb(cid:107)\u03b8(cid:107)2\n\n2\n\n(5)\n\nt(cid:88)\n\ni=1\n\nto produce \u03b8t. Since Lt\u22121 and Lt differ by only one point, we expect \u03b8t\u22121 and \u03b8t to also be close. We\ncan make this formal using Taylor\u2019s theorem. First, since \u03b8t is a minimizer, we have \u2207Lt(\u03b8t) = 0.\nThen, since the loss is continuously twice-differentiable, for some \u03b8(cid:48) between \u03b8t\u22121 and \u03b8t:\n\n0 = \u2207Lt(\u03b8t) = \u2207Lt(\u03b8t\u22121) + \u22072Lt(\u03b8(cid:48))(\u03b8t \u2212 \u03b8t\u22121).\n\nSince (cid:96) is convex and the quadratic regularizer, \u22072Lt is invertible and we can solve for \u03b8t:\n\n(7)\nSince \u03b8t\u22121 minimizes Lt\u22121, we have \u2207Lt\u22121(\u03b8t\u22121) = 0. Also note that Lt(\u03b8) = Lt\u22121(\u03b8)+(cid:96)(z(t), \u03b8).\nThus,\n\n\u03b8t = \u03b8t\u22121 \u2212 [\u22072Lt(\u03b8(cid:48))]\u22121\u2207Lt(\u03b8t\u22121).\n\n\u03b8t = \u03b8t\u22121 \u2212 [\u22072Lt(\u03b8(cid:48))]\u22121\u2207(cid:96)(z(t), \u03b8t\u22121).\n\n(6)\n\n(8)\n\nThe update above holds for any choice of z(t), in particular, when z(t) is chosen by random sampling\nor uncertainty sampling. For random sampling, z(t) \u223c p\u2217, so we have\n\n(9)\nfrom which one can interpret the iterates of random sampling as preconditioned SGD on the population\nsurrogate loss L.\n\nE[\u2207(cid:96)(z(t), \u03b8t\u22121)] = \u2207L(\u03b8t\u22121),\n\n3.2 Parameter Updates of Uncertainty Sampling\n\nLet us now turn to uncertainty sampling. Whereas random sampling is preconditioned SGD on the\npopulation surrogate loss L, we will now show that uncertainty sampling is preconditioned SGD on\nthe population zero-one loss Z.\nThe very rough intuition is as follows: the gradient \u2207Z(\u03b8) only depends on the density at the decision\nboundary corresponding to \u03b8 since points not at the decision boundary contribute zero gradient.\nAsymptotically, uncertainty sampling selects points close to the decision boundary de\ufb01ned by \u03b8 and\npoints are selected proportional to the density.\nBased on (8), we seek to understand E[\u2207(cid:96)(z(t), \u03b8t\u22121)], where z(t) is chosen by uncertainty sampling.\nFirst, we must de\ufb01ne some concepts. Each parameter vector \u03b8 de\ufb01nes a decision boundary:\nDe\ufb01nition 3 (Decision Boundary).\n\ndef= {x : S(x, \u03b8) = 0}.\n\nB\u03b8\n\n(10)\nIf S(x, \u03b8) is differentiable with respect to x and \u2207xS(x, \u03b8) (cid:54)= 0 for all x \u2208 B\u03b8, then by the implicit\nfunction theorem, B\u03b8 is a (d \u2212 1)-dimensional differentiable manifold and has measure zero (see\nProposition 10 in the Appendix). When this condition is satis\ufb01ed, the decision boundary is well\nbehaved, and Z and uncertainty sampling has nice properties. For these reasons, denote the set of\nparameters that meet this condition as regular parameters \u0398regular:\nDe\ufb01nition 4 (Regular Parameters).\n\n\u0398regular\n\ndef= {\u03b8 : \u2200x \u2208 B\u03b8,\u2207xS(x, \u03b8) (cid:54)= 0}.\n\n(11)\nFor logistic regression with identity features (\u03c6(x) = x), \u2207xS(x, \u03b8) = \u03b8, so the only point not in\n\u0398regular is \u03b8 = 0. For logistic regression with quadratic features, \u03b8 \u00b7 \u03c6(x) = x(cid:62)Ax + b(cid:62)x + c (the\nparameters are A, b, and c), parameters where A is non-singular and c (cid:54)= 1\n4 b(cid:62)A\u22121b are in \u0398regular.\nThus, the parameters not in \u0398regular have measure zero.\nAnother important quantity is the probability density at the decision boundary. Before de\ufb01ning this,\nwe \ufb01rst need to make two assumptions on the data distribution p\u2217 and an assumption that the score\nfunction is smooth.\n\n4\n\n\fAssumption 5 (Smooth PDF). p\u2217 has an smooth (all derivatives exist) probability density function.\nAssumption 6 (Bounded Support). The support of p\u2217 is bounded.\nAssumption 7 (Smooth Score). The score S(x, \u03b8) is smooth, that is, all derivatives with respect to x\nand \u03b8 exist.\nRecall that the decision boundary B\u03b8 has measure zero for \u03b8 \u2208 \u0398regular. Assumption 5 implies that\nthere is zero probability mass on all decision boundaries corresponding to \u03b8 \u2208 \u0398regular (P(x \u2208 B\u03b8) =\n0 for x \u223c p\u2217). However, we can de\ufb01ne a probability density on the decision boundary B\u03b8:\n\nb(\u03b8) def= lim\nh\u21920\n\nP(S(x, \u03b8) \u2264 h) \u2212 P(S(x, \u03b8) \u2264 0)\n\nh\n\n.\n\n(12)\n\nIf Assumptions 5, 6, and 7 hold, then the limit exists for \u03b8 \u2208 \u0398regular. For this statement, see\nProposition 12 in the appendix.\nNow that we have de\ufb01ned the set of regular parameters \u0398regular, the density at the decision boundary\nb(\u03b8), and Assumptions 5, 6, and 7, we are ready to formally state the expected gradient of the\nsurrogate loss on a point chosen by uncertainty sampling. Our main result, Theorem 8, states\nthat as the minipool size goes to in\ufb01nity, E[\u2207(cid:96)(z(t), \u03b8)] tends in the direction of the gradient of\nthe population zero-one loss Z(\u03b8), where the expectation is with respect to the randomness of the\nminipool XM and label y. In particular, let z(t) be chosen via uncertainty sampling with the parameters\n\u03b8: x(t) = arg minx\u2208XM |S(x, \u03b8)| and y(t) \u223c p\u2217(y | x(t)). We require that the size of the minipool\ngoes to in\ufb01nity (and thus the size of the unlabeled pool must go to in\ufb01nity as well) to ensure that we\nare choosing points arbitrarily close to the decision boundary.\nTheorem 8 (Expected Uncertainty Sampling Gradient). If Assumptions 2, 5, 6, and 7 hold and\n\u03b8 \u2208 \u0398regular and b(\u03b8) (cid:54)= 0, then if z(t) is chosen via uncertainty sampling with parameters \u03b8,\n\nlim\n\nnminipool\u2192\u221e\n\nE[\u2207(cid:96)(z(t), \u03b8)] =\n\n\u2212\u03c8(cid:48)(0)\nb(\u03b8)\n\n\u2207Z(\u03b8).\n\n(13)\n\nThus, similar to how random sampling yields preconditioned SGD on the population surrogate loss L\n(9), uncertainty sampling yields preconditioned SGD on the population zero-one loss Z.\n\n3.3 Descent Direction\n\nSo far, we have shown that uncertainty sampling is preconditioned SGD on the population zero-one\nloss Z by analyzing E[\u2207(cid:96)(z(t), \u03b8)]. To show that these updates are descent directions on Z, we need\nto also consider the preconditioner [\u22072Lt(\u03b8(cid:48))]\u22121 appearing in (8). Due to quadratic regularization\n(5), the preconditioner is positive de\ufb01nite. However, we need to be careful since the preconditioner\ndepends on the iterate \u03b8t both through \u03b8(cid:48) and the function Lt. Because of this, we only move in\na descent direction in expectation if (cid:107)\u2207Z(\u03b8t\u22121)(cid:107) \u2265 \u0001 and for large enough regularization, which\nensures that the dependence on \u03b8t doesn\u2019t change the preconditioner too much.\nTheorem 9 (Uncertainty Sampling Descent Direction). Assume that Assumptions 1, 2, 5, 6, and\n7 hold, and assume \u03c8(cid:48)(0) < 0. For any b0 > 0, \u0001 > 0, and n, for any suf\ufb01ciently large \u03bb \u2265\n(\u2212\u03c8(cid:48)(0))\u22121/2\u0001\u22121/2n2/3, for all iterates of uncertainty sampling {\u03b8t} where \u03b8t\u22121 \u2208\n2M 3/2\n\u0398regular, (cid:107)\u2207Z(\u03b8t\u22121)(cid:107) \u2265 \u0001, and b(\u03b8t\u22121) \u2264 b0, as nminipool \u2192 \u221e,\n\nb1/2\n0\n\n(cid:96)\n\n\u2207Z(\u03b8t\u22121) \u00b7 E[\u03b8t \u2212 \u03b8t\u22121|\u03b8t\u22121] < 0.\n\n(14)\n\nAlthough \u03bb may appear to have to be quite large, note that typical regularization is proportion to the\nnumber of data points, while this regularization can be sub-linear which corresponds to rather weak\nregularization for large n.\nThis result explains why uncertainty sampling can achieve lower zero-one loss than random sampling;\nbecause it is implicitly descending on Z. Further, since Z is non-convex, uncertainty sampling can\nconverge to different values depending on the initialization.\n\n5\n\n\fFigure 2: Synthetic dataset based on a mixture of four Gaussians (left) and the associated learning\ncurves for runs of uncertainty sampling with different initial seed sets (right). Depending on the seed\nset, uncertainty sampling can produce either better or worse parameters than random sampling. See\nthe main text for more information.\n\n4 Experiments\n\nWe run uncertainty sampling on a simple synthetic data to illustrate the dynamics (Section 4.1) as\nwell as 25 real datasets (Section 4.2). In both cases, we show how uncertainty sampling converges to\ndifferent parameters depending on initialization, and how it can achieve lower asymptotic zero-one\nloss compared to minimizing the surrogate loss on all the data. Note that most active learning\nexperiments are interested in measuring the rate of convergence (data ef\ufb01ciency), whereas this paper\nfocuses exclusively on asymptotic values and the variation that we obtain from different seed sets.\nAlso note that we measure only zero-one loss (error) but all algorithms optimize the logistic loss.\n\n4.1 Synthetic Data\n\nFigure 2 (left) shows a mixture of Gaussian distributions in two dimensions. All the Gaussians\nare isotropic, and the size of the circle indicates the variance (one standard deviation for the inner\ncircle, and two standard deviations for the outer circle). The points drawn from the two red Gaussian\ndistributions are labeled y = 1 and the points drawn from the two blue ones are labeled y = \u22121. The\npercentages refer to the mixture proportions of the clusters. We see that there are four local minima\nof the population zero-one loss, indicated by the green dashed lines. Each minima will misclassify\none of the Gaussian clusters, yielding losses of just over 10%, 20%, 30%, and 40%. The black dotted\nline corresponds to the parameters that minimize the logistic loss, which yields a loss of 18%.\nFigure 2 (right) shows learning curves for different seed sets, which consist of two points, one from\neach class. We see that the uncertainty sampling learning curves converge to four different asymptotic\nlosses, corresponding to the four local minima of the zero-one loss mentioned earlier. The thick black\ndashed line is the zero-one loss for random sampling. We see that uncertainty sampling can achieve\nlower loss than random sampling. This occurs when the conditional label distribution is misspeci\ufb01ed\nin a way that the (global) optimum of the logistic loss does not correspond to the global minimum of\nthe zero-one loss.\n\n4.2 Real-World Datasets\n\nWe collected 25 datasets from OpenML (retrieved August, 2017) that had a large number of data\npoints and where logistic regression outperformed the majority classi\ufb01er (predict the majority label).\nWe further subsampled each dataset to have 10,000 points, which was divided into 7000 training\npoints and 3000 test points. We ran a different version of uncertainty sampling from the version that\nwe analyzed theoretically. Selecting from large minipools would require too much data and sampling\nwithout replacement (as is usually done in practice) would converge trivially since the entire dataset\nwould eventually be labeled. Thus, instead of selecting points from the minipool subset and without\nreplacement, we select points from the entire pool with replacement. We ran uncertainty sampling on\n\n6\n\n10%30%20%40%050100150200Label Budget: n0.00.10.20.30.40.5Error\fFigure 3: A scatter plot of the asymptotic zero-one loss for uncertainty sampling for two particular\ndatasets for 13 seed sizes. The black line is the zero-one loss on the full dataset.\n\nFigure 4: A plot showing the distribution of runs\nover the datasets (with 10 runs per dataset) of\nwhen uncertainty sampling converges to a lower\nzero-one loss than using the entire dataset.\n\nFigure 5: A violin plot capturing the relative\nasymptotic zero-one loss compared to the zero-\none loss on the full dataset. The plot shows the\ndensity of points with kernel density estimation.\nThe red lines are the median losses. Each \u201cviolin\u201d\ncaptures 230 points (10 runs over 23 datasets).\n\neach dataset with random seed sets of sizes that are powers of two from 2 to 4096 and then 7000. We\nstopped when uncertainty sampling did not choose an unlabeled point for 1000 iterations. For each\ndataset and seed set size, we ran uncertainty sampling 10 times, for a total of 25\u00b7 13\u00b7 10 = 3250 runs.\nIn Figure 3, we see scatter plots of the asymptotic zero-one loss of 130 points: 13 seed set sizes, each\nwith 10 runs. The dataset on the left was chosen to exhibit the wide range of convergence values of\nuncertainty sampling, some with lower zero-one loss than with the full dataset. In both plots, we see\nthat the variance of the zero-one loss of uncertainty sampling decreases as the seed set grows. This is\nexpected from theory since the initialization has less variance for larger seed set sizes (as the seed\nset size goes to in\ufb01nity, the parameters converge). For most of the datasets, the behavior was more\nsimilar to the plot on the right, where uncertainty sampling has a higher mean zero-one loss than\nrandom sampling for most seed sizes.\nTo gain a more quantitative understanding of all the datasets, we summarized the asymptotic zero-one\nloss of uncertainty sampling for various random seed set sizes. In Figure 4, we show the proportions\nof the runs over the datasets where uncertainty sampling converges to a lower zero-one loss than\nusing the entire dataset. In Figure 5, we show a \u201cviolin plot\u201d for the distribution of the ratio between\nthe asymptotic zero-one loss of uncertainty sampling and the zero-one loss using the full dataset. We\nnote that the mean and variance of uncertainty sampling signi\ufb01cantly drops as the size of the seed set\ngrows larger. The initial parameters are poor if the seed set is small, and it is well-known that poor\ninitializations for optimizing non-convex functions locally can yield poor results, as seen here.\n\n7\n\n101102103104Seed Size0.100.150.200.250.300.35Error101102103104Seed Size0.360.380.400.420.440.460.480.50Error100101102103104Seed Size0.10.20.30.40.50.60.70.8Proportion of runsBelow Full Dataset Error100101102103104Seed Size0.00.51.01.52.02.5Error Ratio to Full Dataset Error\f5 Related Work and Discussion\n\nThe phenomenon that uncertainty sampling can achieve lower error with a subset of the data rather\nthan using the entire dataset has been observed multiple times. In fact, the original uncertainty\nsampling paper (Lewis and Gale, 1994) notes that \u201cFor 6 of 10 categories, the mean [F-score] for a\nclassi\ufb01er trained on a uncertainty sample of 999 examples actually exceeds that from training on the\nfull training set of 319,463\u201d. Schohn and Cohn (2000) de\ufb01nes a heuristic that selects the point closest\nto the decision boundary of an SVM, which is equivalent to uncertainty sampling in our formulation.\nIn the abstract, the authors note, \u201cWe observe... that a SVM trained on a well-chosen subset of the\navailable corpus frequency performs better than one trained on all available data\u201d. More recently,\nChang et al. (2017) develops an \u201cactive bias\u201d technique that emphasizes the uncertain points and \ufb01nd\nthat it increases the performance compared to using a fully-labeled dataset.\nThere is also work showing the bias of active learning can harm \ufb01nal performance. Sch\u00fctze et al.\n(2006) notes the \u201cmissed cluster effect\u201d, where active learning can ignore clusters in the data and\nnever query points from there; corresponding to a local minimum of the zero-one loss. Dasgupta and\nHsu (2008) has a section on the bias of uncertainty sampling and provides another example where\nuncertainty sampling fails due to sampling bias, which we can explain as convergence to a spurious\nlocal minimum of the zero-one loss. Bach (2007) and Beygelzimer et al. (2009) note this bias issue\nand propose different importance sampling schemes to re-weight points and correct for the bias.\nIn this work, we \ufb01nd that uncertainty sampling updates are preconditioned SGD steps on the\npopulation zero-one loss and move in descent directions for parameters that are not approximate\nstationary points. Note that this does not give any global optimality guarantees. In fact, for linear\n2 \u2212 \u0001 (for any \u0001 > 0) even when\nclassi\ufb01ers, it is NP-hard to optimize the training zero-one loss below 1\nthere is a linear classi\ufb01er that achieves just \u0001 training zero-one loss (Feldman et al., 2012).\nOne of the key questions in light of this work is when optimizing convex surrogate losses yield good\nzero-one losses. If the loss function corresponds to the negative log-likelihood of a well-speci\ufb01ed\nmodel, then the zero-one loss Z will have a local minimum at the parameters that optimize the\nlog-likelihood. If the loss function is \u201cclassi\ufb01cation-calibrated\u201d, Bartlett et al. (2006) shows that if\nthe convex surrogate loss of the estimated parameters converges to the optimal convex surrogate loss,\nthen the zero-one loss of the estimated parameters converges to the global minimum of the zero-one\nloss (Bayes error). This holds only for universal classi\ufb01ers (Micchelli et al., 2006), but in practice,\nthese assumptions are unrealistic. For instance, several papers show how outliers and noise can cause\nlinear classi\ufb01ers learned on convex surrogate losses to suffer high zero-one loss (Nguyen and Sanner,\n2013; Wu and Liu, 2007; Long and Servedio, 2010).\nOther works connect active learning with optimization in rather different ways. Ramdas and Singh\n(2013) uses active learning as a subroutine to improve stochastic convex optimization. Guillory et al.\n(2009) shows how performing online active learning updates corresponds to online optimization\nupdates of non-convex functions, more speci\ufb01cally, truncated convex losses. In this work, we analyze\nactive learning with of\ufb02ine optimization and show the connection between uncertainty sampling and\none particularly important non-convex loss, the zero-one loss.\nIn summary, our work is the \ufb01rst to show a connection between the zero-one loss and the commonly-\nused uncertainty sampling. This provides an explanation and understanding of the various empirical\nphenomena observed in the active learning literature. Uncertainty sampling simultaneously offers the\nhope of converging to lower error but the danger of converging to local minima (an issue that can\npossibly be avoided with larger seed sizes). We hope this connection can lead to improved active\nlearning and optimization algorithms.\n\nReproducibility. The\non\nCodaLab\n0xf8dfe5bcc1dc408fb54b3cc15a5abce8/.\n\ncode,\ndata,\nplatform at\n\nand\n\nthe\n\nexperiments\n\navailable\nhttps://worksheets.codalab.org/worksheets/\n\npaper\n\nthis\n\nfor\n\nare\n\nAcknowledgments. This research was supported by an NSF Graduate Fellowship to the \ufb01rst author.\n\n8\n\n\fReferences\nBach, F. R. (2007). Active learning for misspeci\ufb01ed generalized linear models. In Advances in neural\n\ninformation processing systems, pages 65\u201372.\n\nBalcan, M.-F., Beygelzimer, A., and Langford, J. (2006). Agnostic active learning. In Proceedings of\n\nthe 23rd international conference on Machine learning, pages 65\u201372. ACM.\n\nBartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473), 138\u2013156.\n\nBeygelzimer, A., Dasgupta, S., and Langford, J. (2009). Importance weighted active learning. In\nProceedings of the 26th annual international conference on machine learning, pages 49\u201356. ACM.\n\nBordes, A., Ertekin, S., Weston, J., and Bottou, L. (2005). Fast kernel classi\ufb01ers with online and\n\nactive learning. Journal of Machine Learning Research, 6(Sep), 1579\u20131619.\n\nChang, H.-S., Learned-Miller, E., and McCallum, A. (2017). Active bias: Training more accurate\nIn Advances in Neural Information\n\nneural networks by emphasizing high variance samples.\nProcessing Systems, pages 1003\u20131013.\n\nDasgupta, S. and Hsu, D. (2008). Hierarchical sampling for active learning. In Proceedings of the\n\n25th international conference on Machine learning, pages 208\u2013215. ACM.\n\nFeldman, V., Guruswami, V., Raghavendra, P., and Wu, Y. (2012). Agnostic learning of monomials\n\nby halfspaces is hard. SIAM Journal on Computing, 41(6), 1558\u20131590.\n\nGuillory, A., Chastain, E., and Bilmes, J. (2009). Active learning as non-convex optimization. In\n\nArti\ufb01cial Intelligence and Statistics, pages 201\u2013208.\n\nHanneke, S. et al. (2014). Theory of disagreement-based active learning. Foundations and Trends R(cid:13)\n\nin Machine Learning, 7(2-3), 131\u2013309.\n\nHoveijn, I. (2007). Differentiability of the volume of a region enclosed by level sets. arXiv preprint\n\narXiv:0712.0915.\n\nKlein, S., Staring, M., Andersson, P., and Pluim, J. P. (2011). Preconditioned stochastic gradient\ndescent optimisation for monomodal image registration. In International Conference on Medical\nImage Computing and Computer-Assisted Intervention, pages 549\u2013556. Springer.\n\nLewis, D. D. and Gale, W. A. (1994). A sequential algorithm for training text classi\ufb01ers.\n\nIn\nProceedings of the 17th annual international ACM SIGIR conference on Research and development\nin information retrieval, pages 3\u201312. Springer-Verlag New York, Inc.\n\nLi, X.-L. (2018). Preconditioned stochastic gradient descent. IEEE transactions on neural networks\n\nand learning systems, 29(5), 1454\u20131466.\n\nLong, P. M. and Servedio, R. A. (2010). Random classi\ufb01cation noise defeats all convex potential\n\nboosters. Machine learning, 78(3), 287\u2013304.\n\nMicchelli, C. A., Xu, Y., and Zhang, H. (2006). Universal kernels. Journal of Machine Learning\n\nResearch, 7(Dec), 2651\u20132667.\n\nNguyen, T. and Sanner, S. (2013). Algorithms for direct 0\u20131 loss optimization in binary classi\ufb01cation.\n\nIn International Conference on Machine Learning, pages 1085\u20131093.\n\nRamdas, A. and Singh, A. (2013). Algorithmic connections between active learning and stochastic\nconvex optimization. In International Conference on Algorithmic Learning Theory, pages 339\u2013353.\nSpringer.\n\nSchohn, G. and Cohn, D. (2000). Less is more: Active learning with support vector machines. In\n\nICML, pages 839\u2013846. Citeseer.\n\nSch\u00fctze, H., Velipasaoglu, E., and Pedersen, J. O. (2006). Performance thresholding in practical\ntext classi\ufb01cation. In Proceedings of the 15th ACM international conference on Information and\nknowledge management, pages 662\u2013671. ACM.\n\n9\n\n\fSettles, B. (2010). Active learning literature survey. Computer Sciences Technical Report, 1648.\n\nTong, S. and Koller, D. (2001). Support vector machine active learning with applications to text\n\nclassi\ufb01cation. Journal of machine learning research, 2(Nov), 45\u201366.\n\nWu, Y. and Liu, Y. (2007). Robust truncated hinge loss support vector machines. Journal of the\n\nAmerican Statistical Association, 102(479), 974\u2013983.\n\nYang, Y. and Loog, M. (2016). A benchmark and comparison of active learning for logistic regression.\n\narXiv preprint arXiv:1611.08618.\n\n10\n\n\f", "award": [], "sourceid": 3454, "authors": [{"given_name": "Stephen", "family_name": "Mussmann", "institution": "Stanford University"}, {"given_name": "Percy", "family_name": "Liang", "institution": "Stanford University"}]}