{"title": "Faster Online Learning of Optimal Threshold for Consistent F-measure Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3889, "page_last": 3899, "abstract": "In this paper, we consider online F-measure optimization (OFO). Unlike traditional performance metrics (e.g., classification error rate), F-measure is non-decomposable over training examples and is a non-convex function of model parameters, making it much more difficult to be optimized in an online fashion. Most existing results of OFO usually suffer from high memory/computational costs and/or lack  statistical consistency  guarantee for optimizing F-measure at the population level. To advance OFO, we propose an efficient online algorithm based on simultaneously learning a posterior probability of class and learning an optimal threshold by minimizing  a stochastic strongly convex function with unknown strong convexity parameter. A key component of the proposed method is  a novel stochastic algorithm with low memory and computational costs, which can enjoy a  convergence rate of $\\widetilde O(1/\\sqrt{n})$ for learning the optimal threshold under a mild condition on the convergence of the posterior probability,  where $n$ is the number of processed examples. It is provably  faster than its predecessor based on a heuristic for updating the threshold.   The experiments verify  the efficiency of the proposed algorithm in comparison with state-of-the-art OFO algorithms.", "full_text": "Faster Online Learning of Optimal Threshold for\n\nConsistent F-measure Optimization\n\nMingrui Liu\u2217\u2020, Xiaoxuan Zhang\u2217\u2020, Xun Zhou\u2021, Tianbao Yang\u2020\n\n\u2020Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA\n\u2021Department of Management Sciences, The University of Iowa, Iowa City, IA 52242, USA\n\nmingrui-liu, tianbao-yang@uiowa.edu\n\nAbstract\n\nIn this paper, we consider online F-measure optimization (OFO). Unlike tra-\nditional performance metrics (e.g., classi\ufb01cation error rate), F-measure is non-\ndecomposable over training examples and is a non-convex function of model\nparameters, making it much more dif\ufb01cult to be optimized in an online fashion.\nMost existing results of OFO usually suffer from high memory/computational\ncosts and/or lack statistical consistency guarantee for optimizing F-measure at\nthe population level. To advance OFO, we propose an ef\ufb01cient online algorithm\nbased on simultaneously learning a posterior probability of class and learning\nan optimal threshold by minimizing a stochastic strongly convex function with\nunknown strong convexity parameter. A key component of the proposed method\nis a novel stochastic algorithm with low memory and computational costs, which\nn) for learning the optimal threshold under\na mild condition on the convergence of the posterior probability, where n is the\nnumber of processed examples. It is provably faster than its predecessor based on a\nheuristic for updating the threshold. The experiments verify the ef\ufb01ciency of the\nproposed algorithm in comparison with state-of-the-art OFO algorithms.\n\ncan enjoy a convergence rate of (cid:101)O(1/\n\n\u221a\n\nIntroduction\n\n1\nA learning algorithm is to optimize a certain performance metric de\ufb01ned over a set or population of\nexamples. Online learning [18, 28, 4, 10] is a paradigm in which an algorithm alternatively makes\nprediction on a received data and then updates the model given the feedback of prediction to optimize\na target performance metric. It has attracted tremendous attention due to its ef\ufb01ciency in handling\nlarge-scale and/or streaming data, and has been actively investigated for decades. While many studies\nare devoted to learning with traditional performance metrics (e.g., classi\ufb01cation error rate), there\nhas also been an increasing interest in designing ef\ufb01cient online learning algorithms to maximize F-\nmeasure for tackling large-scale streaming data or by one pass of large-scale batch data [3, 9, 14, 23].\nThis is because F-measure is more suited for imbalanced classi\ufb01cation data because it enforces a\nbetter balance between performance on the rare class and the dominating class. Imbalanced data can\nbe found in many applications, e.g., medical diagnostics [13], spam email detection [20], malicious\nURL detection [27], etc.\nOnline F-measure optimization is much more challenging than traditional online learning with point-\nwise loss functions since F-measure is non-decomposable over training examples and is a non-convex\nfunction of model parameters. Nevertheless, several previous works have made efforts to tackle\nthis dif\ufb01cult problem [3, 9, 14, 23], which fall into three categories. The \ufb01rst category is based\non minimizing a surrogate loss function or maximizing a surrogate reward function in an online\nfashion. This type of approaches usually has large memory costs and/or lacks statistical consistency\nfor F-measure due to that the consistency/calibration of surrogate loss of F-measure is not clear.\n\n\u2217equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTable 1: Comparison with Existing Work of Online F-measure Optimization. The comparison is\nbased on \ufb01xing the total number of processed examples to be n, where d is the dimensionality of data,\nm > 0 is a parameter of the referred algorithm, and \u03b1 > 0. Comp. costs is short for computational\ncosts.\n\n[9]\n[14]\n[23]\n[3]\n\nTarget of Convergence Analysis\n\nEmpirical Structural Surrogate Loss\n\nSurrogate F-measure\n\nCost-sensitive loss/F-measure\nOptimal Threshold/F-measure\nOptimal Threshold/F-measure\n\nConvergence Result\n\n\u221a\nO(1/n1/4)\nn)\nO(1/\n\n\u221a\nO(1/\n\n\u221a\nn)/O(1/m + 1/\nasymptotic/asymptotic\n\u221a\nn)/asymptotic\n\n(cid:101)O(1/\n\nn) Yes (iff m = n\u03b1)\n\nNo\nNo\n\nYes\nYes\n\nConsistency\n\nMemory Costs Comp. Costs\n\n\u221a\n\nO(\n\nnd)\n\nO(d)\nO(md)\nO(d)\nO(d)\n\nO(nd)\nO(nd)\nO(mnd)\nO(nd)\nO(nd)\n\nThis work\nThe second category of approaches leverages the characterization that F-measure optimization is\nequivalent to a cost-sensitive loss minimization and uses online learning algorithms for minimizing\na cost-sensitive loss. However, the optimal costs in this characterization depend on the optimal\nF-measure, which makes this type of approaches suffer from a large computational cost for tuning or\nsearching the optimal costs. The third family of methods is based on a result that the optimal classi\ufb01er\nfor maximizing the F-measure can be achieved by thresholding the posterior class probability. Then,\nthe problem reduces to learning the optimal threshold and the posterior probability incrementally.\nOnline learning of the posterior probability can implemented by minimizing calibrated surrogate loss\nfunctions (e.g., logistic loss). However, the challenge of this type of approach lies at how to learn the\noptimal threshold on-the-\ufb02y.\nIn this paper, we address this challenge (online learning of the optimal threshold) in an elegant\nway. In particular, we cast the problem of learning the optimal threshold as stochastic strongly convex\n\noptimization. Nevertheless, the existing online gradient descent method with (cid:101)O(1/n) convergence\n\nfor minimizing strongly convex functions is not directly applicable. The reason is that the strong\nconvexity parameter is unknown and unbiased stochastic gradient is not available. The signi\ufb01-\ncance of this work is to address these challenges by a new design of online algorithm and novel high\nprobability analysis of the proposed algorithm. Our main contributions are summarized below:\n\u2022 We propose a Fast Online F-measure Optimization (FOFO) with a novel component for learning\nthe optimal threshold for a probabilistic binary classi\ufb01er. The proposed algorithm has low memory\nand computational costs.\n\nn) 2 convergence rate for learning the optimal threshold and the consistency\nof F-measure optimization at the population level under a point-wise convergence condition of\nlearning the posterior probability. It is provably faster than its predecessor [3] that updates the\nthreshold based on a heuristic.\n\u2022 We conduct extensive experiments comparing the proposed algorithm with existing algorithms\nin the three categories. Experimental results show that FOFO has much better online and testing\nperformance than other algorithms especially on highly imbalanced datasets.\n\n\u2022 We prove an (cid:101)O(1/\n\n\u221a\n\n2 Related Work\nThis work is motivated by addressing the de\ufb01ciencies of previous algorithms of OFO and is also built\non existing results for F-measure optimization. In this section, we will highlight them. As mentioned\nbefore previous OFO algorithms can be organized into three categories. In review of related work,\nwe focus on the F-measure optimization part, though some of them also include contributions for\noptimizing other non-decomposable metrics (e.g., AUC, Precision) [9, 14, 23].\nTwo representative algorithms in the \ufb01rst category are proposed in [9, 14]. In particular, Kar et al. [9]\ndeveloped an online gradient descent method for minimizing structural surrogate loss, which was\nmotivated by a batch-learning method for optimizing the structural surrogate loss [7]. The authors\nof [9] established a convergence rate for minimizing the empirical structural surrogate loss in the order\n\u221a\nof O(1/n1/4), where n is the total number of processed examples. One de\ufb01ciency of their algorithm\nis the large-memory costs due to that it needs to maintain O(\nn) examples for computing the gradient\nof the structural surrogate loss for each update of the model parameters. Narasimhan et al. [14]\naddressed the issue of high-memory cost by optimizing a surrogate F-measure that is de\ufb01ned based\non surrogate reward functions for approximating true-positive and true-negative rate. They leveraged\n\n2The (cid:101)O(\u00b7) notation hides logarithmic factors. This convergence rate is implied by a convergence rate of\n(cid:101)O(1/n) for minimizing the involved strongly convex function.\n\n2\n\n\fthe pseudo-linear property of the (surrogate) F-measure and developed an alternate-maximization\n\u221a\nalgorithms for optimizing the surrogate F-measure, which has a stochastic version called STAMP\nwith a convergence rate of O(1/\nn). However, their stochastic algorithm is not designed for online\nlearning where online performance is important. In particular, their algorithm alternates between\ntwo stages with one stage updating the models using received examples and another stage updating\nthe so-called challenge level using received examples. In contrast, our algorithm simultaneously\nupdating the probabilistic classi\ufb01er and its threshold for making predications, making it more suitable\nfor online learning. Another de\ufb01ciency of both works [9, 14] is the lack of statistical consistency of\nF-measure optimization at the population level.\nA related algorithm in the second category of OFO is proposed in [23], which is based on minimizing\ncost-sensitive loss. It is motivated by that F-measure maximization is equivalent to a cost-sensitive\nerror minimization that consists of a weighted sum of false positive and false negative [16]. Neverthe-\nless, the optimal weight is dependent on the optimal F-measure and thus is not available. To tackle\nthe unknown optimal weight, Yan et al. [23] proposed to learn multiple cost-sensitive classi\ufb01ers\ncorresponding to multiple settings of the weight. For online prediction, they also maintain and update\nselection probabilities of the multiple classi\ufb01ers. At each iteration, the algorithm selects one classi\ufb01er\nfor making prediction and updates all classi\ufb01ers upon receiving the label information of received\ndata. As a result, their algorithm suffers from high memory and computational costs. They proved a\nconvergence result for F-measure optimization by utilizing the fact that cost-sensitive surrogate loss\nis calibrated [19]. Nevertheless, the consistency of F-measure optimization requires the number of\nmaintained classi\ufb01ers (denoted by m in Table 1) to be very large.\nRecently, Busa-Fekete et al. [3] proposed a remarkably simple OFO algorithm, which belongs to\nthe third category. It is based on a fact that optimal F-measure can be achieved by thresholding the\ntrue posterior probability of positive class [24, 15]. Hence, an online algorithm for updating the\nmodel of the posterior probability and updating the threshold is developed in [3]. Their update for the\nthreshold is based on a heuristic by setting the threshold as half of the F-measure computed on the\nhistorical examples. This is motivated by fact established in [26] that the optimal threshold for the\ntrue posterior probability is half of the optimal F-measure. However, it is generally not true that for\nany probabilistic classi\ufb01er the optimal threshold is half of its F-measure. As a result, they can only\nprove asymptotic convergence (with n \u2192 \u221e) for learning the optimal threshold even using the true\nposterior probability at each iteration. In contrast, we overcome this shortcoming by learning the\noptimal threshold through solving a strongly convex optimization problem. With careful design and\nn) for learning\n\nanalysis of the proposed algorithm, we are able to prove a convergence rate of (cid:101)O(1/\n\nthe optimal threshold under a mild condition for learning the posterior class probability.\nWe note that the consistency of F-measure optimization is an important concern for the design and\nanalysis of OFO algorithm [15]. It requires that given in\ufb01nite amount of data, the learned classi\ufb01er\nshould achieve the best F-measure at the population level. As a summary, we present a quick\ncomparison between this work and related studies in Table 1 from various perspectives, including\ntheoretical convergence results, consistency of F-measure optimization, memory and computational\ncosts, where linear models are assumed for different algorithms in order to compare the memory and\ncomputational costs. Finally, we emphasize that although there are some batch-learning based F-\nmeasure optimization algorithms [15, 24], the comparison in this paper focuses on online algorithms.\n\n\u221a\n\npositive class by \u03b7(x) = Pr(y = 1|x), and thus we have \u03c0 =(cid:82)\n\n3 Preliminaries and Notations\nLet z = (x, y) denote a random data, where x \u2208 X represents the feature vector and y \u2208 {1, 0}\nrepresents the binary class label. Let Z = X \u00d7 {1, 0} denote the domain of the data. We assume z\nfollows an unknown distribution P, and denote the marginal distribution of the feature x by \u00b5(x). We\ndenote the probability of positive class by \u03c0 = Pr(y = 1), and the true posterior probability of the\nx\u2208X \u03b7(x)d\u00b5(x). Since we assume\nthe received examples follow a distribution, in the sequel we use online gradient descent (OGD) and\nstochastic gradient descent (SGD) interchangeably.\nLet F = {f : X \u2192 {1, 0}} be the set of all binary classi\ufb01ers on X . The F-measure (in particular F1\nmeasure) of f at the population level is de\ufb01ned as\n\nF (f ) =\n\n.\n\n(1)\n\n2(cid:82)\nX \u03b7(x)d\u00b5(x) +(cid:82)\n\n(cid:82)\n\nX \u03b7(x)f (x)d\u00b5(x)\n\nX f (x)d\u00b5(x)\n\n3\n\n\fDenote by F\u2217 = arg maxf\u2208F F (f ). Let G = {g : X \u2192 [0, 1]} denote a set of probabilistic classi\ufb01er\nthat assigns to any example x a probability that it belongs to the positive class. It induces a family of\nthresholded binary classi\ufb01ers H = {g\u03b8(x) := I[g(x)\u2265\u03b8]} \u2286 F, where I is an indicator function, and\n\u03b8 \u2208 [0, 1] is a threshold.\nIt was shown that the optimal binary classi\ufb01er that maximizes the F-measure at the population level\ncan be achieved by thresholding the true posterior probability \u03b7(x), i.e., \u03b7\u03b8(x) = I[\u03b7(x)\u2265\u03b8] [24, 15].\nAs a result, we have maxf\u2208F F (f ) = max\u03b8\u2208[0,0.5] F (\u03b7\u03b8). This reduces the problem of F-measure\noptimization into two sub-problems: learning the posterior probability \u03b7(x) and learning the optimal\nthreshold. The best threshold \u03b8\u2217 that maximizes F (\u03b7\u03b8) has a relationship with the optimal F-measure,\ni.e., \u03b8\u2217 = F\u2217/2 [26]. It also implies that the best optimal threshold \u03b8\u2217 \u2208 [0, 0.5].\nDe\ufb01nition 1. An algorithm is said to be F-measure consistent if the learned classi\ufb01er f satis\ufb01es\nF\u2217 \u2212 F (f )\nLet W = [0, 0.5] and B(\u03b80, r) = {\u03b8 : |\u03b8 \u2212 \u03b80| \u2264 r}. Denote by \u03a0W [\u03b8] by a projection of \u03b8 into the\n\ndomain W. Denote by X\u03b8 = {x \u2208 X : \u03b7(x) \u2265 \u03b8} for any \u03b8 \u2208 [0, 0.5] and by \u03c1\u03b8 =(cid:82)\n\np\u2212\u2192 0, as n \u2192 \u221e, where p\u2212\u2192 denotes convergence in probability.\n\nd\u00b5(x).\n\nx\u2208X\u03b8\n\n4 Fast Online F-measure Optimization Algorithm\nFrom the discussion above, we can cast OFO into two sub-problems, i.e., online learning of the\nposterior probability and online learning of the optimal threshold. Let us \ufb01rst discuss the \ufb01rst sub-\nproblem and then focus on the second sub-problem. For the \ufb01rst sub-problem, we assume that there\nexists an online algorithm A that can incrementally learn the posterior probability. To better illustrate\nthis, let us consider a scenario that the true posterior probability is speci\ufb01ed by a generalized linear\nmodel, i.e., with an appropriate feature mapping \u03c6(x) \u2208 Rd there exists w\u2217 \u2208 Rd such that\n\n\u03b7(x) = Pr(y = 1|x) =\n\n1\n\n1 + exp(\u2212w(cid:62)\u2217 \u03c6(x))\n\n.\n\n(2)\n\nIt is not dif\ufb01cult to show that the model parameter w\u2217 can be learned by minimizing the expected\nlogistic loss (see supplement), i.e.,\n\nw\u2217 \u2208 arg min\nw\u2208Rd\n\nL(w) (cid:44) Ex,y log(1 + exp(\u2212(2y \u2212 1)w(cid:62)\u03c6(x))).\n\n(3)\n\nTherefore, one can use existing online learning algorithms (e.g., SGD [28, 17]) to learn w\u2217. To this\nend, we denote by\n\nwt = A(wt\u22121, xt, yt),\n\n(4)\nthe update of an online algorithm that updates the model parameter wt\u22121 iteratively, where t =\n1, . . . , n. In the next section, we discuss some choices of the online algorithm A and its implication\n\nfor the convergence result. At the t-th iteration (before receiving the t-th example), let(cid:98)\u03b7t(x) denote\n(cid:98)\u03b7t(x) can be computed by\n\nan estimate of the posterior probability. For example of generalized linear model considered above,\n\n(5)\nwhere \u00afwt\u22121 is a solution computed based on w0, . . . , wt\u22121 that has a convergence guarantee (please\nsee Assumption 1 and its discussion in next section). It is notable that online learning of the posterior\nprobability is also used in [3].\nNext, we focus on online learning of the optimal threshold \u03b8\u2217. According to [26], \u03b8\u2217 is achieved at\nthe unique root of\n\nt\u22121\u03c6(x))),\n\n(cid:98)\u03b7t(x) = 1/(1 + exp(\u2212 \u00afw(cid:62)\n\nq(\u03b8) = \u03c0\u03b8 \u2212 Ex [(\u03b7(x) \u2212 \u03b8)+] ,\n\nwhere q(\u03b8) is continuous and strictly increasing, and the function (\u00b7)+ is de\ufb01ned as (x)+ = max(x, 0).\nInstead, we will cast the problem of learning the optimal threshold \u03b8\u2217 as the following strongly convex\noptimization problem.\nLemma 1. \u03b8\u2217 is the unique optimizer of the following strongly convex function\n\n(cid:2)(\u03b7(x) \u2212 \u03b8)2\n\n(cid:3) +\n\n+\n\nmin\n\n\u03b8\u2208[0,0.5]\n\nQ(\u03b8) (cid:44) 1\n2\n\nEx\n\n1\n2\n\n\u03c0\u03b82.\n\n(6)\n\nIndeed, Q(\u03b8) is a \u03c3-strongly convex function with \u03c3 = \u03c0 + min\u03b8\u2208[0,0.5] \u03c1\u03b8.\n\n4\n\n\fAlgorithm 1 FOFO(n)\n\n1: Set w0 = 0,(cid:98)\u03b8(0) = 0, m = (cid:98) 1\n\n2: for k = 1, . . . , m do\n3:\n\n(cid:104)\nw(k),(cid:98)\u03b8(k),(cid:98)\u03c0(k)(cid:105)\n\nSet \u03b3k = Rk\u22121\u221a\n10n0\n\n4:\n5: end for\n\n2n\n\n2 log2\n\nlog2 n(cid:99) \u2212 1, n0 = (cid:98)n/m(cid:99),(cid:98)\u03c00 = 0, R0 = 0.5.\n= SFO(w(k\u22121),(cid:98)\u03b8(k\u22121),(cid:98)\u03c0(k\u22121), n0, \u03b3k, Rk\u22121, (k \u2212 1)n0)\n\n, Rk = Rk\u22121/2\n\nAlgorithm 2 SFO(w, \u03b8,(cid:98)\u03c0, T, \u03b3, R, T0)\n1: Initialize \u00af\u03b81 = \u03b81 = \u03b8,(cid:98)\u03c0T0 =(cid:98)\u03c0, wT0 = w\n\nLet t = \u03c4 + T0 be the global iteration index.\nReceive an example xt\n\nCompute(cid:98)\u03b7t(xt) according to (5)\nif(cid:98)\u03b7t(xt) \u2265 \u00af\u03b8\u03c4 then\n\nMake a prediction \u02c6yt = 1\n\nelse\n\n2: for \u03c4 = 1, . . . , T do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: wt = A(wt\u22121, xt, yt)\n16: end for\n\nend if\nObserve the label yt\n\n17: return wt, \u00af\u03b8T +1,(cid:98)\u03c0t\n\n\u00af\u03b8\u03c4 +1 = 1\n\n\u03c4 +1 (\u03c4 \u00af\u03b8\u03c4 + \u03b8\u03c4 +1)\n\nMake a prediction \u02c6yt = 0\n\nt ((t \u2212 1)(cid:98)\u03c0t\u22121 + I[yt=1])\n\n(cid:98)\u03c0t = 1\n\u03b8\u03c4 +1 = \u03a0W\u2229B(\u03b81,R)(\u03b8\u03c4 \u2212 \u03b3\u2202(cid:98)Qt(\u03b8\u03c4 ))\n\nThe above lemma can be easily shown by noting that \u2207Q(\u03b8) = q(\u03b8). The strong convexity of Q(\u03b8)\nis also not dif\ufb01cult to prove (see supplement for details). The next lemma shows that if \u03b8 is closer to\n\u03b8\u2217 then F (\u03b7\u03b8) is closer to F\u2217, which further justi\ufb01es our approach of learning the optimal threshold\nby optimizing the strongly convex function Q(\u03b8).\nLemma 2. \u2200\u03b8 \u2208 [0, 0.5], there exists c > 0 such that F\u2217 \u2212 F (\u03b7\u03b8) \u2264 c|\u03b8 \u2212 \u03b8\u2217|.\nThe proof of this lemma can be found in the supplement. However, there are several challenges for\nminimizing Q(\u03b8) to have a faster convergence. First, even an unbiased stochastic gradient of Q(\u03b8)\nthrough the historical model parameters. Second, the strongly convex parameter \u03c3 \u2265 \u03c0 of Q(\u03b8) is\n\nis not available due to that \u03b7(x) is not given aprior. Only a noisy estimation of(cid:98)\u03b7t(x) is available\nunknown. Standard SGD method for minimizing a \u03c3-strongly convex function with an (cid:101)O(1/(\u03c3n))\n\nconvergence rate on the objective value requires knowing the strong convexity parameter for setting\nthe step size. One may use historical examples y1, . . . , yt\u22121 to obtain a new estimate of \u03c0 at each\niteration t and use it to set the step size. However, analysis of such an approach is dif\ufb01cult. Another\nsimple approach is to use a dedicated set of examples to obtain a lower bound of \u03c0 and then use it to\nset the step size. Nevertheless, such an approach could yield a large convergence error because the\nlower bound of \u03c0 could be very small especially when the data is highly imbalanced.\nWe address these issues by proposing a novel stochastic algorithm that does not require using the\n\nstrong convexity parameter to set the step size, and also can tolerate moderate noise in(cid:98)\u03b7t(x) to enjoy\na fast convergence rate of (cid:101)O(1/(\u03c3n)) for minimizing Q(\u03b8). We present the Fast Online F-measure\nOptimization (FOFO) in Algorithm 1 and Algorithm 2, where(cid:98)\u03c0t in Step 12 is the estimate of \u03c0\nup to the t-th iteration, i.e.,(cid:98)\u03c0t =(cid:80)t\n2(cid:98)\u03c0t\u03b82. It is worth\n\n\u03c4 =1 y\u03c4 /t and (cid:98)Qt(\u03b8) = 1\n\n2 ((cid:98)\u03b7t(xt) \u2212 \u03b8)2\n\nmentioning that the Step 5 and 15 in Algorithm 2 can be replaced by other online learners of the\nposterior probability which are not necessarily restricted to the generalized linear model (2). The main\nalgorithm FOFO is presented in a way that facilitates the analysis. The updates of FOFO are divided\ninto m stages, where at each stage a stochastic F-measure optimization method (Algorithm 2) is called\nfor running n0 iterations. For each received example xt, the prediction \u02c6yt is computed by thresholding\n\ncurrent estimate of posterior probability(cid:98)\u03b7t(xt) by the current value of \u03b8. At each iteration of a\n\n+ + 1\n\n5\n\n\fstage k, we use the gradient of (cid:98)Qt(\u03b8) to update \u03b8 with a constant step size \u03b3k and project it into\n\nW \u2229 B(\u03b81, Rk\u22121), where \u03b81 is the initial solution of this stage and Rk\u22121 is a radius parameter. The\nstep-size \u03b3k and radius parameter Rk\u22121 are changed according to Step 3 in Algorithm 1. We remark\nthat the same multi-stage scheme (especially the setting of m and n0) Algorithm 1 is due to [8], which\nwas also used in several stochastic optimization algorithms for solving different problems [11, 12].\n\nHowever, the difference from these studies is that \u2202(cid:98)Qt(\u03b8) is not an unbiased stochastic gradient of\n\nQ(\u03b8).\n\n\u221a\n\n\u221a\n\n5 Convergence and Consistency Results of FOFO\nIn this section, we further justify the proposed FOFO by presenting a convergence result of FOFO\nfor learning the optimal threshold \u03b8\u2217 and a consistency result of FOFO for F-measure optimization.\nOmitted proofs can be found in the supplement.\nFor simplicity of analysis, we let \u03c6(x) = x and w.l.o.g assume that feature vectors are bounded by\n\na positive number \u03ba, i.e., supx\u2208X (cid:107)x(cid:107)2 \u2264 \u03ba. As mentioned in the last section that \u2202(cid:98)Qt(\u03b8) is not\nan unbiased stochastic gradient of Q(\u03b8). Nevertheless, we expect that \u2202(cid:98)Qt(\u03b8) is getting close to an\nunbiased stochastic gradient of Q(\u03b8) as(cid:98)\u03b7t(x) converges to \u03b7(x). To formalize this notion, we will\nt, i.e., by learning with t examples the error of learned posterior probability(cid:98)\u03b7t(\u00b7) is\nmaxx\u2208X |(cid:98)\u03b7t(x) \u2212 \u03b7(x)| \u2264 O(1/\n\nintroduce the following assumption about convergence of the model for the posterior probability.\nAssumption 1. Assume there exists an online algorithm A that learns the posterior probability at\na rate of 1/\n\nt) with high probability.\n\u221a\nRemark: Please note that this is a high-level assumption without imposing any form of the poste-\n\u221a\nt) rate for learning the posterior probability is the minimal\nrior probability. The assumed O(1/\nassumption for achieving an O(1/\nn) rate of the learning the threshold. It is worth mentioning that\nestimating the posterior probability can be considered as a special problem of statistical density esti-\nmation, which has been studied in the literature with a convergence rate as fast as O(1/t) [25, 6, 21].\nWe provide a justi\ufb01cation here for the considered generalized linear model (2), which is Lipchitz\ncontinuous with respect to w. To this end, it suf\ufb01ces to assume that there exists an algorithm A as\nin (4) that produces solutions \u00afwt converging to w\u2217 at a rate of O(1/\nt) with high probability, i.e.,\nwe have (cid:107) \u00afwt \u2212 w\u2217(cid:107)2 \u2264 C(cid:48)\u221a\nwith high probability, where \u00afwt is a solution computed from w1, . . . , wt\nand C(cid:48) is a problem-dependent value. This can be justi\ufb01ed as following. Note that the objective\nfunction L(w) in (3) is strongly convex for w in a compact domain if the data covariance matrix is\nnonsingular [1]. Thus SGD method for strongly convex function using a suf\ufb01x-averaging solution\n\u00afwt = (w(1\u2212\u03b1)t+1 + . . . + wt)/(\u03b1t) with \u03b1 \u2208 (0, 1) can have an O(log(1/\u03b4)/t) convergence rate for\nminimizing L(w), i.e., L( \u00afwt) \u2212 L(w\u2217) \u2264 O(log(1/\u03b4)/t) [17], which implies (cid:107) \u00afwt \u2212 w\u2217(cid:107)2 \u2264 C(cid:48)\u221a\n\nwith a high probability 1 \u2212 \u03b4 for \u03b4 \u2208 (0, 1) and C(cid:48) = O((cid:112)log(1/\u03b4)). When the convariance matrix\n\nis not singular, by Corollary 7 of [22], a quadratic growth condition is satis\ufb01ed and we can still get the\nO(1/t) convergence for minimizing L(w) via accelerated stochastic subgradient method proposed\nin [22]. Even though the strong convexity parameter of L(w) is not exploited, the stochastic approxi-\nmation algorithm proposed in [2] with a constant step size also has an O(1/t) convergence rate of an\n\u221a\naverged solution \u00afwt for minimizing L(w) with a large probability (e.g., 0.99). These results would\nt) convergence for (cid:107) \u00afwt \u2212 w\u2217(cid:107) for an optimal solution w\u2217 \u2208 arg minw\u2208W L(w).\nlimply an O(1/\na high probability 1 \u2212 \u03b4. The results can be extended to the case that it holds with a large constant\nprobability.\nWe \ufb01rst state the convergence result of one stage of FOFO, i.e., SFO.\n\u221a\nTheorem 2 (Convergence Result of SFO). Suppose Assumption 1 holds with high probability\nt. If |\u03b81 \u2212 \u03b8\u2217| \u2264 R, running\n\nIn the following results, we will assume that maxx\u2208X |(cid:98)\u03b7t(x)\u2212 \u03b7(x)| \u2264 O((cid:112)log(1/\u03b4)/t) holds with\n1 \u2212 \u03b4 in the sense that maxx\u2208X |(cid:98)\u03b7t(x) \u2212 \u03b7(x)| \u2264 C\u03ba(cid:112)log(1/\u03b4)/\n\n\u221a\n\nt\n\nt\n\nAlgorithm 2 for T -iterations with \u03b3 = R\u221a\n10T\n\u221a\n\nQ(\u00af\u03b8T ) \u2212 Q(\u03b8\u2217) \u2264 2\n\n, we have with probability at least 1 \u2212 \u03b4,\n\n10R + R(20 + 4C\u03ba)(cid:112)ln(12T /\u03b4)\n\n.\n\n\u221a\n\nT\n\n6\n\n\fQ((cid:98)\u03b8m) \u2212 Q(\u03b8\u2217) \u2264 (cid:101)O\n\n(cid:18) log( 1\n\n\u03b4 )\n\n(cid:19)\n\n.\n\n\u221a\nRemark: The above result indicates the FOFO has at least an O(1/\n\nQ(\u03b8). The next theorem establishes a faster convergence (cid:101)O(1/n) by utilizing the above result and\n\nn) convergence for optimizing\n\nthe multi-stage scheme of FOFO.\nTheorem 3 (Convergence Result of FOFO). Given \u03b4 \u2208 (0, 1), under the same condition in Theo-\nrem 2 and n is suf\ufb01ciently large such that n > 100. Then with probability at least 1 \u2212 \u03b4,\n\n\u03c3n\n\n\u221a\n\nn)).\n\noptimization by using Proposition 13 in [15].\nTheorem 4 (Consistency of F-measure Optimization). Suppose Assumption 1 holds, then FOFO\n\nRemark: Since Q(\u03b8) is \u03c3-strongly convex, the above result implies that |(cid:98)\u03b8m \u2212 \u03b8\u2217| \u2264 (cid:101)O(1/(\u03c3\nFinally, by the convergence of(cid:98)\u03b8m and(cid:98)\u03b7n(x), we can establish the consistency of FOFO for F-measure\nis F-measure consistent, i.e., the \ufb01nal binary classi\ufb01er [(cid:98)\u03b7n(x) \u2265(cid:98)\u03b8m] is F-measure consistent.\nweaker assumption about the convergence of(cid:98)\u03b7t can be used. As long as maxx\u2208X |(cid:98)\u03b7t(x) \u2212 \u03b7(x)| \u2264\nO(1/t\u03b1) for some \u03b1 > 0, a convergence result can be established for (cid:98)\u03b8m, which implies the\n\nWe would like to mention that for establishing the consistency of FOFO for F-measure optimization a\n\nconsistency of FOFO for F-measure optimization by the Proposition 13 in [15].\nExtension to Other Metrics. Before ending this section, we would like to mention that the proposed\nmethod can be extended to other non-decomposable metrics such as Jaccard similarity coef\ufb01cient\n(JAC) and F\u03b2 measure. We present more details in the supplement for interested readers.\n6 Experiments\nWe present some experimental results in this section. We will compare with four baselines including\nonline learning with logistic loss using 0.5 for thresholding the posterior probability (referred as LR),\nSTAMP [14], OMCSL [23], and OFO [3]. The last three are representative algorithms from the three\ncategories. For LR, OFO and FOFO, we use the same SGD for learning the posterior probability by\nminimizing the logistic loss. To be fair, we implement the STAMP algorithm with a logistic loss\nbased reward function, and also use logistic loss for OMCSL.\nWe evaluate the performance on 25 binary classi\ufb01cation tasks from seven benchmark datasets\n(covtype, webspam, a9a, ijcnn1, w8a, sensorless, protein). All the datasets involved are downloaded\nfrom the LIBSVM repository [5]. It is notable that covtype, sensorless and protein are multi-class\ndatasets. We construct binary tasks following the scheme one vs others denoted by \u201cX vs o\" below.\nEach dataset is divided into three parts (1:1:1) for online training, online validation and of\ufb02ine testing.\nThe validation data is used to select the best parameters by running the considered algorithms and\nselecting the best parameters according to the \ufb01nal F-measure. In particular, for FOFO, OFO, and\nLR, we tune the initial step parameter for learning the posterior probability in the range 2[\u22124:1:4].\nFor STAMP and OMCSL, the stepsize parameter is also tuned in 2[\u22128:1:4]. For OMCSL, we use 10\nsettings for the weights and learn 10 classi\ufb01ers online. For each data, we repeat the experiments 10\ntimes by running on 10 random shuf\ufb02ed data and report the average and variance. We will report\nonline performance (evaluation of predictions on historical examples) on online training data, and\nof\ufb02ine performance by evaluating the \ufb01nal models on the testing data.\nDue to limitation of space, we only report part of the results (complete results are included in the\nsupplement). The online performance (F-score vs iterations) is plotted in Figure 1 and 2 for covtype\ndatasets and other datasets. The F-measure vs running time s plotted in Figure 3 and 4 and the of\ufb02ine\ntesting performance are reported in the supplement. We can \ufb01rst consider the results on covtype.\nFrom (a) to (f) in Figure 1 where p = x% denotes the percentage of positive examples, we organize\nthe data in the order of increasing imbalance. We can see that as data becomes more imbalanced, the\nimprovement of our algorithm FOFO over baselines becomes larger. On the datasets that are more\nbalanced (covtype 2 vs o, 1 vs o), the difference between FOFO and OFO is small, and they both\noutperforms other baselines. When the datasets become highly imbalanced (covtype 6 vs o, 5 vs\no), the baseline LR becomes extremely worse, and the margin of FOFO over OFO becomes larger.\nThe comparison between FOFO and OFO veri\ufb01es that the proposed method for learning the optimal\nthreshold converges faster as they share the same component for learning the posterior probability.\nThe comparison between FOFO and other baselines verify that the proposed method is better than\n\n7\n\n\f(a) cov (2 vs o) (p=48.76%)\n\n(b) cov (1 vs o) (p=36.46%)\n\n(c) cov (3 vs o) (p=6.15%)\n\n(d) cov (7 vs o) (p=3.53%)\n\n(e) cov (6 vs o) (p=2.99%)\n\n(f) cov (5 vs o) (p=1.63%)\n\nFigure 1: Online Performance of F-measure for covtype dataset\n\n(a) webspam (p=60.72%)\n\n(b) protein (0 vs o) (p=46.14%)\n\n(c) a9a (p=24.08%)\n\n(d) ijcnn1 (p=9.49%)\n\n(e) Sensorless (1 vs o) (p=9.09%)\n\n(f) w8a (p=2.97%)\n\nFigure 2: Online Performance of F-measure for other datasets\n\nother categories of algorithms. The results on other datasets in Figure 2 also demonstrate that FOFO\nis faster than OFO and other baselines especially for highly imbalanced data. The running time\nresults in Figure 3 and 4 also verify that the proposed FOFO algorithm is the most ef\ufb01cient.\n\n7 Conclusions\nIn this paper, we proposed a fast online F-measure optimization algorithm with low memory and\ncomputational costs by learning the optimal threshold for a probabilistic classi\ufb01er. A novel stochastic\nalgorithm was proposed for learning the optimal threshold. We prove that the proposed algorithm\nn), and the proposed algorithm\nenjoys F-measure consistency at the population level. Extensive experimental results comparing\n\nfor learning of the optimal threshold has a convergence rate (cid:101)O(1/\n\n\u221a\n\n8\n\n0.511.5iteration1050.70.710.720.730.740.750.760.770.78F-scoreFOFOOFOLRSTAMPOMCSL0.511.5iteration1050.580.60.620.640.660.680.70.72F-scoreFOFOOFOLRSTAMPOMCSL0.511.5iteration1050.50.550.60.650.70.750.8F-scoreFOFOOFOLRSTAMPOMCSL0.511.5iteration1050.20.30.40.50.60.70.8F-scoreFOFOOFOLRSTAMPOMCSL0.511.5iteration105-0.100.10.20.30.40.5F-scoreFOFOOFOLRSTAMPOMCSL0.511.5iteration105-0.0500.050.10.150.20.250.30.35F-scoreFOFOOFOLRSTAMPOMCSL246810iteration1040.840.860.880.90.920.94F-scoreFOFOOFOLRSTAMPOMCSL2000400060008000iteration0.450.50.550.60.650.70.750.8F-scoreFOFOOFOLRSTAMPOMCSL50001000015000iteration0.50.550.60.650.70.75F-scoreFOFOOFOLRSTAMPOMCSL1234iteration10400.10.20.30.40.50.60.7F-scoreFOFOOFOLRSTAMPOMCSL0.511.5iteration104-0.200.20.40.60.8F-scoreFOFOOFOLRSTAMPOMCSL0.511.522.5iteration10400.10.20.30.40.50.60.70.8F-scoreFOFOOFOLRSTAMPOMCSL\f(a) cov (2 vs o)\n\n(b) cov (1 vs o)\n\n(c) cov (3 vs o)\n\n(d) cov (7 vs o)\n\n(e) cov (6 vs o)\n\n(f) cov (5 vs o)\n\nFigure 3: Online F-measure vs Running Time for covtype dataset\n\n(a) webspam\n\n(b) protein (0 vs o)\n\n(c) a9a\n\n(d) ijcnn1\n\n(e) Sensorless (1 vs o)\n\n(f) w8a\n\nFigure 4: Online F-measure vs Running Time for other datasets\n\nwith state-of-the-art online F-measure optimization algorithms also demonstrate the ef\ufb01ciency of the\nproposed algorithm, especially on highly imbalanced datasets.\n\nAcknowledgement\n\nThe authors thank the anonymous reviewers for their helpful comments. M. Liu, X. Zhang and T.\nYang are partially supported by National Science Foundation (IIS-1545995).\n\n9\n\n02468time (second)0.70.710.720.730.740.750.760.770.78F-scoreFOFOOFOLRSTAMPOMCSL02468time (second)0.580.60.620.640.660.680.70.72F-scoreFOFOOFOLRSTAMPOMCSL02468time (second)0.50.550.60.650.70.750.8F-scoreFOFOOFOLRSTAMPOMCSL02468time (second)0.20.30.40.50.60.70.8F-scoreFOFOOFOLRSTAMPOMCSL02468time (second)-0.100.10.20.30.40.5F-scoreFOFOOFOLRSTAMPOMCSL02468time (second)-0.0500.050.10.150.20.250.30.35F-scoreFOFOOFOLRSTAMPOMCSL012345time (second)0.840.860.880.90.920.94F-scoreFOFOOFOLRSTAMPOMCSL00.10.20.30.4time (second)0.450.50.550.60.650.70.750.8F-scoreFOFOOFOLRSTAMPOMCSL00.10.20.30.40.50.6time (second)0.50.550.60.650.70.75F-scoreFOFOOFOLRSTAMPOMCSL00.511.52time (second)00.10.20.30.40.50.60.7F-scoreFOFOOFOLRSTAMPOMCSL00.20.40.60.8time (second)-0.100.10.20.30.40.50.60.7F-scoreFOFOOFOLRSTAMPOMCSL00.20.40.60.811.2time (second)00.10.20.30.40.50.60.70.8F-scoreFOFOOFOLRSTAMPOMCSL\fReferences\n[1] Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright. Stochastic optimization and\nsparse statistical recovery: Optimal algorithms for high dimensions. In Advances in Neural\nInformation Processing Systems 25 (NIPS), pages 1547\u20131555, 2012.\n\n[2] Francis R. Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation\nwith convergence rate o(1/n). In Advances in Neural Information Processing Systems (NIPS),\npages 773\u2013781, 2013.\n\n[3] R\u00f3bert Busa-Fekete, Bal\u00e1zs Sz\u00f6r\u00e9nyi, Krzysztof Dembczynski, and Eyke H\u00fcllermeier. Online\nf-measure optimization. In Advances in Neural Information Processing Systems, pages 595\u2013603,\n2015.\n\n[4] Nicolo Cesa-Bianchi, Philip M Long, and Manfred K Warmuth. Worst-case quadratic loss\nbounds for prediction using linear functions and gradient descent. IEEE Transactions on Neural\nNetworks, 7(3):604\u2013619, 1996.\n\n[5] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM\n\ntransactions on intelligent systems and technology (TIST), 2(3):27, 2011.\n\n[6] Peter D. Gr\u00fcnwald. The Minimum Description Length Principle (Adaptive Computation and\n\nMachine Learning). The MIT Press, 2007.\n\n[7] Thorsten Joachims. A support vector method for multivariate performance measures.\n\nIn\nProceedings of the 22Nd International Conference on Machine Learning (ICML), pages 377\u2013\n384, 2005.\n\n[8] Anatoli Juditsky, Yuri Nesterov, et al. Deterministic and stochastic primal-dual subgradient\n\nalgorithms for uniformly convex minimization. Stochastic Systems, 4(1):44\u201380, 2014.\n\n[9] Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Online and stochastic gradient\nmethods for non-decomposable loss functions. In Advances in Neural Information Processing\nSystems 27 (NIPS), pages 694\u2013702, 2014.\n\n[10] Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for\n\nlinear predictors. Information and Computation, 132(1):1\u201363, 1997.\n\n[11] Mingrui Liu, Xiaoxuan Zhang, Zaiyi Chen, Xiaoyu Wang, and Tianbao Yang. Fast stochastic\nAUC maximization with o(1/n)-convergence rate. In Proceedings of the 35th International\nConference on Machine Learning (ICML), pages 3195\u20133203, 2018.\n\n[12] Mingrui Liu, Xiaoxuan Zhang, Lijun Zhang, Rong Jin, and Tianbao Yang. Fast rates of erm\nand stochastic approximation: Adaptive to error bound conditions. In Advances in Neural\nInformation Processing Systems (NIPS), pages \u2013, 2018.\n\n[13] Maciej A Mazurowski, Piotr A Habas, Jacek M Zurada, Joseph Y Lo, Jay A Baker, and\nGeorgia D Tourassi. Training neural network classi\ufb01ers for medical decision making: The\neffects of imbalanced datasets on classi\ufb01cation performance. Neural networks, 21(2-3):427\u2013436,\n2008.\n\n[14] Harikrishna Narasimhan, Purushottam Kar, and Prateek Jain. Optimizing non-decomposable\nperformance measures: A tale of two classes. In International Conference on Machine Learning,\npages 199\u2013208, 2015.\n\n[15] Nagarajan Natarajan, Oluwasanmi Koyejo, Pradeep Ravikumar, and Inderjit S. Dhillon. Con-\nIn Neural Information\n\nsistent binary classi\ufb01cation with generalized performance metrics.\nProcessing Systems (NIPS), 2014.\n\n[16] Shameem Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. Optimizing f-measures\nby cost-sensitive classi\ufb01cation. In Advances in Neural Information Processing Systems 27,\npages 2123\u20132131, 2014.\n\n10\n\n\f[17] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal\nfor strongly convex stochastic optimization. In Proceedings of International Conference on\nMachine Learning (ICML), 2012.\n\n[18] Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organiza-\n\ntion in the brain. Psychological review, 65(6):386, 1958.\n\n[19] Clayton Scott. Calibrated asymmetric surrogate losses. Electron. J. Statist., 6:958\u2013992, 2012.\n\n[20] Yuchun Tang, Sven Krasser, Paul Judge, and Yan-Qing Zhang. Fast and effective spam sender\ndetection with granular svm on highly imbalanced mail server behavior data. In Collabora-\ntive Computing: Networking, Applications and Worksharing, 2006. CollaborateCom 2006.\nInternational Conference on, pages 1\u20136. IEEE, 2006.\n\n[21] Tim van Erven, Peter D. Gr\u00fcnwald, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson.\n\nFast rates in statistical and online learning. Journal of Machine Learning Research, 2015.\n\n[22] Yi Xu, Qihang Lin, and Tianbao Yang. Stochastic convex optimization: Faster local growth\nimplies faster global convergence. In Proceedings of the 34th International Conference on\nMachine Learning (ICML), pages 3821\u20133830, 2017.\n\n[23] Yan Yan, Tianbao Yang, Yi Yang, and Jianhui Chen. A framework of online learning with\nimbalanced streaming data. In Proceedings of the Thirty-First AAAI Conference on Arti\ufb01cial\nIntelligence (AAAI), pages 2817\u20132823, 2017.\n\n[24] Nan Ye, Kian Ming Adam Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing f-measure:\nA tale of two approaches. In Proceedings of the 29th International Conference on Machine\nLearning (ICML), 2012.\n\n[25] Tong Zhang. From \u0001-entropy to kl-entropy: Analysis of minimum information complexity\n\ndensity estimation. Ann. Statist., 34(5):2180\u20132210, 10 2006.\n\n[26] Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, and Gavin Brown. Beyond fano\u2019s inequal-\nity: bounds on the optimal f-score, ber, and cost-sensitive risk and their implications. Journal of\nMachine Learning Research, 14(Apr):1033\u20131090, 2013.\n\n[27] Peilin Zhao and Steven CH Hoi. Cost-sensitive online active learning with application to\nmalicious url detection. In Proceedings of the 19th ACM SIGKDD international conference on\nKnowledge discovery and data mining, pages 919\u2013927. ACM, 2013.\n\n[28] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\nIn Proceedings of the International Conference on Machine Learning (ICML), pages 928\u2013936,\n2003.\n\n11\n\n\f", "award": [], "sourceid": 1916, "authors": [{"given_name": "Xiaoxuan", "family_name": "Zhang", "institution": "University of Iowa"}, {"given_name": "Mingrui", "family_name": "Liu", "institution": "The University of Iowa"}, {"given_name": "Xun", "family_name": "Zhou", "institution": "University of Iowa"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "The University of Iowa"}]}