{"title": "Efficient Algorithms for Non-convex Isotonic Regression through Submodular Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 10, "abstract": "We consider the minimization of submodular functions subject to ordering constraints. We show that this potentially non-convex optimization problem can be cast as a convex optimization problem on a space of uni-dimensional measures, with ordering constraints corresponding to first-order stochastic dominance.  We propose new discretization schemes that lead to simple and efficient algorithms based on zero-th, first, or higher order oracles;  these algorithms also lead to improvements without isotonic constraints. Finally,   our experiments  show that non-convex loss functions can be much more robust to outliers for isotonic regression, while still being solvable in polynomial time.", "full_text": "Ef\ufb01cient Algorithms for Non-convex Isotonic\nRegression through Submodular Optimization\n\nFrancis Bach\n\nInria\n\nD\u00e9partement d\u2019Informatique de l\u2019Ecole Normale Sup\u00e9rieure\n\nPSL Research University, Paris, France\n\nfrancis.bach@ens.fr\n\nAbstract\n\nWe consider the minimization of submodular functions subject to ordering con-\nstraints. We show that this potentially non-convex optimization problem can be cast\nas a convex optimization problem on a space of uni-dimensional measures, with\nordering constraints corresponding to \ufb01rst-order stochastic dominance. We propose\nnew discretization schemes that lead to simple and ef\ufb01cient algorithms based on\nzero-th, \ufb01rst, or higher order oracles; these algorithms also lead to improvements\nwithout isotonic constraints. Finally, our experiments show that non-convex loss\nfunctions can be much more robust to outliers for isotonic regression, while still\nbeing solvable in polynomial time.\n\n1\n\nIntroduction\n\nShape constraints such as ordering constraints appear everywhere in estimation problems in machine\nlearning, signal processing and statistics. They typically correspond to prior knowledge, and are\nimposed for the interpretability of models, or to allow non-parametric estimation with improved\nconvergence rates [16, 8]. In this paper, we focus on imposing ordering constraints into an estimation\nproblem, a setting typically referred to as isotonic regression [4, 26, 22], and we aim to generalize\nthe set of problems for which ef\ufb01cient (i.e., polynomial-time) algorithms exist.\nWe thus focus on the following optimization problem:\n\nmin\nx2[0,1]n\n\nH(x) such that 8(i, j) 2 E, xi > xj,\n\n(1)\n\nwhere E \u21e2{ 1, . . . , n}2 represents the set of constraints, which form a directed acyclic graph. For\nsimplicity, we restrict x to the set [0, 1]n, but our results extend to general products of (potentially\nunbounded) intervals.\nAs convex constraints, isotonic constraints are well-adapted to estimation problems formulated as\nconvex optimization problems where H is convex, such as for linear supervised learning problems,\nwith many ef\ufb01cient algorithms for separable convex problems [4, 26, 22, 30], which can thus be used\nas inner loops in more general convex problems by using projected gradient methods (see, e.g., [3]).\nIn this paper, we show that another form of structure can be leveraged. We will assume that H is\nsubmodular, which is equivalent, when twice continuously differentiable, to having nonpositive cross\nsecond-order derivatives. This notably includes all (potentially non convex) separable functions (i.e.,\nsums of functions that depend on single variables), but also many other examples (see Section 2).\nMinimizing submodular functions on continuous domains has been recently shown to be equivalent\nto a convex optimization problem on a space of uni-dimensional measures [2], and given that the\nfunctions x 7! (xjxi)+ are submodular for any > 0, it is natural that by using  tending to +1,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwe recover as well a convex optimization problem; the main contribution of this paper is to provide a\nsimple framework based on stochastic dominance, for which we design ef\ufb01cient algorithms which\nare based on simple oracles on the function H (typically access to function values and derivatives).\nIn order to obtain such algorithms, we go signi\ufb01cantly beyond [2] by introducing novel discretization\nalgorithms that also provide improvements without any isotonic constraints.\nMore precisely, we make the following contributions:\n\n\u2013 We show in Section 3 that minimizing a submodular function with isotonic constraints can be\ncast as a convex optimization problem on a space of uni-dimensional measures, with isotonic\nconstraints corresponding to \ufb01rst-order stochastic dominance.\n\n\u2013 On top of the naive discretization schemes presented in Section 4, we propose in Section 5 new\ndiscretization schemes that lead to simple and ef\ufb01cient algorithms based on zero-th, \ufb01rst, or\nhigher order oracles. They go from requiring O(1/\"3) = O(1/\"2+1) function evaluations to\nreach a precision \", to O(1/\"5/2) = O(1/\"2+1/2) and O(1/\"7/3) = O(1/\"2+1/3).\n\n\u2013 Our experiments in Section 6 show that non-convex loss functions can be much more robust to\n\noutliers for isotonic regression.\n\n2 Submodular Analysis in Continuous Domains\n\nIn this section, we review the framework of [2] that shows how to minimize submodular functions\nusing convex optimization.\n\n8(x, y) 2 [0, 1]n \u21e5 [0, 1]n, H(x) + H(y) > H(min{x, y}) + H(max{x, y}),\n\nDe\ufb01nition. Throughout this paper, we consider a continuous function H : [0, 1]n ! R. The\nfunction H is said to be submodular if and only if [21, 29]:\n(2)\nwhere the min and max operations are applied component-wise. If H is continuously twice differen-\ntiable, then this is equivalent to @2H\n@xi@xj\nThe cone of submodular functions on [0, 1]n is invariant by marginal strictly increasing transfor-\nmations, and includes all functions that depend on a single variable (which play the role of linear\nfunctions for convex functions), which we refer to as separable functions.\n\n(x) 6 0 for any i 6= j and x 2 [0, 1]n [29].\n\nExamples. The classical examples are: (a) any separable function, (b) convex functions of the\ndifference of two components, (c) concave functions of a positive linear combination, (d) negative\nlog densities of multivariate totally positive distributions [17]. See Section 6 for a concrete example.\n\nExtension on a space of measures. We consider the convex set P([0, 1]) of Radon probability\nmeasures [24] on [0, 1], which is the closure (for the weak topology) of the convex hull of all Dirac\nmeasures. In order to get an extension, we look for a function de\ufb01ned on the set of products of\nprobability measures \u00b5 2 P([0, 1])n, such that if all \u00b5i, i = 1, . . . , n, are Dirac measures at points\nxi 2 [0, 1], then we have a function value equal to H(x1, . . . , xn). Note that P([0, 1])n is different\nfrom P([0, 1]n), which is the set of probability measures on [0, 1]n.\nFor a probability distribution \u00b5i 2 P([0, 1]) de\ufb01ned on [0, 1], we can de\ufb01ne the (reversed) cumulative\n: [0, 1] ! [0, 1] as F\u00b5i(xi) = \u00b5i[xi, 1]. This is a non-increasing\ndistribution function F\u00b5i\nleft-continuous function from [0, 1] to [0, 1], such that F\u00b5i(0) = 1 and F\u00b5i(1) = \u00b5i({1}). See\nillustrations in the left plot of Figure 1.\nWe can then de\ufb01ne the \u201cinverse\u201d cumulative function from [0, 1] to [0, 1] as F 1\n\u00b5i (ti) = sup{xi 2\n[0, 1], F\u00b5i(xi) > ti}. The function F 1\nis non-increasing and right-continuous, and such that\n\u00b5i (1) = min supp(\u00b5i) and F 1\nF 1\nThe extension from [0, 1]n to the set of product probability measures is obtained by considering a\nsingle threshold t applied to all n cumulative distribution functions, that is:\n\n\u00b5i (0) = 1. Moreover, we have F\u00b5i(xi) > ti , F 1\n\n\u00b5i (ti) > xi.\n\n\u00b5i\n\n8\u00b5 2 P([0, 1])n, h(\u00b51, . . . , \u00b5n) =Z 1\n\n0\n\n2\n\nH\u21e5F 1\n\n\u00b51 (t), . . . , F 1\n\n\u00b5n (t)\u21e4dt.\n\n(3)\n\n\f2\n\n1.5\n\n1\n\n0.5\n\n \n\nF\n\n\u00b5\n\u22121\n\nF\n\n\u00b5\n\n\u00b5\n\n3\n\n2\n\n1\n\n \n\nF\n\n\u03bd\n\nF\n\n\u00b5\n\n\u03bd\n\n\u00b5\n\n0\n\n \n\n0\n\n0.5\n\n1\n\n0\n\n \n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nFigure 1: Left: cumulative and inverse cumulative distribution functions with the corresponding\ndensity (with respect to the Lebesgue measure). Right: cumulative functions for two distributions \u00b5\nand \u232b such that \u00b5 < \u232b.\n\n\u21e5F 1\n\n\u00b51 (t), . . . , F 1\n\nWe have the following properties when H is submodular: (a) it is an extension, that is, if for all i, \u00b5i\nis a Dirac at xi, then h(\u00b5) = H(x); (b) it is convex; (c) minimizing h on P([0, 1])n and minimizing\nH on [0, 1]n is equivalent; moreover, the minimal values are equal and \u00b5 is a minimizer if and only if\n\n\u00b5n (t)\u21e4 is a minimizer of H for almost all t2 [0, 1]. Thus, submodular minimization\n\nis equivalent to a convex optimization problem in a space of uni-dimensional measures.\nNote that the extension is de\ufb01ned on all tuples of measures \u00b5 = (\u00b51, . . . , \u00b5n) but it can equivalently\nbe de\ufb01ned through non-increasing functions from [0, 1] to [0, 1], e.g., the representation in terms of\ncumulative distribution functions F\u00b5i de\ufb01ned above (this representation will be used in Section 4\nwhere algorithms based on the discretization of the equivalent obtained convex problem are discussed).\n\n3\n\nIsotonic Constraints and Stochastic Dominance\n\nIn this paper, we consider the following problem:\n\ninf\n\nx2[0,1]n\n\nH(x) such that 8(i, j) 2 E, xi > xj,\n\n(4)\n\nwhere E is the edge set of a directed acyclic graph on {1, . . . , n} and H is submodular. We denote\nby X \u21e2 Rn (not necessarily a subset of [0, 1]n) the set of x 2 Rn satisfying the isotonic constraints.\nIn order to de\ufb01ne an extension in a space of measures, we consider a speci\ufb01c order on measures\non [0, 1], namely \ufb01rst-order stochastic dominance [20], de\ufb01ned as follows.\nGiven two distributions \u00b5 and \u232b on [0, 1], with (inverse) cumulative distribution functions F\u00b5 and F\u232b,\nwe have \u00b5 < \u232b, if and only if 8x 2 [0, 1], F\u00b5(x) > F\u232b(x), or equivalently, 8t 2 [0, 1], F 1\n\u00b5 (t) >\n(t). As shown in the right plot of Figure 1, the densities may still overlap. An equivalent\nF 1\ncharacterization [19, 9] is the existence of a joint distribution on a vector (X, X0) 2 R2 with\nmarginals \u00b5(x) and \u232b(x0) and such that X > X0 almost surely1. We now prove the main proposition\nof the paper:\n\n\u232b\n\nProposition 1 We consider the convex minimization problem:\n\ninf\n\n\u00b52P([0,1])n\n\nh(\u00b5) such that 8(i, j) 2 E, \u00b5i < \u00b5j.\n\n(5)\n\n\u00b5 (t) is a minimizer of H of Eq. (4) for almost all t 2 [0, 1].\n\nProblems in Eq. (4) and Eq. (5) have the same objective values. Moreover, \u00b5 is a minimizer of Eq. (5)\nif and only if F 1\nProof We denote by M the set of \u00b5 2 P([0, 1])n satisfying the stochastic ordering constraints. For\nany x 2 [0, 1]n that satis\ufb01es the constraints in Eq. (4), i.e., x 2 X \\ [0, 1]n, the associated Dirac\nmeasures satisfy the constraint in Eq. (5). Therefore, the objective value M of Eq. (4) is greater or\nequal to the one M0 of Eq. (5). Given a minimizer \u00b5 for the convex problem in Eq. (5), we have:\nM > M0 = h(\u00b5) =R 1\n0 M dt = M. This shows the proposition by\n\n\u00b5n (t)\u21e4dt >R 1\n\nstudying the equality cases above.\n\n0 H\u21e5F 1\n\n\u00b51 (t), . . . , F 1\n\ndistributed in [0, 1].\n\n1Such a joint distribution may be built as the distribution of (F 1\n\n\u00b5 (T ), F 1\n\n\u232b\n\n(T )), where T is uniformly\n\n3\n\n\fAlternatively, we could add the penalty term P(i,j)2ER +1\n(F\u00b5j (z)  F\u00b5i(z))+dz, which corre-\nsponds to the unconstrained minimization of H(x) + P(i,j)2E(xj  xi)+. For > 0 big enough2,\nthis is equivalent to the problem above, but with a submodular function which has a large Lipschitz\nconstant (and is thus harder to optimize with the iterative methods presented below).\n\n1\n\n4 Discretization algorithms\n\nProp. 1 shows that the isotonic regression problem with a submodular cost can be cast as a convex\noptimization problem; however, this is achieved in a space of measures, which cannot be handled\ndirectly computationally in polynomial time. Following [2], we consider a polynomial time and\nspace discretization scheme of each interval [0, 1] (and not of [0, 1]n), but we propose in Section 5\na signi\ufb01cant improvement that allows to reduce the number of discrete points signi\ufb01cantly. All\npseudo-codes for the algorithms are available in Appendix B.\n\n4.1 Review of submodular optimization in discrete domains\nAll our algorithms will end up minimizing approximately a submodular function F on {0, . . . , k1}n,\nthat is, which satis\ufb01es Eq. (2). Isotonic constraints will be added in Section 4.2.\nFollowing [2], this can be formulated as minimizing a convex function f# on the set of \u21e2 2\n[0, 1]n\u21e5(k1) so that for each i 2{ 1, . . . , n}, (\u21e2ij)j2{1,...,k1} is a non-increasing sequence (we\ndenote by S this set of constraints) corresponding to the cumulative distribution function. For any\nfeasible \u21e2, a subgradient of f# may be computed by sorting all n(k  1) elements of the matrix \u21e2 and\ncomputing at most n(k  1) values of F . An approximate minimizer of F (which exactly inherits\napproximation properties from the approximate optimality of \u21e2) is then obtained by selecting the\nminimum value of F in the computation of the subgradient. Projected subgradient methods can\nthen be used, and if F is the largest absolute difference in values of F when a single variable is\n\nchanged by \u00b11, we obtain an \"-minimizer (for function values) after t iterations, with \" 6 nkF /pt.\n\nThe projection step is composed of n simple separable quadratic isotonic regressions with chain\nconstraints in dimension k, which can be solved easily in O(nk) using the pool-adjacent-violator\nalgorithm [4]. Computing a subgradient requires a sorting operation, which is thus O(nk log(nk)).\nSee more details in [2].\nAlternatively, we can minimize the strongly-convex f#(\u21e2) + 1\nF on the set of \u21e2 2 Rn\u21e5(k1) so\nthat for each i, (\u21e2ij)j is a non-increasing sequence, that is, \u21e2 2 S (the constraints that \u21e2ij 2 [0, 1]\nare dropped). We then get a minimizer z of F by looking for all i 2{ 1, . . . , n} at the largest\nj 2{ 1, . . . , k  1} such that \u21e2ij > 0. We take then zi = j (and if no such j exists, zi = 0). A gap\nof \" in the problem above, leads to a gap of p\"nk for the original problem (see more details in [2]).\nThe subgradient method in the primal, or Frank-Wolfe algorithm in the dual may be used for this\nproblem. We obtain an \"-minimizer (for function values) after t iterations, with \" 6 F /t, which\nleads for the original submodular minimization problem to the same optimality guarantees as above,\nbut with a faster algorithm in practice. See the detailed computations and comparisons in [2].\n\n2k\u21e2k2\n\n4.2 Naive discretization scheme\n\ni\n\nFollowing [2], we simply discretize [0, 1] by selecting the k values\n2k , for i 2{ 0, . . . , k \n1}. If the function H : [0, 1]n is L1-Lipschitz-continuous with respect to the `1-norm, that is\n|H(x)  H(x0)| 6 L1kx  x0k1, the function F is (L1/k)-Lipschitz-continuous with respect to the\n`1-norm (and thus we have F 6 L1/k above). Moreover, if F is minimized up to \", H is optimized\nup to \" + nL1/k.\nIn order to take into account the isotonic constraints, we simply minimize with respect to \u21e2 2\n[0, 1]n\u21e5(k1) \\ S, with the additional constraint that for all j 2{ 1, . . . , k  1}, 8(a, b) 2 E,\n\u21e2a,j > \u21e2b,j. This corresponds to additional contraints T \u21e2 Rn\u21e5(k1).\n\nk1 or 2i+1\n\n2A short calculation shows that when H is differentiable, the \ufb01rst order-optimality condition (which is only\nnecessary here) implies that if  is strictly larger than n times the largest possible partial \ufb01rst-order derivative\nof H, the isotonic constraints have to be satis\ufb01ed.\n\n4\n\n\fFollowing Section 4.1, we can either choose to solve the convex problem min\u21e22[0,1]n\u21e5k\\S\\T f#(\u21e2),\nor the strongly-convex problem min\u21e22S\\T f#(\u21e2) + 1\nF . In the two situations, after t iterations,\nthat is tnk accesses to values of H, we get a constrained minimizer of H with approximation\nguarantee nL1/k + nL1/pt. Thus in order to get a precision \", it suf\ufb01ces to select k > 2nL1/\" and\nt > 4n2L2\n1/\"3 accesses to function values of H, which is the same\nas obtained in [2] (except for an extra factor of n due to a different de\ufb01nition of L1).\n\n1/\"2, leading to an overall 8n4L3\n\n2k\u21e2k2\n\n4.3\n\nImproved behavior for smooth functions\n\ni\n\nk1 for i 2{ 0, . . . , k  1}, and we assume that all \ufb01rst-order\nWe consider the discretization points\n(resp. second-order) partial derivatives are bounded by L1 (resp. L2\n2). In the reasoning above, we may\nupper-bound the in\ufb01mum of the discrete function in a \ufb01ner way, going from inf x2X H(x) + nL1/k\nto inf x2X H(x) + 1\n2/k2 (by doing a Taylor expansion around the global optimum, where the\n\ufb01rst-order terms are always zero, either because the partial derivative is zero or the deviation is zero).\nWe now select k > nL2/p\", leading to a number of accesses to H that scales as 4n4L2\n1L2/\"5/2. We\nthus gain a factor p\" with the exact same algorithm, but different assumptions.\n\n2 n2L2\n\n4.4 Algorithms for isotonic problem\n\nCompared to plain submodular minimization where we need to project onto S, we need to take into\naccount the extra isotonic constraints, i.e., \u21e2 2 T, and thus use more complex orthogonal projections.\nOrthogonal projections. We now require the orthogonal projections on S\\ T or [0, 1]n\u21e5k \\ S\\ T,\nwhich are themselves isotonic regression problems with nk variables. If there are m original isotonic\nconstraints in Eq. (4), the number of isotonic constraints for the projection step is O(nk + mk), which\nis typically O(mk) if m > n, which we now assume. Thus, we can use existing parametric max-\ufb02ow\nalgorithms which can solve these in O(nmk2 log(nk)) [13] or in O(nmk2 log(n2k/m)) [11]. See\nin Appendix A a description of the reformulation of isotonic regression as a parametric max-\ufb02ow\nproblem, and the link with minimum cut. Following [7, Prop. 5.3], we incorporate the [0, 1] box\nconstraints, by \ufb01rst ignoring them and thus by projecting onto the regular isotonic constraints, and\nthen thresholding the result through x ! max{min{x, 1}, 0}.\nAlternatively, we can explicitly consider a sequence of max-\ufb02ow problems (with at most log(1/\") of\nthese, where \" is the required precision) [28, 15]. Finally, we may consider (approximate) alternate\nprojection algorithms such as Dykstra\u2019s algorithm and its accelerated variants [6], since the set S is\neasy to project to, while, in some cases, such as chain isotonic constraints for the original problem, T\nis easy to project to.\nFinally, we could also use algorithms dedicated to special structures for isotonic regression (see [27]),\nin particular when our original set of isotonic constraints in Eq. (4) is a chain, and the orthogonal\nprojection corresponds to a two-dimensional grid [26]. In our experiments, we use a standard\nmax-\ufb02ow code [5] and the usual divide-and-conquer algorithms [28, 15] for parametric max-\ufb02ow.\n\n1\n\nSeparable problems. The function f# from Section 4.2 is then a linear function of the form\nf#(\u21e2) = tr w>\u21e2, and then, a single max-\ufb02ow algorithm can be used.\nFor these separable problerms, the alternative strongly-convex problem of minimizing f#(\u21e2) + 1\n2k\u21e2k2\nF\nbecomes the one of minimizing min\u21e22S\\T\nF , which is simply the problem of projecting\non the intersection of two convex sets, for which an accelerated Dykstra algorithm may be used [6],\nwith convergence rate in O(1/t2) after t iterations. Each step is O(kn) for projecting onto S, while\nthis is k parametric network \ufb02ows with n variables and m constraints for projecting onto T, in\nO(knm log n) for the general case and O(kn) for chains and rooted trees [4, 30].\nIn our experiments in Section 6, we show that Dykstra\u2019s algorithm converges quickly for separable\nproblems. Note that when the underlying losses are convex3, then Dykstra converges in a single\niteration. Indeed, in this situation, the sequences (wij)j are non-increasing and isotonic regression\n3This is a situation where direct algorithms such as the ones by [22] are much more ef\ufb01cient than our\n\n2k\u21e2 + wk2\n\ndiscretization schemes.\n\n5\n\n\falong a direction preserves decreasingness in the other direction, which implies that after two alternate\nprojections, the algorithm has converged to the optimal solution.\nAlternatively, for the non-strongly convex formulation, this is a single network \ufb02ow problem with\nn(k  1) nodes, and mk constraints, in thus O(nmk2 log(nk)) [25]. When E corresponds to a chain,\nthen this is a 2-dimensional-grid with an algorithm in O(n2k2) [26]. For a precision \", and thus k\nproportional to n/\" with the assumptions of Section 4.2, this makes a number of function calls for H,\nequal to O(kn) = O(n2/\") and a running-time complexity of O(n3m/\"2 \u00b7 log(n2/\"))\u2014for smooth\nfunctions, as shown in Section 4.3, we get k proportional to n/p\" and thus an improved behavior.\n\n5\n\nImproved discretization algorithms\n\nWe now consider a different discretization scheme that can take advantage of access to higher-order\nderivatives. We divide [0, 1] into k disjoint pieces A0 = [0, 1\nk , 1].\nThis de\ufb01nes a new function \u02dcH : {0, . . . , k 1}n ! R de\ufb01ned only for elements z 2{ 0, . . . , k 1}n\nthat satisfy the isotonic constraint, i.e., z 2{ 0, . . . , k  1}n \\ X:\n\nk ), . . . , Ak1 = [ k1\n\nk ), A1 = [ 1\n\nk , 2\n\n(6)\n\ni=1 Azi\n\nx2Qn\n\n\u02dcH(z) = min\n\nH(x) such that 8(i, j) 2 E, xi > xj.\nThe function \u02dcH(z) is equal to +1 if z does not satisfy the isotonic constraints.\nProposition 2 The function \u02dcH is submodular, and minimizing \u02dcH(z) for z 2{ 0, . . . , k  1}n such\nthat 8(i, j) 2 E, zi > zj is equivalent to minimizing Eq. (4).\nProof We consider z and z0 that satisfy the isotonic constraints, with minimizers x and x0 in the de\ufb01-\nnition in Eq. (6). We have H(z) + H(z0) = H(x) + H(x0) > H(min{x, x0}) + H(max{x, x0}) >\nH(min{z, z0}) + H(max{z, z0}). Thus it is submodular on the sub-lattice {0, . . . , k  1}n \\ X.\nNote that in order to minimize \u02dcH, we need to make sure that we only access H for elements z that\nsatisfy the isotonic constraints, that is \u21e2 2 S \\ T (which our algorithms impose).\n5.1 Approximation from high-order smoothness\n\n1\n\nk\n\nr=1P|\u21b5|=r\n\ni=1 Azi )\\X Hq(x| z+1/2\n\nThe main idea behind our discretization scheme is to use high-order smoothness to approximate for\nany required z, the function value \u02dcH(z). If we assume that H is q-times differentiable, with uniform\nbounds Lr\nr on all r-th order derivatives, then, the (q1)-th order Taylor expansion of H around y is\nequal to Hq(x|y) = H(y) +Pq1\n\u21b5! (x  y)\u21b5H (\u21b5)(y), where \u21b5 2 Nn and |\u21b5| is the sum\nof elements, (x  y)\u21b5 is the vector with components (xi  yi)\u21b5i, \u21b5! the products of all factorials of\nelements of \u21b5, and H (\u21b5)(y) is the partial derivative of H with order \u21b5i for each i.\nWe thus approximate \u02dcH(z), for any z that satis\ufb01es the isotonic constraint (i.e., z 2 X), by \u02c6H(z) =\n). We have for any z, | \u02dcH(z)  \u02c6H(z)| 6 (nLq/2k)q/q!. Moreover,\nminx2(Qn\nwhen moving a single element of z by one, the maximal deviation is L1/k + 2(nLq/2k)q/q!.\nIf \u02c6H is submodular, then the same reasoning as in Section 4.2 leads to an approximate er-\nror of (nk/pt)L1/k + 2(nLq/2k)q/q! after t iterations, on top of (nLq/2k)q/q!, thus, with\n1/\"2 and k > (q!\"/2)1/qnLq/2 (assuming \" small enough such that t > 16n2k2), this\nt > 16n2L2\n1Lq/\"2+1/q). We thus get\nleads to a number of accesses to the (q1)-th order oracle equal to O(n4L2\nan improvement in the power of \", which tend to \"2 for in\ufb01nitely smooth problems. Note that when\nq = 1 we recover the same rate as in Section 4.3 (with the same assumptions but a slightly different\nalgorithm).\nHowever, unless q = 1, the function \u02c6H(z) is not submodular, and we cannot apply directly the\nbounds for convex optimization of the extension. We show in Appendix D that the bound still holds\nfor q > 1 by using the special structure of the convex problem.\nWhat remains unknown is the computation of \u02c6H which requires to minimize polynomials on a small\ncube. We can always use the generic algorithms from Section 4.2 for this, which do not access extra\n\n6\n\n\ffunction values but can be slow. For quadratic functions, we can use a convex relaxation which is\nnot tight but already allows strong improvements with much faster local steps, and which we now\npresent. See the pseudo-code in Appendix B. In any case, using expansions of higher order is only\npractically useful in situations where function evaluations are expensive.\n\n5.2 Quadratic problems\nIn this section, we consider the minimization of a quadratic submodular function H(x) = 1\n2 x>Ax +\nc>x (thus with all off-diagonal elements of A non-negative) on [0, 1]n, subject to isotonic constraints\nxi > xj for all (i, j) 2 E. This is the sub-problem required in Section 5.1 when using second-order\nTaylor expansions.\nIt could be solved iteratively (and approximately) with the algorithm from Section 4.2; in this\nsection, we consider a semide\ufb01nite relaxation which is tight for certain problems (A positive semi-\nde\ufb01nite, c non-positive, or A with non-positive diagonal elements), but not in general (we have found\ncounter-examples but it is most often tight).\nThe relaxation is based on considering the set of (Y, y) 2 Rn\u21e5n \u21e5 Rn such that there exists\nx 2 [0, 1]n \\ X with Y = xx> and y = x. Our problem is thus equivalent to minimizing\n2 tr AY + c>y such that (Y, y) is in the convex-hull Y of this set, which is NP-hard to characterize\n1\nin polynomial time [10]. However, following ideas from [18], we can \ufb01nd a simple relaxation by\nconsidering the following constraints: (a) for all i 6= j, Yii Yij yi\n1! is positive semi-de\ufb01nite, (b)\nfor all i 6= j, Yij 6 inf{yi, yj}, which corresponds to xixj 6 inf{xi, xj} for any x 2 [0, 1]n, (c)\ni 6 xi, and (d) for all (i, j) 2 E, yi > yj, Yii > Yjj,\nfor all i, Yii 6 yi, which corresponds to x2\nYij > max{Yjj, yj  yi + Yii} and Yij 6 max{Yii, yi  yj + Yjj}, which corresponds to xi > xj,\ni , and xi(1  xj) > xj(1  xj). This\ni > x2\nx2\nleads to a semi-de\ufb01nite program which provides a lower-bound on the optimal value of the problem.\nSee Appendix E for a proof of tightness for special cases and a counter-example for the tightness in\ngeneral.\n\nj, xi(1  xi) 6 xi(1  xj), xixj 6 x2\n\nj, xixj > x2\n\nYij Yjj yj\nyi\n\nyj\n\n6 Experiments\n\nWe consider experiments aiming at (a) showing that the new possibility of minimizing submodular\nfunctions with isotonic constraints brings new possibilities and (b) that the new discretization\nalgorithms are faster than the naive one.\n\n1\n\nRobust isotonic regression. Given some z 2 Rn, we consider a separable function H(x) =\nnPn\ni=1 G(xi  zi) with various possibilities for G: (a) the square loss G(t) = 1\n2 t2, (b) the absolute\n2 log1 + t2/\uf8ff2, which is the negative log-\nloss G(t) = |t| and (c) a logarithmic loss G(t) = \uf8ff2\ndensity of a Student distribution and non-convex. The non-convexity of the cost function and the fact\nthat is has vanishing derivatives for large values make it a good candidate for robust estimation [12].\nThe \ufb01rst two losses may be dealt with methods for separable convex isotonic regression [22, 30], but\nthe non-convex loss can only dealt with exactly by the new optimization routine that we present\u2014\nmajorization-minimization algorithms [14] based on the concavity of G as a function of t2 can be\nused with such non-convex losses, but as shown below, they converge to bad local optima.\nFor simplicity, we consider chain constraints 1 > x1 > x2 > \u00b7\u00b7\u00b7 > xn > 0. We consider two\nset-ups: (a) a separable set-up where maximum \ufb02ow algorithms can be used directly (with n = 200),\nand (b) a general submodular set-up (with n = 25 and n = 200), where we add a smoothness penalty\n2Pn1\ni=1 (xi  xi+1)2, which is submodular (but not separable).\nwhich is the sum of terms of the form \nData generation. We generate the data z 2 Rn, with n = 200, as follows: we \ufb01rst generate a\nsimple decreasing function of i 2{ 1, . . . , n} (here an af\ufb01ne function); we then perturb this ground\ntruth by (a) adding some independent noise and (b) corrupting the data by changing a random subset\nof the n values by the application of another function which is increasing (see Figure 2, left). This\nis an adversarial perturbation, while the independent noise is not adversarial; the presence of the\nadversarial noise makes the problem harder as the proportion of corrupted data increases.\n\n7\n\n\fs\nt\ni\nf\n \n\n\u2212\n \ns\nn\no\n\ni\nt\n\na\nv\nr\ne\ns\nb\no\n\nProp. of corrupted data = 50 %\n0.7\n\n \n\n0.6\n\n0.5\n\n0.4\n\n \n\n0\n\n100\n\nindex i\n\n200\n\nsquare\nabsolute\nlogarithm\n\n)\nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n \n\nn\na\ne\nm\n\n(\n\n0\n1\n\ng\no\n\nl\n\n \n\nsquare\nabsolute\nlogarithm 0.01\nlogarithm 0.001\n\n0\n\n\u22121\n\n\u22122\n\n \n\n0\n1\nproportion of outliers\n\n0.5\n\nFigure 2: Left: robust isotonic regression with decreasing constraints, with 50% of corrupted data\n(observation in pink crosses, and results of isotonic regression with various losses in red, blue and\nblack); the dashed black line corresponds to majorization-minimization algorithm started from the\nobservations. Right: robustness of various losses to the proportion of corrupted data. The two\nlogarithm-based losses are used with two values of \uf8ff (0.01 and 0.001); the dashed line corresponds to\nthe majorization-minimization algorithm (with no convergence guarantees and worse performance).\n\nOptimization of separable problems with maximum \ufb02ow algorithms. We solve the discretized\nversion by a single maximum-\ufb02ow problem of size nk. We compare the various losses for k = 1000\non data which is along a decreasing line (plus noise), but corrupted (i.e., replaced for a certain\nproportion) by data along an increasing line. See an example in the left plot of Figure 2 for 50% of\ncorrupted data. We see that the square loss is highly non robust, while the (still convex) absolute loss\nis slightly more robust, and the robust non-convex loss still approximates the decreasing function\ncorrectly with 50% of corrupted data when optimized globally, while the method with no guarantee\n(based on majorization-minimization, dashed line) does not converge to an acceptable solution. In\nAppendix C, we show additional examples where it is robust up to 75% of corruption.\nIn the right plot of Figure 2, we also show the robustness to an increasing proportion of outliers\n(for the same type of data as for the left plot), by plotting the mean-squared error in log-scale and\naveraged over 20 replications. Overall, this shows the bene\ufb01ts of non-convex isotonic regression with\nguaranteed global optimization, even for large proportions of corrupted data.\n\nOptimization of separable problems with pool-adjacent violator (PAV) algorithm. As shown\nin Section 4.2, discretized separable submodular optimization corresponds to the orthogonal projection\nof a matrix into the intersection of chain isotonic constraints in each row, and isotonic constraints\nin each column equal to the original set of isotonic constraints (in these simulations, these are also\nchain constraints). This can be done by Dykstra\u2019s alternating projection algorithm or its accelerated\nversion [6], for which each projection step can be performed with the PAV algorithm because each of\nthem corresponds to chain constraints.\nIn the left plot of Figure 3, we show the difference in function values (in log-scale) for various\ndiscretization levels (de\ufb01ned by the integer k spaced by 1/4 in base-10 logarithm), as as function of\nthe number of iterations (averaged over 20 replications). For large k (small difference of function\nvalues), we see a spacing between the ends of the plots of approximatively 1/2, highlighting the\ndependence in 1/k2 of the \ufb01nal error with discretization k, which our analysis in Section 4.3 suggests.\n\nEffect of the discretization for separable problems.\nIn order to highlight the effect of discretiza-\ntion and its interplay with differentiability properties of the function to minimize, we consider in\nthe middle plot of Figure 3, the distance in function values after full optimization of the discrete\nsubmodular function for various values of k. We see that for the simple smooth function (quadratic\nloss), we have a decay in 1/k2, while for the simple non smooth function (absolute loss), we have a\n\ufb01nal decay in 1/k), a predicted by our analysis. For the logarithm-based loss, whose smoothness\nconstant depends on \uf8ff, when \uf8ff is large, it behaves like a smooth function immediately, while for \uf8ff\nsmaller, k needs to be large enough to reach that behavior.\n\nNon-separable problems. We consider adding a smoothness penalty to add the prior knowledge\nthat values should be decreasing and close. In Appendix C, we show the effect of adding a smoothness\nprior (for n = 200): it leads to better estimation. In the right plot of Figure 3, we show the effect\nof various discretization schemes (for n = 25), from order 0 (naive discretization), to order 1 and 2\n\n8\n\n\fDykstra\n\nEffect of discretization\n0\n\n \n\n)\n\n)\n\n \n\nH\n\u2206\n(\n\ng\no\n\nl\n\n0\n1\n\n\u22123\n\n\u22124\n\n\u22125\n\n\u22126\n\n\u22127\n0\n\n \n\nH\n\u2206\n(\n\ng\no\n\n0\n1\n\nl\n\n30\n\n\u22122\n\n\u22124\n\n \n\n\u22126\n1\n\n10\n\n20\n\nNumber of iterations\n\nsquare\nabsolute\nlogarithm 0.1\nlogarithm 0.001\n\n)\n\n \n\nH\n\u2206\n(\n\ng\no\n\n0\n1\n\nl\n\n3\n\n2\nlog\n\n(k)\n\n10\n\n0\n\n\u22122\n\n\u22124\n\n \n\n\u22126\n1\n\n \n\norder 0\norder 1\norder 2\n\n2\n\nlog\n\n1.5\n10\n\n(k)\n\nFigure 3: Dykstra\u2019s projection algorithms for separable problems, with several values of k, spaced\nwith 1/4 in base-10 logarithm, from 101 to 103.5. Dykstra in dashed and accelerated Dykstra in\nplain. Middle: effect of discretization value k for various loss functions for separable problems (the\nlogarithm-based loss is considered with two values of \uf8ff, \uf8ff = 0.1 and \uf8ff = 0.001). Right: effect of\ndiscretization k on non-separable problems.\n\n(our new schemes based on Taylor expansions from Section 5.1), and we plot the difference in\nfunction values after 50 steps of subgradient descent: in each plot, the quantity H is equal to\nH(x\u21e4k)  H\u21e4, where x\u21e4k is an approximate minimizer of the discretized problem with k values and\nH\u21e4 the minimum of H (taking into account the isotonic constraints). As outlined in our analysis, the\n\ufb01rst-order scheme does not help because our function has bounded Hessians, while the second-order\ndoes so signi\ufb01cantly.\n\n7 Conclusion\n\nIn this paper, we have shown how submodularity could be leveraged to obtain polynomial-time\nalgorithms for isotonic regressions with a submodular cost, based on convex optimization in a space\nof measures\u2014although based on convexity arguments, our algorithms apply to all separable non-\nconvex functions. The \ufb01nal algorithms are based on discretization, with a new scheme that also\nprovides improvements based on smoothness (also without isotonic constraints). Our framework is\nworth extending in the following directions: (a) we currently consider a \ufb01xed discretization, it would\nbe advantageous to consider adaptive schemes, potentially improving the dependence on the number\nof variables n and the precision \"; (b) other shape constraints can be consider in a similar submodular\nframework, such as xixj > 0 for certain pairs (i, j); (c) a direct convex formulation without\ndiscretization could probably be found for quadratic programming with submodular costs (which are\npotentially non-convex but solvable in polynomial time); (d) a statistical study of isotonic regression\nwith adversarial corruption could now rely on formulations with polynomial-time algorithms.\n\nAcknowledgements\n\nWe acknowledge support the European Research Council (grant SEQUOIA 724063).\n\nReferences\n\n[1] F. Bach. Learning with Submodular Functions: A Convex Optimization Perspective, volume 6\n\nof Foundations and Trends in Machine Learning. NOW, 2013.\n\n[2] F. Bach. Submodular functions: from discrete to continuous domains. Mathematical Program-\n\nming, 2018.\n\n[3] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c Belmont, 2016. 3rd edition.\n[4] M. J. Best and N. Chakravarti. Active set algorithms for isotonic regression: a unifying\n\nframework. Mathematical Programming, 47(1):425\u2013439, 1990.\n\n[5] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-\ufb02ow algorithms for\nenergy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n26(9):1124\u20131137, 2004.\n\n9\n\n\f[6] A. Chambolle and T. Pock. A remark on accelerated block coordinate descent for computing the\nproximity operators of a sum of convex functions. SMAI-Journal of computational mathematics,\n1:29\u201454, 2015.\n\n[7] Xi Chen, Qihang Lin, and Bodhisattva Sen. On degrees of freedom of projection estimators\nwith applications to multivariate shape restricted regression. Technical Report 1509.01877,\narXiv, 2015.\n\n[8] Y. Chen and R. J. Samworth. Generalized additive and index models with shape constraints.\n\nJournal of the Royal Statistical Society Series B, 78(4):729\u2013754, 2016.\n\n[9] D. Dentcheva and A. Ruszczy\u00b4nski. Semi-in\ufb01nite probabilistic optimization: \ufb01rst-order stochastic\n\ndominance constraint. Optimization, 53(5-6):583\u2013601, 2004.\n\n[10] M. M. Deza and M. Laurent. Geometry of Cuts and Metrics, volume 15. Springer, 2009.\n[11] G. Gallo, M. D. Grigoriadis, and R. E. Tarjan. A fast parametric maximum \ufb02ow algorithm and\n\napplications. SIAM Journal on Computing, 18(1):30\u201355, 1989.\n\n[12] Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, and Werner A. Stahel. Robust\nStatistics: the Approach Based on In\ufb02uence Functions, volume 196. John Wiley & Sons, 2011.\n[13] D. S. Hochbaum. The pseudo\ufb02ow algorithm: A new algorithm for the maximum-\ufb02ow problem.\n\nOperations Research, 56(4):992\u20131009, 2008.\n\n[14] D. R. Hunter and K. Lange. A tutorial on MM algorithms. The American Statistician, 58(1):30\u2013\n\n37, 2004.\n\n[15] S. Jegelka, F. Bach, and S. Sra. Re\ufb02ection methods for user-friendly submodular optimization.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2013.\n\n[16] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Ef\ufb01cient learning of generalized linear and\nsingle index models with isotonic regression. In Advances in Neural Information Processing\nSystems (NIPS), 2011.\n\n[17] S. Karlin and Y. Rinott. Classes of orderings of measures and related correlation inequalities.\ni. multivariate totally positive distributions. Journal of Multivariate Analysis, 10(4):467\u2013498,\n1980.\n\n[18] S. Kim and M. Kojima. Exact solutions of some nonconvex quadratic optimization problems\nvia SDP and SOCP relaxations. Comp. Optimization and Applications, 26(2):143\u2013154, 2003.\n[19] E. L. Lehmann. Ordered families of distributions. The Annals of Mathematical Statistics,\n\n26(3):399\u2013419, 1955.\n\n[20] H. Levy. Stochastic dominance and expected utility: survey and analysis. Management science,\n\n38(4):555\u2013593, 1992.\n\n[21] G. G. Lorentz. An inequality for rearrangements. Am. Math. Monthly, 60(3):176\u2013179, 1953.\n[22] R. Luss and S. Rosset. Generalized isotonic regression. Journal of Computational and Graphical\n\nStatistics, 23(1):192\u2013210, 2014.\n\n[23] Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer, 2004.\n[24] W. Rudin. Real and complex analysis. McGraw-Hill, 1986.\n[25] D. D. Sleator and R. E. Tarjan. A data structure for dynamic trees. Journal of Computer and\n\nSystem Sciences, 26(3):362\u2013391, 1983.\n\n[26] J. Spouge, H. Wan, and W. J. Wilbur. Least squares isotonic regression in two dimensions.\n\nJournal of Optimization Theory and Applications, 117(3):585\u2013605, 2003.\n\n[27] Q. F. Stout. Isotonic regression via partitioning. Algorithmica, 66(1):93\u2013112, 2013.\n[28] R. Tarjan, J. Ward, B. Zhang, Y. Zhou, and J. Mao. Balancing applied to maximum network\n\n\ufb02ow problems. In European Symposium on Algorithms, pages 612\u2013623. Springer, 2006.\n\n[29] D. M. Topkis. Minimizing a submodular function on a lattice. Operations Research, 26(2):305\u2013\n\n321, 1978.\n\n[30] Y.-L. Yu and E. P. Xing. Exact algorithms for isotonic regression and related. In Journal of\n\nPhysics: Conference Series, volume 699. IOP Publishing, 2016.\n\n10\n\n\f", "award": [], "sourceid": 29, "authors": [{"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}]}