{"title": "Decomposing Isotonic Regression for Efficiently Solving Large Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1513, "page_last": 1521, "abstract": "A new algorithm for isotonic regression is presented based on recursively partitioning the solution space. We develop efficient methods for each partitioning subproblem through an equivalent representation as a network flow problem, and prove that this sequence of partitions converges to the global solution. These network flow problems can further be decomposed in order to solve very large problems. Success of isotonic regression in prediction and our algorithm's favorable computational properties are demonstrated through simulated examples as large as 2x10^5 variables and 10^7 constraints.", "full_text": "Decomposing Isotonic Regression for Ef\ufb01ciently\n\nSolving Large Problems\n\nRonny Luss\n\nDept. of Statistics and OR\n\nTel Aviv University\n\nSaharon Rosset\n\nDept. of Statistics and OR\n\nTel Aviv University\n\nMoni Shahar\n\nDept. of Electrical Eng.\n\nTel Aviv University\n\nronnyluss@gmail.com\n\nsaharon@post.tau.ac.il\n\nmoni@eng.tau.ac.il\n\nAbstract\n\nA new algorithm for isotonic regression is presented based on recursively par-\ntitioning the solution space. We develop ef\ufb01cient methods for each partitioning\nsubproblem through an equivalent representation as a network \ufb02ow problem, and\nprove that this sequence of partitions converges to the global solution. These net-\nwork \ufb02ow problems can further be decomposed in order to solve very large prob-\nlems. Success of isotonic regression in prediction and our algorithm\u2019s favorable\ncomputational properties are demonstrated through simulated examples as large\nas 2 \u00d7 105 variables and 107 constraints.\n\n1 Introduction\n\nAssume we have a set of n data observations (x1, y1), ..., (xn, yn), where x \u2208 X (usually X =Rp)\nis a vector of covariates or independent variables, y \u2208 R is the response, and we wish to \ufb01t a\nmodel \u02c6f : X \u2192 R to describe the dependence of y on x, i.e., y \u2248 \u02c6f (x). Isotonic regression is a\nnon-parametric modeling approach which only restricts the \ufb01tted model to being monotone in all\nindependent variables [1]. De\ufb01ne G as the family of isotonic functions, that is, g \u2208 G satis\ufb01es\n\nx1 (cid:22) x2 \u21d2 g(x1) \u2264 g(x2),\n\nwhere the partial order (cid:22) here will usually be the standard Euclidean one, i.e., x1 (cid:22) x2 if x1j \u2264 x2j\n\u2200j. Given these de\ufb01nitions, isotonic regression solves\n\n\u02c6f = arg min\ng\u2208G\n\nky \u2212 g(x)k2.\n\n(1)\n\nAs many authors have noted, the optimal solution to this problem comprises a partitioning of the\nspace X into regions obeying a monotonicity property with a constant \ufb01tted to \u02c6f in each region.\nIt is clear that isotonic regression is a very attractive model for situations where monotonicity is a\nreasonable assumption, but other common assumptions like linearity or additivity are not. Indeed,\nthis formulation has found useful applications in biology [2], medicine [3], statistics [1] and psy-\nchology [4], among others. Practicality of isotonic regression has already been demonstrated in\nvarious \ufb01elds and in this paper we focus on algorithms for computing isotonic regressions on large\nproblems.\nAn equivalent formulation of L2 isotonic regression seeks an optimal isotonic \ufb01t \u02c6yi at every point\nby solving\n\nminimize\n\nsubject to\n\nn\n\nX\n\n(\u02c6yi \u2212 yi)2\n\ni=1\n\u02c6yi \u2264 \u02c6yj\n\n\u2200(i, j) \u2208 I\n\n(2)\n\nwhere I denotes a set of isotonic constraints. This paper assumes that I contains no redundant\nconstraints, i.e. (i, j), (j, k) \u2208 I \u21d2 (i, k) 6\u2208 I. Problem (2) is a quadratic program subject to\n\n1\n\n\fsimple linear constraints, and, according to a literature review, appears to be largely ignored due to\ncomputational dif\ufb01culty on large problems. The worst case O(n4) complexity (a large overstatement\nin practice as will be shown) has resulted in overlooking the results that follow [5, 6].\n\nThe discussion of isotonic regression originally focused on the case x \u2208 R, where (cid:22) denoted a com-\nplete order [4]. For this case, the well known pooled adjacent violators algorithm (PAVA) ef\ufb01ciently\nsolves the isotonic regression problem. For the partially ordered case, many different algorithms\nhave been developed over the years, with most early efforts concentrated on generalizations of PAVA\n[7, 5]. These algorithms typically have no polynomial complexity guarantees and are impractical\nwhen data size exceed a few thousand observations. Problem (1) can also be treated as a separa-\nble quadratic program subject to simple linear equality constraints. Such was done, for example,\nin [8], which applies active set methods to solve the problem. While such algorithms can often be\nef\ufb01cient in practice, the algorithm of [8] gives no complexity guarantees. Related algorithms in [9]\nto those described here were applied to problems for scheduling reorder intervals in production sys-\ntems and are of complexity O(n4) and connections to isotonic regression can be made through [1].\nInterior point methods are another tool for solving Problem (1), and have time complexity guaran-\ntees of O(n3) when the number of constraints is on the same order as the number of variables (see\n[10]). However, the excessive memory requirements of interior point methods from solving large\nsystems of linear equations typically make them impractical for large data sizes. Recently, [6] and\n[11] gave an O(n2) approximate generalized PAVA algorithm, however solution quality can only be\ndemonstrated via experimentation. An even better complexity of O(n log n) can be obtained for the\noptimal solution when the isotonic constraints take a special structure such as a tree, e.g. [12].\n\n1.1 Contribution\n\nOur novel approach to isotonic regression offers an exact solution of (1) with a complexity bounded\nby O(n4), but acts on the order of O(n3) for practical problems. We demonstrate here that it accom-\nmodates problems with tens of thousands of observations, or even more with our decomposition. The\nmain goal of this paper is to make isotonic regression a reasonable computational tool for large data\nsets, as the assumptions in this framework are very applicable in real-world applications. Our frame-\nwork solves quadratic programs with 2 \u00d7 105 variables and more than 107 constraints, a problem\nof size not solved anywhere in previous isotonic regression literature, and with the decomposition\ndetailed below, even larger problems can be solved.\n\nThe paper is organized as follows. Section 2 describes a partitioning algorithm for isotonic re-\ngression and proves convergence to the globally optimal solution. Section 3 explains how the sub-\nproblems (creating a single partition) can be solved ef\ufb01ciently and decomposed in order to solve\nlarge-scale problems. Section 4 demonstrates that the partitioning algorithm is signi\ufb01cantly better\nin practice than the O(n4) worst-case complexity. Finally, Section 5 gives numerical results and\ndemonstrates favorable predictive performance on large simulated data sets and Section 6 concludes\nwith future directions.\nNotation\nThe weight of a set of points A is de\ufb01ned as yA = 1\n|A| Pi\u2208A yi. A subset U of A is an upper set\nof A if x \u2208 U , y \u2208 A, x \u227a y \u21d2 y \u2208 U. A set B \u2286 A is de\ufb01ned as a block of A if yU \u2229B \u2264 yB\nfor each upper set U of A such that U \u2229 B 6= {}. A general block A is considered a block of the\nentire space. For two blocks A and B, we denote A (cid:22) B if \u2203x \u2208 A, y \u2208 B such that x (cid:22) y and\n\u2204x \u2208 A, y \u2208 B such that y (cid:22) x (i.e. there is at least one comparable pair of points that satisfy the\ndirection of isotonicity). A and B are then said to be isotonic blocks (or obey isotonicity). A group\nof nodes X majorizes (minorizes) another group Y if X (cid:23) Y (X (cid:22) Y ). A group X is a majorant\n(minorant) of X \u222a A where A = \u222ak\n\ni=1Ai if X 6(cid:22) Ai (X 6(cid:23) Ai) \u2200i = 1 . . . k.\n\n2 Partitioning Algorithm\n\nWe \ufb01rst describe the structure of the classic L2 isotonic regression problem and continue to detail\nthe partitioning algorithm. The section concludes by proving convergence of the algorithm to the\nglobally optimal isotonic regression solution.\n\n2\n\n\f2.1 Structure\n\nProblem (2) is a quadratic program subject to simple linear constraints. The structure of the opti-\nmal solution to (2) is well-known. Observations are divided into k groups where the \ufb01ts in each\ngroup take the group mean observation value. This can be seen through the equations given by the\nfollowing Karush-Kuhn-Tucker (KKT) conditions:\n\n(a) \u02c6yi = yi \u2212\n\n1\n2\n\n( X\nj:(i,j)\u2208I\n\n\u03bbij \u2212 X\n\n\u03bbji)\n\nj:(j,i)\u2208I\n\n(b) \u02c6yi \u2264 \u02c6yj \u2200(i, j) \u2208 I\n(c) \u03bbij \u2265 0 \u2200(i, j) \u2208 I\n(d) \u03bbij (\u02c6yi \u2212 \u02c6yj) = 0 \u2200(i, j) \u2208 I.\n\nThis set of conditions exposes the nature of the optimal solution, since condition (d) implies that\n\u03bbij > 0 \u21d2 \u02c6yi = \u02c6yj. Hence \u03bbij can be non-zero only within blocks in the isotonic solution which\nhave the same \ufb01tted value. For observations in different blocks, \u03bbij = 0. Furthermore, the \ufb01t within\neach block is trivially seen to be the average of the observations in the block, i.e. the \ufb01ts minimize the\nblock\u2019s squared loss. Thus, we get the familiar characterization of the isotonic regression problem\nas one of \ufb01nding a division into isotonic blocks.\n\n2.2 Partitioning\n\nIn order to take advantage of the optimal solution\u2019s structure, we propose solving the isotonic re-\ngression problem (2) as a sequence of subproblems that divides a group of nodes into two groups\nat each iteration. An important property of our partitioning approach is that nodes separated at one\niteration are never rejoined into the same group in future iterations. This gives a clear bound on the\ntotal number of iterations in the worst case.\n\nWe now describe the partitioning criterion used for each subproblem. Suppose a current block V is\ni = yV \u2200i \u2208 V. From condition (a) of the KKT conditions, we de\ufb01ne the net\noptimal and thus \u02c6y\u2217\nout\ufb02ow of a group V as Pi\u2208V (yi \u2212 \u02c6yi). Finding two groups within V such that the net out\ufb02ow from\nthe higher group is greater than the net out\ufb02ow from the lower group should be infeasible, according\nto the KKT conditions. The partition here looks for two such groups. Denote by CV the set of all\nfeasible (i.e. isotonic) cuts through the network de\ufb01ned by nodes in V. A cut is called isotonic if the\ntwo blocks created by the cut are isotonic. The optimal cut is determined as the cut that solves the\nproblem\n\nmax\nc\u2208CV\n\nX\ni\u2208V +\n\nc\n\n(yi \u2212 yV) \u2212 X\ni\u2208V \u2212\n\nc\n\n(yi \u2212 yV )\n\n(3)\n\nc (V +\n\nc ) is the group on the lower (upper) side of the edges of cut c. In terms of isotonic\nwhere V \u2212\nregression, the optimal cut is such that the difference in the sum of the normalized \ufb01ts (yi \u2212 yV ) at\neach node of a group is maximized. If this maximized difference is zero, then the group must be an\noptimal block. The optimal cut problem (3) can also be written as the binary program\n\nmaximize Pi xi(yi \u2212 yV )\nsubject to xi \u2264 xj\nxi \u2208 {\u22121, +1}\n\n\u2200(i, j) \u2208 I\n\u2200i \u2208 V.\n\n(4)\n\nWell-known results from [13] (due to the fact that the constraint matrix is totally unimodular) say\nthat the following relaxation to this binary program is optimal with x\u2217 on the boundary, and hence\nthe optimal cut can be determined by solving the linear program\n\nmaximize\nsubject to xi \u2264 xj\n\nzT x\n\n\u2200(i, j) \u2208 I\n\n(5)\n\n\u22121 \u2264 xi \u2264 1 \u2200i \u2208 V\n\nwhere zi = yi \u2212 yV. This group-wise partitioning operation is the basis for our partitioning al-\ngorithm which is explicitly given in Algorithm 1. It starts with all observations as one group (i.e.,\nV = {1, . . . , n}), and recursively splits each group optimally by solving subproblem (5). At each\n\n3\n\n\fiteration, a list C of potential optimal cuts for each group generated thus far is maintained, and the\ncut among them with the highest objective value is performed. The list C is updated with the opti-\nmal cuts in both sub-groups generated. Partitioning ends whenever the solution to (5) is trivial (i.e.,\nno split is found because the group is a block). As proven next, this algorithm terminates with the\noptimal global (isotonic) solution to the isotonic regression problem (2).\n\nLet (val, w\u2212, w+) \u2208 C be the potential cut with largest val.\nUpdate V = (V \\ (w\u2212 \u222a w+)) \u222a {w\u2212, w+}, C = C \\ (val, w\u2212, w+) .\nfor all v \u2208 {w\u2212, w+} do\n\nAlgorithm 1 Paritioning Algorithm\nRequire: Observations y1, . . . , yn and partial order I.\nRequire: V = {{1, . . . , n}}, C = {(0, {1, . . . , n}, {})}, W = {}.\n1: while V 6= {} do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\nend if\n12:\nend for\n13:\n14: end while\n15: return W the optimal groups\n\nLet v\u2212 = {i : x\u2217\nUpdate C = C \u222a {(zT x\u2217, v\u2212, v+)}\n\n1 = . . . = x\u2217\nUpdate V = V \\ v and W = W \u222a v.\n\ni = \u22121}, v+ = {i : x\u2217\n\nSet zi = yi \u2212 yv \u2200i \u2208 v where yv is the mean of observations in v.\nSolve LP (5) with input z and get x\u2217.\nif x\u2217\n\nn (group is optimally divided) then\n\ni = +1}.\n\nelse\n\n2.3 Convergence\n\nTheorem 1 next states the main result that allows for a no-regret partitioning algorithm for isotonic\nregression. This will lead to our convergence result. We assume that group V is isotonic (i.e. has no\nholes) and is the union of optimal blocks.\n\nTheorem 1 Assume a group V is a union of blocks from the optimal solution to problem (2). Then\na cut made by solving (5) does not cut through any block in the global optimal solution.\n\nk (M U\n\nProof. The following is a brief sketch of the proof idea: Let M be the union of K optimal blocks in\nV that get broken by the cut. De\ufb01ne M1 (MK) to be a minorant (majorant) block in M. For each Mk\nk ) as the groups in Mk below (above) the algorithm cut. Using the de\ufb01nitions of how\nde\ufb01ne M L\n< yMK by\nthe algorithm makes partitions, the following two consequences can be proven: (1) yM1\noptimality (i.e. according to KKT conditions) and isotonicity and (2) yM1\n> yV and yMK < yV.\nThis is proven by showing that yM U\n1 block would be on the\n> yV, because otherwise the M U\n> yV since\nlower side of the cut, resulting in M1 being on the lower side of the cut, and thus yM1\nby the optimality assumption on block M1 (with symmetric arguments for MK). This\nyM L\nleads to the contradiction yV < yM1\n\n< yMK < yV, and hence M must be empty.\n\n> yM U\n\n1\n\n1\n\n1\n\nSince Algorithm 1 starts with V = {1, ..., n} which is a union of (all) optimal blocks, we can\nconclude from this theorem that partitions never cut an optimal block. The following corollary is\nthen a direct consequence of repeatedly applying Theorem 1 in Algorithm 1:\n\nCorollary 2 Algorithm 1 converges to the global optimal solution of (2) with no regret (i.e. without\nhaving to rejoin observations that are divided at a previous iteration).\n\n3 Ef\ufb01cient solutions of the subproblems\n\nLinear program (5) has a special structure that can be taken advantage of in order to solve larger\nproblems faster. We \ufb01rst show why these problems can be solved faster than typical linear programs,\nfollowed by a novel decomposition of the structure that allows problems of extremely large size to\nbe solved ef\ufb01ciently.\n\n4\n\n\f3.1 Network \ufb02ow problems\n\nThe dual to Problem (2) is a network \ufb02ow problem with quadratic objective. The network \ufb02ow\nconstraints are identical to those in (6) below, but the objective is 1\ni ), which, to the\nauthor\u2019s knowledge, currently still precludes this dual from being ef\ufb01ciently solved with special\nnetwork algorithms.\n\ni=1 (s2\n\n4 Pn\n\ni + t2\n\nWhile this structure does not help solve directly the quadratic program, the network structure allows\nthe linear program for the subproblems to be solved very ef\ufb01ciently. The dual program to (5) is\n\nminimize X\ni\u2208V\n\n(si + ti)\n\nsubject to X\n\n\u03bbij \u2212 X\n\n\u03bbji \u2212 si + ti = zi \u2200i \u2208 V\n\nj:(i,j)\u2208I\n\u03bb, s, t \u2265 0\n\nj:(j,i)\u2208I\n\n(6)\n\nwhere again zi = yi \u2212 yV. Linear program (6) is a network \ufb02ow problem with |V| + 2 nodes and\n|I| + 2|V| arcs. Variable s denotes links directed from a source node into each other node, while\nt denotes links connecting each node into a sink node. The network \ufb02ow problem here minimizes\nthe total sum of \ufb02ow over links from the source and into the sink with the goal to leave zi units of\n\ufb02ow at each node i \u2208 V. Note that this is very similar to the network \ufb02ow problem solved in [14]\nwhere zi there represents the classi\ufb01cation performance on node i. Specialized simplex methods for\nsuch network \ufb02ow problems are typically much faster ([15] documents an average speedup factor\nof 10 to 100 over standard simplex solvers) due to several reasons such as simpler operations on\nnetwork data structures rather than maintaining and operating on the simplex tableau (see [16] for\nan overview of network simplex methods).\n\n3.2 Large-scale decompositions\n\nIn addition to having a very ef\ufb01cient method for solving this network \ufb02ow problem, further enhance-\nments can be made on extremely large problems of similar structure that might suffer from memory\nproblems. It is already assumed that no redundant arcs exist in I (i.e. (i, j), (j, k) \u2208 I \u21d2 (i, k) 6\u2208\nI). One simple reduction involves eliminating negative (positive) nodes, i.e. nodes with zi < 0\n(zi \u2265 0) where where zi = yi \u2212 yV, that are bounded only from above (below). It is trivial to\nobserve that these nodes will be be equal to \u22121 (+1) in the optimal solution and that eliminating\nthem does not affect solving (5) without them. However, in practice, this trivial reduction has a\ncomputationally minimal affect on large data sets. These reductions were also discussed in [14].\n\nWe next consider a novel reduction for the primal linear program (5). The main idea is that it can\nbe solved through a sequence of smaller linear programs that reduce the total size of the full linear\nprogram on each iteration. Consider a minorant group of nodes J \u2286 V and the subset of arcs\nIJ \u2286 I connecting them. Solving problem (5) on this reduced network with the original input z\ndivides the nodes in J into a lower and upper group, denoted JL and JU . Nodes in JL are not\nbounded from above and will be in the lower group of the full problem solved on V. In addition,\nthe same problem solved on the remaining nodes in V \\ JL will give the optimal solutions to these\nnodes. This is formalized in Proposition 3.\n\nProposition 3 Let J \u2286 V be a minorant group of nodes in V. Let w\u2217 and x\u2217 be optimal solutions\nto Problem (5) on the reduced set J and full set V of nodes, respectively. If w\u2217\ni = \u22121\n\u2200i \u2208 J . The optimal solution for the remaining nodes (V \\ J ) can be found by solving (5) over only\nthose nodes. The same claims can be made when J \u2286 V is a majorant group of nodes in V where\ninstead w\u2217\n\ni = \u22121, then x\u2217\n\ni = +1 \u2200i \u2208 J .\n\ni = +1 \u21d2 x\u2217\n\ni = \u22121 and \u02c6W = V \\ W. Clearly, the solution to\nProof. Denote W the set of nodes such that w\u2217\nProblem (5) over nodes in W has the solution with all variables equal to \u22121. Problem (5) can be\nwritten in the following form with separable objective:\n\nmaximize X\ni\u2208W\n\nzixi + X\ni\u2208V\\W\n\nsubject to xi \u2264 xj\nxi \u2264 xj\n\u22121 \u2264 xi \u2264 1\n\nzixi\n\n5\n\n\u2200(i, j) \u2208 I, i, j \u2208 W\n\u2200(i, j) \u2208 I, i \u2208 V, j \u2208 V \\ W\n\u2200i \u2208 V\n\n(7)\n\n\fStart with an initial solution xi = 1 \u2200i \u2208 V. Variables in W can be optimized over \ufb01rst and by\nassumption have the optimal value with all variables equal to \u22121. Optimization over variables in \u02c6W\nis not bounded from below by variables in W since those variables are all at the lower bound. Hence\nthe optimal solution to variables in \u02c6W is given by optimizing over only these variables. The result\nfor minorant groups follows. The \ufb01nal claim is easily argued in the same way as for the minorant\ngroups.\n\nGiven Proposition 3, Algorithm 2, which iteratively solves (5), can be stated. The subtrees are built\nas follows. First, an upper triangular adjacency matrix C can be constructed to represent I, where\nCij = 1 if xi \u2264 xj is an isotonic constraint and Cij = 0 otherwise. A minorant (majorant) subtree\nwith k nodes is then constructed as the upper left (lower right) k \u00d7 k sub-matrix of C.\n\nAlgorithm 2 Iterative algorithm for linear program (5)\nRequire: Observations y1, . . . , yn and partial order I.\nRequire: M AXSIZE of problem to be solved by general LP solver\nRequire: V = {1, . . . , n}, L = U = {}.\n1: while |V| \u2265 M AXSIZE do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end while\n11: Solve linear program (5) on V and get solution \u02c6y \u2208 {\u22121, +1}|V|.\n12: L = L \u222a {v \u2208 T : \u02c6yv = \u22121}, U = U \u222a {v \u2208 T : \u02c6yv = +1}.\n\nELIMINATE A MINORANT SET OF NODES:\nBuild a minorant subtree T .\nSolve linear program (5) on T and get solution \u02c6y \u2208 {\u22121, +1}|T |.\nL = L \u222a {v \u2208 T : \u02c6yv = \u22121}, V = V \\ {v \u2208 T : \u02c6yv = \u22121}.\nELIMINATE A MAJORANT SET OF NODES:\nBuild majorant subtree T .\nSolve linear program (5) on T and get solution \u02c6y \u2208 {\u22121, +1}|T |.\nU = U \u222a {v \u2208 T : \u02c6yv = +1}, V = V \\ {v \u2208 T : \u02c6yv = +1}.\n\nThe computational bottleneck of Algorithm 2 is solving linear program (5), which is done ef\ufb01ciently\nby solving the dual network \ufb02ow problem (6). This shows that, if the \ufb01rst network \ufb02ow problem is\ntoo large to solve, it can be solved by a sequence of smaller network \ufb02ow problems as illustrated\nin Figure 1. Lemma 4 below proves that this reduction optimally solves the full problem (5). In\nthe worst case, many network \ufb02ow problems will be solved until the original full-size network\n\ufb02ow problem is solved. However, in practice on large problems, this artifact is never observed.\nComputational performance of this reduction is demonstrated in Section 5.\n\nLemma 4 Algorithm 2 optimally solves Problem (5).\n\nProof. The result follows from repeated application of Proposition 3 over the set of nodes V that has\nnot yet been optimally solved for.\n\n4 Complexity of the partitioning algorithm\n\nLinear program (5) can be solved in O(n3) using interior point methods. Given that the algorithm\nperforms at most n iterations, the worst case complexity of Algorithm 1 is O(n4). However, the\npractical complexity of IRP is signi\ufb01cantly better than the worst case. Each iteration of LP (5)\nsolves smaller problems. Consider the case of balanced partitioning at each iteration until there\nare n \ufb01nal blocks. In this case, we can represent the partitioning path as a binary tree with log n\nlevels, and at each level k, LP (5) is solved 2k times on instances of size n\n2k which leads to a total\ncomplexity of\n\nlog n\n\nX\n\nk=0\n\n2k(\n\nn\n\n2k )3 = n3(\n\nlog n\n\nX\n\nk=0\n\n(\n\n1\n4\n\n)k) = n3(\n\n1 \u2212 .25log n+1\n\n.75\n\n),\n\nsubject to additional constants. For n \u2265 10, the summation is approximately 1.33, and hence in this\ncase the partitioning algorithm has complexity O(1.33n3) (considering the complexity of interior\n\n6\n\n\f1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\nFigure 1: Illustration of LP (5) decomposition. Data here is 2 dimensional with only 1000 nodes in order to\nleave a clear picture. First 7 iterations and the \ufb01nal iteration 16 of the decomposition are shown from left to\nright and top to bottom. The remaining nodes (blue circles) to identify as \u00b11 decreases through the iterations.\nLP (5) solved on the entire set of nodes in the \ufb01rst picture may be too large for memory. Hence subproblems are\nsolved on the lower left (red dots) and upper right (green dots) of the networks and some nodes are \ufb01xed from\nthe solution of these subproblems. This is repeated until the number of unidenti\ufb01ed nodes in the last iteration\nis of small enough size for memory. Note that at each iteration the three groups obey isotonicity.\n\npoint methods for partitioning). More generally, let p and 1 \u2212 p be the percentages on each split.\nTable 1 displays the constants c representing the complexity from O(cn3) over varying p and n. As\ndemonstrated, the problem size rapidly decreases and the complexity is in practice O(n3).\n\np=0.55\np=0.65\np=0.75\np=0.85\np=0.95\n\nn=100\n1.35n3\n1.46n3\n1.77n3\n2.56n3\n6.41n3\n\nn=1000\n1.35n3\n1.46n3\n1.78n3\n2.61n3\n6.94n3\n\nn=10000\n1.35n3\n1.47n3\n1.78n3\n2.61n3\n7.01n3\n\nTable 1: Complexity: Groups are split with ratio p at each iteration. Complexity in practice is O(n3).\n\n5 Numerical experiments\n\nWe here demonstrate that exact isotonic regression is computationally tractable for very large prob-\nlems, and compare against the time it takes to get an approximation. We \ufb01rst show the computational\nperformance of isotonic regression on simulated data sets as large as 2 \u00d7 105 training points with\nmore than 107 constraints. We then show the favorable predictive performance of isotonic regression\non large simulated data sets.\n\n5.1 Large-Scale Computations\n\nFigure 2 demonstrates that the partitioning algorithm with decompositions of the partitioning step\ncan solve very large isotonic regressions. Three dimensional data is simulated from U(0, 2) and the\nresponses are created as linear functions plus noise. The size of the training sets varies from 104\nto 2 \u00d7 105 points. The left \ufb01gure shows that the partitioning algorithm \ufb01nds the globally optimal\nisotonic regression solution in not much more time than it takes to \ufb01nd an approximation as done\nin [6] for very large problems. Although the worst-case complexity of our exact algorithm is much\nworse, the two algorithms scale comparably in practice.\n\nFigure 2 (right) shows how the number of partitions (left axis) increases as the number of training\npoints increases. It is not clear why the approximation in [6] has less partitions as the size of the\nproblem grows. More partitions (left axis) require solving more network \ufb02ow problems, however,\nas discussed, they reduce in size very quickly over the partitioning path, resulting in the practical\ncomplexity seen in the \ufb01gure on the left. The bold black line also shows the number of constraints\n(right axis) which goes up to more than 107 constraints.\n\n7\n\n\fTime vs # Training Points\n\n# Partitions vs # Training Points\n\nIRP\nGPAV\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n\ne\nm\nT\n\ni\n\n450\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n\n \n0\n\n \n\nIRP\nGPAV\n\n2\nx 105\n\n7000\n\n6000\n\n5000\n\n4000\n\n3000\n\n2000\n\n1000\n\n0\n\n \n0\n0\n\ns\nn\no\ni\nt\ni\nt\nr\na\nP\nr\ne\nb\nm\nu\nN\n\n0.5\n\n1\n\n1.5\n\nNumber Training Points\n\nx 106\n \n10\n\n9\n\n8\ns\nt\nn\n7\ni\na\nr\n6\nt\ns\nn\n5\no\nC\n4\nr\ne\nb\n3\nm\nu\n2\nN\n1\n\n0.5\n0.5\n\n1\n1\n\n1.5\n1.5\n\nNumber Training Points\n\n0\n\n2\n2\nx 105\nx 105\n\nFigure 2: IRP performance on large-scale simulations. Data x \u2208 R3 has xi \u223c U(0, 2). Responses y are linear\nfunctions plus noise. Number of training points varies from 104 to 2 \u00d7 105. Results shown are averages of 5\nsimulations with dotted lines at \u00b1 one standard deviation. Time (seconds) versus number of training points is\non the left. On the right, the number of partitions is illustrated using the left axis and the bold black line shows\nthe average number of constraints per test using the right axis.\n\n5.2 Predictive Performance\n\nHere we show that isotonic regression is a useful tool when the data \ufb01ts the monotonic framework.\nData is simulated as above and responses are constructed as yi = Qi xi + N (0, .52) where p varies\nfrom 2 to 6. The training set varies from 500 to 5000 to 50000 points and the test size is \ufb01xed at 5000.\nResults are averaged over 10 trials and 95% con\ufb01dence intervals are given. A comparison is made\nbetween isotonic regression and linear least squares regression. With only 500 training points, the\nmodel is poorly \ufb01tted and a simple linear regression performs much better. 5000 training points is\nsuf\ufb01cient to \ufb01t the model well with up to 4 dimensions, after which linear regression outperforms the\nisotonic regression, and 50000 training points \ufb01ts the model well up with up to 5 dimensions. Two\ntrends are observed. Larger training sets allow better models to be \ufb01t which improves performance\nwhile higher dimensions increase over\ufb01tting which, in turn, decreases performance.\n\nDim\n\n2\n3\n4\n5\n6\n\nIRP MSE\n\nn=500\n\n0.69 \u00b1 0.01\n0.76 \u00b1 0.03\n1.45 \u00b1 0.08\n4.61 \u00b1 0.65\n12.89 \u00b1 1.30\n\nLS MSE\nn=500\n\n0.37 \u00b1 0.00\n0.65 \u00b1 0.01\n1.08 \u00b1 0.01\n1.76 \u00b1 0.02\n3.06 \u00b1 0.04\n\nIRP MSE\nn=5000\n\n0.27 \u00b1 0.00\n0.31 \u00b1 0.00\n0.61 \u00b1 0.02\n2.61 \u00b1 0.16\n8.41 \u00b1 1.36\n\nLS MSE\nn=5000\n\n0.36 \u00b1 0.00\n0.61 \u00b1 0.01\n1.08 \u00b1 0.02\n1.88 \u00b1 0.04\n2.84 \u00b1 0.07\n\nIRP MSE\nn=50000\n\n0.25 \u00b1 0.00\n0.26 \u00b1 0.00\n0.34 \u00b1 0.01\n0.93 \u00b1 0.04\n3.37 \u00b1 0.06\n\nLS MSE\nn=50000\n\n0.36 \u00b1 0.00\n0.62 \u00b1 0.00\n1.06 \u00b1 0.03\n1.86 \u00b1 0.05\n2.83 \u00b1 0.12\n\nTable 2: Statistics for simulation generated with yi = Qi xi + N (0, .52). A comparison between the results of\nIRP and a least squares linear regression is shown. Bold demonstrates statistical signi\ufb01cance at 95% con\ufb01dence.\n\n6 Conclusion\n\nThis paper demonstrates that isotonic regression can be used to solve extremely large problems. Fast\napproximations are useful, however, as shown, globally optimal solutions are also computationally\ntractable. Indeed, isotonic regression as done here performs with a complexity of O(n3) in practice.\nAs also shown, isotonic regression performs well at reasonable dimensions, but suffers from over-\n\ufb01tting as the dimension of the data increases. Extensions of this algorithm will analyze the path of\npartitions in order to control over\ufb01tting by stopping the algorithm early. Statistical complexity of\nthe models generated by partitioning will be examined. Furthermore, similar results will be made\nfor isotonic regression with different loss functions.\n\n8\n\n\fReferences\n\n[1] R.E. Barlow and H.D. Brunk. The isotonic regression problem and its dual. Journal of the American\n\nStatistical Association, 67(337):140\u2013147, 1972.\n\n[2] G. Obozinski, G. Lanckriet, C. Grant, M.I. Jordan, and W.S. Noble. Consistent probabilistic outputs for\n\nprotein function prediction. Genome Biology, 9:247\u2013254, 2008. Open Access.\n\n[3] M.J. Schell and B. Singh. The reduced monotonic regression method. Journal of the American Statistical\n\nAssociation, 92(437):128\u2013135, 1997.\n\n[4] J.B. Kruskal. Multidimensional scaling by optimizing goodness of \ufb01t to a nonmetric hypothesis. Psy-\n\nchometrika, 29(1), 1964.\n\n[5] H. Block, S. Qian, and A. Sampson. Structure algorithms for partially ordered isotonic regression. Journal\n\nof Computational and Graphical Statistcs, 3(3):285\u2013300, 1994.\n\n[6] O. Burdakov, O. Sysoev, A. Grimvall, and M. Hussian. An o(n2) algorithm for isotonic regression. 83:25\u2013\n83, 2006. In: G. Di Pillo and M. Roma (Eds) Large-Scale Nonlinear Optimization. Series: Nonconvex\nOptimization and Its Applications.\n\n[7] C.-I. C. Lee. The min-max algorithm and isotonic regression. The Annals of Statistics, 11(2):467\u2013477,\n\n1983.\n\n[8] J. de Leeuw, K. Hornik, and P. Mair. Isotone optimization in r: Pool-adjacent-violators algorithm (pava)\nand active set methods. 2009. UC Los Angeles: Department of Statistics, UCLA. Retrieved from:\nhttp://cran.r-project.org/web/packages/isotone/vignettes/isotone.pdf.\n\n[9] W.L. Maxwell and J.A. Muckstadt. Establishing consistent and realistic reorder intervals in production-\n\ndistribution systems. Operations Research, 33(6):1316\u20131341, 1985.\n\n[10] R.D.C. Monteiro and I. Adler. Interior path following primal-dual algorithms. part II: Convex quadratic\n\nprogramming. Mathematical Programming, 44:43\u201366, 1989.\n\n[11] O. Burdakov, O. Sysoev, and A. Grimvall. Generalized PAV algorithm with block re\ufb01nement for partially\nordered monotonic regression. pages 23\u201337, 2009. In: A. Feelders and R. Potharst (Eds.) Proc. of the\nWorkshop on Learning Monotone Models from Data at the European Conference on Machine Learning\nand Principles and Practice of Knowledge Discovery in Databases.\n\n[12] P.M. Pardalos and G. Xue. Algorithms for a class of isotonic regression problems. Algorithmica, 23:211\u2013\n\n222, 1999.\n\n[13] K.G. Murty. Linear Programming. John Wiley & Sons, Inc., 1983.\n[14] R. Chandrasekaran, Y.U. Ryu, V.S. Jacob, and S. Hong.\n\nComputing, 17(4):462\u2013474, 2005.\n\nIsotonic separation.\n\nINFORMS Journal on\n\n[15] MOSEK ApS. The MOSEK optimization tools manual. version 6.0, revision 61. 2010. Software available\n\nat http://www.mosek.com.\n\n[16] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows: Theory, Algorithms, and Applications.\n\nPrentice-Hall, Inc., 1993.\n\n9\n\n\f", "award": [], "sourceid": 252, "authors": [{"given_name": "Ronny", "family_name": "Luss", "institution": null}, {"given_name": "Saharon", "family_name": "Rosset", "institution": null}, {"given_name": "Moni", "family_name": "Shahar", "institution": null}]}