{"title": "Globally Optimal Learning for Structured Elliptical Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 13488, "page_last": 13497, "abstract": "Heavy tailed and contaminated data are common in various applications of machine learning. A standard technique to handle regression tasks that involve such data, is to use robust losses, e.g., the popular Huber\u2019s loss.\n\nIn structured problems, however, where there are multiple labels and structural constraints on the labels are imposed (or learned), robust optimization is challenging, and more often than not the loss used is simply the negative log-likelihood of a Gaussian Markov random field.\nHeavy tailed and contaminated data are common in various applications of machine learning. A standard technique to handle regression tasks that involve such data, is to use robust losses, e.g., the popular Huber\u2019s loss.\nIn structured problems, however, where there are multiple labels and structural constraints on the labels are imposed (or learned), robust optimization is challenging, and more often than not the loss used is simply the negative log-likelihood of a Gaussian Markov random field. In this work, we analyze robust alternatives. Theoretical understanding of such problems is quite limited, with guarantees on optimization given only for special cases and non-structured settings. The core of the difficulty is the non-convexity of the objective function, implying that standard optimization algorithms may converge to sub-optimal critical points. Our analysis focuses on loss functions that arise from elliptical distributions, which appealingly include most loss functions proposed in the literature as special cases. We show that, even though these problems are non-convex, they can be optimized efficiently. Concretely, we prove that at the limit of infinite training data, due to algebraic properties of the problem, all stationary points are globally optimal. Finally, we demonstrate the empirical appeal of using these losses for regression on synthetic and real-life data.", "full_text": "Globally Optimal Learning for Structured Elliptical\n\nLosses\n\nYoav Wald\u2217\n\nHebrew University\n\nyoav.wald@mail.huji.ac.il\n\nNofar Noy\n\nHebrew University\n\nnofar.noy@mail.huji.ac.il\n\nAmi Wiesel\n\nGal Elidan\n\nGoogle Research and Hebrew University\n\nGoogle Research and Hebrew University\n\nawiesel@google.com\n\nelidan@google.com\n\nAbstract\n\nHeavy tailed and contaminated data are common in various applications of machine\nlearning. A standard technique to handle regression tasks that involve such data, is\nto use robust losses, e.g., the popular Huber\u2019s loss.\nIn structured problems, however, where there are multiple labels and structural\nconstraints on the labels are imposed (or learned), robust optimization is challeng-\ning, and more often than not the loss used is simply the negative log-likelihood\nof a Gaussian Markov random \ufb01eld. In this work, we analyze robust alternatives.\nTheoretical understanding of such problems is quite limited, with guarantees on\noptimization given only for special cases and non-structured settings. The core of\nthe dif\ufb01culty is the non-convexity of the objective function, implying that standard\noptimization algorithms may converge to sub-optimal critical points. Our analysis\nfocuses on loss functions that arise from elliptical distributions, which appealingly\ninclude most loss functions proposed in the literature as special cases. We show\nthat, even though these problems are non-convex, they can be optimized ef\ufb01ciently.\nConcretely, we prove that at the limit of in\ufb01nite training data, due to algebraic\nproperties of the problem, all stationary points are globally optimal. Finally, we\ndemonstrate the empirical appeal of using these losses for regression on synthetic\nand real-life data.\n\n1\n\nIntroduction\n\nMany machine learning tasks require the prediction of several correlated real-valued variables. For\nexample, in modeling of the stock market, the price of multiple interacting stocks is of interest; in\nthe context of weather prediction, different locations exhibit a natural spatial dependence that is\ngoverned by proximity as well as geographical features; in river discharge forecasting, water volume\nis predicted at different locations and times; etc. In all of these tasks, the labels exhibit correlations\n(e.g., nearby locations will have similar temperatures), and it seems advantageous to build prediction\nmodels that can use these correlations to improve prediction accuracy. Such models are known as\nstructured prediction approaches, and they have been studied since the early works on graphical\nmodels [2], and later in frameworks like conditional random \ufb01elds [13] and maximum margin Markov\nnetworks [25].\nWe are interested in capturing realistic scenarios, i.e. be able to account for heavy tailed and\ncontaminated data. In statistics, the prominent approach for coping with such settings is to use robust\n\n\u2217Work done during an internship at Google Research.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fi=1\n\nm(cid:88)\n(cid:26) 1\n2 t2\n\u03b4(|t| \u2212 1\n\nM-estimators [11]. A loss function \u03c1 with certain desirable properties is chosen, and the model is\ntrained by minimizing the total loss over training instances, possibly with an added regularization\nterm r(w):\n\nPerhaps the most popular robust estimator is the Huber loss, currently a vital tool in machine learning:\n\nmin\n\nw\n\n\u03c1(zi; w) + r(w).\n\n(1)\n\n\u03c1\u03b4(t) =\n\n|t| \u2264 \u03b4\n|t| \u2265 \u03b4.\n\n(2)\n\n2 \u03b4)\n\nThis loss and its variants are applicable in supervised regression problems with a single label y. For\nthe simplest case of linear regression, this will be given by \u03c1(x, y; w) = \u03c1\u03b4((cid:104)w, x(cid:105) \u2212 y). In this\nscalar label scenario, the problem of learning such a linear regression model is convex, and has a rich\ntheory describing its statistical and computational properties. Recently, the interesting work of [19]\nextended this theory to non-convex losses.\nUnfortunately, as we move away from the single label setting, and structured models are needed,\nour understanding of robust learning is quite limited, even for the simplest cases. In this work we\nmake an important step toward rectifying this gap, and develop a theoretic characterization of the\noptimization landscape, in a powerful structured robust setting.\nConcretely, we consider robust estimation of a structured inverse covariance matrix, also known\nas a graphical model or Markov random \ufb01eld (MRF) [6, 30]. Most works that rely on MRFs for\nregression are limited to the multivariate Gaussian loss, which is convex and theoretically understood.\nUnfortunately, modifying the loss to one of the robust variants, including convex ones such as Huber\u2019s\nloss, results in a non-convex minimization problem.\nSuch objectives appear hard to analyze and prone to bad local minima. Consequently, analysis has\nbeen mostly constrained to limited structures [32], or special cases of loss functions [36, 22, 34, 4].\nIn this work we provide encouraging results for a wide range of robust loss functions, showing that\nthey can be globally optimized. Concretely, we show that at the limit of in\ufb01nite training data, due to\nalgebraic properties of the problem, all stationary points are globally optimal.\nThe loss functions we consider arise from the elliptical family of distributions [5, 3, 35], and\nappealingly include many of the robust loss functions for covariance estimation that were analyzed\nin the literature on the unstructured settings [24, 22]. Empirically, we demonstrate that using these\nlosses in the structured setting leads to substantial performance gains both in synthetic and real-life\nproblems.\n\n2 Formal setting\n\ni=1 of i.i.d samples from an unknown distribution on Rn. The standard\nWe are given a dataset {zi}m\nway to \ufb01t an inverse covariance matrix \u0393 from the data is to solve the Gaussian maximum likelihood\nproblem:\n\n(3)\n\n(4)\n\n(GM LE)\n\narg min\n\u0393(cid:31)0\n\n1\nm\n\nm(cid:88)\n\ni=1\n\ni \u0393zi + log |\u0393\u22121|.\nz(cid:62)\n(cid:18)(cid:113)\n\n(cid:19)\n\nm(cid:88)\n\ni=1\n\nFollowing Huber\u2019s loss, a natural generalization of this problem is to replace the Gaussian squared\nloss with a robust loss that is less sensitive to heavy tails and outlier contamination. This results in a\nrobust maximum likelihood estimation problem:\n\n(RM LE) arg min\n\u0393(cid:31)0\n\n1\nm\n\n\u03c1\n\nz(cid:62)\ni \u0393zi\n\n+ log |\u0393\u22121|.\n\nIn particular, this formulation includes many commonly used losses, detailed in Table 1. Their main\nproperty is that they can be interpreted as the negative log likelihoods of scaled multivariate normal\ndistributions. Depending on the speci\ufb01c type of scaling, these distributions are known as elliptical,\nspherically invariant random vectors (SIRV) or angular. For simplicity, our analysis below will focus\non the SIRV formulation [35].\n\n2\n\n\f[30, 6]\nGaussian\n[5, 22]\nGeneralized Gaussian\n[5, 4]\nT distribution\n[26, 32]\nAngular / Tyler\nusually in regression [11]\nHuber\nTrimmed\npenalized version of [34]\nTable 1: Common loss functions that satisfy the conditions of our theoretical analysis.\n\n\u03c1(t) = t2\n\u03c1(t) = t2\u03b2, \u03b2 \u2208 (0, 1)\nlog(1 + t2\n\u03c1(t) = n+\u03bd\n2\n\u03c1(t) = n log(t2)\n\u03c1(t) = min{ 1\n\u03c1(t) = min{ 1\n\n2 t2, \u03b4(|t| \u2212 1\n2 t2, \u03b4}\n\n\u03bd ), \u03bd > 2\n\n2 \u03b4)}\n\nDe\ufb01nition 1 A spherically invariant random vector z \u223c SIRV(g, \u03a3) is de\ufb01ned as the product of\na positive random scalar \u03bd, known as texture, with density g(\u03bd) and an independent zero mean\nmultivariate normal u \u223c N (0, \u03a3) with covariance \u03a3, i.e., z = \u03bdu\nThis work will be focused on structured covariance estimation, where we assume the inverse covari-\nance matrix lies in some linear subspace that is known a-priori. Such settings are natural, e.g. when\nthe different dimensions correspond to entities that are spatially arranged. The precise notion of\nstructure in \u0393 that we will consider in this work is de\ufb01ned below.\nDe\ufb01nition 2 Let {G\u03b1}\u03b1\u2208I be a set of matrices in Rn\u00d7n, where I is a set of indices. For a vector\nw \u2208 R|I|, denote:\n\n\u0393(w) =\n\nw\u03b1G\u03b1.\n\nWe will say that an inverse covariance matrix is structured according to these matrices if it belongs\nto the following set G:\n\nG = {\u0393(w) | w \u2208 R|I|, \u0393(w) (cid:31) 0}.\n\nThe most common structure considered in the literature is a graphical structure, or a Markov Random\nField:\n\nGij = eie(cid:62)\n\nj + eje(cid:62)\n\ni , (i, j) \u2208 E.\n\nHere ei is the i\u2019th standard unit vector, E are edges of an undirected graph and we also allow self\nedges (i, i) \u2208 E to accommodate for the diagonal entries of the matrix. The type of structure\nconsidered in this work is more general, allowing for example, parameter sharing between edges, or\neven non-graphical structures. Imposing the structural constraints on Equation (3) gives the maximum\nlikelihood estimation of a Gaussian Markov Random Field [12, 30]:\n\nPlacing these constraints on Equation (4), we arrive at the robust structured problem we will analyze\nin this paper:\n\n(GM RF ) arg min\n\nw:\u0393(w)\u2208G\n\n1\nm\n\n(RM RF ) arg min\n\nw:\u0393(w)\u2208G\n\n1\nm\n\nz(cid:62)\ni \u0393(w)zi + log |\u0393(w)\u22121|.\n(cid:18)(cid:113)\n\n(cid:19)\n\nz(cid:62)\ni \u0393(w)zi\n\n+ log |\u0393(w)\u22121|.\n\nm(cid:88)\n\ni=1\n\n\u03c1\n\n(cid:88)\n\n\u03b1\u2208I\n\nm(cid:88)\n\ni=1\n\n(5)\n\n(6)\n\n(7)\n\nApplication to linear regression In practice the task we are interested in, and for which we present\nresults in the experimental part, is linear regression. This supervised problem \ufb01ts into our framework\nwhen we have z = (x, y), a vector that concatenates features x \u2208 Rn1 and labels y \u2208 Rn2, such that\nn1 + n2 = n. Brie\ufb02y, we derive a linear regressor from an estimated inverse covariance matrix as\nfollows. Assume \u02c6\u0393 \u2208 Rn\u00d7n is the output from one of the algorithms considered in this work, and\nwrite it as a block matrix corresponding to features and labels:\n\n\u02c6\u0393 =\n\n.\n\n(8)\n\n(cid:20)\u02c6\u0393xx\n\n\u02c6\u0393yx\n\n(cid:21)\n\n\u02c6\u0393xy\n\u02c6\u0393yy\n\n3\n\n\fThen the linear regressor will be given by:\n\n\u02c6y(x) = \u2212\u02c6\u0393\u22121\n\nyy\n\n(9)\nWhen no structure is imposed and \u02c6\u0393 is obtained by maximizing the likelihood of a Gaussian, this\nregressor coincides with the solution obtained by using the sample covariance. But, when the loss is\nchanged, or structural constraints are imposed, this is no longer the case.\nWith all formal de\ufb01nitions in place, the next section reviews the most relevant of the vast literature on\nrobust and structured inverse covariance estimation. We then provide our central result in Section 4.\n\n\u02c6\u0393yxx.\n\n3 Related work\n\nRobust machine learning: There is a renewed interest in robust statistics in the machine learning\ncommunity. Recent works consider sample complexity analysis [10] and computational analysis of\nnon-convex robust loss functions [16, 19]. In this work we generalize some of these insights to the\nstructured prediction setting.\nUnstructured elliptical losses: The use of elliptical distributions in the context of multivariate robust\nstatistics dates back to the works of [11, 26]. These models lead to non-convex optimization but are\nwell understood in terms of ef\ufb01cient algorithms [26, 22], loss landscape [32] and sample complexity\n[24]. In the structured problem, it is unclear how to adapt these algorithms, or whether properties of\nthe optimization landscape (e.g. geodesic convexity) hold under linear constraints. Hence the need\nfor results in this work.\nRobust graphical models: Following the success of graphical models and structured prediction [17, 6],\nthere are many works on non-Gaussian alternatives. Multivariate t elliptical graphical models have\nbeen considered in [29, 4]. These were extended in [14] to the transelliptical family via a copula,\nsimilar to the way Gaussian models are extended to the non-paranormal [15]. Another related work\nconsidered trimmed graphical models [34]. These works do not analyze the maximum likelihood\nformulation, and consequently they do not provide guarantees on the landscape of the loss function.\nOur work aims to provide a \ufb01rmer theory to motivate these approaches, and the further development\nof principled techniques to robust structured prediction. We emphasize that the works above also\nconsider structure learning whereas we address the case of known structure.\nNon-Gaussian graphical models: Recent growing interest in generalizing continuous graphical\nmodels to non-Gaussian settings includes [20], who identify the sparsity pattern in the inverse\ncovariance matrix for non-Gaussian data. Other works focus on inference when the model is not\nGaussian [7, 31], but these are less relevant for the robust regression problem.\n\n4 Globally optimal learning for elliptical losses\n\nAs discussed, many works are concerned with the problem of estimating structured inverse covariance\nmatrices, and there is wide interest in using robust losses that can account for realistic data scenarios.\nAccordingly, the key question that we tackle in this work is whether Equation (7) can be solved\nef\ufb01ciently. Our main result is that these structured problems can in fact be ef\ufb01ciently minimized,\nfor an important range of losses. Concretely, the analysis holds for well-behaved loss functions that\nsatisfy the following assumption:\nAssumption 1 The loss \u03c1(\u221at) is twice differentiable and concave in t. Its derivative w.r.t t, denoted\nby \u03c8, satis\ufb01es \u03c8(t) \u2265 \u2212t\u03c8(cid:48)(t) for all t > 0.\nLosses that satisfy the assumption include all the ones mentioned in Table 1, except for the Trimmed\nloss. The condition on \u03c8 can be translated roughly as \u03c1 growing at least as fast as a logarithm, which\ncorresponds to the angular/Tyler loss. This is also the one loss where the condition on \u03c8 is met with\nequality.\nWe start with the following auxiliary lemma:\nLemma 1 Let v be an SIRV(g, \u03a3) with arbitrary texture g, and \u03c1 a function that satis\ufb01es Assump-\ntion 1. De\ufb01ne the matrix:\n\n\u03a3\u03c1(v) = Ev\n\n(cid:2)vv(cid:62)\u03c8((cid:107)v(cid:107)2\n2)(cid:3).\n\n4\n\n(10)\n\n\fThen \u03a3\u03c1(v) and \u03a3 commute, and maintain the same order of eigenvalues.\n\nThis result builds on the analysis in [1] and extends it to a more general class of loss functions. The\nproof is provided in the supplementary material. Note that Lemma 1 holds for arbitrary textures and\nthere is no requirement for \u03c8 and g to match.\nLet us clarify what Lemma 1 means in terms of optimality conditions. We use the shorthand\n\u0393\u2217 = \u0393(w\u2217) for the true inverse covariance of z and \u03a3\u2217 = \u0393\u2217\u22121 for its inverse. Using these\nnotations, if\n\n\u03a3\u03c1(\u0393(w)\n\n1\n\n2 z) = I,\n\n(11)\n\n2 \u03a3\u2217\u0393(w) 1\nthen \u0393(w) 1\na multiplicative scalar.\nCorollary 1 If Equation (11) holds then \u0393(w) = c\u0393\u2217 for some constant c > 0.\n\n2 maintains the ordering of eigenvalues in I, or in other words it equals I up to\n\n\u03c1\n\nmin\n\nEz\n\n(cid:26)\n\nw:\u0393(w)\u2208G\n\nz(cid:62)\u0393(w)z\n\n(cid:18)(cid:113)\n\nIdentifying the ground truth covariance matrix only up to a multiplicative constant is an inherent\nlimitation of the problem in some losses. For instance, in the angular case cw\u2217 will be an optimal\npoint of the problem for any c > 0. In practice, this is of no real concern since the optimal regressor\nin (9) is invariant under this multiplicative constant.\nOur main result is that at the population limit, these multiples of the ground truth matrix are the only\nstationary points of the non-convex optimization problem:\nTheorem 1 Let z be an SIRV(g, \u0393\u22121(w\u2217)) and \u03c1 a function that satis\ufb01es Assumption 1 and consider\nthe optimization problem:\n\n(cid:19)(cid:27)\n+ log |\u0393(w)\u22121|.\n(12)\n(cid:17)(cid:111)\n(cid:16)(cid:112)z(cid:62)\u0393(w)z\n(cid:110)\nIf w is a stationary point of the loss, it holds that \u0393(w) equals \u0393(w\u2217) up to a multiplicative constant.\n+ log |\u0393(w)\u22121| with\nProof We take the derivative of our loss: L(w) = Ez\n\u2212 \u0393(w)\u22121(cid:1) G\u03b1\n(cid:8)zz(cid:62)\u03c8(z(cid:62)\u0393(w)z)(cid:9)\n(cid:9)\nrespect to w\u03b1 for some \u03b1 \u2208 I. Following simple manipulations we have:\n2 \u2212 \u0393(w)\u22121(cid:17)\n(cid:17)\n2 \u03a3\u03c1(cid:16)\n(cid:17)\n\u0393(w) 1\n(cid:104)\u2207L(w), w\u2217\n\n= Tr(cid:8)(cid:0)Ez\n(cid:110)(cid:16)\nDenote the eigenvalues of \u03a3\u03c1(cid:16)\n\ni=1. We will prove that whenever:\n\n(cid:105) = (cid:104)\u2207L(w), w(cid:105) = 0,\n\n\u2202L(w)\n\u2202w\u03b1\n\nby {\u03b4i}n\n\nthen \u03b4i = 1 for all i \u2208 [n]. From Corollary 1 this will imply that \u2207L(w) = 0 only when\n\u0393(w) = c\u0393(w\u2217) for some constant c.\nConsider the SIRV vector \u0393(w) 1\n\n2 and\ncommute and the order of their eigenvalues is maintained. Taking the inverse, the\n\n2 z. Due to Lemma 1, its covariance \u0393(w) 1\n\n2 \u0393\u2217\u22121\u0393(w) 1\n\n\u03a3\u03c1(cid:16)\n\n\u0393(w) 1\n\n2 z\n\n\u0393(w)\u2212 1\n\n\u0393(w)\u2212 1\n\n= Tr\n\n(cid:111)\n\n\u0393(w)\n\n(cid:17)\n\nG\u03b1\n\n1\n2 z\n\n2 z\n\n\u03c1\n\n.\n\neigenvalues of \u0393(w)\u2212 1\n\n2 \u0393\u2217\u0393(w)\u2212 1\n\ni=1, are ordered in reverse. Then we have:\n\n2 denoted by {\u03bb\u22121\nn(cid:88)\ni }n\n\n(cid:111)\n\n(cid:17)\n\n(cid:110)\n\u03a3\u03c1(cid:16)\n\u03a3\u03c1(cid:16)\n(cid:110)(cid:104)\n\n=\n\n(cid:105)\n\n1\n2 z\n\n\u0393(w)\n\n\u2212 I\n\n(cid:17)\n(cid:104)\u2207L(w), w(cid:105) = Tr\n(cid:104)\u2207L(w), w\u2217\nIf the term in Equation (13) equals 0, then(cid:80)n\n(cid:80)n\n\n(cid:105) = Tr\n\n. But because \u03b4 and \u03bb\n\n\u0393(w)\n\n\u2212 I\n\n1\n2 z\n\n\u03bb\n\n1\n\nn \u03bb\u22121\n\ni\n\ni=1\n\nn \u03bb\u22121\n\ni\n\n5\n\n\u03b4i \u2212 1,\n\ni=1\n\n\u0393(w)\u2212 1\n\n2 \u0393\u2217\u0393(w)\u2212 1\n\n2\n\n(cid:111)\n\n=\n\nn(cid:88)\n\ni=1\n\n\u03bb\u22121\n\ni\n\n(\u03b4i \u2212 1).\n\n(13)\n\n(14)\n\nis a convex combination of the eigenvalues\n\u22121. To satisfy equality to 0 in Equation (14), this convex combination must equal the average\n\ni=1\n\n\u03b4i\n\n\u22121 are in reverse order, this can only hold if \u03b4i = 1 for all i.\n\n\fAlgorithm 1 Minimization Majorization for Elliptical Markov Random Fields\nRequire: \u03c1 : R++ \u2192 R,{zi}m\nRescale data \u02dczi = \u03c8((cid:112)z(cid:62)\nSet \u03930 \u2190 I\nfor t = 0 . . . T do\n\ni=1\n\ni \u0393tzi) 1\n\n2 \u00b7 zi \u2200i \u2208 [m]\n\nSolve convex minimization in Equation (6) with data {\u02dczi}m\n\ni=1 and set \u0393t+1 with the solution\n\nend for\n\nOptimization via minimization majorization and Newton coordinate descent Motivated by the\noptimality result on critical points, we now discuss practical considerations regarding the choice of\nalgorithm to optimize Equation (7). We propose using a minimization-majorization approach, where\nwe majorize the concave \u03c1(\u00b7) functions by their linear approximation. The resulting minimizations\nare classical Gaussian MRFs with reweighted data. These are optimized by Newton Coordinate\nDescent as proposed in other works on structured Gaussian models [33, 18]. This method enables\nus to use the extremely ef\ufb01cient algorithms developed for Gaussian models [9, 8]. In practice, few\niterations of minimization-majorization are required. The overall procedure is given in Algorithm 1.\nScaling the dataset as done in this algorithm is intuitive in the context of robust regression. Since \u03c8\ndecays for large arguments, \u02dczi is scaled down when we suffer a high loss due to zi. Thus, reducing the\nloss incurred by outliers or points with very large magnitudes. A detailed derivation of the algorithm\ncan be found in the supplementary material.\n\n5 Experiments\n\nHaving proved that the problem of structured covariance estimation with robust losses is in fact\namenable to ef\ufb01cient optimization, we now demonstrate the ef\ufb01cacy of doing so in a synthetic setting\nas well as two markedly different real-life datasets.\n\n5.1 Synthetic data\n\nWe start with the synthetic setting where we aim to contrast the Gaussian case with non-Gaussian\nones including a heavy tailed scenario. Our setup is as follows:\n\n\u2022 A sparse inverse covariance matrix \u0393 \u2208 R10\u00d710 is drawn at each trial 2.\ni=1 are drawn from multivariate Generalized Gaussian distributions [22],\n\u2022 Instances of {zi}m\nwith zero mean and a sparse inverse covariance \u0393, for varying values of the parameter \u03b2.\nStarting at \u03b2 = 1, which is simply a multivariate Gaussian, \u03b2 = 0.5 which corresponds to a\nLaplace distribution, and \u03b2 = 0.2 that gives a heavy tailed distribution.\n\n\u2022 Structured methods are given the true graphical structure de\ufb01ned by the sparsity pattern of\n\n\u0393, and described in Equation (5).\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: Experiments on synthetic datasets. Distance of the estimated parameters from ground truth\nfor different generating distributions. (a) Multivariate Gaussian; (b) Generalized Gaussian, \u03b2 = 0.5;\n(c) Generalized Gaussian, \u03b2 = 0.2.\n\n2We created random sparse precision matrices using the sklearn function make_sparse_psd [23]\n\n6\n\n100200300400500600700800Trainingsamples0.050.100.150.200.25d(w,w\u21e4)TylerGMRF100200300400500600700800Trainingsamples0.050.100.150.200.25d(w,w\u21e4)TylerGMRF100200300400500600700800Trainingsamples0.050.100.150.200.25d(w,w\u21e4)TylerGMRF\fOur baseline is the common GMRF, as in Equation (6). We run our Algorithm 1 with all the losses\ndescribed in Table 1. The greatest contrast is observed between losses that are suited for heavy tailed\ndata (e.g. multivariate-t loss, Tyler etc.), and ones that resemble the squared loss (e.g. Gaussian).\nThus, for readability, we only present a comparison with the structured Tyler loss. The experiments\non real data will also include results for the Laplace loss (corresponds to Generalized Gaussian\nwith \u03b2 = 0.5). We also note that performance of unstructured alternatives (not shown for clarity)\nis drastically inferior in the synthetic setting. Errors are measured in Frobenius distance from the\nground truth matrix \u0393(w\u2217), normalized by its Frobenius norm.\nThe results are shown in Figure 1. As can be observed, the robust loss is slightly sub-optimal when\nthe data is purely Gaussian (left panel). However, when the generating distribution is heavy-tailed\n(right panel) there is a large gap in favor of using a robust loss.\n\n5.2 Stock market dataset\n\nWe now consider the \ufb01rst real-life setting: the stock market. The \u201cHuge Stock Market Dataset\" on\nKaggle has historical data on the value of stocks over many years. Our experiment was conducted as\nfollows:\n\n\u2022 We took the intra-day returns (difference between closing and opening price, divided by\nopening price) of 342 stocks in the years between 2004 and 2010. To \ufb01x a structure for use\nin structured algorithms, we ran the Graphical Lasso [6] from sklearn over the training data.\nAll structured approaches were then given the obtained sparsity pattern.\n\u2022 Stocks are randomly divided into a set of 105 observed and 15 hidden stocks. Our task is to\npredict the intraday return y \u2208 R15 of hidden stocks, given the intraday return of the other\nones, x \u2208 R105, on the same day. We repeated the experiments for 60 random divisions.\n\u2022 We use data on the years between 2004 and mid-2011 (excluding the mid-2007 to mid-2009\n\ufb01nancial crisis) as training data and test over the values from then until 2015. Training data\nwas randomly permuted, and algorithms were given samples of increasing size.\n\nFigure 2: Comparison of the Gaussian and ro-\nbust structured loss on the \u201cHuge Stock Market\"\ndataset. Shown is the MSE of the prediction as\na function of the number of samples used for\ntraining.\n\nFigure 2 shows the average error over repetitions. Since different divisions give errors in slightly\ndifferent scales, for each division we calculate the ratio between the MSE and the best observed MSE\nover all losses. The advantage of using a robust loss is quite evident. This should not come as a\nsurprise given the synthetic experiment, since we expect real-life stock behavior to be quite heavy\ntailed.\n\n5.3 River discharge estimation\n\nFinally, we consider the real-life heavy tailed challenge of river discharge estimation, where we jointly\nforecast water discharge (water volume per second) in multiple rivers based on historical data and the\nprecipitation over their drainage basins. We downloaded daily water discharge levels of rivers at 34\ndifferent sites from the United States Geological Survey (USGS) website, and synchronized them\nwith rainfall measurements available from the Global Satellite Mapping of Precipitation (GSMaP)\nproduct [27].\n\n7\n\n\fFigure 3: a. Experiments on river discharge resgression. Normalized MSE of robust loss vs. Gaussian\nfor 2 types of models. 1. Unstructured models. 2. Temporal and spatial structure.\n\n(a)\n\n\u2022 The features we used are precipitation levels and discharge at day t, to predict the discharge\nfor days t + 1, t + 2, t + 3. Overall this concludes to 68 features, x = (dt, pt) where\ndt \u2208 R34, pt \u2208 R34 and 102 labels y = (dt+1, dt+2, dt+3).\n\u2022 A structured prediction graphical model and a vanilla unstructured linear regression were\nused. The structure used is temporal and spatial, where we place an edge between discharge\nvariables for the same site across different days, and between each site and its four nearest\nneighbors. Hence it is a very parsimonious structure.\n\n\u2022 As in the stocks experiments, we use Tyler\u2019s robust loss and a Laplace loss. Error is\n\ncalculated by normalized MSE.\n\nThe results are shown in Figure 3. Appealingly the results show a clear bene\ufb01t from using structure\nwhen the amount of training data is scarce. Further clear bene\ufb01t is gained when the structured\nmodel is optimized using a robust loss. Experiments with additional structures can be found in the\nsupplementary material.\n\n6 Conclusion and future work\n\nRobust statistics and structured prediction are two important concepts in machine learning. Applying\nboth of them in a principled manner is a step towards solutions to realistic complex problems that still\npose a challenge to modern machine learning approaches. In this work we proposed a powerful family\nof robust structured losses that are easy to optimize, at least for linear structured models. In practice,\nthe losses proposed here give promising results, and the algorithms used to minimize them are very\nef\ufb01cient. Our theoretical results follow the line of many recent works that aim to better understand\nnon-convex optimization [16, 19], and contributes to the understanding of such optimization in the\ncontext of structured prediction.\nThere are many possible extensions that we believe are of value. Gaussian conditional random\n\ufb01elds (CRFs) are a valuable tool in structured regression and have been used successfully in practice\n[33, 28]. It is straightforward to generalize them with the elliptical losses considered here, by de\ufb01ning\nappropriate structures, and minimizing a conditional loss using an almost identical procedure to\nAlgorithm 1:\n\nmin\n\n\u0393yy\u2208Gyy,\u0393yx\u2208Gyx\n\n1\nm\n\nm(cid:88)\n\ni=1\n\n(cid:16)(cid:0)y \u2212 \u0393\u22121\n\nyy \u0393yxx(cid:1)(cid:62)\n\n\u03c1\n\n(cid:0)y \u2212 \u0393\u22121\n\nyy \u0393yxx(cid:1)(cid:17)\n\n\u0393yy\n\n+ log |\u0393\u22121\nyy|.\n\nPreliminary experiments with these models show promising results [21], and we hope to give an\nanalysis of these losses in future work. Another promising direction is structure learning with robust\nlosses, for instance by adding l1 regularization in a similar fashion to the graphical lasso.\n\n8\n\n50100150200250300350400Trainingsamples0.080.100.120.140.160.180.200.22NMSEGMRFfullLaplacefullTylerfullGMRFstructuredLaplacestructuredTylerstructured\fAcknowledgments\n\nWe thank Elad Mezuman and Amir Globerson for fruitful discussions, and Guy Shalev for preparing\nthe river discharge dataset. This research was partially supported by ISF grant 1339/15.\n\nReferences\n[1] S. Bausson, F. Pascal, P. Forster, J.-P. Ovarlez, and P. Larzabal. First-and second-order moments\nof the normalized sample covariance matrix of spherically invariant random vectors. IEEE\nSignal Processing Letters, 14(6):425\u2013428, 2007.\n\n[2] J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society:\n\nSeries B (Methodological), 48(3):259\u2013279, 1986.\n\n[3] S. Cambanis, S. Huang, and G. Simons. On the theory of elliptically contoured distributions.\n\nJournal of Multivariate Analysis, 11(3):368\u2013385, 1981.\n\n[4] M. A. Finegold and M. Drton. Robust graphical modeling with t-distributions. In Proceedings\nof the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 169\u2013176. AUAI\nPress, 2009.\n\n[5] G. Frahm. Generalized elliptical distributions: theory and applications. PhD thesis, Universit\u00e4t\n\nzu K\u00f6ln, 2004.\n\n[6] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the\n\ngraphical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[7] Y. Guo, H. Xiong, Y. Yang, and N. Ruozzi. One-shot marginal map inference in Markov random\n\ufb01elds. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, to\nappear. AUAI Press, 2019. To appear.\n\n[8] C.-J. Hsieh, I. S. Dhillon, P. K. Ravikumar, and M. A. Sustik. Sparse inverse covariance\nmatrix estimation using quadratic approximation. In Advances in neural information processing\nsystems, pages 2330\u20132338, 2011.\n\n[9] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, P. K. Ravikumar, and R. Poldrack. BIG & QUIC:\nSparse inverse covariance estimation for a million variables. In Advances in neural information\nprocessing systems, pages 3165\u20133173, 2013.\n\n[10] D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. The\n\nJournal of Machine Learning Research, 17(1):543\u2013582, 2016.\n\n[11] P. J. Huber. Robust statistics. Springer, 2011.\n\n[12] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT\n\npress, 2009.\n\n[13] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random \ufb01elds: Probabilistic\nmodels for segmenting and labeling sequence data. In Proceedings of the Eighteenth Interna-\ntional Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA,\nUSA, June 28 - July 1, 2001, pages 282\u2013289, 2001.\n\n[14] H. Liu, F. Han, and C.-h. Zhang. Transelliptical graphical models. In Advances in neural\n\ninformation processing systems, pages 800\u2013808, 2012.\n\n[15] H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high\ndimensional undirected graphs. Journal of Machine Learning Research, 10(Oct):2295\u20132328,\n2009.\n\n[16] P.-L. Loh et al. Statistical consistency and asymptotic normality for high-dimensional robust\n\nm-estimators. The Annals of Statistics, 45(2):866\u2013896, 2017.\n\n[17] R. Mazumder and T. Hastie. The graphical lasso: New insights and alternatives. Electronic\n\njournal of statistics, 6:2125, 2012.\n\n9\n\n\f[18] C. McCarter and S. Kim. Large-scale optimization algorithms for sparse conditional Gaussian\n\ngraphical models. In Arti\ufb01cial Intelligence and Statistics, pages 528\u2013537, 2016.\n\n[19] S. Mei, Y. Bai, and A. Montanari. The landscape of empirical risk for non-convex losses. arXiv\n\npreprint arXiv:1607.06534, 2016.\n\n[20] R. Morrison, R. Baptista, and Y. Marzouk. Beyond normality: Learning sparse probabilistic\ngraphical models in the non-Gaussian setting. In Advances in Neural Information Processing\nSystems, pages 2359\u20132369, 2017.\n\n[21] N. Noy, Y. Wald, G. Elidan, and A. Wiesel. Robust multitask elliptical regression (romer). In\n2019 8th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive\nProcessing (CAMSAP). IEEE, 2019.\n\n[22] F. Pascal, L. Bombrun, J.-Y. Tourneret, and Y. Berthoumieu. Parameter estimation for multivari-\nate generalized Gaussian distributions. IEEE Transactions on Signal Processing, 61(23):5960\u2013\n5971, 2013.\n\n[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[24] I. Soloveychik and A. Wiesel. Performance analysis of Tyler\u2019s covariance estimator. IEEE\n\nTransactions on Signal Processing, 63(2):418\u2013426, 2015.\n\n[25] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in neural\n\ninformation processing systems, pages 25\u201332, 2004.\n\n[26] D. E. Tyler. A distribution-free m-estimator of multivariate scatter. The annals of Statistics,\n\n15(1):234\u2013251, 1987.\n\n[27] T. Ushio, K. Okamoto, T. Iguchi, N. Takahashi, K. Iwanami, K. Aonashi, S. Shige,\nH. Hashizume, T. Kubota, and T. Inoue. The global satellite mapping of precipitation (GSMaP)\nproject. Aqua (AMSR-E), 2004, 2003.\n\n[28] R. Vemulapalli, O. Tuzel, M.-Y. Liu, and R. Chellapa. Gaussian conditional random \ufb01eld\nnetwork for semantic segmentation. In Proceedings of the IEEE conference on computer vision\nand pattern recognition, pages 3224\u20133233, 2016.\n\n[29] D. Vogel and R. Fried. Elliptical graphical modelling. Biometrika, 98(4):935\u2013951, 2011.\n\n[30] M. J. Wainwright, M. I. Jordan, et al. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[31] D. Wang, Z. Zeng, and Q. Liu. Stein variational message passing for continuous graphical\nmodels. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018,\nStockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018, pages 5206\u20135214, 2018.\n\n[32] A. Wiesel. Geodesic convexity and covariance estimation.\n\nprocessing, 60(12):6182\u20136189, 2012.\n\nIEEE transactions on signal\n\n[33] M. Wytock and Z. Kolter. Sparse Gaussian conditional random \ufb01elds: Algorithms, theory,\nand application to energy forecasting. In International conference on machine learning, pages\n1265\u20131273, 2013.\n\n[34] E. Yang and A. C. Lozano. Robust Gaussian graphical modeling with the trimmed graphical\n\nlasso. In Advances in Neural Information Processing Systems, pages 2602\u20132610, 2015.\n\n[35] K. Yao. A representation theorem and its applications to spherically-invariant random processes.\n\nIEEE Transactions on Information Theory, 19(5):600\u2013608, 1973.\n\n[36] T. Zhang, A. Wiesel, and M. S. Greco. Multivariate generalized Gaussian distribution: Convexity\n\nand graphical models. IEEE Transactions on Signal Processing, 61(16):4141\u20134148, 2013.\n\n10\n\n\f", "award": [], "sourceid": 7485, "authors": [{"given_name": "Yoav", "family_name": "Wald", "institution": "Hebrew University / Google"}, {"given_name": "Nofar", "family_name": "Noy", "institution": "Hebrew University"}, {"given_name": "Gal", "family_name": "Elidan", "institution": "Google"}, {"given_name": "Ami", "family_name": "Wiesel", "institution": "Google Research and The Hebrew University of Jerusalem, Israel"}]}