{"title": "Learning Positive Functions with Pseudo Mirror Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 14144, "page_last": 14154, "abstract": "The nonparametric learning of positive-valued functions appears widely in machine learning, especially in the context of estimating intensity functions of point processes. Yet, existing approaches either require computing expensive projections or semidefinite relaxations, or lack convexity and theoretical guarantees after introducing nonlinear link functions. In this paper, we propose a novel algorithm, pseudo mirror descent, that performs efficient estimation of positive functions within a Hilbert space without expensive projections. The algorithm guarantees positivity by performing mirror descent with an appropriately selected Bregman divergence, and a pseudo-gradient is adopted to speed up the gradient evaluation procedure in practice. We analyze both asymptotic and nonasymptotic convergence of the algorithm. Through simulations, we show that pseudo mirror descent outperforms the state-of-the-art benchmarks for learning intensities of Poisson and multivariate Hawkes processes, in terms of both computational efficiency and accuracy.", "full_text": "Learning Positive Functions with Pseudo Mirror Descent\n\nYingxiang Yang\u2217\n\nUIUC\n\nHaoxiang Wang\n\nUIUC\n\nNegar Kiyavash\n\nEPFL\n\nyyang172@illinois.edu\n\nhwang264@illinois.edu\n\nnegar.kiyavash@epfl.ch\n\nNiao He\nUIUC\n\nniaohe@illinois.edu\n\nAbstract\n\nThe nonparametric learning of positive-valued functions appears widely in machine\nlearning, especially in the context of estimating intensity functions of point pro-\ncesses. Yet, existing approaches either require computing expensive projections or\nsemide\ufb01nite relaxations, or lack convexity and theoretical guarantees after introduc-\ning nonlinear link functions. In this paper, we propose a novel algorithm, pseudo\nmirror descent, that performs ef\ufb01cient estimation of positive functions within a\nHilbert space without expensive projections. The algorithm guarantees positivity\nby performing mirror descent with an appropriately selected Bregman divergence,\nand a pseudo-gradient is adopted to speed up the gradient evaluation procedure\nin practice. We analyze both asymptotic and nonasymptotic convergence of the\nalgorithm. Through simulations, we show that pseudo mirror descent outperforms\nthe state-of-the-art benchmarks for learning intensities of Poisson and multivariate\nHawkes processes, in terms of both computational ef\ufb01ciency and accuracy.\n\n1\n\nIntroduction\n\nLearning positive-valued functions (or positive functions for short) in Hilbert spaces is pervasive\nin machine learning, especially when estimating intensity functions of point processes. In recent\nyears, there has been a surge of interest and demand for modeling large-scale time-series and discrete\nevent data using point processes. This is fueled by a wide spectrum of applications ranging from\nmodeling \ufb01nancial activities [Embrechts et al., 2011], to modeling network diffusion such as in\ndisease propagation [Yang and Zha, 2013] and spread of news on social networks [Farajtabar et al.,\n2015, 2017], to tracking and control of large-scale and real-time systems [Craciun et al., 2015].\nDespite this, progress has been slow on nonparametric learning of positive functions (or positive\nintensities in case of point processes).\n\n1.1 Learning Positive Functions: Existing Results\n\nSemi-in\ufb01nite/Semi-de\ufb01nite relaxations. In regularized empirical risk minimization over a repro-\nducing kernel Hilbert space (RKHS), the representer theorem [Sch\u00a8olkopf et al., 2001] allows one\nto write the estimate as a linear combination of reproducing kernels. Therefore, the optimization\n\u2217This work was supported in part by MURI grant ARMY W911NF-15-1-0479, ONR grant W911NF-15-1-\n\n0479, NSF CCF-1755829 and NSF CMMI-1761699.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fproblem reduces to a special instance of semi-in\ufb01nite programming (SIP), which can then be solved\nusing a variety of methods, such as cutting plane methods [Wu and Fang, 1999, Betr`o, 2004, Kortanek\nand No, 1993, Papp, 2017]. If the RKHS has a polynomial kernel [Prestel and Delzell, 2013, Bagnell\nand Farahmand, 2015], then the problem further reduces to a sum-of-squares (SOS) optimization,\nwhich can be solved using semi-de\ufb01nite programming (SDP) solvers, e.g., Grant and Boyd [2014].\nAlthough these approaches guarantee positivity, they are often limited to the batch learning setting,\nand are computationally expensive, thus unsuitable for learning large or streaming data sets.\nLink functions. Another approach for enforcing positivity is to perform a change of variable via\na pointwise mapping, h : R \u2192 R+, known as a link function. Examples include h(t) = t2 and\nh(t) = exp(t), as well as various types of activation functions in neural networks [Mei and Eisner,\n2017, Xiao et al., 2017]. By introducing h, the original problem of learning over a constrained set of\nfunctions is effectively transformed into an unconstrained one. Such methods have been successfully\napplied to nonparametric learning of intensity functions of Poisson and multivariate Hawkes processes\n[Flaxman et al., 2017, Yang et al., 2017]. However, despite their numerical advantage, the introduction\nof a link function often breaks convexity of the underlying learning problem. Consequently, the\nnumerical results are not backed by theoretical guarantees.\nProjection. When applying iterative optimization algorithms such as the gradient descent, an ad hoc\nway to enforce positivity of the intermediate updates is to perform projection. In a parametric setting,\nthis can be carried out by solving a quadratic program (QP), with positivity constraint enforced on a\nlarge but \ufb01nite set of points over the support of the estimate (see Appendix K for details). However,\nthis approach does not guarantee an optimal solution due to relaxation of the constraints.\n\n1.2 Our Contribution\n\nDespite recent advances in learning positive functions, nonparametric learning algorithms that are\nboth computationally ef\ufb01cient and provide theoretical guarantees remain largely elusive. In this paper,\nwe design a pseudo mirror descent algorithm that leverages the classical mirror descent algorithm and\na sequence of pseudo-gradients to achieve these goals. When the objective is smooth and the pseudo-\ngradient is close to the true gradient, we prove that the gradient norm vanishes at the rate of O(1/\nk),\nwhere k is the number of iterations. Under a generalized version of the Polyak-\u0141ojasiewicz condition\n[Karimi et al., 2016], we further show that the objective value converges to the optimal at the rate of\nO(1/k). For several point processes estimation applications of interest, including learning intensities\nof nonhomogeneous Poisson processes and multivariate Hawkes processes, we construct pseudo-\ngradients based on kernel embeddings, as the true functional gradients for these problems are not\naccessible in practice. We also conduct extensive numerical experiments on both synthetic and\nreal-world datasets. Those numerical results show that pseudo mirror descent outperforms existing\nnonparametric approaches in terms of both ef\ufb01ciency and accuracy.\n\n\u221a\n\n2 Learning Positive Functions in Hilbert Spaces\n\nWe \ufb01rst focus on a general optimization problem:\n\nmin\nx\u2208H+\n\nf (x),\n\n(1)\n\nwhere H is a Hilbert space that consists of functions mapping a compact support \u2126 \u2282 Rd to R, and\nH+ := {x \u2208 H : x(t) \u2265 0, \u2200t \u2208 \u2126}. The topological dual of H, which consists of continuous linear\noperators on H, is denoted by H\u2217, and the norm and inner product of H are denoted by (cid:107)\u00b7(cid:107) and (cid:104)\u00b7,\u00b7(cid:105),\nrespectively. Next, we introduce notations and de\ufb01nitions that will be used frequently in our analysis.\nFunctional gradient. For a G\u02c6ateaux differentiable functional f : H \u2192 R, denote its G\u02c6ateaux\nderivative by [Df (x)](\u00b7). The functional gradient of f at x, denoted by \u2207f (x), belongs to H and\nsatis\ufb01es [Df (x)](y) = (cid:104)\u2207f (x), y(cid:105) for any y \u2208 H. By the Riesz representation theorem, \u2207f (x)\nexists and is unique. Likewise, if f is twice G\u02c6ateaux differentiable, one can de\ufb01ne the Hessian of f\n\n2\n\n\f2(cid:107)x \u2212 y(cid:107)2\n\n2(cid:107)x\u2212y(cid:107)2\n\nat x by \u22072f (x) \u2208 H\u2217, such that for any y, z \u2208 H, [D[Df (x)](y)](z) = (cid:104)z, [\u22072f (x)](y)(cid:105). For more\ndetails, please see Bauschke and Combettes [2011].\nBregman divergence and Fenchel conjugate. Let int(H+) be the interior of H+, and consider a\ncontinuously differentiable functional \u03a6 : int(H+) \u2192 R that is \u00b5-strongly-convex with respect to\nsome norm (cid:107)\u00b7(cid:107)(cid:93). That is, \u03a6(x) \u2265 \u03a6(y)+(cid:104)\u2207\u03a6(y), x\u2212y(cid:105)+ \u00b5\n(cid:93) , \u2200x, y \u2208 int(H+). De\ufb01ne the\nBregman divergence induced by \u03a6 as \u2206\u03a6(x, y) = \u03a6(x)\u2212\u03a6(y)\u2212(cid:104)\u2207\u03a6(y), x\u2212y(cid:105) for x, y \u2208 int(H+),\nand \u2206\u03a6(x, y) \u2265 \u00b5\n(cid:93) . The Fenchel conjugate of \u03a6 is \u03a6\u2217(u) = supx\u2208H{(cid:104)x, u(cid:105) \u2212 \u03a6(x)},\nwhich is \u00b5\u22121-Lipschitz-smooth with respect to (cid:107) \u00b7 (cid:107)(cid:93),\u2217, which stands for the dual norm of (cid:107) \u00b7 (cid:107)(cid:93). 2\nIn this paper, we aim at leveraging the classic mirror descent algorithm [Nemirovski and Yudin, 1983]\nto guarantee positivity. This approach requires the following assumption.\nAssumption 1. Suppose minx\u2208H+ f (x) = f\u2217 > \u2212\u221e is achieved at x\u2217 \u2208 int(H+), and there exists\na \u03a6 : int(H+) \u2192 R continuously differentiable and \u00b5-strongly-convex with respect to (cid:107)\u00b7(cid:107)(cid:93), such that\n\u2207\u03a6\u2217(x) \u2208 int(H+) for x \u2208 H. Moreover, let f\u03a6(x) = (f \u25e6 \u2207\u03a6\u2217)(x) = f (\u2207\u03a6\u2217(x)). We assume\n\u2207f\u03a6(\u2207\u03a6(x)) \u2208 H for x \u2208 int(H+), and that f\u03a6 is M \u00b5\u22121-Lipschitz-smooth for constant M:\n\u2200x, y \u2208 int(H+).\n\n(2)\nWhen \u03a6(x) = (cid:107)x(cid:107)2/2, we have \u2207\u03a6\u2217(x) = \u2207\u03a6(x) = x and \u2207f\u03a6(\u2207\u03a6(x)) = \u2207f (x). In this case,\n(2) reduces to the standard smoothness assumption on the objective f. For more general choices\nof \u03a6, a suf\ufb01cient condition for (2) is when f is Lipschitz smooth and \u22072\u03a6 has uniformly bounded\neigenvalues over int(H+). However this is not necessary as we will show in Section 3.\nIntuitively, Assumption 1 can be interpreted by introducing a \u201cdual space\u201d [Bubeck et al., 2015], H(cid:48) =\n{\u2207\u03a6(x) : x \u2208 int(H+)}, which is connected to the primal space H+ through a pair of mappings:\n\u2207\u03a6 : int(H+) \u2192 H(cid:48) and \u2207\u03a6\u2217 : H(cid:48) \u2192 int(H+). Notice that, in the \u201cdual space\u201d, the objective and\nits gradient become f\u03a6(\u2207\u03a6(x)) and \u2207f\u03a6(\u2207\u03a6(x)), respectively. Therefore, Assumption 1 assumes\nsmoothness of the objective in the \u201cdual space\u201d, where the dependence on \u03a6 is incorporated into the\nLipschitz constant. A more detailed illustration can be found in Appendix A.\n\n(cid:107)\u2207f\u03a6(\u2207\u03a6(y)) \u2212 \u2207f\u03a6(\u2207\u03a6(x))(cid:107)(cid:93) \u2264 M \u00b5\u22121(cid:107)\u2207\u03a6(y) \u2212 \u2207\u03a6(x)(cid:107)(cid:93),\u2217,\n\n2.1 Pseudo-gradients\n\nIn practice, the exact gradient can be costly to evaluate, store, or transmit; sometimes it may also lack\ndesired properties such as continuity or smoothness. To circumvent of these challenges, a common\npractice is to use a rough direction as a substitute of the exact gradient in optimization algorithms.\nExamples include pseudo-gradients [Poljak and Tsypkin, 1973], the gradient sign [Goodfellow et al.,\n2015], ternery gradients [Wen et al., 2017], and quantized gradients [Wu et al., 2018]. Among them,\nthe concept of pseudo-gradient is the most general, and the starting point of our algorithm design \u2013\nusing mirror descent to guarantee positivity \u2013 further motivates us to introduce a generalized notion\nof pseudo-gradient that is compatible with the Bregman divergence.\nDe\ufb01nition 1 (Pseudo-gradient). Consider an iterative algorithm initialized at x(0) and with inter-\nmediate updates x(1), . . . , x(k), where each x(k) is generated from some given rule r(x(k\u22121), g(k))\nwith a random direction g(k) \u2208 H. Let F (k) be the minimum \u03c3-algebra generated by x(0), . . . , x(k).\nThen, a pseudo-gradient for f at x(k) is a random element g(k+1) \u2208 H satisfying\n\n(cid:104)\u2207f\u03a6(\u2207\u03a6(x(k))), E[g(k+1)|F (k)](cid:105) \u2265 0.\n\n(3)\n\nThe notion of pseudo-gradient was originally introduced in Poljak and Tsypkin [1973] as a random\nelement in H that has an acute angle with the true gradient: (cid:104)E[g(k+1)|F (k)],\u2207f (x(k))(cid:105) \u2265 0. This\ncan be retained from De\ufb01nition 1 by setting \u03a6(x) = (cid:107)x(cid:107)2/2. Under the intuition that guided us to\nraise Assumption 1, De\ufb01nition 1 de\ufb01nes a pseudo-gradient to have an acute angle with the gradient\nin the \u201cdual space\u201d. Below we give a few examples of pseudo-gradients.\nExample 1 (Stochastic gradients are pseudo-gradients). Suppose g(k) \u2208 H is a stochastic gradient\nof f at x(k\u22121): E[g(k)|F (k\u22121)] = \u2207f (x(k\u22121)). Then g(k) is a pseudo-gradient of f at x(k\u22121).\n\n2See Appendix B for details. The two norms (cid:107) \u00b7 (cid:107) and (cid:107) \u00b7 (cid:107)(cid:93) need not be the same.\n\n3\n\n\fAlgorithm 1 Pseudo Mirror Descent Algorithm\n1: Input: number of iterations T ; step sizes {\u03b7k}T\n2: Initialize x(0) \u2208 int(H+).\n3: for k = 1 to T do\n4:\n5:\n6: end for\n7: Output: estimated function x(T ).\n\nCompute pseudo-gradient g(k).\nx(k) = argminx\u2208int(H+)\n\nk=0; objective f; strongly convex function \u03a6.\n\n(cid:8)f (x(k\u22121)) + (cid:104)g(k), x \u2212 x(k\u22121)(cid:105) + \u03b7\u22121\n\nk\u22121\u2206\u03a6(x, x(k\u22121))(cid:9).\n\nThe proof is in Appendix D. While stochastic gradients are pseudo-gradients, the converse is not true.\nAs is the case with the following examples, pseudo-gradients can, and often turn out to be, biased.\nExample 2 (Kernel embeddings are pseudo-gradients). Suppose K(\u00b7,\u00b7) : \u2126\u00d7 \u2126 \u2192 R is a symmetric\npositive de\ufb01nite kernel satisfying (cid:104)x,(cid:104)K, x(cid:105)(cid:105) \u2265 0 for any x \u2208 H. Let Kt = K(t,\u00b7) then\n\ng(k)(t) = (cid:104)Kt,\u2207f\u03a6(\u2207\u03a6(x(k\u22121)))(cid:105)\n\nis a pseudo-gradient of f at x(k\u22121).\nExample 3 (The sign of the gradient is a pseudo-gradient for H = L2(\u2126)). For any x \u2208 int(H+),\n\n(cid:104)\u2207f\u03a6(\u2207\u03a6(x)), sgn(\u2207f\u03a6(\u2207\u03a6(x)))(cid:105) =\n\n|[\u2207f\u03a6(\u2207\u03a6(x))](t)|dt \u2265 0.\n\n(cid:90)\n\n2.2 Pseudo Mirror Descent: Algorithm and Theory\n\n\u2126\n\nIn this section, we introduce a new algorithm, Pseudo Mirror Descent, that integrates the stochastic\nmirror descent with pseudo-gradients. The stochastic mirror descent has been extensively studied\nand widely applied to solving constrained optimization problems: see, e.g., the seminal work\nby Nemirovski et al. [2009]. When it comes to the positivity constraint, the stochastic mirror descent\nalgorithm, leveraging a properly chosen Bregman divergence, leads to a simple multiplicative update\nrule that preserves positivity, and reduces the runtime in practice.\nThe pseudo mirror descent algorithm is described in Algorithm 1, with the main iteration:\n\nf (x(k\u22121)) + (cid:104)g(k), x \u2212 x(k\u22121)(cid:105) + \u03b7\u22121\n\nx(k) = argmin\nx\u2208int(H+)\n\n(4)\nwhere \u2206\u03a6(\u00b7,\u00b7) is the Bregman divergence induced by \u03a6. When \u2207\u03a6\u2217(x) \u2208 int(H+) for x \u2208 H, x(k)\nhas an explicit expression, as we show below (see Appendix E for proof).\nLemma 2. Under Assumption 1, the solution of (4) reduces to\n\n(cid:111)\nk\u22121\u2206\u03a6(x, x(k\u22121))\n\n,\n\n(cid:110)\n\nx(k) = \u2207\u03a6\u2217(\u2207\u03a6(x(k\u22121)) \u2212 \u03b7k\u22121g(k)).\n\nBelow is an example that applies Lemma 2.\nExample 4 (The generalized I-divergence). Let H = L2[0, 1], and \u03a6(x) (cid:44) (cid:104)x, log(x) \u2212 1(cid:105). Then\n\u2206\u03a6(x, y) = (cid:104)x, log(x) \u2212 log(y)(cid:105), and (4) reduces to x(k) = x(k\u22121) exp{\u2212\u03b7k\u22121g(k)}.\nSelection of the Bregman divergence. Example 4 gave an example of Bregman divergence, but\nthe choice of Bregman divergence is rather \ufb02exible, and can be designed in a more general fashion.\nIn the context of learning positive functions, any distance-generating function \u03a6 such that \u2207\u03a6\u2217\npreserves positivity would be suf\ufb01cient. Intuitively, this means that one could start out by choosing an\nappropriate \u2207\u03a6\u2217 and determine the corresponding \u03a6 subsequently. Following this way of designing\nthe Bregman divergence, a few more examples could be easily constructed, including using \u03a6(x) =\n\n\u2212(cid:82) log x(t)dt, which leads to the Itakura-Saito divergence, as well as \u03a6(x) =(cid:82) 0.4x2.5(t)dt.\n\nNext, we provide both asymptotic and nonasymptotic convergence analysis for Algorithm 1. It is\nnoteworthy that none of our results assume convexity of the objective. To the best of our knowledge,\nthese are the \ufb01rst proven convergence results on mirror descent with pseudo-gradient updates.\n\n4\n\n\fConvergence of a vanishing gradient. First, we prove that the pseudo-gradient and the true gradient\nare asymptotically orthogonal.\n\nTheorem 3. Suppose Assumption 1 holds, and the step sizes in (4) satisfy \u03b7k \u2265 0,(cid:80)\u221e\nand(cid:80)\u221e\nwhere the sequence \u03bbk \u2265 0 satis\ufb01es(cid:80)\u221e\n\n(5)\nk\u03bbk+1 < \u221e, and \u03c1 is a positive constant. Then, with\n\n(cid:93),\u2217|F (k\u22121)] \u2264 \u03bbk + \u03c1(cid:104)\u2207f\u03a6(\u2207\u03a6(x(k\u22121))), E[g(k)|F (k\u22121)](cid:105)\n\nk < \u221e. In addition, let g(k) satisfy\n\nk=0 \u03b7k = \u221e,\n\nE[(cid:107)g(k)(cid:107)2\n\nk=0 \u03b72\n\nprobability 1, limk\u2192\u221e f (x(k)) exists and\n\nk=0 \u03b72\n\nk\u2192\u221e (cid:104)\u2207f\u03a6(\u2207\u03a6(x(k\u22121))), E[g(k)|F (k\u22121)](cid:105) = 0.\n\nlim inf\n\nThe proof can be found in Appendix H. The above theorem requires a set of assumptions on the step\nsizes, as well as an upper bound on the pseudo-gradient\u2019s norm, which are standard assumptions in\noptimization literature [Bottou et al., 2018, Poljak and Tsypkin, 1973]. Under such assumptions, the\npseudo-gradient and the gradient eventually become orthogonal to each other in probability. This\nimplies that either the angle between the pseudo-gradient and the gradient becomes asymptotically\nperpendicular, or the norm of the pseudo-gradient converges to 0. Since in Algorithm 1, one has the\nfreedom of designing the pseudo-gradient, we can immediately claim that, if (i) the pseudo-gradient\nis set to always have an acute angle with the true gradient, and (ii) the norm ratio between the\npseudo-gradient and the gradient is lower bounded, then the norm of the gradient converges to 0. An\nexample of this is given in the following corollary (see proof in Appendix G).\nCorollary 4. In Algorithm 1, suppose \u22072\u03a6 is positive de\ufb01nite, and let g(k) = \u2207f\u03a6(\u2207\u03a6(x(k\u22121)))\nor g(k) = \u2207f (x(k\u22121)). Then, we have limk\u2192\u221e (cid:107)\u2207f (x(k))(cid:107) = 0 in probability.\nNote that, if \u2207f (x(k))\u2019s values are uniformly continuous, then this would further imply convergence\ntowards a stationary point of the objective f.\nNext, we investigate the nonasymptotic convergence rate of Algorithm 1 to characterize the behavior\nof the approximate solution for a \ufb01nite number of iterations (see proof in Appendix H).\nTheorem 5. Suppose that Assumption 1 holds, and that constants c2 and c3 exist such that\n\n2 + c2\n3\n\n(cid:93),\u2217] \u2264 c2\n\nE[(cid:107)g(k)(cid:107)2\n\nE[(cid:104)\u2207f\u03a6(\u2207\u03a6(x(k\u22121))), E[g(k)|F (k\u22121)](cid:105)].\n\u221a\nIn addition, suppose that the step size \u03b7k in Algorithm 1 satis\ufb01es \u03b7k = \u0398(1/\n3 M\u22121 for all k, and a constant c4 exists such that f (x(0)) \u2212 f\u2217 \u2264 c4. Then,\n2\u00b5c\u22122\nk).\n\nE[(cid:104)\u2207f\u03a6(\u2207\u03a6(x(i))), E[g(i)|F (i\u22121)](cid:105)] = O(log k/\n\n\u221a\n\nmin\n0\u2264i\u2264k\n\n(6)\nk) and \u03b7k \u2264\n\n2 and \u03c1 = c2\n\nNote that if (5) holds with \u03bbk \u2261 c2\n3, then (6) holds by taking expectation on both sides of\n(5). Theorem 5 states the rate at which the inner product between the pseudo-gradient and the actual\ngradient vanishes under just the smoothness assumption. Faster rates and global convergence can be\nachieved under stronger assumptions. Below we introduce the convergence rate when the objective\nsatis\ufb01es a generalized version of the well-known Polyak-\u0141ojasiewicz condition Polyak [1963].\nGlobal convergence under Polyak-\u0141ojasiewicz condition. We introduce our assumption below.\nAssumption 2 (Generalized Polyak-\u0141ojasiewicz condition). For any x \u2208 int(H+), suppose\n\n(cid:107)\u2207f\u03a6(\u2207\u03a6(x))(cid:107)2 \u2265 \u03b3(f (x) \u2212 f\u2217)\n\n1\n2\n\nfor some universal constant \u03b3 > 0.\n\nThe above assumption generalizes the Polyak-\u0141ojasiewicz condition [Polyak, 1963], which corre-\nsponds to the speci\ufb01c choice of \u03a6(x) = (cid:107)x(cid:107)2/2. Under this choice of \u03a6, pseudo mirror descent\nreduces to pseudo gradient descent, and converges linearly [Poljak and Tsypkin, 1973]. Note that As-\nsumption 2 is a slightly more restrictive condition than the Polyak-\u0141ojasiewicz condition because, by\nchain rule, it implies the Polyak-\u0141ojasiewicz condition so long as \u22072\u03a6(x) has bounded eigenvalues.\n\n5\n\n\fTheorem 6. Suppose Assumptions 1, 2 and Equation (6) hold, and a constant c1 > 0 exists such\nthat, for all x(k) satisfying f (x(k)) (cid:54)= f\u2217,\n\nE[(cid:104)\u2207f\u03a6(\u2207\u03a6(x(k\u22121))), E[g(k)|F (k\u22121)](cid:105)] \u2265 c1E(cid:107)\u2207f\u03a6(\u2207\u03a6(x(k)))(cid:107)2,\n\n\u2200k \u2265 1.\n\n(7)\n\nIf we set \u03b7k \u2261 \u03b7 < min{1/(2\u03b3c1), 2M\u22121\u00b5c\u22122\n\n(cid:19)(cid:19)k\nIf instead we set \u03b7k = min{(2k + 1)/[\u03b3c1(k + 1)2], M\u22121\u00b5c\u22122\n\n(cid:18)\n3 }, then\n\u03b7 \u2212 M \u00b5\u22121\u03b72\n\n1 \u2212 2\u03b3c1\n\nE[f (x(k)) \u2212 f\n\n\u2217\n\n] \u2264\n\n(cid:18)\n\nc2\n3\n\n2\n\n[f (x(0)) \u2212 f\n3 }, then\n\n\u2217\n\n] +\n\nM \u00b5\u22121\u03b72\n\n2\n\nc2\n2.\n\nE[f (x(k)) \u2212 f\u2217] \u2264 M \u00b5\u22121c2\n2\n2\u03b32c2\n1k\n\nfor k \u2265 M c2\n\n3/(\u03b3c1\u00b5).\n\nThe proof of Theorem 6 is given in Appendix I, and is built on Karimi et al. [2016], in which the\nsame rate is obtained for stochastic gradient descent under standard Polyak-\u0141ojasiewicz condition in\nan Euclidean space. By comparison, Theorem 6 is a more general result: (i) it applies to stochastic\nmirror descent on Hilbert spaces, (ii) it applies to any pseudo-gradient satisfying (7). As it turns\nout, the \ufb02exibility in utilizing pseudo-gradients instead of unbiased stochastic gradients plays an\nimportant role in many practical applications, as we will illustrate in the following section.\n\n3 Pseudo Mirror Descent for Point Process Estimation\n\nIn this section, we apply pseudo mirror descent to the problems of learning the intensity functions of\nPoisson processes, as well as triggering functions of multivariate Hawkes processes.\n\n3.1 Learning Poisson Intensities with Pseudo Mirror Descent\n\nFor simplicity of exposition, we consider a one-dimensional Poisson process over [0, 1] with intensity\nx\u2217(t). The objective for estimating x\u2217(t) is\n\nf (x) =\n\nx(t)dt \u2212\n\n\u2217\nx\n\n(t) log x(t)dt.\n\n(8)\n\n0\n\n0\n\nThis objective can be viewed as the expectation of the negative log-likelihood of a Poisson process\nover in\ufb01nite number of sample paths. Our goal is to minimize f (x) over x \u2208 int(H+) with\nH = L2[0, 1]. We restrict x to be continuous, and we choose the generalized I-divergence as the\nBregman divergence, with \u03a6(x) (cid:44) (cid:104)x, log(x) \u2212 1(cid:105).\nDeriving pseudo-gradients. We have \u2207\u03a6(x) = log x, \u2207\u03a6\u2217(x) = exp(x), and\n\n(cid:90) 1\n\n(cid:90) 1\n\n0\n\nf\u03a6(y) =\n\nexp(y)(t)dt \u2212\n\n\u2217\n\nx\n\n(t)y(t)dt.\n\nHence, \u2207f\u03a6(\u2207\u03a6(x)) = x \u2212 x\u2217. In practice, we cannot simply choose \u2207f\u03a6(\u2207\u03a6(x)) as the pseudo-\ngradient, since x\u2217(t) is unknown, and instead only sample arrivals from the Poisson process are\nobserved. Hence, we choose the pseudo-gradient as\n\nx(\u03c4 )K(t, \u03c4 )d\u03c4 \u2212 N(cid:88)\n\ng(t) =\n\nK(\u03c4i, t),\n\ni=1\n\n(cid:90) 1\n\n(cid:90) 1\n\n0\n\n(cid:90) 1\n\n0\n\nwhere K(\u00b7,\u00b7) is a positive de\ufb01nite kernel, and \u03c41, . . . , \u03c4N are arrival times from the Poisson process.\nThe introduction of K(\u00b7,\u00b7) is necessary to avoid the presence of Dirac\u2019s delta functions in the\nexpression of the pseudo-gradient. Substitute x with x(k) in the expression of g. The resulting g(k) is\na pseudo-gradient since E[g(k)|F (k\u22121)] is the kernel embedding of \u2207f\u03a6(\u2207\u03a6(x(k\u22121))).\nOn convergence of pseudo mirror descent. We verify that the conditions in Theorem 6 hold. When\nK(\u00b7,\u00b7) is a \ufb01nite-dimensional kernel (e.g., a polynomial kernel), we have\n\n(cid:104)E[g(k)|F (k\u22121)],\u2207f\u03a6(\u2207\u03a6(x(k\u22121)))(cid:105) =\n\n(x(k\u22121) \u2212 x\n\n\u2217\n\n)(t1)K(t1, t2)(x(k\u22121) \u2212 x\n\n\u2217\n\n)(t2)dt1dt2,\n\n(cid:90) 1\n\n(cid:90) 1\n\n0\n\n0\n\n6\n\n\fwhich is lower bounded by \u03bbmin(cid:107)x(k\u22121) \u2212 x\u2217(cid:107)2 where \u03bbmin is the minimum eigenvalue of the\nintegral operator associated with K(\u00b7,\u00b7). This design guarantees that (7) holds.\nThe expected log-likelihood objective in (8) is not particularly nice for learning positive functions: as\n(cid:107)x(cid:107)\u221e approaches 0, f (x) becomes non-smooth, and violates the generalized Polyak-\u0141ojasiewicz con-\ndition. Nevertheless, for \ufb01nite number of iterations, it is reasonable to assume that the extreme values\nof x(t) are bounded and thus the following proposition follows.\nProposition 7. Consider objective (8) and let \u03a6(x) = (cid:104)x, log x \u2212 1(cid:105). Then,\n\n\u2022 The \u00b5-strong-convexity of \u03a6 and (2) are satis\ufb01ed for the L1-norm when (cid:107)x(cid:107)L1 \u2264 \u00b5\u22121.\n\u2022 The objective satis\ufb01es Assumption 2 with constant \u03bd when mint\u2208[0,1] x(t) \u2265 2\u03bd.\n\nAlthough this proposition requires (cid:107)x(cid:107)L1 \u2264 \u00b5\u22121 in order for \u03a6(x) to be \u00b5-strongly-convex and\nfor f\u03a6 to be M \u00b5\u22121-Lipschitz-smooth for constant M, a crude analysis shows that the updates are\nessentially of the form x(k+1)(t) = x(k)(t) exp(\u2212\u03b7k[\u2207f\u03a6(x(k))](t)) = O(\u03b7\u22121\nk ). Therefore, with\ni.i.d. sample paths of the Poisson process observed in practice, one can expect, using standard\nargument of concentration inequality (see e.g., Rosasco et al. [2010]), that such condition would hold\nwith high probability for the constant step size speci\ufb01ed in Theorem 6. Indeed, in the next section\nwe show that, although Polyak-\u0141ojasiewicz condition is not strictly satis\ufb01ed, a linear convergence\nbehavior at early stage can still be observed. Meanwhile, the proof of Proposition 7 also shows that\n(2) may hold when \u22072\u03a6 does not have uniformly bounded eigenvalues over int(H+).\n\n3.2 Learning Multivariate Hawkes Processes with Pseudo Mirror Descent\n\n(cid:90) t\n\nHerein, we apply pseudo mirror descent to learn the triggering functions of a multivariate Hawkes\nprocess. A p-dimensional multivariate Hawkes process is a set of stochastic processes whose intensity\nfunctions, denoted by x\u2217\n1, . . . , x\u2217\np, are causally dependent on the past arrivals [Hawkes, 1971]:\n\u2217\n\u2217\ni (t) = x\ni0 +\n\ni0 is a given base intensity, Nj(t) is the counting process of dimension j, and y\u2217\n\n(9)\nij \u2208 H :=\nHere, x\u2217\nL2[0, 1] is the triggering function that captures the mutual excitation impact from dimension j to i.\nOur goal is to learn the p \u00d7 p triggering functions by maximizing the expected log-likelihood, which\ncan be carried out by optimizing p separate objectives of the form [Yang et al., 2017]:\n\nij(t \u2212 \u03c4 )dNj(\u03c4 )\n\u2217\n\ni \u2208 {1, . . . , p}.\n\np(cid:88)\n\n\u2212\u221e\n\nj=1\n\nx\n\ny\n\nyi1,...,yip\u2208H fi(yi1, . . . , yip) = E\n\nmin\n\nxi(t) \u2212 x\n\n\u2217\ni (t) log xi(t)dt\n\n(cid:20)(cid:90) T\n\n0\n\n(cid:21)\n\n,\n\n(10)\n\n(11)\n\nwhere x1, . . . , xp are calculated by\n\n\u2217\nxi(t) = x\ni0 +\n\n(cid:90) t\n\np(cid:88)\n\n\u2212\u221e\n\nj=1\n\nyij(t \u2212 \u03c4 )dNj(\u03c4 )\n\ni \u2208 {1, . . . , p}.\n\nDeriving pseudo-gradients. We consider \u03a6(x) = (cid:104)x, log x\u2212 1(cid:105). After some calculations,the partial\n(cid:18)\nderivative of f\u03a6 with respect to \u2207\u03a6(yij) can be expressed as:\n\n(cid:20)(cid:90) T\n\n(cid:19)\n\n(cid:21)\n\n1 \u2212 x\u2217\n\ni (t)\nxi(t)\n\nyij(s)x\n\nj (t \u2212 s)dt\n\u2217\n\n,\n\nwhere s > 0 (due to causality), and the expectation is over the sample paths. We choose the\npseudo-gradient to be the kernel embedding of the above and x\u2217(t) are accessed through samples:\n\n[\u2202\u2207\u03a6(yij )f\u03a6(\u2207\u03a6(yi1), . . . ,\u2207\u03a6(yip))](s) = E\n(cid:90) T\n\nK(s, t \u2212 tjk)yij(t \u2212 tjk)dt \u2212 Ni(T )(cid:88)\n\nNj (t)(cid:88)\n\n0\n\nNj (tim)(cid:88)\n\n0\n\nk=1\n\nm=0\n\nn=0\n\ngij(s) =\n\nK(s, tim \u2212 tjn)\n\nxi(tim)\n\nyij(tim \u2212 tjn),\n\n(12)\n\nwhere tim is the m-th arrival in the i-th dimension (see Appendix for detailed construction).\nRemark 8 (On ef\ufb01cient representation of the updates). For both Poisson and multivariate Hawkes\nprocesses, the updates can be tracked pointwise. If we replace the integration over yij or x by sample\naverages, g(s) and gi(s) become linear combination of the kernels. This allows us to perform updates\nby merely keeping track of the coef\ufb01cients and the parameters of those kernels.\n\n7\n\n\f4 Numerical Experiment\n\nIn this section, we present numerical results on synthetic and real datasets. With synthetic data, the\ngoal is to verify the results of Theorem 6, and to compare the performance of pseudo mirror descent\nwith the link function and projection approaches mentioned in the introduction. Meanwhile, the\nexperiment on real data is designed to show the practical performance of pseudo mirror descent. We\nconduct experiment with various choices of kernels, including the polynomial kernel K(x, y) =\n(1 + xy)2, and the Sobolev kernel K(x, y) = 1 + min{x, y}. As noted in Theorem 6, a \ufb01nite-\ndimensional kernel guarantees (7), whereas an in\ufb01nite-dimensional kernel has a better representation\ncapability, and hence a better performance when fewer iterations are performed. Detailed parameter\nsettings and additional results can be found in Appendices K and L.\nLearning a synthetic one-dimensional Poisson process. We set x\u2217(t) = exp(\u2212t), and evaluated\nthe performance of pseudo mirror descent under constant and vanishing step sizes. The result is\nshown in Figure 1, where we plotted log(f (x(k)) \u2212 f\u2217) versus k under constant (left) and vanishing\nstep sizes (mid), and compared the estimation errors between pseudo mirror descent, projected\ngradient descent, and the link function approach (right). The pseudo-gradient is calculated using a\nmini-batch of 10 realizations and a polynomial kernel K(x, y) = (1 + xy)2. All hyperparameters\nare \ufb01ne-tuned and reported in Appendix K. From the left-most subplot, we see that, even though the\nobjective does not satisfy the Polyak-\u0141ojasiewicz condition, we still observe linear convergence under\na constant step size at the initial stages. From the right-most subplot, we see that the pseudo mirror\ndescent achieves a faster convergence comparing to the link function approach and projected gradient\ndescent. An extension of this experiment is carried out in Figure 2, where the underlying intensity\nfunction is set to a discontinuous function x\u2217(t) = 1 + (cid:98)10t(cid:99) for t \u2208 [0, 1]. The left-hand side of\nFigure 2 shows that both Sobolev and polynomial kernel can learn a continuous approximation of the\nintensity function. The right-hand side of the \ufb01gure shows that the Sobolev kernel has a slightly better\nrepresentation power and thus a slightly better overall performance in the given number of iterations.\nLearning shot distances in professional basketball games. We used the shot distance data of\nseveral professional basketball players over 500 games (available at stats.nba.com). We applied\npseudo mirror descent, the link function approach, and a neural network estimator built with PyTorch\n[Paszke et al., 2017] to learn each player\u2019s shooting distance modeled as a Poisson process. The\npseudo-gradient is computed with a Sobolev kernel K(x, y) = 1 + min{x, y} [Wahba, 1990], and\nthe hyperparameters are \ufb01ne-tuned and reported in Appendix K. Figure 3 depicts the result with\nthe histogram of the data in background. We can see that the pseudo mirror descent shows a similar\naccuracy compared to the link function approach and the neural network estimator.\nOnline learning for multivariate Hawkes process. We studied the mouse embryonic stem cell data,\nwhich is often modeled as a multivariate Hawkes process. The dataset we adopted [Chen et al., 2008]\nconsists of 15 DNA sequences, where each sequence documents the co-occurrence of 15 types of\ntranscriptional regulatory elements (TREs). We modeled each DNA sequence as a 15-dimensional\nHawkes process, following the setting of [Carstensen et al., 2010]. Our goal is to compare the\nlog-likelihood per dimension, (10), evaluated using the estimates of pseudo mirror descent, the\nexpectation maximization (EM) algorithm [Lewis and Mohler, 2011], and the MLE-SGLP proposed\nby Xu et al. [2016]. The pseudo-gradient is computed with the Sobolev kernel introduced above.\nFigure 4 shows two scatter plots of performance comparison, between pseudo mirror descent and the\nEM algorithm (left), and between pseudo mirror descent and MLE-SGLPL (right). The horizontal\naxis is the log-likelihood of the benchmarks, implemented with Bacry et al. [2017], and the vertical\naxis is the log-likelihood of the pseudo mirror descent. As each dot represents the per-dimensional\nlog-likelihood of one TRE in one DNA sequence, there are a total of 15 \u00d7 15 = 225 dots. We can see\nthat, on the left-hand subplot in Figure 4, most dots fall to the left of the diagonal line, indicating that\npseudo mirror descent is slightly better than the EM algorithm; on the right-hand subplot, most dots\nfall in the vicinity of the diagonal line, implying similar performances between pseudo mirror descent\nand MLE-SGLP. Note that both the EM algorithm and MLE-SGLP are batch learning algorithms.\n\n8\n\n\fFigure 1: Synthetic dataset: log of objective error for pseudo mirror descent under constant (left) and\nvanishing (mid) step sizes; estimation error of pseudo mirror descent, projected gradient descent, and\nthe link function approach (right).\n\nFigure 2: Synthetic dataset: the \ufb01tting of a Poisson process with piecewise constant intensity function.\nWe compare the performance using a polynomial kernel and a Sobolev kernel (left), and compare the\nestimation error with the link function approach (right).\n\nFigure 3: Basketball shot distance dataset: recovery of the intensities using pseudo mirror descent\n(red curve), the link function approach), and neural networks (yellow curve).\n\nFigure 4: Mouse embryonic stem cell dataset: scatter plot comparison between pseudo mirror descent\nand expectation maximization (left), and between pseudo mirror descent and MLE-SGLP (right).\n\n5 Conclusion\nThis paper introduced a principle algorithm, pseudo mirror descent, and a new theoretical framework\nfor nonparametric estimation of positive functions. Convergence results on pseudo mirror descent\napply to general-purposed (non-convex) optimization problems, which can be of independent interest.\nWe provided examples on applying pseudo mirror descent to learning intensity and triggering\nfunctions of Poisson and multivariate Hawkes processes. Besides its strong theoretical guarantees,\nnumerical results also showed that pseudo mirror descent generates near optimal performance in\npractice.\n\n9\n\n020000400006000080000100000Number of iterations10864202f(x(k))f* (log)=1/500=1/1000=1/2000020000400006000080000100000Number of iterations10864202f(x(k))f* (log) k=1/(0.01k+10) k=1/(0.01k+100) k=1/(0.1k+10) k=1/(0.1k+100)02004006008001000Number of iterations642024L2 error (log)Pseudo mirror descentRKHS + Link function x=y2Projected gradient descent0.00.20.40.60.81.0t246810x(t)k=1/(0.01k+10) (polynomial kernel)k=1/(0.01k+10) (Sobolev kernel)Ground truth0200040006000800010000Number of iterations210123L2 error (log)Pseudo mirror descent (polynomial kernel)Pseudo mirror descent (Sobolev kernel)RKHS + Link function x=y20.00.20.40.60.81.0Normalized shot distance d020406080100Estimated intensity functionPseudo Mirror DescentRKHS + Link function x=y2Neural network020040060080010001200Number of shots takenStephen CurryHistogram0.00.20.40.60.81.0Normalized shot distance d010203040506070Estimated intensity functionPseudo Mirror DescentRKHS + Link function x=y2Neural network02004006008001000Number of shots takenKlay ThompsonHistogram0.00.20.40.60.81.0Normalized shot distance d0255075100125150175200Estimated intensity functionPseudo Mirror DescentRKHS + Link function x=y2Neural network050010001500200025003000Number of shots takenLeBron JamesHistogram3500300025002000150010005000Expectation Maximization (EM)3500300025002000150010005000Pseudo Mirror Descent (PMD)3500300025002000150010005000MLE-SGLP3500300025002000150010005000Pseudo Mirror Descent (PMD)\fReferences\nM Andersen, Joachim Dahl, and Lieven Vandenberghe. Cvxopt: A python package for convex\n\noptimization. abel. ee. ucla. edu/cvxopt, 2013.\n\nEmmanuel Bacry, Martin Bompaire, St\u00b4ephane Ga\u00a8\u0131ffas, and Soren Poulsen. Tick: a Python library\nfor statistical learning, with a particular emphasis on time-dependent modelling. arXiv preprint\narXiv:1707.03003, 2017.\n\nJ Andrew Bagnell and Amir-massoud Farahmand. Learning positive functions in a Hilbert space.\n\nNIPS Workshop on Optimization, (OPT2015), 2015.\n\nHeinz H Bauschke and Patrick L Combettes. Convex analysis and monotone operator theory in\n\nHilbert spaces, volume 408. Springer, 2011.\n\nBruno Betr`o. An accelerated central cutting plane algorithm for linear semi-in\ufb01nite programming.\n\nMathematical Programming, 101(3):479\u2013495, 2004.\n\nL\u00b4eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. Siam Review, 60(2):223\u2013311, 2018.\n\nS\u00b4ebastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends R(cid:13)\n\nin Machine Learning, 8(3-4):231\u2013357, 2015.\n\nLisbeth Carstensen, Albin Sandelin, Ole Winther, and Niels R Hansen. Multivariate Hawkes process\n\nmodels of the occurrence of regulatory elements. BMC bioinformatics, 11(1):456, 2010.\n\nXi Chen, Han Xu, Ping Yuan, Fang Fang, Mikael Huss, Vinsensius B Vega, Eleanor Wong, Yuriy L\nOrlov, Weiwei Zhang, Jianming Jiang, et al. Integration of external signaling pathways with the\ncore transcriptional network in embryonic stem cells. Cell, 133(6):1106\u20131117, 2008.\n\nPaula Craciun, Mathias Ortner, and Josiane Zerubia. Joint detection and tracking of moving objects\nusing spatio-temporal marked point processes. In Applications of Computer Vision (WACV), 2015\nIEEE Winter Conference on, pages 177\u2013184. IEEE, 2015.\n\nPaul Embrechts, Thomas Liniger, and Lu Lin. Multivariate Hawkes processes: an application to\n\n\ufb01nancial data. Journal of Applied Probability, 48(A):367\u2013378, 2011.\n\nMehrdad Farajtabar, Yichen Wang, Manuel Gomez Rodriguez, Shuang Li, Hongyuan Zha, and\nLe Song. Coevolve: A joint point process model for information diffusion and network co-\nevolution. Advances in Neural Information Processing Systems (NIPS), pages 1954\u20131962, 2015.\nMehrdad Farajtabar, Jiachen Yang, Xiaojing Ye, Huan Xu, Rakshit Trivedi, Elias Khalil, Shuang Li,\nLe Song, and Hongyuan Zha. Fake news mitigation via point process based intervention. 34th\nInternational Conference on Machine Learning (ICML), 70:1097\u20131106, 06\u201311 Aug 2017.\n\nSeth Flaxman, Yee Whye Teh, Dino Sejdinovic, et al. Poisson intensity estimation with reproducing\n\nkernels. Electronic Journal of Statistics, 11(2):5081\u20135104, 2017.\n\nIan J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial\n\nexamples. International Conference on Learning Representations (ICLR), 2015.\n\nMichael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming,\n\nversion 2.1. http://cvxr.com/cvx, March 2014.\n\nAlan G Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58\n\n(1):83\u201390, 1971.\n\nHamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the Polyak-Lojasiewicz condition. Joint European Conference on Machine\nLearning and Knowledge Discovery in Databases, pages 795\u2013811, 2016.\n\nKenneth O. Kortanek and Hoon No. A central cutting plane algorithm for convex semi-in\ufb01nite\n\nprogramming problems. SIAM Journal on optimization, 3(4):901\u2013918, 1993.\n\nJure Leskovec, Lars Backstrom, and Jon Kleinberg. Meme-tracking and the dynamics of the news\ncycle. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery\nand data mining, pages 497\u2013506, 2009.\n\n10\n\n\fErik Lewis and George Mohler. A nonparametric EM algorithm for multiscale Hawkes processes.\n\n2011.\n\nHongyuan Mei and Jason M Eisner. The neural Hawkes process: A neurally self-modulating\nmultivariate point process. Advances in Neural Information Processing Systems (NIPS), pages\n6754\u20136764, 2017.\n\nArkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic\napproximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574\u2013\n1609, 2009.\n\nArkadi. S. Nemirovski and David. B. Yudin. Problem complexity and method ef\ufb01ciency in optimization.\n\nWiley, New York, 1983.\n\nD\u00b4avid Papp. Semi-in\ufb01nite programming using high-degree polynomial interpolants and semide\ufb01nite\n\nprogramming. SIAM Journal on Optimization, 27(3):1858\u20131879, 2017.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. NIPS Autodiff Workshop, 2017.\n\nB.T. Poljak and Ya Z. Tsypkin. Pseudogradient adaptation and training algorithms. Automation and\n\nRemote Control, 34:45\u201367, 1973.\n\nDavid Pollard, 2005. URL http://www.stat.yale.edu/~pollard/Courses/607.spring05/\n\nhandouts/Totalvar.pdf.\n\nBoris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel\u2019noi\n\nMatematiki i Matematicheskoi Fiziki, 3(4):643\u2013653, 1963.\n\nAlexander Prestel and Charles Delzell. Positive polynomials: from Hilbert\u2019s 17th problem to real\n\nalgebra. Springer Science & Business Media, 2013.\n\nLorenzo Rosasco, Mikhail Belkin, and Ernesto De Vito. On learning with integral operators. Journal\n\nof Machine Learning Research, 11(Feb):905\u2013934, 2010.\n\nBernhard Sch\u00a8olkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In\n\nInternational conference on computational learning theory, pages 416\u2013426. Springer, 2001.\n\nGrace Wahba. Spline models for observational data, volume 59. SIAM, 1990.\nWei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad:\nTernary gradients to reduce communication in distributed deep learning. Advances in neural\ninformation processing systems, pages 1509\u20131519, 2017.\n\nJiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized sgd\nand its applications to large-scale distributed optimization. arXiv preprint arXiv:1806.08054, 2018.\nSoon-Yi Wu and S-C Fang. Solving convex programs with in\ufb01nitely many linear constraints by a\nrelaxed cutting plane method. Computers & Mathematics with Applications, 38(3-4):23\u201333, 1999.\nShuai Xiao, Junchi Yan, Xiaokang Yang, Hongyuan Zha, and Stephen M Chu. Modeling the intensity\nfunction of point process via recurrent neural networks. AAAI Conference on Arti\ufb01cial Intellegence\n(AAAI), 17:1597\u20131603, 2017.\n\nHongteng Xu, Mehrdad Farajtabar, and Hongyuan Zha. Learning Granger causality for Hawkes\n\nprocesses. International Conference on Machine Learning, pages 1717\u20131726, 2016.\n\nShuang-Hong Yang and Hongyuan Zha. Mixture of mutually exciting processes for viral diffusion.\n\nInternational Conference on Machine Learning (ICML), pages 1\u20139, 2013.\n\nYingxiang Yang, Jalal Etesami, Niao He, and Negar Kiyavash. Online learning for multivariate\nHawkes processes. Advances in Neural Information Processing Systems (NIPS), pages 4937\u20134946,\n2017.\n\n11\n\n\f", "award": [], "sourceid": 7901, "authors": [{"given_name": "Yingxiang", "family_name": "Yang", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Haoxiang", "family_name": "Wang", "institution": "University of Illinois, Urbana-Champaign"}, {"given_name": "Negar", "family_name": "Kiyavash", "institution": "EPFL"}, {"given_name": "Niao", "family_name": "He", "institution": "UIUC"}]}