{"title": "Minimizing Quadratic Functions in Constant Time", "book": "Advances in Neural Information Processing Systems", "page_first": 2217, "page_last": 2225, "abstract": "A sampling-based optimization method for quadratic functions is   proposed. Our method approximately solves the following   $n$-dimensional quadratic minimization problem in constant time,   which is independent of $n$:   $z^*=\\min_{\\bv \\in \\bbR^n}\\bracket{\\bv}{A \\bv} +   n\\bracket{\\bv}{\\diag(\\bd)\\bv} + n\\bracket{\\bb}{\\bv}$,   where $A \\in \\bbR^{n \\times n}$ is a matrix and $\\bd,\\bb \\in \\bbR^n$   are vectors. Our theoretical analysis specifies the number of   samples $k(\\delta, \\epsilon)$ such that the approximated solution   $z$ satisfies $|z - z^*| = O(\\epsilon n^2)$ with probability   $1-\\delta$. The empirical performance (accuracy and runtime) is   positively confirmed by numerical experiments.", "full_text": "Minimizing Quadratic Functions in Constant Time\n\nNational Institute of Advanced Industrial Science and Technology\n\nhayashi.kohei@gmail.com\n\nKohei Hayashi\n\nNational Institute of Informatics and Preferred Infrastructure, Inc.\n\nyyoshida@nii.ac.jp\n\nYuichi Yoshida\n\nAbstract\n\nA sampling-based optimization method for quadratic functions is proposed.\nOur method approximately solves the following n-dimensional quadratic min-\nz\u2217 =\nimization problem in constant\nminv\u2208Rn(cid:104)v, Av(cid:105) + n(cid:104)v, diag(d)v(cid:105) + n(cid:104)b, v(cid:105), where A \u2208 Rn\u00d7n is a matrix and\nd, b \u2208 Rn are vectors. Our theoretical analysis speci\ufb01es the number of samples\nk(\u03b4, \u0001) such that the approximated solution z satis\ufb01es |z \u2212 z\u2217| = O(\u0001n2) with\nprobability 1\u2212 \u03b4. The empirical performance (accuracy and runtime) is positively\ncon\ufb01rmed by numerical experiments.\n\ntime, which is independent of n:\n\n1\n\nIntroduction\n\nA quadratic function is one of the most important function classes in machine learning, statistics,\nand data mining. Many fundamental problems such as linear regression, k-means clustering, prin-\ncipal component analysis, support vector machines, and kernel methods [14] can be formulated as a\nminimization problem of a quadratic function.\nIn some applications, it is suf\ufb01cient to compute the minimum value of a quadratic function rather\nthan its solution. For example, Yamada et al. [21] proposed an ef\ufb01cient method for estimating the\nPearson divergence, which provides useful information about data, such as the density ratio [18].\nThey formulated the estimation problem as the minimization of a squared loss and showed that the\nPearson divergence can be estimated from the minimum value. The least-squares mutual informa-\ntion [19] is another example that can be computed in a similar manner.\nDespite its importance, the minimization of a quadratic function has the issue of scalability. Let\nn \u2208 N be the number of variables (the \u201cdimension\u201d of the problem). In general, such a minimization\nproblem can be solved by quadratic programming (QP), which requires poly(n) time. If the problem\nis convex and there are no constraints, then the problem is reduced to solving a system of linear\nequations, which requires O(n3) time. Both methods easily become infeasible, even for medium-\nscale problems, say, n > 10000.\nAlthough several techniques have been proposed to accelerate quadratic function minimization, they\nrequire at least linear time in n. This is problematic when handling problems with an ultrahigh\ndimension, for which even linear time is slow or prohibitive. For example, stochastic gradient\ndescent (SGD) is an optimization method that is widely used for large-scale problems. A nice\nproperty of this method is that, if the objective function is strongly convex, it outputs a point that\nis suf\ufb01ciently close to an optimal solution after a constant number of iterations [5]. Nevertheless,\nin each iteration, we need at least O(n) time to access the variables. Another technique is low-\nrank approximation such as Nystr\u00a8om\u2019s method [20]. The underlying idea is the approximation\nof the problem by using a low-rank matrix, and by doing so, we can drastically reduce the time\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fminimize\n\nv\u2208Rn\n\ncomplexity. However, we still need to compute the matrix\u2013vector product of size n, which requires\nO(n) time. Clarkson et al. [7] proposed sublinear-time algorithms for special cases of quadratic\nfunction minimization. However, it is \u201csublinear\u201d with respect to the number of pairwise interactions\nof the variables, which is O(n2), and their algorithms require O(n logc n) time for some c \u2265 1.\nOur contributions: Let A \u2208 Rn\u00d7n be a matrix and d, b \u2208 Rn be vectors. Then, we consider the\nfollowing quadratic problem:\n\npn,A,d,b(v), where pn,A,d,b(v) = (cid:104)v, Av(cid:105) + n(cid:104)v, diag(d)v(cid:105) + n(cid:104)b, v(cid:105).\n\n(1)\nHere, (cid:104)\u00b7,\u00b7(cid:105) denotes the inner product and diag(d) denotes the matrix whose diagonal entries are\nspeci\ufb01ed by d. Note that a constant term can be included in (1); however, it is irrelevant when\noptimizing (1), and hence we ignore it.\nLet z\u2217 \u2208 R be the optimal value of (1) and let \u0001, \u03b4 \u2208 (0, 1) be parameters. Then, the main goal of\nthis paper is the computation of z with |z \u2212 z\u2217| = O(\u0001n2) with probability at least 1 \u2212 \u03b4 in constant\ntime, that is, independent of n. Here, we assume the real RAM model [6], in which we can perform\nbasic algebraic operations on real numbers in one step. Moreover, we assume that we have query\naccesses to A, b, and d, with which we can obtain an entry of them by specifying an index. We note\nthat z\u2217 is typically \u0398(n2) because (cid:104)v, Av(cid:105) consists of \u0398(n2) terms, and (cid:104)v, diag(d)v(cid:105) and (cid:104)b, v(cid:105)\nconsist of \u0398(n) terms. Hence, we can regard the error of \u0398(\u0001n2) as an error of \u0398(\u0001) for each term,\nwhich is reasonably small in typical situations.\nLet \u00b7|S be an operator that extracts a submatrix (or subvector) speci\ufb01ed by an index set S \u2282 N; then,\nour algorithm is de\ufb01ned as follows, where the parameter k := k(\u0001, \u03b4) will be determined later.\n\nAlgorithm 1\nInput: An integer n \u2208 N, query accesses to the matrix A \u2208 Rn\u00d7n and to the vectors d, b \u2208 Rn,\n1: S \u2190 a sequence of k = k(\u0001, \u03b4) indices independently and uniformly sampled from\n\nand \u0001, \u03b4 > 0\n{1, 2, . . . , n}.\n\n2: return n2\n\nk2 minv\u2208Rn pk,A|S ,d|S ,b|S (v).\n\nIn other words, we sample a constant number of indices from the set {1, 2, . . . , n}, and then solve\nthe problem (1) restricted to these indices. Note that the number of queries and the time complexity\nare O(k2) and poly(k), respectively. In order to analyze the difference between the optimal values\nof pn,A,d,b and pk,A|S ,d|S ,b|S , we want to measure the \u201cdistances\u201d between A and A|S, d and d|S,\nand b and b|S, and want to show them small. To this end, we exploit graph limit theory, initiated by\nLov\u00b4asz and Szegedy [11] (refer to [10] for a book), in which we measure the distance between two\ngraphs on different number of vertices by considering continuous versions. Although the primary\ninterest of graph limit theory is graphs, we can extend the argument to analyze matrices and vectors.\nUsing synthetic and real settings, we demonstrate that our method is orders of magnitude faster than\nstandard polynomial-time algorithms and that the accuracy of our method is suf\ufb01ciently high.\n\nRelated work: Several constant-time approximation algorithms are known for combinatorial op-\ntimization problems such as the max cut problem on dense graphs [8, 13], constraint satisfaction\nproblems [1, 22], and the vertex cover problem [15, 16, 25]. However, as far as we know, no such\nalgorithm is known for continuous optimization problems.\nA related notion is property testing [9, 17], which aims to design constant-time algorithms that\ndistinguish inputs satisfying some predetermined property from inputs that are \u201cfar\u201d from satisfying\nit. Characterizations of constant-time testable properties are known for the properties of a dense\ngraph [2, 3] and the af\ufb01ne-invariant properties of a function on a \ufb01nite \ufb01eld [23, 24].\n\n2 Preliminaries\nFor an integer n, let [n] denote the set {1, 2, . . . , n}. The notation a = b\u00b1 c means that b\u2212 c \u2264 a \u2264\nb + c. In this paper, we only consider functions and sets that are measurable.\n\n2\n\n\fLet S = (x1, . . . , xk) be a sequence of k indices in [n]. For a vector v \u2208 Rn, we denote the\nrestriction of v to S by v|S \u2208 Rk; that is, (v|S)i = vxi for every i \u2208 [k]. For the matrix A \u2208 Rn\u00d7n,\nwe denote the restriction of A to S by A|S \u2208 Rk\u00d7k; that is, (A|S)ij = Axixj for every i, j \u2208 [k].\n\n2.1 Dikernels\nFollowing [12], we call a (measurable) function f : [0, 1]2 \u2192 R a dikernel. A dikernel is a general-\nization of a graphon [11], which is symmetric and whose range is bounded in [0, 1]. We can regard a\ndikernel as a matrix whose index is speci\ufb01ed by a real value in [0, 1]. We stress that the term dikernel\nhas nothing to do with kernel methods.\n\nFor two functions f, g : [0, 1] \u2192 R, we de\ufb01ne their inner product as (cid:104)f, g(cid:105) =(cid:82) 1\n(cid:12)(cid:12)(cid:12)(cid:82)\n\n0 f (x)g(x)dx. For a\ndikernel W : [0, 1]2 \u2192 R and a function f : [0, 1] \u2192 R, we de\ufb01ne a function W f : [0, 1] \u2192 R as\n(W f )(x) = (cid:104)W (x,\u00b7), f(cid:105).\nLet W : [0, 1]2 \u2192 R be a dikernel. The Lp norm (cid:107)W(cid:107)p for p \u2265 1 and the cut norm (cid:107)W(cid:107)(cid:3) of W are\nde\ufb01ned as (cid:107)W(cid:107)p =\nT W (x, y)dxdy\nrespectively, where the supremum is over all pairs of subsets. We note that these norms satisfy the\ntriangle inequalities and (cid:107)W(cid:107)(cid:3) \u2264 (cid:107)W(cid:107)1.\nLet \u03bb be a Lebesgue measure. A map \u03c0 : [0, 1] \u2192 [0, 1] is said to be measure-preserving, if\nthe pre-image \u03c0\u22121(X) is measurable for every measurable set X, and \u03bb(\u03c0\u22121(X)) = \u03bb(X). A\nmeasure-preserving bijection is a measure-preserving map whose inverse map exists and is also\nmeasurable (and then also measure-preserving). For a measure preserving bijection \u03c0 : [0, 1] \u2192\n[0, 1] and a dikernel W : [0, 1]2 \u2192 R, we de\ufb01ne the dikernel \u03c0(W ) : [0, 1]2 \u2192 R as \u03c0(W )(x, y) =\nW (\u03c0(x), \u03c0(y)).\n\n(cid:82) 1\n0 |W (x, y)|pdxdy\n\nand (cid:107)W(cid:107)(cid:3) = supS,T\u2286[0,1]\n\n(cid:17)1/p\n\n(cid:16)(cid:82) 1\n\n(cid:12)(cid:12)(cid:12),\n\n(cid:82)\n\nS\n\n0\n\n[0, 1\n\n2.2 Matrices and Dikernels\nLet W : [0, 1]2 \u2192 R be a dikernel and S = (x1, . . . , xk) be a sequence of elements in [0, 1]. Then,\nwe de\ufb01ne the matrix W|S \u2208 Rk\u00d7k so that (W|S)ij = W (xi, xj).\n\nWe can construct the dikernel (cid:98)A : [0, 1]2 \u2192 R from the matrix A \u2208 Rn\u00d7n as follows. Let I1 =\ninteger such that x \u2208 Ii. Then, we de\ufb01ne (cid:98)A(x, y) = Ain(x)in(y). The main motivation for creating a\nn , . . . , 1]. For x \u2208 [0, 1], we de\ufb01ne in(x) \u2208 [n] as a unique\nB of different sizes via the cut norm, that is, (cid:107)(cid:98)A \u2212 (cid:98)B(cid:107)(cid:3).\nindependently sampled from [n] exactly matches the distribution of (cid:98)A|S, where S is a sequence of\n\nWe note that the distribution of A|S, where S is a sequence of k indices that are uniformly and\n\ndikernel from a matrix is that, by doing so, we can de\ufb01ne the distance between two matrices A and\n\nn ], . . . , In = ( n\u22121\n\nn ], I2 = ( 1\n\nn , 2\n\nk elements that are uniformly and independently sampled from [0, 1].\n\n3 Sampling Theorem and the Properties of the Cut Norm\n\nIn this section, we prove the following theorem, which states that, given a sequence of dikernels\nW 1, . . . , W T : [0, 1]2 \u2192 [\u2212L, L], we can obtain a good approximation to them by sampling a\nsequence of a small number of elements in [0, 1]. Formally, we prove the following:\nTheorem 3.1. Let W 1, . . . , W T : [0, 1]2 \u2192 [\u2212L, L] be dikernels. Let S be a sequence of k\nelements uniformly and independently sampled from [0, 1]. Then, with a probability of at least\n1 \u2212 exp(\u2212\u2126(kT / log2 k)), there exists a measure-preserving bijection \u03c0 : [0, 1] \u2192 [0, 1] such that,\nfor any functions f, g : [0, 1] \u2192 [\u2212K, K] and t \u2208 [T ], we have\n\n|(cid:104)f, W tg(cid:105) \u2212 (cid:104)f, \u03c0( (cid:91)W t|S)g(cid:105)| = O\n\n(cid:16)\n\nLK 2(cid:112)T / log2 k\n\n(cid:17)\n\n.\n\nWe start with the following lemma, which states that, if a dikernel W : [0, 1]2 \u2192 R has a small cut\nnorm, then (cid:104)f, W f(cid:105) is negligible no matter what f is. Hence, we can focus on the cut norm when\nproving Theorem 3.1.\n\n3\n\n\fLemma 3.2. Let \u0001 \u2265 0 and W : [0, 1]2 \u2192 R be a dikernel with (cid:107)W(cid:107)(cid:3) \u2264 \u0001. Then, for any functions\nf, g : [0, 1] \u2192 [\u2212K, K], we have |(cid:104)f, W g(cid:105)| \u2264 \u0001K 2.\nProof. For \u03c4 \u2208 R and the function h : [0, 1] \u2192 R, let L\u03c4 (h) := {x \u2208 [0, 1] | h(x) = \u03c4} be the level\nset of h at \u03c4. For f(cid:48) = f /K and g(cid:48) = g/K, we have\n\n|(cid:104)f, W g(cid:105)| = K 2|(cid:104)f(cid:48), W g(cid:48)(cid:105)| = K 2(cid:12)(cid:12)(cid:12)(cid:90) 1\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:90)\n\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n\n|\u03c41||\u03c42|\n\n\u2264 K 2\n\n\u22121\n\n\u22121\n\n\u22121\n\n(cid:90) 1\n(cid:90)\n\n\u22121\n\nL\u03c41 (f(cid:48))\n\nL\u03c42 (g(cid:48))\n|\u03c41||\u03c42|d\u03c41d\u03c42 = \u0001K 2.\n\n\u2264 \u0001K 2\n\n\u22121\n\n\u22121\n\n(cid:90)\n\n(cid:90)\n\nL\u03c41 (f(cid:48))\n\nL\u03c42 (g(cid:48))\n\nW (x, y)dxdy\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d\u03c41d\u03c42\n\n\u03c41\u03c42\n\nW (x, y)dxdyd\u03c41d\u03c42\n\n(cid:12)(cid:12)(cid:12)\n\n(cid:90)\n\nTo introduce the next technical tool, we need several de\ufb01nitions. We say that the partition Q is a\nre\ufb01nement of the partition P = (V1, . . . , Vp) if Q is obtained by splitting each set Vi into one or more\nparts. The partition P = (V1, . . . , Vp) of the interval [0, 1] is called an equipartition if \u03bb(Vi) = 1/p\nfor every i \u2208 [p]. For the dikernel W : [0, 1]2 \u2192 R and the equipartition P = (V1, . . . , Vp) of [0, 1],\nwe de\ufb01ne WP : [0, 1]2 \u2192 R as the function obtained by averaging each Vi \u00d7 Vj for i, j \u2208 [p]. More\nformally, we de\ufb01ne\n\nWP (x, y) =\n\n1\n\n\u03bb(Vi)\u03bb(Vj)\n\nVi\u00d7Vj\n\nW (x(cid:48), y(cid:48))dx(cid:48)dy(cid:48) = p2\n\nW (x(cid:48), y(cid:48))dx(cid:48)dy(cid:48),\n\nwhere i and j are unique indices such that x \u2208 Vi and y \u2208 Vj, respectively.\nThe following lemma states that any function W : [0, 1]2 \u2192 R can be well approximated by WP\nfor the equipartition P into a small number of parts.\nLemma 3.3 (Weak regularity lemma for functions on [0, 1]2 [8]). Let P be an equipartition of [0, 1]\ninto k sets. Then, for any dikernel W : [0, 1]2 \u2192 R and \u0001 > 0, there exists a re\ufb01nement Q of P with\n|Q| \u2264 k2C/\u00012 for some constant C > 0 such that\n\n(cid:90)\n\nVi\u00d7Vj\n\nCorollary 3.4. Let W 1, . . . , W T : [0, 1]2 \u2192 R be dikernels. Then, for any \u0001 > 0, there exists an\nequipartition P into |P| \u2264 2CT /\u00012 parts for some constant C > 0 such that, for every t \u2208 [T ],\n\n(cid:107)W \u2212 WQ(cid:107)(cid:3) \u2264 \u0001(cid:107)W(cid:107)2.\n\n(cid:107)W t \u2212 W tP(cid:107)(cid:3) \u2264 \u0001(cid:107)W t(cid:107)2.\n\nProof. Let P 0 be a trivial partition, that is, a partition consisting of a single part [n]. Then, for each\nt \u2208 [T ], we iteratively apply Lemma 3.3 with P t\u22121, W t, and \u0001, and we obtain the partition P t into\nat most |P t\u22121|2C/\u00012 parts such that (cid:107)W t \u2212 W tP t(cid:107)(cid:3) \u2264 \u0001(cid:107)W t(cid:107)2. Since P t is a re\ufb01nement of P t\u22121,\nwe have (cid:107)W i \u2212 W iP t(cid:107)(cid:3) \u2264 (cid:107)W i \u2212 W iP t\u22121(cid:107)(cid:3) for every i \u2208 [t \u2212 1]. Then, P T satis\ufb01es the desired\nproperty with |P T| \u2264 (2C/\u00012\nAs long as S is suf\ufb01ciently large, W and (cid:100)W|S are close in the cut norm:\n\nLemma 3.5 ((4.15) of [4]). Let W : [0, 1]2 \u2192 [\u2212L, L] be a dikernel and S be a sequence of k\nelements uniformly and independently sampled from [0, 1]. Then, we have\n\n)T = 2CT /\u00012.\n\n\u2264 ES(cid:107)(cid:100)W|S(cid:107)(cid:3) \u2212 (cid:107)W(cid:107)(cid:3) <\n\n\u2212 2L\nk\n\n8L\nk1/4\n\n.\n\nFinally, we need the following concentration inequality.\nLemma 3.6 (Azuma\u2019s inequality). Let (\u2126, A, P ) be a probability space, k be a positive integer, and\nC > 0. Let z = (z1, . . . , zk), where z1, . . . , zk are independent random variables, and zi takes\nvalues in some measure space (\u2126i, Ai). Let f : \u21261 \u00d7 \u00b7\u00b7\u00b7 \u00d7 \u2126k \u2192 R be a function. Suppose that\n|f (x) \u2212 f (y)| \u2264 C whenever x and y only differ in one coordinate. Then\n\n(cid:104)|f (z) \u2212 Ez[f (z)]| > \u03bbC\n\n(cid:105)\n\nPr\n\n< 2e\u2212\u03bb2/2k.\n\n4\n\n\fNow we prove the counterpart of Theorem 3.1 for the cut norm.\n[0, 1]2 \u2192 [\u2212L, L] be dikernels. Let S be a sequence of k\nLemma 3.7. Let W 1, . . . , W T :\nelements uniformly and independently sampled from [0, 1]. Then, with a probability of at least\n1 \u2212 exp(\u2212\u2126(kT / log2 k)), there exists a measure-preserving bijection \u03c0 : [0, 1] \u2192 [0, 1] such that,\nfor every t \u2208 [T ], we have\n\n(cid:107)W t \u2212 \u03c0( (cid:91)W t|S)(cid:107)(cid:3) = O\n\n(cid:16)\n\nL(cid:112)T / log2 k\n\n(cid:17)\n\n.\n\nProof. First, we bound the expectations and then prove their concentrations. We apply Corollary 3.4\nto W 1, . . . , W T and \u0001, and let P = (V1, . . . , Vp) be the obtained partition with p \u2264 2CT /\u00012 parts\nsuch that\n(cid:107)W t \u2212 W tP(cid:107)(cid:3) \u2264 \u0001L.\nfor every t \u2208 [T ]. By Lemma 3.5, for every t \u2208 [T ], we have\n\nES(cid:107)(cid:92)\n\nW tP|S \u2212 (cid:91)W t|S(cid:107)(cid:3) = ES(cid:107)(W tP \u2212 W t)|S(cid:92) (cid:107)(cid:3) \u2264 \u0001L +\n\n.\nThen, for any measure-preserving bijection \u03c0 : [0, 1] \u2192 [0, 1] and t \u2208 [T ], we have\nES(cid:107)W t \u2212 \u03c0( (cid:91)W t|S)(cid:107)(cid:3) \u2264 (cid:107)W t \u2212 W tP(cid:107)(cid:3) + ES(cid:107)W tP \u2212 \u03c0(\n\n(cid:92)\nW tP|S) \u2212 \u03c0( (cid:91)W t|S)(cid:107)(cid:3)\n(2)\nThus, we are left with the problem of sampling from P. Let S = {x1, . . . , xk} be a sequence of\nindependent random variables that are uniformly distributed in [0, 1], and let Zi be the number of\npoints xj that fall into the set Vi. It is easy to compute that\n\n(cid:92)\nW tP|S)(cid:107)(cid:3) + ES(cid:107)\u03c0(\n(cid:92)\nW tP|S)(cid:107)(cid:3).\n\n+ ES(cid:107)W tP \u2212 \u03c0(\n\n\u2264 2\u0001L +\n\n8L\nk1/4\n\n8L\nk1/4\n\n(cid:16) 1\n\n(cid:17)\n\nk\np\n\nE[Zi] =\n\nand Var[Zi] =\nThe partition P(cid:48) of [0, 1] is constructed into the sets V (cid:48)\ni ) = min(1/p, Zi/k). For each t \u2208 [T ], we construct the dikernel W\nV (cid:48)\nvalue of W\n\nj is the same as the value of W tP on Vi \u00d7 Vj. Then, W\n\n\u2212 1\np2\np such that \u03bb(V (cid:48)\n1 , . . . , V (cid:48)\n\ni \u00d7 V (cid:48)\n\nt on V (cid:48)\n\nk <\n\nk\np\n\np\n\n.\n\nt\n\nthe set Q =(cid:83)\n\nj ). Then, there exists a bijection \u03c0 such that \u03c0(\n\ni ) = Zi/k and \u03bb(Vi \u2229\n: [0, 1] \u2192 R such that the\nt agrees with W tP on\n(cid:92)\nW tP|S) = W\n\nt\n\nt(cid:107)1 \u2264 2L(1 \u2212 \u03bb(Q))\n\n(cid:16) 1\n\n(cid:17)(cid:17)\n\nmin\n\n,\n\nZi\nk\n\np\n\n,\n\np\n\nmin\n\n= 2L\n\nZi\nk\n\nfor each t \u2208 [T ]. Then, for every t \u2208 [T ], we have\n\ni,j\u2208[p](Vi\u2229V (cid:48)\n1 \u2212(cid:16)(cid:88)\n(cid:16)\n(cid:107)W tP \u2212 \u03c0(\n(cid:12)(cid:12)(cid:12) 1\n= 2L\n(cid:88)\n\ni )\u00d7(Vj \u2229V (cid:48)\n(cid:16) 1\n(cid:12)(cid:12)(cid:12) \u2264 2L\n(cid:16)\n\ni\u2208[p]\n\u2212 Zi\nk\n\n(cid:17)(cid:17)2(cid:17) \u2264 4L\n(cid:16)\n1 \u2212(cid:88)\n(cid:92)\nt(cid:107)(cid:3) \u2264 (cid:107)W tP \u2212 W\nW tP|S)(cid:107)(cid:3) = (cid:107)W tP \u2212 W\n(cid:17)2(cid:17)1/2\n(cid:16) 1\n(cid:88)\n(cid:16) 1\n(cid:88)\nThe expectation of the right hand side is (4L2p/k2)(cid:80)\nW tP|S)(cid:107)(cid:3) \u2264 2L(cid:112)p/k.\n(cid:114) p\n\nSchwartz inequality, E(cid:107)W tP \u2212 \u03c0(\nInserted this into (2), we obtain\n\n(cid:92)\nW tP|S)(cid:107)2(cid:3) \u2264 4L2p\n\nwhich we rewrite as\n\n(cid:107)W tP \u2212 \u03c0(\n\n\u2212 Zi\nk\n\n(cid:92)\n\ni\u2208[p]\n\ni\u2208[p]\n\ni\u2208[p]\n\np\n\np\n\np\n\np\n\n,\n\n(cid:17)2\n\n.\n\n\u2212 Zi\nk\n\nChoosing \u0001 =(cid:112)CT /(log2 k1/4) =(cid:112)4CT /(log2 k), we obtain the upper bound\n(cid:115)\n\nE(cid:107)W t \u2212 \u03c0( (cid:91)W t|S)(cid:107)(cid:3) \u2264 2\u0001L +\n(cid:115)\nE(cid:107)W t \u2212 \u03c0( (cid:91)W t|S)(cid:107)(cid:3) \u2264 2L\n\n\u2264 2\u0001L +\n\n8L\nk1/4\n\n8L\nk1/4\n\n2L\nk1/2\n\n(cid:16)\n\n+ 2L\n\n= O\n\n+\n\n+\n\n+\n\nL\n\nk\n\nT\n\n4CT\nlog2 k\n\n8L\nk1/4\n\n2L\nk1/4\n\nlog2 k\n\n(cid:17)\n\n.\n\n2CT /\u00012\n\n.\n\ni\u2208[p]\ni\u2208[p] Var(Zi) < 4L2p/k. By the Cauchy-\n\n5\n\n\fObserving that (cid:107)W t \u2212 \u03c0( (cid:91)W t|S)(cid:107)(cid:3) changes by at most O(L/k) if one element in S changes, we\n\napply Azuma\u2019s inequality with \u03bb = k(cid:112)T / log2 k and the union bound to complete the proof.\n\nThe proof of Theorem 3.1 is immediately follows from Lemmas 3.2 and 3.7.\n\n4 Analysis of Algorithm 1\n\nIn this section, we analyze Algorithm 1. Because we want to use dikernels for the analysis, we\nintroduce a continuous version of pn,A,d,b (recall (1)). The real-valued function Pn,A,d,b on the\nfunctions f : [0, 1] \u2192 R is de\ufb01ned as\n\nPn,A,d,b(f ) = (cid:104)f, (cid:98)Af(cid:105) + (cid:104)f 2, (cid:100)d1(cid:62)1(cid:105) + (cid:104)f,(cid:100)b1(cid:62)1(cid:105),\n\nwhere f 2 : [0, 1] \u2192 R is a function such that f 2(x) = f (x)2 for every x \u2208 [0, 1] and 1 : [0, 1] \u2192 R\nis the constant function that has a value of 1 everywhere. The following lemma states that the\nminimizations of pn,A,d,b and Pn,A,d,b are equivalent:\nLemma 4.1. Let A \u2208 Rn\u00d7n be a matrix and d, b \u2208 Rn\u00d7n be vectors. Then, we have\n\nmin\n\nv\u2208[\u2212K,K]n\n\npn,A,d,b(v) = n2 \u00b7\n\ninf\n\nf :[0,1]\u2192[\u2212K,K]\n\nPn,A,d,b(f ).\n\nfor any K > 0.\nProof. First, we show that n2 \u00b7 inf f :[0,1]\u2192[\u2212K,K] Pn,A,d,b(f ) \u2264 minv\u2208[\u2212K,K]n pn,A,d,b(v). Given\na vector v \u2208 [\u2212K, K]n, we de\ufb01ne f : [0, 1] \u2192 [\u2212K, K] as f (x) = vin(x). Then,\n(cid:104)f, (cid:98)Af(cid:105) =\nn2(cid:104)v, Av(cid:105),\n(cid:88)\n(cid:104)f 2, (cid:100)d1(cid:62)1(cid:105) =\n(cid:88)\n(cid:104)f,(cid:100)b1(cid:62)1(cid:105) =\n\n(cid:90)\n(cid:88)\n(cid:90)\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n1\nn\n(cid:104)v, b(cid:105).\n\nAijf (x)f (y)dxdy =\n\ndif (x)2dxdy =\n\ndif (x)2dx =\n\n(cid:90)\n(cid:90)\n(cid:90)\n\n(cid:90)\n(cid:90)\n(cid:90)\n\n(cid:88)\n\nbif (x)dxdy =\n\nbif (x)dx =\n\nAijvivj =\n\n(cid:104)v, diag(d)v(cid:105),\n\ni,j\u2208[n]\n\ni,j\u2208[n]\n\ni,j\u2208[n]\n\ndiv2\n\ni =\n\nbivi =\n\n1\nn2\n\ni\u2208[n]\n\ni\u2208[n]\n\nIi\n\nIj\n\nIi\n\nIj\n\n1\nn\n\n1\n\nIi\n\n1\nn\n\ni\u2208[n]\n\n1\nn\n\ni,j\u2208[n]\n\nIi\n\nIj\n\ni\u2208[n]\n\nIi\n\nThen, we have n2Pn,A,d,b(f ) \u2264 pn,A,d,b(v).\nNext, we show that minv\u2208[\u2212K,K]n pn,A,d,b(v) \u2264 n2 \u00b7 inf f :[0,1]\u2192[\u2212K,K] Pn,A,d,b(f ). Let f :\n[0, 1] \u2192 [\u2212K, K] be a measurable function. Then, for x \u2208 [0, 1], we have\n\u2202Pn,A,d,b(f (x))\n\n(cid:88)\n\n(cid:88)\n\n(cid:90)\n\n(cid:90)\n\nAiin(x)f (y)dy +\n\nAin(x)jf (y)dy + 2din(x)f (x) + bin(x).\n\n\u2202f (x)\n\n=\n\ni\u2208[n]\n\nIi\n\nj\u2208[n]\n\nIj\n\nNote that the form of this partial derivative only depends on in(x); hence, in the optimal solution\nf\u2217 : [0, 1] \u2192 [\u2212K, K], we can assume f\u2217(x) = f\u2217(y) if in(x) = in(y). In other words, f\u2217\nis constant on each of the intervals I1, . . . , In. For such f\u2217, we de\ufb01ne the vector v \u2208 Rn as\nvi = f\u2217(x), where x \u2208 [0, 1] is any element in Ii. Then, we have\n\n(cid:104)v, Av(cid:105) =\n\n(cid:104)v, diag(d)v(cid:105) =\n\n(cid:104)v, b(cid:105) =\n\nAijf\u2217(x)f\u2217(y)dxdy = n2(cid:104)f\u2217, (cid:98)Af\u2217(cid:105),\n\n(cid:90)\n\nAijvivj = n2 (cid:88)\n(cid:90)\n(cid:90)\n\n(cid:88)\n(cid:88)\n\n(cid:90)\ndif\u2217(x)2dx = n(cid:104)(f\u2217)2,(cid:100)d1T 1(cid:105),\nbif\u2217(x)dx = n(cid:104)f\u2217,(cid:100)b1T 1(cid:105).\n\ni,j\u2208[n]\n\ni = n\n\ni\u2208[n]\n\nIj\n\nIi\n\nIi\n\nbivi = n\n\ndiv2\n\ni,j\u2208[n]\n\n(cid:88)\n(cid:88)\n(cid:88)\n\ni\u2208[n]\n\ni\u2208[n]\n\ni\u2208[n]\n\nIi\n\nFinally, we have pn,A,d,b(v) \u2264 n2Pn,A,d,b(f\u2217).\n\nNow we show that Algorithm 1 well-approximates the optimal value of (1) in the following sense:\n\n6\n\n\fTheorem 4.2. Let v\u2217 and z\u2217 be an optimal solution and the optimal value, respectively, of prob-\nlem (1). By choosing k(\u0001, \u03b4) = 2\u0398(1/\u00012) + \u0398(log 1\n\u03b4 ), with a probability of at least\n1 \u2212 \u03b4, a sequence S of k indices independently and uniformly sampled from [n] satis\ufb01es the fol-\nlowing: Let \u02dcv\u2217 and \u02dcz\u2217 be an optimal solution and the optimal value, respectively, of the problem\nminv\u2208Rk pk,A|S ,d|S ,b|S (v). Then, we have\n\n\u03b4 log log 1\n\n(cid:12)(cid:12)(cid:12) n2\nk2 \u02dcz\u2217 \u2212 z\u2217(cid:12)(cid:12)(cid:12) \u2264 \u0001LK 2n2,\n\ni |, maxi\u2208[n] |\u02dcv\u2217\n\nwhere K = max{maxi\u2208[n] |v\u2217\n\ni |} and L = max{maxi,j |Aij|, maxi |di|, maxi |bi|}.\n\u03b4 ) and the dikernels (cid:98)A,\n(cid:100)d1(cid:62), and (cid:100)b1(cid:62). Then, with a probability of at least 1\u2212 \u03b4, there exists a measure preserving bijection\nb1(cid:62)|S))1(cid:105)|(cid:111) \u2264 \u0001LK 2\n\nProof. We instantiate Theorem 3.1 with k = 2\u0398(1/\u00012) + \u0398(log 1\n\u03c0 : [0, 1] \u2192 [0, 1] such that\n\n(cid:110)|(cid:104)f, ((cid:98)A \u2212 \u03c0((cid:100)A|S))f(cid:105)|,|(cid:104)f 2, ((cid:100)d1(cid:62) \u2212 \u03c0(\n\nd1(cid:62)|S))1(cid:105)|,|(cid:104)f, ((cid:100)b1(cid:62) \u2212 \u03c0(\n\n\u03b4 log log 1\n\n(cid:92)\n\n(cid:92)\n\nmax\nfor any function f : [0, 1] \u2192 [\u2212K, K]. Then, we have\n\n3\n\npk,A|S ,d|S ,b|S (v) = min\n\npk,A|S ,d|S ,b|S (v)\n\nv\u2208[\u2212K,K]k\nPk,A|S ,d|S ,b|S (f )\n\n(cid:16)(cid:104)f, (\u03c0((cid:100)A|S) \u2212 (cid:98)A)f(cid:105) + (cid:104)f, (cid:98)Af(cid:105) + (cid:104)f 2, (\u03c0(\nb1(cid:62)|S) \u2212 (cid:100)b1(cid:62))1(cid:105) + (cid:104)f,(cid:100)b1(cid:62)1(cid:105)(cid:17)\n(cid:104)f 2, (cid:100)d1(cid:62)1(cid:105) + (cid:104)f, (\u03c0(\n(cid:16)(cid:104)f, (cid:98)Af(cid:105) + (cid:104)f 2, (cid:100)d1(cid:62)1(cid:105) + (cid:104)f,(cid:100)b1(cid:62)1(cid:105) \u00b1 \u0001LK 2(cid:17)\n\n(cid:92)\n\n(cid:92)\n\n(By Lemma 4.1)\n\nd1(cid:62)|S) \u2212 (cid:100)d1(cid:62))1(cid:105)+\n\npn,A,d,b(v) \u00b1 \u0001LK 2k2.\n\n(By Lemma 4.1)\n\n\u02dcz\u2217 = min\nv\u2208Rk\n= k2 \u00b7\n= k2 \u00b7\n\ninf\n\nf :[0,1]\u2192[\u2212K,K]\n\ninf\n\nf :[0,1]\u2192[\u2212K,K]\n\nf :[0,1]\u2192[\u2212K,K]\n\ninf\n\nk2\n\n\u2264 k2 \u00b7\nn2 \u00b7 min\nn2 \u00b7 min\nv\u2208Rn\n\nk2\n\n=\n\n=\n\nv\u2208[\u2212K,K]n\n\npn,A,d,b(v) \u00b1 \u0001LK 2k2 =\n\nk2\n\nn2 z\u2217 \u00b1 \u0001LK 2k2.\n\nRearranging the inequality, we obtain the desired result.\n\nWe can show that K is bounded when A is symmetric and full rank. To see this, we \ufb01rst note\nthat we can assume A + ndiag(d) is positive-de\ufb01nite, as otherwise pn,A,d,b is not bounded and\nthe problem is uninteresting. Then, for any set S \u2286 [n] of k indices, (A + ndiag(d))|S is again\npositive-de\ufb01nite because it is a principal submatrix. Hence, we have v\u2217 = (A + ndiag(d))\u22121nb/2\nand \u02dcv\u2217 = (A|S + ndiag(d|S))\u22121nb|S/2, which means that K is bounded.\n\n5 Experiments\n\nIn this section, we demonstrate the effectiveness of our method by experiment.1 All experiments\nwere conducted on an Amazon EC2 c3.8xlarge instance. Error bars indicate the standard deviations\nover ten trials with different random seeds.\n\nNumerical simulation We investigated the actual relationships between n, k, and \u0001. To this end,\nwe prepared synthetic data as follows. We randomly generated inputs as Aij \u223c U[\u22121,1], di \u223c U[0,1],\nand bi \u223c U[\u22121,1] for i, j \u2208 [n], where U[a,b] denotes the uniform distribution with the support [a, b].\nAfter that, we solved (1) by using Algorithm 1 and compared it with the exact solution obtained by\nQP.2 The result (Figure 1) show the approximation errors were evenly controlled regardless of n,\nwhich meets the error analysis (Theorem 4.2).\n\n1The program codes are available at https://github.com/hayasick/CTOQ.\n2We used GLPK (https://www.gnu.org/software/glpk/) for the QP solver.\n\n7\n\n\fTable 1: Pearson divergence: runtime (second).\n\ne\ns\no\np\no\nr\nP\n\nk\nd 20\n40\n80\n160\nm 20\n40\n80\n160\n\n\u00a8o\nr\nt\ns\ny\nN\n\nn = 500\n0.002\n0.003\n0.007\n0.030\n0.005\n0.010\n0.022\n0.076\n\n1000\n0.002\n0.003\n0.007\n0.030\n0.012\n0.022\n0.049\n0.116\n\n2000\n0.002\n0.003\n0.008\n0.033\n0.046\n0.087\n0.188\n0.432\n\n5000\n0.002\n0.003\n0.008\n0.035\n0.274\n0.513\n0.942\n1.972\n\nFigure 1: Numerical simulation: abso-\nlute approximation error scaled by n2.\n\ne\ns\no\np\no\nr\nP\n\nk\nd 20\n40\n80\n160\nm 20\n40\n80\n160\n\n\u00a8o\nr\nt\ns\ny\nN\n\nTable 2: Pearson divergence: absolute approximation error.\nn = 500\n0.0027 \u00b1 0.0028\n0.0018 \u00b1 0.0023\n0.0007 \u00b1 0.0008\n0.0003 \u00b1 0.0003\n0.3685 \u00b1 0.9142\n0.3549 \u00b1 0.6191\n0.0184 \u00b1 0.0192\n0.0143 \u00b1 0.0209\n\n2000\n0.0021 \u00b1 0.0019\n0.0012 \u00b1 0.0011\n0.0008 \u00b1 0.0008\n0.0003 \u00b1 0.0003\n3.1119 \u00b1 6.1464\n0.9838 \u00b1 1.5422\n0.2056 \u00b1 0.2725\n0.0585 \u00b1 0.1112\n\n1000\n0.0012 \u00b1 0.0012\n0.0006 \u00b1 0.0007\n0.0004 \u00b1 0.0003\n0.0002 \u00b1 0.0001\n1.3006 \u00b1 2.4504\n0.4207 \u00b1 0.7018\n0.0398 \u00b1 0.0472\n0.0348 \u00b1 0.0541\n\n5000\n0.0016 \u00b1 0.0022\n0.0011 \u00b1 0.0020\n0.0007 \u00b1 0.0017\n0.0002 \u00b1 0.0003\n0.6989 \u00b1 0.9644\n0.3744 \u00b1 0.6655\n0.5705 \u00b1 0.7918\n0.0254 \u00b1 0.0285\n\n(cid:80)n\n\nj=1 \u03c6(x(cid:48)\n\nj, xl)\u03c6(x(cid:48)\n\nn(cid:48)) \u2208 Rn(cid:48)\n\n2 \u2212 minv\u2208Rn\n\nn(cid:48) (cid:80)n(cid:48)\n\n(cid:80)n\n1, . . . , x(cid:48)\ni=1 \u03c6(xi, xl)\u03c6(xi, xm) + 1\u2212\u03b1\n\nApplication to kernel methods Next, we considered the kernel approximation of the Pearson\ndivergence [21]. The problem is de\ufb01ned as follows. Suppose we have the two different data sets\nx = (x1, . . . , xn) \u2208 Rn and x(cid:48) = (x(cid:48)\nwhere n, n(cid:48) \u2208 N. Let H \u2208 Rn\u00d7n\nbe a gram matrix such that Hl,m = \u03b1\nj, xm),\nwhere \u03c6(\u00b7,\u00b7) is a kernel function and \u03b1 \u2208 (0, 1) is a parameter. Also, let h \u2208 Rn be a vector\nn\nsuch that hl = 1\ni=1 \u03c6(xi, xl). Then, an estimator of the \u03b1-relative Pearson divergence between\nn\nthe distributions of x and x(cid:48) is obtained by \u2212 1\n2(cid:104)v, v(cid:105). Here,\n\u03bb > 0 is a regularization parameter. In this experiment, we used the Gaussian kernel \u03c6(x, y) =\nexp((x\u2212 y)2/2\u03c32) and set n(cid:48) = 200 and \u03b1 = 0.5; \u03c32 and \u03bb were chosen by 5-fold cross-validation\nas suggested in [21]. We randomly generated the data sets as xi \u223c N (1, 0.5) for i \u2208 [n] and\nj \u223c N (1.5, 0.5) for j \u2208 [n(cid:48)] where N (\u00b5, \u03c32) denotes the Gaussian distribution with mean \u00b5 and\nx(cid:48)\nvariance \u03c32.\nWe encoded this problem into (1) by setting A = 1\n2n 1n, where 1n denotes\nthe n-dimensional vector whose elements are all one. After that, given k, we computed the second\nstep of Algorithm 1 with the pseudoinverse of A|S +kdiag(d|S). Absolute approximation errors and\nruntimes were compared with Nystr\u00a8om\u2019s method whose approximated rank was set to k. In terms of\naccuracy, our method clearly outperformed Nystr\u00a8om\u2019s method (Table 2). In addition, the runtimes\nof our method were nearly constant, whereas the runtimes of Nystr\u00a8om\u2019s method grew linearly in k\n(Table 1).\n\n2 H, b = \u2212h, and d = \u03bb\n\n2(cid:104)v, Hv(cid:105) \u2212 (cid:104)h, v(cid:105) + \u03bb\n\n1\n\n6 Acknowledgments\n\nWe would like to thank Makoto Yamada for suggesting a motivating problem of our method. K. H. is\nsupported by MEXT KAKENHI 15K16055. Y. Y. is supported by MEXT Grant-in-Aid for Scienti\ufb01c\nResearch on Innovative Areas (No. 24106001), JST, CREST, Foundations of Innovative Algorithms\nfor Big Data, and JST, ERATO, Kawarabayashi Large Graph Project.\n\n8\n\n\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf0.000.050.100.000.050.100.000.050.100.000.050.10n=2005001000200010204080160k|z\u2212z\u2217|n2\fReferences\n[1] N. Alon, W. F. de la Vega, R. Kannan, and M. Karpinski. Random sampling and approximation of MAX-\n\nCSP problems. In STOC, pages 232\u2013239, 2002.\n\n[2] N. Alon, E. Fischer, I. Newman, and A. Shapira. A combinatorial characterization of the testable graph\n\nproperties: It\u2019s all about regularity. SIAM Journal on Computing, 39(1):143\u2013167, 2009.\n\n[3] C. Borgs, J. Chayes, L. Lov\u00b4asz, V. T. S\u00b4os, B. Szegedy, and K. Vesztergombi. Graph limits and parameter\n\ntesting. In STOC, pages 261\u2013270, 2006.\n\n[4] C. Borgs, J. T. Chayes, L. Lov\u00b4asz, V. T. S\u00b4os, and K. Vesztergombi. Convergent sequences of dense graphs\ni: Subgraph frequencies, metric properties and testing. Advances in Mathematics, 219(6):1801\u20131851,\n2008.\n\n[5] L. Bottou. Stochastic learning. In Advanced Lectures on Machine Learning, pages 146\u2013168. 2004.\n[6] V. Brattka and P. Hertling. Feasible real random access machines. Journal of Complexity, 14(4):490\u2013526,\n\n[7] K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machine learning. Journal of\n\n[8] A. Frieze and R. Kannan. The regularity lemma and approximation schemes for dense problems.\n\nIn\n\n1998.\n\nthe ACM, 59(5):23:1\u201323:49, 2012.\n\nFOCS, pages 12\u201320, 1996.\n\n[9] O. Goldreich, S. Goldwasser, and D. Ron. Property testing and its connection to learning and approxima-\n\ntion. Journal of the ACM, 45(4):653\u2013750, 1998.\n\n[10] L. Lov\u00b4asz. Large Networks and Graph Limits. American Mathematical Society, 2012.\n[11] L. Lov\u00b4asz and B. Szegedy. Limits of dense graph sequences. Journal of Combinatorial Theory, Series B,\n\n[12] L. Lov\u00b4asz and K. Vesztergombi. Non-deterministic graph property testing. Combinatorics, Probability\n\n96(6):933\u2013957, 2006.\n\nand Computing, 22(05):749\u2013762, 2013.\n\n[13] C. Mathieu and W. Schudy. Yet another algorithm for dense max cut: go greedy. In SODA, pages 176\u2013182,\n\n[14] K. P. Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012.\n[15] H. N. Nguyen and K. Onak. Constant-time approximation algorithms via local improvements. In FOCS,\n\n2008.\n\npages 327\u2013336, 2008.\n\n[16] K. Onak, D. Ron, M. Rosen, and R. Rubinfeld. A near-optimal sublinear-time algorithm for approximat-\n\ning the minimum vertex cover size. In SODA, pages 1123\u20131131, 2012.\n\n[17] R. Rubinfeld and M. Sudan. Robust characterizations of polynomials with applications to program testing.\n\nSIAM Journal on Computing, 25(2):252\u2013271, 1996.\n\n[18] M. Sugiyama, T. Suzuki, and T. Kanamori. Density Ratio Estimation in Machine Learning. Cambridge\n\n[19] T. Suzuki and M. Sugiyama. Least-Squares Independent Component Analysis. Neural Computation,\n\nUniversity Press, 2012.\n\n23(1):284\u2013301, 2011.\n\n[20] C. K. I. Williams and M. Seeger. Using the nystr\u00a8om method to speed up kernel machines. In NIPS, 2001.\n[21] M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama. Relative density-ratio estimation for\n\nrobust distribution comparison. In NIPS, 2011.\n\n[22] Y. Yoshida. Optimal constant-time approximation algorithms and (unconditional) inapproximability re-\n\nsults for every bounded-degree CSP. In STOC, pages 665\u2013674, 2011.\n\n[23] Y. Yoshida. A characterization of locally testable af\ufb01ne-invariant properties via decomposition theorems.\n\nIn STOC, pages 154\u2013163, 2014.\n\n[24] Y. Yoshida. Gowers norm, function limits, and parameter estimation. In SODA, pages 1391\u20131406, 2016.\n[25] Y. Yoshida, M. Yamamoto, and H. Ito. Improved constant-time approximation algorithms for maximum\n\nmatchings and other optimization problems. SIAM Journal on Computing, 41(4):1074\u20131093, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1145, "authors": [{"given_name": "Kohei", "family_name": "Hayashi", "institution": "AIST"}, {"given_name": "Yuichi", "family_name": "Yoshida", "institution": "NII"}]}