{"title": "Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1361, "page_last": 1368, "abstract": "Policy gradient (PG) reinforcement learning algorithms have strong (local) convergence guarantees, but their learning performance is typically limited by a large variance in the estimate of the gradient. In this paper, we formulate the variance reduction problem by describing a signal-to-noise ratio (SNR) for policy gradient algorithms, and evaluate this SNR carefully for the popular Weight Perturbation (WP) algorithm. We confirm that SNR is a good predictor of long-term learning performance, and that in our episodic formulation, the cost-to-go function is indeed the optimal baseline. We then propose two modifications to traditional model-free policy gradient algorithms in order to optimize the SNR. First, we examine WP using anisotropic sampling distributions, which introduces a bias into the update but increases the SNR; this bias can be interpretted as following the natural gradient of the cost function. Second, we show that non-Gaussian distributions can also increase the SNR, and argue that the optimal isotropic distribution is a \u00e2\u0080\u0098shell\u00e2\u0080\u0099 distribution with a constant magnitude and uniform distribution in direction. We demonstrate that both modifications produce substantial improvements in learning performance in challenging policy gradient experiments.", "full_text": "Signal-to-Noise Ratio Analysis\nof Policy Gradient Algorithms\n\nJohn W. Roberts and Russ Tedrake\n\nComputer Science and\n\nArti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nAbstract\n\nPolicy gradient (PG) reinforcement learning algorithms have strong (local) con-\nvergence guarantees, but their learning performance is typically limited by a large\nvariance in the estimate of the gradient. In this paper, we formulate the variance\nreduction problem by describing a signal-to-noise ratio (SNR) for policy gradient\nalgorithms, and evaluate this SNR carefully for the popular Weight Perturbation\n(WP) algorithm. We con\ufb01rm that SNR is a good predictor of long-term learn-\ning performance, and that in our episodic formulation, the cost-to-go function is\nindeed the optimal baseline. We then propose two modi\ufb01cations to traditional\nmodel-free policy gradient algorithms in order to optimize the SNR. First, we\nexamine WP using anisotropic sampling distributions, which introduces a bias\ninto the update but increases the SNR; this bias can be interpreted as following the\nnatural gradient of the cost function. Second, we show that non-Gaussian distribu-\ntions can also increase the SNR, and argue that the optimal isotropic distribution is\na \u2018shell\u2019 distribution with a constant magnitude and uniform distribution in direc-\ntion. We demonstrate that both modi\ufb01cations produce substantial improvements\nin learning performance in challenging policy gradient experiments.\n\n1 Introduction\n\nModel-free policy gradient algorithms allow for the optimization of control policies on systems\nwhich are impractical to model effectively, whether due to cost, complexity or uncertainty in the\nvery structure and dynamics of the system (Kohl & Stone, 2004; Tedrake et al., 2004). However,\nthese algorithms often suffer from high variance and relatively slow convergence times (Greensmith\net al., 2004). As the same systems on which one wishes to use these algorithms tend to have a\nhigh cost of policy evaluation, much work has been done on maximizing the policy improvement\nfrom any individual evaluation (Meuleau et al., 2000; Williams et al., 2006). Techniques such as\nNatural Gradient (Amari, 1998; Peters et al., 2003a) and GPOMDP (Baxter & Bartlett, 2001) have\nbecome popular through their ability to match the performance gains of more basic model-free\npolicy gradient algorithms while using fewer policy evaluations.\nAs practitioners of policy gradient algorithms in complicated mechanical systems, our group has a\nvested interest in making practical and substantial improvements to the performance of these algo-\nrithms. Variance reduction, in itself, is not a suf\ufb01cient metric for optimizing the performance of PG\nalgorithms - of greater signi\ufb01cance is the magnitude of the variance relative to the magnitude of the\ngradient update. Here we formulate a signal-to-noise ratio (SNR) which facilitates simple and fast\nevaluations of a PG algorithm\u2019s average performance, and facilitates algorithmic performance im-\nprovements. Though the SNR does not capture all facets of a policy gradient algorithm\u2019s capability\nto learn, we show that achieving a high SNR will often result in a superior convergence rate with\nless violent variations in the policy.\n\n1\n\n\fThrough a close analysis of the SNR, and the means by which it is maximized, we \ufb01nd several mod-\ni\ufb01cations to traditional model-free policy gradient updates that improve learning performance. The\n\ufb01rst of these is the reshaping of distributions such that they are different on different parameters, a\nmodi\ufb01cation which introduces a bias to the update. We show that this reshaping can improve per-\nformance, and that the introduced bias results in following the natural gradient of the cost function,\nrather than the true point gradient. The second improvement is the use of non-Gaussian distribu-\ntions for sampling, and through the SNR we \ufb01nd a simple distribution which improves performance\nwithout increasing the complexity of implementation.\n\n2 The weight perturbation update\n\nConsider minimizing a scalar function J( (cid:126)w) with respect to the parameters (cid:126)w (note that it is pos-\nsible that J( (cid:126)w) is a long-term cost and results from running a system with the parameters (cid:126)w until\nconclusion). The weight perturbation algorithm (Jabri & Flower, 1992) performs this minimization\nwith the update:\n\n(1)\nwhere the components of the \u2018perturbation\u2019, (cid:126)z, are drawn independently from a mean-zero dis-\ntribution, and \u03b7 is a positive scalar controlling the magnitude of the update (the \u201clearning rate\u201d).\nPerforming a \ufb01rst-order Taylor expansion of J( (cid:126)w + (cid:126)z) yields:\n\n\u2206 (cid:126)w = \u2212\u03b7 (J( (cid:126)w + (cid:126)z) \u2212 J( (cid:126)w)) (cid:126)z,\n\n(cid:32)\nJ( (cid:126)w) +(cid:88)\n\n\u2206 (cid:126)w = \u2212\u03b7\n\n(cid:33)\n\n(cid:88)\n\ni\n\nzi \u00b7 (cid:126)z.\n\n\u2202J\n\u2202 (cid:126)w i\n\nzi \u2212 J( (cid:126)w)\n\n(cid:126)z = \u2212\u03b7\n\n\u2202J\n\u2202 (cid:126)w i\n\ni\n\nIn expectation, this becomes the gradient times a (diagonal) covariance matrix, and reduces to\n\nE[\u2206 (cid:126)w] = \u2212\u03b7\u03c32 \u2202J\n\u2202 (cid:126)w\n\n,\n\nan unbiased estimate of the gradient, scaled by the learning rate and \u03c32, the variance of the pertur-\nbation. However, this unbiasedness comes with a very high variance, as the direction of an update\nis uniformly distributed. It is only the fact that updates near the direction of the true gradient have a\nlarger magnitude than do those nearly perpendicular to the gradient that allows for the true gradient\nto be achieved in expectation. Note also that all samples parallel to the gradient are equally useful,\nwhether they be in the same or opposite direction, as the sign does not affect the resulting update.\nThe WP algorithm is one of the simplest examples of a policy gradient reinforcement learning al-\ngorithm, and thus is well suited for analysis. In the special case when (cid:126)z is drawn from a Gaussian\ndistribution, weight perturbation can be interpreted as a REINFORCE update(Williams, 1992).\n\n3 SNR for policy gradient algorithms\n\nThe SNR is the expected power of the signal (update in the direction of the true gradient) divided by\nthe expected power of the noise (update perpendicular to the true gradient). Taking care to ensure\nthat the magnitude of the true gradient does not effect the SNR, we have:\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\nE\n\n\u2206 (cid:126)wT(cid:107) \u2206 (cid:126)w(cid:107)\n\n(cid:104)\n(cid:105)\nE(cid:2)\u2206 (cid:126)wT\u22a5\u2206 (cid:126)w\u22a5(cid:3) ,\n\uf8f6\uf8f8 (cid:126)Jw(cid:13)(cid:13)(cid:13) (cid:126)Jw\n\n(cid:13)(cid:13)(cid:13) , \u2206 (cid:126)w\u22a5 = \u2206 (cid:126)w \u2212 (cid:126)w(cid:107),\n\n\u2206 (cid:126)w(cid:107) =\n\nSNR =\n\n(cid:126)Jw(cid:13)(cid:13)(cid:13) (cid:126)Jw\n\n(cid:13)(cid:13)(cid:13)\n\n\uf8eb\uf8ed\u2206 (cid:126)wT\n(cid:12)(cid:12)(cid:12)( (cid:126)w= (cid:126)w0)\n\nand using (cid:126)Jw( (cid:126)w0) = \u2202J( (cid:126)w)\n\u2202 (cid:126)w\n\nfor convenience.\n\nIntuitively, this expression measures how large a proportion of the update is \u201cuseful\u201d. If the update\nis purely in the direction of the gradient the SNR would be in\ufb01nite, while if the update moved\nperpendicular to the true gradient, it would be zero. As such, all else being equal, a higher SNR\nshould generally perform as well or better than a lower SNR, and result in less violent swings in cost\nand policy for the same improvement in performance.\n\n2\n\n\f3.1 Weight perturbation with Gaussian distributions\n\nEvaluating the SNR for the WP update in Equation 1 with a deterministic J( (cid:126)w) and (cid:126)z drawn from a\nGaussian distribution yields a surprisingly simple result. If one \ufb01rst considers the numerator:\n\n(cid:104)\n\n(cid:105)\n\nE\n\n\u2206 (cid:126)wT(cid:107) \u2206 (cid:126)w(cid:107)\n\nJwiJwj zizj\n\n\uf8f6\uf8f8 (cid:126)Jw\n\nT \u00b7\n\n\uf8ee\uf8ef\uf8f0 \u03b72(cid:13)(cid:13)(cid:13) (cid:126)Jw\n(cid:13)(cid:13)(cid:13)4\n\uf8ee\uf8ef\uf8f0 \u03b72(cid:13)(cid:13)(cid:13) (cid:126)Jw\n(cid:13)(cid:13)(cid:13)2\n\n\uf8eb\uf8ed(cid:88)\n(cid:88)\n\ni,j\n\ni,j,k,p\n\n= E\n\n= E\n\nwhere we have named this term Q for convenience as it occurs several times in the expansion of the\nSNR. We now expand the denominator as follows:\n\n\uf8f9\uf8fa\uf8fb\n\nk,p\n\nJwk Jwp zkzp\n\n\uf8f6\uf8f8 (cid:126)Jw\n\n\uf8eb\uf8ed(cid:88)\n\uf8f9\uf8fa\uf8fb = Q,\n(cid:105)\n= E(cid:2)\u2206 (cid:126)wT \u2206 (cid:126)w(cid:3)\u22122Q+Q\n\uf8f9\uf8fb \u2212 Q.\n\n(8)\n\n(7)\n\n(6)\n\nJwiJwj Jwk Jwp zizjzkzp\n\n\uf8ee\uf8f0(cid:88)\n\nSubstituting Equation (1) into Equation (7) and simplifying results in:\n\nE(cid:2)\u2206 (cid:126)wT\u22a5\u2206 (cid:126)w\u22a5(cid:3) = E\n\n(cid:104)\n\u2206 (cid:126)wT \u2206 (cid:126)w \u2212 2\u2206 (cid:126)wT(cid:107) (\u2206 (cid:126)w(cid:107) + \u2206 (cid:126)w\u22a5) + \u2206 (cid:126)wT(cid:107) \u2206 (cid:126)w(cid:107)\nE(cid:2)\u2206 (cid:126)wT\u22a5\u2206 (cid:126)w\u22a5(cid:3) = \u03b72(cid:13)(cid:13)(cid:13) (cid:126)Jw\n(cid:13)(cid:13)(cid:13)2 E\n\uf8eb\uf8ed3\u03c34(cid:88)\n2(cid:88)\n4 + 3\u03c34(cid:88)\nQ = \u03b72(cid:13)(cid:13)(cid:13) (cid:126)Jw\n(cid:13)(cid:13)(cid:13)4\n\uf8eb\uf8ed2(cid:88)\n2 +(cid:88)\nE(cid:2)\u2206 (cid:126)wT\u22a5\u2206 (cid:126)w\u22a5(cid:3) = \u03b72\u03c34\n(cid:13)(cid:13)(cid:13) (cid:126)Jw\n(cid:13)(cid:13)(cid:13)2\n\nJwiJwj zizjz2\nk\n\n(cid:13)(cid:13)(cid:13)4\n\n3\u03c34\n\nJwj\n\nJwi\n\nJwi\n\nJwi\n\nJwi\n\nj(cid:54)=i\n\ni,j,k\n\ni,j\n\n2\n\ni\n\ni\n\ni\n\n2\n\n(cid:88)\n\n\uf8f6\uf8f8 =\n\uf8f6\uf8f8\u2212Q = \u03c34(2+N)\u22123\u03c34 = \u03c34(N \u22121), (10)\n\n(cid:13)(cid:13)(cid:13) (cid:126)Jw\n\n2 = 3\u03c34,\n\n2Jwj\n\nJwi\n\n(9)\n\ni,j\n\nWe now assume that each component zi is drawn from a Gaussian distribution with variance \u03c32.\nTaking the expected value, it may be further simpli\ufb01ed to:\n\nwhere N is the number of parameters. Canceling \u03c3 results in:\n\nSNR =\n\n3\n\nN \u2212 1 .\n\n(11)\n\nThus, for small noises and constant \u03c3 the SNR and the parameter number have a simple inverse\nrelationship. This is a particularly concise model for performance scaling in PG algorithms.\n\n3.2 Relationship of the SNR to learning performance\n\nTo evaluate the degree to which the SNR is correlated with actual learning performance, we ran a\nnumber of experiments on a simple quadratic bowl cost function, which may be written as:\n\nJ( (cid:126)w) = (cid:126)wT A (cid:126)w,\n\n(12)\n\nwhere the optimal is always at the point (cid:126)0. The SNR suggests a simple inverse relationship be-\ntween the number of parameters and the learning performance. To evalute this claim we performed\nthree tests: 1) true gradient descent on the identity cost function (A set to the identity matrix) as a\nbenchmark, 2) WP on the identity cost function and 3) WP on 150 randomly generated cost func-\ntions (each component drawn from a Gaussian distribution), all of the form given in Equation (12),\nand for values of N between 2 and 10. For each trial (cid:126)w was intially set to be (cid:126)1. As can be seen\nin Figure 1a, both the SNR and the reduction in cost after running WP for 100 iterations decrease\nmonotonically as the number of parameters N increases. The fact that this occurs in the case of\nrandomly generated cost functions demonstrates that this effect is not related to the simple form of\nthe identity cost function, but is in fact related to the number of dimensions.\n\n3\n\n\fFigure 1: Two comparisons of SNR and learning performance: (A) Relationship as dimension N\nis increased (Section 3.2). The curves are 15,000 averaged runs, each run 100 iterations. For ran-\ndomly generated cost functions, 150 A matrices were tested. True gradient descent was run on the\nidentity cost function. The SNR for each case was computed in with Equation (11). (B) Relation-\nship as Gaussian is reshaped by changing variances for case of 2D anisotropic cost function(ratio of\n2 = 0.1\ngradients in different directions is 5) cost function (Section 4.1.1). The constraint \u03c32\nis imposed, while \u03c32\n1 is between 0 and .1. For each value of \u03c31 15,000 updates were averaged to\nproduce the curve plotted. The plot shows that variances which increase the SNR also improve the\nperformance of the update.\n\n1 + \u03c32\n\n3.3 SNR with parameter-independent additive noise\n\nIn many real world systems, the evaluation of the cost J( (cid:126)w) is not deterministic, a property which\ncan signi\ufb01cantly affect learning performance. In this section we investigate how additive \u2018noise\u2019 in\nthe function evaluation affects the analytical expression for the SNR. We demonstrate that for very\nhigh noise WP begins to behave like a random walk, and we \ufb01nd in the SNR the motivation for an\nimprovement in the WP algorithm that will be examined in Section 4.2.\nConsider modifying the update seen in Equation (1) to allow for a parameter-independent additive\nnoise term v and a more general baseline b( (cid:126)w), and again perform the Taylor expansion. Writing\nthe update with these terms gives:\n\n(cid:33)\n\n(cid:32)(cid:88)\n\n(cid:33)\n\n(cid:32)\nJ( (cid:126)w) +(cid:88)\n\n\u2206 (cid:126)w = \u2212\u03b7\n\nJwi zi \u2212 b( (cid:126)w) + v\n\n(cid:126)z = \u2212\u03b7\n\ni\n\ni\n\nJwi zi + \u03be( (cid:126)w)\n\n(cid:126)z.\n\n(13)\n\nwhere we have combined the terms J( (cid:126)w), b( (cid:126)w) and v into a single random variable \u03be( (cid:126)w). The new\nvariable \u03be( (cid:126)w) has two important properties: its mean can be controlled through the value of b( (cid:126)w),\nand its distribution is independent of parameters (cid:126)w, thus \u03be( (cid:126)w) is independent of all the zi.\nWe now essentially repeat the calculation seen in Section 3.1, with the small modi\ufb01cation of includ-\ning the noise term. When we again assume independent zi, each drawn from identical Gaussian\ndistributions with standard deviation \u03c3, we obtain the expression:\n\nSNR =\n\n\u03c6 + 3\n\n(N \u2212 1)(\u03c6 + 1) , \u03c6 =\n\n(J( (cid:126)w) \u2212 b( (cid:126)w))2 + \u03c32\n\nv\n\n\u03c32(cid:107) (cid:126)Jw(cid:107)2\n\n(14)\n\nwhere \u03c3v is the standard deviation of the noise v and we have termed the error component \u03c6. This\nexpression depends upon the fact that the noise v is mean-zero and independent of the parameters,\nalthough as stated earlier, the assumption that v is mean-zero is not limiting. It is clear that in the\nlimit of small \u03c6 the expression reduces to that seen in Equation (11), while in the limit of very large\n\u03c6 it becomes the expression for the SNR of a random walk (see Section 3.4). This expression makes\nit clear that minimizing \u03c6 is desirable, a result that suggests two things: (1) the optimal baseline\n(from the perspective of the SNR) is the value function (i.e. b\u2217( (cid:126)w) = J( (cid:126)w)) and (2) higher values of\n\u03c3 are desirable, as they reduce \u03c6 by increasing the size of its denominator. However, there is clearly\na limit on the size of \u03c3 due to higher order terms in the Taylor expansion; very large \u03c3 will result in\nsamples which do not represent the local gradient. Thus, in the case of noisy measurements, there\nis some optimal sampling distance that is as large as possible without resulting in poor sampling of\nthe local gradient. This is explored in Section 4.2.1.\n\n4\n\n\f3.4 SNR of a Random Walk\n\nDue to the fact that the update is squared in the SNR, only its degree of parallelity to the true gradient\nis relevant, not its direction. In the case of WP on a deterministic function, this is not a concern as the\nupdate is always within 90\u25e6 of the gradient, and thus the parallel component is always in the correct\ndirection. For a system with noise, however, components of the update parallel to the gradient can\nin fact be in the incorrect direction, contributing to the SNR even though they do not actually result\nin learning. This effect only becomes signi\ufb01cant when the noise is particularly large, and reaches\nits extreme in the case of a true random walk (a strong bias in the \u201cwrong\u201d direction is in fact a\ngood update with an incorrect sign). If one considers moving by a vector drawn from a multivariate\nGaussian distribution without any correlation to the cost function, the SNR is particularly easy to\ncompute, taking the form:\n\n(cid:88)\nT(cid:88)\nJwizi (cid:126)Jw)T ((cid:126)z \u2212 1\n\nJwizi (cid:126)Jw\n\nj\n\ni\n\nJwj zj (cid:126)Jw\n\n(cid:88)\n\ni\n\n(cid:107) (cid:126)Jw(cid:107)2\n\n1\n\n(cid:107) (cid:126)Jw(cid:107)4\n\n(cid:88)\n\ni\n\n((cid:126)z \u2212 1\n\n(cid:107) (cid:126)Jw(cid:107)2\n\nSNR =\n\n=\n\n\u03c32\n\nN \u03c32 \u2212 2\u03c32 + \u03c32 =\n\n1\n\nN \u2212 1\n\nJwizi (cid:126)Jw)\n\n(15)\nAs was discussed in Section 3.3, this value of the SNR is the limiting case of very high measurement\nnoise, a situation which will in fact produce a random walk.\n\n4 Applications of SNR\n\n4.1 Reshaping the Gaussian Distribution\n\nConsider a generalized WP algorithm, in which we allow each component zi to be drawn inde-\npendently from separate mean-zero distributions. Returning to the derivation in Section 3.1, we no\nlonger assume each zi is drawn from an identical distribution, but rather associate each with its own\n\u03c3i (the vector of the \u03c3i will be referred to as (cid:126)\u03c3). Removing this assumption results in the SNR:\n\n(cid:13)(cid:13)(cid:13) (cid:126)Jw\n\n(cid:13)(cid:13)(cid:13)2\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\uf8eb\uf8ed2(cid:88)\n3(cid:88)\n\ni\n\ni,j\n\n\uf8f6\uf8f8\n\ni +(cid:88)\n\nJwi\n\n2\u03c34\n\nJwi\n\n2\u03c32\n\ni \u03c32\nj\n\ni,j\n2\u03c32\nj\n\n2\u03c32\n\ni Jwj\n\nJwi\n\n\u22121\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\nSNR((cid:126)\u03c3, (cid:126)Jw) =\n\n\u2212 1\n\n.\n\n(16)\n\nAn important property of this SNR is that it depends only upon the direction of (cid:126)Jw and the rel-\native magnitude of the \u03c3i (as opposed to parameters such as the learning rate \u03b7 and the absolute\nmagnitudes (cid:107)(cid:126)\u03c3(cid:107) and (cid:107) (cid:126)Jw(cid:107)).\n\n4.1.1 Effect of reshaping on performance\n\nWhile the absolute magnitudes of the variance and true gradient do not affect the SNR given in\nEquation (16), the relative magnitudes of the different \u03c3i and their relationship to the true gradient\ncan affect it. To study this property, we investigate a cost function with a signi\ufb01cant degree of\nanisotropy. Using a cost function of the form given in Equation (12) and N = 2, we choose an A\nmatrix whose \ufb01rst diagonal component is \ufb01ve times that of the second. We then investigate a series\n2 = C). We\nof possible variances \u03c32\nobserve the performance of the \ufb01rst update (rather than the full trial) as the true gradient can vary\nsigni\ufb01cantly over the course of a trial, thereby having major effects on the SNR even as the variances\nare unchanged. As is clear in Figure 1b, as the SNR is increased through the choice of variances the\nperformance of this update is improved. The variation of the SNR is much more signi\ufb01cant than the\nchange in performance, however this is not surprising as the SNR is in\ufb01nite if the update is exactly\nalong the correct direction, while the improvement from this update will eventually saturate.\n\n2 constrained such that their sum is a constant (\u03c32\n\n1 and \u03c32\n\n1 + \u03c32\n\n5\n\n\f4.1.2 Demonstration in simulation\n\nThe improved performance of the previous section suggests the possibility of a modi\ufb01cation to the\nWP algorithm in which an estimate of the true gradient is used before each update to select new\nvariances which are more likely to learn effectively. Changing the shape of the distribution does add\na bias to the update direction, but the resulting biased update is in fact descending the natural gradient\nof the cost function. To make use of this opportunity, some knowledge of the likely gradient direction\nis required. This knowledge can be provided via a momentum estimate (an average of previous\nupdates) or through an inaccurate model that is able to capture some facets of the geometry of the\ncost function. With this estimated gradient the expression given in Equation (16) can be optimized\nover the \u03c3i numerically using a method such as Sequential Quadratic Programming (SQP). Care\nmust be taken to avoid converging to very narrow distributions (e.g. placing some small minimum\nnoise on all parameters regardless of the optimization), but ultimately this reshaping of the Gaussian\ncan provide real performance bene\ufb01ts.\n\n(a)\n\n(b)\n\nFigure 2: (a) The cart-pole system. The task is to apply a horizontal force f to the cart such that\nthe pole swings to the vertical position. (b) The average of 200 curves showing reduction in cost\nversus trial number for both a symmetric Gaussian distribution and a distribution reshaped using the\nSNR. The blue shaded region marks the area within one standard deviation for a symmetric Gaussian\ndistribution, the red region marks one standard deviation for the reshaped distribution and the purple\nis within one standard deviation of both. The reshaping began on the eighth trial to give time for the\nmomentum-based gradient estimate to stabilize.\n\nTo demonstrate the improvement in convergence time this reshaping can achieve, weight perturba-\ntion was used to develop a barycentric feedback policy for the cart-pole swingup task, where the\ncost was de\ufb01ned as a weighted sum of the actuation used and the squared distance from the upright\nposition. A gradient estimate was obtained through averaging previous updates, and SQP was used\nto optimize the SNR prior to each trial. Figure 2 demonstrates the superior performance of the re-\nshaped distribution over a symmetric Guassian using the same total variance (i.e. the traces of the\ncovariance matrices for both distributions were the same).\n\n4.1.3 WP with Gaussian distributions follow the natural gradient\n\nThe natural gradient for a policy that samples with a mean-zero Gaussian of covariance \u03a3 may be\nwritten (see (Peters et al., 2003b)):\n\n\u02dc(cid:126)Jw = F \u22121 (cid:126)Jw, F = E\u03c0((cid:126)\u03be; (cid:126)w)\n\n\u2202 log \u03c0((cid:126)\u03be; (cid:126)w)\n\n\u2202 log \u03c0((cid:126)\u03be; (cid:126)w)\n\n\u2202wi\n\n\u2202wj\n\n.\n\n(17)\n\n(cid:34)\n\n(cid:35)\n\nwhere F is the Fisher Information matrix, \u03c0 is the sampling distribution, and (cid:126)\u03be = (cid:126)w + (cid:126)z. Using the\nGaussian form of the sampling, F may be evaluated easily, and becomes as \u03a3\u22121, thus:\n\n(18)\nThis is true for all mean-zero multivariate Gaussian distributions, thus the biased update, while no\nlonger following the local point gradient, does follow the natural gradient. It is important to note\nthat the natural gradient is a function of the shape of the sampling distribution, and it is because of\nthis that all sampling distributions of this form can follow the natural gradient.\n\n\u02dc(cid:126)Jw = \u03a3 (cid:126)Jw.\n\n6\n\nfpxlmcg\u03b8m\f4.2 Non-Gaussian Distributions\n\nThe analysis in Section 3.3 suggests that for a function\nwith noisy measurements there is an optimal sampling\ndistance which depends upon the local noise and gra-\ndient as well as the strength of higher-order terms in\nthat region. For a two-dimensional cost function of\nthe form given in Equation (12), Figure 3 shows the\nSNR\u2019s dependence upon the radius of the shell distri-\nbution (i.e. the magnitude of the sampling). For various\nlevels of additive mean-zero noise the SNR was com-\nputed for a distribution uniform in angle and \ufb01xed in its\ndistance from the mean (this distance is the \u201csampling\nmagnitude\u201d). The fact that there is a unique maximum\nfor each case suggests the possibility of sampling only\nat that maximal magnitude, rather than over all mag-\nnitudes as is done with a Gaussian, and thus improv-\ning SNR and performance. While determining the ex-\nact magnitude of maximum SNR may be impractical,\nchoosing a distribution with uniformly distributed di-\nrection and a constant magnitude close to this optimal\nvalue, performance can be improved. This idea was\ntested on the benchmark proposed in (Riedmiller et al.,\n2007), where comparisons showed it was able to learn\nat rates similar to optimized RPROP from reasonable\ninitial policies, and was capable of learning from a zero\ninitial policy.\n\n4.2.1 Experimental Demonstration\n\nFigure 3: SNR vs. update magnitude for\na 2D quadratic cost function. Mean-zero\nmeasurement noise is included with vari-\nances from 0 to .65. As the noise is in-\ncreased, the sampling magnitude produc-\ning the maximum SNR is larger and the\nSNR achieved is lower. Note that the\nhighest SNR achieved is for the small-\nest sampling magnitude with no noise\nwhere it approaches the theoretical value\n(for 2D) of 3. Also note that for small\nsampling magnitudes and large noises the\nSNR approaches the random walk value.\n\nTo provide compelling evidence of improved performance, the shell distribution was implemented\non a laboratory experimental system with actuator limitations and innate stochasticity. We have re-\ncently been exploring the use of PG algorithms in an incredibly dif\ufb01cult and exciting control domain\n-\ufb02uid dynamics - and as such applied the shell distribution to a \ufb02uid dynamical system. Speci\ufb01cally,\nwe applied learning to a system used to sudy the dynamics of \ufb02apping \ufb02ight via a wing submerged\nin water (see Figure 4 for a description of the system (Vandenberghe et al., 2004)). The task is to\ndetermine the vertical motion producing the highest ratio of rotational displacement to energy input.\nModel-free methods are particularly exciting in this domain because direct numerical simulation\ncan take days(Shelley et al., 2005) - in contrast optimizationg on the experimental physical \ufb02apping\nwing can be done in real-time, at the cost of dealing with noise in the evaluation of the cost function;\nsuccess here would be enabling for experimental \ufb02uid dynamics. We explored the idea of using a\n\u201cshell\u201d distribution to improve the performance of our PG learning on this real-world system.\n\n(a)\n\n(b)\n\nFigure 4: (a) Schematic of the \ufb02apping setup. The plate rotates freely about its vertical axis, while\nthe vertical motion is prescribed by the learnt policy. This vertical motion is coupled with the plate\u2019s\nrotation through hydrodynamic effects. (b) 5 averaged runs on the \ufb02apping plate using Gaussian or\nShell distributions for sampling. The error bars represent one standard deviation in the performance\nof different runs at that trial.\n\n7\n\n\fRepresenting the vertical position as a function of time with a 13-point periodic cubic spline, a\n5D space was searched (points 1, 7 and 13 were \ufb01xed at zero, while points 2 and 8, 3 and 9 etc.\nwere set to equal and opposite values determined by the control parameters). Beginning with a\nsmoothed square wave, WP was run for 20 updates using shell distributions and Gaussians. Both\nforms of distributions were run 5 times and averaged to produce the curves in Figure 4. The sampling\nmagnitude of the shell distribution was set to be the expected value of the length of a sample from\nthe Gaussian distribution, while all other parameters were set equal. With optimized sampling, we\nacquired locally optimal policies in as little as 15 minutes, with repeated optimizations from very\ndifferent initial policies converging to the same waveform. The result deepened our understanding\nof this \ufb02uid system and suggests promising applications to other \ufb02uid systems of similar complexity.\n5 Conclusion\nIn this paper we present an expression for the SNR of PG algorithms, and looked in detail at the\ncommon case of WP. This expression gives us a quantitative means of evaluating the expected per-\nformance of a PG algorithm, although the SNR does not completely capture an algorithm\u2019s capacity\nto learn. SNR analysis revealed two distinct mechanisms for improving the WP update - perturb-\ning different parameters with different distributions, and using non-Gaussian distributions. Both of\nthem showed real improvement on highly nonlinear problems (the cart-pole example used a very\nhigh-dimensional policy), without knowledge of the problem\u2019s dynamics and structure. We believe\nthat SNR-optimized PG algorithms show promise for many complicated, real-world applications.\n6 Acknowledgements\nThe authors thank Drs. Lionel Moret and Jun Zhang for valuable assistance with the heaving foil.\nReferences\nAmari, S. (1998). Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10, 251\u2013276.\nBaxter, J., & Bartlett, P. (2001). In\ufb01nite-horizon policy-gradient estimation. Journal of Arti\ufb01cial\n\nIntelligence Research, 15, 319\u2013350.\n\nGreensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance reduction techniques for gradient\n\nestimates in reinforcement learning. Journal of Machine Learning Research, 5, 1471\u20131530.\n\nJabri, M., & Flower, B. (1992). Weight perturbation: An optimal architecture and learning technique\nfor analog VLSI feedforward and recurrent multilayer networks. IEEE Trans. Neural Netw., 3,\n154\u2013157.\n\nKohl, N., & Stone, P. (2004). Policy gradient reinforcement learning for fast quadrupedal locomo-\n\ntion. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).\nMeuleau, N., Peshkin, L., Kaelbling, L. P., & Kim, K.-E. (2000). Off-policy policy search. NIPS.\nPeters, J., Vijayakumar, S., & Schaal, S. (2003a). Policy gradient methods for robot control (Tech-\n\nnical Report CS-03-787). University of Southern California.\n\nPeters, J., Vijayakumar, S., & Schaal, S. (2003b). Reinforcement learning for humanoid robotics.\n\nProceedings of the Third IEEE-RAS International Conference on Humanoid Robots.\n\nRiedmiller, M., Peters, J., & Schaal, S. (2007). Evaluation of policy gradient methods and variants on\nthe cart-pole benchmark. Symposium on Approximate Dynamic Programming and Reinforcement\nLearning (pp. 254\u2013261).\n\nShelley, M., Vandenberghe, N., & Zhang, J. (2005). Heavy \ufb02ags undergo spontaneous oscillations\n\nin \ufb02owing water. Physical Review Letters, 94.\n\nTedrake, R., Zhang, T. W., & Seung, H. S. (2004). Stochastic policy gradient reinforcement learning\non a simple 3D biped. Proceedings of the IEEE International Conference on Intelligent Robots\nand Systems (IROS) (pp. 2849\u20132854). Sendai, Japan.\n\nVandenberghe, N., Zhang, J., & Childress, S. (2004). Symmetry breaking leads to forward \ufb02apping\n\n\ufb02ight. Journal of Fluid Mechanics, 506, 147\u2013155.\n\nWilliams, J. L., III, J. W. F., & Willsky, A. S. (2006). Importance sampling actor-critic algorithms.\n\nProceedings of the 2006 American Control Conference.\n\nWilliams, R. (1992). Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine Learning, 8, 229\u2013256.\n\n8\n\n\f", "award": [], "sourceid": 715, "authors": [{"given_name": "John", "family_name": "Roberts", "institution": null}, {"given_name": "Russ", "family_name": "Tedrake", "institution": null}]}