{"title": "Learning convex bounds for linear quadratic control policy synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 9561, "page_last": 9572, "abstract": "Learning to make decisions from observed data in dynamic environments remains a problem of fundamental importance in a numbers of fields, from artificial intelligence and robotics, to medicine and finance.\nThis paper concerns the problem of learning control policies for unknown linear dynamical systems so as to maximize a quadratic reward function.\nWe present a method to optimize the expected value of the reward over the posterior distribution of the unknown system parameters, given data.\nThe algorithm involves sequential convex programing, and enjoys reliable local convergence and robust stability guarantees.\nNumerical simulations and stabilization of a real-world inverted pendulum are used to demonstrate the approach, with strong performance and robustness properties observed in both.", "full_text": "Learning convex bounds for linear quadratic control\n\npolicy synthesis\n\nDepartment of Information Technology\n\nDepartment of Information Technology\n\nThomas B. Sch\u00f6n\n\nUppsala University\n\nSweden\n\nthomas.schon@it.uu.se\n\nJack Umenberger\n\nUppsala University\n\nSweden\n\njack.umenberger@it.uu.se\n\nAbstract\n\nLearning to make decisions from observed data in dynamic environments remains\na problem of fundamental importance in a number of \ufb01elds, from arti\ufb01cial intelli-\ngence and robotics, to medicine and \ufb01nance. This paper concerns the problem of\nlearning control policies for unknown linear dynamical systems so as to maximize\na quadratic reward function. We present a method to optimize the expected value\nof the reward over the posterior distribution of the unknown system parameters,\ngiven data. The algorithm involves sequential convex programing, and enjoys\nreliable local convergence and robust stability guarantees. Numerical simulations\nand stabilization of a real-world inverted pendulum are used to demonstrate the\napproach, with strong performance and robustness properties observed in both.\n\n1 Introduction\n\nDecision making for dynamical systems in the presence of uncertainty is a problem of great prevalence\nand importance, as well as considerable dif\ufb01culty, especially when knowledge of the dynamics is\navailable only via limited observations of system behavior. In machine learning, the data-driven\nsearch for a control policy to maximize the expected reward attained by a stochastic dynamic process\nis known as reinforcement learning (RL) [45]. Despite remarkable recent success in games [32, 43], a\nmajor obstacle to the deployment RL-based control on physical systems (e.g. robots and self-driving\ncars) is the issue of robustness, i.e., guaranteed safe and reliable operation. With the necessity of such\nguarantees widely acknowledged [2], so-called \u2018safe RL\u2019 remains an active area of research [21].\nThe problem of robust automatic decision making for uncertain dynamical systems has also been\nthe subject of intense study in the area of robust control (RC) [57]. In RC, one works with a set\nof plausible models and seeks a control policy that is guaranteed to stabilize all models within the\nset. In addition, there is also a performance objective to optimize, i.e. a reward to be maximized, or\nequivalently, a cost to be minimized. Such cost functions are usually de\ufb01ned with reference to either\na nominal model [20, 25] or the worst-case model [36] in the set. RC has been extremely successful\nin a number of engineering applications [38]; however, as has been noted, e.g., [48, 35], robustness\nmay (understandably) come at the expense of performance, particularly for worst-case design.\nThe problem we address in this paper lies at the intersection of reinforcement learning and robust\ncontrol, and can be summarized as follows: given observations from an unknown dynamical system,\nwe seek a policy to optimize the expected cost (as in RL), subject to certain robust stability guarantees\n(as in RC). Speci\ufb01cally, we focus our attention on control of linear time-invariant dynamical systems,\nsubject to Gaussian disturbances, with the goal of minimizing a quadratic function penalizing state\ndeviations and control action. When the system is known, this is the classical linear quadratic\nregulator (LQR), a.k.a. H2, optimal control problem [8]. We are interested in the setting in which the\nsystem is unknown, and knowledge of the dynamics must be inferred from observed data.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fContributions and paper structure The principal contribution of this paper is an algorithm\nto optimize the expected value of the linear quadratic regulator reward/cost function, where the\nexpectation is w.r.t.\nthe posterior distribution of unknown system parameters, given observed\ndata; cf. Section 3 for a detailed problem formulation. Speci\ufb01cally, we construct a sequence of\nconvex approximations (upper bounds) to the expected cost, that can be optimized via semide\ufb01nite\nprograming [50]. The algorithm, developed in Section 4, invokes the majorize-minimization (MM)\nprinciple [29], and consequently enjoys reliable convergence to local optima. An important part of\nour contribution lies in guarantees on the robust stability properties of the resulting control policies,\ncf. Section 4.3. We demonstrate the proposed method via two experimental case studies: i) the\nbenchmark problem on simulated systems considered in [17, 48], and ii) stabilization of a real-world\ninverted pendulum. Strong performance and robustness properties are observed in both. Moving\nforward, from a machine learning perspective this work contributes to the growing body of research\nconcerned with ensuring robustness in RL, cf. Section 2. From a control perspective, this work\nappropriates cost functions more commonly found in RL (namely, expected reward) to a RC setting,\nwith the objective of reducing conservatism of the resulting robust control policies.\n\n2 Related work\n\nIncorporating various notions of \u2018robustness\u2019 into RL has long been an area of active research [21].\nIn so-called \u2018safe RL\u2019, one seeks to respect certain safety constraints during exploration and/or policy\noptimization, for example, avoiding undesirable regions of the state-action space [22, 1]. A related\nproblem is addressed in \u2018risk-sensitive RL\u2019, in which the search for a policy takes both the expected\nvalue and variance of the reward into account [31, 19]. Recently, there has been an increased interest\nin notions of robustness more commonly considered in control theory, chie\ufb02y stability [35, 3]. Of\nparticular relevance is the work of [4], which employs Lyapunov theory [27] to verify stability of\nlearned policies. Like the present paper, [4] adopts a Bayesian framework; however, [4] makes use\nof Gaussian processes [39] to model the uncertain nonlinear dynamics, which are assumed to be\ndeterministic. A major difference between [4] and our work is the cost function; in the former the\npolicy is selected by optimizing for worst-case performance, whereas we optimize the expected cost.\nRobustness of data-driven control has also been the focus of a recently developed family of methods\nreferred to as \u2018coarse-ID control\u2019, cf. [47, 17, 7, 44], in which \ufb01nite-data bounds on the accuracy\nof the least squares estimator are combined with modern robust control tools, such as system level\nsynthesis [55]. Coarse-ID builds upon so-called \u2018H1 identi\ufb01cation\u2019 methods for learning models\nof dynamical systems, along with error bounds that are compatible with robust synthesis methods\n[26, 14, 13]. H1 identi\ufb01cation assumes an adversarial (i.e. worst-case) disturbance model, whereas\nCoarse-ID is applicable to probabilistic models, such as those considered in the present paper. Of\nparticular relevance to the present paper is [17], which provides sample complexity bounds on the\nperformance of robust control synthesis for the in\ufb01nite horizon LQR problem, when the true system\nis not known. Such bounds necessarily consider the worst-case model, given the observed data, where\nas we are concerned with expected cost over the posterior distribution of models.\nThis approach of controller synthesis w.r.t. distributions over models has much in common with\nthe \ufb01eld of probabilistic robust control [11, 46]. Early work in this area applied statistical learning\ntheory [53] to randomized algorithms for feasibility analysis and policy design, cf. e.g., [51, 52]. Of\nparticular relevance to the present paper is the so-called \u2018scenario approach\u2019 to control: robustness\nrequirements lead to semi-in\ufb01nite convex programs, which are approximated by sampling a \ufb01nite\nnumber of constraints, cf. e.g., [9, 10]. A key focus of the scenario approach is bounding sample\ncomplexity (i.e., the number of sampled constraints required to ensure some probability of feasibility),\nwithout resorting to statistical learning theory, so as to reduce conservatism.\nIn closing, we brie\ufb02y mention the so-called \u2018Riemann-Stieltjes\u2019 class of optimal control problems,\nfor uncertain continuous-time dynamical systems, cf. e.g., [41, 40]. Such problems often arise in\naerospace applications (e.g. satellite control) where the objective is to design an open-loop control\nsignal (e.g. for an orbital maneuver) rather than a feedback policy.\n\n2\n\n\f3 Problem formulation\n\nIn this section we describe in detail the speci\ufb01c problem that we address in this paper. The following\n++) denotes the cone of\nnotation is used: Sn denotes the set of n \u21e5 n symmetric matrices; Sn\n+ (Sn\npositive semde\ufb01nite (positive de\ufb01nite) matrices. A \u232b B denotes A  B 2 Sn\n+, similarly for  and\n++. The trace of A is denoted tr A. The transpose of A is denoted A0. |a|2\nQ is shorthand for a0Qa.\nSn\nThe convex hull of set \u21e5 is denoted conv\u21e5. The set of Schur stable matrices is denoted S.\nDynamics, reward function and policies We are concerned with control of discrete linear time-\ninvariant dynamical systems of the form\n\n1\n\nT PT\n\nwt \u21e0N (0, \u21e7),\n\nxt+1 = Axt + But + wt,\n\n(1)\nwhere xt 2 Rnx, ut 2 Rnu, and wt 2 Rnw denote the state, input, and unobserved exogenous\ndisturbance at time t, respectively. Let \u2713 := {A, B, \u21e7}. Our objective is to design a feedback control\npolicy ut = (xt) that minimizes the cost function limT!1\nt=0 E [x0tQxt + u0tRut], where\nxt evolves according to (1), and Q \u232b 0 and R  0 are user de\ufb01ned weight matrices. A number of\ndifferent parametrizations of the policy  have been considered in the literature, from neural networks\n(popular in RL, e.g., [4]) to causal (typically linear) dynamical systems (common in RC, e.g., [36]).\nIn this paper, we will restrict our attention to static-gain policies of the form ut = Kxt, where\nK 2 Rnu\u21e5nx is constant. As noted in [17], controller synthesis and implementation, is simpler (and\nmore computationally ef\ufb01cient) for such policies. When the parameters of the true system, denoted\n\u2713tr := {Atr, Btr, \u21e7tr}, are known this is the in\ufb01nite horizon LQR problem, the optimal solution of\nwhich is well-known [5]. We assume that \u2713tr is unknown; rather, our knowledge of the dynamics\nmust be inferred from observed sequences of inputs and states.\nObserved data We adopt the data-driven setup used in [17], and assume that D := {xr\n0:T}N\nr=1\nwhere xr\nt=0 is the observed state sequence attained by evolving the true system for T\nt}T\ntime steps, starting from an arbitrary xr\nt=0. Each of\nt}T\n0:T = {ur\nthese N independent experiments is referred to as a rollout. We perform parameter inference in the\nof\ufb02ine/batch setting; i.e., all data D is assumed to be available at the time of controller synthesis.\nOptimization objective Given observed data and, possibly, prior knowledge of the system, we\nthen have the posterior distribution over the model parameters denoted \u21e1(\u2713) := p(A, B, \u21e7|D), in\nplace of the true parameters \u2713tr. The function that we seek to minimize is the expected cost w.r.t. the\nposterior distribution, i.e.,\n\n0 and driven by arbitrary input ur\n\n0:T = {xr\n\n0:T , ur\n\n1\nT\n\nTXt=0\n\nE [x0tQxt + u0tRut | xt+1 = Axt + But + wt, wt \u21e0N (0, \u21e7) , {A, B, \u21e7}\u21e0 \u21e1(\u2713)] .\nlim\nT!1\n(2)\nIn practice, the support of \u21e1 almost surely contains {A, B} that are unstabilizable, which implies\nthat (2) is in\ufb01nite. Consequently, we shall consider averages over con\ufb01dence regions w.r.t. \u21e1. For\nconvenience, let us denote the in\ufb01nite horizon LQR cost, for given system parameters \u2713, by\n\nJ(K|\u2713) := lim\nt!1\n\nE [x0t(Q + K0RK)xt | xt+1 = (A + BK)xt + wt, w \u21e0N (0, \u21e7)]\n=\u21e2tr X\u21e7 with X = (A + BK)0X(A + BK) + Q + K0RK, A + BK 2S\n\notherwise,\n\n1,\n\n(3a)\n\n(3b)\n\nwhere the second equality follows from standard Gramian calculations, and S denotes the set of\nSchur stable matrices. As an alternative to (2) we may consider a cost function like J c(K) :=\nR\u21e5c J(K|\u2713)\u21e1(\u2713)d\u2713, where \u21e5c denotes a c % con\ufb01dence region of the parameter space w.r.t. the\nposterior \u21e1. Though better suited to optimization than (2), which is almost surely in\ufb01nite, this integral\ncannot be evaluated in closed form, due to the complexity of J(\u00b7|\u2713) w.r.t. \u2713. Furthermore, there is\nstill no guarantee that \u21e5c contain only stabilizable models. To circumvent both of these issues, we\npropose the following Monte Carlo (MC) approximation of J c(K),\ni \u21e0 \u21e5c \\M ,\n\n(4)\nwhere M is the number of samples used, and M denotes the set of stabilizable {A, B}. Note that (4)\nis not a true MC approximation of J c(K) as only stabilizable samples {Ai, Bi}2M are used.\n\nMXM\n\nJ(K|\u2713i),\u2713\n\nJ c\nM (K) :=\n\ni = 1, . . . , M,\n\ni=1\n\n1\n\n3\n\n\ft=1\n\nt|xr\n\nt1, ur\n\nr=1YT\nt1,\u2713 ) = NAxr\n\nPosterior distribution Given data D, the parameter posterior distribution is given by Bayes\u2019 rule:\n(5)\n\np(xr\n\n\u21e1(\u2713) := p(\u2713|D) =\n\nt1,\u2713 ) =: \u00af\u21e1(\u2713),\n\n1\np(D)\n\np(D|\u2713)p(\u2713) / p(\u2713)YN\n\nt|xr\n\nt1, ur\n\nt1 + Bur\n\nt1, \u21e7, and\nwhere p(\u2713) denotes our prior belief on \u2713, p(xr\n\u00af\u21e1 = p(D)\u21e1 denotes the unnormalized posterior. To sample from \u21e1, we can distinguish between two\ndifferent cases. First, consider the case when \u21e7tr is known or can be reliably estimated independently\nof {A, B}. This is the setting in, e.g., [17]. In this case, the likelihood can be equivalently expressed as\na Gaussian distribution over {A, B}. Then, when the prior p(A, B) is uniform (i.e. non-informative)\nor Gaussian (self-conjugate), the posterior p(A, B|\u21e7tr,D) is also Gaussian, cf. Appendix A.1.1.\nSecond, consider the general case in which \u21e7tr, along with {A, B}, is unknown. In this setting, one\ncan select from a number of methods adapted for Bayesian inference in dynamical systems, such as\nMetropolis-Hastings [33], Hamiltonian Monte Carlo [15], and Gibbs sampling [16, 56]. When one\nplaces a non-informative prior on \u21e7 (e.g., p(\u21e7) / det(\u21e7) nx+1\n), each iteration of a Gibbs sampler\ntargeting \u21e1 requires sampling from either a Gaussian or an inverse Wishart distribution, for which\nreliable numerical methods exist; cf. Appendix A.1.2. In both of these cases we can sample from\n\u21e1 and evaluate \u00af\u21e1 point-wise. To draw \u2713i \u21e0 \u21e5c \\M , as in (4), we can \ufb01rst draw a large number of\nsamples from \u21e1, discard the (100c)% of samples with the lowest unnormalized posterior values,\nand then further discard any samples that happen to be unstabilizable. For convenience, we de\ufb01ne\n\u02dc\u21e5c\ni=1 : \u2713i \u21e0 \u21e5c \\M , i = 1, . . . , M}, which should be interpreted as a set of M\nM := {{\u2713i}M\nrealizations of this procedure for sampling \u2713i \u21e0 \u21e5c \\M .\nSummary We seek the solution of the optimization problem minK J c\n\n2\n\nM (K) for K 2 Rnu\u21e5nx.\n\n4 Solution via semide\ufb01nite programing\n\nIn this section we present the principal contribution of this paper: a method for solving minK J c\nM (K)\nvia convex (semide\ufb01nite) programing (SDP). It is convenient to consider an equivalent representation\n\nmin\ni=12Snx\nK, {Xi}M\ns.t.\n\n++\n\n1\n\ntr Xi\u21e7i,\n\nMXM\nXi \u232b (Ai + BiK)0Xi(Ai + BiK) + Q + K0RK, {Ai, Bi, \u21e7i}2 \u02dc\u21e5c\nM ,\n\ni=1\n\n(6a)\n\n(6b)\n\nwhere the Comparison Lemma [34, Lecture 2] has been used to replace the equality in (3b) with the\n\u270f := {S 2 Sn : S \u232b \u270fI, S  \u00b5I}, where \u270f and \u00b5 are\ninequality in (6b). We introduce the notation Sn\narbitrarily small and large positive constants, respectively. Sn\n\u270f serves as a compact approximation of\n++, suitable for use with SDP solvers, i.e., S 2 Sn\nSn\n4.1 Common Lyapunov relaxation\n\n\u270f =) S 2 Sn\n\n++.\n\nThe principal challenge in solving (6) is that the constraint (6b) is not jointly convex in K and X i.\nThe usual approach to circumventing this nonconvexity is to \ufb01rst apply the Schur complement to\n(6b), and then conjugate by the matrix diag(X1\n, I, I, I), which leads to the equivalent constraint\n\ni\n\n2664\n\nX1\n\ni\n\nX1\n\ni\n\n(Ai + BiK)X1\n\ni\n\nQ1/2X1\ni\nKX1\n\ni\n\nX1\n\n(Ai + BiK)0 X1\n0\nI\n0\n\ni\n0\n0\n\ni Q1/2 X1\n\ni K0\n\n3775 \u232b 0.\n\n0\nR1\n\n(7)\n\nand Li = KX1\n\nWith the change of variables Yi = X1\n, (7) becomes an linear matrix inequality\n(LMI), in Yi and Li. This approach is effective when M = 1 (i.e. we have a single nominal system,\nas in standard LQR). However, when M > 1 we cannot introduce a new Yi for each X1\n, as we lose\nuniqueness of the controller K in Li = KX1\nfor i 6= j. One\nstrategy (prevalent in robust control, e.g., [17, \u00a7C]) is to employ a \u2018common Lyapunov function\u2019,\ni.e., Y = X1\nfor all i = 1, . . . , M. This gives the following convex relaxation (upper bound) of\n\n, i.e., in general LiY 1\n\n6= LjY 1\n\nj\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n4\n\n\fproblem (6),\n\nK, Y 2Snx\n\n\u270f\n\ns.t.\n\nmin\n, {Zi}M\n\ni=12Snx\n\nG0i\n\nY  \u232b 0, 2664\n\uf8ff Zi Gi\ntr Y 1\u21e7i \uf8ff min\n\nY,Zi\n\nmin\nY\n\ntr Zi,\n\n(8a)\n\nY A0i + L0B0i Y Q1/2\n\nY\n\nAiY + BiL\n\nQ1/2Y\n\nL\n\nY\n0\n0\n\n0\nI\n0\n\ntr Zi s.t. Zi \u232b G0iY 1Gi () min\n\nY,Zi\n\n(8b)\n\nL0\n\nM ,\n\n0\nR1\n\n3775 \u232b 0,\u2713 i 2 \u02dc\u21e5c\nY  \u232b 0.\ntr Zi s.t. \uf8ff Zi Gi\n\nG0i\n\nwhere Gi denotes the Cholesky factorization of \u21e7i, i.e., \u21e7i = GiG0i, and {Zi}M\nused to encode the cost (6a) with the change of variables, i.e.,\n\ni=1 are slack variables\n\nThe approximation in (8) is highly conservative, which motivates the iterative local optimization\nmethod presented in Section 4.2. Nevertheless, (8) provides a principled way (i.e., a one-shot convex\nprogram) to initialize the iterative search method derived in Section 4.2.\n\nIterative improvement by sequential semide\ufb01nite programing\n\n4.2\nTo develop this iterative search method \ufb01rst consider an equivalent representation of J(K|\u2713i),\nJ(K|\u2713i) = min\nXi2Snx\n\ntr Xi\u21e7i\n\n\u270f\n\n(9a)\n\ns.t. 24\n\nXi  Q (Ai + BiK)0 K0\n0\nAi + BiK\n\nX1\n\nK\n\ni\n0\n\nR1 35 \u232b 0,\n\nrecall: \u2713i = {Ai, Bi, \u21e7i}.\n\n(9b)\n\nThis representation highlights the nonconvexity of J(K|\u2713i) due to the X1\nterm, which was addressed\n(in the usual way) by a change of variables in Section 4.1. In this section, we will instead replace X1\ni\nwith a linear approximation and prove that this leads to a tight convex upper bound. Given S 2 Sn\n++,\nlinear) Taylor series approximation of S1 about some\nlet T (S, S0) denote the \ufb01rst order (i.e.\n0 + @S1\n(S  S0) S1\nnominal S0 2 Sn\n0 .\nWe now de\ufb01ne the function\n\n++, i.e., T (S, S0) := S1\n\n(S  S0) = S1\n\n0  S1\n\n0\n\ni\n\n@S S=S0\n\n\u02c6J(K, \u00afK|\u2713i) := min\nXi2Snx\n\n\u270f\n\ntr Xi\u21e7i\n\ns.t. 24\n\nXi  Q (Ai + BiK)0 K0\nAi + BiK T (Xi, \u00afXi)\n0\n\nK\n\n0\n\nR1 35 \u232b 0,\n\n(10a)\n\n(10b)\n\nwhere \u00afXi is any Xi 2 Snx\nJ( \u00afK|\u2713i) = tr \u00afXi\u21e7i. Analogously to (4), we de\ufb01ne\n\n\u270f\n\nthat achieves the minimum in (9), with K = \u00afK for some nominal \u00afK, i.e.,\n\n\u02c6J c\nM (K, \u00afK) :=\n\n\u02c6J(K, \u00afK|\u2713i).\n\n(11)\n\n1\n\nMX\u2713i2 \u02dc\u21e5c\n\nM\n\nM (K, \u00afK) is a convex upper bound on J c\n\nWe now show that \u02c6J c\nproof is given in A.2.2 and makes use of the following technical lemma (cf. A.2.1 for proof),\nLemma 4.1. T (S, S0)  S1 for all S, S0 2 Sn\nseries expansion of S1 about S0 .\nTheorem 4.1. Let \u02c6J c\nM (K, \u00afK) is a convex upper bound on J c\n\u02c6J c\nbound is \u2018tight\u2019 at \u00afK, i.e., \u02c6J c\nM ( \u00afK, \u00afK) = J c\n\nM (K, \u00afK) be de\ufb01ned as in (11), with \u00afK such that J c\n\nM (K), i.e., \u02c6J c\nM ( \u00afK).\n\nM ( \u00afK) is \ufb01nite. Then\nM (K) 8K. Furthermore, the\n\n++, where T (S, S0) denotes the \ufb01rst-order Taylor\n\nM (K), which is tight at K = \u00afK. The\n\nM (K, \u00afK)  J c\n\nIterative algorithm To improve upon the common Lyapunov solution given by (8), we can solve\na sequence of convex optimization problems: K(k+1) = arg minK \u02c6J c\nM (K, K(k)), cf. Algorithm 1\nfor details. This procedure of optimizing tight surrogate functions in lieu of the actual objective\n\n5\n\n\fM (K(k+1), K(k))  J c\n\nfunction is an example of the \u2018majorize-minimization (MM) principle\u2019, a.k.a. optimization transfer\n[29]. MM algorithms enjoy good numerical robustness, and (with the exception of some pathological\ncases) reliable convergence to local minima [49]. Indeed, it is readily veri\ufb01ed that J c\nM (K(k)) =\n\u02c6J c\nM (K(k), K(k))  \u02c6J c\nM (K(k+1)), where equality follows from tightness of\nthe bound, and the second inequality is due to the fact that \u02c6J c\nM (K, K(k)) is an upper bound. This\nimplies that {J c\nBefore proceeding, let us comment brie\ufb02y on the computational complexity of the approach, which\nwill be dominated by the convex program minK \u02c6J c\nM (K, \u00afK) in (11). The complexity of each iteration\nof an interior point method for solving this problem is O(max{m3, M mn3, M m2n2}), cf. e.g. [30,\n\u00a72], where m = nxnu + M nx(nx + 1)/2 denotes the dimensionality of the decision variable, and\nn = 2nx + nu denotes the dimension of the LMI in (10b). It has been observed that the number of\niterations required for convergence grows slowly with problem dimension [50]. For computation\ntimes on numerical examples, refer to Table 2.\n\nM (K(k))}1k=1 is a converging sequence.\n\nfrom Section 3.\n\nM (K) via semide\ufb01nite programing\n\nCarlo approximation M, convergence tolerance \u270f.\n\nAlgorithm 1 Optimization of J c\n1: Input: observed data D, con\ufb01dence c, LQR cost matrices Q and R, number of particles in Monte\n2: Generate M samples from \u21e5c \\M , i.e., \u02dc\u21e5c\nM, using the appropriate Bayesian inference method\n3: Solve (8). Let Kcl denote the optimal solution of (8). Set K(0) 1, K(1) Kcl and k 1.\n4: while |J c\n5:\n6: end while\n7: return K(k) as the control policy.\n\nM (K, K(k)). Set K(k+1) K\u21e4 and k k + 1.\n\nM (K(k))  J c\n\nM (K(k1))| >\u270f do\n\nSolve K\u21e4 = arg minK \u02c6J c\n\nRemark 4.1. This sequential SDP approach can be applied in other robust control settings, e.g.,\nmixed H2/H1 [20], to improve on the common Lyapunov solution, cf. Section5.1 for an illustration.\n\n4.3 Robustness\n\nHitherto, we have considered the performance component of the robust control problem, namely\nminimization of the expected cost; we now address the robust stability requirement. It is desirable for\nthe learned policy to stabilize every model in the con\ufb01dence region \u21e5c; in fact, this is necessary for\nthe cost J c(K) to be \ufb01nite. Algorithm 1 ensures stability of each of the M sampled systems from\n\u02dc\u21e5c\nM, which implies that  stabilizes the entire region as M ! 1. However, we would like to be\nable to say something about robustness for \ufb01nite M. To this end, we make two remarks. First, if\nclosed-loop stability of each sampled model is veri\ufb01ed with a common Lyapunov function, then the\npolicy stabilizes the convex hull of the sampled systems:\nTheorem 4.2. Suppose there exists K 2 Rnx\u21e5nu such that (Ai + BiK)0X(Ai + BiK) X  0 for\nX  0 and all \u21e5= {Ai, Bi}N\ni=1. Then (A + BK)0X(A + BK)  X  0 for all {A, B}2 conv\u21e5,\nwhere conv\u21e5 denotes the convex hull of \u21e5.\n\nThe proof of Theorem 4.2 is given in A.2.3. The conditions of Theorem 4.2 hold for the common\nLyapunov approach in (8), and can be made to hold for Algorithm 1 by introducing an additional\nLyapunov stability constraint (with common Lyapunov function) for each sampled system, at the\nexpense of some conservatism. Second, we observe empirically that Algorithm 1 returns policies\nthat very nearly stabilize the entire region \u21e5c, despite a very modest number of samples M relative\nto the dimension of the parameter space, cf. Section5.1, in particular Figure 2. In principle, results\nfrom probabilistic robust control could be used to bound the number of samples required for such\nrobustness properties, cf. e.g., [9, Theorem 1], however, at least for the examples in this paper, such\nbounds appear to be quite conservative. Furthermore, a number of recent papers have investigated\nsampling (or grid) based approaches to stability veri\ufb01cation of control policies, e.g., [54, 4, 6].\nUnderstanding why policies from Algorithm 1 generalize effectively to the entire region \u21e5c, as well\nas clarifying connections to probabilistic robust control, are interesting topics for future research.\n\n6\n\n\f5 Experimental results\n\n5.1 Numerical simulations using synthetic systems\n\nt=1|xr\n\nt  Axr\n\nt1 Bur\n\nr=1PT\n\nIn this section, we study the in\ufb01nite horizon LQR problem speci\ufb01ed by\nAtr = toeplitz(a, a0), a = [1.01, 0.01, 0, . . . , 0] 2 Rnx, Btr = I, \u21e7tr = I, Q = 103I, R = I,\nwhere toeplitz(r, c) denotes the Toeplitz matrix with \ufb01rst row r and \ufb01rst colum c. This is the same\nproblem studied in [17, \u00a76] (for nx = 3), where it is noted that such dynamics naturally arise in\nconsensus and distributed averaging problems. To obtain problem data D, each rollout involves\nsimulating (1), with the true parameters, for T = 6 time steps, excited by ut \u21e0N (0, I) with x0 = 0.\nNote: to facilitate comparison with [17], we too shall assume that \u21e7tr is known. Furthermore, for all\nexperiments \u21e5c will denote a 95% con\ufb01dence region, as in [17]. We compare the following methods\nof control synthesis: existing methods: (i) nominal: standard LQR using the nominal model from\nthe least squares, i.e., {Als, Bls} := arg minA,BPN\nt1|2; (ii) worst-case:\noptimize for worst-case model (95% con\ufb01dence) s.t. robust stability constraints, i.e., the method\nof [17, \u00a75.2]; (iii) H2/H1: enforce stability constraint from [17, \u00a75.2], but optimize performance\nfor the nominal model {Als, Bls}; proposed method(s): (iv) CL: the common Lyapunov relaxation\nof 8; (v) proposed: the method proposed in this paper, i.e., Algorithm 1; additional new methods:\n(vi) alternate-r: initialize with the H2/H1 solution, and apply the iterative optimization method\nproposed in Section 4.2, cf. Remark 4.1; (vii) alternate-s: optimize for the nominal model {Als, Bls},\nenforce stability for the sampled systems in \u02dc\u21e5c\nM. Before proceeding, we wish to emphasize that\nthe different control synthesis methods have different objectives; a lower cost does not mean that\nthe associated method is \u2018better\u2019. This is particularly true for worst-case which seeks to optimize\nperformance for the worst possible model so as to bound the cost on the true system.\nTo evaluate performance, we compare the cost of applying a learned policy K to the true system\n\u2713tr = {Atr, Btr}, to the optimal cost achieved by the optimal controller Klqr (designed using \u2713tr),\nIn Figure 1 we plot LQR\ni.e., J(K|\u2713tr)/J(Klqr|\u2713tr). We refer to this as \u2018LQR suboptimality.\u2019\nsuboptimality is shown as a function of the number of rollouts N, for nx = 3. We make the following\nobservations. Foremost, the methods that enforce stability \u2018stochastically\u2019 (i.e. point-wise), namely\nproposed and alternate-s, attain signi\ufb01cantly lower costs than the methods that enforce stability\n\u2018robustly\u2019. Furthermore, in situations with very little data, e.g. N = 5, the robust control methods are\nusually unable to \ufb01nd a stabilizing controller, yet the proposed method \ufb01nds a stabilizing controller\nin the majority of trials. Finally, we note that the iterative procedure in proposed (and alternate-s)\nsigni\ufb01cantly improves on the common-Lyapunov relaxation CL; similarly, alternate-r consistently\nimproves upon H2/H1 (as expected).\n\n \n\n \n\n%\n0\n0\n1\n\n%\n0\n0\n1\n\n%\n0\n0\n1\n\n%\n4\n7\n\n \n\n \n\n \n\n%\n8\n2\n\n%\n4\n2\n\n%\n8\n6\n\n%\n8\n7\n\n%\n8\n7\n\n%\n8\n7\n\n \n\n \n\n \n\n \n\n \n\n \n\n%\n2\n6\n\n \n\n%\n8\n5\n\n%\n2\n\n \n\n \n\n%\n8\n4\n\n%\n4\n\n \n\n \n\n%\n2\n4\n\n%\n2\n\n \n\n%\n2\n\n \n\nFigure 1: LQR suboptimality as a function of the number of rollouts (i.e. amount of training data).\n1 suboptimality denotes cases in which the method was unable to \ufb01nd a stabilizing controller for\nthe true system (including infeasibility of the optimization problem for policy synthesis), and the %\ndenotes the frequency with which this occurred for the 50 experimental trials conducted.\n\n7\n\n\fTable 1: Median % of unstable closed-loop models, with open-loop models sampled from a 95%\ncon\ufb01dence region of the posterior, for system of varying dimension nx; cf. Section 5.1 for details. 50\nexperiments were conducted, with N = 50. The policy synthesis optimization problems were always\nfeasible, except for the worst-case method, which was infeasible in 46% of trials when nx = 12.\nH2/H1 and alternate-r have the same robustness guarantees as worst-case, and are omitted.\n\noptimal\n61.6\n95.37\n99.6\n100\n\nnominal worst-case CL\n28.75\n58.41\n81.9\n94.28\n\n0\n0\n0\n0\n\n0\n0\n0\n0\n\nnx\n3\n6\n9\n12\n\nproposed\n0.10\n0.18\n0.24\n0.27\n\nalternate-s\n1.35\n1.76\n1.40\n1.27\n\nTable 2: Mean computation times in seconds for control synthesis for system of varying dimension\nnx; cf. Section 5.1 for details. 50 experiments were conducted, with N = 50.\n\nnx worst-case H2/H1 CL\n3\n6\n9\n12\n\n0.159\n0.173\n0.208\n0.329\n\n1.91\n2.05\n2.51\n3.72\n\n0.605\n0.962\n1.79\n3.90\n\nproposed\n20.3\n28.9\n48.1\n96.8\n\nalternate-s\n4.56\n13.4\n27.1\n62.9\n\nIt is natural to ask whether the reduction in cost exhibited by proposed (and alternate-s) come at\nthe expense of robustness, namely, the ability to stabilize a large region of the parameter space.\nEmpirical results suggest that this is not the case. To investigate this we sample 5000 fresh (i.e. not\nused for learning) models from \u21e5c \\M and check closed-loop stability of each; this is repeated for\n50 independent experiments with varying nx and N = 50. The median percentage of models that\nwere unstable in closed-loop is recorded in Table 1. We make two observations: (i) the proposed\nmethod exhibits strong robustness. Even for nx = 12 (i.e., 288-dim parameter space), it stabilizes\nmore than 99% of samples from the con\ufb01dence region, with only M = 100 MC samples. (ii) when\nthe robust methods (worst-case, H2/H1, alternate-r) are feasible, the resulting policies were found\nto stabilize 100% of samples; however, for nx = 12, the methods were infeasible almost half the\ntime, whereas proposed always returned a policy. Further evidence is provided in Figure 2, which\nplots robustness and performance as a function of the number of MC samples, M. For nx = 3\nand M  800, the entire con\ufb01dence region is stabilized with very high probability, suggesting that\nM ! 1 is not required for robust stability in practice.\n\n101\n\n100\n\n10-1\n\n10-2\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n12\n10\n8\n6\n4\n2\n0\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nFigure 2: (left) Median % of unstable closed-loop models, with open-loop models sampled from\na 95% con\ufb01dence region of the posterior, for nx = 3 and N = 15, as a function of the number of\nsamples M used in the MC approximation (4). (right) LQR suboptimality as a function of M. 50\nexperiments were conducted, cf. Section5.1 for details. Shaded regions cover the interquartile range.\n\n8\n\n\f%\n%\n0\n0\n0\n0\n1\n1\n\n%\n%\n0\n0\n0\n0\n1\n1\n\n%\n0\n0\n1\n\n(a)\n\n(b)\n\nFigure 3: (a) (Median) LQR cost on real-world pendulum experiment, as a function of the number\nof rollouts. 1 cost denotes controllers that resulted in instability during testing. n/a denotes cases\nin which the synthesis problem was infeasible. Five trials were conducted to evaluate the cost of\neach policy. The shaded region spans from the minimum to maximum cost. Note: for this particular\nexperiment, the nominal models from least squares happened to yield stabilizing controllers that\noffered good performance. Such behavior is not to be expected in general, cf. Figure 1. (b) pendulum\nangle and control signal recorded after 10 rollouts.\n\n5.2 Real-world experiments on a rotary inverted pendulum\n\nWe now apply the proposed algorithm to the classic problem of stabilizing a (rotary) inverted\npendulum, on real (i.e. physical) hardware (Quanser QUBE 2), cf. A.3 for details. To generate\ntraining data, the superposition of a non-stabilizing control signal and a sinusoid of random frequency\nis applied to the rotary arm motor while the pendulum is inverted. The arm and pendulum angles\n(along with velocities) are sampled at 100Hz until the pendulum angle exceeds 20, which takes\nno more than 5 seconds. This constitutes one rollout. We applied the worst-case, H2/H1, and\nproposed methods to optimize the LQ cost with Q = I and R = 1. To generate bounds \u270fA \nkAls  Atrk2 and \u270fB  kBls  Btrk2 for worst-case and H2/H1, we sample {Ai, Bi}5000\ni=1 from a\n95% con\ufb01dence region of the posterior, using Gibbs sampling, and take \u270fA = maxi kAls  Aik2\nand \u270fB = maxi kBls  Bik2. The proposed method used 100 such samples for synthesis. We also\napplied the least squares policy iteration method [28], but none of the policies could stabilize the\npendulum given the amount of training data. Results are presented in Figure 3, from which we make\nthe following remarks. First, as in Section5.1, the proposed method achieves high performance\n(low cost), especially in the low data regime where the magnitude of system uncertainty renders the\nother synthesis methods infeasible. Insight into this performance is offered by Figure 3(b), which\nindicates that policies from the proposed method stabilize the pendulum with control signals of\nsmaller magnitude. Second, performance of the proposed method converges after very few rollouts.\nData-inef\ufb01ciency is a well-known limitation of RL; understanding and mitigating this inef\ufb01ciency is\nthe subject of considerable research [17, 48, 18, 42, 23, 24]. Investigating the role that a Bayesian\napproach to uncertainty quanti\ufb01cation plays in the apparent sample-ef\ufb01ciency of the proposed method\nis an interesting topic for further inquiry.\n\nAcknowledgments\n\nThis research was \ufb01nancially supported by the Swedish Foundation for Strategic Research (SSF)\nvia the project ASSEMBLE (contract number: RIT15-0012) and via the projects Learning \ufb02exible\nmodels for nonlinear dynamics (contract number: 2017-03807) and NewLEADS - New Directions\nin Learning Dynamical Systems (contract number: 621-2016-06079), both funded by the Swedish\nResearch Council.\n\n9\n\n\fReferences\n[1] P. Abbeel and A. Y. Ng. Exploration and apprenticeship learning in reinforcement learning. In Proceedings\n\nof the 22nd international conference on Machine learning, pages 1\u20138. ACM, 2005.\n\n[2] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man\u00e9. Concrete problems in AI\n\nsafety. arXiv preprint arXiv:1606.06565, 2016.\n\n[3] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin. Provably safe and robust learning-based model\n\npredictive control. Automatica, 49(5):1216\u20131226, 2013.\n\n[4] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement learning with\n\nstability guarantees. In Advances in Neural Information Processing Systems (NIPS). 2017.\n\n[5] D. P. Bertsekas. Dynamic programming and optimal control. Belmont, MA: Athena Scienti\ufb01c, 1995.\n\n[6] R. Bobiti and M. Lazar. A sampling approach to \ufb01nding lyapunov functions for nonlinear discrete-time\n\nsystems. In Control Conference (ECC), 2016 European, pages 561\u2013566. IEEE, 2016.\n\n[7] R. Boczar, N. Matni, and B. Recht. Finite-data performance guarantees for the output-feedback control of\n\nan unknown system. arXiv preprint arXiv:1803.09186, 2018.\n\n[8] J. B. Burl. Linear optimal control: H2 and H1 methods. Addison-Wesley Longman Publishing Co., Inc.,\n\n1998.\n\n[9] G. C. Cala\ufb01ore and M. C. Campi. Uncertain convex programs: randomized solutions and con\ufb01dence levels.\n\nMathematical Programming, 102(1):25\u201346, 2005.\n\n[10] G. C. Cala\ufb01ore and M. C. Campi. The scenario approach to robust control design. IEEE Transactions on\n\nAutomatic Control, 51(5):742\u2013753, 2006.\n\n[11] G. C. Cala\ufb01ore, F. Dabbene, and R. Tempo. Research on probabilistic methods for control system design.\n\nAutomatica, 47(7):1279\u20131293, 2011.\n\n[12] C. K. Carter and R. Kohn. On gibbs sampling for state space models. Biometrika, 81(3):541\u2013553, 1994.\n[13] J. Chen and G. Gu. Control-oriented system identi\ufb01cation: an H1 approach, volume 19. Wiley-\n\nInterscience, 2000.\n\n[14] J. Chen and C. N. Nett. The Caratheodory-Fejer problem and H2/H1 identi\ufb01cation: a time domain\napproach. In Proceedings of the 32nd IEEE Conference on Decision and Control (CDC), pages 68\u201373,\n1993.\n\n[15] S. H. Cheung and J. L. Beck. Bayesian model updating using hybrid Monte Carlo simulation with applica-\ntion to structural dynamic models with many uncertain parameters. Journal of engineering mechanics,\n135(4):243\u2013255, 2009.\n\n[16] J. Ching, M. Muto, and J. Beck. Bayesian linear structural model updating using Gibbs sampler with\nmodal data. In Proceedings of the 9th International Conference on Structural Safety and Reliability, pages\n2609\u20132616. Millpress, 2005.\n\n[17] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic\n\nregulator. arXiv preprint arXiv:1710.01688, 2017.\n\n[18] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-ef\ufb01cient approach to policy search. In\nProceedings of the 28th International Conference on machine learning (ICML-11), pages 465\u2013472, 2011.\n\n[19] S. Depeweg, J. M. Hern\u00e1ndez-Lobato, F. Doshi-Velez, and S. Udluft. Decomposition of Uncertainty in\n\nBayesian Deep Learning for Ef\ufb01cient and Risk-sensitive Learning. arXiv:1710.07283, 2017.\n\n[20] J. Doyle, K. Zhou, K. Glover, and B. Bodenheimer. Mixed H2 and H1 performance objectives. II. Optimal\n\ncontrol. IEEE Transactions on Automatic Control, 39(8):1575\u20131587, 1994.\n\n[21] J. Garc\u0131a and F. Fern\u00e1ndez. A comprehensive survey on safe reinforcement learning. Journal of Machine\n\nLearning Research, 16(1):1437\u20131480, 2015.\n\n[22] P. Geibel and F. Wysotzki. Risk-sensitive reinforcement learning applied to control under constraints.\n\nJournal of Arti\ufb01cial Intelligence Research, 24:81\u2013108, 2005.\n\n10\n\n\f[23] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-prop: Sample-ef\ufb01cient policy gradient\n\nwith an off-policy critic. In International Conference on Learning Representations, 2017.\n\n[24] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with model-based acceleration.\n\nIn International Conference on Machine Learning, pages 2829\u20132838, 2016.\n\n[25] W. M. Haddad, D. S. Bernstein, and D. Mustafa. Mixed-norm H2/H1 regulation and estimation: The\n\ndiscrete-time case. Systems & Control Letters, 16(4):235\u2013247, 1991.\n\n[26] A. J. Helmicki, C. A. Jacobson, and C. N. Nett. Control oriented system identi\ufb01cation: a worst-\ncase/deterministic approach in H1. IEEE Transactions on Automatic control, 36(10):1163\u20131176, 1991.\n\n[27] H. K. Khalil. Noninear systems. Prentice-Hall, New Jersey, 2(5):5\u20131, 1996.\n\n[28] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of machine learning research,\n\n4(Dec):1107\u20131149, 2003.\n\n[29] K. Lange, D. R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. Journal of\n\ncomputational and graphical statistics, 9(1):1\u201320, 2000.\n\n[30] Z. Liu and L. Vandenberghe. Interior-point method for nuclear norm approximation with application to\n\nsystem identi\ufb01cation. SIAM J. Matrix Analysis and Applications, 31(3):1235\u20131256, 2009.\n\n[31] O. Mihatsch and R. Neuneier. Risk-sensitive reinforcement learning. Machine learning, 49(2-3):267\u2013290,\n\n2002.\n\n[32] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,\nA. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature,\n518(7540):529, 2015.\n\n[33] B. Ninness and S. Henriksen. Bayesian system identi\ufb01cation via Markov chain Monte Carlo techniques.\n\nAutomatica, 46(1):40\u201351, 2010.\n\n[34] Oliveira, Maur\u00edcio de. MAE 280B: Linear Control Design., 2009.\n\n[35] C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot. Robust Constrained Learning-based NMPC enabling\nreliable mobile robot path tracking. The International Journal of Robotics Research, 35(13):1547\u20131563,\n2016.\n\n[36] I. R. Petersen, M. R. James, and P. Dupuis. Minimax optimal control of stochastic uncertain systems with\n\nrelative entropy constraints. IEEE Transactions on Automatic Control, 45(3):398\u2013412, 2000.\n\n[37] K. B. Petersen and M. S. Pedersen. The Matrix Cookbook., 2012.\n\n[38] I. Postlethwaite, M. C. Turner, and G. Herrmann. Robust control applications. Annual Reviews in Control,\n\n31(1):27\u201339, 2007.\n\n[39] C. E. Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning,\n\npages 63\u201371. Springer, 2004.\n\n[40] I. M. Ross, R. J. Proulx, and M. Karpenko. Unscented optimal control for space \ufb02ight. In 24th International\n\nSymposium on Space Flight Dynamics, 2014.\n\n[41] I. M. Ross, R. J. Proulx, M. Karpenko, and Q. Gong. Riemann\u2013stieltjes optimal control problems for\n\nuncertain dynamic systems. Journal of Guidance, Control, and Dynamics, 38(7):1251\u20131263, 2015.\n\n[42] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using\n\ngeneralized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.\n\n[43] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of Go with deep neural\nnetworks and tree search. nature, 529(7587):484\u2013489, 2016.\n\n[44] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht. Learning without mixing: Towards a sharp\n\nanalysis of linear system identi\ufb01cation. arXiv preprint arXiv:1802.08334, 2018.\n\n[45] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge,\n\n1998.\n\n11\n\n\f[46] R. Tempo, G. Cala\ufb01ore, and F. Dabbene. Randomized algorithms for analysis and control of uncertain\n\nsystems: with applications. Springer Science & Business Media, 2012.\n\n[47] S. Tu, R. Boczar, A. Packard, and B. Recht. Non-asymptotic analysis of robust control from coarse-grained\n\nidenti\ufb01cation. arXiv preprint arXiv:1707.04791, 2017.\n\n[48] S. Tu and B. Recht. Least-squares temporal difference learning for the linear quadratic regulator. arXiv\n\npreprint arXiv:1712.08642, 2017.\n\n[49] F. Vaida. Parameter convergence for EM and MM algorithms. Statistica Sinica, pages 831\u2013840, 2005.\n\n[50] L. Vandenberghe and S. Boyd. Semide\ufb01nite programming. SIAM review, 38(1):49\u201395, 1996.\n\n[51] M. Vidyasagar. Statistical learning theory and randomized algorithms for control. IEEE Control Systems,\n\n18(6):69\u201385, 1998.\n\n[52] M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory.\n\nAutomatica, 37(10):1515\u20131528, 2001.\n\n[53] M. Vidyasagar. A theory of learning and generalization. Springer-Verlag New York, Inc., 2002.\n\n[54] J. Vinogradska, B. Bischoff, D. Nguyen-Tuong, A. Romer, H. Schmidt, and J. Peters. Stability of controllers\nfor gaussian process forward models. In International Conference on Machine Learning, pages 545\u2013554,\n2016.\n\n[55] Y.-S. Wang, N. Matni, and J. C. Doyle. A system level approach to controller synthesis. arXiv preprint\n\narXiv:1610.04815, 2016.\n\n[56] A. Wills, T. B. Sch\u00f6n, F. Lindsten, and B. Ninness. Estimation of linear systems using a Gibbs sampler.\n\nIFAC Proceedings Volumes, 45(16):203\u2013208, 2012.\n\n[57] K. Zhou and J. C. Doyle. Essentials of robust control, volume 104. Prentice Hall Upper Saddle River, NJ,\n\n1998.\n\n12\n\n\f", "award": [], "sourceid": 5839, "authors": [{"given_name": "Jack", "family_name": "Umenberger", "institution": "Uppsala University"}, {"given_name": "Thomas", "family_name": "Sch\u00f6n", "institution": "Uppsala University"}]}