{"title": "Adaptive Batch Size for Safe Policy Gradients", "book": "Advances in Neural Information Processing Systems", "page_first": 3591, "page_last": 3600, "abstract": "Policy gradient methods are among the best Reinforcement Learning (RL) techniques to solve complex control problems. In real-world RL applications, it is common to have a good initial policy whose performance needs to be improved and it may not be acceptable to try bad policies during the learning process. Although several methods for choosing the step size exist, research paid less attention to determine the batch size, that is the number of samples used to estimate the gradient direction for each update of the policy parameters. In this paper, we propose a set of methods to jointly optimize the step and the batch sizes that guarantee (with high probability) to improve the policy performance after each update. Besides providing theoretical guarantees, we show numerical simulations to analyse the behaviour of our methods.", "full_text": "Adaptive Batch Size for Safe Policy Gradients\n\nMatteo Papini\n\nDEIB\n\nPolitecnico di Milano, Italy\n\nMatteo Pirotta\nSequeL Team\n\nInria Lille, France\n\nMarcello Restelli\n\nDEIB\n\nPolitecnico di Milano, Italy\n\nmatteo.papini@polimi.it\n\nmatteo.pirotta@inria.fr\n\nmarcello.restelli@polimi.it\n\nAbstract\n\nPolicy gradient methods are among the best Reinforcement Learning (RL) tech-\nniques to solve complex control problems. In real-world RL applications, it is\ncommon to have a good initial policy whose performance needs to be improved and\nit may not be acceptable to try bad policies during the learning process. Although\nseveral methods for choosing the step size exist, research paid less attention to\ndetermine the batch size, that is the number of samples used to estimate the gradient\ndirection for each update of the policy parameters. In this paper, we propose a set\nof methods to jointly optimize the step and the batch sizes that guarantee (with\nhigh probability) to improve the policy performance after each update. Besides\nproviding theoretical guarantees, we show numerical simulations to analyse the\nbehaviour of our methods.\n\n1\n\nIntroduction\n\nIn many real-world sequential decision-making problems (e.g., industrial robotics, natural resource\nmanagement, smart grids), engineers have developed automatic control policies usually derived\nfrom modelling approaches. The performance of such policies strictly depends on the model\naccuracy that for some tasks (e.g., \ufb01nancial applications) may be quite poor. Furthermore, even when\naccurate models are available and good control policies are obtained, their performance may degrade\nover time due to the non-stationary dynamics of the problem, thus requiring human intervention\nto adjust the policy parameters (think about equipment wear in smart manufacturing). In such\nscenarios, Reinforcement Learning (RL) techniques represent an interesting solution to get an online\noptimization of the control policies and to hinder the performance loss caused by unpredictable\nenvironment changes, thus allowing to improve the autonomy of the control system.\nIn the last years, several RL studies [1, 2, 3, 4, 5, 6, 7] have shown that policy-search methods can\neffectively be employed to solve complex control tasks (e.g., robotic ones) due to their capabilities to\nhandle high-dimensional continuous problems, face uncertain and partial observations of the state,\nand incorporate prior knowledge about the problem by means of the de\ufb01nition of a proper policy\nmodel whose parameters need to be optimized (refer to [8, 9] for recent surveys). This last property is\nparticularly appealing when the reinforcement learning algorithm needs to operate online in scenarios\nwhere bad exploratory policies may damage the system. A proper design of the policy model may\nallow excluding such policies. On the other hand, in order to speed up the learning process, most RL\nmethods need to explore the policy space by executing policies that may be worse than the initial\none. This is not acceptable in many relevant applications. Under this perspective, we are interested in\ndeveloping RL methods that are (in high probability) monotonically improving.\nInspired by the conservative policy iteration approach [10], recently, new advances have been done\nin the \ufb01eld of approximate policy iteration algorithms [11, 12], obtaining methods that can learn\nfaster while still giving statistical guarantees of improvement after each policy update [13, 14]. These\nmethods are usually referred to as conservative, monotonically improving, or safe (as we do in this\npaper). These ideas have been exploited also for deriving novel safe policy-search approaches [15, 16,\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f17, 18, 19] that have obtained signi\ufb01cant empirical results. In particular, policy-gradient methods are\namong the most commonly used RL techniques to solve complex high-dimensional tasks. Up to now,\nworks on safe policy gradients [15, 16] have focused mainly on the choice of the step size, a parameter\nthat signi\ufb01cantly affects the speed and stability of gradient methods. By adopting small enough step\nsizes, one can limit oscillations and avoid worsening updates, but the consequent reduction of the\nlearning rate is paid on the long term as a poor overall performance. On the other hand, as we will\nshow in this paper, there is another parameter that plays an important role in the de\ufb01nition of safe\npolicy gradient approaches: the batch size (i.e., the number of samples used to estimate the gradient).\nSo far, the optimization of the batch size has not been considered in the RL literature. The batch size,\nbesides conditioning the optimal step size, has a non-negligible impact on the speed of improvement\nwhen samples are trajectories performed on the actual system. In the present paper, we inquire the\nrelationship between the step size and the batch size, showing an interesting duality. Focusing on\nGaussian policies, we make a \ufb01rst attempt at developing practical methods aimed at achieving the best\naverage performance in the long term, by jointly optimizing both meta-parameters. After providing\nsome background in Section 2, in Section 3 we improve an existing adaptive step-size method [15].\nBuilding on this result, in Section 4 we derive the main result on the batch size, proposing jointly\nadaptive methods. Finally, in Section 5 we empirically analyse the behaviour of the proposed methods\non a simple simulated control task.\n\n2 Preliminaries\nA discrete-time continuous Markov decision process (MDP) is a tuple (cid:104)S,A,P,R, \u03b3, \u00b5(cid:105), where S is\nthe continuous state space, A is the continuous action space, P is a Markovian transition model where\nP(s(cid:48)|s, a) de\ufb01nes the transition density between states s and s(cid:48) under action a, R : S\u00d7A \u2192 [\u2212R, R]\nis the reward function, such that R(s, a) is the expected immediate reward for the state-action pair\n(s, a) and R is the maximum absolute reward value, \u03b3 \u2208 [0, 1) is the discount factor for future\nrewards and \u00b5 is the initial state distribution. A policy is de\ufb01ned as a density distribution \u03c0(\u00b7|s) that,\nfor each state s, speci\ufb01es the density distribution over action space A. We consider in\ufb01nite horizon\nproblems where the future rewards are exponentially discounted with \u03b3. For each state-action pair\n(s, a), the utility of taking action a in state s and then following a stationary policy \u03c0 is de\ufb01ned as:\n\nQ\u03c0(s, a) = R(s, a) + \u03b3\n\nP(s(cid:48)|s, a)\n\n\u03c0(a(cid:48)|s(cid:48)\n\n)Q\u03c0(s(cid:48), a(cid:48)\n\n)da(cid:48)\n\nds(cid:48).\n\n(cid:90)\n\n(cid:90)\n\nJ \u03c0\n\u00b5 =\n\n\u00b5(s)\n\nS\n\n\u03c0(a | s)Q\u03c0(s, a)dads =\n\nPolicies can be ranked by their expected discounted reward starting from initial state distribution \u00b5:\n\n(cid:90)\n\u00b5(s) = (1 \u2212 \u03b3)(cid:80)\u221e\nA\nt=0 \u03b3tP r(st = s|\u03c0, \u00b5) is the \u03b3-discounted future state distribution for a\nwhere d\u03c0\n\u00b5 as the performance of\nstarting state distribution \u00b5 [2]. In the following, we will often refer to J \u03c0\npolicy \u03c0. Solving an MDP means \ufb01nding a policy \u03c0\u2217 maximizing J \u03c0\n\u00b5 . We consider the problem of\n\ufb01nding a policy that maximizes the expected discounted reward over a class of parametrized policies\n\u03a0\u03b8 = {\u03c0\u03b8 : \u03b8 \u2208 Rm}. A particular class of parametrized policies is the Gaussian policy model with\nstandard deviation \u03c3 and mean linear in the state features \u03c6(\u00b7):\n\n\u03c0(a|s)R(s, a)dads,\n\nd\u03c0\n\u00b5(s)\n\nS\n\nA\n\n(cid:90)\n\nS\n\n(cid:90)\n(cid:90)\n\nA\n\n\u03c0(a|s, \u03b8) =\n\n1\u221a\n2\u03c0\u03c32\n\nexp\n\n(cid:33)2\uf8f6\uf8f8 ,\n\na \u2212 \u03b8T \u03c6(s)\n\n\u03c3\n\n(cid:32)\n\n\uf8eb\uf8ed\u2212 1\n(cid:90)\n\n2\n\n(cid:90)\n\nwhich is a common choice for MDPs with continuous actions. The exact gradient of the expected\ndiscounted reward w.r.t. the policy parameters [2] is:\n\n\u2207\u03b8J\u00b5(\u03b8) =\n\n1\n1 \u2212 \u03b3\n\nd\u03c0\u03b8\n\u00b5 (s)\n\nS\n\nA\n\n\u2207\u03b8\u03c0(a|s, \u03b8)Q\u03c0\u03b8 (s, a)dads.\n\nIn most commonly used policy gradient methods, the policy parameters are updated by following\n= \u03b8 + \u03b1\u2207\u03b8J\u00b5(\u03b8), where \u03b1 \u2265 0\nthe direction of the gradient of the expected discounted reward: \u03b8\nis a scalar step size. In the following we will denote with (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)p the Lp-norm of the policy\ngradient.\n\n(cid:48)\n\n2\n\n\f3 Non-Scalar Step Size for Gaussian Policies\n\nBefore starting to optimize the batch size for the gradient estimation, in this section we extend\nthe results in [15] to the case of a non-scalar step size, showing that, focusing on the Gaussian\npolicy model, such extension guarantees a larger performance improvement than the one obtained\nin [15]. Furthermore, this result signi\ufb01cantly simpli\ufb01es the closed-form solutions obtained for the\noptimization of the batch size described in the following sections. In Section 3.1 we stick to the\ntheoretical setting in which the gradient is known exactly, while in Section 3.2 we take into account\nthe estimation error.\n\n3.1 Exact Framework\n\nThe idea is to have a separate adaptive step size \u03b1i for each component \u03b8i of \u03b8. For notational\nconvenience, we de\ufb01ne a non-scalar step size as a diagonal matrix \u039b = diag(\u03b11, \u03b12, . . . , \u03b1m) with\n\u03b1i \u2265 0 for i = 1, . . . , m. The policy parameters can be updated as:\n\n(cid:48)\n\n\u03b8\n\n= \u03b8 + \u039b\u2207\u03b8J\u00b5(\u03b8).\n\nNote that the direction of the update can differ from the gradient direction. Since the \u03b1i are non-\nnegative, the absolute angular difference is never more than \u03c0/2. The traditional scalar step-size\nupdate can be seen as a special case where \u039b = \u03b1I.\nAssumption 3.1. State features are uniformly bounded: |\u03c6i(s)| \u2264 M\u03c6,\u2200s \u2208 S,\u2200i = 1, . . . , m.\nBy adapting Theorem 4.3 in [15] to the new parameter update, we obtain a lower bound on the policy\nperformance improvement:\nLemma 3.2. For any initial state distribution \u00b5 and any pair of stationary Gaussian policies\n\u03c0\u03b8 \u223c N (\u03b8T \u03c6(s), \u03c32) and \u03c0\u03b8(cid:48) \u223c N (\u03b8\n= \u03b8 + \u039b\u2207\u03b8J\u00b5(\u03b8), and under\nAssumption 3.1, the difference between the performance of \u03c0\u03b8(cid:48) and the one of \u03c0\u03b8 can be bounded\nbelow as follows:\n(cid:19)\n(cid:90)\n\n(cid:48)T \u03c6(s), \u03c32), so that \u03b8\n\n(cid:90)\n\n(cid:48)\n\n(cid:48)\n\n) \u2212 J\u00b5(\u03b8) \u2265 \u2207\u03b8 J\u00b5(\u03b8)T\u039b\u2207\u03b8 J\u00b5(\u03b8) \u2212 (cid:107)\u039b\u2207\u03b8 J\u00b5(\u03b8)(cid:107)2\n(1 \u2212 \u03b3)\u03c32\n\n1 M 2\n\n\u03c6\n\nJ\u00b5(\u03b8\n\n2\u03c0\u03c3\n\nS\n\nd\u03c0\u03b8\n\u00b5 (s)\n\nA\n\nQ\u03c0\u03b8 (s, a)dads +\n\n\u03b3 (cid:107)Q\u03c0\u03b8(cid:107)\u221e\n2(1 \u2212 \u03b3)\n\n,\n\n(cid:18) 1\u221a\n\nwhere (cid:107)Q\u03c0\u03b8(cid:107)\u221e is the supremum norm of the Q-function: (cid:107)Q\u03c0\u03b8(cid:107)\u221e = sup\ns\u2208S,a\u2208A\n\nQ\u03c0\u03b8 (s, a).\n\nThe above bound requires us to compute the Q-function explicitly, but this is often not possible in\nreal-world applications. We now consider a simpli\ufb01ed (although less tight) version of the bound that\ndoes not have this requirement, which is an adaptation of Corollary 5.1 in [15]:\nTheorem 3.3. For any initial state distribution \u00b5 and any pair of stationary Gaussian policies\n\u03c0\u03b8 \u223c N (\u03b8T \u03c6(s), \u03c32) and \u03c0\u03b8(cid:48) \u223c N (\u03b8\n= \u03b8 + \u039b\u2207\u03b8J\u00b5(\u03b8), and under\nAssumption 3.1, the difference between the performance of \u03c0\u03b8(cid:48) and the one of \u03c0\u03b8 can be bounded\nbelow as follows:\n\n(cid:48)T \u03c6(s), \u03c32), so that \u03b8\n\n(cid:48)\n\nJ\u00b5(\u03b8\n\n(cid:48)\n\n(cid:16) |A|\u221a\n\n) \u2212 J\u00b5(\u03b8) \u2265 \u2207\u03b8J\u00b5(\u03b8)T\u039b\u2207\u03b8J\u00b5(\u03b8) \u2212 c(cid:107)\u039b\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n1 ,\nand |A| is the volume of the action space.\n\n(cid:17)\n\nRM 2\n\u03c6\n\n2(1\u2212\u03b3)\n\n(1\u2212\u03b3)2\u03c32\n\n2\u03c0\u03c3 + \u03b3\n\nwhere c =\nWe then \ufb01nd the step size \u039b\u2217 that maximizes this lower bound under the natural constraint\n\u03b1i \u2265 0\u2200i = 1, . . . , m. The derivation is not trivial and is provided in Appendix A.\nCorollary 3.4. The lower bound of Theorem 3.3 is maximized by the following non-scalar step size:\n\n(cid:26) 1\n\n2c\n0\n\n\u03b1\u2217\nk =\n\nif k = min{arg maxi |\u2207\u03b8i J\u00b5(\u03b8)|} ,\notherwise,\n\nwhich guarantees the following performance improvement: J\u00b5(\u03b8\nNote that update induced by the obtained \u039b\u2217 corresponds to employing a constant, scalar step size to\nupdate just the parameter corresponding to the largest absolute gradient component. This method\nis known in the literature as greedy coordinate descent. Convergence of this algorithm to a local\n\n4c\n\n.\n\n(cid:48)\n\n) \u2212 J\u00b5(\u03b8) \u2265 (cid:107)\u2207\u03b8 J\u00b5(\u03b8)(cid:107)2\u221e\n\n3\n\n\foptimum is guaranteed for small step sizes, as shown in [20]. Note also that the way in which the\nindex is selected in case of multiple maxima (here min) is arbitrary, see the proof of Corollary 3.4\nfor details. We now propose an intuitive explanation of our result: the employed performance lower\nbound ultimately derives from Corollary 3.6 in [13]. From the original bound, one can easily see\nthat the positive part accounts to the average advantage of the new policy over the old one, while the\nnegative part penalizes large parameter updates, which may result in overshooting. Updating just the\nparameter corresponding to the larger policy gradient component represents an intuitive trade-off\nbetween these two objectives. We now show that this result represents an improvement w.r.t. the\nadaptive scalar step size proposed in [15] for the current setting:\nCorollary 3.5. Under identical hypotheses, the performance improvement guaranteed by Corollary\n3.4 is never less than the one guaranteed by Corollary 5.1 in [15], i.e.:\n\n(cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\u221e\n\n4c\n\n\u2265 (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)4\n4c(cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n\n2\n\n1\n\n.\n\nThis corollary derives from the trivial norm inequality (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)\u221e (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)1 \u2265 (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n2.\n3.2 Approximate Framework\nWe now turn to the more realistic case in which the policy gradient, \u2207\u03b8J\u00b5(\u03b8), is not known, and has\nto be estimated from a \ufb01nite number of trajectory samples. A performance improvement can still be\nguaranteed with high probability. To adapt the result of Theorem 3.3 to the stochastic gradient case,\nwe need both a lower bound on the policy gradient estimate \u02c6\u2207\u03b8J\u00b5(\u03b8):\n\u02c6\u2207\u03b8J\u00b5(\u03b8) = max(| \u02c6\u2207\u03b8J\u00b5(\u03b8)| \u2212 \u0001, 0)\n\n(where the maximum is component-wise) and an upper bound:\n\u02c6\u2207\u03b8J\u00b5(\u03b8) = | \u02c6\u2207\u03b8J\u00b5(\u03b8)| + \u0001\n\nwhere \u0001 = [\u00011, . . . , \u0001m], and \u0001i is an upper bound on the approximation error of \u2207\u03b8iJ\u00b5(\u03b8) with\nprobability at least 1 \u2212 \u03b4. We can now state the following:\nTheorem 3.6. Under the same assumptions of Theorem 3.3, and provided that a policy gradient\nestimate \u02c6\u2207\u03b8J\u00b5(\u03b8) is available, so that P(|\u2207\u03b8i J\u00b5(\u03b8) \u2212 \u02c6\u2207\u03b8i J\u00b5(\u03b8)| \u2265 \u0001i) \u2264 \u03b4, \u2200i = 1, . . . , m, the\ndifference between the performance of \u03c0\u03b8(cid:48) and the one of \u03c0\u03b8 can be bounded below with probability\nat least (1 \u2212 \u03b4)m as follows:\n\n(cid:48)\n\n) \u2212 J\u00b5(\u03b8) \u2265 \u02c6\u2207\u03b8J\u00b5(\u03b8)\n\nT\n\n\u039b \u02c6\u2207\u03b8J\u00b5(\u03b8) \u2212 c\n\nJ\u00b5(\u03b8\n\nwhere c is de\ufb01ned as in Theorem 3.3.\n\n(cid:13)(cid:13)(cid:13)\u039b \u02c6\u2207\u03b8J\u00b5(\u03b8)\n\n(cid:13)(cid:13)(cid:13)2\n\n1\n\n,\n\nTo derive the optimal step size, we \ufb01rst restrict our analysis to the case in which \u00011 = \u00012 = . . . =\n\u0001m (cid:44) \u0001. We call this common estimation error \u0001. This comes naturally in the following section,\nwhere we use concentration bounds to give an expression for \u0001. However, it is always possible to\nde\ufb01ne a common error by \u0001 = maxi \u0001i. We then need the following assumption:\nAssumption 3.7. At least one component of the policy gradient estimate is, in absolute value, no\nless than the approximation error:\n\n(cid:13)(cid:13)(cid:13) \u02c6\u2207\u03b8J\u00b5(\u03b8)\n(cid:13)(cid:13)(cid:13)\u221e\n\n\u2265 \u0001.\n\nThe violation of the above assumption can be used as a stopping condition since it prevents to\nguarantee any performance improvement. We can now state the following (the derivation is similar to\nthe one of Corollary 3.5 and is, again, left to Appendix A):\nCorollary 3.8. The performance lower bound of Theorem 3.6 is maximized under Assumption 3.7 by\nthe following non-scalar step size:\n\n(cid:110)\narg maxi | \u02c6\u2207\u03b8iJ\u00b5(\u03b8)|(cid:111)\n\n,\n\n\uf8f1\uf8f2\uf8f3 ((cid:107) \u02c6\u2207\u03b8 J\u00b5(\u03b8)(cid:107)\u221e\u2212\u0001)2\n\n2c((cid:107) \u02c6\u2207\u03b8 J\u00b5(\u03b8)(cid:107)\u221e+\u0001)2\n0\n\n\u03b1\u2217\nk =\n\nif k = min\notherwise,\n\n4\n\n\fwhich guarantees with probability (1 \u2212 \u03b4)m a performance improvement\n\n(cid:13)(cid:13)(cid:13)\u221e\n(cid:16)(cid:13)(cid:13)(cid:13) \u02c6\u2207\u03b8J\u00b5(\u03b8)\n(cid:17)4\n(cid:13)(cid:13)(cid:13)\u221e + \u0001\n(cid:16)(cid:13)(cid:13)(cid:13) \u02c6\u2207\u03b8J\u00b5(\u03b8)\n(cid:17)2 .\n\n\u2212 \u0001\n\n4c\n\n(cid:48)\n\n) \u2212 J\u00b5(\u03b8) \u2265\n\nJ\u00b5(\u03b8\n\n4 Adaptive Batch Size\n\nIn this section we jointly optimize the step size for parameter updates and the batch size for policy\ngradient estimation, taking into consideration the cost of collecting sample trajectories. We call N the\nbatch size, i.e., the number of trajectories sampled to compute the policy gradient estimate \u02c6\u2207\u03b8J\u00b5(\u03b8)\nat each parameter update. We de\ufb01ne the following cost-sensitive performance improvement measure:\nDe\ufb01nition 4.1. Cost-sensitive performance improvement measure \u03a5\u03b4 is de\ufb01ned as:\n\n\u03a5\u03b4(\u039b, N ) :=\n\nB\u03b4(\u039b, N )\n\nN\n\n,\n\nwhere B\u03b4 is the high probability lower bound on performance improvement given in Theorem 3.6.\n\nThe rationale behind this choice of performance measure is to maximize the performance improvement\nper sample trajectory. Using larger batch sizes leads to more accurate policy updates, but the gained\nperformance improvement is spread over a larger number of trials. This is particularly relevant in\nreal-world online applications, where the collection of more samples with a sub-optimal policy affects\nthe overall performance and must be justi\ufb01ed by a greater improvement in the learned policy. By\nde\ufb01ning \u03a5\u03b4 in this way, we can control the improvement provided, on average, by each collected\nsample. We now show how to jointly select the step size \u039b and the batch size N so as to maximize\n\u03a5\u03b4. Notice that the dependence of B\u03b4 on N is entirely through \u0001, whose expression depends on which\nconcentration bound is considered. We \ufb01rst restrict our analysis to concentration bounds that allow to\nexpress \u0001 as follows:\nAssumption 4.1. The per-component policy gradient estimation error made by averaging over N\nsample trajectories can be bounded with probability at least 1 \u2212 \u03b4 by:\n\nwhere d\u03b4 is a constant w.r.t. N.\n\n\u0001(N ) =\n\nd\u03b4\u221a\nN\n\n,\n\nThis class of inequalities includes well-known concentration bounds such as Chebyshev\u2019s and\nHoeffding\u2019s. Under Assumption 4.1 \u03a5\u03b4 can be optimized in closed form:\nTheorem 4.2. Under the hypotheses of Theorem 3.3 and Assumption 4.1, the cost-sensitive perfor-\nmance improvement measure \u03a5\u03b4, as de\ufb01ned in De\ufb01nition 4.1, is maximized by the following step size\nand batch size:\n\n(cid:40) (13\u22123\n\n4c\n\n0\n\n\u221a\n\n17)\n\n\u03b1\u2217\nk =\n\nRM 2\n\u03c6\n\nwhere c =\nmance improvement of:\n\n(1\u2212\u03b3)2\u03c32\n\nif k = min\notherwise,\n\n(cid:16) |A|\u221a\n\u221a\n) \u2212 J\u00b5(\u03b8) \u2265 393 \u2212 95\n\n2\u03c0\u03c3 + \u03b3\n\n2(1\u2212\u03b3)\n\n8\n\n(cid:48)\n\nJ\u00b5(\u03b8\n\n(cid:110)\narg maxi | \u02c6\u2207\u03b8iJ\u00b5(\u03b8)|(cid:111)\n(cid:17)\n\n,\n\n(cid:13)(cid:13)(cid:13) \u02c6\u2207\u03b8J\u00b5(\u03b8)\n(cid:13)(cid:13)(cid:13)2\n\n\u221e\n\n17\n\n\u2265 0.16\n\n. This choice guarantees with probability (1 \u2212 \u03b4)m a perfor-\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa ,\n\n17)d2\n\u03b4\n\n(cid:13)(cid:13)(cid:13)2\n\n\u221e\n\nN\u2217\n\n2\n\n=\n\n\u221a\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef (13 + 3\n(cid:13)(cid:13)(cid:13) \u02c6\u2207\u03b8J\u00b5(\u03b8)\n(cid:13)(cid:13)(cid:13) \u02c6\u2207\u03b8J\u00b5(\u03b8)\n(cid:13)(cid:13)(cid:13)2\n\n\u221e .\n\nNotice that, under Assumption 4.1, Assumption 3.7 can be restated as N \u2265\n, which\nis always veri\ufb01ed by the proposed N\u2217. This means that the adaptive batch size never allows an\nestimation error larger than the gradient estimate. Another peculiarity of this result is that the step\nsize is constant, in the sense that its value does not depend on the gradient estimate. This can be\n\n(cid:107) \u02c6\u2207\u03b8 J\u00b5(\u03b8)(cid:107)2\n\nd2\n\u03b4\n\n\u221e\n\n5\n\n\fexplained in terms of a duality between step size and batch size: in other conservative adaptive-step\nsize approaches, such as the one proposed with Theorem 4.2, the step size is kept small to counteract\npolicy updates that are too off due to bad gradient estimates. When also the batch size is made\nadaptive, a suf\ufb01cient number of sample trajectories can be taken to keep the policy update on track\neven with a constant-valued step size. Note that, in this formulation, the batch size selection process\nis always one step behind the gradient estimation. A possible synchronous implementation is to\nupdate N\u2217 each time a trajectory is performed, using all the data collected since the last learning step.\nAs soon as the number of trajectories performed in the current learning iteration is larger than or\nequal to N\u2217, a new learning step is performed.\nWe now consider some concentration bounds in more detail: we provide the values for d\u03b4, while the\nfull expressions for N\u2217 can be found in Appendix B.\n\n4.1 Chebyshev\u2019s Bound\n\nBy using the sample mean version of Chebyshev\u2019s bound we obtain:\n\n(cid:115)\n\nd\u03b4 =\n\nV ar[ \u02dc\u2207\u03b8iJ\u00b5(\u03b8)]\n\n\u03b4\n\n,\n\nwhere \u02dc\u2207\u03b8iJ\u00b5(\u03b8) is the policy gradient approximator (from a single sample trajectory). The main\nadvantage of this bound is that it does not make any assumption on the range of the gradient sample.\nThe variance of the sample can be upper bounded in the case of the REINFORCE [1] and the\nG(PO)MDP [3]/PGT [2] gradient estimators by using results from [21], already adapted for similar\npurposes in [15]. The G(PO)MDP/PGT estimator suffers from a smaller variance if compared with\nREINFORCE, and the variance bound is indeed tighter.\n\n4.2 Hoeffding\u2019s Bound\n\nBy using Hoeffding\u2019s bound we obtain:\n\nd\u03b4 = R\n\n(cid:114)\n\nlog 2/\u03b4\n\n,\n\n2\n\nwhere R is the range of the gradient approximator, i.e., |supp( \u02dc\u2207\u03b8iJ\u00b5(\u03b8))|. For the class of policies\nwe are considering, i.e., Gaussian with mean linear in the features, under some assumptions, the\nrange can be upper bounded as follows:\nLemma 4.3. For any Gaussian policy \u03c0\u03b8 \u223c N (\u03b8T \u03c6(s), \u03c32), assuming that the action space is\nbounded (\u2200a \u2208 A,|a| \u2264 A) and the policy gradient is estimated on trajectories of length H, the\nrange R of the policy gradient sample \u02dc\u2207\u03b8iJ\u00b5(\u03b8) can be upper bounded \u2200i = 1, . . . , m and \u2200\u03b8 by\n\nR \u2264 2HM\u03c6AR\n\u03c32(1 \u2212 \u03b3)\n\n.\n\nAs we will show in Section 5, a more practical solution (even if less rigorous) consists in computing\nthe range as the difference between the largest and the smallest gradient sample seen during learning.\n\n4.3 Empirical Bernstein\u2019s Bound\n\nTighter concentration bounds allow for smaller batch sizes (which result in more frequent policy\nupdates) and larger step sizes, thus speeding up the learning process and improving long-time average\nperformance. An empirical Bernstein bound from [22] allows to use sample variance instead of the\nvariance bounds from [21] and to limit the impact of the gradient range. On the other hand, this\nbound does not satisfy Assumption 4.1, giving for the estimation error the following, more complex,\nexpression:\n\nwhere\n\n(cid:112)\n\nd\u03b4 =\n\n\u0001(N ) =\n\nd\u03b4\u221a\nN\n\n+\n\nf\u03b4\nN\n\n,\n\n2SN ln 3/\u03b4,\n\nf = 3R ln 3/\u03b4,\n\n6\n\n\fand SN is the sample variance of the gradient approximator. No reasonably simple closed-form\nsolution is available in this case, requiring a linear search of the batch size N\u2217 maximizing \u03a5\u03b4. By\nadapting Assumption 3.7 to this case, a starting point for this search can be provided:\n\n(cid:13)(cid:13)(cid:13) \u02c6\u2207\u03b8J\u00b5(\u03b8)\n(cid:13)(cid:13)(cid:13)\u221e\n(cid:13)(cid:13)(cid:13)\u221e\n\n\uf8f6\uf8f7\uf8f7\uf8f8\n\n2\n\n,\n\nd2\n\u03b4 + 4f\u03b4\n\n(cid:13)(cid:13)(cid:13) \u02c6\u2207\u03b8J\u00b5(\u03b8)\n\n2\n\n(cid:114)\n\n\uf8eb\uf8ec\uf8ec\uf8ed d\u03b4 +\n\nN \u2265\n\nWe also know that there is a unique maximum in [N0, +\u221e) (see Appendix A for more details) and\nthat \u03a5\u03b4 goes to 0 as N goes to in\ufb01nity. Hence, to \ufb01nd the optimal batch size, it is enough to start from\nN0 and stop as soon as the value of the cost function \u03a5(\u039b\u2217, N ) begins to decrease. Furthermore, the\noptimal step size is no longer constant: it can be computed with the expression given in Corollary 3.8\nby setting \u0001 := \u0001(N\u2217). As for the Hoeffding\u2019s bound, the range R can be upper bounded exactly or\nestimated from samples.\n\nTable 1: Improvement rate of the policy updates for different policy standard deviation \u03c3, \ufb01xed batch\nsize N and \ufb01xed step size \u03b1, using the G(PO)MDP gradient estimator.\n\n\u03c3 = 0.5\n\n\u03c3 = 1\n\n\u03b1\n\n1e-3\n1e-4\n1e-5\n1e-6\n\nN = 10000 N = 1000 N = 100\n49.79%\n51.41%\n55.69%\n58.44%\n\n52.85%\n73.27%\n81.88%\n83.88%\n\n95.96%\n100%\n98.99%\n100%\n\nN = 10000 N = 1000 N = 100\n50.4%\n46.08%\n39.04%\n86.04%\n\n24.24%\n100%\n100%\n100%\n\n37.4%\n27.03%\n99.9%\n100%\n\nTable 2: Average performance for different gradient estimators, statistical bounds and values of \u03b4.\nAll results are averaged over 5 runs (95% con\ufb01dence intervals are reported).\n\nEstimator\n\nBound\n\n\u03b4\n\n\u03a5\n\nCon\ufb01dence interval\n\nREINFORCE Chebyshev\nREINFORCE Chebyshev\nREINFORCE Chebyshev\nChebyshev\nG(PO)MDP\nG(PO)MDP\nChebyshev\nChebyshev\nG(PO)MDP\nChebyshev\nG(PO)MDP\nChebyshev\nG(PO)MDP\nHoeffding\nG(PO)MDP\nG(PO)MDP\nBernstein\nHoeffding (empirical range)\nG(PO)MDP\nG(PO)MDP\nBernstein (empirical range)\n\n0.95\n0.75\n0.5\n0.95\n0.75\n0.5\n0.25\n0.05\n0.95\n0.95\n0.95\n0.95\n\n-11.3266\n-11.4303\n-11.5947\n-10.6085\n-10.7141\n-10.9036\n-11.2355\n-11.836\n-11.914\n-10.2159\n-9.8582\n-9.6623\n\n[-11.3277; -11.3256]\n[-11.4308; -11.4297]\n[-11.5958; -11.5937]\n[-10.6087; -10.6083]\n[-10.7145; -10.7136]\n[-10.904; -10.9031]\n[-11.2363; -11.2346]\n[-11.8368; -11.8352]\n[-11.9143; -11.9136]\n[-10.2162; -10.2155]\n[-9.8589; -9.8574]\n[-9.6619; -9.6627]\n\n5 Numerical Simulations\n\nIn this section, we test the proposed methods on the linear-quadratic Gaussian regulation (LQG)\nproblem [23]. The LQG problem is de\ufb01ned by transition model st+1 \u223c N (st + at, \u03c32\n0), Gaussian\npolicy at \u223c N (\u03b8 \u00b7 s, \u03c32) and reward rt = \u22120.5(s2\nt ). In all our simulations we use \u03c30 = 0,\nsince all the noise can be modelled on the agent\u2019s side without loss of generality. Both action and\nstate variables are bounded to the interval [\u22122, 2] and the initial state is drawn uniformly at random.\nWe use this task as a testing ground because it is simple, all the constants involved in our bounds\ncan be computed exactly, and the true optimal parameter \u03b8\u2217 is available as a reference. We use a\ndiscount factor \u03b3 = 0.9, which gives an optimal parameter \u03b8\u2217 \u2248 \u22120.59, corresponding to expected\nperformance J(\u03b8\u2217) \u2248 \u221213.21. Coherently with the framework described in Section 1, we are\ninterested both in the convergence speed and in the ratio of policy updates that does not result in a\n\nt + a2\n\n7\n\n\fworsening of the expected performance, which we will call improvement ratio. First of all, we want\nto analyze how the choice of \ufb01xed step sizes and batch sizes may affect the improvement ratio and\nhow much it depends on the variability of the trajectories (that in this case is due to the variance of\nthe policy). Table 1 shows the improvement ratio for two parameterizations (\u03c3 = 0.5 and \u03c3 = 1)\nwhen various constant step sizes and batch sizes are used, starting from \u03b8 = \u22120.55 and stopping after\na total of one million trajectories. As expected, small batch sizes combined with large step sizes lead\nto low improvement ratios. However, the effect is non-trivial and problem-dependent, justifying the\nneed for an adaptive method.\nWe then proceed to test the methods described in Section 4. In the following simulations, we use\n\u03c3 = 1 and start from \u03b8 = 0, stopping after a total of 30 million trajectories. Figure 1 shows the\nexpected performance over sample trajectories for both the REINFORCE and G(PO)MDP gradient\nestimators, using Chebyshev\u2019s bound with different values of \u03b4. Expected performance is computed\nfor each parameter update. Data are then scaled to account for the different batch sizes. In general,\nREINFORCE performs worse than G(PO)MDP due to its larger variance (in both cases the proper\noptimal baseline from [23] was used), and larger values of \u03b4 (the probability with which worsening\nupdates are allowed to take place) lead to better performance. Notice that an improvement ratio of 1 is\nachieved also with large values of \u03b4. This is due to the fact that the bounds used in the development of\nour method are not tight. Being the method this conservative, in practical applications \u03b4 can be set to\na high value to improve the convergence rate. Another common practice in empirical applications is\nto shrink con\ufb01dence intervals through a scalar multiplicative factor. However, in this work we chose\nto not exploit this trick. Figure 2 compares the performance of the different concentration bounds\ndescribed in the previous section, using always G(PO)MDP to estimate the gradient and \u03b4 = 0.95.\nAs expected, Bernstein\u2019s bound performs better than Chebyshev\u2019s, especially in the empirical range\nversion. The rigorous version of Hoeffding\u2019s bound performs very poorly, while the one using the\nempirical range is almost as good as the corresponding Bernstein method. This is due to the fact\nthat the bound on the gradient estimate range is very loose, since it accounts also for unrealistic\ncombinations of state, action and reward. Finally, to better capture the performance of the different\nvariants of the algorithm in a real-time scenario, we de\ufb01ne a metric \u03a5, which is obtained by averaging\nthe real performance (measured during learning) over all the trajectories, coherently with the cost\nfunction used to derive the optimal batch size. The results are reported in Table 2. In Appendix C we\nalso show how the adaptive batch size evolves as the policy approaches the optimum.\n\n6 Conclusions\n\nWe showed the relationship between the batch size and the step size in policy gradient approaches\nunder Gaussian policies, and how their joint optimization can lead to parameters updates that\nguarantee with high probability a \ufb01xed improvement in the policy performance. In addition to\nthe formal analysis, we proposed practical methods to compute the information required by the\nalgorithms. Finally, we have proposed a preliminary evaluation on a simple control task. Future work\nshould focus on developing more practical methods. It would also be interesting to investigate the\nextension of the proposed methodology to other classes of policies.\n\nAcknowledgments\n\nThis research was supported in part by French Ministry of Higher Education and Research, Nord-Pas-\nde-Calais Regional Council and French National Research Agency (ANR) under project ExTra-Learn\n(n.ANR-14-CE24-0010-01).\n\nReferences\n[1] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine Learning, 8(3):229\u2013256, 1992.\n\n[2] Richard S Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods\nfor reinforcement learning with function approximation. In Advances in Neural Information Processing\nSystems 12, pages 1057\u20131063. MIT Press, 2000.\n\n[3] Jonathan Baxter and Peter L. Bartlett. In\ufb01nite-horizon policy-gradient estimation. Journal of Arti\ufb01cial\n\nIntelligence Research, 15:319\u2013350, 2001.\n\n8\n\n\fFigure 1: Expected performance over sample trajectories using G(PO)MDP and REINFORCE\n(dashed) gradient estimators and Chebyshev bound, for different values of \u03b4. All results are\naveraged over 5 runs.\n\nFigure 2: Comparison of the performance of different statistical bounds, using the G(PO)MDP\ngradient estimator and \u03b4 = 0.95. All results are averaged over 5 runs.\n\n[4] Frank Sehnke, Christian Osendorfer, Thomas R\u00fcckstie\u00df, Alex Graves, Jan Peters, and J\u00fcrgen Schmidhuber.\nPolicy gradients with parameter-based exploration for control. In Arti\ufb01cial Neural Networks - ICANN\n2008, pages 387\u2013396. Springer Berlin Heidelberg, 2008.\n\n[5] Jens Kober and Jan Peters. Policy search for motor primitives in robotics.\n\nInformation Processing Systems 21, volume 21, pages 849\u2013856, 2008.\n\nIn Advances in Neural\n\n[6] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180\u20131190, 2008.\n\n[7] Jan Peters, Katharina M\u00fclling, and Yasemin Altun. Relative entropy policy search. In AAAI Conference on\n\nArti\ufb01cial Intelligence 24. AAAI Press, 2010.\n\n[8] Ivo Grondman, Lucian Busoniu, Gabriel AD Lopes, and Robert Babuska. A survey of actor-critic\nreinforcement learning: Standard and natural policy gradients. Systems, Man, and Cybernetics, Part C:\nApplications and Reviews, IEEE Transactions on, 42(6):1291\u20131307, 2012.\n\n[9] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics.\n\nFoundations and Trends in Robotics, 2(1-2):1\u2013142, 2013.\n\n[10] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning.\n\nInternational Conference on Machine Learning 19, pages 267\u2013274. Morgan Kaufmann, 2002.\n\nIn\n\n[11] Dimitri P Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control\n\nTheory and Applications, 9(3):310\u2013335, 2011.\n\n9\n\n00.20.40.60.811.21.41.61.822.22.42.62.83\u00b7107\u221255\u221250\u221245\u221240\u221235\u221230\u221225\u221220\u221215NumberofTrajectoriesExpectedPerformanceG(PO)MDP\u03b4=0.95G(PO)MDP\u03b4=0.75G(PO)MDP\u03b4=0.5G(PO)MDP\u03b4=0.25G(PO)MDP\u03b4=0.05REINFORCE\u03b4=0.95REINFORCE\u03b4=0.75REINFORCE\u03b4=0.500.20.40.60.811.21.41.61.822.22.42.62.83\u00b7107\u221250\u221240\u221230\u221220\u221210NumberofTrajectoriesExpectedPerformanceBernstein(empiricalrange)Hoe\ufb00ding(empiricalrange)BernsteinChebyshevHoe\ufb00ding\f[12] Bruno Scherrer. Approximate policy iteration schemes: A comparison. In International Conference on\nMachine Learning 31, volume 32 of JMLR Workshop and Conference Proceedings, pages 1314\u20131322.\nJMLR.org, 2014.\n\n[13] Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policy iteration.\nIn International Conference on Machine Learning 30, volume 28 of JMLR Workshop and Conference\nProceedings, pages 307\u2013315. JMLR.org, 2013.\n\n[14] Yasin Abbasi-Yadkori, Peter L Bartlett, and Stephen J Wright. A fast and reliable policy improvement\nalgorithm. In Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 1338\u20131346, 2016.\n\n[15] Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Adaptive step-size for policy gradient methods. In\nAdvances in Neural Information Processing Systems 26, pages 1394\u20131402. Curran Associates, Inc., 2013.\n\n[16] Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Policy gradient in lipschitz markov decision processes.\n\nMachine Learning, 100(2-3):255\u2013283, 2015.\n\n[17] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy\noptimization. In International Conference on Machine Learning 32, volume 37 of JMLR Workshop and\nConference Proceedings, pages 1889\u20131897. JMLR.org, 2015.\n\n[18] Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High con\ufb01dence policy improve-\nment. In International Conference on Machine Learning 32, volume 37 of JMLR Workshop and Conference\nProceedings, pages 2380\u20132388. JMLR.org, 2015.\n\n[19] Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement by minimizing\n\nrobust baseline regret. pages 2298\u20132306, 2016.\n\n[20] Julie Nutini, Mark W. Schmidt, Issam H. Laradji, Michael P. Friedlander, and Hoyt A. Koepke. Coordinate\ndescent converges faster with the gauss-southwell rule than random selection. In International Conference\non Machine Learning 32, volume 37 of JMLR Workshop and Conference Proceedings, pages 1632\u20131641.\nJMLR.org, 2015.\n\n[21] Tingting Zhao, Hirotaka Hachiya, Gang Niu, and Masashi Sugiyama. Analysis and improvement of policy\n\ngradient estimation. Neural Networks, 26:118\u2013129, 2012.\n\n[22] Volodymyr Mnih, Csaba Szepesv\u00e1ri, and Jean-Yves Audibert. Empirical bernstein stopping. In Interna-\ntional Conference on Machine Learning 25, volume 307 of ACM International Conference Proceeding\nSeries, pages 672\u2013679. ACM, 2008.\n\n[23] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks,\n\n21(4):682\u2013697, May 2008.\n\n[24] M. S. Pinsker. Information and Information Stability of Random Variables and Processes. Izv. Akad. Nauk,\n\nMoskva, 1960.\n\n10\n\n\f", "award": [], "sourceid": 2019, "authors": [{"given_name": "Matteo", "family_name": "Papini", "institution": "Politecnico di Milano"}, {"given_name": "Matteo", "family_name": "Pirotta", "institution": "INRIA Lille-Nord Europe"}, {"given_name": "Marcello", "family_name": "Restelli", "institution": "Politecnico di Milano"}]}