Search for a command to run...
Recently-developed Natural Actor-Critic (NAC) [1] [2], which employs natural policy gradient learning for the actor and LSTD-Q(λ) for the critic, has provided a good model-free reinforcement learning scheme applicable to high-dimensional systems. Since NAC is an onpolicy learning method, however, a new sample sequence is required for estimating sufficient statistics in the policy gradient after the current policy or the policy’s parameterization is modified, or the gradient is estimated as biased. Moreover, the control of exploration and exploitation should be performed by a direct operation on the policy, which may be a large constraint on introducing an exploratory factor. To overcome these problems, we propose an off-policy NAC in this article, in which the policy gradient is estimated without any asymptotic bias by using past system trajectories under the control by past policies. In addition, this method allows the exploration to be controlled from the outside of the policy optimization; this is more effective than the direct policy operation. We also propose a hierarchical parameterization of the policy and an off-policy NAC learning for that policy. Computer experiments using a snake-like robot simulator show our new off-policy NAC is so effective that the number of required trajectories is much smaller than that by the onpolicy method. Moreover, even when the policy parameterization is high dimensional, our hierarchical off-policy NAC learning successfully obtains a good policy, by controlling the policy’s effective dimensionality dynamically. These improvements allow a more efficient application in high-dimensional control problems.