Abstract
Hierarchical reinforcement learning (HRL) can learn the decomposed subpolicies corresponding to the local state-space; therefore, it is a promising solution to complex robotic assembly control tasks with fewer interactions with environments. Most existing HRL algorithms often require on-policy learning, where resampling is necessary for every training step. In this article, we propose a data-efficient HRL via off-policy learning with three main contributions. First, two augmented MDPs (Markov decision processes) are reformulated to learn the higher level policy and lower level policy from the same samples. Second, to learn higher level policy that leads to efficient exploration, a softmax gating policy is derived to determine the lower level policy for interacting with the environment. Third, to learn the lower level policies via off-policy samples from one lower level replay buffer, the higher level policy derived by the option-value network is adopted to select the appropriate option for learning the corresponding lower level policy. The data-efficiency performance of our algorithm is validated on two simulations and real-world robotic dual peg-in-hole assembly tasks.
| Original language | English |
|---|---|
| Article number | 9264727 |
| Pages (from-to) | 11565-11575 |
| Number of pages | 11 |
| Journal | IEEE Transactions on Industrial Electronics |
| Volume | 68 |
| Issue number | 11 |
| Early online date | 19 Nov 2020 |
| DOIs | |
| Publication status | Published - Nov 2021 |
| Externally published | Yes |
Bibliographical note
Publisher Copyright:© 1982-2012 IEEE.
Funding
This work was supported in part by the National Key R&D Program of China under Grant 2017YFC0822204; in part by the National Natural Science Foundation of China under Grant U1613205, Grant 51675291, and Grant 51935010; in part by the Beijing Municipal Natural Science Foundation under Grant L192001; in part by the Funding for Basic Scientific Research Program under Grant JCKY2018205B029; and in part by the State Key Laboratory of China under Grant SKL2020C15.
Keywords
- Data-efficiency
- hierarchical reinforcement learning
- robotic assembly control