随机状态下基于期望经验回放的Q学习算法

河北省机器学习与计算智能重点实验室,河北大学数学与信息科学学院,河北保定071002

人工智能; 机器学习; 强化学习; 经验回放; Q 学习算法; 随机环境; 收敛; 过估计

An expected experience replay based Q-learning algorithm with random state transition
ZHANG Feng, QIAN Hui, DONG Chunru, and HUA Qiang

Hebei Key Laboratory of Machine Learning and Computational Intelligence, College of Mathematics and Information Science, Hebei University, Baoding 071002, Hebei Province, P.R.China

artificial intelligence; machine learning; reinforcement learning; memory replay; Q-learning algorithm; stochastic environment; convergence; over estimation

DOI: 10.3724/SP.J.1249.2020.02202

备注

强化学习的经验回放方法在减少状态序列间相关性的同时提高了数据的利用效率,但目前只能用于确定性的状态环境.为在随机状态环境下充分利用经验回放,且能够保持原有的状态分布,提出一种基于树的经验存储结构来存储探索过程中的状态转移概率,并根据该存储方式,提出基于期望经验回放的Q学习算法.该方法在保证算法复杂度较低的情况下,可实现对环境状态转移的无偏估计,减少Q学习算法的过估计问题.在经典的机器人随机行走问题中进行实验,结果证明,相比于基于均匀回放方法和优先回放的经验回放方法,基于期望经验回放Q学习算法的收敛速度约提高了50%.

The experience replay method in reinforcement learning algorithms reduces the correlation between state sequences by sampling randomly and increases the efficiency of data utilization. However, presently it can only be used in the deterministic environment. In order to use the experience replay efficiently in a dynamic random environment and keep the original state transition distribution unchanged, we propose a tree-based experience storage structure to store the state transition probability in the process of exploration and provide an expected experience replay based Q-learning algorithm which realizes an unbiased estimation of transition distribution. The main advantage of proposed algorithm lies in that it can keep the transition distribution unchanged without increasing the algorithm complexity. Additionally, it eliminates the overestimation of Q value in an efficient way. Experimental results in the classical random walking problem of robot verify that the proposed algorithm improves the convergence speed by about 50%.

·