[1]张峰,钱辉,董春茹,等.随机状态下基于期望经验回放的Q学习算法[J].深圳大学学报理工版,2020,37(2):202-207.[doi:10.3724/SP.J.1249.2020.02202]
 ZHANG Feng,QIAN Hui,DONG Chunru,et al.An expected experience replay based Q-learning algorithm with random state transition[J].Journal of Shenzhen University Science and Engineering,2020,37(2):202-207.[doi:10.3724/SP.J.1249.2020.02202]
点击复制

随机状态下基于期望经验回放的Q学习算法()
分享到:

《深圳大学学报理工版》[ISSN:1000-2618/CN:44-1401/N]

卷:
第37卷
期数:
2020年第2期
页码:
202-207
栏目:
电子与信息科学
出版日期:
2020-03-16

文章信息/Info

Title:
An expected experience replay based Q-learning algorithm with random state transition
文章编号:
202002013
作者:
张峰钱辉董春茹花强
河北省机器学习与计算智能重点实验室,河北大学数学与信息科学学院,河北保定071002
Author(s):
ZHANG Feng QIAN Hui DONG Chunru and HUA Qiang
Hebei Key Laboratory of Machine Learning and Computational Intelligence, College of Mathematics and Information Science, Hebei University, Baoding 071002, Hebei Province, P.R.China
关键词:
人工智能机器学习强化学习经验回放Q 学习算法随机环境收敛过估计
Keywords:
artificial intelligence machine learning reinforcement learning memory replay Q-learning algorithm stochastic environment convergence over estimation
分类号:
TP181
DOI:
10.3724/SP.J.1249.2020.02202
文献标志码:
A
摘要:
强化学习的经验回放方法在减少状态序列间相关性的同时提高了数据的利用效率,但目前只能用于确定性的状态环境.为在随机状态环境下充分利用经验回放,且能够保持原有的状态分布,提出一种基于树的经验存储结构来存储探索过程中的状态转移概率,并根据该存储方式,提出基于期望经验回放的Q学习算法.该方法在保证算法复杂度较低的情况下,可实现对环境状态转移的无偏估计,减少Q学习算法的过估计问题.在经典的机器人随机行走问题中进行实验,结果证明,相比于基于均匀回放方法和优先回放的经验回放方法,基于期望经验回放Q学习算法的收敛速度约提高了50%.
Abstract:
The experience replay method in reinforcement learning algorithms reduces the correlation between state sequences by sampling randomly and increases the efficiency of data utilization. However, presently it can only be used in the deterministic environment. In order to use the experience replay efficiently in a dynamic random environment and keep the original state transition distribution unchanged, we propose a tree-based experience storage structure to store the state transition probability in the process of exploration and provide an expected experience replay based Q-learning algorithm which realizes an unbiased estimation of transition distribution. The main advantage of proposed algorithm lies in that it can keep the transition distribution unchanged without increasing the algorithm complexity. Additionally, it eliminates the overestimation of Q value in an efficient way. Experimental results in the classical random walking problem of robot verify that the proposed algorithm improves the convergence speed by about 50%.

参考文献/References:

[1] WATKINS C H. Learning from delayed rewards[D]. London: King’s College, 1989: 89-95.
[2] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. Cambridge, USA: MIT Press, 2018.
[3] SZEPESVRI C. The asymptotic convergence-rate of Q-learning[C]// Advances in Neural Information Processing Systems (NIPS). Cambridge, USA: MIT Press, 1998: 1064-1070.
[4] HWANG I, YOUNG J J. Q(λ) learning-based dynamic route guidance algorithm for overhead hoist transport systems in semiconductor fabs[J]. International Journal of Production Research, 2019(3): 1-23.
[5] ALIMORADI M R. KASHAN A H. A league championship algorithm equipped with network structure and backward Q-learning for extracting stock trading rules[J]. Applied Soft Computing, 2018, 68: 478-493.
[6] LIN Longji. Self-improving reactive agents based on reinforcement learning, planning and teaching[J]. Machine Learning, 1992, 8(3/4): 293-321.
[7] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[8] HASSELT H V, GUEZ A, SILVER D. Deep reinforcement learning with double Q-Learning[C]// Proceedings of the 30th AAAI Conference on Artificial Intelligence. Phoenix, USA: AAAI Press, 2016: 2094-2100.
[9] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. (2017-07-20)[2017-08-28]. https://arxiv.org/abs/1707.06347
[10] GU Shixiang, LILLICRAP T, SUTSKEVER I, et al. Continuous deep Q-learning with model-based acceleration[C]// International Conference on Machine Learning. New York. USA:[s.n.], 2016: 2829-2838.
[11] MAHMOOD A R, van HASSELT H, SUTTONR S. Weighted importance sampling for off-policy learning with linear function approximation[C]// Advances in Neural Information Processing Systems.[S. l.]: Neural Information Processing Systems Foundation, Inc., 2014: 3014-3022.
[12] SCHAUL T, QUAN J, ANTONOGLOU I, et al. Prioritized experience replay[EB/OL]. (2015-11-18)[2019-04-01]. https://arxiv.org/abs/1511.05952.
[13] VAN HASSELT H. Double Q-learning[C]// Advances in Neural Information Processing Systems.[S.l.]: Neural Information Processing Systems Foundation, Inc., 2010: 2613-2621.
[14] ARJONA-MEDINA J A, GILLHOFER M, WIDRICH M, et al. RUDDER: return decomposition for delayed rewards[DB/OL]. (2018-06-20)[2019-09-10]. https://arxiv.org/abs/1806.07857.
[15] ROLNICK D, AHUJA A, SCHWARZ J, et al. Experience replay for continual learning[C]// Advances in Neural Information Processing Systems.[S.l.]: Neural Information Processing Systems Foundation, Inc., 2019: 348-358.

相似文献/References:

[1]林春漪,尹俊勋,高 学,等.基于统计学习的多层医学图像语义建模方法[J].深圳大学学报理工版,2007,24(2):138.
 LIN Chun-yi,YIN Jun-xun,GAO Xue,et al.A multi-level medical image semantic modeling approach based on statistical learning[J].Journal of Shenzhen University Science and Engineering,2007,24(2):138.
[2]骆剑平,李霞.求解TSP的改进混合蛙跳算法[J].深圳大学学报理工版,2010,27(2):173.
 LUO Jian-ping and LI Xia.Improved shuffled frog leaping algorithm for solving TSP[J].Journal of Shenzhen University Science and Engineering,2010,27(2):173.
[3]蔡良伟,李霞.基于混合蛙跳算法的作业车间调度优化[J].深圳大学学报理工版,2010,27(4):391.
 CAI Liang-wei and LI Xia.Optimization of job shop scheduling based on shuffled frog leaping algorithm[J].Journal of Shenzhen University Science and Engineering,2010,27(2):391.
[4]张重毅,刘彦斌,于繁华,等.CDA市场环境模型进化研究[J].深圳大学学报理工版,2010,27(4):413.
 ZHANG Zhong-yi,LIU Yan-bin,YU Fan-hua,et al.Research on the evolution model of CDA market environment[J].Journal of Shenzhen University Science and Engineering,2010,27(2):413.
[5]姜建国,周佳薇,郑迎春,等.一种双菌群细菌觅食优化算法[J].深圳大学学报理工版,2014,31(1):43.[doi:10.3724/SP.J.1249.2014.01043]
 Jiang Jianguo,Zhou Jiawei,Zheng Yingchun,et al.A double flora bacteria foraging optimization algorithm[J].Journal of Shenzhen University Science and Engineering,2014,31(2):43.[doi:10.3724/SP.J.1249.2014.01043]
[6]蔡良伟,刘思麒,李霞,等.基于蚁群优化的正则表达式分组算法[J].深圳大学学报理工版,2014,31(3):279.[doi:10.3724/SP.J.1249.2014.03279]
 Cai Liangwei,Liu Siqi,Li Xia,et al.Regular expression grouping algorithm based on ant colony optimization[J].Journal of Shenzhen University Science and Engineering,2014,31(2):279.[doi:10.3724/SP.J.1249.2014.03279]
[7]宁剑平,王冰,李洪儒,等.递减步长果蝇优化算法及应用[J].深圳大学学报理工版,2014,31(4):367.[doi:10.3724/SP.J.1249.2014.04367]
 Ning Jianping,Wang Bing,Li Hongru,et al.Research on and application of diminishing step fruit fly optimization algorithm[J].Journal of Shenzhen University Science and Engineering,2014,31(2):367.[doi:10.3724/SP.J.1249.2014.04367]
[8]刘万峰,李霞.车辆路径问题的快速多邻域迭代局部搜索算法[J].深圳大学学报理工版,2015,32(2):196.[doi:10.3724/SP.J.1249.2015.02000]
 Liu Wanfeng,and Li Xia,A fast multi-neighborhood iterated local search algorithm for vehicle routing problem[J].Journal of Shenzhen University Science and Engineering,2015,32(2):196.[doi:10.3724/SP.J.1249.2015.02000]
[9]蔡良伟,程璐,李军,等.基于遗传算法的正则表达式规则分组优化[J].深圳大学学报理工版,2015,32(3):281.[doi:10.3724/SP.J.1249.2015.03281]
 Cai Liangwei,Cheng Lu,Li Jun,et al.Regular expression grouping optimization based on genetic algorithm[J].Journal of Shenzhen University Science and Engineering,2015,32(2):281.[doi:10.3724/SP.J.1249.2015.03281]
[10]罗雪晖,李霞,张基宏.支持向量机及其应用研究[J].深圳大学学报理工版,2003,20(3):40.
 LUO Xue-hui,LI Xia and ZHANG Ji-hong.Introduction to Support Vector Machine and Its Applications[J].Journal of Shenzhen University Science and Engineering,2003,20(2):40.
[11]潘长城,徐晨,李国.解全局优化问题的差分进化策略[J].深圳大学学报理工版,2008,25(2):211.
 PAN Chang-cheng,XU Chen,and LI Guo.Differential evolutionary strategies for global optimization[J].Journal of Shenzhen University Science and Engineering,2008,25(2):211.

备注/Memo

备注/Memo:
Received:2019-04-10;Accepted:2019-05-15
Foundation:Natural Science Foundation of Hebei Province (F2017201020,F2018201115); Key Science and Technology Foundation of the Education Department of Hebei Province (ZD2019021); Youth Fund of Hebei Education Department (QN2017019)
Corresponding author:Professor HUA Qiang. E-mail: huaq@hbu.edu.cn
Citation:ZHANG Feng, QIAN Hui, DONG Chunru, et al. An expected experience replay based Q-learning algorithm with random state transition[J]. Journal of Shenzhen University Science and Engineering, 2020, 37(2): 202-207.(in Chinese)
基金项目:河北省自然科学面上基金资助项目(F2017201020,F2018201115);河北省教育厅科学技术研究重点资助项目(ZD2019021);河北省教育厅青年基金资助项目(QN2017019)
作者简介:张峰(1976—),河北大学副教授、博士.研究方向:强化学习和智能决策.E-mail: amyfzhang@yahoo.com
引文:张峰,钱辉,董春茹,等.随机状态下基于期望经验回放的Q学习算法[J]. 深圳大学学报理工版,2020,37(2):202-207.
更新日期/Last Update: 2020-03-30