| 280 | 0 | 447 |
| 下载次数 | 被引频次 | 阅读次数 |
海克斯棋是一种完全信息博弈项目,而开局库作为海克斯棋博弈系统的一个重要组成部分,主要依靠人工经验和蒙特卡罗树搜索(Monte Carlo tree search,MCTS)算法进行计算生成,需要大量时间并且难以保证精度。为了解决这一问题,提出一种基于Q学习的自博弈方法用于高效生成海克斯棋的开局库,以多线程模拟棋局为思路,使用一种改进上限置信区间(upper confidence bound apply to tree,UCT)算法来搜索优良的开局位置,引入改进ε-贪心策略用来加快Q学习算法的收敛速度。为了进一步提升算法性能,将上限置信边界(upper confidence bound,UCB)公式与Q值相结合,在实际对弈过程中,使用Q值为UCB公式提供先验经验,能够提高决策的准确性。实验结果表明,当训练达到3 000次时,棋盘各位置的Q值趋于收敛,证明了该方法在开局库制作上的可行性。此外,在博弈水平测试中,纯开局库对弈改进UCT算法的平均胜率达到62.9%;当采用Q值提供先验经验时,平均胜率进一步提高到75.9%。采用提出方法的程序在中国计算机博弈大赛中获得了一等奖,证明了该方法的有效性。
Abstract:Hex is a perfect-information board game, and its opening library — an essential component of the game system — has traditionally been generated based on human expertise and Monte Carlo tree search(MCTS) algorithms.However, this approach is computationally expensive and may not consistently ensure accuracy. This study proposes a self-play method based on Q-learning for the efficient construction of Hex opening libraries. The proposed method employs multi-threaded simulations and an improved upper confidence bound applied to trees(UCT) algorithm to identify promising opening moves. An enhanced ε-greedy strategy is incorporated to improve the convergence rate of the Q-learning algorithm. To further improve performance, Q-values are integrated into the upper confidence bound(UCB) formula as prior knowledge, which is intended to enhance decision-making accuracy during gameplay. Experimental results indicate that after 3 000 training iterations, the Q-values across the board converge, suggesting the method′ s potential for stable policy learning. In comparative evaluations, the generated opening library achieved a62.9% average win rate against the improved UCT algorithm. When Q-values were used as prior input to the UCB formula, the average win rate increased to 75.9%. The method was also applied in the Chinese Computer Game Competition, where the implementation received a first-place award, supporting the practical applicability of the approach.
[1]刘贺,张小川,刁志东,等.一种棋类计算机博弈强化学习智能体的决策依据解释方法[J].重庆理工大学学报(自然科学),2021, 35(12):140-146.LIU H, ZHANG X C, DIAO Z D, et al. An interpretation method of decision basis for the reinforcement learning agent of chess computer game[J]. Journal of Chongqing University of Technology(Natural Science), 2021, 35(12):140-146.(in Chinese)
[2]徐心和,邓志立,王骄,等.机器博弈研究面临的各种挑战[J].智能系统学报,2008, 3(4):288-293.XU X H, DENG Z L, WANG J, et al. Challenging issues facing computer game research[J]. Caai Transactions on Intelligent Systems, 2008, 3(4):288-293.(in Chinese)
[3]张家铭,王静文,李媛.基于改进UCT算法的国际跳棋博弈系统研究[J].智能计算机与应用,2022, 12(1):128-131.ZHANG J M, WANG J W, LI Y. Draughts based on improve UCT algorithm[J]. Intelligent Computer and Applications, 2022, 12(1):128-131.(in Chinese)
[4]王亚杰,王晓岩,邱虹坤,等.建设棋牌谱标准构建计算机博弈竞赛持续发展新生态[J].实验技术与管理,2020, 37(2):19-23.WANG Y J, WANG X Y, QIU H K, et al. Establishing chess and card spectrum standards to construct new ecology of sustainable development of computer game competition[J]. Experimental Technology and Management, 2020, 37(2):19-23.(in Chinese)
[5]靳淑娴,高铭,王修锴.开局库在点格棋计算机博弈系统中的应用[J].数字技术与应用,2022, 40(1):61-63.JIN S X, GAO M, WANG X K. Application of opening library in computer game system of checkers[J]. Digital Technology&Application, 2022, 40(1):61-63.(in Chinese)
[6] COULOM R. Efficient selectivity and backup operators in Monte-Carlo tree search[C]//Proceedings of 5th International Conference, CG 2006, May 29-31, 2006, Turin,Italy. Berlin:Springer, 2007:72-83.
[7]杜康豪,宋睿卓,魏庆来.强化学习在机器博弈上的应用综述[J].控制工程,2021, 28(10):1998-2004.DU K H, SONG R Z, WEI Q L. Review of reinforcement learning applications in machine games[J]. Control Engineering of China, 2021, 28(10):1998-2004.(in Chinese)
[8]马骋乾,谢伟,孙伟杰.强化学习研究综述[J].指挥控制与仿真,2018, 40(6):68-72.MA C Q, XIE W, SUN W J. Research on reinforcement learning technology:a review[J]. Command Control&Simulation, 2018, 40(6):68-72.(in Chinese)
[9]毛健,赵红东,姚婧婧.人工神经网络的发展及应用[J].电子设计工程,2011, 19(24):62-65.MAO J, ZHAO H D, YAO J J. Application and prospect of artificial neural network[J]. Electronic Design Engineering, 2011, 19(24):62-65.(in Chinese)
[10] WEI Q L, LIU D R, SHI G. A novel dual iterative Qlearning method for optimal battery management in smart residential environments[J]. IEEE Transactions on Industrial Electronics, 2015, 62(4):2509-2518.
[11]邱虹坤,王浩宇,王亚杰. Q学习实现亚马逊棋评估函数自调参[J].重庆理工大学学报(自然科学),2022, 36(12):136-141.QIU H K, WANG H Y, WANG Y J. Parameter self-adjustment of Amazon Chess evaluation function through Qlearning[J]. Journal of Chongqing University of Technology(Natural Science), 2022, 36(12):136-141.(in Chinese)
[12]陈兴国,俞扬.强化学习及其在电脑围棋中的应用[J].自动化学报,2016, 42(5):685-695.CHEN X G, YU Y. Reinforcement learning and its application to the game of Go[J]. Acta Automatica Sinica, 2016,42(5):685-695.(in Chinese)
[13] WANG Y Z, GELLY S. Modifications of UCT and sequence-like simulations for Monte-Carlo Go[C]//Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Games, April 1-5, 2007, Honolulu, HI, USA.New York:IEEE Xplore, 2007:175-182.
[14]张加佳.非完备信息机器博弈中风险及对手模型的研究[D].哈尔滨:哈尔滨工业大学,2015.ZHANG J J. Research on risk and opponent modeling in inperfect information game[D]. Harbin:Harbin Institute of Technology, 2015.(in Chinese)
[15]王亚杰,祁冰枝,张云博,等.结合神经网络的改进UCT在国际跳棋中的应用[J].重庆理工大学学报(自然科学),2021, 35(7):259-265.WANG Y J, QI B Z, ZHANG Y B, et al. Application of improved UCT algorithm combined with neural network in checkers[J]. Journal of Chongqing University of Technology(Natural Science), 2021, 35(7):259-265.(in Chinese)
[16] ARNESON B, HAYWARD R B, HENDERSON P. Monte Carlo tree search in Hex[J]. IEEE Transactions on Computational Intelligence and AI in Games, 2010, 2(4):251-258.
[17]徐志凡,王静文,李媛.基于UCT算法改进的Hex棋博弈系统研究[J].智能计算机与应用,2022, 12(3):183-185.XU Z F, WANG J W, LI Y. Hex system based on improved UCT[J]. Intelligent Computer and Applications,2022, 12(3):183-185.(in Chinese)
[18]赵冬斌,邵坤,朱圆恒,等.深度强化学习综述:兼论计算机围棋的发展[J].控制理论与应用,2016, 33(6):701-717.ZHAO D B, SHAO K, ZHU Y H, et al. Review of deep reinforcement learning and discussions on the development of computer Go[J]. Control Theory&Applications, 2016, 33(6):701-717.(in Chinese)
[19]吴昊霖,蔡乐才,高祥.在线更新的信息强度引导启发式Q学习[J].计算机应用研究,2018, 35(8):2323-2327.WU H L, CAI L C, GAO X. Online pheromone stringency guiding heuristically accelerated Q-learning[J]. Application Research of Computers, 2018, 35(8):2323-2327.(in Chinese)
[20] TANG R K, YUAN H L. An error-sensitive Q-learning approach for robot navigation[C]//Proceedings of the 201534th Chinese Control Conference(CCC), July 28-30, 2015,Hangzhou, China. New York:IEEE Xplore, 2015:5835-5840.
基本信息:
DOI:10.12194/j.ntu.20241023001
中图分类号:TP18
引用信息:
[1]徐志凡,李媛,王静文,等.一种Q学习制作海克斯棋开局库方法[J].南通大学学报(自然科学版),2025,24(02):22-28+47.DOI:10.12194/j.ntu.20241023001.
基金信息:
国家重点研发计划项目(2022YFB4100802)
2024-05-29
2024
2024-10-12
2024
1
2024-10-15
2024-10-15
2024-10-15