南通大学学报（自然科学版）

2024, 01, v.23;No.88 58-65

基于数据集扩充的即时软件缺陷预测方法

杨帆¹ 夏鸿崚²

1.江苏工程职业技术学院图文信息中心 2.南通大学信息科学技术学院

基金项目(Foundation): 南通市科技计划面上项目（JC2023070）

邮箱(Email):

DOI: 10.12194/j.ntu.20231206001

发布时间： 2024-01-09

出版时间： 2024-01-09

网络发布时间： 2024-01-09

移动端阅读

276	2	581
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

即时软件缺陷预测针对项目开发与维护过程中的代码提交来预测是否会引入缺陷。在即时软件缺陷预测研究领域，模型训练依赖于高质量的数据集，然而已有的即时软件缺陷预测方法尚未研究数据集扩充方法对即时软件缺陷预测的影响。为提高即时软件缺陷预测的性能，提出一种基于数据集扩充的即时软件缺陷预测(prediction based on data augmentation,PDA)方法。PDA方法包括特征拼接、样本生成、样本过滤和采样处理4个部分。增强后的数据集样本数量充足、样本质量高且消除了类不平衡问题。将提出的PDA方法与最新的即时软件缺陷预测方法(JIT-Fine)作对比，结果表明：在JIT-Defects4J数据集上，F₁指标提升了18.33%；在LLTC4J数据集上，F₁指标仍有3.67%的提升，验证了PDA的泛化能力。消融实验证明了所提方法的性能提升主要来源于数据集扩充和筛选机制。

关键词： 数据增强; 深度学习; 即时软件缺陷预测; 样本生成; 类不平衡问题;

Abstract：

Just-in-time (JIT) software defect prediction aims to predict whether code commits during project development and maintenance will introduce defects.In the field of JIT software defect prediction research,model training relies on high-quality datasets.However,the impact of dataset augmentation methods on JIT software defect prediction has not been thoroughly investigated in existing methods.To enhance the performance of JIT software defect prediction,a method based on dataset augmentation,named prediction based on data augmentation (PDA) is proposed.PDA includes four parts:feature stitching,sample generation,sample filtering,and sampling processing.The augmented dataset has an ample number of samples with high quality and eliminates the class imbalance problem.Comparing the proposed PDA method with the latest JIT software defect prediction method (JIT-Fine),results indicate:an 18.33%improvement in the F_1score on the JIT-Defects4J dataset;and a 3.67%improvement on the LLTC4J dataset,demonstrating PDA′s generalization ability.Ablation studies have confirmed that the performance improvement of the proposed PDA method mainly comes from dataset augmentation and filtering mechanisms.

KeyWords： data augmentation; deep learning; just-in-time defect prediction; sample generation; imbalanced datasets;

参考文献

[1] WEN M, WU R X, LIU Y P, et al. Exploring and exploiting the correlations between bug-inducing and bug-fixing commits[C]//Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering,August 26-30, 2019, Tallinn, Estonia. New York:ACM,2019:326-337.

[2]陈翔，顾庆，刘望舒，等.静态软件缺陷预测方法研究[J].软件学报，2016, 27(1):1-25.CHEN X, GU Q, LIU W S, et al. Survey of static software defect prediction[J]. Journal of Software, 2016, 27(1):1-25.(in Chinese)

[3] ZHAO Y H, DAMEVSKI K, CHEN H. A systematic survey of just-in-time software defect prediction[J]. ACM Computing Surveys, 2023, 55(10):201.

[4] NI C, WANG W, YANG K W, et al. The best of both worlds:integrating semantic features with expert features for defect prediction and localization[C]//Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, November 14-18, 2022, Singapore, Singapore. New York:ACM, 2022:672-683.

[5] MOCKUS A, WEISS D M. Predicting risk of software changes[J]. Bell Labs Technical Journal, 2000, 5(2):169-180.

[6] KAMEI Y, SHIHAB E, ADAMS B, et al. A large-scale empirical study of just-in-time quality assurance[J]. IEEE Transactions on Software Engineering, 2013, 39(6):757-773.

[7] HOANG T, DAM H K, KAMEI Y, et al. DeepJIT:an end-to-end deep learning framework for just-in-time defect prediction[C]//Proceedings of the 2019 IEEE/ACM16th International Conference on Mining Software Repositories(MSR), 2019, May 25-31, Montreal, QC, Canada.New York:IEEE Xplore, 2019:34-45.

[8] HOANG T, KANG H J, LO D, et al. CC2Vec:distributed representations of code changes[C]//Proceedings of the2020 IEEE/ACM 42nd Inter national Conference on Software Engineering(ICSE), October 05-11, 2020, Seoul,Korea(South). New York:IEEE Xplore, 2020:518-529.

[9] PORNPRASIT C, TANTITHAMTHAVORN C K. JITLine:a simpler, better, faster, finer-grained just-in-time defect prediction[C]//Proceedings of the 2021 IEEE/ACM18th International Conference on Mining Software Repositories(MSR), May 17-19, 2021, Madrid, Spain. New York:IEEE Xplore, 2021:369-379.

[10] ZENG Z R, ZHANG Y Q, ZHANG H T, et al. Deep justin-time defect prediction:how far are we?[C]//Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, July 11-17, 2021, Virtual, Denmark. New York:ACM, 2021:427-438.

[11] WANG W Y, YANG D Y. That′s so annoying!!!:a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using#petpeeve tweets[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. Stroudsburg, PA, USA:Association for Computational Linguistics, 2015:2557-2563.

[12] WEI J, ZOU K. EDA:easy data augmenta tion techniques for boosting performance on text classification tasks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLPIJCNLP), Hong Kong, China. Stroudsburg, PA, USA:Association for Computational Linguistics, 2019:6381-6387.

[13] XIE Z A, WANG S I, LI J W, et al. Data noising as smoothing in neural network language models[EB/OL].(2017-03-07)[2023-11-06]. https：//arxiv.org/abs/1703.02573.

[14] ZHANG H Y, CISSE M, DAUPHIN Y N, et al. Mixup:beyond empirical risk minimization[EB/OL].(2018-10-25)[2023-11-06]. https：//arxiv.org/abs/1710.09412.

[15] SAHIN G G, STEEDMAN M. Data augmentation via dependency tree morphing for low-resource languages[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Stroudsburg, PA, USA:Association for Computational Linguistics,2018:5004-5009.

[16] SENNRICH R, HADDOW B, BIRCH A. Improving neural machine translation models with monolingual data[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers), Berlin,Germany. Stroudsburg, PA, USA:Association for Computational Linguistics, 2016:86-96.

[17] CHEN J A, YANG Z C, YANG D Y. MixText:linguistically-informed interpolation of hidden space for semi-supervised text classification[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online. Stroudsburg, PA, USA:Association for Computational Linguistics, 2020:2147-2157.

[18] CHEN X, ZHANG D, ZHAO Y Q, et al. Software defect number prediction:unsupervised vs supervised methods[J].Information and Software Technology, 2019, 106:161-181.

[19] ZHAO M Y, ZHANG L, XU Y, et al. EPiDA:an easy plug-in data augmentation framework for high performance text classification[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Seattle, USA. Stroudsburg, PA, USA:Association for Computational Linguistics, 2022:4742-4752.

[20]李冉，周丽娟，王华.面向类不平衡数据集的软件缺陷预测模型[J].计算机应用研究，2018, 35(9):2806-2810.LI R, ZHOU L J, WANG H. Software defect prediction model based on class imbalanced datasets[J]. Application Research of Computers, 2018, 35(9):2806-2810.(in Chinese)

基本信息:

DOI：10.12194/j.ntu.20231206001

中图分类号:TP311.5

引用信息:

[1]杨帆,夏鸿崚.基于数据集扩充的即时软件缺陷预测方法[J].南通大学学报(自然科学版),2024,23(01):58-65.DOI:10.12194/j.ntu.20231206001.

基金信息:

南通市科技计划面上项目（JC2023070）

发布时间：

2024-01-09

出版时间：

2024-01-09

网络发布时间：

2024-01-09

请选择需要下载的pdf数据

南通大学学报（自然科学版）

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈

引用

GB/T 7714-2015 格式引文

MLA格式引文

APA格式引文

请选择需要下载的pdf数据

南通大学学报（自然科学版）

使用微信“扫一扫”功能。将此内容分享给您的微信好友或者朋友圈

引用

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈