深圳大学学报理工版

王一宾,吴陈,程玉胜,等.不平衡标记差异性多标记特征选择算法[J].深圳大学学报理工版,2020,37(3):234-242.[doi:10.3724/SP.J.1249.2020.03234]
WANG Yibin,WU Chen,CHENG Yusheng,et al.Multi-label feature selection algorithm with imbalance label otherness[J].Journal of Shenzhen University Science and Engineering,2020,37(3):234-242.[doi:10.3724/SP.J.1249.2020.03234]

点击复制

不平衡标记差异性多标记特征选择算法

王一宾^1,2,吴陈¹,程玉胜^1,2,江健生^1,2

1)安庆师范大学计算机与信息院,安徽安庆 246133; 2)安徽省高校智能感知与计算重点实验室,安徽安庆 246133

关键词：人工智能; 多标记学习; 特征选择; 不平衡数据; 标记相关性; 信息熵; 标记差异性

Multi-label feature selection algorithm with imbalance label otherness

WANG Yibin^{1, 2}, WU Chen¹, CHENG Yusheng^{1, 2}, and JIANG Jiansheng^{1, 2}

1)School of Computer and Information, Anqing Normal University, Anqing 246133, Anhui Province, P.R.China 2)The University Key Laboratory of Intelligent Perception and Computing of Anhui Province, Anqing 246133, Anhui Province, P.R.China

Keywords： artificial intelligence; multi-label learning; feature selection; imbalanced data; label correlation; information entropy; label otherness

DOI: 10.3724/SP.J.1249.2020.03234

备注

摘要

全文

图/表

参考文献

针对现有的特征选择算法大多未考虑不同标记对样本的描述程度可能存在差异的问题,提出一种不平衡标记差异性多标记特征选择算法(multi-label feature selection algorithm with imbalance label otherness, MSIO),将不同标记下正负标记的频率分布作为该标记的权值加入到特征选择的过程中,并修正传统的信息熵计算方法,从而得到一组更高效的特征序列.以多标记k近邻(multi-label k-nearest neighbor, ML-kNN)为基础分类器,在Mulan数据库的11个多标记基准数据集上,对基于最大相关性的多标记维数约简(multi-label dimensionality reduction via dependence maximization, MDDM)算法、基于多变量互信息的多标记特征选择算法PMU(pairwise multivariate mutual information)、多标记朴素贝叶斯分类的特征选择(feature selection for multi-label naive Bayes classification, MLNB)算法、基于标记相关性的多标记特征选择(multi-label feature selection with label correlation, MUCO)算法和MSIO算法进行评价,实验结果和统计假设检验说明,MSIO算法稳定性佳且分类精度高,具有一定的有效性和优越性.

In view of the fact that most of the existing feature selection algorithms do not consider the possible differences existing in the sample description by different labels, a multi-label feature selection algorithm with imbalance label otherness(MSIO)is proposed. The frequency distributions of positive and negative labels under different labels are added to the process of feature selection as the label weight, the traditional method of calculating information entropy is modified to get a more efficient feature sequence. Based on ML-kNN(multi-label k-nearest neighbor), the features are classified on 11 multi-label benchmark datasets of Mulan database, and the algorithms of multi-label dimensionality reduction via dependency maximization(MDDM), pairwise multivariate mutual information(PMU), feature selection for multi-label naive Bayes classification(MLNB), multi-label feature selection with label correlation(MUCO)and MSIO algorithm are evaluated. Experimental results and statistical hypothesis tests show that MSIO algorithm has good stability, high classification accuracy, and certain effectiveness and superiority.

引言
1 相关知识
2 不平衡标记差异性多标记特征选择
3 实验数据及结果分析
4 结语

[1] PAN Xiaoyong, FAN Yongxian, JIA Jue, et al. Identifying RNA-binding proteins using multi-label deep learning[J]. Science China Information Sciences, 2019, 62(1): 19103.
[2] ROMAN-RANGEL E, MARCHAND-MAILLET S. Inductive t-SNE via deep learning to visualize multi-label images[J]. Engineering Applications of Artificial Intelligence, 2019, 81: 336-345.
[3] CHENG Yusheng, ZHAO Dawei, ZHAN Wenfa, et al. Multi-label learning of non-equilibrium labels completion with mean shift[J]. Neurocomputing, 2018, 321: 92-102.
[4] 刘军煜,贾修一.一种利用关联规则挖掘的多标记分类算法[J].软件学报,2017,28(11):2865-2878.
[5] 何志芬,杨明,刘会东.多标记分类和标记相关性的联合学习[J].软件学报,2014,25(9):1967-1981.
[6] 蔡亚萍,杨明.一种利用局部标记相关性的多标记特征选择算法[J].南京大学学报自然科学版,2016,52(4):693-704.
[7] 吴磊,张敏灵.基于类属属性的多标记学习算法[J].软件学报,2014,25(9): 1992-2001.
[8] 王一宾,程玉胜,何月,等.回归核极限学习机的多标记学习算法[J].模式识别与人工智能,2018,31(5):419-430.
[9] 黄莉莉,汤进,孙登第,等.基于多标签ReliefF的特征选择算法[J].计算机应用,2012,32(10): 2888-2890.
[10] 张振海,李士宁,李志刚,等.一类基于信息熵的多标签特征选择算法[J].计算机研究与发展,2013,50(6):1177-1184.
[11] 刘景华,林梦雷,王晨曦,等.基于局部子空间的多标记特征选择算法[J].模式识别与人工智能,2016,29(3):240-251.
[12] LIN Yaojin, HU Qinghua, LIU Jinghua, et al. Multi-label feature selection based on neighborhood mutual information[J]. Applied Soft Computing, 2016, 38: 244-256.
[13] LIN Yaojin, HU Qinghua, LIU Jinghua,et al. Streaming feature selection for multilabel learning based on fuzzy mutual information[J]. IEEE Transactions on Fuzzy Systems, 2017, 25(6): 1491-1507.
[14] 程玉胜,赵大卫,钱坤.近邻标签空间非平衡化标签补全的多标签学习[J].模式识别与人工智能,2018,31(8):740-749.
[15] 程玉胜,陈飞,王一宾.基于粗糙集的数据流多标记分布特征选择[J].计算机应用,2018,38(11):3105-3111.
[16] 李志欣,卓亚琦,张灿龙,等.多标记学习研究综述[J].计算机应用研究,2014,31(6): 1601-1605.
[17] TSOUMAKAS G, SPYROMITROS-XIOUFIS E, VILCEK J, et al. Mulan: a java library for multi-label learning[DB/OL].(2011-07-12). http://mulan.sourceforge.net/datasets.html.
[18] LIN Yaojin, HU Xuegang, WU Xindong. Quality of information-based source assessment and selection[J]. Neurocomputing, 2014, 133: 95-102.
[19] ZHANG Minling, ZHOU Zhihua. ML-kNN: a lazy learning approach to multi-label learning[J]. Pattern Recognition, 2007, 40(7): 2038-2048.
[20] 王晨曦,林耀进,唐莉,等.基于信息粒化的多标记特征选择算法[J].模式识别与人工智能,2017,31(2):123-131.
[21] ZHANG Yin, ZHOU Zhihua. Multilabel dimensionality reduction via dependence maximization[J]. ACM Transactions on Knowledge Discovery from Data, 2010, 4(3): 14.
[22] LEE J, KIM D W. Feature selection for multi-label clas-sification using multivariate mutual information[J]. Pattern Recognition Letters, 2013, 34(3): 349-357.