[1]何玉林,等.混合属性数据集分布一致性度量的新方法[J].深圳大学学报理工版,2021,38(2):170-179.[doi:10.3724/SP.J.1249.2021.02170]
 HE Yulin,JIN Yi,et al.A new method for measuring the distribution consistency of mixed-attribute data sets[J].Journal of Shenzhen University Science and Engineering,2021,38(2):170-179.[doi:10.3724/SP.J.1249.2021.02170]
点击复制

混合属性数据集分布一致性度量的新方法()
分享到:

《深圳大学学报理工版》[ISSN:1000-2618/CN:44-1401/N]

卷:
第38卷
期数:
2021年第2期
页码:
170-179
栏目:
电子与信息科学
出版日期:
2021-03-12

文章信息/Info

Title:
A new method for measuring the distribution consistency of mixed-attribute data sets
文章编号:
202102009
作者:
何玉林1 2金一3戴德鑫1黄柏皓1黄家杰1
1)深圳大学计算机与软件学院,广东深圳 518060
2)深圳大学大数据系统计算技术国家工程实验室,广东深圳 518060
3)中国刑事警察学院刑事科学技术学院,辽宁沈阳 110854
Author(s):
HE Yulin1 2 JIN Yi3 DAI Dexin1 HUANG Baihao1 and HUANG Jiajie1
1) College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, Guangdong Province, P.R.China
2) National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, Guangdong Province, P.R.China
3) College of Forensic Science and Technology, Criminal Investigation Police University of China, Shenyang 110854, Liaoning Province, P.R.China
关键词:
人工智能 随机样本划分 分布一致性 最大均值差异 混合属性数据 独热编码 深度编码
Keywords:
artificial intelligence random sample partition distribution consistency maximum mean discrepancy mixed-attribute data one-hot encoding deep encoding
分类号:
TP311
DOI:
10.3724/SP.J.1249.2021.02170
文献标志码:
A
摘要:
数据分布一致性的度量是大数据随机样本划分生成过程中的一个关键问题,如何针对混合属性的数据集进行合理有效的分布一致性度量是目前随机样本划分技术研究的重点.提出一种新的基于深度编码和最大平均差异的混合属性数据集分布一致性度量方法,不直接对两个不同的原始数据集进行分布一致性的度量,而是首先对混合属性中的离散属性进行独热编码,得到独热编码数据集;之后对独热编码数据集进行自编码处理,得到深度编码数据集;最后基于最大均值差异指标对两个不同的深度编码数据集进行分布一致性的度量.在Adult、Australian、CRX和German标准混合属性数据集上对所提方法进行性能实验验证,结果表明,与基于离散属性独热编码的最大平均差异方法和基于连续属性二进制化的相似性度量方法相比,新方法能够更加准确地对混合属性数据进行分布一致性的度量.
Abstract:
The measurement of data distribution consistency is a key problem in the process of generating random sample partition (RSP) of big data. How to measure the distribution consistency of mixed-attribute data sets reasonably and effectively is the focus of current research on RSP technology. This paper proposes a new method to measure the distribution consistency of mixed-attribute data sets based on deep encoding and maximum mean discrepancy (DE-MMD). Firstly, we conduct the one-hot encoding to transform the original data set with discrete attributes into the one-hot encoding data set. Then, we construct and train an autoencoder with single hidden layer based on the one-hot encoding data and thus we get the corresponding deep encoding data set by representing the original data set with hidden layer output. Finally, we measure the distribution consistency of mixed-attribute data sets based on the corresponding deep encoding data sets by using the maximum mean discrepancy index. On 4 benchmark mixed-attribute data sets, which are Adult、 Australian、 CRX and German, we compare the measure performances of DE-MMD method with those of the one-hot encoding-based MMD (OE-MMD) method and the binarization-based similarity measure (BSM) method. The experimental results show that the proposed method can measure the distribution consistency of mixed-attribute data sets more accurately and more effectively than OE-MMD and BSM methods.

参考文献/References:

[1] 何清,李宁,罗文娟,等.大数据下的机器学习算法综述[J].模式识别与人工智能,2014,27(4):327-336.
HE Qing, LI Ning, LUO Wenjuan, et al. A survey of machine learning algorithms for big data[J]. Pattern Recognition and Artificial Intelligence, 2014, 27(4): 327-336.(in Chinese)
[2] DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
[3] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: cluster computing with working sets[C]// Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. Berkeley, USA: USENIX Association, 2010: 10.
[4] 魏丞昊,黄哲学,何玉林.基于统计感知的大数据系统计算框架[J].深圳大学学报理工版,2018,35(5):441-443.
WEI Chenghao, HUANG Zhexue, HE Yulin. Statistical aware based big data system computing framework[J]. Journal of Shenzhen University Science and Engineering, 2018, 35(5): 441-443.(in Chinese)
[5] SALLOUM S, HUANG J Z, HE Yulin. Random sample partition: a distributed data model for big data analysis[J]. IEEE Transactions on Industrial Informatics, 2019, 15(11): 5846-5854.
[6] 黄哲学,何玉林,魏丞昊,等.大数据随机样本划分模型及相关分析计算技术[J].数据采集与处理,2019,34(3):373-385.
HUANG Zhexue, HE Yulin, WEI Chenghao, et al. Random sample partition data model and related technologies for big data analysis[J]. Journal of Data Acquisition and Processing, 2019, 34(3): 373-385.(in Chinese)
[7] GRETTON A, BORGWARDT K M, RASCH M J, et al. A kernel two-sample test[J]. Journal of Machine Learning Research, 2012, 13: 723-773.
[8] 李洪奇,徐青松,朱丽萍,等.基于数据集相似性的分类算法推荐[J].计算机应用与软件,2016,33(8):62-66.
LI Hongqi, XU Qingsong, ZHU Liping, et al. Classification algorithms recommendation based on dataset similarity[J]. Computer Applications and Software, 2016, 33(8): 62-66.(in Chinese)
[9] 冀振燕,宋晓军,皮怀雨,等.基于深度学习的融合多源异构数据的推荐模型[J].北京邮电大学学报,2019,42(6):35-42.
JI Zhenyan, SONG Xiaojun, PI Huaiyu, et al. Recommended model for fusing multi-source heterogeneous data based on deep learning[J]. Journal of Beijing University of Posts and Telecommunications, 2019, 42(6): 35-42.(in Chinese)
[10] 杨景玉,张珩,李宝文,等.多源异构遥感大数据的高性能存储技术研究[J].兰州交通大学学报,2019,38(1):50-56.
YANG Jingyu, ZHANG Heng, LI Baowen, et al. Research on storage performance improvement technology of multi-source heterogeneous remote sensing big data[J]. Journal of Lanzhou Jiaotong University, 2019, 38(1): 50-56.(in Chinese)
[11] 吴宾,娄铮铮,叶阳东.一种面向多源异构数据的协同过滤推荐算法[J].计算机研究与发展,2019,56(5):1034-1047.
WU Bin, LOU Zhengzheng, YE Yangdong. A collaborative filtering recommendation algorithm for multi-source heterogeneous data[J]. Journal of Computer Research and Development, 2019, 56(5): 1034-1047.
[12] LI Jie, CHEN Jiahao, ZHANG Xueqin, et al. One-hot encoding and convolutional neural network based anomaly detection[J]. Journal of Tsinghua University Science and Technology, 2019, 59(7): 523-529.
[13] RODRGUEZ P, BAUTISTA M A, GONZALEZ J, et al. Beyond one-hot encoding: lower dimensional target embedding[J]. Image and Vision Computing, 2018, 75: 21-31.
[14] HUANG Tinglin, HE Yulin, DAI Dexin, et al. Neural network-based deep encoding for mixed-attribute data classification[C]// The 19th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Ho Chi Minh City: Springer, 2019: 153-163.
[15] ALCAL-FDEZ J, FERNANDEZ A, LUENGO J, et al. KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2011, 17(2/3): 255-287.
[16] HE Yulin, LIU J N K, WANG Xizhao, et al. Optimal bandwidth selection for re-substitution entropy estimation[J]. Applied Mathematics and Computation, 2012, 219(8): 3425-3460.
[17] WEI Chenghao, SALLOUM S, EMARA T Z, et al. A two-stage data processing algorithm to generate random sample partitions for big data analysis[C]// International Conference on Cloud Computing.[S. l.]: Springer, 2018: 347-364.
[18] HUANG Guangbin, ZHU Qinyu, SIEW C K. Extreme learning machine: theory and applications[J]. Neurocomputing, 2006, 70(1/2/3): 489-501.
[19] CAO Jiuwen, LIN Zhiping, HUANG Guangbin, et al. Voting based extreme learning machine[J]. Information Sciences, 2012, 185(1): 66-77.

相似文献/References:

[1]潘长城,徐晨,李国.解全局优化问题的差分进化策略[J].深圳大学学报理工版,2008,25(2):211.
 PAN Chang-cheng,XU Chen,and LI Guo.Differential evolutionary strategies for global optimization[J].Journal of Shenzhen University Science and Engineering,2008,25(2):211.
[2]骆剑平,李霞.求解TSP的改进混合蛙跳算法[J].深圳大学学报理工版,2010,27(2):173.
 LUO Jian-ping and LI Xia.Improved shuffled frog leaping algorithm for solving TSP[J].Journal of Shenzhen University Science and Engineering,2010,27(2):173.
[3]蔡良伟,李霞.基于混合蛙跳算法的作业车间调度优化[J].深圳大学学报理工版,2010,27(4):391.
 CAI Liang-wei and LI Xia.Optimization of job shop scheduling based on shuffled frog leaping algorithm[J].Journal of Shenzhen University Science and Engineering,2010,27(2):391.
[4]张重毅,刘彦斌,于繁华,等.CDA市场环境模型进化研究[J].深圳大学学报理工版,2010,27(4):413.
 ZHANG Zhong-yi,LIU Yan-bin,YU Fan-hua,et al.Research on the evolution model of CDA market environment[J].Journal of Shenzhen University Science and Engineering,2010,27(2):413.
[5]姜建国,周佳薇,郑迎春,等.一种双菌群细菌觅食优化算法[J].深圳大学学报理工版,2014,31(1):43.[doi:10.3724/SP.J.1249.2014.01043]
 Jiang Jianguo,Zhou Jiawei,Zheng Yingchun,et al.A double flora bacteria foraging optimization algorithm[J].Journal of Shenzhen University Science and Engineering,2014,31(2):43.[doi:10.3724/SP.J.1249.2014.01043]
[6]蔡良伟,刘思麒,李霞,等.基于蚁群优化的正则表达式分组算法[J].深圳大学学报理工版,2014,31(3):279.[doi:10.3724/SP.J.1249.2014.03279]
 Cai Liangwei,Liu Siqi,Li Xia,et al.Regular expression grouping algorithm based on ant colony optimization[J].Journal of Shenzhen University Science and Engineering,2014,31(2):279.[doi:10.3724/SP.J.1249.2014.03279]
[7]宁剑平,王冰,李洪儒,等.递减步长果蝇优化算法及应用[J].深圳大学学报理工版,2014,31(4):367.[doi:10.3724/SP.J.1249.2014.04367]
 Ning Jianping,Wang Bing,Li Hongru,et al.Research on and application of diminishing step fruit fly optimization algorithm[J].Journal of Shenzhen University Science and Engineering,2014,31(2):367.[doi:10.3724/SP.J.1249.2014.04367]
[8]刘万峰,李霞.车辆路径问题的快速多邻域迭代局部搜索算法[J].深圳大学学报理工版,2015,32(2):196.[doi:10.3724/SP.J.1249.2015.02000]
 Liu Wanfeng,and Li Xia,A fast multi-neighborhood iterated local search algorithm for vehicle routing problem[J].Journal of Shenzhen University Science and Engineering,2015,32(2):196.[doi:10.3724/SP.J.1249.2015.02000]
[9]蔡良伟,程璐,李军,等.基于遗传算法的正则表达式规则分组优化[J].深圳大学学报理工版,2015,32(3):281.[doi:10.3724/SP.J.1249.2015.03281]
 Cai Liangwei,Cheng Lu,Li Jun,et al.Regular expression grouping optimization based on genetic algorithm[J].Journal of Shenzhen University Science and Engineering,2015,32(2):281.[doi:10.3724/SP.J.1249.2015.03281]
[10]王守觉,鲁华祥,陈向东,等.人工神经网络硬件化途径与神经计算机研究[J].深圳大学学报理工版,1997,14(1):8.
 Wang Shoujue,Lu Huaxiang,Chen Xiangdong and Zeng Yujuan.On the Hardware for Artificial Neural Networks and Neurocomputer[J].Journal of Shenzhen University Science and Engineering,1997,14(2):8.

备注/Memo

备注/Memo:
Received:2020-04-20;Accepted:2020-08-18
Foundation:Open Foundation of Key Laboratory of Impression Evidence Examination and Identification Technology, Ministry of Public Security of China (HJKF201901); Basic Research Foundation of Strengthening Police with Science and Technology of the Ministry of Public Security of China (2017GABJC09); (2018060); (201910590017)
Corresponding author:Assistant professor HE Yulin.E-mail: yulinhe@szu.edu.cn
Citation:HE Yulin, JIN Yi, DAI Dexin, et al. A new method for measuring the distribution consistency of mixed-attribute data sets[J]. Journal of Shenzhen University Science and Engineering, 2021, 38(2): 170-179.(in Chinese)
基金项目:公安部重点实验室开放基金资助项目 (HJKF 201901);公安部科技强警基础工作专项资助项目(2017GABJC09);深圳大学新引进教师科研启动资助项目 (2018060);大学生创新创业训练计划资助项目 (201910590017)
作者简介:何玉林 (1982—),深圳大学副研究员、博士.研究方向:大数据系统计算技术与应用.E-mail:yulinhe@szu.edu.cn
引文:何玉林,金一,戴德鑫,等.混合属性数据分布一致性度量的新方法[J]. 深圳大学学报理工版,2021, 38(2): 170-179.
更新日期/Last Update: 2021-03-30