深圳大学学报理工版

何玉林,金一,戴德鑫,等.混合属性数据集分布一致性度量的新方法[J].深圳大学学报理工版,2021,38(2):170-179.[doi:10.3724/SP.J.1249.2021.02170]
HE Yulin,JIN Yi,DAI Dexin,et al.A new method for measuring the distribution consistency of mixed-attribute data sets[J].Journal of Shenzhen University Science and Engineering,2021,38(2):170-179.[doi:10.3724/SP.J.1249.2021.02170]

点击复制

混合属性数据集分布一致性度量的新方法

何玉林^1,2,金一³,戴德鑫¹,黄柏皓¹,黄家杰¹

1)深圳大学计算机与软件学院,广东深圳 518060; 2)深圳大学大数据系统计算技术国家工程实验室,广东深圳 518060; 3)中国刑事警察学院刑事科学技术学院,辽宁沈阳 110854

关键词：人工智能; 随机样本划分; 分布一致性; 最大均值差异; 混合属性数据; 独热编码; 深度编码

A new method for measuring the distribution consistency of mixed-attribute data sets

HE Yulin^{1, 2}, JIN Yi³, DAI Dexin¹, HUANG Baihao¹, and HUANG Jiajie¹

1)College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, Guangdong Province, P.R.China2)National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University,Shenzhen 518060, Guangdong Province, P.R.China3)College of Forensic Science and Technology, Criminal Investigation Police University of China, Shenyang 110854, Liaoning Province, P.R.China

Keywords： artificial intelligence; random sample partition; distribution consistency; maximum mean discrepancy; mixed-attribute data; one-hot encoding; deep encoding

DOI: 10.3724/SP.J.1249.2021.02170

备注

摘要

全文

图/表

参考文献

数据分布一致性的度量是大数据随机样本划分生成过程中的一个关键问题,如何针对混合属性的数据集进行合理有效的分布一致性度量是目前随机样本划分技术研究的重点.提出一种新的基于深度编码和最大平均差异的混合属性数据集分布一致性度量方法,不直接对两个不同的原始数据集进行分布一致性的度量,而是首先对混合属性中的离散属性进行独热编码,得到独热编码数据集; 之后对独热编码数据集进行自编码处理,得到深度编码数据集; 最后基于最大均值差异指标对两个不同的深度编码数据集进行分布一致性的度量.在Adult、Australian、CRX和German标准混合属性数据集上对所提方法进行性能实验验证,结果表明,与基于离散属性独热编码的最大平均差异方法和基于连续属性二进制化的相似性度量方法相比,新方法能够更加准确地对混合属性数据进行分布一致性的度量.

The measurement of data distribution consistency is a key problem in the process of generating random sample partition(RSP)of big data. How to measure the distribution consistency of mixed-attribute data sets reasonably and effectively is the focus of current research on RSP technology. This paper proposes a new method to measure the distribution consistency of mixed-attribute data sets based on deep encoding and maximum mean discrepancy(DE-MMD). Firstly, we conduct the one-hot encoding to transform the original data set with discrete attributes into the one-hot encoding data set. Then, we construct and train an autoencoder with single hidden layer based on the one-hot encoding data and thus we get the corresponding deep encoding data set by representing the original data set with hidden layer output. Finally, we measure the distribution consistency of mixed-attribute data sets based on the corresponding deep encoding data sets by using the maximum mean discrepancy index. On 4 benchmark mixed-attribute data sets, which are Adult、 Australian、 CRX and German, we compare the measure performances of DE-MMD method with those of the one-hot encoding-based MMD(OE-MMD)method and the binarization-based similarity measure(BSM)method. The experimental results show that the proposed method can measure the distribution consistency of mixed-attribute data sets more accurately and more effectively than OE-MMD and BSM methods.

引言
1 离散属性的独热编码
2 基于自编码神经网络的深度编码
3 基于MMD的分布一致性度量
4 实验验证与结果分析

表1 混合属性数据集<br/>Table 1 Mixed-attribute data set

表1 混合属性数据集
Table 1 Mixed-attribute data set

表2 独热编码数据集<br/>Table 2 One-hot encoding data set

表2 独热编码数据集
Table 2 One-hot encoding data set

表3 四个标准KEEL数据集<br/>Table 3 Four KEEL benchmark data sets

表3 四个标准KEEL数据集
Table 3 Four KEEL benchmark data sets

图1 自编码神经网络对混合属性数据集深度编码稳定性的影响<br/>Fig.1 The impact of autocoder neural network on transformation of mixed-attribute data set

图1 自编码神经网络对混合属性数据集深度编码稳定性的影响
Fig.1 The impact of autocoder neural network on transformation of mixed-attribute data set

图2 阈值ε与样本个数和显著性水平α之间的关系<br/>Fig.2 The relationship among threshold, size of data set, and significance level

图2 阈值ε与样本个数和显著性水平α之间的关系
Fig.2 The relationship among threshold, size of data set, and significance level

图3 在Adult数据集上DE-MMD方法对非RSP和RSP数据块分布一致性的判定(ε'=0.013 1)<br/>Fig.3 The distribution consistency determined by DE-MMD method on non-RSP and RSP data blocks of Adult data set(ε'=0.013 1)

图3 在Adult数据集上DE-MMD方法对非RSP和RSP数据块分布一致性的判定(ε'=0.013 1)
Fig.3 The distribution consistency determined by DE-MMD method on non-RSP and RSP data blocks of Adult data set(ε'=0.013 1)

图4 在Australian数据集上DE-MMD方法对非RSP和RSP数据块分布一致性的判定(ε'=0.109 8)<br/>Fig.4 The distribution consistency with determined by DE-MMD method on non-RSP and RSP data blocks of Australian data set(ε'=0.109 8)

图4 在Australian数据集上DE-MMD方法对非RSP和RSP数据块分布一致性的判定(ε'=0.109 8)
Fig.4 The distribution consistency with determined by DE-MMD method on non-RSP and RSP data blocks of Australian data set(ε'=0.109 8)

图5 在CRX数据集上DE-MMD方法对非RSP和RSP数据块分布一致性的判定(ε'=0.039 7)<br/>Fig.5 The distribution consistency with determined by DE-MMD method on non-RSP and RSP data blocks of CRX data set(ε'=0.039 7)

图5 在CRX数据集上DE-MMD方法对非RSP和RSP数据块分布一致性的判定(ε'=0.039 7)
Fig.5 The distribution consistency with determined by DE-MMD method on non-RSP and RSP data blocks of CRX data set(ε'=0.039 7)

图6 在German数据集上DE-MMD方法对非RSP和RSP数据块分布一致性的判定(ε'=0.038 1)<br/>Fig.6 The distribution consistency with determined by DE-MMD method on non-RSP and RSP data blocks of German data set(ε'=0.038 1)

图6 在German数据集上DE-MMD方法对非RSP和RSP数据块分布一致性的判定(ε'=0.038 1)
Fig.6 The distribution consistency with determined by DE-MMD method on non-RSP and RSP data blocks of German data set(ε'=0.038 1)

表4 OE-MMD、BSM和DE-MMD在Australian数据集对应的RSP数据块上的对比<br/>Table 4 The comparative results of OE-MMD, BSM, and DE-MMD on RSP data blocks of Australian data set

表4 OE-MMD、BSM和DE-MMD在Australian数据集对应的RSP数据块上的对比
Table 4 The comparative results of OE-MMD, BSM, and DE-MMD on RSP data blocks of Australian data set

表5 OE-MMD、BSM和DE-MMD在Adult数据集对应的RSP数据块上的对比<br/>Table 5 The comparative results of OE-MMD, BSM, and DE-MMD on RSP data blocks of Adult data set

表5 OE-MMD、BSM和DE-MMD在Adult数据集对应的RSP数据块上的对比
Table 5 The comparative results of OE-MMD, BSM, and DE-MMD on RSP data blocks of Adult data set

表6 OE-MMD、BSM和DE-MMD在CRX数据集对应的RSP数据块上的对比<br/>Table 6 The comparative results of OE-MMD, BSM, and DE-MMD on RSP data blocks of CRX data set

表6 OE-MMD、BSM和DE-MMD在CRX数据集对应的RSP数据块上的对比
Table 6 The comparative results of OE-MMD, BSM, and DE-MMD on RSP data blocks of CRX data set

表7 OE-MMD、BSM和DE-MMD方法在German数据集对应的RSP数据块上的对比<br/>Table 7 The comparative results of OE-MMD, BSM, and DE-MMD on RSP data blocks of German data set

表7 OE-MMD、BSM和DE-MMD方法在German数据集对应的RSP数据块上的对比
Table 7 The comparative results of OE-MMD, BSM, and DE-MMD on RSP data blocks of German data set

图7 OE-MMD、BSM和DE-MMD在4个KEEL数据集上的对比结果<br/>Fig.7 The comparison of results of OE-MMD, BSM, and DE-MMD on 4 KEEL data sets

图7 OE-MMD、BSM和DE-MMD在4个KEEL数据集上的对比结果
Fig.7 The comparison of results of OE-MMD, BSM, and DE-MMD on 4 KEEL data sets

[1] 何清,李宁,罗文娟,等.大数据下的机器学习算法综述[J].模式识别与人工智能,2014,27(4):327-336.
[2] DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
[3] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: cluster computing with working sets[C]// Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. Berkeley, USA: USENIX Association, 2010: 10.
[4] 魏丞昊,黄哲学,何玉林.基于统计感知的大数据系统计算框架[J].深圳大学学报理工版,2018,35(5):441- 443.
[5] SALLOUM S, HUANG J Z, HE Yulin. Random sample partition: a distributed data model for big data analysis[J]. IEEE Transactions on Industrial Informatics, 2019, 15(11): 5846-5854.
[6] 黄哲学,何玉林,魏丞昊,等.大数据随机样本划分模型及相关分析计算技术[J].数据采集与处理,2019,34(3):373-385.
[7] GRETTON A, BORGWARDT K M, RASCH M J, et al. A kernel two-sample test[J]. Journal of Machine Learning Research, 2012, 13: 723-773.
[8] 李洪奇,徐青松,朱丽萍,等.基于数据集相似性的分类算法推荐[J].计算机应用与软件,2016,33(8):62- 66.
[9] 冀振燕,宋晓军,皮怀雨,等.基于深度学习的融合多源异构数据的推荐模型[J].北京邮电大学学报,2019,42(6):35- 42.
[10] 杨景玉,张珩,李宝文,等.多源异构遥感大数据的高性能存储技术研究[J].兰州交通大学学报,2019,38(1):50-56.
[11] 吴宾,娄铮铮,叶阳东.一种面向多源异构数据的协同过滤推荐算法[J].计算机研究与发展,2019,56(5):1034-1047.
[12] LI Jie, CHEN Jiahao, ZHANG Xueqin, et al. One-hot encoding and convolutional neural network based anomaly detection[J]. Journal of Tsinghua University Science and Technology, 2019, 59(7): 523-529.
[13] RODRÍGUEZ P, BAUTISTA M A, GONZALEZ J, et al. Beyond one-hot encoding: lower dimensional target embedding[J]. Image and Vision Computing, 2018, 75: 21-31.
[14] HUANG Tinglin, HE Yulin, DAI Dexin, et al. Neural network-based deep encoding for mixed-attribute data classification[C]// The 19th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Ho Chi Minh City: Springer, 2019: 153-163.
[15] ALCALÁ-FDEZ J, FERNANDEZ A, LUENGO J, et al. KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2011, 17(2/3): 255-287.
[16] HE Yulin, LIU J N K, WANG Xizhao, et al. Optimal bandwidth selection for re-substitution entropy estimation[J]. Applied Mathematics and Computation, 2012, 219(8): 3425-3460.
[17] WEI Chenghao, SALLOUM S, EMARA T Z, et al. A two-stage data processing algorithm to generate random sample partitions for big data analysis[C]// International Conference on Cloud Computing.[S. l.]: Springer, 2018: 347-364.
[18] HUANG Guangbin, ZHU Qinyu, SIEW C K. Extreme learning machine: theory and applications[J]. Neurocomputing, 2006, 70(1/2/3): 489-501.