混合属性数据集分布一致性度量的新方法

1)深圳大学计算机与软件学院,广东深圳 518060; 2)深圳大学大数据系统计算技术国家工程实验室,广东深圳 518060; 3)中国刑事警察学院刑事科学技术学院,辽宁沈阳 110854

人工智能; 随机样本划分; 分布一致性; 最大均值差异; 混合属性数据; 独热编码; 深度编码

A new method for measuring the distribution consistency of mixed-attribute data sets
HE Yulin1, 2, JIN Yi3, DAI Dexin1, HUANG Baihao1, and HUANG Jiajie1

1)College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, Guangdong Province, P.R.China2)National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University,Shenzhen 518060, Guangdong Province, P.R.China3)College of Forensic Science and Technology, Criminal Investigation Police University of China, Shenyang 110854, Liaoning Province, P.R.China

artificial intelligence; random sample partition; distribution consistency; maximum mean discrepancy; mixed-attribute data; one-hot encoding; deep encoding

DOI: 10.3724/SP.J.1249.2021.02170

备注

数据分布一致性的度量是大数据随机样本划分生成过程中的一个关键问题,如何针对混合属性的数据集进行合理有效的分布一致性度量是目前随机样本划分技术研究的重点.提出一种新的基于深度编码和最大平均差异的混合属性数据集分布一致性度量方法,不直接对两个不同的原始数据集进行分布一致性的度量,而是首先对混合属性中的离散属性进行独热编码,得到独热编码数据集; 之后对独热编码数据集进行自编码处理,得到深度编码数据集; 最后基于最大均值差异指标对两个不同的深度编码数据集进行分布一致性的度量.在Adult、Australian、CRX和German标准混合属性数据集上对所提方法进行性能实验验证,结果表明,与基于离散属性独热编码的最大平均差异方法和基于连续属性二进制化的相似性度量方法相比,新方法能够更加准确地对混合属性数据进行分布一致性的度量.

The measurement of data distribution consistency is a key problem in the process of generating random sample partition(RSP)of big data. How to measure the distribution consistency of mixed-attribute data sets reasonably and effectively is the focus of current research on RSP technology. This paper proposes a new method to measure the distribution consistency of mixed-attribute data sets based on deep encoding and maximum mean discrepancy(DE-MMD). Firstly, we conduct the one-hot encoding to transform the original data set with discrete attributes into the one-hot encoding data set. Then, we construct and train an autoencoder with single hidden layer based on the one-hot encoding data and thus we get the corresponding deep encoding data set by representing the original data set with hidden layer output. Finally, we measure the distribution consistency of mixed-attribute data sets based on the corresponding deep encoding data sets by using the maximum mean discrepancy index. On 4 benchmark mixed-attribute data sets, which are Adult、 Australian、 CRX and German, we compare the measure performances of DE-MMD method with those of the one-hot encoding-based MMD(OE-MMD)method and the binarization-based similarity measure(BSM)method. The experimental results show that the proposed method can measure the distribution consistency of mixed-attribute data sets more accurately and more effectively than OE-MMD and BSM methods.

·