基于无放回抽样的帕尔森窗口集成方法

1)沧州职业技术学院信息工程系,河北沧州 061001; 2)深圳大学计算机与软件学院,广东深圳 518060; 3)深圳大学大数据系统计算技术国家工程实验室,广东深圳 518060

概率分布; 概率密度函数估计; 帕尔森窗口; 核密度估计方法; 窗口宽度; 无放回抽样; 集成方法; 大规模数据集

Sampling without replacement-based Parzen window ensemble method
HE Wuchao1, WANG Xiaolan1, HE Yulin2,3, and XIONG Ruijie2

1)Department of Information Engineering, Cangzhou Technical College, Cangzhou 061001, Hebei Province, P.R.China2)College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, Guangdong Province, P.R.China3)National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, Guangdong Province, P.R.China

probability distribution; probability density function estimation; Parzen window; kernel density estimation method; bandwidth; sampling without replacement; ensemble method; large-scale dataset

DOI: 10.3724/SP.J.1249.2018.06617

备注

为解决大规模数据集的概率密度函数估计问题,提出一种基于无放回抽样的帕尔森窗口集成(sampling without replacement-based Parzen window ensemble,SR-PWE)方法,该方法在不需要利用全部数据的前提下,能够以较低的计算复杂度获得令人满意的概率密度函数估计表现.基于无放回抽样得到的若干原数据集的数据子集,利用帕尔森窗口法在数据子集上进行基概率密度函数估计,并将抽样上估计的基概率密度函数集成得到原始数据集的概率密度函数.通过在柯西分布和正态分布上对比帕尔森窗口法和SR-PWE方法的概率密度函数估计表现,证实SR-PWE方法可行且有效.

Although the Parzen window method is a classical probability density function(PDF)estimation method, which is widely applied in the fields of machine learning and pattern recognition, it is unsuitable for the PDF estimation of large-scale data because of its high computational complexity and bandwidth sensibility. In this paper, to handle the PDF estimation for large-scale data, we propose a sampling without replacement-based Parzen window ensemble(SR-PWE)method which conducts the PDF estimation based on the partial data and is able to obtain the satisfactory PDF estimation performance with the low computation complexity. Firstly, we generate a number of sub-datasets from the original data set by sampling without replacement. Secondly, we estimate the base PDFs by using the Parzen window method on these sub-datasets. Then, we determine the PDF of original data set based on the fusion of base PDFs. Finally, the experimental results on Cauchy and normal distributions demonstrate the feasibility and effectiveness of sampling without replacement-based Parzen window ensemble method.

·