基于统计感知的大数据系统计算框架

深圳大学计算机与软件学院大数据技术与应用研究所,广东深圳 518060

计算机系统结构; 大数据; 随机样本划分; 逼近式集成学习; 并行分布式计算; 分布式处理系统

A statistical aware based big data system computing framework
WEI Chenghao, HUANG Zhexue, and HE Yulin

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, Guangdong Province, P.R.China

computer system structure; big data; random sample partition; asymmetric ensemble learning; paralleled distributed computing; distributed processing system

DOI: 10.3724/SP.J.1249.2018.05441

备注

为在一定计算资源条件下实现大数据可计算化,本研究提出一种基于统计感知思想的Tbyte级大数据系统计算框架Bigdata-α,该框架的核心为大数据随机样本划分模型和逼近式集成学习模型.前者保证了划分后每个子数据块所包含的样本与大数据总体概率分布的一致性.后者通过分析若干个随机样本数据块替代了Tbyte级全量数据分析.使用1 Tbyte模拟数据集验证随机样本划分模型的有效性,通过逐渐增加随机样本块的个数,提升了Higgs数据集基分类器的分类准确度,证明该方法能克服大数据分析中计算资源的限制瓶颈.

In order to realize the computability of big data in a certain computing resource, a statistical aware based big data system computing framework(abbreviated as Bigdata-α)is proposed in this paper to deal with Tbyte grade big data. The core of the framework is random sample partition model and asymptotic ensemble learning model. The first one guarantees the consistent distributions between the big data and its data-blocks, while the second one provides an unbiased and convergent learning model by using some samples of the big date. The effectiveness of the random sample partitioning model is verified by using a 1 Tbyte simulation dataset. By gradually increasing the number of random sample blocks, the classification accuracy of the base classifier is improved. The massive computing resources is avoided in big data analysis.

·