作者简介:魏丞昊(1986—),男,深圳大学博士后研究人员.研究方向:机器学习与数据挖掘.E-mail:chenghao.wei@szu.edu.cn" title="发送邮件">chenghao.wei@szu.edu.cn
中文责编:英 子; 英文责编:子 兰
作者简介:魏丞昊(1986—),男,深圳大学博士后研究人员.研究方向:机器学习与数据挖掘.E-mail:chenghao.wei@szu.edu.cn" title="发送邮件">chenghao.wei@szu.edu.cn
DOI: 10.3724/SP.J.1249.2018.05441
为在一定计算资源条件下实现大数据可计算化,本研究提出一种基于统计感知思想的Tbyte级大数据系统计算框架Bigdata-α,该框架的核心为大数据随机样本划分模型和逼近式集成学习模型.前者保证了划分后每个子数据块所包含的样本与大数据总体概率分布的一致性.后者通过分析若干个随机样本数据块替代了Tbyte级全量数据分析.使用1 Tbyte模拟数据集验证随机样本划分模型的有效性,通过逐渐增加随机样本块的个数,提升了Higgs数据集基分类器的分类准确度,证明该方法能克服大数据分析中计算资源的限制瓶颈.
In order to realize the computability of big data in a certain computing resource, a statistical aware based big data system computing framework(abbreviated as Bigdata-α)is proposed in this paper to deal with Tbyte grade big data. The core of the framework is random sample partition model and asymptotic ensemble learning model. The first one guarantees the consistent distributions between the big data and its data-blocks, while the second one provides an unbiased and convergent learning model by using some samples of the big date. The effectiveness of the random sample partitioning model is verified by using a 1 Tbyte simulation dataset. By gradually increasing the number of random sample blocks, the classification accuracy of the base classifier is improved. The massive computing resources is avoided in big data analysis.