[1]何玉林,等.大规模数据集聚类算法的研究进展[J].深圳大学学报理工版,2019,36(1):4-17.[doi:10.3724/SP.J.1249.2019.01004]
 HE Yulin,and HUANG Zhexue,A review on clustering algorithms for large-scale data sets[J].Journal of Shenzhen University Science and Engineering,2019,36(1):4-17.[doi:10.3724/SP.J.1249.2019.01004]
点击复制

大规模数据集聚类算法的研究进展()
分享到:

《深圳大学学报理工版》[ISSN:1000-2618/CN:44-1401/N]

卷:
第36卷
期数:
2019年第1期
页码:
4-17
栏目:
电子与信息科学
出版日期:
2019-01-20

文章信息/Info

Title:
A review on clustering algorithms for large-scale data sets
作者:
何玉林1 2黄哲学1 2
1)深圳大学计算机与软件学院大数据所,广东深圳 518060;2)深圳大学大数据系统计算技术国家工程实验室,广东深圳 518060
Author(s):
HE Yulin1 2 and HUANG Zhexue1 2
1) Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China 2) National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China
关键词:
人工智能大规模数据聚类串行计算并行计算数据挖掘综述
Keywords:
artificial intelligence large-scale data clustering sequential computing parallel computing data miningreview
分类号:
TP 311
DOI:
10.3724/SP.J.1249.2019.01004
摘要:
聚类是机器学习领域的一个重要研究方向,在过去的几十年间,针对不同类型中小规模数据集聚类算法的研究取得了很大的进展,许多行之有效的算法先后问世.然而,这些算法在处理大规模数据集时,由于计算复杂度较高,且处理高维数据的能力较弱,往往难以获得令人满意的效果.伴随着大数据时代的到来,数据的采集和存储变得相对容易和便捷,但同时数据量也与日俱增,因此,针对各种实际应用的聚类问题应运而生,使得专门针对大规模数据集的聚类算法研究成为当前机器学习领域的重要任务之一.本文以大规模数据集的可计算性为切入点,对目前串行和并行计算环境下专门用于处理大规模数据集的聚类算法进行了综述和分析,重点评述了串行计算环境下基于样例选择、增量学习、特征子集和特征转换的聚类算法以及并行计算环境下基于MapReduce、Spark和Storm框架的聚类算法。并给出有关未来大规模数据集聚类算法设计思路与应用前景的思考和讨论,包括基于数据并行和训练过程自动化的聚类算法设计策略以及关于社交网络大数据聚类算法的若干理解.
Abstract:
Clustering is an important research branch of machine learning. In the past decades, there are many well-known clustering algorithms that have been designed to handle the clustering problems of small and medium-scale data sets. Although these algorithms obtained the good clustering performances, they are usually inefficient when dealing with the clustering tasks of large-scale data sets due to the high computation complexity and weak capability of handling the high-dimensional data. In the big data age, the collection and storage of data become easier and more convenient. The clustering technologies are desperately needed to satisfy the requirements of real applications which generate a great deal of large-scale data sets. Thus, the clustering for large-scale data sets becomes an important research direction in the field of machine learning. This paper reviews and analyses the current clustering algorithms for large-scale data sets under both the sequential (e,g., the clustering algorithms based on instance selection, incremental learning, feature subset and feature transformation) and the parallel (e.g., the clustering algorithms based on MapReduce, Spark and Storm) computational frameworks, respectively. Unlike the existing literature reviews, this paper focuses on the computability of large-scale data sets. Meanwhile, we provide some new thoughts for the designs and applications of clustering algorithms for large-scale data sets, including data parallelization, training automatization and clustering for big social network.

相似文献/References:

[1]潘长城,徐晨,李国.解全局优化问题的差分进化策略[J].深圳大学学报理工版,2008,25(2):211.
 PAN Chang-cheng,XU Chen,and LI Guo.Differential evolutionary strategies for global optimization[J].Journal of Shenzhen University Science and Engineering,2008,25(1):211.
[2]骆剑平,李霞.求解TSP的改进混合蛙跳算法[J].深圳大学学报理工版,2010,27(2):173.
 LUO Jian-ping and LI Xia.Improved shuffled frog leaping algorithm for solving TSP[J].Journal of Shenzhen University Science and Engineering,2010,27(1):173.
[3]蔡良伟,李霞.基于混合蛙跳算法的作业车间调度优化[J].深圳大学学报理工版,2010,27(4):391.
 CAI Liang-wei and LI Xia.Optimization of job shop scheduling based on shuffled frog leaping algorithm[J].Journal of Shenzhen University Science and Engineering,2010,27(1):391.
[4]张重毅,刘彦斌,于繁华,等.CDA市场环境模型进化研究[J].深圳大学学报理工版,2010,27(4):413.
 ZHANG Zhong-yi,LIU Yan-bin,YU Fan-hua,et al.Research on the evolution model of CDA market environment[J].Journal of Shenzhen University Science and Engineering,2010,27(1):413.
[5]姜建国,周佳薇,郑迎春,等.一种双菌群细菌觅食优化算法[J].深圳大学学报理工版,2014,31(1):43.[doi:10.3724/SP.J.1249.2014.01043]
 Jiang Jianguo,Zhou Jiawei,Zheng Yingchun,et al.A double flora bacteria foraging optimization algorithm[J].Journal of Shenzhen University Science and Engineering,2014,31(1):43.[doi:10.3724/SP.J.1249.2014.01043]
[6]蔡良伟,刘思麒,李霞,等.基于蚁群优化的正则表达式分组算法[J].深圳大学学报理工版,2014,31(3):279.[doi:10.3724/SP.J.1249.2014.03279]
 Cai Liangwei,Liu Siqi,Li Xia,et al.Regular expression grouping algorithm based on ant colony optimization[J].Journal of Shenzhen University Science and Engineering,2014,31(1):279.[doi:10.3724/SP.J.1249.2014.03279]
[7]宁剑平,王冰,李洪儒,等.递减步长果蝇优化算法及应用[J].深圳大学学报理工版,2014,31(4):367.[doi:10.3724/SP.J.1249.2014.04367]
 Ning Jianping,Wang Bing,Li Hongru,et al.Research on and application of diminishing step fruit fly optimization algorithm[J].Journal of Shenzhen University Science and Engineering,2014,31(1):367.[doi:10.3724/SP.J.1249.2014.04367]
[8]刘万峰,李霞.车辆路径问题的快速多邻域迭代局部搜索算法[J].深圳大学学报理工版,2015,32(2):196.[doi:10.3724/SP.J.1249.2015.02000]
 Liu Wanfeng,and Li Xia,A fast multi-neighborhood iterated local search algorithm for vehicle routing problem[J].Journal of Shenzhen University Science and Engineering,2015,32(1):196.[doi:10.3724/SP.J.1249.2015.02000]
[9]蔡良伟,程璐,李军,等.基于遗传算法的正则表达式规则分组优化[J].深圳大学学报理工版,2015,32(3):281.[doi:10.3724/SP.J.1249.2015.03281]
 Cai Liangwei,Cheng Lu,Li Jun,et al.Regular expression grouping optimization based on genetic algorithm[J].Journal of Shenzhen University Science and Engineering,2015,32(1):281.[doi:10.3724/SP.J.1249.2015.03281]
[10]王守觉,鲁华祥,陈向东,等.人工神经网络硬件化途径与神经计算机研究[J].深圳大学学报理工版,1997,14(1):8.
 Wang Shoujue,Lu Huaxiang,Chen Xiangdong and Zeng Yujuan.On the Hardware for Artificial Neural Networks and Neurocomputer[J].Journal of Shenzhen University Science and Engineering,1997,14(1):8.

更新日期/Last Update: 2019-01-30