CHEN Guoliang.Editorial of special issue on big data clustering[J].Journal of Shenzhen University Science and Engineering,2019,36(No.1(1-110)):1-3.[doi:10.3724/SP.J.1249.2019.01001]
artificial intelligence; big data; storage management; system computing; clustering
DOI: 10.3724/SP.J.1249.2019.01001
备注
引言
1 中文全文
2013年被称为“大数据元年”.经过近5年的飞速发展,大数据已经成为大众最为关注的一门新技术,大数据的应用预示着信息时代进入了一个新阶段.目前,大数据应用已经渗透到人类社会的各个角落,高效的大数据分析和运用,将会对未来中国经济发展、社会治理、国家管理、人民生活产生积极重大的影响.我以“big data”和“大数据”为关键词,对2013年至今发表在Web of Science(WOS)和中国知网CNKI平台(计算机软件及计算机应用领域内的期刊论文)上的文献进行了检索(超16 000篇文献),并通过对其中WOS的约100篇高被引和热点论文以及CNKI中下载量超过10 000次的学术论文进行分析,总结归纳发现,大数据的研究主要经历了以下3个关键时期.
◆ 概念探索期(2013年):在此时期,人们试图寻找一个合理的、精确的、能够被学术界和工业界一致认可的大数据定义.遗憾的是,到目前为止尚没有一个被普遍认可的大数据定义出现.业界人员转而从大数据的特征来对大数据进行定义,比较有代表性的是大数据的“4V”、“4V+1O”、“4V+1C”和“4V+1U”等特征.其中,4V是指数量巨大volume、类型繁多variety、增长速度快velocity、蕴含价值大value; 1O是指online,即大数据永远在线; 1C是指complexity,即大数据的处理和分析难度异常大; 1U是指usability,即大数据的可用性.
◆ 数据管理期(2014—2015年):在这个时期,随着互联网行业的快速发展和智能硬件产品的迅速普及,各行业的数据量呈现激增态势(例如,腾讯发布的《2015年微信用户数据报告》显示,2015年9月微信的日均登陆用户达5.7亿,日活跃用户同比增长64%),大数据的研究开始转移到对大数据本身的存储和管理上.目前典型的大数据存储技术路线有3种:采用MPP架构的新型数据库集群、基于Hadoop技术扩展和封装以及大数据一体机.前两种技术都是分布式存储,第3种是集中式存储.
◆ 数据分析计算期(2016年—):以AlphaGo和AlphaGo Zero的巨大成功为分界点,开启了大数据分析的新阶段.先前的研究更多地关注大数据表象的处理,而在这个阶段更注重对大数据本身蕴含价值的挖掘.“面向大数据分析的在线机器学习”、“大数据的新型计算技术”、“大数据驱动知识学习”、“大数据智能”等重点任务和重大工程的研发和启动,标志着在未来一段时期内,新一代大数据分析和计算技术将得到极大重视和发展.
在此,本专题重点关注大数据研究的第3个时期,即大数据分析计算期.在此期间,经过学术界和产业界科研工作者坚持不懈的努力钻研,人们在大数据的系统计算、统计分析、有监督学习、无监督学习以及半监督学习等方面取得了大量具有开拓意义的研究成果.本期“大数据聚类专题”正是对大数据无监督学习最新科研成果的一次集中展示,刊登了5篇各具特色的优秀论文,希望对国内大数据聚类的研究有所启示和帮助.
第1篇是题为《大规模数据集聚类算法的研究进展》的综述文章.该文以大数据的可计算性为切入点,对目前串行和并行环境下专门用于处理大数据的聚类算法进行了综述和分析,并给出未来关于大数据聚类算法设计思路与应用前景的思考和讨论,抛砖引玉,期待更多的国内优秀学者参与其中的研究.
第2篇论文的题目为《基于二部图的快速聚类算法》.该文提出了一种基于二部图的快速聚类算法(fast clustering based on bipartite graph, FCBG),通过对二部图对应的拉普拉斯矩阵施加秩约束,FCBG算法可在优化二部图的边的权重的同时,保持二部图的类簇结构,在不依赖构图时每条边初始权重分配的情况下,最终直接给出聚类结果.实验结果表明,FCBG算法可有效学习二部图的权重,并在较小的时间消耗下获得高质量的聚类结果.
第3篇论文的题目为《基于分层抽样的不均衡数据集成分类》.该文提出了一种基于分层抽样的不均衡数据集成分类方法(stratified sampling-based ensemble classification method for imbalanced data,简称EC-SS).该方法通过自调节谱聚类挖掘多数类样本结构信息,之后基于分层抽样方法构建集成学习数据样本集,确保单个学习器的输入数据均衡且保留原始数据的结构信息,从而提升后续集成分类性能.实验结果表明,所提出的EC-SS方法全部有效地提升了不平衡分类的效果.
第4篇论文的题目为《面向分类型矩阵数据的无监督孤立点检测算法》.该文通过给出一种矩阵对象自身的内聚度和该矩阵对象与其他矩阵对象之间的耦合度来定义矩阵对象的孤立因子,进而提出了一种面向分类型矩阵数据的孤立点检测算法(outlier detection algorithm for matrix-object data, ODAMD).通过在真实数据集上与基于共同近邻孤立因子算法、局部异常因子算法和基于信息熵算法的对比分析结果显示,提出的ODAMD算法能够更有效地检测出分类型矩阵数据中的孤立点.
第5篇论文的题目为《增量学习的优化算法在app使用预测中的应用》.该文提出了一个名为Predictor的app使用预测系统,该系统利用基于聚类有效值(cluster effective value, CEV)策略的增量k-近邻算法为用户提供app使用预测服务.其中,CEV的计算依赖app特征的上下文关联学习.由于CEV采用了多维度特征方法来提高分类的准确度,从而能够改善app使用的预测精度.实验结果表明,带有CEV策略的IkNN模型比通常默认的IkNN模型拥有更稳定的预测准确度,并且在减少建模时间的同时,显著提高了预测准确度.
最后,我对该专题作者们辛苦的工作和无私的奉献表示最诚挚的感谢,也希望大数据相关领域的读者能够更多地交流最新的研究成果,共同促进大数据研究的蓬勃发展!
2 英文全文
With the coming of big data era, the efficient data analysis plays a more and more important role in the economic development, social governance, state administration and people's livelihoods. The extensive literature study indicates that the development of big data research has undergone three main periodes, i.e., concept exploration periode, big data management periode and big data periode analysis/computation. In the first periode, people tried to give a reasonable and precise definition of big data that could be widely acceptable to the academic and industrial researchers. The research emphasis in the second periode is on how to effectively store and manage big data. The research in the third stage has been focusing on exploration of values of big data.
There are many innovative and valuable studies which have been carried out in the third stage of big data research, including big data system computation technology, statistical analysis, supervised learning, unsupervised learning and semi-supervised learning, etc. This special issue is characterized with big data clustering. Five papers are included in this issue, which review the latest development of big data clustering, propose a fast clustering algorithm with bipartite graph, present a stratified sampling-based ensemble classification method with adaptive spectral clustering for imbalanced data, give a cohesion degree-based outlier detection algorithm for matrix-object data and describe an app usage prediction system with a cluster effective value-based incremental k-nearest neighbor algorithm, respectively. The following is a brief introduction to the five contributions.
◆ The paper “A review on clustering algorithms for large-scale data sets”, authored by HE Yulin and HUANG Zhexue, reviews and analyses the current clustering algorithms for large-scale data sets under both the sequential and the parallel computational frameworks, respectively. Unlike the existing literature reviews, this paper focuses on the computability of large-scale data sets. Meanwhile, the authors provide some new thoughts for the designs and applications of clustering algorithms for large-scale data sets.
◆ The paper “
Fast clustering based on bipartite graph ”, authored by NIE Feiping, WANG Cheng-long and WANG Rong, proposes a fast clustering method with a bipartite graph(FCBG)algorithm which reduces the size of original data structure using a sampling method, and learns the relationship between the sample data and the original data. The FCBG algorithm can optimize the weights of bipartite graph edge meanwhile maintaining the cluster structure of bipartite graph. The experimental analysis shows that the algorithm can effectively learn the data relationship and obtain the better clustering results with an acceptable time cost.
◆ The paper “Stratified sampling based ensemble classification for imbalanced data”, authored by WANG Xinyue and JING Liping, presents a new imbalanced data ensemble classification method with the aid of stratified sampling on the majority class(EC-SS). To mine the hidden structure in the majority class sufficiently, an adaptive self-tuning clustering strategy is adopted to separate the major-class samples into different strata and then the stratified sampling is used to downsampling sample the majority class. A series of experiments on the real benchmark datasets demonstrate the superiority of EC-SS.
◆ The paper “An unsupervised outlier detection algorithm for categorical matrix-object data”, authored by WU Xiaolin and CAO Fuyuan, first defines the outlier factor of a matrix-object and then gives an outlier detection algorithm for the categorical matrix-object data by defining the cohesion degree of a matrix-object itself and the coupling degree with other matrix-objects. The experimental results show that the proposed algorithm can effectively detect the outliers for the matrix-object data in comparison with the benchmark outlier detection algorithms.
◆ The paper “The application of optimization algorithm based on incremental learning in app usage prediction”, authored by HAN Di, LI Wenting, WANG Qingjuan, et al., describes a new app usage prediction system which uses a cluster effective value-based incremental k-nearest neighbor algorithm to predict the next app usage. The cluster effective value compensates for the error induced by the multidimensional features and thus is helpful to improve the prediction accuracy. The large-scale experiments show that the improved prediction algorithm obtains better prediction performance with lower remodeling time.
I sincerely thank all the authors for their contributions to this special issue and will be happy if this special issue would be enlightening and helpful to researchers of big data analysis and computation, as well as to the development of big data clustering research.
【中文责编:英 子; 英文责编:木 柯】
期刊信息
深圳大学学报理工版
JOURNAL OF SHENZHEN UNIVERSITY SCIENCE AND ENGINEERING
(1984年创刊 双月刊)
主 管 深圳大学
主 办 深圳大学
编辑出版 深圳大学学报理工版编辑部
主 编 李清泉
国内发行 深圳市邮电局
国外发行 中国国际图书贸易集团有限公司(北京399信箱)
地 址 北京东黄城根北街16号
邮 编 100717
电 话 0755-26732266
0755-26538306
Email journal@szu.edu.cn
标准刊号 ISSN 1000-2618
CN 44-1401/N