深圳大学学报理工版

为解决随机向量函数链接（randomvectorfunctionallink，RVFL）网络处理大规模数据分类时表现出的泛化能力差和计算复杂度高的问题，基于Spark框架设计与实现一种分布式子空间局部链接的RVFL（distributedRVFLwithsubspace-basedlocalconnections，DRVFL-SLC）网络．利用弹性分布式数据集（resilientdistributeddataset，RDD）的分区并行性，对存于Hadoop分布式文件系统（Hadoopdistributedfilesystem，HDFS）的大规模数据集进行随机样本划分（randomsamplepartition，RSP）操作，保证每个RSP数据块对应RDD的1个分区．其中，RSP数据块是在给定的显著性水平下与大数据保持概率分布一致性的数据子集．在分布式环境下对包含多个分区的RDD调用mapPartitions转换算子并行高效地训练对应的最优RVFL-SLC网络．利用collect执行算子将RDD每个分区对应的最优RVFL-SLC网络进行高效率地渐近融合获得DRVFL-SLC网络以实现对大数据分类问题的近似求解．在部署了6个计算节点的Spark集群上，基于8个百万条记录的大规模数据集对DRVFL-SLC网络的可行性和有效性进行了验证．结果表明，DRVFL-SLC网络拥有很好的加速比、可扩展性以及规模增长性，同时能够获得比在单机上利用全量数据训练的RVFL-SLC网络更好的泛化表现．

In order to solve the problem of poor generalization ability and high computational complexity of random vector functional link (RVFL) network when dealing with large-scale data classification, we design and implement a distributed RVFL network with subspace-based local connections in Spark framework (DRVFL-SLC). Firstly, in order to take advantage of the partition parallelism of resilient distributed dataset (RDD), the large-scale dataset stored in the Hadoop distributed file system HDFS is randomly divided into random sample partition (RSP) data blocks and each RSP data block corresponds to a partition of the RDD, where the RSP data block is a subset of data that maintains probability distribution consistency with the big data at a given significance level. After that, the mapPartitions transformation is invoked on the RDD containing multiple partitions in a distributed environment and this operation trains the corresponding optimal RVFL-SLC efficiently in parallel. Then, the collect execution operator is used to efficiently fuse the optimal RVFL-SLC corresponding to each partition of the RDD to obtain DRVFL-SLC for realizing the classification of big data. Finally, the feasibility and effectiveness of DRVFL-SLC are verified based on several large-scale data set with at least million records on a Spark cluster deployed with 6 computing nodes. The results show that DRVFL-SLC has a good speedup ratio, scalability and scale growth, and can achieve better generalization performance than RVFL-SLC trained on a single machine with full data.

引言
1 预备知识
2 DRVFL-SLC的并行化设计
3 DRVFL-SLC的具体实现
4 实验验证与结果分析
结语

图1 DRVFL-SLC算法的具体实现步骤（算法1） Fig. 1 DRVFL-SLC algorithm procedure.

图2 单个RVFL-SLC模型的训练算法（算法2） Fig. 2 Training algorithm of single RVFL-SLC model.

图3 多个RVFL-SLC模型的测试算法（算法3） Fig. 3 Testing algorithm of multiple RVFL-SLC models.

图4 在1#—6#数据集（对应图（a）—（f））上RVFL-SLC和DRVFL-SLC的测试精度Fig. 4 The testing accuracies of RVFL-SLC blue curve and DRVFL-SLC red curve on datasets 1#—6#((a)-(f)) respectively.

图5 在1#—8#数据集（对应图（a）—（h））上的并行RVFL和DRVFL-SLC的测试精度Fig. 5 The testing accuracies of RVFL red curve and DRVFL-SLC blue curve on datasets 1#—6#((a)-(h)) respectively.

图6 DRVFL-SLC算法的并行效果（a）加速比；（b）可扩展性；（c）规模增长性Fig. 6 The parallel performances of DRVFL-SLC. (a) Speedup, (b) Scaleup, and (c) Sizeup.

[1]IGELNIK B, PAO Y H. Stochastic choice of basis func⁃tions in adaptive function approximation and the functional-link net [J]. IEEE transactions on Neural Net⁃works, 1995, 6(6): 1320-1329.
[2]REN Y, SUGANTHAN P N, SRIKANTH N, et al. Ran⁃dom vector functional link network for short-term electric⁃ity load demand forecasting [J]. Information Sciences, 2016, 367: 1078-1093.
[3]ZHANG Le, SUGANTHAN P N. A comprehensive evalua⁃tion of random vector functional link networks [J]. Infor⁃mation Sciences, 2016, 367: 1094-1105.
[4]SCHMIDT W F, KRAAIJVELD M A, DUIN R P. Feedfor⁃ward neural networks with random weights [C]// The 11th IAPR International Conference on Pattern Recognition. The Hague, Netherlands: IEEE, 1992: 1-4.
[5]HUANG Guangbin, ZHU Qinyu, SIEW C K. Extreme learning machine: theory and applications [J]. Neurocom⁃puting, 2006, 70(1/2/3): 489-501.
[6]HUANG Guangbin, ZHOU Hongming, DING Xiaojian, et al. Extreme learning machine for regression and multi⁃class classification [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2011, 42(2):513-529.
[7]HUANG Gao, HUANG Guangbin, SONG Shiji, et al. Trends in extreme learning machines: a review [J]. Neural Networks. 2015, 61: 32-48.
[8] LI Feng, YANG Jie, YAO Mingchen, et al. Extreme learning machine with local connections [J]. Neurocom⁃puting, 2019, 368: 146-152.
[9]HE Yulin, YUAN Zhenhao, HUANG Zhexue. Random vector functional link network with subspace-based local connections [J]. Applied Intelligence, 2022. doi: 10. 1007/s10489-022-03404-8.
[10]栾亚建，黄翀民，龚高晟，等.Hadoop平台的性能优化研究[J].计算机工程， 2010， 36（14）： 262-263，266. LUAN Yajian, HUANG Chongmin, GONG Gaosheng, et al. Research on performance optimization of Hadoop platform [J]. Computer Engineering, 2010, 36(14): 262-263, 266. (in Chinese)
[11]SUN Yongjiao, YUAN Ye, WANG Guoren. An OS-ELM based distributed ensemble classification framework in P2P networks [J]. Neurocomputing, 2011, 74(16): 2438-2443.
[12]XIN Junchang, WANG Zhiqiong, CHEN Chen, et al. ELM∗: distributed extreme learning machine with map Reduce [J]. World Wide Web, 2014, 17(5): 1189-1204.
[13]CHEN Jiaoyan, CHEN Huajun, WAN Xiangyi, et al. MR-ELM: a mapReduce-based framework for large-scale ELM training in big data era [J]. Neural Computing and Applications, 2016, 27(1): 101-110.
[14]邓万宇，李力，牛慧娟. 基于Spark的并行极速神经网络[J]. 郑州大学学报工学版， 2016， 37（5）：47-56. DENG Wanyu, LI Li, NIU Huijuan. Parallel extremely fast neural network based on Spark [J]. Journal of Zheng⁃zhou University Engineering Edition, 2016, 37(5): 47-56. (in Chinese)
[15]杨敏，刘黎志，邓开巍，等. 基于Spark的自适应差分进化极限学习机研究[J]. 武汉工程大学学报， 2021，43（3）：318-323. YANG Min, LIU Lizhi, DENG Kaiwei, et al. Research on adaptive differential evolution extreme learning machine based on Spark [J]. Journal of Wuhan Institute of Technol⁃ogy, 2021, 43(3): 318-323. (in Chinese)
[16]SCARDAPANE S, WANG D, PANELLA M, et al. Dis⁃tributed learning for random vector functional-link net⁃works [J]. Information Sciences, 2015, 301: 271-284.
[17]SCARDAPANE S, PANELLA M, COMMINIELLO D, et al. Learning from distributed data sources using random vector functional-link networks [J]. Procedia Computer Science, 2015, 53: 468-477.
[18]ROSATO A, ALTILIO R, PANELLA M. On-line learning of RVFL neural networks on finite precision hardware [C]//The 2018 IEEE International Symposium on Circuits and Systems. Florence, Italy: IEEE, 2018: 1-5.
[19]赵立杰，陈征，张立强，等.基于交替方向乘子法的球磨机负荷分布式随机权值神经网络模型[J].数据挖掘，2018，8（1）：1-8. ZHAO Lijie, CHEN Zheng, ZHANG Liqiang, et al. Dis⁃tributed random weight neural network model for ball mill load based on alternating direction multiplier method [J]. Hans Journal of Data Mining, 2018, 8(1): 1-8. (in Chinese)
[20]XIE Jin, LIU Sanyang, DAI Hao, et al. Distributed semi-supervised learning algorithms for random vector functional-link networks with distributed data splitting across samples and features [J]. Knowledge-Based Sys⁃tems, 2020, 195: 105577.
[21]黄哲学，何玉林，魏丞昊，等.大数据随机样本划分模型及相关分析计算技术[J].数据采集与处理， 2019，34（3）：373-385. HUANG Zhexue, HE Yulin, WEI Chenhao, et al. Big data random sample partition model and related analysis and calculation technology [J]. Journal of Data Acquisi⁃tion and Processing, 2019, 34(3): 373-385. (in Chinese)
[22]HASAN B T, ABDULLAH D B. A survey of scheduling tasks in big data: Apache Spark [C]// The International Conference on Micro-Electronics and Telecommunication Engineering. Singapore: Springer, 2022: 405-414.
[23]SHAFER J, RIXNER S, COX A L. The hadoop distributed filesystem: balancing portability and performance [C]//IEEE International Symposium on Performance Analysis of Systems & Software. New York, USA: IEEE, 2010:122-133.
[24]OMAR H K, JUMAA A K. Distributed big data analysis using spark parallel data processing [J]. Bulletin of Elec⁃trical Engineering and Informatics, 2022, 11(3): 1505-1515.

备注

引言

1 预备知识

2 DRVFL-SLC的并行化设计

3 DRVFL-SLC的具体实现

4 实验验证与结果分析

结语

期刊信息

备注

引言

1 预备知识

2 DRVFL-SLC的并行化设计

3 DRVFL-SLC的具体实现

4 实验验证与结果分析

结 语

期刊信息

结语