分布式子空间局部链接随机向量函数链接网络

1.河北民族师范学院数学与计算机科学学院,河北承德067000;2.深圳大学计算机与软件学院,广东深圳518060;3.人工智能与数字经济广东省实验室(深圳),广东深圳518107

人工智能;随机向量函数链接网络;子空间局部链接;随机样本划分;Hadoop分布式文件系统

Distributed random vector functional link network with subspace-based local connections
YU Wanguo1,YUAN Zhenhao2,CHEN Jiaqi2,and HE Yulin3

1.College of Mathematics and Computer Science, Hebei Normal University for Nationalities, Chengde 067000, Hebei Province, P. R. China;2.College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, Guangdong Province, P. R. China;3.Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen 518107, Guangdong Province, P. R. China

artificial intelligence; random vector functional link network; subspace-based local connection; random sample partition; Hadoop distributed file system

DOI: 10.3724/SP.J.1249.2022.06675

备注

为解决随机向量函数链接(randomvectorfunctionallink,RVFL)网络处理大规模数据分类时表现出的泛化能力差和计算复杂度高的问题,基于Spark框架设计与实现一种分布式子空间局部链接的RVFL(distributedRVFLwithsubspace-basedlocalconnections,DRVFL-SLC)网络.利用弹性分布式数据集(resilientdistributeddataset,RDD)的分区并行性,对存于Hadoop分布式文件系统(Hadoopdistributedfilesystem,HDFS)的大规模数据集进行随机样本划分(randomsamplepartition,RSP)操作,保证每个RSP数据块对应RDD的1个分区.其中,RSP数据块是在给定的显著性水平下与大数据保持概率分布一致性的数据子集.在分布式环境下对包含多个分区的RDD调用mapPartitions转换算子并行高效地训练对应的最优RVFL-SLC网络.利用collect执行算子将RDD每个分区对应的最优RVFL-SLC网络进行高效率地渐近融合获得DRVFL-SLC网络以实现对大数据分类问题的近似求解.在部署了6个计算节点的Spark集群上,基于8个百万条记录的大规模数据集对DRVFL-SLC网络的可行性和有效性进行了验证.结果表明,DRVFL-SLC网络拥有很好的加速比、可扩展性以及规模增长性,同时能够获得比在单机上利用全量数据训练的RVFL-SLC网络更好的泛化表现.
In order to solve the problem of poor generalization ability and high computational complexity of random vector functional link (RVFL) network when dealing with large-scale data classification, we design and implement a distributed RVFL network with subspace-based local connections in Spark framework (DRVFL-SLC). Firstly, in order to take advantage of the partition parallelism of resilient distributed dataset (RDD), the large-scale dataset stored in the Hadoop distributed file system HDFS is randomly divided into random sample partition (RSP) data blocks and each RSP data block corresponds to a partition of the RDD, where the RSP data block is a subset of data that maintains probability distribution consistency with the big data at a given significance level. After that, the mapPartitions transformation is invoked on the RDD containing multiple partitions in a distributed environment and this operation trains the corresponding optimal RVFL-SLC efficiently in parallel. Then, the collect execution operator is used to efficiently fuse the optimal RVFL-SLC corresponding to each partition of the RDD to obtain DRVFL-SLC for realizing the classification of big data. Finally, the feasibility and effectiveness of DRVFL-SLC are verified based on several large-scale data set with at least million records on a Spark cluster deployed with 6 computing nodes. The results show that DRVFL-SLC has a good speedup ratio, scalability and scale growth, and can achieve better generalization performance than RVFL-SLC trained on a single machine with full data.
·