张宗华,屈 英,叶志佳,等.基于多特征匹配和Bloom filter的重复数据删除算法[J].深圳大学学报理工版,2016,(05):531-535.[doi:10.3724/SP.J.1249.2016.05531 ]
Zhang Zonghua,Qu Ying,Ye Zhijia,et al.Deduplication based on multi-feature matching and Bloom filter[J].Journal of Shenzhen University Science and Engineering,2016,(05):531-535.[doi:10.3724/SP.J.1249.2016.05531 ]
基于多特征匹配和Bloom filter的重复数据删除算法

1)国家电网公司北京电力医院信息通讯部,北京 100073; 2)电子科技大学计算机科学与工程学院,四川成都 611731

计算技术; 重复数据删除; 多特征匹配; 布隆过滤器; EB算法; 磁盘优化

Deduplication based on multi-feature matching and Bloom filter
Zhang Zonghua1, Qu Ying2, Ye Zhijia2, and Niu Xinzheng2

Zhang Zonghua1, Qu Ying2, Ye Zhijia2, and Niu Xinzheng21)Ministry of Information and Communication, Beijing Electric Power Hospital, State Grid Corporation of China, Beijing 100073, P.R.China2)School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan Province, P.R.China

computing technology; deduplication; multi-feature matching; Bloom filter; extreme binning; disk optimization

DOI: 10.3724/SP.J.1249.2016.05531

备注

针对EB(extreme binning)算法重复数据删除率低,磁盘I/O开销大的缺陷,提出基于多特征匹配和Bloom filter的重复数据删除算法DBMB(deduplication based on multi-feature matching and Bloom filter). 将小文件聚合为局部性文件单元,作为一个整体进行去重处理,采用最大、最小以及中间数据块ID的多重相似性特征进行匹配,并基于Bloom filter优化磁盘数据块的查找和匹配过程. 结果表明,DBMB算法能有效提升重复数据删除率,降低算法执行时间,同时减少处理小文件的内存开销,性能提升显著.

Aiming at low deduplication rate and high disk I/O overhead of EB(extreme binning), we propose a deduplication algorithm based on multi-feature matching and Bloom filter(DBMB). Firstly, we group small files as a local file unit in order to process them as a whole. Then we take the maximum, minimum and middle ID of data chunk for similarity matching. Finally, we optimize the process of searching and matching disk data blocks based on Bloom filter. The experiment results show that DBMB algorithm can effectively increase the deduplication rate and reduce the execution time. In the meantime, DBMB reduces the memory overhead of small files deduplication, the comprehensive performance is improved significantly.

·