[1]蔡伊娜,陈新,覃志武,等.基于改进CDC的实验原始记录匹配算法[J].深圳大学学报理工版,2022,39(5):509-514.[doi:10.3724/SP.J.1249.2022.05509]
 CAI Yina,CHEN Xin,QIN Zhiwu,et al.An algorithm for matching original experimental records based on improved CDC[J].Journal of Shenzhen University Science and Engineering,2022,39(5):509-514.[doi:10.3724/SP.J.1249.2022.05509]
点击复制

基于改进CDC的实验原始记录匹配算法()
分享到:

《深圳大学学报理工版》[ISSN:1000-2618/CN:44-1401/N]

卷:
第39卷
期数:
2022年第5期
页码:
509-514
栏目:
电子与信息科学
出版日期:
2022-09-16

文章信息/Info

Title:
An algorithm for matching original experimental records based on improved CDC
文章编号:
202205004
作者:
蔡伊娜12 陈新3 覃志武13 王歆1 包先雨13 彭锦学2 林泳奇1 李俊霖1
1)深圳市检验检疫科学研究院,广东深圳 518045
2)深圳海关食品检验检疫技术中心,广东深圳 518045
3)深圳海关信息中心,广东深圳 518045
Author(s):
CAI Yina12 CHEN Xin3 QIN Zhiwu13 WANG Xin1 BAO Xianyu13 PENG Jinxue2 LIN Yongqi1 and LI Junlin1
1) Shenzhen Academy of Inspection and Quarantine, Shenzhen 518045, Guangdong Province, P.R.China
2) Food Inspection and Quarantine Center, Shenzhen Customs, Shenzhen 518045, Guangdong Province, P.R.China
3) Information Center, Shenzhen Customs, Shenzhen 518045, Guangdong Province, P.R.China
关键词:
计算机应用数据块模式串字符串匹配实验原始记录内容可变长度分块算法实验室检测报告
Keywords:
computer application data block pattern string string matching original experimental record content-defined chunking algorithm generation of test reports
分类号:
TP301.6;TP391.1
DOI:
10.3724/SP.J.1249.2022.05509
文献标志码:
A
摘要:
针对当前实验室检测报告的生成过程存在时间长和易出现偶然性差错等问题,提出基于栅栏因子的通用实验原始记录文件自动抓取技术.先通过计算文件整体hash值准确过滤当日已读取文件,再使用改进的内容可变长度分块(content-defined chunking, CDC)算法进行文本分块.该CDC算法改进之处主要体现在:设定滑动窗口下一单位为行与行间距之和的高度以及滑动窗口内字节大小的范围.待文本分块结束后,使用基于数据块索引的字符串匹配算法完成匹配.该字符串匹配算法结合数据块索引表构建模式串与数据块的映射关系,之后由模式串Pn通过数据块索引表快速匹配到相应数据块.使用海关实验室的实验原始记录文件进行测试,实验证明,该算法的内存占用量少且分块吞吐量更大.
Abstract:
Aiming at the problems such as long time and occasional errors in the generation process of the current laboratory test report, we present an automatic capture technology of general original experimental records based on fence factor. First, the read files of the day are accurately filtered by calculating the overall Hash value of file. Then, we use the improved content-defined chunking (CDC) algorithm for chunking. The improvement of CDC algorithm includes setting the unit of the sliding window as the spacing of between two lines and setting the range of the byte size in the sliding window. When the text block is completed, a string matching algorithm based on pattern string is used to complete the matching process. The string matching algorithm constructs the mapping relationship between the pattern string and data block in data block index table, and then quickly matches the pattern string Pn to corresponding data block through the data block index table. The original experimental record files of customs laboratory are used for testing. The algorithm occupies the least memory and has the largest chunking throughput.

参考文献/References:

[1] 陈永杰,吾守尔·斯拉木,于清.一种基于Aho-Corasick算法改进的多模式匹配算法[J].现代电子技术,2019,42(4):89-93.
CHEN Yongjie, Wushour Silamu, YU Qing. An improved multi?pattern matching algorithm based on Aho?Corasick algorithm [J]. Modern Electronics Technique, 2019, 42(4): 89-93.(in Chinese)
[2] 梁正平,纪震,刘小丽.基于语义模板的问答系统研究[J].深圳大学学报理工版,2007,24(3):281-285.
LIANG Zhengping, JI Zhen, LIU Xiaoli. Research on semantic pattern based question answering system [J]. Journal of Shenzhen University Science and Engineering, 2007, 24(3): 281-285.(in Chinese)
[3] WHEELER D L, BARRETT T, BENSON D A, et al. Database resources of the National Center for Biotechnology Information [J]. Nucleic Acids Research, 2007, 28(1):10-14.
[4] 刘萍,刘燕兵,郭莉,等.串匹配算法中模式串与文本之间关系的研究[J].软件学报,2010,21(7):1503-1514.
LIU Ping, LIU Yanbing, GUO Li, et al. Research on relationship between patterns and text in string matching algorithms [J]. Journal of Software, 2010, 21(7): 1503-1514.(in Chinese)
[5] 张利香.基于后缀数组的字符串模式查找的算法[D].兰州:西北师范大学,2010.
ZHANG Lixiang. The string pattern searching algorithms based on suffix arrays [D]. Lanzhou: Northwest Normal University, 2010.(in Chinese)
[6] 王东宏.基于云端的可验证模式串匹配研究[D].深圳:深圳大学,2018.
WANG Donghong. Research on verifiable pattern string matching based on cloud [D]. Shenzhen: Shenzhen University, 2018.(in Chinese)
[7] 夏念,嵩天.短规则有效的快速多模式匹配算法[J].计算机工程与应用,2017,53(7):1- 8.
XIA Nian, SONG Tian. Short-rule-efficient rapid multi-pattern matching algorithm [J]. Computer Engineering and Applications, 2017, 53(7): 1- 8.(in Chinese)
[8] 褚衍杰,李云照,魏强.一种改进的多模式匹配算法[J].西安电子科技大学学报,2014,41(6):174-180.
CHU Yanjie, LI Yunzhao, WEI Qiang. Improved multi-pattern matching algorithm [J]. Journal of Xidian University, 2014, 41(6): 174-180.(in Chinese)
[9] 刘许刚,黄海,马宏.一种基于分段匹配的字符串匹配算法[J].计算机应用与软件,2012,29(3):128-131.
LIU Xugang, HUANG Hai, MA Hong. A string matching algorithm based on segmenting matching [J]. Computer Applications and Software, 2012, 29(3): 128-131.(in Chinese)
[10] 黄勇,平玲娣,潘雪增,等.基于划分的模式匹配改进算法[J].大连海事大学学报,2008,34(1):41-44.
HUANG Yong, PING Lingdi, PAN Xuezeng, et al. Improved pattern matching algorithm based on partition [J]. Journal of Dalian Maritime University, 2008, 34(1): 41- 44.(in Chinese)
[11] 董志鑫,李馨梅.一种改进的应用于多模式串匹配的KR算法[J].智能计算机与应用,2018,8(1):116-122.
DONG Zhixin, LI Xinmei. An improved KR algorithm for multi-patterns string matching [J]. Intelligent Computer and Applications, 2018, 8(1): 116-122.(in Chinese)
[12] TRIVEDI U. An optimized Aho-Corasick multi-pattern matching algorithm for fast pattern matching [C]// The 17th India Council International Conference. Delhi, India: IEEE, 2020: 1-5.
[13] VIJI D, REVATHY S. Comparative analysis for content defined chunking algorithms in data deduplication [J]. Webology, 2021, 18(2): 255-268.
[14] 李建江,马占宁,张凯.一种基于内容分块的层次化去冗优化策略[J].电子学报,2019,47(5):1094-1100.
LI Jianjiang, MA Zhanning, ZHANG Kai. An optimal hierarchical deduplication strategy based on content defined chunking [J]. Acta Electronica Sinica, 2019, 47(5): 1094-1100.(in Chinese)
[15] BJ?RNER NIKOLAJ, BLASS A, GUREVICH Y. Content-dependent chunking for differential compression, the local maximum approach [J]. Journal of Computer and System Sciences, 2010, 76(5): 154-203.
[16] 敖莉.舒继武,李明强.重复数据删除技术[J].软件学报,2010,21(5):916-929.
AO Li, SHU Jiwu, LI Mingqiang. Data deduplication techniques [J]. Journal of Software, 2010, 21(5): 916-929.(in Chinese)

相似文献/References:

[1]蔡华利,刘鲁,樊坤,等.基于BPSO的web服务推荐策略[J].深圳大学学报理工版,2010,27(1):49.
 CAI Hua-li,LIU Lu,FAN Kun,et al.Web services recommendation based on BPSO[J].Journal of Shenzhen University Science and Engineering,2010,27(5):49.
[2]朱泽轩,张永朋,尤著宏,等.高通量DNA测序数据压缩研究进展[J].深圳大学学报理工版,2013,30(No.4(331-440)):409.[doi:10.3724/SP.J.1249.2013.04409]
 Zhu Zexuan,Zhang Yongpeng,You Zhuhong,et al.Advances in the compression of high-throughput DNA sequencing data[J].Journal of Shenzhen University Science and Engineering,2013,30(5):409.[doi:10.3724/SP.J.1249.2013.04409]
[3]张滇,明仲,刘刚,等.基于传感器节点的无线接收信号强度研究(英文)[J].深圳大学学报理工版,2014,31(1):63.[doi:10.3724/SP.J.1249.2014.01063]
 Zhang Dian,Ming Zhong,Liu Gang,et al.An empirical study of radio signal strength in sensor networks using MICA2 nodes[J].Journal of Shenzhen University Science and Engineering,2014,31(5):63.[doi:10.3724/SP.J.1249.2014.01063]
[4]廖日军,李雄军,徐健杰,等.Arnold变换在二值图像置乱应用中若干问题讨论[J].深圳大学学报理工版,2015,32(4):428.[doi:10.3724/SP.J.1249.2015.04428]
 Liao Rijun,Li Xiongjun,Xu Jianjie,et al.Discussions on applications of Arnold transformation in binary image scrambling[J].Journal of Shenzhen University Science and Engineering,2015,32(5):428.[doi:10.3724/SP.J.1249.2015.04428]
[5]李雄军,廖日军,李金龙,等.图像Arnold变换中的准对称性问题与半周期现象[J].深圳大学学报理工版,2015,32(6):551.[doi:10.3724/SP.J.1249.2015.06551]
 Li Xiongjun,Liao Rijun,Li Jinlong,et al.Quasi-symmetry and the half-cycle phenomenon in scrambling degrees for images with pixel locations scrambled by Arnold transformation[J].Journal of Shenzhen University Science and Engineering,2015,32(5):551.[doi:10.3724/SP.J.1249.2015.06551]
[6]柴变芳,曹欣雨,魏春丽,等.一种主动半监督大规模网络结构发现算法[J].深圳大学学报理工版,2020,37(3):243.[doi:10.3724/SP.J.1249.2020.03243]
 CHAI Bianfang,CAO Xinyu,WEI Chunli,et al.An active semi-supervised structure exploring algorithm for large networks[J].Journal of Shenzhen University Science and Engineering,2020,37(5):243.[doi:10.3724/SP.J.1249.2020.03243]
[7]刘朝斌,孙雪,刘剑,等.基于物联网的高校校园智能安防建设探索[J].深圳大学学报理工版,2020,37(增刊1):128.[doi:10.3724/SP.J.1249.2020.99128]
 LIU Chaobin,SUN Xue,LIU Jian,et al.Campus intelligent security construction based on internet of things[J].Journal of Shenzhen University Science and Engineering,2020,37(5):128.[doi:10.3724/SP.J.1249.2020.99128]
[8]杨阳.高校大数据平台的规划设计与实现[J].深圳大学学报理工版,2020,37(增刊1):146.[doi:10.3724/SP.J.1249.2020.99146]
 YANG Yang.Design and implementation of big data platform in colleges[J].Journal of Shenzhen University Science and Engineering,2020,37(5):146.[doi:10.3724/SP.J.1249.2020.99146]
[9]龚黎旰,顾坤,明心铭,等.基于校园一卡通大数据的高校学生消费行为分析[J].深圳大学学报理工版,2020,37(增刊1):150.[doi:10.3724/SP.J.1249.2020.99150]
 GONG Ligan,GU Kun,MING Xinming,et al.Analysis of college students’ consumption behavior based on campus card data[J].Journal of Shenzhen University Science and Engineering,2020,37(5):150.[doi:10.3724/SP.J.1249.2020.99150]
[10]林晓玲,王志强,等.基于多约束场景的BFO-ACO漫游路径规划[J].深圳大学学报理工版,2022,39(4):463.[doi:10.3724/SP.J.1249.2022.04463]
 LIN Xiaoling,WANG Zhiqiang,GUO Yanyan,et al.BFO-ACO roaming path planning based on multi-constraint scenarios[J].Journal of Shenzhen University Science and Engineering,2022,39(5):463.[doi:10.3724/SP.J.1249.2022.04463]

备注/Memo

备注/Memo:
Received: 2021- 09-22; Revised: 2022-03-11; Accepted: 2022-06-01; Online (CNKI): 2022-07-27
Foundation: National Key R & D Program (2019YFC1605504, 2018YFC1603601)
Corresponding author: Senior engineer BAO Xianyu. E-mail: 601459563@qq.com
Citation: CAI Yina, CHEN Xin, QIN Zhiwu, et al. An algorithm for matching original experimental records based on improved CDC [J]. Journal of Shenzhen University Science and Engineering, 2022, 39(5): 509-514.(in Chinese)
基金项目:国家重点研发计划资助项目(2019YFC1605504,2018YFC1603601)
作者简介:蔡伊娜(1979-),深圳市检验检疫科学研究院高级工程师.研究方向:食品安全及其信息化研究.E-mail: 1530210935@qq.com
引文:蔡伊娜,陈新,覃志武,等.基于改进CDC的实验原始记录匹配算法[J].深圳大学学报理工版,2022,39(5):509-514.
更新日期/Last Update: 2022-09-30