[1]朱泽轩,张永朋,尤著宏,等.高通量DNA测序数据压缩研究进展[J].深圳大学学报理工版,2013,30(No.4(331-440)):409-415.[doi:10.3724/SP.J.1249.2013.04409]
 Zhu Zexuan,Zhang Yongpeng,You Zhuhong,et al.Advances in the compression of high-throughput DNA sequencing data[J].Journal of Shenzhen University Science and Engineering,2013,30(No.4(331-440)):409-415.[doi:10.3724/SP.J.1249.2013.04409]
点击复制

高通量DNA测序数据压缩研究进展()
分享到:

《深圳大学学报理工版》[ISSN:1000-2618/CN:44-1401/N]

卷:
第30卷
期数:
2013年No.4(331-440)
页码:
409-415
栏目:
电子与信息科学
出版日期:
2013-07-12

文章信息/Info

Title:
Advances in the compression of high-throughput DNA sequencing data
文章编号:
20130411
作者:
朱泽轩1张永朋1尤著宏1姜亮2纪震1
1) 深圳市嵌入式系统设计重点实验室,深圳大学计算机与软件学院,深圳 518060;
2) 深圳大学生命科学学院, 深圳 518060
Author(s):
Zhu Zexuan1 Zhang Yongpeng1 You Zhuhong1 Jiang Liang2 and Ji Zhen1
1) Shenzhen City Key Laboratory of Embedded System Design, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, P.R.China
2) College of Life Sciences, Shenzhen University, Shenzhen 518060, P.R.China
关键词:
计算机应用DNA 测序下一代测序重测序从头测序高通量测序数据压缩算法
Keywords:
computer application DNA sequencing next generation sequencing resequencing de novo sequencing high-throughput sequencing data compression
分类号:
TP 391;TP 319
DOI:
10.3724/SP.J.1249.2013.04409
文献标志码:
A
摘要:
针对高通量DNA测序技术发展产生的DNA测序数据量猛增,数据压缩技术是解决存储和传输高通量DNA序列数据问题的重要方法之一.评述DNA测序数据传统压缩方法包括替代法和统计法,以及基于参考基因组的高通量DNA测序数据压缩方法,介绍并比较重测序数据压缩、从头测序数据压缩、质量分数压缩和压缩数据检索的代表性算法,研究高通量DNA测序数据压缩面临的挑战及对未来的展望.
Abstract:
With the development of high-throughput DNA sequencing technology, DNA sequencing data grows rapidly. The use of compression techniques provides an important candidate solution for the storage and transmission challenges of high-throughput DNA sequencing data. In this paper, the traditional DNA sequences compression methods, including substitutionary and statistical methods, and the reference-genome-based compression method for high-throughput DNA sequencing data are surveyed. The state-of-the-art algorithms of re-sequencing data compression, de novo sequencing data compression, quality score compression, and compressed data indexing are introduced and compared. The challenges and future prospects of high-throughput DNA sequencing data compression are also discussed.

参考文献/References:

[1] Sanger F,Nicklen S,Coulson A R.DNA sequencing with chain-terminating inhibitors[J].Proceedings of the National Academy of Sciences of the United States of America,1977,74(12):5463-5467.
[2] Margulies M, Egholm M, Altman W E, et al.Genome sequencing in microfabricated high-density picolitre reactors[J].Nature,2005,437(7057):376-380.
[3] Kai A.STM and AFM of bio/organic molecules and structures[J].Surface Science Reports,1996,26(8): 263-332.
[4] Hibbs X,Krstic S,Mastrangelo A,et al.The potential and challenges of nanopore sequencing[J].Nature Biotechnology,2008,26(10):1146-1153.
[5] Kahn S D.On the future of genomic data[J].Science, 2011,331(6018):728-729.
[6] Grumbach S,Tahi F.Compression of DNA sequences[C]// In Proceedings of Data Compression Conference. Snowbird(USA): IEEE Computer Society, 1993:340-350.
[7] Giancarlo R,Scaturro D,Utro F.Textual data compression in computational biology: asynopsis[J].Bioinformatics,2009,25(13):1575-1586.
[8] Matsumoto T,Sadakane K,Imai H.Biological sequence compression algorithms[C]// Proceedings of Genome Informatics Workshop,Tokyo,2000: 43-52.
[9] Chen X,Li M,Ma B,et al.DNA compress:fast and effective DNA sequence compression [J]. Bioinformatics, 2002,18(2):1696-1698.
[10] Loewenstern D,Yianilos P N.Significantly lower entropy estimates for natural DNA sequences[J].Computational Biology,1999,6(1):125-142.
[11] Cao M D,Dix T I,Allison L,et al.A simple statistical algorithm for biological sequence compression[C]// In Proceedings of the Conference on Data Compression, Snowbird(USA):IEEE Computer Society, 2007:43-52.
[12] Zhang Lixia,Song Hongzhi.Multiple-compression of DNA sequence data[J].Journal of Computer Applications,2010,30(5):1379 -1382.(in Chinese)
张丽霞,宋鸿陟.多重压缩DNA序列数据[J].计算机应用,2007,30(5):1379-1382.
[13] Fritz M H Y, Leinonen R, Cochrane G, et al.Efficient storage of high throughput DNA sequencing data using reference-based compression[J].Genome Research, 2011,21(5):734-740.
[14] Jones D,Ruzzo W,Peng X,et al.Compression of next-generation Sequencing reads aided by highly efficient de novo assembly[J].Nucleic Acids Research,2012, 40(22):e171.
[15] Christley S,Lu Y,Li C,et al.Human genomes as email attachments[J].Bioinformatics,2009,25(2):274-275.
[16] Kent W J.BLAT:the BLAST-like alignment tool[J].Genome Research,2002,12(4):656-664.
[17] Altschul S,Gish W,Miller W,et al.Basic local alignment search tool[J].Journal of Molecular Biology,1990,215(3):403-410.
[18] Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores[J]. Genome Research,2008,18(11):1851-1858.
[19] Li H,Durbin R.Fast and accurate short read alignment with burrows-wheeler transform[J].Bioinformatics, 2009,25(14):1754-1760.
[20] Brandon M C,Wallace D C, Baldi P.Data structures and compression algorithms for genomic sequence data[J].Bioinformatics,2009,25(14):1731-1738.
[21] Zhu Z,Zhou J,Ji Z,et al.DNA sequence compression using adaptive particle swarm optimization-based memetic algorithm[J].IEEE Transactions on Evolutionary Computation,2011,15(5):643-658.
[22] Korodi G,Tabus I.An efficient normalized maximum likelihood algorithm for DNA sequence compression[J].ACM Transactions on Information Systems,2005,23(1):3-34.
[23] Korodi G,Tabus I.Normalized maximum likelihood model of order-1 for the compression of DNA sequences[C]// Data Compression Conference.Snowbird(USA):IEEE Computer Society,2007:33-42.
[24] Kuruppu S,Puglisi S J,Zobel J.Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval[C]// in proceedings of the 17th International Symposium on String Processing and Information Retrieval.Los Cabos:Springer,2010:201-206.
[25] Kuruppu S,Puglisi S J,Zobel J.Optimized relative Lempel-Ziv compression of genomes[C]// Proceedings of Australasian Computer Science Conference.Perth(Australia):Australasian Computer Science,2011:91-98.
[26] Wang C,Zhang D.A novel compression tool for efficient storage of genome resequencing data[J].Nucleic Acids Research,2011,39(7):E45-U74.
[27] Miller J R,Koren S,Sutton G.Assembly algorithms for next generation sequencing data[J].Genomics,2010, 95(6):315-327.
[28] Pevzner P A,Tang H,Waterman M S.An Eulerian path approach to DNA fragment assembly[J].PNAS,2001, 98(17):9748-9753.
[29] Li R Q,Zhu H M,Ruan J,et al.De novo assembly of human genomes with massively parallel short read sequencing[J].Genome Research,2010,20(2),265-272.
[30] Tembe W,Lowey J,Suh E.G-SQZ:compact encoding of genomic sequence and quality data [J]. Bioinformatics, 2010,26(17):2192-2194.
[31] Deorowicz S, Grabowski S. Compression of DNA sequence reads in FASTQ format[J].Bioinformatics,20 11,27(6):860-862.
[32] Popitsch N,Haeseler A V.NGC:lossless and lossy compression of aligned high throughput sequencing data[J].Nucleic Acids Research,2012,41(1):e27.
[33] Kuruppu S S.Compression of Large DNA Databases[D]. Melbourne:Melbourne School of Engineering,2012.
[34] FerraginaP,Manzini G.Opportunistic data structures with applications[C]// Proceedings of the 41st Annual Symposium on Foundations of Computer Science. Redondo Beach:IEEE Computer Society,2000:390-398.
[35] Makinen V,Navarro G.Succinct suffix arrays based on run-length encoding[C]// In Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching.Jeju Island(Korea):Springer,2005:40-66.
[36] Grossi R,Vitter J S.Compressed suffix arrays and suffix trees with applications to text indexing and string matching[J].SIAM Journal on Computing,2005,35(2):378-407.
[37] Siren J, Valimaki N, Makinen V, et al. Run-length compressed indexes are superior for highly repetitive sequence collections[C]// String Processing and Information Retrieval.Berlin:Springer,2009:164-175.
[38] Karkkainen E, Ukkonen. Lempel-Ziv parsing and sublinear- size index structures for string matching[C]// Proceedings of the 3rd South American Workshop on String Processing.Belo Horizonte(Brazil):World Scientific Publishing Company Incorporated,1996:141-155.
[39] Kreft S,Navarro G.Self-Indexing based on LZ77[C]// In Proceedings of the 22th Annual Symposium on Combinational Pattern Matching.Palermo(Italy):Springer, 2011:41-54.
[40] Bao S,Jiang R,Kwan W,et al.Evaluation of next-generation sequencing software in mapping and assembly[J].Journal of Human Genetics,2011,56(9): 406-414.
[41] SteinL D.The case for cloud computing in genome informatics[J].Genome Biology,2010,11(5):207.
[42] Ansorge W J. Next-generation DNA sequencing techniques[J].New Biotechnology,2009,25(4):195-203.
[43] Horner D S,Pavesi G,Castrignano T,et al.Bioinformatics approaches for genomics and post genomics applications of next generation sequencing[J].Briefings in Bioinformatics,2010,11(2):181-197.
[44] Morozova O,Marra M A.Applications of next-generation sequencing technologies in functional genomics[J].Genomics,92(5):255-264.
[45] Yang X,Charlebois P,Gnerre S,et al.De novo assembly of highly diverse viral populations [J]. BMC Genomics, 2012,13:475.

相似文献/References:

[1]蔡华利,刘鲁,樊坤,等.基于BPSO的web服务推荐策略[J].深圳大学学报理工版,2010,27(1):49.
 CAI Hua-li,LIU Lu,FAN Kun,et al.Web services recommendation based on BPSO[J].Journal of Shenzhen University Science and Engineering,2010,27(No.4(331-440)):49.
[2]张滇,明仲,刘刚,等.基于传感器节点的无线接收信号强度研究(英文)[J].深圳大学学报理工版,2014,31(1):63.[doi:10.3724/SP.J.1249.2014.01063]
 Zhang Dian,Ming Zhong,Liu Gang,et al.An empirical study of radio signal strength in sensor networks using MICA2 nodes[J].Journal of Shenzhen University Science and Engineering,2014,31(No.4(331-440)):63.[doi:10.3724/SP.J.1249.2014.01063]
[3]廖日军,李雄军,徐健杰,等.Arnold变换在二值图像置乱应用中若干问题讨论[J].深圳大学学报理工版,2015,32(4):428.[doi:10.3724/SP.J.1249.2015.04428]
 Liao Rijun,Li Xiongjun,Xu Jianjie,et al.Discussions on applications of Arnold transformation in binary image scrambling[J].Journal of Shenzhen University Science and Engineering,2015,32(No.4(331-440)):428.[doi:10.3724/SP.J.1249.2015.04428]
[4]李雄军,廖日军,李金龙,等.图像Arnold变换中的准对称性问题与半周期现象[J].深圳大学学报理工版,2015,32(6):551.[doi:10.3724/SP.J.1249.2015.06551]
 Li Xiongjun,Liao Rijun,Li Jinlong,et al.Quasi-symmetry and the half-cycle phenomenon in scrambling degrees for images with pixel locations scrambled by Arnold transformation[J].Journal of Shenzhen University Science and Engineering,2015,32(No.4(331-440)):551.[doi:10.3724/SP.J.1249.2015.06551]
[5]柴变芳,曹欣雨,魏春丽,等.一种主动半监督大规模网络结构发现算法[J].深圳大学学报理工版,2020,37(3):243.[doi:10.3724/SP.J.1249.2020.03243]
 CHAI Bianfang,CAO Xinyu,WEI Chunli,et al.An active semi-supervised structure exploring algorithm for large networks[J].Journal of Shenzhen University Science and Engineering,2020,37(No.4(331-440)):243.[doi:10.3724/SP.J.1249.2020.03243]

备注/Memo

备注/Memo:
Received:2013-03-09;Accepted:2013-05-27
Foundation:National Natural Science Foundation of China(61211130120,61001185)
Corresponding author:Professor Ji Zhen.E-mail:jizhen@szu.edu.cn
Citation:Zhu Zexuan,Zhang Yongpeng,You Zhuhong,et al.Advances in the compression of high-throughput DNA sequencing data[J]. Journal of Shenzhen University Science and Engineering, 2013, 30(4): 409-415.(in Chinese)
基金项目:国家自然科学基金资助项目(61211130120,61001185)
作者简介:朱泽轩(1981-),男(汉族), 广东省潮州市人,深圳大学副教授、博士. E-mail: zhuzx@szu.edu.cn
引文:朱泽轩,张永朋,尤著宏,等. 高通量DNA测序数据压缩研究进展[J]. 深圳大学学报理工版,2013,30(4):409-415.
更新日期/Last Update: 2013-07-12