LEC-Codec: Learning-Based Genome Data Compression

Zhenhao SUN, Meng WANG, Shiqi WANG, Sam KWONG

Research output: Journal PublicationsJournal Article (refereed)peer-review

1 Citation (Scopus)

Abstract

In this paper, we propose a Learning-based gEnome Codec (LEC), which is designed for high efficiency and enhanced flexibility. The LEC integrates several advanced technologies, including Group of Bases (GoB) compression, multi-stride coding and bidirectional prediction, all of which are aimed at optimizing the balance between coding complexity and performance in lossless compression. The model applied in our proposed codec is data-driven, based on deep neural networks to infer probabilities for each symbol, enabling fully parallel encoding and decoding with configured complexity for diverse applications. Based upon a set of configurations on compression ratios and inference speed, experimental results show that the proposed method is very efficient in terms of compression performance and provides improved flexibility in real-world applications.
Original languageEnglish
Pages (from-to)2447-2458
Number of pages12
JournalIEEE/ACM Transactions on Computational Biology and Bioinformatics
Volume21
Issue number6
Early online date3 Oct 2024
DOIs
Publication statusPublished - Nov 2024

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

Funding

This work is supported by Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China (Grant No. 2018AAA0101301), and in part by the Hong Kong GRF-RGC General Research Fund under Grant 11209819 (CityU 9042816) and Grant 11203820 (9042598).

Keywords

  • Data compression
  • learning-based method
  • lossless genome compression
  • non-reference method

Fingerprint

Dive into the research topics of 'LEC-Codec: Learning-Based Genome Data Compression'. Together they form a unique fingerprint.

Cite this