MPEG-G Part 2

Compression

The extensive usage of high-throughput deoxyribonucleic acid (DNA) sequencing technologies opens up new perspectives in the treatment of several diseases and enables the implementation of a new approach to healthcare known as “precision medicine”. 

DNA sequencing technologies produce extremely large amounts of raw data which are stored in different repositories worldwide. The processing, analysis, and comparison of such distributed data is a fundamental element for the effective usage of sequencing data for clinical and scientific purposes. Standard Application Program Interfaces (APIs) and Metadata, obviously, are the basis for interoperable and automated data access and processing systems that can efficiently operate on the worldwide available sets of sequencing data.

The MPEG-G standard, jointly developed by WG 11 (MPEG) and ISO Technical Committee for biotechnology standards (ISO TC 276/WG 5), is the first international standard to address and solve the problem of efficient and cost-effective handling of genomic data by providing, not only new compression and transport technologies (ISO/IEC 23092-1/2), but also a standard specification associating relevant information in the form of metadata and a rich set of APIs for data access and mining, for building a full ecosystem of interoperable applications capable of efficiently processing sequencing data.

In more details, this part provides specifications for the normative representation of genomic sequence reads identifiers, genomic sequence reads (both unaligned and aligned reads), reference sequences and quality values. It specifies compression in terms of normative bitstream syntax and decoding behavior.

In October 2019 the second part of MPEG-G, Coding of Genomic Information (ISO/IEC 23092-2) was initially published. The current version for the standard ISO/IEC 23092-2:2020 for part 2: Coding of genomic information provides specifications for the representation of the following types of genomic information:

  • unaligned sequencing reads including read identifiers and quality values;
  • aligned sequencing reads including read identifiers and quality values;
  • reference sequences.

The update was published in October 2020 the revised standard has been published and is now available to here.

The other MPEG-G standard parts:

Part 1

Transport and Storage

Read more

Part 2:

Compression

Read more

Part 3:

Metadata and APIs

Read more

Part 4:

Reference Software

Read more  

Part 5:

Conformance

Read more 

Part 6:

Genomic Annotation

Read more