The extensive usage of high-throughput deoxyribonucleic acid (DNA) sequencing technologies opens up new perspectives in the treatment of several diseases and enables the implementation of a new approach to healthcare known as “precision medicine”. DNA sequencing technologies produce extremely large amounts of raw data which are stored in different repositories worldwide. The processing, analysis, and comparison of such distributed data is a fundamental element for the effective usage of sequencing data for clinical and scientific purposes. Standard Application Program Interfaces (APIs) and Metadata, obviously, are the basis for interoperable and automated data access and processing systems that can efficiently operate on the worldwide available sets of sequencing data.

The MPEG-G standard, jointly developed by WG 11 (MPEG) and ISO Technical Committee for biotechnology standards (ISO TC 276/WG 5), is the first international standard to address and solve the problem of efficient and cost-effective handling of genomic data by providing, not only new compression and transport technologies (ISO/IEC 23092-1/2), but also a standard specification associating relevant information in the form of metadata and a rich set of APIs for data access and mining, for building a full ecosystem of interoperable applications capable of efficiently processing sequencing data.

In more details, this part provides specifications for the normative representation of genomic sequence reads identifiers, genomic sequence reads (both unaligned and aligned reads), reference sequences and quality values. It specifies compression in terms of normative bitstream syntax and decoding behaviour.

The second part of the MPEG-G specifications, Coding of Genomic Information (ISO/IEC 23092-2) has been promoted to Published International Standard stage in October 2019. The final specification is now available to the industry here.

The extensive usage of high-throughput deoxyribonucleic acid (DNA) sequencing technologies opens up new perspectives in the treatment of several diseases and enables the implementation of a new approach to healthcare known as “precision medicine”. DNA sequencing technologies produce extremely large amounts of raw data which are stored in different repositories worldwide. The processing, analysis, and comparison of such distributed data is a fundamental element for the effective usage of sequencing data for clinical and scientific purposes. Standard Application Program Interfaces (APIs) and Metadata, obviously, are the basis for interoperable and automated data access and processing systems that can efficiently operate on the worldwide available sets of sequencing data.

The MPEG-G standard, jointly developed by WG 11 (MPEG) and ISO Technical Committee for biotechnology standards (ISO TC 276/WG 5), is the first international standard to address and solve the problem of efficient and cost-effective handling of genomic data by providing, not only new compression and transport technologies (ISO/IEC 23092-1/2), but also a standard specification associating relevant information in the form of metadata and a rich set of APIs for data access and mining, for building a full ecosystem of interoperable applications capable of efficiently processing sequencing data.

in more details, this part of the standard specifies data formats for both Transport and Storage of Genomic Information, with reference conversion process and informative annexes. The main topics covered by this part are genomic data streaming and file format.

The first part of the MPEG-G specifications, Transport and Storage of Genomic Information (ISO/IEC 23092-1) has been promoted to Published International Standard stage in July 2019. The final specification is now available to the industry here.

The core group of people behind MPEG-G has published a detailed overview of the standard on BioRXiv
. The paper includes an overview of the main parts of the standard and the supported features. The supplementary information goes deeper in the details of the technology and the reference dataset used during the development process.
MPEG-G Technology
"An introduction to MPEG-G, the new ISO standard for genomic information representation"
Claudio Alberti, Tom Paridaens, Jan Voges, Daniel Naro, Junaid J. Ahmad, Massimo Ravasi, Daniele Renzi, Giorgio Zoia, Idoia Ochoa, Paolo Ribeca, Marco Mattavelli, Jaime Delgado, Mikel Hernaez

doi: https://doi.org/10.1101/426353