The extensive usage of high-throughput deoxyribonucleic acid (DNA) sequencing technologies opens up new perspectives in the treatment of several diseases and enables the implementation of a new approach to healthcare known as “precision medicine”. DNA sequencing technologies produce extremely large amounts of raw data which are stored in different repositories worldwide. The processing, analysis, and comparison of such distributed data is a fundamental element for the effective usage of sequencing data for clinical and scientific purposes. Standard Application Program Interfaces (APIs) and Metadata, obviously, are the basis for interoperable and automated data access and processing systems that can efficiently operate on the worldwide available sets of sequencing data.

The MPEG-G standard, jointly developed by WG 11 (MPEG) and ISO Technical Committee for biotechnology standards (ISO TC 276/WG 5), is the first international standard to address and solve the problem of efficient and cost-effective handling of genomic data by providing, not only new compression and transport technologies (ISO/IEC 23092-1/2), but also a standard specification associating relevant information in the form of metadata and a rich set of APIs for data access and mining, for building a full ecosystem of interoperable applications capable of efficiently processing sequencing data.

In more details, this part provides specifications for the normative representation of genomic sequence reads identifiers, genomic sequence reads (both unaligned and aligned reads), reference sequences and quality values. It specifies compression in terms of normative bitstream syntax and decoding behaviour.

The second part of the MPEG-G specifications, Coding of Genomic Information (ISO/IEC 23092-2) has been promoted to Published International Standard stage in October 2019. The final specification is now available to the industry here.

The extensive usage of high-throughput deoxyribonucleic acid (DNA) sequencing technologies opens up new perspectives in the treatment of several diseases and enables the implementation of a new approach to healthcare known as “precision medicine”. DNA sequencing technologies produce extremely large amounts of raw data which are stored in different repositories worldwide. The processing, analysis, and comparison of such distributed data is a fundamental element for the effective usage of sequencing data for clinical and scientific purposes. Standard Application Program Interfaces (APIs) and Metadata, obviously, are the basis for interoperable and automated data access and processing systems that can efficiently operate on the worldwide available sets of sequencing data.

The MPEG-G standard, jointly developed by WG 11 (MPEG) and ISO Technical Committee for biotechnology standards (ISO TC 276/WG 5), is the first international standard to address and solve the problem of efficient and cost-effective handling of genomic data by providing, not only new compression and transport technologies (ISO/IEC 23092-1/2), but also a standard specification associating relevant information in the form of metadata and a rich set of APIs for data access and mining, for building a full ecosystem of interoperable applications capable of efficiently processing sequencing data.

in more details, this part of the standard specifies data formats for both Transport and Storage of Genomic Information, with reference conversion process and informative annexes. The main topics covered by this part are genomic data streaming and file format.

The first part of the MPEG-G specifications, Transport and Storage of Genomic Information (ISO/IEC 23092-1) has been promoted to Published International Standard stage in July 2019. The final specification is now available to the industry here.

The introduction of high-throughput DNA sequencing has led to the generation of large quantities of genomic sequencing data that have to be stored, transferred and analyzed. So far WG 11 (MPEG) and ISO TC 276/WG 5 have addressed the representation, compression and transport of genome sequencing data by developing the ISO/IEC 23092 standard series also known as MPEG-G. They provide a file and transport format, compression technology, metadata specifications, protection support, and standard APIs for the access of sequencing data in the native compressed format.

An important element in the effective usage of sequencing data is the association of the data with the results of the analysis and annotations that are generated by processing pipelines and analysts. At the moment such association happens as a separate step, standard and effective ways of linking data and meta information derived from sequencing data are not available.

At its 127th meeting, MPEG and ISO TC 276/WG 5 issued a joint Call for Proposals (CfP) addressing the solution of such problem. The call seeks submissions of technologies that can provide efficient representation and compression solutions for the processing of genomic annotation data.

Companies and organizations are invited to submit proposals in response to this call. Responses are expected to be submitted by the 8th January 2020 and will be evaluated during the 129th WG 11 (MPEG) meeting. Detailed information, including how to respond to the call for proposals, the requirements that have to be considered, and the test data to be used, is reported in the documents N18648, N18647, and N18649 available online. For any further question about the call, test conditions, required software or test sequences please contact: Joern Ostermann, MPEG Requirements Group Chair (ostermann@tnt.uni-hannover.de) or Martin Golebiewski, Convenor ISO TC 276/WG 5 (martin.golebiewski@h-its.org).

The extensive usage of high-throughput deoxyribonucleic acid (DNA) sequencing technologies opens up new perspectives in the treatment of several diseases and enables the implementation of a new approach to healthcare known as “precision medicine”. DNA sequencing technologies produce extremely large amounts of raw data which are stored in different repositories worldwide. The processing, analysis, and comparison of such distributed data is a fundamental element for the effective usage of sequencing data for clinical and scientific purposes. Standard Application Program Interfaces (APIs) and Metadata, obviously, are the basis for interoperable and automated data access and processing systems that can efficiently operate on the worldwide available sets of sequencing data.

The MPEG-G standard, jointly developed by WG 11 (MPEG) and ISO Technical Committee for biotechnology standards (ISO TC 276/WG 5), is the first international standard to address and solve the problem of efficient and cost-effective handling of genomic data by providing, not only new compression and transport technologies (ISO/IEC 23092-1/2), but also a standard specification associating relevant information in the form of metadata and a rich set of APIs for data access and mining, for building a full ecosystem of interoperable applications capable of efficiently processing sequencing data.

At the meeting, the third part of the MPEG-G specifications, Application Program Interfaces and Metadata (ISO/IEC 23092-3) has been promoted to Final Draft International Standard (FDIS) stage. Such part of the standard will enable the industry to rely on a final specification in October 2019.

Whole Genome Sequencing Data Analysis
MPEG-G Genomic Information Representation

Workshop on Genomic Sequencing Data Compression

GA4GH – MPEG, Basel 3rd October 2018

Call for Contributions

The amount of genome sequencing data generated day by day is either comparable or larger than other big data problems. Shrinking costs of data alone do not provide affordable solutions to the ambition of making genomic medicine common practice. However, storage costs are not the only factor to consider because genomic data, once generated, have to be made available to the scientific community for frequent and repeated accesses.

Current sequencing technologies compensate the errors generated by intrinsic noisy processes by generating redundant data and associated metadata (i.e. quality values). Thus compression approaches are effective solutions to reduce and mitigate the costs and the technological limitations related to the handling of extremely large volumes of data.

The heterogeneity of genome sequencing data and the diversity of the available compression solutions pose several challenges to the quest for an ideal technology able to deliver, at the same time, high compression ratios, high coding and decoding speed, efficient selective access to data and guaranteed interoperability among applications while respecting a variety of data protection and privacy requirements.

The goal of this workshop is to collect technical contributions on emerging and new compression technologies with particular attention to:

  • DNA sequencing data compression
  • Selective access and processing in the compressed domain
  • Emerging standard frameworks for the specification, representation and compression of genomic sequencing data
  • Interoperability of genomic sequencing data formats, applications standard frameworks and APIs
  • Use cases and processing applications requiring genomic data/metadata compression and protection·

Interested authors are invited to submit an abstract of no more than 600 words (excluding pictures and graphics which are welcome) describing their technical work by 31st August 2018.

The submission should indicate the preferred form of the contribution:

  • Oral presentation
  • Poster
  • Demonstration

Submissions of abstracts must be sent by email to:

GA4GH assembly program

Registration

MPEG-G Standard

The extensive usage of high-throughput deoxyribonucleic acid (DNA) sequencing technologies opens up new perspectives in the treatment of several diseases and enables “precision medicine”. As DNA sequencing technologies produce extremely large amounts of raw data, the ICT costs for the storage, transmission, and processing of DNA sequence data and related information, result to be very high due to the lack of universal standards preventing timely application of effective treatments.

The MPEG-G standard jointly developed by MPEG and ISO Technical Committee for biotechnology standards (ISO TC 276/WG 5) is the first international standard to address and solve the problem of efficient and cost-effective handling of genomic data by providing, not only new compression and transport technologies, but also a family of standard specifications associating relevant information in the form of metadata and a rich set of Application Programming Interfaces (APIs) for building a full ecosystem of interoperable applications and services capable of efficiently processing sequencing data.

At its 122nd meeting, MPEG promoted its core set of MPEG-G specifications, i.e., transport and compression technologies, to Draft International Standard (DIS) stage. Such parts of the standard provide new transport technologies (ISO/IEC 23092-1) and compression technologies (ISO/IEC 23092-2) supporting rich functionality for the access and transport including streaming of genomic data by interoperable applications. This will enable the industry to rely on a final specification in October 2018. Reference software (ISO/IEC 23092-4) and conformance (ISO/IEC 23092-5) will reach this stage in the next 12 months.

Beside standardization achievements, a workshop on the “applications of genomic information processing” has been held in conjunction with the 122nd MPEG meeting discussing requirements, open problems of genome information processing, and solutions provided by MPEG-G standards. Use cases representative of selective remote access with streaming and the execution of the Genome Analysis Toolkit (GATK) and equivalent processing pipelines using sequencing data in MPEG-G compressed forms have also been demonstrated.