MPEG Genome Compression

At its 114th meeting, MPEG has progressed its exploration of genome compression toward formal standardization. The 114th meeting included a seminar to collect additional perspectives on genome data standardization, and a review of technologies that had been submitted in response to a Call for Evidence. The purpose of that CfE, which had been previously issued at the 113th meeting, was to assess whether new technologies could achieve better performance in terms of compression efficiency compared with currently used formats.

In all, 22 tools were evaluated. The results demonstrate that by integrating a multiple of these tools, it is possible to improve the compression of up to 27% with respect to the best state-of-the-art tool. With this evidence, MPEG has issued a Draft Call for Proposals (CfP) on Genomic Information Representation and Compression. The Draft CfP targets technologies for compressing raw and aligned genomic data and metadata for efficient storage and analysis.

As demonstrated by the results of the Call for Evidence, improved lossless compression of genomic data beyond the current state-of-the-art tools is achievable by combining and further developing them. The call also addresses lossy compression of the metadata which make up the dominant volume of the resulting compressed data. The Draft CfP seeks lossy compression technologies that can provide higher compression performance without affecting the accuracy of analysis application results. Responses to the Genomic Information Representation and Compression CfP will be evaluated prior to the 116th MPEG meeting in October 2016 (in Chengdu, China). An ad hoc group, co-chaired by Martin Golobiewski, convenor of Working Group 5 of ISO TC 276 (i.e. the ISO committee for Biotechnology) and Dr. Marco Mattavelli (of MPEG) will coordinate the receipt and pre-analysis of submissions received in response to the call. Detailed results to the CfE and the presentations shown during the seminar will soon be available as MPEG documents N16137 and N16147 at: http://mpeg.chiariglione.org/meetings/114.

Genomic Information Representation

A seminar on Genome Compression Standardization has been held on 23rd February 2016 during the 114th MPEG meeting in San Diego.

The purpose of this seminar was twofold: to raise the awareness on the need of new approaches to genome compression for the efficient handling of the increasing flood of sequencing data and to collect requirements from stakeholders from the different fields interested in the acquisition and processing of genome data.

MPEG

The main topics covered by the seminar presentations were:

  • New approaches, tools and algorithms to compress genome sequence data
  • Genome compression and genomic medicine applications
  • Objectives and issues of quality scores compression and impact on downstream analysis applications

Venue:
San Diego Marriott La Jolla
4240 La Jolla Village Drive
La Jolla, CA 92037 United States
(see the 114th MPEG meeting website for further details)

Organizing Committee:
Marco Mattavelli (EPFL), Joern Ostermann (TNT), Mikel Hernaez (Stanford University)

Program

Speaker Title
Marco Mattavelli – chairman of MPEG Ad-hoc group on Genome Compression and Storage MPEG experience in Genomic Information Representation
Come Raczy – Illumina Compression
Bayo Lau – Bina Technology/Roche Sequencing Practical Considerations of Genomic Data
Cenk Sahinalp – Simon Fraser University Vancouver High Throughput Sequencing Compression – State of the Art
Yong Zhang – Co-convenor of TC 276 “Biotechnology” WG 5 “Data Processing” The Time of Peta-byte Is Coming. Challenges and Opportunities in Big BioData
Joern Ostermann – MPEG Requirements Chair MPEG workplan for Genome Compression Standardization

The presentations showed during the workshop can be found in the Documents page