Part 1: Transport and Storage of Genomic Information

MPEG-G specifies a digital container format for transmission and storage of the genomic data compressed according to Part 2 of the standard. In MPEG jargon the container format used for the transport of packetized data (i.e., stream) on a telecommunication network is referred to as Transport Format, while the container format used for storage on a physical medium (i.e., file) is referred to as File Format. The process of converting a stream to a file and vice versa is normative and specified in the standard.

File Format

An MPEG-G file is organized in a file header and one or more containers named Dataset Groups. Each Dataset Group contains a Dataset Group header, optional metadata containers and encapsulates one or more Datasets. Each Dataset has a Dataset Header, optional metadata containers and carries one or more Access Units. The Access Unit is the actual container of the compressed genomic data. It includes an Access Unit header which provides a description of the compressed content (type of data, number of reads, genomic region including the compressed reads, etc.). In the case where an MPEG-G file was constructed without Descriptor Streams, the Access Unit contains a collection of Blocks of coded information that can be decoded independently using global data at the Dataset level and eventually information contained in other Access Units, such as Access Units containing data of an MPEG-G encoded reference sequence. Otherwise, the blocks of coded information are grouped by type, concatenated and stored as Descriptor Streams. The index mechanism then allows to associate a given Access Unit to the corresponding collection of Blocks. In either case, each Block is compressed using the entropy coding techniques most suitable to the measured statistical properties. This nested data structure is depicted in Figure 2. Figure 2: Key elements of the MPEG-G File Format. Multiple Dataset Groups contain multiple Datasets of sequencing data. Each Dataset is composed of Access Units containing Genomic Records pertaining to one specific Data Class. Each Access Unit is composed by Blocks of Read Descriptors.

Transport Format

In addition to the data containers defined for the File Format, MPEG-G specifies data structures supporting packetized data transport over a network. Such structures are defined both to carry the compressed genomic data and to update metadata describing the streamed content. An example of the latter type of data is indexing information used by the receiving end to enable selective access even on partially transmitted content. The Transport Format structures are instrumental in the specification of the normative process of conversion between Transport Format (i.e., an MPEG-G stream transmitted over the internet) and File Format (i.e., an MPEG-G file stored on disk).

Access Unit is the fundamental structure allowing random and selective access to genomic data.

 Genomic records, are compressed in MPEG-G. Thus, in order to search them, filter, and access, all structures of the MPEG-G shown on Figure 1, contain metadata, with information on genomic data contained in that structure, and parameters that were used to encode them.

Access Unit is the fundamental structure allowing random access, and filtering of genomic records. Each access unit, contains genomic records from only one Data Class (P, N, M, I, HM and U, Table 1), and each access unit can be search/filtered based on the positions in the reference genome (start position, end position). Furthermore, access units can be filtered using aggregated features of genomic records encapsulated in each of them, such as read number, alignment/nucleotide quality, Number of sliced reads (Table 2), that can be accessed without decompressing the data. Similarly to Datasets and datasets groups, all access units are secured independently. Thus, one may select which regions of the genome could be made available to certain users, in order to avoid accidental findings and data lick-age. Finally, in case new sequence data would be added to a given datasets, these will be added as new access units, thus reducing the time required for file compression.Figure 1. Key elements of MPEG-G File Format. Each Access Unit contains Genomic record belonging to only one Data Class

Table 1. Six different data classes defined in MPEG-G, SR – Sequence Reads, ref. – reference genome, InDel – Insertion or deletion

Part 2: Coding of Genomic Information

Genomic Records are classified into six Data Classes according to the result of the primary alignment(s) of their reads against one or more reference sequences as shown in Table 1. Records are classified according to the types of mismatches with respect to the reference sequences used for alignment.

File Format

To further improve compression efficiency, the information contained in the clustered Genomic Records is split across Descriptor Streams. The concept of splitting the information contained in the clustered Genomic Records into Descriptor Streams allows tailoring encoding parameters according to the statistical properties of each Descriptor Stream.

Compression modes for raw sequencing data
Raw sequencing data can be encoded according to two different approaches, depending on application at hand:

  • High compression ratio and indexing
  • Low latency

Compression modes for aligned reads
Genomic sequence reads mapped onto reference sequences can be compressed following two approaches:refere

  • Reference-based compression
  • Reference-free compression

Compression modes for quality values

Due to their higher entropy and larger alphabet, quality values have proven more difficult to compress than the reads [16, 11]. In addition, there is evidence that quality values are inherently noisy, and downstream applications that use them do so in varying heuristic manners.

Compression modes for read identifiers

Read identifiers are broken down into a series of tokens which can be of three main types: strings, digits and single characters. A read identifier is represented as a set of differences and matches with respect to one of the previously decoded Read Identifiers. This approach does not rely on any sequencing manufacturer implementation and only assumes that within the same sequencing
run the structure of read identifiers is mostly constant.

Compression modes for reference sequences

MPEG-G supports the use of reference sequences both in the FASTA format and in the MPEG-G compressed format. The reference sequences can also be embedded as Datasets within the same MPEG-G file. Optionally, external reference sequences (i.e., sequences that are not included in the bitstream) can be used. MPEG-G specifies how external reference sequences can be identified unambiguously using a URI, checksums, etc.

Entropy coding
Storing different types of data in separate Descriptor Streams allows for a significantly higher compression effectiveness. The different statistical properties of each descriptor can be exploited to define different source models to be used for entropy coding. The increased compression efficiency is generated by the adoption of the appropriate context adaptive probability models according to the statistical properties of each source model.

Decoding process

The MPEG-G specification not only defines the syntax and semantics of the compressed genome sequencing data, but also the deterministic decoding process.
The normative input of an MPEG-G decoding process is a concatenation of data structures called Data Units. Data Units can be of three types according to the type of conveyed data. A Data Unit of type 0 encapsulates the decoded representation of one or more reference sequences, a Data Unit of type 1 contains parameters used during the decoding process in a structure called Parameter Set, and a Data Unit of type 2 contains one Access Unit.

MPEG-G Standard
DNA Sequence Database Storage

13:00 – 18:00, 13th October 2018

Shenzhen (CN)

A workshop on applications of genomic information processing will be held on 13th October 2018 one day after the 124th MPEG meeting.

The Workshop intends to provide an overview of MPEG-G the new ISO standard on the compression of and optimized access to genomic information, its impact on the relevant industry, on the various related standardization initiatives, use cases, sequencing technology evolution and perspectives for standardization in other –omics fields.

Specifically the workshop addresses:

  • An overview of the ISO genomic compression standard and its new features and performance
  • The challenges for the generation and management of very large volumes of genome sequencing data
  • The status and future perspectives of sequencing technology and genomic data generation
  • The vision of genomic information storage and processing on the cloud
  • The vision of further standardization objectives in the –omics fields

The workshop is open to the public and interested parties who want to learn more on the perspectives of genomic data processing applications and on new technologies for the processing of genome sequencing data.

Registration is free of charge. To register (only for logistic purposes), please send an email to Massimo Ravasi.

Date: 13th October (Saturday), 2018

Program: 13:00 – 18:00

Venue: 2F Function Room, Tencent Building, No. 10000 Shennan Avenue, NanShan District, Shenzhen , Guangdong province, China

 广省深圳市南山区深南大道10000腾讯大厦二楼多功能

 Organizing Committee:

Joern Ostermann (LUH), Rongshan Yu (AGINOME Scientific), Claudio Alberti (GenomSys), Tom Paridaens (imec and UGent)

Preliminary Program

Start End What Who
12:30 13:00 Registration
13:00 13:10 Welcome & workshop goals Leonardo Chiariglione
13:10 13:40 “An overview of the MPEG-G standard for the compression and processing of genomic sequencing data” Marco Mattavelli (EPFL, Switzerland)
13:40 14:10 “An overview of standardization initiatives on genomic data” Yong Zhang (ISO/TC 276/ WG2 & WG 5 Convenor)
14:10 14:40 “GSA: Genome Sequence Archive, in China” Yanqing Wang (BIG Data Center, BIG, CAS)
14:40 14:50 Short presentation of demos Demonstrators companies
14:50 15:20 Demo session and Coffee Break
15:20 15:50 “State-of-the-art and future of NGS, a standard perspective” Ming NI (BGI-Shenzhen and MGI)
15:50 16:20 “Constructing an open ecosystem for bioinformatics and genomic big data” Chen Shifu (Haplox)
16:20 16:50 “Practice and Challenges of 20,000 human WGS data analysis on BGI Online” Kang FANG (BGI-Online, BGI)
16:50 17:20 Panel discussion,  Q&A and concluding remarks All speakers
17:20 18:00 Demo session continues

Demonstrations of genome sequencing data processing prototypes and products

Co-located with the WS it will be possible to show demos and to present prototypes and products related to genome sequencing data processing analytics, compression and storage to workshop participants.

MPEG Genome Compression

At its 114th meeting, MPEG has progressed its exploration of genome compression toward formal standardization. The 114th meeting included a seminar to collect additional perspectives on genome data standardization, and a review of technologies that had been submitted in response to a Call for Evidence. The purpose of that CfE, which had been previously issued at the 113th meeting, was to assess whether new technologies could achieve better performance in terms of compression efficiency compared with currently used formats.

In all, 22 tools were evaluated. The results demonstrate that by integrating a multiple of these tools, it is possible to improve the compression of up to 27% with respect to the best state-of-the-art tool. With this evidence, MPEG has issued a Draft Call for Proposals (CfP) on Genomic Information Representation and Compression. The Draft CfP targets technologies for compressing raw and aligned genomic data and metadata for efficient storage and analysis.

As demonstrated by the results of the Call for Evidence, improved lossless compression of genomic data beyond the current state-of-the-art tools is achievable by combining and further developing them. The call also addresses lossy compression of the metadata which make up the dominant volume of the resulting compressed data. The Draft CfP seeks lossy compression technologies that can provide higher compression performance without affecting the accuracy of analysis application results. Responses to the Genomic Information Representation and Compression CfP will be evaluated prior to the 116th MPEG meeting in October 2016 (in Chengdu, China). An ad hoc group, co-chaired by Martin Golobiewski, convenor of Working Group 5 of ISO TC 276 (i.e. the ISO committee for Biotechnology) and Dr. Marco Mattavelli (of MPEG) will coordinate the receipt and pre-analysis of submissions received in response to the call. Detailed results to the CfE and the presentations shown during the seminar will soon be available as MPEG documents N16137 and N16147 at: http://mpeg.chiariglione.org/meetings/114.

Genomic Data Compression

MPEG issues Call for Evidence (CfE) for Genome Compression and Storage

At its 113th meeting, MPEG has taken its first formal step toward leveraging its compression expertise to code an entirely new kind of essential information, i.e. the single recipe that describes each one of us as an individual — the human genome.  A sequenced genome is comprised of DNA sequences that may contain up to 300 billion DNA base pairs, that make up the genetic information within each human cell. It is fundamentally the complete set of our hereditary information.

To aid in the representation and storage of this unique information, MPEG has issued a Call for Evidence (CfE) on Genome Compression and Storage with the goal to assess the performance of new technologies for the efficient compression of genomic information when compared to currently used file formats. This is vitally important because the amount of genomic and related information from a sequenced genome can be as high as several Tbytes (trillion bytes).

Additional purposes of the call are to:

  • become aware of which additional functionalities (e.g. non sequential access, lossy compression efficiency, etc. ) are provided by these new technologies
  • collect information that may be used in drafting a future Call for Proposals

Responses to the CfE will be evaluated during the 114th MPEG meeting in February 2016.

Detailed information, including how to respond to the CfE, will soon be available as documents N15740 and N15739 at the 113th meeting website.