Part 2: Coding of Genomic Information
Genomic Records are classified into six Data Classes according to the result of the primary alignment(s) of their reads against one or more reference sequences as shown in Table 1. Records are classified according to the types of mismatches with respect to the reference sequences used for alignment.
File Format
To further improve compression efficiency, the information contained in the clustered Genomic Records is split across Descriptor Streams. The concept of splitting the information contained in the clustered Genomic Records into Descriptor Streams allows tailoring encoding parameters according to the statistical properties of each Descriptor Stream.
Compression modes for raw sequencing data
Raw sequencing data can be encoded according to two different approaches, depending on application at hand:
- High compression ratio and indexing
- Low latency
Compression modes for aligned reads
Genomic sequence reads mapped onto reference sequences can be compressed following two approaches:refere
- Reference-based compression
- Reference-free compression
Compression modes for quality values
Due to their higher entropy and larger alphabet, quality values have proven more difficult to compress than the reads [16, 11]. In addition, there is evidence that quality values are inherently noisy, and downstream applications that use them do so in varying heuristic manners.
Compression modes for read identifiers
Read identifiers are broken down into a series of tokens which can be of three main types: strings, digits and single characters. A read identifier is represented as a set of differences and matches with respect to one of the previously decoded Read Identifiers. This approach does not rely on any sequencing manufacturer implementation and only assumes that within the same sequencing
run the structure of read identifiers is mostly constant.
Compression modes for reference sequences
MPEG-G supports the use of reference sequences both in the FASTA format and in the MPEG-G compressed format. The reference sequences can also be embedded as Datasets within the same MPEG-G file. Optionally, external reference sequences (i.e., sequences that are not included in the bitstream) can be used. MPEG-G specifies how external reference sequences can be identified unambiguously using a URI, checksums, etc.
Entropy coding
Storing different types of data in separate Descriptor Streams allows for a significantly higher compression effectiveness. The different statistical properties of each descriptor can be exploited to define different source models to be used for entropy coding. The increased compression efficiency is generated by the adoption of the appropriate context adaptive probability models according to the statistical properties of each source model.
Decoding process
The MPEG-G specification not only defines the syntax and semantics of the compressed genome sequencing data, but also the deterministic decoding process.
The normative input of an MPEG-G decoding process is a concatenation of data structures called Data Units. Data Units can be of three types according to the type of conveyed data. A Data Unit of type 0 encapsulates the decoded representation of one or more reference sequences, a Data Unit of type 1 contains parameters used during the decoding process in a structure called Parameter Set, and a Data Unit of type 2 contains one Access Unit.