MPEG-G Part 6

Genomic Annotation

The extensive usage of high-throughput deoxyribonucleic acid (DNA) sequencing technologies opens up new perspectives in the treatment of several diseases and enables the implementation of a new approach to healthcare known as “precision medicine”.

DNA sequencing technologies produce extremely large amounts of raw data which are stored in different repositories worldwide. The processing, analysis, and comparison of such distributed data is a fundamental element for the effective usage of sequencing data for clinical and scientific purposes. Standard Application Program Interfaces (APIs) and Metadata, obviously, are the basis for interoperable and automated data access and processing systems that can efficiently operate on the worldwide available sets of sequencing data.

Annotations as a critical element in genomic information processing

While the existing parts 1-5 of the ISO/IEC 23092 (MPEG-G) standard concern themselves with the representation of genomic information derived from the so-called primary analysis of sequencing data – sequencing reads and qualities, and their storage and alignment to a reference set of sequences – that is only the first step in a long series. In particular, the results of primary analysis are usually processed further in order to obtain higher-level information.

For instance, the alignments of RNA-sequencing reads falling across a gene (i.e., a set of specific intervals of the reference genome) might be counted in order to establish the expression level of such a gene in the given biological condition. Several conditions (and hence several sets of reads coming from different experiments) might need to be compared. In general, the process of aggregating information deduced from single reads and their alignments to the genome into more complex results is called secondary analysis.

In most biological studies based on sequencing protocols, the output of secondary analysis is usually represented as different types of annotations (meta-information) all associated to one or more intervals on the reference sequences.

At the moment each source of information – functional annotation of the reference genome, signal intensity for ChIP-seq, coverage for DNA- or RNA-seq, databases of variants – is represented using a number of completely different formats. In addition, the precise semantics of such formats is usually left unspecified, leaving the field open to the presence of different, and slightly incompatible, formats for each information source. That places onto scientists working on integrative analysis a heavy burden, forcing them to convert information frequently, complicating set-up when sets of experiments need to be visualized together, and making the exchange of information in most cases unnecessarily complicated.