The extensive usage of high-throughput deoxyribonucleic acid (DNA) sequencing technologies opens up new perspectives in the treatment of several diseases and enables the implementation of a new approach to healthcare known as “precision medicine”.
DNA sequencing technologies produce extremely large amounts of raw data which are stored in different repositories worldwide. The processing, analysis, and comparison of such distributed data is a fundamental element for the effective usage of sequencing data for clinical and scientific purposes. Standard Application Program Interfaces (APIs) and Metadata, obviously, are the basis for interoperable and automated data access and processing systems that can efficiently operate on the worldwide available sets of sequencing data.
The 3rd part of the series of MPEG-G standards details information metadata, interoperability with the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats, protection metadata and programming interfaces to access genomic information. The main goals are to enable (controlled) access to MPEG-G data from external applications and to add metadata to compressed genomic information.
Enforcement of privacy rules
Data encoded in an MPEG-G file can be linked to multiple owner-defined privacy rules, which impose restrictions on data access and usage. The MPEG-G file format includes protection information provided in specific containers avail-able at most levels of the MPEG-G hierarchy, including Dataset Group, Dataset, Descriptor Stream and Access Unit levels. These protection containers provide—in addition to the privacy rules to be applied to the information they refer to—mechanisms to manage the confidentiality and integrity of the information. Specifically, the information on privacy rules is only available at the Dataset Group and Dataset levels. A privacy rule could specify, for example, access control to specific regions related to identifying Alzheimer predisposition. By using encryption techniques combined with privacy rules the genomic data is efficiently protected from unauthorized access. Therefore,only users authorized by the rules can perform operations over protected regions.
Encryption of sequencing data and metadata
MPEG-G supports the encryption of genomic information at different levels in its hierarchy of logical data structures. The protection information specifies how the data structures at the same level, and the protection information containers of a layer immediately below, are encrypted. Encryption is not only possible for the “low-level” detailed sequencing data and metadata included in Genomic Records, but also for the “high-level” metadata available for the Dataset Group and Dataset hierarchy levels.
In March 2020 the third part of MPEG-G, Coding of Genomic Information (ISO/IEC 23092-3) was initially published. The current version for the standard ISO/IEC 23092-3:2022 for part 3: Metadata and application programming interfaces (APIs) provides specifications for the representation of the following types of genomic information:
- metadata storage and interpretation for the different encapsulation levels as specified in ISO/IEC 23092-1 (in Clause 6);
- protection elements providing confidentiality, integrity and privacy rules at the different encapsulation levels specified in ISO/IEC 23092-1 (in Clause 7);
-
how to associate auxiliary fields to encoded reads (in Clause 8);
-
mechanisms for backward compatibility with existing SAM content, and exportation to this format (in Annex C);
- interfaces to access genomic information coded in compliance with ISO/IEC 23092-1 and ISO/IEC 23092-2 (in subclause 8.1).
The update was published in October 2022 the revised standard has been published and is now available to here.