# SomaticPipe Output Container Format This document proposes a single binary container for SomaticPipe outputs. The goal is to replace scattered `.json.gz`, `.bit`, VCF-derived tables, BAM QC files, pipeline QC files, methylation summaries, CNV files, and fingerprint artifacts with one random-access file that is compact, typed, encrypted, and versioned. Recommended extension: `.pandora` ## Design Goals - Store all sample-level SomaticPipe results in one file. - Preserve the current merged `Variants` model instead of forcing SNV/SV data into a flat table. - Keep optional derived indexes columnar and fast to load with Arrow IPC. - Keep small structured metadata simple with MessagePack. - Allow random access to each result section without reading the full file. - Encrypt patient-identifying result payloads with authenticated encryption. - Make corruption and version mismatch failures explicit before decoding payloads. - Keep the format append-friendly enough for future sections. ## File Layout All integers in the fixed prelude are unsigned big-endian. ```text [MAGIC: 8 bytes] "PANDORA\0" [VERSION: u16] format version, initial value 1 [HEADER_LEN: u64] compressed header length in bytes [HEADER_CHECKSUM: 32] BLAKE3(header_zstd) [HEADER: msgpack + zstd] [SECTION 0..N] ``` The header is not encrypted because readers need the section table before they can load payloads. It must not contain direct clinical results. Patient-identifying fields such as `sample_id` should be either pseudonymized or moved into an encrypted metadata section if the file may leave a controlled environment. ## Header Schema The header is MessagePack serialized and then zstd-compressed. ```text { "format": "somaticpipe.output", "format_version": 1, "producer": { "name": "pandora_lib_promethion", "pipeline": "SomaticPipe", "pipeline_version": "0.1.0", "git_commit": "...", "created_at": "2026-05-27T00:00:00Z" }, "sample": { "sample_id": "pseudonym-or-local-id", "tumor_timepoint": "diag", "normal_timepoint": "constit", "reference": "hs1", "reference_digest": "blake3:..." }, "compression": { "algorithm": "zstd", "level": 3 }, "encryption": { "algorithm": "AES-256-GCM", "key_derivation": "Argon2id", "salt": "", "aad": "fixed-prelude + canonical-section-descriptor" }, "sections": [ { "name": "variants", "kind": "bitcode", "compression": "zstd", "encryption": "AES-256-GCM", "offset": 0, "length": 0, "nonce": "", "tag": "", "checksum": "blake3:", "schema_hash": "blake3:", "required": true } ] } ``` Offsets are absolute byte offsets from the start of the file. `length` is the stored section length, including encrypted ciphertext but excluding any tag if the tag is stored in the header. A reader must validate the magic, version, header checksum, every offset/length bound, and every section checksum before decoding. ## Section Encoding Each encrypted section uses this transform order: ```text logical payload -> encode -> zstd compress -> AES-256-GCM encrypt -> write bytes ``` The GCM authentication tag is stored in the header. The additional authenticated data should be a canonical representation of the fixed prelude plus the section descriptor with `offset`, `length`, `nonce`, `name`, `kind`, and `schema_hash`. This prevents moving encrypted bytes between sections or files without detection. Checksums are BLAKE3 over the stored bytes, not plaintext. AES-GCM already authenticates decrypted plaintext; the checksum catches storage corruption before decryption and helps diagnostics. ## Standard Sections | Section | Required | Encoding | Notes | | --- | --- | --- | --- | | `variants` | yes | bitcode or MessagePack + zstd + AES-256-GCM | Canonical merged `Variants` struct. This keeps SNV, indel, SV, BND, and CNV-like VCF calls together. | | `variant_index` | no | Arrow IPC file + zstd + AES-256-GCM | Optional flattened projection for fast GUI/search. It is derived from `variants`, not canonical. | | `copy_number` | no | MessagePack or Arrow IPC file + zstd + AES-256-GCM | Savana `SavanaCN` / `CNSegment` absolute copy-number segments. | | `bam_qc` | yes | MessagePack + zstd + AES-256-GCM | Tumor/normal `WGSBamStats`, including coverage, N50, karyotype, read groups, source files, and flag stats. | | `pipe_qc` | yes | MessagePack + zstd + AES-256-GCM | SomaticPipe filtering counters, input caller counts, annotation summaries, VEP stats, tool versions. | | `methylation` | no | Arrow IPC file + zstd + AES-256-GCM | Per-region or per-CpG 5mC outputs. | | `fingerprint` | yes | MessagePack + zstd + AES-256-GCM | Sample identity/fingerprint data; encrypt by default because it is identifying. | | `provenance` | yes | MessagePack + zstd + AES-256-GCM | Input file digests, command lines, container/image versions, config digest. | The user-suggested unencrypted `fingerprint` section is risky: fingerprints are identifiers. The safer default is to encrypt it like the other clinical sections. If a public fingerprint is required for indexing, add a separate `public_index` section containing only non-reversible IDs and high-level availability flags. ## Canonical Data Models ### `variants` The canonical variant section should store the existing merged Rust model: ```text Variants { data: Vec } Variant { hash: Hash128, position: GenomePosition, reference: ReferenceAlternative, alternative: ReferenceAlternative, vcf_variants: Vec, annotations: Vec } ``` This is important because the current model intentionally merges SNV, indel, SV/BND, and caller annotations into one nested representation. A flat Arrow table would lose structure or require many lossy string columns. For version 1, use the existing `bitcode` representation when compatibility with the `.bit` output is desired. MessagePack is acceptable if cross-language readers are more important than Rust-native speed. ### `copy_number` The CNV section should store Savana absolute copy-number segments: ```text SavanaCN { segments: Vec } CNSegment { chromosome: String, start: u64, end: u64, segment_id: String, bin_count: u32, sum_of_bin_lengths: u64, weight: f64, copy_number: f64, minor_allele_copy_number: Option, mean_baf: Option, no_het_snps: u32 } ``` This section is distinct from `variants` even if a caller also emits CNV-like VCF records. Segment-level absolute copy number is a continuous genome track and should not be forced into the merged variant list. ### `bam_qc` The BAM QC section should store one `WGSBamStats` payload per analyzed BAM, keyed by role/timepoint: ```text { "tumor": WGSBamStats, "normal": WGSBamStats, "additional_bams": [ {"role": "mrd", "stats": WGSBamStats} ] } ``` `WGSBamStats` already contains the useful BAM-level QC fields: total records, passing reads, mapped fraction, unmapped/duplicate/low-MAPQ counts, mapped yield, read lengths, global coverage, per-contig coverage/karyotype, N50, length histogram, read-group stats, source-file stats, and `FlagStats`. ### `pipe_qc` The pipeline QC section should store SomaticPipe-specific metrics separately from BAM QC: ```text { "somatic_pipe_stats": SomaticPipeStats, "variant_stats": VariantsStats, "annotation_stats": { "initial": AnnotationsStats, "post_filters": AnnotationsStats, "vep": VepStats }, "caller_outputs": [ {"caller": "ClairS Somatic", "n_input": 0, "n_after_filters": 0} ], "filter_steps": [ {"name": "germline_or_constit", "removed": 0}, {"name": "low_constit_depth", "removed": 0}, {"name": "high_constit_alt", "removed": 0}, {"name": "gnomad_and_constit_alt", "removed": 0}, {"name": "low_entropy", "removed": 0} ] } ``` The current `SomaticPipeStats` tracks the key filtering counters and input categorization. If future code records every intermediate annotation snapshot, store the snapshots here as named MessagePack entries or add optional sections such as `annotations_01_initial`. ## Optional Arrow Indexes Arrow should be used for derived projections, not for the canonical nested `Variants` payload. ### `variant_index` This section can be regenerated from `variants`, so it is optional. It is useful for a GUI, quick filtering, and summary browsing before loading the full nested payload. - `variant_id: utf8` - `contig: utf8` - `start: int64` - `end: int64` - `ref: utf8` - `alt: utf8` - `variant_type: utf8` - `callers: list` - `n_callers: uint8` - `tumor_depth: int32` - `tumor_alt: int32` - `normal_depth: int32` - `normal_alt: int32` - `vaf: float32` - `filters: list` - `vep_consequence: utf8` - `vep_impact: utf8` - `gene: utf8` - `cosmic_count: int32` - `gnomad_af: float32` ### `methylation` - `contig: utf8` - `start: int64` - `end: int64` - `region_id: utf8` - `mod_code: utf8` - `valid_coverage: int32` - `modified_count: int32` - `fraction_modified: float32` - `strand: utf8` ## Versioning Rules - Readers must reject unsupported major versions. - New optional sections may be added without changing the major version. - Adding nullable columns to Arrow sections is a minor-compatible change. - Removing columns, changing types, or changing crypto/compression semantics requires a major version bump. - Section names are stable ASCII identifiers. ## Minimal Reader Algorithm 1. Read and validate the 8-byte magic. 2. Read `VERSION`, `HEADER_LEN`, and `HEADER_CHECKSUM`. 3. Read `HEADER_LEN` bytes, verify BLAKE3, decompress zstd, decode MessagePack. 4. Check every section `offset + length` is inside the file and non-overlapping. 5. For the requested section, read bytes and verify BLAKE3 checksum. 6. Decrypt with AES-256-GCM using the section nonce, tag, and AAD. 7. Decompress zstd. 8. Decode bitcode, Arrow IPC, or MessagePack according to `kind`. ## Open Implementation Choices - Use `bitcode` for Rust-native canonical sections that already derive `Encode` / `Decode`, especially `Variants`. - Use `rmp-serde` for MessagePack. - Use `zstd` crate for compression. - Use `aes-gcm` and `argon2` crates for encryption and key derivation. - Store Arrow IPC as file format rather than stream format for derived indexes and tabular tracks, because file format carries schema/footer metadata and is better for self-contained sections. - Add a `public_index` section only if GUI browsing needs non-secret metadata before decryption.