SomaticPipe Output Container Format

This document proposes a single binary container for SomaticPipe outputs. The goal is to replace scattered .json.gz, .bit, VCF-derived tables, BAM QC files, pipeline QC files, methylation summaries, CNV files, and fingerprint artifacts with one random-access file that is compact, typed, encrypted, and versioned.

Recommended extension: .pandora

Design Goals

Store all sample-level SomaticPipe results in one file.
Preserve the current merged Variants model instead of forcing SNV/SV data into a flat table.
Keep optional derived indexes columnar and fast to load with Arrow IPC.
Keep small structured metadata simple with MessagePack.
Allow random access to each result section without reading the full file.
Encrypt patient-identifying result payloads with authenticated encryption.
Make corruption and version mismatch failures explicit before decoding payloads.
Keep the format append-friendly enough for future sections.

File Layout

All integers in the fixed prelude are unsigned big-endian.

[MAGIC: 8 bytes]          "PANDORA\0"
[VERSION: u16]            format version, initial value 1
[HEADER_LEN: u64]         compressed header length in bytes
[HEADER_CHECKSUM: 32]     BLAKE3(header_zstd)
[HEADER: msgpack + zstd]
[SECTION 0..N]

The header is not encrypted because readers need the section table before they can load payloads. It must not contain direct clinical results. Patient-identifying fields such as sample_id should be either pseudonymized or moved into an encrypted metadata section if the file may leave a controlled environment.

Header Schema

The header is MessagePack serialized and then zstd-compressed.

{
  "format": "somaticpipe.output",
  "format_version": 1,
  "producer": {
    "name": "pandora_lib_promethion",
    "pipeline": "SomaticPipe",
    "pipeline_version": "0.1.0",
    "git_commit": "...",
    "created_at": "2026-05-27T00:00:00Z"
  },
  "sample": {
    "sample_id": "pseudonym-or-local-id",
    "tumor_timepoint": "diag",
    "normal_timepoint": "constit",
    "reference": "hs1",
    "reference_digest": "blake3:..."
  },
  "compression": {
    "algorithm": "zstd",
    "level": 3
  },
  "encryption": {
    "algorithm": "AES-256-GCM",
    "key_derivation": "Argon2id",
    "salt": "<base64 16-32 bytes>",
    "aad": "fixed-prelude + canonical-section-descriptor"
  },
  "sections": [
    {
      "name": "variants",
      "kind": "bitcode",
      "compression": "zstd",
      "encryption": "AES-256-GCM",
      "offset": 0,
      "length": 0,
      "nonce": "<base64 12 bytes>",
      "tag": "<base64 16 bytes>",
      "checksum": "blake3:<ciphertext digest>",
      "schema_hash": "blake3:<logical schema digest>",
      "required": true
    }
  ]
}

Offsets are absolute byte offsets from the start of the file. length is the stored section length, including encrypted ciphertext but excluding any tag if the tag is stored in the header. A reader must validate the magic, version, header checksum, every offset/length bound, and every section checksum before decoding.

Section Encoding

Each encrypted section uses this transform order:

logical payload -> encode -> zstd compress -> AES-256-GCM encrypt -> write bytes

The GCM authentication tag is stored in the header. The additional authenticated data should be a canonical representation of the fixed prelude plus the section descriptor with offset, length, nonce, name, kind, and schema_hash. This prevents moving encrypted bytes between sections or files without detection.

Checksums are BLAKE3 over the stored bytes, not plaintext. AES-GCM already authenticates decrypted plaintext; the checksum catches storage corruption before decryption and helps diagnostics.

Standard Sections

Section	Required	Encoding	Notes
`variants`	yes	bitcode or MessagePack + zstd + AES-256-GCM	Canonical merged `Variants` struct. This keeps SNV, indel, SV, BND, and CNV-like VCF calls together.
`variant_index`	no	Arrow IPC file + zstd + AES-256-GCM	Optional flattened projection for fast GUI/search. It is derived from `variants`, not canonical.
`copy_number`	no	MessagePack or Arrow IPC file + zstd + AES-256-GCM	Savana `SavanaCN` / `CNSegment` absolute copy-number segments.
`bam_qc`	yes	MessagePack + zstd + AES-256-GCM	Tumor/normal `WGSBamStats`, including coverage, N50, karyotype, read groups, source files, and flag stats.
`pipe_qc`	yes	MessagePack + zstd + AES-256-GCM	SomaticPipe filtering counters, input caller counts, annotation summaries, VEP stats, tool versions.
`methylation`	no	Arrow IPC file + zstd + AES-256-GCM	Per-region or per-CpG 5mC outputs.
`fingerprint`	yes	MessagePack + zstd + AES-256-GCM	Sample identity/fingerprint data; encrypt by default because it is identifying.
`provenance`	yes	MessagePack + zstd + AES-256-GCM	Input file digests, command lines, container/image versions, config digest.

The user-suggested unencrypted fingerprint section is risky: fingerprints are identifiers. The safer default is to encrypt it like the other clinical sections. If a public fingerprint is required for indexing, add a separate public_index section containing only non-reversible IDs and high-level availability flags.

Canonical Data Models

`variants`

The canonical variant section should store the existing merged Rust model:

Variants {
  data: Vec<Variant>
}

Variant {
  hash: Hash128,
  position: GenomePosition,
  reference: ReferenceAlternative,
  alternative: ReferenceAlternative,
  vcf_variants: Vec<VcfVariant>,
  annotations: Vec<Annotation>
}

This is important because the current model intentionally merges SNV, indel, SV/BND, and caller annotations into one nested representation. A flat Arrow table would lose structure or require many lossy string columns. For version 1, use the existing bitcode representation when compatibility with the .bit output is desired. MessagePack is acceptable if cross-language readers are more important than Rust-native speed.

`copy_number`

The CNV section should store Savana absolute copy-number segments:

SavanaCN {
  segments: Vec<CNSegment>
}

CNSegment {
  chromosome: String,
  start: u64,
  end: u64,
  segment_id: String,
  bin_count: u32,
  sum_of_bin_lengths: u64,
  weight: f64,
  copy_number: f64,
  minor_allele_copy_number: Option<f64>,
  mean_baf: Option<f64>,
  no_het_snps: u32
}

This section is distinct from variants even if a caller also emits CNV-like VCF records. Segment-level absolute copy number is a continuous genome track and should not be forced into the merged variant list.

`bam_qc`

The BAM QC section should store one WGSBamStats payload per analyzed BAM, keyed by role/timepoint:

{
  "tumor": WGSBamStats,
  "normal": WGSBamStats,
  "additional_bams": [
    {"role": "mrd", "stats": WGSBamStats}
  ]
}

WGSBamStats already contains the useful BAM-level QC fields: total records, passing reads, mapped fraction, unmapped/duplicate/low-MAPQ counts, mapped yield, read lengths, global coverage, per-contig coverage/karyotype, N50, length histogram, read-group stats, source-file stats, and FlagStats.

`pipe_qc`

The pipeline QC section should store SomaticPipe-specific metrics separately from BAM QC:

{
  "somatic_pipe_stats": SomaticPipeStats,
  "variant_stats": VariantsStats,
  "annotation_stats": {
    "initial": AnnotationsStats,
    "post_filters": AnnotationsStats,
    "vep": VepStats
  },
  "caller_outputs": [
    {"caller": "ClairS Somatic", "n_input": 0, "n_after_filters": 0}
  ],
  "filter_steps": [
    {"name": "germline_or_constit", "removed": 0},
    {"name": "low_constit_depth", "removed": 0},
    {"name": "high_constit_alt", "removed": 0},
    {"name": "gnomad_and_constit_alt", "removed": 0},
    {"name": "low_entropy", "removed": 0}
  ]
}

The current SomaticPipeStats tracks the key filtering counters and input categorization. If future code records every intermediate annotation snapshot, store the snapshots here as named MessagePack entries or add optional sections such as annotations_01_initial.

Optional Arrow Indexes

Arrow should be used for derived projections, not for the canonical nested Variants payload.

`variant_index`

This section can be regenerated from variants, so it is optional. It is useful for a GUI, quick filtering, and summary browsing before loading the full nested payload.

variant_id: utf8
contig: utf8
start: int64
end: int64
ref: utf8
alt: utf8
variant_type: utf8
callers: list<utf8>
n_callers: uint8
tumor_depth: int32
tumor_alt: int32
normal_depth: int32
normal_alt: int32
vaf: float32
filters: list<utf8>
vep_consequence: utf8
vep_impact: utf8
gene: utf8
cosmic_count: int32
gnomad_af: float32

`methylation`

contig: utf8
start: int64
end: int64
region_id: utf8
mod_code: utf8
valid_coverage: int32
modified_count: int32
fraction_modified: float32
strand: utf8

Versioning Rules

Readers must reject unsupported major versions.
New optional sections may be added without changing the major version.
Adding nullable columns to Arrow sections is a minor-compatible change.
Removing columns, changing types, or changing crypto/compression semantics requires a major version bump.
Section names are stable ASCII identifiers.

Minimal Reader Algorithm

Read and validate the 8-byte magic.
Read VERSION, HEADER_LEN, and HEADER_CHECKSUM.
Read HEADER_LEN bytes, verify BLAKE3, decompress zstd, decode MessagePack.
Check every section offset + length is inside the file and non-overlapping.
For the requested section, read bytes and verify BLAKE3 checksum.
Decrypt with AES-256-GCM using the section nonce, tag, and AAD.
Decompress zstd.
Decode bitcode, Arrow IPC, or MessagePack according to kind.

Open Implementation Choices

Use bitcode for Rust-native canonical sections that already derive Encode / Decode, especially Variants.
Use rmp-serde for MessagePack.
Use zstd crate for compression.
Use aes-gcm and argon2 crates for encryption and key derivation.
Store Arrow IPC as file format rather than stream format for derived indexes and tabular tracks, because file format carries schema/footer metadata and is better for self-contained sections.
Add a public_index section only if GUI browsing needs non-secret metadata before decryption.

somaticpipe-output-format.md 10 KB Istoric Crud