somaticpipe-output-format.md 10 KB

SomaticPipe Output Container Format

This document proposes a single binary container for SomaticPipe outputs. The goal is to replace scattered .json.gz, .bit, VCF-derived tables, BAM QC files, pipeline QC files, methylation summaries, CNV files, and fingerprint artifacts with one random-access file that is compact, typed, encrypted, and versioned.

Recommended extension: .pandora

Design Goals

  • Store all sample-level SomaticPipe results in one file.
  • Preserve the current merged Variants model instead of forcing SNV/SV data into a flat table.
  • Keep optional derived indexes columnar and fast to load with Arrow IPC.
  • Keep small structured metadata simple with MessagePack.
  • Allow random access to each result section without reading the full file.
  • Encrypt patient-identifying result payloads with authenticated encryption.
  • Make corruption and version mismatch failures explicit before decoding payloads.
  • Keep the format append-friendly enough for future sections.

File Layout

All integers in the fixed prelude are unsigned big-endian.

[MAGIC: 8 bytes]          "PANDORA\0"
[VERSION: u16]            format version, initial value 1
[HEADER_LEN: u64]         compressed header length in bytes
[HEADER_CHECKSUM: 32]     BLAKE3(header_zstd)
[HEADER: msgpack + zstd]
[SECTION 0..N]

The header is not encrypted because readers need the section table before they can load payloads. It must not contain direct clinical results. Patient-identifying fields such as sample_id should be either pseudonymized or moved into an encrypted metadata section if the file may leave a controlled environment.

Header Schema

The header is MessagePack serialized and then zstd-compressed.

{
  "format": "somaticpipe.output",
  "format_version": 1,
  "producer": {
    "name": "pandora_lib_promethion",
    "pipeline": "SomaticPipe",
    "pipeline_version": "0.1.0",
    "git_commit": "...",
    "created_at": "2026-05-27T00:00:00Z"
  },
  "sample": {
    "sample_id": "pseudonym-or-local-id",
    "tumor_timepoint": "diag",
    "normal_timepoint": "constit",
    "reference": "hs1",
    "reference_digest": "blake3:..."
  },
  "compression": {
    "algorithm": "zstd",
    "level": 3
  },
  "encryption": {
    "algorithm": "AES-256-GCM",
    "key_derivation": "Argon2id",
    "salt": "<base64 16-32 bytes>",
    "aad": "fixed-prelude + canonical-section-descriptor"
  },
  "sections": [
    {
      "name": "variants",
      "kind": "bitcode",
      "compression": "zstd",
      "encryption": "AES-256-GCM",
      "offset": 0,
      "length": 0,
      "nonce": "<base64 12 bytes>",
      "tag": "<base64 16 bytes>",
      "checksum": "blake3:<ciphertext digest>",
      "schema_hash": "blake3:<logical schema digest>",
      "required": true
    }
  ]
}

Offsets are absolute byte offsets from the start of the file. length is the stored section length, including encrypted ciphertext but excluding any tag if the tag is stored in the header. A reader must validate the magic, version, header checksum, every offset/length bound, and every section checksum before decoding.

Section Encoding

Each encrypted section uses this transform order:

logical payload -> encode -> zstd compress -> AES-256-GCM encrypt -> write bytes

The GCM authentication tag is stored in the header. The additional authenticated data should be a canonical representation of the fixed prelude plus the section descriptor with offset, length, nonce, name, kind, and schema_hash. This prevents moving encrypted bytes between sections or files without detection.

Checksums are BLAKE3 over the stored bytes, not plaintext. AES-GCM already authenticates decrypted plaintext; the checksum catches storage corruption before decryption and helps diagnostics.

Standard Sections

Section Required Encoding Notes
variants yes bitcode or MessagePack + zstd + AES-256-GCM Canonical merged Variants struct. This keeps SNV, indel, SV, BND, and CNV-like VCF calls together.
variant_index no Arrow IPC file + zstd + AES-256-GCM Optional flattened projection for fast GUI/search. It is derived from variants, not canonical.
copy_number no MessagePack or Arrow IPC file + zstd + AES-256-GCM Savana SavanaCN / CNSegment absolute copy-number segments.
bam_qc yes MessagePack + zstd + AES-256-GCM Tumor/normal WGSBamStats, including coverage, N50, karyotype, read groups, source files, and flag stats.
pipe_qc yes MessagePack + zstd + AES-256-GCM SomaticPipe filtering counters, input caller counts, annotation summaries, VEP stats, tool versions.
methylation no Arrow IPC file + zstd + AES-256-GCM Per-region or per-CpG 5mC outputs.
fingerprint yes MessagePack + zstd + AES-256-GCM Sample identity/fingerprint data; encrypt by default because it is identifying.
provenance yes MessagePack + zstd + AES-256-GCM Input file digests, command lines, container/image versions, config digest.

The user-suggested unencrypted fingerprint section is risky: fingerprints are identifiers. The safer default is to encrypt it like the other clinical sections. If a public fingerprint is required for indexing, add a separate public_index section containing only non-reversible IDs and high-level availability flags.

Canonical Data Models

variants

The canonical variant section should store the existing merged Rust model:

Variants {
  data: Vec<Variant>
}

Variant {
  hash: Hash128,
  position: GenomePosition,
  reference: ReferenceAlternative,
  alternative: ReferenceAlternative,
  vcf_variants: Vec<VcfVariant>,
  annotations: Vec<Annotation>
}

This is important because the current model intentionally merges SNV, indel, SV/BND, and caller annotations into one nested representation. A flat Arrow table would lose structure or require many lossy string columns. For version 1, use the existing bitcode representation when compatibility with the .bit output is desired. MessagePack is acceptable if cross-language readers are more important than Rust-native speed.

copy_number

The CNV section should store Savana absolute copy-number segments:

SavanaCN {
  segments: Vec<CNSegment>
}

CNSegment {
  chromosome: String,
  start: u64,
  end: u64,
  segment_id: String,
  bin_count: u32,
  sum_of_bin_lengths: u64,
  weight: f64,
  copy_number: f64,
  minor_allele_copy_number: Option<f64>,
  mean_baf: Option<f64>,
  no_het_snps: u32
}

This section is distinct from variants even if a caller also emits CNV-like VCF records. Segment-level absolute copy number is a continuous genome track and should not be forced into the merged variant list.

bam_qc

The BAM QC section should store one WGSBamStats payload per analyzed BAM, keyed by role/timepoint:

{
  "tumor": WGSBamStats,
  "normal": WGSBamStats,
  "additional_bams": [
    {"role": "mrd", "stats": WGSBamStats}
  ]
}

WGSBamStats already contains the useful BAM-level QC fields: total records, passing reads, mapped fraction, unmapped/duplicate/low-MAPQ counts, mapped yield, read lengths, global coverage, per-contig coverage/karyotype, N50, length histogram, read-group stats, source-file stats, and FlagStats.

pipe_qc

The pipeline QC section should store SomaticPipe-specific metrics separately from BAM QC:

{
  "somatic_pipe_stats": SomaticPipeStats,
  "variant_stats": VariantsStats,
  "annotation_stats": {
    "initial": AnnotationsStats,
    "post_filters": AnnotationsStats,
    "vep": VepStats
  },
  "caller_outputs": [
    {"caller": "ClairS Somatic", "n_input": 0, "n_after_filters": 0}
  ],
  "filter_steps": [
    {"name": "germline_or_constit", "removed": 0},
    {"name": "low_constit_depth", "removed": 0},
    {"name": "high_constit_alt", "removed": 0},
    {"name": "gnomad_and_constit_alt", "removed": 0},
    {"name": "low_entropy", "removed": 0}
  ]
}

The current SomaticPipeStats tracks the key filtering counters and input categorization. If future code records every intermediate annotation snapshot, store the snapshots here as named MessagePack entries or add optional sections such as annotations_01_initial.

Optional Arrow Indexes

Arrow should be used for derived projections, not for the canonical nested Variants payload.

variant_index

This section can be regenerated from variants, so it is optional. It is useful for a GUI, quick filtering, and summary browsing before loading the full nested payload.

  • variant_id: utf8
  • contig: utf8
  • start: int64
  • end: int64
  • ref: utf8
  • alt: utf8
  • variant_type: utf8
  • callers: list<utf8>
  • n_callers: uint8
  • tumor_depth: int32
  • tumor_alt: int32
  • normal_depth: int32
  • normal_alt: int32
  • vaf: float32
  • filters: list<utf8>
  • vep_consequence: utf8
  • vep_impact: utf8
  • gene: utf8
  • cosmic_count: int32
  • gnomad_af: float32

methylation

  • contig: utf8
  • start: int64
  • end: int64
  • region_id: utf8
  • mod_code: utf8
  • valid_coverage: int32
  • modified_count: int32
  • fraction_modified: float32
  • strand: utf8

Versioning Rules

  • Readers must reject unsupported major versions.
  • New optional sections may be added without changing the major version.
  • Adding nullable columns to Arrow sections is a minor-compatible change.
  • Removing columns, changing types, or changing crypto/compression semantics requires a major version bump.
  • Section names are stable ASCII identifiers.

Minimal Reader Algorithm

  1. Read and validate the 8-byte magic.
  2. Read VERSION, HEADER_LEN, and HEADER_CHECKSUM.
  3. Read HEADER_LEN bytes, verify BLAKE3, decompress zstd, decode MessagePack.
  4. Check every section offset + length is inside the file and non-overlapping.
  5. For the requested section, read bytes and verify BLAKE3 checksum.
  6. Decrypt with AES-256-GCM using the section nonce, tag, and AAD.
  7. Decompress zstd.
  8. Decode bitcode, Arrow IPC, or MessagePack according to kind.

Open Implementation Choices

  • Use bitcode for Rust-native canonical sections that already derive Encode / Decode, especially Variants.
  • Use rmp-serde for MessagePack.
  • Use zstd crate for compression.
  • Use aes-gcm and argon2 crates for encryption and key derivation.
  • Store Arrow IPC as file format rather than stream format for derived indexes and tabular tracks, because file format carries schema/footer metadata and is better for self-contained sections.
  • Add a public_index section only if GUI browsing needs non-secret metadata before decryption.