This document proposes a single binary container for SomaticPipe outputs. The goal is to replace scattered .json.gz, .bit, VCF-derived tables, BAM QC files, pipeline QC files, methylation summaries, CNV files, and fingerprint artifacts with one random-access file that is compact, typed, encrypted, and versioned.
Recommended extension: .pandora
Variants model instead of forcing SNV/SV data into a flat table.All integers in the fixed prelude are unsigned big-endian.
[MAGIC: 8 bytes] "PANDORA\0"
[VERSION: u16] format version, initial value 1
[HEADER_LEN: u64] compressed header length in bytes
[HEADER_CHECKSUM: 32] BLAKE3(header_zstd)
[HEADER: msgpack + zstd]
[SECTION 0..N]
The header is not encrypted because readers need the section table before they can load payloads. It must not contain direct clinical results. Patient-identifying fields such as sample_id should be either pseudonymized or moved into an encrypted metadata section if the file may leave a controlled environment.
The header is MessagePack serialized and then zstd-compressed.
{
"format": "somaticpipe.output",
"format_version": 1,
"producer": {
"name": "pandora_lib_promethion",
"pipeline": "SomaticPipe",
"pipeline_version": "0.1.0",
"git_commit": "...",
"created_at": "2026-05-27T00:00:00Z"
},
"sample": {
"sample_id": "pseudonym-or-local-id",
"tumor_timepoint": "diag",
"normal_timepoint": "constit",
"reference": "hs1",
"reference_digest": "blake3:..."
},
"compression": {
"algorithm": "zstd",
"level": 3
},
"encryption": {
"algorithm": "AES-256-GCM",
"key_derivation": "Argon2id",
"salt": "<base64 16-32 bytes>",
"aad": "fixed-prelude + canonical-section-descriptor"
},
"sections": [
{
"name": "variants",
"kind": "bitcode",
"compression": "zstd",
"encryption": "AES-256-GCM",
"offset": 0,
"length": 0,
"nonce": "<base64 12 bytes>",
"tag": "<base64 16 bytes>",
"checksum": "blake3:<ciphertext digest>",
"schema_hash": "blake3:<logical schema digest>",
"required": true
}
]
}
Offsets are absolute byte offsets from the start of the file. length is the stored section length, including encrypted ciphertext but excluding any tag if the tag is stored in the header. A reader must validate the magic, version, header checksum, every offset/length bound, and every section checksum before decoding.
Each encrypted section uses this transform order:
logical payload -> encode -> zstd compress -> AES-256-GCM encrypt -> write bytes
The GCM authentication tag is stored in the header. The additional authenticated data should be a canonical representation of the fixed prelude plus the section descriptor with offset, length, nonce, name, kind, and schema_hash. This prevents moving encrypted bytes between sections or files without detection.
Checksums are BLAKE3 over the stored bytes, not plaintext. AES-GCM already authenticates decrypted plaintext; the checksum catches storage corruption before decryption and helps diagnostics.
| Section | Required | Encoding | Notes |
|---|---|---|---|
variants |
yes | bitcode or MessagePack + zstd + AES-256-GCM | Canonical merged Variants struct. This keeps SNV, indel, SV, BND, and CNV-like VCF calls together. |
variant_index |
no | Arrow IPC file + zstd + AES-256-GCM | Optional flattened projection for fast GUI/search. It is derived from variants, not canonical. |
copy_number |
no | MessagePack or Arrow IPC file + zstd + AES-256-GCM | Savana SavanaCN / CNSegment absolute copy-number segments. |
bam_qc |
yes | MessagePack + zstd + AES-256-GCM | Tumor/normal WGSBamStats, including coverage, N50, karyotype, read groups, source files, and flag stats. |
pipe_qc |
yes | MessagePack + zstd + AES-256-GCM | SomaticPipe filtering counters, input caller counts, annotation summaries, VEP stats, tool versions. |
methylation |
no | Arrow IPC file + zstd + AES-256-GCM | Per-region or per-CpG 5mC outputs. |
fingerprint |
yes | MessagePack + zstd + AES-256-GCM | Sample identity/fingerprint data; encrypt by default because it is identifying. |
provenance |
yes | MessagePack + zstd + AES-256-GCM | Input file digests, command lines, container/image versions, config digest. |
The user-suggested unencrypted fingerprint section is risky: fingerprints are identifiers. The safer default is to encrypt it like the other clinical sections. If a public fingerprint is required for indexing, add a separate public_index section containing only non-reversible IDs and high-level availability flags.
variantsThe canonical variant section should store the existing merged Rust model:
Variants {
data: Vec<Variant>
}
Variant {
hash: Hash128,
position: GenomePosition,
reference: ReferenceAlternative,
alternative: ReferenceAlternative,
vcf_variants: Vec<VcfVariant>,
annotations: Vec<Annotation>
}
This is important because the current model intentionally merges SNV, indel, SV/BND, and caller annotations into one nested representation. A flat Arrow table would lose structure or require many lossy string columns. For version 1, use the existing bitcode representation when compatibility with the .bit output is desired. MessagePack is acceptable if cross-language readers are more important than Rust-native speed.
copy_numberThe CNV section should store Savana absolute copy-number segments:
SavanaCN {
segments: Vec<CNSegment>
}
CNSegment {
chromosome: String,
start: u64,
end: u64,
segment_id: String,
bin_count: u32,
sum_of_bin_lengths: u64,
weight: f64,
copy_number: f64,
minor_allele_copy_number: Option<f64>,
mean_baf: Option<f64>,
no_het_snps: u32
}
This section is distinct from variants even if a caller also emits CNV-like VCF records. Segment-level absolute copy number is a continuous genome track and should not be forced into the merged variant list.
bam_qcThe BAM QC section should store one WGSBamStats payload per analyzed BAM, keyed by role/timepoint:
{
"tumor": WGSBamStats,
"normal": WGSBamStats,
"additional_bams": [
{"role": "mrd", "stats": WGSBamStats}
]
}
WGSBamStats already contains the useful BAM-level QC fields: total records, passing reads, mapped fraction, unmapped/duplicate/low-MAPQ counts, mapped yield, read lengths, global coverage, per-contig coverage/karyotype, N50, length histogram, read-group stats, source-file stats, and FlagStats.
pipe_qcThe pipeline QC section should store SomaticPipe-specific metrics separately from BAM QC:
{
"somatic_pipe_stats": SomaticPipeStats,
"variant_stats": VariantsStats,
"annotation_stats": {
"initial": AnnotationsStats,
"post_filters": AnnotationsStats,
"vep": VepStats
},
"caller_outputs": [
{"caller": "ClairS Somatic", "n_input": 0, "n_after_filters": 0}
],
"filter_steps": [
{"name": "germline_or_constit", "removed": 0},
{"name": "low_constit_depth", "removed": 0},
{"name": "high_constit_alt", "removed": 0},
{"name": "gnomad_and_constit_alt", "removed": 0},
{"name": "low_entropy", "removed": 0}
]
}
The current SomaticPipeStats tracks the key filtering counters and input categorization. If future code records every intermediate annotation snapshot, store the snapshots here as named MessagePack entries or add optional sections such as annotations_01_initial.
Arrow should be used for derived projections, not for the canonical nested Variants payload.
variant_indexThis section can be regenerated from variants, so it is optional. It is useful for a GUI, quick filtering, and summary browsing before loading the full nested payload.
variant_id: utf8contig: utf8start: int64end: int64ref: utf8alt: utf8variant_type: utf8callers: list<utf8>n_callers: uint8tumor_depth: int32tumor_alt: int32normal_depth: int32normal_alt: int32vaf: float32filters: list<utf8>vep_consequence: utf8vep_impact: utf8gene: utf8cosmic_count: int32gnomad_af: float32methylationcontig: utf8start: int64end: int64region_id: utf8mod_code: utf8valid_coverage: int32modified_count: int32fraction_modified: float32strand: utf8VERSION, HEADER_LEN, and HEADER_CHECKSUM.HEADER_LEN bytes, verify BLAKE3, decompress zstd, decode MessagePack.offset + length is inside the file and non-overlapping.kind.bitcode for Rust-native canonical sections that already derive Encode / Decode, especially Variants.rmp-serde for MessagePack.zstd crate for compression.aes-gcm and argon2 crates for encryption and key derivation.public_index section only if GUI browsing needs non-secret metadata before decryption.