# SomaticPipe Output Container Format

This document proposes a single binary container for SomaticPipe outputs. The goal is to replace scattered `.json.gz`, `.bit`, VCF-derived tables, BAM QC files, pipeline QC files, methylation summaries, CNV files, and fingerprint artifacts with one random-access file that is compact, typed, encrypted, and versioned.

Recommended extension: `.pandora`

## Design Goals

- Store all sample-level SomaticPipe results in one file.
- Preserve the current merged `Variants` model instead of forcing SNV/SV data into a flat table.
- Keep optional derived indexes columnar and fast to load with Arrow IPC.
- Keep small structured metadata simple with MessagePack.
- Allow random access to each result section without reading the full file.
- Encrypt patient-identifying result payloads with authenticated encryption.
- Make corruption and version mismatch failures explicit before decoding payloads.
- Keep the format append-friendly enough for future sections.

## File Layout

All integers in the fixed prelude are unsigned big-endian.

```text
[MAGIC: 8 bytes]          "PANDORA\0"
[VERSION: u16]            format version, initial value 1
[HEADER_LEN: u64]         compressed header length in bytes
[HEADER_CHECKSUM: 32]     BLAKE3(header_zstd)
[HEADER: msgpack + zstd]
[SECTION 0..N]
```

The header is not encrypted because readers need the section table before they can load payloads. It must not contain direct clinical results. Patient-identifying fields such as `sample_id` should be either pseudonymized or moved into an encrypted metadata section if the file may leave a controlled environment.

## Header Schema

The header is MessagePack serialized and then zstd-compressed.

```text
{
  "format": "somaticpipe.output",
  "format_version": 1,
  "producer": {
    "name": "pandora_lib_promethion",
    "pipeline": "SomaticPipe",
    "pipeline_version": "0.1.0",
    "git_commit": "...",
    "created_at": "2026-05-27T00:00:00Z"
  },
  "sample": {
    "sample_id": "pseudonym-or-local-id",
    "tumor_timepoint": "diag",
    "normal_timepoint": "constit",
    "reference": "hs1",
    "reference_digest": "blake3:..."
  },
  "compression": {
    "algorithm": "zstd",
    "level": 3
  },
  "encryption": {
    "algorithm": "AES-256-GCM",
    "key_derivation": "Argon2id",
    "salt": "<base64 16-32 bytes>",
    "aad": "fixed-prelude + canonical-section-descriptor"
  },
  "sections": [
    {
      "name": "variants",
      "kind": "bitcode",
      "compression": "zstd",
      "encryption": "AES-256-GCM",
      "offset": 0,
      "length": 0,
      "nonce": "<base64 12 bytes>",
      "tag": "<base64 16 bytes>",
      "checksum": "blake3:<ciphertext digest>",
      "schema_hash": "blake3:<logical schema digest>",
      "required": true
    }
  ]
}
```

Offsets are absolute byte offsets from the start of the file. `length` is the stored section length, including encrypted ciphertext but excluding any tag if the tag is stored in the header. A reader must validate the magic, version, header checksum, every offset/length bound, and every section checksum before decoding.

## Section Encoding

Each encrypted section uses this transform order:

```text
logical payload -> encode -> zstd compress -> AES-256-GCM encrypt -> write bytes
```

The GCM authentication tag is stored in the header. The additional authenticated data should be a canonical representation of the fixed prelude plus the section descriptor with `offset`, `length`, `nonce`, `name`, `kind`, and `schema_hash`. This prevents moving encrypted bytes between sections or files without detection.

Checksums are BLAKE3 over the stored bytes, not plaintext. AES-GCM already authenticates decrypted plaintext; the checksum catches storage corruption before decryption and helps diagnostics.

## Standard Sections

| Section | Required | Encoding | Notes |
| --- | --- | --- | --- |
| `variants` | yes | bitcode or MessagePack + zstd + AES-256-GCM | Canonical merged `Variants` struct. This keeps SNV, indel, SV, BND, and CNV-like VCF calls together. |
| `variant_index` | no | Arrow IPC file + zstd + AES-256-GCM | Optional flattened projection for fast GUI/search. It is derived from `variants`, not canonical. |
| `copy_number` | no | MessagePack or Arrow IPC file + zstd + AES-256-GCM | Savana `SavanaCN` / `CNSegment` absolute copy-number segments. |
| `bam_qc` | yes | MessagePack + zstd + AES-256-GCM | Tumor/normal `WGSBamStats`, including coverage, N50, karyotype, read groups, source files, and flag stats. |
| `pipe_qc` | yes | MessagePack + zstd + AES-256-GCM | SomaticPipe filtering counters, input caller counts, annotation summaries, VEP stats, tool versions. |
| `methylation` | no | Arrow IPC file + zstd + AES-256-GCM | Per-region or per-CpG 5mC outputs. |
| `fingerprint` | yes | MessagePack + zstd + AES-256-GCM | Sample identity/fingerprint data; encrypt by default because it is identifying. |
| `provenance` | yes | MessagePack + zstd + AES-256-GCM | Input file digests, command lines, container/image versions, config digest. |

The user-suggested unencrypted `fingerprint` section is risky: fingerprints are identifiers. The safer default is to encrypt it like the other clinical sections. If a public fingerprint is required for indexing, add a separate `public_index` section containing only non-reversible IDs and high-level availability flags.

## Canonical Data Models

### `variants`

The canonical variant section should store the existing merged Rust model:

```text
Variants {
  data: Vec<Variant>
}

Variant {
  hash: Hash128,
  position: GenomePosition,
  reference: ReferenceAlternative,
  alternative: ReferenceAlternative,
  vcf_variants: Vec<VcfVariant>,
  annotations: Vec<Annotation>
}
```

This is important because the current model intentionally merges SNV, indel, SV/BND, and caller annotations into one nested representation. A flat Arrow table would lose structure or require many lossy string columns. For version 1, use the existing `bitcode` representation when compatibility with the `.bit` output is desired. MessagePack is acceptable if cross-language readers are more important than Rust-native speed.

### `copy_number`

The CNV section should store Savana absolute copy-number segments:

```text
SavanaCN {
  segments: Vec<CNSegment>
}

CNSegment {
  chromosome: String,
  start: u64,
  end: u64,
  segment_id: String,
  bin_count: u32,
  sum_of_bin_lengths: u64,
  weight: f64,
  copy_number: f64,
  minor_allele_copy_number: Option<f64>,
  mean_baf: Option<f64>,
  no_het_snps: u32
}
```

This section is distinct from `variants` even if a caller also emits CNV-like VCF records. Segment-level absolute copy number is a continuous genome track and should not be forced into the merged variant list.

### `bam_qc`

The BAM QC section should store one `WGSBamStats` payload per analyzed BAM, keyed by role/timepoint:

```text
{
  "tumor": WGSBamStats,
  "normal": WGSBamStats,
  "additional_bams": [
    {"role": "mrd", "stats": WGSBamStats}
  ]
}
```

`WGSBamStats` already contains the useful BAM-level QC fields: total records, passing reads, mapped fraction, unmapped/duplicate/low-MAPQ counts, mapped yield, read lengths, global coverage, per-contig coverage/karyotype, N50, length histogram, read-group stats, source-file stats, and `FlagStats`.

### `pipe_qc`

The pipeline QC section should store SomaticPipe-specific metrics separately from BAM QC:

```text
{
  "somatic_pipe_stats": SomaticPipeStats,
  "variant_stats": VariantsStats,
  "annotation_stats": {
    "initial": AnnotationsStats,
    "post_filters": AnnotationsStats,
    "vep": VepStats
  },
  "caller_outputs": [
    {"caller": "ClairS Somatic", "n_input": 0, "n_after_filters": 0}
  ],
  "filter_steps": [
    {"name": "germline_or_constit", "removed": 0},
    {"name": "low_constit_depth", "removed": 0},
    {"name": "high_constit_alt", "removed": 0},
    {"name": "gnomad_and_constit_alt", "removed": 0},
    {"name": "low_entropy", "removed": 0}
  ]
}
```

The current `SomaticPipeStats` tracks the key filtering counters and input categorization. If future code records every intermediate annotation snapshot, store the snapshots here as named MessagePack entries or add optional sections such as `annotations_01_initial`.

## Optional Arrow Indexes

Arrow should be used for derived projections, not for the canonical nested `Variants` payload.

### `variant_index`

This section can be regenerated from `variants`, so it is optional. It is useful for a GUI, quick filtering, and summary browsing before loading the full nested payload.

- `variant_id: utf8`
- `contig: utf8`
- `start: int64`
- `end: int64`
- `ref: utf8`
- `alt: utf8`
- `variant_type: utf8`
- `callers: list<utf8>`
- `n_callers: uint8`
- `tumor_depth: int32`
- `tumor_alt: int32`
- `normal_depth: int32`
- `normal_alt: int32`
- `vaf: float32`
- `filters: list<utf8>`
- `vep_consequence: utf8`
- `vep_impact: utf8`
- `gene: utf8`
- `cosmic_count: int32`
- `gnomad_af: float32`

### `methylation`

- `contig: utf8`
- `start: int64`
- `end: int64`
- `region_id: utf8`
- `mod_code: utf8`
- `valid_coverage: int32`
- `modified_count: int32`
- `fraction_modified: float32`
- `strand: utf8`

## Versioning Rules

- Readers must reject unsupported major versions.
- New optional sections may be added without changing the major version.
- Adding nullable columns to Arrow sections is a minor-compatible change.
- Removing columns, changing types, or changing crypto/compression semantics requires a major version bump.
- Section names are stable ASCII identifiers.

## Minimal Reader Algorithm

1. Read and validate the 8-byte magic.
2. Read `VERSION`, `HEADER_LEN`, and `HEADER_CHECKSUM`.
3. Read `HEADER_LEN` bytes, verify BLAKE3, decompress zstd, decode MessagePack.
4. Check every section `offset + length` is inside the file and non-overlapping.
5. For the requested section, read bytes and verify BLAKE3 checksum.
6. Decrypt with AES-256-GCM using the section nonce, tag, and AAD.
7. Decompress zstd.
8. Decode bitcode, Arrow IPC, or MessagePack according to `kind`.

## Open Implementation Choices

- Use `bitcode` for Rust-native canonical sections that already derive `Encode` / `Decode`, especially `Variants`.
- Use `rmp-serde` for MessagePack.
- Use `zstd` crate for compression.
- Use `aes-gcm` and `argon2` crates for encryption and key derivation.
- Store Arrow IPC as file format rather than stream format for derived indexes and tabular tracks, because file format carries schema/footer metadata and is better for self-contained sections.
- Add a `public_index` section only if GUI browsing needs non-secret metadata before decryption.