# Fold-Back Inversion — Geometry, Math, and Implementation Notes ## Reference Python reference implementation: https://github.com/cortes-ciriano-lab/ont_fb-inv_artifacts/blob/main/fb-inv_artefact_rates.py Rust implementation: `src/io/bam.rs` — `fb_inv_from_record()`, `FbInv` struct, `breakpoints()`. --- ## What Is a Fold-Back Inversion? A fold-back inversion (fb-inv) occurs when a sequencing read aligns to the reference twice on the **same chromosome** with **opposite strands**. The read goes forward along the reference, then "folds back" and runs in reverse over the same (or overlapping) region. This creates a signal that looks like an **inverted duplication**: the read covers a stretch of reference sequence twice — once in the forward direction, once in reverse. This is a known artefact in ONT long-read data at sites of DNA damage or structural variation (double-strand breaks, inverted repeats). --- ## Geometry ### General case (fold with offset `S`) ``` Reference position: a d b c | | | | v v v v [=============A===============>] A: primary, + strand [<=================B====] B: supplementary, − strand |←— A length —————————————————→| |←——— B length ————————→| |←overlap→| |←— a_d ——→| outer offset = |a − d| |←— b_c ——→| junction gap = |b − c| ``` ### Canonical perfect inverted duplication (`S = 0`) When B covers exactly the same region as A: ``` a b | | v v [=============A===============>] [<==============B==============] ^ ^ d c b_c = |b − c| = 0 (no junction gap) a_d = |a − d| = 0 (perfectly symmetric) ``` --- ## Breakpoint Convention Breakpoints `(a, b)` and `(c, d)` are defined in **read order** (5′→3′ along the read), not in reference order. This is critical: | Strand | 5′ terminus (read-first) | 3′ terminus (read-last) | |--------|--------------------------|--------------------------| | `+` | leftmost ref pos = `start` | rightmost ref pos = `end` | | `−` | rightmost ref pos = `end` | leftmost ref pos = `start` | So `breakpoints(start, end, strand)` always returns `(5′_coord, 3′_coord)`: ``` + strand: (start, end) → a = start, b = end − strand: (end, start) → c = end, d = start ``` Alignments A and B are ordered by their position **on the read** (A comes first). The canonical fold-back case is A on `+` strand (forward part of read), B on `−` strand (reverse part of read). --- ## Derived Metrics ``` aln_a_len = |a − b| alignment A length on reference aln_b_len = |c − d| alignment B length on reference b_c = |b − c| read-order junction gap (3′ end of A → 5′ end of B) a_d = |a − d| read-order outer span (5′ end of A → 3′ end of B) ``` ### Interpretation with offset `S` For an inverted duplication shifted by `S` bp: ``` A: [X, X+L) on + B: [X+S, X+L+S) on − a = X, b = X+L c = X+L+S, d = X+S b_c = |(X+L) − (X+L+S)| = S a_d = |X − (X+S)| = S ``` Both metrics equal `S`, the size of the non-duplicated tail on each side. The **duplicated region** is `[d, b] = [X+S, X+L)`, of length `L − S`. For `S = 0` (perfect duplication): `b_c = 0`, `a_d = 0`, duplicated region = all of A. --- ## Reference-Space Overlap / Gap (Filtering) Before computing breakpoints, a separate metric `interval_gap` is used to pre-filter: ``` interval_gap = max(A_start, B_start) − min(A_end, B_end) ``` - Negative → the two alignments **overlap** on the reference by `|gap|` bp (typical fb-inv) - Zero → adjacent - Positive → separated by a gap Note: `interval_gap` is a reference-space measure. `b_c`/`a_d` are read-space measures. They capture complementary aspects of the fold geometry. ### Recommended thresholds for artefact detection ``` max_overlap = Some(150) allow up to 150 bp overlap max_gap = Some(0) require overlap — no gap allowed ``` **Why `max_gap = Some(0)` for artefacts:** in Nanopore sequencing, a fold-back artefact arises when the single-stranded DNA molecule forms a **hairpin** and re-translocates back through the pore in the reverse direction. Because the same physical DNA strand is read twice — once forward, once in reverse — the two alignments **must overlap** on the reference. A positive reference-space gap between the alignments is inconsistent with this mechanism and instead points to a true SV junction with unaligned breakpoint sequence. Allowing a gap (e.g. `max_gap = Some(200)`) risks capturing genuine structural variants and inflating the artefact count. The reference Python implementation requires strict overlap (`overlap > 0 bp`) for exactly this reason. ### Artefact vs. true SV classification Events passing the overlap filter are further classified by `a_d`: ``` a_d < 150 bp → likely sequencing artefact (short hairpin / palindrome) a_d ≥ 150 bp → candidate true structural variant ``` This threshold comes directly from the reference Python implementation. --- ## Diagram Label Summary ``` Reference:---[=====A=================>]--------- --------[<=============B======]------- ^ ^ ^ ^ a d b c a = 5′ of A (leftmost of A, read-first) = aln_a_start b = 3′ of A (rightmost of A, read-last) = aln_a_end c = 5′ of B (rightmost of B, read-first) = aln_b_end d = 3′ of B (leftmost of B, read-last) = aln_b_start ``` The region `[d, b]` (where d < b) is the **inverted duplication** — covered forward by A and backward by B. --- ## Filtering Logic Summary ``` 1. Record must be primary, mapped, MAPQ ≥ min_mapq 2. Exactly one SA tag entry, same chromosome, MAPQ ≥ min_mapq 3. Opposite strands (one +, one −) 4. |interval_gap| ≤ max_overlap (if overlap) 5. interval_gap ≤ max_gap (if gap) ``` All five conditions must hold for a record to be classified as a fold-back inversion.