memo_fb_inv.md 6.2 KB

Fold-Back Inversion — Geometry, Math, and Implementation Notes

Reference

Python reference implementation: https://github.com/cortes-ciriano-lab/ont_fb-inv_artifacts/blob/main/fb-inv_artefact_rates.py

Rust implementation: src/io/bam.rsfb_inv_from_record(), FbInv struct, breakpoints().


What Is a Fold-Back Inversion?

A fold-back inversion (fb-inv) occurs when a sequencing read aligns to the reference twice on the same chromosome with opposite strands. The read goes forward along the reference, then "folds back" and runs in reverse over the same (or overlapping) region.

This creates a signal that looks like an inverted duplication: the read covers a stretch of reference sequence twice — once in the forward direction, once in reverse. This is a known artefact in ONT long-read data at sites of DNA damage or structural variation (double-strand breaks, inverted repeats).


Geometry

General case (fold with offset S)

Reference position:
    a        d         b         c
    |        |         |         |
    v        v         v         v

    [=============A===============>]          A: primary, + strand
             [<=================B====]        B: supplementary, − strand

    |←— A length —————————————————→|
             |←——— B length ————————→|
             |←overlap→|
    |←— a_d ——→|                             outer offset  = |a − d|
                         |←— b_c ——→|        junction gap  = |b − c|

Canonical perfect inverted duplication (S = 0)

When B covers exactly the same region as A:

    a                             b
    |                             |
    v                             v
    [=============A===============>]
    [<==============B==============]
    ^                             ^
    d                             c

    b_c = |b − c| = 0    (no junction gap)
    a_d = |a − d| = 0    (perfectly symmetric)

Breakpoint Convention

Breakpoints (a, b) and (c, d) are defined in read order (5′→3′ along the read), not in reference order. This is critical:

Strand 5′ terminus (read-first) 3′ terminus (read-last)
+ leftmost ref pos = start rightmost ref pos = end
rightmost ref pos = end leftmost ref pos = start

So breakpoints(start, end, strand) always returns (5′_coord, 3′_coord):

+ strand: (start, end)   →  a = start,  b = end
− strand: (end,   start) →  c = end,    d = start

Alignments A and B are ordered by their position on the read (A comes first). The canonical fold-back case is A on + strand (forward part of read), B on strand (reverse part of read).


Derived Metrics

aln_a_len = |a − b|   alignment A length on reference
aln_b_len = |c − d|   alignment B length on reference
b_c       = |b − c|   read-order junction gap  (3′ end of A → 5′ end of B)
a_d       = |a − d|   read-order outer span    (5′ end of A → 3′ end of B)

Interpretation with offset S

For an inverted duplication shifted by S bp:

A: [X,     X+L)  on +
B: [X+S,   X+L+S) on −

a = X,     b = X+L
c = X+L+S, d = X+S

b_c = |(X+L) − (X+L+S)| = S
a_d = |X − (X+S)|        = S

Both metrics equal S, the size of the non-duplicated tail on each side. The duplicated region is [d, b] = [X+S, X+L), of length L − S.

For S = 0 (perfect duplication): b_c = 0, a_d = 0, duplicated region = all of A.


Reference-Space Overlap / Gap (Filtering)

Before computing breakpoints, a separate metric interval_gap is used to pre-filter:

interval_gap = max(A_start, B_start) − min(A_end, B_end)
  • Negative → the two alignments overlap on the reference by |gap| bp (typical fb-inv)
  • Zero → adjacent
  • Positive → separated by a gap

Note: interval_gap is a reference-space measure. b_c/a_d are read-space measures. They capture complementary aspects of the fold geometry.

Recommended thresholds for artefact detection

max_overlap = Some(150)   allow up to 150 bp overlap
max_gap     = Some(0)     require overlap — no gap allowed

Why max_gap = Some(0) for artefacts: in Nanopore sequencing, a fold-back artefact arises when the single-stranded DNA molecule forms a hairpin and re-translocates back through the pore in the reverse direction. Because the same physical DNA strand is read twice — once forward, once in reverse — the two alignments must overlap on the reference. A positive reference-space gap between the alignments is inconsistent with this mechanism and instead points to a true SV junction with unaligned breakpoint sequence. Allowing a gap (e.g. max_gap = Some(200)) risks capturing genuine structural variants and inflating the artefact count.

The reference Python implementation requires strict overlap (overlap > 0 bp) for exactly this reason.

Artefact vs. true SV classification

Events passing the overlap filter are further classified by a_d:

a_d < 150 bp  →  likely sequencing artefact (short hairpin / palindrome)
a_d ≥ 150 bp  →  candidate true structural variant

This threshold comes directly from the reference Python implementation.


Diagram Label Summary

Reference:---[=====A=================>]---------
          --------[<=============B======]-------
             ^    ^              ^    ^
             a    d              b    c

a = 5′ of A (leftmost of A,  read-first)    = aln_a_start
b = 3′ of A (rightmost of A, read-last)     = aln_a_end
c = 5′ of B (rightmost of B, read-first)    = aln_b_end
d = 3′ of B (leftmost of B,  read-last)     = aln_b_start

The region [d, b] (where d < b) is the inverted duplication — covered forward by A and backward by B.


Filtering Logic Summary

1. Record must be primary, mapped, MAPQ ≥ min_mapq
2. Exactly one SA tag entry, same chromosome, MAPQ ≥ min_mapq
3. Opposite strands (one +, one −)
4. |interval_gap| ≤ max_overlap  (if overlap)
5. interval_gap   ≤ max_gap      (if gap)

All five conditions must hold for a record to be classified as a fold-back inversion.