Python reference implementation: https://github.com/cortes-ciriano-lab/ont_fb-inv_artifacts/blob/main/fb-inv_artefact_rates.py
Rust implementation: src/io/bam.rs — fb_inv_from_record(), FbInv struct, breakpoints().
A fold-back inversion (fb-inv) occurs when a sequencing read aligns to the reference twice on the same chromosome with opposite strands. The read goes forward along the reference, then "folds back" and runs in reverse over the same (or overlapping) region.
This creates a signal that looks like an inverted duplication: the read covers a stretch of reference sequence twice — once in the forward direction, once in reverse. This is a known artefact in ONT long-read data at sites of DNA damage or structural variation (double-strand breaks, inverted repeats).
S)Reference position:
a d b c
| | | |
v v v v
[=============A===============>] A: primary, + strand
[<=================B====] B: supplementary, − strand
|←— A length —————————————————→|
|←——— B length ————————→|
|←overlap→|
|←— a_d ——→| outer offset = |a − d|
|←— b_c ——→| junction gap = |b − c|
S = 0)When B covers exactly the same region as A:
a b
| |
v v
[=============A===============>]
[<==============B==============]
^ ^
d c
b_c = |b − c| = 0 (no junction gap)
a_d = |a − d| = 0 (perfectly symmetric)
Breakpoints (a, b) and (c, d) are defined in read order (5′→3′ along the read),
not in reference order. This is critical:
| Strand | 5′ terminus (read-first) | 3′ terminus (read-last) |
|---|---|---|
+ |
leftmost ref pos = start |
rightmost ref pos = end |
− |
rightmost ref pos = end |
leftmost ref pos = start |
So breakpoints(start, end, strand) always returns (5′_coord, 3′_coord):
+ strand: (start, end) → a = start, b = end
− strand: (end, start) → c = end, d = start
Alignments A and B are ordered by their position on the read (A comes first). The
canonical fold-back case is A on + strand (forward part of read), B on − strand
(reverse part of read).
aln_a_len = |a − b| alignment A length on reference
aln_b_len = |c − d| alignment B length on reference
b_c = |b − c| read-order junction gap (3′ end of A → 5′ end of B)
a_d = |a − d| read-order outer span (5′ end of A → 3′ end of B)
SFor an inverted duplication shifted by S bp:
A: [X, X+L) on +
B: [X+S, X+L+S) on −
a = X, b = X+L
c = X+L+S, d = X+S
b_c = |(X+L) − (X+L+S)| = S
a_d = |X − (X+S)| = S
Both metrics equal S, the size of the non-duplicated tail on each side.
The duplicated region is [d, b] = [X+S, X+L), of length L − S.
For S = 0 (perfect duplication): b_c = 0, a_d = 0, duplicated region = all of A.
Before computing breakpoints, a separate metric interval_gap is used to pre-filter:
interval_gap = max(A_start, B_start) − min(A_end, B_end)
|gap| bp (typical fb-inv)Note: interval_gap is a reference-space measure. b_c/a_d are read-space measures.
They capture complementary aspects of the fold geometry.
max_overlap = Some(150) allow up to 150 bp overlap
max_gap = Some(0) require overlap — no gap allowed
Why max_gap = Some(0) for artefacts: in Nanopore sequencing, a fold-back artefact
arises when the single-stranded DNA molecule forms a hairpin and re-translocates
back through the pore in the reverse direction. Because the same physical DNA strand is
read twice — once forward, once in reverse — the two alignments must overlap on the
reference. A positive reference-space gap between the alignments is inconsistent with
this mechanism and instead points to a true SV junction with unaligned breakpoint
sequence. Allowing a gap (e.g. max_gap = Some(200)) risks capturing genuine structural
variants and inflating the artefact count.
The reference Python implementation requires strict overlap (overlap > 0 bp) for
exactly this reason.
Events passing the overlap filter are further classified by a_d:
a_d < 150 bp → likely sequencing artefact (short hairpin / palindrome)
a_d ≥ 150 bp → candidate true structural variant
This threshold comes directly from the reference Python implementation.
Reference:---[=====A=================>]---------
--------[<=============B======]-------
^ ^ ^ ^
a d b c
a = 5′ of A (leftmost of A, read-first) = aln_a_start
b = 3′ of A (rightmost of A, read-last) = aln_a_end
c = 5′ of B (rightmost of B, read-first) = aln_b_end
d = 3′ of B (leftmost of B, read-last) = aln_b_start
The region [d, b] (where d < b) is the inverted duplication — covered forward by A
and backward by B.
1. Record must be primary, mapped, MAPQ ≥ min_mapq
2. Exactly one SA tag entry, same chromosome, MAPQ ≥ min_mapq
3. Opposite strands (one +, one −)
4. |interval_gap| ≤ max_overlap (if overlap)
5. interval_gap ≤ max_gap (if gap)
All five conditions must hold for a record to be classified as a fold-back inversion.