CLAUDE.md 6.8 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a Rust library for somatic variant calling and analysis from Oxford Nanopore long-read sequencing data. The library provides a complete pipeline from POD5 files through basecalling, alignment, variant calling, annotation, and statistical analysis. It supports execution both locally and via Slurm HPC environments.

Build and Test Commands

# Build the library
cargo build

# Run tests with full output
cargo test -- --nocapture

# Run tests with debug logging
RUST_LOG=debug cargo test -- --nocapture

# Format code
cargo fmt

# Lint with warnings as errors
cargo clippy -- -D warnings

# Generate documentation
cargo doc --open

Configuration

The library requires a configuration file at ~/.local/share/pandora/pandora-config.toml. Use pandora-config.example.toml as a template. The configuration system uses path templates with placeholders like {result_dir}, {id}, {time}, {reference_name}, and {haplotagged_bam_tag_name}.

Key configuration sections:

  • Filesystem layout (result directories, temp paths, database location)
  • Reference genome and annotations (FASTA, GFF3, BED files for regions/panels)
  • Tool-specific settings (DeepVariant, ClairS, Savana, Nanomonsv, Severus, Longphase, Modkit)
  • Alignment configuration (Dorado basecalling, samtools parameters)
  • Slurm vs local execution toggle

Architecture Overview

Command Execution Pattern

The library uses a trait-based command execution system defined in src/commands/mod.rs:

  • Command trait: Provides init(), cmd(), and clean_up() lifecycle methods
  • LocalRunner trait: Executes commands directly via bash
  • SlurmRunner trait: Wraps commands with srun or sbatch for HPC execution
  • run! macro (line 639): Dispatches to LocalRunner or SlurmRunner based on config.slurm_runner
  • run_many! macro (line 987): Parallelizes multiple commands using Rayon

All external tools (dorado, samtools, bcftools, longphase, modkit) implement these traits, allowing seamless switching between local and Slurm execution.

Module Organization

  • callers/: Variant calling tool interfaces

    • clairs.rs: ClairS somatic small variant caller with LongPhase haplotagging
    • deep_variant.rs, deep_somatic.rs: Google DeepVariant/DeepSomatic wrappers
    • nanomonsv.rs: Structural variant calling (paired tumor/normal)
    • savana.rs: SV and CNV analysis with haplotagged BAM support
    • severus.rs: VNTR and repeat-based variant calling
  • commands/: External command wrappers implementing Command, LocalRunner, and SlurmRunner

    • dorado.rs: Basecalling and alignment from POD5 files
    • samtools.rs, bcftools.rs: SAM/BAM/VCF manipulation
    • longphase.rs: Phasing and modcall for methylation
    • modkit.rs: Methylation pileup and summary
  • collection/: Input data discovery and organization

    • run.rs, prom_run.rs: PromethION run metadata and POD5 file discovery
    • bam.rs: BAM file collection across cases and time points
    • vcf.rs: VCF file organization
    • flowcells.rs: Flowcell metadata management
    • minknow.rs: MinKNOW sample sheet and telemetry parsing
  • runners.rs: Defines Run, Wait, RunWait traits and run_wait() function for command execution lifecycle with timestamped RunReport generation

  • pipes/: Multi-caller pipeline composition

    • somatic.rs: Orchestrates full somatic pipeline across ClairS, Nanomonsv, Savana, etc.
    • somatic_slurm.rs: Slurm-optimized batch submission variants
  • annotation/: VEP (Variant Effect Predictor) line parsing and consequence filtering

  • variant/: Variant data structures, loading, filtering, and statistics

    • variant.rs: Core variant types, BND graph construction, alteration categorization
    • variant_collection.rs: Bulk variant loading and grouping operations
    • variants_stats.rs: Mutation rates, depth quality ranges, panel-based stats
  • io/: File readers/writers (BED, GFF, VCF, gzip handling)

  • positions.rs: Genome coordinate representations (GenomePosition, GenomeRange) with parallel overlap operations

  • config.rs: Global Config struct loaded from TOML (line 14 defines the struct)

  • helpers.rs: Path utilities, Shannon entropy, Singularity bind flag generation

  • scan/: Somatic variant scanning algorithms

  • functions/: Genome assembly and custom analysis logic

Typical Workflow Pattern

  1. POD5 → BAM: commands::dorado::Dorado basecalls and aligns POD5 to reference
  2. BAM → VCF (Variants): Use caller modules (e.g., callers::clairs::ClairS::initialize(...)?. run()?)
  3. VCF → Annotated JSON: Load with variant::variant_collection::Variants, filter, annotate with annotation::vep
  4. Stats Generation: Create variant::variants_stats::VariantsStats for mutation rates and quality metrics
  5. Multi-case orchestration: Use pipes::somatic::Somatic runner or collection::run::Collections for batch processing

Testing Notes

  • Integration tests expect test data at the path in TEST_DIR constant (src/lib.rs:158): /mnt/beegfs02/scratch/t_steimle/test_data
  • If this path is unavailable, tests may fail or need to be skipped
  • Tests are co-located with modules using #[cfg(test)]

Key Dependencies

External tools required at runtime (ensure they are in PATH or configured in config file):

  • minimap2, samtools, bcftools (alignment and BAM/VCF handling)
  • dorado (ONT basecalling)
  • modkit (methylation analysis)
  • VEP (variant annotation; see pandora_lib_variants for VEP install)
  • ClairS, DeepVariant/DeepSomatic, Nanomonsv, Savana, Severus, LongPhase (variant callers, via Docker/Singularity)

Rust dependencies of note:

  • rust-htslib: HTSlib bindings for BAM/VCF reading (requires cmake, libclang-dev for build)
  • rayon: Parallel iteration across samples and tasks
  • dashmap: Concurrent hashmaps for thread-safe collections
  • arrow: Efficient columnar data handling (from Apache Arrow)
  • noodles-*: Pure-Rust bioinformatics file parsers (FASTA, GFF, CSI)

Dockerized Tool Execution

Tools like ClairS, DeepVariant, and DeepSomatic run via Singularity containers. The config.singularity_bin setting defaults to module load singularity-ce && singularity. Image paths are specified per tool in the config (e.g., deepvariant_image, clairs_image).

Important Conventions

  • Use anyhow::Result with ? operator; avoid unwrap() in production code paths
  • Propagate errors with .context() for debugging clarity
  • All paths in config use templates; resolve with format!() and config field substitution
  • Tumor sample is labeled tumoral_name (default "diag"), normal is normal_name (default "norm")
  • Haplotagged BAMs use tag name from haplotagged_bam_tag_name config field (default "HP")