# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview This is a Rust library for somatic variant calling and analysis from Oxford Nanopore long-read sequencing data. The library provides a complete pipeline from POD5 files through basecalling, alignment, variant calling, annotation, and statistical analysis. It supports execution both locally and via Slurm HPC environments. ## Build and Test Commands ```bash # Build the library cargo build # Run tests with full output cargo test -- --nocapture # Run tests with debug logging RUST_LOG=debug cargo test -- --nocapture # Format code cargo fmt # Lint with warnings as errors cargo clippy -- -D warnings # Generate documentation cargo doc --open ``` ## Configuration The library requires a configuration file at `~/.local/share/pandora/pandora-config.toml`. Use `pandora-config.example.toml` as a template. The configuration system uses path templates with placeholders like `{result_dir}`, `{id}`, `{time}`, `{reference_name}`, and `{haplotagged_bam_tag_name}`. Key configuration sections: - Filesystem layout (result directories, temp paths, database location) - Reference genome and annotations (FASTA, GFF3, BED files for regions/panels) - Tool-specific settings (DeepVariant, ClairS, Savana, Nanomonsv, Severus, Longphase, Modkit) - Alignment configuration (Dorado basecalling, samtools parameters) - Slurm vs local execution toggle ## Architecture Overview ### Command Execution Pattern The library uses a trait-based command execution system defined in `src/commands/mod.rs`: - **`Command` trait**: Provides `init()`, `cmd()`, and `clean_up()` lifecycle methods - **`LocalRunner` trait**: Executes commands directly via bash - **`SlurmRunner` trait**: Wraps commands with `srun` or `sbatch` for HPC execution - **`run!` macro** (line 639): Dispatches to LocalRunner or SlurmRunner based on `config.slurm_runner` - **`run_many!` macro** (line 987): Parallelizes multiple commands using Rayon All external tools (dorado, samtools, bcftools, longphase, modkit) implement these traits, allowing seamless switching between local and Slurm execution. ### Module Organization - **`callers/`**: Variant calling tool interfaces - `clairs.rs`: ClairS somatic small variant caller with LongPhase haplotagging - `deep_variant.rs`, `deep_somatic.rs`: Google DeepVariant/DeepSomatic wrappers - `nanomonsv.rs`: Structural variant calling (paired tumor/normal) - `savana.rs`: SV and CNV analysis with haplotagged BAM support - `severus.rs`: VNTR and repeat-based variant calling - **`commands/`**: External command wrappers implementing `Command`, `LocalRunner`, and `SlurmRunner` - `dorado.rs`: Basecalling and alignment from POD5 files - `samtools.rs`, `bcftools.rs`: SAM/BAM/VCF manipulation - `longphase.rs`: Phasing and modcall for methylation - `modkit.rs`: Methylation pileup and summary - **`collection/`**: Input data discovery and organization - `run.rs`, `prom_run.rs`: PromethION run metadata and POD5 file discovery - `bam.rs`: BAM file collection across cases and time points - `vcf.rs`: VCF file organization - `flowcells.rs`: Flowcell metadata management - `minknow.rs`: MinKNOW sample sheet and telemetry parsing - **`runners.rs`**: Defines `Run`, `Wait`, `RunWait` traits and `run_wait()` function for command execution lifecycle with timestamped `RunReport` generation - **`pipes/`**: Multi-caller pipeline composition - `somatic.rs`: Orchestrates full somatic pipeline across ClairS, Nanomonsv, Savana, etc. - `somatic_slurm.rs`: Slurm-optimized batch submission variants - **`annotation/`**: VEP (Variant Effect Predictor) line parsing and consequence filtering - **`variant/`**: Variant data structures, loading, filtering, and statistics - `variant.rs`: Core variant types, BND graph construction, alteration categorization - `variant_collection.rs`: Bulk variant loading and grouping operations - `variants_stats.rs`: Mutation rates, depth quality ranges, panel-based stats - **`io/`**: File readers/writers (BED, GFF, VCF, gzip handling) - **`positions.rs`**: Genome coordinate representations (`GenomePosition`, `GenomeRange`) with parallel overlap operations - **`config.rs`**: Global `Config` struct loaded from TOML (line 14 defines the struct) - **`helpers.rs`**: Path utilities, Shannon entropy, Singularity bind flag generation - **`scan/`**: Somatic variant scanning algorithms - **`functions/`**: Genome assembly and custom analysis logic ## Typical Workflow Pattern 1. **POD5 → BAM**: `commands::dorado::Dorado` basecalls and aligns POD5 to reference 2. **BAM → VCF (Variants)**: Use caller modules (e.g., `callers::clairs::ClairS::initialize(...)?. run()?`) 3. **VCF → Annotated JSON**: Load with `variant::variant_collection::Variants`, filter, annotate with `annotation::vep` 4. **Stats Generation**: Create `variant::variants_stats::VariantsStats` for mutation rates and quality metrics 5. **Multi-case orchestration**: Use `pipes::somatic::Somatic` runner or `collection::run::Collections` for batch processing ## Testing Notes - Integration tests expect test data at the path in `TEST_DIR` constant (`src/lib.rs:158`): `/mnt/beegfs02/scratch/t_steimle/test_data` - If this path is unavailable, tests may fail or need to be skipped - Tests are co-located with modules using `#[cfg(test)]` ## Key Dependencies External tools required at runtime (ensure they are in PATH or configured in config file): - minimap2, samtools, bcftools (alignment and BAM/VCF handling) - dorado (ONT basecalling) - modkit (methylation analysis) - VEP (variant annotation; see `pandora_lib_variants` for VEP install) - ClairS, DeepVariant/DeepSomatic, Nanomonsv, Savana, Severus, LongPhase (variant callers, via Docker/Singularity) Rust dependencies of note: - `rust-htslib`: HTSlib bindings for BAM/VCF reading (requires `cmake`, `libclang-dev` for build) - `rayon`: Parallel iteration across samples and tasks - `dashmap`: Concurrent hashmaps for thread-safe collections - `arrow`: Efficient columnar data handling (from Apache Arrow) - `noodles-*`: Pure-Rust bioinformatics file parsers (FASTA, GFF, CSI) ## Dockerized Tool Execution Tools like ClairS, DeepVariant, and DeepSomatic run via Singularity containers. The `config.singularity_bin` setting defaults to `module load singularity-ce && singularity`. Image paths are specified per tool in the config (e.g., `deepvariant_image`, `clairs_image`). ## Important Conventions - Use `anyhow::Result` with `?` operator; avoid `unwrap()` in production code paths - Propagate errors with `.context()` for debugging clarity - All paths in config use templates; resolve with `format!()` and config field substitution - Tumor sample is labeled `tumoral_name` (default "diag"), normal is `normal_name` (default "norm") - Haplotagged BAMs use tag name from `haplotagged_bam_tag_name` config field (default "HP")