Pipeline Integration¶

Integrate ChimeraLM into your bioinformatics pipelines using Bash, Nextflow, and Snakemake for reproducible, scalable analysis.

Learning Objectives

By the end of this tutorial, you will be able to:

Integrate ChimeraLM into Bash scripts for simple automation
Build Nextflow pipelines with ChimeraLM filtering
Create Snakemake workflows for reproducible analysis
Handle errors and logging in production pipelines
Scale to large cohorts (100s-1000s of samples)

Prerequisites: ChimeraLM installed, basic knowledge of Bash/Nextflow/Snakemake

Time: ~45 minutes

Integration Options¶

Method	Best For	Complexity	Scalability
Bash Script	Simple workflows, single machine	Low	1-10 samples
Nextflow	Cloud/HPC, complex pipelines	Medium	10-1000s samples
Snakemake	Reproducibility, local/cluster	Medium	10-1000s samples
WDL	Cloud platforms (Terra, Cromwell)	Medium-High	100s-1000s samples

Bash Script Integration¶

Basic Pipeline¶

#!/bin/bash
# chimera_filter_pipeline.sh - Simple ChimeraLM filtering pipeline

set -euo pipefail  # Exit on error, undefined variables, pipe failures

# Configuration
INPUT_BAM=$1
OUTPUT_DIR=$2
GPUS=${3:-1}  # Default to 1 GPU
BATCH_SIZE=${4:-24}  # Default batch size 24

echo "ChimeraLM Filtering Pipeline"
echo "Input: $INPUT_BAM"
echo "Output: $OUTPUT_DIR"

# Create output directory
mkdir -p $OUTPUT_DIR

# Step 1: Predict chimera artifacts induced by WGA
echo "Step 1/3: Predicting chimera artifacts induced by WGA..."
BASENAME=$(basename $INPUT_BAM .bam)
chimeralm predict $INPUT_BAM --gpus $GPUS --batch-size $BATCH_SIZE -o ${BASENAME}.predictions

# Step 2: Filter BAM
echo "Step 2/3: Filtering BAM file..."
# Filter creates .filtered.sorted.bam automatically and writes predictions.txt
chimeralm filter $INPUT_BAM ${BASENAME}.predictions
FILTERED_BAM="${INPUT_BAM%.bam}.filtered.sorted.bam"

# Step 3: Generate QC report
echo "Step 3/3: Generating QC report..."
CHIMERIC_ARTIFACT=$(grep -c "1$" ${BASENAME}.predictions/predictions.txt || echo "0")
BIOLOGICAL_READS=$(grep -c "0$" ${BASENAME}.predictions/predictions.txt || echo "0")
TOTAL_READS=$((CHIMERIC_ARTIFACT + BIOLOGICAL_READS))
CHIMERA_ARTIFACT_RATE=$(echo "scale=2; $CHIMERIC_ARTIFACT * 100 / $TOTAL_READS" | bc)

cat > ${OUTPUT_DIR}/qc_report.txt <<EOF
ChimeraLM WGA Artifact QC Report
=================================
Input BAM: $INPUT_BAM
Output BAM: $FILTERED_BAM

Read Statistics:
  Total analyzed: $TOTAL_READS
  Biological reads: $BIOLOGICAL_READS
  Chimera artifacts (WGA): $CHIMERIC_ARTIFACT
  Chimera artifact rate: ${CHIMERA_ARTIFACT_RATE}%

Filtering complete: $(date)
EOF

echo "Pipeline complete! QC report: ${OUTPUT_DIR}/qc_report.txt"

Usage¶

# Make script executable
chmod +x chimera_filter_pipeline.sh

# Run pipeline
./chimera_filter_pipeline.sh input.bam output/ 1 24

# Batch process multiple files
for bam in data/*.bam; do
    ./chimera_filter_pipeline.sh $bam output/ 1 24
done

Advanced Bash Pipeline with Error Handling¶

#!/bin/bash
# advanced_chimera_pipeline.sh - Production-ready pipeline with error handling

set -euo pipefail

# Logging function
log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*" | tee -a pipeline.log
}

error_exit() {
    log "ERROR: $1"
    exit 1
}

# Validate inputs
INPUT_BAM=${1:?Usage: $0 <input.bam> <output_dir> [gpus] [batch_size]}
OUTPUT_DIR=${2:?Output directory required}
GPUS=${3:-1}
BATCH_SIZE=${4:-24}

# Check dependencies
command -v chimeralm >/dev/null || error_exit "chimeralm not found"
command -v samtools >/dev/null || error_exit "samtools not found"

# Check input file exists
[[ -f $INPUT_BAM ]] || error_exit "Input BAM not found: $INPUT_BAM"

# Check disk space (need at least 2x input size)
INPUT_SIZE=$(du -b $INPUT_BAM | cut -f1)
REQUIRED_SPACE=$((INPUT_SIZE * 2))
AVAILABLE_SPACE=$(df --output=avail -B 1 $(dirname $OUTPUT_DIR) | tail -1)
[[ $AVAILABLE_SPACE -gt $REQUIRED_SPACE ]] || error_exit "Insufficient disk space"

log "Starting ChimeraLM WGA artifact filtering pipeline"
log "Input: $INPUT_BAM ($(du -h $INPUT_BAM | cut -f1))"

BASENAME=$(basename $INPUT_BAM .bam)

# Step 1: Predict chimera artifacts induced by WGA
log "Step 1/3: Running predictions..."
if chimeralm predict $INPUT_BAM --gpus $GPUS --batch-size $BATCH_SIZE -o ${BASENAME}.predictions 2>&1 | tee -a pipeline.log; then
    log "Predictions complete"
else
    error_exit "Prediction failed"
fi

# Step 2: Filter BAM
log "Step 2/3: Filtering BAM..."
if chimeralm filter $INPUT_BAM ${BASENAME}.predictions 2>&1 | tee -a pipeline.log; then
    log "Filtering complete"
else
    error_exit "Filtering failed"
fi

# ChimeraLM automatically creates .filtered.sorted.bam
FILTERED_BAM="${INPUT_BAM%.bam}.filtered.sorted.bam"

# Step 3: Verify output
log "Step 3/3: Verifying output..."
samtools quickcheck $FILTERED_BAM || error_exit "Output BAM is corrupted"
ORIGINAL_COUNT=$(samtools view -c $INPUT_BAM)
FILTERED_COUNT=$(samtools view -c $FILTERED_BAM)
REMOVED_COUNT=$((ORIGINAL_COUNT - FILTERED_COUNT))
log "Removed $REMOVED_COUNT reads (${ORIGINAL_COUNT} -> ${FILTERED_COUNT})"

log "Pipeline complete! Output: $FILTERED_BAM"

Nextflow Integration¶

Simple Nextflow Pipeline¶

// chimera_filter.nf - Nextflow pipeline for ChimeraLM filtering

nextflow.enable.dsl=2

// Parameters
params.input_bam = "input.bam"
params.output_dir = "results/"
params.gpus = 1
params.batch_size = 24

// Process: Predict chimera artifacts induced by WGA
process predict {
    tag { bam.baseName }
    publishDir "${params.output_dir}/predictions", mode: 'copy'

    input:
    path bam

    output:
    tuple path(bam), path("${bam.baseName}.predictions")

    script:
    """
    chimeralm predict ${bam} --gpus ${params.gpus} --batch-size ${params.batch_size} -o ${bam.baseName}.predictions
    """
}

// Process: Filter BAM to remove WGA artifacts
process filter {
    tag { bam.baseName }
    publishDir "${params.output_dir}/filtered_bams", mode: 'copy'

    input:
    tuple path(bam), path(predictions_dir)

    output:
    path "${bam.baseName}.filtered.sorted.bam"
    path "${bam.baseName}.filtered.sorted.bam.bai"

    script:
    """
    chimeralm filter ${bam} ${predictions_dir}
    """
}

// Workflow
workflow {
    // Read input BAMs
    bam_ch = Channel.fromPath(params.input_bam)

    // Run prediction
    predictions_ch = predict(bam_ch)

    // Filter BAMs
    filter(predictions_ch)

    // Generate QC reports
    qc_report(predictions_ch)
}

Run Nextflow Pipeline¶

# Single sample
nextflow run chimera_filter.nf --input_bam input.bam --output_dir results/

# Multiple samples
nextflow run chimera_filter.nf --input_bam "data/*.bam" --output_dir results/

# With resource limits
nextflow run chimera_filter.nf \
    --input_bam "data/*.bam" \
    --output_dir results/ \
    --gpus 1 \
    --batch_size 32 \
    -with-report report.html \
    -with-trace

Advanced Nextflow with Cluster Support¶

// nextflow.config - Configuration for HPC cluster

process {
    executor = 'slurm'
    queue = 'gpu'
    memory = '32 GB'
    cpus = 4

    withName: predict {
        time = '2h'
        clusterOptions = '--gres=gpu:1'
    }

    withName: filter {
        time = '1h'
        cpus = 8
    }
}

docker {
    enabled = true
    runOptions = '--gpus all'
}

Snakemake Integration¶

Snakemake Workflow¶

# Snakefile - Snakemake workflow for ChimeraLM filtering

configfile: "config.yaml"

# Sample names from input directory
SAMPLES = glob_wildcards("data/{sample}.bam").sample

rule all:
    input:
        expand("results/filtered_bams/{sample}.filtered.bam", sample=SAMPLES),
        expand("results/qc/{sample}_qc.txt", sample=SAMPLES),
        "results/summary_report.html"

rule predict:
    input:
        bam="data/{sample}.bam"
    output:
        predictions="results/predictions/{sample}.predictions/predictions.txt"
    params:
        gpus=config.get("gpus", 1),
        batch_size=config.get("batch_size", 24)
    log:
        "logs/predict/{sample}.log"
    shell:
        """
        chimeralm predict {input.bam} \
            --gpus {params.gpus} \
            --batch-size {params.batch_size} \
            -o {wildcards.sample}.predictions \
            2>&1 | tee {log}

        # Move predictions to output directory
        mv {wildcards.sample}.predictions results/predictions/
        """

rule filter:
    input:
        bam="data/{sample}.bam",
        predictions="results/predictions/{sample}.predictions/predictions.txt"
    output:
        filtered_bam="results/filtered_bams/{sample}.filtered.sorted.bam",
        filtered_bai="results/filtered_bams/{sample}.filtered.sorted.bam.bai"
    log:
        "logs/filter/{sample}.log"
    shell:
        """
        chimeralm filter {input.bam} \
            results/predictions/{wildcards.sample}.predictions/ \
            2>&1 | tee {log}

        # Move output to expected location
        mv data/{wildcards.sample}.filtered.sorted.bam {output.filtered_bam}
        mv data/{wildcards.sample}.filtered.sorted.bam.bai {output.filtered_bai}
        """

rule qc_report:
    input:
        predictions="results/predictions/{sample}.predictions/predictions.txt"
    output:
        qc="results/qc/{sample}_qc.txt"
    shell:
        """
        CHIMERIC=$(grep -c '1$' {input.predictions} || echo 0)
        BIOLOGICAL=$(grep -c '0$' {input.predictions} || echo 0)
        TOTAL=$((CHIMERIC + BIOLOGICAL))
        RATE=$(echo "scale=2; $CHIMERIC * 100 / $TOTAL" | bc)

        cat > {output.qc} <<EOF
Sample: {wildcards.sample}
Total reads analyzed: $TOTAL
Biological reads: $BIOLOGICAL
Chimera artifacts (WGA): $CHIMERIC
Chimera artifact rate: ${{RATE}}%
EOF
        """

rule summary:
    input:
        qc=expand("results/qc/{sample}_qc.txt", sample=SAMPLES)
    output:
        report="results/summary_report.html"
    script:
        "scripts/generate_summary.py"

Configuration File¶

# config.yaml - Snakemake configuration

# ChimeraLM parameters
gpus: 1
batch_size: 24

# Cluster resources (if using cluster execution)
cluster:
  predict:
    mem: "32GB"
    cpus: 4
    time: "2:00:00"
    partition: "gpu"
  filter:
    mem: "16GB"
    cpus: 8
    time: "1:00:00"

Run Snakemake Workflow¶

# Dry run to check workflow
snakemake -n

# Run locally with 4 cores
snakemake --cores 4

# Run on HPC cluster with SLURM
snakemake --cluster "sbatch -p {cluster.partition} -c {cluster.cpus} --mem={cluster.mem} -t {cluster.time}" \
    --cluster-config config.yaml \
    --jobs 10

# With Conda environment
snakemake --use-conda --cores 4

# Generate workflow diagram
snakemake --dag | dot -Tpng > workflow.png

WDL Integration (Bonus)¶

WDL Workflow¶

# chimera_filter.wdl - WDL workflow for Terra/Cromwell

version 1.0

workflow ChimeraFilter {
    input {
        Array[File] input_bams
        Int gpus = 1
        Int batch_size = 24
    }

    scatter (bam in input_bams) {
        call Predict {
            input:
                bam = bam,
                gpus = gpus,
                batch_size = batch_size
        }

        call Filter {
            input:
                bam = bam,
                predictions = Predict.predictions
        }
    }

    output {
        Array[File] filtered_bams = Filter.filtered_bam
        Array[File] qc_reports = Predict.qc_report
    }
}

task Predict {
    input {
        File bam
        Int gpus
        Int batch_size
    }

    command <<<
        BASENAME=$(basename ~{bam} .bam)
        chimeralm predict ~{bam} --gpus ~{gpus} --batch-size ~{batch_size} -o ${BASENAME}.predictions
    >>>

    output {
        Directory predictions_dir = "$(basename ~{bam} .bam).predictions"
        File predictions = "$(basename ~{bam} .bam).predictions/predictions.txt"
        File qc_report = "qc_report.txt"
    }

    runtime {
        docker: "chimeralm/chimeralm:latest"
        gpuCount: gpus
        memory: "32 GB"
        disks: "local-disk 100 HDD"
    }
}

task Filter {
    input {
        File bam
        File predictions
    }

    command <<<
        chimeralm filter ~{bam} $(dirname ~{predictions})

        # ChimeraLM creates basename(bam).filtered.sorted.bam
        # Move to expected output name
        mv $(basename ~{bam} .bam).filtered.sorted.bam filtered.sorted.bam
    >>>

    output {
        File filtered_bam = "filtered.sorted.bam"
    }

    runtime {
        docker: "chimeralm/chimeralm:latest"
        memory: "16 GB"
        disks: "local-disk 100 HDD"
    }
}

Best Practices¶

Error Handling¶

# Always use set -euo pipefail in Bash scripts
set -euo pipefail

# Check exit codes
if ! chimeralm predict input.bam --gpus 1; then
    echo "Prediction failed!" >&2
    exit 1
fi

# Use trap for cleanup
trap 'echo "Pipeline failed at line $LINENO"; exit 1' ERR

Logging¶

# Log all output
exec 1> >(tee pipeline.log)
exec 2>&1

# Or per-command logging
chimeralm predict input.bam 2>&1 | tee predict.log

Resource Management¶

# Limit parallel jobs based on available GPUs
NUM_GPUS=$(nvidia-smi -L | wc -l)
parallel -j $NUM_GPUS 'chimeralm predict {} --gpus 1' ::: data/*.bam

Production Checklist¶

Before deploying to production:

Test pipeline on sample data
Implement error handling and logging
Set resource limits (memory, time, GPUs)
Add data validation checks
Include QC report generation
Document pipeline parameters
Version control your pipeline code
Test pipeline failure scenarios

Next Steps¶

Batch Processing: Use CLI commands for multiple sequences
API Access: Use Models API for custom workflows

Summary¶

You've learned how to:

✅ Integrate ChimeraLM into Bash scripts
✅ Build Nextflow pipelines for scalable processing
✅ Create Snakemake workflows for reproducibility
✅ Handle errors and logging in production
✅ Deploy to HPC clusters and cloud platforms
✅ Follow best practices for bioinformatics pipelines

Pipeline Ready!

You're now ready to integrate ChimeraLM into production bioinformatics workflows!