Pipeline Integration¶
Integrate ChimeraLM into your bioinformatics pipelines using Bash, Nextflow, and Snakemake for reproducible, scalable analysis.
Learning Objectives
By the end of this tutorial, you will be able to:
- Integrate ChimeraLM into Bash scripts for simple automation
- Build Nextflow pipelines with ChimeraLM filtering
- Create Snakemake workflows for reproducible analysis
- Handle errors and logging in production pipelines
- Scale to large cohorts (100s-1000s of samples)
Prerequisites: ChimeraLM installed, basic knowledge of Bash/Nextflow/Snakemake
Time: ~45 minutes
Integration Options¶
| Method | Best For | Complexity | Scalability |
|---|---|---|---|
| Bash Script | Simple workflows, single machine | Low | 1-10 samples |
| Nextflow | Cloud/HPC, complex pipelines | Medium | 10-1000s samples |
| Snakemake | Reproducibility, local/cluster | Medium | 10-1000s samples |
| WDL | Cloud platforms (Terra, Cromwell) | Medium-High | 100s-1000s samples |
Bash Script Integration¶
Basic Pipeline¶
#!/bin/bash
# chimera_filter_pipeline.sh - Simple ChimeraLM filtering pipeline
set -euo pipefail # Exit on error, undefined variables, pipe failures
# Configuration
INPUT_BAM=$1
OUTPUT_DIR=$2
GPUS=${3:-1} # Default to 1 GPU
BATCH_SIZE=${4:-24} # Default batch size 24
echo "ChimeraLM Filtering Pipeline"
echo "Input: $INPUT_BAM"
echo "Output: $OUTPUT_DIR"
# Create output directory
mkdir -p $OUTPUT_DIR
# Step 1: Predict chimera artifacts induced by WGA
echo "Step 1/3: Predicting chimera artifacts induced by WGA..."
BASENAME=$(basename $INPUT_BAM .bam)
chimeralm predict $INPUT_BAM --gpus $GPUS --batch-size $BATCH_SIZE -o ${BASENAME}.predictions
# Step 2: Filter BAM
echo "Step 2/3: Filtering BAM file..."
# Filter creates .filtered.sorted.bam automatically and writes predictions.txt
chimeralm filter $INPUT_BAM ${BASENAME}.predictions
FILTERED_BAM="${INPUT_BAM%.bam}.filtered.sorted.bam"
# Step 3: Generate QC report
echo "Step 3/3: Generating QC report..."
CHIMERIC_ARTIFACT=$(grep -c "1$" ${BASENAME}.predictions/predictions.txt || echo "0")
BIOLOGICAL_READS=$(grep -c "0$" ${BASENAME}.predictions/predictions.txt || echo "0")
TOTAL_READS=$((CHIMERIC_ARTIFACT + BIOLOGICAL_READS))
CHIMERA_ARTIFACT_RATE=$(echo "scale=2; $CHIMERIC_ARTIFACT * 100 / $TOTAL_READS" | bc)
cat > ${OUTPUT_DIR}/qc_report.txt <<EOF
ChimeraLM WGA Artifact QC Report
=================================
Input BAM: $INPUT_BAM
Output BAM: $FILTERED_BAM
Read Statistics:
Total analyzed: $TOTAL_READS
Biological reads: $BIOLOGICAL_READS
Chimera artifacts (WGA): $CHIMERIC_ARTIFACT
Chimera artifact rate: ${CHIMERA_ARTIFACT_RATE}%
Filtering complete: $(date)
EOF
echo "Pipeline complete! QC report: ${OUTPUT_DIR}/qc_report.txt"
Usage¶
# Make script executable
chmod +x chimera_filter_pipeline.sh
# Run pipeline
./chimera_filter_pipeline.sh input.bam output/ 1 24
# Batch process multiple files
for bam in data/*.bam; do
./chimera_filter_pipeline.sh $bam output/ 1 24
done
Advanced Bash Pipeline with Error Handling¶
#!/bin/bash
# advanced_chimera_pipeline.sh - Production-ready pipeline with error handling
set -euo pipefail
# Logging function
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*" | tee -a pipeline.log
}
error_exit() {
log "ERROR: $1"
exit 1
}
# Validate inputs
INPUT_BAM=${1:?Usage: $0 <input.bam> <output_dir> [gpus] [batch_size]}
OUTPUT_DIR=${2:?Output directory required}
GPUS=${3:-1}
BATCH_SIZE=${4:-24}
# Check dependencies
command -v chimeralm >/dev/null || error_exit "chimeralm not found"
command -v samtools >/dev/null || error_exit "samtools not found"
# Check input file exists
[[ -f $INPUT_BAM ]] || error_exit "Input BAM not found: $INPUT_BAM"
# Check disk space (need at least 2x input size)
INPUT_SIZE=$(du -b $INPUT_BAM | cut -f1)
REQUIRED_SPACE=$((INPUT_SIZE * 2))
AVAILABLE_SPACE=$(df --output=avail -B 1 $(dirname $OUTPUT_DIR) | tail -1)
[[ $AVAILABLE_SPACE -gt $REQUIRED_SPACE ]] || error_exit "Insufficient disk space"
log "Starting ChimeraLM WGA artifact filtering pipeline"
log "Input: $INPUT_BAM ($(du -h $INPUT_BAM | cut -f1))"
BASENAME=$(basename $INPUT_BAM .bam)
# Step 1: Predict chimera artifacts induced by WGA
log "Step 1/3: Running predictions..."
if chimeralm predict $INPUT_BAM --gpus $GPUS --batch-size $BATCH_SIZE -o ${BASENAME}.predictions 2>&1 | tee -a pipeline.log; then
log "Predictions complete"
else
error_exit "Prediction failed"
fi
# Step 2: Filter BAM
log "Step 2/3: Filtering BAM..."
if chimeralm filter $INPUT_BAM ${BASENAME}.predictions 2>&1 | tee -a pipeline.log; then
log "Filtering complete"
else
error_exit "Filtering failed"
fi
# ChimeraLM automatically creates .filtered.sorted.bam
FILTERED_BAM="${INPUT_BAM%.bam}.filtered.sorted.bam"
# Step 3: Verify output
log "Step 3/3: Verifying output..."
samtools quickcheck $FILTERED_BAM || error_exit "Output BAM is corrupted"
ORIGINAL_COUNT=$(samtools view -c $INPUT_BAM)
FILTERED_COUNT=$(samtools view -c $FILTERED_BAM)
REMOVED_COUNT=$((ORIGINAL_COUNT - FILTERED_COUNT))
log "Removed $REMOVED_COUNT reads (${ORIGINAL_COUNT} -> ${FILTERED_COUNT})"
log "Pipeline complete! Output: $FILTERED_BAM"
Nextflow Integration¶
Simple Nextflow Pipeline¶
// chimera_filter.nf - Nextflow pipeline for ChimeraLM filtering
nextflow.enable.dsl=2
// Parameters
params.input_bam = "input.bam"
params.output_dir = "results/"
params.gpus = 1
params.batch_size = 24
// Process: Predict chimera artifacts induced by WGA
process predict {
tag { bam.baseName }
publishDir "${params.output_dir}/predictions", mode: 'copy'
input:
path bam
output:
tuple path(bam), path("${bam.baseName}.predictions")
script:
"""
chimeralm predict ${bam} --gpus ${params.gpus} --batch-size ${params.batch_size} -o ${bam.baseName}.predictions
"""
}
// Process: Filter BAM to remove WGA artifacts
process filter {
tag { bam.baseName }
publishDir "${params.output_dir}/filtered_bams", mode: 'copy'
input:
tuple path(bam), path(predictions_dir)
output:
path "${bam.baseName}.filtered.sorted.bam"
path "${bam.baseName}.filtered.sorted.bam.bai"
script:
"""
chimeralm filter ${bam} ${predictions_dir}
"""
}
// Workflow
workflow {
// Read input BAMs
bam_ch = Channel.fromPath(params.input_bam)
// Run prediction
predictions_ch = predict(bam_ch)
// Filter BAMs
filter(predictions_ch)
// Generate QC reports
qc_report(predictions_ch)
}
Run Nextflow Pipeline¶
# Single sample
nextflow run chimera_filter.nf --input_bam input.bam --output_dir results/
# Multiple samples
nextflow run chimera_filter.nf --input_bam "data/*.bam" --output_dir results/
# With resource limits
nextflow run chimera_filter.nf \
--input_bam "data/*.bam" \
--output_dir results/ \
--gpus 1 \
--batch_size 32 \
-with-report report.html \
-with-trace
Advanced Nextflow with Cluster Support¶
// nextflow.config - Configuration for HPC cluster
process {
executor = 'slurm'
queue = 'gpu'
memory = '32 GB'
cpus = 4
withName: predict {
time = '2h'
clusterOptions = '--gres=gpu:1'
}
withName: filter {
time = '1h'
cpus = 8
}
}
docker {
enabled = true
runOptions = '--gpus all'
}
Snakemake Integration¶
Snakemake Workflow¶
# Snakefile - Snakemake workflow for ChimeraLM filtering
configfile: "config.yaml"
# Sample names from input directory
SAMPLES = glob_wildcards("data/{sample}.bam").sample
rule all:
input:
expand("results/filtered_bams/{sample}.filtered.bam", sample=SAMPLES),
expand("results/qc/{sample}_qc.txt", sample=SAMPLES),
"results/summary_report.html"
rule predict:
input:
bam="data/{sample}.bam"
output:
predictions="results/predictions/{sample}.predictions/predictions.txt"
params:
gpus=config.get("gpus", 1),
batch_size=config.get("batch_size", 24)
log:
"logs/predict/{sample}.log"
shell:
"""
chimeralm predict {input.bam} \
--gpus {params.gpus} \
--batch-size {params.batch_size} \
-o {wildcards.sample}.predictions \
2>&1 | tee {log}
# Move predictions to output directory
mv {wildcards.sample}.predictions results/predictions/
"""
rule filter:
input:
bam="data/{sample}.bam",
predictions="results/predictions/{sample}.predictions/predictions.txt"
output:
filtered_bam="results/filtered_bams/{sample}.filtered.sorted.bam",
filtered_bai="results/filtered_bams/{sample}.filtered.sorted.bam.bai"
log:
"logs/filter/{sample}.log"
shell:
"""
chimeralm filter {input.bam} \
results/predictions/{wildcards.sample}.predictions/ \
2>&1 | tee {log}
# Move output to expected location
mv data/{wildcards.sample}.filtered.sorted.bam {output.filtered_bam}
mv data/{wildcards.sample}.filtered.sorted.bam.bai {output.filtered_bai}
"""
rule qc_report:
input:
predictions="results/predictions/{sample}.predictions/predictions.txt"
output:
qc="results/qc/{sample}_qc.txt"
shell:
"""
CHIMERIC=$(grep -c '1$' {input.predictions} || echo 0)
BIOLOGICAL=$(grep -c '0$' {input.predictions} || echo 0)
TOTAL=$((CHIMERIC + BIOLOGICAL))
RATE=$(echo "scale=2; $CHIMERIC * 100 / $TOTAL" | bc)
cat > {output.qc} <<EOF
Sample: {wildcards.sample}
Total reads analyzed: $TOTAL
Biological reads: $BIOLOGICAL
Chimera artifacts (WGA): $CHIMERIC
Chimera artifact rate: ${{RATE}}%
EOF
"""
rule summary:
input:
qc=expand("results/qc/{sample}_qc.txt", sample=SAMPLES)
output:
report="results/summary_report.html"
script:
"scripts/generate_summary.py"
Configuration File¶
# config.yaml - Snakemake configuration
# ChimeraLM parameters
gpus: 1
batch_size: 24
# Cluster resources (if using cluster execution)
cluster:
predict:
mem: "32GB"
cpus: 4
time: "2:00:00"
partition: "gpu"
filter:
mem: "16GB"
cpus: 8
time: "1:00:00"
Run Snakemake Workflow¶
# Dry run to check workflow
snakemake -n
# Run locally with 4 cores
snakemake --cores 4
# Run on HPC cluster with SLURM
snakemake --cluster "sbatch -p {cluster.partition} -c {cluster.cpus} --mem={cluster.mem} -t {cluster.time}" \
--cluster-config config.yaml \
--jobs 10
# With Conda environment
snakemake --use-conda --cores 4
# Generate workflow diagram
snakemake --dag | dot -Tpng > workflow.png
WDL Integration (Bonus)¶
WDL Workflow¶
# chimera_filter.wdl - WDL workflow for Terra/Cromwell
version 1.0
workflow ChimeraFilter {
input {
Array[File] input_bams
Int gpus = 1
Int batch_size = 24
}
scatter (bam in input_bams) {
call Predict {
input:
bam = bam,
gpus = gpus,
batch_size = batch_size
}
call Filter {
input:
bam = bam,
predictions = Predict.predictions
}
}
output {
Array[File] filtered_bams = Filter.filtered_bam
Array[File] qc_reports = Predict.qc_report
}
}
task Predict {
input {
File bam
Int gpus
Int batch_size
}
command <<<
BASENAME=$(basename ~{bam} .bam)
chimeralm predict ~{bam} --gpus ~{gpus} --batch-size ~{batch_size} -o ${BASENAME}.predictions
>>>
output {
Directory predictions_dir = "$(basename ~{bam} .bam).predictions"
File predictions = "$(basename ~{bam} .bam).predictions/predictions.txt"
File qc_report = "qc_report.txt"
}
runtime {
docker: "chimeralm/chimeralm:latest"
gpuCount: gpus
memory: "32 GB"
disks: "local-disk 100 HDD"
}
}
task Filter {
input {
File bam
File predictions
}
command <<<
chimeralm filter ~{bam} $(dirname ~{predictions})
# ChimeraLM creates basename(bam).filtered.sorted.bam
# Move to expected output name
mv $(basename ~{bam} .bam).filtered.sorted.bam filtered.sorted.bam
>>>
output {
File filtered_bam = "filtered.sorted.bam"
}
runtime {
docker: "chimeralm/chimeralm:latest"
memory: "16 GB"
disks: "local-disk 100 HDD"
}
}
Best Practices¶
Error Handling¶
# Always use set -euo pipefail in Bash scripts
set -euo pipefail
# Check exit codes
if ! chimeralm predict input.bam --gpus 1; then
echo "Prediction failed!" >&2
exit 1
fi
# Use trap for cleanup
trap 'echo "Pipeline failed at line $LINENO"; exit 1' ERR
Logging¶
# Log all output
exec 1> >(tee pipeline.log)
exec 2>&1
# Or per-command logging
chimeralm predict input.bam 2>&1 | tee predict.log
Resource Management¶
# Limit parallel jobs based on available GPUs
NUM_GPUS=$(nvidia-smi -L | wc -l)
parallel -j $NUM_GPUS 'chimeralm predict {} --gpus 1' ::: data/*.bam
Production Checklist¶
Before deploying to production:
- Test pipeline on sample data
- Implement error handling and logging
- Set resource limits (memory, time, GPUs)
- Add data validation checks
- Include QC report generation
- Document pipeline parameters
- Version control your pipeline code
- Test pipeline failure scenarios
Next Steps¶
- Batch Processing: Use CLI commands for multiple sequences
- API Access: Use Models API for custom workflows
Summary¶
You've learned how to:
- ✅ Integrate ChimeraLM into Bash scripts
- ✅ Build Nextflow pipelines for scalable processing
- ✅ Create Snakemake workflows for reproducibility
- ✅ Handle errors and logging in production
- ✅ Deploy to HPC clusters and cloud platforms
- ✅ Follow best practices for bioinformatics pipelines
Pipeline Ready!
You're now ready to integrate ChimeraLM into production bioinformatics workflows!