Performance Optimization¶
Maximize ChimeraLM's throughput and minimize prediction time with GPU acceleration, batch tuning, and parallelization strategies.
Learning Objectives
By the end of this tutorial, you will be able to:
- Choose optimal hardware (CPU, CUDA GPU, MPS) for your workload
- Tune batch size for maximum throughput without OOM errors
- Optimize data loading with worker processes
- Profile and benchmark ChimeraLM performance
- Scale to large datasets (millions of reads)
Prerequisites: ChimeraLM installed, basic command-line experience
Time: ~30 minutes
Performance Overview¶
ChimeraLM's prediction speed depends on:
- Hardware: GPU >>> CPU (10-50x speedup)
- Batch size: Larger batches = better GPU utilization
- Data loading: Parallel workers reduce I/O bottlenecks
- Dataset size: Amortized overhead for large files
Key Takeaways
- GPU is 10-50x faster than CPU
- Batch size of 24-64 is optimal for modern GPUs
- Diminishing returns after batch size 64
Step 1: Choose Your Hardware¶
Check Available Hardware¶
# Check CUDA GPU availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
# Check MPS (Apple Silicon) availability
python -c "import torch; print(f'MPS: {torch.backends.mps.is_available()}')"
# Check GPU details (if CUDA available)
nvidia-smi
Hardware Recommendations¶
Best for: Large-scale prediction (>100K reads)
GPU Requirements:
- Minimum: 8GB VRAM (batch-size 12)
- Recommended: 16GB VRAM (batch-size 24-32)
- Optimal: 24GB+ VRAM (batch-size 48-64)
Best for: Mac users, moderate datasets (<100K reads)
MPS Limitations
- Single device only (no multi-GPU)
- Slower than CUDA GPUs
- Limited VRAM (8-96GB depending on model)
Step 2: Optimize Batch Size¶
Batch size is the most important parameter for GPU performance.
Finding Optimal Batch Size¶
# Start with default (batch-size 12)
chimeralm predict input.bam --gpus 1 --batch-size 12
# Increase until you get CUDA out of memory error
chimeralm predict input.bam --gpus 1 --batch-size 24
chimeralm predict input.bam --gpus 1 --batch-size 32
chimeralm predict input.bam --gpus 1 --batch-size 48 # May OOM on smaller GPUs
Batch Size Guidelines¶
| GPU VRAM | Recommended Batch Size | Max Batch Size |
|---|---|---|
| 8GB | 12 | 16 |
| 12GB | 16 | 24 |
| 16GB | 24 | 32 |
| 24GB | 32 | 48 |
| 40GB+ | 48 | 64+ |
Out of Memory Errors
If you get RuntimeError: CUDA out of memory:
Measure Throughput¶
# Benchmark with different batch sizes
time chimeralm predict tests/data/mk1c_test.bam --gpus 1 --batch-size 12
time chimeralm predict tests/data/mk1c_test.bam --gpus 1 --batch-size 24
time chimeralm predict tests/data/mk1c_test.bam --gpus 1 --batch-size 32
# Compare total time
Step 3: Optimize Data Loading¶
Increase Worker Threads¶
# CPU mode: Use multiple workers for parallelism
chimeralm predict input.bam --gpus 0 --workers 8
# GPU mode: 2-4 workers to keep GPU fed
chimeralm predict input.bam --gpus 1 --batch-size 24 --workers 4
Worker Guidelines
- CPU mode: Set workers = number of CPU cores
- GPU mode: 2-4 workers (more doesn't help)
- Default: 0 (main thread only)
I/O Bottlenecks¶
For very large BAM files (>10GB), I/O can become a bottleneck:
# Check if I/O is the bottleneck
# Monitor GPU utilization with nvidia-smi
# If GPU utilization < 80%, increase workers
chimeralm predict large_file.bam --gpus 1 --workers 4
# If still low, data loading is the bottleneck (not much you can do)
Step 4: Scale to Large Datasets¶
Multi-GPU Prediction (Advanced)¶
Multi-GPU prediction is currently supported. You can process different files on multiple GPUs:
Process in Chunks¶
For very large datasets, process in chunks to avoid memory issues:
# Process first 100K reads
chimeralm predict huge_file.bam --max-sample 100000 --gpus 1 --batch-size 32
# Split BAM file into chunks (manual approach)
samtools view -h huge_file.bam | head -100000 | samtools view -Sb > chunk1.bam
samtools view -h huge_file.bam | tail -100000 | samtools view -Sb > chunk2.bam
chimeralm predict chunk1.bam --gpus 1 --batch-size 32
chimeralm predict chunk2.bam --gpus 1 --batch-size 32
Step 5: Profile and Benchmark¶
Measure End-to-End Time¶
# Time the full pipeline
time chimeralm predict input.bam --gpus 1 --batch-size 24
# Output:
# real 0m29.123s
# user 0m45.678s
# sys 0m3.456s
Monitor GPU Utilization¶
# Run nvidia-smi in another terminal while predicting
watch -n 1 nvidia-smi
# Check GPU utilization (should be 90-100% during inference)
# Check GPU memory usage
Best Practices¶
For Maximum Speed¶
# NVIDIA GPU: Large batch, mixed precision
chimeralm predict input.bam --gpus 1 --batch-size 48 --workers 4
# Apple Silicon: Moderate batch
chimeralm predict input.bam --gpus 1 --batch-size 12 --workers 2
# CPU: Multiple workers
chimeralm predict input.bam --gpus 0 --workers 8 --batch-size 32
For Limited GPU Memory¶
For Reproducibility¶
Troubleshooting¶
Slow Predictions¶
Predictions are slower than expected
Symptom: GPU mode is not much faster than CPU
Possible Causes:
-
GPU not being used
-
Batch size too small
-
I/O bottleneck
Out of Memory¶
CUDA out of memory error
Solutions:
GPU Not Detected¶
GPU available but not being used
Check CUDA installation:
Performance Checklist¶
Before running large-scale predictions:
- GPU is detected (
python -c "import torch; print(torch.cuda.is_available())") - Optimal batch size determined (start with 24, increase until OOM)
- Workers set appropriately (2-4 for GPU, 8+ for CPU)
- GPU utilization monitored with
nvidia-smi - Benchmark completed on sample data
Next Steps¶
- Web Interface: See Web Interface for interactive filtering
- Pipeline integration: See Pipeline Integration for scaling across multiple samples
Summary¶
You've learned how to:
- ✅ Choose optimal hardware for your workload
- ✅ Tune batch size for maximum throughput
- ✅ Optimize data loading with worker processes
- ✅ Scale to large datasets with chunking
- ✅ Profile and benchmark performance
- ✅ Troubleshoot common performance issues
Performance Boost Achieved!
With proper optimization, you can achieve 10-50x speedup compared to default settings!