FAQ¶
Frequently asked questions and answers about DeepChopper.
Find quick answers to common questions about installation, usage, performance, and troubleshooting.
General Questions¶
What is DeepChopper?¶
DeepChopper is a deep learning tool designed to detect and remove chimeric artifacts in Nanopore direct RNA sequencing data. It uses a transformer-based language model to identify adapter sequences within base-called reads that traditional basecallers miss.
Why do I need DeepChopper?¶
Chimeric reads (reads containing internal adapter sequences) can lead to:
- False gene fusion calls
- Incorrect transcript annotations
- Inflated gene expression estimates
- Poor transcriptome assembly quality
DeepChopper removes these artifacts, improving downstream analysis accuracy.
How does DeepChopper differ from Dorado's trimming?¶
- Dorado: Trims adapters from read ends (5' and 3')
- DeepChopper: Detects and removes internal adapter sequences that Dorado misses
You can use both tools together for comprehensive adapter removal.
Installation & Setup¶
What are the system requirements?¶
Minimum:
- Python 3.10+
- 8GB RAM
- 2GB storage
Recommended:
- 16GB+ RAM
- NVIDIA GPU with CUDA support
- 10GB storage (for models and data)
Can I use DeepChopper without a GPU?¶
Yes! DeepChopper works on CPU, though it's slower. For small datasets (\<10K reads), CPU processing is reasonable. For larger datasets, GPU acceleration is recommended.
Which operating systems are supported?¶
DeepChopper supports:
- Linux (x86_64)
- macOS (Intel and Apple Silicon)
- Windows (x86_64)
Pre-built wheels are available on PyPI for all platforms.
Usage Questions¶
Which model should I use: rna002 or rna004?¶
- RNA002: For data sequenced with RNA002 chemistry
- RNA004: For data sequenced with RNA004 or newer chemistries
DeepChopper has zero-shot capability, so the RNA002 model works well on RNA004 data, but using the matching model is recommended for best results.
How long does processing take?¶
Processing time depends on:
- Dataset size
- Hardware (CPU vs GPU)
- Batch size
Approximate times for 1 million reads:
- CPU: 2-6 hours
- GPU (single): 10-30 minutes
- GPU (multiple): 5-15 minutes
Can I process multiple files in parallel?¶
Yes! You can run multiple DeepChopper instances simultaneously:
# Process multiple files
for file in *.fastq; do
deepchopper predict "$file" --output "predictions_${file%.fastq}" &
done
wait
What input formats are supported?¶
- FASTQ (
.fastq,.fq,.fastq.gz,.fq.gz) - Parquet (for already-encoded data)
What is the output format?¶
- Predictions: Parquet files with adapter positions
- Chopped reads: FASTQ format with adapters removed
Performance & Optimization¶
How can I speed up processing?¶
- Use GPU: Add
--gpus 1to prediction - Increase batch size: Try
--batch-size 32or higher - Use multiple GPUs:
--gpus 2for parallel processing - Process in parallel: Run multiple instances on different files
I'm running out of memory. What should I do?¶
For prediction:
For chopping:
How much memory do I need?¶
| Dataset Size | Prediction (CPU) | Prediction (GPU) | Chopping |
|---|---|---|---|
| 100K reads | ~40 GB | ~40 GB | 1-2 GB |
| 1M reads | ~70 GB | ~20 GB | 2-5 GB |
| 10M reads | ~70 GB | ~40 GB | 5-20 GB |
Note
The memory usage can vary based on read lengths and system configuration.
Results & Quality¶
How do I know if DeepChopper worked correctly?¶
Check these indicators:
- Output file size: Should be smaller than input (adapters removed)
- Read count: May increase (reads split at adapters)
- Log messages: Look for "processed X reads" messages
- Quality metrics: Review prediction confidence scores
Why are there more reads in the output?¶
This is expected! DeepChopper splits chimeric reads at adapter positions, creating multiple valid reads from single chimeric reads.
Example:
How do I validate the results?¶
- Alignment improvement: Map to reference genome, check alignment rates
- Chimeric alignment reduction: Count chimeric alignments before/after
- Gene fusion validation: Verify gene fusion calls are more accurate
- Visual inspection: Use web interface to inspect individual reads
Can DeepChopper introduce false positives?¶
Yes, like any tool, false positives are possible. To reduce them:
- Increase
--smooth-window(e.g., 31) - Increase
--min-interval-size(e.g., 15) - Use the model matching your chemistry
Troubleshooting¶
Error: "command not found: deepchopper"¶
Solution: Ensure DeepChopper is installed and in your PATH:
# Check installation
pip show deepchopper
# Add to PATH if needed
export PATH="$HOME/.local/bin:$PATH"
Error: "CUDA out of memory"¶
Solution: Reduce batch size:
Error: "FileNotFoundError"¶
Solution: Check file paths and ensure files exist:
# Verify file exists
ls -lh data.fastq
# Use absolute paths
deepchopper predict /full/path/to/data.fastq
Predictions are empty or incorrect¶
Possible causes:
- Wrong model: Make sure you're using the correct model (rna002 vs rna004)
- Already trimmed data: If Dorado already trimmed adapters, DeepChopper may not find internal adapters
- Low-quality data: Very noisy data may produce poor predictions
Solutions:
- Use matching model for your chemistry
- If data is already clean, DeepChopper may not be needed
- Adjust parameters (see Parameters Guide)
Processing is very slow¶
Common causes:
- Using CPU instead of GPU
- Small batch size
- Many workers with limited CPU cores
Solutions:
# Enable GPU
deepchopper predict data.fastq --gpus 1 --batch-size 32
# Optimize workers (usually 0 or 4 works best)
deepchopper predict data.fastq --workers 4
Advanced Usage¶
Can I train my own model?¶
Yes! DeepChopper is built on PyTorch Lightning. See the development documentation for training instructions.
Can I use DeepChopper programmatically?¶
Yes! DeepChopper provides a Python API. However, for most use cases, the CLI is recommended as it's optimized and easier to use.
Does DeepChopper work with DNA sequencing?¶
DeepChopper is specifically designed and trained for direct RNA sequencing. It may not work well with DNA data.
Can I use DeepChopper with other sequencing platforms?¶
DeepChopper is optimized for Oxford Nanopore direct RNA sequencing. It's not tested or recommended for other platforms (Illumina, PacBio, etc.).
Can DeepChopper identify chimeric reads from Whole Genome Amplification (WGA)?¶
No. DeepChopper is specifically designed for direct RNA sequencing and identifies chimeric reads caused by internal adapter sequences in Nanopore dRNA-seq data.
For WGA-related chimeric reads, you should use ChimeraLM, a specialized tool designed to identify artificial chimeric reads arising from whole genome amplification processes.
Key Differences:
- DeepChopper: RNA sequencing, adapter-induced chimeras
- ChimeraLM: DNA sequencing, WGA-induced chimeras
Data & Privacy¶
Does DeepChopper send my data anywhere?¶
No. DeepChopper processes all data locally. The only network access is for:
- Downloading models from Hugging Face Hub (one-time)
- Using the optional
--shareflag in web interface
Where are models stored?¶
Models are cached locally in:
- Linux:
~/.cache/huggingface/ - macOS:
~/Library/Caches/huggingface/ - Windows:
%USERPROFILE%\.cache\huggingface\
Can I use DeepChopper offline?¶
Yes, after the initial model download. Models are cached locally and don't require internet access for subsequent runs.
Getting More Help¶
Where can I find more information?¶
- Tutorial - Complete walkthrough
- CLI Reference - All commands and options
- Parameters Guide - Optimization tips
- GitHub Issues - Report bugs
- GitHub Discussions - Ask questions
How do I report a bug?¶
- Check existing issues
- Open a new issue with:
- DeepChopper version (
deepchopper --version) - Operating system and Python version
- Full error message
- Steps to reproduce
How do I request a feature?¶
Open a GitHub Discussion describing:
- The feature you'd like
- Your use case
- Why it would be helpful
How can I contribute?¶
We welcome contributions! See the Contributing Guide for:
- Setting up development environment
- Code style guidelines
- Testing procedures
- Pull request process