Tutorial¶
Complete guide for using DeepChopper with Nanopore direct-RNA sequencing data.
This tutorial will walk you through the process of identifying and removing chimeric artificial reads in Nanopore direct-RNA sequencing data. Whether you're new to bioinformatics or an experienced researcher, this guide will help you get the most out of DeepChopper.
Prerequisites¶
Before we begin, ensure you have the following installed:
- DeepChopper (latest version)
- Dorado (Oxford Nanopore's basecaller)
- Samtools (for BAM to FASTQ conversion)
- Sufficient storage space for Nanopore data
Working with DNA sequencing data?
DeepChopper is designed for direct RNA sequencing. If you need to identify artificial chimeric reads from whole genome amplification (WGA) processes, please see ChimeraLM.
1. Data Acquisition¶
Start by obtaining your Nanopore direct-RNA sequencing data (POD5 files).
# Example: Download sample data (replace with your actual data source)
wget https://raw.githubusercontent.com/ylab-hi/DeepChopper/refs/heads/main/tests/data/200cases.pod5
💡 Tip: Organize your data in a dedicated project folder for easy management.
2. Basecall Using Dorado¶
Convert raw signal data to nucleotide sequences using Dorado.
# Install Dorado (if not already installed)
# Run Dorado without trimming to preserve all sequences
dorado basecaller --no-trim rna002_70bps_hac@v3 200cases.pod5 > raw_no_trim.bam
# Convert BAM to FASTQ
samtools view raw_no_trim.bam -d dx:0 | samtools fastq > raw_no_trim.fastq
Replace 200cases.pod5 with the directory containing your POD5 files. Use rna002_70bps_hac@v3 for RNA002 kit or rna004_130bps_hac@v5.0.0 for RNA004 kit.
The output will be a FASTQ file containing the basecalled sequences with all adapters preserved for DeepChopper analysis.
📝 Note: You can also use Dorado WITH trimming (default behavior without --no-trim), then apply DeepChopper. Dorado's trimming removes 3' end adapters, and DeepChopper can identify and remove internal adapter regions that Dorado doesn't detect. Both approaches work well with DeepChopper.
For convenience, you can download a pre-prepared FASTQ file for testing:
wget https://raw.githubusercontent.com/ylab-hi/DeepChopper/refs/heads/main/tests/data/raw_no_trim.fastq
3. Predicting Adapter to Detect Artificial Chimeric Reads¶
DeepChopper analyzes your FASTQ data directly to identify chimeric reads:
Basic Usage¶
# Predict chimeric reads (default: RNA002 model, CPU)
deepchopper predict raw_no_trim.fastq --output predictions
# With GPU acceleration
deepchopper predict raw_no_trim.fastq --output predictions --gpus 1
Model Selection¶
DeepChopper supports different models optimized for different RNA sequencing kits:
# Use RNA002 model (default - for RNA002 sequencing kit)
deepchopper predict raw_no_trim.fastq --output predictions --model rna002
# Use RNA004 model (for RNA004 sequencing kit)
deepchopper predict raw_no_trim.fastq --output predictions --model rna004
🎯 Important: Choose the model that matches your sequencing kit:
rna002: For data generated with the RNA002 sequencing kitrna004: For data generated with the RNA004 sequencing kit (newer version with improved chemistry)
Advanced Options¶
# Process a small subset for testing
deepchopper predict raw_no_trim.fastq --output predictions --max-sample 1000
# Use larger batch size for faster processing (requires more memory)
deepchopper predict raw_no_trim.fastq --output predictions --batch-size 32 --gpus 1
# Specify number of data loader workers (default: 0)
deepchopper predict raw_no_trim.fastq --output predictions --workers 4
# Enable verbose output
deepchopper predict raw_no_trim.fastq --output predictions --verbose
📊 Results: Check the predictions folder for output files containing chimera predictions for each read.
Hardware Acceleration¶
DeepChopper can leverage GPUs for significantly faster processing:
# Use single GPU (recommended)
deepchopper predict raw_no_trim.fastq --output predictions --gpus 1
# Use multiple GPUs (if available)
deepchopper predict raw_no_trim.fastq --output predictions --gpus 2
💡 Performance Tip: GPU acceleration can provide 10-50x speedup for large datasets. For datasets with \<10K reads, CPU processing is sufficient.
4. Chopping Artificial Sequences¶
Now that you have predictions, remove the identified adapter sequences:
Chopping Reads¶
Chopping Options¶
# Adjust smoothing and filtering parameters
deepchopper chop predictions/0 raw_no_trim.fastq \
--smooth-window 21 \
--min-interval-size 13 \
--min-read-length 20
# Include chopped sequences in output
deepchopper chop predictions/0 raw_no_trim.fastq --output-chopped
# Use multiple threads for faster processing
deepchopper chop predictions/0 raw_no_trim.fastq --threads 4
Parameter Guide¶
Key parameters you can adjust:
--output, -o: Custom output file prefix--max-batch: Maximum batch size for memory management (default: auto)--threads, -t: Number of threads to use (default: 2)--smooth-window: Smooth window size for prediction smoothing (default: 21)--min-interval-size: Minimum interval size to consider (default: 13)--min-read-length: Minimum read length after chopping (default: 20)--approved-intervals: Number of approved intervals (default: 20)--output-chopped: Output the chopped sequences separately--chop-type: Type of chopping to perform (default: "all")
🎉 Success: Look for the output file with the .chop.fq.gz suffix.
This command takes the original FASTQ file (raw_no_trim.fastq) and the predictions (predictions), and produces a new FASTQ file (with suffix .chop.fq.gz) with the chimeric-artifact chopped.
Understanding the Output¶
The default output is a compressed file in BGZIP format:
- Format: BGZIP-compressed FASTQ (
.chop.fq.gz) - View: Use
zless -S OUTPUTto view the output file contents in a terminal - The
-Sflag: Prevents line wrapping, making it easier to read long sequences - Compatibility: Can be directly used with most bioinformatics tools that support BGZIP
Performance Notes¶
The default parameters used in DeepChopper are optimized based on extensive testing and validation during our research, as detailed in our paper. These parameters have been shown to provide robust and reliable results across a wide range of sequencing data.
Processing Time:
- Demo data: ~20-30 minutes
- Large datasets: May vary depending on:
- Machine specifications
- CPU/GPU availability
- Number of threads used
- Batch size settings
Memory Management:
- Lower batch sizes = less memory but slower processing
- Higher batch sizes = more memory but faster processing
5. Web Interface (Optional)¶
DeepChopper also provides a user-friendly web interface for quick tests and demonstrations:
This will start a local web server where you can:
- Upload single FASTQ records
- Visualize predictions in real-time
- Test DeepChopper without command-line operations
⚠️ Note: The web interface is designed for quick tests with single reads. For production use with large datasets, use the command-line interface.
🌐 Online Version: Try DeepChopper online at Hugging Face Spaces without any installation!
Next Steps¶
- Advanced Parameters: Check our documentation for detailed parameters of the
chopcommand - CLI Options: Explore all available options with
deepchopper --help,deepchopper predict --help,deepchopper chop --help, etc. - Downstream Analysis: Use your cleaned data for:
- Transcript annotation
- Gene expression quantification
- Gene fusion detection
- Alternative splicing analysis
Troubleshooting¶
Memory Issues¶
- Issue: Out of memory errors for CPU or CUDA (GPU) when predicting
Solution:
- Reduce batch size:
deepchopper predict input.fastq --batch-size 4 - Use
--max-samplefor testing:deepchopper predict input.fastq --max-sample 1000 -
Process smaller files separately if dealing with very large FASTQ files
-
Issue: Out of memory when chopping
Solution:
- Reduce the number of threads:
deepchopper chop predictions/0 input.fastq --threads 1
Performance Issues¶
- Issue: Slow processing
Solution:
- Enable GPU acceleration:
deepchopper predict input.fastq --gpus 1 - Increase threads for chopping:
deepchopper chop predictions/0 input.fastq --threads 4 -
Increase batch size (if memory allows):
deepchopper predict input.fastq --batch-size 16 -
Issue: Apple Silicon (M1/M2/M3) not using GPU
Solution:
- Specify
--gpus 1to enable MPS acceleration - Ensure PyTorch was installed with MPS support
- Check with:
python -c "import torch; print(torch.backends.mps.is_available())"
Model and Results Issues¶
- Issue: Unexpected results or poor performance
Solution:
- Verify model selection: Use
--model rna002for RNA002 data or--model rna004for RNA004 data - Try both workflows: You can use Dorado with or without trimming - both work well with DeepChopper
- Verify input data quality (check FASTQ quality scores)
- Check DeepChopper version:
deepchopper --version - Review the prediction output files before chopping
Hardware and Compatibility Issues¶
- Issue: GPU driver compatibility error
Solution:
- Update your GPU driver to the latest version
-
Install a compatible PyTorch version:
- CUDA 11.8:
pip install torch --force-reinstall --index-url https://download.pytorch.org/whl/cu118 - CUDA 12.1:
pip install torch --force-reinstall --index-url https://download.pytorch.org/whl/cu121 - CPU only:
pip install torch --force-reinstall --index-url https://download.pytorch.org/whl/cpu
- CUDA 11.8:
-
Issue:
deepchoppercommand not found
Solution:
- Ensure the installation directory is in your PATH
- Check installation:
pip show deepchopper - Try reinstalling:
pip install --force-reinstall deepchopper - Activate your virtual environment if you created one
Getting Help¶
If you encounter issues not covered here:
- Check the GitHub Issues for similar problems
- Open a new issue with:
- DeepChopper version (
deepchopper --version) - Command you ran
- Full error message
- System information (OS, Python version, GPU if applicable)
Happy sequencing, and may your data be artifical-chimera-free! 🧬🔍