ChimeraLM¶

Genomic Language Model for Detecting WGA Chimeric Artifacts¶

A deep learning-powered tool to identify artificial chimeric reads arising from whole genome amplification (WGA) processes.

Get Started Try Web Demo View on GitHub

Key Features¶

High Accuracy¶

Deep learning model trained on real WGA data for precise chimeric artifact detection

GPU Accelerated¶

Optimized for CUDA, MPS (Apple Silicon), and CPU with configurable batch processing

Easy to Use¶

Simple CLI with sensible defaults - get started in minutes

Fast Processing¶

Batch inference with configurable parallelism for large-scale genomic datasets

Web Interface¶

Try the interactive demo on HuggingFace Spaces - no installation needed!

Production Ready¶

Includes filtering, sorting, and indexing of BAM files

Quick Start¶

Get up and running with ChimeraLM in under 15 minutes:

# Install ChimeraLM
pip install chimeralm

# Predict chimeric reads (CPU)
chimeralm predict your_data.bam

# Predict with GPU acceleration
chimeralm predict your_data.bam --gpus 1 --batch-size 24

Ready to dive in? Check out our Quick Start Guide.

Try ChimeraLM Online - No Installation Required!

Want to test ChimeraLM before installing? Try our interactive web demo:

Launch Web Demo on HuggingFace Spaces

Perfect for:

Testing with individual DNA sequences
Visualizing prediction confidence scores
Learning about chimeric artifact detection
Quick validation before batch processing

The web demo runs the same model as the CLI tool but provides an intuitive visual interface for single-sequence analysis.

What is ChimeraLM?¶

ChimeraLM is a genomic language model that detects chimeric artifacts introduced by whole genome amplification (WGA). Built with PyTorch Lightning and optimized for modern GPUs, it provides fast and accurate identification of chimeric reads in BAM files.

Chimeric artifacts are artificial DNA sequences created during WGA that combine sequences from different genomic locations. These artifacts can lead to incorrect biological conclusions if not removed from analysis.

ChimeraLM uses the HyenaDNA backbone architecture to learn patterns that distinguish biological reads (label 0) from chimeric artifacts (label 1), helping researchers clean their sequencing data before downstream analysis.

DeepChopper - For identifying chimera artifacts caused by internal adapter sequences in Nanopore direct RNA sequencing (dRNA-seq) data

Citation¶

If you use ChimeraLM in your research, please cite:

@software{chimeralm2025,
  title={ChimeraLM: A genomic language model to identify chimera artifacts},
  author={Li, Yangyang and Guo, Qingxiang and Yang, Rendong},
  year={2025},
  url={https://github.com/ylab-hi/ChimeraLM}
}

License¶

ChimeraLM is licensed under the Apache License 2.0. See License for details.