Output¶
ScanNeo2 returns its output files in the results/<name/of/sample> folder (as specified in the config file). The final output is located within prioritization.
results/<name/of/sample>
- dnaseq
- align
- qualitycontrol
- reads
- indel
- rnaseq
- align
- altsplicing
- indel
- exitron
- qualitycontrol
- reads
- variants
- annotation
- hla
- prioritization
Note: the final results are located in the prioritization folder. In addition, these individual folders contain separate results for each
PRE-PROCESSING¶
Two pre-processing stages run per sample, both gated on preproc.activate: true:
Quality Control¶
FastQC reports land in results/<sample>/{dnaseq,rnaseq}/qualitycontrol/ per read group, with separate forward / reverse reports for paired-end inputs. Use these to spot adapter contamination, GC bias, and per-base quality drop-off before trimming.
Pre-processed reads¶
fastp writes the trimmed reads to results/<sample>/{dnaseq,rnaseq}/reads/ as <group>_preproc.fq.gz (single-end) or <group>_preproc_r1.fq.gz / <group>_preproc_r2.fq.gz (paired-end). Sliding-window trimming and minimum-length filtering follow the preproc.slidingwindow and preproc.minlen settings in the config.
HLA¶
This folder contains the results of the HLA genotyping. The files mhc-I.tsv and mhc-II.tsv carry the typed alleles for MHC class I and class II respectively. Each row is <source>\t<allele>:
DNA HLA-A*02:01
RNA HLA-A*02:01
custom HLA-A*68:01
custom HLA-B*15:07
The first column records where the allele came from:
DNA— predicted from DNA-seq reads (OptiType for class I, HLA-HD for class II).RNA— predicted from RNA-seq reads.custom— user-supplied viadata.custom.hlatyping.MHC-{I,II}in the config; useful when alleles are already known or when running on a sample where read-based typing isn't appropriate.
Multiple sources for the same allele are kept (e.g. an allele typed independently from both DNA and RNA appears twice). Downstream binding-affinity prediction operates on the deduplicated allele set.
ALIGNMENT¶
Aligned BAMs land in results/<sample>/dnaseq/align/ and results/<sample>/rnaseq/align/. The two paths use different aligners by design:
- DNA-seq is aligned directly with BWA-MEM — sufficient for variant calling against a reference genome.
- RNA-seq is first aligned with STAR in chimeric-aware mode (the
align.chim*config parameters control chimeric-segment thresholds; see the STAR manual). The STAR BAM is then re-aligned with BWA via therealignrule because the downstream RNA-variant callers (transIndel, ScanExitron) need a BWA-style CIGAR. Both intermediates and the final BAM are kept.
postproc_bam_index writes .bai index files alongside each BAM.
VARIANT CALLING¶
ScanNeo2 calls different variants (according to the configuration) and then collects the results in VCF in the folder results/<name/of/sample>/variants/ (except for gene fusion events). This can be helpful to gain insights into the variants of each type.
ALTERNATIVE SPLICING¶
SplAdder detects alternative-splicing events from the RNA-seq alignment; intermediate splice-graph files land in results/<sample>/rnaseq/altsplicing/. The splicing_to_vcf rule converts SplAdder's output to per-group VCFs that are then sorted, augmented with GRP / SRC INFO keys (same convention as the exitron path below), and merged into results/<sample>/variants/altsplicing.vcf.gz.
EXITRON¶
Exitron events are called using ScanExitron and the results are stored in results/<name/of/sample>/rnaseq/exitron/. Most importantly, ScanExitron generates the <group>_exitron.vcf). In the next step, this file is augmented with information about the <group> and source (exitron), which is stored in the keys GRP and SRC of the INFO field, respectively. The file <group>_exitrons_augmented.vcf is generated for this. Finally, the files are sorted (<group>_exitrons.vcf.gz), and merged into results/<name/of/sample>/variants/exitrons.vcf.gz.
INDEL/SNVs¶
Two callers feed this path:
- transIndel for long indels. The
detect_long_indel_ti_build_DNAanddetect_long_indel_ti_build_RNArules each build a remapped BAM (with redefined CIGAR) beforedetect_long_indel_ti_callextracts the indels; per-group VCFs are augmented withGRP/SRC=long_indelINFO and merged intoresults/<sample>/variants/long.indels.vcf.gz. Whether the DNA build, RNA build, or both run is set byindel.mode. - GATK Mutect2 for short indels and SNVs.
detect_short_indels_m2runs per split BAM (parallel across read-groups),filter_short_indels_m2applies Mutect2's own learned filters, and the augment / merge / select rules separate the per-VCF SNV / short-indel streams. Final results land inresults/<sample>/variants/somatic.short.indels.vcf.gzandresults/<sample>/variants/somatic.snvs.vcf.gz.
The indel.type config key selects which callers run (short, long, or all); indel.mode selects the input modality (DNA, RNA, or BOTH) where applicable.
PRIORITIZATION¶
In the prioritization, the output files are generated in results/<name/of/sample>/prioritization/. For each variant type, this includes the <variant_type>_variant_effects.tsv in which the effects of each variant are listed, and for each MHC class the file <variant_type>_<mhc_class>_neoepitopes.txt which contains the detected neoepitopes. In addition, <mhc_class>_neoepitopes_all.txt contains all detected neoepitopes in one file.
The folder structure looks this this (if all modules were activated)
- altsplicing_mhc-I_neoepitopes.txt
- altsplicing_variant_effects.tsv
- exitrons_mhc-I_neoepitopes.txt
- exitrons_variant_effects.tsv
- fusions_mhc-I_neoepitopes.txt
- fusions_variant_effects.tsv
- long.indels_mhc-I_neoepitopes.txt
- long.indels_variant_effects.tsv
- somatic.short.indels_mhc-I_neoepitopes.txt
- somatic.short.indels_variant_effects.tsv
- somatic.snvs_mhc-I_neoepitopes.txt
- somatic.snvs_variant_effects.tsv
- custom_protein_mhc-I_neoepitopes.txt
- custom_protein_variant_effects.tsv
- mhc-I_neoepitopes_all.txt
- altsplicing_mhc-II_neoepitopes.txt
- altsplicing_variant_effects.tsv
- exitrons_mhc-II_neoepitopes.txt
- exitrons_variant_effects.tsv
- fusions_mhc-II_neoepitopes.txt
- fusions_variant_effects.tsv
- long.indels_mhc-II_neoepitopes.txt
- long.indels_variant_effects.tsv
- somatic.short.indels_mhc-II_neoepitopes.txt
- somatic.short.indels_variant_effects.tsv
- somatic.snvs_mhc-II_neoepitopes.txt
- somatic.snvs_variant_effects.tsv
- mhc-II_neoepitopes_all.txt
<vartype>_variant_effects.tsv and include variant_effects.txt and individual files for predicted MHC classes (e.g., mhc-I_neoepitopes.txt and mhc-II_neoepitopes.txt). The former is an intermediate file that contains the variants and their effects on the protein sequence. It can be used as a reference and provides more information about the variants. The following table describes its content.
| Field | Value | Description |
|---|---|---|
| chrom | String | Chromosome in which the variant occurs. In the case of fusion events, this describes the chromosome of each segment, separated by \| (e.g., chr1\|chr2) |
| start | Integer | Reference position (0-based) of the variant. In the case of fusion events, this describes the start position of each segment, separated by \| (e.g., 1341234\|418728) |
| end | Integer | End position of the variant |
| gene_id | String | Gene ID the variant occurs in |
| gene_name | String | Corresponding gene name the variant occurs in |
| transcript_id | String | Correspond transcript id in the variant occurs in |
| source | String | The source of the variant (e.g., SNV, Indel,..) |
| group | String | The group of the variant |
| var_type | String | The effect of the variant (e.g., inframe deletion,...) |
| wt_subseq | String | Wildtype protein sequence (flanking left and right) of the variant |
| mt_subseq | String | Mutant protein sequence (flanking left and right) of the variant |
| var_start | Number | 0-based start position of the variant in the annotation |
| aa_var_start | Integer | Start position (0-based) of the variant within the subsequence |
| aa_var_end | Integer | Start position (0-based) of the variant within the subsequence |
| vaf | Float | Corresponding variant allele frequency (if available) |
| ao | Float | Observed alleles/reads that support the variant |
| dp | Float | Sequencing depth at the position of the variant |
| TPM | Float | Transcript per Million (TPM) for the corresponding transcript |
| NMD | String | Indicates if the variant is involved in the nonsense-mediated decay (NMD) pathway |
| PTC_dist_ejc | Integer | Distance of the premature stop codon (PTC) to the next exon junction |
| PTC_exon_number | Integer | Exon number the PTC occurs in |
| NMD_escape_rule | Integer | Rule used to escape the NMD pathway (if applicable) |
In addition, the mhc-I_neoepitopes.txt is partly redundant to variant_effects.txt and contains the following fields:
| Field | Value | Description |
|---|---|---|
| chrom | String | Chromosome in which the variant occurs. In the case of fusion events, this describes the chromosome of each segment, separated by \| (e.g., chr1\|chr2) |
| start | Integer | Reference position (0-based) of the variant. In the case of fusion events, this describes the start position of each segment, separated by \| (e.g., 1341234\|418728) |
| end | Integer | End position of the variant |
| allele | String | HLA allele |
| gene_id | String | Gene ID the variant occurs in |
| gene_name | String | Corresponding gene name the variant occurs in |
| transcript_id | String | Correspond transcript id in the variant occurs in |
| source | String | The source of the variant (e.g., SNV, Indel,..) |
| group | String | The group of the variant |
| var_type | String | The effect of the variant (e.g., inframe deletion,...) |
| var_start | Number | 0-based start position of the variant in the annotation |
| wt_epitope_seq | String | wildtype sequence of the epitope |
| wt_epitope_seq_ic50 | Float | binding affinity of the wildtype sequence |
| wt_epitope_rank | Float | rank of the wildtype epitope |
| mt_epitope_seq | String | mutant sequence of the epitope |
| mt_epitope_seq_ic50 | Float | binding affinity of the mutant sequence |
| mt_epitope_rank | Float | rank of the mutant epitope |
| vaf | Float | variant allele frequency |
| supporting | Integer | reads supporting the variant |
| TPM | Float | Transcripts per Million |
| agretopicity | Float | agretopicity score defined mt_affinity/wt_affinity |
| NMD | String | Indicates if the variant is involved in the nonsense-mediated decay (NMD) pathway. Populated for all frameshift variants (SNV / short-indel / long-indel / exitron / alt-splicing / fusion). Values: NMD_variant, NMD_escaping_variant, or empty (.) when no PTC can be determined. |
| PTC_dist_ejc | Integer | Distance of the premature stop codon (PTC) to the next exon junction |
| PTC_exon_number | Integer | Exon number the PTC occurs in |
| NMD_escape_rule | Integer | Rule used to escape the NMD pathway (if applicable) |
| wt_immunogenicity | Float | Immunogenicity score of the wildtype epitope. A higher score indicates a greater probability of eliciting an immune response |
| mt_immunogenicity | Float | Immunogenicity score of the mutant epitope. A higher score indicates a greater probability of eliciting an immune response |
| self-similarity | Float | Similarity measure between the wildtype and mutant epitope. Float values between 0 and 1. 0 Indicates no similarity or a complete difference between the WT and MT sequences. 1 Indicates perfect similarity, meaning the WT and MT sequences are identical in terms of their k-mer similarities. |
| pathogen_similarity | Float | Similarity measure between the mutant epitope and known pathogens - more details below |
| pathogen_evalue | Float | BLAST e-value for the pathogen similarity |
| pathogen_bitscore | Float | BLAST bitscore for the pathogen similarity |
| pathogen | String | Name of the detected (similar) pathogen |
| proteome_similarity | Float | Similarity measure between the mutant epitope and the (human) proteome |
| proteome_evalue | Float | BLAST e-value for the proteome similarity |
| proteome_bitscore | Float | BLAST bitscore for the proteome similarity |
| protein | String | Name/ID of the detected (similar) protein |
pathogen/proteome similarity¶
The sequence similarity ssim is defined as:
ssim = \frac{\text{identity}}{100}*aligncov
aligncov is defined as:
aligncov = \frac{\text{length of alignment}}{\text{length of mutant epitope}}