Output¶

ScanNeo2 returns its output files in the results/<name/of/sample> folder (as specified in the config file). The final output is located within prioritization.

results/<name/of/sample>
  - dnaseq
    - align
    - qualitycontrol
    - reads
    - indel 
  - rnaseq 
    - align
    - altsplicing
    - indel
    - exitron
    - qualitycontrol
    - reads
  - variants
  - annotation
  - hla
  - prioritization

Note: the final results are located in the prioritization folder. In addition, these individual folders contain separate results for each (as specified in the configuration file) but are merged in later stages of ScanNeo2.

PRE-PROCESSING¶

Two pre-processing stages run per sample, both gated on preproc.activate: true:

Quality Control¶

FastQC reports land in results/<sample>/{dnaseq,rnaseq}/qualitycontrol/ per read group, with separate forward / reverse reports for paired-end inputs. Use these to spot adapter contamination, GC bias, and per-base quality drop-off before trimming.

Pre-processed reads¶

fastp writes the trimmed reads to results/<sample>/{dnaseq,rnaseq}/reads/ as <group>_preproc.fq.gz (single-end) or <group>_preproc_r1.fq.gz / <group>_preproc_r2.fq.gz (paired-end). Sliding-window trimming and minimum-length filtering follow the preproc.slidingwindow and preproc.minlen settings in the config.

HLA¶

This folder contains the results of the HLA genotyping. The files mhc-I.tsv and mhc-II.tsv carry the typed alleles for MHC class I and class II respectively. Each row is <source>\t<allele>:

DNA       HLA-A*02:01
RNA       HLA-A*02:01
custom    HLA-A*68:01
custom    HLA-B*15:07

The first column records where the allele came from:

DNA — predicted from DNA-seq reads (OptiType for class I, HLA-HD for class II).
RNA — predicted from RNA-seq reads.
custom — user-supplied via data.custom.hlatyping.MHC-{I,II} in the config; useful when alleles are already known or when running on a sample where read-based typing isn't appropriate.

Multiple sources for the same allele are kept (e.g. an allele typed independently from both DNA and RNA appears twice). Downstream binding-affinity prediction operates on the deduplicated allele set.

ALIGNMENT¶

Aligned BAMs land in results/<sample>/dnaseq/align/ and results/<sample>/rnaseq/align/. The two paths use different aligners by design:

DNA-seq is aligned directly with BWA-MEM — sufficient for variant calling against a reference genome.
RNA-seq is first aligned with STAR in chimeric-aware mode (the align.chim* config parameters control chimeric-segment thresholds; see the STAR manual). The STAR BAM is then re-aligned with BWA via the realign rule because the downstream RNA-variant callers (transIndel, ScanExitron) need a BWA-style CIGAR. Both intermediates and the final BAM are kept.

postproc_bam_index writes .bai index files alongside each BAM.

VARIANT CALLING¶

ScanNeo2 calls different variants (according to the configuration) and then collects the results in VCF in the folder results/<name/of/sample>/variants/ (except for gene fusion events). This can be helpful to gain insights into the variants of each type.

ALTERNATIVE SPLICING¶

SplAdder detects alternative-splicing events from the RNA-seq alignment; intermediate splice-graph files land in results/<sample>/rnaseq/altsplicing/. The splicing_to_vcf rule converts SplAdder's output to per-group VCFs that are then sorted, augmented with GRP / SRC INFO keys (same convention as the exitron path below), and merged into results/<sample>/variants/altsplicing.vcf.gz.

EXITRON¶

Exitron events are called using ScanExitron and the results are stored in results/<name/of/sample>/rnaseq/exitron/. Most importantly, ScanExitron generates the .exitron file which contains all the predicted exitron events. Please consult ScanExitron for a detailed description of the data fields. In addition, the intermediate results (*.janno) are also kept. ScanNeo2 takes the output of ScanExitron and first converts it into VCF (<group>_exitron.vcf). In the next step, this file is augmented with information about the <group> and source (exitron), which is stored in the keys GRP and SRC of the INFO field, respectively. The file <group>_exitrons_augmented.vcf is generated for this. Finally, the files are sorted (<group>_exitrons.vcf.gz), and merged into results/<name/of/sample>/variants/exitrons.vcf.gz.

INDEL/SNVs¶

Two callers feed this path:

transIndel for long indels. The detect_long_indel_ti_build_DNA and detect_long_indel_ti_build_RNA rules each build a remapped BAM (with redefined CIGAR) before detect_long_indel_ti_call extracts the indels; per-group VCFs are augmented with GRP / SRC=long_indel INFO and merged into results/<sample>/variants/long.indels.vcf.gz. Whether the DNA build, RNA build, or both run is set by indel.mode.
GATK Mutect2 for short indels and SNVs. detect_short_indels_m2 runs per split BAM (parallel across read-groups), filter_short_indels_m2 applies Mutect2's own learned filters, and the augment / merge / select rules separate the per-VCF SNV / short-indel streams. Final results land in results/<sample>/variants/somatic.short.indels.vcf.gz and results/<sample>/variants/somatic.snvs.vcf.gz.

The indel.type config key selects which callers run (short, long, or all); indel.mode selects the input modality (DNA, RNA, or BOTH) where applicable.

PRIORITIZATION¶

In the prioritization, the output files are generated in results/<name/of/sample>/prioritization/. For each variant type, this includes the <variant_type>_variant_effects.tsv in which the effects of each variant are listed, and for each MHC class the file <variant_type>_<mhc_class>_neoepitopes.txt which contains the detected neoepitopes. In addition, <mhc_class>_neoepitopes_all.txt contains all detected neoepitopes in one file.

The folder structure looks this this (if all modules were activated)

- altsplicing_mhc-I_neoepitopes.txt
- altsplicing_variant_effects.tsv
- exitrons_mhc-I_neoepitopes.txt
- exitrons_variant_effects.tsv
- fusions_mhc-I_neoepitopes.txt
- fusions_variant_effects.tsv
- long.indels_mhc-I_neoepitopes.txt
- long.indels_variant_effects.tsv
- somatic.short.indels_mhc-I_neoepitopes.txt
- somatic.short.indels_variant_effects.tsv
- somatic.snvs_mhc-I_neoepitopes.txt
- somatic.snvs_variant_effects.tsv
- custom_protein_mhc-I_neoepitopes.txt
- custom_protein_variant_effects.tsv
- mhc-I_neoepitopes_all.txt

- altsplicing_mhc-II_neoepitopes.txt
- altsplicing_variant_effects.tsv
- exitrons_mhc-II_neoepitopes.txt
- exitrons_variant_effects.tsv
- fusions_mhc-II_neoepitopes.txt
- fusions_variant_effects.tsv
- long.indels_mhc-II_neoepitopes.txt
- long.indels_variant_effects.tsv
- somatic.short.indels_mhc-II_neoepitopes.txt
- somatic.short.indels_variant_effects.tsv
- somatic.snvs_mhc-II_neoepitopes.txt
- somatic.snvs_variant_effects.tsv
- mhc-II_neoepitopes_all.txt

These include the files <vartype>_variant_effects.tsv and include variant_effects.txt and individual files for predicted MHC classes (e.g., mhc-I_neoepitopes.txt and mhc-II_neoepitopes.txt). The former is an intermediate file that contains the variants and their effects on the protein sequence. It can be used as a reference and provides more information about the variants. The following table describes its content.

Field	Value	Description
chrom	String	Chromosome in which the variant occurs. In the case of fusion events, this describes the chromosome of each segment, separated by `\\|` (e.g., `chr1\\|chr2`)
start	Integer	Reference position (0-based) of the variant. In the case of fusion events, this describes the start position of each segment, separated by `\\|` (e.g., `1341234\\|418728`)
end	Integer	End position of the variant
gene_id	String	Gene ID the variant occurs in
gene_name	String	Corresponding gene name the variant occurs in
transcript_id	String	Correspond transcript id in the variant occurs in
source	String	The source of the variant (e.g., SNV, Indel,..)
group	String	The group of the variant
var_type	String	The effect of the variant (e.g., inframe deletion,...)
wt_subseq	String	Wildtype protein sequence (flanking left and right) of the variant
mt_subseq	String	Mutant protein sequence (flanking left and right) of the variant
var_start	Number	0-based start position of the variant in the annotation
aa_var_start	Integer	Start position (0-based) of the variant within the subsequence
aa_var_end	Integer	Start position (0-based) of the variant within the subsequence
vaf	Float	Corresponding variant allele frequency (if available)
ao	Float	Observed alleles/reads that support the variant
dp	Float	Sequencing depth at the position of the variant
TPM	Float	Transcript per Million (TPM) for the corresponding transcript
NMD	String	Indicates if the variant is involved in the nonsense-mediated decay (NMD) pathway
PTC_dist_ejc	Integer	Distance of the premature stop codon (PTC) to the next exon junction
PTC_exon_number	Integer	Exon number the PTC occurs in
NMD_escape_rule	Integer	Rule used to escape the NMD pathway (if applicable)

In addition, the mhc-I_neoepitopes.txt is partly redundant to variant_effects.txt and contains the following fields:

Field	Value	Description
chrom	String	Chromosome in which the variant occurs. In the case of fusion events, this describes the chromosome of each segment, separated by `\\|` (e.g., `chr1\\|chr2`)
start	Integer	Reference position (0-based) of the variant. In the case of fusion events, this describes the start position of each segment, separated by `\\|` (e.g., `1341234\\|418728`)
end	Integer	End position of the variant
allele	String	HLA allele
gene_id	String	Gene ID the variant occurs in
gene_name	String	Corresponding gene name the variant occurs in
transcript_id	String	Correspond transcript id in the variant occurs in
source	String	The source of the variant (e.g., SNV, Indel,..)
group	String	The group of the variant
var_type	String	The effect of the variant (e.g., inframe deletion,...)
var_start	Number	0-based start position of the variant in the annotation
wt_epitope_seq	String	wildtype sequence of the epitope
wt_epitope_seq_ic50	Float	binding affinity of the wildtype sequence
wt_epitope_rank	Float	rank of the wildtype epitope
mt_epitope_seq	String	mutant sequence of the epitope
mt_epitope_seq_ic50	Float	binding affinity of the mutant sequence
mt_epitope_rank	Float	rank of the mutant epitope
vaf	Float	variant allele frequency
supporting	Integer	reads supporting the variant
TPM	Float	Transcripts per Million
agretopicity	Float	agretopicity score defined mt_affinity/wt_affinity
NMD	String	Indicates if the variant is involved in the nonsense-mediated decay (NMD) pathway. Populated for all frameshift variants (SNV / short-indel / long-indel / exitron / alt-splicing / fusion). Values: `NMD_variant`, `NMD_escaping_variant`, or empty (`.`) when no PTC can be determined.
PTC_dist_ejc	Integer	Distance of the premature stop codon (PTC) to the next exon junction
PTC_exon_number	Integer	Exon number the PTC occurs in
NMD_escape_rule	Integer	Rule used to escape the NMD pathway (if applicable)
wt_immunogenicity	Float	Immunogenicity score of the wildtype epitope. A higher score indicates a greater probability of eliciting an immune response
mt_immunogenicity	Float	Immunogenicity score of the mutant epitope. A higher score indicates a greater probability of eliciting an immune response
self-similarity	Float	Similarity measure between the wildtype and mutant epitope. Float values between 0 and 1. `0` Indicates no similarity or a complete difference between the WT and MT sequences. `1` Indicates perfect similarity, meaning the WT and MT sequences are identical in terms of their k-mer similarities.
pathogen_similarity	Float	Similarity measure between the mutant epitope and known pathogens - more details below
pathogen_evalue	Float	BLAST e-value for the pathogen similarity
pathogen_bitscore	Float	BLAST bitscore for the pathogen similarity
pathogen	String	Name of the detected (similar) pathogen
proteome_similarity	Float	Similarity measure between the mutant epitope and the (human) proteome
proteome_evalue	Float	BLAST e-value for the proteome similarity
proteome_bitscore	Float	BLAST bitscore for the proteome similarity
protein	String	Name/ID of the detected (similar) protein

pathogen/proteome similarity¶

The sequence similarity ssim is defined as:

ssim = \frac{\text{identity}}{100}*aligncov

where aligncov is defined as:

aligncov = \frac{\text{length of alignment}}{\text{length of mutant epitope}}