Advanced Configuration¶
The config file, found under config/config.yaml
can be used to adapt your analysis.
Execution Mode¶
# execution mode. Can be either "patient" or "environment"
mode: environment
Defines the execution mode of UnCoVar.
When the mode is set to patient
, the sample is assumed come be from a single
host organism and contains only one strain of SARS-CoV-2. The parts of the
workflow for reconstructing the SARS-CoV-2 strain genome are activated.
If the mode is set to environment
, the sample is assumed to be from the
environment (e.g. wastewater) and to contain different SARS-CoV-2 strains.
The parts of the workflow responsible for creating and analysing individual
genomes (e.g. assembly, lineage calling via Pangolin) are disabled.
Sending lab number¶
UnCoVar automatically generates a multi-Fasta file and a corresponding .csv
for
all samples with a 1
flag for inlcude_in_high_genome_summary
in the sample sheet,
that match the given quality-criteria
(see below). The reporting format and the
quality criteria are inspired by the requirements for SARS-CoV-2 genome submission
to the Robert-Koch-Institute, Germany.
The sending lab number will be included in the .csv
file
Data handling¶
With the root of the UnCoVar workflow as working directory, we recommended to use the following folder structure:
├── archive
├── incoming
└── uncovar
└── data
└── 2023-12-24
The structure can be adjusted to via the config under data-handling
:
data-handling:
# flag for using the following data-handling structure
# True: data-handling structure is used as shown below
# False: only the sample sheet needs to be updated (manually)
use-data-handling: True
# flag for archiving data
# True: data is archived in path defined below
# False: data is not archived
archive-data: False
# path of incoming data, which is moved to the
# data directory by the preprocessing script
incoming: ../incoming/
# path to store data within the workflow
data: data/
# path to archive data from incoming and
# the results from the latest run to
archive: ../archive/
Quality criteria¶
The quality criteria can be adjusted to your individual needs. By default they match the quality criteria needed for submitting to the RKI (see Sending lab number above)
quality-criteria:
illumina:
# minimal length of acceptable reads
min-length-reads: 30
# average quality of acceptable reads (PHRED)
min-PHRED: 20
ont:
# minimal length of acceptable reads
min-length-reads: 200
# average quality of acceptable reads (PHRED)
min-PHRED: 10
# identity to virus reference genome (see-above) of reconstructed sequence
min-identity: 0.9
# share N in the reconstructed sequence
max-n: 0.05
# minimum local sequencing depth without filtering of PCR duplicates
min-depth-with-PCR-duplicates: 20
# minimum local sequencing depth after filtering PCR duplicates
min-depth-without-PCR-duplicates: 10
# minimum informative allele frequency
min-allele: 0.9
Preprocessing¶
Here different preprocessing can be adjustet. Per default the standard Illumina adapters
are trimmed. For samples prepared with an amplicon sequencing approach, you can
define the path to the primer file in .bed
format. If you are processing Nanopore
samples, you can also define the primer version via changing the number.
The default primer file is a bed file from the ARTIC network. However, the primers for clipping can be customized. First, the custom primers must be saved in bed format. Next, the path to this file must be changed in the config. Go to the config folder and open config.yaml. In the "preprocessing" subcategory, change the path after "amplicon-primers" to the path where your primer file can be found.
preprocessing:
# only for *non* Oxford Nanopore data. Adapters to trim.
# see: https://www.nimagen.com/shop/products/rc-cov096/easyseq-sars-cov-2-novel-coronavirus-whole-genome-sequencing-kit
kit-adapters: "--adapter_sequence GCGAATTTCGACGATCGTTGCATTAACTCGCGAA --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
# only for Oxford Nanopore data.
# ARTIC primer version to clip from reads. See
# https://github.com/artic-network/artic-ncov2019/tree/master/primer_schemes/nCoV-2019/V4
# for more information
artic-primer-version: 4
# path to amplicon primers in bed format for hard-clipping on paired end files (illumina) or url to file that should be downloaded
amplicon-primers: "resources/SARS-CoV-2-artic-v4_1.primer.bed"
# GenBank accession of reference sequence of the amplicon primers
amplicon-reference: "MN908947"
Assembly¶
In this section you define which assembler you want to use for the genome reconstruction. UnCoVar uses MEGAHIT and metaSPAdes by default, as those achieved the best results in a benchmarking comparison. The assembly options can be changed independently.
There are several other options available:
- megahit-std
- megahit-meta-large
- megahit-meta-sensitive
- trinity
- velvet
- metaspades
- coronaspades
- spades
- rnaviralspades