Denoising amplicon sequences
To provide a full tool chain from raw sequences to data in SBDI, we have, in collaboration with nf-core, developed a reproducible workflow – Ampliseq – for denoising of Illumina and PacBio sequences with DADA2. The tool also performs taxonomic annotation of the resulting Amplicon Sequence Variants (ASVs).
At the time of writing, the workflow is in transition from being based on the QIIME2 implementation of DADA2 and taxonomic annotation, but is expected to be released as version 2 during the first half of 2021. The development version of the workflow can be run after first installing Nextflow, a workflow management tool. Please follow the Nextflow installation instructions and consult the development version of the Ampliseq README. In short, after installing Nextflow and assuming you have Illumina read pairs in a directory called
sequences you can run the development version of the workflow like this:
nextflow run nf-core/ampliseq -r dev -profile docker --input sequences
The above will run all steps of the analysis using Docker, which needs to be installed on your local machine. There are other profiles available for running the workflow on other types of computing resources, see the nf-core usage instructions.
The Swedish ASV portal: A submission and analysis tool
SBDI hosts the Swedish ASV portal, currently accepting denoised amplicon sequences (PCR products). The webapp serves as a portal to the main SBDI site and provides search tools for sequence data. It also allows submission of sequences.
The submitted denoised amplicon sequences are given a standardized taxonomic annotation before entering the ASV database. This is done regardless if submitted sequences already have a taxonomic annotation or not, to achieve consistency.
The annotation is performed with Ampliseq, which uses DADA2‘s Bayesian classifier (the
assignTaxonomy() function) and its species assignment function
addSpecies() that assigns species to a sequence if there are exact matches without conflicts in the database. Which database is used, depends on which group of organisms the amplicons were generated from.
|Archaea and Bacteria||16S rRNA||GTDB-SBDI*) (latest release)|
|Fungi||ITS||Unite fungi (latest release)|
|Other eukaryotes||ITS||Unite all eukaryotes (latest release)|
Cleaning of the GTDB-SBDI taxonomy database
The genomes used in the GTDB taxonomy database have been vetted for completeness and contamination. However, since many of genomes are metagenome assembled genomes (MAGs), rRNA operons are sometimes wrongly binned, giving them wrong taxonomic labels. To ascertain that sequences used for taxonomy annotation in SBDI are correct, we filter 16S rRNA gene sequences downloaded from GTDB by the following steps.
- Sequences longer than 2000 basepairs are removed.
- After aligning the sequences to the archaeal and bacterial Barrnap SSU rRNA profiles, sequences with an aligned length shorter than 1000 basepairs are removed.
- We removed sequences that the Sativa algorithm detects as misclassified at genus to phylum level.
- A maximum of five sequences per species are selected. Sequences belonging to a GTDB species representative genome and longer sequences are prioritized.
ASV portal vs. Bioatlas taxonomy
To become fully accessible in the SBDI Bioatlas, all occurrence data are also indexed against the GBIF taxonomy backbone. While this taxonomy covers a huge amount of names, GBIF is currently in the process of improving their coverage of underrepresented groups such as prokaryotes. This means that the taxonomy of some ASVs may differ slightly between the ASV portal and the Bioatlas. Also note that, for practical reasons, we follow GBIF in listing prokaryote domains Archaea and Bacteria together with eukaryote kingdoms in our search interface. See the help page for taxonomy under Publish and share data for more information.
Sequence search and connection to SBDI occurrence data
Currently it is not possible to search the SBDI Bioatlas directly with sequences. Users that are interested in tracking presences or abundances of organisms associated with specific amplicon sequences can instead use the Swedish ASV portal‘s sequence search interface to search for similar sequences with BLAST or filter search with primer pair and taxonomy.
Sequences found link to occurrence data in the SBDI Bioatlas.
Submission of molecular data to SBDI
The Swedish ASV portal has an interface for sequence data submission. Further instructions for this are available at the submission page.
Submission of raw sequence data
To be able to publish manuscripts with results from sequencing efforts, it is generally required that raw sequences are deposited in an online archive. However, SBDI does not handle raw sequences, but only denoised, quantified and taxonomically annotated sequences. We have nevertheless prepared a guide to help with submission of raw sequence data to the European Nucleotide Archive, ENA.