Denoising amplicon sequences
To provide a full tool chain from raw sequences to data in SBDI, we have, in collaboration with nf-core, developed a reproducible workflow – Ampliseq – for denoising of Illumina and PacBio sequences with DADA2. The tool also performs taxonomic annotation of the resulting Amplicon Sequence Variants (ASVs).
At the time of writing, the workflow is in transition from being based on the QIIME2 implementation of DADA2 and taxonomic annotation, but is expected to be released as version 2 during the first half of 2021. The development version of the workflow can be run after first installing Nextflow, a workflow management tool. Please follow the Nextflow installation instructions and consult the Ampliseq documentation. In short, after installing Nextflow and assuming you have Illumina read pairs in a directory called
sequences you can run the workflow like this:
nextflow run nf-core/ampliseq -profile docker --input sequences
The above will run all steps of the analysis using Docker, which needs to be installed on your local machine. There are other profiles available for running the workflow on other types of computing resources, see the nf-core usage instructions.
To get help, join the nf-core Slack and, after being invited, ask your question on the ampliseq channel.
The Swedish ASV portal: A submission and analysis tool
SBDI hosts the Swedish ASV portal, currently accepting denoised amplicon sequences (PCR products). The webapp serves as a portal to the main SBDI site and provides search tools for sequence data. It also allows submission of sequences.
The submitted denoised amplicon sequences are given a standardized taxonomic annotation before entering the ASV database. This is done regardless if submitted sequences already have a taxonomic annotation or not, to achieve consistency.
The annotation is performed with Ampliseq, which, by default, uses DADA2‘s Bayesian classifier (the
assignTaxonomy() function) and its species assignment function
addSpecies() that assigns species to a sequence if there are exact matches without conflicts in the database. COI sequences are annotated using the VSEARCH implementation of the SINTAX classifier, which is included as an option in Ampliseq. At least 80% bootstrap support is required for the taxonomic ranks. Which database is used, depends on which group of organisms the amplicons were generated from.
For ITS sequences, the annotation is complemented with Unite species hypothesis (SH) assignments. These are made by matching the ITS part of the sequence to the database with
vsearch --usearch_global and a sequence identity cutoff of 98.5%. If a good enough and unique match is found, the SH of the match is assigned to the sequence, and the taxonomy of the sequence is changed to that of the SH.
|Archaea and Bacteria||16S rRNA||GTDB-SBDI*) (latest release)|
|Fungi||ITS||Unite fungi (latest release)|
|Metazoa||COI||SBDI-COI*) (latest release)|
|Other eukaryotes||ITS||Unite all eukaryotes (latest release)|
Vetting of 16S and 18S sequences with “Barrnap”
Amplicon sequences from 16S and 18S rRNA genes are checked with Barrnap, a tool that identifies potential rRNA sequences. Only sequences that are identified as SSU rRNA by Barrnap will be included in SBDI. We make exceptions for 16S sequences that are not identified as SSU rRNA by Barrnap, if they have a taxonomy assignment at domain level.
Cleaning of the GTDB-SBDI taxonomy database
The genomes used in the GTDB taxonomy database have been vetted for completeness and contamination. However, since many of genomes are metagenome assembled genomes (MAGs), rRNA operons are sometimes wrongly binned, giving them wrong taxonomic labels. To ascertain that sequences used for taxonomy annotation in SBDI are correct, we filter 16S rRNA gene sequences downloaded from GTDB by the following steps.
- Sequences longer than 2000 basepairs are removed.
- After aligning the sequences to the archaeal and bacterial Barrnap SSU rRNA profiles, sequences with an aligned length shorter than 1000 basepairs are removed.
- We removed sequences that the Sativa algorithm detects as misclassified at genus to phylum level.
- A maximum of five sequences per species are selected. Sequences belonging to a GTDB species representative genome and longer sequences are prioritized.
ASV portal vs. Bioatlas taxonomy
To become fully accessible in the SBDI Bioatlas, all occurrence data are also indexed against the GBIF taxonomy backbone. While this taxonomy covers a huge amount of names, GBIF is currently in the process of improving their coverage of underrepresented groups such as prokaryotes. This means that the taxonomy of some ASVs may differ slightly between the ASV portal and the Bioatlas. Also note that, for practical reasons, we follow GBIF in listing prokaryote domains Archaea and Bacteria together with eukaryote kingdoms in our search interface. See the help page for taxonomy under Publish and share data for more information.
Sequence search and connection to SBDI occurrence data
Currently it is not possible to search the SBDI Bioatlas directly with sequences. Users that are interested in tracking presences or abundances of organisms associated with specific amplicon sequences can instead use the Swedish ASV portal‘s sequence search interface to search for similar sequences with BLAST or filter search with primer pair and taxonomy.
Sequences found link to occurrence data in the SBDI Bioatlas.
Submission of molecular data to SBDI
The Swedish ASV portal has an interface for sequence data submission. Further instructions for this are available at the submission page.
Submission of raw sequence data
To be able to publish manuscripts with results from sequencing efforts, it is generally required that raw sequences are deposited in an online archive. However, SBDI does not handle raw sequences, but only denoised, quantified and taxonomically annotated sequences. We have nevertheless prepared a guide to help with submission of raw sequence data to the European Nucleotide Archive, ENA.