We aim to make it easier to share and use molecular biodiversity data, i.e. occurrence data derived from DNA sequences. Our initial focus has been on developing tools and resources for metabarcoding (amplicon sequencing) data, but upcoming releases will extend support to metagenomics.
Guide to ENA submission
Publishing research findings from sequencing studies typically requires sharing raw sequence reads in an online archive. While SBDI doesn’t handle raw data, we provide a step-by-step guide to submitting sequence reads from environmental samples to the European Nucleotide Archive (ENA).
Workflow for denoising and taxonomic annotation
To streamline the conversion of raw sequences into occurrence data, we contribute to the development of nf-core/ampliseq – a reproducible workflow for denoising Illumina and PacBio sequences using DADA2. This tool also performs taxonomic annotation of the resulting Amplicon Sequence Variants (ASVs).
To use the workflow, first install Nextflow, a workflow management tool, by following the Nextflow installation instructions. Then refer to the nf-core/ampliseq documentation to install ampliseq. After installation, assuming you have Illumina read pairs in a
sequences directory, run the workflow like this:
nextflow run nf-core/ampliseq -profile docker --input sequences
The above command will run all steps of the analysis using Docker, which must be installed on your local machine. There are other profiles available for running the workflow on other types of computing resources, and an
--sbdiexport option for outputting data in a format suitable for ASV portal submission (see below). Refer to the nf-core usage instructions.
Web interface to ASV occurrences in SBDI
The Swedish ASV portal serves as a web interface to DNA-derived biodiversity data in SBDI. Users can submit denoised metabarcoding data, such as the output from nf-core/ampliseq (see above), as well as search for ASVs and Bioatlas records using BLAST or filters based on sequencing details and taxonomy. Read more about the ASV portal application here.
While we store the original taxonomy of submitted ASVs, we also perform a standard re-annotation of these, using current classification methods and reference databases. This ensures that each unique sequence only has one valid classification at any given time. Our system also allows for subsequent taxonomic updates, as reference databases grow and improve.
The annotation is performed with nf-core/ampliseq, which by default uses DADA2‘s Bayesian classifier, the
assignTaxonomy() function, and the species assignment function
addSpecies() to assign species to a sequence if there are exact matches without conflicts in the database. For 16S and 18S rRNA sequences, no deviations are made from this protocol. However, for ITS sequences, the default annotation is complemented with Unite species hypothesis (SH) assignments by using the
--addsh option together with
--cut_its its2 when running the workflow. This will match the ITS part of the sequence to the database with
vsearch --usearch_global and a sequence identity cutoff of 98.5%. If a good enough and unique match is found, the SH of the match is assigned to the sequence, and the taxonomy of the sequence is changed to that of the SH. In case only the ITS1 region has been sequenced, this is used instead of ITS2. COI sequences are, in turn, annotated using the VSEARCH implementation of the SINTAX classifier, which is included as the
--sintax_ref_taxonomy option in ampliseq. At least 80% bootstrap support is required for the taxonomic ranks.
The database we use for the annotation depends on the target gene and the group of organisms from which amplicons were generated, as indicated in the table below. Exact versions of applied algorithms and databases are recorded in the Darwin Core field
identificationRemarks for each ASV occurrence.
|Archaea and Bacteria||16S rRNA||GTDB-SBDI*) (latest release)|
|Fungi||ITS||Unite fungi (latest release)|
|Metazoa||COI||SBDI-COI*) (latest release)|
|Other eukaryotes||ITS||Unite all eukaryotes (latest release)|
Ribosomal RNA (rRNA) status vetting
Amplicon sequences from 16S and 18S rRNA genes are checked with Barrnap, a tool that identifies potential rRNA sequences. Only sequences that are identified as SSU rRNA by Barrnap will be included in SBDI. We make exceptions for 16S sequences that are not identified as SSU rRNA by Barrnap, if they have a taxonomy assignment at domain level.
Cleaning of reference databases
The genomes used in the GTDB taxonomy database have been vetted for completeness and contamination. However, since many of the genomes are metagenome assembled genomes (MAGs), rRNA operons are sometimes wrongly binned, giving them wrong taxonomic labels. To ascertain that sequences used for taxonomy annotation in SBDI are correct, we filter 16S rRNA gene sequences downloaded from GTDB by the following steps.
- Sequences longer than 2000 basepairs are removed.
- After aligning the sequences to the archaeal and bacterial Barrnap SSU rRNA profiles, sequences with an aligned length shorter than 1000 basepairs are removed.
- We removed sequences that the Sativa algorithm detects as misclassified at genus to phylum level.
- A maximum of five sequences per species are selected. Sequences belonging to a GTDB species representative genome and longer sequences are prioritized.
ASV portal vs. Bioatlas taxonomy
To become fully accessible in the SBDI Bioatlas, all occurrence data are also indexed against the GBIF taxonomy backbone. While this taxonomy covers a huge amount of names, GBIF is currently in the process of improving their coverage of underrepresented groups such as prokaryotes. This means that the taxonomy of some ASVs may differ slightly between the ASV portal and the Bioatlas. Also note that, for practical reasons, we follow GBIF in listing prokaryote domains Archaea and Bacteria together with eukaryote kingdoms in our search interface. See the help page for taxonomy under Publish and share data for more information.