Molecular tools - Swedish Biodiversity Data Infrastructure

We aim to make it easier to share and use molecular biodiversity data, i.e. occurrence data derived from DNA sequences. Our initial focus has been on developing tools and resources for metabarcoding (amplicon sequencing) data, but upcoming releases will extend support to metagenomics.

Guide to ENA submission

Publishing research findings from sequencing studies typically requires sharing raw sequence reads in an online archive. While SBDI doesn’t handle raw data, we provide a step-by-step guide to submitting sequence reads from environmental samples to the European Nucleotide Archive (ENA).

Workflow for denoising and taxonomic annotation

To streamline the conversion of raw sequences into occurrence data, we contribute to the development of nf-core/ampliseq – a reproducible workflow for denoising Illumina and PacBio sequences using DADA2. This tool also performs taxonomic annotation of the resulting Amplicon Sequence Variants (ASVs).

To use the workflow, first install Nextflow, a workflow management tool, by following the Nextflow installation instructions. Then refer to the nf-core/ampliseq documentation to install ampliseq. After installation, assuming you have Illumina read pairs in a sequences directory, run the workflow like this:

nextflow run nf-core/ampliseq -profile docker --input sequences

The above command will run all steps of the analysis using Docker, which must be installed on your local machine. There are other profiles available for running the workflow on other types of computing resources, and an --sbdiexport option for outputting data in a format suitable for ASV portal submission (see below). Refer to the nf-core usage instructions.

To get help, join the nf-core Slack and, after being invited, ask your question on the ampliseq channel.

Web interface to ASV occurrences in SBDI

The Swedish ASV portal serves as a web interface to DNA-derived biodiversity data in SBDI. Users can submit denoised metabarcoding data, such as the output from nf-core/ampliseq (see above), as well as search for ASVs and Bioatlas records using BLAST or filters based on sequencing details and taxonomy. Read more about the ASV portal application here.

Taxonomic re-annotation

While we store the original taxonomy of submitted ASVs, we also perform a standard re-annotation of these, using current classification methods and reference databases. This ensures that each unique sequence only has one valid classification at any given time. Our system also allows for subsequent taxonomic updates, as reference databases grow and improve.

The annotation is performed with nf-core/ampliseq, which by default uses DADA2‘s Bayesian classifier, the assignTaxonomy() function, and the species assignment function addSpecies() to assign species to a sequence if there are exact matches without conflicts in the database. For 16S and 18S rRNA sequences, no deviations are made from this protocol. However, for ITS sequences, the default annotation is complemented with Unite species hypothesis (SH) assignments by using the --addsh option together with --cut_its its2 when running the workflow. This will match the ITS part of the sequence to the database with vsearch --usearch_global and a sequence identity cutoff of 98.5%. If a good enough and unique match is found, the SH of the match is assigned to the sequence, and the taxonomy of the sequence is changed to that of the SH. In case only the ITS1 region has been sequenced, this is used instead of ITS2. COI sequences are, in turn, annotated using the VSEARCH implementation of the SINTAX classifier, which is included as the --sintax_ref_taxonomy option in ampliseq. At least 80% bootstrap support is required for the taxonomic ranks.

The database we use for the annotation depends on the target gene and the group of organisms from which amplicons were generated, as indicated in the table below. Exact versions of applied algorithms and databases are recorded in the Darwin Core field identificationRemarks for each ASV occurrence.

Taxa	Gene	Database
Archaea and Bacteria	16S rRNA	GTDB-SBDI^* (latest release)
Eukaryotes	18S rRNA	PR2
Fungi	ITS	UNITE all eukaryotes^** (latest release)
Metazoa	COI	SBDI-COI^* (latest release)
Other eukaryotes	ITS	UNITE all eukaryotes (latest release)

Databases used by SBDI’s web app for taxonomic annotation. ^*Databases suffixed with “SBDI” have been cleaned by SBDI, see description below. ^**Since May 2024, we annotate against UNITE alleuk, but subsequently exclude any ASVs that are not classified as kingdom Fungi (also from total sample read counts, i.e. DwC term ‘sampleSizeValue’). More genes/databases will be added in the future.

Ribosomal RNA (rRNA) status vetting

Amplicon sequences from 16S and 18S rRNA genes are checked with Barrnap, a tool that identifies potential rRNA sequences. Only sequences identified as SSU rRNA by Barrnap will be included in SBDI datasets (and in total sample read counts reported under ‘sampleSizeValue‘). We make exceptions for 16S sequences that are not identified as SSU rRNA by Barrnap, if they have a taxonomy assignment at domain level.

Cleaning of reference databases

The genomes used in the GTDB taxonomy database have been vetted for completeness and contamination. However, since many of the genomes are metagenome assembled genomes (MAGs), rRNA operons are sometimes wrongly binned, giving them wrong taxonomic labels. To ascertain that sequences used for taxonomy annotation in SBDI are correct, we filter 16S rRNA gene sequences downloaded from GTDB by the following steps.

Sequences longer than 2000 basepairs are removed.
After aligning the sequences to the archaeal and bacterial Barrnap SSU rRNA profiles, sequences with an aligned length shorter than 1000 basepairs are removed.
We removed sequences that the Sativa algorithm detects as misclassified at genus to phylum level.
A maximum of five sequences per species are selected. Sequences belonging to a GTDB species representative genome and longer sequences are prioritized.

ASV portal vs. Bioatlas taxonomy

To become fully accessible in the SBDI Bioatlas, all occurrence data are also indexed against the GBIF taxonomy backbone. While this taxonomy covers a huge amount of names, GBIF is currently in the process of improving their coverage of underrepresented groups such as prokaryotes. This means that the taxonomy of some ASVs may differ slightly between the ASV portal and the Bioatlas. Also note that, for practical reasons, we follow GBIF in listing prokaryote domains Archaea and Bacteria together with eukaryote kingdoms in our search interface. See the help page for taxonomy under Publish and share data for more information.