Frequently asked questions (FAQ’s) - Swedish Biodiversity Data Infrastructure

here you can find FAQ’s ordered by topics. If you do not find your question(s) answered below, you can always contact us through the SBDI support web form.

General questions

What sort of help can I get from SBDI? SBDI has a support network consisting of experts across all the 14 partner institutions who can assist you using the SBDI portal and tools. Our experts can also help you with consultations about data mining, analysis, and publication. In addition, we are arranging regular workshops and training events that may help you with your scientific analyses. If you do not find answers to your question(s) in the FAQ’s or in any of our tutorials, then you can also contact us through the support channel <here>.

Find, analyze, and cite data

How do I cite data downloaded from SBDI? The citation guides can be found at the SBDI help pages.

Can I find sequence-based biodiversity data in SBDI? Yes, through the SBDI ASV portal. There, you can search for processed barcode sequences (i.e. Amplicon Sequence Variants; ASVs) using BLAST (Basic Local Alignment Search Tool), as well as filter occurrence data based on target gene regions and PCR primers and the resulting sequences will be linked to the SBDI Bioatlas.

How do I get access to data from permanent sample plots from the Swedish National Forest Inventory? All data from the permanent sample plots that have been collected since 1983 can be ordered through SBDI or directly from the The Swedish National Forest Inventory. However, some restrictions apply to exact coordinates. The coordinates for forest vegetation that are displayed in open services are approximate coordinates where the center position of the plot has been moved randomly by 200 to 1000 meters in any direction. Detailed coordinates for our permanent plots are only disclosed in exceptional cases as this may jeopardize data integrity. Specific research projects can apply for access to detailed coordinates for the permanent vegetation plots. If access is granted the use of data is regulated by a confidentiality agreement.

How do I get access to data from temporary sample plots from the Swedish National Forest Inventory? A wide range of variables and detailed coordinates for temporary sample plots from the Swedish National Forest Inventory are available for download here. This data is useful as reference in various remote sensing applications.

Format and publish data

My funding agency asked me to provide a Data Management Plan. How do I do that? A data management plan or DMP is a formal document that describes how data is to be managed both during a research project and after the project has been completed. It includes important information on what data your research will produce, where they will be archived, and under which license they will be published. Every research project should set up a DMP at the start. If you need help, we can give you consultation, just contact us through the registration form <here>.

Can I submit raw sequence output to you? No, raw reads should be submitted to a public repository for primary sequence data, but we do provide a step-by-step guide to ENA submission to help you with this. We are also collaborating on a scripted pipeline for denoising and taxonomic annotation of amplicon sequencing data, from both Illumina and PacBio platforms. Using this pipeline, you will (in the near future) be able to process data into a format that you can submit to SBDI.

Do I need to comply with the Nagoya Protocol when I submit molecular data to SBDI? Yes, if your data is based on genetic material obtained from a country other than Sweden, you need to check whether that country is party to the Nagoya protocol and what type of access legislation applies, if any. Please, see guidance on this at the Swedish Environmental Protection Agency. If your sequences derive from material of Swedish origin, you do not have to worry about Nagoya, as Sweden has no specific access regulations for genetic resources. You do, however, need to comply with any legislation restricting or prohibiting sampling of specific organisms.

Can I publish data from observations made in other countries? Yes. Swedish natural history collections and herbaria include specimens from many regions and countries of the world, and research datasets from Swedish researchers have a similar worldwide scope. The majority of the 97 million occurrences published from Sweden refer to observations made in Sweden but include about a half million species occurrences reported for localities outside of Sweden.

I have a relatively small data set (e.g. <200 species observations) – how do I publish these records? The number of observations is not considered a criterion for GBIF publication. We encourage data providers to consider the long-term data life of these publications and make decisions about the scope of the dataset accordingly. Datasets with few observations are characteristic of short-term sampling projects where the efforts to compile biological observations are discontinued and the dataset is static following the initial publication.

Many datasets can be subdivided to reflect their origin as the summary outcome of limited sampling efforts over a time period, geographic region, or taxonomic group. It can be more effective to publish, however, these smaller datasets as a single dataset to facilitate metadata management. This is especially relevant when these data subsets are expected to be updated with additional observations. This is best illustrated by datasets from public collections where digitization is ongoing and active.

If my data is not adhering to Darwin Core (DwC) standards can it be published? The Darwin Core standard is a glossary of terms originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatio-temporal occurrence, and their supporting evidence housed in collections (physical or digital). It provides a stable, standard reference for sharing information on biological diversity, providing stable semantic definitions with the goal of being maximally reusable in a variety of contexts. This means that datasets not adhering to DwC standards have greatly diminished levels of technical and conceptual interoperability with published biodiversity informatics data. Please see the “Darwin Core Archive How-To Guide” for further information.

My data is not structured according to the Darwin Core (DwC), how do I get help restructuring data? The majority of biological datasets are structured to normalize the same data elements, so it is more common to find partial rather than no overlap with DwC standards even when data were not compiled with this standard in mind. It follows that the data not adhering to the standard would be poorly understood, or even misleading, in the context of other datasets and should be excluded from publication in the global context until it does.

If I have a FileMaker Database can the data be published? Yes. There are many export options for data in FileMaker, and other software widely used to organize and manage biological observations. The guiding principle is to export local database observations as comma- or tab-delimited text files whose content is Darwin Core compliant and can be, therefore, published in GBIF.

How do I update the datasets? Updates are handled by routines established by the data resource managers and the publisher; in many cases, this is managed in the integrated Publishing Toolkit (IPT )by an automatic routine.

How frequently are the datasets updated? Frequency of updates are determined during the agreement between the data publisher and data provider, and can be revisited at any time. When errors or corrections are discovered, it is a common practice to perform an update in coordination with the data publisher. Generally, active datasets with well-established data pipelines are updated regularly, for example, weekly to accommodate ongoing digitization in public collections.

How do I report or correct an error in a SBDI dataset? In practice, any of the contacts associated with a data resource are available to answer questions concerning published data. In most cases, reported errors are content issues that will ultimately be directed to the data provider, who make corrections to the local database. With regularly updated datasets, errors reported directly to the data providers are corrected in the next publication or following a requested interim publication.

How to use DarwinCore extensions to enrich my dataset? The Darwin Core Standard (DwC) is able to handle the wide range of biodiversity data types, whose scope continues to be improved and expanded. There are several areas of active work on these community standards where the DwC standard is expanded to associate, for example, multimedia and research-generated data with biological observations. These extensions have varying degrees of refinement and adoption. Exploration of these extensions is highly recommended, particularly when preparing datasets with rich data associations established during research. Practical use of extensions involves establishing a core data file, consisting of the standard set of DwC terms, and creating extension files, in which each extension file row points to a row in the core data file. This relationship between the core and extensions results in a star schema and allows several extension records to refer to a single core data file row.

Which data licenses (data greements) are typically used in SBDI? When publishing data, one of three creative common licenses should be applied to the dataset: CC0, CC-BY CC-NC. A CC0 license, used often in the scientific community, applies to the dataset a status of unrestricted use (free in the public domain). A CC-BY license allows for reuse of the data with attribution to the originator, as a reference to the original data source. This attribution is facilitated for GBIF datasets through the assignment of DOIs to data downloaded from gbif.org and citation formats for GBIF data resources. A CC-NC license indicates that the data can be reused only for noncommercial activities. You can find more information about these licenses here.

How do I know who is using data that I have published? There are a range of mechanisms that document the use of GBIF published data and allow resource managers and the general public to track data use relevant to each data publication. This tracking is facilitated by technical attributes of data publications, namely DOI prefixes and URLs of IPT installations. There is information collected in association with downloads of data, primarily to provide the insure that the requester receives the information on the dataset origins before its use. Second, GBIF retrieves alerts and XML-based feeds to identify research uses and citations of biodiversity information accessed through GBIF’s global infrastructure [https://www.gbif.org/literature-tracking]. For example, the data resource published as “Entomological Collections (NHRS), Swedish Museum of Natural History (NRM)” has its use presented as a citations page. When more comprehensive statistics on data use are required, such as the raw data used to produce data use summaries, the GBIF helpdesk can be contacted directly.

Do I need to belong to an institution (public museum or university) in order to publish data? No, although establishing a reliable publication route is a critical element of ensuring the long-term availability and accessibility of the dataset. It is necessary, therefore, to identify an endorsed publisher for the dataset, which is already publishing data with GBIF, or to register a new one. This publisher must be endorsed by the national GBIF node prior to publication of the data resource, and there are many precedents for endorsements of non-profit organizations and, more recently, citizen science platforms. Publishers in Sweden include public museums and universities but also a research station, federal agency, an eDNA company and a non-governmental organization.

Search and publish genetic data

Does the Swedish ASV portal have Application Programming Interfaces (API’s) for uploading, searching and downloading data, in addition to manual forms and interfaces? No, we don’t currently provide API access to the ASV database. We are working on making sequence-based occurrence data available for search/download via R (SBDI4R package), but this will use the Bioatlas API, which will not be useful for data upload to the ASV database (which is where data are stored before going into the Bioatlas).

Are there plans to add more search fields in the direct search page of the ASV portal, e.g. names of contributors, projects, dates or geography? Our main aim was to add some basic sequence-related functionality (mainly BLAST search) that was missing from the main BioAtals platform, which otherwise has advanced search functionality. But a few additions to the basic search functions may be possible in the ASV if there are good reasons.

The search field in the ASV portal on “Taxon” does not make reference to what taxonomy is used during the search. But there are differences in taxonomies between e.g. WoRMS and NCBI. Which taxonomy is this search interface running on? The search interface uses the standard taxonomic annotation that we apply during data import into the ASV database. It is explained in detail under “Molecular tools” documentation page (here). Data providers can use any taxonomy, but this will be replaced with our standard, and only be displayed under ‘previousIdentifications’ in the Bioatlas. A caveat, at the moment, is that the GTDB taxonomy we use for prokaryotes in the ASV database has only recently been added to the GBIF/Bioatlas backbone, and hence there are still some issues leading to discrepancies between these platforms.

Is there a GitHub repository for the ASV portal plugin? The Github module is here: https://github.com/biodiversitydata-se/mol-mod. The reverse proxy setup for the same https://github.com/biodiversitydata-se/proxy-ws-mol-mod-docker.

Systematic monitoring data

In the national forest monitoring program, how do you extrapolate from monitoring sites to obtain a complete spatial coverage? Do you use models or remote sensing or other methods? Data from national laser scanning and/or satellite data is combined with information from field plots and models are used to produce raster databases.

In the national forest monitoring program, are only woody plants counted in the forest inventory plots or all vascular plants? Current inventories include presence/absence data for 270 ground vegetation species/groups with coverage estimation for around 70 (out of 270) species. More information and research examples are given here: Skogsdata2011_webb.pdf (slu.se) and Environmental analysis data as a basis for research on large-scale changes in woodland vegetation.

In the national forest monitoring program, how is „good data quality“ defined and controlled (e.g. manually or automatically)? As the primary use of the data is for official reporting and statistics, a large number of controls are applied throughout the integrated data handling process. For the forest and soil programs a control field inventory is also performed on a subsample of the field plots to provide a basis for estimation of systematic errors. You can find more information (in Swedish) here: Om inventeringen.

Leave a Reply Cancel reply