Menu Close

Improving on data quality

The quality of the data is crucial for its usability. SBDI provides access to and support with a number of tools that can help you check and improve the quality of the data and metadata. For instance, there are services that allow you to validate your data against the Darwin Core standards, and that allows you to match the taxonomic names you are using against those in the appropriate reference taxonomy.

Data quality principles are involved at different stages of data management – from data capture to data presentation and dissemination. A loss of data quality at any of these stages reduces the applicability, or “fitness for use”, of the data. SBDI works with data publishers to improve data quality by providing the highest possible standard for data delivery and curation of source data. There are tools available within the SBDI network for independent and supervised data validation and quality vetting as well as technical support to guide preparation of datasets for SBDI publication.

Data validation is the process of determining whether data are complete, logical and accurate. The process usually happens at different levels:

  1. The ‘Technical Check’ may include checks on the format, completeness, or technical reasonableness of the values. These checks may also involve evaluating compliance with relevant standards, rules, and conventions.
  2. The ‘Probability Assessment’ may include automated checks of e.g. reasonableness of the values regarding to local ‘rules’ based on previous observations in time and space as well as other objectively recognized errors (e.g., incongruence of observation values against taxonomic and geographic hierarchies). Also automated image recognition can be included at this level. These assessments lead to flagging, documenting and evaluation of questionable records for subsequent corrective actions.
  3. ‘Expert Verification’ is the ‘golden standard’ of data validation. Here human experts for specific taxa manually check observations, in particular with respect to accurate information about taxonomic identification, location and time.

The IPT is a publishing rather than a processing tool, and it is not designed to validate the content of verbatim data although default values can be assigned in the IPT to required DarwinCore data elements, and those values are used when the corresponding elements are empty. For content validation of DwC-A files, we recommend using the GBIF data validator. The service presents results as a report on the syntactical correctness and validity of the dataset content. Submitting a dataset to the service subjects it to the same validation and interpretation procedures applied to all datasets published through SBDI to GBIF and the Bioatlas in order to identify potential issues. The GBIF validator accepts zip-compressed DwC-A files, checklist, occurrence and sampling-event data, as well as simple CSV files containing Darwin Core terms in the first row.

The LA system has a validation service that is similar to the GBIF data validator. In the ‘Sandbox’ module, which is not yet implemented in Sweden (but may be so in the near future), it is possible to apply this service to datasets before they are published. However, the same service is used for the data that are shared in the LA system, and potential issues are flagged even if the original data are not changed. In the Bioatlas, it is possible to see the flagged elements, and it is also possible to search for records with particular issues. Thus, it is possible for data providers to use the Bioatlas for data validation and quality improvement after publication. It is also possible for other users to discover issues using this functionality and report them to the data providers.

Data are also checked in SOS, presenting results as a report on the syntactical correctness and validity of the dataset content, including, for example, information on the success of data import, data readability, missing information, and how data looks like as DwC-A. As a complement, the resulting DwC-A  file can be submitted to the GBIF data validator before further processing. A probability assessment module connected to SOS is developed during spring 2021 to classify observations as plausible or implausible based on e.g. geographical location or phenology. A verification module is developed during 2022 that will allow expert verification.

Post publication processing

As described in these manuals there are several ways to publish SBDI datasets, and they have slightly different implications with respect to how the data are processed after publication. The GBIF and Bioatlas platforms both process the data in the datasets provided to them before the data are available through their data API:s and other tools and services. This processing may include checks on the format, completeness, or technical reasonableness of the values (see above). The SOS also includes a data processing step, as in the section describing SOS, which means that SBDI datasets passing through SOS will be processed twice. Work is underway to harmonize these data processing pipelines, so that the tools or methods are either the same or complementary, but it will be some time before this work is complete. Read more about this in the section on taxonomy and taxonomic indexing.

Contact SBDI Support Center for best practices, if you are missing a suitable DwC term, or need help with improving the quality of your datasets as well as help with validation against DarwinCore elements.