Registration and Welcome

Overview

Teaching: min
Exercises: min
Questions
Objectives

Registration

You can find the registration desk at the 3rd floor of ZG Tower Annex, Big conference hall (Velika dvorana))

Zagreb Tower

Welcome

Welcome by Gabrijela Šestani
Načelnica Sektora za prirodu / Head of the Nature sector
Zavod za zaštitu okoliša i prirode / Institute for Environment and Nature

Pre-workshop survey

Preparatory Survey

Key Points


Introduction to GBIF

Overview

Teaching: 45 min
Exercises: 30 min
Questions
  • What is GBIF?

  • How is GBIF organized

  • The GBIF portal

Objectives
  • Understand how GBIF works.

  • Understand how GBIF is organized.

  • Basic understanding of the GBIF portal

GBIF—the Global Biodiversity Information Facility—is an international network and data infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.

About

The Global Biodiversity Information Facility (GBIF) is an international network of country and organizational Participants that exists to enable free and open access to biodiversity data from all sources and to support biodiversity science, environmental research, and evidence based decision-making. GBIF operates as a federated system of distributed data publishing efforts, coordinated through a global informatics infrastructure and collaborative network. In this module, you will learn more about GBIF.

GBIF introduction movie

In this video (03:23) you will learn about GBIF through animation. It is a co-production of SiB Colombia and GBIF Spain, both GBIF national nodes, with co-funding from GBIF Spain and Instituto de Investigación de Recursos Biológicos Alexander von Humboldt (IAvH). The video is in Spanish, but subtitles are available.

Alternate Text

What is GBIF video

In this video (07:55), Tim Hirsch, Deputy Director of the GBIF Secretariat, provides you with an overview of GBIF.

Alternate Text

GBIF Overview

GBIF updates a set of overview slides with key information on a quarterly basis. The deck provides an overview of the network status, data use, access and publication, and the latest news and announcements.

Presentation

Exercises 1: Investigate how your country/organization is represented in GBIF

The GBIF website has pages dedicated to countries/economies, including for countries/economies that do not yet participate in GBIF. You can use the search box on the homepage to look up a country/organization.

  • The tabs on these pages provide a general description of:
    • data available about the biodiversity of the country
    • data published by your country/organization,
    • the participation of the country/organization
    • other relevant information on the use of data by researchers in the country/organization.
  1. Look up your country on the GBIF website.
  2. Is a list of introduced and invasive alien species available for your country?
  3. How well does the information available on GBIF reflect what you know about the biodiversity of your country?
  4. Do you see publications from/about your country?
  5. Download the Croatia Activity report

Solution

  1. Find Croatia on this map
  2. [Global Register of Introduced and Invasive Species- Croatia] (https://www.gbif.org/dataset/f82a4f7f-6f84-4b58-82e6-6b41ec9a1f49)
  3. Note your idea about this
  4. publications
  5. activity report

Exercises 2: Downloading data from GBIF

The GBIF website has pages dedicated to data retrieval and downloading.

  1. Find and download this occurrence only dataset from GBIF: “Tipuloidea of Martinique - 2016-2018”
  2. Find and download this sampling event dataset from GBIF: “Meetnetten.be - Chorus counts for Amphibia in Flanders, Belgium’
  3. Find and investigate this checklist on GBIF: https://www.gbif.org/dataset/fd004d9a-2ea4-4244-bb60-0df508d20a15

How many times is this dataset cited?

How many occurrences are in this dataset?

Are all the occurrences georeferenced?

What is the licence on the dataset?

Which sampling method was used to collect the specimen?

Is there any media vailable about this dataset?

How should you cite this dataset?

What is the DOI of this dataset?

Solution

  1. url
    5
    282
    all georeferenced Check issues and flags
    282 occurrences where ‘Licence’ equals: CC0 1.0
    During the 2018 expedition, the main collecting methods were coloured pan traps and sweep nets. In 2016-2017, also Malaise traps were in operation.
    Mederos J, Pollet M, Oosterbroek P, Brosens D (2023). Tipuloidea of Martinique - 2016-2018. Version 1.10. Research Institute for Nature and Forest (INBO). Occurrence dataset https://doi.org/10.15468/s8h9pg accessed via GBIF.org on 2024-02-12.
    https://doi.org/10.15468/s8h9pg

  2. url
    33
    710
    all georeferenced Check issues and flags
    710 occurrences where ‘Licence’ equals: CC0 1.0
    url Piesschaert F, Speybroeck J, Brosens D, Westra T, Desmet P, Ledegen H, Van de Poel S, Pollet M (2023). Meetnetten.be - Chorus counts for Amphibia in Flanders, Belgium. Version 1.17. Research Institute for Nature and Forest (INBO). Sampling event dataset https://doi.org/10.15468/d4bu8j accessed via GBIF.org on 2024-02-12. https://doi.org/10.15468/d4bu8j

  3. url
    0
    0 This is a checklist
    nothing georeferenced, but the target area of this checklists is Flanders, Belgium
    *CC0 1.0

    No
    Branquart E, Adriaens T, Desmet P, Devisscher S, D’hondt B, Denys L, Dumortier A, Latli A, Oldoni D, Packet J, Scheers K, Vanderhoeven S, Vermeersch X, Willeput R (2023). RIPARIAS target species list. Version 1.47. Research Institute for Nature and Forest (INBO). Checklist dataset https://doi.org/10.15468/p4ugqr accessed via GBIF.org on 2024-02-12. https://doi.org/10.15468/p4ugqr

Key Points

  • The GBIF portal is the place to find biodiversity data


Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

Key Points


The GBIF nodes, governance and network

Overview

Teaching: 45 min
Exercises: 0 min
Questions
  • How is GBIF organized

Objectives
  • Understanding the GBIF governance and bodies

1: GBIF participation

GBIF participation

Exercise 1: Investigate Croatia participation in GBIF

  • Find out information on Croatia on GBIF.org
  • In which regional network is Croatia?
  • Explore Croatia Country page

Solution

  • Europe and Central Asia

2: The Nodes

A node is a team designated by a Participant to coordinate a network of people and institutions that produce, manage and use biodiversity data, collectively building an infrastructure for delivering biodiversity information.

Nodes meets every year either globally or regionlly.

ECA Nodes meeting

Exercise 2: Seek information on the Croatian Node

  • When did Croatia joined the GBIF network?
  • Who is the contact person for the Croatian Node?
  • How much data publishers from Croatia?

Solution

  1. 2022
  2. No contact details
  3. Two data publishers

3: GBIF Governance

GBIF Governance

Exercise 3: Governing Board and Standing Committee

  • Which country did host the two last Governing Bards?
  • Who is the Chair of the Science Committee?

Solution

  1. Australia in 2023 & Belgium in 2022
  2. Birgit Gemeinholzer is Science Committee Chair

Key Points

  • The GBIF Participants

  • The GBIF Nodes network

  • The GBIF Governance


Lunch Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

lunch

Key Points


Croment BioBlitz

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • As a citizen scientist, how to share your observations with GBIF.

Objectives
  • Initiate Croment Bioblitz.

Bioblitz

Action : Croment Bioblitz

The INaturalist website is the place where you will start your Bioblitz

  1. Go to inaturalist.org and register
  2. Go to the tab project and search for Croment
  3. Click on Croment Bioblitz and join
  4. Check this tutorial video
  5. How many observations from Croatia are already in INaturalist?
  6. And how many of them are on GBIF?

TASK

  1. Go out in nature
  2. Add observations to INaturalist
  3. Check the competition

SOLUTION

  1. Check INaturalist
  2. Check GBIF

Key Points

  • Citizen Science & Biodiversity Data

  • What data is already available from iNaturalist


Belgian Node

Overview

Teaching: 30 min
Exercises: min
Questions
  • How is the Belgian node organized?

  • Belgium in the world of Biodiversity data

Objectives
  • History of the Belgian Node

  • How the Belgian node is organized.

Presentation Belgium in the world of Biodiversity data

Excercise:

  1. How many data publishers are active in Belgium?
  2. How many datasets do contain records about Belgium?
  3. How many datasets are published by Belgium?
  4. Navigate to the INBO (Research Institute for Nature and Forest) IPT and check how many publishers are making use of this IPT instance
  5. Navigate to www.biodiversity.be and find GBIF info bbpf

SOLUTION

  1. 27
  2. 1274
  3. 536
  4. 8 (Check here)

Presentation : Engaging research communities for data mobilization and use: The GBIF node in Belgium

Key Points

  • The GBIF portal is the place to find biodiversity data


Living Atlasses & Croatian GBIF node

Overview

Teaching: 60 min
Exercises: 0 min
Questions
  • Introduction to Living Atlas Croatia

Objectives
  • Introduction to Living Atlas Croatia & Croatinan GBIF node

  • Introduction to Living Atlas Flanders (Belgium)

Presentation Living Atlas Croatia

Presentation Living Atlas Flanders (Belgium)

Key Points

  • Main concepts of a living Atlas project


Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

lunch

Key Points


GBIF Policy & science

Overview

Teaching: 45 min
Exercises: 15 min
Questions
  • GBIF Science & Policy

Objectives
  • Understand how GBIF interacts with Science.

  • Understand how GBIF interacts with Policy.

1: GBIF & Science

GBIF Science

Exercise 1 : GBIF and Science

  1. How many citations for the Meditera3 dataset published by the University of Zagreb Faculty of Science?
  2. This dataset have been reused in one thesis. Which one?
  3. In the Science Review topics, which ones are of interest for you?

Solution

  1. 11 citations
  2. Decreases over time in shannon diversity of land snails by Hmming J.
  3. The covered topics of the Science Review are : AGRICULTURE, BIODIVERSITY SCIENCE , BIOGEOGRAPHY , CITIZEN SCIENCE , CLIMATE CHANGE , CONSERVATION, DATA MANAGEMENT, DNA , ECOLOGY, ECOSYSTEM SERVICES, EVOLUTION, FRESHWATER, HUMAN HEALTH, MARINE, PHYLOGENETICS, SPECIES DISTRIBUTION, TAXONOMY & INVASIVES.

2: GBIF & CBD

GBIF & CBD

Exercise 2 : GBIF and CBD

  1. Does GBIF plays an official role in the CBD? Which role?
  2. Are GBIF data relevant to GBF targets?
  3. Which Ad-Hoc Technical Expert Group includes GBIF staff?

Solution

  1. Yes, GBIF is observer as Inter-Governmental Organization
  2. Yes, definitely
  3. AHTEG on Indicators

3: Delivering relevant data

GBIF & IPBES

Exercise 3 : Delivering relevant data

  1. Is GBIF data reuse by Science?
  2. Is GBIF data reuse by Policy?

Solution

  1. Yes! see Science Review
  2. Yes! eg CBD, IPBES and Impact Assessments

4: Science Policy Interface

Supporting SPI

Exercise 4 : Science Policy Interface

  1. Does GBIF supports national biodiversity commitments?
  2. Does GBIF supports Science Policy Interface?
  3. Is this done by the nodes and/or the Secretariat?

Solution

  1. Yes
  2. Yes
  3. Both

Key Points

  • GBIF data reuse by scientists

  • GBIF Science review

  • GBIF interaction with CBD

  • Delivering relevant data

  • Science Policy Interface


GBIF Discussion

Overview

Teaching: 0 min
Exercises: 60 min
Questions
Objectives
  • Group discussion

Discussion

What challenges are next in relation to:

  1. Data Quality: Discuss the challenges and strategies for ensuring the quality of biodiversity data in GBIF. How can we address issues such as incomplete or inaccurate data?

Quality is important, and data should be verified (AI could help in some parts, but in the end you need well trained AI. Specialists are still very needed. Especially for difficult species. Imprtant shoul be the versioning of the verification of data . Spatial visualisation is important, and in the contract shoud be notad that data publication (quality data should be published)

  1. Data Access and Use: Explore the various ways researchers, policymakers, and the public can access and utilize the data available through GBIF. How can we maximize the impact of this data for conservation and research?

Maximizing the use of data. There is the need to improve the data coverage (geographically and also by environmental conditions.) Gap analysis…. Environmental variables. Support and collaborate with the for example Biodiversa+ and maximase the use of the data.

  1. Data Sharing and Collaboration: Discuss the importance of collaboration among institutions and countries in sharing biodiversity data through GBIF. How can we encourage more participation and data sharing?

Many advantages in sharing data to GBIF. Your work is visible for science and policy. New collaborations. It avoids (in some cases) repetition of collection. BEtter planning of data collection and avoid mistakes that were done in the past. Data what you share are already prepared and standardized…. Collaboration- A node is a good thing. More participation: more visibility to datasets and make them more usable. (small grants for institutions for data mobilization & digitizing)

  1. Technological Advances: Consider how technological advancements such as machine learning, remote sensing, and DNA sequencing are shaping the future of biodiversity data collection and analysis within GBIF.

We need to make more integration between BOLD and GBIF. Good thing they they are working on that!!

  1. Data Privacy and Security: Address concerns around data privacy and security within GBIF, particularly regarding sensitive species or locations. How can we balance open access with the need to protect sensitive information?**

Already addressed today and before. Privacy concers (is GDPR an issue , or maybe not :) ). Name and surname should be provided. Locations for sensitive species can be blurred, **species blurring is not a good thing!!!

  1. Capacity Building: Explore opportunities for capacity building initiatives to empower researchers, particularly in developing countries, to contribute to and utilize biodiversity data through GBIF.

    Recognize this recognition for GBIF in the citation. Use of bibliographicCitation in DwC should be promoted. In a call, an obligation for sharing data could be mantioned. If you publish in scientific papers before sharing with GBIF, you shoul make sure the connection is made More Workshops

  2. Monitoring and Assessment: Discuss how biodiversity data from GBIF can be used for monitoring changes in biodiversity over time and assessing the effectiveness of conservation efforts.

  3. Future Directions: Consider the future directions of GBIF and the role it can play in addressing global challenges such as climate change, habitat loss, and species extinction.

Discussion challenge

Choose a topic to discuss in your group. 30 minutes for group discussion. 30 minutes for reporting back to the room. 5-6 persons per group.

Solution

Report back to the room on your group discussion

Key Points

  • How GBIF works


FAIR and Open Science

Overview

Teaching: 45 min
Exercises: 0 min
Questions
  • What is open science

  • A new research life cycle?

  • FAIR data?

Objectives
  • Undersctand the principles of Open Science

  • Understand how FAIR data works.

  • Understand how FAIR data is organized.

Presentation

Excercise : FAIR data & Open Science

  1. What is the difference between FAIR and OPEN data?
  2. Check the FAIR Self assessment tool here Think about a dataset you know and run over the assessment
  3. What could you do to make your data more FAIR?
  4. Is data published through GBIF FAIR?
  5. Is all data published by GBIF considered as open data?

SOLUTION

  1. FAIR data is not always open, FAIR data is findable and good documented. Open data per definition is not always FAIR. (Just an Excel somewhere on a website is considered as open data)
  2. fair
  3. Publish your data in GBIF or in another open repository like Zenodo
  4. YES
  5. No, CC-BY-NC is not considered as open data

Presentation

Excercise : Creative commons license chooser

  1. Check the Creative commons license chooser
  2. Learn how to find an appropriate license for your biodiversity data
  3. Is this license alowed for GBIF?
  4. Is CC-BY-NC an open data license?

SOLUTION

  1. cc
  2. Check the license chooser
  3. The only licenses allowed for GBIF are CC0 ; CC_BY ; and CC_BY_NC
  4. CC-BY-NC is not considered as an open data license

Key Points

  • Open Science & FAIR data


Data management & best practices

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • How to deal with data?

  • Tips & tricks for good data management

  • What is Biodiversity informatics?

  • What is a Biodiversity dataset?

Objectives
  • Introduction to data management plan

  • Introduction to Biodiversity informatics

  • Feel familiar with open data types

  • Distinction between data quality and fitness for use

  • Make sure your data are tidy data

  • Learn some best practices

Presentation

Exercise

Challenge: Make this data tidy.

  1. Download this SAMPLE_DATE
  2. Open in spreadsheet programme (Excel, LibreOffice, Openoffice,….)
  3. Make this data Tidy (Each variable forms a column and contains values, Each observation forms a row, Each type of observational unit forms a table) Open this link for the complete excercise and tips

Solution

  1. screenshot

Key Points

  • Best data management practices

  • Organize your Data and Metadata


Coffee Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

Break

Key Points


Openrefine

Overview

Teaching: 0 min
Exercises: 90 min
Questions
  • What is Openrefine? an introduction

  • Data cleaning with open refine?

  • Name matching with Openrefine

Objectives
  • Understand the purpose of Openrefine

Openrefine

Presentation

You can find the complete user manual here

Excercise : Openrefine

  1. Complete this exercise

SOLUTION

  1. follow the guidelines in the tutorial document

Key Points

  • Openrefine saves time


Lunch Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

lunch

Key Points


Openrefine part 2

Overview

Teaching: 0 min
Exercises: 90 min
Questions
  • What is Openrefine? an introduction

  • Data cleaning with open refine?

  • Name matching with Openrefine

Objectives
  • Understand the purpose of Openrefine

Openrefine

Presentation

Excercise : Openrefine

  1. Finish this exercise

SOLUTION

  1. follow the guidelines in the document

Excercise : Openrefine Extra exercice

The GLobal names veriefier gives you the opportunity to check your names with numerous sources. 209 Checklists are used for this service.

  1. On your column ‘scientificName’ Go to reconciliation services -> reconcile –> start reconciling
  2. Click on ‘add reconciliation service’ and fill in “https://verifier.globalnames.org/api/v1/reconcile”
  3. Click on Globalnames
  4. Click on start reconciling
  5. Click reconcile –> facets –> ‘choose your way of judgement’

SOLUTION

  1. After reconciliation your names are matched
  2. More information on this service here

Key Points

  • Openrefine saves time


SQLite

Overview

Teaching: 0 min
Exercises: 90 min
Questions
  • Data cleaning with SQLite

Objectives
  • Understand how SQLite can help cleaning data

Presentation: SQLite

SQLite


Exercise 1 : Download from GBIF.org

Instructions

Exercise 2 : Import data

Instructions

Exercise 3 : Explore data

Instructions

Exercice 4 : Discard data

Instructions

Exercice 5 : Filter data

Instructions

Exercice 6 : Annotate data

Instructions


Solutions

If needed, see the solutions page.

Key Points

  • SQL can be very useful to clean your data

  • Views are great to filter the records and fields you want to keep without changing your original data

  • Store your SQL statements under Git

  • SQL statements are easy to understand, sustainable and reusable


Coffee Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

Break

Key Points


SQLite

Overview

Teaching: 0 min
Exercises: 90 min
Questions
  • Data cleaning with SQLite

Objectives
  • Understand how SQLite can help cleaning data

Presentation: SQLite

SQLite

Exercise 1 : Download from GBIF.org

Instructions

Solution 1

Exercise 2 : Import data

Instructions

Solution 2

select count(*) from occ;

Exercise 3 : Explore data

Instructions

Solution 3

select * from occ where scientificName is null;
select * from occ where eventdate is null;
select * from occ where year is null or month is null or day is null;
select * from occ where decimalLatitude is null or decimalLongitude is null;

select count(*) from occ where individualCount is null;
select taxonRank, count(*) from occ group by taxonRank;
select phylum, count(*) from occ group by phylum;
select license, count(*) from occ group by license;

Exercice 4 : Discard data

Instructions

Solution 4

select count(*) from occ where occurrenceStatus='ABSENT';

create view trusted as select * from occ where occurrenceStatus='PRESENT' and taxonRank='SPECIES';
select count(*) from trusted;

Exercice 5 : Filter data

Instructions

Solution 5

select count(*) from occ where coordinateUncertaintyInMeters is null;
select coordinateUncertaintyInMeters, count(*) from occ group by coordinateUncertaintyInMeters;
select * from occ where CAST(coordinateUncertaintyInMeters as INTEGER) > 10000;

drop view if exists trusted ;
create view trusted as select scientificName, year,month,day,decimalLatitude, decimalLongitude,  CAST(coordinateUncertaintyInMeters as INTEGER) as uncertainty, occurrenceID from occ where occurrenceStatus='PRESENT'  and taxonRank='SPECIES' and uncertainty <= 10000;
select count(*) from trusted;

select eventdate, strftime('%d',eventdate) as day, strftime('%m',eventdate) as month, strftime('%Y', eventdate) as year from occ;

Exercice 6 : Annotate data

Instructions

Solution 6

update occ set individualCount=1 where individualCount is null;

drop view if exists trusted ;
create view trusted as select scientificName, year,month,day,decimalLatitude, decimalLongitude,  CAST(coordinateUncertaintyInMeters as INTEGER) as uncertainty, occurrenceID, individualCount, mediaType is not null as withMedia from occ where occurrenceStatus='PRESENT' and taxonRank='SPECIES' and uncertainty <= 10000;

Key Points

  • SQL can be very useful to clean your data

  • Views are great to filter the records and fields you want to keep without changing your original data

  • Store your SQL statements under Git

  • SQL statements are easy to understand, sustainable and reusable


Discussion on data publication

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • Do you ready to publish your biodiversity data in an open way

  • What are your concerns

  • What do you want to change

Objectives
  • Discussion on the principles of Open Science

  • Discussion on FAIR data.

  • Discussion on data publication

Become a data publisher for GBIF

The Endorsement procedure

The endorsement procedure aims to ensure that:

[https://www.gbif.org/become-a-publisher]’https://www.gbif.org/become-a-publisher)

image

Discussion on data publication

GBIF

  1. Data Publishing Challenges: What are your personal thresholds in relation to Biodiversity Data Publication? How would you make your data acceptable for publication.

Writing a data policy for an institution for 81+ scientists is a difficult assignement.

  1. Data Licensing and Usage Policies:: Are you willing to publish unde CC0 - CC-BY or CC-BY-NC?

Not in favour of CC0, discussion on CC-BY and CC-BY-NC , CC-BY might be an option. CC-BY-NC. CC-BY vr CC-NC discussion

  1. Your Institutes view: What is the position of your institute on open Biodiversity data publication?

Anyone in general is “pro’ open data publication, but questions arise… Are they willing to do this, the answer can be different for eacht person. PPNEA –> dealing with the donors of the data is also an issue. In many cases it is not only the institute or researchers who decide. Different type of data, some will be open availabe, some are sensitive (there are different cases)…. Also need to interact with 3rd parties.

  1. Impact and Outcomes: Reflect on the impact and outcomes of data publication through GBIF. How can the availability of biodiversity data through GBIF contribute to scientific research, conservation efforts, and policymaking? Data literacy is needed. In Croatia biodiversity data and nature protection area, open data are in the process and leading. Talk to persons or talk also to the institutions. Create obligations….?

Not a lot of discussion needed, it a facht (no opinion). Also think about the sensitive data. (open or aggregated?) Open data awareness raiing is required and also in education.

Discussion challenge

Choose a topic to discuss in your group. 30 minutes for group discussion. 30 minutes for reporting back to the room. 5-6 persons per group.

Solution

Report back to the room on your group discussion

Key Points

  • Open Science & FAIR data


Introduction to Darwin Core

Overview

Teaching: 30 min
Exercises: 40 min
Questions
  • What is Darwin Core?

  • What is a Darwin Core Archive?

  • Why do people use Darwin Core for their data?

  • What are the required Darwin Core terms for sharing to GBIF?

Objectives
  • Understand the purpose of Darwin Core.

  • Understand how to map data to Darwin Core.

  • Plan for mapping to Darwin Core.

Presentation

Darwin Core - A global community of data sharing and integration

Darwin Core is a data standard to mobilize and share biodiversity data. Over the years, the Darwin Core standard has expanded to enable exchange and sharing of diverse types of biological observations from citizen scientists, ecological monitoring, eDNA, animal telemetry, taxonomic treatments, and many others. Darwin Core is applicable to any observation of an organism (scientific name, OTU, or other methods of defining a species) at a particular place and time. In Darwin Core this is an occurrence. To learn more about the foundations of Darwin Core read Wieczorek et al. 2012.

Demonstrated Use of Darwin Core

The power of Darwin Core is most evident in the data aggregators that harvest data using that standard. The one we will refer to most frequently in this workshop is Global Biodiversity Information Facility (learn more about GBIF). Another prominent one is the Ocean Biodiversity Information System (learn more about OBIS) . It’s also used by the Atlas of Living Australia, iDigBio, among others.

Darwin Core Archives

Darwin Core Archives are what OBIS and GBIF harvest into their systems. Fortunately the software created and maintained by GBIF, the Integrated Publishing Toolkit, produces Darwin Core Archives for us. Darwin Core Archives are pretty simple. It’s a zipped folder containing the data (one or several files depending on how many extensions you use), an Ecological Metadata Language (EML) XML file, and a meta.xml file that describes what’s in the zipped folder.

Exercise

Challenge: Download this Darwin Core Archive and examine what’s in it. Did you find anything unusual or that you don’t understand what it is?

Solution

 dwca-tpwd_harc_texasaransasbay_bagseine-v2.3
 |-- eml.xml
 |-- event.txt
 |-- extendedmeasurementorfact.txt
 |-- meta.xml
 |-- occurrence.txt

Darwin Core Mapping

Now that we understand a bit more about why Darwin Core was created and how it is used today we can begin the work of mapping data to the standard. The key resource when mapping data to Darwin Core is the Darwin Core Quick Reference Guide. This document provides an easy-to-read reference of the currently recommended terms for the Darwin Core standard. There are a lot of terms there and you won’t use them all for every dataset (or even use them all on any dataset) but as you apply the standard to more datasets you’ll become more familiar with the terms.

Tip

If your raw column headers are Darwin Core terms verbatim then you can skip this step! Next time you plan data collection use the standard DwC term headers!

Exercise

Challenge: Find the matching Darwin Core term for these column headers.

  1. SAMPLE_DATE (example data: 09-MAR-21 05.45.00.000000000 PM)
  2. lat (example data: 32.6560)
  3. depth_m (example data: 6 meters)
  4. COMMON_NAME (example data: staghorn coral)
  5. percent_cover (example data: 15)
  6. COUNT (example data: 2 Females)

Solution

  1. eventDate
  2. decimalLatitude
  3. minimumDepthInMeters and maximumDepthInMeters
  4. vernacularName
  5. organismQuantity and organismQuantityType
  6. This one is tricky- it’s two terms combined and will need to be split. indvidualCount and sex

Tip

To make the mapping step easier on yourself, we recommend starting a mapping document/spreadsheet (or document it as a comment in your script). List out all of your column headers in one column and document the appropriate Dawin Core term(s) in a second column. For example:

my term DwC term
lat decimalLatitude
date eventDate
species scientificName

What are the required Darwin Core terms for publishing to GBIF?

When doing your mapping some required information may be missing. Below are the Darwin Core terms that are required to share your data to OBIS plus a few that are needed for GBIF.

Darwin Core Term Definition Comment Example
occurrenceID An identifier for the Occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique. To construct a globally unique identifier for each occurrence you can usually concatenate station + date + scientific name (or something similar) but you’ll need to check this is unique for each row in your data. It is preferred to use the fields that are least likely to change in the future for this. For ways to check the uniqueness of your occurrenceIDs see the QA / QC section of the workshop. Station_95_Date_09JAN1997:14:35:00.000_Atractosteus_spatula
basisOfRecord The specific nature of the data record. Pick from these controlled vocabulary terms: HumanObservation, MachineObservation, MaterialSample, PreservedSpecimen, LivingSpecimen, FossilSpecimen HumanObservation
scientificName The full scientific name, with authorship and date information if known. When forming part of an Identification, this should be the name in lowest level taxonomic rank that can be determined. This term should not contain identification qualifications, which should instead be supplied in the identificationQualifier term. Note that cf., aff., etc. need to be parsed out to the identificationQualifier term. For a more thorough review of identificationQualifier see this paper. Atractosteus spatula
eventDate The date-time or interval during which an Event occurred. For occurrences, this is the date-time when the event was recorded. Not suitable for a time in a geological context. Must follow ISO 8601. See more information on dates in the Data Cleaning section of the workshop. 2009-02-20T08:40Z
decimalLatitude The geographic latitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. Positive values are north of the Equator, negative values are south of it. Legal values lie between -90 and 90, inclusive. For OBIS and GBIF the required geodeticDatum is WGS84. Uncertainty around the geographic center of a Location (e.g. when sampling event was a transect) can be recorded in coordinateUncertaintyInMeters. See more information on coordinates in the Data Cleaning section of the workshop. -41.0983423
decimalLongitude The geographic longitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. Positive values are east of the Greenwich Meridian, negative values are west of it. Legal values lie between -180 and 180, inclusive For OBIS and GBIF the required geodeticDatum is WGS84. See more information on coordinates in the Data Cleaning section of the workshop. -121.1761111
countryCode The standard code for the country in which the location occurs. Use an ISO 3166-1-alpha-2 country code. Not required for OBIS but GBIF prefers to have this for their system. For international waters, leave blank. US, MX, CA
kingdom The full scientific name of the kingdom in which the taxon is classified. Not required for OBIS but GBIF needs this to disambiguate scientific names that are the same but in different kingdoms. Animalia
geodeticDatum The ellipsoid, geodetic datum, or spatial reference system (SRS) upon which the geographic coordinates given in decimalLatitude and decimalLongitude as based. Must be WGS84 for data shared to OBIS and GBIF but it’s best to state explicitly that it is. WGS84

What other terms should be considered?

While these terms are not required for publishing data to GBIF, they are extremely helpful for downstream users because without them the data are less useful for future analyses. For instance, depth is a crucial piece of information for marine observations, but it is not always included. For the most part the ones listed below are not going to be sitting there in the data, so you’ll have to determine what the values should be and add them in. Really try your hardest to include them if you can.

Darwin Core Term Definition Comment Example
coordinateUncertaintyInMeters The horizontal distance (in meters) from the given decimalLatitude and decimalLongitude describing the smallest circle containing the whole of the Location. Leave the value empty if the uncertainty is unknown, cannot be estimated, or is not applicable (because there are no coordinates). Zero is not a valid value for this term There’s always uncertainty associated with locations. Recording the uncertainty is crucial for downstream analyses. 15
occurrenceStatus A statement about the presence or absence of a Taxon at a Location. For GBIF & OBIS, only valid values are present and absent. present
samplingProtocol The names of, references to, or descriptions of the methods or protocols used during an Event.   Bag Seine
taxonRank The taxonomic rank of the most specific name in the scientificName. Also helps with disambiguation of scientific names. Species
organismQuantity A number or enumeration value for the quantity of organisms. OBIS and GBIF also likes to see this in the Extended Measurement or Fact extension. 2.6
organismQuantityType The type of quantification system used for the quantity of organisms.   Relative Abundance
datasetName The name identifying the data set from which the record was derived.   TPWD HARC Texas Coastal Fisheries Aransas Bag Bay Seine
dataGeneralizations Actions taken to make the shared data less specific or complete than in its original form. Suggests that alternative data of higher quality may be available on request. This veers somewhat into the realm of metadata and will not be applicable to all datasets but if the data were modified such as due to sensitive species then it’s important to note that for future users. Coordinates generalized from original GPS coordinates to the nearest half degree grid cell
informationWithheld Additional information that exists, but that has not been shared in the given record. Also useful if the data have been modified this way for sensitive species or for other reasons. location information not given for endangered species
institutionCode The name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record.   TPWD

Other than these specific terms, work through the data that you have and try to crosswalk it to the Darwin Core terms that match best.

Exercise

Challenge: Create some crosswalk notes for your dataset.

Compare your data files to the table(s) above to devise a plan to crosswalk your data columns into the DwC terms.

Key Points

  • Darwin Core isn’t difficult to apply, it just takes a little bit of time.

  • Using Darwin Core allows datasets from across projects, organizations, and countries to be integrated together.

  • Applying certain general principles to the data will make it easier to map to Darwin Core.

  • Implementing Darwin Core makes data FAIR-er and means becoming part of a community of people working together to understand species no matter where they work or are based.


Coffee Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

xkcd 0927 (standards) Image credit: xkcd

Key Points


Data Cleaning for Darwin Core

Overview

Teaching: 0 min
Exercises: 120 min
Questions
  • How to convert dates to ISO?

  • How to match scientific names to GBIF?

  • How to convert latitudes and longitudes to decimal degrees?

Objectives
  • Aligning dates to the ISO 8601 standard.

  • Matching scientific names

  • Converting latitude and longitude variations to decimal degrees North and East.

Some usefull links:

Now that you know what the mapping is between your raw data and the Darwin Core standard, it’s time to start cleaning up the data to align with the conventions described in the standard. The following activities are the three most common conversions a dataset will undergo to align to the Darwin Core standard:

  1. Ensuring dates follow the ISO 8601 standard
  2. Matching scientific names to an authoritative resource
  3. Ensuring latitude and longitude values are in decimal degrees

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Make sure that Python is installed on your machine. You can download Python here: https://www.anaconda.com/ R is a language and environment for statistical computing and graphics. The core R language is augmented by a large number of extension packages, containing reusable code, documentation, and sample data. You can download R & Rstudio (visual interface) here: https://posit.co/download/rstudio-desktop/

Below is a short summary of each of those conversions as well as some example conversion scripts. The exercises are intended to give you a sense of the variability we’ve seen in datasets and how we went about converting them. While the examples use the pandas package for Python and the tidyverse collection of packages for R (in particular the lubridate package), those are not the only options for dealing with these conversions but simply the ones we use more frequently in our experiences.

Getting your dates in order

Dates can be surprisingly tricky because people record them in many different ways. For our purposes we must follow ISO 8601 which means using a four digit year, two digit month, and two digit day with dashes as separators (i.e. YYYY-MM-DD). You can also record time in ISO 8601 but make sure to include the time zone which can also get tricky if your data take place across time zones and throughout the year where daylight savings time may or may not be in effect (and start and end times of daylight savings vary across years). There are packages in R and Python that can help you with these vagaries. Finally, it is possible to record time intervals in ISO 8601 using a slash (e.g. 2022-01-02/2022-01-12). Examine the dates in your data to determine what format they are following and what amendments need to be made to ensure they are following ISO 8601. Below are some examples and solutions in Python and R for them.

ISO 8601 dates can represent moments in time at different resolutions, as well as time intervals, which use “/” as a separator. Date and time are separated by “T”. Timestamps can have a time zone indicator at the end. If not, then they are assumed to be local time. When a time is UTC, the letter “Z” is added at the end (e.g. 2009-02-20T08:40Z, which is the equivalent of 2009-02-20T08:40+00:00).

Tip

Focus on getting your package of choice to read the dates appropriately. While you can use regular expressions to replace and substitute strings to align with the ISO convention, it will typically save you time if you work in your package of choice to translate the dates.

Darwin Core Term Description Example
eventDate The date-time or interval during which an Event occurred. For occurrences, this is the date-time when the event was recorded. Not suitable for a time in a geological context. 1963-03-08T14:07-0600 (8 Mar 1963 at 2:07pm in the time zone six hours earlier than UTC).
2009-02-20T08:40Z (20 February 2009 8:40am UTC).
2018-08-29T15:19 (3:19pm local time on 29 August 2018).
1809-02-12 (some time during 12 February 1809).
1906-06 (some time in June 1906).
1971 (some time in the year 1971).
2007-03-01T13:00:00Z/2008-05-11T15:30:00Z (some time during the interval between 1 March 2007 1pm UTC and 11 May 2008 3:30pm UTC).
1900/1909 (some time during the interval between the beginning of the year 1900 and the end of the year 1909).
2007-11-13/15 (some time in the interval between 13 November 2007 and 15 November 2007).

Examples in Openrefine

When dealing with dates using Openrefine, there are a few base tricks that are useful to wrangle your dates in the correct format.

The examples below show how to use the Openrefine and format your data to the ISO-8601 standard. Here is an overview of the Openrefine data functions. In openrefine code has to be entered here: openrefineSplit


  1. 01/31/2021 17:00 GMT
    Choose edit cells, –> common transforms –> toDate Choose add column based on this column

     value.toDate('yyyy/mm/dd').toString('yyyy-MM-dd')
    

    If you have multiple date formats in one column.

     value.toDate('MM/yy','MMM-yy').toString('yyyy-MM')
    

    “If parsing a date with text components in a language other than your system language you can specify a language code as the format1 argument. For example, a French language date such as “10 janvier 2023”.

    value.toDate('fr','dd MMM yyyy') 
    
  2. Another option is to split your date columns in 3 separate columns using the split function. After splitting join the columns in a data format code: openrefineSplit

       cells["year"].value + "-" +cells["month"].value + "-" + cells["day"].value
    

Examples in Python

When dealing with dates using pandas in Python it is best to create a Series as your time column with the appropriate datatype. Then, when writing your file(s) using .to_csv() you can specify the format which your date will be written in using the date_format parameter.

The examples below show how to use the pandas.to_datetime() function to read various date formats. The process can be applied to entire columns (or Series) within a DataFrame.

  1. 01/31/2021 17:00 GMT

    This date follows a typical date construct of month/day/year 24-hour:minute time-zone. The pandas .to_datetime() function will correctly interpret these dates without the format parameter.

    import pandas as pd
    df = pd.DataFrame({'date':['01/31/2021 17:00 GMT']})
    df['eventDate'] = pd.to_datetime(df['date'], format="%m/%d/%Y %H:%M %Z")
    df
    
                       date                 eventDate
       01/31/2021 17:00 GMT 2021-01-31 17:00:00+00:00
    
  2. 31/01/2021 12:00 EST

    This date is similar to the first date but switches the month and day and identifies a different time-zone. The construct looks like day/month/year 24-hour:minute time-zone

    import pandas as pd
    df = pd.DataFrame({'date':['31/01/2021 12:00 EST']})
    df['eventDate'] = pd.to_datetime(df['date'], format="%d/%m/%Y %H:%M %Z")
    df
    
                       date                 eventDate
       31/01/2021 12:00 EST 2021-01-31 12:00:00-05:00
    
  3. January, 01 2021 5:00 PM GMT

    import pandas as pd
    df = pd.DataFrame({'date':['January, 01 2021 5:00 PM GMT']})
    df['eventDate'] = pd.to_datetime(df['date'],format='%B, %d %Y %I:%M %p %Z')
    df
    
                               date                 eventDate
       January, 01 2021 5:00 PM GMT 2021-01-01 17:00:00+00:00
    
  4. 1612112400 in seconds since 1970

    This uses the units of seconds since 1970 which is common when working with data in netCDF.

    import pandas as pd
    df = pd.DataFrame({'date':['1612112400']})
    df['eventDate'] = pd.to_datetime(df['date'], unit='s', origin='unix')
    df
    
             date           eventDate
       1612112400 2021-01-31 17:00:00
    
  5. 44227.708333333333

    This is the numerical value for dates in Excel because Excel stores dates as sequential serial numbers so that they can be used in calculations. In some cases, when you export an Excel spreadsheet to CSV, the dates are preserved as a floating point number.

    import pandas as pd
    df = pd.DataFrame({'date':['44227.708333333333']})
    df['eventDate'] = pd.to_datetime(df['date'].astype(float), unit='D', origin='1899-12-30')
    df
    
                     date                     eventDate
       44227.708333333333 2021-01-31 17:00:00.000000256
    
  6. Observations with a start date of 2021-01-30 and an end date of 2021-01-31.

    Here we store the date as a duration following the ISO 8601 convention. In some cases, it is easier to use a regular expression or simply paste strings together:

    import pandas as pd
    df = pd.DataFrame({'start_date':['2021-01-30'],
                       'end_date':['2021-01-31']})
    df['eventDate'] = df['start_time']+'/'+df['end_time']
    df
    
       start_time    end_time              eventDate
       2021-01-30  2021-01-31  2021-01-30/2021-01-31
    

Examples in R

When dealing with dates using R, there are a few base functions that are useful to wrangle your dates in the correct format. An R package that is useful is lubridate, which is part of the tidyverse. It is recommended to bookmark this lubridate cheatsheet.

The examples below show how to use the lubridate package and format your data to the ISO-8601 standard.

  1. 01/31/2021 17:00 GMT
   library(lubridate)
   date_str <- '01/31/2021 17:00 GMT'
   lubridate::mdy_hm(date_str,tz="UTC")
   date <- lubridate::format_ISO8601(date) # Separates date and time with a T.
   date <- paste0(date, "Z") # Add a Z because time is in UTC.
   [1] "2021-01-31T17:00:00Z"
  1. 31/01/2021 12:00 EST

    library(lubridate)
    date_str <- '31/01/2021 12:00 EST'
    date <- lubridate::dmy_hm(date_str,tz="EST")
    lubridate::with_tz(date,tz="UTC")
    date <- lubridate::format_ISO8601(date)
    date <- paste0(date, "Z")
    
    [1] "2021-01-31T17:00:00Z"
    
  2. January, 01 2021 5:00 PM GMT

    library(lubridate)
    date_str <- 'January, 01 2021 5:00 PM GMT'
    date <- lubridate::mdy_hm(date_str, format = '%B, %d %Y %H:%M', tz="GMT")
    lubridate::with_tz(date,tz="UTC")
    lubridate::format_ISO8601(date)
    date <- paste0(date, "Z")
    
    [1] "2021-01-01T17:00:00Z"
    
  3. 1612112400 in seconds since 1970

    This uses the units of seconds since 1970 which is common when working with data in netCDF.

    library(lubridate)
    date_str <- '1612112400'
    date_str <- as.numeric(date_str)
    date <- lubridate::as_datetime(date_str, origin = lubridate::origin, tz = "UTC")
    date <- lubridate::format_ISO8601(date)
    date <- paste0(date, "Z")
    print(date)
    
    [1] "2021-01-31T17:00:00Z"
    
  4. 44227.708333333333

    This is the numerical value for dates in Excel because Excel stores dates as sequential serial numbers so that they can be used in calculations. In some cases, when you export an Excel spreadsheet to CSV, the dates are preserved as a floating point number.

    library(openxlsx)
    library(lubridate)
    date_str <- 44227.708333333333
    date <- as.Date(date_str, origin = "1899-12-30") # If you're only interested in the YYYY-MM-DD
    fulldate <- openxlsx::convertToDateTime(date_str, tz = "UTC")
    fulldate <- lubridate::format_ISO8601(fulldate)
    fulldate <- paste0(fulldate, "Z")
    print(date)
    print(fulldate)
    
    [1] "2021-01-31"
    [1] "2021-01-31T17:00:00Z"
    
  5. Observations with a start date of 2021-01-30 and an end date of 2021-01-31. For added complexity, consider adding in a 4-digit deployment and retrieval time.

    Here we store the date as a duration following the ISO 8601 convention. In some cases, it is easier to use a regular expression or simply paste strings together:

    library(lubridate)
    event_start <- '2021-01-30'
    event_finish <- '2021-01-31'
       
    deployment_time <- 1002
    retrieval_time <- 1102
    
    Time is recorded numerically (1037 instead of 10:37), so need to change these columns:
    deployment_time <- substr(as.POSIXct(sprintf("%04.0f", deployment_time), format = "%H%M"), 12, 16)
    retrieval_time <- substr(as.POSIXct(sprintf("%04.0f", retrieval_time, format = "%H%M"), 12, 16)
    
    # If you're interested in just pasting the event dates together:
    eventDate <- paste(event_start, event_finish, sep = "/") 
    
    # If you're interested in including the deployment and retrieval times in the eventDate:
    eventDateTime_start <- lubridate::format_ISO8601(as.POSIXct(paste(event_start, deployment_time), tz = "UTC"))
    eventDateTime_start <- paste0(eventDateTime_start, "Z")
    eventDateTime_finish <- lubridate::format_ISO8601(as.POSIXct(paste(event_finish, retrieval_time), tz = "UTC"))
    eventDateTime_finish <- paste0(eventdateTime_finish, "Z")
    eventDateTime <- paste(eventDateTime_start, eventDateTime_finish, sep = "/") 
       
    print(eventDate)
    print(eventDateTime)
    
    [1] "2021-01-30/2021-01-31"
    [1] "2021-01-30T10:02:00Z/2021-01-31T11:02:00Z"
    

Tip

When all else fails, treat the dates as strings and use substitutions/regular expressions to manipulate the strings into ISO 8601.

Matching your scientific names to a taxonomic backbone

Introduction

Working with different partners/institutes/researchers results in a diversity of taxonomic names to define species. This hardens comparison amongst datasets, as in many occasions, aggrgeation is aimed for or filtering on specific species. By translating all species names to a common taxonomic backbone (ensuring unique ID’s for each species name), this can be done.

Darwin Core Term Description Example
scientificNameID An identifier for the nomenclatural (not taxonomic) details of a scientific name. urn:lsid:ipni.org:names:37829-1:1.3
kingdom The full scientific name of the kingdom in which the taxon is classified. Animalia, Archaea, Bacteria, Chromista, Fungi, Plantae, Protozoa, Viruses
taxonRank The taxonomic rank of the most specific name in the scientificName. subspecies, varietas, forma, species, genus

Using the commandline using Python

This small utility provides the functionality to add the species information from the GBIF backbone to any data table (CSV-style or a > Pandas dataframe) by requesting this information via the GBIF API. For each match, the corresponding accepted name is looked for. Nevertheless there will always be errors and control is still essential, the acceptedkeys provide the ability to compare species names from different data sources. The functionality can be loaded within Python itself by importing the function extract_species_information or by running the script from the command line. We will show you on how to use the command line

  1. Create a folder which will be used for name matching.

  2. Place your CSV (comma separated value) file with the scientific names of the species of interest in that folder. Here we are showing some of the contents of the file species.csv. screenshot

  3. Place this Python file gbif_species_name_match.py in your name matching folder screenshot

  4. Navigate in the Python terminal to the correct folder. screenshot

  5. Run the command > python gbif_species_name_match.py yourfilename_input.csv yourfilename_output

Using the Global Names Verifier

Verify a list of scientific names against biodiversity data-sources. This service parses incoming names, executes exact or fuzzy matching as required, and returns the best-scored result. Optionally, it can also return matches from data-sources selected by a user.

  1. Create a CSV (comma separated value) file with the scientific name of the species of interest. Here we are showing some of the contents of the file species.csv. screenshot

  2. Copy your scientific names to the Global Names Verifier screenshot

  3. Click on Search Names. Don’t forget to choose your output format (here choose .csv)

  4. Hopefully, your names will be matched

    1. In some cases you will have ambiguous matches.
    2. Capy you response and use it building your Darwin Core file

Getting lat/lon to decimal degrees

Latitude (decimalLatitude) and longitude (decimalLongitude) are the geographic coordinates (in decimal degrees north and east, respectively), using the spatial reference system given in geodeticDatum of the geographic center of a location.

Note, that the requirement for decimalLatitude and decimallLongitude is they must be in decimal degrees in WGS84. Since this is the requirement for Darwin Core, OBIS and GBIF will assume data shared using those Darwin Core terms are in the geodetic datum WGS84. We highly recommend checking the coordinate reference system (CRS) of your observations to confirm they are using the same datum and documenting it in the geodeticDatum Darwin Core term. If your coordinates are not using WGS84, they will need to be converted in order to share the data to OBIS and GBIF since decimalLatitude and decimalLongitude are required terms.

Helpful packages for managing CRS and geodetic datum:

Tip

If at all possible, it’s best to extract out the components of the information you have in order to compile the appropriate field. For example, if you have the coordinates as one lone string 17° 51' 57.96" S 149° 39' 13.32" W, try to split it out into its component pieces: 17, 51, 57.96, S, 149, 39, 13.32, and W just be sure to track which values are latitude and which are longitude.

Darwin Core Term Description Example
decimalLatitude The geographic latitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. Positive values are north of the Equator, negative values are south of it. Legal values lie between -90 and 90, inclusive. -41.0983423
decimalLongitude The geographic longitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic center of a Location. Positive values are east of the Greenwich Meridian, negative values are west of it. Legal values lie between -180 and 180, inclusive. -121.1761111
geodeticDatum The ellipsoid, geodetic datum, or spatial reference system (SRS) upon which the geographic coordinates given in decimalLatitude and decimalLongitude as based. WGS84

coordinate_precision Image credit: xkcd

Examples in Python

  1. 17° 51' 57.96" S 149° 39' 13.32" W
    • This example assumes you have already split the two strings into discrete components (as shown in the table). An example converting the full strings 17° 51' 57.96" S 149° 39' 13.32" W to decimal degrees can be found here.
    lat_degrees lat_minutes lat_seconds lat_hemisphere lon_degrees lon_minutes lon_seconds lon_hemisphere
    17 51 57.96 S 149 39 13.32 W
    df = pd.DataFrame({'lat_degrees':[17],
                       'lat_minutes':[51],
                       'lat_seconds':[57.96],
                       'lat_hemisphere':['S'],
                       'lon_degrees': [149], 
                       'lon_minutes': [39], 
                       'lon_seconds':[13.32], 
                       'lon_hemisphere': ['W'],
                      })
       
    df['decimalLatitude'] = df['lat_degrees'] + ( (df['lat_minutes'] + (df['lat_seconds']/60) )/60)
    df['decimalLongitude'] = df['lon_degrees'] + ( (df['lon_minutes'] + (df['lon_seconds']/60) )/60)
    
    # Convert hemisphere S and W to negative values as units should be `degrees North` and `degrees East`
    df.loc[df['lat_hemisphere']=='S','decimalLatitude'] = df.loc[df['lat_hemisphere']=='S','decimalLatitude']*-1
    df.loc[df['lon_hemisphere']=='W','decimalLongitude'] = df.loc[df['lon_hemisphere']=='W','decimalLongitude']*-1
          
    df[['decimalLatitude','decimalLongitude']]
    
       decimalLatitude  decimalLongitude
              -17.8661         -149.6537
    
  2. 33° 22.967' N 117° 35.321' W
    • Similar to above, this example assumes you have already split the two strings into discrete components (as shown in the table).
    lat_degrees lat_dec_minutes lat_hemisphere lon_degrees lon_dec_minutes lon_hemisphere
    33 22.967 N 117 35.321 W
    df = pd.DataFrame({'lat_degrees':[33],
                       'lat_dec_minutes':[22.967],
                       'lat_hemisphere':['N'],
                       'lon_degrees': [117], 
                       'lon_dec_minutes': [35.321], 
                       'lon_hemisphere': ['W'],
                      })
       
    df['decimalLatitude'] = df['lat_degrees'] + (df['lat_dec_minutes']/60)
    df['decimalLongitude'] = df['lon_degrees'] + (df['lon_dec_minutes']/60)
       
    # Convert hemisphere S and W to negative values as units should be `degrees North` and `degrees East`
    df.loc[df['lat_hemisphere']=='S','decimalLatitude'] = df.loc[df['lat_hemisphere']=='S','decimalLatitude']*-1
    df.loc[df['lon_hemisphere']=='W','decimalLongitude'] = df.loc[df['lon_hemisphere']=='W','decimalLongitude']*-1
       
    df[['decimalLatitude','decimalLongitude']]
    
    decimalLatitude  decimalLongitude
    0        33.382783       -117.588683
    

Examples in R

  1. 17° 51' 57.96" S 149° 39' 13.32" W

    lat_degrees lat_minutes lat_seconds lat_hemisphere lon_degrees lon_minutes lon_seconds lon_hemisphere
    17 51 57.96 S 149 39 13.32 W
    library(tibble)
    tbl <- tibble(lat_degrees = 17,
                  lat_minutes = 51,
                  lat_seconds = 57.96,
                  lat_hemisphere = "S",
                  lon_degrees = 149,
                  lon_minutes = 39, 
                  lon_seconds = 13.32, 
                  lon_hemisphere = "W")
       
    tbl$decimalLatitude <- tbl$lat_degrees + ( (tbl$lat_minutes + (tbl$lat_seconds/60)) / 60 )
    tbl$decimalLongitude <- tbl$lon_degrees + ( (tbl$lon_minutes + (tbl$lon_seconds/60)) / 60 )
       
    tbl$decimalLatitude = as.numeric(as.character(tbl$decimalLatitude))*(-1)
    tbl$decimalLongitude = as.numeric(as.character(tbl$decimalLongitude))*(-1)
    
    > tbl$decimalLatitude
    [1] -17.8661
    > tbl$decimalLongitude
    [1] -149.6537
    
  2. 33° 22.967' N 117° 35.321' W

    lat_degrees lat_dec_minutes lat_hemisphere lon_degrees lon_dec_minutes lon_hemisphere
    33 22.967 N 117 35.321 W
    library(tibble)
    tbl <- tibble(lat_degrees = 33,
                  lat_dec_minutes = 22.967,
                  lat_hemisphere = "N",
                  lon_degrees = 117, 
                  lon_dec_minutes = 35.321, 
                  lon_hemisphere = "W")
       
    tbl$decimalLatitude <- tbl$lat_degrees + ( tbl$lat_dec_minutes/60 )
    tbl$decimalLongitude <- tbl$lon_degrees + ( tbl$lon_dec_minutes/60 )
       
    tbl$decimalLongitude = as.numeric(as.character(tbl$decimalLongitude))*(-1)
    
    > tbl$decimalLatitude
    [1] 33.38278
    > tbl$decimalLongitude
    [1] -117.5887
    
  3. 33° 22.967' N 117° 35.321' W

    • Using the measurements package the conv_unit() can work with space separated strings for coordinates.
    lat lat_hemisphere lon lon_hemisphere
    33 22.967 N 117 35.321 W
   tbl <- tibble(lat = "33 22.967",
                 lat_hemisphere = "N",
                 lon = "117 35.321", 
                 lon_hemisphere = "W")
  
  tbl$decimalLongitude = measurements::conv_unit(tbl$lon, from = 'deg_dec_min', to = 'dec_deg')
  tbl$decimalLongitude = as.numeric(as.character(tbl$decimalLongitude))*(-1)
  
  tbl$decimalLatitude = measurements::conv_unit(tbl$lat, from = 'deg_dec_min', to = 'dec_deg')
   > tbl$decimalLatitude
   [1] 33.38278
   > tbl$decimalLongitude
   [1] -117.5887

You can find some more tutorials on data transformation and publication on the INBO tutorial page: https://inbo.github.io/tutorials/

Key Points

  • When doing conversions it’s best to break out your data into it’s component pieces.

  • Dates are messy to deal with. Some packages have easy solutions, otherwise use regular expressions to align date strings to ISO 8601.

  • Latitude and longitudes are like dates, they can be messy to deal with. Take a similar approach.


GBIF community forum & helpdesk

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What is the GBIF community forum

  • What is the GBIF helpdesk

  • What is the Technical support hour for GBIF nodes

Objectives
  • Learn more on how the GBIF community is ready to help

Presentation: The GBIF community forum

Excercise : Browse the Community forum

  1. Create an account on the community forum
  2. Check this post and check this out on gbif(www.gbif.org) extra_info
  3. Browse the forum a bit

SOLUTION

  1. That was easy

Presentation: The GBIF Helpdesk

GBIF portal ‘Feedback system’

Instructions

  1. Create an account on Github
  2. Instead of sending bugs or ideas to helpdesk@gbif.org you can also use Github issues
  3. On the right upper corner of the GBIF portal clik here: portal You can choose different options
    content
    Bug
    Idea
    Helpdesk

SOLUTION

  1. That was easy

Key Points

  • You are not alone in this world


Lunch Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

lunch

Key Points


Darwin Core+Extensions archive

Overview

Teaching: 20 min
Exercises: 60 min
Questions
  • What is a core and what are the extensions in Darwin Core?

  • How to organize my data and metadata?

  • How to create Darwin Core Archive

Objectives
  • Creating IDs and using them

  • Creating core and extensions files.

Darwin Core Extensions

DwC Schema

Now that we have a firm basis for understanding the different terms in Darwin Core the next part to understand is how data tables are organized and the difference between cores and extensions. You will always have a core table (Occurrence core or Event core) with either no extensions or several. What you choose depends on the data you have and how to represent it best. The original Darwin Core core is the Occurrence core. Once people started using that core they began to see that they needed extensions to that core to best represent the data they were trying to share and therefore several extensions have been developed (and are continuing to be developed). As more monitoring data has been shared over time, another core type called Event core was added. Without getting too far into the weeds on the cores and extensions, what’s most important to understand is that you need to pick your core type and once you do that then you pick the extensions to go with it. For example, if your data took place as part of an event (cruise, transects, etc) you will pick Event core. If there was no sampling event, then you will pick Occurrence core.

Different options for sharing the data

Occurrence Core only

The bare minimum for sharing data to OBIS is to use the Occurrence Core with no extensions. This core type covers datasets that only include observations and/or specimen records where no information on sampling is available. Occurrence core is also used for eDNA or DNA derived data.

The Occurrence core allows you to provide all the required Darwin Core terms detailed in the intro section. You can produce a fully compliant Darwin Core version of your data using only the Occurrence core (see this example by Tylar Murray). On the one hand, if the data were collected using some kind of sampling methodology, you will lose much of that information if you use this most simple form of the data. One the other, it is faster and easier to produce.

Thought Experiment

Look at the minimum required fields example. What is possible to do in future reuse? What would not be possible? For instance, note that there is no information about depth or the uncertainty of the coordinates. For more examples check out the Datasets folder in the IOOS Bio Data Guide.

Occurrence Core + extensions

Using the Occurrence core plus relevant extensions means that you can capture more of the data that’s been recorded. As an example, let’s consider an environmental DNA dataset. eDNA datasets have information that is unique to that method and will not be represented well using Occurrence core only. To document eDNA using Darwin Core you should follow this guide; you will need the Occurrence core plus the DNA derived data extension. Adding the DNA derived data extension allows you to capture information such as the PCR primer used, DNA sequences, standard operating procedure used in the assembly and other information specific to this type of data.

Let’s consider another example: a museum dataset that has biological measurements for each individual specimen (e.g. length). All information about each organism’s occurrence (taxonomic information, locality, identification, etc.) will go into the Occurrence core. You can then capture the biotic measurement information (type of measurement, units, accuracy, etc.) by using either the Measurement or Facts extension, or the Extended Measurement or Fact extension (we elaborate on this extension below). Note again here we do not have information on how the organisms were sampled.

Checklist Core + extensions

Suitable for publication of Taxonomic data, still in use but ColDP standard is about to replace that.

Event Core + extensions

As we have indicated earlier, the Event core is for datasets that include known sampling events - details are known about how, when, and where samples were taken.

An innovation that OBIS made in this space was introducing the Extended Measurement or Fact extension (also sometimes referred to as OBIS-ENV-DATA, or eMoF). This uses the Event core with an Occurrence extension + the extended Measurement or Fact extension. The eMoF extension makes it possible to include measurements for both the events (salinity, temperature, dissolved oxygen, gear type, etc.) as well as measurements about the occurrences (weight, length, etc.). Prior to this you were only able to include measurements of the occurrence (in the Measurement or Facts extension).

When these types of measurement data were collected, they may have each had their own column in your dataset. However, the Extended Measurement of Fact extension does not format data in this way. Rather than documenting each of your measurements in separate columns, measurements will be condensed into one column: measurementValue (e.g. 15). Then, to tell us what that value is, there is the column measurementType which describes what the measurement actually is (e.g. length). Finally the column measurementUnit is used to indicate the unit of the measurement (e.g. cm).

Now, you may wonder - what do you write in the “measurementType” field? For some measurements, it may be simple. For others, maybe not as simple. The good news is this field is unconstrained - you can populate it with free text as you like. But what if you were interested in getting all records that have “length” measurements from OBIS? Due to the inevitable heterogeneity in how different people would document “length”, you would have to try to account for all these different ways!

The key thing about the extended Measurement or Fact extension that gets around this challenge, is that it provides a way to include Unique Resource Identifiers (URIs). These URIs are used to populate the measurementTypeID field (as well as measurementUnitID and measurementValueID). URIs mean that if you call the measurementType “abundance” but I call it “Abundance per square meter” and we both use the measurementTypeID “http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL02/” then we know this is the same measurement type even if we didn’t use the same free text words to describe it. Choosing the right URI can be difficult but you can read about finding codes here. All you need to know for now is that you should try to find a measurementTypeID URI that belongs to the P01 collection. OBIS is developing guidelines to help you with the process of choosing URIs, so stay tuned to their manual for updates.

Tip

Consider to check the Datasets classes pages and the Data Quality requirements for each of them.

What’s in an ID?

Darwin Core Term Description Example
eventID An identifier for the set of information associated with an Event (something that occurs at a place and time). May be a global unique identifier or an identifier specific to the data set. INBO:VIS:Ev:00009375
Station_95_Date_09JAN1997:14:35:00.000
FFS-216:2007-09-21:A:replicateID1024
occurrenceID An identifier for the Occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique. urn:catalog:UWBM:Bird:89776
Station_95_Date_09JAN1997:14:35:00.000_Atractosteus_spatula
FFS-216:2007-09-21:A:replicateID1024:objectID1345330
measurementID An identifier for the MeasurementOrFact (information pertaining to measurements, facts, characteristics, or assertions). May be a global unique identifier or an identifier specific to the data set. 9c752d22-b09a-11e8-96f8-529269fb1459

IDs are the keys in your data that are used to link tables together. For example, an occurenceID in the Extended Measurement or Fact table records information about an organism with the same occurrenceID within the Occurrence core table. IDs are also the keys that keep track of each of the records, so that if you notice a mistake or missing information you can keep the record in place in the global aggregators and fix the mistake or add the missing information. For instance, let’s say you have a record with an occurrenceID Station_95_Date_09JAN1997:14:35:00.000_Atractosteus_spatula and after it’s published to OBIS you notice that the latitude was recorded incorrectly. When you fix that record in the data you would keep the occurrenceID Station_95_Date_09JAN1997:14:35:00.000_Atractosteus_spatula, fix the latitude, and republish the data so that the record is still present in OBIS but you have fixed the mistake.

With that in mind what is the best way to create an eventID, occurrenceID, or measurementID? Until we have a system that mints Persistent Identififers for individual records then the best way we have seen is to build the ID from information in the data itself. That way if you need to update or fix a record you simply use the same information again to build the same ID for the same record. Take our example above Station_95_Date_09JAN1997:14:35:00.000_Atractosteus_spatula. This is a concatenation of information from the original source data of the Station number + Verbatim Date + Scientific name. Because this is unique for each row in the occurrence file and we have kept the original data in its original format we can always rebuild this ID by concatenating this same information together again.

It is very important that these IDs do not change over time. So if an ID for a museum specimen is built from e.g. the institution the specimen is being held at, but then the specimen changes institutions - its ID should not change to reflect the move. If the ID changes then the record will be duplicated in the global database and record information could be lost over time.

Exercise Time!

Now, let’s try a practical use case where a birds watchers group send you their data. see explanations

1. Initial checks

  1. yes
  2. yes
  3. not everything cristal clear
  4. Metadata is not complete
  5. Event Core
  6. Occurrence and MeasurementOrFact
  7. Yes
  8. Maybe geodeticDatum, coordinatesUncertainty…
  9. Yes into event, occurrence and measurement
  10. Occurrence identifiers are missing

2. Data cleaning

  • ScientificNames shall appear on each row
  • Colors, if meaningful, shall be added as column
  • Missing data should be null
  • OccurrenceIDs are missing, we suggest to use spreadsheet rowID
  • Incorrect eventIDs shall be removed or corrected

3. DarwinCore mapping

  • Original data shall be organized in event, occurrence and measurements
  • Metadata should be more elaborated : Taxonomic, geographic, time scope…
  • License/waiver should be selected
  • Fields name should correspond to DarwinCore terms

Key Points

  • Darwin Core star schema with core and extensions to model the multitude of biological observation data.

  • Identifiers fields are important keys in your data and we recommend building them from the information in your data.

  • Minimum Data fields Requirements for each class.

  • Linking Core entities with the extensions.


Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

xkcd 1179 (iso 8601) Image credit: xkcd

Key Points


Metadata

Overview

Teaching: 0 min
Exercises: 30 min
Questions
  • How are my data published?

  • What metadata are required for publishing?

Objectives
  • Showing data publishing pipeline

  • Introducing the IPT

  • Introduction to EML

Integrated Publishing Toolkit

The Integrated Publishing Toolkit (IPT) is an open-source web application developed and maintained by the Global Biodiversity Information Facility (GBIF) for publishing biodiversity data. The IPT makes it easy to share four types of biodiversity-related information:

GBIF maintains a very detailed IPT manual The Croatian IPT is available here.

The requirements for publishing data through your node IPT are that:

  1. you have contacted the node to ensure the data are a good fit
  2. the data follows Darwin Core (DwC) and Ecological Metadata Language (EML)
  3. includes the required Darwin Core and EML metadata elements

Presentation

Ecological Metadata Language (EML)

Both OBIS and GBIF use Ecological Metadata Language (EML) as the metadata standard associated with the data. For the purposes of this workshop we will not dive into the world of EML. However, we should note that when publishing your data through the IPT, the IPT helps you create an EML file as part of the Darwin Core Archive (DwC-A). As such, if you publish your own data through the IPT, there is no need for innate knowledge on the EML format. But there are a minimum required number of fields that would need to be filled out in the IPT: title, abstract, citation, and several contacts.

More information on EML can be found at the EML standard page, and in the bio data guide. There are also a number of R packages for working with EML, reviewed here.

Tip

Try to collect as much of this information as possible before and during the Darwin Core alignment process. It will significantly reduce the amount of time it takes to load the data into the IPT.

Required EML metadata fields for sharing to GBIF

Best practices for these fields are explained in detail in the GBIF IPT user manual_Resource metadata Simply use the IPT’s built-in metadata editor to populate the metadata.

IPT/EML Fields Definition Comment
title A good descriptive title is indispensable and can provide the user with valuable information, making the discovery of data easier. The IPT also requires you to provide a Shortname. Shortnames serve as an identifier for the resource within the IPT installation and should be unique, descriptive and short (max. 100 characters). Spell out acronyms in Title but they are ok to use in the shortname.
description The abstract or description of a dataset provides basic information on the content of the dataset. The information in the abstract should improve understanding and interpretation of the data.  
license The licence that you apply to the resource. The license provides a standardized way to define appropriate uses of your work. Must use CC-0, CC-BY, or CC-BY-NC. Description of the licenses can be found here.
resource Contact(s) The list of people and organizations that should be contacted to get more information about the resource, that curate the resource or to whom putative problems with the resource or its data should be addressed. Last name, Postition, and Organization are required, helpful to include an ORCID and a contact method like email or phone number.
resource Creator(s) The people and organizations who created the resource, in priority order. The list will be used to auto-generate the resource citation (if auto-generation is turned on).  
metadata Provider(s) The people and organizations responsible for producing the resource metadata.  
publishing organisation The organization who publishes the data i.e. the data publisher  

Other EML fields to consider

IPT/EML Fields Definition Comment
Bounding Box Farthest North, South, East, and West coordinate.  
Geographic Description A textual description of the geographic coverage.  
Temporal Coverage This can either be a Single Date, Date Range, Formation Period, or Living Time Period.  
Study Extent This field represents both a specific sampling area and the sampling frequency (temporal boundaries, frequency of occurrence) of the project.  
Sampling Description This field allows for a text-based/human readable description of the sampling procedures used in the research project. The content of this element would be similar to a description of sampling procedures found in the methods section of a journal article.
Step Description This field allows for repeated sets of elements that document a series of methods and procedures used in the study, and the processing steps leading to the production of the data files. These include e.g. text descriptions of the procedures, relevant literature, software, instrumentation and any quality control measurements taken. Each method should be described in enough detail to allow other researchers to interpret and repeat the study, if required.
citation To ensure your dataset gets cited the way you want  

Exercises 1: Create an ‘imaginary’ dataset in the Croatian IPT

  1. Go to the Croatian ‘test’ IPT instance on ipt.bioportal.hr
  2. Login to the ‘IPT’ instance, you can login with your emailaddress and WelcomCroMent iptlogin
  3. Click on manage resources iptmanage
  4. Click on Create Newand choose your the type of your dataset (here choose occurrence) iptcreate
  5. Give a shortname for your resource, the shortname serves as an identifier for the resource and will be used as a parameter in the url iptshortname
  6. Click on Create

Solution

  1. You have created your first resource on the IPT

Exercises 2: Create ‘imaginary’ metadata for your dataset

  1. Go to the Croatian ‘test’ IPT instance
  2. Login
  3. Click on Manage resources
  4. Click on your ‘imaginary’ dataset iptimaginary
  5. Click on edit in the Metadatasection iptmetadata
  6. Complete the Metadata wizzard, starting with providing a tittle for your dataset iptwizzard

Solution

  1. Congratulations, you did add metadata in your dataset
  1. Go to the Croatian ‘test’ IPT instance
  2. Login
  3. Click on Manage resources
  4. Click on your ‘imaginary’ dataset
  5. Click on add in the Source datasection source
  6. Choose your source data:
    • A File (Choose occurrencememo.csv if you don’t have an ‘imaginary’ dataset
    • An url
    • An SQL statement
  7. Click on addin the Darwin Core Mappings section mapping
  8. CLick again on add, make sure Darwin Core Occurrence is selected
  9. Select the source ‘occurrencememoand click save`
  10. Your data is automapped to Darwin Core, you can click on save

Solution

  1. Congratulations, you or you nodemanager can publish this dataset after validation

Datapapers

datapaper

Tip

  • In some cases you’ll want to ensure the values are representative of the entity you are reporting.
  • For example, individualCount should be an integer. So, checking that column for integer values would be good.

Key Points

  • The IPT is a well-documented and flexible system for publishing data to GBIF (and OBIS)

  • Some Darwin Core and Ecological Metadata Language fields are required for publishing to GBIF (and OBIS).

  • Strive to write more than the minimal metadata


Data Validation & GBIF Validator

Overview

Teaching: 0 min
Exercises: 30 min
Questions
  • How to quality check my data before publication?

Objectives
  • Showing GBIF data validator

Data enhancement and quality control

Data validation with GBIF data validation tool

  • Check that all the required Darwin Core terms are present and contain the correct information.
  • Make a map from your data to ensure the coordinates are valid and within your expected range.
  • Look at unique values of columns containing string entries to identify potential issues (eg. spelling).
  • Check for uniqueness of occurrenceID field. dwc:occurrenceID
  • Check for uniqueness of eventID for each event, if applicable. dwc:eventID
  • Check that dates are following ISO-8601. dwc:eventDate

GBIF data validator

GBIF Data validator

One method for data validation is fairly simple. The GBIF data validator

What is the GBIF data validator?

The GBIF data validator is a service that allows anyone with a GBIF-relevant dataset to receive a report on the syntactical correctness and the validity of the content contained within the dataset. By submitting a dataset to the validator, you can go through the validation and interpretation procedures usually associated with publishing in GBIF and quickly determine potential issues in data - without having to publish it.

How does it work?

You start by uploading the dataset file to the validator, either by 1) clicking SELECT FILE and selecting it on your local computer or 2) dragging the file from a local folder and dropping it on the Drop here icon. You can also enter the URL of a dataset file accessible from the internet. This is particularly useful for larger datasets. Once you hit the Submit button, the validator starts processing your dataset file. You will be taken straight to a page showing the status of the validation.

Depending on the size of your dataset, processing might take a while. You don’t have to keep the browser window open, as a unique job ID is issued every time a new validation process is started. If your dataset is taking too long to process, just save the ID (bookmark the URL) and use it to return at a later time to view the report. We’ll keep the report for a month, during which you can come back whenever you like.

Which file types are accepted?

ZIP-compressed Darwin Core Archives (DwC-A) (containing cores Occurrence, Taxon, or Event). Integrated Publishing Toolkit (IPT) Excel templates containing Checklist, Occurrence, or Sampling-event data Simple CSV files containing Darwin Core terms in the first row

What information will I get from the validation report?

Once processing is done, you will be able to see the validation report containing the following information:

You will also be able to view the metadata as a draft version of the dataset page as it would appear when the dataset is published and registered with GBIF.

I’ve got the validation report - now what? If the validator finds that your dataset cannot be indexed by GBIF, you should address the issues raised by the validation report before you consider publishing it to GBIF. On the other hand, if you get the green light and your dataset is indexable by GBIF, you should still carefully review any issues that may be the result of e.g. conversion errors, etc. which could affect the quality of the data. If you find and correct any error - from a single typo to large systematic problems - feel free to resubmit your dataset as many times you like.

Technical details As with all GBIF tools and software, the data validator is an open source project. For more information, source code and documentation is available in a GitHub repository.

Exercises 1: Check a Dwc-a (Darwin Core Archive) in the GBIF validator

The GBIF validator is simple and easy to use.

  1. Go to the GBIF validator website
  2. Drop your DwC-a file (you can find an example file here dwc-a sample file or maybe use the dwc-a you published before in the IPT.
  3. Check the GBIF report on your dataset.
  4. Is your data ready to be published by GBIF?

Solution

  1. screenshot
  2. Check the validation status of your file

Hmisc Describe (optional)

Another method for reviewing your data is to use the r package Hmisc and the function describe. Expand the example below using output from this notebook to see how it works.

Hmisc::describe

# pull in the occurrence file from https://www.sciencebase.gov/catalog/item/53a887f4e4b075096c60cfdd
url <- "https://www.sciencebase.gov/catalog/file/get/53a887f4e4b075096c60cfdd?f=__disk__32%2F24%2F80%2F322480c9bcbad19030e29c9ec5e2caeb54cb4a08&allowOpen=true"

occurrence <- read.csv(url)

head(occurrence,n=1)       
  vernacularName                                eventID occurrenceStatus
1  Alligator gar Station_95_Date_09JAN1997:14:35:00.000           Absent
     basisOfRecord       scientificName
1 HumanObservation Atractosteus spatula
                           scientificNameID  kingdom   phylum       class
1 urn:lsid:marinespecies.org:taxname:279822 Animalia Chordata Actinopteri
             order        family        genus scientificNameAuthorship
1 Lepisosteiformes Lepisosteidae Atractosteus        (LacepA"de, 1803)
  taxonRank organismQuantity organismQuantityType
1   Species                0   Relative Abundance
                                                 occurrenceID
1 Station_95_Date_09JAN1997:14:35:00.000_Atractosteus_spatula
         collectionCode
1 Aransas Bay Bag Seine

Hmisc::describe(occurrence)
occurrence 

 18  Variables      334341  Observations
--------------------------------------------------------------------------------
vernacularName
       n  missing distinct
  334341        0       61

lowest : Alligator gar        Arrow shrimp         Atlantic brief squid Atlantic bumper      Atlantic croaker
highest: Striped mullet       Thinstripe hermit    Threadfin shad       White mullet         White shrimp
--------------------------------------------------------------------------------
eventID
       n  missing distinct
  334341        0     5481

lowest : Station_10_Date_04DEC1991:13:59:00.000 Station_10_Date_04SEP2002:13:17:00.000 Station_10_Date_05JUN1991:15:20:00.000 Station_10_Date_07APR1995:12:54:00.000 Station_10_Date_07APR2000:11:16:00.000
highest: Station_99_Date_21APR1998:18:24:00.000 Station_99_Date_22OCT2001:13:12:00.000 Station_99_Date_25JUN1990:13:48:00.000 Station_99_Date_25NOV2003:11:11:00.000 Station_99_Date_27JUN1988:12:45:00.000
--------------------------------------------------------------------------------
occurrenceStatus
       n  missing distinct
  334341        0        2

Value       Absent Present
Frequency   294469   39872
Proportion   0.881   0.119
--------------------------------------------------------------------------------
basisOfRecord
               n          missing         distinct            value
          334341                0                1 HumanObservation 

Value      HumanObservation
Frequency            334341
Proportion                1
--------------------------------------------------------------------------------
scientificName
       n  missing distinct
  334341        0       61

lowest : Adinia xenica               Anchoa mitchilli            Archosargus probatocephalus Ariopsis felis              Atractosteus spatula
highest: Stomatopoda                 Stomolophus meleagris       Syngnathus scovelli         Tozeuma carolinense         Trichiurus lepturus
--------------------------------------------------------------------------------
scientificNameID 
       n  missing distinct
  334341        0       61

lowest : urn:lsid:marinespecies.org:taxname:105792 urn:lsid:marinespecies.org:taxname:107034 urn:lsid:marinespecies.org:taxname:107379 urn:lsid:marinespecies.org:taxname:126983 urn:lsid:marinespecies.org:taxname:127089
highest: urn:lsid:marinespecies.org:taxname:367528 urn:lsid:marinespecies.org:taxname:396707 urn:lsid:marinespecies.org:taxname:421784 urn:lsid:marinespecies.org:taxname:422069 urn:lsid:marinespecies.org:taxname:443955

--------------------------------------------------------------------------------
kingdom
       n  missing distinct    value
  334341        0        1 Animalia

Value      Animalia
Frequency    334341
Proportion        1
--------------------------------------------------------------------------------
phylum
       n  missing distinct
  328860     5481        4

Value      Arthropoda   Chordata   Cnidaria   Mollusca
Frequency       71253     246645       5481       5481
Proportion      0.217      0.750      0.017      0.017
--------------------------------------------------------------------------------
class
       n  missing distinct
  328860     5481        5

lowest : Actinopteri    Cephalopoda    Elasmobranchii Malacostraca   Scyphozoa     
highest: Actinopteri    Cephalopoda    Elasmobranchii Malacostraca   Scyphozoa

Value         Actinopteri    Cephalopoda Elasmobranchii   Malacostraca
Frequency          235683           5481          10962          71253
Proportion          0.717          0.017          0.033          0.217

Value           Scyphozoa
Frequency            5481
Proportion          0.017
--------------------------------------------------------------------------------
order
       n  missing distinct
  328860     5481       22

lowest : Atheriniformes            Batrachoidiformes         Carangaria incertae sedis Carangiformes             Carcharhiniformes        
highest: Rhizostomeae              Scombriformes             Siluriformes              Syngnathiformes           Tetraodontiformes
--------------------------------------------------------------------------------
family
       n  missing distinct
  328860     5481       36

lowest : Ariidae        Atherinopsidae Batrachoididae Carangidae     Carcharhinidae
highest: Stromateidae   Syngnathidae   Tetraodontidae Trichiuridae   Triglidae
--------------------------------------------------------------------------------
genus
       n  missing distinct
  328860     5481       52

lowest : Adinia       Anchoa       Archosargus  Ariopsis     Atractosteus
highest: Sphoeroides  Stomolophus  Syngnathus   Tozeuma      Trichiurus  
--------------------------------------------------------------------------------
scientificNameAuthorship
       n  missing distinct
  328860     5481       52

lowest : (Baird & Girard, 1853)        (Baird & Girard, 1855)        (Blainville, 1823)            (Bosc, 1801)                  (Burkenroad, 1939)
highest: Rathbun, 1896                 Say, 1817 [in Say, 1817-1818] Shipp & Yerger, 1969          Valenciennes, 1836            Winchell, 1864
--------------------------------------------------------------------------------
taxonRank
       n  missing distinct
  334341        0        3

Value        Genus   Order Species
Frequency     5481    5481  323379
Proportion   0.016   0.016   0.967
--------------------------------------------------------------------------------
organismQuantity
       n  missing distinct     Info     Mean      Gmd      .05      .10
  334341        0     8696    0.317  0.01639  0.03141  0.00000  0.00000
     .25      .50      .75      .90      .95
 0.00000  0.00000  0.00000  0.01005  0.07407

lowest : 0.0000000000 0.0000917684 0.0001835370 0.0002136300 0.0002241650
highest: 0.9969931270 0.9974226800 0.9981570220 0.9982300880 1.0000000000
--------------------------------------------------------------------------------
organismQuantityType 
                 n            missing           distinct              value
            334341                  0                  1 Relative Abundance

Value      Relative Abundance
Frequency              334341
                    n               missing              distinct
              334341                     0                     1
                value
Aransas Bay Bag Seine

Value      Aransas Bay Bag Seine
Frequency                 334341
Proportion                     1
--------------------------------------------------------------------------------

Exercise

Challenge: Perform the following minimal quality assurance and control checks:

  1. Run a diagnostics report for the data quality.
  2. Ensure that the eventIDs are unique.
  3. Make sure that the eventDates follow ISO-8601 standards.
  4. Determine whether reported depths are accurate.

The event core data used in the checks below can be found in this Excel file.

Solution in R

Install obistools R packages. Use readxl package to read the Excel file.

  1. Run a diagnostics report for the data quality
library(readxl)
library(obistools)

trawl_fish <- readxl::read_excel('data/trawl_fish.xlsx')
report <- obistools::report(trawl_fish)
report

drawing

  1. Check to make sure eventID are unique
    eventid <- obistools::check_eventids(trawl_fish)
    head(eventid)
    
    # A tibble: 6 x 4
     field         level   row message                                                    
     <chr>         <chr> <int> <chr>                                                      
     1 eventID       error     7 eventID IYS:GoA2019:Stn6:trawl is duplicated               
     2 eventID       error     8 eventID IYS:GoA2019:Stn6:trawl is duplicated               
     3 parentEventID error     1 parentEventID IYS:GoA2019:Stn1 has no corresponding eventID
     4 parentEventID error     2 parentEventID IYS:GoA2019:Stn2 has no corresponding eventID
     5 parentEventID error     3 parentEventID IYS:GoA2019:Stn3 has no corresponding eventID
     6 parentEventID error     4 parentEventID IYS:GoA2019:Stn4 has no corresponding eventID
    
  2. Check for proper eventDate to ensure they follow ISO 8601 standards:
     eventDate <- obistools::check_eventdate(trawl_fish)
     print(eventDate)  
    
     # A tibble: 3 x 4
      level   row field     message                                                       
      <chr> <int> <chr>     <chr>                                                         
     1 error    10 eventDate eventDate 2019-02-24T07u40 does not seem to be a valid date   
     2 error    13 eventDate eventDate 2019-02-25 11h25min does not seem to be a valid date
     3 error    15 eventDate eventDate 2019-26-2 does not seem to be a valid date    
    
  3. From the report generated under exercise 1, you can already see that there’s measurements made on land. This information can also be gathered by plotting the map separately or using the check_onland() or check_depth() functions in the obistools package.
     depth <- obistools::check_depth(trawl_fish)
     onland <- obistools::check_onland(trawl_fish) # Gives the same output.           
     print(depth)  
    
     # A tibble: 1 x 16
      eventID parentEventID eventDate  year month   day decimalLatitude decimalLongitude footprintWKT coordinateUncer~ minimumDepthInM~
      <chr>   <chr>         <chr>     <dbl> <dbl> <dbl>           <dbl>            <dbl> <chr>                   <dbl>            <dbl>
     1 IYS:Go~ IYS:GoA2019:~ 2019-02-~  2019     2    22            67.4            -140. LINESTRING ~            2313.                0
     # ... with 5 more variables: maximumDepthInMeters <dbl>, samplingProtocol <chr>, locality <chr>, locationID <chr>, type <chr>    
    

Solution in Python

Install the pandas, cartopy, and geopandas Python packages. Use pandas to read the Excel file.

import pandas as pd
url = 'https://ioos.github.io/bio_mobilization_workshop/data/trawl_fish.xlsx'
df = pd.read_excel(url) # might need to install openpyxl
df['row'] = df.index.to_numpy()+1 # python starts at zero
  1. Run a diagnostics report for the data quality.
    import cartopy.io.shapereader as shpreader
    import geopandas as gpd
    import shapely.geometry as sgeom
    from shapely.ops import unary_union
    from shapely.prepared import prep
    import matplotlib.pyplot as plt
       
    gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.decimalLongitude, df.decimalLatitude))
       
    land_shp_fname = shpreader.natural_earth(resolution='50m',
                                           category='physical', name='land')
       
    land_geom = unary_union(list(shpreader.Reader(land_shp_fname).geometries()))
    land = prep(land_geom)
       
    for index, row in gdf.iterrows():
        gdf.loc[index, 'on_land'] = land.contains(row.geometry)
       
    fig, axs = plt.subplots(ncols=1,nrows=2)
    # Make a map:
    xlim = ([gdf.total_bounds[0]-2,  gdf.total_bounds[2]+2])
    ylim = ([gdf.total_bounds[1]-2,  gdf.total_bounds[3]+2])
    
    axs[0].set_xlim(xlim)
    axs[0].set_ylim(ylim)
    
    gpd.read_file(land_shp_fname).plot(ax=axs[0])
       
    gdf.loc[gdf['on_land']==False].plot(ax=axs[0], color='green', markersize=1)
    gdf.loc[gdf['on_land']==True].plot(ax=axs[0], color='red', markersize=1)
       
    # Collect some informational material about potential issues w/ data:
    invalid_coord = []
    if len(gdf.loc[gdf['on_land']==True]) > 0:
       invalid_coord.append('Row {} coordinates on land.'.format(gdf.loc[gdf['on_land'] == True,'row'].tolist()[0]))
        
    req_cols = ['eventDate', 'decimalLongitude', 'decimalLatitude', 'scientificName', 'scientificNameID', 'occurrenceStatus', 'basisOfRecord']
    missing_cols = []
    for col in req_cols:
     if col not in gdf.columns:
       missing_cols.append('Column {} is missing.'.format(col))
       
    # Add the information to the figure
    axs[1].text(0.25,0.25,'\n'.join(['\n'.join(missing_cols),'\n'.join(invalid_coord)]))
    axs[1].axis('off')
    plt.show()
    

    drawing

  2. Check to make sure eventID are unique
    dup_events = df.loc[df['eventID'].duplicated()]
    print('Duplicated eventID:\n',dup_events[['eventID','row']])
       
    parent_not_event = df.loc[~df['eventID'].isin(df['parentEventID'].unique())]
    print('\nparentEventID missing corresponding eventID:\n', parent_not_event[['parentEventID','row']])
    
    Duplicated eventID:
                       eventID  row
    6  IYS:GoA2019:Stn6:trawl    7
    7  IYS:GoA2019:Stn6:trawl    8
       
    parentEventID missing corresponding eventID:
             parentEventID  row
    0    IYS:GoA2019:Stn1    1
    1    IYS:GoA2019:Stn2    2
    2    IYS:GoA2019:Stn3    3
    3    IYS:GoA2019:Stn4    4
    4    IYS:GoA2019:Stn5    5
    ..                ...  ...
    59  IYS:GoA2019:Stn60   60
    60  IYS:GoA2019:Stn61   61
    61  IYS:GoA2019:Stn62   62
    62  IYS:GoA2019:Stn63   63
    63  IYS:GoA2019:Stn64   64
    [64 rows x 2 columns]
    
  3. Check for proper eventDate to ensure they follow ISO 8601 standards:
    for date in df['eventDate']:
        try:
            pd.to_datetime(date)
        except:
            print("Date",date,"might not follow ISO 8601")
    
  4. From the report generated under exercise 1, you can already see that there’s measurements made on land. Now let’s check the depths are within reason for the points. Let’s use the GEBCO bathymetry dataset served in the coastwatch ERDDAP.
    import time
    import numpy as np
       
    df['bathy'] = np.nan # initialize column
       
    for index, row in df.iterrows():
        base_url = 'https://coastwatch.pfeg.noaa.gov/erddap/griddap/GEBCO_2020.csvp?'
        query_url = 'elevation%5B({})%5D%5B({})%5D'.format(row['decimalLatitude'],row['decimalLongitude'])
        url = base_url+query_url
        bathy = pd.read_csv(url)
        df.at[index,'bathy'] = bathy['elevation (m)'] # insert bathymetry value
        time.sleep(0.5) # to not ping erddap too much
       
    # make new column for depth in meters as negative because GEBCO is Elevation relative to sea level
    df['neg_maximumDepthInMeters'] = -1*df['maximumDepthInMeters']
       
    print('maximumDepthInMeters deeper than GEBCO bathymetry:')
    if len( df.loc[df['neg_maximumDepthInMeters'] < df['bathy']] ) > 0:
       print(df.loc[df['neg_maximumDepthInMeters'] < df['bathy']].T)
    else:
       print('None')
    
    maximumDepthInMeters deeper than GEBCO bathymetry:
                                                                                  4
    eventID                                                  IYS:GoA2019:Stn5:trawl
    parentEventID                                                  IYS:GoA2019:Stn5
    eventDate                             2019-02-22T09:49:00Z/2019-02-22T10:49:00Z
    year                                                                       2019
    month                                                                         2
    day                                                                          22
    decimalLatitude                                                       67.399004
    decimalLongitude                                                    -139.552501
    footprintWKT                   LINESTRING ( -139.583 67.397 , -139.522 67.401 )
    coordinateUncertaintyInMeters                                       2313.094678
    minimumDepthInMeters                                                          0
    maximumDepthInMeters                                                       33.2
    samplingProtocol                                                 midwater trawl
    locality                                                                    NaN
    locationID                                                                  NaN
    type                                                             midwater trawl
    row                                                                           5
    bathy                                                                     306.0
    neg_maximumDepthInMeters                                                  -33.2
    

Data Publishing Pipeline

After going through QAQC and being standardized to Darwin Core, the dataset are uploaded to an IPT. Metadata is added in the form of EML and the dataset published as a Darwin Core Archive (DwC-A). The data are then pushed to central OBIS. Each dataset also has the option of being pushed to GBIF through the OBIS IPT.

Image by Enrique Montes Data publishing pipeline. Image credit: Enrique Montes

Key Points

  • Use at least the GBIF data validator before you publish data on the network


Continuing the Conversation

Overview

Teaching: 0 min
Exercises: 120 min
Questions
  • How do I continue my work after the workshop?

  • Where do I go to look for help with DwC alignment?

  • How do I provide feedback about this workshop?

Objectives
  • Complete the (optional) post-workshop survey.

Thank you for attending the workshop!

Our hope that you were able to register to GBIF and submit some of your data. If not, this is just the beginning and work needs to continue beyond the workshop. The national Node and the entire GBIF community will help you to succeed.

“Creativity comes from applying things you learn in other fields to the field you work in.” Aaron Swartz

Post-Workshop Survey

If you wish to provide feedback please use this post-workshop survey.


GBIF’s Technical Support Hour

The theme for March session of the Technical Support Hour for nodes is GBIF’s data quality workflow. We will go through how published data is processed in terms of quality checks, show how you can get an overview of the flags and issues of datasets, how users provide publically accessible feedback and how you can navigate the feedback yourself.

Registration

The event will take place on the 6th of March 2024 at 4pm CET (UTC+1)


Further readings

This section cover some useful links grouped by topics.

GBIF documentation

Key documents

Other useful resources

On Data Standards

Well established ones

Emerging ones

On Georeferencing

On Persistent identifiers

Key Points


How to publish biodiversity data through GBIF.org

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • What is IPT for the GBIF node

  • How is IPT organized

Objectives
  • Understand how IPT works.

GBIF—the Global Biodiversity Information Facility—is an international network and data infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.

Presentation: Data Publication workflow ‘generic’

GBIF supports publication, discovery and use of four classes of data:

IPT manual

At the simplest, GBIF enables sharing information describing a biodiversity data resource – even when no further digital information is currently available from the resource. Other data classes support an increasingly richer and wider range of information on species, their distributions and abundance.

Data publishers are strongly encouraged to share their data using the richest appropriate data class. This maximizes the usefulness of the data for users.

To give yourself an introduction to how the IPT can be used to publish biodiversity data through GBIF.org, it’s highly recommended watching this concise 25 minute live demo below:

Alternate Text

Announcement

Prerequisites

You require an account on a GBIF Integrated Publishing Toolkit (IPT) to publish your data.

Hint: it is highly recommended that you save yourself time and money by requesting an account on an IPT located at a data hosting centre in your country or community.

Hint: you could install and maintain your own IPT instance if you have technical skills and capacity to maintain it online near 100% of the time.

Hint: if no data hosting centre exists in your country, and you or your organization don’t have the technical skills and capacity to host an IPT, you can contact the GBIF Helpdesk helpdesk@gbif.org for assistance.

Assuming that you would like to register your dataset with GBIF and make it globally discoverable via GBIF.org, your dataset must be affiliated with an organization that is registered with GBIF.

Hint: to register your organization with GBIF, start by completing this online questionnaire. The registration process can take days, so in parallel you can proceed to publish your data.

Hint: if you aren’t affiliated with any organization, you can contact the GBIF Helpdesk helpdesk@gbif.org for assistance. In the meantime, you can proceed to publish your data.

Instructions

To publish your data, follow the 7 steps below.

screenshot

1. Select the class of biodiversity data you have from this list:

2. Transform your data into a table structure, using Darwin Core (DwC) terms as column names

Hint: try using an Excel template to structure your data, and understand what DwC terms are required and recommended (Excel templates for each dataset class are available in the above links - see the previous point)

Hint: it is possible to use data stored in a supported database

3. Upload your data to the IPT

Hint: refer to other sections of this manual for additional guidance, such as the Manage Resources Menu section.

4. Map the data (e.g. Checklist Data gets mapped to the Taxon Core, Occurrence Data gets mapped to the Occurrence Core, Sampling Event Data gets mapped to the Event Core).

5. Fill in resource metadata using the IPT’s metadata editor

6. Publish the dataset (make it freely and openly available worldwide)

7. Register the dataset with GBIF.

Your organization must be registered with GBIF (see prerequisite 2 above) and added to your IPT by the IPT administrator. Otherwise, the organization will not be available to choose from in the IPT.

Exercises 1: Publish this occurrence dataset (dwc-a) on the Croatian IPT ipt.bioportal.hr

Most of the work on the publication of the data lies in the data cleaning, mapping and the description of the dataset. Once a Darwin Core archive was generated, it is fairly simple to publish it again, on another IPT for example.
Publish this dataset, “already published by the Croatian Faculty of science (which is already a GBIF data publisher) on the GBIF ECA Cloud IPT” again on the Croatian IPT. Make sure you are logged in on the IPT instance.
You should have recieved a pswdr and a login to the Croatian IPT instance.

Solution

  1. donwload the dwc-a file here
  2. Go to the tab manage resources
  3. create a new dataset Create new dataset
  4. provide a new shortname
  5. Choose Import from an archived resource import
  6. Choose the Dwc-a file
  7. Click save
  8. If everything went correct, your metadata and data is correctly mapped in the IPT and ready to publih.
  9. Click publish to finish this exercise

Exercises 2: Publish this occurrence dataset on the Croatian IPT ipt.bioportal.hr

Unfortunately, in most cases you will not have a DwC-a file availble, meaning, that you should, together with the data researcher or person who would like to publish his or her data to GBIF, create a dwc-a.
The IPT is a good tool to create dwc-archives. (There are also other tools available here for example but we do not recommend this.
For this exercise we prepared all the files needed to generate a dwc-a. Make sure you are logged in on the IPT instance.
You should have recieved a pswdr and a login to the Croatian IPT instance.
You can find an occurrence file here
You can find the metadata here Copy paste only the minimal set of information on the right place in the IPT

Solution

  1. donwload the dwc-a file
  2. go to the tab manage resources
  3. create a new dataset Create new dataset
  4. provide a new shortname
  5. select type occurrence and push create
  6. deal with source data, darwin core mappings and metadata (tip see session metadata & data validation)
  7. publish your dataset
  8. change visibility to public
  9. register your dataset (not needed in this exercise)
  10. Click publish to finish this exercise

Exercises 3: Publish this sample based dataset dataset on the Croatian IPT ipt.bioportal.hr

Unfortunately, in most cases you will not have a DwC-a file availble, meaning, that you should, together with the data researcher or person who would like to publish his or her data to GBIF, create a dwc-a.
The IPT is a good tool to create dwc-archives. For this exercise we prepared all the files needed to generate a dwc-a. Make sure you are logged in on the IPT instance.
You should have recieved a pswdr and a login to the Croatian IPT instance.
You can find an occurrence file here occurrence
You can find the event file here event
You can find the metadata here Copy paste only the minimal set of information on the right place in the IPT

Solution

  1. go to the tab manage resources
  2. create a new dataset Create new dataset
  3. provide a new shortname
  4. select type sampling event and push create
  5. deal with source data add both files to the IPT
  6. deal with darwin core mappings for the occurrence file
  7. deal with darwin core mappings for the event file
  8. deal with metadata also here, only copy paste the minimum needed
  9. publish your dataset
  10. change visibility to public
  11. register your dataset (not needed in this exercise)
  12. Click publish to finish this exercise

Exercises 4: Publish this checklist dataset dataset on the Croatian IPT ipt.bioportal.hr

Now, we will publish a checklist data on on the IPT. A checklist is a 3rd type of dataset you can publish on Global Biodiversity Information Facility. A cheklist has no occurrences as the core file, but the species (the taxon) is at the centre of the star scheme. For this exercise we prepared all the files needed to generate a dwc-a. Make sure you are logged in on the IPT instance.
You should have recieved a pswdr and a login to the Croatian IPT instance.
You can find all the needed data here: TrIAS The TrIAS cheklist is a live ‘cheklist’ which is regurlaly updated through Github actions and an automatic update function in the IPT.
You can donwload the needed files from Github. If you want to make sure your published datsets is always up to date, you can use the raw online files as a source file raw Github content For this checklist, we have a taxon, description, distribution, speciesprofile and references file. Only use (download) the taxon, description and spieciesprofile file for this exercise. You can find the metadata here Copy paste only the minimal set of information on the right place in the IPT

Solution

  1. go to the tab manage resources
  2. create a new dataset Create new dataset
  3. provide a new shortname
  4. select type checklist and push create
  5. deal with the source data import all files in the IPT. In the IPT, for taxon choose source data is url instead of file and use this url raw Github content import
  6. deal with darwin core mappings for the distribution file
  7. deal with darwin core mappings for the distribution file
  8. deal with darwin core mappings for the profile file
  9. deal with metadata also here, only copy paste the minimum needed
  10. publish your dataset
  11. change visibility to public
  12. register your dataset (not needed in this exercise)
  13. Click publish to finish this exercise

Key Points

  • IPT is the main tool to publish your data to GBIF


How to become a data publisher for GBIF.org

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • What is a data publisher?

Objectives
  • Understand what it is to be a data publisher.

Becoming a data publishers

1. Secure institutional agreements

Before to share data through the GBIF network, you should alert administrators of your plans to publish on behalf of your institution. Sharing open data will increase the visibility of your institution, building on traditional methods like academic publications and specimen loans to bring new collaboration opportunities and international recognition through DOI-based citations.

2. Request Endorsement

To become a data publisher, your organization must request endorsement from the GBIF community. Once you have reviewed the data publisher agreement and agree in principle to share data, we encourage you to request endorsement for your organization as soon as possible to avoid delays in publishing data.

Data Publisher Agreement

Terms and conditions:

see on GBIF.org

Data user agreement

It is also important for data publishers to carefully read the agreement between GBIF and data users see on GBIF.org

3. Select your publishing tools and partners

Your data may arrive to GBIF via different ways, or tools. Today, much of the data arrive from the Integrated Publishing Toolkit(IPT) installed at the institution, at the natioan node, at GBIF Secretariat(hosted IPT) or elesewhere. The Living Atlases platform, originally developed by the Atlas of Living Australia, also offers data publication.

Other alternatives exist: fully automated based on GBIF API or simply by putting on the web (HTTP installation).

Publishing

4. Data Management Plan

As Data publishers, you will decide:

We would suggest you to document all this in your institution/unit Data Management Plan.

Exercise

Please take some time to answer these questions:

Incentives for publishing open-access biodiversity data

An important part of GBIF’s mission is to promote a culture in which people recognize the benefits of publishing open-access biodiversity data, for themselves as well as for the broader society.

By making your data discoverable and accessible through GBIF and similar information infrastructures, you will contribute to global knowledge about biodiversity, and thus to the solutions that will promote its conservation and sustainable use. Data publishing enables datasets held all over the world to be integrated, revealing new opportunities for collaboration among data owners and researchers. Publishing data enables individuals and institutions to be properly credited for their work to create and curate biodiversity data, by giving visibility to publishing institutions through good metadata authoring. This recognition can be further developed if you author a peer-reviewed data paper, giving scholarly recognition to the publication of biodiversity datasets. Collection managers can trace usage and citations of digitized data published from their institutions and accessed through GBIF and similar infrastructures. Some funding agencies now require researchers receiving public funds to make data freely accessible at the end of a project.

see How to become a data publisher chapter

Glossary

GBIF Participant
Signatory of the GBIF Memorandum of Understanding (MoU)
GBIF Secretariat
Legal entity empowered by the GBIF Participants to enter into contracts, execute the Work Programme, and maintain the central services for the GBIF network including software components and updates, interfaces, indexing and registry services, helpdesk and relevant training.
GBIF network
The infrastructure consisting of the central services of the GBIF Secretariat, Participant Nodes and Data Publishers. Making data available through the GBIF network means registering and advertising the pertinent services via the GBIF central services.
Participant Node
An organisational unit designated by the GBIF Participant to coordinate activities in its domain. It may also provide data.
Biodiversity data
Primary data on specimens, observations, names, taxonomic concepts, and sites, and other data on biological diversity.
Metadata
Data describing the attributes and combinations of biodiversity data.
Data
Biodiversity data and metadata.
Data publishing
The process of and agreements for making data freely and universally available on the Internet.
Data Publisher
A custodian of data making it technically available. This may or may not be the data owner. If not they will have declared to GBIF that they have permission to make the data available.
Data User
Anyone who uses the Internet to access data through the GBIF network.
Owner of data
The legal entity possessing the right resulting from the act of creating a digital record. The record may be a product derived from another, possibly non-digital product, which may affect the right.
Sensitive data
Any data that the Data Publisher does not want to make available, e.g. precise localities of endangered species.

Key Points

  • Use the DOI


Coffee break

Overview

Teaching: min
Exercises: min
Questions
Objectives

Key Points


IPT Administration

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • How is IPT organized and managed

Objectives
  • Understand how IPT can be managed

presentation IPT admin

Exercises 1: Get familiar with the Croatian IPT

You should be already a little bit familiar with the Croatian IPT. Now Login as an admin and use the force

  1. Add a logo to your IPT installation
  2. Change the IPT colors to the colors you want
  3. Add a user account ~role Manager with registration rights (You can delete this user af needed)
  4. Any organisations we would like to add?
  5. Make sure all the Core types and extensions are available

Solution

  1. You are familiarized with your IPT instance

Key Points

  • The GBIF portal is the place to find biodiversity data


Lunch Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

lunch

Key Points


The GBIF registry

Overview

Teaching: 30 min
Exercises: 30 min
Questions
Objectives
  • Get used to the GBIF Registry

  • Understand what it can deliver?

  • Explore GBIF API, Pygbif, Rgbif

Presentation

This presentation will introduce you to GBIF Registry( a core component of GBIF architecture ).

Registry


Exercise 1 : Find your organisation in the Registry

Within your browser:

Solution 1


Exercise 2 : use Python or R programming

With Pygbif or Rgbif packages:

Solution 2

Key Points

  • GBIF Registry

  • Pygbif & Rgbif

  • Webservices API


GBIF GRSciColl

Overview

Teaching: 60 min
Exercises: 0 min
Questions
  • Why is GRSCiColl important?

  • How can I help?

Objectives
  • Discover GRSCiColl

  • Help curation of Scientific Collections

GRSciColl

The Global Registry of Scientific Collections is a comprehensive, community-curated repository of information about scientific collections that extends work initially started by the Consortium of the Barcode of Life (CBOL).

Global Registry of Scientific Collections (GRSciColl) from GBIF on Vimeo.

In this issue of the support hour 12, the Data Products team will give you an overview of the Global Registry of Scientific Collections (GRSciColl) 3. How to edit it in the interface or with he Collections registry API 3 and how the occurrences published on GBIF are linked to GRSciColl entries.

Exercise

Key Points

  • GRSciColl a central registry maintained by GBIF

  • GRSciColl a community effort


Coffee break

Overview

Teaching: min
Exercises: min
Questions
Objectives

Key Points


GBIF Hosted Portals

Overview

Teaching: 60 min
Exercises: 0 min
Questions
  • What are GBIF hosted portals

Objectives
  • Understand how GBIF hosted portals work.

Participant nodes play an essential role in promoting the use of biodiversity data mobilized by the GBIF network. To further support their engagement across national, institutional, regional and thematic levels, the GBIF Secretariat has developed a fully hosted service that provides simple, customizable biodiversity data portals to GBIF Participant nodes and their partners.

Each hosted portal is a simple website hosted and maintained on the GBIF infrastructure. The programme complements other tools available through the network by lowering the technical threshold for maintaining a branded web presence displaying a targeted subset of the data and information already available through GBIF.org.

Watch this introduction video:

Alternate Text

What is a hosted portal?

A hosted portal is a simple, branded and fully customizable website that displays a targeted subset of GBIF-mediated data to support Participant nodes and their partners.

This service is designed to support biodiversity data use and engagement at national, institutional, regional and thematic scales.

Who is it for?

A hosted portal can benefit Participant nodes and publishers that need a simple yet fully functional data portal.

By lowering the technical demands for delivering biodiversity data, hosted portals can enable a focus on other critical activities like data management and community engagement.

How to apply

Participant nodes and their partner institutions can apply to participate in GBIF’s hosted portal programme at any time.

Learn more about the application process, then fill out and submit your application below to get started!

Fill out the application](https://www.gbif.org/composition/7zgSnALNuD1OvzanAUPG4z/hosted-portals-application-form) Check out the Hosted Portal service agreement

Some examples of hosted portals

DiSSCo-Flanders Hosted Portal DiSSCo-UK Hosted portal

Key Points

  • A GBIF hosted portal is a nice way to showcase your data


GBIF Discussion

Overview

Teaching: 0 min
Exercises: 60 min
Questions
Objectives
  • Group discussion

Discussion

  1. Data Standardization: Discuss the role of data standardization in ensuring interoperability and usability of biodiversity data across different platforms, including GBIF hosted portals, GRSciColl, and IPT. How can we encourage data publishers to adhere to established standards?

  2. Metadata Management: Explore strategies for managing metadata associated with biodiversity datasets. How can metadata standards be applied consistently across different platforms to improve data discoverability and usability?

  3. Quality Control and Assurance: Address the importance of quality control and assurance processes in maintaining the reliability and accuracy of biodiversity data hosted on GBIF portals. How can data publishers and GBIF collaborate to ensure data quality?

  4. Capacity Building: Explore opportunities for capacity building initiatives to empower researchers and institutions to utilize GRSciColl, and IPT effectively. How can training programs and resources be tailored to the needs of different user groups?

  5. Sustainability: Consider the long-term sustainability of GRSciColl and IPT. What are some challenges and opportunities for ensuring the continued operation and development of these platforms?

  6. **A histed portal for your institute or network?” Discuss the possibility to start a hosted portal for your institute or network. What would be your way to go!?

Discussion challenge

Choose a topic to discuss in your group. 30 minutes for group discussion. 30 minutes for reporting back to the room. 5-6 persons per group.

Solution

Report back to the room on your group discussion

Key Points

  • How GBIF works


GBIF opportunities (CESP, BID, Ambassador programme, data use club...)

Overview

Teaching: 60 min
Exercises: 0 min
Questions
  • What is CESP

  • What is BID

Objectives
  • Learn about CESP

  • Learn about BID

  • Learn about the GBIF Ambassador programme

Living atlasses community

LA

As GBIF nodes, one of our goals is to highlight our publishers and their data. To achieve this, the Atlas of Living Australia (ALA) developed a huge open source platform with several modules re-usable by other organizations. Since 2013, the community around this tool has organized technical workshops to present ALA modules to other institutions that wanted to implement it, to improve already existing national data portals and to learn from each other’s achievements.

LA

CESP: Capacity Enhancement Support Programme

cesp

This programme aims to address specific capacity needs identified by GBIF Participants by facilitating collaboration at regional and global levels.

The annual calls for proposals under this programme provide co-funding to GBIF Participants for capacity enhancement projects based on combinations of the following types of action:

Mentoring activities: Interactions among two or more Participants where the core objective is the transfer of information, technology, experience and best practices in small groups.

Support for regional events and training workshops: Courses and workshops with a regional (multi-national) component to enhance the capacity of individuals or networks to contribute to and benefit from GBIF.

Documentation: Production of key learning and promotional resources or their adaptation to the national or regional context (e.g. by translation or including local/regional perspectives). The GBIF Secretariat advocates digital-first documentation to provide technical guidance and support training and skills development across GBIF’s communities of practice. The key features of this system include standardized documentation, routine updates, versioning, translations, community input, peer review, and searchable format.

These types of action are part of the suite of capacity enhancement activities provided by GBIF, to enable effective mobilization and use of biodiversity information.

croment

funded

BID: Biodiversity Information for Development

BID

Biodiversity Information for Development (BID) is a multi-year programme funded by the European Union and led by GBIF with the aim of enhancing capacity for effective mobilization and use of biodiversity data in research and policy in the ‘ACP’ nations of sub-Saharan Africa, the Caribbean and the Pacific.

Funding from the programme’s first phase has supported capacity enhancement activities and projects to mobilize biodiversity data and strengthened national and regional biodiversity information facilities in these regions. Its impacts to date have focused on data that support the regions’ policy needs, particularly in connection with protected areas, threatened species and invasive alien species.

bid

Data use club

Data Use club

The Data Use Club is a space that promotes the interaction between data users and provides them with tools for developing skills in data use, no matter where they are in the world. In the club, we provide support in the following form:

Training seminars: This quarterly webinar series highlights approaches to global problems using GBIF-mediated data. Each seminar provides opportunities for knowledge exchange and inspiration for GBIF users who wih to develop their own solutions to similar challenges.

Practical sessions: This quarterly webinar series focuses on developing the informatic and data management skills necessary to fully exploit the potential of GBIF-mediated data. The material for these sessions expands on the biodiversity data use curriculum developed by GBIF.

DataCamp online training: GBIF users can receive free access to the full suite of online training offered by DataCamp through DataCamp Donates.

For questions regarding the Data Use Club, please contact datause@gbif.org

datacamp

Ambassador programme

ambassadors

The success of GBIF depends in part on establishing a good understanding within research and policy communities of the benefits and opportunities provided by free and open access to biodiversity data, as well as the importance of responsible use of such data through proper citation and attribution.

The GBIF Secretariat and GBIF participant nodes work to promote this understanding through their communication platforms, at meetings and across networks around of the world. However, this relatively small group can never hope to reach all relevant communities without assistance.

Biodiversity Open Data Ambassadors can fill that gap. If you are a biodiversity professional who promotes the principles and best practices of open data sharing and use, we can equip you with information resources, networking opportunities and recognition to make you an even more effective advocate in your own professional communities.

How to become a GBIF data ambassador

Are you a potential Biodiversity Open Data Ambassador? We ask for some minimum qualifications and a basic level of commitment, namely that:

GBIF translators

translators

Can I join the GBIF translator community?

The work of translation is never done: we are always creating new content, and we introduce small but incremental additions and adjustments to the user interface on an almost weekly basis. We often have other publications and documentation in need of translations (or updated translations), so we welcome the involvement of newcomers who can help us maintain progress in making free and open data even more widely accessible around the world.

If you are interested in joining our volunteer community, please feel free to email us at communication@gbif.org. We’ll discuss current priorities and arrange a time to orient you to the key tools and the most pressing tasks.

If you already have experience with CrowdIn (which our community first started using in 2014 to translate the interface for the GBIF IPT), you can simply request to join the CrowdIn project for your preferred language.

Ebbe Nielsen Challenge

ebbe

The GBIF Ebbe Nielsen Challenge is an annual incentive prize that seeks to inspire innovative applications of open-access biodiversity data by scientists, informaticians, data modelers, cartographers and other experts.

The call for entries for the 2024 Ebbe Nielsen Challenge is now open! DEADLINE: 24 August 2024

Like the Ebbe Nielsen Prize it replaced, the Challenge honours the memory of Dr Ebbe Schmidt Nielsen, an inspirational leader in the fields of biosystematics and biodiversity informatics and one of the principal founders of GBIF, who died unexpectedly just before it came into being.

While the focus of the competition has evolved from year to year, Challenge entries manifest a variety of forms and approaches—new analytical research, richer policy-relevant visualizations, web and mobile applications, or improvements to processes around data digitization, quality and access. Allocations from the annual prize pool of €20,000 will award the first-place team with €8,000, €6,000 for second place, and €3,000 each for the third-place winners.

ebbewinner

The Graduate Researchers Award

Graduate Researchers Award

The Graduate Researchers Award (previously the Young Researchers Award) is an annual programme aimed at fostering innovative research and discovery in biodiversity informatics. The GRA provides prizes to two graduate students—generally one Master’s candidate and one Ph.D candidate—nominated by GBIF Participant countries.

Since its inception in 2010, the Award has encouraged innovation in the use and mobilization of biodiversity data shared through the GBIF network.

Calls for nominations go out each spring, and interested students must submit applications to either a national Head of Delegation or node manager from a GBIF Participant country. These national delegations may nominate a maximum of two students each year to the GBIF Secretariat.

Graduate students wishing to be considered for the nominations should consult the websites of their national GBIF Participants or contact the Head of Delegation or node manager directly. The GBIF Secretariat and national Participants whose nominees are selected to receive the award announce the award winners each autumn just before the annual Governing Board meeting.

GBIF data use

Data use

Data from the GBIF network is used in scientific studies published in peer-reviewed journals at a rate of more than four papers every day. Review highlights from the most recent publications drawn from the Secretariat’s ongoing literature tracking programme or check out the comprehensive literature database, which comprises more than 9,000 entries from scientific literature that cite the use of GBIF-mediated data.

Based on their scientific impact, relevance and uniqueness as well as diversity in taxonomy and geography, we select and feature a handful of papers every month. These are later compiled into a printed publication—the Science Review, our annual compilation of scientific articles—partial but instructive—enhanced and supported by free and open data that the GBIF network of members and publishers make available.

The applicability of free and open biodiversity data spans beyond academia, as GBIF-mediated data is also used to inform decision-making and policy—ranging from documents produced by local and national agencies—to extinction risk assessments for thousands of species on the IUCN Red List of Threatened Species— and comprehensive reports by large-scale intergovernmental and convention-based bodies—such as the Intergovernmental Panel on Climate Change (IPCC) and the the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES).

GBIF mentors and trainers

Mentors

Mentors and trainers are a pillar of the GBIF community of practice. By sharing their expertise within the community, GBIF mentors contribute to capacity enhancement by providing support to project teams, participating in training events and by interacting with other mentors.

Becoming a volunteer mentor is an opportunity to collaborate with the growing community of scientists, institutions, and networks involved in the mobilization and use of biodiversity data, and to make new connections all around the world. GBIF maintains an open call for volunteers to join this group of mentors and trainers.

Areas of expertise Mentors typcially have experience in several of the following areas:

Become a GBIF mentor

  1. Volunteer
  2. Complete the mentor certification after your application has been reviewed.
  3. Join the GBIF Community Forum and participate in the )[discussions](https://discourse.gbif.org/c/mentors with other mentors on the community forum.
  4. Answer occasional calls to mentors to participate in capacity support activities as project mentors and/or workshop mentors and trainers.

image

Key Points

  • keypoints


Biodiversity Informatics Network (TDWG)

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • Biodiversity Informatics Network (TDWG)

Objectives
  • Understand what TDWG is doing and how it proceed?

TDWG

Biodiversity Information Standards (TDWG) is a non-profit organization and a community dedicated to developing biodiversity information standards. To achieve its goals, TDWG:

see TDWG website for more.

TDWG 2023 Hobart

Community

TDWG is an open, bottom-up organization. Anyone can become a member, individuals as well as institutions (including government agencies, non-governmental organizations, programs, and projects). Everything TDWG produces comes from the expertise, collaboration, and support of its members - people and organizations who create, manage, integrate, and analyze biodiversity information.

As an example, let’s see the How we developed a Data Exchange Format: Lessons Learned from Camera Trap Data Package (Camtrap DP). Presented by Peter Desmet as part of the TDWG 2023 Hybrid Annual Conference in Hobart, Tasmania.

Standards

Darwin Core is a standard maintained by the Darwin Core Maintenance Interest Group.

What is Darwin Core, and why does it matter?

The Darwin Core Standard (DwC) offers a stable, straightforward and flexible framework for compiling biodiversity data from varied and variable sources.

Conferences

TDWG annual conferences provide a showcase for state of the art biodiversity informatics, much of which relies on the standards created by TDWG and other organizations. Our conferences also provide a forum for developing, refining, and extending our standards in response to new challenges and opportunities.

Key Points

  • TDWG community

  • TDWG standards

  • TDWG conferences


Coffee break

Overview

Teaching: min
Exercises: min
Questions
Objectives

Key Points


GBIF mentoring programme CESP

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • What is IPT?

  • How is IPT organized

  • The GBIF portal

Objectives
  • Understand how IPT works.

  • Understand how GBIF is organized.

  • Basic understanding of the GBIF portal

Key Points

  • The GBIF portal is the place to find biodiversity data


Lunch Break

Overview

Teaching: min
Exercises: min
Questions
Objectives

lunch

Key Points


GBIF BID programme

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • What is IPT?

  • How is IPT organized

  • The GBIF portal

Objectives
  • Understand how IPT works.

  • Understand how GBIF is organized.

  • Basic understanding of the GBIF portal

Key Points

  • The GBIF portal is the place to find biodiversity data


Develop GBIF data mobilization Strategy

Overview

Teaching: 30 min
Exercises: 60 min
Questions
  • What is a data mobilization strategy

Objectives
  • Develop a data mobilization strategy.

Guidance document on Belgian Data Mobilization strategy

Key Points

  • The GBIF portal is the place to find biodiversity data

  • A data mobilization strategy helps you in organizing your data mobilization


Coffee break

Overview

Teaching: min
Exercises: min
Questions
Objectives

Key Points


Develop GBIF data mobilization Strategy 2

Overview

Teaching: 10 min
Exercises: 30 min
Questions
  • Who are your stakeholders?

  • Who are your key stakeholders?

Objectives
  • Identify all your stakeholders

  • Map your stakeholders (influence vs interest)

  • Key stakeholders and primary audience

Presentation

StakeholdersLandscape

Exercise

  1. List all your stakeholders
  2. Map them with this matrix
  3. Determine your key stakeholders = Influent AND Interested
  4. Determine your audience = Influent OR Interested

Key Points

  • Node success depends on interaction with your stakeholders.

  • Some stakeholders (with high influence & interest) are key.


Continuing the Conversation

Overview

Teaching: 0 min
Exercises: 120 min
Questions
  • How do I continue my work after the workshop?

  • Where do I go to look for help with DwC alignment?

  • How do I provide feedback about this workshop?

Objectives
  • Complete the (optional) post-workshop survey.

Thank you for attending the workshop!

Our hope that you were able to register to GBIF and submit some of your data. If not, this is just the beginning and work needs to continue beyond the workshop. The national Node and the entire GBIF community will help you to succeed.

“Creativity comes from applying things you learn in other fields to the field you work in.” Aaron Swartz

Post-Workshop Survey

If you wish to provide feedback please use this post-workshop survey.


GBIF’s Technical Support Hour

The theme for March session of the Technical Support Hour for nodes is GBIF’s data quality workflow. We will go through how published data is processed in terms of quality checks, show how you can get an overview of the flags and issues of datasets, how users provide publically accessible feedback and how you can navigate the feedback yourself.

Registration

The event will take place on the 6th of March 2024 at 4pm CET (UTC+1)


Further readings

This section cover some useful links grouped by topics.

GBIF documentation

Key documents

Other useful resources

On Data Standards

Well established ones

Emerging ones

On Georeferencing

On Persistent identifiers

Key Points


Course Evaluation

Overview

Teaching: min
Exercises: min
Questions
Objectives
1 / 8
2 / 8
3 / 8
4 / 8
5 / 8
6 / 8
7 / 8
8 / 8

evaluation evaluation evaluation evaluation evaluation evaluation

Key Points


GBIF discussion

Overview

Teaching: 60 min
Exercises: 0 min
Questions
  • GBIF nodes and ECA network

  • strategic Framework

  • GBIF implementation plans

  • work programmes

Objectives
  • Understand how GBIF works.

  • Understand how GBIF is organized.

Key Points

  • How GBIF works