SQLite

Overview

Teaching: 0 min
Exercises: 90 min
Questions
  • Data cleaning with SQLite

Objectives
  • Understand how SQLite can help cleaning data

Presentation: SQLite

SQLite

Exercise 1 : Download from GBIF.org

Instructions

Solution 1

Exercise 2 : Import data

Instructions

Solution 2

select count(*) from occ;

Exercise 3 : Explore data

Instructions

Solution 3

select * from occ where scientificName is null;
select * from occ where eventdate is null;
select * from occ where year is null or month is null or day is null;
select * from occ where decimalLatitude is null or decimalLongitude is null;

select count(*) from occ where individualCount is null;
select taxonRank, count(*) from occ group by taxonRank;
select phylum, count(*) from occ group by phylum;
select license, count(*) from occ group by license;

Exercice 4 : Discard data

Instructions

Solution 4

select count(*) from occ where occurrenceStatus='ABSENT';

create view trusted as select * from occ where occurrenceStatus='PRESENT' and taxonRank='SPECIES';
select count(*) from trusted;

Exercice 5 : Filter data

Instructions

Solution 5

select count(*) from occ where coordinateUncertaintyInMeters is null;
select coordinateUncertaintyInMeters, count(*) from occ group by coordinateUncertaintyInMeters;
select * from occ where CAST(coordinateUncertaintyInMeters as INTEGER) > 10000;

drop view if exists trusted ;
create view trusted as select scientificName, year,month,day,decimalLatitude, decimalLongitude,  CAST(coordinateUncertaintyInMeters as INTEGER) as uncertainty, occurrenceID from occ where occurrenceStatus='PRESENT'  and taxonRank='SPECIES' and uncertainty <= 10000;
select count(*) from trusted;

select eventdate, strftime('%d',eventdate) as day, strftime('%m',eventdate) as month, strftime('%Y', eventdate) as year from occ;

Exercice 6 : Annotate data

Instructions

Solution 6

update occ set individualCount=1 where individualCount is null;

drop view if exists trusted ;
create view trusted as select scientificName, year,month,day,decimalLatitude, decimalLongitude,  CAST(coordinateUncertaintyInMeters as INTEGER) as uncertainty, occurrenceID, individualCount, mediaType is not null as withMedia from occ where occurrenceStatus='PRESENT' and taxonRank='SPECIES' and uncertainty <= 10000;

Key Points

  • SQL can be very useful to clean your data

  • Views are great to filter the records and fields you want to keep without changing your original data

  • Store your SQL statements under Git

  • SQL statements are easy to understand, sustainable and reusable