15  Data validation

15.1 Learning Objectives:

This section covers different methods for validating the assumptions about our data in R, by the end of this chapter you will be able to:

  • Explain why data validation matters for reproducible research.

  • Use assertr functions (assertr::verify() , asserrr::assert(), assertr::insist()) to check data.

  • Write automated checks for numeric ranges and categorical values.

  • Combine validation checks into a simple, readable pipeline.

  • Interpret errors and handle validation failures gracefully.

15.2 Why validate data

Before we analyse data we need to trust it

Think about the assumptions we build into or data:

  • Counts are always positive

  • Column B is always < than Column A

  • Measurements are within acceptable ranges

Remember Garbage in = Garbage out

15.3 Our dataset

Here’s our experimental dataset: each row is a female insect, with basic life-cycle and reproduction data

female_egg_data <- read_csv(here::here("data", "raw", "female_egg_data.csv"))
female_egg_data
female_id treatment age_days eggs_laid eggs_hatched
1 A 0 52 47
2 B 19 120 52
3 A 14 50 55
4 B 3 46 46
5 A 10 59 50
6 B 18 55 43
7 A 22 49 49
8 B 11 44 38
9 A 5 42 42
10 B 20 47 39
11 A 14 44 44
12 B 22 57 46
13 A 25 48 44
14 B 26 41 41
15 A 27 41 41
16 B 5 47 43
17 A 19 47 47
18 B 27 49 34
19 A 25 56 37
20 B 28 48 46

Question

If this were your dataset, what checks would you want to include?

Column What we want to ensure assertr check
female_id No missing, unique, positive integer assert(not_na, female_id), assert(is_uniq, female_id)
treatment Only “A” or “B” verify(treatment %in% c("A","B"))
age_days >0 and reasonable range assert(within_bounds(1, 30), age_days)
eggs_laid >0, within 3 SDs assert(within_bounds(1, Inf), eggs_laid), insist(within_n_sds(3), eggs_laid)
eggs_hatched ≥0 and ≤ eggs_laid assert(within_bounds(0, Inf), eggs_hatched), verify(eggs_hatched <= eggs_laid)

15.4 Run an assertion check

With assertr by default if a validation check fails the code will throw an error:

library(assertr)

female_egg_data |>
  verify(age_days > 0)                      # hatched ≤ laid

15.5 assert, insist, verify

The assertr package (Fischetti (2023)) has three main functions:

  • verify() - is this statement true for the whole data?
df |>  verify(has_all_names("female_id","treatment","age_days","eggs_laid","eggs_hatched"))
  • assert() - applies a prediction row-by-row to the specified columns
df |>  assert(is.numeric, female_id, age_days, eggs_laid, eggs_hatched)
  • insist() - Flag values using a rule that depends on the whole column
df |>  insist(within_n_sds(3), eggs_laid)

15.5.1 error functions

The error_fun argument defines what happens when a validation check produces an error. By default it prints a summary of the errors and halts the code.

This may not be desirable if we wish to run a longer pipeline or many error checks, there are multiple options including just_warn script execution continues and a warning is printed.

female_egg_data |>
  verify(age_days > 0,
         error_fun = just_warn)  
verification [age_days > 0] failed! (1 failure)

    verb redux_fn    predicate column index value
1 verify       NA age_days > 0     NA     1    NA
female_id treatment age_days eggs_laid eggs_hatched
1 A 0 52 47
2 B 19 120 52
3 A 14 50 55
4 B 3 46 46
5 A 10 59 50
6 B 18 55 43
7 A 22 49 49
8 B 11 44 38
9 A 5 42 42
10 B 20 47 39
11 A 14 44 44
12 B 22 57 46
13 A 25 48 44
14 B 26 41 41
15 A 27 41 41
16 B 5 47 43
17 A 19 47 47
18 B 27 49 34
19 A 25 56 37
20 B 28 48 46

15.6 Pipelines

We can string together multiple checks into a single pipeline, we don’t need assertr, it can be achieved using dplyr but this package makes pipelines easier.

  female_egg_data  |> 
      verify(has_all_names("female_id", "treatment",
                           "age_days", "eggs_laid", "eggs_hatched"),
             error_fun = just_warn) |>
      assert(is.numeric, female_id, age_days, eggs_laid, eggs_hatched,
             error_fun = just_warn) |>
      assert(is.character, treatment, error_fun = just_warn) |>
      assert(is_uniq, female_id, error_fun = just_warn) |>
      assert(not_na, female_id, treatment, age_days,
             eggs_laid, eggs_hatched, error_fun = just_warn) |>
      assert(in_set("A", "B"), treatment) |> 
      assert(within_bounds(1, 30), age_days, error_fun = just_warn) |>
      verify(eggs_hatched <= eggs_laid, error_fun = just_warn) |>
      insist(within_n_sds(3), eggs_laid, error_fun = just_warn)
# Step 1: Get summary stats for eggs_laid
female_egg_data_summary <- female_egg_data |> 
  summarise(
    mean_eggs_laid = mean(eggs_laid, na.rm = TRUE),
    sd_eggs_laid   = sd(eggs_laid, na.rm = TRUE)
  )

# Check column classes
female_egg_data |> 
  summarise(across(everything(), ~ class(.)[1])) %>%
  tidyr::pivot_longer(everything(),
                      names_to = "column",
                      values_to = "type")

female_egg_data |> 
  filter(duplicated(female_id))

female_egg_data_report <- female_egg_data |>
  mutate(
    age_invalid           = age_days <= 0,
    eggs_laid_invalid     = eggs_laid <= 0,
    eggs_laid_outlier     = eggs_laid < (female_egg_data_summary$mean_eggs_laid - 3*female_egg_data_summary$sd_eggs_laid) |
                             eggs_laid > (female_egg_data_summary$mean_eggs_laid + 3*female_egg_data_summary$sd_eggs_laid),
    eggs_hatched_invalid  = eggs_hatched > eggs_laid
  )

# Step 3: view rows with any issues filter == TRUE
female_egg_data_report |> 
  filter(age_invalid | eggs_laid_invalid | eggs_laid_outlier | eggs_hatched_invalid)
column type
female_id integer
treatment character
age_days integer
eggs_laid integer
eggs_hatched integer
female_id treatment age_days eggs_laid eggs_hatched
female_id treatment age_days eggs_laid eggs_hatched age_invalid eggs_laid_invalid eggs_laid_outlier eggs_hatched_invalid
1 A 0 52 47 TRUE FALSE FALSE FALSE
2 B 19 120 52 FALSE FALSE TRUE FALSE
3 A 14 50 55 FALSE FALSE FALSE TRUE

15.7 Options for within group datachecks

assertr is designed for whole dataset validation, but it can make sense for some assertr::insist() rules that these are applied in a group specific way

Question

When might we want to consider groups for some data validations?

For data within the insist function we compare data across the entire column - in these examples we might check data is within certain deviations of the mean. But if there are multiple groups, this makes sense to perform within groups.

female_egg_data |> 
  group_by(treatment) |> 
dplyr::group_modify(~ .x |> insist(within_n_sds(3), eggs_laid, error_fun = just_warn))
treatment female_id age_days eggs_laid eggs_hatched
A 1 0 52 47
A 3 14 50 55
A 5 10 59 50
A 7 22 49 49
A 9 5 42 42
A 11 14 44 44
A 13 25 48 44
A 15 27 41 41
A 17 19 47 47
A 19 25 56 37
B 2 19 120 52
B 4 3 46 46
B 6 18 55 43
B 8 11 44 38
B 10 20 47 39
B 12 22 57 46
B 14 26 41 41
B 16 5 47 43
B 18 27 49 34
B 20 28 48 46

15.8 Summary

You now know how to:

  • Use assertr to check data integrity before analysis.

  • Validate numeric and categorical variables.

  • Combine checks into a tidy pipeline.

  • Control what happens when checks fail.

Automated validation transforms your scripts into self-auditing workflows — they document and test your assumptions each time the data changes.