15 Data validation
15.1 Learning Objectives:
This section covers different methods for validating the assumptions about our data in R, by the end of this chapter you will be able to:
Explain why data validation matters for reproducible research.
Use
assertrfunctions (assertr::verify(),asserrr::assert(),assertr::insist()) to check data.Write automated checks for numeric ranges and categorical values.
Combine validation checks into a simple, readable pipeline.
Interpret errors and handle validation failures gracefully.
15.2 Why validate data
Before we analyse data we need to trust it
Think about the assumptions we build into or data:
Counts are always positive
Column B is always < than Column A
Measurements are within acceptable ranges
Remember Garbage in = Garbage out
15.3 Our dataset
Here’s our experimental dataset: each row is a female insect, with basic life-cycle and reproduction data
| female_id | treatment | age_days | eggs_laid | eggs_hatched |
|---|---|---|---|---|
| 1 | A | 0 | 52 | 47 |
| 2 | B | 19 | 120 | 52 |
| 3 | A | 14 | 50 | 55 |
| 4 | B | 3 | 46 | 46 |
| 5 | A | 10 | 59 | 50 |
| 6 | B | 18 | 55 | 43 |
| 7 | A | 22 | 49 | 49 |
| 8 | B | 11 | 44 | 38 |
| 9 | A | 5 | 42 | 42 |
| 10 | B | 20 | 47 | 39 |
| 11 | A | 14 | 44 | 44 |
| 12 | B | 22 | 57 | 46 |
| 13 | A | 25 | 48 | 44 |
| 14 | B | 26 | 41 | 41 |
| 15 | A | 27 | 41 | 41 |
| 16 | B | 5 | 47 | 43 |
| 17 | A | 19 | 47 | 47 |
| 18 | B | 27 | 49 | 34 |
| 19 | A | 25 | 56 | 37 |
| 20 | B | 28 | 48 | 46 |
Question
If this were your dataset, what checks would you want to include?
| Column | What we want to ensure | assertr check |
|---|---|---|
female_id |
No missing, unique, positive integer |
assert(not_na, female_id), assert(is_uniq, female_id)
|
treatment |
Only “A” or “B” | verify(treatment %in% c("A","B")) |
age_days |
>0 and reasonable range | assert(within_bounds(1, 30), age_days) |
eggs_laid |
>0, within 3 SDs |
assert(within_bounds(1, Inf), eggs_laid), insist(within_n_sds(3), eggs_laid)
|
eggs_hatched |
≥0 and ≤ eggs_laid |
assert(within_bounds(0, Inf), eggs_hatched), verify(eggs_hatched <= eggs_laid)
|
15.4 Run an assertion check
With assertr by default if a validation check fails the code will throw an error:
15.5 assert, insist, verify
The assertr package (Fischetti (2023)) has three main functions:
-
verify()- is this statement true for the whole data?
-
assert()- applies a prediction row-by-row to the specified columns
-
insist()- Flag values using a rule that depends on the whole column
15.5.1 error functions
The error_fun argument defines what happens when a validation check produces an error. By default it prints a summary of the errors and halts the code.
This may not be desirable if we wish to run a longer pipeline or many error checks, there are multiple options including just_warn script execution continues and a warning is printed.
verification [age_days > 0] failed! (1 failure)
verb redux_fn predicate column index value
1 verify NA age_days > 0 NA 1 NA
| female_id | treatment | age_days | eggs_laid | eggs_hatched |
|---|---|---|---|---|
| 1 | A | 0 | 52 | 47 |
| 2 | B | 19 | 120 | 52 |
| 3 | A | 14 | 50 | 55 |
| 4 | B | 3 | 46 | 46 |
| 5 | A | 10 | 59 | 50 |
| 6 | B | 18 | 55 | 43 |
| 7 | A | 22 | 49 | 49 |
| 8 | B | 11 | 44 | 38 |
| 9 | A | 5 | 42 | 42 |
| 10 | B | 20 | 47 | 39 |
| 11 | A | 14 | 44 | 44 |
| 12 | B | 22 | 57 | 46 |
| 13 | A | 25 | 48 | 44 |
| 14 | B | 26 | 41 | 41 |
| 15 | A | 27 | 41 | 41 |
| 16 | B | 5 | 47 | 43 |
| 17 | A | 19 | 47 | 47 |
| 18 | B | 27 | 49 | 34 |
| 19 | A | 25 | 56 | 37 |
| 20 | B | 28 | 48 | 46 |
15.6 Pipelines
We can string together multiple checks into a single pipeline, we don’t need assertr, it can be achieved using dplyr but this package makes pipelines easier.
female_egg_data |>
verify(has_all_names("female_id", "treatment",
"age_days", "eggs_laid", "eggs_hatched"),
error_fun = just_warn) |>
assert(is.numeric, female_id, age_days, eggs_laid, eggs_hatched,
error_fun = just_warn) |>
assert(is.character, treatment, error_fun = just_warn) |>
assert(is_uniq, female_id, error_fun = just_warn) |>
assert(not_na, female_id, treatment, age_days,
eggs_laid, eggs_hatched, error_fun = just_warn) |>
assert(in_set("A", "B"), treatment) |>
assert(within_bounds(1, 30), age_days, error_fun = just_warn) |>
verify(eggs_hatched <= eggs_laid, error_fun = just_warn) |>
insist(within_n_sds(3), eggs_laid, error_fun = just_warn)# Step 1: Get summary stats for eggs_laid
female_egg_data_summary <- female_egg_data |>
summarise(
mean_eggs_laid = mean(eggs_laid, na.rm = TRUE),
sd_eggs_laid = sd(eggs_laid, na.rm = TRUE)
)
# Check column classes
female_egg_data |>
summarise(across(everything(), ~ class(.)[1])) %>%
tidyr::pivot_longer(everything(),
names_to = "column",
values_to = "type")
female_egg_data |>
filter(duplicated(female_id))
female_egg_data_report <- female_egg_data |>
mutate(
age_invalid = age_days <= 0,
eggs_laid_invalid = eggs_laid <= 0,
eggs_laid_outlier = eggs_laid < (female_egg_data_summary$mean_eggs_laid - 3*female_egg_data_summary$sd_eggs_laid) |
eggs_laid > (female_egg_data_summary$mean_eggs_laid + 3*female_egg_data_summary$sd_eggs_laid),
eggs_hatched_invalid = eggs_hatched > eggs_laid
)
# Step 3: view rows with any issues filter == TRUE
female_egg_data_report |>
filter(age_invalid | eggs_laid_invalid | eggs_laid_outlier | eggs_hatched_invalid)| column | type |
|---|---|
| female_id | integer |
| treatment | character |
| age_days | integer |
| eggs_laid | integer |
| eggs_hatched | integer |
| female_id | treatment | age_days | eggs_laid | eggs_hatched |
|---|
| female_id | treatment | age_days | eggs_laid | eggs_hatched | age_invalid | eggs_laid_invalid | eggs_laid_outlier | eggs_hatched_invalid |
|---|---|---|---|---|---|---|---|---|
| 1 | A | 0 | 52 | 47 | TRUE | FALSE | FALSE | FALSE |
| 2 | B | 19 | 120 | 52 | FALSE | FALSE | TRUE | FALSE |
| 3 | A | 14 | 50 | 55 | FALSE | FALSE | FALSE | TRUE |
15.7 Options for within group datachecks
assertr is designed for whole dataset validation, but it can make sense for some assertr::insist() rules that these are applied in a group specific way
Question
When might we want to consider groups for some data validations?
For data within the insist function we compare data across the entire column - in these examples we might check data is within certain deviations of the mean. But if there are multiple groups, this makes sense to perform within groups.
| treatment | female_id | age_days | eggs_laid | eggs_hatched |
|---|---|---|---|---|
| A | 1 | 0 | 52 | 47 |
| A | 3 | 14 | 50 | 55 |
| A | 5 | 10 | 59 | 50 |
| A | 7 | 22 | 49 | 49 |
| A | 9 | 5 | 42 | 42 |
| A | 11 | 14 | 44 | 44 |
| A | 13 | 25 | 48 | 44 |
| A | 15 | 27 | 41 | 41 |
| A | 17 | 19 | 47 | 47 |
| A | 19 | 25 | 56 | 37 |
| B | 2 | 19 | 120 | 52 |
| B | 4 | 3 | 46 | 46 |
| B | 6 | 18 | 55 | 43 |
| B | 8 | 11 | 44 | 38 |
| B | 10 | 20 | 47 | 39 |
| B | 12 | 22 | 57 | 46 |
| B | 14 | 26 | 41 | 41 |
| B | 16 | 5 | 47 | 43 |
| B | 18 | 27 | 49 | 34 |
| B | 20 | 28 | 48 | 46 |
15.8 Summary
You now know how to:
Use
assertrto check data integrity before analysis.Validate numeric and categorical variables.
Combine checks into a tidy pipeline.
Control what happens when checks fail.
Automated validation transforms your scripts into self-auditing workflows — they document and test your assumptions each time the data changes.