15 Data validation – Data Analysis in R

15.1 Learning Objectives:

This section covers different methods for validating the assumptions about our data in R, by the end of this chapter you will be able to:

Explain why data validation matters for reproducible research.
Use assertr functions (assertr::verify() , asserrr::assert(), assertr::insist()) to check data.
Write automated checks for numeric ranges and categorical values.
Combine validation checks into a simple, readable pipeline.
Interpret errors and handle validation failures gracefully.

15.2 Why validate data

Before we analyse data we need to trust it

Think about the assumptions we build into or data:

Counts are always positive
Column B is always < than Column A
Measurements are within acceptable ranges

Remember Garbage in = Garbage out

15.3 Our dataset

Here’s our experimental dataset: each row is a female insect, with basic life-cycle and reproduction data

female_egg_data <- read_csv(here::here("data", "raw", "female_egg_data.csv"))

female_egg_data

female_id	treatment	age_days	eggs_laid	eggs_hatched
1	A	0	52	47
2	B	19	120	52
3	A	14	50	55
4	B	3	46	46
5	A	10	59	50
6	B	18	55	43
7	A	22	49	49
8	B	11	44	38
9	A	5	42	42
10	B	20	47	39
11	A	14	44	44
12	B	22	57	46
13	A	25	48	44
14	B	26	41	41
15	A	27	41	41
16	B	5	47	43
17	A	19	47	47
18	B	27	49	34
19	A	25	56	37
20	B	28	48	46

Question

If this were your dataset, what checks would you want to include?

Column	What we want to ensure	assertr check
`female_id`	No missing, unique, positive integer	`assert(not_na, female_id)`, `assert(is_uniq, female_id)`
`treatment`	Only “A” or “B”	`verify(treatment %in% c("A","B"))`
`age_days`	>0 and reasonable range	`assert(within_bounds(1, 30), age_days)`
`eggs_laid`	>0, within 3 SDs	`assert(within_bounds(1, Inf), eggs_laid)`, `insist(within_n_sds(3), eggs_laid)`
`eggs_hatched`	≥0 and ≤ eggs_laid	`assert(within_bounds(0, Inf), eggs_hatched)`, `verify(eggs_hatched <= eggs_laid)`

15.4 Run an assertion check

With assertr by default if a validation check fails the code will throw an error:

library(assertr)

female_egg_data |>
  verify(age_days > 0)                      # hatched ≤ laid

15.5 assert, insist, verify

The assertr package (Fischetti (2023)) has three main functions:

verify() - is this statement true for the whole data?

df |>  verify(has_all_names("female_id","treatment","age_days","eggs_laid","eggs_hatched"))

assert() - applies a prediction row-by-row to the specified columns

df |>  assert(is.numeric, female_id, age_days, eggs_laid, eggs_hatched)

insist() - Flag values using a rule that depends on the whole column

df |>  insist(within_n_sds(3), eggs_laid)

15.5.1 error functions

The error_fun argument defines what happens when a validation check produces an error. By default it prints a summary of the errors and halts the code.

This may not be desirable if we wish to run a longer pipeline or many error checks, there are multiple options including just_warn script execution continues and a warning is printed.

female_egg_data |>
  verify(age_days > 0,
         error_fun = just_warn)

verification [age_days > 0] failed! (1 failure)

    verb redux_fn    predicate column index value
1 verify       NA age_days > 0     NA     1    NA

female_id	treatment	age_days	eggs_laid	eggs_hatched
1	A	0	52	47
2	B	19	120	52
3	A	14	50	55
4	B	3	46	46
5	A	10	59	50
6	B	18	55	43
7	A	22	49	49
8	B	11	44	38
9	A	5	42	42
10	B	20	47	39
11	A	14	44	44
12	B	22	57	46
13	A	25	48	44
14	B	26	41	41
15	A	27	41	41
16	B	5	47	43
17	A	19	47	47
18	B	27	49	34
19	A	25	56	37
20	B	28	48	46

15.6 Pipelines

We can string together multiple checks into a single pipeline, we don’t need assertr, it can be achieved using dplyr but this package makes pipelines easier.

  female_egg_data  |> 
      verify(has_all_names("female_id", "treatment",
                           "age_days", "eggs_laid", "eggs_hatched"),
             error_fun = just_warn) |>
      assert(is.numeric, female_id, age_days, eggs_laid, eggs_hatched,
             error_fun = just_warn) |>
      assert(is.character, treatment, error_fun = just_warn) |>
      assert(is_uniq, female_id, error_fun = just_warn) |>
      assert(not_na, female_id, treatment, age_days,
             eggs_laid, eggs_hatched, error_fun = just_warn) |>
      assert(in_set("A", "B"), treatment) |> 
      assert(within_bounds(1, 30), age_days, error_fun = just_warn) |>
      verify(eggs_hatched <= eggs_laid, error_fun = just_warn) |>
      insist(within_n_sds(3), eggs_laid, error_fun = just_warn)

# Step 1: Get summary stats for eggs_laid
female_egg_data_summary <- female_egg_data |> 
  summarise(
    mean_eggs_laid = mean(eggs_laid, na.rm = TRUE),
    sd_eggs_laid   = sd(eggs_laid, na.rm = TRUE)
  )

# Check column classes
female_egg_data |> 
  summarise(across(everything(), ~ class(.)[1])) %>%
  tidyr::pivot_longer(everything(),
                      names_to = "column",
                      values_to = "type")

female_egg_data |> 
  filter(duplicated(female_id))

female_egg_data_report <- female_egg_data |>
  mutate(
    age_invalid           = age_days <= 0,
    eggs_laid_invalid     = eggs_laid <= 0,
    eggs_laid_outlier     = eggs_laid < (female_egg_data_summary$mean_eggs_laid - 3*female_egg_data_summary$sd_eggs_laid) |
                             eggs_laid > (female_egg_data_summary$mean_eggs_laid + 3*female_egg_data_summary$sd_eggs_laid),
    eggs_hatched_invalid  = eggs_hatched > eggs_laid
  )

# Step 3: view rows with any issues filter == TRUE
female_egg_data_report |> 
  filter(age_invalid | eggs_laid_invalid | eggs_laid_outlier | eggs_hatched_invalid)

column	type
female_id	integer
treatment	character
age_days	integer
eggs_laid	integer
eggs_hatched	integer

female_id	treatment	age_days	eggs_laid	eggs_hatched

female_id	treatment	age_days	eggs_laid	eggs_hatched	age_invalid	eggs_laid_invalid	eggs_laid_outlier	eggs_hatched_invalid
1	A	0	52	47	TRUE	FALSE	FALSE	FALSE
2	B	19	120	52	FALSE	FALSE	TRUE	FALSE
3	A	14	50	55	FALSE	FALSE	FALSE	TRUE

15.7 Options for within group datachecks

assertr is designed for whole dataset validation, but it can make sense for some assertr::insist() rules that these are applied in a group specific way

Question

When might we want to consider groups for some data validations?

For data within the insist function we compare data across the entire column - in these examples we might check data is within certain deviations of the mean. But if there are multiple groups, this makes sense to perform within groups.

female_egg_data |> 
  group_by(treatment) |> 
dplyr::group_modify(~ .x |> insist(within_n_sds(3), eggs_laid, error_fun = just_warn))

treatment	female_id	age_days	eggs_laid	eggs_hatched
A	1	0	52	47
A	3	14	50	55
A	5	10	59	50
A	7	22	49	49
A	9	5	42	42
A	11	14	44	44
A	13	25	48	44
A	15	27	41	41
A	17	19	47	47
A	19	25	56	37
B	2	19	120	52
B	4	3	46	46
B	6	18	55	43
B	8	11	44	38
B	10	20	47	39
B	12	22	57	46
B	14	26	41	41
B	16	5	47	43
B	18	27	49	34
B	20	28	48	46

15.8 Summary

You now know how to:

Use assertr to check data integrity before analysis.
Validate numeric and categorical variables.
Combine checks into a tidy pipeline.
Control what happens when checks fail.

Automated validation transforms your scripts into self-auditing workflows — they document and test your assumptions each time the data changes.