16 Function based data checks

16.1 Motivation

We have run through the principles of organised data dictionaries and data validation. Once we understand the consistent types of data(sets) we encounter it is possible to start building data validation pipelines.

Pipelines are collections of custom data functions that can apply consistent data cleaning steps to different data that fits a particular layout
Pipelines can help speed up the process of data validation, analysis and plotting.

16.1.1 Examples

Parameterised reports - simple summary reports that can be generated from a dataset with a button push
Parameterised data import and validation pipelines - our second example and presented here:

16.2 The data

In the previous chapters we worked through the process of applying data validation checks to a dataset on female age, fertility and fecundity according to treatments

head(female_egg_data)

female_id	treatment	age_days	eggs_laid	eggs_hatched
1	A	0	52	47
2	B	19	120	52
3	A	14	50	55
4	B	3	46	46
5	A	10	59	50
6	B	18	55	43

16.2.1 Loading and cleaning data

In our previous chapters we worked through the steps of importing a dataset, standardising column names, formatting column types and standardising date formats.

We also carried out simple exploratory steps using packages such as skimr.

When we understand the requirements of our data - we can choose to convert code that runs on specific datasets (object-oriented programming) into something more abstract - functional programming.

data <- read_csv("data/raw/female_egg_data.csv") |> 
  clean_names() |> 
  mutate(date = dmy(date))

skim(data)

print(data)

R makes it easy to create user defined functions by using function(). Here is how it works:

# this is an example function
my_function_name <- function(my_args) {
  # document your function here
  # what the function does
  # function inputs and outputs
  some_calculated_output <- (argument1 + argument2 )
  
  return(some_calculated_output)
}

Here is a very simple function. Can you guess what it does?

add_one <- function(x) {
  return(x + 1)
}

add_one(10)

[1] 11

Your turn

Using the Rstudio drop down Code > Extract Function - you can highlight code and start turning it into a named function.

this is relatively simple and may not give you exactly what you want (AI support can help here)

When we have a series of functions - these can be run without modification on new data:

#' Load and clean a delimited dataset
#'
#' This function reads a delimited text file, cleans its column names, 
#' optionally converts a specified date column to a proper `Date` class, 
#' and can display a summary of the data using `skimr::skim()`.
#'
#' @param path Character string giving the file path to the dataset to load.
#' @param date_col Optional; character string specifying
#'   which column to parse as a date. If the column exists, it will be converted
#'   using `lubridate::dmy()`. If not found, a warning is issued.
#' @param delim Character string specifying the field delimiter used in the file.
#'   Defaults to a comma (`,`) for CSV files.
#' @param show_skim Logical; if `TRUE` (default), prints a quick data summary
#'   using `skimr::skim()`.
#'
#' @return A cleaned `data.frame` (tibble) with standardized column names, and
#'   optionally a converted date column.
#'
#' @details
#' Column names are standardized to lower snake_case via `janitor::clean_names()`.
#' If a date column is specified and present, the function attempts to convert it
#' assuming day-month-year format.
#'
#' @examples
#' \dontrun{
#' df <- load_and_clean_data("data/sales.csv", date_col = "order_date")
#' }
#'
#' @export


load_and_clean_data <- function(path, 
                                date_col = NULL,
                                delim = ",",
                                show_skim = TRUE) {
  # Read data using readr
  df <- readr::read_delim(path, delim = delim, show_col_types = FALSE) |>
    janitor::clean_names()
  
  # Convert date column if provided
  # Convert date column if provided
    if (!is.null(date_col) && all(date_col %in% names(df))) {
        df <- df |>
            mutate(across(all_of(date_col),
                          ~ lubridate::dmy(as.character(.x))))
    
    } else if (!is.null(date_col)) {
    warning(paste("Column", date_col, "not found in data — skipping date conversion"))
  }
  
  # Optionally skim summary
  if (show_skim) {
    cat("\n📊 Quick data summary:\n")
    print(skimr::skim(df))
  }
  
  return(df)
}

load_and_clean_data("data/raw/female_egg_data.csv")

Note

In theory this function could be applied to any dataset (as long as it was in tidy format)

16.2.2 Validating data

#' Robust data validation checks
#'
#' Performs simple but resilient validation of a dataset:
#' verifies column presence (optionally from metadata), checks types,
#' enforces uniqueness and bounds, and validates logical consistency.
#'
#' @param data A data frame or tibble.
#' @param metadata_path Optional path to an Excel file containing a
#'   "Data Dictionary" sheet. Column names are read from the `Name` column.
#' @param numeric_cols Character vector of columns expected to be numeric.
#' @param character_cols Character vector of columns expected to be character.
#' @param bounded_cols Character vector of numeric columns checked within `bounds`.
#' @param bounds Numeric vector of length 2 specifying lower and upper limits.
#' @param treatment_levels Allowed values for the `treatment` column.
#' @param just_warn If TRUE (default), issues warnings instead of stopping on errors.
#'
#' @return The validated data frame, invisibly.
#' @export
data_checks <- function(data,
                        metadata_path = NULL,
                        numeric_cols = NULL,
                        character_cols = NULL,
                        unique_cols = NULL,
                        bounded_cols = NULL,
                        bounds = c(1, 30),
                        treatment_levels = c("A", "B"),
                        just_warn = TRUE) {
  stopifnot(is.data.frame(data))
  stopifnot(length(bounds) == 2, is.numeric(bounds), bounds[1] < bounds[2])

  handler <- if (just_warn) assertr::just_warn else assertr::error_stop

  # --- Metadata check (safe) ---
  required_cols <- NULL
  if (!is.null(metadata_path)) {
    metadata <- tryCatch(
      readxl::read_excel(metadata_path, sheet = "Data Dictionary"),
      error = function(e) {
        warning("Could not read metadata: ", conditionMessage(e))
        NULL
      }
    )
    if (!is.null(metadata) && "Name" %in% names(metadata)) {
      required_cols <- janitor::make_clean_names(metadata$Name)
      data_names <- janitor::make_clean_names(names(data))
      missing <- setdiff(required_cols, data_names)
      if (length(missing) > 0) {
        msg <- paste("Missing required columns:", paste(missing, collapse = ", "))
        if (just_warn) warning(msg, call. = FALSE) else stop(msg, call. = FALSE)
      }
    }
  }

  # --- Type checks ---
  if (!is.null(numeric_cols)) {
    found <- intersect(numeric_cols, names(data))
    data <- assertr::assert(data, is.numeric, !!!rlang::syms(found), error_fun = handler)
  }
  if (!is.null(character_cols)) {
    found <- intersect(character_cols, names(data))
    data <- assertr::assert(data, is.character, !!!rlang::syms(found), error_fun = handler)
  }

  # --- Non-missing ---
  if (ncol(data) > 0) {
    data <- assertr::assert(data, assertr::not_na, !!!rlang::syms(names(data)), error_fun = handler)
  }

  # --- Treatment levels ---
  if ("treatment" %in% names(data)) {
    data <- assertr::assert(data, assertr::in_set(treatment_levels), treatment, error_fun = handler)
  }

  # --- Bounded numeric ---
  if (!is.null(bounded_cols)) {
    found <- intersect(bounded_cols, names(data))
    data <- assertr::assert(data,
                            assertr::within_bounds(bounds[1], bounds[2]),
                            !!!rlang::syms(found),
                            error_fun = handler)
  }

  # --- Logical rule ---
  if (all(c("eggs_hatched", "eggs_laid") %in% names(data))) {
    data <- assertr::verify(data, eggs_hatched <= eggs_laid, error_fun = handler)
  }

  invisible(data)
}

Question

Can we think of any ways this code could potentially be abstracted/improved further?

Your turn

Try making a new R script - put the two functions here and save the script as data_cleaning_functions.R
In your console or in another script run source() and then try using the functions:

  egg_data <- 
  load_and_clean_data("data/raw/female_egg_data.csv") |> 
  data_checks(metadata_path = "data-dictionary/insect_egg_metadata.xlsx",
              numeric_cols = c("female_id",
                               "age_days",
                               "eggs_laid",
                               "eggs_hatched"),
              character_cols = "treatment")