Save time, reduce errors, and work more efficiently in teams
September 8, 2025
Pragmatic Solutions and Best Practices




Create a new repository on GitHub, clone it to your local machine, and add a README.md file with a brief description of your project.


Windows context menu for GitHub: TortoiseGit



GitHub Desktop App


Many other ways to use GitHub: Eclipse, Positron, VS Code, RStudio, …




main only after reviewmain is always stable, ready for productionthomas, friedrich



github.com/fpahlke/good-engineering-workshop-demo/branche
GitHub branches overview page
Fetch origin (update local information) and then select branch friedrich.


Add a new R script and data file

Push the changes to GitHub
Use Copilot for the pull request description and review of the changes.




First invite Thomas as collaborator.






Then add Thomas as reviewer.











Check reviewer comments
We use the usethis package to create a new R package structure that offers various advantages, even for projects that are not R package projects:







Check reviewer comments
inst/extdata/ (preferred over inst/data/)Restructure folders as requested by Thomas
Try to guess in 30 seconds. Would you trust this in production?
set.seed(7)
d <- read.csv("data.csv")
d <- d[!is.na(d$x1)&d$x1>0,]
d$g <- ifelse(d$trt==1,1,0)
d$y <- with(d, (x1*0.3+x2*0.1+g*0.5) + rnorm(nrow(d),0,1))
res <- tapply(d$y,d$g,mean)
zz <- res[2]-res[1]
S <- replicate(1000,{
jj <- sample(nrow(d), nrow(d), replace=TRUE)
tt <- tapply(d$y[jj], d$g[jj], mean)
tt[2]-tt[1]
})
ci <- quantile(S, c(.025,.975))
cat(zz>0, ci[1], ci[2])This script breaks common clean code rules:
d, g, zz, S)trt)1000, 0.025, 0.975)# Parameters
input_path <- "data.csv"
bootstrap_iterations <- 1000
alpha <- 0.05
seed <- 2486720266 # runif(1, 1e08, 9e08)
# Load & validate
stopifnot(file.exists(input_path))
raw <- read.csv(input_path)
stopifnot(all(c("x1", "x2", "trt") %in% names(raw)))
set.seed(seed)
# Prepare data
prepared <- subset(raw, !is.na(x1) & x1 > 0)
prepared$group <- ifelse(prepared$trt == 1, "treatment", "control")
# Effect estimate function
mean_diff <- function(y, group) {
by_vals <- tapply(y, group, mean)
unname(by_vals["treatment"] - by_vals["control"])
}
# Calculate effect estimate
prepared$y <- with(prepared,
(x1 * 0.3 + x2 * 0.1 + (trt == 1) * 0.5) +
rnorm(nrow(prepared), 0, 1))
estimate <- mean_diff(prepared$y, prepared$group)
# Bootstrap CI
re_idx <- replicate(bootstrap_iterations, sample.int(nrow(prepared),
nrow(prepared), replace = TRUE))
boot_diffs <- apply(re_idx, 2,
function(idx) mean_diff(prepared$y[idx], prepared$group[idx]))
ci <- quantile(boot_diffs,
probs = c(alpha / 2, 1 - alpha / 2),
names = FALSE)
result <- list(estimate = estimate, ci = ci)
resultHere, dplyr & friends can improve readability and intent.
# install.packages("dplyr") # if needed
library(dplyr)
params <- list(
input_path = "data.csv",
iterations = 1000,
alpha = 0.05,
seed = 2486720266 # runif(1, 1e08, 9e08)
)
set.seed(params$seed)
raw <- read.csv(params$input_path)
stopifnot(all(c("x1", "x2", "trt") %in% names(raw)))
prepared <- raw |>
filter(!is.na(x1), x1 > 0) |>
mutate(
group = if_else(trt == 1, "treatment", "control"),
y = (x1 * 0.3 + x2 * 0.1 + (trt == 1) * 0.5) + rnorm(n(), 0, 1)
)
mean_diff <- function(df) {
df |>
summarize(diff = mean(y[group == "treatment"]) -
mean(y[group == "control"])) |>
pull(diff)
}
boot_diffs <- replicate(params$iterations, {
s <- sample(nrow(prepared), nrow(prepared), replace = TRUE)
mean_diff(prepared[s, ])
})
ci <- quantile(boot_diffs, c(params$alpha / 2, 1 - params$alpha / 2))
result <- list(estimate = mean_diff(prepared), ci = unname(ci))
resultThis script breaks all common clean code rules:
y=function(x){
s1=0
for(v1 in x){s1=s1+v1}
m1=s1/length(x)
i=ceiling(length(x)/2)
if(length(x) %% 2 == 0){i=c(i,i+1)}
s2=0
for(v2 in i){s2=s2+x[v2]}
m2=s2/length(i)
c(m1,m2)
}
y(c(1:7, 100))[1] 16.0 4.5
We now refactor it by applying clean code rules…
CCR#1 Naming: Are the names of the variables, functions, and classes descriptive and meaningful?
Examples:
Personal opinion: shorter words, i.e. less to write; as easy to read as snake_case
“Camels may eat snakes to obtain nutrients and cope with their harsh desert environment”
Source: afjrd.org/camels-eating-snakes
getMeanAndMedian=function(x){
sum1=0
for(value in x){sum1=sum1+value}
meanValue=sum1/length(x)
centerIndices=ceiling(length(x)/2)
if(length(x) %% 2 == 0){
centerIndices=c(centerIndices,centerIndices+1)
}
sum2=0
for(centerIndex in centerIndices){sum2=sum2+x[centerIndex]}
medianValue=sum2/length(centerIndices)
c(meanValue,medianValue)
}CCR#1 Naming
CCR#2 Formatting: Are indentation, spacing, and bracketing consistent, i.e., is the code easy to read
getMeanAndMedian <- function(x) {
sum1 <- 0
for (value in x) {
sum1 <- sum1 + value
}
meanValue <- sum1 / length(x)
centerIndices <- ceiling(length(x) / 2)
if (length(x) %% 2 == 0) {
centerIndices <- c(
centerIndices, centerIndices + 1)
}
sum2 <- 0
for (centerIndex in centerIndices) {
sum2 <- sum2 + x[centerIndex]
}
medianValue <- sum2 / length(centerIndices)
c(meanValue, medianValue)
}CCR#2 Formatting
CCR#3 Simplicity: Did you keep the code as simple and straightforward as possible, i.e., did you avoid unnecessary complexity
R/ folderR/load_data.R,R/summarize_parameter.Rsource(list.files(here::here("R"), "\\.R$") to source all R scripts in the R/ folder (devtools::load_all() might be useful)inst/scripts/ (or scripts/), e.g., inst/scripts/run_analysis.RCCR#3 Simplicity
CCR#4 Single Responsibility Principle (SRP): does each function have only a single, well-defined purpose
getMean <- function(x) {
sum(x) / length(x)
}
isLengthAnEvenNumber <- function(x) {
length(x) %% 2 == 0
}
getMedian <- function(x) {
centerIndices <- ceiling(length(x) / 2)
if (isLengthAnEvenNumber(x)) {
centerIndices <- c(centerIndices, centerIndices + 1)
}
sum(x[centerIndices]) / length(centerIndices)
}CCR#4 Single Responsibility Principle (SRP)
CCR#5 Don’t Repeat Yourself (DRY): Did you avoid duplication of code, either by reusing existing code or creating functions
CCR#5: DRY
Suppose you have a code block that performs the same calculation multiple times:
Create a function to encapsulate this calculation and reuse it multiple times:
CCR#5 Don’t Repeat Yourself (DRY)
CCR#6 Documentation: Did you use comments to explain the purpose of code blocks and to clarify complex logic
Roxygen (R package roxygen2):
#'
#' Calculate Mean Value
#'
#' @description
#' Computes the arithmetic mean of a numeric vector.
#'
#' @param x A numeric vector.
#'
#' @return A numeric scalar representing the mean of \code{x}.
#'
#' @examples
#' getMean(c(1, 2, 3, 4))
#'
getMean <- function(x) {
sum(x) / length(x)
}
#'
#' Check if Length is Even
#'
#' @description
#' Checks whether the length of the provided vector is even.
#'
#' @param x A vector to check.
#'
#' @return A logical value. Returns \code{TRUE} if the length of
#' \code{x} is even and \code{FALSE} otherwise.
#'
#' @examples
#' isLengthAnEvenNumber(c(1, 2, 3, 4))
#' isLengthAnEvenNumber(1:5)
#'
isLengthAnEvenNumber <- function(x) {
length(x) %% 2 == 0
}
#'
#' Calculate Median
#'
#' @description
#' Computes the median value of a numeric vector.
#' For even-length vectors, the median is calculated
#' as the mean of the two center elements.
#'
#' @param x A numeric vector.
#'
#' @return A numeric scalar representing the median of \code{x}.
#'
#' @examples
#' getMedian(c(1, 3, 5, 7))
#'
getMedian <- function(x) {
centerIndices <- ceiling(length(x) / 2)
if (isLengthAnEvenNumber(x)) {
centerIndices <- c(centerIndices,
centerIndices + 1)
}
getMean(x[centerIndices])
}# returns the mean of x
getMean <- function(x) {
sum(x) / length(x)
}
# returns TRUE if the length of x is
# an even number; FALSE otherwise
isLengthAnEvenNumber <- function(x) {
length(x) %% 2 == 0
}
# returns the median of x
getMedian <- function(x) {
centerIndices <- ceiling(length(x) / 2)
if (isLengthAnEvenNumber(x)) {
centerIndices <- c(centerIndices,
centerIndices + 1)
}
getMean(x[centerIndices])
}#' returns the mean of x
getMean <- function(x) {
checkmate::assertNumeric(x)
sum(x) / length(x)
}
#' returns TRUE if the length of x is an even number; FALSE otherwise
isLengthAnEvenNumber <- function(x) {
checkmate::assertVector(x)
length(x) %% 2 == 0
}
#' returns the median of x
getMedian <- function(x) {
checkmate::assertNumeric(x)
centerIndices <- ceiling(length(x) / 2)
if (isLengthAnEvenNumber(x)) {
centerIndices <- c(centerIndices, centerIndices + 1)
}
getMean(x[centerIndices])
}CCR#7 Error Handling
Recommended quality workflow for R scripts and projects:
R package testthat
testthat in your project with usethis::use_testthat() (see below) to create a tests/testthat/ folderExample: unit test passed
Example: unit test failed
Error: getMean(c(1, 3, 2, NA)) not equal to 2. Error: getMedian(c(1, 3, 2)) not equal to 2.
message() for progress; keep it shortset.seed() where randomness matterssessionInfo()Example: sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] testthat_3.2.3 assertthat_0.2.1
loaded via a namespace (and not attached):
[1] desc_1.4.3 digest_0.6.37 R6_2.6.1 fastmap_1.2.0
[5] xfun_0.53 magrittr_2.0.3 glue_1.8.0 knitr_1.50
[9] htmltools_0.5.8.1 rmarkdown_2.29 lifecycle_1.0.4 cli_3.6.5
[13] vctrs_0.6.5 pkgload_1.4.0 compiler_4.5.1 rprojroot_2.1.1
[17] tools_4.5.1 brio_1.1.5 pillar_1.11.0 evaluate_1.0.5
[21] yaml_2.3.10 rlang_1.1.6 jsonlite_2.0.0
Avoid the need to edit the source code on different systems and in different repositories, e.g., due to the use of absolute paths.
Parameters in a params.yml file
Use the config package to read the YAML file:
Note: save the yml file in inst/ folder.
vignettes/ folder of your project to enable automatic building of documents, reports, or vignettesusethis function usethis::use_vignette())Two popular R packages support the tidyverse style guide:
Quite new (2025):
The devtools function spell_check runs a spell check on text fields in the package description file, manual pages, and optionally vignettes.
library(dplyr)
library(knitr)
data_clean |>
filter(!is.na(y)) |>
mutate(treatment_arm = arm) |>
group_by(treatment_arm) |>
summarize(n = n(),
mean = mean(y),
sd = sd(y),
se = sd(y) / sqrt(length(y))) |>
kable()| treatment_arm | n | mean | sd | se |
|---|---|---|---|---|
| A | 103 | 17.73372 | 8.523611 | 0.8398563 |
| B | 97 | 21.37104 | 6.888219 | 0.6993927 |
README.md files ensure you still understand your work years laterTake-home message: GitHub makes everyday work and team collaboration much easier, even in very small teams.
Advantages of using an R package structure for projects:
Example: github.com/fpahlke/demoProject1
roxygen2-style documentation for functionsExample project repository:
Example R package repository:
openstatsware working group:
Cloud based coding agents with GitHub integration:
Coding agents for the command line:
Your scenarios, your code, …