Code
install.packages("dplyr")
install.packages("tidyr")
install.packages("flextable")
install.packages("here")
install.packages("renv")
install.packages("gapminder")
install.packages("checkdown") Martin Schweinberger
January 1, 2026


This tutorial covers best practices for conducting reproducible, transparent, and well-organised research in R. Reproducibility — the ability of another researcher (or your future self) to re-run your analysis and obtain the same results — is increasingly recognised as a cornerstone of credible science. Journals, funders, and research institutions are all moving toward requiring it. R, combined with a small set of tools and habits, makes genuine reproducibility achievable with relatively little extra effort.
The tutorial works through the key components of a reproducible R workflow: project organisation, R Projects, reproducible documents, dependency control, version control, file paths, random seeds, and tidy data principles. Each section explains not just how to use a tool, but why it matters for reproducibility.
Before working through this tutorial, please complete or familiarise yourself with:
Martin Schweinberger. 2026. Reproducibility with R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/r_reproducibility/r_reproducibility.html (Version 2026.03.28).
renv — locking package versionshere — portable, machine-independent pathsset.seed() — replicable stochastic resultsInstall the packages used in this tutorial (once only):
Load packages:
What you’ll learn: How to use R Projects to create self-contained, portable analysis environments
Why it matters: R Projects eliminate the most common cause of broken scripts — hardcoded file paths — and make it trivial to share a complete analysis with collaborators
An R Project is a folder identified by a .Rproj file. When you open a project in RStudio (via File → Open Project, or by double-clicking the .Rproj file), RStudio automatically sets the working directory to the project folder. Everything — scripts, data, outputs — lives relative to that root.

setwd()?Before R Projects became standard, analysts used setwd() to tell R where to look for files:
This approach has serious problems:
setwd() call.R Projects solve all of these problems. The working directory is always the project root, regardless of which machine the project is opened on.
File → New ProjectNew Directory (for a fresh project) or Existing Directory (for a folder you have already created)Create ProjectA .Rproj file is created in the folder. From now on, open this project by double-clicking the .Rproj file or via File → Open Project in RStudio.
Create a new R Project for each distinct research project or analysis. Never work across projects by navigating to different folders with setwd(). Each project should be fully self-contained: opening the .Rproj file should be sufficient to reproduce the entire analysis.
What you’ll learn: How to organise files within a project for clarity and long-term maintainability
Why it matters: A consistent folder structure makes it immediately obvious where to find files and where to put new ones — for you and for collaborators
A well-organised project folder makes a project understandable at a glance. The following structure works well for most linguistic research projects:
my_project/
├── my_project.Rproj ← R Project file (root anchor)
├── data/
│ ├── raw/ ← original, unmodified data (treat as read-only)
│ └── processed/ ← cleaned/transformed data
├── R/ ← R scripts (.R files)
├── notebooks/ ← R Markdown / Quarto notebooks (.Rmd, .qmd)
├── outputs/
│ ├── figures/ ← saved plots
│ └── tables/ ← exported tables
└── docs/ ← notes, reports, paper drafts
A few principles worth following:
Raw data is sacred. Never overwrite or modify the original data files. Save all processed versions separately to data/processed/. This means you can always re-derive processed data from the raw source.
Separate scripts from notebooks. Keep short, reusable functions and data processing steps in .R scripts inside R/. Keep narrative analyses with integrated output in notebooks inside notebooks/.
Name files to sort logically. Use numeric prefixes for scripts that must run in order (01_clean_data.R, 02_analyse.R, 03_visualise.R). Use snake_case for all file names — spaces and special characters in file names cause problems across operating systems.
What you’ll learn: What R Markdown and Quarto notebooks are, why they are the gold standard for reproducible reporting, and how to use them effectively
Why it matters: A rendered notebook is a complete, self-verifying record of your analysis — every number, table, and figure is generated fresh from code each time it is rendered
A notebook (R Markdown .Rmd file or Quarto .qmd file) combines three things in one document:
When you click Render (Quarto) or Knit (R Markdown), R executes every code chunk from scratch in a clean environment and weaves the output together with the prose into a finished HTML, PDF, or Word document.

The rendered output looks like this:

The key property of a rendered notebook is that every result is derived directly from code. There is no manual copying of numbers from a statistical output into a Word document — a step that is both error-prone and opaque. When a reviewer asks “where does this number come from?”, you can point directly to the code chunk that produced it.
This means:
Notebooks also document the reasoning behind analytical choices, not just the code. Prose in a notebook can explain why a particular model was chosen, what a diagnostic plot revealed, or why certain observations were excluded — information that a bare script cannot capture.
While notebooks are most commonly associated with quantitative and computational analyses, they are increasingly used to document qualitative and interpretative work. A notebook can display annotation decisions alongside the data being annotated, making the logic of qualitative coding transparent and verifiable. Recent studies have demonstrated their value in corpus pragmatics and corpus-based discourse analysis for exactly this purpose.
R Markdown (.Rmd) is the original notebook format for R, stable and widely used. Quarto (.qmd) is its successor — it supports R, Python, Julia, and Observable JS in the same document, has a cleaner syntax, and is the format used by all LADAL tutorials. If you are starting fresh, use Quarto. If you have existing R Markdown files, they continue to work and do not need to be converted.
Feature | R Markdown | Quarto |
|---|---|---|
File extension | .Rmd | .qmd |
Render function | knitr::knit() / rmarkdown::render() | quarto::quarto_render() |
Multi-language | R only (+ Python via reticulate) | R, Python, Julia, Observable |
Output formats | HTML, PDF, Word, slides | HTML, PDF, Word, slides, books, websites |
LADAL tutorials | Legacy format | Current format |
Q1. Why is using setwd() at the top of a script considered bad practice for reproducibility?
Q2. What is the key reproducibility advantage of a rendered notebook over a plain R script?
What you’ll learn: Conventions for writing readable, consistent R code
Why it matters: Code you write today will be read — by collaborators, reviewers, and your future self — months or years from now. Readable code is reproducible code.
Reproducibility is not only about running code — it is also about understanding it. Code that is hard to read is hard to verify, hard to modify, and hard to debug. The following conventions are widely adopted in the R community and are used throughout LADAL tutorials.
# Good: lowercase with underscores (snake_case)
word_count <- 42
reaction_time_ms <- 487.3
corpus_summary <- data.frame()
# Avoid: mixed case, dots, or cryptic abbreviations
WordCount <- 42 # CamelCase (used in some communities, not LADAL)
reaction.time <- 487.3 # dots can conflict with S3 method names
rt <- 487.3 # cryptic — what does rt mean? # Good: spaces around operators, after commas, consistent indentation
corpus_summary <- corpus_data |>
dplyr::filter(register == "Academic") |>
dplyr::group_by(speaker_id) |>
dplyr::summarise(
mean_wc = mean(word_count, na.rm = TRUE),
sd_wc = sd(word_count, na.rm = TRUE),
.groups = "drop"
)
# Avoid: cramped, hard-to-read code
corpus_summary<-corpus_data|>dplyr::filter(register=="Academic")|>
dplyr::group_by(speaker_id)|>dplyr::summarise(mean_wc=mean(word_count,na.rm=TRUE)) # Good: comments explain WHY, not just WHAT
# Remove speakers with fewer than 3 observations — insufficient data for per-speaker models
corpus_data <- corpus_data |>
dplyr::group_by(speaker_id) |>
dplyr::filter(dplyr::n() >= 3) |>
dplyr::ungroup()
# Avoid: comments that merely restate the code
# group by speaker_id and filter
corpus_data <- corpus_data |>
dplyr::group_by(speaker_id) |>
dplyr::filter(dplyr::n() >= 3) |>
dplyr::ungroup() Keep lines under 80 characters. RStudio shows a vertical guideline at column 80 by default (Tools → Global Options → Code → Display → Show margin). Long lines are hard to read and cause problems in version control diffs.
Every R script or notebook should follow a consistent top-to-bottom structure:
# ===========================================================
# Script: 02_analyse_register_variation.R
# Author: Martin Schweinberger
# Date: 2026-02-19
# Description: Mixed-effects model of word count by register
# ===========================================================
# 1. PACKAGES ------------------------------------------------
library(dplyr)
library(lme4)
library(here)
# 2. OPTIONS -------------------------------------------------
options(stringsAsFactors = FALSE)
options(scipen = 100)
set.seed(42)
# 3. LOAD DATA -----------------------------------------------
corpus <- readRDS(here::here("data", "processed", "corpus_clean.rds"))
# 4. ANALYSIS ------------------------------------------------
# ... analysis code here
# 5. SAVE OUTPUTS --------------------------------------------
# ... save results here lintr and styler Packages
Two packages automate style checking and fixing:
lintr checks your code against a style guide and reports violations — like a spell-checker for code stylestyler automatically reformats your code to comply with the tidyverse style guiderenvWhat you’ll learn: How to use renv to lock the exact package versions used in a project, so the analysis can be reproduced identically on any machine — now or in the future
Why it matters: R packages change over time. An analysis that runs correctly today may produce different results or fail entirely in two years if packages have been updated. renv prevents this.
Consider this scenario: you publish a paper in 2024 using a mixed-effects model fitted with lme4 version 1.1-35. A reviewer in 2025 tries to reproduce your analysis, but lme4 1.1-37 has changed how it handles singular fit warnings and produces slightly different output. Your analysis is no longer exactly reproducible — even with identical code and data.
This is package version drift, and it is one of the most common obstacles to long-term reproducibility. The renv package solves it.
renv Worksrenv creates a project-local library — a folder inside your project that contains exactly the versions of all packages used. When you share the project, collaborators install the same versions from the renv.lock file. The project is isolated from the user’s main R library, so updates to packages installed globally do not affect the project.

renvrenv::init() scans your project for package dependencies, installs them into the project library, and creates a renv.lock file recording exact package versions.
renv.lock FileThe lock file is plain text (JSON format) and records the name, version, and source of every package:
{
"R": {
"Version": "4.3.2",
"Repositories": [{"Name": "CRAN", "URL": "https://cloud.r-project.org"}]
},
"Packages": {
"dplyr": {
"Package": "dplyr",
"Version": "1.1.4",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "..."
}
}
} This file should be committed to version control (Git) — it is the machine-readable specification of your software environment.
renv Workflow# Install a new package (adds it to the project library)
renv::install("emmeans")
# After installing or removing packages, update the lock file
renv::snapshot()
# Restore the project environment on another machine or after a clean install
renv::restore()
# Check the status of the project library vs. the lock file
renv::status() renv ProjectWhen you share your project (e.g., by pushing to GitHub or sending a zip file):
.Rproj file in RStudiorenv automatically detects the lock file and prompts: renv::restore()?renv installs the exact package versions specifiedrenv vs. packrat
renv is the modern successor to the older packrat package. If you have projects using packrat, they can be migrated to renv with renv::migrate(). packrat should be considered deprecated for new projects.
renv
Q1. What does renv::snapshot() do?
What you’ll learn: How to use Git for version control and GitHub for sharing and collaborating on R projects
Why it matters: Version control is a complete, timestamped record of every change ever made to your code. It makes collaboration safe, enables you to revert mistakes, and provides a permanent citable home for your analysis.
Git is a version control system — software that tracks changes to files over time. Every time you commit a set of changes, Git records what changed, who changed it, and when. You can browse the entire history of a project, compare any two versions, and revert to any earlier state.
GitHub is a web platform that hosts Git repositories. It serves as a permanent, shareable, and optionally public home for your project. A GitHub repository can be shared with collaborators, cited in papers (with a DOI via Zenodo), and submitted as supplementary material to journals.

Before using Git with RStudio, you need Git installed on your computer:
git --version — if Git is not installed, macOS will prompt you to install it via Xcode Command Line Toolssudo apt install gitAfter installation, tell Git your name and email (used to label your commits):
You also need a free GitHub account. The usethis package provides the easiest way to connect RStudio to GitHub:
The simplest workflow is to create the GitHub repository first, then clone it as an R Project:
https://github.com/yourusername/your-repo.git)File → New Project → Version Control → GitAlternatively, if you already have an R Project and want to connect it to a new GitHub repository:
Once your project is connected to GitHub, the daily workflow has three steps:
# In the Terminal pane (not the Console):
# 1. Stage: mark files to include in the next commit
git add R/my_analysis.R
git add data/processed/corpus_clean.rds
# 2. Commit: save the staged changes with a message
git commit -m "Add register effect to mixed-effects model"
# 3. Push: upload commits to GitHub
git push In RStudio, the Git pane (top right, after Git is initialised) provides a visual interface for staging, committing, and pushing without using the Terminal.
Commit everything needed to reproduce the analysis:
.R, .Rmd, .qmd)renv.lock file.Rproj fileREADME.md describing the projectDo not commit:
.html, .pdf) — these are derived products that can be regeneratedrenv/library/ folder (the lock file is sufficient; collaborators restore from it)Create a .gitignore file to tell Git which files to ignore:
Q1. What is the difference between git commit and git push?
Q2. Why should large raw data files generally NOT be committed to a Git repository?
hereWhat you’ll learn: How to use the here package to write file paths that work on any machine without modification
Key function: here::here()
Even within an R Project, file paths can cause problems if written carelessly. The here package provides a single, simple function that constructs file paths relative to the project root — correctly on Windows, Mac, and Linux — without any configuration.
Within an R Project, you might write:
This works when the working directory is the project root (which R Projects guarantee). But it breaks if a script is sourced from a subdirectory, or if the file is run as part of a rendered document that temporarily changes the working directory.
here::here() as the Solutionlibrary(here)
# Constructs the full path from the project root, regardless of where R is run from
data <- read.csv(here::here("data", "processed", "corpus.csv"))
# Save a processed file
saveRDS(data, here::here("data", "processed", "corpus_clean.rds"))
# Save a plot
ggsave(here::here("outputs", "figures", "register_plot.png"),
width = 8, height = 5, dpi = 300) here::here("data", "processed", "corpus.csv") constructs the platform-appropriate path separator automatically (/ on Mac/Linux, \ on Windows) and always anchors to the project root.
[1] "C:/Users/Martin/Documents/projects/ladal"
Absolute paths ("C:/Users/Martin/...", "/home/martin/...") break on any other machine. Paths relative to an unspecified working directory break when the script is run from a different location. here::here() is the correct solution in both cases. Make it a habit to use it for every file read and write operation.
set.seed()What you’ll learn: How to make analyses involving random processes exactly reproducible using set.seed()
Key function: set.seed()
Why it matters: Any analysis involving random numbers — random forests, bootstrap confidence intervals, train/test splits, simulation studies — will produce different results each run unless the random seed is fixed
R’s random number functions produce different output every time they are called:
[1] 1 5 6 9 8
[1] 5 4 1 6 7
This means that any analysis using randomness — shuffling data, drawing bootstrap samples, initialising random forests — will differ between runs. A collaborator running the same code will get different numbers. A reported result cannot be reproduced exactly.
set.seed()set.seed() initialises R’s internal random number generator to a known state. Every random operation that follows will produce the same sequence of numbers, on any machine, in any R version:
set.seed()Set the seed once, at the top of your script or notebook, immediately after loading packages and options. This ensures every random operation in the document is reproducible from a single, documented starting point:
set.seed() Is Version-Sensitive
The default random number generator changed in R 3.6.0. Results from set.seed(42) in R 3.5 differ from results in R 3.6+. This is another reason why recording your R version (via sessionInfo()) and locking your environment with renv is important for long-term reproducibility.
set.seed()
Q1. You run a random forest model twice with identical code and get slightly different variable importance scores each time. What is the most likely cause and fix?
What you’ll learn: The principles of tidy data — a consistent, analysis-ready way to structure tabular data
Why it matters: Tidy data works immediately with all tidyverse functions; untidy data requires transformation before almost any analysis can begin
The same underlying data can be stored in multiple formats. Consider life expectancy data for five countries across two years:
Table 1 (wide format): Years are column names — compact for reading, problematic for analysis
country | continent | 2002 | 2007 |
|---|---|---|---|
Afghanistan | Asia | 42.1 | 43.8 |
Australia | Oceania | 80.4 | 81.2 |
China | Asia | 72.0 | 72.9 |
Germany | Europe | 78.7 | 79.4 |
Tanzania | Africa | 50.7 | 52.5 |
Table 2 (long/tidy format): One observation per row — each year is its own row, life expectancy is one column
country | continent | year | life_exp |
|---|---|---|---|
Afghanistan | Asia | 2002 | 42.1 |
Afghanistan | Asia | 2007 | 43.8 |
Australia | Oceania | 2002 | 80.4 |
Australia | Oceania | 2007 | 81.2 |
China | Asia | 2002 | 72.0 |
China | Asia | 2007 | 72.9 |
Germany | Europe | 2002 | 78.7 |
Germany | Europe | 2007 | 79.4 |
Tanzania | Africa | 2002 | 50.7 |
Tanzania | Africa | 2007 | 52.5 |
Table 2 is tidy. Tidy data follows three rules:
country, continent, year, life_exp are all separate columnsTidy data is not just a convention — it is the format that all tidyverse functions expect. dplyr::group_by(), ggplot2::ggplot(), and statistical model functions all assume that the variable you want to use as a grouping factor is a column, not spread across multiple column names.
# Tidy format enables immediate plotting without reshaping
life_exp_long <- life_exp |>
tidyr::pivot_longer(
cols = c(`2002`, `2007`),
names_to = "year",
values_to = "life_exp"
)
ggplot2::ggplot(life_exp_long,
ggplot2::aes(x = country, y = life_exp,
fill = year)) +
ggplot2::geom_col(position = "dodge") +
ggplot2::scale_fill_manual(values = c("steelblue", "tomato")) +
ggplot2::coord_flip() +
ggplot2::theme_bw() +
ggplot2::theme(panel.grid.minor = ggplot2::element_blank()) +
ggplot2::labs(title = "Life expectancy by country and year",
x = NULL, y = "Life expectancy (years)", fill = "Year") 
Problem | Example | Fix |
|---|---|---|
Column headers are values, not variable names | Columns: country, 2002, 2007, 2012 (years as column names) | pivot_longer() to move year values into a year column |
Multiple variables stored in one column | Column 'age_gender' contains values like 'M_25', 'F_30' | tidyr::separate() to split into age and gender columns |
One observation spread across multiple rows | A speaker's metadata split across three rows | Aggregate or pivot to one row per observation unit |
Multiple types of observational units in one table | Speaker info and utterance data mixed in one table | Split into separate tables, join when needed |
What you’ll learn: How to choose the right file format for saving data, balancing portability, size, and fidelity
Key functions: write.csv(), saveRDS(), readRDS()
Data files can be stored in many formats, each with trade-offs in portability, file size, and how much R-specific information (column types, factor levels) is preserved.
Format | Size | Portable to | Preserves | Best for |
|---|---|---|---|---|
CSV (.csv) | Medium | Any software | Values as text only | Sharing data with non-R users |
Excel (.xlsx) | Large | Excel / R / Python | Values + basic formatting | Sharing with Excel users |
RDS (.rds) | Small | R only | All R types and attributes | Saving a single processed R object |
RData (.rda/.RData) | Small | R only | Multiple objects at once | Saving multiple R objects together |
To illustrate the size difference, we create a moderately sized data frame and save it in each format:
# Create a sample data frame (1000 rows, 8 columns)
set.seed(42)
n <- 1000
demo_data <- data.frame(
doc_id = paste0("doc", 1:n),
register = sample(c("Academic", "News", "Fiction"), n, replace = TRUE),
word_count = round(rnorm(n, 300, 60)),
year = sample(2015:2023, n, replace = TRUE)
)
# Save in different formats
write.csv(demo_data,
here::here("data", "demo.csv"), row.names = FALSE)
saveRDS(demo_data,
here::here("data", "demo.rds"))
openxlsx::write.xlsx(demo_data,
here::here("data", "demo.xlsx"))
# Compare file sizes
sizes <- file.info(c(
here::here("data", "demo.csv"),
here::here("data", "demo.rds"),
here::here("data", "demo.xlsx")
))$size / 1024 # size in KB
names(sizes) <- c("CSV", "RDS", "Excel")
round(sizes, 1) In practice, RDS is typically 3–5× smaller than the equivalent CSV and 5–8× smaller than Excel, because it uses R’s native binary compression.
save.image() or .RData) — this creates an opaque binary blob that is hard to version-control and makes scripts implicitly depend on a hidden stateWhy it matters: Recording your session information at the end of every notebook creates a permanent, human-readable log of the exact software environment used — essential for troubleshooting, peer review, and long-term reproducibility
The sessionInfo() function prints a complete record of:
Always include this at the end of every notebook:
In combination with renv.lock, sessionInfo() output provides a complete snapshot of the software environment in a human-readable format that can be included in supplementary materials or reported in a methods section.
By the end of any project, a fully reproducible analysis should have:
Martin Schweinberger. 2026. Reproducibility with R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/r_reproducibility/r_reproducibility.html (Version 2026.03.28), doi: .
@manual{martinschweinberger2026reproducibility,
author = {Martin Schweinberger},
title = {Reproducibility with R},
year = {2026},
note = {https://ladal.edu.au/tutorials/r_reproducibility/r_reproducibility.html},
organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
edition = {2026.03.28}
doi = {}
}
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Australia/Brisbane
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] flextable_0.9.7 checkdown_0.0.13 gapminder_1.0.0 lubridate_1.9.4
[5] forcats_1.0.0 stringr_1.5.1 dplyr_1.2.0 purrr_1.0.4
[9] readr_2.1.5 tibble_3.2.1 ggplot2_4.0.2 tidyverse_2.0.0
[13] tidyr_1.3.2 here_1.0.1 DT_0.33 kableExtra_1.4.0
[17] knitr_1.51
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.56 htmlwidgets_1.6.4
[4] tzdb_0.4.0 vctrs_0.7.1 tools_4.4.2
[7] generics_0.1.3 pkgconfig_2.0.3 data.table_1.17.0
[10] RColorBrewer_1.1-3 S7_0.2.1 uuid_1.2-1
[13] lifecycle_1.0.5 compiler_4.4.2 farver_2.1.2
[16] textshaping_1.0.0 codetools_0.2-20 litedown_0.9
[19] fontquiver_0.2.1 fontLiberation_0.1.0 htmltools_0.5.9
[22] yaml_2.3.10 pillar_1.10.1 openssl_2.3.2
[25] fontBitstreamVera_0.1.1 commonmark_2.0.0 zip_2.3.2
[28] tidyselect_1.2.1 digest_0.6.39 stringi_1.8.4
[31] labeling_0.4.3 rprojroot_2.0.4 fastmap_1.2.0
[34] grid_4.4.2 cli_3.6.4 magrittr_2.0.3
[37] withr_3.0.2 gdtools_0.4.1 scales_1.4.0
[40] timechange_0.3.0 rmarkdown_2.30 officer_0.6.7
[43] askpass_1.2.1 ragg_1.3.3 hms_1.1.3
[46] evaluate_1.0.3 viridisLite_0.4.2 markdown_2.0
[49] rlang_1.1.7 Rcpp_1.0.14 glue_1.8.0
[52] xml2_1.3.6 renv_1.1.1 svglite_2.1.3
[55] rstudioapi_0.17.1 jsonlite_1.9.0 R6_2.6.1
[58] systemfonts_1.2.1
This tutorial was revised and restyled with the assistance of Claude (claude.ai), a large language model created by Anthropic. All substantive content — code, statistical explanations, exercises, and reporting conventions — was retained from the original. All changes were reviewed and approved by Martin Schweinberger, who takes full responsibility for the tutorial’s accuracy.
---
title: "Reproducibility with R"
author: "Martin Schweinberger"
date: "2026"
params:
title: "Reproducibility with R"
author: "Martin Schweinberger"
year: "2026"
version: "2026.03.28"
url: "https://ladal.edu.au/tutorials/r_reproducibility/r_reproducibility.html"
institution: "The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia"
doi: ""
format:
html:
toc: true
toc-depth: 4
code-fold: show
code-tools: true
theme: cosmo
---
```{r setup, echo=FALSE, message=FALSE, warning=FALSE}
library(checkdown)
library(dplyr)
library(tidyr)
library(flextable)
library(here)
options(stringsAsFactors = FALSE)
options(scipen = 100)
options(max.print = 100)
```
{ width=100% }
# Introduction {#intro}
{ width=15% style="float:right; padding:10px" }
This tutorial covers best practices for conducting reproducible, transparent, and well-organised research in R. Reproducibility — the ability of another researcher (or your future self) to re-run your analysis and obtain the same results — is increasingly recognised as a cornerstone of credible science. Journals, funders, and research institutions are all moving toward requiring it. R, combined with a small set of tools and habits, makes genuine reproducibility achievable with relatively little extra effort.
The tutorial works through the key components of a reproducible R workflow: project organisation, R Projects, reproducible documents, dependency control, version control, file paths, random seeds, and tidy data principles. Each section explains not just *how* to use a tool, but *why* it matters for reproducibility.
::: {.callout-note}
## Prerequisite Tutorials
Before working through this tutorial, please complete or familiarise yourself with:
- [Getting Started with R and RStudio](/tutorials/intror/intror.html)
- [Loading, Saving, and Generating Data in R](/tutorials/load/load.html)
- [Handling Tables in R](/tutorials/table/table.html)
- [Concepts in Reproducible Research](/tutorials/repro/repro.html)
:::
::: {.callout-note}
## Citation
```{r citation-callout-top, echo=FALSE, results='asis'}
cat(
params$author, ". ",
params$year, ". *",
params$title, "*. ",
params$institution, ". ",
"url: ", params$url, " ",
"(Version ", params$version, ").",
sep = ""
)
```
:::
::: {.callout-tip}
## What This Tutorial Covers
1. **R Projects** — self-contained, portable project folders
2. **Folder structure** — organising files for clarity and reproducibility
3. **Reproducible documents** — R Markdown and Quarto notebooks
4. **Code style** — writing readable, maintainable R code
5. **Dependency control with `renv`** — locking package versions
6. **Version control with Git and GitHub** — tracking and sharing changes
7. **File paths with `here`** — portable, machine-independent paths
8. **Reproducible randomness with `set.seed()`** — replicable stochastic results
9. **Tidy data principles** — structuring data for analysis
10. **Efficient data storage** — choosing the right file format
:::
---
## Preparation {-}
Install the packages used in this tutorial (once only):
```{r install, eval=FALSE}
install.packages("dplyr")
install.packages("tidyr")
install.packages("flextable")
install.packages("here")
install.packages("renv")
install.packages("gapminder")
install.packages("checkdown")
```
Load packages:
```{r load, message=FALSE, warning=FALSE}
library(dplyr)
library(tidyr)
library(flextable)
library(here)
library(checkdown)
```
---
# R Projects {#rproj}
::: {.callout-note}
## Section Overview
**What you'll learn:** How to use R Projects to create self-contained, portable analysis environments
**Why it matters:** R Projects eliminate the most common cause of broken scripts — hardcoded file paths — and make it trivial to share a complete analysis with collaborators
:::
An **R Project** is a folder identified by a `.Rproj` file. When you open a project in RStudio (via `File → Open Project`, or by double-clicking the `.Rproj` file), RStudio automatically sets the working directory to the project folder. Everything — scripts, data, outputs — lives relative to that root.
{ width=40% style="float:right; padding:10px" }
## Why Not `setwd()`? {-}
Before R Projects became standard, analysts used `setwd()` to tell R where to look for files:
```{r setwd_bad, eval=FALSE}
# The old way — do not do this
setwd("C:/Users/Martin/Documents/Projects/MyCorpusStudy/")
```
This approach has serious problems:
- **It is machine-specific.** The path above works only on Martin's computer. Anyone else who runs the script must manually edit it.
- **It breaks when folders move.** Renaming or moving the project folder invalidates every `setwd()` call.
- **It creates hidden dependencies.** The script appears to be self-contained but secretly depends on a specific folder structure on a specific machine.
R Projects solve all of these problems. The working directory is always the project root, regardless of which machine the project is opened on.
## Creating an R Project {-}
1. In RStudio, click `File → New Project`
2. Select `New Directory` (for a fresh project) or `Existing Directory` (for a folder you have already created)
3. Navigate to or name your project folder
4. Click `Create Project`
A `.Rproj` file is created in the folder. From now on, open this project by double-clicking the `.Rproj` file or via `File → Open Project` in RStudio.
::: {.callout-important}
## One Project Per Analysis
Create a new R Project for each distinct research project or analysis. Never work across projects by navigating to different folders with `setwd()`. Each project should be fully self-contained: opening the `.Rproj` file should be sufficient to reproduce the entire analysis.
:::
---
# Folder Structure {#folders}
::: {.callout-note}
## Section Overview
**What you'll learn:** How to organise files within a project for clarity and long-term maintainability
**Why it matters:** A consistent folder structure makes it immediately obvious where to find files and where to put new ones — for you and for collaborators
:::
A well-organised project folder makes a project understandable at a glance. The following structure works well for most linguistic research projects:
```
my_project/
├── my_project.Rproj ← R Project file (root anchor)
├── data/
│ ├── raw/ ← original, unmodified data (treat as read-only)
│ └── processed/ ← cleaned/transformed data
├── R/ ← R scripts (.R files)
├── notebooks/ ← R Markdown / Quarto notebooks (.Rmd, .qmd)
├── outputs/
│ ├── figures/ ← saved plots
│ └── tables/ ← exported tables
└── docs/ ← notes, reports, paper drafts
```
A few principles worth following:
**Raw data is sacred.** Never overwrite or modify the original data files. Save all processed versions separately to `data/processed/`. This means you can always re-derive processed data from the raw source.
**Separate scripts from notebooks.** Keep short, reusable functions and data processing steps in `.R` scripts inside `R/`. Keep narrative analyses with integrated output in notebooks inside `notebooks/`.
**Name files to sort logically.** Use numeric prefixes for scripts that must run in order (`01_clean_data.R`, `02_analyse.R`, `03_visualise.R`). Use `snake_case` for all file names — spaces and special characters in file names cause problems across operating systems.
---
# Reproducible Documents {#notebooks}
::: {.callout-note}
## Section Overview
**What you'll learn:** What R Markdown and Quarto notebooks are, why they are the gold standard for reproducible reporting, and how to use them effectively
**Why it matters:** A rendered notebook is a complete, self-verifying record of your analysis — every number, table, and figure is generated fresh from code each time it is rendered
:::
## What Are Reproducible Notebooks? {-}
A **notebook** (R Markdown `.Rmd` file or Quarto `.qmd` file) combines three things in one document:
1. **Prose** — written in Markdown for formatted text, headings, and lists
2. **Code** — in executable code chunks that R runs when the document is rendered
3. **Output** — tables, figures, and printed results generated directly from the code
When you click **Render** (Quarto) or **Knit** (R Markdown), R executes every code chunk from scratch in a clean environment and weaves the output together with the prose into a finished HTML, PDF, or Word document.
{ width=100% style="padding:10px" }
The rendered output looks like this:
{ width=100% style="padding:10px" }
## Why Notebooks Matter for Reproducibility {-}
The key property of a rendered notebook is that **every result is derived directly from code**. There is no manual copying of numbers from a statistical output into a Word document — a step that is both error-prone and opaque. When a reviewer asks "where does this number come from?", you can point directly to the code chunk that produced it.
This means:
- If you update your data, you re-render and all results update automatically
- If a collaborator wants to verify a result, they open the notebook and render it
- If you revisit the analysis a year later, the notebook shows exactly what was done and why
Notebooks also document the *reasoning* behind analytical choices, not just the code. Prose in a notebook can explain why a particular model was chosen, what a diagnostic plot revealed, or why certain observations were excluded — information that a bare script cannot capture.
## Notebooks for Qualitative Work {-}
While notebooks are most commonly associated with quantitative and computational analyses, they are increasingly used to document qualitative and interpretative work. A notebook can display annotation decisions alongside the data being annotated, making the logic of qualitative coding transparent and verifiable. Recent studies have demonstrated their value in corpus pragmatics and corpus-based discourse analysis for exactly this purpose.
## R Markdown vs. Quarto {-}
**R Markdown** (`.Rmd`) is the original notebook format for R, stable and widely used. **Quarto** (`.qmd`) is its successor — it supports R, Python, Julia, and Observable JS in the same document, has a cleaner syntax, and is the format used by all LADAL tutorials. If you are starting fresh, use Quarto. If you have existing R Markdown files, they continue to work and do not need to be converted.
```{r notebook_table, echo=FALSE, message=FALSE, warning=FALSE}
data.frame(
Feature = c("File extension", "Render function", "Multi-language",
"Output formats", "LADAL tutorials"),
R_Markdown = c(".Rmd", "knitr::knit() / rmarkdown::render()",
"R only (+ Python via reticulate)",
"HTML, PDF, Word, slides", "Legacy format"),
Quarto = c(".qmd", "quarto::quarto_render()",
"R, Python, Julia, Observable",
"HTML, PDF, Word, slides, books, websites", "Current format")
) |>
dplyr::rename("R Markdown" = R_Markdown) |>
flextable() |>
flextable::set_table_properties(width = .95, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::fontsize(size = 11, part = "header") |>
flextable::align_text_col(align = "left") |>
flextable::set_caption(caption = "Comparison of R Markdown and Quarto notebook formats.") |>
flextable::border_outer()
```
---
::: {.callout-tip}
## Exercises: Projects and Notebooks
:::
**Q1. Why is using `setwd()` at the top of a script considered bad practice for reproducibility?**
```{r}
#| echo: false
#| label: "PROJ_Q1"
check_question("setwd() uses an absolute path specific to one machine — the script breaks immediately on any other computer or if the folder is moved",
options = c(
"setwd() uses an absolute path specific to one machine — the script breaks immediately on any other computer or if the folder is moved",
"setwd() is deprecated and no longer works in modern versions of R",
"setwd() is slower than R Projects for large datasets",
"setwd() only works on Windows, not on Mac or Linux"
),
type = "radio",
q_id = "PROJ_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! A path like setwd('C:/Users/Martin/Documents/...') is tied to one specific machine. Any collaborator, reviewer, or future version of yourself on a different computer would need to manually edit the path before the script could run — a fragile, error-prone requirement. R Projects eliminate this: the working directory is always the project root, wherever the project folder happens to be located.",
wrong = "Think about what an absolute file path like 'C:/Users/Martin/...' means on another computer or after moving the project folder.")
```
---
**Q2. What is the key reproducibility advantage of a rendered notebook over a plain R script?**
```{r}
#| echo: false
#| label: "PROJ_Q2"
check_question("A rendered notebook shows both the code AND the output it produced — results are verifiable without re-running anything, and every number is traceable to the code that generated it",
options = c(
"A rendered notebook shows both the code AND the output it produced — results are verifiable without re-running anything, and every number is traceable to the code that generated it",
"Notebooks run faster than plain scripts because they cache results",
"Notebooks are the only format that supports ggplot2 graphics",
"A rendered notebook is smaller in file size than a plain R script"
),
type = "radio",
q_id = "PROJ_Q2",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! A plain script is a set of instructions; its outputs exist only if the reader re-runs it in a matching environment. A rendered notebook captures both the instructions (code) and their results (output) in one document. A colleague can read the rendered HTML and verify every result without needing R installed. And because the output is generated from the code rather than typed manually, there is no risk of a number being accidentally mis-copied.",
wrong = "Think about what a reviewer or collaborator needs to do with a plain .R script versus a rendered HTML notebook to verify a reported result.")
```
---
# Code Style {#style}
::: {.callout-note}
## Section Overview
**What you'll learn:** Conventions for writing readable, consistent R code
**Why it matters:** Code you write today will be read — by collaborators, reviewers, and your future self — months or years from now. Readable code is reproducible code.
:::
Reproducibility is not only about *running* code — it is also about *understanding* it. Code that is hard to read is hard to verify, hard to modify, and hard to debug. The following conventions are widely adopted in the R community and are used throughout LADAL tutorials.
## Naming Conventions {-}
```{r naming, eval=FALSE}
# Good: lowercase with underscores (snake_case)
word_count <- 42
reaction_time_ms <- 487.3
corpus_summary <- data.frame()
# Avoid: mixed case, dots, or cryptic abbreviations
WordCount <- 42 # CamelCase (used in some communities, not LADAL)
reaction.time <- 487.3 # dots can conflict with S3 method names
rt <- 487.3 # cryptic — what does rt mean?
```
## Spacing and Indentation {-}
```{r spacing, eval=FALSE}
# Good: spaces around operators, after commas, consistent indentation
corpus_summary <- corpus_data |>
dplyr::filter(register == "Academic") |>
dplyr::group_by(speaker_id) |>
dplyr::summarise(
mean_wc = mean(word_count, na.rm = TRUE),
sd_wc = sd(word_count, na.rm = TRUE),
.groups = "drop"
)
# Avoid: cramped, hard-to-read code
corpus_summary<-corpus_data|>dplyr::filter(register=="Academic")|>
dplyr::group_by(speaker_id)|>dplyr::summarise(mean_wc=mean(word_count,na.rm=TRUE))
```
## Commenting {-}
```{r comments, eval=FALSE}
# Good: comments explain WHY, not just WHAT
# Remove speakers with fewer than 3 observations — insufficient data for per-speaker models
corpus_data <- corpus_data |>
dplyr::group_by(speaker_id) |>
dplyr::filter(dplyr::n() >= 3) |>
dplyr::ungroup()
# Avoid: comments that merely restate the code
# group by speaker_id and filter
corpus_data <- corpus_data |>
dplyr::group_by(speaker_id) |>
dplyr::filter(dplyr::n() >= 3) |>
dplyr::ungroup()
```
## Line Length {-}
Keep lines under **80 characters**. RStudio shows a vertical guideline at column 80 by default (`Tools → Global Options → Code → Display → Show margin`). Long lines are hard to read and cause problems in version control diffs.
## Script Structure {-}
Every R script or notebook should follow a consistent top-to-bottom structure:
```{r script_structure, eval=FALSE}
# ===========================================================
# Script: 02_analyse_register_variation.R
# Author: Martin Schweinberger
# Date: 2026-02-19
# Description: Mixed-effects model of word count by register
# ===========================================================
# 1. PACKAGES ------------------------------------------------
library(dplyr)
library(lme4)
library(here)
# 2. OPTIONS -------------------------------------------------
options(stringsAsFactors = FALSE)
options(scipen = 100)
set.seed(42)
# 3. LOAD DATA -----------------------------------------------
corpus <- readRDS(here::here("data", "processed", "corpus_clean.rds"))
# 4. ANALYSIS ------------------------------------------------
# ... analysis code here
# 5. SAVE OUTPUTS --------------------------------------------
# ... save results here
```
::: {.callout-tip}
## The `lintr` and `styler` Packages
Two packages automate style checking and fixing:
- `lintr` checks your code against a style guide and reports violations — like a spell-checker for code style
- `styler` automatically reformats your code to comply with the tidyverse style guide
```r
install.packages("lintr")
install.packages("styler")
lintr::lint("R/my_script.R") # check style
styler::style_file("R/my_script.R") # auto-fix style
```
:::
---
# Dependency Control with `renv` {#renv}
::: {.callout-note}
## Section Overview
**What you'll learn:** How to use `renv` to lock the exact package versions used in a project, so the analysis can be reproduced identically on any machine — now or in the future
**Why it matters:** R packages change over time. An analysis that runs correctly today may produce different results or fail entirely in two years if packages have been updated. `renv` prevents this.
:::
## The Problem: Package Version Drift {-}
Consider this scenario: you publish a paper in 2024 using a mixed-effects model fitted with `lme4` version 1.1-35. A reviewer in 2025 tries to reproduce your analysis, but `lme4` 1.1-37 has changed how it handles singular fit warnings and produces slightly different output. Your analysis is no longer exactly reproducible — even with identical code and data.
This is **package version drift**, and it is one of the most common obstacles to long-term reproducibility. The `renv` package solves it.
## How `renv` Works {-}
`renv` creates a **project-local library** — a folder inside your project that contains exactly the versions of all packages used. When you share the project, collaborators install the same versions from the `renv.lock` file. The project is isolated from the user's main R library, so updates to packages installed globally do not affect the project.
{ width=35% style="float:right; padding:10px" }
## Setting Up `renv` {-}
```{r renv_setup, eval=FALSE}
# Install renv (only once, globally)
install.packages("renv")
# Initialise renv in your project
renv::init()
```
`renv::init()` scans your project for package dependencies, installs them into the project library, and creates a `renv.lock` file recording exact package versions.
## The `renv.lock` File {-}
The lock file is plain text (JSON format) and records the name, version, and source of every package:
```json
{
"R": {
"Version": "4.3.2",
"Repositories": [{"Name": "CRAN", "URL": "https://cloud.r-project.org"}]
},
"Packages": {
"dplyr": {
"Package": "dplyr",
"Version": "1.1.4",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "..."
}
}
}
```
This file should be committed to version control (Git) — it is the machine-readable specification of your software environment.
## Daily `renv` Workflow {-}
```{r renv_workflow, eval=FALSE}
# Install a new package (adds it to the project library)
renv::install("emmeans")
# After installing or removing packages, update the lock file
renv::snapshot()
# Restore the project environment on another machine or after a clean install
renv::restore()
# Check the status of the project library vs. the lock file
renv::status()
```
## Sharing a `renv` Project {-}
When you share your project (e.g., by pushing to GitHub or sending a zip file):
1. The collaborator clones or unzips the project
2. They open the `.Rproj` file in RStudio
3. `renv` automatically detects the lock file and prompts: `renv::restore()?`
4. They confirm, and `renv` installs the exact package versions specified
5. The analysis runs in an identical environment
::: {.callout-tip}
## `renv` vs. `packrat`
`renv` is the modern successor to the older `packrat` package. If you have projects using `packrat`, they can be migrated to `renv` with `renv::migrate()`. `packrat` should be considered deprecated for new projects.
:::
---
::: {.callout-tip}
## Exercises: `renv`
:::
**Q1. What does `renv::snapshot()` do?**
```{r}
#| echo: false
#| label: "RENV_Q1"
check_question("It updates the renv.lock file to reflect the current state of the project library — recording which packages and versions are currently installed",
options = c(
"It updates the renv.lock file to reflect the current state of the project library — recording which packages and versions are currently installed",
"It installs all packages listed in the renv.lock file",
"It takes a screenshot of the current RStudio session",
"It saves a backup copy of all data files in the project"
),
type = "radio",
q_id = "RENV_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! renv::snapshot() writes the current state of the project library to renv.lock. Run it after installing, removing, or updating any package used in the project. The complementary function is renv::restore(), which does the opposite: it reads renv.lock and installs the specified package versions — used when setting up the project on a new machine or restoring after a clean install.",
wrong = "Think about the two core renv operations: one records the environment (snapshot), one recreates it (restore). Which is which?")
```
---
# Version Control with Git and GitHub {#git}
::: {.callout-note}
## Section Overview
**What you'll learn:** How to use Git for version control and GitHub for sharing and collaborating on R projects
**Why it matters:** Version control is a complete, timestamped record of every change ever made to your code. It makes collaboration safe, enables you to revert mistakes, and provides a permanent citable home for your analysis.
:::
## What Is Git? {-}
**Git** is a version control system — software that tracks changes to files over time. Every time you *commit* a set of changes, Git records what changed, who changed it, and when. You can browse the entire history of a project, compare any two versions, and revert to any earlier state.
**GitHub** is a web platform that hosts Git repositories. It serves as a permanent, shareable, and optionally public home for your project. A GitHub repository can be shared with collaborators, cited in papers (with a DOI via Zenodo), and submitted as supplementary material to journals.
{ width=40% style="float:right; padding:10px" }
## Installing Git {-}
Before using Git with RStudio, you need Git installed on your computer:
- **Windows:** Download from [git-scm.com](https://git-scm.com/downloads) and run the installer with default settings
- **Mac:** Open Terminal and run `git --version` — if Git is not installed, macOS will prompt you to install it via Xcode Command Line Tools
- **Linux:** Install via your package manager, e.g., `sudo apt install git`
After installation, tell Git your name and email (used to label your commits):
```{bash git_config, eval=FALSE}
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
```
You also need a free [GitHub account](https://github.com/). The `usethis` package provides the easiest way to connect RStudio to GitHub:
```{r usethis_setup, eval=FALSE}
install.packages("usethis")
usethis::create_github_token() # opens GitHub to create a token
gitcreds::gitcreds_set() # stores the token in your credential manager
```
## Connecting an R Project to GitHub {-}
The simplest workflow is to create the GitHub repository first, then clone it as an R Project:
1. On GitHub, click **New** to create a repository. Give it a name, add a README, and click **Create repository**.
2. Copy the HTTPS URL of the repository (e.g., `https://github.com/yourusername/your-repo.git`)
3. In RStudio: `File → New Project → Version Control → Git`
4. Paste the URL. RStudio clones the repository and opens it as an R Project.
Alternatively, if you already have an R Project and want to connect it to a new GitHub repository:
```{r usethis_git, eval=FALSE}
# From inside your R Project
usethis::use_git() # initialise Git in the project
usethis::use_github() # create a GitHub repo and push
```
## The Daily Git Workflow {-}
Once your project is connected to GitHub, the daily workflow has three steps:
```{r git_workflow, eval=FALSE}
# In the Terminal pane (not the Console):
# 1. Stage: mark files to include in the next commit
git add R/my_analysis.R
git add data/processed/corpus_clean.rds
# 2. Commit: save the staged changes with a message
git commit -m "Add register effect to mixed-effects model"
# 3. Push: upload commits to GitHub
git push
```
In RStudio, the **Git pane** (top right, after Git is initialised) provides a visual interface for staging, committing, and pushing without using the Terminal.
## What to Commit — and What Not to {-}
Commit everything needed to reproduce the analysis:
- All R scripts and notebooks (`.R`, `.Rmd`, `.qmd`)
- The `renv.lock` file
- Small, shareable data files
- The `.Rproj` file
- A `README.md` describing the project
Do **not** commit:
- Large raw data files (use a data repository like OSF or Zenodo instead, and link to them)
- Sensitive or confidential data
- Rendered output files (`.html`, `.pdf`) — these are derived products that can be regenerated
- The `renv/library/` folder (the lock file is sufficient; collaborators restore from it)
Create a `.gitignore` file to tell Git which files to ignore:
```{r gitignore, eval=FALSE}
# Automatically create a sensible .gitignore for an R project
usethis::use_git_ignore(c("renv/library/", "*.html", "*.pdf",
"data/raw/", ".Rhistory", ".RData"))
```
---
::: {.callout-tip}
## Exercises: Git and GitHub
:::
**Q1. What is the difference between `git commit` and `git push`?**
```{r}
#| echo: false
#| label: "GIT_Q1"
check_question("git commit saves changes to your local repository history; git push uploads those commits to the remote GitHub repository",
options = c(
"git commit saves changes to your local repository history; git push uploads those commits to the remote GitHub repository",
"git push saves changes locally; git commit sends them to GitHub",
"They are identical — both save and upload changes",
"git commit creates a new branch; git push merges it into main"
),
type = "radio",
q_id = "GIT_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! Git works in two layers: a local repository on your computer and a remote repository on GitHub. git commit records a snapshot of your staged changes in the local repository — this works even without an internet connection. git push uploads all local commits that GitHub does not yet have. This two-step design means you can commit frequently (saving your progress locally) and push when you are ready to share.",
wrong = "Git has two separate layers: a local repository and a remote repository. Which operation affects which layer?")
```
---
**Q2. Why should large raw data files generally NOT be committed to a Git repository?**
```{r}
#| echo: false
#| label: "GIT_Q2"
check_question("Git is optimised for text files — large binary or data files bloat the repository history permanently, slow down all operations, and may exceed GitHub's file size limits",
options = c(
"Git is optimised for text files — large binary or data files bloat the repository history permanently, slow down all operations, and may exceed GitHub's file size limits",
"GitHub charges per file committed, so large files are expensive",
"Large files cannot be staged with git add",
"Data files change too frequently for version control to handle"
),
type = "radio",
q_id = "GIT_Q2",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! Git stores the complete history of every committed file — once a large file is committed, it remains in the repository history even if it is later deleted, permanently inflating the repository size. GitHub also enforces a 100 MB per-file limit and recommends keeping repositories under 1 GB total. For large datasets, the recommended practice is to store them in a dedicated data repository (OSF, Zenodo, Figshare) and link to them with a persistent URL in your README.",
wrong = "Think about how Git stores history: every committed version of every file is kept forever. What happens when that file is 500 MB?")
```
---
# Portable File Paths with `here` {#here}
::: {.callout-note}
## Section Overview
**What you'll learn:** How to use the `here` package to write file paths that work on any machine without modification
**Key function:** `here::here()`
:::
Even within an R Project, file paths can cause problems if written carelessly. The `here` package provides a single, simple function that constructs file paths relative to the project root — correctly on Windows, Mac, and Linux — without any configuration.
## The Problem with Relative Paths {-}
Within an R Project, you might write:
```{r relative_bad, eval=FALSE}
# This looks like a relative path, but it depends on the working directory
data <- read.csv("data/processed/corpus.csv")
```
This works when the working directory is the project root (which R Projects guarantee). But it breaks if a script is sourced from a subdirectory, or if the file is run as part of a rendered document that temporarily changes the working directory.
## `here::here()` as the Solution {-}
```{r here_demo, eval=FALSE}
library(here)
# Constructs the full path from the project root, regardless of where R is run from
data <- read.csv(here::here("data", "processed", "corpus.csv"))
# Save a processed file
saveRDS(data, here::here("data", "processed", "corpus_clean.rds"))
# Save a plot
ggsave(here::here("outputs", "figures", "register_plot.png"),
width = 8, height = 5, dpi = 300)
```
`here::here("data", "processed", "corpus.csv")` constructs the platform-appropriate path separator automatically (`/` on Mac/Linux, `\` on Windows) and always anchors to the project root.
```{r here_show, message=FALSE, warning=FALSE}
library(here)
# Shows where here considers the project root to be
here::here()
```
::: {.callout-tip}
## The Rule: Never Hardcode Paths
Absolute paths (`"C:/Users/Martin/..."`, `"/home/martin/..."`) break on any other machine. Paths relative to an unspecified working directory break when the script is run from a different location. `here::here()` is the correct solution in both cases. Make it a habit to use it for every file read and write operation.
:::
---
# Reproducible Randomness with `set.seed()` {#seed}
::: {.callout-note}
## Section Overview
**What you'll learn:** How to make analyses involving random processes exactly reproducible using `set.seed()`
**Key function:** `set.seed()`
**Why it matters:** Any analysis involving random numbers — random forests, bootstrap confidence intervals, train/test splits, simulation studies — will produce different results each run unless the random seed is fixed
:::
## The Problem: Stochastic Results {-}
R's random number functions produce different output every time they are called:
```{r seed_demo1}
# Two calls to sample() give different results
sample(1:10, size = 5)
sample(1:10, size = 5)
```
This means that any analysis using randomness — shuffling data, drawing bootstrap samples, initialising random forests — will differ between runs. A collaborator running the same code will get different numbers. A reported result cannot be reproduced exactly.
## The Solution: `set.seed()` {-}
`set.seed()` initialises R's internal random number generator to a known state. Every random operation that follows will produce the same sequence of numbers, on any machine, in any R version:
```{r seed_demo2}
# Fix the seed, then draw a sample
set.seed(42)
sample(1:10, size = 5)
```
```{r seed_demo3}
# Same seed, same result
set.seed(42)
sample(1:10, size = 5)
```
```{r seed_demo4}
# Different seed, different result
set.seed(99)
sample(1:10, size = 5)
```
## Where to Put `set.seed()` {-}
Set the seed **once, at the top of your script or notebook**, immediately after loading packages and options. This ensures every random operation in the document is reproducible from a single, documented starting point:
```{r seed_practice, eval=FALSE}
# Top of script: packages, options, seed
library(dplyr)
library(ranger)
options(stringsAsFactors = FALSE)
set.seed(42) # ← set once, here, at the top
# All subsequent random operations are now reproducible
```
::: {.callout-warning}
## `set.seed()` Is Version-Sensitive
The default random number generator changed in R 3.6.0. Results from `set.seed(42)` in R 3.5 differ from results in R 3.6+. This is another reason why recording your R version (via `sessionInfo()`) and locking your environment with `renv` is important for long-term reproducibility.
:::
---
::: {.callout-tip}
## Exercises: `set.seed()`
:::
**Q1. You run a random forest model twice with identical code and get slightly different variable importance scores each time. What is the most likely cause and fix?**
```{r}
#| echo: false
#| label: "SEED_Q1"
check_question("The random forest uses random sampling internally — fix by adding set.seed(42) (or any integer) before the model call",
options = c(
"The random forest uses random sampling internally — fix by adding set.seed(42) (or any integer) before the model call",
"The data is being loaded differently each time — check the data import",
"Variable importance scores are inherently unstable and cannot be made reproducible",
"Increase the number of trees to 10,000 — results stabilise with enough trees"
),
type = "radio",
q_id = "SEED_Q1",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Correct! Random forests use random sampling at two stages: random feature selection at each split and random bootstrap sampling of the training data. Without a fixed seed, each run uses a different random sequence, producing slightly different trees and therefore slightly different importance scores. Setting set.seed() before the model call fixes the random sequence and makes results exactly reproducible. The integer you choose (42, 123, 2024, etc.) does not matter — any value works, as long as you record and report it.",
wrong = "Random forests are explicitly stochastic algorithms. What mechanism in R controls random number generation?")
```
---
# Tidy Data Principles {#tidydata}
::: {.callout-note}
## Section Overview
**What you'll learn:** The principles of tidy data — a consistent, analysis-ready way to structure tabular data
**Why it matters:** Tidy data works immediately with all tidyverse functions; untidy data requires transformation before almost any analysis can begin
:::
The same underlying data can be stored in multiple formats. Consider life expectancy data for five countries across two years:
```{r tidy_setup, echo=FALSE, message=FALSE, warning=FALSE}
library(tidyr)
countries <- c("Afghanistan", "Australia", "China", "Germany", "Tanzania")
life_exp <- data.frame(
country = countries,
continent = c("Asia", "Oceania", "Asia", "Europe", "Africa"),
`2002` = c(42.1, 80.4, 72.0, 78.7, 50.7),
`2007` = c(43.8, 81.2, 72.9, 79.4, 52.5),
check.names = FALSE
)
```
**Table 1 (wide format):** Years are column names — compact for reading, problematic for analysis
```{r tidy_t1, echo=FALSE}
life_exp |>
flextable() |>
flextable::set_table_properties(width = .65, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 12) |>
flextable::fontsize(size = 12, part = "header") |>
flextable::align_text_col(align = "center") |>
flextable::set_caption(caption = "Table 1: Life expectancy, wide format.") |>
flextable::border_outer()
```
**Table 2 (long/tidy format):** One observation per row — each year is its own row, life expectancy is one column
```{r tidy_t2, echo=FALSE}
life_exp |>
tidyr::pivot_longer(
cols = c(`2002`, `2007`),
names_to = "year",
values_to = "life_exp"
) |>
flextable() |>
flextable::set_table_properties(width = .6, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 12) |>
flextable::fontsize(size = 12, part = "header") |>
flextable::align_text_col(align = "center") |>
flextable::set_caption(caption = "Table 2: Life expectancy, long (tidy) format.") |>
flextable::border_outer()
```
Table 2 is **tidy**. Tidy data follows three rules:
1. **Each variable has its own column** — `country`, `continent`, `year`, `life_exp` are all separate columns
2. **Each observation has its own row** — one row per country-year combination
3. **Each value has its own cell** — no merged cells, no values encoded in column names
## Why Tidy Data Matters {-}
Tidy data is not just a convention — it is the format that all tidyverse functions expect. `dplyr::group_by()`, `ggplot2::ggplot()`, and statistical model functions all assume that the variable you want to use as a grouping factor is a column, not spread across multiple column names.
```{r tidy_plot, message=FALSE, warning=FALSE}
# Tidy format enables immediate plotting without reshaping
life_exp_long <- life_exp |>
tidyr::pivot_longer(
cols = c(`2002`, `2007`),
names_to = "year",
values_to = "life_exp"
)
ggplot2::ggplot(life_exp_long,
ggplot2::aes(x = country, y = life_exp,
fill = year)) +
ggplot2::geom_col(position = "dodge") +
ggplot2::scale_fill_manual(values = c("steelblue", "tomato")) +
ggplot2::coord_flip() +
ggplot2::theme_bw() +
ggplot2::theme(panel.grid.minor = ggplot2::element_blank()) +
ggplot2::labs(title = "Life expectancy by country and year",
x = NULL, y = "Life expectancy (years)", fill = "Year")
```
## Common Untidy Patterns and How to Fix Them {-}
```{r untidy_table, echo=FALSE, message=FALSE, warning=FALSE}
data.frame(
Problem = c(
"Column headers are values, not variable names",
"Multiple variables stored in one column",
"One observation spread across multiple rows",
"Multiple types of observational units in one table"
),
Example = c(
"Columns: country, 2002, 2007, 2012 (years as column names)",
"Column 'age_gender' contains values like 'M_25', 'F_30'",
"A speaker's metadata split across three rows",
"Speaker info and utterance data mixed in one table"
),
Fix = c(
"pivot_longer() to move year values into a year column",
"tidyr::separate() to split into age and gender columns",
"Aggregate or pivot to one row per observation unit",
"Split into separate tables, join when needed"
)
) |>
flextable() |>
flextable::set_table_properties(width = .99, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::fontsize(size = 11, part = "header") |>
flextable::align_text_col(align = "left") |>
flextable::set_caption(caption = "Common untidy data patterns and their fixes.") |>
flextable::border_outer()
```
---
# Efficient Data Storage {#storage}
::: {.callout-note}
## Section Overview
**What you'll learn:** How to choose the right file format for saving data, balancing portability, size, and fidelity
**Key functions:** `write.csv()`, `saveRDS()`, `readRDS()`
:::
Data files can be stored in many formats, each with trade-offs in portability, file size, and how much R-specific information (column types, factor levels) is preserved.
## Comparing File Formats {-}
```{r format_compare, echo=FALSE, message=FALSE, warning=FALSE}
data.frame(
Format = c("CSV (.csv)", "Excel (.xlsx)", "RDS (.rds)", "RData (.rda/.RData)"),
Size = c("Medium", "Large", "Small", "Small"),
Portable = c("Any software", "Excel / R / Python", "R only", "R only"),
Preserves = c("Values as text only",
"Values + basic formatting",
"All R types and attributes",
"Multiple objects at once"),
Best_for = c("Sharing data with non-R users",
"Sharing with Excel users",
"Saving a single processed R object",
"Saving multiple R objects together")
) |>
dplyr::rename("Best for" = Best_for, "Preserves" = Preserves,
"Portable to" = Portable) |>
flextable() |>
flextable::set_table_properties(width = .99, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 11) |>
flextable::fontsize(size = 11, part = "header") |>
flextable::align_text_col(align = "left") |>
flextable::set_caption(caption = "Comparison of common data file formats.") |>
flextable::border_outer()
```
## A Concrete Size Comparison {-}
To illustrate the size difference, we create a moderately sized data frame and save it in each format:
```{r storage_demo, eval=FALSE}
# Create a sample data frame (1000 rows, 8 columns)
set.seed(42)
n <- 1000
demo_data <- data.frame(
doc_id = paste0("doc", 1:n),
register = sample(c("Academic", "News", "Fiction"), n, replace = TRUE),
word_count = round(rnorm(n, 300, 60)),
year = sample(2015:2023, n, replace = TRUE)
)
# Save in different formats
write.csv(demo_data,
here::here("data", "demo.csv"), row.names = FALSE)
saveRDS(demo_data,
here::here("data", "demo.rds"))
openxlsx::write.xlsx(demo_data,
here::here("data", "demo.xlsx"))
# Compare file sizes
sizes <- file.info(c(
here::here("data", "demo.csv"),
here::here("data", "demo.rds"),
here::here("data", "demo.xlsx")
))$size / 1024 # size in KB
names(sizes) <- c("CSV", "RDS", "Excel")
round(sizes, 1)
```
In practice, RDS is typically 3–5× smaller than the equivalent CSV and 5–8× smaller than Excel, because it uses R's native binary compression.
## Recommended Practice {-}
- **Save raw data as CSV** — human-readable, portable, future-proof
- **Save processed R objects as RDS** — preserves all R-specific attributes (factor levels, column types) without needing to re-run preprocessing each time
- **Save outputs for sharing as CSV or Excel** — collaborators without R can open them
- **Never save your workspace** (`save.image()` or `.RData`) — this creates an opaque binary blob that is hard to version-control and makes scripts implicitly depend on a hidden state
```{r workspace_setting, eval=FALSE}
# Disable automatic workspace saving in RStudio:
# Tools → Global Options → General →
# set "Save workspace to .RData on exit" to NEVER
# and uncheck "Restore .RData into workspace at startup"
```
---
# Session Information {#sessioninfo}
::: {.callout-note}
## Section Overview
**Why it matters:** Recording your session information at the end of every notebook creates a permanent, human-readable log of the exact software environment used — essential for troubleshooting, peer review, and long-term reproducibility
:::
The `sessionInfo()` function prints a complete record of:
- The R version
- The operating system
- All loaded packages and their versions
- Locale settings
Always include this at the end of every notebook:
```{r sessioninfo_demo, eval=FALSE}
sessionInfo()
```
In combination with `renv.lock`, `sessionInfo()` output provides a complete snapshot of the software environment in a human-readable format that can be included in supplementary materials or reported in a methods section.
::: {.callout-tip}
## The Complete Reproducibility Checklist
By the end of any project, a fully reproducible analysis should have:
- ☐ All code in scripts or notebooks (no Console-only work)
- ☐ An R Project (`.Rproj`) as the root anchor
- ☐ `renv` initialised with a committed `renv.lock`
- ☐ A Git repository with meaningful commit history
- ☐ A GitHub repository (public or private) linked to the project
- ☐ `here::here()` used for all file paths
- ☐ `set.seed()` at the top of every script using randomness
- ☐ Tidy data structure in all analysis-ready data files
- ☐ `sessionInfo()` at the end of every notebook
- ☐ A `README.md` explaining what the project does and how to run it
:::
---
# Citation & Session Info {-}
::: {.callout-note}
## Citation
```{r citation-callout, echo=FALSE, results='asis'}
cat(
params$author, ". ",
params$year, ". *",
params$title, "*. ",
params$institution, ". ",
"url: ", params$url, " ",
"(Version ", params$version, "), ",
"doi: ", params$doi, ".",
sep = ""
)
```
```{r citation-bibtex, echo=FALSE, results='asis'}
key <- paste0(
tolower(gsub(" ", "", gsub(",.*", "", params$author))),
params$year,
tolower(gsub("[^a-zA-Z]", "", strsplit(params$title, " ")[[1]][1]))
)
cat("```\n")
cat("@manual{", key, ",\n", sep = "")
cat(" author = {", params$author, "},\n", sep = "")
cat(" title = {", params$title, "},\n", sep = "")
cat(" year = {", params$year, "},\n", sep = "")
cat(" note = {", params$url, "},\n", sep = "")
cat(" organization = {", params$institution, "},\n", sep = "")
cat(" edition = {", params$version, "}\n", sep = "")
cat(" doi = {", params$doi, "}\n", sep = "")
cat("}\n```\n")
```
:::
```{r fin}
sessionInfo()
```
::: {.callout-note}
## AI Transparency Statement
This tutorial was revised and restyled with the assistance of **Claude** (claude.ai), a large language model created by Anthropic. All substantive content — code, statistical explanations, exercises, and reporting conventions — was retained from the original. All changes were reviewed and approved by Martin Schweinberger, who takes full responsibility for the tutorial's accuracy.
:::
---
[Back to top](#intro)
[Back to HOME](/index.html)
---
# References {-}