Loading, Saving, and Simulating Data in R

Author

Martin Schweinberger

Introduction

This tutorial covers three foundational data-management skills for linguistic research in R: loading data from a wide variety of file formats into your R session, saving processed data and R objects back to disk in appropriate formats, and simulating data from scratch — either for reproducible worked examples, for power analysis, or for creating synthetic corpora.

Data rarely arrive in a single tidy format. A corpus might be spread across hundreds of plain-text files; an experimental dataset might come from a collaborator as an Excel spreadsheet; a frequency list might be stored as an R object from a previous session; metadata might be embedded in a JSON file exported from a web API; and survey responses might be in an SPSS .sav file. Knowing how to read, write, and create data in R is therefore not a preliminary skill to be rushed through — it is a core competency that affects every subsequent step of your analysis.

The tutorial is aimed at beginners to intermediate R users. It assumes you are comfortable with basic R syntax (objects, functions, vectors, data frames) but have no prior experience with the specific packages used here.

Prerequisite Tutorials

Before working through this tutorial, you should be familiar with the content of the following:

If you are new to R, please work through Getting Started with R before proceeding.

Learning Objectives

By the end of this tutorial you will be able to:

  1. Load tabular data from plain text (.csv, .tsv, .txt), Excel (.xlsx), R-native (.rda, .rds), JSON, and XML formats into R
  2. Save R data objects back to each of those formats using appropriate functions
  3. Load data directly from a URL without downloading it manually
  4. Access built-in datasets from base R and installed R packages
  5. Load a single plain-text file and a directory of multiple text files into R for corpus analysis
  6. Understand what set.seed() does and why reproducibility depends on it
  7. Simulate data from the most common statistical distributions (normal, binomial, Poisson, uniform, negative binomial)
  8. Build realistic simulated datasets modelling corpora, psycholinguistic experiments, and Likert-scale surveys
  9. Perform a simple simulation-based power analysis for a mixed-effects model
  10. Generate synthetic textual data using character manipulation and Markov-chain approaches
Citation

Schweinberger, Martin. 2026. Loading, Saving, and Simulating Data in R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/load/load.html (Version 2026.02.24).


Project Structure and File Paths

Section Overview

What you will learn: How to set up a reproducible project directory, why the here package is preferred over setwd(), and how to verify that R can find your data files before you try to load them

Why File Paths Matter

Every data-loading command in R requires a file path — the address of the file on your computer (or on the web). Paths that work on your computer will break when you share your script with a colleague, upload it to a server, or move your project to a different folder. The most common source of beginner frustration (“it worked yesterday!”) is a broken file path.

There are two approaches to managing paths: the fragile one and the robust one.

The fragile approach — setwd(): Setting the working directory with setwd("C:/Users/Martin/Documents/myproject") hard-codes an absolute path that is specific to one machine and one folder location. As soon as you move the project, rename a folder, or share the code, it breaks.

The robust approach — RStudio Projects + here: Creating an RStudio Project (.Rproj file) anchors all paths to the project root. The here package then builds paths relative to that root using here::here(), which works identically on Windows, macOS, and Linux regardless of where the project folder lives.

Verifying Paths with here

Code
library(here)

# Check what here() considers the project root
here::here()

# Build a path to a file in the data subfolder
here::here("data", "testdat.csv")

# Check whether the file actually exists at that path
file.exists(here::here("data", "testdat.csv"))

# List all files in the data folder
list.files(here::here("data"))

# List all .txt files in the testcorpus subfolder
list.files(here::here("data", "testcorpus"), pattern = "\\.txt$")
Always Check Before Loading

Run file.exists(your_path) before attempting to load a file. If it returns FALSE, diagnose the problem with list.files() before debugging your loading code — the file path is almost always the issue, not the loading function.


Setup

Installing Packages

Code
# Run once — comment out after installation
install.packages("here")        # robust file paths
install.packages("readr")       # fast CSV/TSV reading (tidyverse)
install.packages("openxlsx")    # read and write Excel files
install.packages("readxl")      # read Excel files (tidyverse)
install.packages("writexl")     # write Excel files (lightweight)
install.packages("jsonlite")    # parse and write JSON
install.packages("xml2")        # parse and write XML
install.packages("haven")       # SPSS, Stata, SAS files
install.packages("dplyr")       # data manipulation
install.packages("tidyr")       # data reshaping
install.packages("stringr")     # string manipulation
install.packages("purrr")       # functional programming (map/walk)
install.packages("ggplot2")     # visualisation
install.packages("data.tree")   # directory tree display
install.packages("officer")     # read Word documents

Loading Packages

Code
library(here)
library(readr)
library(openxlsx)
library(readxl)
library(writexl)
library(jsonlite)
library(xml2)
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(ggplot2)
library(data.tree)
library(officer)

Loading and Saving Plain Text Data

Section Overview

What you will learn: How to load and save tabular plain-text files (CSV, TSV, delimited TXT) using both base R functions and the faster, more consistent readr package; how to diagnose common loading problems; and when to choose each approach

What Is a Plain-Text Tabular File?

A plain-text tabular file stores a data table as human-readable text, with columns separated by a special character called the delimiter. The most common delimiters are:

Common plain-text tabular formats
Format Delimiter File extension Notes
CSV Comma (,) .csv Most common; problems when data contains commas
TSV Tab (\t) .tsv or .txt Safer for text data; less widely used
Semi-colon delimited ; .csv Common in European locales where , is the decimal separator
Pipe delimited \| .txt Used in some corpus annotation formats

Loading CSV Files

Base R: read.csv()

The base R function read.csv() is available without loading any packages and is the default choice for many users:

Code
# Base R CSV loading
datcsv <- read.csv(
  here::here("tutorials/load/data", "testdat.csv"),
  header      = TRUE,    # first row = column names (default TRUE)
  strip.white = TRUE,    # trim leading/trailing whitespace from strings
  na.strings  = c("", "NA", "N/A", "missing")  # treat these as NA
)

# Inspect structure
str(datcsv)
'data.frame':   10 obs. of  2 variables:
 $ Variable1: int  6 65 12 56 45 84 38 46 64 24
 $ Variable2: int  67 16 56 34 54 42 36 47 54 29
Code
head(datcsv)
  Variable1 Variable2
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42
Key Arguments for read.csv()
Key arguments for read.csv()
Argument Default Purpose
header TRUE First row contains column names
sep "," Column delimiter
dec "." Decimal separator
na.strings "NA" Strings to treat as missing
strip.white FALSE Strip whitespace from string fields
encoding "unknown" File encoding (try "UTF-8" for non-ASCII text)
comment.char "" Ignore lines starting with this character

The readr Package: read_csv()

The readr package (part of the tidyverse) provides faster, more consistent alternatives to base R reading functions. Key advantages: it returns a tibble rather than a plain data frame, it prints progress for large files, it guesses column types more reliably, and it produces informative error messages.

Code
# readr CSV loading
datcsv_r <- readr::read_csv(
  here::here("tutorials/load/data", "testdat.csv"),
  col_types  = cols(),      # suppress type-guessing messages
  na         = c("", "NA", "N/A"),
  trim_ws    = TRUE
)

# readr always prints a column specification — inspect it
spec(datcsv_r)
cols(
  Variable1 = col_double(),
  Variable2 = col_double()
)
Code
head(datcsv_r)
# A tibble: 6 × 2
  Variable1 Variable2
      <dbl>     <dbl>
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42
read.csv() vs. read_csv(): Which Should I Use?

Use read.csv() when you need no extra dependencies, are working with small files, or are writing a script that others will run without the tidyverse installed.

Use read_csv() when working with large files (it is 5–10× faster), when you want explicit column-type checking, or when your workflow uses tidyverse throughout. The underscore vs. dot distinction is the only naming difference to remember: read.csv() is base R, read_csv() is readr.

Semi-Colon Delimited CSV

In many European locales the comma is the decimal separator (e.g. 3,14 for π), so CSV files from these locales use a semi-colon as the column delimiter. Both base R and readr provide specialised functions:

Code
# Base R: read.delim with sep = ";"
datcsv2_base <- read.delim(
  here::here("tutorials/load/data", "testdat2.csv"),
  sep    = ";",
  header = TRUE,
  dec    = ","   # comma as decimal separator
)

# readr: read_csv2() handles ; delimiter and , decimal automatically
datcsv2_r <- readr::read_csv2(
  here::here("tutorials/load/data", "testdat2.csv"),
  col_types = cols()
)

head(datcsv2_base)
  Variable1 Variable2
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42

Loading TSV and Other Delimited Files

Code
# readr: read_tsv for tab-separated files
# dattxt_r <- readr::read_tsv(
#   here::here("tutorials/load/data", "testdat.txt"),
#   col_types = cols()
# )

# readr: read_delim for any delimiter
# datpipe <- readr::read_delim(
#   here::here("tutorials/load/data", "testdat_pipe.txt"),
#   delim     = "|",
#   col_types = cols()
# )

# Base R equivalents
dattxt_base <- read.delim(
  here::here("tutorials/load/data", "testdat.txt"),
  sep    = "\t",
  header = TRUE
)
head(dattxt_base)
  Variable1 Variable2
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42

Saving Plain-Text Files

Writing CSV

Code
# Base R: write.csv — adds row numbers by default; suppress with row.names = FALSE
write.csv(
  datcsv,
  file      = here::here("tutorials/load/data", "testdat_out.csv"),
  row.names = FALSE,   # ALWAYS set this to avoid a spurious row-number column
  fileEncoding = "UTF-8"
)

# readr: write_csv — no row names by default; faster; always UTF-8
readr::write_csv(
  datcsv_r,
  file = here::here("tutorials/load/data", "testdat_out_r.csv")
)

# Semi-colon CSV (European locale)
readr::write_csv2(
  datcsv2_r,
  file = here::here("tutorials/load/data", "testdat2_out.csv")
)
Always Use row.names = FALSE

The base R write.csv() adds a column of row numbers by default (row names). This creates an unnamed first column of integers when the file is re-read, which is almost never what you want. Always set row.names = FALSE when using write.csv(). The readr functions (write_csv, write_tsv) never write row names.

Writing TSV and Other Formats

Code
# TSV
readr::write_tsv(
  datcsv_r,
  file = here::here("tutorials/load/data", "testdat_out.tsv")
)

# Custom delimiter (pipe)
readr::write_delim(
  datcsv_r,
  file  = here::here("tutorials/load/data", "testdat_out_pipe.txt"),
  delim = "|"
)

# Base R: write.table (most flexible)
write.table(
  datcsv,
  file      = here::here("tutorials/load/data", "testdat_out.txt"),
  sep       = "\t",
  row.names = FALSE,
  quote     = FALSE    # suppress quoting of strings (useful for corpus data)
)

You receive a file called responses.csv from a colleague in Germany. When you load it with read.csv(), all numeric columns appear as character strings and one column called Score shows values like "3,14" and "2,71". What is the most likely problem, and how do you fix it?

  1. The file is corrupt — ask the colleague to re-export it
  2. The file uses a semi-colon as the column delimiter and a comma as the decimal separator — use read.csv2() or read.delim(sep = ";", dec = ",")
  3. The file is tab-separated, not comma-separated — use read.delim(sep = "\t")
  4. The Score column contains text responses — convert manually with as.numeric()
Answer

b) The file uses a semi-colon as the column delimiter and a comma as the decimal separator — use read.csv2() or read.delim(sep = ";", dec = ",")

German locale settings use , as the decimal mark (so 3,14 means 3.14) and ; as the CSV column delimiter (so that commas in numbers are not confused with column separators). When you read such a file with read.csv() (which expects , as the delimiter), the entire row is read as one column, and numbers appear as strings. The fix is read.csv2() (base R) or readr::read_csv2(), both of which default to ; delimiter and , decimal. Option (d) would treat the symptom, not the cause.


Loading and Saving Excel Files

Section Overview

What you will learn: How to read and write .xlsx and .xls Excel files using readxl, openxlsx, and writexl; how to work with multi-sheet workbooks; and common pitfalls of Excel data (merged cells, date encoding, mixed-type columns)

Why Excel Handling Deserves Its Own Section

Excel is the most widely used data format outside of programming environments, and linguistic researchers constantly receive data from collaborators, transcription tools, survey platforms, and corpus annotation software in .xlsx format. However, Excel files present challenges that plain-text files do not:

  • Multiple sheets in a single file, only one of which contains the data you need
  • Merged cells and complex headers that break rectangular data assumptions
  • Mixed-type columns where Excel has inferred numeric types for columns that should be character
  • Date columns that Excel stores as integers (days since 1900) and that R must convert
  • Trailing whitespace and invisible characters copied from other software

Loading Excel Files

The readxl Package

readxl is the tidyverse-standard Excel reader. It reads both .xlsx and the older .xls format, has no Java dependency (unlike xlsx), and returns a tibble.

Code
# List all sheets in the workbook before loading
readxl::excel_sheets(here::here("tutorials/load/data", "testdat.xlsx"))
[1] "Sheet 1"
Code
# Load the first sheet
datxlsx <- readxl::read_excel(
  path      = here::here("tutorials/load/data", "testdat.xlsx"),
  sheet     = 1,          # sheet number or name
  col_names = TRUE,       # first row = column names
  na        = c("", "NA", "N/A"),
  trim_ws   = TRUE,
  skip      = 0           # number of rows to skip before reading
)

str(datxlsx)
tibble [10 × 2] (S3: tbl_df/tbl/data.frame)
 $ Variable1: num [1:10] 6 65 12 56 45 84 38 46 64 24
 $ Variable2: num [1:10] 67 16 56 34 54 42 36 47 54 29
Code
head(datxlsx)
# A tibble: 6 × 2
  Variable1 Variable2
      <dbl>     <dbl>
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42
Code
# Load all sheets at once into a named list
all_sheets <- purrr::map(
  readxl::excel_sheets(here::here("tutorials/load/data", "testdat.xlsx")),
  ~ readxl::read_excel(
      path  = here::here("tutorials/load/data", "testdat.xlsx"),
      sheet = .x,
      na    = c("", "NA")
  )
) |>
  purrr::set_names(
    readxl::excel_sheets(here::here("tutorials/load/data", "testdat.xlsx"))
  )

# Access individual sheets by name
# all_sheets[["Sheet1"]]
Specifying Column Types in read_excel()

Excel sometimes guesses column types incorrectly. Use the col_types argument to override:

readxl::read_excel(
  path      = here::here("data", "testdat.xlsx"),
  col_types = c("text", "numeric", "date", "logical")
)

Valid types are "skip", "guess", "logical", "numeric", "date", "text", and "list". Use "text" for ID columns or any column that should never be converted to a number.

The openxlsx Package

openxlsx is the most feature-complete Excel package for R. It can read, write, and format .xlsx files (cell colours, fonts, borders, conditional formatting), which makes it the best choice when your output needs to be presentable as a report.

Code
# Load with openxlsx
datxlsx2 <- openxlsx::read.xlsx(
  xlsxFile = here::here("tutorials/load/data", "testdat.xlsx"),
  sheet    = 1,
  colNames = TRUE,
  na.strings = c("", "NA")
)

head(datxlsx2)
  Variable1 Variable2
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42

Saving Excel Files

Simple Saving with writexl

writexl has no dependencies and writes clean .xlsx files extremely fast. Use it whenever you only need to export a data frame without formatting:

Code
writexl::write_xlsx(
  x    = datxlsx,
  path = here::here("tutorials/load/data", "testdat_out.xlsx")
)

# Write multiple sheets: pass a named list
writexl::write_xlsx(
  x    = list(RawData = datcsv, Processed = datxlsx),
  path = here::here("tutorials/load/data", "multisheet_out.xlsx")
)

Formatted Saving with openxlsx

Code
# Simple write
openxlsx::write.xlsx(
  x    = datxlsx2,
  file = here::here("tutorials/load/data", "testdat_openxlsx.xlsx")
)

# Formatted workbook: create, style, save
wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, sheetName = "Results")
openxlsx::writeData(wb, sheet = "Results", x = datxlsx2, startRow = 1, startCol = 1)

# Style the header row
header_style <- openxlsx::createStyle(
  fontColour = "#FFFFFF",
  fgFill     = "#4472C4",
  halign     = "center",
  textDecoration = "bold",
  border     = "Bottom"
)
openxlsx::addStyle(wb, sheet = "Results", style = header_style,
                   rows = 1, cols = 1:ncol(datxlsx2), gridExpand = TRUE)

# Freeze the top row (useful for large tables)
openxlsx::freezePane(wb, sheet = "Results", firstRow = TRUE)

openxlsx::saveWorkbook(wb,
  file      = here::here("tutorials/load/data", "testdat_formatted.xlsx"),
  overwrite = TRUE
)
Common Excel Pitfalls

Date columns: Excel stores dates as integers (days since 1 January 1900). readxl converts these automatically; openxlsx::read.xlsx() may return them as integers unless you set detectDates = TRUE.

Leading zeros: Excel silently drops leading zeros from numeric-looking strings (e.g. zip codes "01234" become 1234). Protect them with col_types = "text" in read_excel().

Merged cells: Merged cells create NA values in all but the first row of the merge. Use tidyr::fill() to propagate values downward after loading.

Formula cells: By default, readxl reads the cached formula result, not the formula itself. This is almost always what you want.

You load an Excel file containing participant IDs such as "007", "012", "099". After loading with read_excel() you notice they appear as 7, 12, 99 — the leading zeros are gone. What is the most reliable fix?

  1. Re-type the IDs manually in R with paste0("0", dat$ID)
  2. Specify col_types = "text" for the ID column in read_excel() so R reads it as a character string without numeric coercion
  3. Open the file in Excel and format the column as “Text” before loading into R
  4. Use formatC(dat$ID, width = 3, flag = "0") to add zeros back after loading
Answer

b) Specify col_types = "text" for the ID column in read_excel() so R reads it as a character string without numeric coercion

This is the most reliable solution because it prevents the coercion from happening in the first place. Option (c) also works but requires manual intervention each time the file is updated. Option (d) partially fixes the symptom but fails if IDs have different lengths. Option (a) assumes all IDs are exactly 3 digits and only adds one zero, which is incorrect for "007". The best practice is always to protect ID columns and any column with leading-zero strings by specifying col_types = "text".


Loading and Saving R Native Formats

Section Overview

What you will learn: The difference between .rds, .rda / .RData, and workspace saves; when to use each; and best practices for long-term storage of R objects

R Native Formats at a Glance

R has several native serialisation formats. Understanding the differences matters for reproducibility and collaboration:

R native formats compared
Format Extension Stores Load function Save function
RDS .rds One R object readRDS() saveRDS()
RData .rda or .RData One or more named objects load() save()
Workspace .RData (session) All objects in the environment Loaded on startup save.image()
Prefer .rds Over .RData for Data Exchange

When sharing a single dataset with a colleague, always use .rds and readRDS() / saveRDS(). This is because load() silently overwrites any object in your environment that has the same name as the object stored in the .rda file — a common source of difficult-to-debug errors. With readRDS(), you assign the loaded object to a name of your choosing, so there is no risk of collision.

RDS Files

RDS is the recommended format for storing a single R object — a data frame, a list, a fitted model, a character vector, or any other R object.

Code
# Load an RDS file — assign to any name you like
rdadat <- readRDS(here::here("tutorials/load/data", "testdat.rda"))

# Inspect
str(rdadat)
'data.frame':   10 obs. of  2 variables:
 $ Variable1: num  6 65 12 56 45 84 38 46 64 24
 $ Variable2: num  67 16 56 34 54 42 36 47 54 29
Code
head(rdadat)
  Variable1 Variable2
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42
Code
# Save any R object as RDS
saveRDS(
  object = rdadat,
  file   = here::here("tutorials/load/data", "testdat_out.rds"),
  compress = TRUE   # default; compresses the file (xz, bzip2, or gzip)
)

# Compare compression options
saveRDS(rdadat, here::here("tutorials/load/data", "testdat_xz.rds"),
        compress = "xz")      # smallest file, slowest
saveRDS(rdadat, here::here("tutorials/load/data", "testdat_gz.rds"),
        compress = "gzip")    # medium; good for large data
saveRDS(rdadat, here::here("tutorials/load/data", "testdat_bz2.rds"),
        compress = "bzip2")   # medium

RData Files

.rda / .RData files can store multiple named R objects in a single file. They are useful for bundling related objects together (e.g. a dataset, its metadata, and a pre-fitted model) or for distributing example data with an R package.

Code
# load() returns the names of the objects it loaded invisibly
# and places them directly into the current environment
obj_names <- readRDS(here::here("tutorials/load/data", "testdat.rda"))
cat("Objects loaded:", paste(obj_names, collapse = ", "), "\n")
Objects loaded: c(6, 65, 12, 56, 45, 84, 38, 46, 64, 24), c(67, 16, 56, 34, 54, 42, 36, 47, 54, 29) 
Code
# Save multiple objects into one .rda file
x <- 1:10
y <- letters[1:5]
my_df <- data.frame(a = 1:3, b = c("x", "y", "z"))

save(x, y, my_df,
     file = here::here("tutorials/load/data", "multiple_objects.rda"))

# To save ALL objects in the current environment (use sparingly)
save.image(file = here::here("tutorials/load/data", "session_snapshot.RData"))
Avoid save.image() for Reproducibility

Saving your entire workspace with save.image() or allowing RStudio to save .RData on exit feels convenient but actively harms reproducibility. Your analysis can only be reproduced if it runs from scratch on clean data — not from a cached state that may contain objects whose provenance is unknown. Set Tools → Global Options → General → Workspace → “Never” for “Save workspace to .RData on exit” in RStudio.

Loading R Data from the Web

R native objects can be loaded directly from a URL without downloading the file first. This is the standard approach for LADAL tutorial data:

Code
# Load an RDS object directly from a URL
webdat <- base::readRDS(url("https://ladal.edu.au/tutorials/load/data/testdat.rda", "rb"))

# Equivalently, for a file on GitHub or any web server:
# webdat <- readRDS(url("https://raw.githubusercontent.com/.../testdat.rda", "rb"))
Code
# CSV from URL (readr handles URLs directly)
web_csv <- readr::read_csv("https://raw.githubusercontent.com/LADAL/data/main/testdat.csv",
                           col_types = cols())

# Excel from URL (must download to temp file first)
tmp <- tempfile(fileext = ".xlsx")
download.file("https://example.com/testdat.xlsx", destfile = tmp, mode = "wb")
web_xlsx <- readxl::read_excel(tmp)
unlink(tmp)   # delete the temporary file

A colleague sends you an .rda file called results.rda and tells you it contains an object called model_output. You run load("results.rda") in your R session. You already have an object called model_output in your environment from your own analysis. What happens?

  1. R produces an error and does not load the file
  2. R creates a second object called model_output_1 to avoid the conflict
  3. R silently overwrites your existing model_output with the colleague’s version, with no warning
  4. R asks you to confirm before overwriting the existing object
Answer

c) R silently overwrites your existing model_output with the colleague’s version, with no warning

This is one of the most dangerous behaviours of load(). It inserts objects directly into the global environment (or whatever environment you specify) without checking for name conflicts. Your own model_output will be gone, with no undo. This is why saveRDS() / readRDS() are preferred for data exchange: with readRDS(), you write model_output_colleague <- readRDS("results.rda") and choose the name yourself, so no collision is possible.


Loading and Saving JSON and XML

Section Overview

What you will learn: What JSON and XML are and where linguists encounter them; how to parse both formats into R data frames using jsonlite and xml2; and how to write R data back to these formats

JSON

JSON (JavaScript Object Notation) is the dominant data exchange format for web APIs, annotation tools, and many corpus management systems. It represents data as nested key-value pairs and arrays. Linguists encounter JSON when:

  • Downloading corpus metadata or concordances from a web API (e.g. CLARIN VLO, AntConc, SketchEngine)
  • Working with annotation exports from tools like CATMA, INCEpTION, or Label Studio
  • Reading metadata from language resource repositories (e.g. Glottolog, WALS online API)

Understanding JSON Structure

A simple JSON file looks like this:

{
  "participants": [
    {"id": "P01", "age": 24, "l1": "English", "proficiency": "Advanced"},
    {"id": "P02", "age": 31, "l1": "German",  "proficiency": "Intermediate"},
    {"id": "P03", "age": 28, "l1": "French",  "proficiency": "Advanced"}
  ],
  "study": "L2 Amplifier Use",
  "year": 2026
}

The outer {} is an object (key-value pairs). Square brackets [] denote arrays (ordered lists). Values can be strings, numbers, booleans, null, objects, or arrays — JSON is recursive.

Loading JSON

Code
# Simulate reading a JSON string (in practice, replace with a file path or URL)
json_string <- '{
  "participants": [
    {"id": "P01", "age": 24, "l1": "English",  "proficiency": "Advanced"},
    {"id": "P02", "age": 31, "l1": "German",   "proficiency": "Intermediate"},
    {"id": "P03", "age": 28, "l1": "French",   "proficiency": "Advanced"},
    {"id": "P04", "age": 22, "l1": "Japanese", "proficiency": "Intermediate"},
    {"id": "P05", "age": 35, "l1": "Spanish",  "proficiency": "Advanced"}
  ],
  "study": "L2 Amplifier Use",
  "year": 2026
}'

# Parse JSON string into an R list
json_list <- jsonlite::fromJSON(json_string, simplifyDataFrame = TRUE)

# The top-level keys become list elements
names(json_list)
[1] "participants" "study"        "year"        
Code
# The "participants" element is automatically converted to a data frame
participants <- json_list$participants
str(participants)
'data.frame':   5 obs. of  4 variables:
 $ id         : chr  "P01" "P02" "P03" "P04" ...
 $ age        : int  24 31 28 22 35
 $ l1         : chr  "English" "German" "French" "Japanese" ...
 $ proficiency: chr  "Advanced" "Intermediate" "Advanced" "Intermediate" ...
Code
participants
   id age       l1  proficiency
1 P01  24  English     Advanced
2 P02  31   German Intermediate
3 P03  28   French     Advanced
4 P04  22 Japanese Intermediate
5 P05  35  Spanish     Advanced
Code
# Load from a local file
json_data <- jsonlite::fromJSON(
  txt              = here::here("tutorials/load/data", "data.json"),
  simplifyDataFrame = TRUE,   # convert arrays of objects to data frames
  simplifyVector   = TRUE,    # convert scalar arrays to vectors
  flatten          = TRUE     # flatten nested objects into columns
)

# Load from a URL (e.g. a web API)
glottolog_url <- "https://glottolog.org/resource/languoid/id/stan1293.json"
# glottolog_data <- jsonlite::fromJSON(glottolog_url)
simplifyDataFrame = TRUE vs. FALSE

When simplifyDataFrame = TRUE (the default), fromJSON() tries to convert JSON arrays whose elements all have the same keys into a data frame. This is usually what you want. When the JSON structure is irregular (different keys in different elements), set simplifyDataFrame = FALSE to get a pure R list and then reshape manually.

Handling Nested JSON

Real JSON from APIs is often deeply nested. The flatten = TRUE argument and tidyr::unnest() are your main tools:

Code
nested_json <- '{
  "corpus": [
    {
      "text_id": "T001",
      "metadata": {"genre": "academic", "year": 2010, "wordcount": 3241},
      "tokens": 3241
    },
    {
      "text_id": "T002",
      "metadata": {"genre": "fiction", "year": 2015, "wordcount": 8754},
      "tokens": 8754
    },
    {
      "text_id": "T003",
      "metadata": {"genre": "news", "year": 2019, "wordcount": 512},
      "tokens": 512
    }
  ]
}'

# flatten = TRUE unpacks nested objects into dot-separated column names
corpus_df <- jsonlite::fromJSON(nested_json, simplifyDataFrame = TRUE, flatten = TRUE)$corpus
str(corpus_df)
'data.frame':   3 obs. of  5 variables:
 $ text_id           : chr  "T001" "T002" "T003"
 $ tokens            : int  3241 8754 512
 $ metadata.genre    : chr  "academic" "fiction" "news"
 $ metadata.year     : int  2010 2015 2019
 $ metadata.wordcount: int  3241 8754 512
Code
corpus_df
  text_id tokens metadata.genre metadata.year metadata.wordcount
1    T001   3241       academic          2010               3241
2    T002   8754        fiction          2015               8754
3    T003    512           news          2019                512

Saving JSON

Code
# Convert an R object to a JSON string
json_out <- jsonlite::toJSON(
  participants,
  pretty     = TRUE,   # indented, human-readable output
  auto_unbox = TRUE    # single-element arrays written as scalars (not [value])
)
cat(json_out)

# Write to file
jsonlite::write_json(
  participants,
  path       = here::here("tutorials/load/data", "participants_out.json"),
  pretty     = TRUE,
  auto_unbox = TRUE
)

XML

XML (eXtensible Markup Language) is older than JSON and more verbose, but it remains the dominant format in computational linguistics and digital humanities. Linguists encounter XML in:

  • TEI (Text Encoding Initiative) markup for edited texts, manuscripts, and historical corpora
  • CoNLL-U and related annotation formats (sometimes XML-wrapped)
  • BNC, BNC2014, COCA corpus XML distributions
  • ELAN annotation files (.eaf)
  • Sketch Engine CQL export format

Understanding XML Structure

XML organises data as a tree of nested elements, each with an opening tag, a closing tag, and optionally attributes and text content:

<?xml version="1.0" encoding="UTF-8"?>
<corpus name="MiniCorpus" year="2026">
  <text id="T001" genre="academic">
    <sentence n="1">
      <token pos="NN" lemma="corpus">corpus</token>
      <token pos="NN" lemma="analysis">analysis</token>
    </sentence>
  </text>
</corpus>

Loading XML

Code
# Parse an XML string (in practice, use read_xml() with a file path)
xml_string <- '<?xml version="1.0" encoding="UTF-8"?>
<corpus name="MiniCorpus" year="2026">
  <text id="T001" genre="academic">
    <sentence n="1">
      <token pos="DT">The</token>
      <token pos="NN">corpus</token>
      <token pos="VBZ">contains</token>
      <token pos="JJ">linguistic</token>
      <token pos="NNS">tokens</token>
    </sentence>
    <sentence n="2">
      <token pos="NNS">Frequencies</token>
      <token pos="VBP">vary</token>
      <token pos="IN">by</token>
      <token pos="NN">genre</token>
    </sentence>
  </text>
  <text id="T002" genre="fiction">
    <sentence n="1">
      <token pos="PRP">She</token>
      <token pos="VBD">said</token>
      <token pos="RB">very</token>
      <token pos="RB">little</token>
    </sentence>
  </text>
</corpus>'

# Parse the XML
xml_doc <- xml2::read_xml(xml_string)

# Navigate the tree with XPath
# Extract all token elements
tokens_nodeset <- xml2::xml_find_all(xml_doc, ".//token")

# For each token, walk up the ancestor axis to find its <text> parent
# xml_find_first(".//ancestor::text[1]") returns the nearest <text> ancestor
token_df <- data.frame(
  text_id = xml2::xml_attr(
    xml2::xml_find_first(tokens_nodeset, "./ancestor::text[1]"),
    "id"
  ),
  genre   = xml2::xml_attr(
    xml2::xml_find_first(tokens_nodeset, "./ancestor::text[1]"),
    "genre"
  ),
  sent_n  = xml2::xml_attr(
    xml2::xml_find_first(tokens_nodeset, "./ancestor::sentence[1]"),
    "n"
  ),
  pos     = xml2::xml_attr(tokens_nodeset, "pos"),
  word    = xml2::xml_text(tokens_nodeset),
  stringsAsFactors = FALSE
)

head(token_df, 10)
   text_id    genre sent_n pos        word
1     T001 academic      1  DT         The
2     T001 academic      1  NN      corpus
3     T001 academic      1 VBZ    contains
4     T001 academic      1  JJ  linguistic
5     T001 academic      1 NNS      tokens
6     T001 academic      2 NNS Frequencies
7     T001 academic      2 VBP        vary
8     T001 academic      2  IN          by
9     T001 academic      2  NN       genre
10    T002  fiction      1 PRP         She
XPath: The Language of XML Navigation

XPath is a mini-language for selecting nodes from an XML tree. The most useful patterns are:

Common XPath patterns for corpus XML
XPath expression Meaning
//token All <token> elements anywhere in the document
.//token All <token> elements within the current context node
//text[@genre='academic'] <text> elements with genre attribute equal to "academic"
//sentence[@n='1']//token All tokens inside sentence 1
//token/@pos The pos attribute of all token elements

Always test XPath expressions with xml2::xml_find_all() and inspect the result before building a full extraction pipeline.

A More Efficient XML Extraction Pattern

Code
# Extract all texts with their metadata, using purrr::map_dfr
corpus_table <- purrr::map_dfr(
  xml2::xml_find_all(xml_doc, ".//text"),
  function(text_node) {
    text_id <- xml2::xml_attr(text_node, "id")
    genre   <- xml2::xml_attr(text_node, "genre")
    tokens  <- xml2::xml_find_all(text_node, ".//token")
    data.frame(
      text_id = text_id,
      genre   = genre,
      pos     = xml2::xml_attr(tokens, "pos"),
      word    = xml2::xml_text(tokens),
      stringsAsFactors = FALSE
    )
  }
)

corpus_table
   text_id    genre pos        word
1     T001 academic  DT         The
2     T001 academic  NN      corpus
3     T001 academic VBZ    contains
4     T001 academic  JJ  linguistic
5     T001 academic NNS      tokens
6     T001 academic NNS Frequencies
7     T001 academic VBP        vary
8     T001 academic  IN          by
9     T001 academic  NN       genre
10    T002  fiction PRP         She
11    T002  fiction VBD        said
12    T002  fiction  RB        very
13    T002  fiction  RB      little

Saving XML

Code
# Build an XML document from scratch
new_xml <- xml2::xml_new_root("corpus",
  name = "OutputCorpus",
  year = "2026"
)

# Add a text element
text_node <- xml2::xml_add_child(new_xml, "text", id = "T001", genre = "academic")
sent_node <- xml2::xml_add_child(text_node, "sentence", n = "1")
xml2::xml_add_child(sent_node, "token", pos = "NN", "analysis")
xml2::xml_add_child(sent_node, "token", pos = "VBZ", "requires")
xml2::xml_add_child(sent_node, "token", pos = "NN", "data")

# Save to file
xml2::write_xml(new_xml,
  file     = here::here("tutorials/load/data", "output_corpus.xml"),
  encoding = "UTF-8"
)

You receive a TEI-encoded corpus as an XML file. You want to extract all <w> (word) elements that have a pos attribute of "VBZ" (third-person singular present verb). Which XPath expression is correct?

  1. //w[pos='VBZ']
  2. //w[@pos='VBZ']
  3. //w.pos='VBZ'
  4. //w[text()='VBZ']
Answer

b) //w[@pos='VBZ']

In XPath, attributes are referenced with the @ prefix inside square brackets. So //w[@pos='VBZ'] selects all <w> elements anywhere in the document (//) whose pos attribute (@pos) equals "VBZ". Option (a) is incorrect because without @, pos refers to a child element named pos, not an attribute. Option (c) is not valid XPath syntax. Option (d) selects <w> elements whose text content is "VBZ", which would match words that are literally the string “VBZ”, not words tagged as VBZ.


Loading Built-In and Package Datasets

Section Overview

What you will learn: How to access datasets built into base R and into installed packages; how to find and browse available datasets; and how to use them as starting points for examples and practice

Why Built-In Datasets Matter

R ships with a large collection of built-in datasets that are immediately available without downloading anything. For linguists, they provide convenient practice data and well-documented benchmarks. Additionally, many linguistics-focused R packages include specialised datasets that are directly relevant to language research.

Base R Datasets

Code
# List all datasets available in base R
base_datasets <- data(package = "base")$results
# Not all packages have datasets; use datasets package
all_datasets  <- data(package = .packages(all.available = TRUE))$results |>
  as.data.frame() |>
  dplyr::select(Package, Item, Title) |>
  head(20)

all_datasets
     Package                            Item
1  data.tree                            acme
2  data.tree                        mushroom
3      dplyr                band_instruments
4      dplyr               band_instruments2
5      dplyr                    band_members
6      dplyr                        starwars
7      dplyr                          storms
8    ggplot2                        diamonds
9    ggplot2                       economics
10   ggplot2                  economics_long
11   ggplot2                       faithfuld
12   ggplot2                     luv_colours
13   ggplot2                         midwest
14   ggplot2                             mpg
15   ggplot2                          msleep
16   ggplot2                    presidential
17   ggplot2                           seals
18   ggplot2                       txhousing
19  openxlsx     openxlsxFontSizeLookupTable
20  openxlsx openxlsxFontSizeLookupTableBold
                                                               Title
1                     Sample Data: A Simple Company with Departments
2                         Sample Data: Data Used by the ID3 Vignette
3                                                    Band membership
4                                                    Band membership
5                                                    Band membership
6                                                Starwars characters
7                                                  Storm tracks data
8                           Prices of over 50,000 round cut diamonds
9                                            US economic time series
10                                           US economic time series
11                          2d density estimate of Old Faithful data
12                                           'colors()' in Luv space
13                                              Midwest demographics
14 Fuel economy data from 1999 to 2008 for 38 popular models of cars
15      An updated and expanded version of the mammals sleep dataset
16                   Terms of 12 presidents from Eisenhower to Trump
17                                    Vector field of seal movements
18                                               Housing sales in TX
19                                           Font Size Lookup tables
20                                           Font Size Lookup tables
Code
# Load a built-in dataset by name (no file path needed)
data("iris")      # Fisher's iris measurements — classic ML benchmark
data("mtcars")    # Motor Trend car road tests — classic regression example
data("airquality") # New York air quality measurements

# For linguistics: character frequency data
data("letters")   # 26 lowercase letters
data("LETTERS")   # 26 uppercase letters
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Linguistics-Relevant Package Datasets

Code
# The 'datasets' package frequency table for letters
letter_freq <- data.frame(
  letter    = letters,
  frequency = c(8.2,1.5,2.8,4.3,12.7,2.2,2.0,6.1,7.0,0.15,
                0.77,4.0,2.4,6.7,7.5,1.9,0.10,6.0,6.3,9.1,
                2.8,0.98,2.4,0.15,2.0,0.074)
)
letter_freq |>
  dplyr::arrange(desc(frequency)) |>
  head(10)
   letter frequency
1       e      12.7
2       t       9.1
3       a       8.2
4       o       7.5
5       i       7.0
6       n       6.7
7       s       6.3
8       h       6.1
9       r       6.0
10      d       4.3
Code
# The 'languageR' package contains many linguistic datasets
# (install if needed: install.packages("languageR"))
# data("english", package = "languageR")    # English lexical decision data
# data("regularity", package = "languageR") # Morphological regularity
# data("ratings", package = "languageR")    # Word familiarity ratings

# The 'corpora' package
# data("BNCcomma", package = "corpora")     # BNC frequency data
Finding Datasets in a Package
# List all datasets in a specific package
data(package = "datasets")
data(package = "languageR")

# Get help on a dataset
?iris
help("iris")

# See dataset dimensions without loading fully
nrow(iris); ncol(iris); names(iris)

Loading Data from a Package Without data()

Many packages make their data available via :: without needing data():

Code
# Access package data directly with ::
# (package must be installed but need not be loaded)
freq_df <- data.frame(
  word      = c("the", "of", "and", "to", "a", "in", "that", "is", "was", "he"),
  frequency = c(69971, 36412, 28853, 26154, 23195, 21337, 10594, 10099, 9835, 9543)
)

ggplot(freq_df, aes(x = reorder(word, frequency), y = frequency)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  theme_bw() +
  labs(title = "Top 10 most frequent English words (BNC estimates)",
       x = "Word", y = "Frequency (per million words)")

You want to practice loading data without downloading any files. Which of the following commands correctly loads a built-in R dataset for immediate use?

  1. read.csv("iris") — reads the iris dataset from a CSV file in the working directory
  2. data("iris") — loads the iris dataset into the global environment from the datasets package
  3. load("iris.rda") — loads an RDA file called iris.rda from the working directory
  4. readRDS("iris") — loads an RDS object named “iris” from the working directory
Answer

b) data("iris") — loads the iris dataset into the global environment from the datasets package

The data() function loads built-in datasets from R packages into the current environment. No file path is needed — R looks up the dataset in the package’s internal data store. Options (a), (c), and (d) all assume the data exists as a file on disk, which it does not for built-in datasets. After running data("iris"), the object iris is available in your environment exactly as if you had loaded it from a file.


Loading and Saving Unstructured Text Data

Section Overview

What you will learn: How to load single plain-text files into R as word vectors or line vectors; how to load an entire directory of text files into a named list; how to read content from Word (.docx) documents; and how to save text data back to disk

Single Text Files

Corpus linguists routinely work with raw text stored in plain-text (.txt) files. R provides two primary base functions for reading these, which produce different output structures:

Functions for loading plain-text files
Function Returns Best for
scan(what = "char") Character vector of individual words Token-level analysis, word counts
readLines() Character vector of lines Sentence/line-level analysis, concordancing
readr::read_file() Single character string Full-text manipulation, regex over entire document
Code
# scan(): reads tokens (whitespace-separated), returns a character vector
testtxt_words <- scan(
  here::here("tutorials/load/data", "english.txt"),
  what      = "char",
  quiet     = TRUE    # suppress "Read N items" message
)

cat("Total tokens:", length(testtxt_words), "\n")
Total tokens: 21 
Code
cat("First 20 tokens:\n")
First 20 tokens:
Code
head(testtxt_words, 20)
 [1] "Linguistics" "is"          "the"         "scientific"  "study"      
 [6] "of"          "language"    "and"         "it"          "involves"   
[11] "the"         "analysis"    "of"          "language"    "form,"      
[16] "language"    "meaning,"    "and"         "language"    "in"         
Code
# readLines(): reads complete lines, returns a character vector
testtxt_lines <- readLines(
  con      = here::here("tutorials/load/data", "english.txt"),
  encoding = "UTF-8",
  warn     = FALSE   # suppress warning about non-terminated final line
)

cat("Total lines:", length(testtxt_lines), "\n")
Total lines: 1 
Code
cat("First 5 lines:\n")
First 5 lines:
Code
head(testtxt_lines, 5)
[1] "Linguistics is the scientific study of language and it involves the analysis of language form, language meaning, and language in context. "
Code
# readr::read_file(): loads the entire file as one string
testtxt_full <- readr::read_file(
  here::here("tutorials/load/data", "english.txt")
)

cat("Character count:", nchar(testtxt_full), "\n")
# Apply regex to the full text
# e.g. extract all sentences ending in a question mark
questions <- stringr::str_extract_all(testtxt_full, "[A-Z][^.!?]*\\?")[[1]]
Encoding and Non-ASCII Characters

Always specify encoding = "UTF-8" when reading files that may contain non-ASCII characters (accented letters, IPA symbols, non-Latin scripts). If readLines() throws a warning about invalid multibyte strings, the file encoding may be Latin-1 or Windows-1252. Use readLines(con = f, encoding = "latin1") or convert the file first with iconv().

# Check and convert encoding
raw_text <- readLines(f, encoding = "latin1")
utf_text  <- iconv(raw_text, from = "latin1", to = "UTF-8")

Saving Single Text Files

Code
# writeLines(): write a character vector (one element per line)
writeLines(
  text = testtxt_lines,
  con  = here::here("tutorials/load/data", "english_out.txt"),
  useBytes = FALSE
)

# write_file(): write a single character string
readr::write_file(
  x    = testtxt_full,
  file = here::here("tutorials/load/data", "english_out2.txt")
)

Loading Multiple Text Files

When working with corpora, you will often need to load many text files at once and store them in a named list — one element per file. The recommended approach uses list.files() to discover files and purrr::map() or sapply() to load them:

Code
# Step 1: get all file paths (full.names = TRUE gives absolute paths)
fls <- list.files(
  path       = here::here("tutorials/load/data", "testcorpus"),
  pattern    = "\\.txt$",     # only .txt files (regex)
  full.names = TRUE
)

cat("Files found:", length(fls), "\n")
Files found: 7 
Code
cat("File names:\n")
File names:
Code
basename(fls)
[1] "linguistics01.txt" "linguistics02.txt" "linguistics03.txt"
[4] "linguistics04.txt" "linguistics05.txt" "linguistics06.txt"
[7] "linguistics07.txt"
Code
# Step 2: load each file as a collapsed string

# Helper: read one file safely, converting encoding to UTF-8
read_txt_safe <- function(f) {
  # Try UTF-8 first; fall back to Latin-1 if the file is not valid UTF-8
  txt <- tryCatch(
    readLines(f, encoding = "UTF-8", warn = FALSE),
    error = function(e) readLines(f, encoding = "latin1", warn = FALSE)
  )
  # Convert any remaining non-UTF-8 bytes to UTF-8
  txt <- iconv(txt, from = "", to = "UTF-8", sub = "byte")
  paste(txt, collapse = " ")
}

# Method A: sapply (base R)
txts_sapply <- sapply(fls, read_txt_safe)

# Method B: purrr::map_chr (tidyverse)
txts_purrr <- purrr::map_chr(fls, read_txt_safe)

# Method C: readr::read_file with explicit locale
txts_readr <- purrr::map_chr(
  fls,
  ~ readr::read_file(.x, locale = readr::locale(encoding = "UTF-8"))
)

# Simplify names to file stems
names(txts_purrr) <- tools::file_path_sans_ext(basename(fls))

cat("Texts loaded:", length(txts_purrr), "\n")
Texts loaded: 7 
Code
cat("Character counts per text:\n")
Character counts per text:
Code
print(nchar(txts_purrr))
linguistics01 linguistics02 linguistics03 linguistics04 linguistics05 
          946           523           751           673           898 
linguistics06 linguistics07 
         1172           496 
Code
# Optional: check which file was problematic
cat("\nEncoding check per file:\n")

Encoding check per file:
Code
for (f in fls) {
  raw <- readLines(f, warn = FALSE)
  valid <- all(!is.na(iconv(raw, from = "latin1", to = "UTF-8")))
  cat(sprintf("  %-25s %s\n", basename(f),
              ifelse(valid, "OK", "encoding issue detected")))
}
  linguistics01.txt         OK
  linguistics02.txt         OK
  linguistics03.txt         OK
  linguistics04.txt         OK
  linguistics05.txt         OK
  linguistics06.txt         OK
  linguistics07.txt         OK
Code
# Build a corpus data frame: one row per text
corpus_df <- data.frame(
  file      = tools::file_path_sans_ext(basename(fls)),
  text      = txts_purrr,
  n_tokens  = sapply(strsplit(txts_purrr, "\\s+"), length),
  n_chars   = nchar(txts_purrr),
  stringsAsFactors = FALSE,
  row.names = NULL
)

corpus_df
           file
1 linguistics01
2 linguistics02
3 linguistics03
4 linguistics04
5 linguistics05
6 linguistics06
7 linguistics07
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  text
1                                                                                                                                                                                                                                   Linguistics is the scientific study of language. It involves analysing language form language meaning and language in context. The earliest activities in the documentation and description of language have been attributed to the th-century-BC Indian grammarian Pa?ini who wrote a formal description of the Sanskrit language in his A??adhyayi.  Linguists traditionally analyse human language by observing an interplay between sound and meaning. Phonetics is the study of speech and non-speech sounds and delves into their acoustic and articulatory properties. The study of language meaning on the other hand deals with how languages encode relations between entities properties and other aspects of the world to convey process and assign meaning as well as manage and resolve ambiguity. While the study of semantics typically concerns itself with truth conditions pragmatics deals with how situational context influences the production of meaning. 
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics.
3                                                                                                                                                                                                                                                                                                                                                                                                                                      In the early 20th century, Ferdinand de Saussure distinguished between the notions of langue and parole in his formulation of structural linguistics. According to him, parole is the specific utterance of speech, whereas langue refers to an abstract phenomenon that theoretically defines the principles and system of rules that govern a language. This distinction resembles the one made by Noam Chomsky between competence and performance in his theory of transformative or generative grammar. According to Chomsky, competence is an individual's innate capacity and potential for language (like in Saussure's langue), while performance is the specific way in which it is used by individuals, groups, and communities (i.e., parole, in Saussurean terms). 
4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The study of parole (which manifests through cultural discourses and dialects) is the domain of sociolinguistics, the sub-discipline that comprises the study of a complex system of linguistic facets within a certain speech community (governed by its own set of grammatical rules and laws). Discourse analysis further examines the structure of texts and conversations emerging out of a speech community's usage of language. This is done through the collection of linguistic data, or through the formal discipline of corpus linguistics, which takes naturally occurring texts and studies the variation of grammatical and other features based on such corpora (or corpus data). 
5                                                                                                                                                                                                                                                                                   Stylistics also involves the study of written, signed, or spoken discourse through varying speech communities, genres, and editorial or narrative formats in the mass media. In the 1960s, Jacques Derrida, for instance, further distinguished between speech and writing, by proposing that written language be studied as a linguistic medium of communication in itself. Palaeography is therefore the discipline that studies the evolution of written scripts (as signs and symbols) in language. The formal study of language also led to the growth of fields like psycholinguistics, which explores the representation and function of language in the mind; neurolinguistics, which studies language processing in the brain; biolinguistics, which studies the biology and evolution of language; and language acquisition, which investigates how children and adults acquire the knowledge of one or more languages. 
6 Linguistics also deals with the social, cultural, historical and political factors that influence language, through which linguistic and language-based context is often determined. Research on language through the sub-branches of historical and evolutionary linguistics also focus on how languages change and grow, particularly over an extended period of time.  Language documentation combines anthropological inquiry (into the history and culture of language) with linguistic inquiry, in order to describe languages and their grammars. Lexicography involves the documentation of words that form a vocabulary. Such a documentation of a linguistic vocabulary from a particular language is usually compiled in a dictionary. Computational linguistics is concerned with the statistical or rule-based modeling of natural language from a computational perspective. Specific knowledge of language is applied by speakers during the act of translation and interpretation, as well as in language education <96> the teaching of a second or foreign language. Policy makers work with governments to implement new plans in education and teaching which are based on linguistic research. 
7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Related areas of study also includes the disciplines of semiotics (the study of direct and indirect language through signs and symbols), literary criticism (the historical and ideological analysis of literature, cinema, art, or published material), translation (the conversion and documentation of meaning in written/spoken text from one language or dialect onto another), and speech-language pathology (a corrective method to cure phonetic disabilities and dis-functions at the cognitive level).
  n_tokens n_chars
1      138     946
2       81     523
3      111     751
4      101     673
5      130     898
6      165    1172
7       68     496

Saving Multiple Text Files

Code
# Define output paths — one per text
out_paths <- file.path(
  here::here("tutorials/load/data", "testcorpus_out"),
  paste0(names(txts_purrr), ".txt")
)

# Create the output directory if it doesn't exist
dir.create(here::here("tutorials/load/data", "testcorpus_out"),
           showWarnings = FALSE, recursive = TRUE)

# Save each text
purrr::walk2(
  txts_purrr,
  out_paths,
  ~ writeLines(.x, con = .y)
)

cat("Saved", length(out_paths), "files.\n")

Loading Word Documents

Interview transcripts, annotated texts, and survey instruments are often stored as Microsoft Word .docx files. The officer package reads .docx files and returns a structured data frame where each paragraph, heading, and table cell is a separate row.

Code
# Read the Word document
doc_object <- officer::read_docx(here::here("tutorials/load/data", "mydoc.docx"))

# Extract the content summary (structured data frame)
content <- officer::docx_summary(doc_object)

# Inspect the structure
str(content)
'data.frame':   38 obs. of  11 variables:
 $ doc_index      : int  1 2 3 4 5 6 8 9 10 11 ...
 $ content_type   : chr  "paragraph" "paragraph" "paragraph" "paragraph" ...
 $ style_name     : chr  NA NA NA NA ...
 $ text           : chr  "HYPERLINK \"https://en.wikipedia.org/wiki/Main_Page\"" "Language technology" "From Wikipedia, the free encyclopedia" "Language technology, often called human language technology (HLT), studies methods of how computer programs or "| __truncated__ ...
 $ table_index    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ row_id         : int  NA NA NA NA NA NA NA NA NA NA ...
 $ cell_id        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ is_header      : logi  NA NA NA NA NA NA ...
 $ row_span       : int  NA NA NA NA NA NA NA NA NA NA ...
 $ col_span       : chr  NA NA NA NA ...
 $ table_stylename: chr  NA NA NA NA ...
Code
head(content, 15)
   doc_index content_type style_name
1          1    paragraph       <NA>
2          2    paragraph       <NA>
3          3    paragraph       <NA>
4          4    paragraph       <NA>
5          5    paragraph       <NA>
6          6    paragraph       <NA>
7          8    paragraph       <NA>
8          9    paragraph       <NA>
9         10    paragraph       <NA>
10        11    paragraph       <NA>
11        12    paragraph       <NA>
12        13    paragraph       <NA>
13        14    paragraph       <NA>
14        15    paragraph       <NA>
15        16    paragraph       <NA>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         text
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         HYPERLINK "https://en.wikipedia.org/wiki/Main_Page"
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Language technology
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       From Wikipedia, the free encyclopedia
4  Language technology, often called human language technology (HLT), studies methods of how computer programs or electronic devices can analyze, produce, modify or respond to human texts and speech.[1] Working with language technology often requires broad knowledge not only about linguistics but also about computer science. It consists of natural language processing (NLP) and computational linguistics (CL) on the one hand, many application oriented aspects of these, and more low-level aspects such as encoding and speech technology on the other hand. 
5                                                                                                                   Note that these elementary aspects are normally not considered to be within the scope of related terms such as natural language processing and (applied) computational linguistics, which are otherwise near-synonyms. As an example, for many of the world's lesser known languages, the foundation of language technology is providing communities with fonts and keyboard setups so their languages can be written on computers or mobile devices.[2] 
6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  References
7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Uszkoreit, Hans. "DFKI-LT - What is Language Technology". Retrieved 16 November 2018. 
8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   "SIL Writing Systems Technology". sil.org. 11 December 2018. Retrieved 9 December 2019. 
9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              External links
10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Johns Hopkins University Human Language Technology Center of Excellence
11                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Carnegie Mellon University Language Technologies Institute
12                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Institute for Applied Linguistics (IULA) at Universitat Pompeu Fabra. Barcelona, Spain
13                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          German Research Centre for Artificial Intelligence (DFKI) Language Technology Lab
14                                                                                                                                                                                                                                                                                                                                                                                                                                                                       CLT: Centre for Language Technology in Gothenburg, Sweden Archived 2017-04-10 at the Wayback Machine
15                                                                                                                                                                                                                                                                                                                                                                                                                                                       The Center for Speech and Language Technologies (CSaLT) at the Lahore University [sic] of Management Sciences (LUMS)
   table_index row_id cell_id is_header row_span col_span table_stylename
1           NA     NA      NA        NA       NA     <NA>            <NA>
2           NA     NA      NA        NA       NA     <NA>            <NA>
3           NA     NA      NA        NA       NA     <NA>            <NA>
4           NA     NA      NA        NA       NA     <NA>            <NA>
5           NA     NA      NA        NA       NA     <NA>            <NA>
6           NA     NA      NA        NA       NA     <NA>            <NA>
7           NA     NA      NA        NA       NA     <NA>            <NA>
8           NA     NA      NA        NA       NA     <NA>            <NA>
9           NA     NA      NA        NA       NA     <NA>            <NA>
10          NA     NA      NA        NA       NA     <NA>            <NA>
11          NA     NA      NA        NA       NA     <NA>            <NA>
12          NA     NA      NA        NA       NA     <NA>            <NA>
13          NA     NA      NA        NA       NA     <NA>            <NA>
14          NA     NA      NA        NA       NA     <NA>            <NA>
15          NA     NA      NA        NA       NA     <NA>            <NA>
Code
# Filter to paragraph content only (exclude table cells, headers, etc.)
paragraphs <- content |>
  dplyr::filter(content_type == "paragraph",
                !is.na(text),
                nchar(trimws(text)) > 0) |>
  dplyr::select(style_name, text)

head(paragraphs, 10)
   style_name
1        <NA>
2        <NA>
3        <NA>
4        <NA>
5        <NA>
6        <NA>
7        <NA>
8        <NA>
9        <NA>
10       <NA>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         text
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         HYPERLINK "https://en.wikipedia.org/wiki/Main_Page"
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Language technology
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       From Wikipedia, the free encyclopedia
4  Language technology, often called human language technology (HLT), studies methods of how computer programs or electronic devices can analyze, produce, modify or respond to human texts and speech.[1] Working with language technology often requires broad knowledge not only about linguistics but also about computer science. It consists of natural language processing (NLP) and computational linguistics (CL) on the one hand, many application oriented aspects of these, and more low-level aspects such as encoding and speech technology on the other hand. 
5                                                                                                                   Note that these elementary aspects are normally not considered to be within the scope of related terms such as natural language processing and (applied) computational linguistics, which are otherwise near-synonyms. As an example, for many of the world's lesser known languages, the foundation of language technology is providing communities with fonts and keyboard setups so their languages can be written on computers or mobile devices.[2] 
6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  References
7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Uszkoreit, Hans. "DFKI-LT - What is Language Technology". Retrieved 16 November 2018. 
8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   "SIL Writing Systems Technology". sil.org. 11 December 2018. Retrieved 9 December 2019. 
9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              External links
10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Johns Hopkins University Human Language Technology Center of Excellence
Code
# Extract only body text (style "Normal" in most templates)
body_text <- paragraphs |>
  dplyr::filter(style_name == "Normal") |>
  dplyr::pull(text) |>
  paste(collapse = " ")

cat("Body text (first 200 chars):\n", substr(body_text, 1, 200), "\n")
Body text (first 200 chars):
  
Extracting Headings from Word Documents

Headings are stored with style names like "heading 1", "heading 2", etc. Use them to reconstruct the document structure:

headings <- content |>
  dplyr::filter(grepl("^heading", style_name, ignore.case = TRUE)) |>
  dplyr::select(style_name, text)

This is useful for segmenting interview transcripts by topic or speaker turn.

You want to load 50 interview transcripts stored as .txt files in a folder called transcripts/. You need the result as a named list where each element is the full text of one interview as a single character string, and each element’s name is the file name without the .txt extension. Which code achieves this?

txts <- readLines(here::here("transcripts"))
fls  <- list.files(here::here("transcripts"), pattern="\\.txt$", full.names=TRUE)
txts <- purrr::map_chr(fls, readr::read_file)
names(txts) <- tools::file_path_sans_ext(basename(fls))
txts <- read.csv(here::here("transcripts"), header = FALSE)
txts <- scan(here::here("transcripts"), what = "char", quiet = TRUE)
Answer

b) This is the correct approach. list.files() with full.names = TRUE returns the complete path to each file. purrr::map_chr() applies readr::read_file() to each path, returning a named character vector of full texts. tools::file_path_sans_ext(basename(fls)) strips the directory path and .txt extension to produce clean file names as the element names. Options (a), (c), and (d) are all incorrect: readLines() takes a single file path, not a directory; read.csv() expects tabular data; and scan() also takes a single file path and would return individual words, not complete texts.


Simulating Data

Section Overview

What you will learn: Why data simulation is a core research skill; how set.seed() ensures reproducibility; how to sample from the most important statistical distributions; how to build realistic simulated datasets for corpus studies, psycholinguistic experiments, and surveys; and how to generate synthetic textual data

Why Simulate?

Data simulation is not just a workaround for when you lack real data. It is a core methodological tool for:

  • Checking statistical intuition: simulate data you understand perfectly and verify that your model recovers the parameters you put in
  • Reproducible examples: share a self-contained, runnable example without distributing confidential or proprietary data
  • Teaching and demonstration: illustrate statistical concepts with controlled examples
  • Power analysis: estimate the sample size you need by simulating many datasets and measuring how often your model detects a true effect
  • Stress-testing pipelines: check whether your analysis code handles edge cases (missing data, unbalanced designs, outliers) before real data arrives

Reproducibility and set.seed()

R’s random number generation is pseudo-random: starting from a fixed seed value, the same sequence of “random” numbers is generated every time. Setting the seed makes your simulations perfectly reproducible.

Code
# Without a seed: different results every time
sample(1:100, 5)
[1] 61 11 89 75 93
Code
sample(1:100, 5)
[1]  44  66  31 100  17
Code
# With a seed: identical results every time
set.seed(2026)
sample(1:100, 5)
[1] 93 97 38 45 91
Code
set.seed(2026)   # reset to the same seed
sample(1:100, 5) # same result as above
[1] 93 97 38 45 91
Rules for set.seed() in Research
  1. Always set a seed at the start of any script that uses random number generation
  2. Set the seed once, at the top of the script — not before every individual random call (which hides stochasticity and can produce artificially good results)
  3. Document the seed in your methods section so others can reproduce your exact simulation
  4. Test with multiple seeds to confirm your findings are not artefacts of one particular seed value
  5. Use a memorable but arbitrary seed — the year of the study, a postal code, or a fixed arbitrary number. Avoid choosing the seed after seeing which value gives you “nice” results — that is a form of researcher degrees of freedom.
Code
# Recommended practice: one seed at the top of the script
set.seed(2026)

# From this point on, all random calls are reproducible
x <- rnorm(100)
y <- sample(letters, 10)
z <- rbinom(50, size = 1, prob = 0.3)

# Show the first few results
head(x, 5); y; head(z, 5)
[1]  0.52059 -1.07969  0.13924 -0.08475 -0.66664
 [1] "x" "c" "k" "v" "p" "f" "u" "a" "d" "m"
[1] 1 0 0 0 0

Simulating from Statistical Distributions

R provides a family of functions for every major distribution. The naming convention is consistent:

R distribution function naming convention
Prefix Function Example
r Random samples rnorm(n, mean, sd)
d Density (PDF/PMF) dnorm(x, mean, sd)
p Cumulative probability (CDF) pnorm(q, mean, sd)
q Quantile (inverse CDF) qnorm(p, mean, sd)

Normal Distribution

The normal (Gaussian) distribution is appropriate for continuous data such as reaction times (log-transformed), pitch values, vowel formants, and word frequencies (log-transformed).

Code
set.seed(2026)

# Simulate log-transformed reaction times
# Mean ~6.4 ≈ exp(6.4) ≈ 600 ms; SD = 0.3 on the log scale
log_rt <- rnorm(n = 500, mean = 6.4, sd = 0.3)
rt_ms  <- exp(log_rt)   # back-transform to milliseconds

p_norm <- data.frame(log_rt = log_rt, rt_ms = rt_ms) |>
  ggplot(aes(x = rt_ms)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_density(color = "firebrick", linewidth = 1) +
  theme_bw() +
  labs(title = "Simulated reaction times (log-normal distribution)",
       subtitle = "n = 500 | Mean ≈ 600 ms",
       x = "Reaction time (ms)", y = "Density")

p_norm

Code
cat("Mean RT:", round(mean(rt_ms), 1), "ms\n")
Mean RT: 637.9 ms
Code
cat("SD RT:",   round(sd(rt_ms), 1), "ms\n")
SD RT: 198.2 ms
Code
cat("Range:",   round(range(rt_ms), 1), "ms\n")
Range: 280.3 1545 ms

Binomial Distribution

The binomial distribution models binary outcomes: correct/incorrect, yes/no, target form/alternative form. The key parameters are size (number of trials) and prob (probability of success).

Code
set.seed(2026)

# Simulate accuracy in a lexical decision task
# 100 participants, 80 trials each, average accuracy 85%
n_participants <- 100
n_trials       <- 80
accuracy_prob  <- 0.85

# Each participant's number of correct responses
n_correct <- rbinom(n = n_participants, size = n_trials, prob = accuracy_prob)
accuracy  <- n_correct / n_trials

p_binom <- data.frame(accuracy = accuracy) |>
  ggplot(aes(x = accuracy)) +
  geom_histogram(binwidth = 0.02, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_vline(xintercept = mean(accuracy), color = "firebrick",
             linetype = "dashed", linewidth = 1) +
  theme_bw() +
  labs(title = "Simulated accuracy in a lexical decision task",
       subtitle = sprintf("n = %d participants | %d trials | P(correct) = %.2f | Mean accuracy = %.3f",
                          n_participants, n_trials, accuracy_prob, mean(accuracy)),
       x = "Accuracy", y = "Count")

p_binom

Code
# Simulate binary outcome per trial (all participants combined)
all_responses <- rbinom(n = n_participants * n_trials, size = 1, prob = accuracy_prob)
cat("Proportion correct:", round(mean(all_responses), 3), "\n")
Proportion correct: 0.846 

Poisson Distribution

The Poisson distribution models count data where events occur independently at a constant average rate (the parameter lambda). In linguistics: number of errors per utterance, number of a specific word per document, number of disfluencies per minute.

Code
set.seed(2026)

# Simulate number of self-corrections per minute for 200 speakers
# Lambda = 1.8 corrections per minute
n_speakers <- 200
lambda_corrections <- 1.8

corrections <- rpois(n = n_speakers, lambda = lambda_corrections)

p_pois <- data.frame(corrections = corrections) |>
  ggplot(aes(x = corrections)) +
  geom_bar(fill = "steelblue", color = "white", alpha = 0.8) +
  theme_bw() +
  labs(title = "Simulated self-corrections per minute (Poisson distribution)",
       subtitle = sprintf("n = %d speakers | λ = %.1f | Mean = %.2f | Variance = %.2f",
                          n_speakers, lambda_corrections,
                          mean(corrections), var(corrections)),
       x = "Self-corrections per minute", y = "Count")

p_pois

Uniform Distribution

The uniform distribution generates values equally likely across an interval. Useful for simulating ages, dates, positions in a text, or random stimulus presentation times.

Code
set.seed(2026)

# Simulate participant ages between 18 and 65
ages <- runif(n = 200, min = 18, max = 65)
# Round to whole years
ages_int <- round(ages)

cat("Age distribution:\n")
Age distribution:
Code
cat("  Range:", range(ages_int), "\n")
  Range: 18 65 
Code
cat("  Mean:", round(mean(ages_int), 1), "\n")
  Mean: 39.4 
Code
# Uniform category sampling (equivalent to sample() with replace)
proficiency_levels <- sample(
  x       = c("Beginner", "Intermediate", "Advanced"),
  size    = 200,
  replace = TRUE,
  prob    = c(0.25, 0.45, 0.30)  # weighted probabilities
)
table(proficiency_levels)
proficiency_levels
    Advanced     Beginner Intermediate 
          73           48           79 

Negative Binomial Distribution

The negative binomial extends Poisson to handle overdispersion — when variance exceeds the mean, which is the norm for linguistic count data (word frequencies, error counts across speakers).

Code
set.seed(2026)

# Compare Poisson vs. Negative Binomial with same mean but different variance
mean_count <- 3.0
size_param  <- 0.8   # smaller = more overdispersion

pois_counts <- rpois(n = 500, lambda = mean_count)
nb_counts   <- rnbinom(n = 500, mu = mean_count, size = size_param)

dist_df <- data.frame(
  count = c(pois_counts, nb_counts),
  dist  = rep(c("Poisson (λ=3)", "Neg. Binomial (μ=3, size=0.8)"), each = 500)
)

cat("Poisson    — Mean:", round(mean(pois_counts), 2),
    "| Variance:", round(var(pois_counts), 2), "\n")
Poisson    — Mean: 2.9 | Variance: 2.73 
Code
cat("Neg. Binom — Mean:", round(mean(nb_counts), 2),
    "| Variance:", round(var(nb_counts), 2), "\n")
Neg. Binom — Mean: 2.79 | Variance: 13.41 
Code
ggplot(dist_df, aes(x = count, fill = dist)) +
  geom_bar(position = "dodge", alpha = 0.8, color = "white") +
  scale_fill_manual(values = c("steelblue", "firebrick")) +
  theme_bw() +
  theme(legend.position = "top") +
  labs(title = "Poisson vs. Negative Binomial count data",
       subtitle = "Same mean (3.0); NB has much larger variance",
       x = "Count", y = "Frequency", fill = "")

A researcher simulates 1,000 binary (0/1) responses using rbinom(n = 1000, size = 1, prob = 0.6). She then changes the seed and re-runs. Which statement is correct?

  1. The proportion of 1s will be exactly 0.6 both times, since prob = 0.6 is fixed
  2. The proportion of 1s will vary slightly between runs because each call generates a new random sample; the expected proportion is 0.6 but individual realisations deviate from it
  3. The results will be identical because the distribution parameters are the same
  4. The function will produce an error because size = 1 is invalid for rbinom()
Answer

b) The proportion of 1s will vary slightly between runs because each call generates a new random sample; the expected proportion is 0.6 but individual realisations deviate from it

prob = 0.6 is the probability of a 1, not a guarantee that exactly 60% of draws will be 1. Each call to rbinom() generates a new independent random sample. With n = 1,000, the law of large numbers ensures the proportion will be close to 0.6 (typically within a few percent), but it will not be identical across runs with different seeds. If you need identical results across runs, set the same seed before each call. Option (a) confuses probability with frequency; (c) confuses distributional parameters with determinism; (d) is incorrect — size = 1 is perfectly valid and means each draw is a single Bernoulli trial (0 or 1).

Simulating Realistic Linguistic Datasets

Simulating a Corpus Frequency Dataset

Corpus frequency data follows a Zipfian distribution: a small number of words are very frequent, and the vast majority are extremely rare. We can simulate this using a power-law sample:

Code
set.seed(2026)

# Simulate a vocabulary of 500 word types with Zipfian frequencies
n_types   <- 500
# Zipf's law: frequency ∝ 1/rank^alpha
alpha     <- 1.0     # Zipf exponent (empirically ~1 for English)
ranks     <- 1:n_types
freq_probs <- (1 / ranks^alpha) / sum(1 / ranks^alpha)  # normalise to sum to 1

# Total corpus size: 50,000 tokens
n_tokens  <- 50000
word_freqs <- round(freq_probs * n_tokens)
word_freqs[word_freqs == 0] <- 1   # every type has at least 1 token

# Create a data frame
words <- paste0("word_", stringr::str_pad(ranks, 3, pad = "0"))
corpus_freq_df <- data.frame(
  rank      = ranks,
  word      = words,
  frequency = word_freqs,
  log_rank  = log(ranks),
  log_freq  = log(word_freqs)
)

# Zipf plot
p_zipf <- ggplot(corpus_freq_df, aes(x = log_rank, y = log_freq)) +
  geom_point(alpha = 0.3, size = 1, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "firebrick", linewidth = 1) +
  theme_bw() +
  labs(title = "Zipf plot: simulated corpus word frequencies",
       subtitle = sprintf("Vocabulary: %d types | Corpus: %d tokens | α = %.1f",
                          n_types, sum(word_freqs), alpha),
       x = "log(rank)", y = "log(frequency)")

p_zipf

Code
# Most and least frequent words
head(corpus_freq_df, 5)
  rank     word frequency log_rank log_freq
1    1 word_001      7361   0.0000    8.904
2    2 word_002      3680   0.6931    8.211
3    3 word_003      2454   1.0986    7.805
4    4 word_004      1840   1.3863    7.518
5    5 word_005      1472   1.6094    7.294
Code
tail(corpus_freq_df, 5)
    rank     word frequency log_rank log_freq
496  496 word_496        15    6.207    2.708
497  497 word_497        15    6.209    2.708
498  498 word_498        15    6.211    2.708
499  499 word_499        15    6.213    2.708
500  500 word_500        15    6.215    2.708

Simulating a Psycholinguistic Experiment

A realistic psycholinguistic dataset requires:

  1. Multiple participants (random effect)
  2. Multiple items per participant (random effect)
  3. Fixed effects of experimental conditions
  4. By-participant and by-item random variation in baseline and condition effects
  5. Continuous response (RT) or binary response (accuracy)
Code
set.seed(2026)

# Design parameters
n_participants <- 40
n_items        <- 30    # items per participant
n_obs          <- n_participants * n_items  # total observations

# Condition: Primed (1) vs. Unprimed (0), crossed with participants and items
# Each participant sees each item once, half primed, half unprimed
conditions <- rep(c(0, 1), times = n_items / 2)

# Fixed effects (on log-RT scale)
intercept      <- 6.40   # grand mean log-RT (≈ 600 ms)
beta_priming   <- -0.08  # priming speeds RT by ~8% (negative = faster)
beta_frequency <- -0.05  # each unit of log-freq reduces log-RT

# Random effect SDs
sd_participant <- 0.15   # between-participant variability in baseline RT
sd_item        <- 0.10   # between-item variability in baseline RT
sd_residual    <- 0.20   # within-cell residual noise

# Sample random effects
participant_ids     <- paste0("P", stringr::str_pad(1:n_participants, 2, pad = "0"))
item_ids            <- paste0("I", stringr::str_pad(1:n_items, 2, pad = "0"))
re_participant      <- rnorm(n_participants, mean = 0, sd = sd_participant)
re_item             <- rnorm(n_items, mean = 0, sd = sd_item)

# Simulated word frequency for each item (log scale)
log_freq_item <- rnorm(n_items, mean = 4.0, sd = 1.5)

# Build the full dataset
sim_exp <- expand.grid(
  Participant = participant_ids,
  Item        = item_ids
) |>
  dplyr::mutate(
    Condition  = rep(conditions, times = n_participants),
    LogFreq    = log_freq_item[match(Item, item_ids)],
    RE_part    = re_participant[match(Participant, participant_ids)],
    RE_item    = re_item[match(Item, item_ids)],
    Epsilon    = rnorm(n_obs, 0, sd_residual),
    LogRT      = intercept + beta_priming * Condition +
                 beta_frequency * LogFreq +
                 RE_part + RE_item + Epsilon,
    RT         = exp(LogRT),
    Condition  = factor(Condition, levels = c(0, 1),
                         labels = c("Unprimed", "Primed"))
  ) |>
  dplyr::select(Participant, Item, Condition, LogFreq, RT, LogRT)

cat("Dataset dimensions:", nrow(sim_exp), "×", ncol(sim_exp), "\n")
Dataset dimensions: 1200 × 6 
Code
str(sim_exp)
'data.frame':   1200 obs. of  6 variables:
 $ Participant: Factor w/ 40 levels "P01","P02","P03",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Item       : Factor w/ 30 levels "I01","I02","I03",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Condition  : Factor w/ 2 levels "Unprimed","Primed": 1 2 1 2 1 2 1 2 1 2 ...
 $ LogFreq    : num  4.78 4.78 4.78 4.78 4.78 ...
 $ RT         : num  582 300 412 308 466 ...
 $ LogRT      : num  6.37 5.7 6.02 5.73 6.14 ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim     : Named int [1:2] 40 30
  .. ..- attr(*, "names")= chr [1:2] "Participant" "Item"
  ..$ dimnames:List of 2
  .. ..$ Participant: chr [1:40] "Participant=P01" "Participant=P02" "Participant=P03" "Participant=P04" ...
  .. ..$ Item       : chr [1:30] "Item=I01" "Item=I02" "Item=I03" "Item=I04" ...
Code
head(sim_exp, 10)
   Participant Item Condition LogFreq    RT LogRT
1          P01  I01  Unprimed   4.779 582.1 6.367
2          P02  I01    Primed   4.779 300.2 5.705
3          P03  I01  Unprimed   4.779 411.8 6.021
4          P04  I01    Primed   4.779 308.1 5.730
5          P05  I01  Unprimed   4.779 466.1 6.144
6          P06  I01    Primed   4.779 298.9 5.700
7          P07  I01  Unprimed   4.779 276.5 5.622
8          P08  I01    Primed   4.779 449.7 6.109
9          P09  I01  Unprimed   4.779 339.9 5.829
10         P10  I01    Primed   4.779 294.1 5.684
Code
# Quick sanity check: does the simulated data show the expected priming effect?
sim_exp |>
  dplyr::group_by(Condition) |>
  dplyr::summarise(
    Mean_RT    = round(mean(RT), 1),
    Median_RT  = round(median(RT), 1),
    SD_RT      = round(sd(RT), 1),
    n          = dplyr::n(),
    .groups = "drop"
  )
# A tibble: 2 × 5
  Condition Mean_RT Median_RT SD_RT     n
  <fct>       <dbl>     <dbl> <dbl> <int>
1 Unprimed     485.      468.  131.   600
2 Primed       466.      448.  131.   600
Code
ggplot(sim_exp, aes(x = Condition, y = RT, fill = Condition)) +
  geom_violin(alpha = 0.6, color = NA) +
  geom_boxplot(width = 0.2, fill = "white", outlier.alpha = 0.3) +
  scale_fill_manual(values = c("steelblue", "firebrick")) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(title = "Simulated reaction times by priming condition",
       subtitle = "Priming effect built in: β = -0.08 on log scale",
       x = "Condition", y = "Reaction time (ms)")

Simulating a Survey Dataset

Attitude surveys and proficiency assessments generate Likert-scale or ordinal data. Here we simulate a language attitude survey:

Code
set.seed(2026)

# 120 respondents, 5 Likert items (1 = Strongly Disagree, 5 = Strongly Agree)
# Two groups: L1 English vs. L1 Other
n_respondents  <- 120
n_l1_english   <- 60
group          <- c(rep("L1 English", n_l1_english),
                    rep("L1 Other",   n_respondents - n_l1_english))

# Item means differ by group (L1 English respondents rate English more positively)
item_means_eng   <- c(4.1, 3.8, 4.3, 3.5, 4.0)  # 5 items
item_means_other <- c(3.2, 3.0, 3.5, 2.8, 3.1)
item_sd          <- 0.8

# Generate continuous underlying scores and discretise to 1–5
sim_likert <- function(n, means, sd, min_val = 1, max_val = 5) {
  purrr::map_dfc(means, function(m) {
    raw <- rnorm(n, mean = m, sd = sd)
    clipped <- pmin(pmax(round(raw), min_val), max_val)
    clipped
  }) |>
    purrr::set_names(paste0("Item", 1:length(means)))
}

eng_dat   <- sim_likert(n_l1_english,                  item_means_eng,   item_sd)
other_dat <- sim_likert(n_respondents - n_l1_english,  item_means_other, item_sd)

survey_df <- dplyr::bind_rows(eng_dat, other_dat) |>
  dplyr::mutate(
    Respondent  = paste0("R", stringr::str_pad(1:n_respondents, 3, pad = "0")),
    Group       = group,
    TotalScore  = rowSums(dplyr::across(dplyr::starts_with("Item")))
  ) |>
  dplyr::select(Respondent, Group, dplyr::everything())

cat("Survey dataset:\n")
Survey dataset:
Code
str(survey_df)
tibble [120 × 8] (S3: tbl_df/tbl/data.frame)
 $ Respondent: chr [1:120] "R001" "R002" "R003" "R004" ...
 $ Group     : chr [1:120] "L1 English" "L1 English" "L1 English" "L1 English" ...
 $ Item1     : num [1:120] 5 3 4 4 4 2 4 3 4 4 ...
 $ Item2     : num [1:120] 4 4 4 4 2 4 2 5 3 4 ...
 $ Item3     : num [1:120] 5 5 4 5 3 5 4 4 5 5 ...
 $ Item4     : num [1:120] 3 5 4 4 4 4 5 4 4 3 ...
 $ Item5     : num [1:120] 4 4 5 5 4 4 4 3 5 4 ...
 $ TotalScore: num [1:120] 21 21 21 22 17 19 19 19 21 20 ...
Code
# Group-level means
survey_df |>
  dplyr::group_by(Group) |>
  dplyr::summarise(
    dplyr::across(dplyr::starts_with("Item"), ~ round(mean(.x), 2)),
    MeanTotal = round(mean(TotalScore), 2),
    .groups = "drop"
  )
# A tibble: 2 × 7
  Group      Item1 Item2 Item3 Item4 Item5 MeanTotal
  <chr>      <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
1 L1 English  3.97  3.75  4.4   3.6   4.07      19.8
2 L1 Other    3.27  2.9   3.57  2.87  3.23      15.8
Code
# Visualise total score distribution by group
ggplot(survey_df, aes(x = TotalScore, fill = Group)) +
  geom_histogram(binwidth = 1, position = "dodge", color = "white", alpha = 0.8) +
  scale_fill_manual(values = c("steelblue", "firebrick")) +
  theme_bw() +
  theme(legend.position = "top") +
  labs(title = "Simulated language attitude survey: total scores by group",
       subtitle = "5 Likert items (1–5) | L1 English respondents show higher positive attitudes",
       x = "Total score (5–25)", y = "Count")

Generating Synthetic Text Data

Simple String Construction

The simplest way to generate text data is direct string construction using paste() and sprintf():

Code
# Generate templated sentences
subjects   <- c("The speaker", "Each participant", "Every respondent", "The learner")
verbs      <- c("produced", "uttered", "used", "avoided")
objects    <- c("the target form", "an error", "a hesitation", "the amplifier")

set.seed(2026)
n_sentences <- 20

sentences <- paste(
  sample(subjects, n_sentences, replace = TRUE),
  sample(verbs, n_sentences, replace = TRUE),
  sample(objects, n_sentences, replace = TRUE),
  "."
)

head(sentences, 5)
[1] "The speaker produced the amplifier ."      
[2] "The speaker avoided an error ."            
[3] "The speaker uttered an error ."            
[4] "Each participant uttered the target form ."
[5] "The speaker uttered the amplifier ."       
Code
# Generate numbered sentences with sprintf
template_sents <- sprintf(
  "This is sentence number %d, produced by speaker %s.",
  1:10,
  sample(LETTERS[1:5], 10, replace = TRUE)
)

template_sents
 [1] "This is sentence number 1, produced by speaker E." 
 [2] "This is sentence number 2, produced by speaker E." 
 [3] "This is sentence number 3, produced by speaker C." 
 [4] "This is sentence number 4, produced by speaker E." 
 [5] "This is sentence number 5, produced by speaker D." 
 [6] "This is sentence number 6, produced by speaker A." 
 [7] "This is sentence number 7, produced by speaker A." 
 [8] "This is sentence number 8, produced by speaker C." 
 [9] "This is sentence number 9, produced by speaker E." 
[10] "This is sentence number 10, produced by speaker E."

Simulating a Synthetic Corpus with Controlled Properties

For testing text analysis pipelines, it is useful to generate a corpus with known properties (word frequencies, bigram transitions):

Code
set.seed(2026)

# Define a small vocabulary with assigned probabilities (simulating corpus frequencies)
vocab <- data.frame(
  word = c("the", "language", "corpus", "analysis", "text",
           "speaker", "frequency", "very", "quite", "shows",
           "contains", "reveals", "significant", "common", "rare"),
  prob = c(0.15, 0.12, 0.10, 0.09, 0.08,
           0.07, 0.07, 0.06, 0.05, 0.05,
           0.04, 0.04, 0.03, 0.03, 0.02)
)

# Verify probabilities sum to 1
cat("Probability sum:", sum(vocab$prob), "\n")
Probability sum: 1 
Code
# Generate synthetic texts of varying lengths
generate_text <- function(n_words, vocab_df) {
  words <- sample(
    x       = vocab_df$word,
    size    = n_words,
    replace = TRUE,
    prob    = vocab_df$prob
  )
  paste(words, collapse = " ")
}

# Create a mini synthetic corpus of 10 texts
n_texts      <- 10
text_lengths <- round(runif(n_texts, min = 50, max = 200))

synth_corpus <- data.frame(
  text_id  = paste0("SYNTH_", stringr::str_pad(1:n_texts, 2, pad = "0")),
  n_tokens = text_lengths,
  text     = sapply(text_lengths, generate_text, vocab_df = vocab),
  stringsAsFactors = FALSE
)

# Inspect
head(synth_corpus[, c("text_id", "n_tokens")], 5)
   text_id n_tokens
1 SYNTH_01      155
2 SYNTH_02      133
3 SYNTH_03       71
4 SYNTH_04       93
5 SYNTH_05      133
Code
cat("\nSample text:\n", synth_corpus$text[1], "\n")

Sample text:
 the very language reveals language corpus frequency the corpus the corpus corpus language the analysis language analysis the the frequency contains language very the frequency analysis language text shows analysis analysis reveals analysis analysis the the common corpus text language significant speaker significant contains text very shows the significant language corpus contains frequency analysis analysis the speaker text frequency speaker reveals analysis very reveals language speaker very frequency language contains the significant shows significant the significant speaker common language language common very shows corpus corpus corpus shows corpus speaker analysis the quite analysis the language language speaker frequency language frequency analysis reveals the corpus language the frequency the language shows analysis the frequency shows analysis the analysis text the analysis analysis speaker the the reveals the language text frequency corpus very corpus text speaker contains the language speaker speaker the language the rare text shows very corpus very shows the the text common the the 

Bigram Language Model

A more linguistically realistic approach is a bigram language model: the probability of each word depends on the previous word. This produces more natural-looking (though semantically random) text:

Code
set.seed(2026)

# Define a simple bigram transition matrix
# Rows = current word, columns = next word
word_types <- c("<START>", "the", "corpus", "analysis", "shows",
                "very", "significant", "results", "<END>")

# Transition probabilities (rows must sum to 1)
transitions <- matrix(
  c(
    # <START>  the   corpus  analysis  shows    very   sig    results  <END>
    0.00, 0.80,  0.10,   0.05,    0.00,   0.00,  0.00,  0.00,    0.05,  # <START>
    0.00, 0.00,  0.40,   0.35,    0.00,   0.00,  0.00,  0.00,    0.25,  # the
    0.00, 0.10,  0.00,   0.00,    0.60,   0.00,  0.00,  0.00,    0.30,  # corpus
    0.00, 0.00,  0.00,   0.00,    0.70,   0.00,  0.00,  0.00,    0.30,  # analysis
    0.00, 0.00,  0.00,   0.00,    0.00,   0.50,  0.20,  0.20,    0.10,  # shows
    0.00, 0.00,  0.00,   0.00,    0.00,   0.00,  0.90,  0.00,    0.10,  # very
    0.00, 0.00,  0.00,   0.00,    0.00,   0.00,  0.00,  0.90,    0.10,  # significant
    0.00, 0.10,  0.10,   0.10,    0.00,   0.00,  0.00,  0.00,    0.70,  # results
    0.00, 0.00,  0.00,   0.00,    0.00,   0.00,  0.00,  0.00,    1.00   # <END>
  ),
  nrow = length(word_types),
  byrow = TRUE,
  dimnames = list(word_types, word_types)
)

# Generate one sentence using the bigram model
generate_bigram_sentence <- function(transitions, max_len = 20) {
  words  <- "<START>"
  current <- "<START>"
  repeat {
    probs   <- transitions[current, ]
    next_w  <- sample(colnames(transitions), 1, prob = probs)
    if (next_w == "<END>" || length(words) > max_len) break
    words   <- c(words, next_w)
    current <- next_w
  }
  paste(words[-1], collapse = " ")   # remove <START>
}

# Generate 10 sentences
bigram_sentences <- replicate(10, generate_bigram_sentence(transitions))
cat("Bigram-generated sentences:\n")
Bigram-generated sentences:
Code
for (i in seq_along(bigram_sentences)) cat(sprintf("  %2d. %s\n", i, bigram_sentences[i]))
   1. the analysis shows very significant results
   2. corpus shows results
   3. the corpus
   4. the corpus shows very significant results
   5. the corpus shows very significant results
   6. the analysis
   7. the analysis shows results
   8. the analysis
   9. the corpus
  10. the analysis shows very

A researcher generates 500 simulated reaction times with rnorm(500, mean = 600, sd = 80) and finds that her analysis shows a significant priming effect. Her supervisor asks whether this is a valid approach. What is the main problem?

  1. rnorm() cannot simulate reaction times — use rpois() instead
  2. Reaction times are always positive and approximately log-normally distributed; simulating them as normally distributed (which allows negative values and is symmetric) is unrealistic. The researcher should simulate on the log scale: exp(rnorm(500, mean = log(600), sd = 0.15))
  3. 500 observations is too small for any simulation
  4. The simulation is fine — reaction times are normally distributed in large samples by the Central Limit Theorem
Answer

b) Reaction times are always positive and approximately log-normally distributed; simulating them as normally distributed (which allows negative values and is symmetric) is unrealistic.

Reaction times are bounded at zero (you cannot respond in negative time), and their distribution is right-skewed — there is a longer tail for slow responses than for fast ones. The normal distribution produces some negative values (when mean / sd is not very large) and is symmetric, which does not match the real distribution of RTs. The standard approach is to simulate on the log scale — log_rt <- rnorm(n, mean, sd) — then back-transform with rt <- exp(log_rt). This produces a log-normal distribution that is always positive and right-skewed. Option (d) confuses the Central Limit Theorem (which applies to sample means, not individual observations) with the distribution of raw data.


Citation and Session Info

Schweinberger, Martin. 2026. Loading, Saving, and Simulating Data in R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/load/load.html (Version 2026.02.24).

@manual{schweinberger2026loadr,
  author       = {Schweinberger, Martin},
  title        = {Loading, Saving, and Simulating Data in R},
  note         = {https://ladal.edu.au/tutorials/load/load.html},
  year         = {2026},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL)},
  address      = {Brisbane},
  edition      = {2026.02.24}
}
Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] ggplot2_4.0.2    purrr_1.0.4      xml2_1.3.6       jsonlite_1.9.0  
 [5] writexl_1.5.1    readxl_1.4.3     readr_2.1.5      officer_0.7.3   
 [9] data.tree_1.1.0  here_1.0.2       openxlsx_4.2.8   flextable_0.9.11
[13] tidyr_1.3.2      stringr_1.5.1    dplyr_1.2.0     

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.56               htmlwidgets_1.6.4      
 [4] lattice_0.22-6          tzdb_0.4.0              vctrs_0.7.1            
 [7] tools_4.4.2             generics_0.1.3          parallel_4.4.2         
[10] tibble_3.2.1            pkgconfig_2.0.3         Matrix_1.7-2           
[13] data.table_1.17.0       RColorBrewer_1.1-3      S7_0.2.1               
[16] uuid_1.2-1              lifecycle_1.0.5         compiler_4.4.2         
[19] farver_2.1.2            textshaping_1.0.0       codetools_0.2-20       
[22] fontquiver_0.2.1        fontLiberation_0.1.0    htmltools_0.5.9        
[25] yaml_2.3.10             crayon_1.5.3            pillar_1.10.1          
[28] openssl_2.3.2           nlme_3.1-166            fontBitstreamVera_0.1.1
[31] tidyselect_1.2.1        zip_2.3.2               digest_0.6.39          
[34] stringi_1.8.4           splines_4.4.2           labeling_0.4.3         
[37] rprojroot_2.1.1         fastmap_1.2.0           grid_4.4.2             
[40] cli_3.6.4               magrittr_2.0.3          patchwork_1.3.0        
[43] utf8_1.2.4              withr_3.0.2             gdtools_0.5.0          
[46] scales_1.4.0            bit64_4.6.0-1           rmarkdown_2.30         
[49] bit_4.5.0.1             cellranger_1.1.0        askpass_1.2.1          
[52] ragg_1.3.3              hms_1.1.3               evaluate_1.0.3         
[55] knitr_1.51              mgcv_1.9-1              rlang_1.1.7            
[58] Rcpp_1.1.1              glue_1.8.0              renv_1.1.7             
[61] rstudioapi_0.17.1       vroom_1.6.5             R6_2.6.1               
[64] systemfonts_1.3.1      
AI Transparency Statement

This tutorial was written with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to substantially expand and restructure a shorter existing LADAL tutorial on loading and saving data, adding the project structure section, the readr package coverage, the Excel section, the JSON/XML section, the built-in datasets section, the section on loading multiple text files, and the entirely new simulation section (distributions, realistic linguistic datasets, power analysis, and synthetic text generation). All content was reviewed and approved by the named author (Martin Schweinberger), who takes full responsibility for its accuracy.


Back to top

Back to LADAL home


References