Working with R: Control Flow, Functions, and Programming

Author

Martin Schweinberger

Introduction

This tutorial is the second in the LADAL R series. It picks up where Getting Started with R and RStudio left off and introduces the programming side of R: how to write code that makes decisions, repeats itself, and encapsulates reusable logic. These are the tools that transform R from a calculator into a real programming environment — the tools you reach for when you want to automate a task, process many files at once, or build a custom analysis pipeline.

The tutorial uses linguistic examples throughout: text cleaning, corpus processing, token counting, and data wrangling tasks typical of language research. By the end, you will be able to write your own functions, process data in loops, and apply the same operation efficiently across many groups or files.

Prerequisite Tutorials

Before working through this tutorial, please complete:

You should be comfortable with objects, vectors, data frames, and basic dplyr operations before continuing.

What This Tutorial Covers

Conditional logic — if/else, ifelse(), dplyr::case_when()
for loops — iterating over vectors, lists, and files
while loops — condition-driven iteration
Writing functions — arguments, defaults, return values, scope
The apply family — sapply(), lapply(), apply()
Functional programming with purrr — map() and its variants
Error handling — tryCatch() for robust code
Best practices — when to loop, when to vectorise, how to write clean functions

Preparation and Session Set-up

Install required packages (once only):

Code

install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("flextable")
install.packages("purrr")
install.packages("checkdown")

Load packages and set options:

Code

library(dplyr)       # data manipulation
library(ggplot2)     # data visualisation
library(tidyr)       # data reshaping
library(flextable)   # formatted tables
library(purrr)       # functional programming tools
library(checkdown)   # interactive exercises

options(stringsAsFactors = FALSE)
options(scipen = 100)
options(max.print = 100)
set.seed(42)

We will use a small simulated dataset throughout this tutorial — a collection of text samples with associated metadata:

Code

# A small corpus of text samples
corpus <- data.frame(
  doc_id   = paste0("doc", 1:12),
  register = rep(c("Academic", "News", "Fiction"), each = 4),
  text     = c(
    "The syntactic properties of embedded clauses remain poorly understood.",
    "Phonological alternations in unstressed syllables exhibit considerable variation.",
    "Discourse coherence is maintained through a variety of cohesive devices.",
    "The morphological complexity of agglutinative languages poses theoretical challenges.",
    "Scientists announced a major breakthrough in renewable energy storage yesterday.",
    "Local authorities confirmed that road closures will affect the city centre this weekend.",
    "The prime minister addressed parliament amid growing calls for electoral reform.",
    "Unemployment figures fell sharply in the third quarter according to new statistics.",
    "She had not expected the letter to arrive so soon, or to contain such news.",
    "The old house creaked and groaned as the storm gathered strength outside.",
    "He said nothing for a long time, watching the rain trace patterns on the glass.",
    "By morning the fog had lifted and the valley lay green and still below them."
  ),
  n_tokens = c(11, 10, 12, 11, 14, 16, 13, 14, 17, 15, 18, 16),
  year     = c(2019, 2020, 2021, 2022, 2019, 2020, 2021, 2022,
               2019, 2020, 2021, 2022),
  stringsAsFactors = FALSE
)

Conditional Logic

Section Overview

What you’ll learn: How to make R take different actions depending on the data — the foundation of any decision-making code

Key functions: if, else, else if, ifelse(), dplyr::case_when()

Why it matters: Real data is messy and varied — conditional logic lets your code respond intelligently to what it finds

`if` / `else` Statements

An if statement runs a block of code only when a condition is TRUE. The optional else block runs when it is FALSE.

Code

n_docs <- nrow(corpus)

if (n_docs >= 10) {
  cat("Corpus is large enough for analysis:", n_docs, "documents.\n")
} else {
  cat("Corpus may be too small:", n_docs, "documents.\n")
}

Corpus is large enough for analysis: 12 documents.

Chain multiple conditions with else if:

Code

mean_tokens <- mean(corpus$n_tokens)

if (mean_tokens < 10) {
  complexity <- "low"
} else if (mean_tokens < 15) {
  complexity <- "moderate"
} else {
  complexity <- "high"
}

cat("Average token count:", round(mean_tokens, 1),
    "→ Complexity:", complexity, "\n")

Average token count: 13.9 → Complexity: moderate

if Requires a Single TRUE or FALSE

The condition inside if() must evaluate to exactly one logical value. In R 4.2 and later, passing a vector of logicals is a hard error (in older versions it was a warning that used only the first element). Either way, it is never what you want.

Code

x <- c(TRUE, FALSE, TRUE)

# Safely demonstrate the error without breaking the document
tryCatch(
  if (x) print("this only checks x[1]"),
  error = function(e) cat("Error:", conditionMessage(e), "\n")
)

Error: the condition has length > 1

Use any() or all() when you need to reduce a logical vector to a single value:

Code

# Does at least one element satisfy the condition?
if (any(x))  cat("At least one TRUE\n")

At least one TRUE

Code

# Do all elements satisfy the condition?
if (all(x))  cat("All TRUE\n") else cat("Not all TRUE\n")

Not all TRUE

`ifelse()` — Vectorised Conditional

ifelse() applies a condition to an entire vector and returns a vector of results — one for each element. This makes it ideal for creating or recoding columns:

Code

# Classify each document by token count
corpus <- corpus %>%
  dplyr::mutate(
    length_class = ifelse(n_tokens >= 15, "long", "short")
  )

table(corpus$length_class)


 long short 
    5     7

`dplyr::case_when()` — Multiple Conditions

When you need more than two categories, case_when() is far cleaner than nested ifelse() calls. It works like a series of if/else if conditions, evaluated top to bottom:

Code

corpus <- corpus %>%
  dplyr::mutate(
    era = dplyr::case_when(
      year <= 2019 ~ "Early",
      year <= 2021 ~ "Middle",
      year == 2022 ~ "Recent",
      TRUE         ~ "Unknown"   # fallback for any unmatched rows
    )
  )

table(corpus$era)


 Early Middle Recent 
     3      6      3

case_when() Evaluation Order

Conditions are tested top to bottom and the first match wins. Always put more specific conditions before less specific ones. The final TRUE ~ "value" acts as a catch-all default (like else) — it is good practice to always include one.

`switch()` — Selecting Among Named Options

switch() is useful when you have a single variable that can take one of several known values, and you want to map each value to a different result or action:

Code

describe_register <- function(reg) {
  switch(reg,
    "Academic" = "Formal; high lexical density; passive constructions common",
    "News"     = "Neutral; inverted pyramid structure; quotations frequent",
    "Fiction"  = "Varied; narrative voice; dialogue and description",
    "Unknown register"
  )
}

describe_register("Academic")

[1] "Formal; high lexical density; passive constructions common"

Code

describe_register("News")

[1] "Neutral; inverted pyramid structure; quotations frequent"

Code

describe_register("Blog")   # falls through to default

[1] "Unknown register"

Exercises: Conditional Logic

Q1. What is the key difference between if and ifelse() in R?

Q2. In a case_when() call, what does the final TRUE ~ "Unknown" line do?

Q3. You want to add a column pos_class that is \"function\" when word is in c(\"the\", \"a\", \"of\", \"in\") and \"content\" otherwise. Which code is correct?

`for` Loops

Section Overview

What you’ll learn: How to repeat a block of code for each element of a sequence or list

Key concepts: Loop variable, iteration, pre-allocation, seq_along()

Why it matters: Loops automate repetitive tasks — processing multiple files, computing statistics per document, or building up results iteratively

A for loop iterates over a sequence, executing its body once per element. The loop variable takes each element’s value in turn:

Code

registers <- unique(corpus$register)

for (reg in registers) {
  n <- sum(corpus$register == reg)
  cat(reg, ":", n, "documents\n")
}

Academic : 4 documents
News : 4 documents
Fiction : 4 documents

Looping with Indices

When you need both the element and its position, loop over indices using seq_along(). This is safer than 1:length(x) because it handles zero-length vectors correctly:

Code

words <- c("syntax", "morphology", "phonology", "pragmatics", "semantics")

for (i in seq_along(words)) {
  cat(sprintf("Word %d: %-12s (%d characters)\n",
              i, words[i], nchar(words[i])))
}

Word 1: syntax       (6 characters)
Word 2: morphology   (10 characters)
Word 3: phonology    (9 characters)
Word 4: pragmatics   (10 characters)
Word 5: semantics    (9 characters)

Storing Results: Pre-allocation

The most important loop performance rule: pre-allocate your output object before the loop, then fill it by index. Growing a vector by appending inside a loop forces R to copy the entire vector on every iteration — catastrophically slow for large inputs.

Code

# BAD: growing inside the loop (slow for large n)
results_slow <- c()
for (i in seq_along(words)) {
  results_slow <- c(results_slow, nchar(words[i]))  # full copy each time!
}

# GOOD: pre-allocate, then fill by index
results_fast <- integer(length(words))
for (i in seq_along(words)) {
  results_fast[i] <- nchar(words[i])
}

results_fast

[1]  6 10  9 10  9

A Realistic Corpus Example

Here we loop over registers, compute summary statistics for each, and collect the results in a pre-allocated list:

Code

registers  <- unique(corpus$register)
summaries  <- vector("list", length(registers))   # pre-allocate a list
names(summaries) <- registers

for (reg in registers) {
  subset_df       <- corpus[corpus$register == reg, ]
  summaries[[reg]] <- data.frame(
    register  = reg,
    n_docs    = nrow(subset_df),
    mean_tok  = round(mean(subset_df$n_tokens), 1),
    sd_tok    = round(sd(subset_df$n_tokens), 2),
    min_tok   = min(subset_df$n_tokens),
    max_tok   = max(subset_df$n_tokens)
  )
}

# Combine list of data frames into one
do.call(rbind, summaries) %>%
  flextable() %>%
  flextable::set_table_properties(width = .85, layout = "autofit") %>%
  flextable::theme_zebra() %>%
  flextable::fontsize(size = 12) %>%
  flextable::fontsize(size = 12, part = "header") %>%
  flextable::align_text_col(align = "center") %>%
  flextable::set_caption(caption = "Token statistics per register computed with a for loop.") %>%
  flextable::border_outer()

register	n_docs	mean_tok	sd_tok	min_tok	max_tok
Academic	4	11.0	0.82	10	12
News	4	14.2	1.26	13	16
Fiction	4	16.5	1.29	15	18

Looping Over Files

One of the most practical uses of for loops in corpus linguistics is processing many text files in a folder:

Code

# List all .txt files in a folder
txt_files <- list.files(path = "data/corpus/",
                        pattern = "\\.txt$",
                        full.names = TRUE)

# Pre-allocate results
results <- data.frame(
  filename = character(length(txt_files)),
  n_chars  = integer(length(txt_files)),
  n_lines  = integer(length(txt_files)),
  stringsAsFactors = FALSE
)

for (i in seq_along(txt_files)) {
  text              <- readLines(txt_files[i], warn = FALSE)
  results$filename[i] <- basename(txt_files[i])
  results$n_chars[i]  <- sum(nchar(text))
  results$n_lines[i]  <- length(text)
}

head(results)

`break` and `next`

Two special keywords control loop flow:

break: exit the loop immediately
next: skip to the next iteration (like continue in other languages)

Code

# next: skip documents with fewer than 13 tokens
cat("Long documents only:\n")

Long documents only:

Code

for (i in seq_len(nrow(corpus))) {
  if (corpus$n_tokens[i] < 13) next
  cat(" ", corpus$doc_id[i], "-", corpus$n_tokens[i], "tokens\n")
}

  doc5 - 14 tokens
  doc6 - 16 tokens
  doc7 - 13 tokens
  doc8 - 14 tokens
  doc9 - 17 tokens
  doc10 - 15 tokens
  doc11 - 18 tokens
  doc12 - 16 tokens

Code

# break: stop as soon as we find the first Academic document
cat("\nFirst Academic document:\n")


First Academic document:

Code

for (i in seq_len(nrow(corpus))) {
  if (corpus$register[i] == "Academic") {
    cat(" ", corpus$doc_id[i], ":", substr(corpus$text[i], 1, 50), "...\n")
    break
  }
}

  doc1 : The syntactic properties of embedded clauses remai ...

Nested `for` Loops

Loops can be nested — the inner loop runs completely for each iteration of the outer loop:

Code

registers <- unique(corpus$register)
eras      <- unique(corpus$era)

cat("Documents per register × era:\n")

Documents per register × era:

Code

for (reg in registers) {
  for (era in eras) {
    n <- sum(corpus$register == reg & corpus$era == era)
    cat(sprintf("  %-10s × %-7s : %d\n", reg, era, n))
  }
}

  Academic   × Early   : 1
  Academic   × Middle  : 2
  Academic   × Recent  : 1
  News       × Early   : 1
  News       × Middle  : 2
  News       × Recent  : 1
  Fiction    × Early   : 1
  Fiction    × Middle  : 2
  Fiction    × Recent  : 1

Loops Are Often Not the Best Tool

Before writing a loop, always ask: does a vectorised function or dplyr verb already do this? Vectorised operations in R are implemented in C and run orders of magnitude faster than R-level loops.

Code

# Instead of a loop to get character counts per document:
nchar(corpus$text)    # nchar() is already vectorised

 [1] 70 81 72 85 80 88 80 83 75 73 79 76

Code

# Instead of a loop to get token stats per register:
corpus %>%
  dplyr::group_by(register) %>%
  dplyr::summarise(mean_tok = mean(n_tokens), .groups = "drop")

# A tibble: 3 × 2
  register mean_tok
  <chr>       <dbl>
1 Academic     11  
2 Fiction      16.5
3 News         14.2

Loops shine when: (a) each iteration depends on the result of the previous one, (b) you are reading/writing files, or (c) no vectorised alternative exists.

Exercises: for Loops

Q1. Why should you pre-allocate your output vector before a for loop rather than growing it with c() inside the loop?

Q2. What does next do inside a for loop?

Q3. Why is seq_along(x) preferred over 1:length(x) when looping over a vector x?

`while` Loops

Section Overview

What you’ll learn: How to write loops that run until a condition changes rather than for a fixed number of iterations

Key concepts: Loop condition, infinite loops, break as a safety exit

When to use: Convergence algorithms, reading streams of data, retrying failed operations

A while loop runs its body as long as its condition remains TRUE. Use it when the number of iterations is not known in advance.

Code

# Count how many tokens we accumulate before passing 50 total
token_counts <- corpus$n_tokens
total        <- 0
i            <- 0

while (total < 50) {
  i     <- i + 1
  total <- total + token_counts[i]
}

cat("Reached", total, "tokens after", i, "documents.\n")

Reached 58 tokens after 5 documents.

A Text-Processing Example

Here we simulate reading tokens from a stream until we hit a sentence boundary (a token ending in .):

Code

tokens    <- c("The", "quick", "brown", "fox", "jumps", ".", "Over", "the", "lazy")
sentence  <- character(0)
j         <- 0

while (j < length(tokens)) {
  j       <- j + 1
  current <- tokens[j]
  sentence <- c(sentence, current)
  if (grepl("\\.$", current)) break   # stop at sentence boundary
}

cat("First sentence:", paste(sentence, collapse = " "), "\n")

First sentence: The quick brown fox jumps .

Avoiding Infinite Loops

A while loop runs forever if its condition never becomes FALSE. Always ensure:

The loop body modifies the condition variable
A break safety exit is included for unexpected situations

Code

# Safe pattern: always include a maximum iteration counter
max_iter <- 1000
iter     <- 0
value    <- 100

while (value > 1 && iter < max_iter) {
  value <- value * 0.9   # converges toward 0
  iter  <- iter + 1
}

cat("Converged to", round(value, 4), "after", iter, "iterations.\n")

Converged to 0.9698 after 44 iterations.

If Your R Session Freezes

If you accidentally create an infinite loop, press Escape in the Console, or click the Stop button (red square) in the Console toolbar. RStudio will interrupt the running code. If that fails, use Session → Interrupt R from the menu.

Exercises: while Loops

Q1. When is a while loop more appropriate than a for loop?

Q2. What is the risk of writing while (TRUE) { ... } without a break statement inside the body?

Writing Functions

Section Overview

What you’ll learn: How to write your own reusable functions in R — the single most important skill for writing clean, maintainable code

Key concepts: Function definition, arguments, default values, return values, scope, documentation

Why it matters: Functions eliminate copy-paste errors, make your intentions explicit, and make code testable and shareable

The rule of thumb: if you have written the same block of code more than twice, it should be a function.

Anatomy of a Function

Code

# General template:
# my_function <- function(required_arg, optional_arg = default_value) {
#   # body: code that does the work
#   return(result)   # optional: last expression is returned automatically
# }

# A minimal example
greet_language <- function(language) {
  paste("Hello from", language, "linguistics!")
}

greet_language("computational")

[1] "Hello from computational linguistics!"

Code

greet_language("corpus")

[1] "Hello from corpus linguistics!"

Required and Optional Arguments

Arguments without a default are required — omitting them raises an error. Arguments with a default are optional and use their default when not supplied:

Code

# type_token_ratio: required x (character vector of tokens), optional lowercase
ttr <- function(tokens, lowercase = TRUE) {
  if (lowercase) tokens <- tolower(tokens)
  n_tokens <- length(tokens)
  n_types  <- length(unique(tokens))
  n_types / n_tokens
}

sample_tokens <- c("The", "cat", "sat", "on", "the", "mat", "the", "cat")

ttr(sample_tokens)              # lowercase = TRUE (default)

[1] 0.625

Code

ttr(sample_tokens, lowercase = FALSE)  # case-sensitive

[1] 0.75

Return Values

A function automatically returns its last evaluated expression. Use return() explicitly for early exits or when clarity matters:

Code

# Early return when input is invalid
safe_ttr <- function(tokens, lowercase = TRUE) {
  if (length(tokens) == 0) {
    warning("Empty token vector supplied — returning NA.")
    return(NA_real_)
  }
  if (!is.character(tokens)) {
    stop("tokens must be a character vector.")
  }
  if (lowercase) tokens <- tolower(tokens)
  length(unique(tokens)) / length(tokens)
}

safe_ttr(character(0))     # triggers warning, returns NA

Warning in safe_ttr(character(0)): Empty token vector supplied — returning NA.

[1] NA

Code

safe_ttr(sample_tokens)    # normal use

[1] 0.625

Returning Multiple Values

Functions can only return one object, but that object can be a named list containing as many results as needed:

Code

corpus_stats <- function(tokens, lowercase = TRUE) {
  if (lowercase) tokens <- tolower(tokens)
  list(
    n_tokens = length(tokens),
    n_types  = length(unique(tokens)),
    ttr      = round(length(unique(tokens)) / length(tokens), 3),
    longest  = tokens[which.max(nchar(tokens))]
  )
}

result <- corpus_stats(sample_tokens)
result$ttr

[1] 0.625

Code

result$longest

[1] "the"

Code

# Access multiple outputs at once
str(result)

List of 4
 $ n_tokens: int 8
 $ n_types : int 5
 $ ttr     : num 0.625
 $ longest : chr "the"

Function Scope

Variables created inside a function live only inside that function — they are invisible to (and cannot accidentally overwrite) the global environment:

Code

cleanup_text <- function(text) {
  cleaned <- gsub("[[:punct:]]", "", text)   # 'cleaned' is local to the function
  cleaned <- trimws(tolower(cleaned))
  cleaned
}

raw_text <- "  Hello, World!  "
cleaned_text <- cleanup_text(raw_text)

cleaned_text                 # the result is returned and stored here

[1] "hello world"

Code

exists("cleaned")            # FALSE — 'cleaned' was never in the global env

[1] FALSE

The <<- Operator

If you genuinely need to modify a variable in the calling environment from inside a function (rare), use <<- instead of <-. This searches up the call stack to find the variable and modifies it there. However, this is considered bad practice in most data analysis code because it creates hidden side effects that make functions unpredictable. Prefer returning a value and assigning it explicitly.

Documenting Functions with `roxygen2` Style

Good functions should be documented so that you (and colleagues) can understand them months later. The conventional format mirrors the roxygen2 package style:

Code

#' Compute Type-Token Ratio
#'
#' @description
#' Calculates the type-token ratio (TTR) of a character vector of tokens.
#' TTR = number of unique word types / total number of tokens.
#'
#' @param tokens A character vector of tokens (words).
#' @param lowercase Logical. If TRUE (default), tokens are lowercased before
#'   counting, so "The" and "the" count as the same type.
#'
#' @return A single numeric value between 0 and 1. Values closer to 1
#'   indicate higher lexical diversity.
#'
#' @examples
#' ttr(c("the", "cat", "sat", "on", "the", "mat"))
#' ttr(c("The", "Cat", "sat"), lowercase = FALSE)
ttr <- function(tokens, lowercase = TRUE) {
  if (length(tokens) == 0) return(NA_real_)
  if (lowercase) tokens <- tolower(tokens)
  length(unique(tokens)) / length(tokens)
}

Building a Reusable Text Cleaning Pipeline

Here is a realistic example: a family of small, focused functions composed into a pipeline:

Code

# Step 1: normalise whitespace and case
normalise_text <- function(text) {
  text <- tolower(text)
  text <- trimws(text)
  gsub("\\s+", " ", text)   # collapse multiple spaces
}

# Step 2: remove punctuation
remove_punct <- function(text) {
  gsub("[[:punct:]]", "", text)
}

# Step 3: tokenise (split on whitespace)
tokenise <- function(text) {
  strsplit(text, "\\s+")[[1]]
}

# Step 4: remove stopwords
remove_stopwords <- function(tokens,
                             stopwords = c("the","a","an","of","in","and","to","is")) {
  tokens[!tokens %in% stopwords]
}

# Compose into a full pipeline
clean_and_tokenise <- function(text, stopwords = NULL) {
  text   <- normalise_text(text)
  text   <- remove_punct(text)
  tokens <- tokenise(text)
  if (!is.null(stopwords)) tokens <- remove_stopwords(tokens, stopwords)
  tokens
}

# Apply to one document
example_text <- "The syntactic properties of embedded clauses remain poorly understood."
clean_and_tokenise(example_text,
                   stopwords = c("the", "a", "an", "of", "in", "and", "to", "is"))

[1] "syntactic"  "properties" "embedded"   "clauses"    "remain"    
[6] "poorly"     "understood"

Code

# Apply to all documents in the corpus
corpus$content_tokens <- sapply(corpus$text, function(t) {
  length(clean_and_tokenise(t, stopwords = c("the","a","an","of","in","and","to","is")))
})

head(corpus[, c("doc_id", "register", "n_tokens", "content_tokens")])

  doc_id register n_tokens content_tokens
1   doc1 Academic       11              7
2   doc2 Academic       10              7
3   doc3 Academic       12              7
4   doc4 Academic       11              7
5   doc5     News       14              8
6   doc6     News       16             12

Exercises: Writing Functions

Q1. A function has no explicit return() statement. What does it return?

Q2. You write x <- 99 inside a function body. After calling the function, does x exist in the global environment?

Q3. Your function computes three things: n_tokens, n_types, and TTR. What is the best way to return all three?

The `apply` Family

Section Overview

What you’ll learn: How to apply a function to every element of a vector or list without writing an explicit loop

Key functions: sapply(), lapply(), apply()

Why it matters: The apply family is more concise than loops and often faster — it expresses intent clearly

The apply family of functions replaces many common loop patterns with a single, expressive call. They all share the same pattern: apply this function to each element of this object.

`sapply()` — Simplified Apply

sapply() applies a function to each element of a vector or list and simplifies the result to a vector or matrix if possible:

Code

# Character count for each document
nchar_results <- sapply(corpus$text, nchar)
head(nchar_results)

                  The syntactic properties of embedded clauses remain poorly understood. 
                                                                                      70 
       Phonological alternations in unstressed syllables exhibit considerable variation. 
                                                                                      81 
                Discourse coherence is maintained through a variety of cohesive devices. 
                                                                                      72 
   The morphological complexity of agglutinative languages poses theoretical challenges. 
                                                                                      85 
        Scientists announced a major breakthrough in renewable energy storage yesterday. 
                                                                                      80 
Local authorities confirmed that road closures will affect the city centre this weekend. 
                                                                                      88

Code

# Type-token ratio for a list of token vectors
token_lists <- list(
  doc1 = c("the", "cat", "sat", "on", "the", "mat"),
  doc2 = c("a", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"),
  doc3 = c("to", "be", "or", "not", "to", "be")
)

sapply(token_lists, ttr)   # returns a named numeric vector

     doc1      doc2      doc3 
0.8333333 1.0000000 0.6666667

Use an anonymous function (a function defined inline without a name) for more complex operations:

Code

# Count content words (longer than 3 characters) per document
sapply(corpus$text, function(t) {
  tokens <- strsplit(tolower(t), "\\s+")[[1]]
  sum(nchar(tokens) > 3)
})

                  The syntactic properties of embedded clauses remain poorly understood. 
                                                                                       7 
       Phonological alternations in unstressed syllables exhibit considerable variation. 
                                                                                       7 
                Discourse coherence is maintained through a variety of cohesive devices. 
                                                                                       7 
   The morphological complexity of agglutinative languages poses theoretical challenges. 
                                                                                       7 
        Scientists announced a major breakthrough in renewable energy storage yesterday. 
                                                                                       8 
Local authorities confirmed that road closures will affect the city centre this weekend. 
                                                                                      12 
        The prime minister addressed parliament amid growing calls for electoral reform. 
                                                                                       9 
     Unemployment figures fell sharply in the third quarter according to new statistics. 
                                                                                       8 
             She had not expected the letter to arrive so soon, or to contain such news. 
                                                                                       7 
               The old house creaked and groaned as the storm gathered strength outside. 
                                                                                       7 
         He said nothing for a long time, watching the rain trace patterns on the glass. 
                                                                                       9 
            By morning the fog had lifted and the valley lay green and still below them. 
                                                                                       7

The \(x) Shorthand (R 4.1+)

Since R 4.1, you can write anonymous functions more concisely:

# Old style
sapply(x, function(t) nchar(t))

# New shorthand (backslash lambda)
sapply(x, \(t) nchar(t))

`lapply()` — List Apply

lapply() always returns a list, making it safer when results have different lengths or types:

Code

# Get unique words per document (different lengths per doc)
unique_words <- lapply(token_lists, unique)
unique_words

$doc1
[1] "the" "cat" "sat" "on"  "mat"

$doc2
[1] "a"     "quick" "brown" "fox"   "jumps" "over"  "the"   "lazy"  "dog"  

$doc3
[1] "to"  "be"  "or"  "not"

Code

# lapply is appropriate here because each result has a different length
sapply(token_lists, unique)  # also works but returns a list (cannot simplify)

$doc1
[1] "the" "cat" "sat" "on"  "mat"

$doc2
[1] "a"     "quick" "brown" "fox"   "jumps" "over"  "the"   "lazy"  "dog"  

$doc3
[1] "to"  "be"  "or"  "not"

`apply()` — Matrix/Data Frame Apply

apply() operates on matrices or data frames, applying a function across rows (MARGIN = 1) or columns (MARGIN = 2):

Code

# Create a numeric matrix of document features
features <- matrix(
  c(corpus$n_tokens, corpus$content_tokens),
  ncol = 2,
  dimnames = list(corpus$doc_id, c("n_tokens", "content_tokens"))
)

# Column means
apply(features, MARGIN = 2, FUN = mean)

      n_tokens content_tokens 
      13.91667        9.25000

Code

# Row sums
head(apply(features, MARGIN = 1, FUN = sum))

doc1 doc2 doc3 doc4 doc5 doc6 
  18   17   19   18   22   28

Choosing Between `sapply()` and `lapply()`

Function	Input	Output	Use when
sapply()	vector or list	vector/matrix (simplified) or list if simplification fails	results are all the same type and length
lapply()	vector or list	always a list	results differ in length or type; you always want a list
apply()	matrix or data frame	vector or list	you want to summarise across rows or columns of a matrix

Functional Programming with `purrr`

Section Overview

What you’ll learn: How to use purrr::map() and its variants as a modern, consistent alternative to the apply family

Key functions: map(), map_chr(), map_dbl(), map_df(), map2(), walk()

Why it matters: purrr functions have consistent, predictable behaviour and integrate cleanly with dplyr pipelines

The purrr package provides a family of map() functions that replace the apply family with a more consistent interface. Every map() function takes a list or vector and applies a function to each element.

`map()` and Type-Specific Variants

map() always returns a list. Type-specific variants guarantee a particular output type and fail informatively if the results do not match:

Code

# map() → always a list
map(token_lists, length)

$doc1
[1] 6

$doc2
[1] 9

$doc3
[1] 6

Code

# map_dbl() → numeric (double) vector
map_dbl(token_lists, ttr)

     doc1      doc2      doc3 
0.8333333 1.0000000 0.6666667

Code

# map_int() → integer vector
map_int(token_lists, length)

doc1 doc2 doc3 
   6    9    6

Code

# map_chr() → character vector
map_chr(token_lists, \(t) paste(t[1:2], collapse = " "))

     doc1      doc2      doc3 
"the cat" "a quick"   "to be"

`map_df()` — Map to a Data Frame

map_df() (or map() |> list_rbind()) is extremely useful for applying a function that returns a data frame to each element and binding the results together:

Code

get_doc_stats <- function(tokens, doc_name) {
  data.frame(
    doc      = doc_name,
    n_tokens = length(tokens),
    n_types  = length(unique(tokens)),
    ttr      = round(ttr(tokens), 3)
  )
}

map_df(names(token_lists), \(nm) get_doc_stats(token_lists[[nm]], nm))

   doc n_tokens n_types   ttr
1 doc1        6       5 0.833
2 doc2        9       9 1.000
3 doc3        6       4 0.667

`map2()` — Map Over Two Inputs Simultaneously

map2() applies a function to corresponding elements of two vectors or lists:

Code

doc_ids <- c("text_A", "text_B", "text_C")
texts   <- list(
  c("the","cat","sat"),
  c("a","dog","ran"),
  c("one","two","three","one")
)

map2_dbl(texts, doc_ids, \(tokens, id) {
  cat(id, ": TTR =", round(ttr(tokens), 3), "\n")
  ttr(tokens)
})

text_A : TTR = 1 
text_B : TTR = 1 
text_C : TTR = 0.75

[1] 1.00 1.00 0.75

`walk()` — Map for Side Effects

walk() is like map() but is used when you want the side effect (printing, writing a file, making a plot) rather than the return value. It invisibly returns the input:

Code

# Print a summary line for each register without capturing the return value
corpus %>%
  dplyr::group_by(register) %>%
  dplyr::group_split() %>%
  walk(\(df) {
    cat(sprintf("Register: %-10s | Docs: %d | Mean tokens: %.1f\n",
                df$register[1], nrow(df), mean(df$n_tokens)))
  })

Register: Academic   | Docs: 4 | Mean tokens: 11.0
Register: Fiction    | Docs: 4 | Mean tokens: 16.5
Register: News       | Docs: 4 | Mean tokens: 14.2

Exercises: apply and purrr

Q1. What is the difference between sapply() and lapply()?

Q2. When would you use purrr::walk() instead of purrr::map()?

Error Handling

Section Overview

What you’ll learn: How to write code that handles errors and warnings gracefully rather than crashing

Key functions: tryCatch(), try(), stop(), warning(), message()

Why it matters: When processing many files or documents, a single error should not halt your entire pipeline

Signalling Conditions from Your Functions

Use stop(), warning(), and message() to communicate problems from inside your functions:

Code

compute_ttr <- function(tokens) {
  if (!is.character(tokens)) stop("tokens must be a character vector")
  if (length(tokens) == 0)   warning("Empty vector — returning NA")
  if (length(tokens) < 10)   message("Note: TTR is unreliable for short texts")
  if (length(tokens) == 0)   return(NA_real_)
  length(unique(tokens)) / length(tokens)
}

compute_ttr(c("the", "cat", "sat"))  # message: short text

Note: TTR is unreliable for short texts

[1] 1

`tryCatch()` — Handle Errors Gracefully

tryCatch() lets you intercept errors, warnings, and messages, and decide what to do instead of crashing:

Code

# Without tryCatch: one bad input crashes everything
safe_ttr <- function(tokens) {
  tryCatch(
    expr    = compute_ttr(tokens),
    error   = function(e) {
      cat("Error in compute_ttr:", conditionMessage(e), "\n")
      NA_real_
    },
    warning = function(w) {
      cat("Warning:", conditionMessage(w), "\n")
      NA_real_
    }
  )
}

safe_ttr(c("the", "cat", "sat", "on", "the", "mat"))   # normal

Note: TTR is unreliable for short texts

[1] 0.8333333

Code

safe_ttr(123)                                           # wrong type → error caught

Error in compute_ttr: tokens must be a character vector

[1] NA

Code

safe_ttr(character(0))                                  # empty → warning caught

Warning: Empty vector — returning NA

[1] NA

Applying `tryCatch()` Across a Pipeline

This pattern is invaluable when processing many documents or files — one bad item should not stop the whole run:

Code

# Simulate some good and some bad inputs
inputs <- list(
  c("the", "cat", "sat", "on", "the", "mat"),
  123,                      # will error
  character(0),             # will warn
  c("a", "quick", "fox")
)

results <- sapply(inputs, safe_ttr)

Note: TTR is unreliable for short texts

Error in compute_ttr: tokens must be a character vector 
Warning: Empty vector — returning NA

Note: TTR is unreliable for short texts

Code

results

[1] 0.8333333        NA        NA 1.0000000

Exercises: Error Handling

Q1. What is the difference between stop(), warning(), and message() inside a function?

Q2. Why is wrapping a function call in tryCatch() useful when processing a large number of files or documents?

Best Practices

Section Overview

A concise guide to writing better R code: when to loop, when to vectorise, how to name and document functions, and how to structure growing code projects

When to Use Each Construct

Situation	Best tool
Apply the same operation to every element of a vector	Vectorised operation (e.g., nchar(), tolower(), arithmetic)
Apply the same operation to each group in a data frame	dplyr::group_by() + summarise() or mutate()
Apply a function to each element and collect results	sapply() / lapply() / purrr::map()
Iterate when each step depends on the previous result	for loop with pre-allocated output
Number of iterations unknown; stop when condition met	while loop (with break safety exit)
Apply a function for its side effects (print, save, plot)	purrr::walk() or a for loop
Handle different cases of a single categorical variable	ifelse() / case_when() / switch()

Writing Good Functions

Do one thing well — a function named clean_and_tokenise_and_count_and_plot() is a sign it should be four functions
Name with verbs: clean_text(), compute_ttr(), plot_frequency() — not myFunc() or data2()
Validate inputs early — use stop() at the top of the function body for invalid arguments
Provide useful defaults — optional arguments should default to the most common use case
Document — at minimum a comment above explaining what the function does, its arguments, and what it returns
Test — call your function with expected inputs, edge cases (empty vector, single element, NA), and invalid inputs

Code Style for Loops and Functions

# Good: clear structure, consistent indentation, descriptive names
compute_register_stats <- function(data, group_col = "register") {
  data %>%
    dplyr::group_by(.data[[group_col]]) %>%
    dplyr::summarise(
      n          = dplyr::n(),
      mean_tok   = round(mean(n_tokens), 1),
      sd_tok     = round(sd(n_tokens), 2),
      .groups    = "drop"
    )
}

# Bad: cryptic names, no whitespace, no structure
f<-function(d,g="register"){d%>%group_by(.data[[g]])%>%summarise(n=n(),m=round(mean(n_tokens),1))}

The DRY Principle

Don’t Repeat Yourself. If you catch yourself copy-pasting a block of code and changing one value, that block should be a function parameterised by that value. Code duplication multiplies the places you must update when requirements change, and multiplies the opportunities for inconsistency.

# BEFORE: copy-pasted three times with minor changes
academic_ttr <- sum(corpus$register == "Academic") |> ...
news_ttr     <- sum(corpus$register == "News")     |> ...
fiction_ttr  <- sum(corpus$register == "Fiction")  |> ...

# AFTER: one function, called three times
get_register_ttr <- function(data, reg) { ... }
sapply(c("Academic", "News", "Fiction"), get_register_ttr, data = corpus)

Citation & Session Info

Schweinberger, Martin. 2026. Working with R: Control Flow, Functions, and Programming. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/workingwithr/workingwithr.html (Version 2026.02.19).

@manual{schweinberger2026workingwithr,
  author       = {Schweinberger, Martin},
  title        = {Working with R: Control Flow, Functions, and Programming},
  note         = {https://ladal.edu.au/tutorials/workingwithr/workingwithr.html},
  year         = {2026},
  organization = {The University of Queensland, Australia. School of Languages and Cultures},
  address      = {Brisbane},
  edition      = {2026.02.19}
}

Code

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] purrr_1.0.4      flextable_0.9.7  tidyr_1.3.2      ggplot2_4.0.2   
[5] dplyr_1.2.0      checkdown_0.0.13

loaded via a namespace (and not attached):
 [1] utf8_1.2.4              generics_0.1.3          fontLiberation_0.1.0   
 [4] renv_1.1.1              xml2_1.3.6              digest_0.6.39          
 [7] magrittr_2.0.3          evaluate_1.0.3          grid_4.4.2             
[10] RColorBrewer_1.1-3      fastmap_1.2.0           jsonlite_1.9.0         
[13] zip_2.3.2               scales_1.4.0            fontBitstreamVera_0.1.1
[16] codetools_0.2-20        textshaping_1.0.0       cli_3.6.4              
[19] rlang_1.1.7             fontquiver_0.2.1        litedown_0.9           
[22] commonmark_2.0.0        withr_3.0.2             yaml_2.3.10            
[25] gdtools_0.4.1           tools_4.4.2             officer_0.6.7          
[28] uuid_1.2-1              vctrs_0.7.1             R6_2.6.1               
[31] lifecycle_1.0.5         htmlwidgets_1.6.4       ragg_1.3.3             
[34] pkgconfig_2.0.3         pillar_1.10.1           gtable_0.3.6           
[37] data.table_1.17.0       glue_1.8.0              Rcpp_1.0.14            
[40] systemfonts_1.2.1       xfun_0.56               tibble_3.2.1           
[43] tidyselect_1.2.1        rstudioapi_0.17.1       knitr_1.51             
[46] farver_2.1.2            htmltools_0.5.9         rmarkdown_2.30         
[49] compiler_4.4.2          S7_0.2.1                askpass_1.2.1          
[52] markdown_2.0            openssl_2.3.2

Back to HOME

Introduction

Preparation and Session Set-up

Conditional Logic

if / else Statements

ifelse() — Vectorised Conditional

dplyr::case_when() — Multiple Conditions

switch() — Selecting Among Named Options

for Loops

Looping with Indices

Storing Results: Pre-allocation

A Realistic Corpus Example

Looping Over Files

break and next

Nested for Loops

while Loops

A Text-Processing Example

Avoiding Infinite Loops

Writing Functions

Anatomy of a Function

Required and Optional Arguments

Return Values

Returning Multiple Values

Function Scope

Documenting Functions with roxygen2 Style

Building a Reusable Text Cleaning Pipeline

The apply Family

sapply() — Simplified Apply

lapply() — List Apply

apply() — Matrix/Data Frame Apply

Choosing Between sapply() and lapply()

Functional Programming with purrr

map() and Type-Specific Variants

map_df() — Map to a Data Frame

map2() — Map Over Two Inputs Simultaneously

walk() — Map for Side Effects

Error Handling

Signalling Conditions from Your Functions

tryCatch() — Handle Errors Gracefully

Applying tryCatch() Across a Pipeline

Best Practices

When to Use Each Construct

Writing Good Functions

Code Style for Loops and Functions

The DRY Principle

Citation & Session Info

References

`if` / `else` Statements

`ifelse()` — Vectorised Conditional

`dplyr::case_when()` — Multiple Conditions

`switch()` — Selecting Among Named Options

`for` Loops

`break` and `next`

Nested `for` Loops

`while` Loops

Documenting Functions with `roxygen2` Style

The `apply` Family

`sapply()` — Simplified Apply

`lapply()` — List Apply

`apply()` — Matrix/Data Frame Apply

Choosing Between `sapply()` and `lapply()`

Functional Programming with `purrr`

`map()` and Type-Specific Variants

`map_df()` — Map to a Data Frame

`map2()` — Map Over Two Inputs Simultaneously

`walk()` — Map for Side Effects

`tryCatch()` — Handle Errors Gracefully

Applying `tryCatch()` Across a Pipeline