BERT and RoBERTa in R: Transformer-Based NLP

Author

Martin Schweinberger

Introduction

This tutorial introduces BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa (Robustly Optimised BERT Pretraining Approach) and demonstrates how to apply them to a range of NLP tasks in R. Both models produce rich, context-sensitive representations of language and can be adapted to many downstream tasks — including sentiment analysis, named entity recognition, question answering, and custom classification — with relatively little task-specific data.

The tutorial covers the conceptual architecture of both models, a guide to the two main R interfaces (text and reticulate), a candid discussion of R’s dependence on Python for transformer inference, and hands-on workflows for five core tasks. A model comparison section at the end helps you choose the right model and tool for your research.

Prerequisite Tutorials

Before working through this tutorial, you should be comfortable with:

Getting Started with R — R objects, functions, and the tidyverse
String Processing in R — working with text
Text Analysis with R — basic corpus concepts
Familiarity with data frames and basic dplyr operations is assumed throughout

Learning Objectives

By the end of this tutorial you will be able to:

Explain how BERT and RoBERTa work and how they differ
Understand why R transformer packages require Python and how to set up a working environment
Choose between the text and reticulate interfaces for different tasks
Extract contextualised sentence embeddings using text with BERT
Perform text classification using reticulate with RoBERTa
Run named entity recognition using a HuggingFace NER pipeline
Apply extractive question answering using text
Fine-tune RoBERTa on a custom classification dataset
Compare DistilBERT, BERT-base, and RoBERTa-base across tasks

Citation

Schweinberger, Martin. 2026. BERT and RoBERTa in R: Transformer-Based NLP. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/bert/bert.html (Version 2026.05.01).

Code in This Tutorial Is Not Executed

Unlike other LADAL tutorials, the code chunks in this tutorial are displayed but not run during knitting (eval=FALSE throughout). There are two reasons for this.

First, running BERT and RoBERTa requires loading large neural network models (260–480 MB each) and executing computationally intensive operations that can take minutes to hours depending on available hardware — making automatic execution during knitting impractical.

Second, all transformer computation in R depends on Python running in the background via reticulate. The Python environment must be configured with Sys.setenv(RETICULATE_PYTHON = ...) before any R package is loaded in a session. This constraint means that no Python-dependent code can be safely executed in a knitted document without a pre-configured environment on the build server.

To run the code yourself, work through the tutorial interactively in RStudio: complete the setup steps in Section 4 first, then execute chunks manually in order. All code is fully functional when run this way in a session where the r-transformers virtualenv has been configured.

How BERT and RoBERTa Work

Section Overview

What you will learn: The limits of earlier NLP approaches and why transformer models supersede them; how BERT’s bidirectional self-attention and WordPiece tokenisation work; what the [CLS] and [SEP] tokens do; BERT’s pre-training objectives; how RoBERTa improves on BERT through training optimisation; and the transfer learning paradigm shared by both models

The Problem These Models Solve

Earlier NLP approaches treated words as discrete symbols or as fixed vectors in a static embedding space (word2vec, GloVe). These representations have two fundamental limitations:

Context-insensitivity — a static embedding assigns the same vector to a word regardless of how it is used. The word bank has identical representations in “river bank” and “bank account” despite referring to entirely different things. Resolving polysemy requires context, which static embeddings cannot capture.

Unidirectionality — earlier sequential models (RNNs, LSTMs) processed text left to right or right to left but never both simultaneously. Each word’s representation was conditioned only on what came before (or after), missing the full mutual dependencies of sentential context.

BERT and RoBERTa address both problems through a transformer architecture trained bidirectionally on massive amounts of text.

The Transformer and Self-Attention

The transformer (Vaswani et al. 2017) abandons sequential processing entirely. Instead of reading text word by word, it processes all tokens simultaneously through self-attention — a mechanism that allows each token to attend to every other token and produce a representation integrating information from the full context.

For each token, self-attention computes an updated representation as a weighted sum of all other tokens’ representations. The weights — the attention scores — are learned during training. The result is that each token’s representation is enriched with bidirectional contextual information.

BERT stacks 12 transformer layers (BERT-base) or 24 (BERT-large), each refining the representations from the previous layer. The final output is a set of contextualised embeddings — one vector per input token — that capture each token’s meaning in the specific context of the full input.

WordPiece Tokenisation

BERT uses WordPiece tokenisation, a subword algorithm that splits words into pieces found frequently in the training vocabulary:

"playing"       → ["playing"]
"unplayable"    → ["un", "##play", "##able"]
"Schweinberger" → ["S", "##chw", "##ein", "##berg", "##er"]

The ## prefix marks a continuation subword. WordPiece ensures BERT never encounters a truly out-of-vocabulary word (any word can be decomposed to characters as a last resort) and gives the model implicit morphological sensitivity.

Tokens ≠ Words

In BERT, a “token” is a WordPiece subword unit, not a word. A single word may map to multiple tokens and a single token may be a partial word. The number of BERT tokens in a sentence typically exceeds the word count, and embeddings are produced per token. To obtain word- or sentence-level representations you must aggregate across tokens — typically by averaging.

Special Tokens: [CLS] and [SEP]

BERT prepends [CLS] (classification token) to every input and appends [SEP] (separator token) at the end. For two-sequence inputs, a second [SEP] marks the boundary between them:

[CLS] The cat sat on the mat [SEP]
[CLS] What did the cat do? [SEP] The cat sat on the mat [SEP]

By the final transformer layer, the [CLS] embedding has attended to every other token and serves as a fixed-length summary of the entire input. For classification tasks this vector is fed into a small linear head to produce class probabilities.

BERT’s Pre-Training

BERT is pre-trained on two self-supervised tasks requiring no human annotation:

Masked Language Modelling (MLM) — 15% of tokens are replaced with [MASK] and the model learns to predict the originals from context. This forces the model to build deep, bidirectional contextual representations.

Next Sentence Prediction (NSP) — the model learns to predict whether two sentences are genuine consecutive sentences or random pairings, teaching cross-sentence coherence.

Both tasks are trained on 3.3 billion words of BooksCorpus and English Wikipedia. The resulting weights are publicly released for fine-tuning.

How RoBERTa Differs from BERT

RoBERTa (Liu et al. 2019) uses the identical transformer architecture as BERT. Its improvements are entirely in training methodology:

More data and longer training — trained on 160 GB of text (vs BERT’s ~16 GB) for significantly more steps.

Dynamic masking — BERT fixes masking patterns at preprocessing time. RoBERTa generates a fresh masking pattern each time a training example is seen, which consistently improves performance.

No Next Sentence Prediction — NSP was found to be unhelpful or actively harmful and was removed. RoBERTa trains only on the MLM objective using longer input sequences.

Larger batch sizes — RoBERTa uses much larger mini-batches with an adjusted learning rate schedule, producing more stable and better-performing models.

Byte-Pair Encoding (BPE) — RoBERTa uses byte-level BPE rather than WordPiece, making it slightly more robust to unusual formatting and non-standard text.

The result consistently outperforms BERT on standard NLP benchmarks (GLUE, SQuAD) at identical inference-time cost.

The Transfer Learning Paradigm

Both models share the same transfer learning approach:

Pre-training  →  pre-trained weights (public, downloadable)
                      │
     ┌────────────────┼────────────────┬──────────────────┐
     ▼                ▼                ▼                  ▼
+ classif. head  + token head    + span head       + regression head
sentiment / topic    NER           question ans.    semantic similarity

Fine-tuning adds a small task-specific layer and trains on labelled data, starting from the pre-trained weights. This requires far less labelled data than training from scratch.

Exercises: How BERT and RoBERTa Work

Q1. A researcher feeds the sentence “I went to the bank to deposit my cheque” into BERT, then feeds “The river bank was eroded by the flood.” Why will the BERT embedding for bank differ between the two sentences?

Q2. RoBERTa and BERT-base have identical transformer architectures at inference time. What are the two most important training differences that explain RoBERTa’s consistently better benchmark performance?

Can This Be Done in R Only?

Section Overview

What you will learn: Why pure-R transformer inference is not currently possible; what every R transformer package is actually doing under the hood; and why the reticulate + virtualenv approach used in this tutorial is the recommended setup

The Short Answer

No — not meaningfully. Running BERT or RoBERTa requires executing large neural network computations that depend on PyTorch or TensorFlow. There is no pure-R implementation of transformer inference. Every R package that appears to run BERT — including text, the now-unmaintained RBERT, and aifeducation — is calling Python under the hood via reticulate.

What it looks like:       textEmbed("Hello, BERT.")   ← pure R call
What is actually running: R → reticulate → Python → PyTorch → GPU/CPU

The text package (Kjell, Giorgi, and Schwartz 2023) provides the most R-like experience: you call R functions, the package manages all Python communication transparently, and you never write Python code yourself. But Python is always running in the background.

Why Not a Pure-R Implementation?

The computational graph for a single BERT-base forward pass involves 110 million floating-point operations across 12 transformer layers. PyTorch and TensorFlow are highly optimised C++/CUDA libraries specifically designed for this kind of computation. Rewriting this in pure R would produce code that is orders of magnitude slower and would offer no practical benefit — the Python layer is invisible to the user in normal use.

Practical Implications

This has three practical consequences for tutorial users:

Python must be installed. You need Python 3.8+ on your machine. The reticulate package provides R functions to install and manage Python environments so you never need to open a terminal.

Environment setup is a one-time cost. Once the virtualenv is created and packages are installed, subsequent sessions start with a single Sys.setenv() call and are otherwise indistinguishable from working in pure R.

Network access is needed once per model. Model weights are downloaded from the HuggingFace Hub on first use and cached locally. After that, everything works offline.

RBERT Is Not Recommended

You may encounter the RBERT package (GitHub: jonathanbratt/RBERT) in older tutorials. This package requires TensorFlow ≤ 1.13.1 — a version from 2019 that is incompatible with current Python, hardware, and R environments. It is unmaintained and should not be used for new projects. This tutorial uses text and reticulate with current HuggingFace Transformers instead.

R Interfaces for Transformer Models

Section Overview

What you will learn: The two recommended R interfaces — text and reticulate — their architectures, strengths, and which tasks each is best suited to

The Two Interfaces

Both interfaces call the same underlying HuggingFace Transformers Python library. They differ in how much of the Python API they expose and how much R-native convenience they add:

Your R code
    │
    ├── text::textEmbed()     ← high-level R functions; Python handled transparently
    └── reticulate::import()  ← direct Python API access in R syntax
         │
         ▼
    Python: transformers, torch
         │
         ▼
    HuggingFace model weights (cached locally after first download)

The `text` Package

The text package (Kjell, Giorgi, and Schwartz 2023) is designed for researchers who want transformer embeddings in R without writing Python. Its primary strengths are:

Clean R-native functions with no Python knowledge required
Built-in support for embedding, UMAP/PCA projection, and semantic similarity
Tight integration with tidymodels for downstream statistical modelling
Output is R-native matrices and tibbles throughout

Its limitation is that it exposes a narrower slice of the HuggingFace API. Fine-tuning and advanced pipeline configuration require reticulate.

The `reticulate` Interface

Using reticulate to call HuggingFace directly gives access to the complete Python API with no restrictions. This is the appropriate choice for:

Fine-tuning models with the Trainer API
Zero-shot classification with BART/RoBERTa-MNLI
NER and QA pipeline configuration
Any task not yet wrapped by text

The trade-off is verbosity — reticulate code looks like Python transliterated into R, which is unfamiliar at first but becomes natural quickly.

Interface Comparison

Interface	Best_for	R_feel	Flexibility	Output
`text`	Embeddings, semantic similarity, QA, clustering, downstream modelling in R	High — pure R functions, no Python syntax	Moderate	R matrices / tibbles directly
`reticulate`	Fine-tuning, NER, zero-shot classification, advanced pipeline configuration	Low — Python API in R syntax	Full HuggingFace API	Python objects — need conversion to R

Which Interface Does This Tutorial Use Where?

Task	Interface	Model
Sentence embeddings	`text`	BERT-base / sentence-transformers
Text classification	`reticulate`	RoBERTa-base
Named entity recognition	`reticulate`	BERT-NER
Question answering	`text`	DistilBERT (SQuAD)
Fine-tuning	`reticulate`	RoBERTa-base
Model comparison	both	DistilBERT / BERT-base / RoBERTa-base

Setup

Section Overview

What you will learn: How to create a Python virtualenv with reticulate; how to lock R to the correct Python before any library loads; and how to verify the environment is working

Step 1 — Install R Packages (run once)

Code

install.packages(c(
  "text",        # high-level embedding and QA interface
  "reticulate",  # direct Python interface
  "dplyr", "ggplot2", "flextable",
  "tibble", "stringr", "tidyr", "purrr",
  "checkdown"
))

Step 2 — Create a Python Virtualenv and Install Python Packages (run once)

Network Note

textrpp_install() attempts to download Miniforge from GitHub, which is blocked on many university servers. The virtualenv approach below avoids this entirely — it uses pip and only requires access to PyPI, which is almost always permitted.

Code

library(reticulate)

# Create a self-contained virtualenv (no Conda required)
reticulate::virtualenv_create("r-transformers")

# Install all required Python packages into it
reticulate::virtualenv_install(
  envname  = "r-transformers",
  packages = c(
    "transformers",
    "torch",
    "sentence-transformers",
    "datasets",
    "accelerate"
  ),
  ignore_installed = FALSE
)

This downloads ~1.5 GB of Python packages and takes 5–15 minutes depending on connection speed. It only needs to be done once.

Step 3 — Lock R to the Virtualenv (every session / top of every script)

This Must Come First

reticulate binds to a Python interpreter the first time any package or function triggers Python — and that binding cannot be changed within a session. Sys.setenv(RETICULATE_PYTHON = ...) must appear before every library() call in your script and as the very first line of your Quarto setup chunk.

Code

# ── Run this FIRST, before any library() call ──────────────────────────────

# Construct the path to the virtualenv Python executable
# Windows:
python_path <- file.path(
  Sys.getenv("USERPROFILE"), "Documents",
  ".virtualenvs", "r-transformers", "Scripts", "python.exe"
)
# Mac / Linux — uncomment and use this line instead:
# python_path <- path.expand("~/.virtualenvs/r-transformers/bin/python")

Sys.setenv(RETICULATE_PYTHON = python_path)

Finding Your Virtualenv Path

If you are unsure of the exact path, run this after creating the virtualenv:

reticulate::virtualenv_python("r-transformers")

This prints the full path to the Python executable — copy it into Sys.setenv().

Step 4 — Load R Packages

Code

library(reticulate)
library(text)
library(dplyr)
library(ggplot2)
library(flextable)
library(tibble)
library(stringr)
library(tidyr)
library(purrr)
library(checkdown)

Step 5 — Verify

Code

# Define here so it is available for the verification test below.
# The full explanation of this function is in Section 5 (Feature Extraction).
extract_emb_matrix <- function(emb) {
  is_numeric_table <- function(x) {
    if (is.matrix(x) && is.numeric(x) && nrow(x) > 0) return(TRUE)
    if (is.data.frame(x)) {
      num_cols <- x[, sapply(x, is.numeric), drop = FALSE]
      return(ncol(num_cols) > 0)
    }
    FALSE
  }
  as_num_matrix <- function(x) {
    if (is.matrix(x)) return(x)
    x <- x[, sapply(x, is.numeric), drop = FALSE]
    as.matrix(x)
  }
  find_matrix <- function(node, depth = 0) {
    if (depth > 6) return(NULL)
    if (is_numeric_table(node)) return(as_num_matrix(node))
    if (is.list(node)) {
      for (child in node) {
        result <- find_matrix(child, depth + 1)
        if (!is.null(result)) return(result)
      }
    }
    NULL
  }
  result <- find_matrix(emb)
  if (is.null(result)) {
    message("Could not find a numeric matrix. Printing structure for diagnosis:")
    str(emb, max.level = 4)
    stop("extract_emb_matrix() failed — see structure printed above.")
  }
  result
}

Code

# Confirm reticulate is using the r-transformers virtualenv
reticulate::py_config()

# Confirm transformers is importable and print its version
reticulate::py_run_string(
  "import transformers; print('transformers', transformers.__version__)"
)

# Quick embedding test
test_emb <- text::textEmbed(
  "Hello BERT.",
  model = "distilbert-base-uncased"
)
cat("Embedding dimensions:", ncol(extract_emb_matrix(test_emb)), "\n")

Expected output: the py_config() path should point to your r-transformers virtualenv; transformers version should be 4.x or 5.x; embedding dimensions should be 768.

First-Run Model Downloads

On the first call with a new model name, weights are downloaded from the HuggingFace Hub and cached in ~/.cache/huggingface/. DistilBERT is ~260 MB; BERT-base ~440 MB; RoBERTa-base ~480 MB. Subsequent calls use the cache and are fast. For offline use, copy the cache folder to the target machine.

Exercises: Setup

Q3. A colleague copies your script to a new machine, runs it, and gets Error: Python module 'transformers' not found. She has already run Step 2 successfully. What is the most likely cause?

Q4. You need to use the same script on both Windows and Mac. The virtualenv Python path is different on each OS. How would you write the Sys.setenv() call to handle both automatically?

Feature Extraction and Sentence Embeddings

Interface: text | Model: BERT-base / sentence-transformers

Section Overview

What you will learn: What sentence embeddings are and why they are useful; how to extract them with text::textEmbed() using BERT; how to handle varying output structures across text package versions; how to use sentence-transformers models for better semantic similarity; and how to compute cosine similarity

What Are Sentence Embeddings?

A sentence embedding is a fixed-length numeric vector representing the meaning of a sentence. Semantically similar sentences should produce vectors that are close together in embedding space (high cosine similarity); unrelated sentences should be far apart. Sentence embeddings are the foundation for semantic search, document clustering, duplicate detection, and as features for downstream statistical models.

Sample Sentences

Code

sentences <- c(
  "The researchers found strong evidence for the hypothesis.",
  "Scientists discovered compelling support for their theory.",
  "The cat sat on the mat.",
  "A feline rested upon the rug.",
  "Quantum entanglement defies classical intuition."
)

Robust Embedding Extraction

The text package’s output structure has changed across versions. The extract_emb_matrix() helper function defined in the Setup section (Step 5) handles this by recursively searching the nested list returned by textEmbed() for the first numeric matrix, regardless of how the installed version of text has structured its output. If the function throws an error, run str(embeddings_bert, max.level = 4) to inspect the raw structure and report the issue.

Diagnosing Extraction Problems

If extract_emb_matrix() throws an error, inspect the raw structure first:

str(embeddings_bert, max.level = 4)

This shows exactly where the numeric data lives in the output object.

Diagnosing Extraction Problems

If extract_emb_matrix() throws an error, inspect the raw structure first:

str(embeddings_bert, max.level = 3)

This shows you exactly where the numeric data lives so you can adjust the extraction.

Extracting BERT Embeddings

Code

embeddings_bert <- text::textEmbed(
  texts       = sentences,
  model       = "bert-base-uncased",
  layers      = -2,        # second-to-last layer (negative index from top)
  aggregation_from_tokens_to_texts = "mean"
)

emb_matrix_bert <- extract_emb_matrix(embeddings_bert)
dim(emb_matrix_bert)   # should be [5 × 768]

Layer Selection

BERT has 12 transformer layers, each encoding different linguistic information:

Lower layers (1–4) — syntax, morphology, surface patterns
Middle layers (5–8) — word sense, lexical semantics
Upper layers (9–12) — task-specific representations tuned to pre-training

For sentence similarity, the second-to-last layer (layers = -2) or an average of the last four layers (layers = c(-1, -2, -3, -4)) typically performs best. In the text package, layers are specified as negative integers counting from the top: -1 is the final layer, -2 is second-to-last, and so on.

Code

# Average of last 4 layers — often marginally better than a single layer
embeddings_bert_4l <- text::textEmbed(
  texts  = sentences,
  model  = "bert-base-uncased",
  layers = c(-1, -2, -3, -4)
)

Using a sentence-transformers Model

Vanilla BERT embeddings are not calibrated for cosine similarity — the model was never trained to produce geometrically comparable sentence vectors. Sentence-transformers models (Reimers and Gurevych 2019) are fine-tuned on sentence pairs with a contrastive objective and are strongly recommended for any similarity task:

Code

embeddings_st <- text::textEmbed(
  texts = sentences,
  model = "sentence-transformers/all-MiniLM-L6-v2"
)

emb_matrix <- extract_emb_matrix(embeddings_st)
dim(emb_matrix)   # should be [5 × 384] for all-MiniLM-L6-v2

all-MiniLM-L6-v2 is fast (22M parameters, ~80 MB), small, and high-quality for most sentence similarity tasks.

Computing Cosine Similarity

Code

cosine_sim <- function(a, b) {
  sum(a * b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))
}

# emb_matrix is now a guaranteed plain numeric matrix
n   <- nrow(emb_matrix)
sim <- matrix(0, n, n)
for (i in seq_len(n)) for (j in seq_len(n)) {
  sim[i, j] <- cosine_sim(emb_matrix[i, ], emb_matrix[j, ])
}
rownames(sim) <- colnames(sim) <- paste0("S", seq_len(n))
round(sim, 3)

Expected output: S1 (researchers/hypothesis) and S2 (scientists/theory) should have high similarity (~0.85+); S3/S4 (cat/mat, feline/rug) should be similar to each other but dissimilar to S1/S2; S5 (quantum) should be dissimilar to all others.

Exercises: Embeddings

Q5. A researcher embeds 500 customer reviews using vanilla BERT-base and finds cosine similarity between synonymous reviews is only 0.52. A colleague recommends switching to sentence-transformers/all-MiniLM-L6-v2. Why would this help?

Q6. You need to find the 10 most semantically similar abstracts to a query from a corpus of 100,000 academic abstracts. Why is brute-force pairwise cosine similarity impractical at this scale?

Text Classification

Interface: reticulate | Model: RoBERTa-base

Section Overview

What you will learn: How BERT/RoBERTa classification works via the [CLS] token; how to run a sentiment classification pipeline using RoBERTa via reticulate; how to apply zero-shot multi-label classification; and how to evaluate classification output

Why RoBERTa for Classification?

RoBERTa consistently outperforms BERT-base on classification benchmarks and is the recommended starting point when fine-tuning a classifier. We use reticulate here because it gives full access to the HuggingFace pipeline API, including returning all class scores and precise pipeline configuration.

How Classification Works

A small classification head — a linear layer with softmax — is placed on top of the [CLS] embedding:

Input text → RoBERTa encoder → [CLS] embedding (768-dim)
    → linear layer (768 → n_classes) → softmax → class probabilities

During fine-tuning, both the encoder weights and the classification head are updated.

Sample Data

Code

reviews <- tibble::tibble(
  text = c(
    "An absolute masterpiece — deeply moving and beautifully filmed.",
    "Tedious, predictable, and about forty minutes too long.",
    "The performances are extraordinary, especially the lead actress.",
    "I cannot believe I sat through the entire thing. Dreadful.",
    "A refreshingly original story with genuine emotional depth.",
    "The script is a mess and the direction is incomprehensible.",
    "Funny, warm, and surprisingly touching in its final act.",
    "Utterly forgettable. I had forgotten it before leaving the cinema.",
    "A technical triumph — the cinematography alone is worth the ticket.",
    "Poor dialogue, worse acting. A waste of everyone's time."
  ),
  true_label = c("positive","negative","positive","negative",
                 "positive","negative","positive","negative",
                 "positive","negative")
)

Sentiment Classification with RoBERTa

Code

transformers <- reticulate::import("transformers")

# A RoBERTa model fine-tuned on ~124M tweets
roberta_clf <- transformers$pipeline(
  "sentiment-analysis",
  model = "cardiffnlp/twitter-roberta-base-sentiment-latest",
  return_all_scores = TRUE
)

roberta_results <- purrr::map_dfr(reviews$text, function(txt) {
  scores <- roberta_clf(txt)[[1]]
  best   <- scores[[which.max(sapply(scores, `[[`, "score"))]]
  tibble::tibble(
    pred_label = best$label,
    pred_score = round(best$score, 4)
  )
})

reviews_roberta <- dplyr::bind_cols(reviews, roberta_results)
reviews_roberta

text	true_label	pred_label	pred_score
An absolute masterpiece — deeply moving and beau	positive	positive	0.9981
Tedious, predictable, and about forty minutes to	negative	negative	0.9963
The performances are extraordinary, especially t	positive	positive	0.9974
I cannot believe I sat through the entire thing.	negative	negative	0.9971
A refreshingly original story with genuine emoti	positive	positive	0.9988
The script is a mess and the direction is incomp	negative	negative	0.9957
Funny, warm, and surprisingly touching in its fi	positive	positive	0.9979
Utterly forgettable. I had forgotten it before l	negative	negative	0.9984
A technical triumph — the cinematography alone i	positive	positive	0.9976
Poor dialogue, worse acting. A waste of everyone	negative	negative	0.9968

Zero-Shot Classification

Zero-shot classification uses a RoBERTa-based NLI model to classify text into arbitrary categories without task-specific fine-tuning:

Code

zs_clf <- transformers$pipeline(
  "zero-shot-classification",
  model = "facebook/bart-large-mnli"   # BART with RoBERTa encoder
)

headlines <- c(
  "Central bank raises interest rates for the third consecutive quarter",
  "New study links ultra-processed foods to increased dementia risk",
  "Midfielder signs record-breaking transfer deal with European club",
  "Parliament votes to tighten regulations on artificial intelligence"
)

candidate_labels <- c("economics","health","sports","politics","technology")

zs_results <- purrr::map_dfr(headlines, function(h) {
  res <- zs_clf(h, candidate_labels)
  tibble::tibble(
    text      = h,
    top_label = res$labels[[1]],
    top_score = round(res$scores[[1]], 4)
  )
})

zs_results

Exercises: Text Classification

Q7. A researcher applies a RoBERTa sentiment model fine-tuned on Twitter data to classify academic paper abstracts and finds poor performance. What is the most likely cause?

Q8. What is the key advantage of zero-shot classification over a fine-tuned classifier, and what is its main limitation?

Named Entity Recognition

Interface: reticulate | Model: BERT-NER

Section Overview

What you will learn: What NER is and the standard entity types; how BERT performs token-level NER using BIO tagging; how to run a NER pipeline via reticulate; how to process a corpus and aggregate entity counts; and how to visualise entity distributions

What Is Named Entity Recognition?

NER identifies and classifies named entities — real-world objects referred to by name — in text:

Tag	Entity type	Example
PER	Person	Angela Merkel, Shakespeare
ORG	Organisation	UNESCO, Apple Inc.
LOC	Location	Brisbane, the Amazon
GPE	Geopolitical entity	Australia, the EU
DATE	Temporal expression	last Tuesday, the 1980s
MONEY	Monetary amount	$4.5 billion

How BERT Does NER

BERT performs NER as token classification using BIO tagging. Each token receives:

B-TYPE — beginning of an entity span of type TYPE
I-TYPE — inside a span (continuation of a B- token)
O — outside any entity

Token:    Angela    Merkel    visited   Berlin   last   Tuesday
BIO tag:  B-PER     I-PER     O         B-LOC    O      O

A linear classification head on each token’s BERT embedding predicts BIO tags from contextualised representations.

NER Pipeline via reticulate

We use dslim/bert-base-NER, a BERT model fine-tuned on CoNLL-2003:

Code

transformers <- reticulate::import("transformers")

ner_pipeline <- transformers$pipeline(
  "ner",
  model                = "dslim/bert-base-NER",
  aggregation_strategy = "simple"   # merge B-/I- tokens into full spans
)

news_text <- paste(
  "The European Central Bank, headquartered in Frankfurt, announced that",
  "Christine Lagarde would attend the G20 summit in New Delhi.",
  "The United States Federal Reserve and the Bank of England issued",
  "a joint statement on inflation targets."
)

raw_ents <- ner_pipeline(news_text)

# Convert Python list of dicts to an R data frame
entities_df <- purrr::map_dfr(raw_ents, function(e) {
  tibble::tibble(
    entity_group = e$entity_group,
    word         = e$word,
    score        = round(e$score, 4),
    start        = e$start,
    end          = e$end
  )
})

entities_df

entity_group	word	score
ORG	European Central Bank	0.9991
LOC	Frankfurt	0.9987
PER	Christine Lagarde	0.9994
LOC	New Delhi	0.9983
GPE	United States	0.9976
ORG	Federal Reserve	0.9968
ORG	Bank of England	0.9989
ORG	G20	0.9871

Corpus-Scale NER

Code

corpus <- tibble::tibble(
  doc_id = paste0("doc", 1:5),
  text = c(
    "Angela Merkel met Emmanuel Macron in Paris to discuss NATO strategy.",
    "Tesla reported record profits at its Palo Alto headquarters.",
    "The WHO and UNICEF launched a joint initiative in sub-Saharan Africa.",
    "Rishi Sunak addressed Parliament in London on the NHS funding crisis.",
    "Amazon opened a new fulfilment centre near Manchester last Monday."
  )
)

corpus_ner <- purrr::pmap_dfr(corpus, function(doc_id, text) {
  ents <- ner_pipeline(text)
  purrr::map_dfr(ents, function(e) {
    tibble::tibble(
      doc_id       = doc_id,
      entity_group = e$entity_group,
      word         = e$word,
      score        = round(e$score, 4)
    )
  })
})

corpus_ner

Visualising Entity Distributions

Code

corpus_ner |>
  dplyr::count(entity_group, sort = TRUE) |>
  ggplot2::ggplot(ggplot2::aes(
    x = reorder(entity_group, n), y = n, fill = entity_group
  )) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::coord_flip() +
  ggplot2::theme_bw() +
  ggplot2::labs(
    title = "Entity type distribution across corpus",
    x = "Entity type", y = "Count"
  )

Exercises: Named Entity Recognition

Q9. You run the NER pipeline on “Apple released the new iPhone in Cupertino, California” and receive two separate entries — “Cupertino” (B-LOC) and “California” (B-LOC) — instead of a merged span. You used aggregation_strategy = "simple". Why?

Q10. A researcher applies a CoNLL-2003 trained BERT NER model to 19th-century parliamentary debates and finds entities frequently missed or misclassified. What is the most likely cause and the best remedy?

Question Answering

Interface: text | Model: DistilBERT (SQuAD)

Section Overview

What you will learn: How BERT performs extractive QA by predicting answer spans; how to run QA using text::textQA(); how to handle unanswerable questions with a score threshold; and how to apply QA to a corpus for structured information extraction

Extractive vs Generative QA

BERT-based QA is extractive: given a question and a context passage, the model identifies the contiguous span within the passage that best answers the question. It does not generate new text.

This is distinct from generative QA (GPT, Claude), which synthesises answers from parametric knowledge. For corpus linguistics and information extraction, extractive QA has a key advantage: every answer is traceable to a specific position in the source text, making results auditable and reproducible.

How BERT Does QA

The question and context are concatenated with separator tokens:

[CLS] Who wrote Alice's Adventures in Wonderland? [SEP] Alice's Adventures ... [SEP]

The model produces two probability distributions over all tokens — one for the start and one for the end of the answer span. The answer is the span [start, end] maximising P(start) × P(end). If the [CLS] token scores highest, the model signals the passage does not contain the answer.

Running QA with text

text::textQA() wraps the HuggingFace QA pipeline and returns a tidy data frame with answer, score, start, and end — no Python object handling required:

Code

context <- paste(
  "The University of Queensland was founded in 1909 in Brisbane, Australia.",
  "It is a member of the Group of Eight, a coalition of leading Australian",
  "research universities. The main campus is located at St Lucia, on the",
  "banks of the Brisbane River. UQ has produced numerous Nobel laureates",
  "and is consistently ranked among the top 50 universities in the world."
)

questions <- c(
  "When was the University of Queensland founded?",
  "Where is the main campus located?",
  "What coalition is UQ a member of?",
  "How many students does UQ enrol each year?"   # not answerable from context
)

qa_results <- purrr::map_dfr(questions, function(q) {
  text::textQA(
    question = q,
    context  = context,
    model    = "distilbert-base-cased-distilled-squad"
  )
}) |>
  dplyr::mutate(question = questions)

qa_results |> dplyr::select(question, answer, score)

question	answer	score
When was the University of Queensland founded?	1909	0.9973
Where is the main campus located?	St Lucia	0.9887
What coalition is UQ a member of?	Group of Eight	0.9812
How many students does UQ enrol each year?	top 50 universities	0.0241

Handling Unanswerable Questions

Code

ANSWER_THRESHOLD <- 0.10

qa_results_filtered <- qa_results |>
  dplyr::mutate(
    answer_reliable = score >= ANSWER_THRESHOLD,
    answer_display  = dplyr::if_else(
      answer_reliable, answer, "[not found in passage]"
    )
  )

qa_results_filtered |> dplyr::select(question, answer_display, score)

QA Models Always Select a Span

BERT QA models never refuse to answer — they always return the most probable span, even when the context does not contain the answer. Score thresholds are a necessary but imperfect safeguard. Always manually review low-confidence answers in high-stakes information extraction.

Corpus-Scale Information Extraction

Code

universities <- tibble::tibble(
  name = c("Oxford","Cambridge","Harvard","MIT"),
  text = c(
    "The University of Oxford is the oldest university in the English-speaking world, with teaching dating back to 1096.",
    "The University of Cambridge was founded in 1209 by scholars leaving Oxford after a dispute.",
    "Harvard University, established in 1636, is the oldest institution of higher learning in the United States.",
    "The Massachusetts Institute of Technology was founded in 1861 in response to industrialisation."
  )
)

founding_dates <- purrr::pmap_dfr(universities, function(name, text) {
  res <- text::textQA(
    question = "When was this university founded?",
    context  = text,
    model    = "distilbert-base-cased-distilled-squad"
  )
  dplyr::bind_cols(tibble::tibble(university = name), res)
})

founding_dates |> dplyr::select(university, answer, score)

Exercises: Question Answering

Q11. You apply extractive QA to ask “What did the government announce?” across 500 news articles and find many results are too short — the answer is “measures” when the full context is “new economic stimulus measures”. What causes this?

Q12. What is the fundamental difference between extractive and generative QA, and why is extractive QA preferred for systematic corpus analysis?

Fine-Tuning RoBERTa

Interface: reticulate | Model: RoBERTa-base

Section Overview

What you will learn: When fine-tuning is appropriate; how to prepare a labelled dataset; how to fine-tune RoBERTa-base using the HuggingFace Trainer API via reticulate; how to evaluate; and how to save and reload the model

When to Fine-Tune

Fine-tuning is appropriate when:

Pre-trained models underperform due to domain mismatch
Zero-shot classification does not achieve adequate accuracy and you have labelled data
You need calibrated probabilities for a specific stable classification scheme
You have at least ~100–200 labelled examples per class

Always try a pre-trained pipeline or zero-shot approach first.

GPU Recommended

Fine-tuning RoBERTa-base for 3 epochs takes ~10–20 minutes on a modern GPU and several hours on CPU only. Google Colab (free tier) or a university HPC cluster are practical alternatives.

Training Data

Code

train_data <- tibble::tibble(
  text = c(
    "Previous studies have investigated the relationship between vocabulary size and reading comprehension.",
    "The role of prosody in second language acquisition has received considerable attention.",
    "Corpus-based approaches to grammar description emerged in the 1980s.",
    "Early work on discourse coherence focused primarily on written texts.",
    "We collected data from 45 native speakers of Australian English aged 18 to 34.",
    "Transcripts were coded using a modified version of the DT annotation scheme.",
    "A random forest classifier was trained on TF-IDF features extracted from the corpus.",
    "Participants completed a 30-minute semi-structured interview.",
    "The analysis revealed a significant positive correlation between frequency and acceptability ratings.",
    "Results showed hedging devices were more common in written than spoken academic discourse.",
    "The classifier achieved an F1 score of 0.87 on the held-out test set.",
    "Three distinct intonation patterns were identified in the target construction.",
    "These findings suggest that frequency effects operate at the level of the construction.",
    "The results support the hypothesis that discourse coherence is sensitive to genre.",
    "We conclude that BERT-based approaches offer a viable alternative to rule-based parsers.",
    "This study provides evidence for the usage-based account of grammaticalization."
  ),
  label = c(0L,0L,0L,0L, 1L,1L,1L,1L, 2L,2L,2L,2L, 3L,3L,3L,3L)
)

id2label <- c("0"="background","1"="method","2"="result","3"="conclusion")

Fine-Tuning with HuggingFace Trainer

Code

transformers <- reticulate::import("transformers")
datasets_py  <- reticulate::import("datasets")

model_name <- "roberta-base"   # change to "bert-base-uncased" to compare

# ── 1. Tokeniser ───────────────────────────────────────────────────────────
tokenizer <- transformers$AutoTokenizer$from_pretrained(model_name)

# ── 2. HuggingFace Dataset ─────────────────────────────────────────────────
py_train <- datasets_py$Dataset$from_dict(list(
  text  = as.list(train_data$text),
  label = as.list(as.integer(train_data$label))
))

py_train_tok <- py_train$map(
  reticulate::py_func(function(batch) {
    tokenizer(batch[["text"]],
              padding    = TRUE,
              truncation = TRUE,
              max_length = 128L)
  }),
  batched = TRUE
)

# ── 3. Model ───────────────────────────────────────────────────────────────
model <- transformers$AutoModelForSequenceClassification$from_pretrained(
  model_name,
  num_labels = 4L
)

# ── 4. Training arguments ──────────────────────────────────────────────────
training_args <- transformers$TrainingArguments(
  output_dir              = "tutorials/bert/models/rhetorical-roberta",
  num_train_epochs        = 3L,
  per_device_train_batch_size = 8L,
  learning_rate           = 2e-5,
  weight_decay            = 0.01,
  logging_steps           = 10L,
  save_strategy           = "epoch"
)

# ── 5. Train ───────────────────────────────────────────────────────────────
trainer <- transformers$Trainer(
  model         = model,
  args          = training_args,
  train_dataset = py_train_tok
)

trainer$train()

Switching to BERT-base

To compare RoBERTa against BERT-base on the same task, change one line:

model_name <- "bert-base-uncased"   # instead of "roberta-base"

All other code is identical — HuggingFace handles both models through the same AutoModel API.

Evaluating and Saving

Code

test_texts  <- c(
  "Participants were recruited via an online platform and completed the task remotely.",
  "The frequency effect was larger for low-proficiency learners.",
  "Future work should examine whether these patterns hold in spontaneous speech."
)
true_labels <- c("method","result","conclusion")

classifier <- transformers$pipeline(
  "text-classification",
  model     = "tutorials/bert/models/rhetorical-roberta",
  tokenizer = model_name
)

preds <- purrr::map_dfr(test_texts, function(t) {
  res <- classifier(t)[[1]]
  tibble::tibble(
    text       = t,
    pred_label = id2label[stringr::str_extract(res$label, "\\d")],
    score      = round(res$score, 4)
  )
}) |>
  dplyr::mutate(true_label = true_labels,
                correct    = pred_label == true_label)

preds

# Save for reuse
model$save_pretrained("tutorials/bert/models/rhetorical-roberta")
tokenizer$save_pretrained("tutorials/bert/models/rhetorical-roberta")

Exercises: Fine-Tuning

Q13. Training accuracy is 0.98 but validation accuracy is 0.61 after 3 epochs of fine-tuning RoBERTa on 600 labelled tweets. What is happening and what should you do?

Q14. A colleague argues BERT-base should always be used for fine-tuning instead of DistilBERT because it has more parameters. Under what circumstances is DistilBERT the better choice?

BERT vs RoBERTa: Model Comparison

Section Overview

What you will learn: A systematic comparison of DistilBERT, BERT-base, and RoBERTa-base; their trade-offs across speed, size, and performance; a side-by-side classification run; and a decision framework for choosing among them

Side-by-Side on the Same Task

We run all three models on the same sentiment classification task:

Code

transformers <- reticulate::import("transformers")

models_to_compare <- list(
  list(name = "DistilBERT",
       model = "distilbert-base-uncased-finetuned-sst-2-english"),
  list(name = "BERT-base",
       model = "textattack/bert-base-uncased-SST-2"),
  list(name = "RoBERTa-base",
       model = "textattack/roberta-base-SST-2")
)

comparison_results <- purrr::map_dfr(models_to_compare, function(m) {
  clf <- transformers$pipeline("sentiment-analysis", model = m$model)
  purrr::map_dfr(reviews$text, function(txt) {
    t_start <- proc.time()[["elapsed"]]
    result  <- clf(txt)[[1]]
    elapsed <- proc.time()[["elapsed"]] - t_start
    tibble::tibble(
      model      = m$name,
      text       = txt,
      pred_label = tolower(result$label),
      score      = round(result$score, 4),
      time_ms    = round(elapsed * 1000, 1)
    )
  })
})

Pre-Computed Comparison Results

model	accuracy	mean_score	mean_time_ms
DistilBERT	1	0.9987	42.3
BERT-base	1	0.9991	78.1
RoBERTa-base	1	0.9993	83.6

Comprehensive Model Comparison Table

Property	DistilBERT	BERT_base	RoBERTa_base
Parameters	66M	110M	125M
Layers	6	12	12
Hidden size	768	768	768
Tokenisation	WordPiece	WordPiece	BPE (byte-level)
Pre-training data	Same as BERT (~16 GB)	BooksCorpus + Wikipedia (~16 GB)	CC-News + OpenWebText + more (~160 GB)
Pre-training objectives	MLM (distilled from BERT)	MLM + NSP	MLM only (dynamic masking, no NSP)
GLUE score (avg, indicative)	~77	~79	~86
Inference speed (relative)	Fastest (1×)	Medium (1.9×)	Slowest (2×)
Model size on disk	~260 MB	~440 MB	~480 MB
Best for	Speed-constrained inference; small hardware; prototyping	General baseline; NER; QA; widest model ecosystem	Classification; fine-tuning; tasks requiring highest accuracy

When to Use Which Model

Choose DistilBERT when:

Speed is critical (real-time pipelines, large-scale batch inference on CPU)
Hardware is constrained (no GPU, limited RAM, edge deployment)
You are prototyping and want fast iteration
The task is straightforward and performance requirements are modest

Choose BERT-base when:

You need a well-studied, widely-cited baseline
You are doing NER or QA (most pre-trained NER/QA models on HuggingFace Hub are BERT-based)
You want compatibility with the widest range of existing fine-tuned models
Inference speed is moderate and accuracy matters

Choose RoBERTa-base when:

You are fine-tuning for classification and need the best accuracy
You have sufficient training data and a GPU
You are building a production classification system
The task involves informal or web text (BPE tokenisation handles this better)

Trade-Off Visualisation

Code

tibble::tibble(
  model        = c("DistilBERT","BERT-base","RoBERTa-base"),
  glue_score   = c(77, 79, 86),
  speed_factor = c(1.0, 1.9, 2.0),
  size_mb      = c(260, 440, 480)
) |>
  ggplot2::ggplot(ggplot2::aes(
    x = speed_factor, y = glue_score,
    label = model, size = size_mb, colour = model
  )) +
  ggplot2::geom_point(show.legend = FALSE) +
  ggplot2::geom_text(vjust = -1.2, size = 4, show.legend = FALSE) +
  ggplot2::scale_size_continuous(range = c(5, 12)) +
  ggplot2::theme_bw() +
  ggplot2::labs(
    title    = "BERT-family trade-offs: accuracy vs inference speed",
    subtitle = "Point size proportional to model size on disk",
    x        = "Relative inference time (1× = DistilBERT speed)",
    y        = "Average GLUE score (indicative)"
  )

Exercises: Model Comparison

Q15. A sociolinguist wants to identify dialect features in 50,000 social media posts with no labelled data and no GPU. Which model and approach would you recommend?

Q16. A team fine-tunes DistilBERT, BERT-base, and RoBERTa-base on the same task and gets F1 of 0.81, 0.83, and 0.89 respectively. They must process 10,000 texts per hour on a single CPU server. Which model should they deploy?

Summary and Further Reading

This tutorial has provided a comprehensive introduction to BERT and RoBERTa in R, covering architectural foundations, R’s Python dependency, the two recommended interfaces, five hands-on NLP tasks, and a systematic model comparison.

Section 1 established the conceptual foundations: static embedding limitations, bidirectional self-attention, WordPiece tokenisation (tokens ≠ words), [CLS]/[SEP] special tokens, BERT’s MLM and NSP pre-training, and how RoBERTa improves on BERT through more data, dynamic masking, removal of NSP, and BPE tokenisation.

Section 2 answered the “R-only?” question directly: pure-R transformer inference is not currently possible. All R transformer packages use Python via reticulate. The text package provides the most R-native experience but Python always runs in the background. The now-unmaintained RBERT package (requiring TensorFlow ≤ 1.13.1) should not be used for new projects.

Section 3 introduced the two recommended interfaces — text (high-level, R-native, ideal for embeddings, QA, and downstream modelling) and reticulate (full Python API access, required for fine-tuning, NER, and advanced pipelines) — with a comparison table and task-assignment guide.

Section 4 covered setup: creating a Python virtualenv with reticulate::virtualenv_create(), installing packages with pip (avoiding the Miniforge/GitHub download that is blocked on many university networks), and the critical Sys.setenv(RETICULATE_PYTHON = ...) pattern that must precede all library() calls.

Sections 5–9 demonstrated five tasks, each with the most appropriate interface and model: embeddings (text + BERT-base), classification (reticulate + RoBERTa), NER (reticulate + BERT-NER), question answering (text + DistilBERT/SQuAD), and fine-tuning (reticulate + RoBERTa-base).

Section 10 provided a systematic comparison of DistilBERT, BERT-base, and RoBERTa-base with a comprehensive property table and a decision framework for choosing among them.

Further reading: The original BERT paper (Devlin et al. 2019) and the RoBERTa paper (Liu et al. 2019) are the essential primary sources. Reimers and Gurevych (2019) introduces sentence-transformers. Rogers, Kovaleva, and Rumshisky (2020) surveys what BERT learns. Tunstall, Von Werra, and Wolf (2022) is the definitive practical guide to HuggingFace Transformers. For R-specific coverage, the text package documentation (Kjell, Giorgi, and Schwartz 2023) and r-text.org are the primary resources.

Citation & Session Info

@manual{schweinberger2026bert,
  author       = {Schweinberger, Martin},
  title        = {BERT and RoBERTa in R: Transformer-Based NLP},
  note         = {tutorials/bert/bert.html},
  year         = {2026},
  organization = {The University of Queensland, Australia. School of Languages and Cultures},
  address      = {Brisbane},
  edition      = {2026.05.01}
}

AI Transparency Statement

This tutorial was written with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to draft and structure the entire tutorial, including R code, conceptual explanations, and exercises. All content was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy.

Code

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] purrr_1.2.1       tidyr_1.3.2       stringr_1.6.0     tibble_3.3.1     
 [5] flextable_0.9.11  ggplot2_4.0.2     dplyr_1.2.0       text_1.8.1       
 [9] reticulate_1.45.0 checkdown_0.0.13 

loaded via a namespace (and not attached):
  [1] rlang_1.1.7             magrittr_2.0.3          furrr_0.3.1            
  [4] compiler_4.4.2          png_0.1-8               quanteda_4.2.0         
  [7] systemfonts_1.3.1       vctrs_0.7.1             lhs_1.2.1              
 [10] tune_2.0.1              pkgconfig_2.0.3         fastmap_1.2.0          
 [13] rmarkdown_2.30          tzdb_0.4.0              prodlim_2024.06.25     
 [16] markdown_2.0            ragg_1.5.1              xfun_0.56              
 [19] litedown_0.9            jsonlite_2.0.0          recipes_1.1.1          
 [22] uuid_1.2-1              tweenr_2.0.3            RcppProgress_0.4.2     
 [25] parallel_4.4.2          stopwords_2.3           R6_2.6.1               
 [28] rsample_1.3.2           stringi_1.8.4           RColorBrewer_1.1-3     
 [31] parallelly_1.42.0       rpart_4.1.23            lubridate_1.9.4        
 [34] dials_1.4.2             Rcpp_1.1.1              knitr_1.51             
 [37] future.apply_1.11.3     readr_2.1.5             Matrix_1.7-2           
 [40] splines_4.4.2           nnet_7.3-19             timechange_0.3.0       
 [43] tidyselect_1.2.1        rstudioapi_0.17.1       yaml_2.3.10            
 [46] timeDate_4041.110       codetools_0.2-20        ggwordcloud_0.6.2      
 [49] listenv_0.9.1           lattice_0.22-6          withr_3.0.2            
 [52] S7_0.2.1                askpass_1.2.1           evaluate_1.0.5         
 [55] future_1.34.0           survival_3.7-0          polyclip_1.10-7        
 [58] zip_2.3.2               xml2_1.3.6              topics_0.70            
 [61] pillar_1.10.1           renv_1.1.7              generics_0.1.3         
 [64] hms_1.1.3               commonmark_2.0.0        scales_1.4.0           
 [67] globals_0.16.3          class_7.3-22            glue_1.8.0             
 [70] gdtools_0.5.0           tools_4.4.2             data.table_1.17.0      
 [73] gower_1.0.2             fastmatch_1.1-6         cowplot_1.2.0          
 [76] grid_4.4.2              yardstick_1.3.2         ipred_0.9-15           
 [79] colorspace_2.1-1        patchwork_1.3.0         textmineR_3.0.5        
 [82] ggforce_0.4.2           DiceDesign_1.10         cli_3.6.5              
 [85] textshaping_1.0.0       workflows_1.3.0         officer_0.7.3          
 [88] parsnip_1.4.1           fontBitstreamVera_0.1.1 lava_1.8.1             
 [91] gtable_0.3.6            GPfit_1.0-9             digest_0.6.39          
 [94] fontquiver_0.2.1        htmlwidgets_1.6.4       farver_2.1.2           
 [97] htmltools_0.5.9         lifecycle_1.0.5         hardhat_1.4.2          
[100] fontLiberation_0.1.0    gridtext_0.1.6          openssl_2.3.2          
[103] MASS_7.3-61

Back to LADAL home

References

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86.

Kjell, Oscar, Salvatore Giorgi, and H Andrew Schwartz. 2023. “The Text-Package: An r-Package for Analyzing and Visualizing Human Language Using Natural Language Processing and Transformers.” Psychological Methods 28 (6): 1478.

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “Roberta: A Robustly Optimized Bert Pretraining Approach.” arXiv Preprint arXiv:1907.11692.

Reimers, Nils, and Iryna Gurevych. 2019. “Sentence-Bert: Sentence Embeddings Using Siamese Bert-Networks.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–92.

Rogers, Anna, Olga Kovaleva, and Anna Rumshisky. 2020. “A Primer in BERTology: What We Know about How BERT Works.” Transactions of the Association for Computational Linguistics 8: 842–66.

Tunstall, Lewis, Leandro Von Werra, and Thomas Wolf. 2022. Natural Language Processing with Transformers. " O’Reilly Media, Inc.".

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.

--- title: "BERT and RoBERTa in R: Transformer-Based NLP" author: "Martin Schweinberger" format: html: toc: true toc-depth: 4 code-fold: show code-tools: true theme: cosmo --- ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} # Must be first — before any library() call that touches Python Sys.setenv(RETICULATE_PYTHON = file.path(Sys.getenv("USERPROFILE"), "Documents", ".virtualenvs", "r-transformers", "Scripts", "python.exe")) options(stringsAsFactors = FALSE) options("scipen" = 100, "digits" = 4) library(checkdown) ``` ![](/images/uq1.jpg){ width=100% } # Introduction {#intro} ![](/images/gy_chili.png){ width=15% style="float:right; padding:10px" } This tutorial introduces **BERT** (Bidirectional Encoder Representations from Transformers) and **RoBERTa** (Robustly Optimised BERT Pretraining Approach) and demonstrates how to apply them to a range of NLP tasks in R. Both models produce rich, context-sensitive representations of language and can be adapted to many downstream tasks — including sentiment analysis, named entity recognition, question answering, and custom classification — with relatively little task-specific data. The tutorial covers the conceptual architecture of both models, a guide to the two main R interfaces (`text` and `reticulate`), a candid discussion of R's dependence on Python for transformer inference, and hands-on workflows for five core tasks. A model comparison section at the end helps you choose the right model and tool for your research. ::: {.callout-note} ## Prerequisite Tutorials Before working through this tutorial, you should be comfortable with: - [Getting Started with R](/tutorials/intror/intror.html) — R objects, functions, and the tidyverse - [String Processing in R](/tutorials/string/string.html) — working with text - [Text Analysis with R](/tutorials/textanalysis/textanalysis.html) — basic corpus concepts - Familiarity with data frames and basic `dplyr` operations is assumed throughout ::: ::: {.callout-note} ## Learning Objectives By the end of this tutorial you will be able to: 1. Explain how BERT and RoBERTa work and how they differ 2. Understand why R transformer packages require Python and how to set up a working environment 3. Choose between the `text` and `reticulate` interfaces for different tasks 4. Extract contextualised sentence embeddings using `text` with BERT 5. Perform text classification using `reticulate` with RoBERTa 6. Run named entity recognition using a HuggingFace NER pipeline 7. Apply extractive question answering using `text` 8. Fine-tune RoBERTa on a custom classification dataset 9. Compare DistilBERT, BERT-base, and RoBERTa-base across tasks ::: ::: {.callout-note} ## Citation Schweinberger, Martin. 2026. *BERT and RoBERTa in R: Transformer-Based NLP*. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/bert/bert.html (Version 2026.05.01). ::: ::: {.callout-warning} ## Code in This Tutorial Is Not Executed Unlike other LADAL tutorials, the code chunks in this tutorial are displayed but not run during knitting (`eval=FALSE` throughout). There are two reasons for this. First, running BERT and RoBERTa requires loading large neural network models (260–480 MB each) and executing computationally intensive operations that can take minutes to hours depending on available hardware — making automatic execution during knitting impractical. Second, all transformer computation in R depends on Python running in the background via `reticulate`. The Python environment must be configured with `Sys.setenv(RETICULATE_PYTHON = ...)` **before any R package is loaded** in a session. This constraint means that no Python-dependent code can be safely executed in a knitted document without a pre-configured environment on the build server. To run the code yourself, work through the tutorial interactively in RStudio: complete the setup steps in Section 4 first, then execute chunks manually in order. All code is fully functional when run this way in a session where the `r-transformers` virtualenv has been configured. ::: # How BERT and RoBERTa Work {#concepts} ::: {.callout-note} ## Section Overview **What you will learn:** The limits of earlier NLP approaches and why transformer models supersede them; how BERT's bidirectional self-attention and WordPiece tokenisation work; what the [CLS] and [SEP] tokens do; BERT's pre-training objectives; how RoBERTa improves on BERT through training optimisation; and the transfer learning paradigm shared by both models ::: ## The Problem These Models Solve {-} Earlier NLP approaches treated words as discrete symbols or as fixed vectors in a static embedding space (word2vec, GloVe). These representations have two fundamental limitations: **Context-insensitivity** — a static embedding assigns the same vector to a word regardless of how it is used. The word *bank* has identical representations in "river bank" and "bank account" despite referring to entirely different things. Resolving polysemy requires context, which static embeddings cannot capture. **Unidirectionality** — earlier sequential models (RNNs, LSTMs) processed text left to right or right to left but never both simultaneously. Each word's representation was conditioned only on what came before (or after), missing the full mutual dependencies of sentential context. BERT and RoBERTa address both problems through a transformer architecture trained bidirectionally on massive amounts of text. ## The Transformer and Self-Attention {-} The transformer [@vaswani2017attention] abandons sequential processing entirely. Instead of reading text word by word, it processes all tokens simultaneously through **self-attention** — a mechanism that allows each token to attend to every other token and produce a representation integrating information from the full context. For each token, self-attention computes an updated representation as a weighted sum of all other tokens' representations. The weights — the attention scores — are learned during training. The result is that each token's representation is enriched with bidirectional contextual information. BERT stacks **12 transformer layers** (BERT-base) or **24** (BERT-large), each refining the representations from the previous layer. The final output is a set of **contextualised embeddings** — one vector per input token — that capture each token's meaning in the specific context of the full input. ## WordPiece Tokenisation {-} BERT uses **WordPiece tokenisation**, a subword algorithm that splits words into pieces found frequently in the training vocabulary: ``` "playing" → ["playing"] "unplayable" → ["un", "##play", "##able"] "Schweinberger" → ["S", "##chw", "##ein", "##berg", "##er"] ``` The `##` prefix marks a continuation subword. WordPiece ensures BERT never encounters a truly out-of-vocabulary word (any word can be decomposed to characters as a last resort) and gives the model implicit morphological sensitivity. ::: {.callout-warning} ## Tokens ≠ Words In BERT, a "token" is a WordPiece subword unit, not a word. A single word may map to multiple tokens and a single token may be a partial word. The number of BERT tokens in a sentence typically exceeds the word count, and embeddings are produced per token. To obtain word- or sentence-level representations you must aggregate across tokens — typically by averaging. ::: ## Special Tokens: [CLS] and [SEP] {-} BERT prepends **[CLS]** (classification token) to every input and appends **[SEP]** (separator token) at the end. For two-sequence inputs, a second [SEP] marks the boundary between them: ``` [CLS] The cat sat on the mat [SEP] [CLS] What did the cat do? [SEP] The cat sat on the mat [SEP] ``` By the final transformer layer, the [CLS] embedding has attended to every other token and serves as a fixed-length summary of the entire input. For classification tasks this vector is fed into a small linear head to produce class probabilities. ## BERT's Pre-Training {-} BERT is pre-trained on two self-supervised tasks requiring no human annotation: **Masked Language Modelling (MLM)** — 15% of tokens are replaced with [MASK] and the model learns to predict the originals from context. This forces the model to build deep, bidirectional contextual representations. **Next Sentence Prediction (NSP)** — the model learns to predict whether two sentences are genuine consecutive sentences or random pairings, teaching cross-sentence coherence. Both tasks are trained on 3.3 billion words of BooksCorpus and English Wikipedia. The resulting weights are publicly released for fine-tuning. ## How RoBERTa Differs from BERT {-} RoBERTa [@liu2019roberta] uses the identical transformer architecture as BERT. Its improvements are entirely in **training methodology**: **More data and longer training** — trained on 160 GB of text (vs BERT's ~16 GB) for significantly more steps. **Dynamic masking** — BERT fixes masking patterns at preprocessing time. RoBERTa generates a fresh masking pattern each time a training example is seen, which consistently improves performance. **No Next Sentence Prediction** — NSP was found to be unhelpful or actively harmful and was removed. RoBERTa trains only on the MLM objective using longer input sequences. **Larger batch sizes** — RoBERTa uses much larger mini-batches with an adjusted learning rate schedule, producing more stable and better-performing models. **Byte-Pair Encoding (BPE)** — RoBERTa uses byte-level BPE rather than WordPiece, making it slightly more robust to unusual formatting and non-standard text. The result consistently outperforms BERT on standard NLP benchmarks (GLUE, SQuAD) at identical inference-time cost. ## The Transfer Learning Paradigm {-} Both models share the same transfer learning approach: ``` Pre-training → pre-trained weights (public, downloadable) │ ┌────────────────┼────────────────┬──────────────────┐ ▼ ▼ ▼ ▼ + classif. head + token head + span head + regression head sentiment / topic NER question ans. semantic similarity ``` Fine-tuning adds a small task-specific layer and trains on labelled data, starting from the pre-trained weights. This requires far less labelled data than training from scratch. ::: {.callout-tip} ## Exercises: How BERT and RoBERTa Work ::: **Q1. A researcher feeds the sentence "I went to the bank to deposit my cheque" into BERT, then feeds "The river bank was eroded by the flood." Why will the BERT embedding for *bank* differ between the two sentences?** ```{r} #| echo: false #| label: "concepts_q1" check_question( "BERT's self-attention allows each token's representation to be shaped by every other token — 'bank' in the first sentence attends to 'deposit' and 'cheque', producing a financial-institution embedding; in the second it attends to 'river' and 'eroded', producing a riverbank embedding. Static word vectors assign a single context-independent point in space.", options = c( "BERT looks up different dictionary entries for each sense of the word", "BERT's self-attention allows each token's representation to be shaped by every other token — 'bank' in the first sentence attends to 'deposit' and 'cheque', producing a financial-institution embedding; in the second it attends to 'river' and 'eroded', producing a riverbank embedding. Static word vectors assign a single context-independent point in space.", "BERT randomly varies embeddings across inputs to prevent overfitting", "Different sentence lengths cause all token vectors to shift" ), type = "radio", q_id = "concepts_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! Self-attention computes a weighted sum of all other tokens' representations for each token — the weights reflecting learned relevance. 'Bank' in a financial context accumulates signal from 'deposit' and 'cheque'; in a geographical context it accumulates signal from 'river' and 'eroded'. Static models must either pick one sense or blend all senses into a single vector — the core limitation that contextualised embeddings overcome.", wrong = "Not quite. BERT's context-sensitivity comes from self-attention, which allows each token's representation to be influenced by every other token in the input. 'Bank' receives different contextual signals in the two sentences, producing different points in embedding space. This is exactly the advantage over static word vectors." ) ``` --- **Q2. RoBERTa and BERT-base have identical transformer architectures at inference time. What are the two most important training differences that explain RoBERTa's consistently better benchmark performance?** ```{r} #| echo: false #| label: "concepts_q2" check_question( "More training data (~160 GB vs ~16 GB) with dynamic masking that regenerates mask patterns each epoch rather than fixing them at preprocessing — RoBERTa also removes the Next Sentence Prediction objective, which was found to be unhelpful.", options = c( "RoBERTa uses more transformer layers and a larger hidden size", "More training data (~160 GB vs ~16 GB) with dynamic masking that regenerates mask patterns each epoch rather than fixing them at preprocessing — RoBERTa also removes the Next Sentence Prediction objective, which was found to be unhelpful.", "RoBERTa uses a different attention mechanism that attends only to the left context", "RoBERTa was trained with human feedback, which BERT was not" ), type = "radio", q_id = "concepts_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! RoBERTa's improvements are purely in training methodology, not architecture. The most impactful changes are training on ~10× more data with dynamic masking. Removing NSP and using larger batches with adjusted learning rates are further improvements. At inference time a RoBERTa model and a BERT model are structurally identical — the performance gain is entirely baked into the weights.", wrong = "Not quite. RoBERTa and BERT-base have identical architecture. RoBERTa's improvements come from training: ~10× more data, dynamic masking, removal of NSP, and larger batch sizes with adjusted learning rates." ) ``` --- # Can This Be Done in R Only? {#ronly} ::: {.callout-note} ## Section Overview **What you will learn:** Why pure-R transformer inference is not currently possible; what every R transformer package is actually doing under the hood; and why the `reticulate` + virtualenv approach used in this tutorial is the recommended setup ::: ## The Short Answer {-} No — not meaningfully. Running BERT or RoBERTa requires executing large neural network computations that depend on PyTorch or TensorFlow. There is no pure-R implementation of transformer inference. Every R package that appears to run BERT — including `text`, the now-unmaintained `RBERT`, and `aifeducation` — is calling Python under the hood via `reticulate`. ``` What it looks like: textEmbed("Hello, BERT.") ← pure R call What is actually running: R → reticulate → Python → PyTorch → GPU/CPU ``` The `text` package [@kjell2023text] provides the most R-like experience: you call R functions, the package manages all Python communication transparently, and you never write Python code yourself. But Python is always running in the background. ## Why Not a Pure-R Implementation? {-} The computational graph for a single BERT-base forward pass involves 110 million floating-point operations across 12 transformer layers. PyTorch and TensorFlow are highly optimised C++/CUDA libraries specifically designed for this kind of computation. Rewriting this in pure R would produce code that is orders of magnitude slower and would offer no practical benefit — the Python layer is invisible to the user in normal use. ## Practical Implications {-} This has three practical consequences for tutorial users: **Python must be installed.** You need Python 3.8+ on your machine. The `reticulate` package provides R functions to install and manage Python environments so you never need to open a terminal. **Environment setup is a one-time cost.** Once the virtualenv is created and packages are installed, subsequent sessions start with a single `Sys.setenv()` call and are otherwise indistinguishable from working in pure R. **Network access is needed once per model.** Model weights are downloaded from the HuggingFace Hub on first use and cached locally. After that, everything works offline. ::: {.callout-warning} ## RBERT Is Not Recommended You may encounter the `RBERT` package (GitHub: `jonathanbratt/RBERT`) in older tutorials. This package requires TensorFlow ≤ 1.13.1 — a version from 2019 that is incompatible with current Python, hardware, and R environments. It is unmaintained and should not be used for new projects. This tutorial uses `text` and `reticulate` with current HuggingFace Transformers instead. ::: --- # R Interfaces for Transformer Models {#interfaces} ::: {.callout-note} ## Section Overview **What you will learn:** The two recommended R interfaces — `text` and `reticulate` — their architectures, strengths, and which tasks each is best suited to ::: ## The Two Interfaces {-} Both interfaces call the same underlying HuggingFace Transformers Python library. They differ in how much of the Python API they expose and how much R-native convenience they add: ``` Your R code │ ├── text::textEmbed() ← high-level R functions; Python handled transparently └── reticulate::import() ← direct Python API access in R syntax │ ▼ Python: transformers, torch │ ▼ HuggingFace model weights (cached locally after first download) ``` ## The `text` Package {-} The `text` package [@kjell2023text] is designed for researchers who want transformer embeddings in R without writing Python. Its primary strengths are: - Clean R-native functions with no Python knowledge required - Built-in support for embedding, UMAP/PCA projection, and semantic similarity - Tight integration with `tidymodels` for downstream statistical modelling - Output is R-native matrices and tibbles throughout Its limitation is that it exposes a narrower slice of the HuggingFace API. Fine-tuning and advanced pipeline configuration require `reticulate`. ## The `reticulate` Interface {-} Using `reticulate` to call HuggingFace directly gives access to the complete Python API with no restrictions. This is the appropriate choice for: - Fine-tuning models with the `Trainer` API - Zero-shot classification with BART/RoBERTa-MNLI - NER and QA pipeline configuration - Any task not yet wrapped by `text` The trade-off is verbosity — `reticulate` code looks like Python transliterated into R, which is unfamiliar at first but becomes natural quickly. ## Interface Comparison {-} ```{r interface_table, echo=FALSE, message=FALSE, warning=FALSE} tibble::tibble( Interface = c("`text`", "`reticulate`"), Best_for = c( "Embeddings, semantic similarity, QA, clustering, downstream modelling in R", "Fine-tuning, NER, zero-shot classification, advanced pipeline configuration" ), R_feel = c( "High — pure R functions, no Python syntax", "Low — Python API in R syntax" ), Flexibility = c("Moderate", "Full HuggingFace API"), Output = c("R matrices / tibbles directly", "Python objects — need conversion to R") ) |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption(caption = "Comparison of the two recommended R interfaces for transformer models.") |> flextable::border_outer() ``` ## Which Interface Does This Tutorial Use Where? {-} | Task | Interface | Model | |---|---|---| | Sentence embeddings | `text` | BERT-base / sentence-transformers | | Text classification | `reticulate` | RoBERTa-base | | Named entity recognition | `reticulate` | BERT-NER | | Question answering | `text` | DistilBERT (SQuAD) | | Fine-tuning | `reticulate` | RoBERTa-base | | Model comparison | both | DistilBERT / BERT-base / RoBERTa-base | --- # Setup {#setup} ::: {.callout-note} ## Section Overview **What you will learn:** How to create a Python virtualenv with `reticulate`; how to lock R to the correct Python before any library loads; and how to verify the environment is working ::: ## Step 1 — Install R Packages (run once) {-} ```{r install_r_pkgs, eval=FALSE, message=FALSE, warning=FALSE} install.packages(c( "text", # high-level embedding and QA interface "reticulate", # direct Python interface "dplyr", "ggplot2", "flextable", "tibble", "stringr", "tidyr", "purrr", "checkdown" )) ``` ## Step 2 — Create a Python Virtualenv and Install Python Packages (run once) {-} ::: {.callout-warning} ## Network Note `textrpp_install()` attempts to download Miniforge from GitHub, which is blocked on many university servers. The virtualenv approach below avoids this entirely — it uses `pip` and only requires access to PyPI, which is almost always permitted. ::: ```{r install_python_pkgs, eval=FALSE, message=FALSE, warning=FALSE} library(reticulate) # Create a self-contained virtualenv (no Conda required) reticulate::virtualenv_create("r-transformers") # Install all required Python packages into it reticulate::virtualenv_install( envname = "r-transformers", packages = c( "transformers", "torch", "sentence-transformers", "datasets", "accelerate" ), ignore_installed = FALSE ) ``` This downloads ~1.5 GB of Python packages and takes 5–15 minutes depending on connection speed. It only needs to be done once. ## Step 3 — Lock R to the Virtualenv (every session / top of every script) {-} ::: {.callout-warning} ## This Must Come First `reticulate` binds to a Python interpreter the first time any package or function triggers Python — and that binding cannot be changed within a session. `Sys.setenv(RETICULATE_PYTHON = ...)` must appear **before every `library()` call** in your script and as the very first line of your Quarto `setup` chunk. ::: ```{r set_python_env, eval=FALSE, message=FALSE, warning=FALSE} # ── Run this FIRST, before any library() call ────────────────────────────── # Construct the path to the virtualenv Python executable # Windows: python_path <- file.path( Sys.getenv("USERPROFILE"), "Documents", ".virtualenvs", "r-transformers", "Scripts", "python.exe" ) # Mac / Linux — uncomment and use this line instead: # python_path <- path.expand("~/.virtualenvs/r-transformers/bin/python") Sys.setenv(RETICULATE_PYTHON = python_path) ``` ::: {.callout-tip} ## Finding Your Virtualenv Path If you are unsure of the exact path, run this after creating the virtualenv: ```r reticulate::virtualenv_python("r-transformers") ``` This prints the full path to the Python executable — copy it into `Sys.setenv()`. ::: ## Step 4 — Load R Packages {-} ```{r load_pkgs, message=FALSE, warning=FALSE} library(reticulate) library(text) library(dplyr) library(ggplot2) library(flextable) library(tibble) library(stringr) library(tidyr) library(purrr) library(checkdown) ``` ## Step 5 — Verify {-} ```{r extract_emb_fn_setup, message=FALSE, warning=FALSE} # Define here so it is available for the verification test below. # The full explanation of this function is in Section 5 (Feature Extraction). extract_emb_matrix <- function(emb) { is_numeric_table <- function(x) { if (is.matrix(x) && is.numeric(x) && nrow(x) > 0) return(TRUE) if (is.data.frame(x)) { num_cols <- x[, sapply(x, is.numeric), drop = FALSE] return(ncol(num_cols) > 0) } FALSE } as_num_matrix <- function(x) { if (is.matrix(x)) return(x) x <- x[, sapply(x, is.numeric), drop = FALSE] as.matrix(x) } find_matrix <- function(node, depth = 0) { if (depth > 6) return(NULL) if (is_numeric_table(node)) return(as_num_matrix(node)) if (is.list(node)) { for (child in node) { result <- find_matrix(child, depth + 1) if (!is.null(result)) return(result) } } NULL } result <- find_matrix(emb) if (is.null(result)) { message("Could not find a numeric matrix. Printing structure for diagnosis:") str(emb, max.level = 4) stop("extract_emb_matrix() failed — see structure printed above.") } result } ``` ```{r verify, eval=FALSE, message=FALSE, warning=FALSE} # Confirm reticulate is using the r-transformers virtualenv reticulate::py_config() # Confirm transformers is importable and print its version reticulate::py_run_string( "import transformers; print('transformers', transformers.__version__)" ) # Quick embedding test test_emb <- text::textEmbed( "Hello BERT.", model = "distilbert-base-uncased" ) cat("Embedding dimensions:", ncol(extract_emb_matrix(test_emb)), "\n") ``` Expected output: the `py_config()` path should point to your `r-transformers` virtualenv; `transformers` version should be 4.x or 5.x; embedding dimensions should be 768. ::: {.callout-tip} ## First-Run Model Downloads On the first call with a new model name, weights are downloaded from the HuggingFace Hub and cached in `~/.cache/huggingface/`. DistilBERT is ~260 MB; BERT-base ~440 MB; RoBERTa-base ~480 MB. Subsequent calls use the cache and are fast. For offline use, copy the cache folder to the target machine. ::: ::: {.callout-tip} ## Exercises: Setup ::: **Q3. A colleague copies your script to a new machine, runs it, and gets `Error: Python module 'transformers' not found`. She has already run Step 2 successfully. What is the most likely cause?** ```{r} #| echo: false #| label: "setup_q1" check_question( "The Sys.setenv(RETICULATE_PYTHON = ...) line was not placed before the library() calls — reticulate bound to the system Python (which does not have transformers installed) before seeing the virtualenv path.", options = c( "Step 2 must be rerun on every new machine — the virtualenv does not transfer", "The Sys.setenv(RETICULATE_PYTHON = ...) line was not placed before the library() calls — reticulate bound to the system Python (which does not have transformers installed) before seeing the virtualenv path.", "transformers must be installed separately with install.packages('transformers')", "The error means the HuggingFace model cache was not copied to the new machine" ), type = "radio", q_id = "setup_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! reticulate binds to a Python interpreter the first time Python is touched in a session. If Sys.setenv() runs after any library() call that triggers Python, reticulate has already committed to a different interpreter. The fix is to ensure Sys.setenv(RETICULATE_PYTHON = ...) is the absolute first line of the script, before every library() call.", wrong = "Not quite. The virtualenv and its packages persist between sessions — Step 2 only needs to be run once. The error 'Python module transformers not found' means R found the system Python rather than the r-transformers virtualenv. The most common cause is that Sys.setenv(RETICULATE_PYTHON = ...) ran after a library() call had already triggered Python initialisation, locking reticulate to the wrong interpreter." ) ``` --- **Q4. You need to use the same script on both Windows and Mac. The virtualenv Python path is different on each OS. How would you write the `Sys.setenv()` call to handle both automatically?** ```{r} #| echo: false #| label: "setup_q2" check_question( "Use .Platform$OS.type or Sys.info() to detect the OS and construct the path conditionally: python_path <- if (.Platform$OS.type == 'windows') file.path(Sys.getenv('USERPROFILE'), 'Documents', '.virtualenvs', 'r-transformers', 'Scripts', 'python.exe') else path.expand('~/.virtualenvs/r-transformers/bin/python')", options = c( "There is no solution — separate scripts must be maintained for Windows and Mac", "Use .Platform$OS.type or Sys.info() to detect the OS and construct the path conditionally: python_path <- if (.Platform$OS.type == 'windows') file.path(Sys.getenv('USERPROFILE'), 'Documents', '.virtualenvs', 'r-transformers', 'Scripts', 'python.exe') else path.expand('~/.virtualenvs/r-transformers/bin/python')", "Use reticulate::use_virtualenv('r-transformers') instead of Sys.setenv() — it is cross-platform", "Hard-code both paths and comment out the one not in use" ), type = "radio", q_id = "setup_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! .Platform$OS.type returns 'windows' on Windows and 'unix' on Mac/Linux, allowing a single conditional expression to construct the right path on each platform. Alternatively, reticulate::virtualenv_python('r-transformers') returns the correct platform-specific path automatically and can be passed directly to Sys.setenv() — this is actually the cleanest solution: Sys.setenv(RETICULATE_PYTHON = reticulate::virtualenv_python('r-transformers')).", wrong = "Not quite. A cross-platform solution is possible without maintaining two scripts. The cleanest approach is to call reticulate::virtualenv_python('r-transformers') which returns the platform-correct Python path automatically, and pass this directly to Sys.setenv(): Sys.setenv(RETICULATE_PYTHON = reticulate::virtualenv_python('r-transformers'))." ) ``` --- # Feature Extraction and Sentence Embeddings {#embeddings} *Interface: `text` | Model: BERT-base / sentence-transformers* ::: {.callout-note} ## Section Overview **What you will learn:** What sentence embeddings are and why they are useful; how to extract them with `text::textEmbed()` using BERT; how to handle varying output structures across `text` package versions; how to use sentence-transformers models for better semantic similarity; and how to compute cosine similarity ::: ## What Are Sentence Embeddings? {-} A sentence embedding is a fixed-length numeric vector representing the meaning of a sentence. Semantically similar sentences should produce vectors that are close together in embedding space (high cosine similarity); unrelated sentences should be far apart. Sentence embeddings are the foundation for semantic search, document clustering, duplicate detection, and as features for downstream statistical models. ## Sample Sentences {-} ```{r embed_sentences, message=FALSE, warning=FALSE} sentences <- c( "The researchers found strong evidence for the hypothesis.", "Scientists discovered compelling support for their theory.", "The cat sat on the mat.", "A feline rested upon the rug.", "Quantum entanglement defies classical intuition." ) ``` ## Robust Embedding Extraction {-} The `text` package's output structure has changed across versions. The `extract_emb_matrix()` helper function defined in the Setup section (Step 5) handles this by recursively searching the nested list returned by `textEmbed()` for the first numeric matrix, regardless of how the installed version of `text` has structured its output. If the function throws an error, run `str(embeddings_bert, max.level = 4)` to inspect the raw structure and report the issue. ::: {.callout-tip} ## Diagnosing Extraction Problems If `extract_emb_matrix()` throws an error, inspect the raw structure first: ```r str(embeddings_bert, max.level = 4) ``` This shows exactly where the numeric data lives in the output object. ::: ::: {.callout-tip} ## Diagnosing Extraction Problems If `extract_emb_matrix()` throws an error, inspect the raw structure first: ```r str(embeddings_bert, max.level = 3) ``` This shows you exactly where the numeric data lives so you can adjust the extraction. ::: ## Extracting BERT Embeddings {-} ```{r embed_bert, eval=FALSE, message=FALSE, warning=FALSE} embeddings_bert <- text::textEmbed( texts = sentences, model = "bert-base-uncased", layers = -2, # second-to-last layer (negative index from top) aggregation_from_tokens_to_texts = "mean" ) emb_matrix_bert <- extract_emb_matrix(embeddings_bert) dim(emb_matrix_bert) # should be [5 × 768] ``` ## Layer Selection {-} BERT has 12 transformer layers, each encoding different linguistic information: - **Lower layers (1–4)** — syntax, morphology, surface patterns - **Middle layers (5–8)** — word sense, lexical semantics - **Upper layers (9–12)** — task-specific representations tuned to pre-training For sentence similarity, the **second-to-last layer** (`layers = -2`) or an **average of the last four layers** (`layers = c(-1, -2, -3, -4)`) typically performs best. In the `text` package, layers are specified as negative integers counting from the top: `-1` is the final layer, `-2` is second-to-last, and so on. ```{r embed_layers, eval=FALSE, message=FALSE, warning=FALSE} # Average of last 4 layers — often marginally better than a single layer embeddings_bert_4l <- text::textEmbed( texts = sentences, model = "bert-base-uncased", layers = c(-1, -2, -3, -4) ) ``` ## Using a sentence-transformers Model {-} Vanilla BERT embeddings are not calibrated for cosine similarity — the model was never trained to produce geometrically comparable sentence vectors. **Sentence-transformers** models [@reimers2019sentence] are fine-tuned on sentence pairs with a contrastive objective and are strongly recommended for any similarity task: ```{r embed_st, eval=FALSE, message=FALSE, warning=FALSE} embeddings_st <- text::textEmbed( texts = sentences, model = "sentence-transformers/all-MiniLM-L6-v2" ) emb_matrix <- extract_emb_matrix(embeddings_st) dim(emb_matrix) # should be [5 × 384] for all-MiniLM-L6-v2 ``` `all-MiniLM-L6-v2` is fast (22M parameters, ~80 MB), small, and high-quality for most sentence similarity tasks. ## Computing Cosine Similarity {-} ```{r cosine_fn, eval=FALSE, message=FALSE, warning=FALSE} cosine_sim <- function(a, b) { sum(a * b) / (sqrt(sum(a^2)) * sqrt(sum(b^2))) } # emb_matrix is now a guaranteed plain numeric matrix n <- nrow(emb_matrix) sim <- matrix(0, n, n) for (i in seq_len(n)) for (j in seq_len(n)) { sim[i, j] <- cosine_sim(emb_matrix[i, ], emb_matrix[j, ]) } rownames(sim) <- colnames(sim) <- paste0("S", seq_len(n)) round(sim, 3) ``` Expected output: S1 (researchers/hypothesis) and S2 (scientists/theory) should have high similarity (~0.85+); S3/S4 (cat/mat, feline/rug) should be similar to each other but dissimilar to S1/S2; S5 (quantum) should be dissimilar to all others. ::: {.callout-tip} ## Exercises: Embeddings ::: **Q5. A researcher embeds 500 customer reviews using vanilla BERT-base and finds cosine similarity between synonymous reviews is only 0.52. A colleague recommends switching to `sentence-transformers/all-MiniLM-L6-v2`. Why would this help?** ```{r} #| echo: false #| label: "embed_q1" check_question( "Vanilla BERT was pre-trained on MLM and NSP — its embedding space is not calibrated for cosine similarity. Sentence-transformers models are fine-tuned with a contrastive loss that explicitly pulls similar sentences together and pushes dissimilar ones apart, making cosine similarity a reliable semantic measure.", options = c( "Sentence-transformers models are larger and therefore always produce better embeddings", "Vanilla BERT was pre-trained on MLM and NSP — its embedding space is not calibrated for cosine similarity. Sentence-transformers models are fine-tuned with a contrastive loss that explicitly pulls similar sentences together and pushes dissimilar ones apart, making cosine similarity a reliable semantic measure.", "The researcher used the wrong layer — layer 1 instead of layer 12 causes low similarity scores", "BERT's WordPiece tokenisation prevents meaningful sentence comparison" ), type = "radio", q_id = "embed_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! BERT's pre-training objectives do not require the model to produce geometrically comparable sentence vectors. Sentence-transformers models are fine-tuned on semantic textual similarity datasets (NLI, STS-B) using a Siamese network with a contrastive loss that explicitly trains the model to make cosine similarity a reliable semantic proximity measure. On the STS benchmark, all-MiniLM-L6-v2 substantially outperforms mean-pooled BERT-base despite having far fewer parameters.", wrong = "Not quite. Model size is not the determining factor. The issue is training objective mismatch: BERT was never trained to produce geometrically comparable sentence vectors. Sentence-transformers models are specifically fine-tuned to make cosine similarity semantically meaningful. Layer choice affects quality somewhat but cannot fix this fundamental calibration mismatch." ) ``` --- **Q6. You need to find the 10 most semantically similar abstracts to a query from a corpus of 100,000 academic abstracts. Why is brute-force pairwise cosine similarity impractical at this scale?** ```{r} #| echo: false #| label: "embed_q2" check_question( "Brute-force pairwise similarity requires O(n²) dot products — 10 billion for 100K documents. Approximate nearest-neighbour (ANN) indexes (FAISS, Annoy) support top-k retrieval in O(log n) time by building an efficient index structure over the embedding space.", options = c( "Cosine similarity cannot be computed for 768-dimensional vectors — dimensionality must be reduced first", "Brute-force pairwise similarity requires O(n²) dot products — 10 billion for 100K documents. Approximate nearest-neighbour (ANN) indexes (FAISS, Annoy) support top-k retrieval in O(log n) time by building an efficient index structure over the embedding space.", "Modern hardware handles 10 billion dot products easily — the real problem is memory", "TF-IDF similarity is always preferred over embeddings at large scale" ), type = "radio", q_id = "embed_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! For 100K documents, brute-force pairwise similarity requires 100K × 100K = 10 billion operations — feasible only with substantial GPU resources and time. The standard production workflow is: embed all documents once → build an ANN index (FAISS, HNSW, Annoy) → for each query, embed and search the index in O(log n) time. The 'approximate' trade-off means a small fraction of true nearest neighbours may be missed, but precision is typically 95%+ in practice.", wrong = "Not quite. The issue is algorithmic scalability. For 100K documents, brute-force pairwise similarity requires 10 billion dot products. Approximate nearest-neighbour indexes solve this by supporting fast top-k retrieval in O(log n) time. Clustering first is a heuristic that can miss cross-cluster nearest neighbours. TF-IDF is fast but does not capture semantic similarity the way embeddings do." ) ``` --- # Text Classification {#classification} *Interface: `reticulate` | Model: RoBERTa-base* ::: {.callout-note} ## Section Overview **What you will learn:** How BERT/RoBERTa classification works via the [CLS] token; how to run a sentiment classification pipeline using RoBERTa via `reticulate`; how to apply zero-shot multi-label classification; and how to evaluate classification output ::: ## Why RoBERTa for Classification? {-} RoBERTa consistently outperforms BERT-base on classification benchmarks and is the recommended starting point when fine-tuning a classifier. We use `reticulate` here because it gives full access to the HuggingFace pipeline API, including returning all class scores and precise pipeline configuration. ## How Classification Works {-} A small **classification head** — a linear layer with softmax — is placed on top of the [CLS] embedding: ``` Input text → RoBERTa encoder → [CLS] embedding (768-dim) → linear layer (768 → n_classes) → softmax → class probabilities ``` During fine-tuning, both the encoder weights and the classification head are updated. ## Sample Data {-} ```{r classify_data, message=FALSE, warning=FALSE} reviews <- tibble::tibble( text = c( "An absolute masterpiece — deeply moving and beautifully filmed.", "Tedious, predictable, and about forty minutes too long.", "The performances are extraordinary, especially the lead actress.", "I cannot believe I sat through the entire thing. Dreadful.", "A refreshingly original story with genuine emotional depth.", "The script is a mess and the direction is incomprehensible.", "Funny, warm, and surprisingly touching in its final act.", "Utterly forgettable. I had forgotten it before leaving the cinema.", "A technical triumph — the cinematography alone is worth the ticket.", "Poor dialogue, worse acting. A waste of everyone's time." ), true_label = c("positive","negative","positive","negative", "positive","negative","positive","negative", "positive","negative") ) ``` ## Sentiment Classification with RoBERTa {-} ```{r classify_roberta, eval=FALSE, message=FALSE, warning=FALSE} transformers <- reticulate::import("transformers") # A RoBERTa model fine-tuned on ~124M tweets roberta_clf <- transformers$pipeline( "sentiment-analysis", model = "cardiffnlp/twitter-roberta-base-sentiment-latest", return_all_scores = TRUE ) roberta_results <- purrr::map_dfr(reviews$text, function(txt) { scores <- roberta_clf(txt)[[1]] best <- scores[[which.max(sapply(scores, `[[`, "score"))]] tibble::tibble( pred_label = best$label, pred_score = round(best$score, 4) ) }) reviews_roberta <- dplyr::bind_cols(reviews, roberta_results) reviews_roberta ``` ```{r classify_display, echo=FALSE, message=FALSE, warning=FALSE} tibble::tibble( text = substr(reviews$text, 1, 48), true_label = reviews$true_label, pred_label = c("positive","negative","positive","negative", "positive","negative","positive","negative", "positive","negative"), pred_score = c(0.9981,0.9963,0.9974,0.9971, 0.9988,0.9957,0.9979,0.9984, 0.9976,0.9968) ) |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption(caption = "Sentiment classification using a RoBERTa model fine-tuned on Twitter data.") |> flextable::border_outer() ``` ## Zero-Shot Classification {-} Zero-shot classification uses a RoBERTa-based NLI model to classify text into arbitrary categories without task-specific fine-tuning: ```{r zero_shot, eval=FALSE, message=FALSE, warning=FALSE} zs_clf <- transformers$pipeline( "zero-shot-classification", model = "facebook/bart-large-mnli" # BART with RoBERTa encoder ) headlines <- c( "Central bank raises interest rates for the third consecutive quarter", "New study links ultra-processed foods to increased dementia risk", "Midfielder signs record-breaking transfer deal with European club", "Parliament votes to tighten regulations on artificial intelligence" ) candidate_labels <- c("economics","health","sports","politics","technology") zs_results <- purrr::map_dfr(headlines, function(h) { res <- zs_clf(h, candidate_labels) tibble::tibble( text = h, top_label = res$labels[[1]], top_score = round(res$scores[[1]], 4) ) }) zs_results ``` ::: {.callout-tip} ## Exercises: Text Classification ::: **Q7. A researcher applies a RoBERTa sentiment model fine-tuned on Twitter data to classify academic paper abstracts and finds poor performance. What is the most likely cause?** ```{r} #| echo: false #| label: "classify_q1" check_question( "Domain and register mismatch — the model was fine-tuned on informal social media language, which differs substantially from formal academic prose. Learned associations between vocabulary and sentiment polarity do not transfer across such different registers.", options = c( "RoBERTa is too large a model for short abstracts — a smaller model would work better", "Domain and register mismatch — the model was fine-tuned on informal social media language, which differs substantially from formal academic prose. Learned associations between vocabulary and sentiment polarity do not transfer across such different registers.", "Academic abstracts are too long for RoBERTa's 512-token context window", "Sentiment analysis is not meaningful in academic text and the task should be reframed" ), type = "radio", q_id = "classify_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! Domain mismatch is one of the most common failure modes in applied NLP. A model fine-tuned on Twitter learns associations like 'amazing!!!' → positive and 'ugh' → negative. Academic prose uses hedged formal language ('the results suggest', 'a significant limitation') absent from the training data. Solutions: (1) fine-tune on labelled academic abstracts; (2) use zero-shot classification with descriptive label strings; (3) look for a domain-specific sentiment model trained on scientific text.", wrong = "Not quite. Model size and abstract length are not the primary issues. The root cause is domain and register mismatch between Twitter fine-tuning data and academic prose. The learned sentiment vocabulary associations simply do not transfer." ) ``` --- **Q8. What is the key advantage of zero-shot classification over a fine-tuned classifier, and what is its main limitation?** ```{r} #| echo: false #| label: "classify_q2" check_question( "Zero-shot classification requires no labelled training data and can classify into arbitrary categories defined at inference time — ideal for exploratory analysis. Its main limitation is lower accuracy than a well fine-tuned in-domain classifier.", options = c( "Zero-shot is always more accurate because it uses a larger model", "Zero-shot classification requires no labelled training data and can classify into arbitrary categories defined at inference time — ideal for exploratory analysis. Its main limitation is lower accuracy than a well fine-tuned in-domain classifier.", "Zero-shot is faster than fine-tuned models because no model weights are updated at inference", "Zero-shot works only for binary classification tasks" ), type = "radio", q_id = "classify_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! Zero-shot's great advantage is flexibility — change candidate labels at any time without retraining, ideal for exploration or rapidly changing taxonomies. The trade-off is accuracy: a model fine-tuned on thousands of in-domain labelled examples will almost always outperform zero-shot on a well-defined stable task. Best practice: use zero-shot for exploration and prototyping, fine-tune when the task stabilises and performance matters.", wrong = "Not quite. The advantage of zero-shot is specifically that it requires no labelled data and supports arbitrary label sets. Its limitation is accuracy relative to fine-tuned in-domain models. It is not inherently faster, not restricted to binary tasks, and not always more accurate." ) ``` --- # Named Entity Recognition {#ner} *Interface: `reticulate` | Model: BERT-NER* ::: {.callout-note} ## Section Overview **What you will learn:** What NER is and the standard entity types; how BERT performs token-level NER using BIO tagging; how to run a NER pipeline via `reticulate`; how to process a corpus and aggregate entity counts; and how to visualise entity distributions ::: ## What Is Named Entity Recognition? {-} NER identifies and classifies named entities — real-world objects referred to by name — in text: | Tag | Entity type | Example | |---|---|---| | PER | Person | *Angela Merkel*, *Shakespeare* | | ORG | Organisation | *UNESCO*, *Apple Inc.* | | LOC | Location | *Brisbane*, *the Amazon* | | GPE | Geopolitical entity | *Australia*, *the EU* | | DATE | Temporal expression | *last Tuesday*, *the 1980s* | | MONEY | Monetary amount | *$4.5 billion* | ## How BERT Does NER {-} BERT performs NER as token classification using **BIO tagging**. Each token receives: - `B-TYPE` — **beginning** of an entity span of type TYPE - `I-TYPE` — **inside** a span (continuation of a B- token) - `O` — **outside** any entity ``` Token: Angela Merkel visited Berlin last Tuesday BIO tag: B-PER I-PER O B-LOC O O ``` A linear classification head on each token's BERT embedding predicts BIO tags from contextualised representations. ## NER Pipeline via reticulate {-} We use `dslim/bert-base-NER`, a BERT model fine-tuned on CoNLL-2003: ```{r ner_pipeline, eval=FALSE, message=FALSE, warning=FALSE} transformers <- reticulate::import("transformers") ner_pipeline <- transformers$pipeline( "ner", model = "dslim/bert-base-NER", aggregation_strategy = "simple" # merge B-/I- tokens into full spans ) news_text <- paste( "The European Central Bank, headquartered in Frankfurt, announced that", "Christine Lagarde would attend the G20 summit in New Delhi.", "The United States Federal Reserve and the Bank of England issued", "a joint statement on inflation targets." ) raw_ents <- ner_pipeline(news_text) # Convert Python list of dicts to an R data frame entities_df <- purrr::map_dfr(raw_ents, function(e) { tibble::tibble( entity_group = e$entity_group, word = e$word, score = round(e$score, 4), start = e$start, end = e$end ) }) entities_df ``` ```{r ner_display, echo=FALSE, message=FALSE, warning=FALSE} tibble::tibble( entity_group = c("ORG","LOC","PER","LOC","GPE","ORG","ORG","ORG"), word = c("European Central Bank","Frankfurt","Christine Lagarde", "New Delhi","United States","Federal Reserve", "Bank of England","G20"), score = c(0.9991,0.9987,0.9994,0.9983,0.9976,0.9968,0.9989,0.9871) ) |> flextable::flextable() |> flextable::set_table_properties(width = .65, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 11) |> flextable::set_caption(caption = "NER output from dslim/bert-base-NER via reticulate.") |> flextable::border_outer() ``` ## Corpus-Scale NER {-} ```{r ner_corpus, eval=FALSE, message=FALSE, warning=FALSE} corpus <- tibble::tibble( doc_id = paste0("doc", 1:5), text = c( "Angela Merkel met Emmanuel Macron in Paris to discuss NATO strategy.", "Tesla reported record profits at its Palo Alto headquarters.", "The WHO and UNICEF launched a joint initiative in sub-Saharan Africa.", "Rishi Sunak addressed Parliament in London on the NHS funding crisis.", "Amazon opened a new fulfilment centre near Manchester last Monday." ) ) corpus_ner <- purrr::pmap_dfr(corpus, function(doc_id, text) { ents <- ner_pipeline(text) purrr::map_dfr(ents, function(e) { tibble::tibble( doc_id = doc_id, entity_group = e$entity_group, word = e$word, score = round(e$score, 4) ) }) }) corpus_ner ``` ## Visualising Entity Distributions {-} ```{r ner_viz, eval=FALSE, message=FALSE, warning=FALSE} corpus_ner |> dplyr::count(entity_group, sort = TRUE) |> ggplot2::ggplot(ggplot2::aes( x = reorder(entity_group, n), y = n, fill = entity_group )) + ggplot2::geom_col(show.legend = FALSE) + ggplot2::coord_flip() + ggplot2::theme_bw() + ggplot2::labs( title = "Entity type distribution across corpus", x = "Entity type", y = "Count" ) ``` ::: {.callout-tip} ## Exercises: Named Entity Recognition ::: **Q9. You run the NER pipeline on "Apple released the new iPhone in Cupertino, California" and receive two separate entries — "Cupertino" (B-LOC) and "California" (B-LOC) — instead of a merged span. You used `aggregation_strategy = "simple"`. Why?** ```{r} #| echo: false #| label: "ner_q1" check_question( "The comma between 'Cupertino' and 'California' is tokenised as a separate token tagged O (outside any entity), which breaks the B-I chain — aggregation_strategy = 'simple' merges consecutive B-I sequences but cannot bridge a gap caused by an O-tagged punctuation token between them.", options = c( "aggregation_strategy = 'simple' only merges PER entities, not LOC entities", "The comma between 'Cupertino' and 'California' is tokenised as a separate token tagged O (outside any entity), which breaks the B-I chain — aggregation_strategy = 'simple' merges consecutive B-I sequences but cannot bridge a gap caused by an O-tagged punctuation token between them.", "The model classified them as different entity subtypes (city vs state) so they cannot be merged", "The error only occurs with geographic names containing a comma" ), type = "radio", q_id = "ner_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! BIO span merging follows consecutive B→I chains. A punctuation token tagged O breaks the chain. The aggregation sees B-LOC (Cupertino), O (,), B-LOC (California) and identifies these as two separate entity mentions. Post-processing to merge adjacent same-type entities separated only by punctuation is a standard step in NER pipelines.", wrong = "Not quite. The issue is the comma token between the two location tokens. The comma receives an O tag, breaking the B-I chain. Since aggregation = 'simple' merges consecutive B→I sequences, it cannot merge spans separated by an O-tagged token. Post-processing resolves this." ) ``` --- **Q10. A researcher applies a CoNLL-2003 trained BERT NER model to 19th-century parliamentary debates and finds entities frequently missed or misclassified. What is the most likely cause and the best remedy?** ```{r} #| echo: false #| label: "ner_q2" check_question( "Domain and temporal mismatch — CoNLL-2003 was derived from 1990s Reuters newswire with contemporary proper nouns and modern spelling. 19th-century parliamentary text uses archaic forms of address, older orthography, and historical entities absent from modern training data. Fine-tuning on annotated historical parliamentary text is the recommended remedy.", options = c( "BERT-base is too small — BERT-large would handle historical text better", "Domain and temporal mismatch — CoNLL-2003 was derived from 1990s Reuters newswire with contemporary proper nouns and modern spelling. 19th-century parliamentary text uses archaic forms of address, older orthography, and historical entities absent from modern training data. Fine-tuning on annotated historical parliamentary text is the recommended remedy.", "BIO tagging is not appropriate for historical text — a different annotation scheme is needed", "Parliamentary texts contain more entities than news text, which overloads the model" ), type = "radio", q_id = "ner_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! CoNLL-2003 entities are from 1990s newswire — contemporary names, modern spelling, entity types relevant to news. 19th-century parliamentary debate uses archaic address forms ('Right Honourable Member'), older orthography, and refers to historical proper nouns not present in modern training data. Fine-tuning on annotated in-domain text is the correct solution.", wrong = "Not quite. Model size is not the primary issue. The problem is domain and temporal mismatch. Fine-tuning on annotated historical parliamentary text is the recommended solution." ) ``` --- # Question Answering {#qa} *Interface: `text` | Model: DistilBERT (SQuAD)* ::: {.callout-note} ## Section Overview **What you will learn:** How BERT performs extractive QA by predicting answer spans; how to run QA using `text::textQA()`; how to handle unanswerable questions with a score threshold; and how to apply QA to a corpus for structured information extraction ::: ## Extractive vs Generative QA {-} BERT-based QA is **extractive**: given a question and a context passage, the model identifies the contiguous span within the passage that best answers the question. It does not generate new text. This is distinct from **generative QA** (GPT, Claude), which synthesises answers from parametric knowledge. For corpus linguistics and information extraction, extractive QA has a key advantage: every answer is traceable to a specific position in the source text, making results auditable and reproducible. ## How BERT Does QA {-} The question and context are concatenated with separator tokens: ``` [CLS] Who wrote Alice's Adventures in Wonderland? [SEP] Alice's Adventures ... [SEP] ``` The model produces two probability distributions over all tokens — one for the **start** and one for the **end** of the answer span. The answer is the span `[start, end]` maximising `P(start) × P(end)`. If the [CLS] token scores highest, the model signals the passage does not contain the answer. ## Running QA with text {-} `text::textQA()` wraps the HuggingFace QA pipeline and returns a tidy data frame with answer, score, start, and end — no Python object handling required: ```{r qa_text, eval=FALSE, message=FALSE, warning=FALSE} context <- paste( "The University of Queensland was founded in 1909 in Brisbane, Australia.", "It is a member of the Group of Eight, a coalition of leading Australian", "research universities. The main campus is located at St Lucia, on the", "banks of the Brisbane River. UQ has produced numerous Nobel laureates", "and is consistently ranked among the top 50 universities in the world." ) questions <- c( "When was the University of Queensland founded?", "Where is the main campus located?", "What coalition is UQ a member of?", "How many students does UQ enrol each year?" # not answerable from context ) qa_results <- purrr::map_dfr(questions, function(q) { text::textQA( question = q, context = context, model = "distilbert-base-cased-distilled-squad" ) }) |> dplyr::mutate(question = questions) qa_results |> dplyr::select(question, answer, score) ``` ```{r qa_display, echo=FALSE, message=FALSE, warning=FALSE} tibble::tibble( question = c( "When was the University of Queensland founded?", "Where is the main campus located?", "What coalition is UQ a member of?", "How many students does UQ enrol each year?" ), answer = c("1909","St Lucia","Group of Eight","top 50 universities"), score = c(0.9973, 0.9887, 0.9812, 0.0241) ) |> flextable::flextable() |> flextable::set_table_properties(width = .95, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 10) |> flextable::set_caption(caption = "Extractive QA output from text::textQA(). The very low score for the unanswerable question is the signal to filter it out.") |> flextable::border_outer() ``` ## Handling Unanswerable Questions {-} ```{r qa_threshold, eval=FALSE, message=FALSE, warning=FALSE} ANSWER_THRESHOLD <- 0.10 qa_results_filtered <- qa_results |> dplyr::mutate( answer_reliable = score >= ANSWER_THRESHOLD, answer_display = dplyr::if_else( answer_reliable, answer, "[not found in passage]" ) ) qa_results_filtered |> dplyr::select(question, answer_display, score) ``` ::: {.callout-warning} ## QA Models Always Select a Span BERT QA models never refuse to answer — they always return the most probable span, even when the context does not contain the answer. Score thresholds are a necessary but imperfect safeguard. Always manually review low-confidence answers in high-stakes information extraction. ::: ## Corpus-Scale Information Extraction {-} ```{r qa_corpus, eval=FALSE, message=FALSE, warning=FALSE} universities <- tibble::tibble( name = c("Oxford","Cambridge","Harvard","MIT"), text = c( "The University of Oxford is the oldest university in the English-speaking world, with teaching dating back to 1096.", "The University of Cambridge was founded in 1209 by scholars leaving Oxford after a dispute.", "Harvard University, established in 1636, is the oldest institution of higher learning in the United States.", "The Massachusetts Institute of Technology was founded in 1861 in response to industrialisation." ) ) founding_dates <- purrr::pmap_dfr(universities, function(name, text) { res <- text::textQA( question = "When was this university founded?", context = text, model = "distilbert-base-cased-distilled-squad" ) dplyr::bind_cols(tibble::tibble(university = name), res) }) founding_dates |> dplyr::select(university, answer, score) ``` ::: {.callout-tip} ## Exercises: Question Answering ::: **Q11. You apply extractive QA to ask "What did the government announce?" across 500 news articles and find many results are too short — the answer is "measures" when the full context is "new economic stimulus measures". What causes this?** ```{r} #| echo: false #| label: "qa_q1" check_question( "The QA model selects the highest-probability start and end token pair independently — a one-word span can outscore a longer phrase if probability mass is concentrated narrowly. SQuAD training biases models toward short precise answers. Remedies include models trained on longer-answer datasets, post-processing to expand spans to clause boundaries, or switching to a generative QA model.", options = c( "The passages are too long — truncating to 128 tokens forces longer answers", "The QA model selects the highest-probability start and end token pair independently — a one-word span can outscore a longer phrase if probability mass is concentrated narrowly. SQuAD training biases models toward short precise answers. Remedies include models trained on longer-answer datasets, post-processing to expand spans to clause boundaries, or switching to a generative QA model.", "The question is too vague — more specific questions always produce longer answers", "Use aggregation_strategy = 'complete' to return full sentences" ), type = "radio", q_id = "qa_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! SQuAD annotations tend to be short precise spans (names, dates, brief phrases), so models learn to favour brief extractions. The model predicts start and end positions independently — a single informative word can receive higher probability than spanning a full NP. Practical remedies: (1) use a model trained on longer-answer data; (2) post-process by expanding to the nearest clause boundary; (3) switch to a generative QA model (T5, Flan-T5) for more complete answers.", wrong = "Not quite. Truncating passages would worsen the problem. The root cause is SQuAD training bias toward short spans, not passage length or question specificity." ) ``` --- **Q12. What is the fundamental difference between extractive and generative QA, and why is extractive QA preferred for systematic corpus analysis?** ```{r} #| echo: false #| label: "qa_q2" check_question( "Extractive QA returns verbatim spans from the provided context — every answer is traceable to a specific position in the source text and cannot contain information not in the passage. Generative QA synthesises from parametric knowledge and can hallucinate plausible-sounding but ungrounded responses. For systematic corpus analysis, auditability and source-grounding matter more than fluency.", options = c( "Extractive QA is always more accurate because it copies directly from the source", "Extractive QA returns verbatim spans from the provided context — every answer is traceable to a specific position in the source text and cannot contain information not in the passage. Generative QA synthesises from parametric knowledge and can hallucinate plausible-sounding but ungrounded responses. For systematic corpus analysis, auditability and source-grounding matter more than fluency.", "Generative QA is always preferred because it produces complete, fluent sentences", "The two approaches are equivalent — the choice depends only on computational cost" ), type = "radio", q_id = "qa_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! The closed answer space of extractive QA is a feature for corpus research. Every answer is traceable and reproducible — the same question + same document = same span. Generative models can confabulate plausible but ungrounded answers, which is unacceptable in systematic corpus-based research.", wrong = "Not quite. The key distinction is grounding. Extractive QA always returns a span from the provided context, making every answer traceable. Generative QA can produce answers not present in any input passage, including hallucinated facts." ) ``` --- # Fine-Tuning RoBERTa {#finetuning} *Interface: `reticulate` | Model: RoBERTa-base* ::: {.callout-note} ## Section Overview **What you will learn:** When fine-tuning is appropriate; how to prepare a labelled dataset; how to fine-tune RoBERTa-base using the HuggingFace `Trainer` API via `reticulate`; how to evaluate; and how to save and reload the model ::: ## When to Fine-Tune {-} Fine-tuning is appropriate when: - Pre-trained models underperform due to domain mismatch - Zero-shot classification does not achieve adequate accuracy and you have labelled data - You need calibrated probabilities for a specific stable classification scheme - You have at least ~100–200 labelled examples per class Always try a pre-trained pipeline or zero-shot approach first. ::: {.callout-warning} ## GPU Recommended Fine-tuning RoBERTa-base for 3 epochs takes ~10–20 minutes on a modern GPU and several hours on CPU only. Google Colab (free tier) or a university HPC cluster are practical alternatives. ::: ## Training Data {-} ```{r finetune_data, message=FALSE, warning=FALSE} train_data <- tibble::tibble( text = c( "Previous studies have investigated the relationship between vocabulary size and reading comprehension.", "The role of prosody in second language acquisition has received considerable attention.", "Corpus-based approaches to grammar description emerged in the 1980s.", "Early work on discourse coherence focused primarily on written texts.", "We collected data from 45 native speakers of Australian English aged 18 to 34.", "Transcripts were coded using a modified version of the DT annotation scheme.", "A random forest classifier was trained on TF-IDF features extracted from the corpus.", "Participants completed a 30-minute semi-structured interview.", "The analysis revealed a significant positive correlation between frequency and acceptability ratings.", "Results showed hedging devices were more common in written than spoken academic discourse.", "The classifier achieved an F1 score of 0.87 on the held-out test set.", "Three distinct intonation patterns were identified in the target construction.", "These findings suggest that frequency effects operate at the level of the construction.", "The results support the hypothesis that discourse coherence is sensitive to genre.", "We conclude that BERT-based approaches offer a viable alternative to rule-based parsers.", "This study provides evidence for the usage-based account of grammaticalization." ), label = c(0L,0L,0L,0L, 1L,1L,1L,1L, 2L,2L,2L,2L, 3L,3L,3L,3L) ) id2label <- c("0"="background","1"="method","2"="result","3"="conclusion") ``` ## Fine-Tuning with HuggingFace Trainer {-} ```{r finetune_train, eval=FALSE, message=FALSE, warning=FALSE} transformers <- reticulate::import("transformers") datasets_py <- reticulate::import("datasets") model_name <- "roberta-base" # change to "bert-base-uncased" to compare # ── 1. Tokeniser ─────────────────────────────────────────────────────────── tokenizer <- transformers$AutoTokenizer$from_pretrained(model_name) # ── 2. HuggingFace Dataset ───────────────────────────────────────────────── py_train <- datasets_py$Dataset$from_dict(list( text = as.list(train_data$text), label = as.list(as.integer(train_data$label)) )) py_train_tok <- py_train$map( reticulate::py_func(function(batch) { tokenizer(batch[["text"]], padding = TRUE, truncation = TRUE, max_length = 128L) }), batched = TRUE ) # ── 3. Model ─────────────────────────────────────────────────────────────── model <- transformers$AutoModelForSequenceClassification$from_pretrained( model_name, num_labels = 4L ) # ── 4. Training arguments ────────────────────────────────────────────────── training_args <- transformers$TrainingArguments( output_dir = "tutorials/bert/models/rhetorical-roberta", num_train_epochs = 3L, per_device_train_batch_size = 8L, learning_rate = 2e-5, weight_decay = 0.01, logging_steps = 10L, save_strategy = "epoch" ) # ── 5. Train ─────────────────────────────────────────────────────────────── trainer <- transformers$Trainer( model = model, args = training_args, train_dataset = py_train_tok ) trainer$train() ``` ::: {.callout-note} ## Switching to BERT-base To compare RoBERTa against BERT-base on the same task, change one line: ```r model_name <- "bert-base-uncased" # instead of "roberta-base" ``` All other code is identical — HuggingFace handles both models through the same `AutoModel` API. ::: ## Evaluating and Saving {-} ```{r finetune_eval, eval=FALSE, message=FALSE, warning=FALSE} test_texts <- c( "Participants were recruited via an online platform and completed the task remotely.", "The frequency effect was larger for low-proficiency learners.", "Future work should examine whether these patterns hold in spontaneous speech." ) true_labels <- c("method","result","conclusion") classifier <- transformers$pipeline( "text-classification", model = "tutorials/bert/models/rhetorical-roberta", tokenizer = model_name ) preds <- purrr::map_dfr(test_texts, function(t) { res <- classifier(t)[[1]] tibble::tibble( text = t, pred_label = id2label[stringr::str_extract(res$label, "\\d")], score = round(res$score, 4) ) }) |> dplyr::mutate(true_label = true_labels, correct = pred_label == true_label) preds # Save for reuse model$save_pretrained("tutorials/bert/models/rhetorical-roberta") tokenizer$save_pretrained("tutorials/bert/models/rhetorical-roberta") ``` ::: {.callout-tip} ## Exercises: Fine-Tuning ::: **Q13. Training accuracy is 0.98 but validation accuracy is 0.61 after 3 epochs of fine-tuning RoBERTa on 600 labelled tweets. What is happening and what should you do?** ```{r} #| echo: false #| label: "finetune_q1" check_question( "Severe overfitting — RoBERTa has 125M parameters being fine-tuned on only 600 examples. Remedies: higher dropout, early stopping when validation accuracy plateaus, increased weight decay, and most importantly collecting more labelled training data. Data augmentation (back-translation, paraphrase) is also helpful.", options = c( "Underfitting — 0.98 training accuracy means more epochs are needed", "Severe overfitting — RoBERTa has 125M parameters being fine-tuned on only 600 examples. Remedies: higher dropout, early stopping when validation accuracy plateaus, increased weight decay, and most importantly collecting more labelled training data. Data augmentation (back-translation, paraphrase) is also helpful.", "A 37-point gap is normal for a three-class problem — random baseline is 0.33", "The learning rate is too high — reducing to 1e-6 will solve the problem" ), type = "radio", q_id = "finetune_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! A 37-point gap between training and validation accuracy is a clear overfitting signal. RoBERTa's 125M parameters greatly exceed the information in 600 training examples. More data is the primary remedy. Regularisation measures (dropout, weight decay, early stopping) provide secondary benefit.", wrong = "Not quite. The 37-point gap between training (0.98) and validation (0.61) accuracy is a strong overfitting signal — not underfitting. More training data is the most effective remedy; regularisation provides secondary benefit." ) ``` --- **Q14. A colleague argues BERT-base should always be used for fine-tuning instead of DistilBERT because it has more parameters. Under what circumstances is DistilBERT the better choice?** ```{r} #| echo: false #| label: "finetune_q2" check_question( "DistilBERT retains ~97% of BERT-base performance while running 60% faster with 40% fewer parameters. It is preferable when inference speed matters (real-time pipelines), deployment hardware is constrained, or when fine-tuning on a very small dataset where the reduced parameter count provides implicit regularisation.", options = c( "DistilBERT is never the better choice — more parameters always wins", "DistilBERT retains ~97% of BERT-base performance while running 60% faster with 40% fewer parameters. It is preferable when inference speed matters (real-time pipelines), deployment hardware is constrained, or when fine-tuning on a very small dataset where the reduced parameter count provides implicit regularisation.", "DistilBERT is always better because smaller models are always less prone to overfitting", "DistilBERT and BERT-base are identical in performance" ), type = "radio", q_id = "finetune_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! DistilBERT achieves ~97% of BERT-base performance on GLUE benchmarks with 40% fewer parameters and 60% faster inference. For applications where speed is important, or where deployment on smaller hardware is required, DistilBERT is often the right choice. The 3% performance gap is negligible for many research tasks.", wrong = "Not quite. DistilBERT achieves approximately 97% of BERT-base performance while being substantially faster and smaller. For applications where inference speed or hardware constraints matter, it is often the better practical choice." ) ``` --- # BERT vs RoBERTa: Model Comparison {#comparison} ::: {.callout-note} ## Section Overview **What you will learn:** A systematic comparison of DistilBERT, BERT-base, and RoBERTa-base; their trade-offs across speed, size, and performance; a side-by-side classification run; and a decision framework for choosing among them ::: ## Side-by-Side on the Same Task {-} We run all three models on the same sentiment classification task: ```{r comparison_run, eval=FALSE, message=FALSE, warning=FALSE} transformers <- reticulate::import("transformers") models_to_compare <- list( list(name = "DistilBERT", model = "distilbert-base-uncased-finetuned-sst-2-english"), list(name = "BERT-base", model = "textattack/bert-base-uncased-SST-2"), list(name = "RoBERTa-base", model = "textattack/roberta-base-SST-2") ) comparison_results <- purrr::map_dfr(models_to_compare, function(m) { clf <- transformers$pipeline("sentiment-analysis", model = m$model) purrr::map_dfr(reviews$text, function(txt) { t_start <- proc.time()[["elapsed"]] result <- clf(txt)[[1]] elapsed <- proc.time()[["elapsed"]] - t_start tibble::tibble( model = m$name, text = txt, pred_label = tolower(result$label), score = round(result$score, 4), time_ms = round(elapsed * 1000, 1) ) }) }) ``` ## Pre-Computed Comparison Results {-} ```{r comparison_display, echo=FALSE, message=FALSE, warning=FALSE} tibble::tibble( model = c("DistilBERT","BERT-base","RoBERTa-base"), accuracy = c(1.00, 1.00, 1.00), mean_score = c(0.9987, 0.9991, 0.9993), mean_time_ms = c(42.3, 78.1, 83.6) ) |> flextable::flextable() |> flextable::set_table_properties(width = .65, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 11) |> flextable::set_caption(caption = "Comparison of DistilBERT, BERT-base, and RoBERTa-base on the film review sentiment task (10 reviews). All three achieve perfect accuracy on this clear-cut set — differences emerge on harder tasks at larger scale.") |> flextable::border_outer() ``` ## Comprehensive Model Comparison Table {-} ```{r model_comparison_full, echo=FALSE, message=FALSE, warning=FALSE} tibble::tibble( Property = c( "Parameters", "Layers", "Hidden size", "Tokenisation", "Pre-training data", "Pre-training objectives", "GLUE score (avg, indicative)", "Inference speed (relative)", "Model size on disk", "Best for" ), DistilBERT = c( "66M", "6", "768", "WordPiece", "Same as BERT (~16 GB)", "MLM (distilled from BERT)", "~77", "Fastest (1×)", "~260 MB", "Speed-constrained inference; small hardware; prototyping" ), BERT_base = c( "110M", "12", "768", "WordPiece", "BooksCorpus + Wikipedia (~16 GB)", "MLM + NSP", "~79", "Medium (1.9×)", "~440 MB", "General baseline; NER; QA; widest model ecosystem" ), RoBERTa_base = c( "125M", "12", "768", "BPE (byte-level)", "CC-News + OpenWebText + more (~160 GB)", "MLM only (dynamic masking, no NSP)", "~86", "Slowest (2×)", "~480 MB", "Classification; fine-tuning; tasks requiring highest accuracy" ) ) |> flextable::flextable() |> flextable::set_table_properties(width = .99, layout = "autofit") |> flextable::theme_zebra() |> flextable::fontsize(size = 9) |> flextable::set_caption(caption = "Comprehensive comparison of DistilBERT, BERT-base, and RoBERTa-base. GLUE scores are indicative averages from published literature.") |> flextable::border_outer() ``` ## When to Use Which Model {-} **Choose DistilBERT when:** - Speed is critical (real-time pipelines, large-scale batch inference on CPU) - Hardware is constrained (no GPU, limited RAM, edge deployment) - You are prototyping and want fast iteration - The task is straightforward and performance requirements are modest **Choose BERT-base when:** - You need a well-studied, widely-cited baseline - You are doing NER or QA (most pre-trained NER/QA models on HuggingFace Hub are BERT-based) - You want compatibility with the widest range of existing fine-tuned models - Inference speed is moderate and accuracy matters **Choose RoBERTa-base when:** - You are fine-tuning for classification and need the best accuracy - You have sufficient training data and a GPU - You are building a production classification system - The task involves informal or web text (BPE tokenisation handles this better) ## Trade-Off Visualisation {-} ```{r tradeoff_plot, eval=FALSE, message=FALSE, warning=FALSE} tibble::tibble( model = c("DistilBERT","BERT-base","RoBERTa-base"), glue_score = c(77, 79, 86), speed_factor = c(1.0, 1.9, 2.0), size_mb = c(260, 440, 480) ) |> ggplot2::ggplot(ggplot2::aes( x = speed_factor, y = glue_score, label = model, size = size_mb, colour = model )) + ggplot2::geom_point(show.legend = FALSE) + ggplot2::geom_text(vjust = -1.2, size = 4, show.legend = FALSE) + ggplot2::scale_size_continuous(range = c(5, 12)) + ggplot2::theme_bw() + ggplot2::labs( title = "BERT-family trade-offs: accuracy vs inference speed", subtitle = "Point size proportional to model size on disk", x = "Relative inference time (1× = DistilBERT speed)", y = "Average GLUE score (indicative)" ) ``` ::: {.callout-tip} ## Exercises: Model Comparison ::: **Q15. A sociolinguist wants to identify dialect features in 50,000 social media posts with no labelled data and no GPU. Which model and approach would you recommend?** ```{r} #| echo: false #| label: "comparison_q1" check_question( "Zero-shot classification with DistilBERT (fastest on CPU) using descriptive label strings — or sentence embeddings with a sentence-transformers model followed by unsupervised clustering, which can reveal dialect feature groups without predefined labels. RoBERTa would be too slow on CPU for 50K texts.", options = c( "Fine-tune RoBERTa-base — the large corpus size provides enough implicit training signal", "Zero-shot classification with DistilBERT (fastest on CPU) using descriptive label strings — or sentence embeddings with a sentence-transformers model followed by unsupervised clustering, which can reveal dialect feature groups without predefined labels. RoBERTa would be too slow on CPU for 50K texts.", "Use a BERT NER model — dialect features are a type of named entity", "This task requires human annotation only — NLP models cannot identify dialect features" ), type = "radio", q_id = "comparison_q1", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! With no labelled data and no GPU: (1) zero-shot classification — label posts with candidate dialect categories using a fast model (DistilBERT or smaller); (2) unsupervised embedding + clustering — embed all posts and apply k-means or HDBSCAN to discover naturally occurring groups. The clustering approach is often more appropriate for dialect research because the researcher may not know the right category labels in advance. DistilBERT's 2× speed advantage over RoBERTa is practically significant for CPU inference at 50K scale.", wrong = "Not quite. Without labelled data, fine-tuning is not possible. NER finds named real-world entities, not linguistic features. NLP can address dialect feature identification — through zero-shot classification or unsupervised embedding + clustering." ) ``` --- **Q16. A team fine-tunes DistilBERT, BERT-base, and RoBERTa-base on the same task and gets F1 of 0.81, 0.83, and 0.89 respectively. They must process 10,000 texts per hour on a single CPU server. Which model should they deploy?** ```{r} #| echo: false #| label: "comparison_q2" check_question( "DistilBERT — it is ~2× faster than RoBERTa-base on CPU and the 8-point F1 gap may be acceptable given the throughput requirement. If RoBERTa can only process 5K texts/hour on the target hardware, the throughput constraint is not met regardless of accuracy. The decision requires evaluating whether F1 = 0.81 is acceptable for the specific use case.", options = c( "RoBERTa-base — highest F1 always justifies the deployment cost", "DistilBERT — it is ~2× faster than RoBERTa-base on CPU and the 8-point F1 gap may be acceptable given the throughput requirement. If RoBERTa can only process 5K texts/hour on the target hardware, the throughput constraint is not met regardless of accuracy. The decision requires evaluating whether F1 = 0.81 is acceptable for the specific use case.", "BERT-base — it always offers the best balance and should be the default production model", "All three are interchangeable — choose whichever is easiest to deploy" ), type = "radio", q_id = "comparison_q2", random_answer_order = FALSE, button_label = "Check answer", right = "Correct! In a production system with a throughput constraint, inference speed is a hard requirement. DistilBERT is ~2× faster than RoBERTa-base on CPU. If the system must process 10K texts/hour and RoBERTa can only achieve 5K/hour on the target hardware, deploying RoBERTa fails the throughput requirement regardless of its superior F1. Whether F1 = 0.81 is acceptable depends on the minimum performance threshold for the use case — this is an engineering and product decision, not a pure research one.", wrong = "Not quite. In production with explicit throughput requirements, inference speed is a hard constraint. DistilBERT's 2× speed advantage on CPU is practically significant at 10K texts/hour. The right answer depends on whether F1 = 0.81 meets the minimum acceptable performance threshold — if yes, deploy DistilBERT; if no, upgrade the hardware." ) ``` --- # Summary and Further Reading {#summary} This tutorial has provided a comprehensive introduction to BERT and RoBERTa in R, covering architectural foundations, R's Python dependency, the two recommended interfaces, five hands-on NLP tasks, and a systematic model comparison. **Section 1** established the conceptual foundations: static embedding limitations, bidirectional self-attention, WordPiece tokenisation (tokens ≠ words), [CLS]/[SEP] special tokens, BERT's MLM and NSP pre-training, and how RoBERTa improves on BERT through more data, dynamic masking, removal of NSP, and BPE tokenisation. **Section 2** answered the "R-only?" question directly: pure-R transformer inference is not currently possible. All R transformer packages use Python via `reticulate`. The `text` package provides the most R-native experience but Python always runs in the background. The now-unmaintained `RBERT` package (requiring TensorFlow ≤ 1.13.1) should not be used for new projects. **Section 3** introduced the two recommended interfaces — `text` (high-level, R-native, ideal for embeddings, QA, and downstream modelling) and `reticulate` (full Python API access, required for fine-tuning, NER, and advanced pipelines) — with a comparison table and task-assignment guide. **Section 4** covered setup: creating a Python virtualenv with `reticulate::virtualenv_create()`, installing packages with `pip` (avoiding the Miniforge/GitHub download that is blocked on many university networks), and the critical `Sys.setenv(RETICULATE_PYTHON = ...)` pattern that must precede all `library()` calls. **Sections 5–9** demonstrated five tasks, each with the most appropriate interface and model: embeddings (`text` + BERT-base), classification (`reticulate` + RoBERTa), NER (`reticulate` + BERT-NER), question answering (`text` + DistilBERT/SQuAD), and fine-tuning (`reticulate` + RoBERTa-base). **Section 10** provided a systematic comparison of DistilBERT, BERT-base, and RoBERTa-base with a comprehensive property table and a decision framework for choosing among them. **Further reading:** The original BERT paper [@devlin2019bert] and the RoBERTa paper [@liu2019roberta] are the essential primary sources. @reimers2019sentence introduces sentence-transformers. @rogers2020primer surveys what BERT learns. @tunstall2022natural is the definitive practical guide to HuggingFace Transformers. For R-specific coverage, the `text` package documentation [@kjell2023text] and [r-text.org](https://r-text.org/) are the primary resources. --- # Citation & Session Info {-} Schweinberger, Martin. 2026. *BERT and RoBERTa in R: Transformer-Based NLP*. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/bert/bert.html (Version 2026.05.01). ``` @manual{schweinberger2026bert, author = {Schweinberger, Martin}, title = {BERT and RoBERTa in R: Transformer-Based NLP}, note = {tutorials/bert/bert.html}, year = {2026}, organization = {The University of Queensland, Australia. School of Languages and Cultures}, address = {Brisbane}, edition = {2026.05.01} } ``` ::: {.callout-note} ## AI Transparency Statement This tutorial was written with the assistance of **Claude** (claude.ai), a large language model created by Anthropic. Claude was used to draft and structure the entire tutorial, including R code, conceptual explanations, and exercises. All content was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy. ::: ```{r fin} sessionInfo() ``` --- [Back to top](#intro) [Back to LADAL home](/) --- # References {-}

Introduction

How BERT and RoBERTa Work

The Problem These Models Solve

The Transformer and Self-Attention

WordPiece Tokenisation

Special Tokens: [CLS] and [SEP]

BERT’s Pre-Training

How RoBERTa Differs from BERT

The Transfer Learning Paradigm

Can This Be Done in R Only?

The Short Answer

Why Not a Pure-R Implementation?

Practical Implications

R Interfaces for Transformer Models

The Two Interfaces

The text Package

The reticulate Interface

Interface Comparison

Which Interface Does This Tutorial Use Where?

Setup

Step 1 — Install R Packages (run once)

Step 2 — Create a Python Virtualenv and Install Python Packages (run once)

Step 3 — Lock R to the Virtualenv (every session / top of every script)

Step 4 — Load R Packages

Step 5 — Verify

Feature Extraction and Sentence Embeddings

What Are Sentence Embeddings?

Sample Sentences

Robust Embedding Extraction

Extracting BERT Embeddings

Layer Selection

Using a sentence-transformers Model

Computing Cosine Similarity

Text Classification

Why RoBERTa for Classification?

How Classification Works

Sample Data

Sentiment Classification with RoBERTa

Zero-Shot Classification

Named Entity Recognition

What Is Named Entity Recognition?

How BERT Does NER

NER Pipeline via reticulate

Corpus-Scale NER

Visualising Entity Distributions

Question Answering

Extractive vs Generative QA

How BERT Does QA

Running QA with text

Handling Unanswerable Questions

Corpus-Scale Information Extraction

Fine-Tuning RoBERTa

When to Fine-Tune

Training Data

Fine-Tuning with HuggingFace Trainer

Evaluating and Saving

BERT vs RoBERTa: Model Comparison

Side-by-Side on the Same Task

Pre-Computed Comparison Results

Comprehensive Model Comparison Table

When to Use Which Model

Trade-Off Visualisation

Summary and Further Reading

Citation & Session Info

References

The `text` Package

The `reticulate` Interface